Exercise 1-23. Write a program to remove all comments from a C program. Don’t forget to handle quoted strings and character constants properly. C comments don’t nest.
Approach
The core challenge is that the same character sequence means completely different things depending on context. A /* inside a string literal is not a comment. A " inside a block comment does not start a string. You have to track where you are at every character — and that is exactly what a finite state machine does.
The solution defines five states: normal code, inside a block comment, inside a line comment, inside a double-quoted string, and inside a single-quoted character constant. One switch on the current state drives the entire loop. The only real trick is that two-character tokens (/*, */, //) require remembering the previous character, which is handled by a prev variable rather than any look-ahead on getchar().
The Five States and Their Transitions
Each state defines what the program does when it reads each character. Here is the complete transition table:
CODE — copying source to output, watching for comment or string openers:
prev='/'and current is*→ enter BLOCK_CMT; discard the buffered/prev='/'and current is/→ enter LINE_CMT; discard the buffered/- current is
"→ flushprev, output", enter IN_STRING - current is
'→ flushprev, output', enter IN_CHAR - anything else → output
prev(if set), buffer current inprev
BLOCK_CMT — inside /* … */; suppress everything:
prev='*'and current is/→ output a single space (prevents adjacent tokens from merging), enter CODE- current is
*→ setprev='*'(might be the start of*/) - anything else → clear
prev, discard character
LINE_CMT — inside // … to end of line; suppress everything:
- current is
\n→ output the newline (preserves line count), enter CODE - anything else → discard character
IN_STRING — inside "…"; copy everything verbatim:
prev='\\'→ the current character is escaped; output it, clearprev- current is
\\→ output it, setprev='\\'(next char may be escaped) - current is
"→ output it, enter CODE - anything else → output it, clear
prev
IN_CHAR — inside '…'; copy everything verbatim:
prev='\\'→ escaped character; output it, clearprev- current is
\\→ output it, setprev='\\' - current is
'→ output it, enter CODE - anything else → output it, clear
prev
Why the prev Variable?
You cannot tell whether a / is the start of a comment until you see the next character. If you call getchar() twice inside one iteration you skip a character permanently — there is no “put back” in K&R Chapter 1. The solution sidesteps this by keeping every character in prev for one cycle. When the next character arrives, the two-character token is fully visible, and you decide then. If the second character is not * or /, you simply flush prev normally and buffer the new character. Nothing is lost.
Solution
/* K&R Exercise 1-23 — remove all comments from a C program
* Compile: gcc -ansi -Wall ex1-23.c -o ex1-23
* Usage: ./ex1-23 < input.c > output.c
*/
#include <stdio.h>
#define CODE 0 /* normal source code */
#define BLOCK_CMT 1 /* inside /* ... */ comment */
#define LINE_CMT 2 /* inside // ... comment */
#define IN_STRING 3 /* inside "..." string literal */
#define IN_CHAR 4 /* inside '...' char constant */
int main(void)
{
int c, state, prev;
state = CODE;
prev = 0;
while ((c = getchar()) != EOF) {
switch (state) {
case CODE:
if (prev == '/' && c == '*') {
state = BLOCK_CMT; /* opening /* — discard the / */
prev = 0;
} else if (prev == '/' && c == '/') {
state = LINE_CMT; /* opening // */
prev = 0;
} else if (c == '"') {
if (prev) putchar(prev);
putchar(c);
prev = 0;
state = IN_STRING;
} else if (c == '\'') {
if (prev) putchar(prev);
putchar(c);
prev = 0;
state = IN_CHAR;
} else {
if (prev) putchar(prev);
prev = c; /* buffer one character behind */
}
break;
case BLOCK_CMT:
if (prev == '*' && c == '/') {
putchar(' '); /* one space replaces the comment */
state = CODE;
prev = 0;
} else {
prev = (c == '*') ? c : 0;
}
break;
case LINE_CMT:
if (c == '\n') {
putchar('\n'); /* preserve newline / line count */
state = CODE;
prev = 0;
}
/* else: discard everything until end of line */
break;
case IN_STRING:
putchar(c);
if (prev == '\\') {
prev = 0; /* this char was escaped — reset */
} else if (c == '\\') {
prev = c; /* backslash: next char is escaped */
} else if (c == '"') {
state = CODE;
prev = 0;
} else {
prev = 0;
}
break;
case IN_CHAR:
putchar(c);
if (prev == '\\') {
prev = 0;
} else if (c == '\\') {
prev = c;
} else if (c == '\'') {
state = CODE;
prev = 0;
} else {
prev = 0;
}
break;
}
}
/* flush any buffered character left at end of input */
if (state == CODE && prev)
putchar(prev);
return 0;
}
Compile and Run
gcc -ansi -Wall ex1-23.c -o ex1-23
./ex1-23 < input.c > stripped.c
The program reads from standard input and writes to standard output, so you redirect an existing C file in and collect the result.
Worked Examples
Block comment removal
Input:
int x = /* set initial value */ 5;
Output:
int x = 5;
The comment is replaced by a single space. Two spaces appear between = and 5 because the space that was already before /* is preserved. This is intentional: without the replacement space, adjacent identifiers such as a/**/b would incorrectly merge into ab in the output.
Line comment removal
Input:
n++; // increment counter
Output:
n++;
Everything from // to the end of the line is suppressed. The newline itself is preserved so that line numbers in the output still match the original source.
String literal containing comment-like text
Input:
printf("/* this is not a comment */\n");
Output (unchanged):
printf("/* this is not a comment */\n");
Once the machine enters IN_STRING it copies every character through, including sequences that look like comment delimiters. It only returns to CODE when it sees an unescaped closing ".
Escaped quote inside a string
Input:
char *msg = "say \"hello\"";
Output (unchanged):
char *msg = "say \"hello\"";
The prev == '\\' check ensures that \" does not terminate the string; only a bare " with no preceding backslash ends the IN_STRING state.
What This Exercise Teaches
- Finite state machines in C — a
switchon a state variable is the standard idiom for character-by-character parsing that depends on context. - One-character lookahead without un-getting — the
prevvariable buffers one character so you can recognise two-character tokens (/*,*/,//) without a secondgetchar()call orungetc(). - Edge cases in lexical analysis — comments inside strings, escaped quotes inside strings, and the rule that C comments do not nest are exactly the kinds of edge cases that trip up naive solutions and that real-world lexers must handle.
- Output correctness under deletion — replacing a block comment with a single space rather than nothing is a deliberate design choice that keeps the surrounding tokens separate; understanding why reveals something important about how the C preprocessor sees source text.
Set Up Your C Environment
To compile and run this solution you need GCC installed. If you haven’t set up C on your machine yet:
- Install GCC on Windows 11
- Install GCC on macOS
- Install GCC on Ubuntu/Linux
- VS Code for C Programming — recommended editor
- Complete C Development Environment Setup — step-by-step guide for beginners
← Exercise 1-22 |
Chapter 1 Solutions |
Exercise 1-24 →
Book:
The C Programming Language, 2nd Ed — Kernighan & Ritchie