K&R C Exercise 1-23: Remove Comments from a C Program

Exercise 1-23. Write a program to remove all comments from a C program. Don’t forget to handle quoted strings and character constants properly. C comments don’t nest.

Approach

The core challenge is that the same character sequence means completely different things depending on context. A /* inside a string literal is not a comment. A " inside a block comment does not start a string. You have to track where you are at every character — and that is exactly what a finite state machine does.

The solution defines five states: normal code, inside a block comment, inside a line comment, inside a double-quoted string, and inside a single-quoted character constant. One switch on the current state drives the entire loop. The only real trick is that two-character tokens (/*, */, //) require remembering the previous character, which is handled by a prev variable rather than any look-ahead on getchar().

The Five States and Their Transitions

Each state defines what the program does when it reads each character. Here is the complete transition table:

CODE — copying source to output, watching for comment or string openers:

  • prev='/' and current is * → enter BLOCK_CMT; discard the buffered /
  • prev='/' and current is / → enter LINE_CMT; discard the buffered /
  • current is " → flush prev, output ", enter IN_STRING
  • current is ' → flush prev, output ', enter IN_CHAR
  • anything else → output prev (if set), buffer current in prev

BLOCK_CMT — inside /* … */; suppress everything:

  • prev='*' and current is / → output a single space (prevents adjacent tokens from merging), enter CODE
  • current is * → set prev='*' (might be the start of */)
  • anything else → clear prev, discard character

LINE_CMT — inside // … to end of line; suppress everything:

  • current is \n → output the newline (preserves line count), enter CODE
  • anything else → discard character

IN_STRING — inside "…"; copy everything verbatim:

  • prev='\\' → the current character is escaped; output it, clear prev
  • current is \\ → output it, set prev='\\' (next char may be escaped)
  • current is " → output it, enter CODE
  • anything else → output it, clear prev

IN_CHAR — inside '…'; copy everything verbatim:

  • prev='\\' → escaped character; output it, clear prev
  • current is \\ → output it, set prev='\\'
  • current is ' → output it, enter CODE
  • anything else → output it, clear prev

Why the prev Variable?

You cannot tell whether a / is the start of a comment until you see the next character. If you call getchar() twice inside one iteration you skip a character permanently — there is no “put back” in K&R Chapter 1. The solution sidesteps this by keeping every character in prev for one cycle. When the next character arrives, the two-character token is fully visible, and you decide then. If the second character is not * or /, you simply flush prev normally and buffer the new character. Nothing is lost.

Solution

/* K&R Exercise 1-23 — remove all comments from a C program
 * Compile: gcc -ansi -Wall ex1-23.c -o ex1-23
 * Usage:   ./ex1-23 < input.c > output.c
 */

#include <stdio.h>

#define CODE      0   /* normal source code          */
#define BLOCK_CMT 1   /* inside /* ... */  comment   */
#define LINE_CMT  2   /* inside // ... comment       */
#define IN_STRING 3   /* inside "..." string literal */
#define IN_CHAR   4   /* inside '...' char constant  */

int main(void)
{
    int c, state, prev;

    state = CODE;
    prev  = 0;

    while ((c = getchar()) != EOF) {
        switch (state) {

        case CODE:
            if (prev == '/' && c == '*') {
                state = BLOCK_CMT;   /* opening /* — discard the / */
                prev  = 0;
            } else if (prev == '/' && c == '/') {
                state = LINE_CMT;    /* opening // */
                prev  = 0;
            } else if (c == '"') {
                if (prev) putchar(prev);
                putchar(c);
                prev  = 0;
                state = IN_STRING;
            } else if (c == '\'') {
                if (prev) putchar(prev);
                putchar(c);
                prev  = 0;
                state = IN_CHAR;
            } else {
                if (prev) putchar(prev);
                prev = c;           /* buffer one character behind */
            }
            break;

        case BLOCK_CMT:
            if (prev == '*' && c == '/') {
                putchar(' ');        /* one space replaces the comment */
                state = CODE;
                prev  = 0;
            } else {
                prev = (c == '*') ? c : 0;
            }
            break;

        case LINE_CMT:
            if (c == '\n') {
                putchar('\n');       /* preserve newline / line count */
                state = CODE;
                prev  = 0;
            }
            /* else: discard everything until end of line */
            break;

        case IN_STRING:
            putchar(c);
            if (prev == '\\') {
                prev = 0;           /* this char was escaped — reset */
            } else if (c == '\\') {
                prev = c;           /* backslash: next char is escaped */
            } else if (c == '"') {
                state = CODE;
                prev  = 0;
            } else {
                prev = 0;
            }
            break;

        case IN_CHAR:
            putchar(c);
            if (prev == '\\') {
                prev = 0;
            } else if (c == '\\') {
                prev = c;
            } else if (c == '\'') {
                state = CODE;
                prev  = 0;
            } else {
                prev = 0;
            }
            break;
        }
    }

    /* flush any buffered character left at end of input */
    if (state == CODE && prev)
        putchar(prev);

    return 0;
}

Compile and Run

gcc -ansi -Wall ex1-23.c -o ex1-23
./ex1-23 < input.c > stripped.c

The program reads from standard input and writes to standard output, so you redirect an existing C file in and collect the result.

Worked Examples

Block comment removal

Input:

int x = /* set initial value */ 5;

Output:

int x =  5;

The comment is replaced by a single space. Two spaces appear between = and 5 because the space that was already before /* is preserved. This is intentional: without the replacement space, adjacent identifiers such as a/**/b would incorrectly merge into ab in the output.

Line comment removal

Input:

n++;   // increment counter

Output:

n++;

Everything from // to the end of the line is suppressed. The newline itself is preserved so that line numbers in the output still match the original source.

String literal containing comment-like text

Input:

printf("/* this is not a comment */\n");

Output (unchanged):

printf("/* this is not a comment */\n");

Once the machine enters IN_STRING it copies every character through, including sequences that look like comment delimiters. It only returns to CODE when it sees an unescaped closing ".

Escaped quote inside a string

Input:

char *msg = "say \"hello\"";

Output (unchanged):

char *msg = "say \"hello\"";

The prev == '\\' check ensures that \" does not terminate the string; only a bare " with no preceding backslash ends the IN_STRING state.

What This Exercise Teaches

  • Finite state machines in C — a switch on a state variable is the standard idiom for character-by-character parsing that depends on context.
  • One-character lookahead without un-getting — the prev variable buffers one character so you can recognise two-character tokens (/*, */, //) without a second getchar() call or ungetc().
  • Edge cases in lexical analysis — comments inside strings, escaped quotes inside strings, and the rule that C comments do not nest are exactly the kinds of edge cases that trip up naive solutions and that real-world lexers must handle.
  • Output correctness under deletion — replacing a block comment with a single space rather than nothing is a deliberate design choice that keeps the surrounding tokens separate; understanding why reveals something important about how the C preprocessor sees source text.

Set Up Your C Environment

To compile and run this solution you need GCC installed. If you haven’t set up C on your machine yet:

← Exercise 1-22  | 
Chapter 1 Solutions  | 
Exercise 1-24 →

Book:

The C Programming Language, 2nd Ed — Kernighan & Ritchie

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>