Skip to content

Conversation

@ChAoSUnItY
Copy link
Collaborator

@ChAoSUnItY ChAoSUnItY commented Nov 15, 2025

This patch aims to resolve most current preprocessor issues seen in parser unit, including:

  1. Inappropriate eager file inclusion expansion, this is caused due to file inclusion is always happening before actual if-else file guard can determine, which sometimes causes weird result.
  2. Bad error message reporting system, current system uses source index to suggest user where the error is, which is not really helpful on pratical term.
  3. Lack of expansion only compilation.
  4. Lack of predefined macros, including __LINE__, __FILE__.

In this approach, we introduce a whole new phase dedicated to preprocessing (this practice can be seen in chibicc and other similar cc), instead of binding the preprocessing phase in few different places.

Current status

This draft is still in development, the below screenshot is comparison between result of cpp -E and out/shecc -E:

comparison

test.c:

#define B
#define EXP(...) __VA_ARGS__
#include "a.c"

int main()
{
#if defined(B)
    __LINE__;
    __FILE__;
#endif
    EXP(1, +, 2);
    printf("Hello, World!\n");
    return 0;
}

a.c:

#pragma once
int fib(void)
{
    A;
    __FILE__;
    return 0;
}
image

Current major compilation workflow hasn't affected by the patch, which is expected to resolve in later commits.

Memory overhead

Memory usage increasement is expected since we now use token_t struct to track position, and token stream will be generated at once for preprocessor and parser to use with. Once parser finished, we expect to free all tokens.


Summary by cubic

Introduces a dedicated preprocessing phase and rebuilds token handling to fix include guard behavior, add robust macro expansion, and improve error diagnostics. Adds an -E mode to output preprocessed code comparable to cpp -E.

  • New Features

    • Standalone preprocessor with object/function-like macros, variadics, and hide sets.
    • Conditional directives: #if, #ifdef, #ifndef, #elif, #else, #endif.
    • #pragma once and proper include path parsing for "<...>" and "..." files.
    • Predefined macros: LINE and FILE.
    • -E expansion-only mode with preprocessed output emitter.
    • Better token tracking and caching, with precise file:line:column locations.
  • Bug Fixes

    • Include expansion now respects guards; files are included after condition evaluation.
    • Error messages upgraded from raw source indices to readable diagnostics with caret highlights.
    • Corrected #elif evaluation during preprocessing.
    • Deferred string/char literal unescaping until after preprocessing; fixes hex and octal escape handling.

Written for commit a62f3c9. Summary will update automatically on new commits.

hashmap_free(CONSTANTS_MAP);
}

void dbg_token(token_t *token)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed? Use conditional build.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use a global array to store the strings and use token->kind as the index to retrieve the corresponding one?

/* Initialize this array in global_init() */
char *token_str[NUM_OF_TOKENS];
if (token->kind >= 0 && token->kind <= T_inclusion_path)
    name = token_str[token->kind);
else
    name = "<unknown>";

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've guarded this function with conditional macro namely DEBUG_BUILD defined in defs.h.

src/globals.c Outdated
string_literal_pool->literals = hashmap_create(256);

SOURCE = strbuf_create(MAX_SOURCE);
SRC_FILE_MAP = hashmap_create(8);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about defining a new macro for the constant value 8?

hashmap_free(CONSTANTS_MAP);
}

void dbg_token(token_t *token)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use a global array to store the strings and use token->kind as the index to retrieve the corresponding one?

/* Initialize this array in global_init() */
char *token_str[NUM_OF_TOKENS];
if (token->kind >= 0 && token->kind <= T_inclusion_path)
    name = token_str[token->kind);
else
    name = "<unknown>";

@ChAoSUnItY
Copy link
Collaborator Author

The initial result is benchmarked below, the memory overhead and minor performance overhead are expected to be the outcome of algorithm and data structure used in this patch:

  1. The source files will be loaded once into memory (which used for later error diagnositc and tokenization)
  2. The token stream will later be computed (or tokenized) and store in TOKEN_ARENA, and this is the major memory overhead expected to be produced in this patch
  3. The computed token stream of each source files will also get cached
  4. Later in preprocessor, each token will also get copied to be preprocessed, this is to prevent corrupted token integrity if later cached token stream is used. This is also expected to produce another major memory overhead.
  5. After the preprocessing is done, comipler will pass the whole computed token stream to parser

The performance overhead is expected to be came from multiuple token stream traversal, as seen 3 times at least in lexer, preprocessor, and parser.

performance benchmark (stage 0)

Before

/bin/time -v out/shecc src/main.c:

        Command being timed: "./out/shecc ./src/main.c"
        User time (seconds): 0.16
        System time (seconds): 0.33
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.50
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 305664
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 76195
        Voluntary context switches: 0
        Involuntary context switches: 2
        Swaps: 0
        File system inputs: 0
        File system outputs: 728
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

After

/bin/time -v out/shecc src/main.c:

        Command being timed: "./out/shecc ./src/main.c"
        User time (seconds): 0.20
        System time (seconds): 0.44
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.66
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 384000
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 3
        Minor (reclaiming a frame) page faults: 96004
        Voluntary context switches: 44
        Involuntary context switches: 2
        Swaps: 0
        File system inputs: 1992
        File system outputs: 800
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
performance benchmark (stage 1)

Before

/bin/time -v out/shecc-stage1.elf src/main.c:

        Command being timed: "./out/shecc-stage1.elf ./src/main.c"
        User time (seconds): 1.49
        System time (seconds): 1.23
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.76
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 188772
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 64
        Minor (reclaiming a frame) page faults: 44474
        Voluntary context switches: 158
        Involuntary context switches: 5
        Swaps: 0
        File system inputs: 10880
        File system outputs: 728
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

After

/bin/time -v out/shecc-stage1.elf src/main.c:

        Command being timed: "./out/shecc-stage1.elf ./src/main.c"
        User time (seconds): 1.94
        System time (seconds): 1.50
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.47
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 233612
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 63
        Minor (reclaiming a frame) page faults: 55761
        Voluntary context switches: 50
        Involuntary context switches: 15
        Swaps: 0
        File system inputs: 9120
        File system outputs: 800
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@ChAoSUnItY
Copy link
Collaborator Author

There are still some improvements left to do, e.g.:

  1. Delay string / character literal escaped character computation, so that when passing -E the result don't have to convert back again.
  2. Function & structure renaming, some leftover unused structures will be also removed as well.

@ChAoSUnItY
Copy link
Collaborator Author

I seem to unable to reply to the latest review suggestion so I'll reply here:

Would it be better to use a global array to store the strings and use token->kind as the index to retrieve the corresponding one?

Perhaps? The approach requires reorganization of token kind order, and I don't think it's necessary to just boost performance while debugging (I didn't encoutered significant performance overhead when calling it)?

@ChAoSUnItY
Copy link
Collaborator Author

I'm thinking an approach to validate our compiler's expansion only compilation flag (aka -E), the procedure is as follow:

  1. make distclean config out/shecc
  2. ./out/shecc -E src/main.c > out/out.c
  3. ./out/shecc --no-libc -o out/shecc-stage1.elf out/out.c
  4. If the above procedure succeeded, we may continue to bootstrap stage 2 executable with the above procedure as well

@jserv what do you think?

@jserv
Copy link
Collaborator

jserv commented Nov 19, 2025

I'm thinking an approach to validate our compiler's expansion only compilation flag (aka -E), the procedure is as follow:

  1. make distclean config out/shecc
  2. ./out/shecc -E src/main.c > out/out.c

Furthermore, we can adapt the preprocessor implementation from slimcc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants