Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lexer & Tokenizer

The lexer in cudafe++ is EDG 6.6's lexical.c implementation -- a hand-coded, state-machine-driven tokenizer that converts raw source bytes into a stream of 357 distinct token kinds. It spans approximately 185 functions across the address range 0x668330--0x689130 and constitutes one of the densest subsystems in the binary. The design is a classic multi-layered scanner: a byte-level character scanner (sub_679800, 907 lines) feeds into a token acquisition engine (sub_6810F0, 3,811 lines), which in turn is wrapped by a cache-aware token delivery function (sub_676860, 1,995 lines). CUDA keyword recognition is injected at the get_token_main level, gated on dword_106C2C0 (GPU compilation mode flag).

The lexer does not use generated tables from tools like flex. Instead, every character-class test, keyword match, and operator scan is written as explicit C switch/if chains, compiled into dense jump tables by the optimizer. This produces extremely large functions -- get_token_main alone has approximately 300 local variables in its decompiled form -- but eliminates the overhead of table-driven DFA transitions for a language as context-sensitive as C++.

Key Facts

PropertyValue
Source filelexical.c (~185 functions)
Address range0x668330--0x689130
Token kinds357 (indexed from off_E6D240 name table)
Primary scannersub_679800 (scan_token, 907 lines)
Token acquisitionsub_6810F0 (get_token_main, 3,811 lines, ~300 locals)
Cache + deliverysub_676860 (get_next_token, 1,995 lines)
Numeric literal scannersub_672390 (scan_numeric_literal, 1,571 lines)
Keyword registrationsub_5863A0 (keyword_init, in fe_init.c, 200+ keywords)
Universal char scannersub_6711E0 (scan_universal_character, 278 lines)
Template arg scannersub_67DC90 (scan_template_argument_list, 1,078 lines)
Token cache entry size80--112 bytes (8 cache entry kinds)
Scope entry size784 bytes (at qword_126C5E8)
GPU mode gatedword_106C2C0
Current token globalword_126DD58

Architecture

The lexer is organized as four concentric layers, each calling into the one below it:

Parser (expr.c, decls.c, statements.c)
  │
  ▼
get_next_token (sub_676860)         ← Cache management, macro rescan
  │
  ▼
get_token_main (sub_6810F0)         ← Keyword classification, CUDA gates
  │
  ▼
scan_token (sub_679800)             ← Character-level scanning
  │
  ▼
Input buffer (qword_126DDA0)        ← Raw bytes from source file

The parser never calls the character-level scanner directly. All token consumption flows through get_next_token, which checks the token cache and rescan lists before falling through to get_token_main. This layering allows the lexer to support lookahead, backtracking, macro expansion replay, and template argument rescanning without modifying the core scanner.

Token System

The 357 Token Kinds

Every token produced by the lexer carries a 16-bit token code stored in word_126DD58. The complete set of 357 token kinds is indexed through the name table at off_E6D240, which maps each token code to its string representation. The stop-token table at qword_126DB48 + 8 contains 357 boolean entries used by the error recovery scanner to identify synchronization points.

Token codes are assigned in blocks:

RangeCategoryExamples
1--51Operators and punctuation+, -, *, /, (, ), {, }, ::, ->
52--76Alternative tokens / digraphsand, or, not, <%, %>, <:, :>
77--108C89 keywordsauto(77), break(78), case(79), char(80), while(108)
109--131C99/C11 keywordsrestrict(119), _Bool(120), _Complex(121), _Imaginary(122)
132--136MSVC keywords__declspec(132), __int8(133), __int16(134), __int32(135), __int64(136)
137--199C++ keywordscatch(150), class(151), template(160), decltype(185), typeof(189)
200--206Compiler internalInternal token kinds for the preprocessor
207--330Type traits__is_class(207), __has_trivial_copy, ..., NVIDIA-specific traits at 328--330
331--356Extended types / recent additions_Float32(331)--_Float128(335), C++23/26 features

CUDA-Specific Token Kinds

Three NVIDIA type-trait keywords occupy dedicated token codes registered during keyword_init:

Token CodeKeywordPurpose
328__nv_is_extended_device_lambda_closure_typeTests if type is a device lambda
329__nv_is_extended_host_device_lambda_closure_typeTests if type is a host-device lambda
330__nv_is_extended_device_lambda_with_preserved_return_typeTests if device lambda preserves return type

These are registered as standard type-trait keywords and participate in the same token classification path as the 60+ standard __is_xxx/__has_xxx traits.

Token State Globals

When a token is produced, the following globals are populated:

AddressNameTypeDescription
word_126DD58current_token_codeWORD16-bit token kind (0--356)
qword_126DD38current_source_positionQWORDEncoded file/line/column
qword_126DD48token_text_ptrQWORDPointer to identifier/literal text
srctoken_start_positionchar*Start of token in input buffer
ntoken_text_lengthsize_tLength of token text
dword_126DF90token_flags_1DWORDClassification flags
dword_126DF8Ctoken_flags_2DWORDAdditional flags
qword_126DF80token_extra_dataQWORDContext-dependent payload
xmmword_106C380--106C3B0identifier_lookup_result4 x 128-bitSSE-packed lookup result for identifiers (64 bytes)

The 64-byte identifier lookup result is written into four SSE registers (xmmword_106C380 through xmmword_106C3B0) by the identifier classification path. When a scanned identifier is also a keyword, the lookup result contains the keyword's token code, scope information, and classification flags. The compiler uses movaps/movups instructions to read/write this packed state in bulk.

Token Cache

The token cache provides the lookahead, backtracking, and macro-expansion replay capabilities required by C++ parsing. Tokens are stored in a linked list of cache entries that can be consumed, rewound, and re-scanned.

Cache Entry Layout (80--112 bytes)

OffsetSizeFieldDescription
+08nextNext entry in cache linked list
+88source_positionEncoded source location
+162token_codeToken kind (0--356)
+181cache_entry_kindDiscriminator for payload type (see below)
+204flagsToken flags
+244extra_flagsAdditional flags
+328extra_dataContext-dependent data
+40..variespayloadKind-specific payload data

Cache Entry Kinds

KindValuePayloadDescription
identifier1Name pointer + lookup resultIdentifier token with pre-resolved scope lookup
macro_def2Macro definition pointerMacro definition for re-expansion (calls sub_5BA500)
pragma3Pragma dataPreprocessor pragma for deferred processing
pp_number4Number textPreprocessing number (not yet classified as int/float)
(reserved)5--Not observed in use
string6String data + encodingString literal token
(reserved)7--Not observed in use
concatenated_string8Concatenated string dataWide or multi-piece concatenated string literal

Cache Management Globals

AddressNameDescription
qword_1270150cached_token_rescan_listHead of list of tokens to re-scan (pushed back for lookahead)
qword_1270128reusable_cache_stackStack of reusable cache entry blocks
qword_1270148free_token_listFree list for recycling cache entries
qword_1270140macro_definition_chainActive macro definition chain
dword_126DB74has_cached_tokensBoolean flag: nonzero when cache is non-empty

Cache Operations

AddressIdentityDescription
sub_669650copy_tokens_from_cacheCopies cached preprocessor tokens for macro re-expansion (assert at lexical.c:3417)
sub_669D00allocate_token_cache_entryAllocates from free list at qword_1270118
sub_669EB0create_cached_token_nodeCreates and initializes token cache node
sub_66A000append_to_token_cacheAppends token to cache list, maintains tail pointer
sub_66A140push_token_to_rescan_listPushes token onto rescan stack at qword_1270150
sub_66A2C0free_single_cache_entryReturns cache entry to free list

Layer 1: scan_token (sub_679800)

scan_token is the character-level scanner. It reads raw bytes from the input buffer at qword_126DDA0, classifies them, and produces a single token. The function is 907 lines and dispatches on the first byte of each token.

Character Dispatch

The scanner reads the byte at the current input position and enters one of the following paths:

First ByteAction
0x00 (NUL)Control byte processing (8 embedded control types, see below)
0x09 (TAB), 0x0B (VT), 0x0C (FF), 0x20 (space)Whitespace -- advance and retry
a--z, A--Z, _Identifier or keyword scanning
0--9Numeric literal scanning (decimal, hex, octal, binary)
'Character literal scanning
"String literal scanning
/Comment (// or /* */) or division operator
.Dot operator, or float literal if followed by digit
<Less-than, <=, <<, <<=, <=>, or template bracket
>Greater-than, >=, >>, >>=, or template bracket
+, -, *, %, ^, ~, !, =, &, |Operator scanning (single or compound)
(, ), [, ], {, }, ;, ,, ?, @Single-character tokens
#Preprocessor directive or stringification operator
\Universal character name (\uXXXX, \UXXXXXXXX) or line continuation

Embedded Control Bytes (NUL Dispatch)

The input buffer uses embedded NUL bytes (0x00) as in-band control markers. When the scanner encounters a NUL, it reads the next byte as a control type code:

Control TypeValueAction
Newline marker1End of line -- calls sub_6702F0 (refill_buffer) to read next source line
(reserved)2--
Macro position3Macro expansion position marker -- calls sub_66A770 to update position tracking
End of directive4Marks end of a preprocessor directive
EOF (primary)5End of current source file -- pops file stack
Stale position6Invalid position marker -- emits diagnostic 1192 or 861
Continuation7Backslash-newline continuation was here
EOF (secondary)8Secondary EOF marker for nested includes

This in-band signaling approach avoids the cost of checking buffer boundaries on every character read. The refill_buffer function (sub_6702F0, 792 lines) places these marker bytes at the end of each source line, so the scanner can detect line endings and EOF without comparing the input pointer against a limit.

Input Buffer System

AddressNameDescription
qword_126DDA0current_input_positionRead pointer into the input buffer
qword_126DDD8input_buffer_baseStart of the allocated input buffer
qword_126DDD0input_buffer_endEnd of the allocated input buffer
qword_126DDF0file_stackStack of open source files (for #include)
qword_127FBA8current_file_handleFILE* for the current source file
dword_127FBA0eof_flagSet when current file reaches EOF
dword_127FB9Cmultibyte_encoding_modeValues >1 enable multibyte character decoding via sub_5B09B0
dword_126DDA8source_line_counterLines read from current source file
dword_126DDBCoutput_line_counterLines emitted to preprocessed output

Buffer Refill: read_next_source_line (sub_66F4E0)

sub_66F4E0 (735 lines) reads the next line from the source file into the input buffer. It calls getc() for single-byte mode or sub_5B09B0 for multibyte mode (controlled by dword_127FB9C > 1). The function:

  1. Reads characters one at a time until newline or EOF
  2. Handles backslash-newline line splicing (joining continuation lines)
  3. Places control byte markers at newline positions (type 1) and EOF (type 5/8)
  4. Updates the line counter at dword_126DDA8
  5. Manages trigraph warnings (diagnostic 1750) through the companion function sub_6702F0

Layer 2: get_token_main (sub_6810F0)

get_token_main is the largest function in the lexer at 3,811 decompiled lines with approximately 300 local variables. It wraps scan_token and performs the complete token classification pipeline: keyword recognition, CUDA keyword gating, template parameter detection, operator overload name lookup, access specifier tracking, and namespace scope management.

Token Classification Pipeline

After scan_token produces a raw token, get_token_main performs these classification steps:

scan_token produces raw token
  │
  ├── Identifier?
  │     ├── Look up in keyword table
  │     │     ├── Standard C/C++ keyword → set token_code to keyword kind
  │     │     ├── CUDA keyword (dword_106C2C0 != 0) → set token_code
  │     │     ├── Type trait keyword → set token_code (207-356)
  │     │     └── Not a keyword → classify as identifier token
  │     │
  │     ├── Check template parameter context
  │     │     └── If inside template<>, classify as type-name or non-type
  │     │
  │     └── Entity lookup for context-sensitive classification
  │           ├── typedef name → classify as TYPE_NAME token
  │           ├── class/struct name → classify as CLASS_NAME
  │           ├── enum name → classify as ENUM_NAME
  │           ├── namespace name → classify as NAMESPACE_NAME
  │           └── template name → classify as TEMPLATE_NAME
  │
  ├── Numeric literal?
  │     └── Route to scan_numeric_literal (sub_672390)
  │
  ├── String/character literal?
  │     └── Handle encoding prefix (L, u8, u, U, R)
  │
  └── Operator/punctuation?
        ├── Check for template angle bracket context
        ├── Handle digraphs/alternative tokens
        └── Produce operator token code

CUDA Keyword Detection

CUDA keyword handling is gated on dword_106C2C0 (GPU mode). When this flag is nonzero, get_token_main recognizes CUDA-specific identifiers and routes them to the CUDA attribute processing path:

// Pseudocode from get_token_main
if (token_is_identifier) {
    // ... standard keyword lookup ...

    if (dword_106C2C0 != 0) {  // GPU mode active
        // Check for __device__, __host__, __global__,
        // __shared__, __constant__, __managed__,
        // __launch_bounds__, __grid_constant__
        // Route to CUDA attribute handlers
        if (dword_106BA08) {   // CUDA attribute processing enabled
            sub_74DC30(...);   // CUDA attribute resolution
            sub_74E240(...);   // CUDA attribute application
        }
    }
}

The GPU mode flag dword_106C2C0 is also checked during:

  • Attribute token processing in sub_686350 (handle_attribute_token, 584 lines)
  • Deferred diagnostic emission in sub_668660 (severity override via byte_126ED55)
  • Entity visibility computation in sub_669130

C++ Standard Version Gating

Throughout get_token_main, keyword classification is gated on the C++ standard version stored in dword_126EF68:

Version ValueStandardKeywords Enabled
201102C++11constexpr, decltype, nullptr, char16_t, char32_t, static_assert
201402C++14binary literals, digit separators
201703C++17if constexpr, char8_t, structured bindings
202002C++20concept, requires, co_yield, co_return, co_await, consteval, constinit
202302C++23typeof, typeof_unqual, extended digit separators

The language mode at dword_126EFB4 controls broader dialect selection:

ValueModeEffect
1GNU/defaultGNU extensions enabled, alternative tokens recognized
2MSVCMSVC keywords enabled (__declspec, __int8--__int64), some GNU extensions disabled

Context-Sensitive Token Classification

C++ requires the lexer to classify identifiers based on declaration context. The functions supporting this classification:

AddressIdentityDescription
sub_668C90classify_identifier_entityDispatches on entity kind: typedef(3), class(4,5), function(7,9), namespace(19-22)
sub_668E00resolve_entity_through_aliasWalks typedef/using chains (kind=3 with +104 flag, kind=16 → **[+88])
sub_668F80get_resolved_entity_typeResolves entity to underlying type through alias chains
sub_668900handle_token_identifier_type_checkDetermines if token is identifier vs typename vs template
sub_666720select_dual_lookup_symbolSelects between two candidate symbols in dual-scope lookup (372 lines)

Entity classification reads the entity_kind byte at offset +80 of entity nodes:

switch (entity->kind) {    // offset +80
    case 3:                // typedef
        return TYPE_NAME;
    case 4: case 5:        // class / struct
        return CLASS_NAME;
    case 6:                // enum
        return ENUM_NAME;
    case 7:                // function
        return IDENTIFIER;
    case 9: case 10:       // namespace / namespace alias
        return NAMESPACE_NAME;
    case 19: case 20: case 21: case 22:  // template kinds
        return TEMPLATE_NAME;
    case 16:               // using declaration
        return resolve_through_using(entity);
    case 24:               // namespace alias (resolved)
        return NAMESPACE_NAME;
}

Layer 3: get_next_token (sub_676860)

get_next_token (1,995 lines) is the token delivery function called by the parser. It manages the token cache, handles macro expansion replay, and calls get_token_main only when no cached tokens are available.

Token Delivery Flow

get_next_token (sub_676860)
  │
  ├── Check cached_token_rescan_list (qword_1270150)
  │     └── If non-empty: pop token, dispatch on cache_entry_kind
  │           ├── kind 1 (identifier): load xmmword_106C380..106C3B0
  │           ├── kind 2 (macro_def): call sub_5BA500 (macro expansion)
  │           ├── kind 3 (pragma): process deferred pragma
  │           ├── kind 4 (pp_number): return as-is
  │           ├── kind 6 (string): return string token
  │           └── kind 8 (concatenated_string): return concatenated string
  │
  ├── Check reusable_cache_stack (qword_1270128)
  │     └── If non-empty: pop and return cached token
  │           (assert: "get_token_from_reusable_cache_stack" at 4450, 4469)
  │
  ├── Check pending_macro_arg (qword_106B8A0)
  │     └── If set: process macro argument token
  │
  └── Fall through to get_token_main (sub_6810F0)
        └── Full token acquisition from source

The function sets the following globals on every token delivery:

  • word_126DD58 = token code
  • qword_126DD38 = source position
  • dword_126DF90 = token flags 1
  • dword_126DF8C = token flags 2
  • qword_126DF80 = extra data

CUDA Attribute Token Interception

When CUDA attribute processing is enabled (dword_106BA08 != 0), get_next_token intercepts identifier tokens and routes them through CUDA attribute resolution via sub_74DC30 and sub_74E240. This allows CUDA execution-space attributes (__device__, __host__, __global__) to be recognized at the token level rather than requiring full declaration parsing.

Numeric Literal Scanner: scan_numeric_literal (sub_672390)

The numeric literal scanner is 1,571 lines and handles every numeric literal format defined by C89 through C++23.

Literal Prefix Dispatch

scan_numeric_literal
  │
  ├── First char '0':
  │     ├── 0x/0X → hex literal (isxdigit validation)
  │     ├── 0b/0B → binary literal (C++14)
  │     ├── 0[0-7] → octal literal
  │     └── 0 alone → decimal zero
  │
  ├── First char '1'-'9':
  │     └── decimal literal
  │
  └── After integer part:
        ├── '.' → floating-point literal
        ├── 'e'/'E' → decimal float exponent
        ├── 'p'/'P' → hex float exponent
        └── suffix → type suffix parsing

C++14 Digit Separators

Digit separators (' characters within numeric literals) are handled through a two-flag system:

AddressNamePurpose
dword_126EEFCcpp14_digit_separators_enabledMaster enable for digit separator support
dword_126DB58digit_separator_seenSet when a separator is encountered in the current literal

When dword_126EEFC is enabled, the scanner accepts ' between digits:

// Digit separator handling in scan_numeric_literal
while (isdigit(*pos) || (*pos == '\'' && dword_126EEFC)) {
    if (*pos == '\'') {
        dword_126DB58 = 1;  // mark separator seen
        pos++;
        if (!isdigit(*pos))
            emit_diagnostic(2629);  // separator not followed by digit
        continue;
    }
    // process digit...
}

C++23 extended digit separators (for binary, octal, hex) are gated on dword_126EF68 > 202302:

if (dword_126EF68 > 202302) {
    // C++23: allow digit separators in binary/octal/hex
} else {
    emit_diagnostic(2628);  // C++23 feature used in earlier mode
}

Integer Suffix Parsing

sub_6748A0 (convert_integer_suffix, 137 lines) parses the following suffixes:

SuffixType
(none)int (or promoted per value)
u / Uunsigned int
l / Llong
ll / LLlong long
ul / ULunsigned long
ull / ULLunsigned long long
z / Zsize_t (C++23)
uz / UZsize_t unsigned (C++23)

sub_674BB0 (determine_numeric_literal_type, 400 lines) applies the C++ promotion rules based on the literal value and suffix to determine the final type.

Floating-Point Literal Handling

AddressIdentityDescription
sub_675390scan_float_exponentScans e/E/p/P exponent suffix (57 lines)
sub_6754B0convert_float_literalConverts float literal string to value (338 lines)

Float suffixes: f/F (float), l/L (long double), none (double).

Universal Character Names: scan_universal_character (sub_6711E0)

sub_6711E0 (278 lines, assert at lexical.c:12384) scans \uXXXX and \UXXXXXXXX universal character names in identifiers and string/character literals.

void scan_universal_character(char *input, uint32_t *result) {
    int width;
    if (input[1] == 'u')
        width = 4;    // \uXXXX
    else
        width = 8;    // \UXXXXXXXX

    uint32_t value = 0;
    for (int i = 0; i < width; i++) {
        char c = *input++;
        if (!isxdigit(c)) {
            // emit error diagnostic
            return;
        }
        int digit;
        if (c >= '0' && c <= '9')
            digit = c - 48;      // '0' = 48
        else if (islower(c))
            digit = c - 87;      // 'a' = 97, 97-87 = 10
        else
            digit = c - 55;      // 'A' = 65, 65-55 = 10
        value = (value << 4) | digit;
    }
    *result = value;
}

sub_671870 (validate_universal_character_value, 62 lines) performs range checking after scanning: surrogate pair values (0xD800--0xDFFF) are rejected, and values outside the valid Unicode range (> 0x10FFFF) produce an error.

The feature is controlled by dword_106BCC4 (universal characters enabled) and dword_106BD4C (extended character mode).

Keyword Registration: keyword_init (sub_5863A0)

sub_5863A0 (1,113 lines, in fe_init.c) registers all C/C++ keywords with the symbol table during frontend initialization. It calls sub_7463B0 (enter_keyword) once per keyword, passing the token ID and string representation. GNU double-underscore variants are registered via sub_585B10, and alternative tokens via sub_749600.

Keyword Categories and Version Gating

Keywords are registered conditionally based on language mode and standard version:

keyword_init (sub_5863A0)
  │
  ├── C89 core (always registered)
  │     auto(77), break(78), case(79), char(80), continue(82),
  │     default(83), do(84), double(85), else(86), enum(87),
  │     extern(88), float(89), for(90), goto(91), if(92),
  │     int(93), long(94), register(95), return(96), short(97),
  │     sizeof(99), static(100), struct(101), switch(102),
  │     typedef(103), union(104), unsigned(105), void(106), while(108)
  │
  ├── C99 (gated on C99+ mode)
  │     _Bool(120), _Complex(121), _Imaginary(122), restrict(119)
  │
  ├── C11 (gated on C11+ mode)
  │     _Generic(262), _Atomic(263), _Alignof(247), _Alignas(248),
  │     _Thread_local(194), _Static_assert(184), _Noreturn(260)
  │
  ├── C23 (gated on C23 mode)
  │     bool, true, false, alignof, alignas, static_assert,
  │     thread_local, typeof(189), typeof_unqual(190)
  │
  ├── C++ core (gated on C++ mode: dword_126EFB4 == 2)
  │     catch(150), class(151), friend(153), inline(154),
  │     mutable(174), operator(156), new(155), delete(152),
  │     private(157), protected(158), public(159), template(160),
  │     this(161), throw(162), try(163), virtual(164),
  │     namespace(175), using(179), typename(183), typeid(178),
  │     const_cast(166), dynamic_cast(167), static_cast(177),
  │     reinterpret_cast(176)
  │
  ├── C++ alternative tokens (gated on C++ mode)
  │     and(52), and_eq(64), bitand(33), bitor(51), compl(37),
  │     not(38), not_eq(48), or(53), or_eq(66), xor(50), xor_eq(65)
  │
  ├── C++ modern keywords (gated on standard version)
  │     C++11: constexpr(244), decltype(185), nullptr(237),
  │            char16_t(126), char32_t(127)
  │     C++17: char8_t(128)
  │     C++20: consteval(245), constinit(246), co_yield(267),
  │            co_return(268), co_await(269), concept(295), requires(294)
  │     C++23: typeof(189), typeof_unqual(190)
  │
  ├── GNU extensions (gated on dword_126EFA8)
  │     __extension__(187), __auto_type(186), __attribute(142),
  │     __builtin_offsetof(117), __builtin_types_compatible_p(143),
  │     __builtin_shufflevector(258), __builtin_convertvector(259),
  │     __builtin_complex(261), __builtin_has_attribute(296),
  │     __builtin_addressof(271), __builtin_bit_cast(297),
  │     __int128(239), __bases(249), __direct_bases(250),
  │     _Float32(331), _Float32x(332), _Float64(333),
  │     _Float64x(334), _Float128(335)
  │
  ├── MSVC extensions (gated on dword_126EFB0)
  │     __declspec(132), __int8(133), __int16(134),
  │     __int32(135), __int64(136)
  │
  ├── Clang extensions (gated on Clang version at qword_126EF90)
  │     _Nullable(264), _Nonnull(265), _Null_unspecified(266)
  │
  ├── Type traits (60+, gated by standard version)
  │     __is_class(207), __is_enum, __is_union, __has_trivial_copy,
  │     __has_virtual_destructor, ... through token code 327
  │
  ├── NVIDIA CUDA type traits (gated on GPU mode)
  │     __nv_is_extended_device_lambda_closure_type(328),
  │     __nv_is_extended_host_device_lambda_closure_type(329),
  │     __nv_is_extended_device_lambda_with_preserved_return_type(330)
  │
  └── EDG internal keywords (always registered)
        __edg_type__(272), __edg_size_type__(277),
        __edg_ptrdiff_type__(278), __edg_bool_type__(279),
        __edg_wchar_type__(280), __edg_opnd__(282),
        __edg_throw__(281), __edg_is_deducible(304),
        __edg_vector_type__(273), __edg_neon_vector_type__(274)

Version gating globals used during keyword registration:

AddressNameValues
dword_126EFB4language_mode1 = K&R C / GNU default, 2 = C++
dword_126EF68cpp_standard_version199900, 201102, 201402, 201703, 202002, 202302
qword_126EF98gnu_versione.g., 0x9FC3 = GCC 4.0.3
qword_126EF90clang_versione.g., 0x15F8F, 0x1D4BF
dword_126EFA8gnu_extensions_enabledBoolean
dword_126EFA4extensions_enabledBoolean (Clang compat)
dword_126EFACc_language_modeBoolean: C vs C++
dword_126EFB0microsoft_extensions_enabledBoolean

String and Character Literal Scanning

Character Literal Scanning

AddressIdentityLinesDescription
sub_66CB30scan_character_literal_prefix34Detects encoding prefix (L, u, U, u8)
sub_66CBD0scan_character_literal111Scans 'x' / L'x' / u'x' / U'x' / u8'x' literals

String Literal Scanning

AddressIdentityLinesDescription
sub_66C550scan_string_literal356Scans quoted string literals with escape sequences
sub_676080scan_raw_string_literal391Scans R"delimiter(content)delimiter" raw strings
sub_66E6E0scan_identifier_suffix94Checks for user-defined literal suffixes (C++11)
sub_66E920is_valid_ud_suffix51Validates user-defined literal suffix names
sub_6892F0string_literal_concatenation_check107Checks adjacent string literal tokens for concatenation
sub_689550process_user_defined_literal332Handles C++11 UDL operator lookup

Encoding Prefixes

The lexer recognizes 5 string encoding prefixes, each producing a different string literal type:

PrefixTokenCharacter TypeWidth
(none)"..."char1 byte
LL"..."wchar_t4 bytes (Linux)
u8u8"..."char8_t (C++20) / char1 byte
uu"..."char16_t2 bytes
UU"..."char32_t4 bytes

Scope Entry Layout

The lexer interacts heavily with the scope system. Scope entries are 784-byte records stored in an array at qword_126C5E8, indexed by dword_126C5E4 (current scope index).

OffsetSizeFieldDescription
+04name_hashHash of scope name for lookup
+41scope_kindKind code (12 = file scope, see below)
+61scope_flagsBit flags: bit 5 = inline namespace
+71access_flagsBit 0 = in class context
+101extra_flagsBit 0 = module scope
+121template_flagsBit 0 = in template argument scan, bit 4 = has concepts
+248symbol_chain_or_hash_ptrHead of symbol chain or hash table
+328hash_table_ptrHash table for O(1) lookup in large scopes
+1928lazy_load_scope_ptrPointer for lazy symbol loading (calls sub_7C1900)
+2084scope_depthNesting depth counter
+3768parent_template_infoTemplate context for template scope entries
+4168module_infoC++20 module partition data
+6328class_info_ptrPointer to class descriptor for class scopes

Scope-related globals:

AddressNameDescription
dword_126C5E4current_scope_indexIndex into scope table
dword_126C5C4class_scope_indexInnermost class scope (-1 if none)
dword_126C5C8namespace_scope_indexInnermost namespace scope (-1 if none)
dword_126C5DCfile_scope_indexFile (global) scope index
xmmword_126C520entity_kind_to_language_mode_map32-entry table mapping entity kinds to required language modes

Lexer State Stack

The lexer supports push/pop of its entire state for speculative parsing and template argument scanning.

AddressIdentityLinesDescription
sub_688320push_lexical_state137Pushes current lexer state onto qword_126DB40 stack
sub_668330pop_lexical_state_stack_full166Pops state, restores stop-token table, macro chains (assert at lexical.c:17808)

State stack nodes are 80-byte linked-list entries:

OffsetSizeField
+08next (previous state)
+88cached_tokens
+168source_position
+24--+7248token_cache_state (saved cache pointers and flags)

The push/pop mechanism is used for:

  • Template argument list scanning (sub_67DC90, 1,078 lines)
  • Speculative parsing in disambiguation contexts
  • Macro expansion state save/restore

Template Argument Scanning: scan_template_argument_list (sub_67DC90)

sub_67DC90 (1,078 lines, assert at lexical.c:19918) scans template argument lists (<...>). This is one of the most complex lexer functions because of the >> ambiguity: in vector<vector<int>>, the closing >> must be split into two > tokens to close two template argument lists.

The scanner:

  1. Pushes lexer state and sets template argument scanning mode (scope entry offset +12, bit 0)
  2. Scans tokens while tracking nesting depth of <> pairs
  3. Handles nested template-ids recursively
  4. Creates token cache entries for deferred parsing
  5. Uses the scope system to classify identifiers within template arguments
  6. Disambiguates >> as either right-shift or double template close

The entity kind checks at offsets +80 (values 19--22) identify template entities for recursive template-id scanning.

Preprocessor Integration

The lexer handles several preprocessor-related responsibilities:

Source Position Tracking

AddressIdentityLinesDescription
sub_66D100set_source_position282Converts raw input position to file/line/column (called from dozens of locations)
sub_66D5E0emit_output_line491Emits source text and #line directives to preprocessed output
sub_66B1F0emit_preprocessed_output231Outputs #line directives via qword_106C280 (output FILE*)

Macro Expansion Support

AddressIdentityLinesDescription
sub_66A770lookup_macro_at_position41Scans macro chain (qword_126DD80) for macro enclosing given position
sub_66A7F0create_macro_expansion_record44Allocates macro expansion tracking node
sub_66A890push_macro_expansion41Pushes new expansion onto active stack
sub_66A940pop_macro_expansion28Pops expansion from stack
sub_66A9D0is_in_macro_expansion12Returns whether currently inside macro expansion
sub_66A9F0get_macro_expansion_depth17Returns nesting depth of macro expansions
sub_66A310invalidate_macro_node56Clears macro definition when it goes out of scope
sub_66A5E0free_macro_definition_chain91Walks and frees macro chain via qword_126DD70 / qword_126DDE0

Include File Handling

AddressIdentityLinesDescription
sub_66BB50open_source_file332Opens include files via sub_4F4970 (fopen wrapper), creates file tracking nodes
sub_66EA70open_next_input_file364Opens next input source after current file ends, manages include-stack unwinding
sub_67BAB0scan_header_name110Scans <filename> or "filename" for #include directives

Token Pasting and Stringification

AddressIdentityLinesDescription
sub_67D1E0handle_token_pasting117Implements ## preprocessor operator
sub_67D440stringify_token251Implements # preprocessor operator
sub_67D050check_token_paste_validity57Validates token paste produces a valid token
sub_67D900expand_macro_argument204Expands a single macro argument during substitution

Operator Scanning

Multi-character operators are scanned by a set of dedicated functions in the 0x67ABB0--0x67BAB0 range. The scanner reads the first operator character and dispatches to the appropriate function to check for compound operators:

First CharPossible Tokens
<<, <=, <<, <<=, <=>, <% (digraph {), <: (digraph [)
>>, >=, >>, >>=
++, ++, +=
--, --, -=, ->, ->*
**, *=
&&, &&, &=
||, ||, |=
==, ==
!!, !=
::, ::
.., ..., .*

Template Angle Bracket Disambiguation

sub_67CB70 (handle_template_angle_brackets, 263 lines) handles the critical disambiguation of < and > in template contexts. In template argument lists, < opens and > closes, but in expressions, they are comparison operators. The function uses scope context information and the current parsing state (from the 784-byte scope entries) to make the determination.

Error Recovery

AddressIdentityLinesDescription
sub_6887C0skip_to_token317Error recovery: skips tokens until finding a synchronization point (;, }, etc.)
sub_6886F0expect_token31Checks current token matches expected kind, emits diagnostic on mismatch
sub_688560peek_next_token44Looks ahead at next token without consuming it

The stop-token table at qword_126DB48 + 8 (357 entries) controls which token kinds are valid synchronization points for error recovery.

Built-in Type and Attribute Handling

AddressIdentityLinesDescription
sub_685AB0handle_builtin_type_token289Processes built-in type keywords (int, float, etc.) into type tokens
sub_685F10process_decltype_token212Handles decltype() expression in token stream
sub_686350handle_attribute_token584Processes [[attribute]] and __attribute__((x)) syntax, including CUDA attributes
sub_686F40process_asm_or_extension_keyword244Handles asm, __asm__, and extension keywords

Diagnostic Strings

StringSourceCondition
"pop_lexical_state_stack_full"sub_668330Assert at lexical.c:17808
"copy_tokens_from_cache"sub_669650Assert at lexical.c:3417
"scan_universal_character"sub_6711E0Assert at lexical.c:12384
"get_token_from_cached_token_rescan_list"sub_676860Assert at lexical.c:4302
"get_token_from_reusable_cache_stack"sub_676860Assert at lexical.c:4450, 4469
"scan_template_argument_list"sub_67DC90Assert at lexical.c:19918
"select_dual_lookup_symbol"sub_666720Assert at lexical.c:22477
"keyword_init"sub_5863A0Assert at fe_init.c:1597
"fe_translation_unit_init"sub_5863A0Assert at fe_init.c:2373
Diagnostic CodeContextMeaning
870Character literal scanningInvalid character in literal
912select_dual_lookup_symbolAmbiguous lookup result
1192Control byte type 6Stale source position marker
861Control byte type 6Invalid position reference
1665check_deferred_diagnosticsDeferred macro-related warning
1750refill_bufferTrigraph sequence warning
2628Numeric literal scannerC++23 digit separator used in earlier mode
2629Numeric literal scannerDigit separator not followed by digit

Function Map

AddressIdentityConfidenceLinesEDG Source
sub_5863A0keyword_init / fe_translation_unit_init98%1,113fe_init.c:1597
sub_666720select_dual_lookup_symbolHIGH372lexical.c:22477
sub_668330pop_lexical_state_stack_fullHIGH166lexical.c:17808
sub_668660check_deferred_diagnosticsMEDIUM104lexical.c
sub_6688A0get_scope_from_entityHIGH32lexical.c
sub_668C90classify_identifier_entityMEDIUM89lexical.c
sub_668E00resolve_entity_through_aliasMEDIUM88lexical.c
sub_669650copy_tokens_from_cacheHIGH385lexical.c:3417
sub_669D00allocate_token_cache_entryMEDIUM119lexical.c
sub_66A000append_to_token_cacheMEDIUM88lexical.c
sub_66A140push_token_to_rescan_listMEDIUM46lexical.c
sub_66A3F0create_source_region_nodeMEDIUM84lexical.c
sub_66A5E0free_macro_definition_chainMEDIUM91lexical.c
sub_66A770lookup_macro_at_positionMEDIUM41lexical.c
sub_66A890push_macro_expansionMEDIUM41lexical.c
sub_66AA50process_preprocessor_directiveMEDIUM380lexical.c
sub_66B1F0emit_preprocessed_outputMEDIUM231lexical.c
sub_66B910skip_whitespace_and_commentsMEDIUM105lexical.c
sub_66BB50open_source_fileHIGH332lexical.c
sub_66C550scan_string_literalMEDIUM356lexical.c
sub_66CBD0scan_character_literalMEDIUM111lexical.c
sub_66D100set_source_positionHIGH282lexical.c
sub_66D5E0emit_output_lineHIGH491lexical.c
sub_66DFF0scan_pp_numberMEDIUM268lexical.c
sub_66EA70open_next_input_fileMEDIUM364lexical.c
sub_66F4E0read_next_source_lineHIGH735lexical.c
sub_6702F0refill_bufferHIGH792lexical.c
sub_6711E0scan_universal_characterHIGH278lexical.c:12384
sub_671870validate_universal_character_valueMEDIUM62lexical.c
sub_6719B0scan_identifier_or_keywordHIGH400lexical.c
sub_672390scan_numeric_literalHIGH1,571lexical.c
sub_6748A0convert_integer_suffixMEDIUM137lexical.c
sub_674BB0determine_numeric_literal_typeMEDIUM400lexical.c
sub_675390scan_float_exponentMEDIUM57lexical.c
sub_6754B0convert_float_literalMEDIUM338lexical.c
sub_676080scan_raw_string_literalMEDIUM-HIGH391lexical.c
sub_676860get_next_tokenHIGHEST1,995lexical.c:4302
sub_679800scan_tokenHIGH907lexical.c
sub_67BAB0scan_header_nameMEDIUM110lexical.c
sub_67CB70handle_template_angle_bracketsMEDIUM263lexical.c
sub_67D050check_token_paste_validityLOW57lexical.c
sub_67D1E0handle_token_pastingMEDIUM117lexical.c
sub_67D440stringify_tokenMEDIUM251lexical.c
sub_67D900expand_macro_argumentMEDIUM204lexical.c
sub_67DC90scan_template_argument_listHIGH1,078lexical.c:19918
sub_67F2E0create_template_argument_cacheMEDIUM184lexical.c
sub_67F740rescan_template_argumentsMEDIUM-HIGH583lexical.c
sub_680670resolve_dependent_template_idMEDIUM240lexical.c
sub_680AE0handle_dependent_name_contextMEDIUM235lexical.c
sub_6810F0get_token_mainHIGHEST3,811lexical.c
sub_685AB0handle_builtin_type_tokenMEDIUM289lexical.c
sub_685F10process_decltype_tokenMEDIUM212lexical.c
sub_686350handle_attribute_tokenMEDIUM-HIGH584lexical.c
sub_686F40process_asm_or_extension_keywordMEDIUM244lexical.c
sub_687F30setup_lexer_for_parsing_modeMEDIUM216lexical.c
sub_688320push_lexical_stateMEDIUM137lexical.c
sub_688560peek_next_tokenMEDIUM44lexical.c
sub_6886F0expect_tokenMEDIUM31lexical.c
sub_6887C0skip_to_tokenMEDIUM317lexical.c

Cross-References