Lexer & Tokenizer

The lexer in cudafe++ is EDG 6.6's lexical.c implementation -- a hand-coded, state-machine-driven tokenizer that converts raw source bytes into a stream of 357 distinct token kinds. It spans approximately 185 functions across the address range 0x668330--0x689130 and constitutes one of the densest subsystems in the binary. The design is a classic multi-layered scanner: a byte-level character scanner (sub_679800, 907 lines) feeds into a token acquisition engine (sub_6810F0, 3,811 lines), which in turn is wrapped by a cache-aware token delivery function (sub_676860, 1,995 lines). CUDA keyword recognition is injected at the get_token_main level, gated on dword_106C2C0 (GPU compilation mode flag).

The lexer does not use generated tables from tools like flex. Instead, every character-class test, keyword match, and operator scan is written as explicit C switch/if chains, compiled into dense jump tables by the optimizer. This produces extremely large functions -- get_token_main alone has approximately 300 local variables in its decompiled form -- but eliminates the overhead of table-driven DFA transitions for a language as context-sensitive as C++.

Key Facts

Property	Value
Source file	`lexical.c` (~185 functions)
Address range	`0x668330`--`0x689130`
Token kinds	357 (indexed from `off_E6D240` name table)
Primary scanner	`sub_679800` (`scan_token`, 907 lines)
Token acquisition	`sub_6810F0` (`get_token_main`, 3,811 lines, ~300 locals)
Cache + delivery	`sub_676860` (`get_next_token`, 1,995 lines)
Numeric literal scanner	`sub_672390` (`scan_numeric_literal`, 1,571 lines)
Keyword registration	`sub_5863A0` (`keyword_init`, in `fe_init.c`, 200+ keywords)
Universal char scanner	`sub_6711E0` (`scan_universal_character`, 278 lines)
Template arg scanner	`sub_67DC90` (`scan_template_argument_list`, 1,078 lines)
Token cache entry size	80--112 bytes (8 cache entry kinds)
Scope entry size	784 bytes (at `qword_126C5E8`)
GPU mode gate	`dword_106C2C0`
Current token global	`word_126DD58`

Architecture

The lexer is organized as four concentric layers, each calling into the one below it:

Parser (expr.c, decls.c, statements.c)
  │
  ▼
get_next_token (sub_676860)         ← Cache management, macro rescan
  │
  ▼
get_token_main (sub_6810F0)         ← Keyword classification, CUDA gates
  │
  ▼
scan_token (sub_679800)             ← Character-level scanning
  │
  ▼
Input buffer (qword_126DDA0)        ← Raw bytes from source file

The parser never calls the character-level scanner directly. All token consumption flows through get_next_token, which checks the token cache and rescan lists before falling through to get_token_main. This layering allows the lexer to support lookahead, backtracking, macro expansion replay, and template argument rescanning without modifying the core scanner.

Token System

The 357 Token Kinds

Every token produced by the lexer carries a 16-bit token code stored in word_126DD58. The complete set of 357 token kinds is indexed through the name table at off_E6D240, which maps each token code to its string representation. The stop-token table at qword_126DB48 + 8 contains 357 boolean entries used by the error recovery scanner to identify synchronization points.

Token codes are assigned in blocks:

Range	Category	Examples
1--51	Operators and punctuation	`+`, `-`, `*`, `/`, `(`, `)`, `{`, `}`, `::`, `->`
52--76	Alternative tokens / digraphs	`and`, `or`, `not`, `<%`, `%>`, `<:`, `:>`
77--108	C89 keywords	`auto`(77), `break`(78), `case`(79), `char`(80), `while`(108)
109--131	C99/C11 keywords	`restrict`(119), `_Bool`(120), `_Complex`(121), `_Imaginary`(122)
132--136	MSVC keywords	`__declspec`(132), `__int8`(133), `__int16`(134), `__int32`(135), `__int64`(136)
137--199	C++ keywords	`catch`(150), `class`(151), `template`(160), `decltype`(185), `typeof`(189)
200--206	Compiler internal	Internal token kinds for the preprocessor
207--330	Type traits	`__is_class`(207), `__has_trivial_copy`, ..., NVIDIA-specific traits at 328--330
331--356	Extended types / recent additions	`_Float32`(331)--`_Float128`(335), C++23/26 features

CUDA-Specific Token Kinds

Three NVIDIA type-trait keywords occupy dedicated token codes registered during keyword_init:

Token Code	Keyword	Purpose
328	`__nv_is_extended_device_lambda_closure_type`	Tests if type is a device lambda
329	`__nv_is_extended_host_device_lambda_closure_type`	Tests if type is a host-device lambda
330	`__nv_is_extended_device_lambda_with_preserved_return_type`	Tests if device lambda preserves return type

These are registered as standard type-trait keywords and participate in the same token classification path as the 60+ standard __is_xxx/__has_xxx traits.

Token State Globals

When a token is produced, the following globals are populated:

Address	Name	Type	Description
`word_126DD58`	`current_token_code`	WORD	16-bit token kind (0--356)
`qword_126DD38`	`current_source_position`	QWORD	Encoded file/line/column
`qword_126DD48`	`token_text_ptr`	QWORD	Pointer to identifier/literal text
`src`	`token_start_position`	char*	Start of token in input buffer
`n`	`token_text_length`	size_t	Length of token text
`dword_126DF90`	`token_flags_1`	DWORD	Classification flags
`dword_126DF8C`	`token_flags_2`	DWORD	Additional flags
`qword_126DF80`	`token_extra_data`	QWORD	Context-dependent payload
`xmmword_106C380`--`106C3B0`	`identifier_lookup_result`	4 x 128-bit	SSE-packed lookup result for identifiers (64 bytes)

The 64-byte identifier lookup result is written into four SSE registers (xmmword_106C380 through xmmword_106C3B0) by the identifier classification path. When a scanned identifier is also a keyword, the lookup result contains the keyword's token code, scope information, and classification flags. The compiler uses movaps/movups instructions to read/write this packed state in bulk.

Token Cache

The token cache provides the lookahead, backtracking, and macro-expansion replay capabilities required by C++ parsing. Tokens are stored in a linked list of cache entries that can be consumed, rewound, and re-scanned.

Cache Entry Layout (80--112 bytes)

Offset	Size	Field	Description
`+0`	8	`next`	Next entry in cache linked list
`+8`	8	`source_position`	Encoded source location
`+16`	2	`token_code`	Token kind (0--356)
`+18`	1	`cache_entry_kind`	Discriminator for payload type (see below)
`+20`	4	`flags`	Token flags
`+24`	4	`extra_flags`	Additional flags
`+32`	8	`extra_data`	Context-dependent data
`+40`..	varies	`payload`	Kind-specific payload data

Cache Entry Kinds

Kind	Value	Payload	Description
identifier	1	Name pointer + lookup result	Identifier token with pre-resolved scope lookup
macro_def	2	Macro definition pointer	Macro definition for re-expansion (calls `sub_5BA500`)
pragma	3	Pragma data	Preprocessor pragma for deferred processing
pp_number	4	Number text	Preprocessing number (not yet classified as int/float)
(reserved)	5	--	Not observed in use
string	6	String data + encoding	String literal token
(reserved)	7	--	Not observed in use
concatenated_string	8	Concatenated string data	Wide or multi-piece concatenated string literal

Cache Management Globals

Address	Name	Description
`qword_1270150`	`cached_token_rescan_list`	Head of list of tokens to re-scan (pushed back for lookahead)
`qword_1270128`	`reusable_cache_stack`	Stack of reusable cache entry blocks
`qword_1270148`	`free_token_list`	Free list for recycling cache entries
`qword_1270140`	`macro_definition_chain`	Active macro definition chain
`dword_126DB74`	`has_cached_tokens`	Boolean flag: nonzero when cache is non-empty

Cache Operations

Address	Identity	Description
`sub_669650`	`copy_tokens_from_cache`	Copies cached preprocessor tokens for macro re-expansion (assert at `lexical.c:3417`)
`sub_669D00`	`allocate_token_cache_entry`	Allocates from free list at `qword_1270118`
`sub_669EB0`	`create_cached_token_node`	Creates and initializes token cache node
`sub_66A000`	`append_to_token_cache`	Appends token to cache list, maintains tail pointer
`sub_66A140`	`push_token_to_rescan_list`	Pushes token onto rescan stack at `qword_1270150`
`sub_66A2C0`	`free_single_cache_entry`	Returns cache entry to free list

Layer 1: scan_token (sub_679800)

scan_token is the character-level scanner. It reads raw bytes from the input buffer at qword_126DDA0, classifies them, and produces a single token. The function is 907 lines and dispatches on the first byte of each token.

Character Dispatch

The scanner reads the byte at the current input position and enters one of the following paths:

First Byte	Action
`0x00` (NUL)	Control byte processing (8 embedded control types, see below)
`0x09` (TAB), `0x0B` (VT), `0x0C` (FF), `0x20` (space)	Whitespace -- advance and retry
`a`--`z`, `A`--`Z`, `_`	Identifier or keyword scanning
`0`--`9`	Numeric literal scanning (decimal, hex, octal, binary)
`'`	Character literal scanning
`"`	String literal scanning
`/`	Comment (`//` or `/* */`) or division operator
`.`	Dot operator, or float literal if followed by digit
`<`	Less-than, `<=`, `<<`, `<<=`, `<=>`, or template bracket
`>`	Greater-than, `>=`, `>>`, `>>=`, or template bracket
`+`, `-`, `*`, `%`, `^`, `~`, `!`, `=`, `&`, `\|`	Operator scanning (single or compound)
`(`, `)`, `[`, `]`, `{`, `}`, `;`, `,`, `?`, `@`	Single-character tokens
`#`	Preprocessor directive or stringification operator
`\`	Universal character name (`\uXXXX`, `\UXXXXXXXX`) or line continuation

Embedded Control Bytes (NUL Dispatch)

The input buffer uses embedded NUL bytes (0x00) as in-band control markers. When the scanner encounters a NUL, it reads the next byte as a control type code:

Control Type	Value	Action
Newline marker	1	End of line -- calls `sub_6702F0` (`refill_buffer`) to read next source line
(reserved)	2	--
Macro position	3	Macro expansion position marker -- calls `sub_66A770` to update position tracking
End of directive	4	Marks end of a preprocessor directive
EOF (primary)	5	End of current source file -- pops file stack
Stale position	6	Invalid position marker -- emits diagnostic 1192 or 861
Continuation	7	Backslash-newline continuation was here
EOF (secondary)	8	Secondary EOF marker for nested includes

This in-band signaling approach avoids the cost of checking buffer boundaries on every character read. The refill_buffer function (sub_6702F0, 792 lines) places these marker bytes at the end of each source line, so the scanner can detect line endings and EOF without comparing the input pointer against a limit.

Input Buffer System

Address	Name	Description
`qword_126DDA0`	`current_input_position`	Read pointer into the input buffer
`qword_126DDD8`	`input_buffer_base`	Start of the allocated input buffer
`qword_126DDD0`	`input_buffer_end`	End of the allocated input buffer
`qword_126DDF0`	`file_stack`	Stack of open source files (for `#include`)
`qword_127FBA8`	`current_file_handle`	`FILE*` for the current source file
`dword_127FBA0`	`eof_flag`	Set when current file reaches EOF
`dword_127FB9C`	`multibyte_encoding_mode`	Values >1 enable multibyte character decoding via `sub_5B09B0`
`dword_126DDA8`	`source_line_counter`	Lines read from current source file
`dword_126DDBC`	`output_line_counter`	Lines emitted to preprocessed output

Buffer Refill: read_next_source_line (sub_66F4E0)

sub_66F4E0 (735 lines) reads the next line from the source file into the input buffer. It calls getc() for single-byte mode or sub_5B09B0 for multibyte mode (controlled by dword_127FB9C > 1). The function:

Reads characters one at a time until newline or EOF
Handles backslash-newline line splicing (joining continuation lines)
Places control byte markers at newline positions (type 1) and EOF (type 5/8)
Updates the line counter at dword_126DDA8
Manages trigraph warnings (diagnostic 1750) through the companion function sub_6702F0

Layer 2: get_token_main (sub_6810F0)

get_token_main is the largest function in the lexer at 3,811 decompiled lines with approximately 300 local variables. It wraps scan_token and performs the complete token classification pipeline: keyword recognition, CUDA keyword gating, template parameter detection, operator overload name lookup, access specifier tracking, and namespace scope management.

Token Classification Pipeline

After scan_token produces a raw token, get_token_main performs these classification steps:

scan_token produces raw token
  │
  ├── Identifier?
  │     ├── Look up in keyword table
  │     │     ├── Standard C/C++ keyword → set token_code to keyword kind
  │     │     ├── CUDA keyword (dword_106C2C0 != 0) → set token_code
  │     │     ├── Type trait keyword → set token_code (207-356)
  │     │     └── Not a keyword → classify as identifier token
  │     │
  │     ├── Check template parameter context
  │     │     └── If inside template<>, classify as type-name or non-type
  │     │
  │     └── Entity lookup for context-sensitive classification
  │           ├── typedef name → classify as TYPE_NAME token
  │           ├── class/struct name → classify as CLASS_NAME
  │           ├── enum name → classify as ENUM_NAME
  │           ├── namespace name → classify as NAMESPACE_NAME
  │           └── template name → classify as TEMPLATE_NAME
  │
  ├── Numeric literal?
  │     └── Route to scan_numeric_literal (sub_672390)
  │
  ├── String/character literal?
  │     └── Handle encoding prefix (L, u8, u, U, R)
  │
  └── Operator/punctuation?
        ├── Check for template angle bracket context
        ├── Handle digraphs/alternative tokens
        └── Produce operator token code

CUDA Keyword Detection

CUDA keyword handling is gated on dword_106C2C0 (GPU mode). When this flag is nonzero, get_token_main recognizes CUDA-specific identifiers and routes them to the CUDA attribute processing path:

// Pseudocode from get_token_main
if (token_is_identifier) {
    // ... standard keyword lookup ...

    if (dword_106C2C0 != 0) {  // GPU mode active
        // Check for __device__, __host__, __global__,
        // __shared__, __constant__, __managed__,
        // __launch_bounds__, __grid_constant__
        // Route to CUDA attribute handlers
        if (dword_106BA08) {   // CUDA attribute processing enabled
            sub_74DC30(...);   // CUDA attribute resolution
            sub_74E240(...);   // CUDA attribute application
        }
    }
}

The GPU mode flag dword_106C2C0 is also checked during:

Attribute token processing in sub_686350 (handle_attribute_token, 584 lines)
Deferred diagnostic emission in sub_668660 (severity override via byte_126ED55)
Entity visibility computation in sub_669130

C++ Standard Version Gating

Throughout get_token_main, keyword classification is gated on the C++ standard version stored in dword_126EF68:

Version Value	Standard	Keywords Enabled
201102	C++11	`constexpr`, `decltype`, `nullptr`, `char16_t`, `char32_t`, `static_assert`
201402	C++14	`binary literals`, `digit separators`
201703	C++17	`if constexpr`, `char8_t`, `structured bindings`
202002	C++20	`concept`, `requires`, `co_yield`, `co_return`, `co_await`, `consteval`, `constinit`
202302	C++23	`typeof`, `typeof_unqual`, extended digit separators

The language mode at dword_126EFB4 controls broader dialect selection:

Value	Mode	Effect
1	GNU/default	GNU extensions enabled, alternative tokens recognized
2	MSVC	MSVC keywords enabled (`__declspec`, `__int8`--`__int64`), some GNU extensions disabled

Context-Sensitive Token Classification

C++ requires the lexer to classify identifiers based on declaration context. The functions supporting this classification:

Address	Identity	Description
`sub_668C90`	`classify_identifier_entity`	Dispatches on entity kind: typedef(3), class(4,5), function(7,9), namespace(19-22)
`sub_668E00`	`resolve_entity_through_alias`	Walks typedef/using chains (kind=3 with `+104` flag, kind=16 → `**[+88]`)
`sub_668F80`	`get_resolved_entity_type`	Resolves entity to underlying type through alias chains
`sub_668900`	`handle_token_identifier_type_check`	Determines if token is identifier vs typename vs template
`sub_666720`	`select_dual_lookup_symbol`	Selects between two candidate symbols in dual-scope lookup (372 lines)

Entity classification reads the entity_kind byte at offset +80 of entity nodes:

switch (entity->kind) {    // offset +80
    case 3:                // typedef
        return TYPE_NAME;
    case 4: case 5:        // class / struct
        return CLASS_NAME;
    case 6:                // enum
        return ENUM_NAME;
    case 7:                // function
        return IDENTIFIER;
    case 9: case 10:       // namespace / namespace alias
        return NAMESPACE_NAME;
    case 19: case 20: case 21: case 22:  // template kinds
        return TEMPLATE_NAME;
    case 16:               // using declaration
        return resolve_through_using(entity);
    case 24:               // namespace alias (resolved)
        return NAMESPACE_NAME;
}

Layer 3: get_next_token (sub_676860)

get_next_token (1,995 lines) is the token delivery function called by the parser. It manages the token cache, handles macro expansion replay, and calls get_token_main only when no cached tokens are available.

Token Delivery Flow

get_next_token (sub_676860)
  │
  ├── Check cached_token_rescan_list (qword_1270150)
  │     └── If non-empty: pop token, dispatch on cache_entry_kind
  │           ├── kind 1 (identifier): load xmmword_106C380..106C3B0
  │           ├── kind 2 (macro_def): call sub_5BA500 (macro expansion)
  │           ├── kind 3 (pragma): process deferred pragma
  │           ├── kind 4 (pp_number): return as-is
  │           ├── kind 6 (string): return string token
  │           └── kind 8 (concatenated_string): return concatenated string
  │
  ├── Check reusable_cache_stack (qword_1270128)
  │     └── If non-empty: pop and return cached token
  │           (assert: "get_token_from_reusable_cache_stack" at 4450, 4469)
  │
  ├── Check pending_macro_arg (qword_106B8A0)
  │     └── If set: process macro argument token
  │
  └── Fall through to get_token_main (sub_6810F0)
        └── Full token acquisition from source

The function sets the following globals on every token delivery:

word_126DD58 = token code
qword_126DD38 = source position
dword_126DF90 = token flags 1
dword_126DF8C = token flags 2
qword_126DF80 = extra data

CUDA Attribute Token Interception

When CUDA attribute processing is enabled (dword_106BA08 != 0), get_next_token intercepts identifier tokens and routes them through CUDA attribute resolution via sub_74DC30 and sub_74E240. This allows CUDA execution-space attributes (__device__, __host__, __global__) to be recognized at the token level rather than requiring full declaration parsing.

Numeric Literal Scanner: scan_numeric_literal (sub_672390)

The numeric literal scanner is 1,571 lines and handles every numeric literal format defined by C89 through C++23.

Literal Prefix Dispatch

scan_numeric_literal
  │
  ├── First char '0':
  │     ├── 0x/0X → hex literal (isxdigit validation)
  │     ├── 0b/0B → binary literal (C++14)
  │     ├── 0[0-7] → octal literal
  │     └── 0 alone → decimal zero
  │
  ├── First char '1'-'9':
  │     └── decimal literal
  │
  └── After integer part:
        ├── '.' → floating-point literal
        ├── 'e'/'E' → decimal float exponent
        ├── 'p'/'P' → hex float exponent
        └── suffix → type suffix parsing

C++14 Digit Separators

Digit separators (' characters within numeric literals) are handled through a two-flag system:

Address	Name	Purpose
`dword_126EEFC`	`cpp14_digit_separators_enabled`	Master enable for digit separator support
`dword_126DB58`	`digit_separator_seen`	Set when a separator is encountered in the current literal

When dword_126EEFC is enabled, the scanner accepts ' between digits:

// Digit separator handling in scan_numeric_literal
while (isdigit(*pos) || (*pos == '\'' && dword_126EEFC)) {
    if (*pos == '\'') {
        dword_126DB58 = 1;  // mark separator seen
        pos++;
        if (!isdigit(*pos))
            emit_diagnostic(2629);  // separator not followed by digit
        continue;
    }
    // process digit...
}

C++23 extended digit separators (for binary, octal, hex) are gated on dword_126EF68 > 202302:

if (dword_126EF68 > 202302) {
    // C++23: allow digit separators in binary/octal/hex
} else {
    emit_diagnostic(2628);  // C++23 feature used in earlier mode
}

Integer Suffix Parsing

sub_6748A0 (convert_integer_suffix, 137 lines) parses the following suffixes:

Suffix	Type
(none)	`int` (or promoted per value)
`u` / `U`	`unsigned int`
`l` / `L`	`long`
`ll` / `LL`	`long long`
`ul` / `UL`	`unsigned long`
`ull` / `ULL`	`unsigned long long`
`z` / `Z`	`size_t` (C++23)
`uz` / `UZ`	`size_t unsigned` (C++23)

sub_674BB0 (determine_numeric_literal_type, 400 lines) applies the C++ promotion rules based on the literal value and suffix to determine the final type.

Floating-Point Literal Handling

Address	Identity	Description
`sub_675390`	`scan_float_exponent`	Scans `e`/`E`/`p`/`P` exponent suffix (57 lines)
`sub_6754B0`	`convert_float_literal`	Converts float literal string to value (338 lines)

Float suffixes: f/F (float), l/L (long double), none (double).

Universal Character Names: scan_universal_character (sub_6711E0)

sub_6711E0 (278 lines, assert at lexical.c:12384) scans \uXXXX and \UXXXXXXXX universal character names in identifiers and string/character literals.

void scan_universal_character(char *input, uint32_t *result) {
    int width;
    if (input[1] == 'u')
        width = 4;    // \uXXXX
    else
        width = 8;    // \UXXXXXXXX

    uint32_t value = 0;
    for (int i = 0; i < width; i++) {
        char c = *input++;
        if (!isxdigit(c)) {
            // emit error diagnostic
            return;
        }
        int digit;
        if (c >= '0' && c <= '9')
            digit = c - 48;      // '0' = 48
        else if (islower(c))
            digit = c - 87;      // 'a' = 97, 97-87 = 10
        else
            digit = c - 55;      // 'A' = 65, 65-55 = 10
        value = (value << 4) | digit;
    }
    *result = value;
}

sub_671870 (validate_universal_character_value, 62 lines) performs range checking after scanning: surrogate pair values (0xD800--0xDFFF) are rejected, and values outside the valid Unicode range (> 0x10FFFF) produce an error.

The feature is controlled by dword_106BCC4 (universal characters enabled) and dword_106BD4C (extended character mode).

Keyword Registration: keyword_init (sub_5863A0)

sub_5863A0 (1,113 lines, in fe_init.c) registers all C/C++ keywords with the symbol table during frontend initialization. It calls sub_7463B0 (enter_keyword) once per keyword, passing the token ID and string representation. GNU double-underscore variants are registered via sub_585B10, and alternative tokens via sub_749600.

Keyword Categories and Version Gating

Keywords are registered conditionally based on language mode and standard version:

keyword_init (sub_5863A0)
  │
  ├── C89 core (always registered)
  │     auto(77), break(78), case(79), char(80), continue(82),
  │     default(83), do(84), double(85), else(86), enum(87),
  │     extern(88), float(89), for(90), goto(91), if(92),
  │     int(93), long(94), register(95), return(96), short(97),
  │     sizeof(99), static(100), struct(101), switch(102),
  │     typedef(103), union(104), unsigned(105), void(106), while(108)
  │
  ├── C99 (gated on C99+ mode)
  │     _Bool(120), _Complex(121), _Imaginary(122), restrict(119)
  │
  ├── C11 (gated on C11+ mode)
  │     _Generic(262), _Atomic(263), _Alignof(247), _Alignas(248),
  │     _Thread_local(194), _Static_assert(184), _Noreturn(260)
  │
  ├── C23 (gated on C23 mode)
  │     bool, true, false, alignof, alignas, static_assert,
  │     thread_local, typeof(189), typeof_unqual(190)
  │
  ├── C++ core (gated on C++ mode: dword_126EFB4 == 2)
  │     catch(150), class(151), friend(153), inline(154),
  │     mutable(174), operator(156), new(155), delete(152),
  │     private(157), protected(158), public(159), template(160),
  │     this(161), throw(162), try(163), virtual(164),
  │     namespace(175), using(179), typename(183), typeid(178),
  │     const_cast(166), dynamic_cast(167), static_cast(177),
  │     reinterpret_cast(176)
  │
  ├── C++ alternative tokens (gated on C++ mode)
  │     and(52), and_eq(64), bitand(33), bitor(51), compl(37),
  │     not(38), not_eq(48), or(53), or_eq(66), xor(50), xor_eq(65)
  │
  ├── C++ modern keywords (gated on standard version)
  │     C++11: constexpr(244), decltype(185), nullptr(237),
  │            char16_t(126), char32_t(127)
  │     C++17: char8_t(128)
  │     C++20: consteval(245), constinit(246), co_yield(267),
  │            co_return(268), co_await(269), concept(295), requires(294)
  │     C++23: typeof(189), typeof_unqual(190)
  │
  ├── GNU extensions (gated on dword_126EFA8)
  │     __extension__(187), __auto_type(186), __attribute(142),
  │     __builtin_offsetof(117), __builtin_types_compatible_p(143),
  │     __builtin_shufflevector(258), __builtin_convertvector(259),
  │     __builtin_complex(261), __builtin_has_attribute(296),
  │     __builtin_addressof(271), __builtin_bit_cast(297),
  │     __int128(239), __bases(249), __direct_bases(250),
  │     _Float32(331), _Float32x(332), _Float64(333),
  │     _Float64x(334), _Float128(335)
  │
  ├── MSVC extensions (gated on dword_126EFB0)
  │     __declspec(132), __int8(133), __int16(134),
  │     __int32(135), __int64(136)
  │
  ├── Clang extensions (gated on Clang version at qword_126EF90)
  │     _Nullable(264), _Nonnull(265), _Null_unspecified(266)
  │
  ├── Type traits (60+, gated by standard version)
  │     __is_class(207), __is_enum, __is_union, __has_trivial_copy,
  │     __has_virtual_destructor, ... through token code 327
  │
  ├── NVIDIA CUDA type traits (gated on GPU mode)
  │     __nv_is_extended_device_lambda_closure_type(328),
  │     __nv_is_extended_host_device_lambda_closure_type(329),
  │     __nv_is_extended_device_lambda_with_preserved_return_type(330)
  │
  └── EDG internal keywords (always registered)
        __edg_type__(272), __edg_size_type__(277),
        __edg_ptrdiff_type__(278), __edg_bool_type__(279),
        __edg_wchar_type__(280), __edg_opnd__(282),
        __edg_throw__(281), __edg_is_deducible(304),
        __edg_vector_type__(273), __edg_neon_vector_type__(274)

Version gating globals used during keyword registration:

Address	Name	Values
`dword_126EFB4`	`language_mode`	1 = K&R C / GNU default, 2 = C++
`dword_126EF68`	`cpp_standard_version`	199900, 201102, 201402, 201703, 202002, 202302
`qword_126EF98`	`gnu_version`	e.g., `0x9FC3` = GCC 4.0.3
`qword_126EF90`	`clang_version`	e.g., `0x15F8F`, `0x1D4BF`
`dword_126EFA8`	`gnu_extensions_enabled`	Boolean
`dword_126EFA4`	`extensions_enabled`	Boolean (Clang compat)
`dword_126EFAC`	`c_language_mode`	Boolean: C vs C++
`dword_126EFB0`	`microsoft_extensions_enabled`	Boolean

String and Character Literal Scanning

Character Literal Scanning

Address	Identity	Lines	Description
`sub_66CB30`	`scan_character_literal_prefix`	34	Detects encoding prefix (`L`, `u`, `U`, `u8`)
`sub_66CBD0`	`scan_character_literal`	111	Scans `'x'` / `L'x'` / `u'x'` / `U'x'` / `u8'x'` literals

String Literal Scanning

Address	Identity	Lines	Description
`sub_66C550`	`scan_string_literal`	356	Scans quoted string literals with escape sequences
`sub_676080`	`scan_raw_string_literal`	391	Scans `R"delimiter(content)delimiter"` raw strings
`sub_66E6E0`	`scan_identifier_suffix`	94	Checks for user-defined literal suffixes (C++11)
`sub_66E920`	`is_valid_ud_suffix`	51	Validates user-defined literal suffix names
`sub_6892F0`	`string_literal_concatenation_check`	107	Checks adjacent string literal tokens for concatenation
`sub_689550`	`process_user_defined_literal`	332	Handles C++11 UDL operator lookup

Encoding Prefixes

The lexer recognizes 5 string encoding prefixes, each producing a different string literal type:

Prefix	Token	Character Type	Width
(none)	`"..."`	`char`	1 byte
`L`	`L"..."`	`wchar_t`	4 bytes (Linux)
`u8`	`u8"..."`	`char8_t` (C++20) / `char`	1 byte
`u`	`u"..."`	`char16_t`	2 bytes
`U`	`U"..."`	`char32_t`	4 bytes

Scope Entry Layout

The lexer interacts heavily with the scope system. Scope entries are 784-byte records stored in an array at qword_126C5E8, indexed by dword_126C5E4 (current scope index).

Offset	Size	Field	Description
`+0`	4	`name_hash`	Hash of scope name for lookup
`+4`	1	`scope_kind`	Kind code (12 = file scope, see below)
`+6`	1	`scope_flags`	Bit flags: bit 5 = inline namespace
`+7`	1	`access_flags`	Bit 0 = in class context
`+10`	1	`extra_flags`	Bit 0 = module scope
`+12`	1	`template_flags`	Bit 0 = in template argument scan, bit 4 = has concepts
`+24`	8	`symbol_chain_or_hash_ptr`	Head of symbol chain or hash table
`+32`	8	`hash_table_ptr`	Hash table for O(1) lookup in large scopes
`+192`	8	`lazy_load_scope_ptr`	Pointer for lazy symbol loading (calls `sub_7C1900`)
`+208`	4	`scope_depth`	Nesting depth counter
`+376`	8	`parent_template_info`	Template context for template scope entries
`+416`	8	`module_info`	C++20 module partition data
`+632`	8	`class_info_ptr`	Pointer to class descriptor for class scopes

Scope-related globals:

Address	Name	Description
`dword_126C5E4`	`current_scope_index`	Index into scope table
`dword_126C5C4`	`class_scope_index`	Innermost class scope (-1 if none)
`dword_126C5C8`	`namespace_scope_index`	Innermost namespace scope (-1 if none)
`dword_126C5DC`	`file_scope_index`	File (global) scope index
`xmmword_126C520`	`entity_kind_to_language_mode_map`	32-entry table mapping entity kinds to required language modes

Lexer State Stack

The lexer supports push/pop of its entire state for speculative parsing and template argument scanning.

Address	Identity	Lines	Description
`sub_688320`	`push_lexical_state`	137	Pushes current lexer state onto `qword_126DB40` stack
`sub_668330`	`pop_lexical_state_stack_full`	166	Pops state, restores stop-token table, macro chains (assert at `lexical.c:17808`)

State stack nodes are 80-byte linked-list entries:

Offset	Size	Field
`+0`	8	`next` (previous state)
`+8`	8	`cached_tokens`
`+16`	8	`source_position`
`+24`--`+72`	48	`token_cache_state` (saved cache pointers and flags)

The push/pop mechanism is used for:

Template argument list scanning (sub_67DC90, 1,078 lines)
Speculative parsing in disambiguation contexts
Macro expansion state save/restore

Template Argument Scanning: scan_template_argument_list (sub_67DC90)

sub_67DC90 (1,078 lines, assert at lexical.c:19918) scans template argument lists (<...>). This is one of the most complex lexer functions because of the >> ambiguity: in vector<vector<int>>, the closing >> must be split into two > tokens to close two template argument lists.

The scanner:

Pushes lexer state and sets template argument scanning mode (scope entry offset +12, bit 0)
Scans tokens while tracking nesting depth of <> pairs
Handles nested template-ids recursively
Creates token cache entries for deferred parsing
Uses the scope system to classify identifiers within template arguments
Disambiguates >> as either right-shift or double template close

The entity kind checks at offsets +80 (values 19--22) identify template entities for recursive template-id scanning.

Preprocessor Integration

The lexer handles several preprocessor-related responsibilities:

Source Position Tracking

Address	Identity	Lines	Description
`sub_66D100`	`set_source_position`	282	Converts raw input position to file/line/column (called from dozens of locations)
`sub_66D5E0`	`emit_output_line`	491	Emits source text and `#line` directives to preprocessed output
`sub_66B1F0`	`emit_preprocessed_output`	231	Outputs `#line` directives via `qword_106C280` (output `FILE*`)

Macro Expansion Support

Address	Identity	Lines	Description
`sub_66A770`	`lookup_macro_at_position`	41	Scans macro chain (`qword_126DD80`) for macro enclosing given position
`sub_66A7F0`	`create_macro_expansion_record`	44	Allocates macro expansion tracking node
`sub_66A890`	`push_macro_expansion`	41	Pushes new expansion onto active stack
`sub_66A940`	`pop_macro_expansion`	28	Pops expansion from stack
`sub_66A9D0`	`is_in_macro_expansion`	12	Returns whether currently inside macro expansion
`sub_66A9F0`	`get_macro_expansion_depth`	17	Returns nesting depth of macro expansions
`sub_66A310`	`invalidate_macro_node`	56	Clears macro definition when it goes out of scope
`sub_66A5E0`	`free_macro_definition_chain`	91	Walks and frees macro chain via `qword_126DD70` / `qword_126DDE0`

Include File Handling

Address	Identity	Lines	Description
`sub_66BB50`	`open_source_file`	332	Opens include files via `sub_4F4970` (fopen wrapper), creates file tracking nodes
`sub_66EA70`	`open_next_input_file`	364	Opens next input source after current file ends, manages include-stack unwinding
`sub_67BAB0`	`scan_header_name`	110	Scans `<filename>` or `"filename"` for `#include` directives

Token Pasting and Stringification

Address	Identity	Lines	Description
`sub_67D1E0`	`handle_token_pasting`	117	Implements `##` preprocessor operator
`sub_67D440`	`stringify_token`	251	Implements `#` preprocessor operator
`sub_67D050`	`check_token_paste_validity`	57	Validates token paste produces a valid token
`sub_67D900`	`expand_macro_argument`	204	Expands a single macro argument during substitution

Operator Scanning

Multi-character operators are scanned by a set of dedicated functions in the 0x67ABB0--0x67BAB0 range. The scanner reads the first operator character and dispatches to the appropriate function to check for compound operators:

First Char	Possible Tokens
`<`	`<`, `<=`, `<<`, `<<=`, `<=>`, `<%` (digraph `{`), `<:` (digraph `[`)
`>`	`>`, `>=`, `>>`, `>>=`
`+`	`+`, `++`, `+=`
`-`	`-`, `--`, `-=`, `->`, `->*`
`*`	``, `=`
`&`	`&`, `&&`, `&=`
`\|`	`\|`, `\|\|`, `\|=`
`=`	`=`, `==`
`!`	`!`, `!=`
`:`	`:`, `::`
`.`	`.`, `...`, `.*`

Template Angle Bracket Disambiguation

sub_67CB70 (handle_template_angle_brackets, 263 lines) handles the critical disambiguation of < and > in template contexts. In template argument lists, < opens and > closes, but in expressions, they are comparison operators. The function uses scope context information and the current parsing state (from the 784-byte scope entries) to make the determination.

Error Recovery

Address	Identity	Lines	Description
`sub_6887C0`	`skip_to_token`	317	Error recovery: skips tokens until finding a synchronization point (`;`, `}`, etc.)
`sub_6886F0`	`expect_token`	31	Checks current token matches expected kind, emits diagnostic on mismatch
`sub_688560`	`peek_next_token`	44	Looks ahead at next token without consuming it

The stop-token table at qword_126DB48 + 8 (357 entries) controls which token kinds are valid synchronization points for error recovery.

Built-in Type and Attribute Handling

Address	Identity	Lines	Description
`sub_685AB0`	`handle_builtin_type_token`	289	Processes built-in type keywords (`int`, `float`, etc.) into type tokens
`sub_685F10`	`process_decltype_token`	212	Handles `decltype()` expression in token stream
`sub_686350`	`handle_attribute_token`	584	Processes `[[attribute]]` and `__attribute__((x))` syntax, including CUDA attributes
`sub_686F40`	`process_asm_or_extension_keyword`	244	Handles `asm`, `__asm__`, and extension keywords

Diagnostic Strings

String	Source	Condition
`"pop_lexical_state_stack_full"`	`sub_668330`	Assert at `lexical.c:17808`
`"copy_tokens_from_cache"`	`sub_669650`	Assert at `lexical.c:3417`
`"scan_universal_character"`	`sub_6711E0`	Assert at `lexical.c:12384`
`"get_token_from_cached_token_rescan_list"`	`sub_676860`	Assert at `lexical.c:4302`
`"get_token_from_reusable_cache_stack"`	`sub_676860`	Assert at `lexical.c:4450`, `4469`
`"scan_template_argument_list"`	`sub_67DC90`	Assert at `lexical.c:19918`
`"select_dual_lookup_symbol"`	`sub_666720`	Assert at `lexical.c:22477`
`"keyword_init"`	`sub_5863A0`	Assert at `fe_init.c:1597`
`"fe_translation_unit_init"`	`sub_5863A0`	Assert at `fe_init.c:2373`

Diagnostic Code	Context	Meaning
870	Character literal scanning	Invalid character in literal
912	`select_dual_lookup_symbol`	Ambiguous lookup result
1192	Control byte type 6	Stale source position marker
861	Control byte type 6	Invalid position reference
1665	`check_deferred_diagnostics`	Deferred macro-related warning
1750	`refill_buffer`	Trigraph sequence warning
2628	Numeric literal scanner	C++23 digit separator used in earlier mode
2629	Numeric literal scanner	Digit separator not followed by digit

Function Map

Address	Identity	Confidence	Lines	EDG Source
`sub_5863A0`	`keyword_init` / `fe_translation_unit_init`	98%	1,113	`fe_init.c:1597`
`sub_666720`	`select_dual_lookup_symbol`	HIGH	372	`lexical.c:22477`
`sub_668330`	`pop_lexical_state_stack_full`	HIGH	166	`lexical.c:17808`
`sub_668660`	`check_deferred_diagnostics`	MEDIUM	104	`lexical.c`
`sub_6688A0`	`get_scope_from_entity`	HIGH	32	`lexical.c`
`sub_668C90`	`classify_identifier_entity`	MEDIUM	89	`lexical.c`
`sub_668E00`	`resolve_entity_through_alias`	MEDIUM	88	`lexical.c`
`sub_669650`	`copy_tokens_from_cache`	HIGH	385	`lexical.c:3417`
`sub_669D00`	`allocate_token_cache_entry`	MEDIUM	119	`lexical.c`
`sub_66A000`	`append_to_token_cache`	MEDIUM	88	`lexical.c`
`sub_66A140`	`push_token_to_rescan_list`	MEDIUM	46	`lexical.c`
`sub_66A3F0`	`create_source_region_node`	MEDIUM	84	`lexical.c`
`sub_66A5E0`	`free_macro_definition_chain`	MEDIUM	91	`lexical.c`
`sub_66A770`	`lookup_macro_at_position`	MEDIUM	41	`lexical.c`
`sub_66A890`	`push_macro_expansion`	MEDIUM	41	`lexical.c`
`sub_66AA50`	`process_preprocessor_directive`	MEDIUM	380	`lexical.c`
`sub_66B1F0`	`emit_preprocessed_output`	MEDIUM	231	`lexical.c`
`sub_66B910`	`skip_whitespace_and_comments`	MEDIUM	105	`lexical.c`
`sub_66BB50`	`open_source_file`	HIGH	332	`lexical.c`
`sub_66C550`	`scan_string_literal`	MEDIUM	356	`lexical.c`
`sub_66CBD0`	`scan_character_literal`	MEDIUM	111	`lexical.c`
`sub_66D100`	`set_source_position`	HIGH	282	`lexical.c`
`sub_66D5E0`	`emit_output_line`	HIGH	491	`lexical.c`
`sub_66DFF0`	`scan_pp_number`	MEDIUM	268	`lexical.c`
`sub_66EA70`	`open_next_input_file`	MEDIUM	364	`lexical.c`
`sub_66F4E0`	`read_next_source_line`	HIGH	735	`lexical.c`
`sub_6702F0`	`refill_buffer`	HIGH	792	`lexical.c`
`sub_6711E0`	`scan_universal_character`	HIGH	278	`lexical.c:12384`
`sub_671870`	`validate_universal_character_value`	MEDIUM	62	`lexical.c`
`sub_6719B0`	`scan_identifier_or_keyword`	HIGH	400	`lexical.c`
`sub_672390`	`scan_numeric_literal`	HIGH	1,571	`lexical.c`
`sub_6748A0`	`convert_integer_suffix`	MEDIUM	137	`lexical.c`
`sub_674BB0`	`determine_numeric_literal_type`	MEDIUM	400	`lexical.c`
`sub_675390`	`scan_float_exponent`	MEDIUM	57	`lexical.c`
`sub_6754B0`	`convert_float_literal`	MEDIUM	338	`lexical.c`
`sub_676080`	`scan_raw_string_literal`	MEDIUM-HIGH	391	`lexical.c`
`sub_676860`	`get_next_token`	HIGHEST	1,995	`lexical.c:4302`
`sub_679800`	`scan_token`	HIGH	907	`lexical.c`
`sub_67BAB0`	`scan_header_name`	MEDIUM	110	`lexical.c`
`sub_67CB70`	`handle_template_angle_brackets`	MEDIUM	263	`lexical.c`
`sub_67D050`	`check_token_paste_validity`	LOW	57	`lexical.c`
`sub_67D1E0`	`handle_token_pasting`	MEDIUM	117	`lexical.c`
`sub_67D440`	`stringify_token`	MEDIUM	251	`lexical.c`
`sub_67D900`	`expand_macro_argument`	MEDIUM	204	`lexical.c`
`sub_67DC90`	`scan_template_argument_list`	HIGH	1,078	`lexical.c:19918`
`sub_67F2E0`	`create_template_argument_cache`	MEDIUM	184	`lexical.c`
`sub_67F740`	`rescan_template_arguments`	MEDIUM-HIGH	583	`lexical.c`
`sub_680670`	`resolve_dependent_template_id`	MEDIUM	240	`lexical.c`
`sub_680AE0`	`handle_dependent_name_context`	MEDIUM	235	`lexical.c`
`sub_6810F0`	`get_token_main`	HIGHEST	3,811	`lexical.c`
`sub_685AB0`	`handle_builtin_type_token`	MEDIUM	289	`lexical.c`
`sub_685F10`	`process_decltype_token`	MEDIUM	212	`lexical.c`
`sub_686350`	`handle_attribute_token`	MEDIUM-HIGH	584	`lexical.c`
`sub_686F40`	`process_asm_or_extension_keyword`	MEDIUM	244	`lexical.c`
`sub_687F30`	`setup_lexer_for_parsing_mode`	MEDIUM	216	`lexical.c`
`sub_688320`	`push_lexical_state`	MEDIUM	137	`lexical.c`
`sub_688560`	`peek_next_token`	MEDIUM	44	`lexical.c`
`sub_6886F0`	`expect_token`	MEDIUM	31	`lexical.c`
`sub_6887C0`	`skip_to_token`	MEDIUM	317	`lexical.c`

Cross-References

Pipeline Overview -- keyword registration during sub_5863A0
Entry Point & Initialization -- frontend init calls keyword_init
Template Engine -- template argument scanning at lexer level
Type System -- entity kind classification used by lexer
Token Kind Table -- full 357-entry token table
Scope Entry -- 784-byte scope entry structure
Entity Node Layout -- entity node offsets used by identifier classification
Global Variable Index -- all global addresses referenced here
Attribute System Overview -- CUDA attribute handling at token level

Keyboard shortcuts

cudafe++ Reverse Engineering Reference