EDG 6.6 Frontend
NVIDIA licenses the Edison Design Group (EDG) C/C++ front end — a commercial compiler frontend used by several major compilers including Intel ICC. In cicc v13.0, EDG version 6.6 occupies 3.2 MB of code (0x5D0000–0x8F0000), making it the largest single subsystem in the binary. Unlike most modern compilers that parse directly to an SSA-based IR, EDG operates as a source-to-source translator: it parses CUDA C++ source code and emits transformed C code containing CUDA runtime API calls. This output is then fed into a second compilation phase that produces NVVM IR (LLVM bitcode). This two-stage design means the CUDA language extensions (kernel launch syntax, memory space qualifiers, device/host function annotations) are resolved entirely within EDG, and the LLVM-based backend never sees raw CUDA syntax.
The EDG frontend is configured at compile time through 737 #define macros, including GCC 8.1 emulation mode and Clang 9.1 emulation mode. Exceptions are disabled by default — CUDA device code cannot use C++ exceptions — while RTTI remains enabled for dynamic_cast support in host-side code that interacts with device objects.
| EDG version | 6.6 (string: "Based on Edison Design Group C/C++ Front End, version 6.6") |
| Entry symbol | lgenfe_main (string at sub_617BD0) |
| GCC emulation | 8.1 (DEFAULT_GNU_VERSION = 80100) |
| Clang emulation | 9.1 (DEFAULT_CLANG_VERSION = 90100) |
| C++ standards | C++98, C++11, C++14, C++17, C++20, C++23 (unk_4F07778 = year code) |
| C standards | C99, C11, C18, C23 |
| Exceptions | Disabled by default (DEFAULT_EXCEPTIONS_ENABLED = 0) |
| RTTI | Enabled by default (DEFAULT_RTTI_ENABLED = 1) |
| Target model | LP64 (TARG_SIZEOF_POINTER = 8, TARG_SIZEOF_LONG = 8) |
| Backend | C-codegen (BACK_END_IS_C_GEN_BE = 1) — emits C source, not LLVM IR directly |
| Functions | ~5,000 in range, 300+ above 5KB |
Architecture
The compilation flow through EDG has four major phases: CLI parsing (282-case switch), translation unit initialization (keyword tables, parser bootstrapping), parsing and semantic analysis (the bulk of the 3.2 MB), and backend code emission (generating three output files: .int.c for internal declarations, .device.c for device code, and .stub.c for host-side launch stubs). Error recovery uses setjmp/longjmp — any of the 478 call sites that invoke the abort handler (sub_721090) will unwind back to the orchestrator rather than crashing the process.
sub_5D2A80 (orchestrator, setjmp error recovery)
│
├─ sub_617BD0 (lgenfe_main: 282-case CLI switch, 737 config #defines)
│ ├─ sub_610260 (register 300+ CLI options)
│ └─ sub_6140E0 (option fetcher loop)
│
├─ sub_8D0BC0 (translation unit init)
│ ├─ sub_706250 (keyword table: ~350 keywords via sub_885C00)
│ ├─ sub_858C60 (parser entry)
│ └─ sub_709290 (finalize)
│
├─ sub_709330 ("Generating Needed Template Instantiations", "Wrapping up translation unit")
│
└─ sub_5E3AD0 (backend entry: "Generating NVVM IR")
├─ Opens .int.c / .device.c / .stub.c output files
├─ sub_5DB980 (top-level declaration dispatcher)
│ ├─ sub_5E13C0 (function declaration printer, 44KB)
│ ├─ sub_5DBFC0 (expression printer, 41KB, 61 self-references)
│ ├─ sub_5DFD00 (statement printer, 26KB)
│ ├─ sub_5D80F0 (initializer printer)
│ ├─ sub_5DAD30 (struct/union/enum printer)
│ └─ sub_5DF1B0 (inline asm printer)
│
└─ dlopen("libTileIRCompiler_shared.so") [optional, gated by dword_4D045A0]
└─ dlsym("cudacc_back_end") — 17-entry function pointer table
Timer callbacks record "Front end time", "Back end time", and "Total compilation time" via sub_7211D0.
Orchestrator — sub_5D2A80
The master entry point for the entire frontend. Uses setjmp for non-local error recovery — when any of the ~5,000 EDG functions detects an unrecoverable error (type system inconsistency, parser corruption, internal assertion failure), it calls sub_721090, which longjmps back to this function. The 478 call sites that reference the abort handler demonstrate just how pervasive error checking is throughout the frontend — roughly 10% of all functions in the EDG range can trigger a fatal abort.
| Global | Purpose |
|---|---|
unk_4D045D8 | Phase callback (prints "Generating NVVM IR" etc.) |
unk_4D04744 | Timer enable flag |
unk_4F074B0 | Error flag (frontend errors occurred) |
unk_4F074A8 | Warning count |
qword_4F076F0 | Input source filename |
Frontend Entry — sub_617BD0 (lgenfe_main)
At 123KB and 3,113 decompiled lines, lgenfe_main is the largest function in the EDG range. The name "lgenfe" stands for "LLVM-generating front end" — a hint that this function was originally designed for a different backend before NVIDIA adopted the EDG+LLVM architecture. The function is divided into three distinct regions: a massive 282-case switch for CLI option parsing (2,000 lines), a post-parse validation phase that checks for conflicting options and enforces CUDA-specific constraints, and a file I/O setup phase that installs 11 signal handlers and returns a pointer to the configured compilation context.
Signature: (int argc, __int64 argv).
Structure
| Region | Lines | Content |
|---|---|---|
| A | 164–2157 | 282-case switch on option ID (v6) |
| B | 2157–2700 | Post-parse validation and cross-option consistency |
| C | 2700–3113 | File I/O setup, 11 signal handlers, return &qword_4D046F0 |
Architecture Parsing (case 0x52)
compute_75, compute_80, compute_86, compute_87, compute_88, compute_89
compute_90, compute_90a
compute_100, compute_100a, compute_100f
compute_103, compute_103a, compute_103f
compute_110, compute_110a, compute_110f
compute_120, compute_120a, compute_120f
compute_121, compute_121a, compute_121f
Storage: unk_4D045E8 = SM number, unk_4D045E4 = a suffix flag, unk_4D045E0 = f suffix flag.
Configuration Emission (case 0xE1)
Emits 737 #define macros to configure the EDG compiler. Key defines:
| Define | Value | Meaning |
|---|---|---|
VERSION_NUMBER | "6.6" | EDG frontend version |
EDG_MAIN | "lgenfe_main" | Entry point symbol |
DEFAULT_GNU_VERSION | 80100 | Emulate GCC 8.1 |
DEFAULT_CLANG_VERSION | 90100 | Emulate Clang 9.1 |
DEFAULT_EXCEPTIONS_ENABLED | 0 | CUDA: no exceptions |
TARG_SIZEOF_POINTER | 8 | 64-bit pointers |
TARG_SIZEOF_LONG_DOUBLE | 16 | 128-bit long double |
TARG_LITTLE_ENDIAN | 1 | x86-64 host |
USE_SOFTFLOAT | 1 | Software FP for constexpr |
ABI_COMPATIBILITY_VERSION | 9999 | Maximum ABI compat |
MODULE_MAX_LINE_NUMBER | 250000 | Max lines per module |
CLI Option Registration — sub_610260
Registers ~300 options via sub_6101D0(id, name, flag, ...). CUDA-specific options include:
| ID | Name | Purpose |
|---|---|---|
| 51 | no-device-int128 | Disable __int128 on device |
| 59 | emit-llvm-bc | Emit LLVM bitcode directly |
| 60 | device-debug | Device-side debug info |
| 68 | force-volatile | Force volatile on memory space (global/shared/constant/local/generic/all) |
| 73 | kernel-params-are-restrict | All kernel pointer params are __restrict__ |
| 82 | nv_arch | compute_XX architecture selection |
| 93 | device-c | Separate compilation mode |
| 105 | tile-only | TileIR-only compilation |
| 124 | extended-lambda | Extended lambda support (--expt-extended-lambda) |
| 132 | emit-lifetime-intrinsics | LLVM lifetime intrinsics |
Translation Unit Processing
Translation unit processing is where EDG transitions from CLI configuration to actual compilation. The init function sets up the lexer, allocates the translation unit data structure (416 bytes), populates the keyword table with ~350 entries, and enters the recursive-descent parser. EDG uses a keyword-registration model where each keyword is individually registered with its token ID — this allows NVIDIA to add CUDA-specific keywords (like __shared__ or __nv_fp8_e4m3) without modifying the core parser grammar.
Init — sub_8D0BC0
- Reset token state (
dword_4F063F8 = 0) - Call
sub_727950(lexer init) - Allocate 416-byte TU object via
sub_823970 - Call
sub_706250— keyword table init (~350 keywords) - Call parser entry (
sub_858C60or PCH pathsub_852E40) - Call
sub_709290— finalize
Keyword Registration — sub_706250
30KB. Calls sub_885C00(token_id, "keyword_string") ~350 times. Initializes 30+ subsystems before keyword registration. Categories:
- C89 keywords:
auto,break,case,const,continue,default,do,double,else, ... - C99 additions:
_Bool,_Complex,_Generic,_Atomic,restrict,inline - C11/C23:
_Static_assert,_Thread_local,_Alignas,_Alignof,constexpr,typeof - C++ keywords:
class,template,virtual,namespace,using,try,catch,throw, ... - C++20:
co_yield,co_return,co_await,requires,concept - Type traits (~80):
__is_pod,__is_abstract,__is_trivially_copyable,__has_virtual_destructor, ... - NVIDIA extensions:
__nv_is_extended_device_lambda_closure_type,__nv_is_extended_host_device_lambda_closure_type,__nv_is_extended_device_lambda_with_preserved_return_type - EDG internal:
__edg_type__,__edg_vector_type__,__edg_neon_vector_type__,__edg_scalable_vector_type__
Version-gated by dword_4F077C4 (language mode), unk_4F07778 (standard year), qword_4F077B4 (feature flags).
Finalization — sub_709330
Strings: "Generating Needed Template Instantiations", "Wrapping up translation unit". Calls sub_8B18F0 for C++ template instantiation when dword_4F077C4 == 2.
Preprocessor
EDG includes its own preprocessor rather than relying on an external cpp. This is standard for EDG-based compilers — the preprocessor is tightly integrated with the parser to handle complex interactions between macros and C++ syntax (e.g., __VA_OPT__ in C++20, which requires the preprocessor to understand syntactic context). The preprocessor occupies ~250KB across four major functions and maintains a 99-entry predefined macro table plus a 25-entry feature-test macro table.
Token Scanner — sub_7B8B50 (59KB)
The main preprocessor tokenizer. Handles all C/C++ token kinds: identifiers, numbers (delegates to sub_7B40D0), string literals, operators, punctuators, UCN sequences. Detects C++20 module/import keywords via string comparison.
Numeric Literal Parser — sub_7B40D0 (42KB)
Second-largest preprocessor function. Handles: integer suffixes (u/U/l/L/ll/LL), float suffixes (f/F/l/L), hex floats (0x...p...), binary literals (0b...), C++14 digit separators (').
Macro Expander — sub_81B8F0 (77KB)
The central macro expansion engine. Features:
__VA_ARGS__(C99) and__VA_OPT__(C++20) support- 99-entry predefined macro table at
off_4B7C440(stride 40 bytes) - 25-entry feature-test macro table at
off_4B7C360 - Recursion limit: 300 expansions (error
0xE3) - Intrinsic type-trait macros:
__type_pack_element,__is_signed,__make_integer_seq,__is_pointer
Character Scanner — sub_7BC390 (29KB)
Giant switch on character value. Handles trigraph sequences, line splices, multi-byte characters, comment detection (// and /*).
Parser & Declaration Processing
The parser subsystem is the largest part of the EDG frontend — over 1 MB of code spread across dozens of functions. EDG uses a recursive-descent parser augmented with a declaration-specifier state machine. The state machine design is necessary because C/C++ declaration specifiers can appear in any order (const unsigned long long int and int long unsigned long const are identical), requiring the parser to accumulate specifiers into bitmasks and resolve the final type only after all specifiers have been consumed.
NVIDIA's major contribution to the parser is the CUDA type extension infrastructure: 19 new FP8/FP6/FP4/MX-format type tokens (339–354) for Blackwell's tensor core operations, 9 address-space qualifier tokens (272–280) for GPU memory spaces, and 4 memory-space declaration specifiers (133–136) that piggyback on the existing width-modifier field. These extensions are grafted onto EDG's type system in a way that minimizes changes to the core parser logic — CUDA qualifiers reuse existing state variables with previously-unused value ranges.
Declaration Specifier State Machine — sub_672A20 (132KB, 4,371 lines)
The central parser function and one of the most complex functions in the binary. A while(2)/switch dispatcher on token codes from word_4F06418[0] with ~80 case labels. It accumulates type specifiers, qualifiers, storage-class specifiers, and CUDA address-space qualifiers from the token stream into a set of bitmask variables, then constructs the final type node from the accumulated state.
State Variables
| Variable | Stack | Bits | Role |
|---|---|---|---|
v325 | [rsp+B8h] | uint | Type specifier kind (see table below) |
v327 | [rsp+C0h] | uint64 | Specifier category bitmask |
v307 | [rsp+90h] | int | CV-qualifier accumulation bits |
v302 | [rsp+78h] | uint | Long count (0=none, 1=long, 2=long long) |
v305 | [rsp+84h] | uint | Signedness/width — reused for CUDA (4–7) |
v299 | [rsp+68h] | int | _Complex (1) / _Imaginary (2) tracking |
Type Specifier Kind (v325)
| Value | Meaning | Token Case |
|---|---|---|
| 0 | None yet | — |
| 2 | char | 80 |
| 3 | wchar_t | 165 |
| 4 | bool / _Bool | 128 / 120 |
| 5 | float | 126 |
| 6 | double | 127 |
| 7 | void | 180 |
| 8 | signed / __int8 | 93 / 239 |
| 9 | __float128 | 331 |
| 12 | int (explicit) | 89 |
| 14 | __float16 | 332 |
| 15 | short / half | 85 |
| 16 | _Float16 | 333 |
| 17 | __bf16 | 334 |
| 19 | bfloat16 | 335 |
| 20 | Resolved typedef/CUDA type name | scope lookup |
| 21 | struct/union/enum tag | 101/104/151 |
| 23 | decltype() | 183 |
| 24 | auto (deduced) | 186 |
| 25 | Resolved identifier type | C++ lookup |
| 26 | Error recovery type | diagnostic |
Specifier Bitmask (v327)
| Bit | Mask | Meaning |
|---|---|---|
| 0 | 0x1 | Storage class (extern/static/etc.) |
| 1 | 0x2 | CV-qualifier seen |
| 2 | 0x4 | Type specifier seen |
| 3 | 0x8 | friend specifier |
| 4 | 0x10 | __declspec / attribute seen |
| 5 | 0x20 | explicit specifier |
| 6 | 0x40 | inline specifier |
| 7 | 0x80 | _Thread_local / thread_local |
| 10 | 0x400 | typeof / decltype |
| 12 | 0x1000 | __declspec() already processed |
| 13 | 0x2000 | explicit(bool) already processed |
| 14 | 0x4000 | _Noreturn / [[noreturn]] |
| 15 | 0x8000 | _Atomic |
CV-Qualifier Bits (v307)
| Bit | Mask | Qualifier |
|---|---|---|
| 0 | 0x01 | const (case 81) |
| 1 | 0x02 | volatile (case 107) |
| 2 | 0x04 | restrict / __restrict (cases 118/119) |
| 3 | 0x08 | __unaligned (case 263 with parens) |
| 4 | 0x10 | __ptr32 (case 264) |
| 5 | 0x20 | __ptr64 (case 265) |
| 6 | 0x40 | __sptr / __uptr (case 266) |
Duplicate CV qualifiers trigger diagnostic 83.
CUDA Memory Space Tokens (133–136)
These piggyback on the signedness/width field v305 with values 4–7:
| Token | Keyword | v305 | v325 | Formula |
|---|---|---|---|---|
| 133 | __shared__ | 4 | 2 | Special case |
| 134 | __device__ | 5 | 8 | token - 129 |
| 135 | __constant__ | 6 | 8 | token - 129 |
| 136 | __managed__ | 7 | 8 | token - 129 |
Clean separation: values 0–3 = standard C width modifiers, 4–7 = CUDA address-space qualifiers. The type-construction switch handles both ranges.
CUDA Extended Type Tokens (339–354)
| Token | Type | Format |
|---|---|---|
| 236 | __nv_fp8_e4m3 | FP8 |
| 339 | __nv_fp8_e5m2 | FP8 |
| 340–343 | __nv_fp8x{2,4}_e{4m3,5m2} | FP8 vector |
| 344–345 | __nv_fp6_e{2m3,3m2} | FP6 |
| 346–347 | __nv_fp6x2_e{2m3,3m2} | FP6 vector |
| 348–349 | __nv_mxfp8_e{4m3,5m2} | MX-format FP8 |
| 350–351 | __nv_mxfp6_e{2m3,3m2} | MX-format FP6 |
| 352 | __nv_mxfp4_e2m1 | MX-format FP4 |
| 353 | __nv_satfinite | Saturation type |
| 354 | __nv_e8m0 | Exponent-only E8M0 |
All resolve via sub_6911B0() → type node, then set v325=20, v327|=4.
CUDA Address Space Qualifier Tokens (272–280)
| Token | Keyword | Space ID | Handler |
|---|---|---|---|
| 272 | __attribute__((address_space(N))) | parsed int | sub_6210B0 |
| 273 | __global__ | 0 | sub_667B60(0,...) |
| 274 | __shared__ (addr space) | 2 | sub_667B60(2,...) |
| 275 | __constant__ (addr space) | 3 | sub_667B60(3,...) |
| 276 | __generic__ | — | sub_72B620(type, cv) |
| 277 | __nv_tex_surf_handle_t | — | sub_72BA30(unk_4F06A51) |
| 278 | __nv_buffer_handle_t | — | sub_72BA30(unk_4F06A60) |
| 279 | __nv_grid_constant | — | sub_72C390() |
| 280 | __nv_is_extended_device_lambda | — | sub_72C270() |
Type Construction Functions
| Function | Purpose | Trigger |
|---|---|---|
sub_72BA30(code) | Fundamental signed integer type | int, short, long, long long |
sub_72BC30(code) | CUDA extended-width integer | CUDA mode + v305 > 3 |
sub_72BCF0(code) | Unsigned fundamental type | unsigned combos |
sub_72BDB0(code) | CUDA unsigned extended type | CUDA mode + unsigned |
sub_72BF70() | float type | v325 == 5 |
sub_72C030() | double type | v325 == 6 |
sub_72C0F0() | long double type | long + double |
sub_72C1B0() | __float128 type | v325 == 9 |
sub_72C610(kind) | Float-by-kind (mapped from v325) | FP8/FP6/BF16/etc. |
sub_72C6F0(kind) | _Complex float variant | v299 == 1 |
sub_72C7D0(kind) | _Imaginary float variant | v299 == 2 |
sub_72C930(code) | Error/placeholder type | diagnostic issued |
sub_72CBA0() | Dependent type | v325 == 25 |
sub_72CBE0(...) | __int128 type | v325 == 1 |
sub_73C570(type, cv, flags) | Apply CV-qualifiers to type | post-construction |
Accumulation Flow
- Initialize: all state variables to 0
- Loop: read
word_4F06418[0], dispatch through switch — set bitmask bits, update kind/cv/width - Exit: unrecognized token →
LABEL_8(default exit) - Type construction: switch on
v325 × v302 × v305→ call appropriatesub_72B*/sub_72C* - CV application:
sub_73C570wraps the type with const/volatile/restrict - Return: type stored at
ds->field_272, CV bits atds->field_120
Declaration Specifier Parser — sub_7C0F00 (184KB, 3,953 lines)
Uses goto-driven dispatch (393 LABEL_ references) — NOT a switch/case. This is a massive state machine for declaration specifier resolution. Self-recursive at line 2407 with flags=20 for nested declarator parsing.
Top-Level Declaration Parser — sub_662DE0 (61KB)
Declarator parsing — handles pointer (*), reference (&/&&), array ([]), and function (()) declarators. Uses SSE __m128i for bulk struct copying of 64-byte EDG type nodes.
Overload Resolution — sub_6523A0 (64KB)
The master overload resolution function. Given a declaration being introduced and a set of existing candidates from name lookup, it decides whether the declaration is a new overload, a redeclaration, or an error. At 2,448 decompiled lines with 39 diagnostic call sites, it is one of the heaviest diagnostic emitters in the frontend.
Candidate collection uses a 72-byte ranking context (v320 on stack) and dispatches to one of three collectors: sub_644100 for non-member/ADL candidates, sub_648CF0 for member + using-declaration candidates (chosen when C++ mode, prior declaration exists, and the class has base classes or is a template), or sub_6418E0 for C-linkage functions. The best candidate is selected by sub_641B60.
__builtin_ prefix forwarding (lines 2060-2162): after resolution, if the resolved symbol is a bodyless non-member function, the resolver checks if a compiler builtin equivalent exists. It hardcodes three function names by length: "abs" (3), "ceil" (4), "strlen" (6). For each, it constructs "__builtin_" + name in a scratch buffer at qword_4F06C50, looks it up via sub_878540, then compares parameter types via sub_8DED30(type1, type2, 0x100004) (exact match + qualification conversion). On match, the builtin's scope entry is linked into the user function's auxiliary data at offset +256 field 8.
OpenMP variant dispatch (lines 727-752): when unk_4D03A10 is set, the resolver renames the declaration to "<name>$$OMP_VARIANT%06d" using a monotonic counter unk_4D03A0C. This creates unique internal names for each device/host specialization.
Constexpr/consteval propagation (lines 2288-2301): gated by unk_4F07778 (C++ standard year). For C++11 and later, byte +204 of the scope entry is bit-packed with three globals: bits 5-6 = unk_4F06C58 (constexpr disposition), bits 1-2 = unk_4F06C5A (consteval disposition), bits 3-4 = unk_4F06C59 (immediate-function flag). Diagnostic 2383 fires on constexpr mismatch between declaration and definition.
Device/host overload sets: CUDA allows the same function name to have both __host__ and __device__ overloads. EDG does not treat execution space as part of the function signature for overload resolution purposes -- the standard C++ overload rules apply first, and execution space filtering happens later during code generation. The $$OMP_VARIANT renaming mechanism is used for OpenMP dispatch variants that need distinct host/device specializations, but regular CUDA __host__/__device__ overloads rely on the backend's execution space filtering rather than frontend overload resolution. This means that if two functions have identical C++ signatures but differ only in __host__ vs __device__, they are treated as redeclarations (not overloads) at the EDG level, and the execution space annotation at scope entry offset +198 determines which version survives into device or host code.
CUDA Memory Space Processing — sub_6582F0 (22KB)
Validates __shared__, __constant__, __managed__ attributes on declarations. Emits diagnostic for automatic variables in inappropriate memory spaces.
Type System
Type Node Layout (192 bytes = 12 x __m128i)
| Offset | Size | Field |
|---|---|---|
| +8 | 8 | Next pointer (linked lists) |
| +40 | 8 | Name pointer |
| +48 | 1 | Declaration kind byte |
| +80 | 1 | Entity kind byte |
| +140 | 1 | TYPE KIND DISCRIMINATOR (the central dispatch key) |
| +160 | 8 | Inner/child type pointer (typedef chains, pointer bases) |
| +168 | 8 | Member list / parameter chain |
| +173 | 1 | Specifier/node kind byte |
| +176 | 2 | Entity kind (uint16, dispatch key for constexpr evaluator) |
| +185 | 1 | CV-qualifier bits (bit 0=const, 1=volatile, 2=restrict) |
| +200 | 1 | Attribute flags |
Type kind discriminator values at offset +140:
| Value | Type | Notes |
|---|---|---|
| 0 | void | |
| 1 | error type | Sentinel |
| 2–4 | fundamental (char, int, ...) | |
| 5 | pointer | Follows +160 chain |
| 6 | pointer-to-member | |
| 7 | function type | Complex: 17 sub-kinds for calling conventions |
| 8 | array | Element count at +128, element type at +160 |
| 9–11 | class / struct / union | Members at +168 |
| 12 | typedef / cv-qualified | Follow +160 for underlying type (critical: skip in type-walk loops) |
| 13 | enum | |
| 14 | void (incomplete) | |
| 15 | vector | Element count at +128 |
| 19 | decltype | |
| 21 | placeholder / auto |
Scope Table Entry (776 bytes)
Indexed by dword_4F04C64 into base qword_4F04C68:
| Offset | Field |
|---|---|
| +0 | Scope identifier |
| +4 | Scope kind (5=namespace, 6=class, 7=function, 8=block, 9=enum, 12=template) |
| +6–10 | Flag bytes |
| +24 | Name list head |
| +32 | Name list tail |
| +208 | Class type pointer |
| +232 | Deferred list |
| +328 | Template info |
| +552 | Parent scope index |
| +624 | Declaration pointer |
| +680 | Linkage specification |
Type Comparison — sub_7386E0 (23KB)
The core type equivalence engine. Takes two type node pointers packed in an __int128 and a flags word, returns boolean equality. The flags word controls comparison mode: bits 0-1 select cv-qualifier strictness (0=strict, 1=relaxed, 2=overload), bit 2 enables template matching (class-equivalence shortcuts), and bit 5 enables anonymous-class structural comparison.
Entry sequence: both types are first canonicalized through sub_72EC50, which peels through chains of non-template typedef aliases. The canonicalizer checks three fields on the elaborated type node: +173 == 12 (typedef kind), +176 == 1 (single-member), and +170 bit 4 == 0 (no template specialization). If all hold, it unwraps one level via sub_72E9A0 and loops. This means typedef int MyInt; typedef MyInt YourInt; canonicalizes YourInt directly to int.
After canonicalization, a quick-reject compares three header bytes without recursing: byte +24 (type kind) must match exactly, bytes +25 XOR must be zero for bits 0x03 (const/volatile) and 0x40 (restrict), and byte +26 XOR must be zero for bit 0x04. Any mismatch short-circuits to return 0.
The main switch dispatches on 38 type kinds. Key cases for CUDA:
- Case 1 (fundamental): compares sub-kind at +56, extra flags at +58 (bits 0x3A), and the base type chain at +72. For integer sub-kind (
sub_kind == 'i'), follows a resolution chain to find the underlying class scope. In template matching mode (flags bit 2), usessub_8C7520to check whether two class instantiations share the same primary template, thensub_89AB40to compare template argument lists. This path handles CUDA's exotic numeric types (__nv_fp8_e4m3,__nv_fp8_e5m2, etc.) which are represented as fundamental types with distinct sub-kinds. - Case 3 (class/struct/union): fast identity via scope pointer equality, then unique-ID shortcut via
dword_4F07588. For anonymous classes with template matching, callssub_740200to extract canonical member lists and performs structural comparison. This is relevant for CUDA lambda closure types, which are anonymous classes. - Case 33 (using-declaration/alias): in overload mode (flags bit 1), performs a hash table lookup via
*qword_4D03BF8to retrieve base class triples and compare element-by-element. This ensures that twousingdeclarations resolving to different base classes are treated as distinct for overload discrimination.
Overload mode specifics (flags & 2): the post-switch check additionally verifies that both types agree on the presence/absence of the +80 "extra declaration" pointer. Template parameters are forced unequal (never match for overload purposes without being identical). Scope pointer equivalence is verified via unique-ID for using-declaration discrimination.
CUDA type equivalence: the NVIDIA-specific float types (__nv_fp8_e4m3, __bf16, _Float16, etc.) each have distinct sub-kind values at type node +56 (see the type mangling table: sub-kind 0 = _Float16, 1 = __fp16, 9 = __bf16, 0xA = _Float16 alternate, 0xB = _Float32, 0xC = _Float64, 0xD = _Float128). The type comparison treats them as distinct fundamental types -- _Float16 and __fp16 are NOT equivalent despite both being 16-bit floats. The half type in CUDA maps to _Float16 (sub-kind 0 or 0xA depending on context), while __half in cuda_fp16.h is a wrapper struct (type kind 9, class/struct), so half and __half are never type-equivalent at the EDG level. User code relies on implicit conversions defined in the CUDA headers, not on type equivalence.
Type-to-String Emitter — sub_74A390 (29KB, 19 callers)
The backbone type printer. Walks type nodes recursively, emitting textual representation for diagnostics. Handles NVIDIA-specific types: __surface_type__, __texture_type__, __nv_bool.
IL Tree Infrastructure
EDG represents parsed code as an Intermediate Language (IL) tree — a rich AST that preserves full C++ semantic information including template instantiation state, scope chains, and type qualifiers. The IL is not LLVM IR; it is EDG's proprietary tree representation that predates the LLVM integration. All semantic analysis, template instantiation, and overload resolution operate on this tree.
The IL tree is traversed by four structurally identical walker functions that share the same 87 node-type dispatch table. The walkers are instantiated from a common template with different callback functions — a design pattern where the traversal logic is fixed but the action at each node is parameterized through function pointers stored in six global variables. This callback-driven walker system is central to EDG's architecture: template instantiation, type checking, code emission, and tree copying all use the same walker infrastructure with different callbacks.
| Function | Size | Self-recursive Calls | Purpose |
|---|---|---|---|
sub_7506E0 | 190KB | 297 | Primary walker |
sub_760BD0 | 109KB | 427 | Parallel walker (deeper traversal) |
sub_75C0C0 | 87KB | 316 | Third-pass walker |
sub_766570 | 148KB | 2 | Copier/transformer (takes callback params) |
Walker Callback System
Six global function pointers form the visitor dispatch table:
| Global | Role |
|---|---|
qword_4F08028 | Node pointer remapper (called before recursion) |
qword_4F08020 | Linked-list child remapper |
qword_4F08038 | String field processor |
qword_4F08030 | Pre-visit callback (return nonzero to skip) |
qword_4F08040 | Post-visit callback |
dword_4F08014 | Skip-shared-nodes flag |
dword_4F08018 | Clear/detach mode (null out fields for ownership transfer) |
IL Node Types (87 types, from walker case labels)
| ID | Type | ID | Type |
|---|---|---|---|
| 1 | source_file | 28 | integral_constant |
| 2 | scope (15 sub-kinds) | 29 | float_constant |
| 3 | type_qualifier | 30 | expression (generic) |
| 4 | simple_type | 41 | call_expression |
| 5 | pointer_type | 42 | cast_expression |
| 6 | function_type (17 sub-kinds) | 43 | conditional_expression |
| 7 | class_type | 44 | string_literal |
| 8 | enum_type | 48 | template_argument (4 sub-kinds) |
| 9 | array_type | 59 | concept_expression (10 sub-kinds) |
| 10 | bitfield_type | 65 | type_list (core linked list) |
| 13 | statement (30+ sub-kinds) | 75 | block/compound_statement |
| 23 | scope_entry (root) | 76 | access_specifier |
Deep Copy — sub_766570 with sub_8C2C50
sub_8C2C50 calls sub_766570 with copy callback sub_8C38E0 and list-copy callback sub_8C3810. Node size table at qword_4B6D500[node_type] provides memcpy sizes. Critical for template instantiation.
Constexpr Evaluator
The constexpr evaluator is arguably the most technically impressive subsystem in the EDG frontend. It is a complete tree-walking interpreter that can execute arbitrary C++ code at compile time, implementing the full C++20 constexpr specification including heap allocation (constexpr new), string literals, virtual function dispatch, and complex control flow. At 317KB for the expression evaluator alone, plus 77KB for the statement executor and ~200KB in supporting functions, it constitutes nearly 20% of the entire EDG frontend.
The evaluator operates on EDG's IL tree directly — it does not compile to bytecode or any intermediate form. Instead, it recursively walks expression and statement nodes, maintaining its own memory model (a 3-tier page arena), variable bindings (an open-addressing hash table), and lifetime tracking (scope epoch counters). This design trades execution speed for implementation simplicity and guaranteed semantic fidelity with the compiler's own type system.
Signature:
bool constexpr_eval_expr(
constexpr_ctx *ctx, // a1: evaluation context (hash table, arena, flags)
expr_node **expr, // a2: expression AST node
__m128i *result, // a3: output value slot (16 or 32 bytes)
char *frame_base // a4: stack frame base pointer for lifetime tracking
);
Expression Evaluator — sub_786210 (317KB, 9,075 lines)
The largest function in the entire EDG frontend. Two-level dispatch: outer switch on expression kind *(a2+24), inner switch on operator code *(a2+56) with 124 cases.
Outer Switch — Expression Kinds
| Kind | Hex | Meaning | Notes |
|---|---|---|---|
| 0 | 0x00 | Void/empty | Sets `ctx+132 |
| 1 | 0x01 | Operator expression | → 124-case inner switch on *(a2+56) |
| 2 | 0x02 | Variable reference | Hash table lookup, kind==1(const) or kind==3(constexpr) |
| 3 | 0x03 | Function reference / enumerator | Subkind==5: has constexpr body → recurse |
| 4 | 0x04 | Literal (int/float constant) | Immediate return — value is in the node |
| 5–6 | 0x05–06 | String / compound literal | C++20 mode required (dword_4F077C4 == 2) |
| 7 | 0x07 | Function call | Most complex case (~1200 lines) |
| 10 | 0x0A | Parenthesized expression | Recurse on a2[7] |
| 11 | 0x0B | Member access (->) | Navigate member hierarchy via type-size table |
| 17 | 0x11 | Lambda expression | Save/restore ctx+72, execute body via sub_7987E0 |
| 18 | 0x12 | Capture variable | Hash table lookup by a2[7] |
| 20 | 0x14 | Address-of | Set flags a3+8 = 0x20 (IS_SYMBOLIC) |
| 23 | 0x17 | sizeof / alignof | Delegate to sub_620D80 |
| 24 | 0x18 | Subscript (array[index]) | Bounds check, compute elem_size * index |
| 27 | 0x1B | Implicit conversion | Navigate chain, recurse on inner |
| 31 | 0x1F | Requires expression (C++20) | Execute body via sub_79B7D0 |
| 32 | 0x20 | Type trait | sub_693DC0 → xmmword_4F08280/xmmword_4F08290 |
| 33 | 0x21 | SFINAE / substitution failure | Template context check, sub_6F2300 |
Inner Switch — Operator Codes (124 cases, selected)
| Cases | Category | Operations |
|---|---|---|
| 0–1 | Assignment | = / initialization (ref types: 32-byte memcpy) |
| 3–4 | Conversion | Lvalue-to-rvalue via sub_7A0070 |
| 5 | Type cast | static_cast — massive dispatch: int→int(sub_622780), float→float(sub_709EF0), int→float(sub_710280), ptr→ptr(sub_770010) |
| 14–15 | Member access | . and -> — offset via sub_8D5CF0, virtual base via sub_771030 |
| 16–17 | Pointer arithmetic | Subtraction, ptrdiff_t via sub_7764B0 |
| 20, 29 | Comparison | ==, != via sub_7759B0 |
| 26–28 | Unary | ++, --, unary minus (sub_621DB0) |
| 30–31 | Vector ops | Element-wise comparison loop, broadcast |
| 39–45 | Arithmetic | +(sub_621270), -(sub_6215F0), *(sub_621F20), /(sub_6220A0), %(sub_6220C0), <<(sub_70BBE0), >>(sub_70BCF0) — all with overflow/divzero checks |
| 46–49 | Bitwise | &, |, ^, ~ |
| 50–57 | Logical | &&, || with short-circuit evaluation |
| 58–59 | Detailed comparison | Integer(sub_621000), float(sub_70BE30), pointer(address+symbolic) |
| 64 | Spaceship | <=> → strong_ordering values at unk_4F06BD8–unk_4F06C30 |
| 73–84 | Compound assignment | += through ^= with lifetime validation, const-check (diag 0x1318) |
| 91–93 | Conditional | Ternary ?:, array subscript (bounds-checked, error 0xA84) |
| 94–95 | Virtual dispatch | Vtable lookup → sub_79CCD0 |
| 96–97 | Allocation | Placement new / operator new |
| 103 | Exception | throw (always fails in constexpr) |
| 105–108 | Delegated | → sub_77FCB0 (builtin operators) |
Value Slot Layout (16 bytes at a3)
| Offset | Size | Field |
|---|---|---|
| 0–7 | 8 | Primary value (integer, IEEE float, or arena pointer) |
| 8 | 1 | Flags byte (see below) |
| 9–11 | 3 | Alignment info, compound assignment tracking |
| 12–15 | 4 | Scope epoch ID (lifetime validation) |
Extended slot (32 bytes for reference types) adds secondary address at +16 and frame base at +24.
Flags Byte (offset +8)
| Bit | Mask | Name | Meaning |
|---|---|---|---|
| 0 | 0x01 | IS_POINTER | Value is an indirect pointer |
| 1 | 0x02 | IS_PAST_END | One-past-the-end pointer |
| 2 | 0x04 | HAS_CLEANUP | Destructor chain at +16 |
| 3 | 0x08 | HAS_SUBOBJECT | Refers to a subobject |
| 4 | 0x10 | HAS_BITFIELD | Bitfield offset in bits 8–31 |
| 5 | 0x20 | IS_SYMBOLIC | Unresolved symbolic reference |
| 6 | 0x40 | IS_CONST | From a const declaration |
| 7 | 0x80 | IS_ARRAY_MEMBER | Part of array storage |
Statement Executor — sub_795660 (77KB)
Dispatch on *(a2+40) — statement kind:
| Case | Kind | Notes |
|---|---|---|
| 0 | Declaration | Arena alloc → eval initializer → insert into scoped hash table |
| 1–4 | If / if-else / if-init / if-constexpr | Condition → bool via sub_620EE0 → branch |
| 5 | While loop | Step counter at ctx+120, limit from qword_4D042E0 (~1M). Error 0x97F on exceeded. |
| 6 | Jump (break/continue/goto) | Sets control flow bits: bit 1=continue, bit 2=break, bit 3=goto |
| 7,15,24 | Null/empty | Return success |
| 8 | Return | Walk call chain at ctx+72, store result, set "returned" flag |
| 11 | Expression statement | Evaluate for side effects via sub_7987E0 |
| 12 | For loop | Init → alloc → [condition → body → increment → cleanup] loop |
| 13 | Do-while | Delegates to sub_7A0E60 |
| 14 | Range-based for | 4 temp slots via sub_77A250, iterator advance via sub_7A0470 |
Memory Management — 3-Tier Page Arena
| Tier | Location | Page Size | Threshold | Purpose |
|---|---|---|---|---|
| Primary | ctx+16/ctx+24 | 64KB | default | Expression evaluation temporaries |
| Secondary | ctx+144/ctx+152 | 64KB | lazy init (ctx+132 & 8) | Variable declarations |
| Tertiary | ctx+80 | 64KB | nullable | String/compound literals |
Overflow: allocations >1024 bytes go to heap via sub_822B10(size+16), forming a singly-linked list from ctx+32. Freed by walking until scope epoch matches.
Value slot header: type pointer at offset -8 (8 bytes), lifetime bits at offset -9 (1 byte, bit 0 = "initialized").
Scope epoch: monotonic counter at ctx+128. Hash table at ctx+56/ctx+64 maps epoch → page state. Arena rewound on scope exit.
Hash Table (ctx+0/ctx+8)
Open-addressing with 16-byte entries [key, value]. Hash: key_pointer >> 3. Collision: linear probing. Doubles at 2 * count > capacity (via sub_7704A0). Secondary table at ctx+56/ctx+64/ctx+68 uses 4-byte integer keys (scope epoch IDs).
Diagnostic Codes
| Code | Hex | Meaning |
|---|---|---|
| 61 | 0x3D | Division by zero |
| 2431 | 0x97F | Step limit exceeded |
| 2692 | 0xA84 | Array index out of bounds |
| 2695 | 0xA87 | Unsupported jump in constexpr |
| 2698 | 0xA8A | Null pointer dereference |
| 2705 | 0xA91 | Negative shift count |
| 2707 | 0xA93 | Integer overflow/underflow |
| 2712 | 0xA98 | Use of uninitialized variable |
| 2721 | 0xAA1 | Not a constant expression (generic) |
| 2727 | 0xAA7 | Invalid type conversion |
| 2735 | 0xAAF | Pointer below array start |
| 2751 | 0xABF | Access outside lifetime |
| 2766 | 0xACE | Modification through null pointer |
| 2959 | 0xB8B | Missing return in constexpr function |
| 3007 | 0xBBF | reinterpret_cast in constexpr |
| 3022 | 0xBCE | Call to undefined constexpr function |
Silent mode: ctx+132 bit 5 (0x20) suppresses diagnostics (SFINAE contexts).
Constexpr and CUDA: Host-Side Evaluation of Device Code
A key architectural question for any CUDA compiler is whether constexpr functions annotated __device__ are evaluated at host compile time. In cicc v13.0, the answer is yes, conditionally. The constexpr evaluator operates entirely within the EDG frontend, which runs on the host. When a constexpr __device__ function is used in a context requiring a constant expression (template argument, array bound, static_assert, constexpr variable initializer), the evaluator executes it using its tree-walking interpreter regardless of the function's execution space annotation. The execution space attributes (__device__, __host__, __global__) are semantic annotations for code generation, not for the constexpr evaluator -- the evaluator sees only the IL tree and does not distinguish between host and device function bodies.
This works because EDG's constexpr evaluator uses software floating point (USE_SOFTFLOAT = 1 in the 737-define configuration block). All floating-point arithmetic in constexpr contexts goes through the softfloat library (sub_70B8D0 add, sub_70B9E0 sub, sub_70BBE0 mul, sub_70BCF0 div, sub_709EF0 convert) rather than the host CPU's FPU. This guarantees that constexpr evaluation of device code produces results consistent with IEEE 754 semantics regardless of the host platform's floating-point behavior. The softfloat library handles all precision levels including _Float16, __bf16, _Float32, _Float64, and __float128.
SM architecture gates influence constexpr relaxations. The global qword_4F077A8 (SM version) gates certain constexpr features:
- SM >= 89 (
qword_4F077A8 > 0x15F8F): relaxed constexpr rules for variables with incomplete types dword_4F077C4 == 2: C++20 features including constexprnew, constexpr string literals, and constexpr member access (expression evaluator cases 5/6)dword_4D04880: C++14 relaxed constexpr (loops, local variable mutation, multiple return statements)- C++23/26 extensions: constexpr
try-catch(statement executor case 14), constexpr placementnew(expression evaluator case 103), constexprdynamic_cast(error0xBB7)
The evaluator enforces a step limit (qword_4D042E0, default ~1M iterations) to prevent infinite loops in constexpr evaluation. This limit applies uniformly to both host and device constexpr functions. When exceeded, diagnostic 0x97F ("constexpr evaluation step limit exceeded") is emitted.
One important consequence: __global__ (kernel) functions cannot be constexpr because they have no return value in the conventional sense -- they are launched asynchronously. The parser enforces this at the declaration specifier level, not in the constexpr evaluator.
Supporting Functions
| Function | Size | Role |
|---|---|---|
sub_79CCD0 | 67KB | Object member accessor (base classes, virtual bases, union tracking) |
sub_799B70 | 33KB | Aggregate initializer (arrays, structs, designated init, brace elision) |
sub_79B7D0 | 29KB | Function call evaluator (argument binding, body execution, recursion limits) |
sub_7987E0 | 11KB | Statement list executor entry |
sub_77FCB0 | 150KB | Top-level dispatch (80 expression types + 62-entry intrinsic table) |
sub_7764B0 | 18KB | Type size calculator (Robin Hood hash memoization, 64MB cap) |
sub_7707D0 | — | Clone constexpr object |
sub_7790A0 | — | Trivial aggregate copy |
sub_7A0070 | — | Lvalue-to-rvalue load |
sub_77F5C0 | — | Bounds check (ptr, type → idx, err, size) |
sub_76FFC0 | — | Run cleanup/destructor chain |
Bigint Library (sub_621*)
| Function | Operation |
|---|---|
sub_621000 | compare(a, width_a, b, width_b) → {-1,0,1} |
sub_621270 | add(dst, src, width, overflow_out) |
sub_6215F0 | sub(dst, src, width, overflow_out) |
sub_621F20 | mul(dst, src, width, overflow_out) |
sub_6220A0 | div(dst, src, width, divzero_out) |
sub_6220C0 | mod(dst, src, width, divzero_out) |
sub_621DB0 | negate(dst) |
sub_620EE0 | to_int(value, width, result_out) |
Float Library (sub_70B*)
| Function | Operation |
|---|---|
sub_70B8D0 | add(type, lhs, rhs, dst, inexact, exception) |
sub_70B9E0 | sub |
sub_70BAF0 | negate |
sub_70BBE0 | mul |
sub_70BCF0 | div |
sub_70BE30 | compare(type, lhs, rhs, nan_result) → {-1,0,1,NaN} |
sub_709EF0 | convert(src, src_prec, dst, dst_prec, inexact) |
Key Globals
| Variable | Purpose |
|---|---|
dword_4F077C4 | C++ standard version (2 = C++20, enables constexpr new/string) |
dword_4D04880 | C++14 relaxed constexpr (enables loops, mutation) |
qword_4D042E0 | Max constexpr evaluation steps (~1M) |
xmmword_4F08280 | Canonical constexpr TRUE |
xmmword_4F08290 | Canonical constexpr FALSE |
qword_4F08380 | Global type-size hash table base |
qword_4F08060 | Global allocator function pointer (constexpr new detection) |
CUDA-Specific Extensions
NVIDIA's extensions to the EDG frontend fall into four categories: memory space qualifiers that map to GPU address spaces, kernel launch syntax that gets lowered to CUDA runtime API calls, registration stubs that tell the CUDA runtime about compiled kernels, and atomic builtin generation for the C++11 atomics model on GPU. These extensions are concentrated in the 0x650000–0x810000 range and reference SM architecture version globals extensively — many features are gated by qword_4F077A8 comparisons against architecture thresholds.
CUDA Keyword Extensions
NVIDIA extends the EDG keyword table with execution space qualifiers, memory space qualifiers, and type intrinsics. These exist in four distinct layers -- registered keywords, declaration specifier tokens, address space attribute tokens, and extended type tokens -- each integrated differently into the EDG parser infrastructure.
The critical architectural fact: __device__, __host__, and __global__ are not keywords in the EDG keyword table. They are processed through the C/C++ attribute system, where EDG maps them to internal single-character codes. The declaration specifier state machine (sub_672A20) and the address space handler together resolve these attributes into symbol-table fields that downstream passes consume.
Token ID Inventory
NVIDIA uses four non-contiguous token ID ranges:
| Range | Category | Count | Registration |
|---|---|---|---|
| 133-136 | Memory space declaration specifiers | 4 | Hardcoded in sub_672A20 switch |
| 236, 339-354 | Extended numeric types (FP8/FP6/FP4/MX) | 17 | Resolved via sub_6911B0 |
| 272-280 | Address space qualifier / special type tokens | 9 | Hardcoded handlers in sub_672A20 |
| 328-330 | NVIDIA type trait intrinsics | 3 | Registered via sub_885C00 in sub_706250 |
Only tokens 328-330 use the standard sub_885C00(token_id, "keyword") registration path. All other CUDA tokens are wired directly into parser switch cases, bypassing the keyword table entirely.
Execution Space Qualifiers -- Attribute Path
__device__, __host__, and __global__ are recognized by the attribute parser, which stores them as single-character codes at declaration context offset +269. The complete internal attribute character map (sub_5C79F0 at 0x5C79F0):
| Char | Hex | Attribute | Scope Entry Bits |
|---|---|---|---|
'V' | 0x56 | __host__ | -- (host is the default) |
'W' | 0x57 | __device__ | +198 bit 4 (0x10) |
'X' | 0x58 | __global__ | +198 bit 4 (0x10) AND bit 5 (0x20) |
'Y' | 0x59 | __tile_global__ | -- |
'Z' | 0x5A | __shared__ | -- (stored in +136 as space code 3) |
'[' | 0x5B | __constant__ | -- (stored in +136 as space code 2) |
'\' | 0x5C | __launch_bounds__ | Arguments at decl+336 struct |
']' | 0x5D | __maxnreg__ | -- |
'^' | 0x5E | __local_maxnreg__ | -- |
'_' | 0x5F | __tile_builtin__ | -- |
'f' | 0x66 | __managed__ | -- (stored in +136 as space code 5) |
'k' | 0x6B | __cluster_dims__ | Arguments at cluster config struct |
'l' | 0x6C | __block_size__ | -- |
'r' | 0x72 | __nv_pure__ | -- |
The attribute character code at +269 is consumed by sub_6582F0 (declaration-side validation) and sub_65F400 (definition-side validation). These functions never see the CUDA qualifier as a keyword token -- they only see the resolved character code.
Execution space at scope entry offset +198 is the authoritative record of a function's execution space for all downstream passes:
- Bit 4 (0x10): function is
__device__or__global__-- activates device-scope variable validation - Bit 5 (0x20): function is
__global__(kernel entry point) -- triggers kernel metadata emission viasub_12735D0, which emits("kernel", 1)to LLVM IR - Bit 2 (0x04) at offset +199:
full_custom_abiflag
When a function has bit 5 set, the attribute emitter also iterates the parameter array (40-byte entries at decl+16) and emits ("grid_constant", param_index) for each parameter where byte +33 is nonzero. The preserve-register struct at decl+336 (three int32 fields: data, control, after) is consumed and cleared (set to -1) after emission.
Memory Space Declaration Specifiers (Tokens 133-136)
These piggyback on the signedness/width field v305 in the declaration specifier state machine with values 4-7, cleanly separated from the standard C width modifiers (0-3):
| Token | Keyword | v305 Value | v325 Value | Formula |
|---|---|---|---|---|
| 133 | __shared__ | 4 | 2 | Special case |
| 134 | __device__ | 5 | 8 | token - 129 |
| 135 | __constant__ | 6 | 8 | token - 129 |
| 136 | __managed__ | 7 | 8 | token - 129 |
The type construction switch in sub_672A20 branches on v305 > 3 to invoke CUDA-specific type constructors (sub_72BC30 for signed, sub_72BDB0 for unsigned) instead of the standard C type constructors used for v305 values 0-3.
Address Space Qualifier Tokens (272-280)
Processed by dedicated handlers in the declaration specifier parser:
| Token | Keyword | Handler | Argument |
|---|---|---|---|
| 272 | __attribute__((address_space(N))) | sub_6210B0 | Parses integer N |
| 273 | __global__ (addr space annotation) | sub_667B60(0, ...) | Space ID = 0 |
| 274 | __shared__ (addr space annotation) | sub_667B60(2, ...) | Space ID = 2 |
| 275 | __constant__ (addr space annotation) | sub_667B60(3, ...) | Space ID = 3 |
| 276 | __generic__ | sub_72B620(type, cv) | -- |
| 277 | __nv_tex_surf_handle_t | sub_72BA30(unk_4F06A51) | Texture/surface handle |
| 278 | __nv_buffer_handle_t | sub_72BA30(unk_4F06A60) | Buffer handle |
| 279 | __nv_grid_constant | sub_72C390() | Grid-constant marker |
| 280 | __nv_is_extended_device_lambda | sub_72C270() | Lambda closure check |
Note the dual role of __shared__, __constant__, and __global__: each appears both as a memory space declaration specifier (tokens 133-135) and as an address space qualifier (tokens 273-275). The declaration specifier path stores the result in the symbol-table entry's memory_space_code at offset +136 and memory_space_flags at offset +156. The address space qualifier path stores the result in the EDG type node's qualifier word at offset +18 (values 1=global, 32=shared, 33=constant). Both representations flow downstream: the symbol-table code controls declaration validation, while the type qualifier controls LLVM pointer type construction in sub_911D10.
The __grid_constant__ qualifier (token 279, handler sub_72C390) marks kernel parameters as grid-constant -- the parameter is read-only across all thread blocks and may be placed in constant memory by the backend. This is a SM 70+ feature.
NVIDIA Type Trait Keywords (Tokens 328-330)
The only CUDA tokens registered through the standard sub_885C00 keyword registration path. Always registered -- not gated by any version, language mode, or feature flag:
| Token | Keyword | Registration |
|---|---|---|
| 328 | __nv_is_extended_device_lambda_closure_type | sub_885C00(328, ...) |
| 329 | __nv_is_extended_host_device_lambda_closure_type | sub_885C00(329, ...) |
| 330 | __nv_is_extended_device_lambda_with_preserved_return_type | sub_885C00(330, ...) |
These type traits are used by CUDA's extended lambda machinery to query whether a lambda closure type carries device or host-device execution space annotations. They participate in SFINAE and if constexpr contexts for compile-time dispatch between host and device lambda implementations.
The lambda mangling extensions in sub_80FE00 use the execution space information from these traits to choose between three proprietary Itanium ABI mangling prefixes: Unvdl (device lambda), Unvdtl (device template lambda), and Unvhdl (host-device lambda). The selection is based on flag byte +92 of the closure descriptor, where bit 5 (0x20) marks an extended CUDA lambda, bit 4 (0x10) marks host-device, and bit 2 (0x04) marks a template lambda.
Extended Numeric Type Tokens (236, 339-354)
Blackwell tensor core operations require exotic floating-point formats. These are resolved via sub_6911B0() to a type node, then set v325=20, v327|=4 in the declaration specifier state machine:
| Token | Type | Format | Width |
|---|---|---|---|
| 236 | __nv_fp8_e4m3 | FP8 | 8b |
| 339 | __nv_fp8_e5m2 | FP8 | 8b |
| 340-341 | __nv_fp8x2_e{4m3,5m2} | FP8 vector | 16b |
| 342-343 | __nv_fp8x4_e{4m3,5m2} | FP8 vector | 32b |
| 344-345 | __nv_fp6_e{2m3,3m2} | FP6 | 6b |
| 346-347 | __nv_fp6x2_e{2m3,3m2} | FP6 vector | 12b |
| 348-349 | __nv_mxfp8_e{4m3,5m2} | MX-format FP8 | 8b |
| 350-351 | __nv_mxfp6_e{2m3,3m2} | MX-format FP6 | 6b |
| 352 | __nv_mxfp4_e2m1 | MX-format FP4 | 4b |
| 353 | __nv_satfinite | Saturation modifier | -- |
| 354 | __nv_e8m0 | Exponent-only E8M0 | 8b |
These types are represented as fundamental types with distinct sub-kind values at type node +56 in the EDG type system. The type comparison engine (sub_7386E0, case 1) compares sub-kind, extra flags at +58 (bits 0x3A), and the base type chain at +72 to ensure each format is treated as a distinct type.
Attribute Processing Pipeline
The complete pipeline from CUDA source keyword to LLVM IR metadata:
CUDA source: __global__ void kernel() __launch_bounds__(256, 2)
|
v
Phase 1: Attribute parser → char code 'X' (0x58) at decl context +269
|
v
Phase 2: Declaration specifier state machine (sub_672A20)
→ scope entry +198 bit 5 set (kernel)
|
v
Phase 3: Post-parse fixup (sub_5D0FF0)
→ __launch_bounds__(256, 2) extracted to launch config struct
|
v
Phase 4: CUDA attribute validator (sub_826060)
→ validates __launch_bounds__ on __global__ function
→ diagnostic 0xDCE (3534) if __launch_bounds__ on non-kernel
→ diagnostic 0xE83 (3715) if values out of range
→ diagnostic 0xE87 (3719) if __launch_bounds__ + __maxnreg__ conflict
|
v
Phase 5: Attribute emission to LLVM IR (sub_12735D0)
→ emits ("kernel", 1) from bit 5 of decl+198
→ emits ("grid_constant", N) per qualifying parameter
|
v
Phase 6: Kernel metadata generation (sub_93AE30)
→ "nvvm.maxntid" = "256,1,1"
→ "nvvm.minctasm" = "2"
Memory Space Attributes
sub_6582F0 (22KB) and sub_65F400 (28KB) validate __shared__, __constant__, __managed__ on variable declarations and definitions respectively. Token cases 133-136 in the parser handle these as first-class declaration specifiers. The validation logic enforces CUDA semantics: __shared__ variables cannot have initializers (shared memory is not initialized on kernel launch), __constant__ variables must have static storage duration, and __managed__ variables require unified memory support on the target architecture.
Symbol Table Memory Space Encoding
Memory space is tracked in two locations within each symbol-table entry:
| Offset | Size | Field | Values |
|---|---|---|---|
| +136 | 1 byte | memory_space_code | 0=default, 1=__device__, 2=__constant__, 3=__shared__, 5=__managed__ |
| +156 | 1 byte | memory_space_flags | bit 0=device, bit 1=shared, bit 2=constant, bit 4=thread_local interaction |
| +157 | 1 byte | Extended flags | bit 0=managed |
The dual encoding exists because the flags are additive from parsed attributes (multiple attributes can be OR'd in) while the code is the single resolved value used by downstream passes. The code at +136 is set by sub_735FB0 (symbol entry constructor) and queried throughout the compiler.
Declaration-Side Validation -- sub_6582F0
The validation follows a ten-phase pipeline:
-
__managed__pre-resolution: whendword_4F04C5C == dword_4F04C34(host-only mode), managed variables are silently downgraded to__device__(space code 1) and the extern flag is cleared. -
Extern handling: sets bit 0 of decl context +122 and the
is_externtracking variables. -
Type normalization: checks function-type declarations against CUDA criteria via
sub_8D4C10; emits diagnostic 891 for function types with memory space. -
Specifier processing: calls
sub_6413B0against the current compilation target. -
Prior-declaration conflict detection: looks up existing symbol, compares memory space codes. Mismatch with
dword_4F077C4 == 2(separate compilation) triggers diagnostic 172 (warning 4). -
New symbol creation:
sub_735FB0(type_ptr, space_code, target_id, is_new_decl). -
__managed__namespace binding: validates namespace name viasub_703C10; checks class/struct type compatibility (diagnostic 1560 on failure). -
Storage class adjustments: processes constant-space read-only flags.
-
Device-scope enforcement: when scope +198 bit 4 is set (inside
__device__/__global__function), local variables cannot carry device memory qualifiers. Diagnostic 3484:"an automatic variable may not be declared as __device__". The memory space name is determined by the priority cascade:__constant__>__managed__>__shared__>__device__. -
Final fixup: type validation (
sub_8D9350), attribute propagation (sub_8756F0),"main"function warnings (diagnostic 2948), thread-safety analysis (sub_826000).
Memory Space Mutual Exclusivity
The code consistently enforces these combinations:
| Combination | Diagnostic | Severity |
|---|---|---|
__shared__ + __constant__ | 3481 | error |
__constant__ + __managed__ | 3568 | error |
__constant__ + __shared__ + __managed__ | 3568 | error |
thread_local + __device__ | 892 | error |
thread_local + any device space | 3578 | error |
auto variable + __device__/__constant__/__managed__ | 3484 | error |
__shared__ + initializer | 3510 | error |
__constant__ in device-function scope | 3512 | error |
register + device memory | 3485/3688 | error |
volatile + __constant__ | 1378 | error |
| redeclaration with different memory space | 3499 | warning 5 |
The diagnostic name-string priority cascade (__constant__ > __managed__ > __shared__ > __device__) appears identically in six locations: sub_6582F0 lines 734-739, sub_65F400 lines 541-549 and 927-935, sub_5C6B80 lines 22-34, sub_667550 lines 87-98, and sub_5D9330 (the symbol printer).
Address Space Flow: EDG to LLVM to PTX
| CUDA Source | EDG Token | Symbol +136 | Type Qualifier +18 | LLVM AS | PTX Directive |
|---|---|---|---|---|---|
__device__ int x; | 134 | 1 | 1 | 1 | .global |
__shared__ int x; | 133 | 3 | 32 | 3 | .shared |
__constant__ int x; | 135 | 2 | 33 | 4 | .const |
__managed__ int x; | 136 | 5 | 1 | 1 | .global + runtime registration |
| (local in kernel) | -- | 0 | -- | 0/5 | .local/.param |
The EDG type node qualifier word (offset +18, masked to 0x7FFF) carries address space through the type system. During EDG-to-LLVM type translation, sub_911D10 reads this qualifier from pointer/reference types (kind 75/76) and maps to LLVM address space numbers via sub_5FFE90. __managed__ variables are compiled as __device__ (LLVM address space 1) with additional runtime registration calls generated by sub_806F60 for unified memory management.
Kernel Launch Lowering — sub_7F2B50 (16KB)
Transforms CUDA's <<<gridDim, blockDim, sharedMem, stream>>> kernel launch syntax into CUDA runtime API calls. The lowered sequence allocates a parameter buffer via cudaGetParameterBufferV2, copies kernel arguments into it, and launches with cudaLaunchDeviceV2. For the simpler launch path, it generates __cudaPushCallConfiguration followed by individual __cudaSetupArg/__cudaSetupArgSimple calls. This lowering happens entirely within EDG — by the time the code reaches the LLVM backend, kernel launches are ordinary function calls.
Registration Stub Generator — sub_806F60
Generates __cudaRegisterAll function with calls to:
__cudaRegisterEntry, __cudaRegisterVariable, __cudaRegisterGlobalTexture, __cudaRegisterGlobalSurface, __cudaRegisterManagedVariable, __cudaRegisterBinary, ____cudaRegisterLinkedBinary.
Host-side stubs generated by sub_808590: "__device_stub_%s", "__cudaLaunch", "__cudaSetupArg", "__cudaSetupArgSimple".
Atomic Builtin Generator — sub_6BBC40 (34KB)
Constructs __nv_atomic_fetch_{add,sub,and,xor,or,max,min} names with type suffixes (_s, _f, _u) and width (_%u).
SM Architecture Gates
Two functions configure ~160 optimization/feature flags based on SM version:
| Function | Role | Thresholds |
|---|---|---|
sub_60D650 | Optimization level → 109 unk_4D04* flags | Single integer parameter (O-level) |
sub_60E7C0 | SM arch → 60 unk_4D04* feature flags | SM 75 (30399), SM 80 (40000), SM 90 (89999), SM 100 (109999), SM 120 (119999) |
Each flag is gated by a byte_4CF8* user-override check, preventing auto-configuration when the user explicitly sets a flag via CLI.
TileIR Backend
sub_5E3AD0 optionally loads libTileIRCompiler_shared.so via dlopen and looks up symbol "cudacc_back_end". A 17-entry function pointer table is passed. Gated by dword_4D045A0.
Diagnostic System
EDG's diagnostic system supports three output formats: human-readable terminal output (with ANSI color and word-wrapping), SARIF JSON for IDE integration, and a machine-readable log format for automated tooling. All three share the same diagnostic numbering scheme and severity classification. The terminal output handler alone is 37KB — it implements its own word-wrapping algorithm with configurable terminal width, recursive child diagnostic emission (for "note: see declaration of X" chains), and color coding by severity level.
Terminal Output — sub_681D20 (37KB)
Formats error/warning/remark messages with:
- Severity labels: remark (2), warning (4), caution (5), severe-warning (6), error (7–8), catastrophe (9–10), internal-error (11)
- Source location:
file:line:col - ANSI color escapes (gated by
dword_4F073CC) - Word-wrapping at
dword_4D039D0(terminal width) - Recursive child diagnostic emission
SARIF JSON Output — sub_6837D0 (20KB)
Structured diagnostics for IDE integration, enabled by --diagnostics_format=sarif (CLI case 0x125, sets unk_4D04198 = 1). The output is a comma-separated stream of SARIF result objects -- NOT a complete SARIF envelope with $schema, runs[], etc. The caller or a post-processor is expected to wrap the stream in the standard SARIF container.
Each diagnostic emits one JSON object:
{
"ruleId": "EC<number>",
"level": "error",
"message": {"text": "<json-escaped message>"},
"locations": [
{
"physicalLocation": {
"artifactLocation": {"uri": "file://<path>"},
"region": {
"startLine": 42,
"startColumn": 17
}
}
}
],
"relatedLocations": [
{
"message": {"text": "see declaration of X"},
"physicalLocation": { ... }
}
]
}
Rule ID format: "EC" + decimal error number from the diagnostic record at offset +176. For example, EDG error 1234 becomes "EC1234".
Severity mapping (byte at diagnostic node +180):
| Severity | EDG Meaning | SARIF level |
|---|---|---|
| 4 | remark | "remark" |
| 5 | warning | "warning" |
| 7, 8 | error | "error" |
| 9 | catastrophe | "catastrophe" |
| 11 | internal error | "internal_error" |
Note that SARIF spec only defines "warning", "error", and "note" as standard levels. The "remark", "catastrophe", and "internal_error" values are EDG extensions -- consuming tools should treat unknown levels as "error".
Message text escaping: sub_683690 renders the diagnostic text into qword_4D039E8, then copies character-by-character into the output buffer, escaping " as \" and \ as \\. No other JSON escaping (e.g., control characters, Unicode) is applied.
Location resolution: sub_67C120 calls sub_729E00 to decompose the packed source location into (file-id, line, column), then sub_722DF0 to resolve the file-id to a filesystem path. The startColumn field is omitted when column is zero.
Related locations: the linked list at diagnostic node +72 chains "note" sub-diagnostics. Each is emitted as a relatedLocations array entry with its own message and physical location.
Filtering before emission: diagnostics pass through severity threshold check (byte_4F07481[0]), duplicate detection (byte_4CFFE80[4*errnum + 2] bit flags), pragma-based suppression (sub_67D520), and error limit check (unk_4F074B0 + unk_4F074B8 >= unk_4F07478). All filtering happens before the SARIF/text format branch.
Machine-Readable Log
Writes to qword_4D04908 in format: <severity-char> "<filename>" <line> <col> <message>\n. Severity chars from "RwweeccccCli": R=remark, w=warning, e=error, c=catastrophe.
Name Mangling (Itanium ABI)
EDG includes a complete implementation of the Itanium C++ ABI name mangling specification. NVIDIA extends the standard mangling with three proprietary prefixes (Unvdl, Unvdtl, Unvhdl) for device lambdas, device template lambdas, and host-device lambdas respectively. These extensions are necessary because CUDA's execution model requires distinguishing between host and device versions of the same lambda — they must have different mangled names to avoid linker collisions when both host and device code are linked into the same binary.
Address range 0x810000–0x8EFFFF:
| Function | Size | Role |
|---|---|---|
sub_8E74B0 | 29KB | Primary mangling entry |
sub_8E9FF0 | 26KB | Type mangling |
sub_816460 | 24KB | Type component mangling |
sub_813790 | 13KB | Expression mangling |
sub_80E340 | 23KB | Builtin type mangling (incl. DF16_, DF16b, Cu6__bf16, u6__mfp8) |
sub_80FE00 | 8KB | NVIDIA extension mangling (Unvdl, Unvdtl, Unvhdl) |
NVIDIA lambda mangling extensions (sub_80FE00): standard Itanium ABI uses Ul<params>E<index>_ for unnamed lambda types and Ut<index>_ for unnamed non-lambda types. NVIDIA adds three proprietary prefixes chosen based on flag byte +92 of the lambda's closure descriptor:
| Prefix | Meaning | Condition |
|---|---|---|
Unvdl | __device__ lambda | flag_byte_92 & 0x20 set, not host-device, not template |
Unvdtl | __device__ template lambda | flag_byte_92 & 0x20 set, flag_byte_92 & 4 set |
Unvhdl | __host__ __device__ lambda | flag_byte_92 & 0x20 set, flag_byte_92 & 0x10 set |
The Unvhdl prefix carries three single-digit flags separated by underscores after the prefix: Unvhdl<index>_<has_explicit_return>_<is_host_device>_<has_template_params>_. Each flag is '0' or '1'. This is richer than the standard Ul which only encodes parameter types.
NVIDIA vendor type manglings (sub_80E340): the type mangler handles CUDA-specific types as Itanium vendor types (prefix u + length + name):
| Type | Mangling | Notes |
|---|---|---|
__bf16 (bfloat16) | u6__bf16 or DF16b | ABI-gated: qword_4F077B4 lo32 selects vendor vs C++23 encoding |
__mfp8 (FP8) | u6__mfp8 | NVIDIA micro-float 8-bit for transformer inference |
__metainfo | U10__metainfo | Kernel parameter metadata type attribute |
float80 | u7float80 | x87 extended precision (vendor type) |
The __bf16 mangling has a three-way gate reflecting the ongoing ABI transition: qword_4F077B4 lo32 != 0 selects "u6__bf16" (vendor type); hi32 == 0 selects "DF16b" (C++23 standardized P1467); otherwise qword_4F06A78 determines which encoding. The ABI version variable unk_4D04250 controls this and other encoding decisions, with known thresholds at 0x76BF (GCC 3.3 compat) and 0xC350 (GCC 12 compat).
Standard float types follow Itanium: _Float16 = "DF16_", __fp16 = "Dh", float = "f", double = "d", __float128 = "g", with the complex variants adding a 'C' prefix.
Key Global Variables
| Variable | Size | Role |
|---|---|---|
dword_4F077C4 | 4 | Language mode: 0=neither, 1=C, 2=C++ |
unk_4F07778 | 4 | C/C++ standard year (199711, 201103, 201402, 201703, 202002, 202310) |
qword_4F077B4 | 8 | Dialect extension flags (lo=CUDA extensions, hi=GNU extensions) |
dword_4F077BC | 4 | NVCC mode flag |
dword_4F077C0 | 4 | GCC compatibility mode |
qword_4F077A8 | 8 | SM architecture version (controls feature gates throughout) |
word_4F06418 | 2 | Current parser token |
qword_4F04C68 | 8 | Scope table base pointer (776-byte entries) |
dword_4F04C64 | 4 | Current scope index |
qword_4CF7CE0 | 8 | AST printer callback vtable |
qword_4D03FF0 | 8 | Current translation unit pointer |
qword_4D04908 | 8 | Machine-readable diagnostic log FILE* |
qword_4F08028–qword_4F08040 | 48 | IL tree walker callback table |
dword_4D045A0 | 4 | TileIR mode flag |