Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

EDG 6.6 Frontend

NVIDIA licenses the Edison Design Group (EDG) C/C++ front end — a commercial compiler frontend used by several major compilers including Intel ICC. In cicc v13.0, EDG version 6.6 occupies 3.2 MB of code (0x5D00000x8F0000), making it the largest single subsystem in the binary. Unlike most modern compilers that parse directly to an SSA-based IR, EDG operates as a source-to-source translator: it parses CUDA C++ source code and emits transformed C code containing CUDA runtime API calls. This output is then fed into a second compilation phase that produces NVVM IR (LLVM bitcode). This two-stage design means the CUDA language extensions (kernel launch syntax, memory space qualifiers, device/host function annotations) are resolved entirely within EDG, and the LLVM-based backend never sees raw CUDA syntax.

The EDG frontend is configured at compile time through 737 #define macros, including GCC 8.1 emulation mode and Clang 9.1 emulation mode. Exceptions are disabled by default — CUDA device code cannot use C++ exceptions — while RTTI remains enabled for dynamic_cast support in host-side code that interacts with device objects.

EDG version6.6 (string: "Based on Edison Design Group C/C++ Front End, version 6.6")
Entry symbollgenfe_main (string at sub_617BD0)
GCC emulation8.1 (DEFAULT_GNU_VERSION = 80100)
Clang emulation9.1 (DEFAULT_CLANG_VERSION = 90100)
C++ standardsC++98, C++11, C++14, C++17, C++20, C++23 (unk_4F07778 = year code)
C standardsC99, C11, C18, C23
ExceptionsDisabled by default (DEFAULT_EXCEPTIONS_ENABLED = 0)
RTTIEnabled by default (DEFAULT_RTTI_ENABLED = 1)
Target modelLP64 (TARG_SIZEOF_POINTER = 8, TARG_SIZEOF_LONG = 8)
BackendC-codegen (BACK_END_IS_C_GEN_BE = 1) — emits C source, not LLVM IR directly
Functions~5,000 in range, 300+ above 5KB

Architecture

The compilation flow through EDG has four major phases: CLI parsing (282-case switch), translation unit initialization (keyword tables, parser bootstrapping), parsing and semantic analysis (the bulk of the 3.2 MB), and backend code emission (generating three output files: .int.c for internal declarations, .device.c for device code, and .stub.c for host-side launch stubs). Error recovery uses setjmp/longjmp — any of the 478 call sites that invoke the abort handler (sub_721090) will unwind back to the orchestrator rather than crashing the process.

sub_5D2A80 (orchestrator, setjmp error recovery)
  │
  ├─ sub_617BD0 (lgenfe_main: 282-case CLI switch, 737 config #defines)
  │    ├─ sub_610260 (register 300+ CLI options)
  │    └─ sub_6140E0 (option fetcher loop)
  │
  ├─ sub_8D0BC0 (translation unit init)
  │    ├─ sub_706250 (keyword table: ~350 keywords via sub_885C00)
  │    ├─ sub_858C60 (parser entry)
  │    └─ sub_709290 (finalize)
  │
  ├─ sub_709330 ("Generating Needed Template Instantiations", "Wrapping up translation unit")
  │
  └─ sub_5E3AD0 (backend entry: "Generating NVVM IR")
       ├─ Opens .int.c / .device.c / .stub.c output files
       ├─ sub_5DB980 (top-level declaration dispatcher)
       │    ├─ sub_5E13C0 (function declaration printer, 44KB)
       │    ├─ sub_5DBFC0 (expression printer, 41KB, 61 self-references)
       │    ├─ sub_5DFD00 (statement printer, 26KB)
       │    ├─ sub_5D80F0 (initializer printer)
       │    ├─ sub_5DAD30 (struct/union/enum printer)
       │    └─ sub_5DF1B0 (inline asm printer)
       │
       └─ dlopen("libTileIRCompiler_shared.so") [optional, gated by dword_4D045A0]
            └─ dlsym("cudacc_back_end") — 17-entry function pointer table

Timer callbacks record "Front end time", "Back end time", and "Total compilation time" via sub_7211D0.

Orchestrator — sub_5D2A80

The master entry point for the entire frontend. Uses setjmp for non-local error recovery — when any of the ~5,000 EDG functions detects an unrecoverable error (type system inconsistency, parser corruption, internal assertion failure), it calls sub_721090, which longjmps back to this function. The 478 call sites that reference the abort handler demonstrate just how pervasive error checking is throughout the frontend — roughly 10% of all functions in the EDG range can trigger a fatal abort.

GlobalPurpose
unk_4D045D8Phase callback (prints "Generating NVVM IR" etc.)
unk_4D04744Timer enable flag
unk_4F074B0Error flag (frontend errors occurred)
unk_4F074A8Warning count
qword_4F076F0Input source filename

Frontend Entry — sub_617BD0 (lgenfe_main)

At 123KB and 3,113 decompiled lines, lgenfe_main is the largest function in the EDG range. The name "lgenfe" stands for "LLVM-generating front end" — a hint that this function was originally designed for a different backend before NVIDIA adopted the EDG+LLVM architecture. The function is divided into three distinct regions: a massive 282-case switch for CLI option parsing (2,000 lines), a post-parse validation phase that checks for conflicting options and enforces CUDA-specific constraints, and a file I/O setup phase that installs 11 signal handlers and returns a pointer to the configured compilation context.

Signature: (int argc, __int64 argv).

Structure

RegionLinesContent
A164–2157282-case switch on option ID (v6)
B2157–2700Post-parse validation and cross-option consistency
C2700–3113File I/O setup, 11 signal handlers, return &qword_4D046F0

Architecture Parsing (case 0x52)

compute_75, compute_80, compute_86, compute_87, compute_88, compute_89
compute_90, compute_90a
compute_100, compute_100a, compute_100f
compute_103, compute_103a, compute_103f
compute_110, compute_110a, compute_110f
compute_120, compute_120a, compute_120f
compute_121, compute_121a, compute_121f

Storage: unk_4D045E8 = SM number, unk_4D045E4 = a suffix flag, unk_4D045E0 = f suffix flag.

Configuration Emission (case 0xE1)

Emits 737 #define macros to configure the EDG compiler. Key defines:

DefineValueMeaning
VERSION_NUMBER"6.6"EDG frontend version
EDG_MAIN"lgenfe_main"Entry point symbol
DEFAULT_GNU_VERSION80100Emulate GCC 8.1
DEFAULT_CLANG_VERSION90100Emulate Clang 9.1
DEFAULT_EXCEPTIONS_ENABLED0CUDA: no exceptions
TARG_SIZEOF_POINTER864-bit pointers
TARG_SIZEOF_LONG_DOUBLE16128-bit long double
TARG_LITTLE_ENDIAN1x86-64 host
USE_SOFTFLOAT1Software FP for constexpr
ABI_COMPATIBILITY_VERSION9999Maximum ABI compat
MODULE_MAX_LINE_NUMBER250000Max lines per module

CLI Option Registration — sub_610260

Registers ~300 options via sub_6101D0(id, name, flag, ...). CUDA-specific options include:

IDNamePurpose
51no-device-int128Disable __int128 on device
59emit-llvm-bcEmit LLVM bitcode directly
60device-debugDevice-side debug info
68force-volatileForce volatile on memory space (global/shared/constant/local/generic/all)
73kernel-params-are-restrictAll kernel pointer params are __restrict__
82nv_archcompute_XX architecture selection
93device-cSeparate compilation mode
105tile-onlyTileIR-only compilation
124extended-lambdaExtended lambda support (--expt-extended-lambda)
132emit-lifetime-intrinsicsLLVM lifetime intrinsics

Translation Unit Processing

Translation unit processing is where EDG transitions from CLI configuration to actual compilation. The init function sets up the lexer, allocates the translation unit data structure (416 bytes), populates the keyword table with ~350 entries, and enters the recursive-descent parser. EDG uses a keyword-registration model where each keyword is individually registered with its token ID — this allows NVIDIA to add CUDA-specific keywords (like __shared__ or __nv_fp8_e4m3) without modifying the core parser grammar.

Init — sub_8D0BC0

  1. Reset token state (dword_4F063F8 = 0)
  2. Call sub_727950 (lexer init)
  3. Allocate 416-byte TU object via sub_823970
  4. Call sub_706250 — keyword table init (~350 keywords)
  5. Call parser entry (sub_858C60 or PCH path sub_852E40)
  6. Call sub_709290 — finalize

Keyword Registration — sub_706250

30KB. Calls sub_885C00(token_id, "keyword_string") ~350 times. Initializes 30+ subsystems before keyword registration. Categories:

  • C89 keywords: auto, break, case, const, continue, default, do, double, else, ...
  • C99 additions: _Bool, _Complex, _Generic, _Atomic, restrict, inline
  • C11/C23: _Static_assert, _Thread_local, _Alignas, _Alignof, constexpr, typeof
  • C++ keywords: class, template, virtual, namespace, using, try, catch, throw, ...
  • C++20: co_yield, co_return, co_await, requires, concept
  • Type traits (~80): __is_pod, __is_abstract, __is_trivially_copyable, __has_virtual_destructor, ...
  • NVIDIA extensions: __nv_is_extended_device_lambda_closure_type, __nv_is_extended_host_device_lambda_closure_type, __nv_is_extended_device_lambda_with_preserved_return_type
  • EDG internal: __edg_type__, __edg_vector_type__, __edg_neon_vector_type__, __edg_scalable_vector_type__

Version-gated by dword_4F077C4 (language mode), unk_4F07778 (standard year), qword_4F077B4 (feature flags).

Finalization — sub_709330

Strings: "Generating Needed Template Instantiations", "Wrapping up translation unit". Calls sub_8B18F0 for C++ template instantiation when dword_4F077C4 == 2.

Preprocessor

EDG includes its own preprocessor rather than relying on an external cpp. This is standard for EDG-based compilers — the preprocessor is tightly integrated with the parser to handle complex interactions between macros and C++ syntax (e.g., __VA_OPT__ in C++20, which requires the preprocessor to understand syntactic context). The preprocessor occupies ~250KB across four major functions and maintains a 99-entry predefined macro table plus a 25-entry feature-test macro table.

Token Scanner — sub_7B8B50 (59KB)

The main preprocessor tokenizer. Handles all C/C++ token kinds: identifiers, numbers (delegates to sub_7B40D0), string literals, operators, punctuators, UCN sequences. Detects C++20 module/import keywords via string comparison.

Numeric Literal Parser — sub_7B40D0 (42KB)

Second-largest preprocessor function. Handles: integer suffixes (u/U/l/L/ll/LL), float suffixes (f/F/l/L), hex floats (0x...p...), binary literals (0b...), C++14 digit separators (').

Macro Expander — sub_81B8F0 (77KB)

The central macro expansion engine. Features:

  • __VA_ARGS__ (C99) and __VA_OPT__ (C++20) support
  • 99-entry predefined macro table at off_4B7C440 (stride 40 bytes)
  • 25-entry feature-test macro table at off_4B7C360
  • Recursion limit: 300 expansions (error 0xE3)
  • Intrinsic type-trait macros: __type_pack_element, __is_signed, __make_integer_seq, __is_pointer

Character Scanner — sub_7BC390 (29KB)

Giant switch on character value. Handles trigraph sequences, line splices, multi-byte characters, comment detection (// and /*).

Parser & Declaration Processing

The parser subsystem is the largest part of the EDG frontend — over 1 MB of code spread across dozens of functions. EDG uses a recursive-descent parser augmented with a declaration-specifier state machine. The state machine design is necessary because C/C++ declaration specifiers can appear in any order (const unsigned long long int and int long unsigned long const are identical), requiring the parser to accumulate specifiers into bitmasks and resolve the final type only after all specifiers have been consumed.

NVIDIA's major contribution to the parser is the CUDA type extension infrastructure: 19 new FP8/FP6/FP4/MX-format type tokens (339–354) for Blackwell's tensor core operations, 9 address-space qualifier tokens (272–280) for GPU memory spaces, and 4 memory-space declaration specifiers (133–136) that piggyback on the existing width-modifier field. These extensions are grafted onto EDG's type system in a way that minimizes changes to the core parser logic — CUDA qualifiers reuse existing state variables with previously-unused value ranges.

Declaration Specifier State Machine — sub_672A20 (132KB, 4,371 lines)

The central parser function and one of the most complex functions in the binary. A while(2)/switch dispatcher on token codes from word_4F06418[0] with ~80 case labels. It accumulates type specifiers, qualifiers, storage-class specifiers, and CUDA address-space qualifiers from the token stream into a set of bitmask variables, then constructs the final type node from the accumulated state.

State Variables

VariableStackBitsRole
v325[rsp+B8h]uintType specifier kind (see table below)
v327[rsp+C0h]uint64Specifier category bitmask
v307[rsp+90h]intCV-qualifier accumulation bits
v302[rsp+78h]uintLong count (0=none, 1=long, 2=long long)
v305[rsp+84h]uintSignedness/width — reused for CUDA (4–7)
v299[rsp+68h]int_Complex (1) / _Imaginary (2) tracking

Type Specifier Kind (v325)

ValueMeaningToken Case
0None yet
2char80
3wchar_t165
4bool / _Bool128 / 120
5float126
6double127
7void180
8signed / __int893 / 239
9__float128331
12int (explicit)89
14__float16332
15short / half85
16_Float16333
17__bf16334
19bfloat16335
20Resolved typedef/CUDA type namescope lookup
21struct/union/enum tag101/104/151
23decltype()183
24auto (deduced)186
25Resolved identifier typeC++ lookup
26Error recovery typediagnostic

Specifier Bitmask (v327)

BitMaskMeaning
00x1Storage class (extern/static/etc.)
10x2CV-qualifier seen
20x4Type specifier seen
30x8friend specifier
40x10__declspec / attribute seen
50x20explicit specifier
60x40inline specifier
70x80_Thread_local / thread_local
100x400typeof / decltype
120x1000__declspec() already processed
130x2000explicit(bool) already processed
140x4000_Noreturn / [[noreturn]]
150x8000_Atomic

CV-Qualifier Bits (v307)

BitMaskQualifier
00x01const (case 81)
10x02volatile (case 107)
20x04restrict / __restrict (cases 118/119)
30x08__unaligned (case 263 with parens)
40x10__ptr32 (case 264)
50x20__ptr64 (case 265)
60x40__sptr / __uptr (case 266)

Duplicate CV qualifiers trigger diagnostic 83.

CUDA Memory Space Tokens (133–136)

These piggyback on the signedness/width field v305 with values 4–7:

TokenKeywordv305v325Formula
133__shared__42Special case
134__device__58token - 129
135__constant__68token - 129
136__managed__78token - 129

Clean separation: values 0–3 = standard C width modifiers, 4–7 = CUDA address-space qualifiers. The type-construction switch handles both ranges.

CUDA Extended Type Tokens (339–354)

TokenTypeFormat
236__nv_fp8_e4m3FP8
339__nv_fp8_e5m2FP8
340–343__nv_fp8x{2,4}_e{4m3,5m2}FP8 vector
344–345__nv_fp6_e{2m3,3m2}FP6
346–347__nv_fp6x2_e{2m3,3m2}FP6 vector
348–349__nv_mxfp8_e{4m3,5m2}MX-format FP8
350–351__nv_mxfp6_e{2m3,3m2}MX-format FP6
352__nv_mxfp4_e2m1MX-format FP4
353__nv_satfiniteSaturation type
354__nv_e8m0Exponent-only E8M0

All resolve via sub_6911B0() → type node, then set v325=20, v327|=4.

CUDA Address Space Qualifier Tokens (272–280)

TokenKeywordSpace IDHandler
272__attribute__((address_space(N)))parsed intsub_6210B0
273__global__0sub_667B60(0,...)
274__shared__ (addr space)2sub_667B60(2,...)
275__constant__ (addr space)3sub_667B60(3,...)
276__generic__sub_72B620(type, cv)
277__nv_tex_surf_handle_tsub_72BA30(unk_4F06A51)
278__nv_buffer_handle_tsub_72BA30(unk_4F06A60)
279__nv_grid_constantsub_72C390()
280__nv_is_extended_device_lambdasub_72C270()

Type Construction Functions

FunctionPurposeTrigger
sub_72BA30(code)Fundamental signed integer typeint, short, long, long long
sub_72BC30(code)CUDA extended-width integerCUDA mode + v305 > 3
sub_72BCF0(code)Unsigned fundamental typeunsigned combos
sub_72BDB0(code)CUDA unsigned extended typeCUDA mode + unsigned
sub_72BF70()float typev325 == 5
sub_72C030()double typev325 == 6
sub_72C0F0()long double typelong + double
sub_72C1B0()__float128 typev325 == 9
sub_72C610(kind)Float-by-kind (mapped from v325)FP8/FP6/BF16/etc.
sub_72C6F0(kind)_Complex float variantv299 == 1
sub_72C7D0(kind)_Imaginary float variantv299 == 2
sub_72C930(code)Error/placeholder typediagnostic issued
sub_72CBA0()Dependent typev325 == 25
sub_72CBE0(...)__int128 typev325 == 1
sub_73C570(type, cv, flags)Apply CV-qualifiers to typepost-construction

Accumulation Flow

  1. Initialize: all state variables to 0
  2. Loop: read word_4F06418[0], dispatch through switch — set bitmask bits, update kind/cv/width
  3. Exit: unrecognized token → LABEL_8 (default exit)
  4. Type construction: switch on v325 × v302 × v305 → call appropriate sub_72B*/sub_72C*
  5. CV application: sub_73C570 wraps the type with const/volatile/restrict
  6. Return: type stored at ds->field_272, CV bits at ds->field_120

Declaration Specifier Parser — sub_7C0F00 (184KB, 3,953 lines)

Uses goto-driven dispatch (393 LABEL_ references) — NOT a switch/case. This is a massive state machine for declaration specifier resolution. Self-recursive at line 2407 with flags=20 for nested declarator parsing.

Top-Level Declaration Parser — sub_662DE0 (61KB)

Declarator parsing — handles pointer (*), reference (&/&&), array ([]), and function (()) declarators. Uses SSE __m128i for bulk struct copying of 64-byte EDG type nodes.

Overload Resolution — sub_6523A0 (64KB)

The master overload resolution function. Given a declaration being introduced and a set of existing candidates from name lookup, it decides whether the declaration is a new overload, a redeclaration, or an error. At 2,448 decompiled lines with 39 diagnostic call sites, it is one of the heaviest diagnostic emitters in the frontend.

Candidate collection uses a 72-byte ranking context (v320 on stack) and dispatches to one of three collectors: sub_644100 for non-member/ADL candidates, sub_648CF0 for member + using-declaration candidates (chosen when C++ mode, prior declaration exists, and the class has base classes or is a template), or sub_6418E0 for C-linkage functions. The best candidate is selected by sub_641B60.

__builtin_ prefix forwarding (lines 2060-2162): after resolution, if the resolved symbol is a bodyless non-member function, the resolver checks if a compiler builtin equivalent exists. It hardcodes three function names by length: "abs" (3), "ceil" (4), "strlen" (6). For each, it constructs "__builtin_" + name in a scratch buffer at qword_4F06C50, looks it up via sub_878540, then compares parameter types via sub_8DED30(type1, type2, 0x100004) (exact match + qualification conversion). On match, the builtin's scope entry is linked into the user function's auxiliary data at offset +256 field 8.

OpenMP variant dispatch (lines 727-752): when unk_4D03A10 is set, the resolver renames the declaration to "<name>$$OMP_VARIANT%06d" using a monotonic counter unk_4D03A0C. This creates unique internal names for each device/host specialization.

Constexpr/consteval propagation (lines 2288-2301): gated by unk_4F07778 (C++ standard year). For C++11 and later, byte +204 of the scope entry is bit-packed with three globals: bits 5-6 = unk_4F06C58 (constexpr disposition), bits 1-2 = unk_4F06C5A (consteval disposition), bits 3-4 = unk_4F06C59 (immediate-function flag). Diagnostic 2383 fires on constexpr mismatch between declaration and definition.

Device/host overload sets: CUDA allows the same function name to have both __host__ and __device__ overloads. EDG does not treat execution space as part of the function signature for overload resolution purposes -- the standard C++ overload rules apply first, and execution space filtering happens later during code generation. The $$OMP_VARIANT renaming mechanism is used for OpenMP dispatch variants that need distinct host/device specializations, but regular CUDA __host__/__device__ overloads rely on the backend's execution space filtering rather than frontend overload resolution. This means that if two functions have identical C++ signatures but differ only in __host__ vs __device__, they are treated as redeclarations (not overloads) at the EDG level, and the execution space annotation at scope entry offset +198 determines which version survives into device or host code.

CUDA Memory Space Processing — sub_6582F0 (22KB)

Validates __shared__, __constant__, __managed__ attributes on declarations. Emits diagnostic for automatic variables in inappropriate memory spaces.

Type System

Type Node Layout (192 bytes = 12 x __m128i)

OffsetSizeField
+88Next pointer (linked lists)
+408Name pointer
+481Declaration kind byte
+801Entity kind byte
+1401TYPE KIND DISCRIMINATOR (the central dispatch key)
+1608Inner/child type pointer (typedef chains, pointer bases)
+1688Member list / parameter chain
+1731Specifier/node kind byte
+1762Entity kind (uint16, dispatch key for constexpr evaluator)
+1851CV-qualifier bits (bit 0=const, 1=volatile, 2=restrict)
+2001Attribute flags

Type kind discriminator values at offset +140:

ValueTypeNotes
0void
1error typeSentinel
2–4fundamental (char, int, ...)
5pointerFollows +160 chain
6pointer-to-member
7function typeComplex: 17 sub-kinds for calling conventions
8arrayElement count at +128, element type at +160
9–11class / struct / unionMembers at +168
12typedef / cv-qualifiedFollow +160 for underlying type (critical: skip in type-walk loops)
13enum
14void (incomplete)
15vectorElement count at +128
19decltype
21placeholder / auto

Scope Table Entry (776 bytes)

Indexed by dword_4F04C64 into base qword_4F04C68:

OffsetField
+0Scope identifier
+4Scope kind (5=namespace, 6=class, 7=function, 8=block, 9=enum, 12=template)
+6–10Flag bytes
+24Name list head
+32Name list tail
+208Class type pointer
+232Deferred list
+328Template info
+552Parent scope index
+624Declaration pointer
+680Linkage specification

Type Comparison — sub_7386E0 (23KB)

The core type equivalence engine. Takes two type node pointers packed in an __int128 and a flags word, returns boolean equality. The flags word controls comparison mode: bits 0-1 select cv-qualifier strictness (0=strict, 1=relaxed, 2=overload), bit 2 enables template matching (class-equivalence shortcuts), and bit 5 enables anonymous-class structural comparison.

Entry sequence: both types are first canonicalized through sub_72EC50, which peels through chains of non-template typedef aliases. The canonicalizer checks three fields on the elaborated type node: +173 == 12 (typedef kind), +176 == 1 (single-member), and +170 bit 4 == 0 (no template specialization). If all hold, it unwraps one level via sub_72E9A0 and loops. This means typedef int MyInt; typedef MyInt YourInt; canonicalizes YourInt directly to int.

After canonicalization, a quick-reject compares three header bytes without recursing: byte +24 (type kind) must match exactly, bytes +25 XOR must be zero for bits 0x03 (const/volatile) and 0x40 (restrict), and byte +26 XOR must be zero for bit 0x04. Any mismatch short-circuits to return 0.

The main switch dispatches on 38 type kinds. Key cases for CUDA:

  • Case 1 (fundamental): compares sub-kind at +56, extra flags at +58 (bits 0x3A), and the base type chain at +72. For integer sub-kind (sub_kind == 'i'), follows a resolution chain to find the underlying class scope. In template matching mode (flags bit 2), uses sub_8C7520 to check whether two class instantiations share the same primary template, then sub_89AB40 to compare template argument lists. This path handles CUDA's exotic numeric types (__nv_fp8_e4m3, __nv_fp8_e5m2, etc.) which are represented as fundamental types with distinct sub-kinds.
  • Case 3 (class/struct/union): fast identity via scope pointer equality, then unique-ID shortcut via dword_4F07588. For anonymous classes with template matching, calls sub_740200 to extract canonical member lists and performs structural comparison. This is relevant for CUDA lambda closure types, which are anonymous classes.
  • Case 33 (using-declaration/alias): in overload mode (flags bit 1), performs a hash table lookup via *qword_4D03BF8 to retrieve base class triples and compare element-by-element. This ensures that two using declarations resolving to different base classes are treated as distinct for overload discrimination.

Overload mode specifics (flags & 2): the post-switch check additionally verifies that both types agree on the presence/absence of the +80 "extra declaration" pointer. Template parameters are forced unequal (never match for overload purposes without being identical). Scope pointer equivalence is verified via unique-ID for using-declaration discrimination.

CUDA type equivalence: the NVIDIA-specific float types (__nv_fp8_e4m3, __bf16, _Float16, etc.) each have distinct sub-kind values at type node +56 (see the type mangling table: sub-kind 0 = _Float16, 1 = __fp16, 9 = __bf16, 0xA = _Float16 alternate, 0xB = _Float32, 0xC = _Float64, 0xD = _Float128). The type comparison treats them as distinct fundamental types -- _Float16 and __fp16 are NOT equivalent despite both being 16-bit floats. The half type in CUDA maps to _Float16 (sub-kind 0 or 0xA depending on context), while __half in cuda_fp16.h is a wrapper struct (type kind 9, class/struct), so half and __half are never type-equivalent at the EDG level. User code relies on implicit conversions defined in the CUDA headers, not on type equivalence.

Type-to-String Emitter — sub_74A390 (29KB, 19 callers)

The backbone type printer. Walks type nodes recursively, emitting textual representation for diagnostics. Handles NVIDIA-specific types: __surface_type__, __texture_type__, __nv_bool.

IL Tree Infrastructure

EDG represents parsed code as an Intermediate Language (IL) tree — a rich AST that preserves full C++ semantic information including template instantiation state, scope chains, and type qualifiers. The IL is not LLVM IR; it is EDG's proprietary tree representation that predates the LLVM integration. All semantic analysis, template instantiation, and overload resolution operate on this tree.

The IL tree is traversed by four structurally identical walker functions that share the same 87 node-type dispatch table. The walkers are instantiated from a common template with different callback functions — a design pattern where the traversal logic is fixed but the action at each node is parameterized through function pointers stored in six global variables. This callback-driven walker system is central to EDG's architecture: template instantiation, type checking, code emission, and tree copying all use the same walker infrastructure with different callbacks.

FunctionSizeSelf-recursive CallsPurpose
sub_7506E0190KB297Primary walker
sub_760BD0109KB427Parallel walker (deeper traversal)
sub_75C0C087KB316Third-pass walker
sub_766570148KB2Copier/transformer (takes callback params)

Walker Callback System

Six global function pointers form the visitor dispatch table:

GlobalRole
qword_4F08028Node pointer remapper (called before recursion)
qword_4F08020Linked-list child remapper
qword_4F08038String field processor
qword_4F08030Pre-visit callback (return nonzero to skip)
qword_4F08040Post-visit callback
dword_4F08014Skip-shared-nodes flag
dword_4F08018Clear/detach mode (null out fields for ownership transfer)

IL Node Types (87 types, from walker case labels)

IDTypeIDType
1source_file28integral_constant
2scope (15 sub-kinds)29float_constant
3type_qualifier30expression (generic)
4simple_type41call_expression
5pointer_type42cast_expression
6function_type (17 sub-kinds)43conditional_expression
7class_type44string_literal
8enum_type48template_argument (4 sub-kinds)
9array_type59concept_expression (10 sub-kinds)
10bitfield_type65type_list (core linked list)
13statement (30+ sub-kinds)75block/compound_statement
23scope_entry (root)76access_specifier

Deep Copy — sub_766570 with sub_8C2C50

sub_8C2C50 calls sub_766570 with copy callback sub_8C38E0 and list-copy callback sub_8C3810. Node size table at qword_4B6D500[node_type] provides memcpy sizes. Critical for template instantiation.

Constexpr Evaluator

The constexpr evaluator is arguably the most technically impressive subsystem in the EDG frontend. It is a complete tree-walking interpreter that can execute arbitrary C++ code at compile time, implementing the full C++20 constexpr specification including heap allocation (constexpr new), string literals, virtual function dispatch, and complex control flow. At 317KB for the expression evaluator alone, plus 77KB for the statement executor and ~200KB in supporting functions, it constitutes nearly 20% of the entire EDG frontend.

The evaluator operates on EDG's IL tree directly — it does not compile to bytecode or any intermediate form. Instead, it recursively walks expression and statement nodes, maintaining its own memory model (a 3-tier page arena), variable bindings (an open-addressing hash table), and lifetime tracking (scope epoch counters). This design trades execution speed for implementation simplicity and guaranteed semantic fidelity with the compiler's own type system.

Signature:

bool constexpr_eval_expr(
    constexpr_ctx *ctx,     // a1: evaluation context (hash table, arena, flags)
    expr_node **expr,       // a2: expression AST node
    __m128i *result,        // a3: output value slot (16 or 32 bytes)
    char *frame_base        // a4: stack frame base pointer for lifetime tracking
);

Expression Evaluator — sub_786210 (317KB, 9,075 lines)

The largest function in the entire EDG frontend. Two-level dispatch: outer switch on expression kind *(a2+24), inner switch on operator code *(a2+56) with 124 cases.

Outer Switch — Expression Kinds

KindHexMeaningNotes
00x00Void/emptySets `ctx+132
10x01Operator expression→ 124-case inner switch on *(a2+56)
20x02Variable referenceHash table lookup, kind==1(const) or kind==3(constexpr)
30x03Function reference / enumeratorSubkind==5: has constexpr body → recurse
40x04Literal (int/float constant)Immediate return — value is in the node
5–60x05–06String / compound literalC++20 mode required (dword_4F077C4 == 2)
70x07Function callMost complex case (~1200 lines)
100x0AParenthesized expressionRecurse on a2[7]
110x0BMember access (->)Navigate member hierarchy via type-size table
170x11Lambda expressionSave/restore ctx+72, execute body via sub_7987E0
180x12Capture variableHash table lookup by a2[7]
200x14Address-ofSet flags a3+8 = 0x20 (IS_SYMBOLIC)
230x17sizeof / alignofDelegate to sub_620D80
240x18Subscript (array[index])Bounds check, compute elem_size * index
270x1BImplicit conversionNavigate chain, recurse on inner
310x1FRequires expression (C++20)Execute body via sub_79B7D0
320x20Type traitsub_693DC0xmmword_4F08280/xmmword_4F08290
330x21SFINAE / substitution failureTemplate context check, sub_6F2300

Inner Switch — Operator Codes (124 cases, selected)

CasesCategoryOperations
0–1Assignment= / initialization (ref types: 32-byte memcpy)
3–4ConversionLvalue-to-rvalue via sub_7A0070
5Type caststatic_cast — massive dispatch: int→int(sub_622780), float→float(sub_709EF0), int→float(sub_710280), ptr→ptr(sub_770010)
14–15Member access. and -> — offset via sub_8D5CF0, virtual base via sub_771030
16–17Pointer arithmeticSubtraction, ptrdiff_t via sub_7764B0
20, 29Comparison==, != via sub_7759B0
26–28Unary++, --, unary minus (sub_621DB0)
30–31Vector opsElement-wise comparison loop, broadcast
39–45Arithmetic+(sub_621270), -(sub_6215F0), *(sub_621F20), /(sub_6220A0), %(sub_6220C0), <<(sub_70BBE0), >>(sub_70BCF0) — all with overflow/divzero checks
46–49Bitwise&, |, ^, ~
50–57Logical&&, || with short-circuit evaluation
58–59Detailed comparisonInteger(sub_621000), float(sub_70BE30), pointer(address+symbolic)
64Spaceship<=> → strong_ordering values at unk_4F06BD8unk_4F06C30
73–84Compound assignment+= through ^= with lifetime validation, const-check (diag 0x1318)
91–93ConditionalTernary ?:, array subscript (bounds-checked, error 0xA84)
94–95Virtual dispatchVtable lookup → sub_79CCD0
96–97AllocationPlacement new / operator new
103Exceptionthrow (always fails in constexpr)
105–108Delegatedsub_77FCB0 (builtin operators)

Value Slot Layout (16 bytes at a3)

OffsetSizeField
0–78Primary value (integer, IEEE float, or arena pointer)
81Flags byte (see below)
9–113Alignment info, compound assignment tracking
12–154Scope epoch ID (lifetime validation)

Extended slot (32 bytes for reference types) adds secondary address at +16 and frame base at +24.

Flags Byte (offset +8)

BitMaskNameMeaning
00x01IS_POINTERValue is an indirect pointer
10x02IS_PAST_ENDOne-past-the-end pointer
20x04HAS_CLEANUPDestructor chain at +16
30x08HAS_SUBOBJECTRefers to a subobject
40x10HAS_BITFIELDBitfield offset in bits 8–31
50x20IS_SYMBOLICUnresolved symbolic reference
60x40IS_CONSTFrom a const declaration
70x80IS_ARRAY_MEMBERPart of array storage

Statement Executor — sub_795660 (77KB)

Dispatch on *(a2+40) — statement kind:

CaseKindNotes
0DeclarationArena alloc → eval initializer → insert into scoped hash table
1–4If / if-else / if-init / if-constexprCondition → bool via sub_620EE0 → branch
5While loopStep counter at ctx+120, limit from qword_4D042E0 (~1M). Error 0x97F on exceeded.
6Jump (break/continue/goto)Sets control flow bits: bit 1=continue, bit 2=break, bit 3=goto
7,15,24Null/emptyReturn success
8ReturnWalk call chain at ctx+72, store result, set "returned" flag
11Expression statementEvaluate for side effects via sub_7987E0
12For loopInit → alloc → [condition → body → increment → cleanup] loop
13Do-whileDelegates to sub_7A0E60
14Range-based for4 temp slots via sub_77A250, iterator advance via sub_7A0470

Memory Management — 3-Tier Page Arena

TierLocationPage SizeThresholdPurpose
Primaryctx+16/ctx+2464KBdefaultExpression evaluation temporaries
Secondaryctx+144/ctx+15264KBlazy init (ctx+132 & 8)Variable declarations
Tertiaryctx+8064KBnullableString/compound literals

Overflow: allocations >1024 bytes go to heap via sub_822B10(size+16), forming a singly-linked list from ctx+32. Freed by walking until scope epoch matches.

Value slot header: type pointer at offset -8 (8 bytes), lifetime bits at offset -9 (1 byte, bit 0 = "initialized").

Scope epoch: monotonic counter at ctx+128. Hash table at ctx+56/ctx+64 maps epoch → page state. Arena rewound on scope exit.

Hash Table (ctx+0/ctx+8)

Open-addressing with 16-byte entries [key, value]. Hash: key_pointer >> 3. Collision: linear probing. Doubles at 2 * count > capacity (via sub_7704A0). Secondary table at ctx+56/ctx+64/ctx+68 uses 4-byte integer keys (scope epoch IDs).

Diagnostic Codes

CodeHexMeaning
610x3DDivision by zero
24310x97FStep limit exceeded
26920xA84Array index out of bounds
26950xA87Unsupported jump in constexpr
26980xA8ANull pointer dereference
27050xA91Negative shift count
27070xA93Integer overflow/underflow
27120xA98Use of uninitialized variable
27210xAA1Not a constant expression (generic)
27270xAA7Invalid type conversion
27350xAAFPointer below array start
27510xABFAccess outside lifetime
27660xACEModification through null pointer
29590xB8BMissing return in constexpr function
30070xBBFreinterpret_cast in constexpr
30220xBCECall to undefined constexpr function

Silent mode: ctx+132 bit 5 (0x20) suppresses diagnostics (SFINAE contexts).

Constexpr and CUDA: Host-Side Evaluation of Device Code

A key architectural question for any CUDA compiler is whether constexpr functions annotated __device__ are evaluated at host compile time. In cicc v13.0, the answer is yes, conditionally. The constexpr evaluator operates entirely within the EDG frontend, which runs on the host. When a constexpr __device__ function is used in a context requiring a constant expression (template argument, array bound, static_assert, constexpr variable initializer), the evaluator executes it using its tree-walking interpreter regardless of the function's execution space annotation. The execution space attributes (__device__, __host__, __global__) are semantic annotations for code generation, not for the constexpr evaluator -- the evaluator sees only the IL tree and does not distinguish between host and device function bodies.

This works because EDG's constexpr evaluator uses software floating point (USE_SOFTFLOAT = 1 in the 737-define configuration block). All floating-point arithmetic in constexpr contexts goes through the softfloat library (sub_70B8D0 add, sub_70B9E0 sub, sub_70BBE0 mul, sub_70BCF0 div, sub_709EF0 convert) rather than the host CPU's FPU. This guarantees that constexpr evaluation of device code produces results consistent with IEEE 754 semantics regardless of the host platform's floating-point behavior. The softfloat library handles all precision levels including _Float16, __bf16, _Float32, _Float64, and __float128.

SM architecture gates influence constexpr relaxations. The global qword_4F077A8 (SM version) gates certain constexpr features:

  • SM >= 89 (qword_4F077A8 > 0x15F8F): relaxed constexpr rules for variables with incomplete types
  • dword_4F077C4 == 2: C++20 features including constexpr new, constexpr string literals, and constexpr member access (expression evaluator cases 5/6)
  • dword_4D04880: C++14 relaxed constexpr (loops, local variable mutation, multiple return statements)
  • C++23/26 extensions: constexpr try-catch (statement executor case 14), constexpr placement new (expression evaluator case 103), constexpr dynamic_cast (error 0xBB7)

The evaluator enforces a step limit (qword_4D042E0, default ~1M iterations) to prevent infinite loops in constexpr evaluation. This limit applies uniformly to both host and device constexpr functions. When exceeded, diagnostic 0x97F ("constexpr evaluation step limit exceeded") is emitted.

One important consequence: __global__ (kernel) functions cannot be constexpr because they have no return value in the conventional sense -- they are launched asynchronously. The parser enforces this at the declaration specifier level, not in the constexpr evaluator.

Supporting Functions

FunctionSizeRole
sub_79CCD067KBObject member accessor (base classes, virtual bases, union tracking)
sub_799B7033KBAggregate initializer (arrays, structs, designated init, brace elision)
sub_79B7D029KBFunction call evaluator (argument binding, body execution, recursion limits)
sub_7987E011KBStatement list executor entry
sub_77FCB0150KBTop-level dispatch (80 expression types + 62-entry intrinsic table)
sub_7764B018KBType size calculator (Robin Hood hash memoization, 64MB cap)
sub_7707D0Clone constexpr object
sub_7790A0Trivial aggregate copy
sub_7A0070Lvalue-to-rvalue load
sub_77F5C0Bounds check (ptr, type → idx, err, size)
sub_76FFC0Run cleanup/destructor chain

Bigint Library (sub_621*)

FunctionOperation
sub_621000compare(a, width_a, b, width_b) → {-1,0,1}
sub_621270add(dst, src, width, overflow_out)
sub_6215F0sub(dst, src, width, overflow_out)
sub_621F20mul(dst, src, width, overflow_out)
sub_6220A0div(dst, src, width, divzero_out)
sub_6220C0mod(dst, src, width, divzero_out)
sub_621DB0negate(dst)
sub_620EE0to_int(value, width, result_out)

Float Library (sub_70B*)

FunctionOperation
sub_70B8D0add(type, lhs, rhs, dst, inexact, exception)
sub_70B9E0sub
sub_70BAF0negate
sub_70BBE0mul
sub_70BCF0div
sub_70BE30compare(type, lhs, rhs, nan_result) → {-1,0,1,NaN}
sub_709EF0convert(src, src_prec, dst, dst_prec, inexact)

Key Globals

VariablePurpose
dword_4F077C4C++ standard version (2 = C++20, enables constexpr new/string)
dword_4D04880C++14 relaxed constexpr (enables loops, mutation)
qword_4D042E0Max constexpr evaluation steps (~1M)
xmmword_4F08280Canonical constexpr TRUE
xmmword_4F08290Canonical constexpr FALSE
qword_4F08380Global type-size hash table base
qword_4F08060Global allocator function pointer (constexpr new detection)

CUDA-Specific Extensions

NVIDIA's extensions to the EDG frontend fall into four categories: memory space qualifiers that map to GPU address spaces, kernel launch syntax that gets lowered to CUDA runtime API calls, registration stubs that tell the CUDA runtime about compiled kernels, and atomic builtin generation for the C++11 atomics model on GPU. These extensions are concentrated in the 0x6500000x810000 range and reference SM architecture version globals extensively — many features are gated by qword_4F077A8 comparisons against architecture thresholds.

CUDA Keyword Extensions

NVIDIA extends the EDG keyword table with execution space qualifiers, memory space qualifiers, and type intrinsics. These exist in four distinct layers -- registered keywords, declaration specifier tokens, address space attribute tokens, and extended type tokens -- each integrated differently into the EDG parser infrastructure.

The critical architectural fact: __device__, __host__, and __global__ are not keywords in the EDG keyword table. They are processed through the C/C++ attribute system, where EDG maps them to internal single-character codes. The declaration specifier state machine (sub_672A20) and the address space handler together resolve these attributes into symbol-table fields that downstream passes consume.

Token ID Inventory

NVIDIA uses four non-contiguous token ID ranges:

RangeCategoryCountRegistration
133-136Memory space declaration specifiers4Hardcoded in sub_672A20 switch
236, 339-354Extended numeric types (FP8/FP6/FP4/MX)17Resolved via sub_6911B0
272-280Address space qualifier / special type tokens9Hardcoded handlers in sub_672A20
328-330NVIDIA type trait intrinsics3Registered via sub_885C00 in sub_706250

Only tokens 328-330 use the standard sub_885C00(token_id, "keyword") registration path. All other CUDA tokens are wired directly into parser switch cases, bypassing the keyword table entirely.

Execution Space Qualifiers -- Attribute Path

__device__, __host__, and __global__ are recognized by the attribute parser, which stores them as single-character codes at declaration context offset +269. The complete internal attribute character map (sub_5C79F0 at 0x5C79F0):

CharHexAttributeScope Entry Bits
'V'0x56__host__-- (host is the default)
'W'0x57__device__+198 bit 4 (0x10)
'X'0x58__global__+198 bit 4 (0x10) AND bit 5 (0x20)
'Y'0x59__tile_global__--
'Z'0x5A__shared__-- (stored in +136 as space code 3)
'['0x5B__constant__-- (stored in +136 as space code 2)
'\'0x5C__launch_bounds__Arguments at decl+336 struct
']'0x5D__maxnreg__--
'^'0x5E__local_maxnreg__--
'_'0x5F__tile_builtin__--
'f'0x66__managed__-- (stored in +136 as space code 5)
'k'0x6B__cluster_dims__Arguments at cluster config struct
'l'0x6C__block_size__--
'r'0x72__nv_pure__--

The attribute character code at +269 is consumed by sub_6582F0 (declaration-side validation) and sub_65F400 (definition-side validation). These functions never see the CUDA qualifier as a keyword token -- they only see the resolved character code.

Execution space at scope entry offset +198 is the authoritative record of a function's execution space for all downstream passes:

  • Bit 4 (0x10): function is __device__ or __global__ -- activates device-scope variable validation
  • Bit 5 (0x20): function is __global__ (kernel entry point) -- triggers kernel metadata emission via sub_12735D0, which emits ("kernel", 1) to LLVM IR
  • Bit 2 (0x04) at offset +199: full_custom_abi flag

When a function has bit 5 set, the attribute emitter also iterates the parameter array (40-byte entries at decl+16) and emits ("grid_constant", param_index) for each parameter where byte +33 is nonzero. The preserve-register struct at decl+336 (three int32 fields: data, control, after) is consumed and cleared (set to -1) after emission.

Memory Space Declaration Specifiers (Tokens 133-136)

These piggyback on the signedness/width field v305 in the declaration specifier state machine with values 4-7, cleanly separated from the standard C width modifiers (0-3):

TokenKeywordv305 Valuev325 ValueFormula
133__shared__42Special case
134__device__58token - 129
135__constant__68token - 129
136__managed__78token - 129

The type construction switch in sub_672A20 branches on v305 > 3 to invoke CUDA-specific type constructors (sub_72BC30 for signed, sub_72BDB0 for unsigned) instead of the standard C type constructors used for v305 values 0-3.

Address Space Qualifier Tokens (272-280)

Processed by dedicated handlers in the declaration specifier parser:

TokenKeywordHandlerArgument
272__attribute__((address_space(N)))sub_6210B0Parses integer N
273__global__ (addr space annotation)sub_667B60(0, ...)Space ID = 0
274__shared__ (addr space annotation)sub_667B60(2, ...)Space ID = 2
275__constant__ (addr space annotation)sub_667B60(3, ...)Space ID = 3
276__generic__sub_72B620(type, cv)--
277__nv_tex_surf_handle_tsub_72BA30(unk_4F06A51)Texture/surface handle
278__nv_buffer_handle_tsub_72BA30(unk_4F06A60)Buffer handle
279__nv_grid_constantsub_72C390()Grid-constant marker
280__nv_is_extended_device_lambdasub_72C270()Lambda closure check

Note the dual role of __shared__, __constant__, and __global__: each appears both as a memory space declaration specifier (tokens 133-135) and as an address space qualifier (tokens 273-275). The declaration specifier path stores the result in the symbol-table entry's memory_space_code at offset +136 and memory_space_flags at offset +156. The address space qualifier path stores the result in the EDG type node's qualifier word at offset +18 (values 1=global, 32=shared, 33=constant). Both representations flow downstream: the symbol-table code controls declaration validation, while the type qualifier controls LLVM pointer type construction in sub_911D10.

The __grid_constant__ qualifier (token 279, handler sub_72C390) marks kernel parameters as grid-constant -- the parameter is read-only across all thread blocks and may be placed in constant memory by the backend. This is a SM 70+ feature.

NVIDIA Type Trait Keywords (Tokens 328-330)

The only CUDA tokens registered through the standard sub_885C00 keyword registration path. Always registered -- not gated by any version, language mode, or feature flag:

TokenKeywordRegistration
328__nv_is_extended_device_lambda_closure_typesub_885C00(328, ...)
329__nv_is_extended_host_device_lambda_closure_typesub_885C00(329, ...)
330__nv_is_extended_device_lambda_with_preserved_return_typesub_885C00(330, ...)

These type traits are used by CUDA's extended lambda machinery to query whether a lambda closure type carries device or host-device execution space annotations. They participate in SFINAE and if constexpr contexts for compile-time dispatch between host and device lambda implementations.

The lambda mangling extensions in sub_80FE00 use the execution space information from these traits to choose between three proprietary Itanium ABI mangling prefixes: Unvdl (device lambda), Unvdtl (device template lambda), and Unvhdl (host-device lambda). The selection is based on flag byte +92 of the closure descriptor, where bit 5 (0x20) marks an extended CUDA lambda, bit 4 (0x10) marks host-device, and bit 2 (0x04) marks a template lambda.

Extended Numeric Type Tokens (236, 339-354)

Blackwell tensor core operations require exotic floating-point formats. These are resolved via sub_6911B0() to a type node, then set v325=20, v327|=4 in the declaration specifier state machine:

TokenTypeFormatWidth
236__nv_fp8_e4m3FP88b
339__nv_fp8_e5m2FP88b
340-341__nv_fp8x2_e{4m3,5m2}FP8 vector16b
342-343__nv_fp8x4_e{4m3,5m2}FP8 vector32b
344-345__nv_fp6_e{2m3,3m2}FP66b
346-347__nv_fp6x2_e{2m3,3m2}FP6 vector12b
348-349__nv_mxfp8_e{4m3,5m2}MX-format FP88b
350-351__nv_mxfp6_e{2m3,3m2}MX-format FP66b
352__nv_mxfp4_e2m1MX-format FP44b
353__nv_satfiniteSaturation modifier--
354__nv_e8m0Exponent-only E8M08b

These types are represented as fundamental types with distinct sub-kind values at type node +56 in the EDG type system. The type comparison engine (sub_7386E0, case 1) compares sub-kind, extra flags at +58 (bits 0x3A), and the base type chain at +72 to ensure each format is treated as a distinct type.

Attribute Processing Pipeline

The complete pipeline from CUDA source keyword to LLVM IR metadata:

CUDA source: __global__ void kernel() __launch_bounds__(256, 2)
  |
  v
Phase 1: Attribute parser → char code 'X' (0x58) at decl context +269
  |
  v
Phase 2: Declaration specifier state machine (sub_672A20)
         → scope entry +198 bit 5 set (kernel)
  |
  v
Phase 3: Post-parse fixup (sub_5D0FF0)
         → __launch_bounds__(256, 2) extracted to launch config struct
  |
  v
Phase 4: CUDA attribute validator (sub_826060)
         → validates __launch_bounds__ on __global__ function
         → diagnostic 0xDCE (3534) if __launch_bounds__ on non-kernel
         → diagnostic 0xE83 (3715) if values out of range
         → diagnostic 0xE87 (3719) if __launch_bounds__ + __maxnreg__ conflict
  |
  v
Phase 5: Attribute emission to LLVM IR (sub_12735D0)
         → emits ("kernel", 1) from bit 5 of decl+198
         → emits ("grid_constant", N) per qualifying parameter
  |
  v
Phase 6: Kernel metadata generation (sub_93AE30)
         → "nvvm.maxntid" = "256,1,1"
         → "nvvm.minctasm" = "2"

Memory Space Attributes

sub_6582F0 (22KB) and sub_65F400 (28KB) validate __shared__, __constant__, __managed__ on variable declarations and definitions respectively. Token cases 133-136 in the parser handle these as first-class declaration specifiers. The validation logic enforces CUDA semantics: __shared__ variables cannot have initializers (shared memory is not initialized on kernel launch), __constant__ variables must have static storage duration, and __managed__ variables require unified memory support on the target architecture.

Symbol Table Memory Space Encoding

Memory space is tracked in two locations within each symbol-table entry:

OffsetSizeFieldValues
+1361 bytememory_space_code0=default, 1=__device__, 2=__constant__, 3=__shared__, 5=__managed__
+1561 bytememory_space_flagsbit 0=device, bit 1=shared, bit 2=constant, bit 4=thread_local interaction
+1571 byteExtended flagsbit 0=managed

The dual encoding exists because the flags are additive from parsed attributes (multiple attributes can be OR'd in) while the code is the single resolved value used by downstream passes. The code at +136 is set by sub_735FB0 (symbol entry constructor) and queried throughout the compiler.

Declaration-Side Validation -- sub_6582F0

The validation follows a ten-phase pipeline:

  1. __managed__ pre-resolution: when dword_4F04C5C == dword_4F04C34 (host-only mode), managed variables are silently downgraded to __device__ (space code 1) and the extern flag is cleared.

  2. Extern handling: sets bit 0 of decl context +122 and the is_extern tracking variables.

  3. Type normalization: checks function-type declarations against CUDA criteria via sub_8D4C10; emits diagnostic 891 for function types with memory space.

  4. Specifier processing: calls sub_6413B0 against the current compilation target.

  5. Prior-declaration conflict detection: looks up existing symbol, compares memory space codes. Mismatch with dword_4F077C4 == 2 (separate compilation) triggers diagnostic 172 (warning 4).

  6. New symbol creation: sub_735FB0(type_ptr, space_code, target_id, is_new_decl).

  7. __managed__ namespace binding: validates namespace name via sub_703C10; checks class/struct type compatibility (diagnostic 1560 on failure).

  8. Storage class adjustments: processes constant-space read-only flags.

  9. Device-scope enforcement: when scope +198 bit 4 is set (inside __device__/__global__ function), local variables cannot carry device memory qualifiers. Diagnostic 3484: "an automatic variable may not be declared as __device__". The memory space name is determined by the priority cascade: __constant__ > __managed__ > __shared__ > __device__.

  10. Final fixup: type validation (sub_8D9350), attribute propagation (sub_8756F0), "main" function warnings (diagnostic 2948), thread-safety analysis (sub_826000).

Memory Space Mutual Exclusivity

The code consistently enforces these combinations:

CombinationDiagnosticSeverity
__shared__ + __constant__3481error
__constant__ + __managed__3568error
__constant__ + __shared__ + __managed__3568error
thread_local + __device__892error
thread_local + any device space3578error
auto variable + __device__/__constant__/__managed__3484error
__shared__ + initializer3510error
__constant__ in device-function scope3512error
register + device memory3485/3688error
volatile + __constant__1378error
redeclaration with different memory space3499warning 5

The diagnostic name-string priority cascade (__constant__ > __managed__ > __shared__ > __device__) appears identically in six locations: sub_6582F0 lines 734-739, sub_65F400 lines 541-549 and 927-935, sub_5C6B80 lines 22-34, sub_667550 lines 87-98, and sub_5D9330 (the symbol printer).

Address Space Flow: EDG to LLVM to PTX

CUDA SourceEDG TokenSymbol +136Type Qualifier +18LLVM ASPTX Directive
__device__ int x;134111.global
__shared__ int x;1333323.shared
__constant__ int x;1352334.const
__managed__ int x;136511.global + runtime registration
(local in kernel)--0--0/5.local/.param

The EDG type node qualifier word (offset +18, masked to 0x7FFF) carries address space through the type system. During EDG-to-LLVM type translation, sub_911D10 reads this qualifier from pointer/reference types (kind 75/76) and maps to LLVM address space numbers via sub_5FFE90. __managed__ variables are compiled as __device__ (LLVM address space 1) with additional runtime registration calls generated by sub_806F60 for unified memory management.

Kernel Launch Lowering — sub_7F2B50 (16KB)

Transforms CUDA's <<<gridDim, blockDim, sharedMem, stream>>> kernel launch syntax into CUDA runtime API calls. The lowered sequence allocates a parameter buffer via cudaGetParameterBufferV2, copies kernel arguments into it, and launches with cudaLaunchDeviceV2. For the simpler launch path, it generates __cudaPushCallConfiguration followed by individual __cudaSetupArg/__cudaSetupArgSimple calls. This lowering happens entirely within EDG — by the time the code reaches the LLVM backend, kernel launches are ordinary function calls.

Registration Stub Generator — sub_806F60

Generates __cudaRegisterAll function with calls to: __cudaRegisterEntry, __cudaRegisterVariable, __cudaRegisterGlobalTexture, __cudaRegisterGlobalSurface, __cudaRegisterManagedVariable, __cudaRegisterBinary, ____cudaRegisterLinkedBinary.

Host-side stubs generated by sub_808590: "__device_stub_%s", "__cudaLaunch", "__cudaSetupArg", "__cudaSetupArgSimple".

Atomic Builtin Generator — sub_6BBC40 (34KB)

Constructs __nv_atomic_fetch_{add,sub,and,xor,or,max,min} names with type suffixes (_s, _f, _u) and width (_%u).

SM Architecture Gates

Two functions configure ~160 optimization/feature flags based on SM version:

FunctionRoleThresholds
sub_60D650Optimization level → 109 unk_4D04* flagsSingle integer parameter (O-level)
sub_60E7C0SM arch → 60 unk_4D04* feature flagsSM 75 (30399), SM 80 (40000), SM 90 (89999), SM 100 (109999), SM 120 (119999)

Each flag is gated by a byte_4CF8* user-override check, preventing auto-configuration when the user explicitly sets a flag via CLI.

TileIR Backend

sub_5E3AD0 optionally loads libTileIRCompiler_shared.so via dlopen and looks up symbol "cudacc_back_end". A 17-entry function pointer table is passed. Gated by dword_4D045A0.

Diagnostic System

EDG's diagnostic system supports three output formats: human-readable terminal output (with ANSI color and word-wrapping), SARIF JSON for IDE integration, and a machine-readable log format for automated tooling. All three share the same diagnostic numbering scheme and severity classification. The terminal output handler alone is 37KB — it implements its own word-wrapping algorithm with configurable terminal width, recursive child diagnostic emission (for "note: see declaration of X" chains), and color coding by severity level.

Terminal Output — sub_681D20 (37KB)

Formats error/warning/remark messages with:

  • Severity labels: remark (2), warning (4), caution (5), severe-warning (6), error (7–8), catastrophe (9–10), internal-error (11)
  • Source location: file:line:col
  • ANSI color escapes (gated by dword_4F073CC)
  • Word-wrapping at dword_4D039D0 (terminal width)
  • Recursive child diagnostic emission

SARIF JSON Output — sub_6837D0 (20KB)

Structured diagnostics for IDE integration, enabled by --diagnostics_format=sarif (CLI case 0x125, sets unk_4D04198 = 1). The output is a comma-separated stream of SARIF result objects -- NOT a complete SARIF envelope with $schema, runs[], etc. The caller or a post-processor is expected to wrap the stream in the standard SARIF container.

Each diagnostic emits one JSON object:

{
  "ruleId": "EC<number>",
  "level": "error",
  "message": {"text": "<json-escaped message>"},
  "locations": [
    {
      "physicalLocation": {
        "artifactLocation": {"uri": "file://<path>"},
        "region": {
          "startLine": 42,
          "startColumn": 17
        }
      }
    }
  ],
  "relatedLocations": [
    {
      "message": {"text": "see declaration of X"},
      "physicalLocation": { ... }
    }
  ]
}

Rule ID format: "EC" + decimal error number from the diagnostic record at offset +176. For example, EDG error 1234 becomes "EC1234".

Severity mapping (byte at diagnostic node +180):

SeverityEDG MeaningSARIF level
4remark"remark"
5warning"warning"
7, 8error"error"
9catastrophe"catastrophe"
11internal error"internal_error"

Note that SARIF spec only defines "warning", "error", and "note" as standard levels. The "remark", "catastrophe", and "internal_error" values are EDG extensions -- consuming tools should treat unknown levels as "error".

Message text escaping: sub_683690 renders the diagnostic text into qword_4D039E8, then copies character-by-character into the output buffer, escaping " as \" and \ as \\. No other JSON escaping (e.g., control characters, Unicode) is applied.

Location resolution: sub_67C120 calls sub_729E00 to decompose the packed source location into (file-id, line, column), then sub_722DF0 to resolve the file-id to a filesystem path. The startColumn field is omitted when column is zero.

Related locations: the linked list at diagnostic node +72 chains "note" sub-diagnostics. Each is emitted as a relatedLocations array entry with its own message and physical location.

Filtering before emission: diagnostics pass through severity threshold check (byte_4F07481[0]), duplicate detection (byte_4CFFE80[4*errnum + 2] bit flags), pragma-based suppression (sub_67D520), and error limit check (unk_4F074B0 + unk_4F074B8 >= unk_4F07478). All filtering happens before the SARIF/text format branch.

Machine-Readable Log

Writes to qword_4D04908 in format: <severity-char> "<filename>" <line> <col> <message>\n. Severity chars from "RwweeccccCli": R=remark, w=warning, e=error, c=catastrophe.

Name Mangling (Itanium ABI)

EDG includes a complete implementation of the Itanium C++ ABI name mangling specification. NVIDIA extends the standard mangling with three proprietary prefixes (Unvdl, Unvdtl, Unvhdl) for device lambdas, device template lambdas, and host-device lambdas respectively. These extensions are necessary because CUDA's execution model requires distinguishing between host and device versions of the same lambda — they must have different mangled names to avoid linker collisions when both host and device code are linked into the same binary.

Address range 0x8100000x8EFFFF:

FunctionSizeRole
sub_8E74B029KBPrimary mangling entry
sub_8E9FF026KBType mangling
sub_81646024KBType component mangling
sub_81379013KBExpression mangling
sub_80E34023KBBuiltin type mangling (incl. DF16_, DF16b, Cu6__bf16, u6__mfp8)
sub_80FE008KBNVIDIA extension mangling (Unvdl, Unvdtl, Unvhdl)

NVIDIA lambda mangling extensions (sub_80FE00): standard Itanium ABI uses Ul<params>E<index>_ for unnamed lambda types and Ut<index>_ for unnamed non-lambda types. NVIDIA adds three proprietary prefixes chosen based on flag byte +92 of the lambda's closure descriptor:

PrefixMeaningCondition
Unvdl__device__ lambdaflag_byte_92 & 0x20 set, not host-device, not template
Unvdtl__device__ template lambdaflag_byte_92 & 0x20 set, flag_byte_92 & 4 set
Unvhdl__host__ __device__ lambdaflag_byte_92 & 0x20 set, flag_byte_92 & 0x10 set

The Unvhdl prefix carries three single-digit flags separated by underscores after the prefix: Unvhdl<index>_<has_explicit_return>_<is_host_device>_<has_template_params>_. Each flag is '0' or '1'. This is richer than the standard Ul which only encodes parameter types.

NVIDIA vendor type manglings (sub_80E340): the type mangler handles CUDA-specific types as Itanium vendor types (prefix u + length + name):

TypeManglingNotes
__bf16 (bfloat16)u6__bf16 or DF16bABI-gated: qword_4F077B4 lo32 selects vendor vs C++23 encoding
__mfp8 (FP8)u6__mfp8NVIDIA micro-float 8-bit for transformer inference
__metainfoU10__metainfoKernel parameter metadata type attribute
float80u7float80x87 extended precision (vendor type)

The __bf16 mangling has a three-way gate reflecting the ongoing ABI transition: qword_4F077B4 lo32 != 0 selects "u6__bf16" (vendor type); hi32 == 0 selects "DF16b" (C++23 standardized P1467); otherwise qword_4F06A78 determines which encoding. The ABI version variable unk_4D04250 controls this and other encoding decisions, with known thresholds at 0x76BF (GCC 3.3 compat) and 0xC350 (GCC 12 compat).

Standard float types follow Itanium: _Float16 = "DF16_", __fp16 = "Dh", float = "f", double = "d", __float128 = "g", with the complex variants adding a 'C' prefix.

Key Global Variables

VariableSizeRole
dword_4F077C44Language mode: 0=neither, 1=C, 2=C++
unk_4F077784C/C++ standard year (199711, 201103, 201402, 201703, 202002, 202310)
qword_4F077B48Dialect extension flags (lo=CUDA extensions, hi=GNU extensions)
dword_4F077BC4NVCC mode flag
dword_4F077C04GCC compatibility mode
qword_4F077A88SM architecture version (controls feature gates throughout)
word_4F064182Current parser token
qword_4F04C688Scope table base pointer (776-byte entries)
dword_4F04C644Current scope index
qword_4CF7CE08AST printer callback vtable
qword_4D03FF08Current translation unit pointer
qword_4D049088Machine-readable diagnostic log FILE*
qword_4F08028qword_4F0804048IL tree walker callback table
dword_4D045A04TileIR mode flag