cudafe++ v13.0 -- Reverse Engineering Reference
cudafe++ is NVIDIA's CUDA frontend compiler -- the first stage of the CUDA compilation pipeline. It is built on the Edison Design Group (EDG) C++ Front End v6.6, a commercial compiler frontend licensed by compiler vendors worldwide. NVIDIA ships cudafe++ as a statically-linked, stripped ELF binary inside every CUDA Toolkit installation. This binary accepts .cu source files, parses them as C++ with CUDA extensions, separates device code from host code, and produces two outputs: an EDG Intermediate Language (IL) stream consumed by cicc (the NVIDIA PTX code generator), and a transformed .int.c host file consumed by the system C++ compiler (gcc, clang, or cl.exe).
This wiki documents the complete internals of the cudafe++ binary from CUDA Toolkit 13.0, reverse-engineered through static analysis (IDA Pro + Hex-Rays decompilation) of all 6,483 functions. The goal is reimplementation-grade documentation: every page should give a senior compiler engineer enough information to build equivalent functionality from scratch.
Binary Identity
| Property | Value |
|---|---|
| Binary | cudafe++ from CUDA Toolkit 13.0 |
| Format | ELF 64-bit LSB executable, x86-64, statically linked, stripped |
| File size | 8,910,936 bytes (8.5 MB) |
| EDG base | Edison Design Group C++ Front End v6.6 |
| Build path | /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/ |
| Total functions | 6,483 |
| Functions mapped to source | 2,208 (34%) |
Segment Layout
| Section | Start | End | Size | Description |
|---|---|---|---|---|
.text | 0x403300 | 0x829722 | 4,351,010 bytes (4.15 MB) | Executable code |
.rodata | 0x829740 | 0xAA3FA3 | 2,599,011 bytes (2.48 MB) | Read-only data (string tables, jump tables, constants) |
.data | 0xD46480 | 0xE7EFF0 | 1,280,880 bytes (1.22 MB) | Initialized global variables |
.bss | 0xE7F000 | 0x12D6F20 | 4,554,528 bytes (4.34 MB) | Zero-initialized globals |
.eh_frame | 0xCB1210 | 0xD3F398 | 582,024 bytes | Exception handling unwind tables |
.data.rel.ro | 0xD428C0 | 0xD45E00 | 13,632 bytes | Relocation-read-only (vtables, GOT-relative) |
Role in the CUDA Toolchain
input.cu
|
v
cudafe++ ──────── THIS BINARY ────────
| |
v v
device.gpu (EDG IL) input.int.c (transformed host C++)
| |
v v
cicc gcc / clang / cl.exe
| |
v v
device.ptx host.o
| |
v v
ptxas ld
| |
v v
device.cubin ──────────────────> final executable
cudafe++ is a source-to-source compiler. It never generates machine code directly. Its job is to take a single .cu translation unit, understand which code is device (__device__, __global__) and which is host, then:
-
For the device track: Emit EDG IL -- a typed, scope-linked intermediate representation containing every declaration, type, expression, and statement. This IL is consumed by
cicc, which lowers it through LLVM to PTX assembly. -
For the host track: Emit a
.int.cfile -- valid C++ source where device function bodies are suppressed inside#if 0/#endif,__global__kernels are replaced by__wrapper__device_stub_<name>()forwarding functions, and CUDA runtime registration boilerplate is appended.
The binary runs as a single-threaded, single-pass-per-stage pipeline with 8 stages: pre-init, CLI parsing (276 flags), one-time init (38 subsystem initializers), TU state reset, frontend parse (EDG parser + CUDA extensions), 5-pass IL finalization, backend .int.c emission, and exit. See Pipeline Overview for the full stage diagram.
Source Attribution
The binary embeds __FILE__ strings from the EDG build system, revealing the original source file structure. From these strings plus address-range analysis of decompiled code, 52 .c source files and 13 .h header files have been identified:
| Category | Files | Functions Mapped | Description |
|---|---|---|---|
| EDG core parser | 15 .c | ~800 | Lexer, expression/declaration parser, statement handling |
| EDG type system | 6 .c | ~350 | Type representation, checking, conversion |
| EDG templates | 5 .c | ~300 | Template parsing, instantiation, deduction |
| EDG IL subsystem | 8 .c | ~250 | IL node types, allocation, walking, display, comparison |
| EDG infrastructure | 12 .c | ~400 | Memory management, error handling, name mangling, scope management |
| EDG code generation | 3 .c | ~150 | Backend .int.c emission, ASM handling |
| NVIDIA additions | 3 .c | ~110 | CUDA transforms, attribute validation, lambda wrappers |
| Headers | 13 .h | (inline) | Shared constants, struct layouts, macro definitions |
The NVIDIA-specific source files are:
-
nv_transforms.c(~34 functions, ~14 KB of.text): The heart of CUDA support. Implements device/host-device lambda wrapper template generation (__nv_dl_wrapper_t,__nv_hdl_wrapper_t,__nv_hdl_create_wrapper_t), CUDA attribute validation (__launch_bounds__,__cluster_dims__,__block_size__,__maxnreg__), host reference array emission (.nvHRKI/.nvHRDE/.nvHRCEELF sections), lambda preamble injection (sub_6BCC20), and array capture helper generation. -
nv_transforms.h: Header with NVIDIA-specific declarations, type trait template names, and bitmask table definitions. -
3 modified EDG files:
cmd_line.c(CUDA CLI flags spliced into EDG's flag table),fe_init.c(CUDA-specific initialization at stage 3), andcp_gen_be.c(device stub generation, lambda wrapper emission, registration table output in the backend).
Key Discoveries
Execution Space Bitfield
Every entity node in the EDG IL carries CUDA execution-space information at byte offset +182 (relative to the entity node base). The bitfield encoding:
| Bit | Mask | Meaning |
|---|---|---|
| 4-5 | 0x30 | Execution space: 0=none, 1=__host__, 2=__device__, 3=__host__ __device__ |
| 6 | 0x40 | Device/global flag (set for __device__ and __global__ functions) |
| 7 | 0x80 | __global__ kernel flag |
This bitfield is checked throughout the pipeline -- in cross-space call validation, device/host code separation, the keep-in-IL predicate, and backend stub generation.
Lambda Wrapper Template Injection
CUDA extended lambdas (__device__ and __host__ __device__ lambdas) cannot be passed directly across the host/device boundary. cudafe++ solves this by injecting a library of template wrapper structs into the compilation at backend time. The master emitter sub_6BCC20 (nv_emit_lambda_preamble) generates all __nv_* templates in a single function call, driven by two 1024-bit bitmasks that record which capture counts were actually needed during parsing:
unk_1286980: Device lambda capture counts (bit N = need__nv_dl_wrapper_tfor N captures)unk_1286900: Host-device lambda capture counts (need__nv_hdl_wrapper_tfor N captures)
Only the required specializations are emitted, keeping the generated code minimal.
CUDA Error Catalog
The binary contains 3,795 diagnostic messages in the EDG error table. Of these, 338 are CUDA-specific (error numbers in the 20000+ range and the 3500-3800 range). These cover:
- Execution space violations (calling
__device__from__host__and vice versa) __global__function constraints (no return value, no variadic args, no virtual)- Lambda restrictions (35+ distinct error categories for extended lambda misuse)
- Attribute conflicts (
__launch_bounds__+__maxnreg__mutual exclusion) - RDC mode restrictions (user-defined copy constructors in kernel arguments)
- Architecture feature gates (feature X requires SM_YY or higher)
IL Entry Kind System
The EDG IL uses 85 defined entry kinds (0-84), each representing a distinct node type in the typed, scope-linked IL graph. Key node types include: routine (288 bytes, functions/methods), variable (232 bytes), type (176 bytes, 22 sub-kinds), expr_node (72 bytes, 36 sub-kinds), statement (80 bytes, 26 sub-kinds), and scope (288 bytes, 9 sub-kinds). All nodes live in a region-based arena allocator with 64 KB blocks. See IL Overview for the complete entry kind table.
CLI Flag Inventory
cudafe++ accepts 276 command-line flags parsed in sub_459630 (cmd_line.c). These control:
- Language mode and C++ standard version (
__cplusplusvalue) - Host compiler identity (MSVC, GCC, Clang) and version
- CUDA-specific modes: extended lambdas, RDC, JIT, architecture target
- Diagnostic suppression and promotion
- Include paths and macro definitions
- Output format and timing
Flags are passed from nvcc via the -Xcudafe forwarding mechanism. Many flags are undocumented EDG internals.
Wiki Structure
This wiki is organized into 10 sections covering the binary from top-level pipeline down to individual data structures.
Overview
- Function Map -- address-to-identity table for all 2,208 mapped functions
- Binary Layout -- segment map, memory regions, address space organization
- Methodology -- RE tools, approach, confidence scoring
Compilation Pipeline
The 8-stage pipeline from main() at 0x408950 through exit. Covers initialization, CLI parsing, EDG frontend invocation, 5-pass IL finalization, backend .int.c emission, and exit code mapping.
CUDA Execution Model
How cudafe++ handles __device__, __host__, and __global__ execution spaces. Device/host code separation, cross-space call validation, kernel stub generation, RDC (relocatable device code) mode, JIT mode, and SM architecture feature gating.
CUDA Attributes
The internal attribute system: __global__ function constraints, __launch_bounds__ / __cluster_dims__ / __block_size__ / __maxnreg__ validation, __grid_constant__ parameter handling, __managed__ variable support, and minor attributes (__nv_pure__, __nv_register_params__).
Lambda Transformations
Extended lambda support architecture: device lambda wrapper (__nv_dl_wrapper_t), host-device lambda wrapper (__nv_hdl_wrapper_t / __nv_hdl_create_wrapper_t), capture handling (field types, array wrappers for up to 8D), preamble injection (sub_6BCC20), and the 35+ lambda restriction error categories.
EDG Intermediate Language
The 85-entry-kind IL format: node allocation (region-based arena), tree walking (5 callback traversal), device code selection (keep-in-IL predicate), display (debug dump), and comparison/copy operations.
Host Output Generation
The .int.c file format, CUDA runtime boilerplate (__nv_managed_rt initialization, crt/host_runtime.h inclusion), host reference arrays (.nvHRKI/.nvHRDE/.nvHRCE ELF sections for device symbol registration), and CRC32-based module ID generation.
EDG Frontend Internals
The stock EDG 6.6 subsystems: lexer/tokenizer (357 token kinds), expression parser, declaration parser, overload resolution, template engine (instantiation worklist), CUDA-specific template restrictions, constexpr interpreter, Itanium ABI name mangling with CUDA extensions, and the type system (176-byte type node, 22 type kinds).
Error & Diagnostic System
The 3,795-entry diagnostic table, CUDA-specific error catalog (338 entries), format specifier system (%t/%s/%n/%sq/%p/%d), and SARIF output / pragma control.
Data Structures
Byte-level layouts for the core IL node types: entity node (execution/memory space at +182), scope entry (784 bytes), translation unit descriptor (424 bytes), type node (176 bytes, 22 kinds), and template instance record (128 bytes).
Configuration
CLI flag inventory (276 flags by category), EDG build configuration (compile-time constants baked into the binary), architecture detection (--nv_arch and SM version mapping), and experimental feature flags.
Reference
EDG source file map (52 .c + 13 .h), global variable index, token kind table (357 types), full error message catalog, and virtual override mismatch matrix.
Navigating This Wiki
If you want to understand the compilation pipeline: Start with Pipeline Overview, then follow the stage-by-stage links.
If you want to understand CUDA-specific behavior: Start with the CUDA Execution Model section. The execution spaces page explains the fundamental bitfield encoding that everything else depends on.
If you want to understand lambda transformations: Start with the Lambda Transformations overview. Lambda support is the most complex NVIDIA addition and involves template injection, capture-count bitmasks, and 5 distinct wrapper template families.
If you want to understand the IL format: Start with IL Overview for the 85 entry kinds, then Keep-in-IL for how device code is selected.
If you want to look up a specific function: The Function Map provides address-to-identity mappings for all 2,208 identified functions. The EDG Source File Map shows which source file each address range belongs to.
Data Sources
This wiki is derived from:
- 6,202 Hex-Rays decompiled C pseudocode files -- one per function with recognizable control flow
- 6,342 x86-64 disassembly files -- full instruction-level coverage
- 9.5 MB strings database with cross-references to every function that uses each string
- 161 MB cross-reference database -- complete caller/callee and data-reference mappings
- 7.7 MB call graph in JSON and DOT format
- 6,483 control flow graphs with basic block boundaries
- 247 MB IDA Pro database (.i64)
All analysis was performed on the binary shipped with CUDA Toolkit 13.0, obtained from NVIDIA's public distribution channels.
Function Map
Every function in the cudafe++ binary that triggers an EDG assertion encodes three pieces of data in the assertion string: the source file path, the line number, and the enclosing function name. These strings survive in .rodata and cross-reference back to the compiled functions, providing a ground-truth mapping from binary address to EDG source file. This page catalogs that mapping for all 52 .c source files and 13 .h header files identified in the CUDA 13.0 build of cudafe++ (EDG 6.6).
The mapping was produced by extracting all string literals matching /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/*.c and *.h from the binary's .rodata section, then tracing their cross-references to determine which functions load each path. A function that references attribute.c in an assertion string was compiled from attribute.c. Functions that reference no source path at all (the "unmapped" pool) are either too small to contain assertions, are inlined from headers, or belong to the statically-linked C++ runtime.
Coverage Summary
| Category | Functions | Percentage |
|---|---|---|
Mapped via .c file paths | 2,129 | 32.8% |
Mapped via .h file paths only | 80 | 1.2% |
| Total mapped | 2,209 | 34.1% |
Unmapped in EDG region (0x403300--0x7E0000) | 2,906 | 44.8% |
C++ runtime / demangler (0x7E0000--0x829722) | 1,085 | 16.7% |
PLT stubs + init (0x402A18--0x403300) | 283 | 4.4% |
| Total functions in binary | 6,483 | 100% |
The 2,906 unmapped functions in the EDG region include inlined header expansions (e.g., util.h vector/hash helpers, types.h type queries), small leaf functions below the assertion threshold, switch-table dispatch fragments, and functions from translation units compiled without assertions enabled (notably il_to_str.c display routines and parts of floating.c).
Binary Layout
The EDG .text region (0x403300--0x7E0000) has a three-part structure:
-
Assert stub region (
0x403300--0x408B40): 235 small__noreturnfunctions, one per assertion site. Each encodes a source file path, line number, and function name, then callssub_4F2930(the internal error handler). These stubs are sorted by source file name -- the linker grouped them from all 52.cfiles into one contiguous block. 200 stubs map to.cfiles; the remaining 35 are from.hfiles inlined into.ccompilation units. -
Constructor region (
0x408B40--0x409350): 15 C++ static constructor functions (ctor_001throughctor_015) that initialize global tables at program startup. -
Main body region (
0x409350--0x7DFFF0): The bulk of the compiler. Source files are laid out roughly in alphabetical order by filename, a consequence of the linker processing object files in directory-listing order. The alphabetical ordering holds across the entire range:attribute.cstarts at0x409350,class_decl.cat0x419280, progressing through totypes.cat0x7A4940,modules.cat0x7C0C60, andfloating.cat0x7D0EB0.
Source File Address Table
The table below lists all 52 .c source files sorted by their main body start address. "Total Funcs" counts all functions referencing the file (stubs + main body). "Stubs" counts assert stubs in 0x403300--0x408B40. "Main Funcs" counts functions in the main body region.
| # | Source File | Origin | Total Funcs | Stubs | Main Funcs | Main Body Start | Main Body End | Sweep |
|---|---|---|---|---|---|---|---|---|
| 1 | attribute.c | EDG | 177 | 7 | 170 | 0x409350 | 0x418F80 | P1.01 |
| 2 | class_decl.c | EDG | 273 | 9 | 264 | 0x419280 | 0x447930 | P1.01--02 |
| 3 | cmd_line.c | EDG | 44 | 1 | 43 | 0x44B250 | 0x459630 | P1.02--03 |
| 4 | const_ints.c | EDG | 4 | 1 | 3 | 0x461C20 | 0x4659A0 | P1.03 |
| 5 | cp_gen_be.c | EDG | 226 | 25 | 201 | 0x466F90 | 0x489000 | P1.03--04 |
| 6 | debug.c | EDG | 2 | 0 | 2 | 0x48A1B0 | 0x48A1B0 | P1.04 |
| 7 | decl_inits.c | EDG | 196 | 4 | 192 | 0x48B3F0 | 0x4A1540 | P1.04--05 |
| 8 | decl_spec.c | EDG | 88 | 3 | 85 | 0x4A1BF0 | 0x4B37F0 | P1.05 |
| 9 | declarator.c | EDG | 64 | 0 | 64 | 0x4B3970 | 0x4C00A0 | P1.05 |
| 10 | decls.c | EDG | 207 | 5 | 202 | 0x4C0910 | 0x4E8C40 | P1.05--06 |
| 11 | disambig.c | EDG | 5 | 1 | 4 | 0x4E9E70 | 0x4EC690 | P1.06 |
| 12 | error.c | EDG | 51 | 1 | 50 | 0x4EDCD0 | 0x4F8F80 | P1.06 |
| 13 | expr.c | EDG | 538 | 10 | 528 | 0x4F9870 | 0x5565E0 | P1.07--08 |
| 14 | exprutil.c | EDG | 299 | 13 | 286 | 0x558720 | 0x583540 | P1.08--09 |
| 15 | extasm.c | EDG | 7 | 0 | 7 | 0x584CA0 | 0x585850 | P1.09 |
| 16 | fe_init.c | EDG | 6 | 1 | 5 | 0x585B10 | 0x5863A0 | P1.09 |
| 17 | fe_wrapup.c | EDG | 2 | 0 | 2 | 0x588D40 | 0x588F90 | P1.09 |
| 18 | float_pt.c | EDG | 79 | 0 | 79 | 0x589550 | 0x594150 | P1.09--10 |
| 19 | folding.c | EDG | 139 | 9 | 130 | 0x594B30 | 0x5A4FD0 | P1.10 |
| 20 | func_def.c | EDG | 56 | 1 | 55 | 0x5A51B0 | 0x5AAB80 | P1.10 |
| 21 | host_envir.c | EDG | 19 | 2 | 17 | 0x5AD540 | 0x5B1E70 | P1.10 |
| 22 | il.c | EDG | 358 | 16 | 342 | 0x5B28F0 | 0x5DFAD0 | P1.10--11d |
| 23 | il_alloc.c | EDG | 38 | 1 | 37 | 0x5E0600 | 0x5E8300 | P1.11a--11e |
| 24 | il_to_str.c | EDG | 83 | 1 | 82 | 0x5F7FD0 | 0x6039E0 | P1.11f--12 |
| 25 | il_walk.c | EDG | 27 | 1 | 26 | 0x603FE0 | 0x620190 | P1.12 |
| 26 | interpret.c | EDG | 216 | 5 | 211 | 0x620CE0 | 0x65DE10 | P1.12--13 |
| 27 | layout.c | EDG | 21 | 2 | 19 | 0x65EA50 | 0x665A60 | P1.13 |
| 28 | lexical.c | EDG | 140 | 5 | 135 | 0x666720 | 0x689130 | P1.13--14 |
| 29 | literals.c | EDG | 21 | 0 | 21 | 0x68ACC0 | 0x68F2B0 | P1.14 |
| 30 | lookup.c | EDG | 71 | 2 | 69 | 0x68FAB0 | 0x69BE80 | P1.14 |
| 31 | lower_name.c | EDG | 179 | 11 | 168 | 0x69C980 | 0x6AB280 | P1.14--15 |
| 32 | macro.c | EDG | 43 | 1 | 42 | 0x6AB6E0 | 0x6B5C10 | P1.15 |
| 33 | mem_manage.c | EDG | 9 | 2 | 7 | 0x6B6DD0 | 0x6BA230 | P1.15 |
| 34 | nv_transforms.c | NVIDIA | 1 | 0 | 1 | 0x6BE300 | 0x6BE300 | P1.15 |
| 35 | overload.c | EDG | 284 | 3 | 281 | 0x6BE4A0 | 0x6EF7A0 | P1.15--16 |
| 36 | pch.c | EDG | 23 | 3 | 20 | 0x6F2790 | 0x6F5DA0 | P1.16 |
| 37 | pragma.c | EDG | 28 | 0 | 28 | 0x6F61B0 | 0x6F8320 | P1.16 |
| 38 | preproc.c | EDG | 10 | 0 | 10 | 0x6F9B00 | 0x6FC940 | P1.16 |
| 39 | scope_stk.c | EDG | 186 | 6 | 180 | 0x6FE160 | 0x7106B0 | P1.16--17 |
| 40 | src_seq.c | EDG | 57 | 1 | 56 | 0x710F10 | 0x718720 | P1.17 |
| 41 | statements.c | EDG | 83 | 1 | 82 | 0x719300 | 0x726A50 | P1.17 |
| 42 | symbol_ref.c | EDG | 42 | 2 | 40 | 0x726F20 | 0x72CEA0 | P1.17 |
| 43 | symbol_tbl.c | EDG | 175 | 8 | 167 | 0x72D950 | 0x74B8D0 | P1.17--18 |
| 44 | sys_predef.c | EDG | 35 | 1 | 34 | 0x74C690 | 0x751470 | P1.18 |
| 45 | target.c | EDG | 11 | 0 | 11 | 0x7525F0 | 0x752DF0 | P1.18 |
| 46 | templates.c | EDG | 455 | 12 | 443 | 0x7530C0 | 0x794D30 | P1.18 |
| 47 | trans_copy.c | EDG | 2 | 0 | 2 | 0x796BA0 | 0x796BA0 | P1.18 |
| 48 | trans_corresp.c | EDG | 88 | 6 | 82 | 0x796E60 | 0x7A3420 | P1.18--19 |
| 49 | trans_unit.c | EDG | 10 | 0 | 10 | 0x7A3BB0 | 0x7A4690 | P1.19 |
| 50 | types.c | EDG | 88 | 5 | 83 | 0x7A4940 | 0x7C02A0 | P1.19 |
| 51 | modules.c | EDG | 22 | 3 | 19 | 0x7C0C60 | 0x7C2560 | P1.19 |
| 52 | floating.c | EDG | 50 | 9 | 41 | 0x7D0EB0 | 0x7D59B0 | P1.19 |
Totals: 5,338 cross-references across 52 .c files, resolving to 2,129 unique functions. With .h file references added, 2,209 unique functions are mapped.
Largest Source Files by Function Count
| Source File | Main Body Funcs | Approximate Code Size |
|---|---|---|
expr.c | 528 | ~373 KB (0x4F9870--0x5565E0) |
templates.c | 443 | ~282 KB (0x7530C0--0x794D30) |
il.c | 342 | ~185 KB (0x5B28F0--0x5DFAD0) |
exprutil.c | 286 | ~175 KB (0x558720--0x583540) |
overload.c | 281 | ~200 KB (0x6BE4A0--0x6EF7A0) |
class_decl.c | 264 | ~187 KB (0x419280--0x447930) |
interpret.c | 211 | ~241 KB (0x620CE0--0x65DE10) |
decls.c | 202 | ~165 KB (0x4C0910--0x4E8C40) |
cp_gen_be.c | 201 | ~141 KB (0x466F90--0x489000) |
decl_inits.c | 192 | ~91 KB (0x48B3F0--0x4A1540) |
Header File Cross-References
Thirteen .h header files appear in assertion strings. These are headers that contain non-trivial inline functions or macros that expand to assertion-bearing code. When a function compiled from decls.c triggers an assertion whose __FILE__ is types.h, that assertion was inlined from types.h into the decls.c compilation unit.
| # | Header File | Xrefs | Stubs | Main Funcs | Address Range | Inlined Into |
|---|---|---|---|---|---|---|
| 1 | decls.h | 1 | 0 | 1 | 0x4E08F0 | decls.c |
| 2 | float_type.h | 63 | 0 | 63 | 0x7D1C90--0x7DEB90 | floating.c |
| 3 | il.h | 5 | 2 | 3 | 0x52ABC0--0x6011F0 | expr.c, il.c, il_to_str.c |
| 4 | lexical.h | 1 | 0 | 1 | 0x68F2B0 | lexical.c / literals.c boundary |
| 5 | mem_manage.h | 4 | 0 | 4 | 0x4EDCD0 | error.c |
| 6 | modules.h | 5 | 0 | 5 | 0x7C1100--0x7C2560 | modules.c |
| 7 | nv_transforms.h | 3 | 0 | 3 | 0x432280--0x719D20 | class_decl.c, cp_gen_be.c, src_seq.c |
| 8 | overload.h | 1 | 0 | 1 | 0x6C9E40 | overload.c |
| 9 | scope_stk.h | 4 | 0 | 4 | 0x503D90--0x574DD0 | expr.c, exprutil.c |
| 10 | symbol_tbl.h | 2 | 1 | 1 | 0x7377D0 | symbol_tbl.c |
| 11 | types.h | 17 | 4 | 13 | 0x469260--0x7B05E0 | Many files (scattered type queries) |
| 12 | util.h | 124 | 10 | 114 | 0x430E10--0x7C2B10 | All major .c files |
| 13 | walk_entry.h | 51 | 0 | 51 | 0x604170--0x618660 | il_walk.c |
Notable Header Patterns
util.h is the most widely-included header, with 124 cross-references (114 in main body) spanning nearly the entire EDG .text region from 0x430E10 to 0x7C2B10. It provides generic container templates (dynamic arrays, hash tables, sorted sets) used by every major subsystem. The EDG linker inlined these templates into each compilation unit, creating many small util.h-attributed functions scattered across the binary.
float_type.h is concentrated in a single 52 KB block at 0x7D1C90--0x7DEB90, immediately after floating.c. It contains 63 template instantiations for IEEE 754 floating-point type operations (comparison, conversion, arithmetic) for each target floating-point width. These templates were instantiated in the floating.c compilation unit.
walk_entry.h contributes 51 functions in the tight range 0x604170--0x618660, all within the il_walk.c region. These are the per-entry-kind callback dispatch functions generated by preprocessor macros in the IL walker header.
nv_transforms.h is NVIDIA-specific. Its 3 cross-references appear in class_decl.c (sub_432280 at 0x432280), cp_gen_be.c (sub_47ECC0 at 0x47ECC0), and src_seq.c (sub_719D20 at 0x719D20). These are the integration points where NVIDIA's CUDA transform hooks are called from standard EDG code paths -- class definition processing, backend code generation, and source sequence ordering.
NVIDIA-Specific Files
nv_transforms.c
The only NVIDIA-authored .c file in the EDG source tree. Despite having only 1 mapped function via __FILE__ (sub_6BE300 at 0x6BE300), the sweep analysis of the 0x6BAE70--0x6BE4A0 range identified approximately 40 functions compiled from this file. The discrepancy exists because nv_transforms.c uses NVIDIA's own assertion macros (not EDG's standard internal_error path), so most functions do not reference the EDG-style __FILE__ string.
Functions confirmed in the nv_transforms.c region:
| Address | Identity | Purpose |
|---|---|---|
0x6BAE70 | nv_init_transforms | Zero all NVIDIA transform state at startup |
0x6BAF70 | alloc_mem_block | 64 KB memory block allocator for NV region pools |
0x6BB290 | reset_mem_state | Emergency OOM recovery -- clear memory tracking |
0x6BB350 | init_memory_regions | Bootstrap region 0 and region 1 with initial blocks |
0x6BB790 | emit_device_lambda_wrapper | Generate __nv_dl_wrapper_t<> specialization |
0x6BCC20 | emit_lambda_preamble | Inject lambda wrapper preamble declarations |
0x6BD490 | emit_host_device_lambda_wrapper | Generate __nv_hdl_wrapper_t<> specialization |
0x6BE300 | (mapped function) | Single function with EDG-style __FILE__ reference |
Key infrastructure in this file:
__nv_dl_wrapper_t<>/__nv_hdl_wrapper_t<>struct template generation- Host reference array emission (
.nvHRKE,.nvHRKI,.nvHRDE,.nvHRDI,.nvHRCE,.nvHRCI) - Capture count bitmask tables:
unk_1286980(device) andunk_1286900(host-device), 128 bytes each - Lambda-to-closure entity mapping via hash table at
qword_12868F0
nv_transforms.h
NVIDIA's hook header, #include-d from three EDG source files. It declares the functions that bridge standard EDG processing to NVIDIA's CUDA transform layer. The three inclusion sites represent the three points where EDG's standard C++ frontend cedes control to NVIDIA-specific logic:
-
class_decl.c(sub_432280at0x432280): Called during class definition processing to apply CUDA execution-space attributes to closure types and validate lambda capture constraints. -
cp_gen_be.c(sub_47ECC0at0x47ECC0): Called during backend code generation to emit CUDA-specific output constructs (device stubs, host reference arrays, registration calls). -
src_seq.c(sub_719D20at0x719D20): Called during source sequence processing to inject NVIDIA preamble declarations and wrapper type definitions into the correct position in the declaration order.
Unmapped Regions (Gap Analysis)
Several address ranges within the EDG .text region contain functions that could not be mapped to any source file via __FILE__ strings. The major gaps and their probable contents:
| Gap Range | Size | Probable Content | Evidence |
|---|---|---|---|
0x408B40--0x409350 | ~2 KB | Static constructors (ctor_001--ctor_015) | No source path; global table initializers |
0x447930--0x44B250 | ~13 KB | class_decl.c / cmd_line.c boundary helpers | Between confirmed ranges |
0x459630--0x461C20 | ~34 KB | cmd_line.c tail + const_ints.c preamble | Unmapped option handlers |
0x5E8300--0x5F7FD0 | ~87 KB | IL display routines (il_to_str.c early body) | No assertions (display-only code) |
0x665A60--0x666720 | ~3 KB | layout.c / lexical.c boundary | Small gap between confirmed ranges |
0x689130--0x68ACC0 | ~7 KB | lexical.c tail + literals.c preamble | Token/literal conversion helpers |
0x6AB280--0x6AB6E0 | ~1 KB | lower_name.c / macro.c boundary | Mangling helpers |
0x6BA230--0x6BAE70 | ~3 KB | mem_manage.c / nv_transforms.c boundary | Memory infrastructure |
0x6EF7A0--0x6F2790 | ~12 KB | overload.c / pch.c boundary | Overload resolution helpers |
0x6FC940--0x6FE160 | ~6 KB | preproc.c / scope_stk.c boundary | Preprocessor tail |
0x751470--0x7525F0 | ~7 KB | sys_predef.c / target.c boundary | Predefined macro infrastructure |
0x7A4690--0x7A4940 | ~1 KB | trans_unit.c / types.c boundary | Translation unit helpers |
0x7C2560--0x7D0EB0 | ~59 KB | Type-name mangling / encoding for output | Between modules.c and floating.c |
0x7D1C90--0x7DEB90 | ~52 KB | float_type.h template instantiations | Confirmed via .h path strings |
0x7DFFF0--0x82A000 | ~304 KB | C++ runtime, demangler, soft-float, EH | Statically-linked libstdc++/libgcc |
The largest unmapped gap within EDG code is the IL display region at 0x5E8300--0x5F7FD0 (87 KB). These functions were compiled from il_to_str.c but contain no assertions because the display/dump subsystem was built without assertion macros -- it is purely diagnostic code that formats IL trees to stdout.
The float_type.h block at 0x7D1C90--0x7DEB90 (52 KB) is technically mapped via .h cross-references but has no .c file attribution because the template instantiations carry only the header's __FILE__ path.
Alphabetical Ordering Observation
The files are laid out in the binary in rough alphabetical order, consistent with a build system that compiles object files in directory-listing order and a linker that processes them sequentially:
0x409350 attribute.c (a)
0x419280 class_decl.c (c)
0x44B250 cmd_line.c (c)
0x461C20 const_ints.c (c)
0x466F90 cp_gen_be.c (c)
0x48A1B0 debug.c (d)
0x48B3F0 decl_inits.c (d)
0x4A1BF0 decl_spec.c (d)
0x4B3970 declarator.c (d)
0x4C0910 decls.c (d)
0x4E9E70 disambig.c (d)
0x4EDCD0 error.c (e)
0x4F9870 expr.c (e)
0x558720 exprutil.c (e)
0x584CA0 extasm.c (e)
0x585B10 fe_init.c (f)
0x588D40 fe_wrapup.c (f)
0x589550 float_pt.c (f)
0x594B30 folding.c (f)
0x5A51B0 func_def.c (f)
0x5AD540 host_envir.c (h)
0x5B28F0 il.c (i)
0x5E0600 il_alloc.c (i)
0x5F7FD0 il_to_str.c (i)
0x603FE0 il_walk.c (i)
0x620CE0 interpret.c (i)
0x65EA50 layout.c (l)
0x666720 lexical.c (l)
0x68ACC0 literals.c (l)
0x68FAB0 lookup.c (l)
0x69C980 lower_name.c (l)
0x6AB6E0 macro.c (m)
0x6B6DD0 mem_manage.c (m)
0x6BAE70 nv_transforms.c (n) [region start; mapped func at 0x6BE300]
0x6BE4A0 overload.c (o)
0x6F2790 pch.c (p)
0x6F61B0 pragma.c (p)
0x6F9B00 preproc.c (p)
0x6FE160 scope_stk.c (s)
0x710F10 src_seq.c (s)
0x719300 statements.c (s)
0x726F20 symbol_ref.c (s)
0x72D950 symbol_tbl.c (s)
0x74C690 sys_predef.c (s)
0x7525F0 target.c (t)
0x7530C0 templates.c (t)
0x796BA0 trans_copy.c (t)
0x796E60 trans_corresp.c (t)
0x7A3BB0 trans_unit.c (t)
0x7A4940 types.c (t)
0x7C0C60 modules.c (m) [breaks alphabetical order]
0x7D0EB0 floating.c (f) [breaks alphabetical order]
Two files break the alphabetical pattern: modules.c at 0x7C0C60 and floating.c at 0x7D0EB0. Both appear after types.c instead of in their expected positions (between mem_manage.c and nv_transforms.c for modules.c, between float_pt.c and folding.c for floating.c). This suggests these two files are compiled as separate translation units outside the main EDG source directory, or are added to the link line after the alphabetically-sorted EDG objects.
Data Source
All mappings were extracted from the binary's .rodata string table. The extraction command:
jq '[.[] | select(.value | test("/dvs/p4/.*\\.c$")) |
{file: (.value | split("/") | last),
xrefs: [.xrefs[].func] | length}
] | sort_by(.file)' cudafe++_strings.json
The full build path for every source file is:
/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/<filename>
Address ranges were verified against the 20 sweep reports (P1.01 through P1.20) produced during the binary analysis phase.
Binary Layout
cudafe++ ships as a single statically-linked, stripped ELF 64-bit x86-64 executable. Static linking pulls in the entirety of libstdc++ (locale facets, iostream, exception handling), Berkeley SoftFloat 3e (half/quad-precision arithmetic), and glibc CRT startup code. The resulting 8.5 MB binary has no external shared library dependencies -- it runs identically on any Linux x86-64 host regardless of installed C++ runtime version.
This page documents the complete segment and section layout, the internal organization of each major section, and the key data structures located within each region. All addresses are virtual addresses from the ELF load image.
ELF Header
| Property | Value |
|---|---|
| Format | ELF 64-bit LSB executable |
| Architecture | x86-64 (AMD64) |
| Linking | Statically linked |
| Stripped | Yes (no debug symbols, no .symtab) |
| File size | 8,910,936 bytes (8.5 MB) |
| Entry point | 0x40918C (_start, glibc CRT) |
| Main | 0x408950 |
Complete Section Table
| Section | Start | End | Size (bytes) | Size (human) | Permissions | Purpose |
|---|---|---|---|---|---|---|
| LOAD (ELF hdr) | 0x400000 | 0x402A18 | 10,776 | 10.5 KB | r-x | ELF headers and program header table |
.init | 0x402A18 | 0x402A30 | 24 | 24 B | r-x | Initialization stub (calls init_proc) |
.plt | 0x402A30 | 0x403300 | 2,256 | 2.2 KB | r-x | Procedure Linkage Table (141 entries) |
.text | 0x403300 | 0x829722 | 4,351,010 | 4.15 MB | r-x | All executable code |
.fini | 0x829724 | 0x829732 | 14 | 14 B | r-x | Finalization stub (empty body) |
.rodata | 0x829740 | 0xAA3FA3 | 2,599,011 | 2.48 MB | r-- | Read-only data |
.eh_frame_hdr | 0xAA3FA4 | 0xAB0350 | 50,092 | 48.9 KB | r-- | Exception frame header index |
.eh_frame | 0xCB1210 | 0xD3F398 | 582,024 | 568.4 KB | rw- | Exception unwind tables (CFI) |
.gcc_except_table | 0xD3F398 | 0xD42854 | 13,500 | 13.2 KB | rw- | GCC LSDA exception handler tables |
.ctors | 0xD42858 | 0xD428B0 | 88 | 88 B | rw- | Constructor table (9 function pointers + 2 sentinels) |
.dtors | 0xD428B0 | 0xD428C0 | 16 | 16 B | rw- | Destructor table (2 sentinels, empty) |
.data.rel.ro | 0xD428C0 | 0xD45E00 | 13,632 | 13.3 KB | rw- | Vtables and relocation-read-only data |
.got | 0xD45FC0 | 0xD45FF8 | 56 | 56 B | rw- | Global Offset Table |
.got.plt | 0xD46000 | 0xD46478 | 1,144 | 1.1 KB | rw- | GOT for PLT entries |
.data | 0xD46480 | 0xE7EFF0 | 1,280,880 | 1.22 MB | rw- | Initialized globals |
.bss | 0xE7F000 | 0x12D6F20 | 4,554,528 | 4.34 MB | rw- | Zero-initialized globals |
.tls | 0x12D6F20 | 0x12D6F38 | 24 | 24 B | --- | Thread-local storage (exception state) |
extern | 0x12D6F38 | 0x12D73A8 | 1,136 | 1.1 KB | --- | External symbol stubs |
Total virtual address space consumed: 0x12D73A8 - 0x400000 = 18.9 MB.
.text -- Executable Code (4.15 MB)
The .text section contains all 6,483 functions in the binary. It divides into four distinct regions, laid out contiguously by the linker:
0x403300 0x829722
|-- assert stubs --|-- ctors --|---- EDG main body ----|-- C++ runtime ----|
0x403300 0x408B40 0x409350 0x7DF400 0x829722
34 KB 8 KB 3.61 MB 304 KB
Assert Stub Region (0x403300 -- 0x408B40, 34 KB)
Contains 235 small __noreturn functions, each encoding a single assertion site. Every stub loads three string constants -- source file path, line number, and function name -- then calls sub_4F2930 (the internal_error handler in error.c). These stubs are called from the bodies of larger functions when an impossible condition is detected.
The linker groups all stubs from all 52 .c source files into this contiguous block, sorted approximately by source file name. Of the 235 stubs:
- 200 map to
.csource files (e.g.,attribute.c:10897at0x403300,cp_gen_be.c:22342at0x4036F6) - 35 map to
.hheader files inlined into.ccompilation units (e.g.,types.hat0x40345C)
Each stub is exactly 29 bytes: a lea for the file path, a mov for the line number, a lea for the function name, then a call to sub_4F2930.
Constructor Region (0x408B40 -- 0x409350, 8 KB)
Contains 9 C++ global constructor functions (ctor_001 through ctor_009) registered in the .ctors table. These run before main() via __libc_start_main's init callback at 0x829640. The constructors, in execution order:
| Constructor | Address | Identity | What It Initializes |
|---|---|---|---|
ctor_001 | 0x408B40 | EDG diagnostic list | Doubly-linked list at E7FE40..E7FE68 (self-referencing empty sentinel) |
ctor_002 | 0x408B90 | Stream state table | 13 qwords at 126ED80..126EDE0 (output channel array including 126EDF0 = stderr FILE*) |
ctor_003 | 0x408C20 | EDG internal caches | ios_base::Init + 7 doubly-linked lists at 12C6A40, 12868C0..1286780 (symbol/type caches) |
ctor_004 | 0x408E50 | Emergency exception pool | 72,704-byte malloc pool at 12D4870, free-list at 12D4868, with pthread mutex |
ctor_005 | 0x408ED0 | Locale once-flags (set 1) | 8 flags at 12D6A68..12D6AA0 |
ctor_006 | 0x408F50 | Locale once-flags (set 2) | 8 flags at 12D6AF0..12D6B28 |
ctor_007 | 0x408FD0 | Locale once-flags (set 3) | 12 flags at 12D6D28..12D6D80 |
ctor_008 | 0x409090 | Locale once-flags (set 4) | 12 flags at 12D6DE8..12D6E40 |
ctor_009 | 0x409150 | Stream buffer destructors | __cxa_atexit for basic_streambuf<char> and basic_streambuf<wchar_t> |
Constructors 4--9 belong to statically-linked libstdc++. Only constructors 1--3 initialize EDG/NVIDIA state.
EDG Main Body (0x409350 -- 0x7DF400, 3.61 MB)
The core of the compiler. Contains 5,115 functions compiled from 52 EDG .c source files plus 3 NVIDIA-specific source files. Functions are laid out in approximate alphabetical order by source file name -- the linker processed object files in directory-listing order:
0x409350 attribute.c (170 functions)
0x419280 class_decl.c (264 functions)
0x44B250 cmd_line.c (43 functions)
0x461C20 const_ints.c (3 functions)
0x466F90 cp_gen_be.c (201 functions)
...
0x6BE300 nv_transforms.c (1 mapped function, NVIDIA)
0x6BE4A0 overload.c (281 functions)
...
0x7A4940 types.c
0x7C0C60 modules.c
0x7D0EB0 floating.c
~0x7DF400 end of EDG code
The 52 source files break down by subsystem:
| Subsystem | Files | Functions | Description |
|---|---|---|---|
| Parser | 15 .c | ~800 | Lexer, expression/declaration parser, statements |
| Type system | 6 .c | ~350 | Type representation, checking, conversion |
| Templates | 5 .c | ~300 | Parsing, instantiation, deduction |
| IL subsystem | 8 .c | ~250 | Node types, allocation, walking, display, comparison |
| Infrastructure | 12 .c | ~400 | Memory, errors, name mangling, scope management |
| Code generation | 3 .c | ~150 | Backend .int.c emission |
| NVIDIA additions | 3 .c | ~110 | CUDA transforms, attribute validation, lambda wrappers |
See Function Map for the complete address-to-source-file table.
C++ Runtime Region (0x7DF400 -- 0x829722, 304 KB)
Statically-linked library code with no EDG source attribution. Contains approximately 900 functions from three libraries:
Berkeley SoftFloat 3e (0x7E0D30 -- 0x7E4150, ~80 functions). IEEE 754 arithmetic for half-precision (float16), extended precision (float80), and quad-precision (float128). Operations: add, sub, mul, div, sqrt, comparisons, int/float conversions. Global state at 12D4820 (exception flags) and 12D4821 (rounding mode). Used by the EDG floating.c subsystem for constant folding of non-native float types.
libstdc++ / libsupc++ (0x7E42E0 -- 0x829600, ~800 functions). The C++ runtime:
operator new/operator deletewith new-handler retry loop (0x7E42E0)- Exception handling:
__cxa_throw(0x823050),__cxa_begin_catch(0x822EB0),__cxa_allocate_exception(0x7E4750),std::terminate(0x8231A0) - Emergency exception pool: 72,704-byte fallback allocator for OOM during exception handling (
0x7E45C0) - iostream initialization:
ios_base::Initconstructor/destructor (0x7E5650/0x7E5F20) setting up cout/cin/cerr + wide variants - Full locale system: 600+ functions implementing ctype, num_get, num_put, numpunct, collate, time_get/put, money_get/put, moneypunct, messages, and codecvt facets for both
charandwchar_t
CUDA-aware name demangler (at 0x7CABB0, technically in the EDG tail region). NVIDIA's custom Itanium ABI demangler with extensions for CUDA lambda wrapper templates. Recognizes mangled prefixes: "Unvdl" for __nv_dl_wrapper_t<>, "Unvdtl" for __nv_dl_wrapper_t<> with trailing return, and "Unvhdl" for __nv_hdl_wrapper_t<>.
CRT startup (0x40918C and 0x829640 -- 0x829722). _start at 0x40918C calls __libc_start_main(main@0x408950, init@0x829640, fini@0x8296D0). The .fini_array processor at 0x8296E0 iterates backwards through function pointers at off_D428A0.
.rodata -- Read-Only Data (2.48 MB)
The .rodata section at 0x829740 -- 0xAA3FA3 holds all constant data: string literals, jump tables, error message templates, IL metadata tables, and format strings. Major structures:
Error Message Table (off_88FAA0)
The EDG diagnostic system's message template table. An array of 3,795 const char* pointers, indexed by error code 0--3794:
off_88FAA0[0] = "" // error 0: unused
off_88FAA0[1] = "last line of file ends ..." // error 1
...
off_88FAA0[3794] = "..." // error 3794
Each pointer references a NUL-terminated format string elsewhere in .rodata containing % fill-in specifiers (%t = type, %s = string, %n = name, %sq = quoted string, %p = position, %d = decimal). Error codes above 3456 are CUDA-specific (338 entries covering execution space violations, lambda restrictions, architecture feature gates). See Diagnostic Overview.
IL Entry Kind Name Table (off_E6DD80)
Maps the 85 entry_kind enum values (0--84) to human-readable strings. Used by the IL display subsystem (il_to_str.c) for debug output:
off_E6DD80[0] = "scope"
off_E6DD80[6] = "type"
off_E6DD80[11] = "routine"
off_E6DD80[23] = "variable"
...
off_E6DD80[84] = "last" // sentinel
The il_one_time_init function (sub_5CF7F0) validates at startup that this table ends with the "last" sentinel, catching version mismatches between the table and the enum.
EDG Source File Path Strings
Approximately 65 string literals of the form /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/<file>.<ext>. These are __FILE__ expansions embedded in assertion macros. Each is referenced by the corresponding assert stub in the 0x403300 region.
Jump Tables
Switch-statement jump tables for the major dispatch functions. The largest are:
- Expression parser dispatch (~120 case targets)
- Declaration specifier dispatch (~80 case targets)
- IL walker entry-kind dispatch (~85 case targets)
- Backend code generation dispatch (~90 case targets)
Format Strings
Printf-style format strings for the .int.c backend emitter. These include CUDA runtime boilerplate templates ("#include \"crt/host_runtime.h\"", "static __nv_managed_rt ...", "void __device_stub__...") and IL display format strings.
.data -- Initialized Globals (1.22 MB)
The .data section at 0xD46480 -- 0xE7EFF0 holds all initialized global variables. Major structures, ordered by address:
Attribute Descriptor Table (off_D46820)
The master attribute dispatch table, starting at 0xD46820 and extending to approximately 0xD47A60. Each entry is 32 bytes and describes one EDG/CUDA attribute kind: kind code (1 byte), flags (1 byte), name string pointer, validation function pointer, and application function pointer. See Attribute System Overview.
Diagnostic Fill-in Tables (off_D481E0)
Named-label fill-in descriptors for the diagnostic system. Maps fill-in label strings to format specifier dispatch codes. Located at 0xD481E0.
Keyword Tables
The EDG keyword registration system stores keyword-to-token-ID mappings. Initialized during fe_translation_unit_init (sub_5863A0) with 200+ C/C++ keywords (from auto through co_yield), 60+ type trait intrinsics (__is_class, __has_trivial_copy, etc.), and CUDA extension keywords (__device__, __global__, __shared__, __constant__, __managed__, __launch_bounds__, __grid_constant__).
Error Severity Override Table
Maps error codes to their overridden severity levels. Populated by --diag_suppress, --diag_warning, --diag_error CLI flags.
libstdc++ Vtables (0xD428C0 -- 0xD45E00, in .data.rel.ro)
The .data.rel.ro section holds vtables for all statically-linked C++ classes. Key vtables:
| Address | Class |
|---|---|
off_D42C00 | __gnu_cxx::__concurrence_lock_error |
off_D42C28 | __gnu_cxx::__concurrence_unlock_error |
off_D42CD8 | std::bad_alloc |
off_D45740 | std::basic_istream<char> |
off_D457C0 | std::basic_istream<wchar_t> |
off_D45860 | std::basic_ostream<char> |
off_D458E0 | std::basic_ostream<wchar_t> |
off_D45A28 | std::basic_streambuf<char> |
off_D45A78 | std::basic_streambuf<wchar_t> |
Exception Handler Pointers (0xE7EExx)
Located at the tail of .data:
| Address | Type | Identity |
|---|---|---|
off_E7EEB0 | qword | atexit target: basic_streambuf<wchar_t> object |
off_E7EEB8 | qword | atexit target: basic_streambuf<char> object |
off_E7EEC0 | qword | std::unexpected_handler pointer |
off_E7EEC8 | qword | std::terminate_handler pointer |
EDG Diagnostic List Head (0xE7FE40)
A 40-byte doubly-linked list structure at 0xE7FE40..0xE7FE68. Initialized by ctor_001 as an empty self-referencing sentinel (both forward and backward pointers point to the list head). Used to chain diagnostic records during compilation.
.bss -- Zero-Initialized Globals (4.34 MB)
The .bss section at 0xE7F000 -- 0x12D6F20 is the largest section by virtual size. It contains all zero-initialized global state for both the EDG compiler and the statically-linked runtime. The .bss occupies no space in the ELF file on disk -- it is allocated and zeroed by the OS loader.
The 4.34 MB .bss divides into three logical regions:
EDG Compiler State (0xE7F000 -- 0x1290000, ~4.1 MB)
The bulk of .bss holds the EDG frontend's global state. Major structures:
Scope stack and symbol tables (~1.5 MB). The EDG scope stack (scope_stk.c) maintains nested scope contexts during parsing. Each scope entry is 784 bytes. The scope stack globals, various hash tables for name lookup, and the associated symbol table arrays consume the largest contiguous blocks.
IL region tracking (~800 KB). Region indices, region-to-scope mappings (qword_126EB90), region memory tables (qword_126EC88), and IL entry list heads. The region counter at dword_126EC80 tracks active regions. Each function definition creates a new region.
Translation unit state (~400 KB). The TU descriptor itself is dynamically allocated (424 bytes), but the per-TU global variables -- source file table, include stack, macro state, conditional compilation depth -- live in .bss. sub_7A4860 (reset_tu_state) zeroes these between compilations.
Parser state (~600 KB). Token lookahead buffers, declaration nesting depth, template argument stacks, expression evaluation context. The lexer maintains character classification tables and identifier hash buckets.
Error and diagnostic state (~200 KB). Error count (qword_126ED90), warning count (qword_126ED98), error limit (qword_126ED60), diagnostic suppression bitmaps, and the stream state table at 126ED80..126EDE0 (13 qwords including the stderr FILE* at qword_126EDF0).
Configuration flags (~100 KB). The 0x106xxxx region contains hundreds of dword flags set by CLI parsing and used throughout compilation. Examples:
| Address | Type | Identity |
|---|---|---|
dword_106B640 | int | Keep-in-IL guard flag |
dword_106B4B0 | int | Catastrophic error re-entry guard |
dword_106B4BC | int | Warnings-as-errors recursion guard |
dword_106B9E8 | int | TU stack depth |
dword_106BA08 | int | TU-copy mode flag |
dword_106BBB8 | int | Output format (0=text, 1=SARIF) |
dword_106BCD4 | int | Predefined macro file mode |
dword_106C088 | int | Warnings-are-errors mode |
dword_106C188 | int | wchar_t keyword enabled |
dword_106C254 | int | Skip backend (errors present) |
dword_106C2C0 | int | GPU mode flag |
dword_1065928 | int | Internal error re-entry guard |
Lambda capture bitmasks (~256 bytes). Two 1024-bit bitmasks recording which lambda capture counts were used during parsing:
| Address | Size | Identity |
|---|---|---|
unk_1286900 | 128 bytes | Host-device lambda capture counts |
unk_1286980 | 128 bytes | Device lambda capture counts |
Bit N set means a lambda with N captures was encountered, triggering emission of the corresponding __nv_dl_wrapper_t or __nv_hdl_wrapper_t specialization in the backend.
IL walker callbacks (5 function pointers at qword_126FB68..126FB88). The five IL tree-walk callback slots: entry filter, entry replace, pre-walk check, string callback, and entry callback. Swapped in and out by different IL traversal passes.
libstdc++ Runtime State (0x1290000 -- 0x12D6F20, ~280 KB)
SoftFloat globals (16 bytes). Exception flags at byte_12D4820, rounding mode at byte_12D4821.
Emergency exception pool (24 bytes of metadata). Free-list head (qword_12D4868), base address (qword_12D4870), capacity (qword_12D4878 = 72,704 bytes). The pool itself is heap-allocated at startup by ctor_004.
Locale system (~2 KB). The "C" locale singleton (unk_12D5E60), global locale impl pointer (qword_12D5E70), classic locale impl pointer (qword_12D5E78), character classification tables (12D5BE0..12D5D50), locale ID counter (dword_12D5E58), and pthread_once control variables.
iostream objects (~2 KB). The six standard stream objects and their backing file buffers:
| Address | Identity |
|---|---|
0x12D6000 | std::cerr |
0x12D6060 | std::cin |
0x12D60C0 | std::cout |
0x12D5EE0 | std::wcerr |
0x12D5F40 | std::wcin |
0x12D5FA0 | std::wcout |
Each stream object is backed by a basic_filebuf at a known offset (e.g., cout's filebuf at 0x12D67E0).
Demangler caches (40 bytes). Template argument cache at qword_12C7B40/12C7B48/12C7B50 (capacity/count/buffer pointer, grows by 500 entries via realloc). Block-scope suppress flag at dword_12C6A24.
EDG internal lists (7 x 48 bytes). Seven doubly-linked list structures at 12868C0..1286780 initialized by ctor_003. Serve as symbol/scope/type caches with destructor sub_6BD820.
Thread-Local Storage (0x12D6F20 -- 0x12D6F38, 24 bytes)
The .tls section holds exactly 24 bytes of thread-local data. This is the __cxa_eh_globals structure (accessed via __readfsqword(0) - 16):
struct __cxa_eh_globals {
void *caught_exception_stack; // +0x00: linked list of caught exceptions
uint32_t uncaughtExceptions; // +0x08: count of in-flight exceptions
};
Despite cudafe++ being single-threaded, the TLS infrastructure exists because libstdc++ exception handling unconditionally uses TLS offsets compiled into the static library.
.ctors / .dtors -- Constructor/Destructor Tables
The .ctors section at 0xD42858 is 88 bytes: a -1 sentinel (8 bytes), 9 constructor function pointers (72 bytes), and a 0 terminator (8 bytes). The 9 constructors are ctor_001 through ctor_009 documented above.
The .dtors section at 0xD428B0 is 16 bytes: a -1 sentinel and a 0 terminator. No destructors are registered -- all cleanup is done via __cxa_atexit handlers registered during construction.
.eh_frame / .gcc_except_table -- Exception Handling
The .eh_frame section (582 KB) contains DWARF Call Frame Information (CFI) records for stack unwinding during C++ exception propagation. The .gcc_except_table section (13.2 KB) contains GCC Language-Specific Data Area (LSDA) records that map program counters to catch handlers and cleanup functions.
The .eh_frame_hdr section (48.9 KB) is a binary search index into .eh_frame, enabling O(log n) lookup of unwind information by instruction pointer during exception throw.
These sections exist because libstdc++ exception handling requires them. cudafe++ itself rarely throws exceptions -- the EDG frontend uses longjmp-based error recovery. However, the statically-linked libstdc++ code (particularly operator new and locale initialization) uses C++ exceptions internally.
.plt / .got.plt -- PLT Stubs
The .plt section (2.2 KB, 141 entries) and .got.plt (1.1 KB) implement lazy binding for the 141 libc functions that cudafe++ imports despite static linking. These are glibc internal symbols resolved at load time. The PLT stubs are the standard x86-64 two-instruction pattern: indirect jump through GOT, then fallback to the dynamic linker (which never executes since the binary is statically linked -- the GOT is pre-resolved by the static linker).
Static Libraries Linked
The binary statically links four library components:
| Library | Functions | .text Range | Purpose |
|---|---|---|---|
| libstdc++ (locale) | ~600 | 0x7EA800 -- 0x829600 | Full locale facet implementations |
| libstdc++ (iostream/exception) | ~60 | 0x7E42E0 -- 0x7EA800 | Streams, exceptions, operator new |
| Berkeley SoftFloat 3e | ~80 | 0x7E0D30 -- 0x7E4150 | float16/float80/float128 arithmetic |
| glibc CRT | ~10 | 0x40918C, 0x829640 -- 0x829722 | _start, init, fini |
No shared libraries are loaded at runtime. The binary is fully self-contained.
Virtual Address Space Map
0x400000 +-----------------------+
| ELF headers | 10.5 KB
0x402A18 | .init | 24 B
0x402A30 | .plt | 2.2 KB
0x403300 | .text | 4.15 MB
| assert stubs | 34 KB (0x403300 - 0x408B40)
| constructors | 8 KB (0x408B40 - 0x409350)
| EDG main body | 3.61 MB (0x409350 - 0x7DF400)
| C++ runtime | 304 KB (0x7DF400 - 0x829722)
0x829722 | padding |
0x829724 | .fini | 14 B
0x829740 | .rodata | 2.48 MB
| error table | 30 KB (off_88FAA0)
| string literals | ~2 MB
| IL kind names | <1 KB (off_E6DD80)
| jump tables | ~400 KB
0xAA3FA3 | .eh_frame_hdr | 48.9 KB
| [gap] |
0xCB1210 | .eh_frame | 568 KB
0xD3F398 | .gcc_except_table | 13.2 KB
0xD42858 | .ctors | 88 B
0xD428B0 | .dtors | 16 B
0xD428C0 | .data.rel.ro | 13.3 KB (vtables)
0xD45E00 | [padding/GOT] |
0xD46480 | .data | 1.22 MB
| attribute table | ~5 KB (off_D46820)
| keyword tables | variable
| handler pointers | (at 0xE7EExx)
| diagnostic list | (at 0xE7FE40)
0xE7EFF0 | [padding] |
0xE7F000 | .bss | 4.34 MB
| EDG compiler state | ~4.1 MB (0xE7F000 - 0x1290000)
| libstdc++ state | ~280 KB (0x1290000 - 0x12D6F20)
0x12D6F20| .tls | 24 B
0x12D6F38| extern | 1.1 KB
0x12D73A8+-----------------------+
Key Observations
The .bss dominates. At 4.34 MB, the .bss is the largest section -- larger than .text. This reflects the EDG frontend's design: hundreds of global variables hold parser state, scope stacks, symbol tables, and IL region metadata. A reimplementation should strongly consider replacing these globals with a context struct passed through the call chain.
Static linking adds 304 KB of dead-weight code. The C++ runtime region (0x7DF400 -- 0x829722) contains 900 functions, the majority of which (600+ locale facet methods) are never called by cudafe++. The locale system is pulled in transitively through iostream initialization. A reimplementation that avoids std::cout/std::cerr could eliminate this entirely.
The EDG code is tightly packed. The 3.61 MB EDG main body has almost no inter-function padding. Functions from the same source file are contiguous, and the alphabetical ordering by filename is consistent across the entire range. This makes address-to-source-file attribution reliable.
The binary is position-dependent. No PIE (Position-Independent Executable) flag is set. All code references use absolute addressing. The .got is minimal (56 bytes / 7 entries) -- almost all data references are direct.
Methodology
This page documents the reverse engineering methodology used to produce every page in this wiki. The goal is full transparency: a reader should be able to reproduce any finding by following the same techniques against the same binary. Every claim in the wiki traces back to one of four evidence categories (CONFIRMED, HIGH, MEDIUM, LOW), and this page defines exactly what each level means, what tools produced the raw data, and how that data was refined into the structured documentation that follows.
Toolchain
| Component | Version | Role |
|---|---|---|
| IDA Pro | 9.0 (64-bit) | Interactive disassembler and database host |
| Hex-Rays | x86-64 decompiler (IDA 9.0 bundled) | Pseudocode generation for all 6,483 functions |
| IDAPython | 3.x (IDA-embedded) | Scripted extraction via analyze_cudafe++.py (531 lines) |
| Target binary | cudafe++ from CUDA Toolkit 13.0 | ELF 64-bit, statically linked, stripped, 8,910,936 bytes |
| IDA database | cudafe++.i64 | 247 MB analysis state (all function boundaries, xrefs, type info, decompilation caches) |
The binary was loaded into IDA Pro 9.0 with default x86-64 analysis settings. IDA's auto-analysis resolved all code/data boundaries, generated function boundaries for 6,483 functions, and identified 52,489 string literals. The Hex-Rays decompiler was invoked on all 6,483 functions; the IDAPython extraction log reports 6,343 successful decompilations (the remaining 140 failures are exception personality routines, SoftFloat leaf functions, and tiny thunks where Hex-Rays cannot reconstruct a valid C AST). However, due to function-name collisions in the output filenames (multiple sub_XXXXXX entries mapping to the same sanitized name after / replacement), the actual decompiled output directory contains 6,202 unique .c files -- the number used throughout this wiki.
Extraction Script
All raw data was exported from the IDA database in a single automated pass using analyze_cudafe++.py, an IDAPython script that runs inside IDA's scripting environment. The script produces 12 output artifacts:
| Artifact | File | Records | Size | Description |
|---|---|---|---|---|
| String table | cudafe++_strings.json | 52,489 strings | 9.2 MB | Every string literal with address, type, and all cross-references |
| Function table | cudafe++_functions.json | 6,483 functions | 12 MB | Address, size, instruction count, callers, callees per function |
| Import table | cudafe++_imports.json | 142 imports | 16 KB | Imported PLT symbols (glibc wrappers in static binary) |
| Segment table | cudafe++_segments.json | 26 segments | 3.3 KB | ELF section addresses, sizes, types, permissions |
| Cross-reference table | cudafe++_xrefs.json | 1,243,258 xrefs | 154 MB | Every code and data xref with source function attribution |
| Comment table | cudafe++_comments.json | 22,911 comments | 2.0 MB | All IDA comments (regular + repeatable) |
| Name table | cudafe++_names.json | 54,771 names | 3.5 MB | All named locations (IDA auto-names + user-defined) |
| Call graph | cudafe++_callgraph.json + .dot | 67,756 edges | 7.4 MB | Complete inter-procedural call graph (5,057 unique callers, 5,382 unique callees) |
.rodata dump | cudafe++_rodata.bin | 2,599,011 bytes | 2.5 MB | Raw bytes of the read-only data section |
| Disassembly | disasm/<func>_<addr>.asm | 6,342 files | 86 MB | Per-function annotated disassembly with hex bytes |
| CFG graphs | graphs/<func>_<addr>.json + .dot | 12,684 files | 184 MB | Per-function basic-block graph with instructions and edges (JSON + DOT) |
| Decompiled code | decompiled/<func>_<addr>.c | 6,202 files | 38 MB | Hex-Rays pseudocode per function |
Script Architecture
The script is structured as a main() function that calls idaapi.auto_wait() to block until IDA's auto-analysis completes, then executes 12 extraction passes in a fixed order. Output is written to four directories: the root output directory (JSON databases), graphs/ (per-function CFGs), disasm/ (per-function disassembly), and decompiled/ (per-function pseudocode). Directories are created if they do not exist.
The 12 passes, in execution order:
-
export_all_strings()-- Enumeratesidautils.Strings(), then for each string walksXrefsTo(string_ea)to record every function that references it. Each string entry captures the address, string value, string type code, and a list of xref records ({from_addr, func_name, xref_type}). This is the foundation for source attribution (see below). -
export_all_functions()-- For each function inidautils.Functions(), records start/end address, size, instruction count (viaidc.is_code()on each head), library flag (FUNC_LIB), thunk flag (FUNC_THUNK), and builds caller/callee lists. Callers are found viaXrefsTo(func_start); callees viaXrefsFrom(head)filtered to call-type xrefs (fl_CN= type 17,fl_CF= type 19). -
export_imports()-- Enumerates all imported modules viaidaapi.get_import_module_qty()andidaapi.enum_import_names(). Records module name, symbol name, address, and ordinal for each of the 142 glibc imports. -
export_segments()-- Iteratesidautils.Segments()to record each ELF section's name, start/end address, size, type code, and permission bits. -
export_xrefs()-- Full enumeration of all cross-references from every instruction head in every function. For each xref, records source address, source function, target address, target function (if any), and xref type code. Produces the 1,243,258-record xref table. The six xref type codes in the output:Type Code Count Meaning dr_O1 29,631 Data offset reference dr_W2 11,488 Data write reference dr_R3 42,364 Data read reference fl_CN17 67,756 Code near call fl_CF19 189,364 Code far/ordinary flow fl_JN21 902,655 Code near jump (including fall-through) -
export_comments()-- Walks every instruction head in the database viaidautils.Heads(), extracting both regular comments (idc.get_cmt(ea, 0)) and repeatable comments (idc.get_cmt(ea, 1)). -
export_names()-- Iteratesidautils.Names()to export all named locations (function names, data labels, IDA auto-generated names). -
extract_rodata()-- Reads the raw bytes of the.rodatasegment viaida_bytes.get_bytes()and writes them to a binary file. Used for offline string scanning and jump table analysis. -
export_callgraph()-- Builds the 67,756-edge call graph by iterating every function and scanning its instruction heads for outgoing call xrefs (fl_CN,fl_CF). Output in both JSON (array of{from, from_addr, to, to_addr}edge records) and Graphviz DOT format (67,759 lines). -
export_complete_disassembly()-- Per-function disassembly files. For each function, iterates all instruction heads within the function's address range, generating hex byte dumps alongside disassembly text viaidc.generate_disasm_line(). Each file includes a header with function name, address range, and byte size. -
export_function_graphs()-- Per-function control flow graphs viaidaapi.FlowChart(). For each basic block: block ID, start/end address, size, and full instruction listing. Block-to-block edges (fall-through and branch targets) are extracted viablock.succs(). Output as both JSON (structured blocks + edges) and DOT (for Graphviz visualization). -
export_decompilation()-- Callsidaapi.init_hexrays_plugin()to initialize the Hex-Rays decompiler, then iterates all functions and callsidaapi.decompile(func_ea). On success, the pseudocode string (str(cfunc)) is written to a.cfile with a header comment containing the function name and address. Failures are silently caught via a bareexcept Exceptionand skipped.
The script is invoked via IDA's headless batch mode or interactive scripting console. It does not call qexit() at the end, allowing the IDA database to remain open for further interactive analysis after extraction. Total extraction time is approximately 30-45 minutes on a workstation-class machine, dominated by the 6,483 decompilation calls in pass 12.
Source Attribution Technique
The single most powerful technique in this analysis is source attribution via __FILE__ strings. The EDG C++ frontend uses C-style assertions throughout its codebase. When an assertion fires, the handler receives the source file path, line number, and function name as compile-time string constants embedded by the __FILE__, __LINE__, and __func__ macros. Because the binary is stripped (no .symtab), these assertion strings are the only surviving link to the original source tree.
The Assert Handler
The central assert handler is sub_4F2930, located in error.c. It is a __noreturn function that formats and emits an internal compiler error message, then terminates the process. A total of 2,139 functions in the binary call sub_4F2930, with 5,178 total call sites (many functions have multiple assertion points throughout their bodies).
The highest-density callers are the 235 assert stubs in the region 0x403300--0x408B40. Each stub is exactly 29 bytes: three register loads (source file path via lea rdi, line number via mov esi, function name via lea rdx) followed by a call to sub_4F2930:
sub_403300: ; assert stub for is_aliasable (attribute.c:10897)
lea rdi, aAttributeC ; "/dvs/p4/.../EDG_6.6/src/attribute.c"
mov esi, 10897 ; line number (integer, not string)
lea rdx, aIsAliasable ; "is_aliasable"
call sub_4F2930 ; internal_error(__FILE__, __LINE__, __func__)
Of the 235 stubs, 200 reference .c file paths and 35 reference .h file paths (inlined assertions from header files). The stubs are sorted approximately by source file name within the stub region -- the linker grouped them from all 52 .c compilation units into one contiguous block.
Beyond the dedicated stubs, 1,904 additional functions contain inline assertion checks: the lea rdi, <file_path> instruction appears within the function body at the assertion site, not in a separate stub. These inline assertions provide the same source-file attribution as the stubs.
The Attribution Chain
The attribution chain works in three steps:
-
String discovery. Extract all strings matching the EDG build path prefix
/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/. This yields one string per source file, each cross-referenced by the assert stubs that load it. -
Xref tracing. For each assert stub, follow
XrefsTo()to find which main-body functions call it. A function at0x40DFD0that calls theattribute.c:5108stub was compiled fromattribute.c. This attributes the caller to the source file. -
Range extension. Assert stubs are sparse -- not every function contains an assertion. Once a set of functions in a contiguous address range are attributed to the same source file, the entire range is assigned to that file. This works because the linker places all object code from a single
.cfile contiguously, and the files are arranged roughly alphabetically by filename.
This technique attributed 2,209 functions (34.1% of the binary) to specific source files. The remaining 4,274 functions fall into three categories: C++ runtime code (1,085 functions from libstdc++/glibc, identifiable by address range), PLT/init stubs (283 functions), and unmapped EDG functions (2,906 functions that contain no assertions and cannot be confidently attributed).
Build Path
The full build path embedded in the binary is:
/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/
This reveals the NVIDIA internal Perforce depot structure (/dvs/p4/), the release branch (r13.0), and the EDG version (EDG_6.6). It confirms the binary was built from EDG C++ Front End version 6.6, licensed from Edison Design Group.
Confidence Levels
Every identification in the raw sweep reports and wiki pages carries one of four confidence levels:
| Level | Tag | Criteria | Example |
|---|---|---|---|
| CONFIRMED | Direct match | The function's identity is proven by an assertion string that encodes the exact function name, source file, and line number. No ambiguity. | sub_403300 loads "is_aliasable" + "attribute.c" + "10897" -- it is the assertion stub for is_aliasable() in attribute.c at line 10897. |
| HIGH | String + callgraph | The function references a distinctive string (error message, format string, keyword literal) AND its position in the call graph is consistent with a single plausible identity. | sub_459630 references 276 CLI flag strings and is called from main() at the position where command-line processing occurs -- identified as proc_command_line(). |
| MEDIUM | Pattern + context | The function matches a known EDG pattern (struct layout access, IL node walking, type query) and its address falls within the expected source file range, but no string or assertion directly confirms the identity. | A function at 0x5B3000 accesses the IL node kind field at the expected struct offset and falls within the il.c address range -- likely an IL accessor, but the specific function name is inferred. |
| LOW | Address proximity | The function's address falls within a source file's range, but no internal evidence (strings, struct accesses, callees) distinguishes it from neighboring functions. The attribution is based solely on the linker's contiguous placement of object code. | A small leaf function at 0x5B2F80 sits between two il.c-attributed functions -- probably from il.c, but it could be an inlined header function. |
In practice, approximately 34% of functions are CONFIRMED (via assert strings), ~20% are HIGH (via distinctive strings or unique callgraph positions), ~25% are MEDIUM, and ~21% are LOW or unattributed.
Call Graph Analysis
The complete call graph contains 67,756 edges connecting the 6,483 functions. This graph is the primary tool for understanding system architecture -- which subsystems call which, where the hot paths are, and how NVIDIA's additions integrate with the EDG base.
Hub Identification
Hub functions -- those with exceptionally high in-degree (many callers) or out-degree (many callees) -- reveal the architectural spine of the compiler:
| Hub Type | Function | Description | Degree |
|---|---|---|---|
| Top callee | sub_4F2930 | internal_error handler | 235+ callers (every assert stub) |
| Top callee | Type query functions (104 total) | is_class_or_struct_or_union_type, etc. | 407 call sites for top query |
| Top caller | sub_7A40A0 | process_translation_unit | Calls into parser, IL, type system |
| Top caller | sub_459630 | proc_command_line (4,105-line monster) | Touches 276 flag variables |
| Top caller | sub_585DB0 | fe_one_time_init | 36 subsystem initializer calls |
| Cross-module bridge | sub_6BCC20 | Lambda preamble injection (NVIDIA) | Called from EDG statement handlers |
Graph Structure
The call graph exhibits a layered structure typical of compiler frontends:
- Entry layer.
main()at0x408950calls exactly 8 stage functions in sequence. - Stage layer. Each stage function (init, CLI, parse, wrapup, backend) fans out to dozens of subsystem entry points.
- Core layer. The parser (
expr.c,decls.c,statements.c) calls into the type system (types.c,exprutil.c), IL builder (il.c,il_alloc.c), and name lookup (lookup.c,scope_stk.c). - Leaf layer. Memory management (
mem_manage.c), error reporting (error.c), and type queries form the bottom of the call hierarchy, referenced from almost every subsystem.
NVIDIA's nv_transforms.c sits as a lateral extension at the core layer: it is called from class_decl.c, cp_gen_be.c, and statements.c (via nv_transforms.h inlines), but does not itself call back into the EDG parser. This clean separation suggests NVIDIA modifies the EDG source minimally, preferring to hook into existing EDG extension points rather than fork the core.
String-Based Discovery
The binary contains 52,489 strings in .rodata. These strings are the second most important evidence source after the assertion paths. Major categories:
| Category | Approximate Count | Usage |
|---|---|---|
EDG assertion paths (/dvs/p4/...) | 65 (52 .c + 13 .h) | Source attribution |
| CUDA keyword strings | ~300 | Keyword table initialization, CLI flag names |
| Error message templates | ~3,800 | Diagnostic emission (off_88FAA0 error table, 3,795 entries) |
| C/C++ keyword strings | ~200 | Lexer token recognition |
Format strings (%s, %d, etc.) | ~500 | Output formatting in .int.c emission and diagnostics |
| IL kind names | ~200 | IL node type display (off_E6DD80 table) |
| Type name fragments | ~400 | Mangling output, type display |
CUDA architecture names (sm_XX) | ~50 | Architecture feature gating |
| Internal EDG config strings | ~200 | Build configuration, feature flags |
String Mining Techniques
Three string mining techniques are used throughout the analysis:
-
Error message tracing. CUDA-specific error messages (e.g.,
"calling a __host__ function from a __device__ function is not allowed") are grepped from the string table, their xrefs traced to the emitting function, and the emitting function's callers analyzed to understand the validation logic that triggers the error. -
Keyword enumeration. The keyword initialization function (
sub_5863A0) loads 200+ string constants in sequence. By reading the strings in load order, the complete CUDA keyword vocabulary is recovered -- including internal-only keywords not documented in the CUDA C++ Programming Guide. -
Format string analysis. Format strings in the backend (
cp_gen_be.c) reveal the exact syntax of.int.coutput. A string like"static void __device_stub__%s("tells us the precise naming convention for device stub wrapper functions.
Decompilation Quality
Hex-Rays produces readable pseudocode for the vast majority of functions, but several systematic limitations affect the analysis:
Control Flow Artifacts
Hex-Rays occasionally introduces control flow constructs that do not exist in the original source. The most prominent example is the while(1) loop in main() (sub_408950): the decompiler wraps the entire function body in an infinite loop because a setjmp-based error recovery mechanism creates a backward edge in the CFG. In reality, main() executes linearly and returns -- the while(1) is a decompiler artifact, not a real loop.
Similar artifacts appear in functions with complex switch statements (EDG uses computed gotos for performance), where Hex-Rays may produce nested if-else chains instead of the flat dispatch table the original code uses.
Lost Preprocessor Logic
The original EDG source makes heavy use of preprocessor conditionals (#if CUDA_SUPPORT, #ifdef FRONT_END_CPFE, etc.). The compiled binary contains only the taken branch -- the preprocessor evaluated all conditions at build time. This means the decompiled code shows the CUDA-enabled configuration only; any host-only or non-CUDA EDG behavior is invisible.
Similarly, C macros that wrap common patterns (assertion macros, IL access macros, type query macros) are fully expanded in the binary. The decompiled output shows the expanded form -- a sequence of struct field accesses and conditional jumps -- rather than the concise macro invocation the original source used.
Unnamed Variables
The binary is stripped. All local variable names are lost. Hex-Rays assigns synthetic names (v1, v2, a1, a2) based on register allocation and stack slot positions. Function parameters are named a1 through aN in declaration order. During analysis, meaningful names are sometimes manually applied in the IDA database, but most decompiled output uses the synthetic names.
Structure field accesses appear as byte-offset expressions (*((_BYTE *)a1 + 182)) rather than named fields (entity->execution_space). Reconstructing the structure layouts from these offset patterns is a core part of the analysis -- see the Entity Node Layout page for the most extensively reconstructed structure.
Decompilation Failures
The IDAPython extraction log reports 6,343 successful decompilations out of 6,487 attempts (140 failures). Due to filename collisions in the output directory (functions with identical sanitized names at different addresses overwrite each other), the actual output directory contains 6,202 unique .c files. The 281 "missing" files break down as:
| Category | Count | Reason |
|---|---|---|
| Hex-Rays decompilation failure | ~140 | Exception personality routines, SoftFloat leaf functions, tiny thunks, irreducible CFG |
| Filename collisions (overwritten) | ~141 | Multiple functions with the same IDA name (after / to _ sanitization) write to the same output path |
The 140 true decompilation failures are concentrated in the C++ runtime region (0x7DF400--0x829722), particularly in the libstdc++ locale facet implementations (complex template instantiations with deeply nested virtual dispatch) and Berkeley SoftFloat 3e functions (pure arithmetic with non-standard calling conventions). For these functions, analysis relies on the raw disassembly output in disasm/ instead.
Phase 1: Address-Range Sweeps
The first phase of analysis consists of 20 address-range sweeps that collectively cover the entire .text section from 0x403000 to 0x82A000. Each sweep examines a contiguous address range of 128--256 KB, documenting every function within that range.
Sweep Index
| Sweep | Address Range | Size | Primary Source Files | Key Findings |
|---|---|---|---|---|
| P1.01 | 0x403000--0x425000 | 136 KB | attribute.c, class_decl.c | Assert stub region, CUDA attribute handlers |
| P1.02 | 0x425000--0x450000 | 172 KB | class_decl.c, cmd_line.c | Virtual override checking, execution space propagation |
| P1.03 | 0x450000--0x478000 | 160 KB | cmd_line.c, const_ints.c, cp_gen_be.c | 4,105-line CLI parser, 276 flags |
| P1.04 | 0x478000--0x4A0000 | 160 KB | cp_gen_be.c, decl_inits.c | Backend .int.c emission, device stub generation |
| P1.05 | 0x4A0000--0x4C8000 | 160 KB | decl_inits.c, decl_spec.c, declarator.c, decls.c | Declaration parsing pipeline |
| P1.06 | 0x4C8000--0x4F8000 | 192 KB | decls.c, disambig.c, error.c | Error table (off_88FAA0, 3,795 entries) |
| P1.07 | 0x4F8000--0x530000 | 224 KB | expr.c | Expression parser (528 functions) |
| P1.08 | 0x530000--0x560000 | 192 KB | expr.c, exprutil.c | Expression utilities, operator overloads |
| P1.09 | 0x560000--0x598000 | 224 KB | exprutil.c, extasm.c, fe_init.c, fe_wrapup.c | Initialization chain, 5-pass wrapup |
| P1.10 | 0x598000--0x5C8000 | 192 KB | float_pt.c, folding.c, func_def.c, host_envir.c | Constant folding, timing infrastructure |
| P1.11a--f | 0x5C8000--0x5F8000 | 192 KB | il.c, il_alloc.c | IL node creation, arena allocator |
| P1.12 | 0x5F8000--0x628000 | 192 KB | il_to_str.c, il_walk.c, interpret.c | IL display, tree walking, constexpr |
| P1.13 | 0x628000--0x668000 | 256 KB | interpret.c, layout.c, lexical.c | Constexpr interpreter, struct layout, lexer |
| P1.14 | 0x668000--0x6A8000 | 256 KB | lexical.c, literals.c, lookup.c, lower_name.c | Name lookup, name mangling |
| P1.15 | 0x6A8000--0x6D0000 | 160 KB | lower_name.c, macro.c, mem_manage.c, nv_transforms.c, overload.c | NVIDIA transforms, memory management |
| P1.16 | 0x6D0000--0x708000 | 224 KB | overload.c, pch.c, pragma.c, preproc.c, scope_stk.c | Overload resolution, scope stack |
| P1.17 | 0x708000--0x740000 | 224 KB | scope_stk.c, src_seq.c, statements.c, symbol_ref.c, symbol_tbl.c | Statement parsing, symbol table |
| P1.18 | 0x740000--0x7A0000 | 384 KB | symbol_tbl.c, sys_predef.c, templates.c | Template engine (443 functions) |
| P1.19 | 0x7A0000--0x7E0000 | 256 KB | trans_unit.c, types.c, modules.c, trans_corresp.c | Type system, TU processing |
| P1.20 | 0x7E0000--0x82A000 | 304 KB | (C++ runtime) | libstdc++, SoftFloat, CRT, demangler |
The P1.11 sweep was subdivided into six sub-sweeps (11a through 11f) because the il.c region is dense and complex, containing the core IL node creation and manipulation functions that are referenced from nearly every other source file.
Sweep Report Format
Each sweep report follows a consistent format:
================================================================================
P1.XX SWEEP: Address range 0xNNNNNN - 0xMMMMMM
================================================================================
Range: 0xNNNNNN - 0xMMMMMM
Functions found: N
EDG source files:
- file.c (assert stub range, main body range)
...
### 0xAAAAAA -- sub_AAAAAA (NN bytes / NN lines)
**Identity**: function_name (source_file.c:NNNN)
**Confidence**: CONFIRMED / HIGH / MEDIUM / LOW
**EDG Source**: source_file.c
**Notes**: Additional observations about behavior, callers, callees
Every function in the sweep range gets an entry. Functions are documented in address order. The identity field records the inferred function name and source location. The confidence field uses the four-level system defined above. Notes capture anything unusual -- unexpected callers, CUDA-specific behavior, undocumented error codes, or connections to other subsystems.
Phase 2: Targeted Deep Dives
After the Phase 1 sweep establishes the complete function map and identifies all source files, Phase 2 produces the detailed wiki pages. Each wiki page corresponds to one W-series work report that focuses on a specific subsystem or topic.
Deep Dive Methodology
Each W-series report follows a consistent process:
-
Scope definition. Identify the set of functions relevant to the topic. For example, W012 (Execution Spaces) requires the CUDA attribute application handlers in
attribute.c, the execution space checking functions innv_transforms.c, and the virtual override validator inclass_decl.c. -
Decompilation review. Read the full Hex-Rays pseudocode for every function in scope. For complex functions, also review the raw disassembly to catch decompiler artifacts.
-
String evidence collection. Grep the string table for all strings referenced by the in-scope functions. Error messages reveal validation rules; format strings reveal output patterns; keyword strings reveal accepted syntax.
-
Call graph traversal. Starting from the in-scope functions, walk callers and callees to understand the full data flow. Who calls
apply_nv_global_attr? What does it call? How does data arrive and where does it go? -
Struct layout reconstruction. When decompiled code accesses struct fields via byte offsets, reconstruct the field layout by collecting all access patterns across all functions that touch the same struct. Cross-validate offsets across multiple functions.
-
Pseudocode reconstruction. Translate the Hex-Rays output into readable C-like pseudocode with meaningful variable names, proper control flow, and comments explaining the logic. This reconstructed pseudocode appears in the wiki pages.
-
Cross-reference synthesis. Link findings to other wiki pages and W-series reports. Every page should situate itself within the overall architecture.
W-Series Report Index
As of this writing, 28 W-series reports have been produced, each backing one or more wiki pages:
| Report | Topic | Wiki Page(s) |
|---|---|---|
| W001 | Index page | index.md |
| W002 | Function map | function-map.md |
| W003 | Binary layout | binary-layout.md |
| W004 | Methodology | methodology.md (this page) |
| W005 | Pipeline overview | pipeline/overview.md |
| W006 | Entry point | pipeline/entry.md |
| W010 | Backend code gen | pipeline/backend.md |
| W012 | Execution spaces | cuda/execution-spaces.md |
| W014 | Cross-space validation | cuda/cross-space-validation.md |
| W015 | Device/host separation | cuda/device-host-separation.md |
| W016 | Kernel stubs | cuda/kernel-stubs.md |
| W020 | Attribute system | attributes/overview.md |
| W021 | __global__ constraints | attributes/global-function.md |
| W026 | Lambda overview | lambda/overview.md |
| W027 | Device wrapper | lambda/device-wrapper.md |
| W028 | Host-device wrapper | lambda/host-device-wrapper.md |
| W029 | Capture handling | lambda/capture-handling.md |
| W032 | IL overview | il/overview.md |
| W033 | IL allocation | il/allocation.md |
| W035 | Keep-in-IL | il/keep-in-il.md |
| W038 | .int.c format | output/int-c-format.md |
| W042 | EDG overview | edg/overview.md |
| W047 | Template engine | edg/template-engine.md |
| W052 | Diagnostics overview | diagnostics/overview.md |
| W053 | CUDA errors | diagnostics/cuda-errors.md |
| W056 | Entity node layout | structs/entity-node.md |
| W061 | CLI flags | config/cli-flags.md |
| W065 | EDG source map | reference/edg-source-map.md |
| W066 | Global variables | reference/global-variables.md |
Numerical Summary
| Metric | Value |
|---|---|
| Binary file size | 8,910,936 bytes (8.5 MB) |
| Total functions in binary | 6,483 |
| Decompiled functions (log-reported) | 6,343 |
| Decompiled files (actual on disk) | 6,202 |
| Disassembly files | 6,342 |
| CFG files (JSON + DOT) | 12,684 |
| Functions attributed to source files | 2,209 (34.1%) |
Functions calling sub_4F2930 (assert handler) | 2,139 |
Total call sites to sub_4F2930 | 5,178 |
Assert stubs (0x403300--0x408B40) | 235 |
Source files identified (.c) | 52 |
Header files identified (.h) | 13 |
EDG build-path strings in .rodata | 65 |
| String literals extracted | 52,489 |
| Cross-references extracted | 1,243,258 |
| Call graph edges | 67,756 (5,057 callers, 5,382 callees) |
| Named locations | 54,771 |
| IDA comments | 22,911 |
| Imported glibc symbols | 142 |
| ELF segments | 26 |
.rodata raw dump | 2,599,011 bytes |
IDA database (.i64) | 247 MB |
| Phase 1 sweep reports | 28 files (20 ranges + 8 sub-sweeps), 38,221 lines |
| Phase 2 deep-dive reports (W-series) | 28 |
| Wiki pages | 55 |
Error table entries (off_88FAA0) | 3,795 |
| CLI flags documented | 276 |
| Total exported data | ~500 MB |
Limitations and Caveats
What This Analysis Cannot Determine
-
Preprocessor-disabled code. Any EDG code behind
#if 0,#ifndef CUDA_SUPPORT, or similar guards was compiled out. The binary reflects only the CUDA-enabled, Linux x86-64, EDG 6.6 configuration. Other EDG frontend features (e.g., Fortran support, Windows target, older C++ standards) are not present. -
Inlined function boundaries. When the compiler inlines a function, its code merges with the caller. The binary may contain hundreds of inlined instances of small EDG utility functions (type queries, IL accessors) that are invisible as separate entities. The 6,483 function count represents only the non-inlined functions.
-
Original variable names. All local and most global variable names are lost. The wiki uses reconstructed names based on semantics (e.g.,
execution_space_bytefor*((_BYTE *)entity + 182)), but these are analyst-assigned, not original. -
Exact source line mapping. While assertion strings encode line numbers, these are the assertion site's line number, not the calling function's line number. The analyst can determine that
is_aliasableinattribute.chas an assertion at line 10897, but cannot determine the start line ofis_aliasableitself. -
NVIDIA-internal documentation. Any design documents, code comments, commit messages, or internal wikis that informed the original development are unavailable. All behavioral descriptions in this wiki are inferred from the binary alone.
Reproducibility
Every finding in this wiki can be reproduced by:
- Obtaining
cudafe++from CUDA Toolkit 13.0 (version string embedded in binary as the build path prefixr13.0). - Loading it into IDA Pro 9.0 (64-bit) with default x86-64 analysis settings. Wait for auto-analysis to complete (5-10 minutes).
- Running
analyze_cudafe++.pyvia File > Script File to extract all raw data (30-45 minutes). - Querying the exported JSON files with
jqto trace cross-reference chains, string lookups, and callgraph paths. - Reading the decompiled
.cfiles and raw.asmfiles for behavioral analysis.
No proprietary tools beyond IDA Pro + Hex-Rays are required. The analysis does not depend on NVIDIA source code access, NDA-protected documentation, or insider knowledge. Every claim is derived from the publicly distributed binary.
Pipeline Overview
cudafe++ is a source-to-source compiler. It reads a .cu file, parses it as C++ with CUDA extensions using a modified EDG 6.6 frontend, then emits a transformed .int.c file where device code is suppressed and host-side stubs replace kernel launch sites. The entire binary is a single-threaded, single-pass-per-stage pipeline controlled from main() at 0x408950.
Pipeline Diagram
input.cu
|
v
[1] fe_pre_init sub_585D60 fe_init.c
9 subsystem pre-initializers
|
v
* sub_5AF350(v7) ---- capture "Total compilation time" start
|
v
[2] proc_command_line sub_459630 cmd_line.c
276 CLI flags parsed, mode selection
|
v
[3] fe_one_time_init sub_585DB0 fe_init.c
38 subsystem initializers + keyword registration
|--- fe_init_part_1 (sub_585EE0): per-unit inits, output file open
|--- keyword_init + fe_translation_unit_init (sub_5863A0)
|
v
* sub_5AF350(v8) ---- capture "Front end time" start
|
v
[4] reset_tu_state sub_7A4860 trans_unit.c
Zero all TU globals
|
v
[5] process_trans_unit sub_7A40A0 trans_unit.c
Allocate 424-byte TU descriptor, parse source,
build EDG IL tree, CUDA attribute propagation
|
v
[6] fe_wrapup sub_588F90 fe_wrapup.c
5-pass IL finalization: needed-flags, keep-in-IL marking,
dead entity elimination, scope cleanup
|
v
* sub_5AF350(v9) ---- capture "Front end time" end
* sub_5AF390("Front end time", v8, v9)
|
v
* sub_5AF350(v10) --- capture "Back end time" start
|
v
[7] Backend entry sub_489000 cp_gen_be.c
Walk source sequence, emit .int.c, device stubs,
lambda wrappers, registration tables
|
v
* sub_5AF350(v11) --- capture "Back end time" end
* sub_5AF390("Back end time", v10, v11)
|
v
* sub_5AF350(v12) --- capture "Total compilation time" end
* sub_5AF390("Total compilation time", v7, v12)
|
v
[8] exit_with_status sub_5AF1D0 host_envir.c
Map internal status to exit code, terminate
|----- "Front end time" covers stages 4-6 ----------|
|----- "Back end time" covers stage 7 ---------------|
|----- "Total compilation time" covers stages 2-8 ---|
Call Hierarchy from main()
The decompiled main() at 0x408950 calls the pipeline stages in this exact order:
void main(int argc, char **argv, char **envp)
{
sub_585D60(argc, argv, envp); // [1] fe_pre_init
sub_5AF350(v7); // capture_time (total start)
sub_459630(argc, argv); // [2] proc_command_line
// [stack limit adjustment via setrlimit]
sub_585DB0(); // [3] fe_one_time_init
if (dword_106C0A4)
sub_5AF350(v8); // capture_time (frontend start)
sub_7A4860(); // [4] reset_tu_state
sub_7A40A0(qword_126EEE0); // [5] process_translation_unit
sub_588F90(v5, 1); // [6] fe_wrapup
if (dword_106C0A4) {
sub_5AF350(v9);
sub_5AF390("Front end time", v8, v9);
}
// --- error-recovery re-compilation loop ---
if (qword_126ED90) { // errors present?
dword_106C254 = 1; // skip backend
}
while (1) {
sub_6B8B20(0); // reset file state
sub_589530(); // write signoff + cleanup
// exit code computation
if (dword_106C0A4)
sub_5AF390("Total compilation time", ...);
sub_5AF1D0(exit_code); // [8] exit
// --- if dword_106C254 == 0, backend runs ---
if (!dword_106C254) {
if (dword_106C0A4)
sub_5AF350(v10); // capture_time (backend start)
sub_489000(); // [7] process_file_scope_entities
if (dword_106C0A4) {
sub_5AF350(v11);
sub_5AF390("Back end time", v10, v11);
}
}
}
}
The while(1) loop with sub_5AF1D0 (which calls exit() / abort()) never actually iterates -- the call to sub_5AF1D0 is __noreturn. The compiler just arranged the basic blocks this way: the backend stage at label LABEL_16 falls through from a goto at the top of the loop when dword_106C254 == 0 (no errors).
Stage Details
Stage 1: fe_pre_init -- sub_585D60 (0x585D60)
Source: fe_init.c
Performs absolute minimum initialization before anything else can run. Called with the raw argc, argv, envp from the OS.
| Call | Address | Identity | Purpose |
|---|---|---|---|
| 1 | sub_48B3C0 | error_handling_init | Zero error counters |
| 2 | sub_6BB290 | source_file_mgr_init | File descriptor table setup |
| 3 | sub_5B1E70 | scope_symbol_pre_init | Scope stack index = -1 |
| 4 | sub_752C90 | type_system_pre_init | Type table allocation |
| 5 | sub_45EB40 | cmd_line_pre_init | Register CLI flag table |
| 6 | sub_4ED530 | declaration_pre_init | Declaration state zeroing |
| 7 | sub_6F6020 | il_pre_init | IL node allocator setup |
| 8 | sub_7A48B0 | tu_tracking_pre_init | Zero all TU globals |
| 9 | sub_7C00F0 | template_pre_init | Template engine state |
Sets dword_126C5E4 = -1 (current scope index = "none") and dword_126C5C8 = -1 (secondary scope index = "none").
Data flow: No input beyond process args. Output: global state zeroed and ready for CLI parsing.
Stage 2: proc_command_line -- sub_459630 (0x459630)
Source: cmd_line.c (4105 decompiled lines)
Parses all 276 CLI flags. Populates global configuration variables that control every subsequent stage. Key outputs:
| Global | Address | Meaning |
|---|---|---|
dword_126EFB4 | 0x126EFB4 | Language mode: 1=K&R C, 2=C++ |
dword_126EF68 | 0x126EF68 | C++ standard version (__cplusplus value) |
dword_106C0A4 | 0x106C0A4 | Timing enabled (print stage durations) |
dword_126E1D8 | 0x126E1D8 | MSVC host compiler |
dword_126E1F8 | 0x126E1F8 | GNU/GCC host compiler |
dword_126E1E8 | 0x126E1E8 | Clang host compiler |
dword_106BF38 | 0x106BF38 | Extended lambda mode |
qword_126EEE0 | 0x126EEE0 | Output filename (or "-" for stdout) |
qword_106BA00 | 0x106BA00 | Primary source filename |
dword_106C29C | 0x106C29C | Preprocessing-only mode |
dword_106C064 | 0x106C064 | Stack limit adjustment flag |
The parser builds four hash tables for macro defines (qword_106C248), include paths (qword_106C240), and system includes (qword_106C238, qword_106C228). It also suppresses a default set of diagnostic numbers (1257, 1373, 1374, 1375, 1633, 2330, 111, 185, 175).
Data flow: Input: argv. Output: ~150+ global configuration variables populated.
Stage 3: fe_one_time_init -- sub_585DB0 (0x585DB0)
Source: fe_init.c
The heaviest initialization stage. Calls 38 subsystem initializers in dependency order, then validates the function pointer dispatch table (a sentinel check: off_D560C0 must equal the address of nullsub_6). After validation, calls sub_585EE0 (fe_init_part_1) which:
- Records compilation timestamp via
time()/ctime()intobyte_106B5C0 - Runs 26 per-compilation-unit initializers
- Opens the output file (
qword_106C280= stdout or file) - Writes the output file header via
sub_5AEDB0 - Calls the keyword registration function
sub_5863A0which registers 200+ C/C++ keywords plus NVIDIA CUDA-specific type traits (__nv_is_extended_device_lambda_closure_type, etc.)
38 subsystem initializers (in call order):
| # | Address | Subsystem |
|---|---|---|
| 1 | sub_752DF0 | types |
| 2 | sub_5B1D40 | scopes |
| 3 | sub_447430 | errors |
| 4 | sub_4B37F0 | preprocessor |
| 5 | sub_4E8ED0 | declarations |
| 6 | sub_4C0840 | attributes |
| 7 | sub_4A1B60 | names |
| 8 | sub_4E9CF0 | declarations (part 2) |
| 9 | sub_4ED710 | declarations (part 3) |
| 10 | sub_510C30 | statements |
| 11 | sub_56DC90 | expression utilities |
| 12 | sub_5A5160 | expressions |
| 13 | sub_603B00 | parser |
| 14 | sub_5CF7F0 | classes |
| 15 | sub_65DC50 | overload resolution |
| 16 | sub_69C8B0 | templates |
| 17 | sub_665A00 | template instantiation |
| 18 | sub_689550 | exception handling |
| 19 | sub_68F640 | implicit conversions |
| 20 | sub_6B6510 | IL |
| 21 | sub_6BAE70 | source file manager |
| 22 | sub_6F5FC0 | IL walking |
| 23 | sub_6F8300 | IL (part 2) |
| 24 | sub_6FDFF0 | lowering |
| 25 | sub_726DC0 | name mangling |
| 26 | sub_72D410 | name mangling (part 2) |
| 27 | sub_74B9A0 | type checking |
| 28 | sub_710B70 | IL (part 3) |
| 29 | sub_76D630 | code generation |
| 30 | nullsub_11 | debug (no-op) |
| 31 | sub_7A4690 | allocation |
| 32 | sub_7A3920 | memory pools |
| 33 | sub_6A0E90 | templates (part 2) |
| 34 | sub_418F80 | diagnostics |
| 35 | sub_5859C0 | extended asm |
| 36 | sub_751540 | types (part 2) |
| 37 | sub_7C25F0 | templates (part 3) |
| 38 | sub_7DF400 | CUDA-specific init |
Data flow: Input: populated config globals. Output: all subsystems initialized, keyword table built, output file open.
Stage 4: reset_tu_state -- sub_7A4860 (0x7A4860)
Source: trans_unit.c
Zeroes all translation unit tracking globals to prepare for processing:
qword_106BA10 = 0; // current_translation_unit
qword_106B9F0 = 0; // primary_translation_unit
qword_12C7A90 = 0; // tu_chain_tail
dword_106B9F8 = 0; // has_module_info
qword_106BA18 = 0; // tu_stack_top
dword_106B9E8 = 0; // tu_stack_depth
Data flow: No input. Output: TU state clean-slated.
Stage 5: process_translation_unit -- sub_7A40A0 (0x7A40A0)
Source: trans_unit.c
The main frontend workhorse. This single call parses the entire .cu source file into the EDG intermediate language. Workflow:
- Debug trace:
"Processing translation unit %s" - Clean up any previous TU state (
sub_7A3A50) - Reset error state (
sub_5EAEC0) - Allocate 424-byte TU descriptor via
sub_6BA0D0 - Initialize TU scope state (offsets 24..192 via
sub_7046E0) - Set as primary TU (
qword_106B9F0) if first - Link into TU chain
- Call
sub_586240-- parse the source file (this enters the EDG parser, which handles all of C++ plus CUDA extensions:__device__,__host__,__global__,__shared__,__managed__, etc.) - Depending on mode:
- Module compilation:
sub_6FDDF0 - Standard compilation:
sub_6F4AD0(header-unit) +sub_4E8A60(standard)
- Module compilation:
- Post-processing:
sub_588E90(translation_unit_wrapup -- scope closure, template wrapup, IL output) - Debug trace:
"Done processing translation unit %s"
At the end of this stage, the EDG IL tree is fully built. Every declaration, type, expression, and statement from the source has been parsed into IL nodes. CUDA execution-space attributes (__device__, __host__, __global__) have been recorded on entity nodes at byte offset +182 (bit 6 = device/global, bits 4-5 = execution space).
Data flow: Input: source filename from qword_126EEE0. Output: complete EDG IL tree anchored at qword_106BA10 (TU descriptor), source sequence list at *(qword_106BA10 + 8).
Stage 6: fe_wrapup -- sub_588F90 (0x588F90)
Source: fe_wrapup.c
Five-pass finalization over all translation units. Each pass iterates the TU chain (qword_106B9F0). Passes 2-4 are per-TU error-gated (skip TUs with qword_126ED90 != 0); passes 1 and 5 run unconditionally.
| Pass | Function | Purpose | Error-gated? |
|---|---|---|---|
| 1 | sub_588C60 | Per-file IL wrapup: template/exception cleanup, IL tree walk (sub_706710), IL finalize (sub_706F40), destroy temporaries | No |
| 2 | sub_707040 | Needed-flags computation: determine which entities must be preserved for backend consumption | Per-TU skip |
| 3 | sub_610420(23) | Keep-in-IL marking: mark entities for device code preservation with guard flag dword_106B640 | Per-TU skip |
| 4 | sub_5CCA40 + sub_5CC410 + sub_5CCBF0 | Dead entity elimination (C++ gate on sub_5CCA40): clear unneeded instantiation flags, remove dead function bodies, remove unneeded IL entries | Per-TU skip |
| 5 | sub_588D40 | Statement finalization, scope assertions, IL output + template output | No |
Between Pass 1 and Pass 2, if no errors have occurred, sub_796C00 runs cross-TU entity marking.
Post-pass operations:
- Cross-TU consistency (
sub_796BA0, error-gated) - Scope renumbering (
sub_707480double-loop) - Template validation (
sub_765480) - File index cleanup (
sub_6B8B20for indices 2..dword_126EC80) - Output flush + close three output files (IDs 1513, 1514, 1515)
- Memory statistics: sums 10
space_used()callbacks - State teardown
Data flow: Input: fully built IL tree. Output: finalized IL with dead entities eliminated and device-needed entities marked. The source sequence list (qword_1065748) is the ordered list of top-level declarations the backend will walk.
Stage 7: Backend Code Generation -- sub_489000 (0x489000)
Source: cp_gen_be.c (723 decompiled lines, the largest single function in the backend)
This is the host-side C++ code generator. It walks the EDG source sequence and emits the .int.c file that the host compiler (gcc/cl.exe/clang) will compile. The backend is gated by dword_106C254: if set to 1 (errors occurred), stage 7 is skipped entirely.
Initialization:
- Zeros output state:
dword_1065834(indent level), stream handle, counters - Clears four 512KB hash tables (
memset 0x7FFE0bytes each) - Sets up gen_be_info callback table (
xmmword_1065760..10657B0) - Creates output file:
<input>.int.c(or stdout for"-")
Boilerplate emission:
#pragma GCC diagnosticpush/pop blocks for suppressing host compiler warnings__nv_managed_rtinitialization boilerplate (for__managed__variables)- Lambda type-trait macro definitions
Main processing loop:
- Walks
qword_1065748(global source sequence list) - For each entry: dispatches to
sub_47ECC0(gen_template/process_source_sequence) - Kind 57 entries are pragma interleavings (handled inline)
CUDA-specific transformations performed:
- Device stub generation: For
__global__kernels, emit__wrapper__device_stub_<name>()forwarding, wrap original body in#if 0/#endif - Device-only suppression: Device-only declarations wrapped in
#if 0/#endif - Lambda wrappers:
__nv_dl_wrapper_t<>for device lambdas,__nv_hdl_create_wrapper_t<>for host-device lambdas - Runtime header injection:
#include "crt/host_runtime.h"at first CUDA entity - Registration tables:
sub_6BCF80called 6 times for device/host/managed/constant combinations - Anonymous namespace:
_NV_ANON_NAMESPACEmacro for unique global symbols
Trailer:
- Empty-file guard:
int __dummy_to_avoid_empty_file; - Re-inclusion of original source via
#include "<original_file>" #undef _NV_ANON_NAMESPACE
Data flow: Input: finalized source sequence from stage 6. Output: .int.c file on disk.
Stage 8: exit_with_status -- sub_5AF1D0 (0x5AF1D0)
Source: host_envir.c
Maps internal compilation status to process exit codes:
| Internal Status | Meaning | Exit Code | Action |
|---|---|---|---|
| 3, 4, 5 | Success | 0 | exit(0) |
| 8 | Warnings only | 2 | exit(2) |
| 9, 10 | Errors | 4 | exit(4) + "Compilation terminated." |
| 11 | Internal error | -- | abort() + "Compilation aborted." |
In SARIF mode (dword_106BBB8), text messages are suppressed but exit codes remain the same.
Key Global Variables Controlling Flow
| Variable | Address | Type | Role |
|---|---|---|---|
dword_106C254 | 0x106C254 | int | Skip-backend flag. Set to 1 when qword_126ED90 (error count) is nonzero after frontend. Prevents stage 7 from running. |
dword_106C0A4 | 0x106C0A4 | int | Timing flag. When set, sub_5AF350/sub_5AF390 bracket each phase with CPU + wall-clock timestamps. |
dword_126EFB4 | 0x126EFB4 | int | Language mode. 1=K&R C, 2=C++. Controls C++ class finalization in pass 4 of fe_wrapup, keyword set selection, and backend behavior. In CUDA mode, always 2. |
qword_126ED90 | 0x126ED90 | qword | Error count. Checked after stages 5-6 to decide whether to run backend. Nonzero skips needed-flags, keep-in-IL marking, and dead entity elimination passes in fe_wrapup. |
qword_126EEE0 | 0x126EEE0 | char* | Output filename. Passed to sub_7A40A0 for TU naming. Used by backend to construct .int.c path. |
dword_1065850 | 0x1065850 | int | Device stub mode. Toggled during backend generation: 1 = currently emitting device stub code (changes parameter types, suppresses bodies). |
dword_106C064 | 0x106C064 | int | Stack limit flag. When set, main adjusts RLIMIT_STACK to max before entering frontend (deep recursion in parser/template engine). |
Timing Regions
When dword_106C0A4 is set (via --timing or equivalent flag), three timing regions are printed:
Front end time 12.34 (CPU) 15.67 (elapsed)
Back end time 3.45 (CPU) 4.56 (elapsed)
Total compilation time 15.79 (CPU) 20.23 (elapsed)
Format string: "%-30s %10.2f (CPU) %10.2f (elapsed)\n"
The timing is implemented via sub_5AF350 (capture_time: records clock() as CPU milliseconds and time() as wall seconds) and sub_5AF390 (report_timing: computes deltas and prints).
| Region | Start | End | Covers |
|---|---|---|---|
| Front end | After sub_585DB0 (fe_one_time_init) | After sub_588F90 (fe_wrapup) | Stages 4-6: TU reset, parse, IL build, wrapup |
| Back end | Before sub_489000 | After sub_489000 | Stage 7: .int.c generation |
| Total | After sub_585D60 (fe_pre_init), before sub_459630 (CLI) | Before sub_5AF1D0 (exit) | Stages 2-8: CLI parsing through exit |
Error Recovery Loop
The main() function contains a while(1) loop that appears to support re-compilation (the TU processing infrastructure has a dword_106BA08 "is_recompilation" flag and sub_7A40A0 checks an a2 recompilation parameter). In practice, for the standard CUDA compilation flow, this loop executes exactly once: sub_5AF1D0 is __noreturn and terminates the process.
The loop body:
sub_6B8B20(0)-- reset file state for the source file managersub_589530()-- write output signoff (sub_5AEE00) + close source manager (sub_6B8DE0)- Compute exit code from
qword_126ED90(errors) andqword_126ED88(additional status) - Print total timing if enabled
- Restore stack limit if it was raised
sub_5AF1D0(exit_code)-- terminate
Cross-References
- Entry Point & Initialization -- detailed breakdown of stages 1-3
- CLI Processing -- all 276 flags parsed in stage 2
- Frontend Invocation -- stage 5 (parse + IL build) in depth
- Frontend Wrapup -- 5-pass architecture of stage 6
- Backend Code Generation -- stage 7 (
.int.cemission) in depth - Timing & Exit -- stage 8 and exit code mapping
- Device/Host Separation -- how the backend filters device vs host code
- Kernel Stub Generation --
__wrapper__device_stub_pattern - Extended Lambda Overview -- lambda wrapper generation in backend
- .int.c File Format -- structure of the backend output
Entry Point & Initialization
main() at 0x408950 is a 488-byte __noreturn function that orchestrates the entire cudafe++ compilation pipeline. It takes the standard POSIX signature (int argc, char **argv, char **envp), performs two phases of subsystem initialization, optionally raises the process stack limit, then runs the frontend, backend, and exit sequence in a linearized loop that executes exactly once. The function has 22 direct callees (including getrlimit, setrlimit, and library calls) and never returns -- sub_5AF1D0 at the bottom of the loop calls exit() or abort().
Key Facts
| Property | Value |
|---|---|
| Address | 0x408950 |
| Size | 488 bytes |
| Source file | fe_init.c / host_envir.c (initialization); fe_wrapup.c (finalization) |
| Signature | void __noreturn main(int argc, char **argv, char **envp) |
| Direct callees | 22 (9 pre-init + CLI + heavy-init + 5 pipeline stages + timing/exit helpers) |
| Stack frame | 0x88 bytes (136 bytes: 6 timing stamps + rlimit struct + alignment) |
| Attribute | __noreturn -- the while(1) loop terminates via sub_5AF1D0 which calls exit()/abort() |
Annotated Decompilation
void __noreturn main(int argc, char **argv, char **envp)
{
rlim_t original_stack;
bool stack_was_raised;
uint8_t exit_code;
struct rlimit rlimits;
timestamp_t t_total_start, t_fe_start, t_fe_end, t_be_start, t_be_end, t_total_end;
// --- Redirect diagnostic output to stderr ---
s = stderr; // 0x126EDF0 alias
qword_126EDF0 = stderr; // diagnostic stream
// === PHASE 1: Pre-initialization (9 subsystem calls) ===
sub_585D60(argc, argv, envp); // fe_pre_init
// --- Capture total compilation start time ---
sub_5AF350(&t_total_start); // capture_time
// === PHASE 2: Command-line parsing ===
sub_459630(argc, argv); // proc_command_line (276 flags)
// === Stack limit adjustment ===
if (dword_106C064 // --modify-stack-limit (default: ON)
&& !getrlimit(RLIMIT_STACK, &rlimits))
{
original_stack = rlimits.rlim_cur;
rlimits.rlim_cur = rlimits.rlim_max; // raise to hard limit
stack_was_raised = (setrlimit(RLIMIT_STACK, &rlimits) == 0);
}
// === PHASE 3: Heavy initialization (38 subsystem calls + validation) ===
sub_585DB0(); // fe_one_time_init
// └─ sub_585EE0() fe_init_part_1 (33 per-unit inits, output file, keywords)
if (dword_106C0A4) // --timing enabled?
sub_5AF350(&t_fe_start); // capture frontend start
// === PHASE 4: Translation unit setup ===
sub_7A4860(); // reset_tu_state (zero 6 TU globals)
// === PHASE 5: Frontend parse + IL build ===
sub_7A40A0(qword_126EEE0); // process_translation_unit
// === PHASE 6: Frontend wrapup (5-pass IL finalization) ===
sub_588F90(qword_126EEE0, 1); // fe_wrapup
if (dword_106C0A4) {
sub_5AF350(&t_fe_end);
sub_5AF390("Front end time", &t_fe_start, &t_fe_end);
}
// --- Error gate: skip backend if frontend had errors ---
if (!qword_126ED90) goto backend; // no errors → run backend
dword_106C254 = 1; // skip-backend flag
// === Linearized exit loop (executes once) ===
while (1) {
exit_code = 8; // default: warnings
sub_6B8B20(0); // reset file state
sub_589530(); // write signoff + close source mgr
if (!qword_126ED90) // re-check after wrapup
exit_code = qword_126ED88 ? 5 : 3; // success codes
if (dword_106C0A4) {
sub_5AF350(&t_total_end);
sub_5AF390("Total compilation time", &t_total_start, &t_total_end);
}
if (stack_was_raised) { // restore original stack limit
rlimits.rlim_cur = original_stack;
setrlimit(RLIMIT_STACK, &rlimits);
}
sub_5AF1D0(exit_code); // __noreturn: exit() or abort()
backend:
if (!dword_106C254) { // backend not skipped
if (dword_106C0A4)
sub_5AF350(&t_be_start);
sub_489000(); // process_file_scope_entities (backend)
if (dword_106C0A4) {
sub_5AF350(&t_be_end);
sub_5AF390("Back end time", &t_be_start, &t_be_end);
}
}
}
}
The while(1) never actually loops. The call to sub_5AF1D0 is __noreturn (it calls exit() or abort() internally), so control never reaches the second iteration. The compiler arranged the basic blocks this way because the backend code at backend: is reached via a goto from the error-gate check, placing it logically "after" the exit call in the CFG.
Phase 1: fe_pre_init -- sub_585D60 (0x585D60)
The first thing main() does after redirecting stderr is call sub_585D60, which performs the absolute minimum initialization needed before command-line parsing can proceed. This function lives in fe_init.c and makes 9 sequential calls to subsystem pre-initializers, plus two inline global assignments.
Pre-Init Call Table
| # | Address | Identity | Source | Purpose |
|---|---|---|---|---|
| 1 | sub_48B3C0 | error_pre_init | error.c | Zero 4 error-tracking globals: qword_1065870=0, qword_1065868=0, dword_1065860=-1, qword_1065858=0 |
| 2 | sub_6BB290 | source_file_mgr_pre_init | srcfile.c | Zero 10 file descriptor table globals: file chain head, file count, file hash, include stack |
| 3 | sub_5B1E70 | host_envir_early_init | host_envir.c | Heaviest pre-init call. Signal handlers, locale, CWD capture, env vars. See below. |
| 4 | sub_752C90 | type_system_pre_init | type.c | Set dword_126E4A8=-1 (dialect version unset), call sub_7515D0 (type table alloc), set host compiler defaults (qword_126E1F0=70300 = GCC 7.3.0 default), init 3 type comparison descriptor pools via sub_465510 |
| 5 | sub_45EB40 | cmd_line_pre_init | cmd_line.c | Zero the 272-flag was-set bitmap (byte_E7FF40, 0x110 bytes), set dword_E7FF20=1 (skip argv[0]), initialize ~350 global config variables to defaults. Notable: dword_106C064=1 (stack limit adjustment ON by default) |
| 6 | sub_4ED530 | declaration_pre_init | decls.c | Set stderr into two global stream pointers, zero error/warning counters (qword_126ED80..qword_126EDE0), set diagnostic defaults (byte_126ED69=5, byte_126ED68=8, qword_126ED60=100 max errors), clear 15.2KB diagnostic severity table (byte_1067920, 0x3B50 bytes) |
| 7 | sub_6F6020 | il_pre_init | il.c | Zero 3 globals: dword_12C6C8C=0 (PCH event counter), qword_12C6EC0=0, qword_12C6EB8=0 |
| -- | (inline) | scope_index_init | fe_init.c | dword_126C5E4 = -1 (current scope stack index = "none"), dword_126C5C8 = -1 (secondary scope index = "none") |
| 8 | sub_7A48B0 | tu_tracking_pre_init | trans_unit.c | Zero 13 TU tracking globals: source filename, compilation mode flags, TU stack pointers, PCH state |
| 9 | sub_7C00F0 | template_pre_init | template.c | Single assignment: dword_106BA20 = 0 (template nesting depth = 0) |
host_envir_early_init (sub_5B1E70) Detail
This is the most substantial pre-init call. It initializes the host environment interface layer from host_envir.c:
Signal handlers (one-time, guarded by dword_E6E120):
| Signal | Handler | Behavior |
|---|---|---|
| SIGINT (2) | handler at 0x5AF2C0 | Write newline to stderr, call sub_5AF2B0(9) which writes signoff then exit(4) |
| SIGTERM (15) | handler at 0x5AF2C0 | Same as SIGINT |
| SIGXCPU (24) | sub_5AF270 | Print "Internal error: CPU time limit exceeded.\n", call sub_5AF1D0(11) which calls abort() |
| SIGXFSZ (25) | SIG_IGN | Ignored (prevents crash on large output files) |
After signal setup, dword_E6E120 is set to 0 so handlers are registered only once.
Locale: Calls newlocale(LC_NUMERIC, "C", 0) then uselocale() to force the C locale for numeric output. If either call fails, asserts with "could not set LC_NUMERIC locale" at host_envir.c:264.
Working directory: Iteratively calls getcwd() with a growing buffer (starting at 256 bytes, expanding by 256 on ERANGE) until it fits, then copies the result into qword_126EEA0 via permanent allocation.
Environment variables:
EDG_BASE-- read intoqword_126EE38(base path for EDG data files; empty string if unset)EDG_SUPPRESS_ASSERTION_LINE_NUMBER-- if set and not"0", setsdword_126ED40 = 1(suppress line numbers in internal assertion messages)
CPU time limit: Calls getrlimit(RLIMIT_CPU) then setrlimit() with rlim_cur = RLIM_INFINITY to disable the CPU time limit.
Global zeroing: Zeros ~50 host-environment globals including file descriptors, path buffers, platform flags, output filename pointers.
Language mode: Sets dword_126EFB4 = 2 (default to C++ mode -- this is later overridden by CLI parsing if -x c is specified).
Sentinel validation: Checks off_E6E0E0 against the string "last" to verify that the predef_macro_mode_names table was properly initialized at link time. On mismatch, asserts with "predef_macro_mode_names not initialized properly" at host_envir.c:6927.
Stack Limit Adjustment
Between CLI parsing and heavy initialization, main() conditionally raises the process stack limit:
if (dword_106C064 && !getrlimit(RLIMIT_STACK, &rlimits)) {
original_stack = rlimits.rlim_cur;
rlimits.rlim_cur = rlimits.rlim_max; // raise soft to hard limit
stack_was_raised = (setrlimit(RLIMIT_STACK, &rlimits) == 0);
}
The flag dword_106C064 is set to 1 by default in sub_45EB40 (cmd_line_pre_init) and can be disabled via the --modify_stack_limit=false CLI flag. The purpose is to prevent stack overflow during deep recursion in the C++ parser, template instantiation engine, and constexpr interpreter. After compilation completes (just before exit), main() restores the original rlim_cur value.
Phase 3: fe_one_time_init -- sub_585DB0 (0x585DB0)
This is the heaviest initialization stage. It zeroes the token state (qword_126DD38 -- 6 bytes packed as a dword + word), optionally calls sub_5AF330 for profiling init if dword_106BD4C is set, then makes 38 sequential calls to subsystem one-time initializers.
One-Time Init Call Table
| # | Address | Identity | Source file |
|---|---|---|---|
| 1 | sub_752DF0 | type_one_time_init | type.c |
| 2 | sub_5B1D40 | scope_one_time_init | scope.c |
| 3 | sub_447430 | error_one_time_init | error.c |
| 4 | sub_4B37F0 | preprocessor_one_time_init | preproc.c |
| 5 | sub_4E8ED0 | declaration_one_time_init | decls.c |
| 6 | sub_4C0840 | attribute_one_time_init | attribute.c |
| 7 | sub_4A1B60 | name_one_time_init | lookup.c |
| 8 | sub_4E9CF0 | declaration_one_time_init_2 | decl_spec.c |
| 9 | sub_4ED710 | declaration_one_time_init_3 | declarator.c |
| 10 | sub_510C30 | statement_one_time_init | stmt.c |
| 11 | sub_56DC90 | exprutil_one_time_init | exprutil.c |
| 12 | sub_5A5160 | expression_one_time_init | expr.c |
| 13 | sub_603B00 | parser_one_time_init | parse.c |
| 14 | sub_5CF7F0 | class_one_time_init | class_decl.c |
| 15 | sub_65DC50 | overload_one_time_init | overload.c |
| 16 | sub_69C8B0 | template_one_time_init | template.c |
| 17 | sub_665A00 | instantiation_one_time_init | instantiate.c |
| 18 | sub_689550 | exception_one_time_init | except.c |
| 19 | sub_68F640 | conversion_one_time_init | convert.c |
| 20 | sub_6B6510 | il_one_time_init | il.c |
| 21 | sub_6BAE70 | srcfile_one_time_init | srcfile.c |
| 22 | sub_6F5FC0 | il_walk_one_time_init | il_walk.c |
| 23 | sub_6F8300 | il_one_time_init_2 | il.c |
| 24 | sub_6FDFF0 | lower_one_time_init | lower_il.c |
| 25 | sub_726DC0 | mangling_one_time_init | lower_name.c |
| 26 | sub_72D410 | mangling_one_time_init_2 | lower_name.c |
| 27 | sub_74B9A0 | typecheck_one_time_init | typecheck.c |
| 28 | sub_710B70 | il_one_time_init_3 | il.c |
| 29 | sub_76D630 | codegen_one_time_init | cp_gen_be.c |
| 30 | nullsub_11 | debug_one_time_init | debug.c (no-op) |
| 31 | sub_7A4690 | allocation_one_time_init | il_alloc.c |
| 32 | sub_7A3920 | pool_one_time_init | il_alloc.c |
| 33 | sub_6A0E90 | template_one_time_init_2 | template.c |
| 34 | sub_418F80 | diagnostics_one_time_init | diag.c |
| 35 | sub_5859C0 | extasm_one_time_init | extasm.c |
| 36 | sub_751540 | type_one_time_init_2 | type.c |
| 37 | sub_7C25F0 | template_one_time_init_3 | template.c |
| 38 | sub_7DF400 | cuda_one_time_init | nv_transforms.c |
The call order reflects dependency constraints: types before scopes, scopes before declarations, declarations before expressions, expressions before the parser, etc. Template initialization is split across three calls (#16, #33, #37) because different phases of template support depend on different subsystems being initialized first.
Function Pointer Table Validation
After all 38 initializers complete, sub_585DB0 performs a critical integrity check:
if (funcs_6F71AE || off_D560C0 != nullsub_6)
sub_4F21C0("function_pointers is incorrectly initialized");
This validates two conditions:
-
funcs_6F71AEmust be zero. This global acts as a "dirty flag" -- if any initializer wrote a nonzero value here, the table was not properly zeroed during static initialization. -
off_D560C0must point tonullsub_6(0x585B00). The addressoff_D560C0is the last entry in a function pointer dispatch table in.rodata. The empty functionnullsub_6acts as a sentinel -- its known address is compared against the table's last slot to verify that the table was correctly populated at link time. If the linker reordered or dropped entries, the sentinel would not match.
If either check fails, sub_4F21C0 emits a fatal diagnostic ("function_pointers is incorrectly initialized") and then falls through to sub_585EE0 (fe_init_part_1) regardless -- this is a non-recoverable error that will likely cause crashes later, but the code attempts to continue.
On successful validation, sub_585DB0 returns without calling sub_585EE0. However, sub_585EE0 is actually called from a different path: the normal flow is that sub_585DB0 returns, and main() proceeds. The sub_585EE0 call on the error path in sub_585DB0 appears to be a fallthrough from the panic handler.
Correction from the sweep report: Examination of the actual decompiled code shows that sub_585EE0 (fe_init_part_1) is called only on the error path of the sentinel check within sub_585DB0. On the normal (no-error) path, sub_585DB0 returns sub_7DF400()'s return value directly. This means fe_init_part_1 is called from the sentinel-check error handler, not from the main success path of sub_585DB0. The actual invocation of fe_init_part_1 in the normal flow must occur elsewhere in the pipeline (likely called from within one of the subsystem initializers or from sub_7A40A0).
fe_init_part_1 -- sub_585EE0 (0x585EE0)
This function performs per-compilation-unit initialization. It is identified by the debug trace string "fe_init_part_1" at level 5 and an assertion path fe_init.c:2007. Its responsibilities:
Compilation Timestamp
time(&timer);
char *t = ctime(&timer);
if (!t) t = "Sun Jan 01 00:00:00 1900\n";
if (strlen(t) > 127)
assert("fe_init.c", 2007, "fe_init_part_1"); // buffer overflow guard
strcpy(byte_106B5C0, t); // 128-byte timestamp buffer
dword_126EE48 = 1; // init-complete flag
Per-Unit Initializer Call Table
After the timestamp, sub_585EE0 calls 33 per-compilation-unit initializers:
| # | Address | Identity |
|---|---|---|
| 1 | sub_4ED7C0 | declaration_unit_init |
| 2 | nullsub_7 | (no-op placeholder) |
| 3 | sub_65DC20 | overload_unit_init |
| 4 | sub_6BB350 | srcfile_unit_init |
| 5 | sub_5B22E0 | scope_unit_init |
| 6 | sub_603B30 | parser_unit_init |
| 7 | sub_5D0170 | class_unit_init |
| 8 | sub_61EBD0 | expression_unit_init |
| 9 | sub_68A0D0 | exception_unit_init |
| 10 | sub_74BFF0 | typecheck_unit_init |
| 11 | sub_710DE0 | il_unit_init |
| 12 | sub_4E8F10 | declaration_unit_init_2 |
| 13 | sub_4C0860 | attribute_unit_init |
| 14 | nullsub_2 | (no-op placeholder) |
| 15 | sub_4474D0 | error_unit_init |
| 16 | sub_665A60 | instantiation_unit_init |
| 17 | sub_4E9D10 | decl_spec_unit_init |
| 18 | sub_76D780 | codegen_unit_init |
| 19 | sub_7C0300 | template_unit_init |
| 20 | sub_7A3980 | pool_unit_init |
| 21 | sub_56DEE0 | exprutil_unit_init |
| 22 | nullsub_10 | (no-op placeholder) |
| 23 | sub_6B6890 | il_unit_init_2 |
| 24 | sub_726EE0 | mangling_unit_init |
| 25 | sub_6F5DA0 | il_walk_unit_init |
| 26 | sub_6F8320 | il_unit_init_3 |
| 27 | sub_6FE130 | lower_unit_init |
| 28 | sub_752FC0 | type_unit_init |
| 29 | sub_4660B0 | folding_unit_init |
| 30 | sub_5943E0 | float_unit_init |
| 31 | sub_6A0F40 | template_unit_init_2 |
| 32 | sub_4190B0 | diagnostics_unit_init |
| 33 | sub_7C2640 | template_unit_init_3 |
Compilation Mode Flags
After the per-unit initializers, sub_585EE0 copies global configuration values (set during CLI parsing) into the compilation-mode descriptor at 0x126EB88:
| Field | Address | Source | Meaning |
|---|---|---|---|
byte_126EB88 | 0x126EB88 | dword_126E498 | Dialect flags |
byte_126EBB0 | 0x126EBB0 | dword_126EFB4 == 1 | K&R C mode |
dword_126EBA8 | 0x126EBA8 | dword_126EFB4 != 2 | Not-C++ flag |
dword_126EBAC | 0x126EBAC | dword_126EF68 | C standard version |
byte_126EBB8 | 0x126EBB8 | dword_126EFB0 | Strict C mode |
byte_126EBB9 | 0x126EBB9 | dword_126EFAC | EDG GNU-compat extensions |
byte_126EBBA | 0x126EBBA | dword_126EFA4 | Clang extensions enabled |
xmmword_126EBC0 | 0x126EBC0 | qword_126EF90 | Clang + GNU version thresholds (16 bytes packed) |
Output File Setup
if (dword_106C298) { // output enabled
if (qword_106C278) // output path specified
qword_106C280 = sub_4F48F0(path, 0, 0, 16, 1513); // open file (ID 1513)
else
qword_106C280 = stdout; // default to stdout
}
sub_5AEDB0(); // write output header
The output file ID 1513 is one of three output file slots used during compilation (1513, 1514, 1515).
Initialization Summary
The total initialization sequence before parsing begins involves 80+ subsystem init calls across three layers:
main()
├─ sub_585D60() fe_pre_init 9 subsystem pre-inits
│ ├─ sub_48B3C0 error 4 globals zeroed
│ ├─ sub_6BB290 srcfile 10 globals zeroed
│ ├─ sub_5B1E70 host_envir signals, locale, CWD, env vars, ~50 globals
│ ├─ sub_752C90 types type table alloc, compiler defaults
│ ├─ sub_45EB40 cmd_line 272-flag bitmap, ~350 config defaults
│ ├─ sub_4ED530 declarations error counters, diagnostic severity table (15KB)
│ ├─ sub_6F6020 il 3 globals zeroed
│ ├─ [inline] scope indices dword_126C5E4 = dword_126C5C8 = -1
│ ├─ sub_7A48B0 tu_tracking 13 globals zeroed
│ └─ sub_7C00F0 templates 1 global zeroed
│
├─ sub_459630() proc_command_line 276 flags → ~150 config globals
│
├─ [RLIMIT_STACK adjustment] raise soft limit to hard limit
│
└─ sub_585DB0() fe_one_time_init 38 subsystem one-time inits
├─ token state zeroing qword_126DD38 = 0 (6 bytes)
├─ 38 subsystem calls types → scopes → errors → ... → CUDA
├─ sentinel check funcs_6F71AE == 0 && off_D560C0 == nullsub_6
└─ sub_585EE0() fe_init_part_1 (on error path, or called from subsystem)
├─ compilation timestamp byte_106B5C0 via ctime()
├─ 33 per-unit inits declarations → overload → ... → templates
├─ compilation mode flags copy CLI config into descriptor struct
├─ output file open stdout or file (ID 1513)
└─ sub_5AEDB0() write output header
Global State Set Before Parsing
By the time sub_7A40A0 (process_translation_unit) is called, the following critical globals have been established:
| Global | Address | Value | Set by |
|---|---|---|---|
dword_126EFB4 | 0x126EFB4 | 2 (C++) | sub_5B1E70 default, may be overridden by CLI |
dword_126EF68 | 0x126EF68 | C/C++ standard version | CLI parsing |
dword_106C064 | 0x106C064 | 1 (stack limit ON) | sub_45EB40 default |
dword_106C0A4 | 0x106C0A4 | 0 or 1 | CLI --timing flag |
qword_126EEE0 | 0x126EEE0 | source filename | CLI parsing |
qword_106C280 | 0x106C280 | output FILE* | sub_585EE0 |
qword_126EDF0 | 0x126EDF0 | stderr | main() + sub_4ED530 |
dword_126EE48 | 0x126EE48 | 1 | sub_585EE0 (init-complete flag) |
byte_106B5C0 | 0x106B5C0 | ctime string | sub_585EE0 (compilation timestamp) |
dword_126C5E4 | 0x126C5E4 | -1 then updated | sub_585D60 then scope init |
qword_126F120 | 0x126F120 | C locale handle | sub_5B1E70 |
qword_126EEA0 | 0x126EEA0 | CWD string copy | sub_5B1E70 |
The Error Gate
The transition from frontend to backend is controlled by a simple error check:
if (!qword_126ED90) // qword_126ED90 = error count from frontend
goto backend_label; // no errors → run backend
dword_106C254 = 1; // errors → set skip-backend flag
When dword_106C254 == 1, the backend stage (sub_489000) is skipped entirely. The process still writes a signoff trailer and exits with a nonzero status code. This means a cudafe++ compilation with frontend errors produces no .int.c output file -- the backend never runs.
Exit Code Mapping
The exit function sub_5AF1D0 at 0x5AF1D0 maps internal status codes to process exit codes:
| Internal Code | Meaning | Process Exit | Message |
|---|---|---|---|
| 3, 4, 5 | Success (various) | exit(0) | (none) |
| 8 | Warnings only | exit(2) | (none) |
| 9, 10 | Compilation errors | exit(4) | "Compilation terminated.\n" |
| 11 | Internal error | abort() | "Compilation aborted.\n" |
| (other) | Unknown/fatal | abort() | (none) |
In SARIF mode (dword_106BBB8 set), the text messages ("Compilation terminated.", "Compilation aborted.") are suppressed, but exit codes remain identical.
Cross-References
- Pipeline Overview -- complete 8-stage pipeline diagram
- CLI Processing -- detailed breakdown of
sub_459630and all 276 flags - Frontend Invocation --
sub_7A40A0(process_translation_unit) internals - Frontend Wrapup -- 5-pass architecture of
sub_588F90 - Backend Code Generation --
sub_489000(.int.c emission) - Timing & Exit --
sub_5AF350/sub_5AF390/sub_5AF1D0details - EDG Overview -- EDG 6.6 source tree and NVIDIA modifications
- EDG Lexer -- keyword registration performed during
sub_5863A0
CLI Processing
proc_command_line (sub_459630) at 0x459630 is a 21,773-byte function (4,105 decompiled lines, 296 callees) in cmd_line.c that parses the entire cudafe++ command line. It registers 276 flags into a flat lookup table, iterates argv with prefix-matching against that table, dispatches each matched flag through a 275-case switch statement, then resolves language dialect settings and opens output files. This function is the second stage of the pipeline, called directly from main() at 0x408950 before any heavy initialization.
Nobody invokes cudafe++ directly. NVIDIA's driver compiler nvcc decomposes its own options and passes the appropriate low-level flags via -Xcudafe <flag>. The full flag inventory is in CLI Flag Inventory; this page documents the implementation mechanics of the parsing system itself.
Key Facts
| Property | Value |
|---|---|
| Address | 0x459630 |
| Binary size | 21,773 bytes |
| Decompiled lines | 4,105 |
| Source file | cmd_line.c |
| Signature | int64_t proc_command_line(int argc, char** argv) |
| Direct callees | 296 |
| Flag table base | dword_E80060 |
| Flag table entry size | 40 bytes |
| Flag table capacity | 552 entries (overflow panics via sub_40351D) |
| Registered flags | 276 |
| Switch cases | 275 (case IDs 1--275) |
| Default-suppressed diagnostics | 9 (1257, 1373, 1374, 1375, 1633, 2330, 111, 185, 175) |
Flag Table Layout
The flag table is a contiguous array starting at dword_E80060. Each of the 552 slots occupies 40 bytes. The current entry count is tracked in dword_E80058.
Offset Field Type Access pattern
------ ----- ---- --------------
+0 case_id int32 dword_E80060[idx * 10]
+8 name char* qword_E80068[idx * 5]
+16 short_char int16 word_E80070[idx * 20] (low byte = char, high byte = 1)
+17 is_valid int8 (high byte of short_char word, always 1)
+18 takes_value int8 byte_E80072[idx * 40]
+19 visible int8 (part of dword_E80080[idx * 10] at +32)
+20 is_boolean int8 byte_E80073[idx * 40]
+24 name_length int64 qword_E80078[idx * 5] (precomputed strlen)
+32 mode_flag int32 dword_E80080[idx * 10]
The flag-was-set bitmap at byte_E7FF40 spans 0x110 bytes (272 slots). When a flag is matched during parsing, the corresponding byte is set to 1 to record that the user explicitly provided it. The bitmap is zeroed by default_init (sub_45EB40) before every compilation.
Registration: sub_452010 (init_command_line_flags)
sub_452010 at 0x452010 is a 30,133-byte function (3,849 decompiled lines) that populates the entire flag table. It is called once, at line 280 of proc_command_line, before the parsing loop begins.
register_command_flag (sub_451F80)
Each flag is registered through sub_451F80 (25 lines), called approximately 275 times from sub_452010:
void register_command_flag(
int case_id, // dispatch ID for the switch (1-275)
char* name, // flag name without dashes ("preprocess", "timing", etc.)
char short_opt, // single-char alias ('E', '#', etc.), 0 for none
char takes_value, // 1 if the flag requires =<value>
int mode_flag, // visibility/classification (mode vs. action)
char enabled // whether the flag is active (1 = registered, 0 = disabled)
);
The function writes into the next free slot at index dword_E80058, precomputes strlen(name) into name_length, always sets the is_valid byte to 1, then increments the counter. If the counter reaches 552, it panics via sub_40351D -- the table is statically sized.
Paired Toggle Registration
Approximately half of all flags are boolean toggles registered as pairs: --flag and --no_flag share the same case_id but differ in which value they write. Pairs are registered in two ways:
-
Two sequential
register_command_flagcalls -- both point to the samecase_id; the parsing loop determines whether the matched name starts withno_and sets the target global to 0 or 1 accordingly. -
Inline table population -- seven additional paired flags (
relaxed_abstract_checking,concepts,colors,keep_restrict_in_signatures,check_unicode_security,old_id_chars,add_match_notes) are written directly into the array without going throughregister_command_flag.
Parsing Loop
After flag registration, proc_command_line performs five sequential setup steps, then enters the main argv iteration.
Pre-Loop Setup
Step 1: Initialize qword_126DD38, qword_126EDE8 (token state / source position)
Step 2: Call sub_452010() -- register all 276 flags
Step 3: Allocate 4 hash tables (16-byte header + 256-byte data each):
qword_106C248 macro define/alias map
qword_106C240 include path list
qword_106C238 system include map
qword_106C228 additional system include map
Step 4: Suppress 9 diagnostic numbers by default:
1257, 1373, 1374, 1375, 1633, 2330, 111, 185, 175
Each via sub_4ED400(number, suppress_severity, 1)
Step 5: Set dword_E7FF20 = 1 (argv index, skipping argv[0])
The default-suppressed diagnostics are EDG warnings that NVIDIA considers noise for CUDA compilation. Diagnostic 111 ("statement is unreachable"), 185 ("pointless comparison of unsigned integer with zero"), and 175 ("subscript out of range") are common false positives in CUDA template-heavy code.
argv Iteration
The loop processes argv[dword_E7FF20] through argv[argc-1]. For each argument:
-
Dash detection -- if the argument does not start with
-, it is treated as the input filename (stored inqword_126EEE0). Only one non-flag argument is expected. -
Short flag matching -- for single-dash arguments (
-X), the parser scans the flag table for an entry whoseshort_charmatches. If the flagtakes_value, the next argv element is consumed as the value. -
Long flag matching -- for double-dash arguments (
--flag-name), the parser callsparse_flag_name_value(sub_451EC0) to split on=:
// sub_451EC0: split "--name=value" into name and value
// Respects backslash escapes and quoted strings
// If no '=' found: *name_out = src, *value_out = NULL
void parse_flag_name_value(char* src, char** name_out, char** value_out);
The name portion is then matched against the flag table using strncmp with each entry's precomputed name_length. The parser iterates all entries and counts exact and prefix matches:
- Exact match (length equals
name_lengthandstrncmpreturns 0) -- dispatches immediately. - Unique prefix match (only one entry's name starts with the given prefix) -- dispatches to that entry.
- Ambiguous prefix (multiple entries match the prefix) -- emits error 923 ("ambiguous command-line option").
- No match -- the argument is silently ignored or treated as input.
Conflict Detection
Before the main loop, check_conflicting_flags (sub_451E80, 15 lines) validates that mutually exclusive flags were not specified together. It checks byte_E7FFF2 || byte_E80031 || byte_E80032 || byte_E80033, corresponding to flags 3, 193, 194, and 195. If any conflict is detected, it emits error 1027 via sub_4F8480.
The Dispatch Switch (275 Cases)
After a flag is matched, its case_id indexes into a giant switch statement occupying the bulk of proc_command_line. The following sections document the most important cases grouped by function.
Preprocessor Control (Cases 3--9)
| Case | Flag | Global(s) | Behavior |
|---|---|---|---|
| 3 | no_line_commands | dword_106C29C=1, dword_106C294=1, dword_106C288=0 | Suppress #line in preprocessor output |
| 4 | preprocess | dword_106C29C=1, dword_106C294=1, dword_106C288=1 | Preprocessor-only mode (output to stdout) |
| 5 | comments | (flag bitmap) | Preserve comments in preprocessor output |
| 6 | old_line_commands | (flag bitmap) | Use old-style # N "file" line directives |
| 8 | dependencies | (multiple) | Dependencies output mode (preprocessor-only + dependency emission) |
| 9 | trace_includes | (flag bitmap) | Print each #include as it is opened |
Compilation Mode (Cases 14, 20--26)
| Case | Flag | Global | Behavior |
|---|---|---|---|
| 14 | no_code_gen | dword_106C254 = 1 | Parse-only mode -- sets the skip-backend flag, preventing process_file_scope_entities from running |
| 20 | timing | dword_106C0A4 = 1 | Enable compilation phase timing. main() checks this flag to decide whether to call sub_5AF350/sub_5AF390 for "Front end time", "Back end time", "Total compilation time" |
| 21 | version | (stdout) | Print the version banner and continue (does not exit). Banner includes: "cudafe: NVIDIA (R) Cuda Language Front End", "Portions Copyright (c) 2005, 2024-YYYY NVIDIA Corporation", "Portions Copyright (c) 1988-2018, 2024 Edison Design Group Inc.", "Based on Edison Design Group C/C++ Front End, version 6.6", "Cuda compilation tools, release 13.0, V13.0.88" |
| 22 | no_warnings | byte_126ED69 = 7 | Set diagnostic severity threshold to error-only (suppress all warnings and remarks) |
| 23 | promote_warnings | byte_126ED68 = 5 | Promote all warnings to errors |
| 24 | remarks | byte_126ED69 = 4 | Lower threshold to include remark-level diagnostics |
| 25 | c | calls sub_44C4F0(0) | Force C language mode (overrides default C++ if currently in C++ mode) |
| 26 | c++ | calls sub_44C4F0(2) | Force C++ language mode |
Diagnostic Control (Cases 39--44)
Cases 39--43 (diag_suppress, diag_remark, diag_warning, diag_error, diag_once) share the same value-parsing logic:
1. Read the value string (after '=')
2. Strip leading/trailing whitespace
3. Split on commas
4. For each token:
a. Parse as integer (diagnostic number)
b. Call sub_4ED400(number, severity, 1)
The severity values map to:
- Suppress = skip entirely
- Remark = informational (level 4)
- Warning = default warning (level 5)
- Error = hard error (level 7)
- Once = emit on first occurrence only
Case 44 (display_error_number / no_display_error_number) toggles whether error codes appear in diagnostic messages.
CUDA-Specific Flags (Cases 45--89)
Output File Paths
| Case | Flag | Global | Description |
|---|---|---|---|
| 45 | gen_c_file_name | qword_106BF20 | Path for the generated .int.c file |
| 85 | gen_device_file_name | (has_arg global) | Device-side output file name |
| 86 | stub_file_name | (has_arg global) | Stub file output path |
| 87 | module_id_file_name | (has_arg global) | Module ID file path |
| 88 | tile_bc_file_name | (has_arg global) | Tile bitcode file path |
Data Model (Cases 65--66, 90--91)
| Case | Flag | Behavior |
|---|---|---|
| 65 | force-lp64 | LP64 model: pointer size=8, long size=8, specific type encodings for 64-bit |
| 66 | force-llp64 | LLP64 model (Windows): pointer size=4, long size=4 |
| 90 | m32 | ILP32 model: all type sizes set for 32-bit (pointer=4, long=4, etc.) |
| 91 | m64 | 64-bit mode (default on Linux x86-64) |
Device Compilation Control
| Case | Flag | Global | Description |
|---|---|---|---|
| 46 | msvc_target_version | dword_126E1D4 | MSVC version for compatibility emulation |
| 47 | host-stub-linkage-explicit | boolean | Use explicit linkage on generated host stubs |
| 48 | static-host-stub | boolean | Generate static (internal linkage) host stubs |
| 49 | device-hidden-visibility | boolean | Apply hidden visibility to device symbols |
| 52 | no-device-int128 | boolean | Disable __int128 type support on device |
| 53 | no-device-float128 | boolean | Disable __float128 type support on device |
| 54 | fe-inlining | dword_106C068 = 1 | Enable frontend inlining pass |
| 55 | modify-stack-limit | dword_106C064 | Whether main() raises the process stack limit via setrlimit. Default is ON. Value parsed as integer: nonzero enables, zero disables. |
| 71 | keep-device-functions | boolean | Do not strip unused device functions |
| 72 | device-syntax-only | boolean | Device-side syntax check without code generation |
| 77 | device-c | boolean | Relocatable device code (RDC) mode |
| 82 | debug_mode | dword_106BFC4=1, dword_106BFC0=1, dword_106BFBC=1 | Full debug mode (sets three debug globals simultaneously) |
| 89 | tile-only | boolean | Tile-only compilation mode |
Template Instantiation (Case 16)
The instantiate flag takes a string value and sets dword_106C094:
| Value | dword_106C094 | Meaning |
|---|---|---|
"none" | 0 | No implicit instantiation |
"all" | 1 | Instantiate all referenced templates |
"used" | 2 | Instantiate only used templates |
"local" | 3 | Local instantiation only |
Include and Macro Arguments (Cases 29--31)
Cases 29 (include_directory / -I) and 167 (sys_include) append entries to linked lists via sub_4595D0:
// sub_4595D0: append_to_linked_list
// Allocates a 24-byte node: {next_ptr, string_ptr, int_field}
// Appends to singly-linked list with head/tail pointers
void append_to_linked_list(list_head*, char* string, int type);
A special case: -I- (the literal string "-") sets a flag for stdin include mode rather than appending to the path list. It calls sub_5AD0A0 for the actual path registration.
Case 30 (define_macro / -D) builds a linked list of macro definitions via sub_4595D0. Case 31 (undefine_macro / -U) allocates the same 24-byte node but marks the int_field as 1 to indicate undefine.
Language Standard Selection (Cases 228, 240--252)
These cases set dword_126EF68 -- the internal value of __cplusplus or __STDC_VERSION__:
| Case(s) | Flag | dword_126EF68 | Standard |
|---|---|---|---|
| 228 | c++98 | 199711 | C++98/03 |
| 204 | c++11 | 201103 | C++11 |
| 240 | c++14 | 201402 | C++14 |
| 246 | c++17 | 201703 | C++17 |
| 251 | c++20 | 202002 | C++20 |
| 252 | c++23 | 202302 | C++23 |
| 178 | c99 | 199901 | C99 (calls set_c_mode) |
| 179 | pre-c99 | 199000 | Pre-C99 |
| 241 | c11 | 201112 | C11 |
| 242 | c17 | 201710 | C17 |
| 243 | c23 | 202311 | C23 |
| 7 | old_c | (K&R) | K&R C via sub_44C4F0(1) |
SM Architecture Target (Case 245)
case 245: // --target=<sm_arch>
dword_126E4A8 = sub_7525E0(value_string);
sub_7525E0 parses the SM architecture string (e.g., "sm_90", "sm_100") and returns the internal architecture code stored in dword_126E4A8. This value gates which CUDA features are available during compilation (see Architecture Feature Gating).
Host Compiler Compatibility (Cases 182--188)
| Case | Flag | Globals | Behavior |
|---|---|---|---|
| 182 | gcc / no_gcc | dword_126EFA8, dword_126EFB0 | Enable/disable GCC compatibility mode + GNU extensions |
| 184 | gnu_version | qword_126EF98 | GCC version number (default: 80100 = GCC 8.1.0). Parsed as integer. |
| 187 | clang / no_clang | dword_126EFA4 | Enable/disable Clang compatibility mode |
| 188 | clang_version | qword_126EF90 | Clang version number (default: 90100 = Clang 9.1.0) |
| 95 | pgc++ | boolean | PGI C++ compiler mode |
| 96 | icc | boolean | Intel ICC mode |
| 97 | icc_version | (has_arg) | Intel ICC version number |
| 98 | icx | boolean | Intel ICX (oneAPI DPC++) mode |
Raw Flag Manipulation (Case 193)
case 193: // --set_flag=<name>=<value> or --clear_flag=<name>
// Looks up <name> in off_D47CE0 (a name-to-address lookup table)
// Sets the corresponding global to <value> (integer)
This is a backdoor for nvcc to set arbitrary internal globals by name, used for flags that do not have dedicated case_id entries.
Output Mode (Case 274)
case 274: // --output_mode=text or --output_mode=sarif
if (strcmp(value, "text") == 0)
output_mode = 0; // plain text diagnostics (default)
else if (strcmp(value, "sarif") == 0)
output_mode = 1; // SARIF JSON diagnostics
SARIF (Static Analysis Results Interchange Format) output is used by IDE integrations and CI pipelines. When enabled, diagnostic messages are emitted as structured JSON instead of traditional file:line: error: format.
Dump Options (Case 273)
case 273: // --dump_command_options
// Iterates the entire flag table
// For each entry where is_valid == 1:
// printf("--%s ", name);
// Then exits
This is a diagnostic/debug mode that prints every registered flag name and exits. Used by nvcc to discover the cudafe++ flag namespace.
Post-Parsing: Dialect Resolution
After the argv loop exits, proc_command_line enters a massive dialect resolution block (approximately 800 lines). This phase reconciles the various mode flags into a consistent configuration.
Input Filename Extraction
The last non-flag argv element is the input filename, stored in qword_126EEE0. This pointer is later passed to process_translation_unit (sub_7A40A0) in stage 5 of the pipeline.
Memory Region Initialization
Eleven memory regions (numbered 1--11) are initialized with default configurations. These correspond to CUDA memory spaces (global, shared, constant, local, texture, etc.) and are used by the frontend to track address space qualifiers.
GCC/Clang Feature Resolution
The resolver checks GCC version thresholds to decide which extensions to enable:
GCC version thresholds (stored as integer * 100):
40299 (0x9D6B) -- GCC 4.2.99 boundary
40599 (0x9E97) -- GCC 4.5.99 boundary
40699 (0x9EFB) -- GCC 4.6.99 boundary
etc.
For each threshold, specific feature flags are conditionally enabled. For example, if GCC version >= 40599, rvalue references and variadic templates are enabled even if the language standard is technically C++03. This emulates how GCC provides extensions ahead of standards.
C++ Standard Feature Cascade
Based on the value of dword_126EF68 (__cplusplus), the resolver enables feature flags in a cascade:
199711 (C++98): base features only
201103 (C++11): + lambdas, rvalue_refs, auto_type, nullptr,
variadic_templates, unrestricted_unions,
delegating_constructors, user_defined_literals, ...
201402 (C++14): + digit_separators, generic lambdas, relaxed_constexpr
201703 (C++17): + exc_spec_in_func_type, aligned_new, if_constexpr,
structured_bindings, fold_expressions, ...
202002 (C++20): + concepts, modules, coroutines, consteval, ...
202302 (C++23): + deducing_this, multidimensional_subscript, ...
Conflict Validation
Post-dialect resolution performs consistency checks:
- If both
gccandclangmodes are enabled, GCC takes precedence - If
cfront_2.1orcfront_3.0is set alongside modern C++ features, features are silently disabled - If
no_exceptionsis set butcoroutinesis requested, coroutines are disabled (they require exceptions)
Output File Opening
After all flags are resolved:
- The output
.int.cfile is opened (path from case 45/gen_c_file_name, or stdout if path is"-") - The error output file is opened if
--error_outputwas specified (case 35) - The listing file is opened if
--listwas specified (case 33)
Default Diagnostic Severity Overrides
Nine diagnostic numbers are suppressed by default before any user --diag_suppress flags are processed:
| Diagnostic | EDG meaning | Why suppressed |
|---|---|---|
| 1257 | (C++11 narrowing conversion in aggregate init) | Common in CUDA kernel argument forwarding |
| 1373 | (nonstandard extension used: zero-sized array in struct) | Used in CUDA runtime headers |
| 1374 | (nonstandard extension used: struct with no members) | Empty base optimization patterns |
| 1375 | (nonstandard extension used: unnamed struct/union) | Windows SDK compatibility |
| 1633 | (inline function linkage conflict) | Host/device function linkage edge cases |
| 2330 | (implicit narrowing conversion) | Template-heavy CUDA code triggers false positives |
| 111 | statement is unreachable | __builtin_unreachable() and device code control flow |
| 185 | pointless comparison of unsigned integer with zero | Generic template code comparing unsigned with zero |
| 175 | subscript out of range | Static analysis false positives in device intrinsics |
Users can override these defaults with explicit --diag_error=111 (or similar) on the command line, since user-specified severity always wins.
Key Helper Functions
| Function | Address | Lines | Identity | Role |
|---|---|---|---|---|
sub_451E80 | 0x451E80 | 15 | check_conflicting_flags | Validates mutually exclusive flags (3/193/194/195) |
sub_451EC0 | 0x451EC0 | 57 | parse_flag_name_value | Splits --name=value on =, respecting quotes and backslash escapes |
sub_451F80 | 0x451F80 | 25 | register_command_flag | Inserts one entry into the flag table |
sub_452010 | 0x452010 | 3,849 | init_command_line_flags | Registers all 276 flags (called once from proc_command_line) |
sub_4595D0 | 0x4595D0 | 21 | append_to_linked_list | Allocates 24-byte node, appends to -D/-I argument lists |
sub_45EB40 | 0x45EB40 | 470 | default_init | Zeros 350 global config variables + flag-was-set bitmap |
sub_44C4F0 | 0x44C4F0 | -- | set_c_mode | Sets language mode: 0=C, 1=K&R, 2=C++ |
sub_44C460 | 0x44C460 | -- | parse_integer_arg | Parses string argument as integer (used by error_limit, etc.) |
sub_4ED400 | 0x4ED400 | -- | set_diagnostic_severity | Sets severity for a single diagnostic number |
Key Global Variables
| Variable | Address | Type | Set by | Description |
|---|---|---|---|---|
dword_E80058 | 0xE80058 | int32 | register_command_flag | Current flag table entry count (max 552) |
dword_E80060 | 0xE80060 | array | register_command_flag | Flag table base (40 bytes/entry) |
byte_E7FF40 | 0xE7FF40 | byte[272] | Parsing loop | Flag-was-set bitmap |
dword_E7FF20 | 0xE7FF20 | int32 | default_init | Current argv index (initialized to 1) |
qword_126EEE0 | 0x126EEE0 | char* | Post-parse | Input source filename |
dword_106C254 | 0x106C254 | int32 | Case 14 | Skip-backend flag (--no_code_gen) |
dword_106C0A4 | 0x106C0A4 | int32 | Case 20 | Timing enabled (--timing) |
dword_126EF68 | 0x126EF68 | int32 | Standard flags | __cplusplus / __STDC_VERSION__ value |
dword_126EFB4 | 0x126EFB4 | int32 | Mode flags | Language mode (0=unset, 1=C, 2=C++) |
dword_126EFA8 | 0x126EFA8 | int32 | Case 182 | GCC compatibility mode enabled |
dword_126EFA4 | 0x126EFA4 | int32 | Case 187 | Clang compatibility mode enabled |
qword_126EF98 | 0x126EF98 | int64 | Case 184 | GCC version (default 80100 = 8.1.0) |
qword_126EF90 | 0x126EF90 | int64 | Case 188 | Clang version (default 90100 = 9.1.0) |
dword_126EFB0 | 0x126EFB0 | int32 | Case 182 | GNU extensions enabled |
dword_106C064 | 0x106C064 | int32 | Case 55 | Modify stack limit (default 1) |
dword_126E4A8 | 0x126E4A8 | int32 | Case 245 | Target SM architecture code |
dword_106C094 | 0x106C094 | int32 | Case 16 | Template instantiation mode (0--3) |
byte_126ED69 | 0x126ED69 | int8 | Cases 22/24 | Diagnostic severity threshold |
byte_126ED68 | 0x126ED68 | int8 | Case 23 | Warning promotion threshold |
qword_106BF20 | 0x106BF20 | char* | Case 45 | Output .int.c file path |
qword_106C248 | 0x106C248 | void* | Pre-loop | Macro define/alias hash table |
qword_106C240 | 0x106C240 | void* | Pre-loop | Include path hash table |
qword_106C238 | 0x106C238 | void* | Pre-loop | System include map hash table |
Annotated Parsing Flow
int64_t proc_command_line(int argc, char** argv)
{
// --- Phase 1: Global state init ---
qword_126DD38 = 0; // zero token state
qword_126EDE8 = 0; // zero source position
// --- Phase 2: Register all flags ---
init_command_line_flags(); // sub_452010: 3849 lines, 276 flags
// --- Phase 3: Allocate hash tables ---
qword_106C248 = alloc_hash_table(); // macro defines/aliases
qword_106C240 = alloc_hash_table(); // include paths
qword_106C238 = alloc_hash_table(); // system includes
qword_106C228 = alloc_hash_table(); // additional system includes
// --- Phase 4: Default diagnostic suppressions ---
set_diagnostic_severity(1257, SUPPRESS, 1);
set_diagnostic_severity(1373, SUPPRESS, 1);
set_diagnostic_severity(1374, SUPPRESS, 1);
set_diagnostic_severity(1375, SUPPRESS, 1);
set_diagnostic_severity(1633, SUPPRESS, 1);
set_diagnostic_severity(2330, SUPPRESS, 1);
set_diagnostic_severity(111, SUPPRESS, 1);
set_diagnostic_severity(185, SUPPRESS, 1);
set_diagnostic_severity(175, SUPPRESS, 1);
// --- Phase 5: Main parsing loop ---
for (int i = 1; i < argc; i++) {
char* arg = argv[i];
if (arg[0] != '-') {
qword_126EEE0 = arg; // input filename
continue;
}
// Split --name=value
char *name, *value;
parse_flag_name_value(arg + 2, &name, &value); // sub_451EC0
// Match against flag table
int match_count = 0;
int matched_id = -1;
for (int f = 0; f < dword_E80058; f++) {
if (strncmp(name, flag_table[f].name, strlen(name)) == 0) {
if (strlen(name) == flag_table[f].name_length) {
matched_id = flag_table[f].case_id; // exact match
break;
}
match_count++;
matched_id = flag_table[f].case_id;
}
}
if (match_count > 1) {
error(923); // "ambiguous command-line option"
continue;
}
byte_E7FF40[matched_id] = 1; // mark flag as set
switch (matched_id) {
case 3: /* no_line_commands */ ...
case 4: /* preprocess */ ...
...
case 274: /* output_mode */ ...
case 275: /* incognito */ ...
}
}
// --- Phase 6: Post-parsing dialect resolution ---
// ~800 lines: resolve gcc/clang versions, cascade C++ features,
// validate consistency, open output files
// --- Phase 7: Memory region init (1-11) ---
// Initialize CUDA memory space descriptors
return 0;
}
Version Banner
Case 21 (--version / -v) prints the following banner to stdout (does not exit):
cudafe: NVIDIA (R) Cuda Language Front End
Portions Copyright (c) 2005, 2024-YYYY NVIDIA Corporation
Portions Copyright (c) 1988-2018, 2024 Edison Design Group Inc.
Based on Edison Design Group C/C++ Front End, version 6.6 (BUILD_DATE BUILD_TIME)
Cuda compilation tools, release 13.0, V13.0.88
Case 92 (--Version / -V) prints a different copyright format and then calls exit(1). This variant is used for machine-parseable version queries.
Relationship to Pipeline
proc_command_line is called as stage 2 of the pipeline, after fe_pre_init (sub_585D60) has initialized signal handlers, locale, working directory, and default config:
main()
|-- sub_585D60() [1] fe_pre_init (10 subsystem pre-initializers)
|-- sub_5AF350() capture_time (total start)
|-- sub_459630(argc, argv) [2] proc_command_line <-- THIS FUNCTION
|-- setrlimit() conditional stack raise (gated by dword_106C064)
|-- sub_585DB0() [3] fe_one_time_init (38 subsystem initializers)
...
By the time proc_command_line returns, every global configuration variable is set to its final value. The subsequent fe_one_time_init phase reads these globals to configure keyword tables, type system parameters, and per-translation-unit state.
Frontend Invocation
process_translation_unit (sub_7A40A0, 1267 bytes at 0x7A40A0, from EDG source file trans_unit.c) is the main frontend workhorse -- stage 5 of the pipeline. Called once from main(), it orchestrates the entire transformation from .cu source text to a fully-built EDG IL tree. The function allocates a 424-byte translation unit descriptor, opens the source file via the lexer, drives the C++ parser to completion, runs semantic analysis on the parsed declarations, and finally performs per-TU wrapup (stop-token verification, class linkage checking, module finalization). By the time it returns, every declaration, type, expression, and statement from the source has been parsed into IL nodes, CUDA execution-space attributes have been resolved, and the TU is linked into the global TU chain ready for the 5-pass fe_wrapup stage.
Key Facts
| Property | Value |
|---|---|
| Function | sub_7A40A0 (process_translation_unit) |
| Binary address | 0x7A40A0 |
| Binary size | 1267 bytes |
| EDG source | trans_unit.c |
| Confidence | DEFINITE (source path and function name embedded at lines 696, 725, 556) |
| Signature | int process_translation_unit(char *filename, int is_recompilation, void *module_info) |
| Direct callees | 27 |
| Debug trace entry | "Processing translation unit %s\n" |
| Debug trace exit | "Done processing translation unit %s\n" |
| TU descriptor size | 424 bytes (allocated via sub_6BA0D0) |
| TU stack entry size | 16 bytes ([0]=next, [8]=tu_ptr) |
Annotated Decompilation
int process_translation_unit(char *filename, // source file path
int is_recompilation, // nonzero on error-retry pass
void *module_info) // non-NULL for C++20 module TUs
{
bool is_primary = (module_info == NULL);
// --- Debug trace on entry ---
if (debug_verbosity > 0 || (debug_enabled && trace_category("trans_unit")))
fprintf(stderr, "Processing translation unit %s\n", filename);
// --- Module-mode state validation ---
// If this is a primary TU (no module_info) but we've already seen a module TU,
// that's an internal consistency error.
if (is_recompilation)
goto skip_validation;
if (!is_primary) {
skip_validation:
if (module_info)
has_seen_module_tu = 1; // dword_12C7A88
goto proceed;
}
if (has_seen_module_tu)
assertion_failure("trans_unit.c", 696, "process_translation_unit", 0, 0);
proceed:
// --- Save previous TU state if any ---
if (current_translation_unit) // qword_106BA10
save_translation_unit_state(current_translation_unit); // sub_7A3A50
// --- Reset per-TU compilation state ---
current_source_position = 0; // qword_126DD38
is_recompilation_flag = is_recompilation; // dword_106BA08
current_filename = filename; // qword_106BA00
has_module_info = (module_info != NULL); // dword_106B9F8
// --- Initialize error/parser state ---
reset_error_state(); // sub_5EAEC0
if (is_recompilation)
fe_init_part_1(); // sub_585EE0
// ==========================================================
// PHASE 1: Allocate and initialize TU descriptor (424 bytes)
// ==========================================================
registration_complete = 1; // dword_12C7A8C
tu_descriptor *tu = allocate_storage(424); // sub_6BA0D0
tu->next_tu = NULL; // [0]
++tu_count; // qword_12C7A78
tu->storage_buffer = allocate_storage(per_tu_storage_size); // [16], sub_6BA0D0
tu->tu_name = NULL; // [8]
init_scope_state(tu + 24); // sub_7046E0, offsets [24..192]
tu->field_192 = 0;
tu->field_352 = 0;
tu->field_184 = 0;
memset(&tu->scope_decl_area, 0, ...); // [200..360] zeroed
tu->field_360 = 0;
tu->field_368 = 0;
tu->field_376 = 0;
tu->flags = 0x0100; // [392] = "initialized"
tu->error_severity_count = 0; // [408]
tu->field_416 = 0;
// --- Copy registered variable defaults into per-TU storage ---
for (reg = registered_variable_list; reg; reg = reg->next) {
if (reg->offset_in_tu)
*(tu + reg->offset_in_tu) = reg->variable_value;
}
// --- Set module info pointer and primary flag ---
tu->module_info_ptr = module_info; // [376]
tu->is_primary = is_primary; // [392] byte 0
// ==========================================================
// PHASE 2: Link TU into global chains
// ==========================================================
// --- Set as primary TU if this is the first ---
if (primary_translation_unit == NULL) { // qword_106B9F0
primary_translation_unit = tu;
if (!is_recompilation)
assertion_failure("trans_unit.c", 725, "process_translation_unit", 0, 0);
}
// --- Push onto TU stack ---
current_translation_unit = tu; // qword_106BA10
// (stack entry allocated from free list or via permanent alloc)
stack_entry = alloc_stack_entry(); // 16 bytes
stack_entry->tu_ptr = tu;
stack_entry->next = tu_stack_top;
if (tu != primary_translation_unit)
++tu_stack_depth; // dword_106B9E8
tu_stack_top = stack_entry; // qword_106BA18
// --- Append to TU linked list ---
if (tu_chain_tail) // qword_12C7A90
tu_chain_tail->next_tu = tu;
tu_chain_tail = tu;
// ==========================================================
// PHASE 3: Source file setup + parse
// ==========================================================
if (module_info) {
// --- Module compilation path ---
// Extract header info from module descriptor
module_id = module_info[7];
module_info[2] = tu; // back-link TU into module
current_module_id = module_id; // qword_106C0B0
// ... copy include paths, source paths from module descriptor ...
source_dir = intern_directory_path(filename, 1); // sub_5ADC60
set_include_paths(source_dir, &include_list, &sys_list); // sub_5AD120
fe_translation_unit_init(source_dir, &include_list); // sub_5863A0
import_module = module_info[3];
tu->error_severity_count = current_error_severity; // [408]
set_module_id(import_module); // sub_5AF7F0
if (preprocessing_only) // dword_106C29C
goto compile;
goto compile_module;
}
// --- Standard (non-module) path ---
fe_translation_unit_init(0, 0); // sub_5863A0
tu->error_severity_count = current_error_severity;
if (preprocessing_only)
goto compile;
// --- PCH header processing (optional) ---
if (pch_enabled && !pch_skip_flag) { // dword_106BF18, dword_106B6AC
setup_pch_source(); // sub_5861C0
precompiled_header_processing(); // sub_6F4AD0
}
compile:
// --- Main compilation: parse + build IL ---
compile_primary_source(); // sub_586240
semantic_analysis(); // sub_4E8A60 (standard path)
goto wrapup;
compile_module:
compile_primary_source(); // sub_586240
module_compilation(); // sub_6FDDF0 (module path)
wrapup:
// ==========================================================
// PHASE 4: Per-TU wrapup + stack pop
// ==========================================================
translation_unit_wrapup(); // sub_588E90
// --- Pop TU stack (inlined pop_translation_unit_stack) ---
top = tu_stack_top;
popped_tu = top->tu_ptr;
if (popped_tu != current_translation_unit)
assertion_failure("trans_unit.c", 556,
"pop_translation_unit_stack", 0, 0);
if (popped_tu != primary_translation_unit)
--tu_stack_depth;
tu_stack_top = top->next;
// (return stack entry to free list)
if (tu_stack_top)
switch_translation_unit(tu_stack_top->tu_ptr); // sub_7A3D60
// --- Debug trace on exit ---
if (debug_verbosity > 0 || (debug_enabled && trace_category("trans_unit")))
fprintf(stderr, "Done processing translation unit %s\n", filename);
}
Execution Flow
process_translation_unit (sub_7A40A0)
|
|-- [1] Debug trace: "Processing translation unit %s"
|-- [2] Module-state validation (assert at trans_unit.c:696)
|-- [3] Save previous TU state (sub_7A3A50)
|-- [4] Reset error state (sub_5EAEC0)
|-- [5] If recompilation: re-run fe_init_part_1 (sub_585EE0)
|
|-- [6] Allocate 424-byte TU descriptor (sub_6BA0D0)
| |-- Allocate per-TU storage buffer (sub_6BA0D0(per_tu_storage_size))
| |-- Initialize scope state at [24..192] (sub_7046E0)
| |-- Zero remaining fields [192..416]
| |-- Copy registered variable defaults
| |-- Set module_info_ptr [376] and flags [392]
|
|-- [7] Set as primary TU if first (assert at trans_unit.c:725)
|-- [8] Push onto TU stack, link into TU chain
|
|-- [9] Module path? (module_info != NULL)
| |-- YES: Extract module header info
| | sub_5ADC60 (intern_directory_path)
| | sub_5AD120 (set_include_paths)
| | sub_5863A0 (fe_translation_unit_init)
| | sub_5AF7F0 (set_module_id)
| |
| |-- NO: sub_5863A0 (fe_translation_unit_init) with NULL args
| sub_5861C0 + sub_6F4AD0 (PCH processing, if enabled)
|
|-- [10] sub_586240 -- compile_primary_source (parser entry)
|
|-- [11] Post-parse semantic analysis:
| |-- Module path: sub_6FDDF0 (module_compilation)
| |-- Standard path: sub_4E8A60 (translation_unit / semantic analysis)
|
|-- [12] sub_588E90 -- translation_unit_wrapup
|
|-- [13] Pop TU stack (assert at trans_unit.c:556)
|-- [14] Debug trace: "Done processing translation unit %s"
Phase 1: Error State Reset -- sub_5EAEC0
Before any parsing begins, sub_5EAEC0 resets the parser's error recovery state. This is a tiny function (22 bytes) that configures the error-recovery token scan depth based on whether this is a recompilation pass:
void reset_error_state(void) {
if (is_recompilation) { // dword_106BA08
error_scan_depth = 8; // dword_126F68C -- shallower scan on retry
error_scan_mode = 0; // dword_126F688
error_recovery_kind = 16;
} else {
error_recovery_kind = 24; // full recovery on first pass
}
error_token_limit = error_recovery_kind; // dword_126F694
error_count_local = 0; // dword_126F690
}
The different error_recovery_kind values (16 vs 24) control how aggressively the parser attempts to resynchronize after encountering a syntax error. On recompilation (error-retry), the compiler uses a smaller recovery window to avoid cascading errors.
Phase 2: TU Descriptor Allocation
The 424-byte TU descriptor is the central data structure tracking a single translation unit's state during compilation. It is allocated from EDG's permanent storage pool via sub_6BA0D0 and linked into two separate data structures: the TU linked list and the TU stack.
Translation Unit Descriptor Layout (424 bytes)
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | next_tu | Singly-linked list pointer: chains all TUs in processing order. qword_106B9F0 (primary TU) is the head; qword_12C7A90 is the tail. |
| 8 | 8 | tu_name | Initially NULL. Set later by the parser to the TU's internal identifier. |
| 16 | 8 | storage_buffer | Pointer to a dynamically-sized buffer holding per-TU copies of all registered global variables. Size = qword_12C7A98 (accumulated during f_register_trans_unit_variable calls). |
| 24-192 | 168 | scope_state | Initialized by sub_7046E0. Contains the TU's scope stack snapshot: file scope descriptor, scope nesting state, using-directive lists. Saved/restored during TU switching by sub_7A3A50/sub_7A3D60. |
| 184 | 8 | source_file_entry | Set to *(qword_126DDF0 + 64) after the source file is opened -- the file descriptor from the source file manager. |
| 192 | 8 | (cleared) | Zero-initialized. |
| 200-352 | ~160 | scope_decl_area | Bulk-zeroed via memset. Holds scope-level declaration state that accumulates during parsing. The zero-init ensures clean state for a new TU. |
| 352 | 8 | (cleared) | Zero-initialized. |
| 360-376 | 24 | additional_state | Three qwords, all zeroed. Purpose unclear; possibly reserved for future EDG versions. |
| 376 | 8 | module_info_ptr | Pointer to the C++20 module descriptor (a3 parameter). NULL for standard compilation. When set, the TU participates in modular compilation. |
| 392 | 2 | flags | Byte 0: is_primary (1 if this is the first TU, 0 otherwise). Byte 1: initialized marker (always 1 = 0x100 in the word). |
| 408 | 4 | error_severity_count | Snapshot of dword_126EC90 at TU creation time. Compared during wrapup to detect new errors introduced during this TU's compilation. |
| 416 | 8 | (cleared) | Zero-initialized. |
Registered Variable Mechanism
EDG's multi-TU infrastructure requires certain global variables to be saved and restored when switching between translation units (e.g., during relocatable device code compilation). The mechanism works as follows:
- Registration phase (during initialization, before any TU processing): Subsystem initializers call
f_register_trans_unit_variable(sub_7A3C00) to register global variables that need per-TU state. Each registration creates a 40-byte entry:
| Offset | Size | Field |
|---|---|---|
| 0 | 8 | next -- linked list pointer |
| 8 | 8 | variable_address -- pointer to the global variable |
| 16 | 8 | variable_name -- debug name string (e.g., "is_recompilation") |
| 24 | 8 | prior_accumulated_size -- offset into per-TU storage buffer |
| 32 | 8 | field_offset_in_tu -- if nonzero, the offset within the TU descriptor where the default value lives |
-
Accumulated size tracking: Each registration pads the variable's size to 8-byte alignment and adds it to
qword_12C7A98(per-TU storage size). The linked list head isqword_12C7AA8, tail isqword_12C7AA0. -
TU creation: When a TU descriptor is allocated, a storage buffer of
per_tu_storage_sizebytes is allocated alongside it at offset [16]. Default values from thefield_offset_in_tuentries are copied into the TU descriptor's own fields. -
TU switching:
save_translation_unit_state(sub_7A3A50) iterates the registered variable list, copying each variable's current value from its global address into the outgoing TU's storage buffer.switch_translation_unit(sub_7A3D60) does the reverse: copies from the incoming TU's storage buffer back to the global addresses.
Three core variables are always registered (by sub_7A4690):
| Variable | Address | Size | Name |
|---|---|---|---|
is_recompilation | dword_106BA08 | 4 | "is_recompilation" |
current_filename | qword_106BA00 | 8 | "current_filename" |
has_module_info | dword_106B9F8 | 4 | "has_module_info" |
Additional variables are registered by other subsystem initializers (trans_corresp registers 3 more via sub_7A3920).
Phase 3: TU Linking and Stack Management
TU Linked List
Translation units are linked in processing order through the next_tu field at offset [0]:
qword_106B9F0 (primary_translation_unit)
|
v
TU_0 --[next_tu]--> TU_1 --[next_tu]--> TU_2 --[next_tu]--> NULL
^
|
qword_12C7A90 (tu_chain_tail)
qword_106B9F0 always points to the first (primary) TU. qword_12C7A90 always points to the last. The chain is walked by fe_wrapup (sub_588F90) during its 5-pass finalization.
TU Stack
The TU stack tracks the active compilation context. Each stack entry is a 16-byte structure:
| Offset | Size | Field |
|---|---|---|
| 0 | 8 | next -- points to the entry below on the stack |
| 8 | 8 | tu_ptr -- pointer to the TU descriptor |
Stack entries are allocated from a free list (qword_12C7AB8); when the free list is empty, a new 16-byte block is allocated via sub_6B7340 (permanent allocator).
qword_106BA18 (tu_stack_top)
|
v
entry_N: [next=entry_N-1, tu_ptr=current_tu]
entry_N-1: [next=entry_N-2, tu_ptr=prev_tu]
...
entry_0: [next=NULL, tu_ptr=primary_tu]
The stack depth counter dword_106B9E8 tracks how many non-primary TUs are stacked. It is incremented on push (if tu != primary_tu) and decremented on pop.
The pop operation at the end of process_translation_unit includes an assertion (at trans_unit.c:556) verifying that the top-of-stack TU matches current_translation_unit. This guards against mismatched push/pop sequences, which would corrupt the multi-TU state:
if (stack_top->tu_ptr != current_translation_unit)
assertion_failure("trans_unit.c", 556, "pop_translation_unit_stack", 0, 0);
Phase 4: Source File Setup
The source file setup differs between standard compilation and C++20 module compilation.
Standard Path (module_info == NULL)
-
sub_5863A0(fe_translation_unit_init/keyword_init, 1113 lines,fe_init.c): The largest initialization function in the binary. Performs two tasks in sequence:- Token state reset: Zeros
qword_126DD38(6-byte source position) andqword_126EDE8(mirror). - Per-TU subsystem reinit: Calls 15+ subsystem re-initializers to prepare for a new compilation unit (source file manager, scope system, preprocessor, diagnostics, etc.).
- Keyword registration: Registers 200+ C/C++ keywords via
sub_7463B0(enter_keyword), including all C89/C99/C11/C23 keywords, C++ keywords through C++26, GNU extensions, MSVC extensions, Clang extensions, 60+ type traits, and three NVIDIA CUDA-specific type trait keywords (__nv_is_extended_device_lambda_closure_type,__nv_is_extended_host_device_lambda_closure_type,__nv_is_extended_device_lambda_with_preserved_return_type). Keyword registration is version-gated by the language mode (dword_126EFB4) and C++ standard version (dword_126EF68). - File scope creation: Calls
sub_7047C0(0)to push the initial file scope onto the scope stack. - C++ builtins: For C++ mode, registers namespace
std,operator new/operator deleteallocation functions,std::align_val_t.
- Token state reset: Zeros
-
PCH processing (optional, if
dword_106BF18is set): Callssub_5861C0to open the source file with minimal setup (same assub_586240but without the recompilation logic), followed bysub_6F4AD0(precompiled_header_processing, 721 lines,pch.c) which searches for an applicable.pchfile, validates memory allocation history, and restores saved variable state from the precompiled header.
Module Path (module_info != NULL)
When compiling a C++20 module unit, the module descriptor (passed as a3) provides pre-computed configuration:
module_info[2] = tu; // back-link TU into module descriptor
qword_106C0B0 = module_info[7]; // module identifier
qword_126EE98 = module_info[4]; // include path list
qword_126EE78 = module_info[6]; // system include path list
qword_126EE90 = module_info[5]; // additional path list
The module path then calls:
sub_5ADC60(filename, 1)-- intern the source directory path (cached allocation)sub_5AD120(source_dir, &include_list, &sys_list)-- configure include search paths from the module descriptorsub_5863A0(source_dir, &include_list)--fe_translation_unit_initwith module-specific pathssub_5AF7F0(module_info[3])-- set the module identifier for this TU (asserts not already set)
Phase 5: Compilation Driver -- sub_586240
sub_586240 (fe_init.c, 63 lines) is the compilation driver that opens the source file and launches the parser. It is called for both standard and module compilation paths.
void compile_primary_source(void) {
// If recompilation: reset file-scope scope pointer
if (is_recompilation)
*(uint64_t *)&xmmword_126EB60 = 0;
// Allocate mutable copy of filename for the lexer
char *fn_copy = temp_allocate(strlen(current_filename) + 1); // sub_5E0460
strcpy(fn_copy, current_filename);
// --- Open source file and push onto input stack ---
open_file_and_push_input_stack(fn_copy, 0, 0, 0, 0, 0, 0, 0, 0, 0); // sub_66E6E0
// Record source file descriptor in TU
current_tu->source_file_entry = *(source_file_descriptor + 64); // [184]
// --- Scope handling ---
if (!pch_mode) { // dword_106B690
init_global_scope_flag = 1; // dword_126C708
global_scope_decl_list = global_decl_chain; // qword_126C710
finalize_scope(); // sub_66E920
}
open_scope(1, 0); // sub_6702F0
// --- PCH recompilation metadata ---
if (is_recompilation) {
// Allocate 4-byte version marker (3550774 = "6.6\0")
char *ver = temp_allocate(4);
*(uint32_t *)ver = 3550774; // EDG 6.6 version tag
edg_version_ptr = ver; // qword_126EB78
// Copy compilation timestamp
char *ts = temp_allocate(strlen(byte_106B5C0));
compilation_timestamp_copy = strcpy(ts, byte_106B5C0); // qword_126EB80
dialect_version_snapshot = dialect_version; // dword_126EBF8
}
// --- PCH header loading ---
if (pch_mode) {
load_precompiled_header(byte_106B5C0); // sub_6B5C10
pch_header_loaded = 1; // dword_106B6B0
}
}
Parser Entry: sub_66E6E0 (open_file_and_push_input_stack)
sub_66E6E0 (lexical.c, 95 lines) is the gateway from file-level compilation into the EDG lexer/parser. It takes 10 parameters controlling how the source file is opened:
| Parameter | Position | Typical Value | Meaning |
|---|---|---|---|
filename | a1 | source path | Path to the .cu file |
include_mode | a2 | 0 | 0 = primary source, nonzero = #include |
search_type | a3 | 0 | 0 = absolute path, nonzero = search include dirs |
is_system | a4 | 0 | System header flag |
guard_flag | a5 | 0 | Include guard checking mode |
is_pragma | a6 | 0 | Pragma-include flag |
embed_mode | a7 | 0 | #embed processing flag |
line_adjust | a8 | 0 | Line number adjustment |
recovery | a9 | 0 | Error recovery mode |
result_out | a10 | 0 | Output: set to 1 if file was skipped (guard) |
The function delegates to sub_66CBD0 which resolves the file path, opens the file handle, and creates the file descriptor. Then sub_66DFF0 pushes the opened file onto the lexer's input stack, making it the active source for tokenization. The lexer reads from this stack via get_next_token (sub_676860, 1995 lines).
At debug verbosity > 3, it prints: "open_file_and_push_input_stack: skipping guarded include file %s\n" when an include guard causes the file to be skipped.
Phase 6: Semantic Analysis -- sub_4E8A60
After parsing completes, sub_4E8A60 (translation_unit, decls.c, 77 lines) performs semantic analysis on the parsed declarations. This function is called only on the standard (non-module) compilation path.
void translation_unit(void) {
// PCH mode: additional scope finalization
if (pch_mode)
finalize_pch_scope(); // sub_6FC900
if (global_decl_chain)
process_pending_declarations(); // sub_6FDD60
// --- Main declaration processing loop ---
declaration_processing_active = 1; // dword_126C704
parse_declaration_seq(); // sub_676860 (get_next_token)
declaration_processing_active = 0;
// Header-unit stop detection
if (header_unit_mode)
finalize_header_unit(); // sub_6F4A10
// --- Top-level declaration loop ---
// Repeatedly processes declarations until token 9 (EOF) is reached.
// For C++ (dword_126EFB4 == 2) with C++14+ (dword_126EF68 > 201102):
// calls sub_6FBCD0 (deferred template processing)
// then sub_4E6F80(1, 0) (process next declaration)
while (current_token != 9) { // 9 = EOF token
if (is_cpp && (cpp_version > 201102 || has_cpp20_features))
process_deferred_templates(); // sub_6FBCD0
if (declaration_enabled)
process_declaration(1, 0); // sub_4E6F80
}
// --- Post-parse validation ---
if (!header_unit_mode) {
if (is_cpp && (cpp_version > 201102 || has_cpp20_features))
process_deferred_templates(); // sub_6FBCD0 final pass
finalize_module_interface(); // sub_6F81D0
} else {
// Header-unit mode assertion: stop position must be found
assertion_failure("decls.c", 23975, "translation_unit",
"translation_unit:", "header stop position not found");
}
}
The C++ standard version checks (dword_126EF68 > 201102) gate C++14+ features like deferred template instantiation. The value 201102 corresponds to C++11 (__cplusplus value). For C++14 and later, sub_6FBCD0 handles deferred template processing between declaration groups.
Phase 7: Translation Unit Wrapup -- sub_588E90
sub_588E90 (translation_unit_wrapup, fe_wrapup.c, 36 lines) performs per-TU finalization after parsing and semantic analysis are complete. It is the last step before the TU stack is popped.
void translation_unit_wrapup(void) {
if (debug_enabled)
trace_enter(1, "translation_unit_wrapup");
// [1] Stop-token verification
check_all_stop_token_entries_are_reset( // sub_675DA0
file_scope_stop_tokens + 8); // qword_126DB48 + 8
// [2] Class linkage checking (conditional)
if (!preprocessing_only) {
if (rdc_enabled || rdc_alt_enabled) // dword_106C2BC, dword_106C2B8
check_class_linkage(); // sub_446F80
}
// [3] Module import finalization
finalize_module_imports(); // sub_7C24D0
// [4] IL output
complete_scope(); // sub_709250
// [5] Close file scope
close_file_scope(1); // sub_7047C0
// [6] Module correspondence finalization (non-preprocessing)
if (!preprocessing_only)
process_verification_list(); // sub_7A2FE0
// [7] Write compilation unit boundary
make_module_id(0); // sub_5AF830
// [8] Namespace cleanup (C++ only, non-PCH, non-preprocessing)
if (is_cpp && !is_recompilation && !preprocessing_only)
namespace_cleanup(); // sub_76C910
if (debug_enabled)
trace_leave(); // sub_48AFD0
}
Sub_675DA0: check_all_stop_token_entries_are_reset
Iterates all 357 entries in the stop-token array. If any nonzero entry is found, logs "stop_tokens[\"%s\"] != 0\n" (using off_E6D240 as the token name table) and asserts with "stop token array not all zero" at lexical.c:17680. This catches lexer state corruption where a stop-token (used during error recovery and tentative parsing) was not properly cleared.
Sub_446F80: check_class_linkage
Called only when relocatable device code (RDC) compilation is enabled (dword_106C2BC or dword_106C2B8). Iterates file-scope type entities looking for class/struct/union types (kind 9-11) and scoped enums (kind 2, bit 3 of +145) that need external linkage for cross-TU visibility. For qualifying types, calls sub_41F800 (make_class_externally_linked) to set the linkage bits at offset +80 to 0x20 (external linkage flag). The function performs a two-pass scan:
-
Pass 1: Identify types needing external linkage. Checks whether the type is used by externally-visible definitions, has nested types with external linkage requirements, or has member functions with non-inline definitions.
-
Pass 2: If any types were promoted, propagates linkage to member functions and nested class template instantiations via
sub_41FD90.
Sub_7A2FE0: process_verification_list (Module Finalization)
sub_7A2FE0 (trans_corresp.c, 69 lines) processes the deferred correspondence verification list for multi-TU compilation. This is the mechanism EDG uses to verify that declarations shared across translation units are structurally compatible (One Definition Rule checking for RDC).
void process_verification_list(void) {
if (is_recompilation || error_count != saved_error_count)
goto skip; // skip if new errors appeared
correspondence_active = 1; // dword_106B9E4
source_seq = *(current_tu + 8); // TU source sequence
prepare_correspondence(source_seq); // sub_79FE00
verify_correspondence(source_seq); // sub_7A2CC0
// Process pending verification items
while (pending_list) { // qword_12C7790
pending_list_snapshot = pending_list;
pending_list = NULL;
for (item = pending_list_snapshot; item; item = next) {
next = item->next;
switch (item->kind) { // byte at [8]
case 0: break; // no-op
case 2: verify_typedef_correspondence(item->data); // sub_7986A0
case 6: verify_friend_correspondence(item->data); // sub_7A1830
case 7: verify_nested_class_correspondence(item->data); // sub_798960
case 8: verify_enum_member_correspondence(item->data); // sub_798770
case 11: verify_member_function_correspondence(item->data); // sub_7A1DB0
case 28: verify_using_declaration_correspondence(item->data);// sub_7982C0
case 58: verify_base_class_correspondence(item->data); // sub_7A27B0
default: assertion_failure("trans_corresp.c", 7709, ...);
}
// Return item to free list
item->next = corresp_free_list;
}
}
correspondence_active = 0;
correspondence_complete = 1; // dword_106B9E0
skip:
correspondence_complete = 1;
}
The kind codes (0, 2, 6, 7, 8, 11, 28, 58) correspond to EDG declaration kinds: typedef (2), friend (6), nested class (7), enum member (8), member function (11), using declaration (28), base class (58).
Module vs Standard Compilation Path
The control flow diverges based on dword_106C29C (preprocessing-only mode) and the presence of module_info:
module_info?
/ \
YES NO
| |
sub_5ADC60 sub_5863A0(0,0)
sub_5AD120 |
sub_5863A0 PCH enabled?
sub_5AF7F0 / \
| YES NO
| | |
| sub_5861C0 |
| sub_6F4AD0 |
| | |
+-----+----+-----+-----+
| |
sub_586240 sub_586240
| |
preprocessing_only?
/ \
YES NO
| |
sub_6FDDF0 sub_4E8A60
(module comp) (standard comp)
| |
+-------+-------+
|
sub_588E90
(translation_unit_wrapup)
Note: sub_6FDDF0 is the module compilation driver (59 lines, lower_il.c). It enters a loop calling sub_676860 (get_next_token) until EOF (token 9), processing module import/export declarations. Between module units, it calls sub_66EA70 to close the current input source and advance to the next module partition.
Global State Variables
Translation Unit Tracking
| Variable | Address | Type | Description |
|---|---|---|---|
current_translation_unit | qword_106BA10 | tu_descriptor* | Points to the TU currently being compiled. Set during TU creation and switching. |
primary_translation_unit | qword_106B9F0 | tu_descriptor* | Points to the first TU. Set exactly once. Never changes after that. |
tu_chain_tail | qword_12C7A90 | tu_descriptor* | Tail of the TU linked list. Used for O(1) append of new TUs. |
tu_stack_top | qword_106BA18 | stack_entry* | Top of the TU stack. Each entry is a 16-byte {next, tu_ptr} node. |
tu_stack_depth | dword_106B9E8 | int | Number of non-primary TUs on the stack. Incremented on push, decremented on pop. |
current_filename | qword_106BA00 | char* | Path of the .cu file being compiled. Per-TU variable (saved/restored on switch). |
is_recompilation | dword_106BA08 | int | Nonzero during error-retry recompilation pass. Per-TU variable. |
has_module_info | dword_106B9F8 | int | 1 if the current TU is a C++20 module unit. Per-TU variable. |
Registration Infrastructure
| Variable | Address | Type | Description |
|---|---|---|---|
registered_variable_list_head | qword_12C7AA8 | reg_entry* | Head of the registered variable linked list. Built during initialization. |
registered_variable_list_tail | qword_12C7AA0 | reg_entry* | Tail of the registered variable list. Used for O(1) append. |
per_tu_storage_size | qword_12C7A98 | size_t | Accumulated size of all registered variables (8-byte aligned). Determines the storage buffer size at TU descriptor offset [16]. |
registration_complete | dword_12C7A8C | int | Set to 1 at the start of process_translation_unit. After this, no more variables can be registered. |
has_seen_module_tu | dword_12C7A88 | int | Set to 1 when a module-info TU is processed. Guards against mixing module and non-module TUs. |
stack_entry_free_list | qword_12C7AB8 | stack_entry* | Free list for recycling 16-byte TU stack entries. |
Statistics Counters
| Variable | Address | Description |
|---|---|---|
qword_12C7A78 | tu_count | Total TU descriptors allocated (424 bytes each) |
qword_12C7A80 | stack_entry_count | Total stack entries allocated (16 bytes each) |
qword_12C7A68 | registration_count | Total variable registration entries (40 bytes each) |
qword_12C7A70 | corresp_count | Total correspondence entries (24 bytes each) |
These counters are reported by sub_7A45A0 (print_trans_unit_statistics), which prints formatted memory usage:
trans. unit corresps N x 24 bytes
translation units N x 424 bytes
trans. unit stack entry N x 16 bytes
variable registration N x 40 bytes
Assertions
The function contains three assertion checks, each producing a fatal diagnostic via sub_4F2930:
| Line | Condition | Message | Meaning |
|---|---|---|---|
trans_unit.c:696 | Primary TU (no module_info) but has_seen_module_tu is set | (none) | Cannot process a non-module TU after a module TU has been seen |
trans_unit.c:725 | primary_translation_unit is set but is_recompilation is false | (none) | First TU must be on the initial compilation pass, not a retry |
trans_unit.c:556 | Stack top's TU pointer does not match current_translation_unit | (none) | TU stack push/pop mismatch -- corrupted compilation state |
Callee Reference Table
| Address | Identity | Source | Role in Pipeline |
|---|---|---|---|
sub_48A7E0 | trace_category | error.c | Check if debug category "trans_unit" is enabled |
sub_5EAEC0 | reset_error_state | parse.c | Reset parser error recovery state |
sub_585EE0 | fe_init_part_1 | fe_init.c | Re-run per-unit init on recompilation |
sub_6BA0D0 | allocate_storage | il_alloc.c | Permanent storage allocator (424-byte TU, per-TU buffer) |
sub_7046E0 | init_scope_state | scope_stk.c | Initialize scope fields at TU descriptor [24..192] |
sub_6B7340 | permanent_alloc | il_alloc.c | Allocate 16-byte TU stack entry |
sub_7A3A50 | save_translation_unit_state | trans_unit.c | Save current TU's registered variables and scope state |
sub_7A3D60 | switch_translation_unit | trans_unit.c | Restore a TU's state (inverse of save) |
sub_5ADC60 | intern_directory_path | host_envir.c | Cache directory path string (module path) |
sub_5AD120 | set_include_paths | host_envir.c | Configure include search paths from module descriptor |
sub_5863A0 | fe_translation_unit_init | fe_init.c | Per-TU init + keyword registration (1113 lines) |
sub_5AF7F0 | set_module_id | host_envir.c | Set module identifier for current TU |
sub_5861C0 | setup_pch_source | fe_init.c | Open source file for PCH mode |
sub_6F4AD0 | precompiled_header_processing | pch.c | Find/load applicable PCH file (721 lines) |
sub_586240 | compile_primary_source | fe_init.c | Open source, launch parser, build IL |
sub_66E6E0 | open_file_and_push_input_stack | lexical.c | Open source file, push onto lexer input stack (10 params) |
sub_676860 | get_next_token | lexical.c | Main tokenizer (1995 lines) |
sub_6702F0 | open_scope | scope_stk.c | Push a new scope onto the scope stack |
sub_6FDDF0 | module_compilation | lower_il.c | Module compilation driver (EOF-driven loop) |
sub_4E8A60 | translation_unit | decls.c | Standard compilation: semantic analysis + declaration loop |
sub_588E90 | translation_unit_wrapup | fe_wrapup.c | Per-TU finalization (8 sub-steps) |
sub_675DA0 | check_all_stop_token_entries_are_reset | lexical.c | Verify all 357 stop-tokens are cleared |
sub_446F80 | check_class_linkage | class_decl.c | RDC: promote class types to external linkage |
sub_7C24D0 | finalize_module_imports | modules.c | C++20 module import finalization |
sub_709250 | complete_scope | il.c | IL scope completion |
sub_7047C0 | close_file_scope | scope_stk.c | Pop file scope, activate using-directives |
sub_7A2FE0 | process_verification_list | trans_corresp.c | ODR verification for multi-TU (RDC) |
sub_76C910 | namespace_cleanup | cp_gen_be.c | C++ namespace state cleanup |
sub_4F2930 | assertion_failure | error.c | Fatal assertion handler (prints source path + line) |
Cross-References
- Pipeline Overview -- complete 8-stage pipeline diagram showing where
process_translation_unitfits - Entry Point & Initialization -- stages 1-3 that execute before this function
- Frontend Wrapup -- the 5-pass
fe_wrapup(stage 6) that runs after this function - Backend Code Generation -- stage 7 that consumes the IL tree built here
- CLI Processing -- all 276 flags that configure the compilation mode
- Timing & Exit -- exit code mapping and timing infrastructure
- EDG Overview -- EDG 6.6 source tree and NVIDIA modifications
- Execution Spaces -- how
__device__/__host__/__global__attributes are recorded during parsing - Device/Host Separation -- how the backend filters device vs host code from the IL tree
Frontend Wrapup
fe_wrapup (sub_588F90, 1433 bytes at 0x588F90, from fe_wrapup.c:776) is the sixth stage of the cudafe++ pipeline. It runs after the parser has built the complete EDG IL tree and before the backend emits the .int.c file. The function performs five sequential passes over the translation unit chain, each pass iterating the linked list rooted at qword_106B9F0. After the five passes, it runs a series of post-pass operations: cross-TU consistency checks, graph optimization, template validation, memory statistics reporting, and global state teardown. The function has 51 direct callees.
The five passes transform the raw IL tree into a finalized, pruned representation: Pass 1 cleans up parsing artifacts, Pass 2 computes which entities are needed, Pass 3 marks entities that must be preserved in the IL for device compilation, Pass 4 eliminates everything not marked, and Pass 5 serializes the result and validates scope consistency. The entire sequence is the bridge between the parser's "everything parsed" state and the backend's "only what matters" input.
Key Facts
| Property | Value |
|---|---|
| Function | sub_588F90 (fe_wrapup) |
| Binary address | 0x588F90 |
| Binary size | 1433 bytes |
| EDG source | fe_wrapup.c, line 776 |
| Direct callees | 51 |
| Debug trace name | "fe_wrapup" (level 1 via sub_48AE00) |
| Assertion | "bad translation unit in fe_wrapup" if dword_106BA08 == 0 |
| Error check | qword_126ED90 -- passes 2-4 skip TUs with errors |
| Language gate | dword_126EFB4 == 2 gates C++-only operations in pass 4 |
Architecture Overview
sub_588F90 (fe_wrapup)
|
|-- Preamble: debug trace, assertion, C++ wrapup, diagnostic hooks
|
|-- Pass 1: per-TU basic declaration processing (sub_588C60)
|-- Pass 2: template/inline instantiation + needed-flags (sub_707040)
| |-- gated by !qword_126ED90 (skip error TUs)
| |-- preceded by cross-TU marking (sub_796C00) on first run
|-- Pass 3: keep-in-IL marking for device code (sub_610420 with arg 23)
| |-- sets dword_106B640=1 guard, clears after
|-- Pass 4: constant folding + CUDA transforms + dead entity elimination
| |-- sub_5CCA40 (C++ only), sub_5CC410, sub_5CCBF0
|-- Pass 5: per-TU final cleanup (sub_588D40)
|
|-- Post-pass: cross-TU consistency (sub_796BA0)
|-- Post-pass: graph optimization (sub_707480 double-loop)
|-- Post-pass: template validation (sub_765480)
|-- Post-pass: final main-TU cleanup (sub_588D40)
|-- Post-pass: file index processing (sub_6B8B20 loop)
|-- Post-pass: output flush (sub_5F7DF0)
|-- Post-pass: close output files (sub_4F7B10 x3)
|-- Post-pass: memory statistics (10 subsystem counters -> sub_6B95C0)
|-- Post-pass: debug dumps (sub_702DC0, sub_6C6570)
|-- Post-pass: final teardown (sub_5E1D00, sub_4ED0E0, zero 6 globals)
Translation Unit Chain
All five passes iterate the same linked list structure. Each translation unit descriptor is a 424-byte allocation. The first qword of each descriptor is the next pointer, forming a singly-linked list. The head is qword_106B9F0 (the primary TU). For standard single-file CUDA compilation, there is typically one primary TU and zero secondary TUs, but the multi-TU infrastructure exists for module compilation and precompiled headers.
Before processing each TU, sub_7A3D60 (set_current_translation_unit) is called to switch global state to point at that TU. This updates qword_106BA10 (current TU descriptor), which is then used by all subsystems to find the current scope, IL root, file info, and error state.
The file scope IL node -- the root of the IL tree for a TU -- is at *(qword_106BA10 + 8).
The iteration pattern shared by all passes:
// Walk secondary TUs (linked from primary)
node = *(qword **)qword_106B9F0; // first secondary TU
while (node) {
sub_7A3D60(node); // set node as current TU
// ... pass-specific work on *(qword_106BA10 + 8) ...
node = *(qword **)node; // follow next pointer at +0
}
// Then process primary TU
sub_7A3D60(qword_106B9F0);
// ... pass-specific work on main TU ...
Preamble
Before the five passes begin, fe_wrapup performs:
- Debug trace: If
dword_126EFC8(debug mode), logs"fe_wrapup"at level 1 viasub_48AE00. - Set current TU: Calls
sub_7A3D60(qword_106B9F0)to select the primary TU. - Assertion: Checks
dword_106BA08 != 0-- the "full compilation mode" flag. If false, triggers a fatal assertion:"bad translation unit in fe_wrapup". This flag is set during TU initialization; its absence here indicates a corrupted pipeline state. - C++ template wrapup: If
dword_126EFB4 == 2(C++ mode), callssub_78A9D0(template_and_inline_entity_wrapup). This performs cross-TU template instantiation setup, walking all TUs and their pending instantiation lists. - No-op hook: Calls
nullsub_5-- a disabled debug hook in the exprutil address range (0x56DC80). Likely a compile-time-disabled expression validation point. - CUDA diagnostics: If
dword_106C268is set, callssub_6B3260(CUDA-specific diagnostic processing). - Source sequence debug: If debug mode and the
"source_file_for_seq_info"flag is active, callssub_5B9580to dump source file sequence information.
Pass 1: Basic Declaration Processing
Function: sub_588C60 (file_scope_il_wrapup)
Address: 0x588C60
Per-TU: Yes (iterates all secondary TUs, then processes the primary TU)
Error-gated: No -- runs unconditionally
This pass performs initial cleanup on each translation unit's IL tree. It runs unconditionally on every TU, regardless of error status, because the cleanup operations (template state release, exception spec finalization) are safe and necessary even after errors.
Operations per TU:
| Step | Function | Purpose |
|---|---|---|
| 1 | sub_7C2690 | Template cleanup -- release deferred template instantiation state |
| 2 | sub_68A0C0 | Exception handling cleanup -- finalize exception specifications, resolve pending catch-block types |
| 3 | sub_446F80 | Diagnostic finalization (conditional: only if dword_106C2BC or dword_106C2B8 is set, and dword_106C29C is clear -- i.e., not preprocessing-only mode) |
| 4 | sub_706710 | IL tree walk with parameters (root, 0, scope_list, 1, 0, 0) -- traverses the full IL tree performing bookkeeping: arg 2=0 means initial walk, arg 4=1 enables scope processing, arg 3 passes the TU scope list at qword_106BA10 + 24 |
| 5 | sub_706F40 | IL finalize -- post-walk finalization of the IL root node, marks it as ready for lowering |
| 6 | sub_5BD350 | Destroy temporaries (C++ only, dword_126EFB4 == 2) -- cleans up temporary objects from expression evaluation |
| 7 | (inline loop) | Clear deferred declaration flags (C++ only, dword_126EE50 == 0): iterates the declaration chain at *(root + 280), and for each declaration where bit 2 of byte +81 is set and sub_5CA6F0 returns true, clears the pointer at +40 and clears bit 2 of byte +81. This removes deferred-initialization markers from declarations whose initialization has completed. |
| 8 | sub_65D9A0 | Overload resolution cleanup -- releases candidate sets and viability data |
After all secondary TUs are processed, the primary TU itself gets the same treatment:
for (tu = *primary_tu; tu != NULL; tu = *tu)
set_current_tu(tu);
file_scope_il_wrapup(); // sub_588C60
set_current_tu(primary_tu);
file_scope_il_wrapup(); // for the primary TU itself
Cross-TU Marking (Between Pass 1 and Pass 2)
Before Pass 2 begins, if no errors have occurred (!qword_126ED90), sub_796C00 (mark_secondary_trans_unit_IL_entities_used_from_primary_as_needed) is called. This function:
- Calls
sub_60E4F0with callbackssub_796A60(IL walk visitor) andsub_796A20(scope visitor) to walk the primary TU's IL and mark entities referenced from secondary TUs. - Iterates the file table (
dword_126EC80entries starting at index 2), and for each valid file scope that is not bit-2 flagged in byte-8and has a non-zero scope kind byte+28, callssub_610200with the same visitor callbacks. - Runs the walk twice (controlled by a counter: first pass with callback
sub_796A60, second with NULL). The two-pass design ensures transitive closure: the first pass discovers direct references, the second propagates through chains of indirect references.
Pass 2: Template/Inline Instantiation and Needed-Flags
Function: sub_707040 (set_needed_flags_at_end_of_file_scope)
Address: 0x707040
Per-TU: Yes, but skips TUs with errors (qword_126ED90 check)
Source: scope_stk.c:8090
This pass determines which entities are "needed" -- must be preserved in the IL for backend consumption. It is the EDG "needed flags" computation, which decides based on linkage, usage, and language rules whether each declaration must survive to the output.
The function operates on a file scope IL node and walks four declaration lists at different offsets:
| Offset | List | Entity Kind | Processing |
|---|---|---|---|
+168 | Nested scopes | Namespace/class scopes | Recursively calls sub_707040 on each scope's IL root at *(entry + 120), skipping entries with bit 0 of byte +116 set (extern linkage marker) |
+104 | Type declarations | Classes (kind 9-11 at byte +132) | Calls sub_61D7F0(entry, 6) to set needed flag; recursively processes the class scope at *(*(entry+152) + 128) if non-null and bit 5 of byte +29 is clear |
+112 | Variable declarations | Variables/objects | Complex multi-condition evaluation (see below) |
+144 | Routine declarations | Functions/methods | Checks template body availability at *(entry+240) and *(*(entry+240)+8), bit 2 of byte +186 (not-needed marker), and entity class at byte +164; marks via sub_61CE20(entry, 0xB); preserves and restores bit 5 of byte +177 across the call |
Variable needed-flag logic
For each variable in the +112 list, the algorithm checks (in order of precedence):
- If bit 3 of byte
+80is set (external/imported), skip -- always mark as needed viasub_61CE20(entry, 7). - Check
sub_7A7850(*(entry+112))-- if referenced, mark as needed. - Check
sub_7A7890(*(entry+112))-- if used, mark as needed. - Otherwise evaluate:
- Byte
+162bit 4 set and full compilation mode: check linkage class at byte+128(1=external) and base type completeness viasub_75C1F0. - Byte
+128== 0 (no linkage) or byte+169== 2: check initializer pointer at+224and constexpr flags at byte+164. - Internal/external linkage with specific storage class: check definition pointer at
+200, storage class byte+128, and flag patterns in bytes+160,+161.
- Byte
At the start of file scope processing, dword_106B640 is set to 1. At the end, after optionally calling sub_6FE8C0 (C++ scope merging), it is cleared to 0.
Debug trace: prints "Start/End of set_needed_flags_at_end_of_file_scope" when the "needed_flags" debug flag is active.
Pass 3: Keep-in-IL Marking (Device Code Selection)
Function: sub_610420 (mark_to_keep_in_il)
Address: 0x610420
Per-TU: Yes, skips error TUs
Source: il_walk.c:1959
Argument: 23 (the file-scope walk mode)
This is the critical CUDA-specific pass. It determines which entities must be preserved in the intermediate language for device code compilation by cicc. The guard flag dword_106B640 is set to 1 before the call and cleared to 0 after, preventing accidental re-invocation.
The keep-in-IL bit is bit 7 (0x80) of the byte at (entity_pointer - 8). Testing uses signed comparison: *(entry - 8) < 0 means "marked for keeping."
Operation
-
Save/restore state: Saves and restores 9 global callback/state variables (
qword_126FB88throughdword_126FB60), installingsub_617310(prune_keep_in_il_walk) as the walk prune callback atqword_126FB78. All other callback slots are zeroed. The callback set atdword_126FB58is set to(byte_at_a1_minus_8 & 2) != 0-- derived from a flag in the scope node header. -
File scope walk: When
a2 == 23and scope kind byte*(a1+28)is 0 (file scope), clears bit 7 of byte*(a1-8)viaAND 0x7F. Then callssub_6115E0(a1, 23)-- the recursivewalk_tree_and_set_keep_in_iltraversal on the file scope root. -
C++ companion walk: For C++ mode (
dword_126EFB4 == 2), callssub_6175F0(a1)to walk scopes and mark out-of-line definitions and friend declarations. -
Guard assertion: Asserts
dword_106B640 != 0. If the guard was cleared during the walk, fires a fatal assertion atil_walk.c:1959with function namemark_to_keep_in_il. -
Pending entity lists: Iterates the deferred entity list at
qword_126EBA0, callingsub_6115E0(entity, 55)for each entry with bit 2 set in byte*(entity[1] + 187)(the "deferred instantiation needed" flag). -
43 category-specific walks: Iterates 43 global lists, each containing entities of a specific IL category. Each list is walked with a category-specific tag argument:
Global range Tags Count qword_126E610--qword_126E7701--23 23 lists qword_126E7B0--qword_126E7E027--30 4 lists qword_126E810--qword_126E8A033--42 10 lists qword_126E8E0--qword_126E90046--48 3 lists qword_126E9B0,qword_126E9D0,qword_126E9E0,qword_126E9F059, 61, 62, 63 4 lists qword_126EA8072 1 list These lists follow a reverse-linked structure where the back-pointer is at
*(list_entry - 16), not at offset +0. Each entity's tag tellssub_6115E0what kind of entity it is processing, which affects how the keep_in_il mark propagates to dependents. -
Using-declaration fixed-point: Processes namespace member entries at
*(root + 256)viasub_6170C0(member, is_file_scope, &changed)in a loop that repeats untilchanged == 0. Theis_file_scopeflag is derived from*(a1+28)being 2 or 17. -
Hidden name resolution: If
*(a1+264)is non-NULL, walks hidden name entries. Each entry has a linked list atentry[1]with per-entry kind at*(entry + 16)(byte). Five kinds are handled:Kind Name Action 0x35Instantiation Walk via sub_6170C0on*(entry[3] + 8)0x33Function template Conditional marking based on scope type and entity mark 0x34Variable template Same as 0x33 with v111 = entry[3]0x36Alias template Same as 0x33 with v110 = entry[3]6Class/struct Special handling: checks typedef chain at byte +132 == 12with non-null source at+8; marks viasub_6115E0(entity, 6)for file-scope entriesFor each marked hidden name entry, the keep_in_il bit at
*(entry - 8)is set via OR with0x80. -
Context restore: Restores all saved function pointers and state variables.
Debug trace: "Beginning/Ending file scope keep_in_il walk" when the "needed_flags" flag is active.
For full details on the keep-in-IL mechanism, see Keep-in-IL.
Pass 4: Constant Folding, CUDA Transforms, and Dead Entity Elimination
Per-TU: Yes, skips error TUs
C++ only: The sub_5CCA40 call is gated by dword_126EFB4 == 2
This pass has three sub-stages per TU. The first (sub_5CCA40) clears flags to prevent unnecessary work. The second (sub_5CC410) removes function bodies. The third (sub_5CCBF0) removes entire IL entries.
Stage 4a: Clear Unneeded Instantiation Flags -- sub_5CCA40
Address: 0x5CCA40
Source: il.c:29450 (clear_instantiation_required_on_unneeded_entities)
C++ only: Asserts dword_126EFB4 == 2
Walks the same four declaration lists as Pass 2 (nested scopes at +168, types at +104, routines at +144, and for non-file scopes variables at +112). For routines that are not marked for keeping but have instantiation-required flags set, calls sub_78A380(entity, 0, 2) to clear the instantiation-required bit. This prevents the template engine from instantiating definitions that will be eliminated in the next sub-stage.
The conditions for clearing a routine's instantiation-required flag are:
- Byte
+80bit 3 clear (not an external/imported entity) - Byte
+179bit 4 clear (not a special instantiation) - Byte
+179bits 1-2 ==0b10(has "instantiation required" set) OR (dword_126E204is set AND byte+176bit 7 is set) - Non-null template pointer at
*(entity + 0)(has a source template) - Byte
+176bit 1 clear (not already processed)
For non-file scopes (byte +28 of scope is nonzero), additionally processes variables in the +112 list with an analogous pattern: byte +162 bit 6 clear, bits 4-5 in the pattern (v8 & 0xB0) == 0x10, with a non-null pointer at *(entry + 0).
Stage 4b: Eliminate Unneeded Function Bodies -- sub_5CC410
Address: 0x5CC410
Source: il.c:29231 (eliminate_bodies_of_unneeded_functions)
Gate: dword_126E55C != 0 (deferred class members exist)
Iterates the scope table (qword_126EB98, 16-byte entries: {qword scope_ptr, int file_index, pad}). The iteration runs from index 1 through dword_126EC78. For each entry:
- Checks that the file reference at
qword_126EC88[file_index]is non-null. - Checks TU ownership:
- Primary TU (
qword_106BA10 == qword_106B9F0): checks(*(scope_ptr - 8) >> 1) ^ 1) & 1-- bit 1 of the pre-header flags byte must be clear. - Secondary TU: checks
qword_126DFE0[*(scope_ptr + 24)] == qword_106BA10-- the scope's file index maps to the current TU.
- Primary TU (
- Verifies scope kind byte
+28== 17 (class/namespace scope). - Checks the keep-in-il mark: bit 2 of byte
*(scope_ptr + 187)must be clear (not needed) AND the scope file entry has bit 0 of byte+29set (eligible for elimination). - If all checks pass, calls
sub_5CAB40to remove the function body from the scope.
In C++ mode with dword_126EFB4 == 2, also calls sub_6FFBA0 to reorganize namespace-level declarations after body removal.
Debug trace: "eliminate_bodies_of_unneeded_functions" at level 3.
Stage 4c: Eliminate Unneeded IL Entries -- sub_5CCBF0
Address: 0x5CCBF0
Source: il.c:29598 (eliminate_unneeded_il_entries)
Gate: dword_126E55C != 0
The heaviest sub-stage. First calls sub_703C30(a1) to get a scope summary structure (7-element qword array stored at v2), asserting the result is non-null. Then walks four entity lists, removing entries whose keep-in-IL mark (bit 7 of byte at entity - 8) is clear:
| List | Offset | Entity Type | Removal actions |
|---|---|---|---|
| Variables | +112 | Variable declarations | Unlink from list; for C++, call sub_7B0B60 on type pointers at +112 and +216 with callback sub_5C71B0 (id 147) to clean up associated type metadata |
| Routines | +144 | Function/method declarations | Unlink from list; same sub_7B0B60 type cleanup on type pointers at +144 and +248; set bit 5 of byte +87 in the routine supplement at *(entity+152) |
| Types | +104 | Type declarations | Unlink from list; for class entities (kind 9-11 at byte +132), call sub_5CB920 (C++ member cleanup) then sub_5E2D70 (scope deallocation); set bit 5 of byte +87 in the entity supplement |
| Hidden names | +272 | Hidden name entries | Unlink unmarked entries from list |
After variable/routine/type processing, the tail pointers are stored into v2[5], v2[6], and v2[4] respectively (the scope summary structure).
For file-scope nodes (byte +28 == 0), additionally calls sub_5CC570 (eliminate unneeded scope orphaned list entries) after variable processing, and sub_718720 (scope-level cleanup) after type/hidden-name processing.
After list processing, walks qword_126EBE0 (a global deferred entity chain) and removes entries where *(entry - 8) >= 0 (bit 7 clear = not marked).
String arithmetic in debug output
The diagnostic output uses a pointer arithmetic trick: "TARG_VERT_TAB_CHAR" + 17 evaluates to "R", so the format string "%semoving variable " produces either "Removing variable ..." (when the entity is being removed) or "Not removing variable ..." (when kept).
Deferred-Class-Members Flag
Pass 4 checks dword_126E55C after each TU's stage 4a. This flag indicates whether there are deferred class member definitions that need processing. If no errors occurred and the flag is set, stages 4b and 4c run. If errors are present during the per-TU loop, the flag is simply cleared to 0 and stages 4b/4c are skipped for that TU.
Pass 5: Per-TU Final Cleanup
Function: sub_588D40 (file_scope_il_wrapup_part_3)
Address: 0x588D40
Source: fe_wrapup.c:559
Per-TU: Yes (all TUs, no error skip)
This pass performs final statement-level processing and scope validation, then optionally re-runs the Pass 2-4 sequence for the main compilation unit.
Operations
-
Statement finalization:
sub_5BAD30-- finalizes statement-level IL nodes (label resolution, goto target binding, fall-through analysis). -
Scope stack assertion (C++ with
dword_106BA08): Verifies that*(qword_126C5E8 + 784 * dword_126C5E4 + 496) == qword_126E4C0. The scope stack is an array of 784-byte entries atqword_126C5E8, indexed bydword_126C5E4(current depth). The assertion checks that the scope pointer at offset +496 of the current entry matches the expected file scope entity (qword_126E4C0). On mismatch, triggers a fatal assertion atfe_wrapup.c:559with function namefile_scope_il_wrapup_part_3. -
Scope cleanup: For C++ mode, calls
sub_5C9E10(0)-- finalizes class scope processing, resolves deferred member access checks. -
IL output:
sub_709250-- serializes the IL tree to the IL output stream. This produces the internal representation that the backend reads, not the final.int.cfile. -
Template output:
sub_7C2560-- serializes template instantiation information to the output. -
Mirrored 3-pass sequence (only when
dword_106BA08-- full compilation mode): Re-runs passes 2-4 on the main TU's file scope node. This handles entities that were discovered or modified during the per-TU passes. The re-run is necessary because secondary TU processing may have added new cross-references to the primary TU's entities:sub_707040(file_scope)(needed flags) -- if errors appear (qword_126ED90), clearsdword_126E55Cand skips remainingsub_610420(file_scope, 23)withdword_106B640 = 1/0guard -- again abort if errorssub_5CCA40(file_scope)(clear instantiation flags, C++ only)sub_5CC410()+sub_5CCBF0(file_scope)(eliminate, ifdword_126E55C)
-
Source file state:
sub_6B9580-- updates source file tracking counters. -
Diagnostic flush:
sub_4F4030-- flushes pending diagnostic messages for this TU. -
File scope cleanup:
sub_6B9340(dword_126EC90)-- closes file scope state, passing the current error count for this file.
Post-Pass Operations
After all five passes complete, fe_wrapup performs a series of global operations that are not per-TU.
Cross-TU IL Consistency -- sub_796BA0
Address: 0x796BA0
Source: trans_copy.c:3003 (copy_secondary_trans_unit_IL_to_primary)
Called only when there are no errors (!qword_126ED90), the multi-TU flag is clear (!dword_106C2B4), and there are secondary TUs (*(qword_106B9F0) != 0). In the current binary, this function always triggers a fatal assertion at trans_copy.c:3003 -- the multi-TU IL copy infrastructure is compiled but disabled, likely reserved for future C++ module compilation support. The function traces "copy_secondary_trans_unit_IL_to_primary" before aborting.
Scope Renumbering -- sub_707480
Address: 0x707480
Source: scope_stk.c
Called when dword_126C5A0 (scope renumbering flag) is set and dword_126EC78 > 0 (scope count is positive). Executes a double-loop:
unsigned pass = 1;
do {
for (int idx = 1; idx < dword_126EC78; idx++)
sub_707480(idx, pass);
if (!pass) break;
pass = 0;
} while (dword_126EC78 > 0);
dword_126C5A0 = 0;
For each scope entry at qword_126EB98 + 16 * idx:
- Extracts the scope pointer at
+0and file index at+8 - Checks non-null scope pointer, valid file reference in
qword_126EC88[file_index] - Verifies scope kind byte
+28== 17 (class/namespace scope) - In
pass=1: skips entries where byte+176of the entity at*(scope+32)is non-negative - Checks bit 1 of byte
*(scope-8)is clear and bit 0 of byte+29is clear - In C++ mode with bit 5 of byte
*(*(scope+32) + 186)set: callssub_6FFBA0to reorganize scope members - Calls
sub_6FE2A0(scope, 0, 1)to renumber the scope's declaration entries
After the double-loop, clears dword_126C5A0 = 0.
Template Validation -- sub_765480
Address: 0x765480
Source: templates.c:19822 (remove_unneeded_instantiations)
Called unless dword_106C094 == 1 (minimal compilation mode). Walks the instantiation pending list at qword_12C7740 (linked via offset +8) and removes template instantiations that are no longer needed:
Referent kind (byte +80) | Entity kind | Action |
|---|---|---|
| 9 | Class template instantiation | If function body exists and is unreferenced (or dword_106C094 == 2), call sub_5CE710 to eliminate class definition |
| 7 | Function template instantiation | Same check with dword_106C094 != 2 guard |
| 10-11 | Variable/alias template | Call sub_5BBC70 to find underlying function, then sub_5CAB40 to remove body |
Each entry has: offset +16 = template entity pointer, offset +24 = referent entity, offset +80 = flags byte.
Final Main-TU Cleanup
Calls sub_588D40 one more time on the main translation unit (not iterating the chain). This ensures the primary TU gets the same final cleanup treatment as secondary TUs.
File Index Processing
If the primary TU has secondary TUs (*(qword_106B9F0) != 0), iterates the file table starting at index 2 through dword_126EC80:
for (int idx = 2; idx <= dword_126EC80; idx++) {
if (!qword_126EC88[idx] || *(byte *)(qword_126EB90[idx] + 28))
continue;
sub_6B8B20(idx);
}
sub_6B8B20 resets the file state for each valid, non-header file index, updating the source file manager's tracking structures.
Output Flush and File Close
-
Conditional flush: If
dword_106C250is set and no errors, callssub_5F7DF0(0)-- flushes the IL output stream. -
Close three output files via
sub_4F7B10:Call File pointer ID Identity sub_4F7B10(&qword_106C280, 1513)Primary output 1513 Main .int.coutput (or stdout)sub_4F7B10(&qword_106C260, 1514)Secondary output 1514 Module interface or IL dump sub_4F7B10(&qword_106C258, 1515)Tertiary output 1515 Template instantiation log sub_4F7B10checks if the file pointer is non-null, zeroes it, callssub_5AEAD0(fclose wrapper), and on error triggers diagnosticsub_4F7AA0with the given ID.
Memory Statistics Reporting
Triggered when any of these conditions hold:
dword_106BC80is set (always-report-stats flag)dword_126EFCC > 0(verbosity level > 0)- Debug mode (
dword_126EFC8) with"space_used"flag active
Sums the return values of 10 subsystem space_used functions:
| # | Function | Address | Subsystem | Report Header |
|---|---|---|---|---|
| 1 | sub_74A980 | 0x74A980 | Symbol table | "Symbol table use:" |
| 2 | sub_6B6280 | 0x6B6280 | Macro table | "Macro table use:" |
| 3 | sub_4ED970 | 0x4ED970 | Error/diagnostic table | "Error table use:" |
| 4 | sub_6887C0 | 0x6887C0 | Conversion table | (conversion/cast subsystem) |
| 5 | sub_4E8F60 | 0x4E8F60 | Declaration table | (declarations subsystem) |
| 6 | sub_56D8C0 | 0x56D8C0 | Expression table | "Expression table use:" |
| 7 | sub_5CEA80 | 0x5CEA80 | IL table | (IL node/class subsystem) |
| 8 | sub_726C80 | 0x726C80 | Mangling table | (name mangling subsystem) |
| 9 | sub_6FDF00 | 0x6FDF00 | Lowering table | (IL lowering subsystem) |
| 10 | sub_419150 | 0x419150 | Diagnostic table | (diagnostic output subsystem) |
Each function prints its own detailed allocation table to stderr in a standardized format with columns Table / Number / Each / Total, tracks "lost" entries (allocated count minus free-list traversal count), and returns its total byte count.
The cumulative sum is passed to sub_6B95C0 (print_memory_management_statistics at 0x6B95C0), which prints the grand total accounting report:
Memory management table use:
Table Number Each Total
text buffers NNN 40 NNNN
Total NNNN
Allocated space in all categories:
Total of above NNNNNNN
Skipped for alignment NNNN
File mapped memory 0
Mapped from PCH 0 (included in previous line)
Mapped IL file size 0
Not listed NNNNN
Total used NNNNNNN
Avail in used mem blocks NNNN
Avail in freed mem blocks 0
Max mem alloc NNNNNNN
The "Not listed" entry is computed as qword_1280700 + qword_1280708 - qword_12806F8 - total_above -- it captures memory allocated by subsystems that do not have their own space_used reporter.
Debug Dumps
If debug mode (dword_126EFC8) is active:
"scope_stack"flag: callssub_702DC0-- dumps the entire scope stack to stderr, showing all active scopes with their indices, kinds, and entity counts."viability"flag: callssub_6C6570-- dumps overload viability information, showing candidate sets and resolution decisions.
Final Teardown
-
IL allocator check --
sub_5E1D00(check_local_constant_useatil_alloc.c:1177): Copiesqword_126EFB8toqword_126EDE8(restores the IL source position to a baseline). Assertsqword_126F680 == 0-- no pending local constants should remain after wrapup. If nonzero, fires a fatal assertion. -
Zero 6 global state variables:
qword_126DB48 = 0-- pending entity pointer (scope tracking)- Call
sub_4ED0E0()-- declaration subsystem cleanup (releases declaration pools) dword_126EE48 = 0-- init-complete flag (cleared, marking end of frontend processing)qword_106BA10 = 0-- current TU descriptor (no active TU)qword_12C7768 = 0-- template state pointer 1qword_12C7770 = 0-- template state pointer 2
-
Timing: If debug mode, calls
sub_48AFD0(print trace timing footer for the fe_wrapup section).
Error Gating Summary
Each pass has a distinct error-gating pattern. The conditions below are verified against the decompiled sub_588F90:
| Pass | Error behavior | Decompiled condition |
|---|---|---|
Pass 1 (sub_588C60) | No gate -- always runs. Cleanup operations (template release, exception spec finalization) are safe and necessary even after errors. | None. Unconditional iteration of all secondary TUs followed by primary. |
Cross-TU (sub_796C00) | Skipped entirely if any errors occurred. This prevents cross-TU marking from propagating errors between units. | if (!qword_126ED90) sub_796C00(); (line 67-68 of decompiled) |
Pass 2 (sub_707040) | Per-TU skip. Inside the TU iteration loop, each TU is independently gated: if errors exist when that TU is selected, it is skipped but subsequent TUs may still run. | sub_7A3D60(tu); if (!qword_126ED90) sub_707040(*(qword_106BA10 + 8)); (lines 77-84) |
Pass 3 (sub_610420) | Per-TU skip. Same per-TU gating as Pass 2. When a TU is skipped, dword_106B640 is never set to 1, so the guard flag remains 0. | sub_7A3D60(tu); if (!qword_126ED90) { dword_106B640 = 1; sub_610420(..., 23); dword_106B640 = 0; } (lines 97-108) |
Pass 4 (sub_5CCA40 etc.) | Per-TU skip. On error for a TU: dword_126E55C is cleared to 0, which prevents stages 4b (sub_5CC410) and 4c (sub_5CCBF0) from running for that TU. Stage 4a (sub_5CCA40) is additionally gated by dword_126EFB4 == 2 (C++ only). | sub_7A3D60(tu); if (!qword_126ED90) { ... if (dword_126E55C) { sub_5CC410(); sub_5CCBF0(v8); } } else { dword_126E55C = 0; } (lines 120-137) |
Pass 5 (sub_588D40) | No gate on the per-TU iteration -- always runs. However, the internal mirrored 2-3-4 re-run within sub_588D40 is individually error-gated at each stage. | Unconditional iteration. Internal re-run checks qword_126ED90 before each of sub_707040, sub_610420, sub_5CCA40. |
| Post-passes | sub_796BA0 requires !qword_126ED90 && !dword_106C2B4 && *(qword_106B9F0) != 0. sub_5F7DF0 requires dword_106C250 && !qword_126ED90. All others run unconditionally. | Line 158: if (!qword_126ED90 && !dword_106C2B4 && *v4) sub_796BA0(); Line 213: if (dword_106C250 && !qword_126ED90) sub_5F7DF0(0); |
Data Flow Summary
| Input | Description |
|---|---|
qword_106B9F0 | TU chain head -- linked list of all translation units |
*(qword_106BA10 + 8) | File scope IL root node -- the IL tree for each TU |
qword_126ED90 | Error flag -- nonzero means compilation errors occurred |
dword_126EFB4 | Language mode -- 2 for C++, gates pass 4 and template operations |
dword_106BA08 | Full compilation mode flag -- gates Pass 5's mirrored sequence |
| Output | Description |
|---|---|
| Finalized IL tree | Entities marked for keeping preserved; all others eliminated |
dword_106B640 | IL emission guard flag -- 0 at completion |
dword_126E55C | Deferred class members flag -- 0 after processing |
| Closed output files | Three output streams (IDs 1513-1515) flushed and closed |
| Zeroed globals | qword_106BA10, dword_126EE48, qword_126DB48, template state -- all cleared |
Function Map
| Address | Identity | Source file | Role in fe_wrapup |
|---|---|---|---|
sub_588F90 | fe_wrapup | fe_wrapup.c:776 | Top-level entry, called from main() |
sub_588C60 | file_scope_il_wrapup | fe_wrapup.c | Pass 1: template/exception cleanup, IL walk, IL finalize |
sub_588D40 | file_scope_il_wrapup_part_3 | fe_wrapup.c:559 | Pass 5: statement finalization, scope assertion, IL/template output |
sub_588E90 | translation_unit_wrapup | fe_wrapup.c | Called from process_translation_unit, not directly from fe_wrapup |
sub_707040 | set_needed_flags_at_end_of_file_scope | scope_stk.c:8090 | Pass 2: compute needed-flags on all entity lists |
sub_610420 | mark_to_keep_in_il | il_walk.c:1959 | Pass 3: mark entities for device code preservation |
sub_5CCA40 | clear_instantiation_required_on_unneeded_entities | il.c:29450 | Pass 4a: prevent unnecessary template instantiation |
sub_5CC410 | eliminate_bodies_of_unneeded_functions | il.c:29231 | Pass 4b: remove dead function bodies |
sub_5CCBF0 | eliminate_unneeded_il_entries | il.c:29598 | Pass 4c: remove dead entities from IL lists |
sub_796C00 | mark_secondary_trans_unit_IL_entities_used_from_primary_as_needed | scope_stk.c | Between Pass 1 and 2: cross-TU reference marking |
sub_796BA0 | copy_secondary_trans_unit_IL_to_primary | trans_copy.c:3003 | Post-pass: dead in CUDA build (always asserts) |
sub_707480 | scope renumber (inferred) | scope_stk.c | Post-pass: renumber scope declarations |
sub_765480 | remove_unneeded_instantiations | templates.c:19822 | Post-pass: prune template instantiation list |
sub_6B95C0 | print_memory_management_statistics | memory mgmt | Post-pass: grand total memory report |
sub_5E1D00 | check_local_constant_use | il_alloc.c:1177 | Post-pass: assert no pending local constants |
sub_7A3D60 | set_current_translation_unit | trans_unit.c | Called before every per-TU operation |
sub_706710 | IL tree walk | IL subsystem | Pass 1 via sub_588C60 |
sub_706F40 | IL finalize | IL subsystem | Pass 1 via sub_588C60 |
sub_6115E0 | walk_tree_and_set_keep_in_il | il_walk.c | Pass 3: recursive keep_in_il walker |
sub_6170C0 | namespace member walk | il_walk.c | Pass 3: using-declaration fixed-point |
sub_6175F0 | C++ companion walk | il_walk.c | Pass 3: out-of-line definitions |
sub_617310 | prune_keep_in_il_walk | il_walk.c | Pass 3: installed as walk prune callback |
sub_5BD350 | destroy temporaries | IL subsystem | Pass 1: C++ temporary cleanup |
sub_7C2690 | template cleanup | template engine | Pass 1: release deferred template state |
sub_68A0C0 | exception cleanup | exception handling | Pass 1: finalize exception specs |
sub_78A9D0 | template_and_inline_entity_wrapup | C++ support | Preamble: C++ pre-wrapup |
sub_78A380 | clear instantiation-required flag | template engine | Pass 4a via sub_5CCA40 |
sub_5CAB40 | eliminate function body | IL subsystem | Pass 4b via sub_5CC410, post-pass via sub_765480 |
sub_5CE710 | eliminate class definition | IL subsystem | Post-pass via sub_765480 |
sub_5CB920 | C++ member cleanup | class subsystem | Pass 4c via sub_5CCBF0 |
sub_5E2D70 | scope deallocation | scope subsystem | Pass 4c via sub_5CCBF0 |
sub_5CC570 | eliminate scope orphaned entries | IL subsystem | Pass 4c via sub_5CCBF0 |
sub_718720 | scope-level cleanup | scope subsystem | Pass 4c via sub_5CCBF0 |
sub_703C30 | get scope summary | scope subsystem | Pass 4c via sub_5CCBF0 |
sub_7B0B60 | walk type tree | type subsystem | Pass 4c: type metadata cleanup |
sub_5C71B0 | type cleanup callback | type subsystem | Pass 4c: invoked via sub_7B0B60 with id 147 |
sub_6FE2A0 | renumber scope entries | scope subsystem | Post-pass via sub_707480 |
sub_6FFBA0 | reorganize scope members | scope subsystem | Pass 4b, scope renumbering |
sub_6FE8C0 | C++ scope merge | scope subsystem | Pass 2: merge declaration/scope lists |
sub_4F7B10 | close output file | file I/O | Post-pass: close 3 files |
sub_5F7DF0 | flush IL output | IL output | Post-pass: conditional flush |
sub_6B8B20 | process file entry | source file mgr | Post-pass: file index loop |
sub_4ED0E0 | declaration cleanup | declarations | State teardown |
sub_709250 | IL output | IL output | Pass 5: serialize IL tree |
sub_7C2560 | template output | template engine | Pass 5: serialize template info |
sub_5BAD30 | statement finalization | statement subsystem | Pass 5: finalize statement-level nodes |
sub_5C9E10 | class scope finalization | class subsystem | Pass 5: C++ scope cleanup |
sub_6B9580 | source file state update | source file mgr | Pass 5: update file tracking |
sub_4F4030 | diagnostic flush | diagnostics | Pass 5: flush pending messages |
sub_6B9340 | file scope close | source file mgr | Pass 5: close file scope with error count |
sub_702DC0 | scope stack dump | scope subsystem | Post-pass: debug dump |
sub_6C6570 | viability dump | overload resolution | Post-pass: debug dump |
sub_48AE00 | debug trace enter | debug subsystem | Preamble, Pass 4b/4c |
sub_48AFD0 | debug trace exit/timing | debug subsystem | Final: print timing |
sub_48A7E0 | debug flag check | debug subsystem | Multiple: check named trace flags |
Diagnostic Strings
| String | Location | When emitted |
|---|---|---|
"fe_wrapup" | sub_588F90 preamble | Debug trace at function entry |
"bad translation unit in fe_wrapup" | sub_588F90 preamble | Fatal assertion when dword_106BA08 == 0 |
"source_file_for_seq_info" | sub_588F90 preamble | Debug flag name for source sequence dump |
"Start of set_needed_flags_at_end_of_file_scope" | sub_707040 entry | Pass 2 debug trace |
"End of set_needed_flags_at_end_of_file_scope" | sub_707040 exit | Pass 2 debug trace |
"needed_flags" | sub_707040, sub_610420 | Debug flag name for needed-flags diagnostics |
"bad scope kind" | sub_707040 | Fatal assertion when scope kind is not 0, 3, or 6 |
"variable_needed_even_if_unreferenced" | sub_707040 | Assertion function name at scope_stk.c:7999/8001 |
"Beginning file scope keep_in_il walk" | sub_610420 entry | Pass 3 debug trace |
"Ending file scope keep_in_il walk" | sub_610420 exit | Pass 3 debug trace |
"mark_to_keep_in_il" | sub_610420 | Fatal assertion function name at il_walk.c:1959 |
"file_scope_il_wrapup_part_3" | sub_588D40 | Assertion function name at fe_wrapup.c:559 |
"clear_instantiation_required_on_unneeded_entities" | sub_5CCA40 | Assertion function name at il.c:29450 |
"eliminate_bodies_of_unneeded_functions" | sub_5CC410 | Debug trace at level 3 |
"eliminate_unneeded_il_entries" | sub_5CCBF0 | Debug trace at level 3 |
"Removing variable ..." | sub_5CCBF0 | Verbose output when removing a variable entity |
"Not removing variable ..." | sub_5CCBF0 | Verbose output when keeping a variable entity |
"Removing routine ..." | sub_5CCBF0 | Verbose output when removing a function entity |
"Not removing routine ..." | sub_5CCBF0 | Verbose output when keeping a function entity |
"Removing hidden name entry for ..." | sub_5CCBF0 | Verbose output during hidden name cleanup |
"check_local_constant_use" | sub_5E1D00 | Assertion function name at il_alloc.c:1177 |
"copy_secondary_trans_unit_IL_to_primary" | sub_796BA0 | Debug trace + fatal assertion at trans_copy.c:3003/3008 |
"remove_unneeded_instantiations" | sub_765480 | Assertion function name at templates.c:19822/19848 |
"scope_stack" | sub_588F90 post-pass | Debug flag name for scope stack dump |
"viability" | sub_588F90 post-pass | Debug flag name for viability analysis dump |
"space_used" | sub_588F90 post-pass | Debug flag name for memory statistics |
"dump_elim" | sub_5CCBF0, sub_5CC410 | Debug flag name for entity removal details |
"Memory management table use:" | sub_6B95C0 | Memory statistics report header |
"Symbol table use:" | sub_74A980 | Symbol table statistics header |
"Macro table use:" | sub_6B6280 | Macro table statistics header |
"Error table use:" | sub_4ED970 | Error table statistics header |
"Expression table use:" | sub_56D8C0 | Expression table statistics header |
Key Global Variables
| Variable | Address | Role in fe_wrapup |
|---|---|---|
qword_106B9F0 | 0x106B9F0 | TU chain head. Iterated by all 5 passes. |
qword_106BA10 | 0x106BA10 | Current TU descriptor. Switched by sub_7A3D60 before each TU. |
qword_126ED90 | 0x126ED90 | Error flag. Passes 2-4 skip TUs when nonzero. |
dword_126EFB4 | 0x126EFB4 | Language mode. 2 = C++. Gates sub_5CCA40, sub_78A9D0, template operations. |
dword_106BA08 | 0x106BA08 | Full compilation flag. Gates preamble assertion and Pass 5's mirrored sequence. |
dword_106B640 | 0x106B640 | IL emission guard. Set=1 during Pass 2 (file scope entry) and Pass 3 (caller). Asserted by sub_610420. Cleared=0 at end. |
dword_126E55C | 0x126E55C | Deferred class members flag. When set, enables stages 4b and 4c. Cleared on error exit. |
dword_126C5A0 | 0x126C5A0 | Scope renumbering flag. When set, enables post-pass sub_707480 double-loop. Cleared after. |
dword_126EC78 | 0x126EC78 | Scope count. Controls iteration bounds for sub_707480 and sub_5CC410. |
qword_126EB98 | 0x126EB98 | Scope table base. 16-byte entries: {qword scope_ptr, int file_index, pad}. |
dword_126EC80 | 0x126EC80 | File table entry count. Controls file index processing loop. |
qword_126EC88 | 0x126EC88 | File table (name/scope pointers). Indexed by file ID. |
qword_126EB90 | 0x126EB90 | File table (info entries). Indexed by file ID. |
dword_106C094 | 0x106C094 | Compilation mode. Value 1 skips sub_765480 (template validation). |
dword_106C250 | 0x106C250 | Output flush flag. When set with no errors, calls sub_5F7DF0(0). |
dword_106C268 | 0x106C268 | CUDA diagnostics flag. Gates sub_6B3260 in preamble. |
dword_106C2B4 | 0x106C2B4 | Cross-TU copy disabled. When set, skips sub_796BA0. |
dword_126EFC8 | 0x126EFC8 | Debug/trace mode. Enables trace output and debug dumps throughout. |
dword_126EFCC | 0x126EFCC | Diagnostic verbosity level. Level > 0 enables memory stats, > 2 enables dump_elim. |
dword_106BC80 | 0x106BC80 | Always-report-stats flag. Forces memory statistics regardless of verbosity. |
dword_126EE48 | 0x126EE48 | Init-complete flag. Set to 1 during fe_init_part_1, cleared to 0 during teardown. |
qword_126DB48 | 0x126DB48 | Scope tracking pointer. Cleared during teardown. |
qword_12C7768 | 0x12C7768 | Template state pointer 1. Cleared during teardown. |
qword_12C7770 | 0x12C7770 | Template state pointer 2. Cleared during teardown. |
qword_126E4C0 | 0x126E4C0 | Expected file scope entity. Compared in Pass 5 scope assertion. |
qword_126C5E8 | 0x126C5E8 | Scope stack base pointer. Array of 784-byte entries. |
dword_126C5E4 | 0x126C5E4 | Current scope stack depth index. |
dword_126E204 | 0x126E204 | Template mode flag. Affects instantiation-required clearing in Pass 4a. |
qword_126EBA0 | 0x126EBA0 | Deferred entity list head. Walked in Pass 3. |
qword_126EBE0 | 0x126EBE0 | Global deferred entity chain. Cleaned in Pass 4c. |
qword_12C7740 | 0x12C7740 | Template instantiation pending list. Walked by sub_765480. |
qword_126DFE0 | 0x126DFE0 | File-index-to-TU mapping table. Used for TU ownership checks. |
Cross-References
- Pipeline Overview -- fe_wrapup is stage 6 in the 8-stage pipeline
- Keep-in-IL -- detailed coverage of the device code selection mechanism (Pass 3)
- IL Overview -- the IL data structures walked by all five passes
- Backend Code Generation -- stage 7, consumes the finalized IL produced by fe_wrapup
- Entry Point & Initialization -- the
main()function that callssub_588F90 - Frontend Invocation -- stage 5, builds the IL tree that fe_wrapup finalizes
- Timing & Exit -- fe_wrapup completion marks the end of "Front end time"
- Device/Host Separation -- the keep_in_il mechanism's relationship to device code isolation
Backend Code Generation
The backend is the final stage of the cudafe++ pipeline (stage 7 in the overview). It lives in a single function, process_file_scope_entities (sub_489000, 723 decompiled lines, 4520 bytes), whose job is to walk the EDG source sequence produced by the frontend and emit a .int.c file that the host C++ compiler (gcc, clang, or cl.exe) can compile. The function resides in cp_gen_be.c at EDG source lines around 19916-26628, and it delegates per-entity code generation to gen_template (sub_47ECC0, 1917 decompiled lines), which dispatches on entity kind to specialized generators for variables, types, routines, namespaces, and templates.
The backend is gated by the skip-backend flag (dword_106C254): if set to 1 (errors occurred during the frontend), main() never calls sub_489000 and proceeds directly to exit.
Key Facts
| Property | Value |
|---|---|
| Function | sub_489000 (process_file_scope_entities) |
| Binary address | 0x489000 |
| Binary size | 4520 bytes (723 decompiled lines) |
| EDG source | cp_gen_be.c |
| Callees | ~140 distinct call targets |
| Output | .int.c file (or stdout when filename is "-") |
| Main dispatcher | sub_47ECC0 (gen_template, 1917 lines) |
| Host reference emitter | sub_6BCF80 (nv_emit_host_reference_array) |
| Module ID writer | sub_5B0180 (write_module_id_to_file) |
| Skip-backend flag | dword_106C254 |
| Backend timing label | "Back end time" |
Output Primitives
All output to the .int.c file passes through a small set of character-level emitters. Understanding these is essential for reading the decompiled backend code, since every line of generated C/C++ is assembled from these calls:
| Function | Address | Identity | Behavior |
|---|---|---|---|
sub_467D60 | 0x467D60 | emit_newline | Writes \n via putc(10, stream). Increments dword_1065820 (line counter). Resets dword_106581C (column counter) and dword_1065830 to 0. Calls sub_403730 (write error abort) on failure. |
sub_467DA0 | 0x467DA0 | emit_line_directive | Checks dword_1065818 (needs-line-directive flag). If the current source position (qword_1065810) differs from the output line counter, calls sub_467EB0 to emit a #line N "file" directive. Resets dword_1065818 to 0. Handles close-range line gaps (within 5 lines) by emitting blank lines instead of a #line directive. |
sub_467E50 | 0x467E50 | emit_string | If dword_1065818 is set, calls emit_line_directive first. Writes each character of the string via putc. Increments dword_106581C by the string length. |
sub_467EB0 | 0x467EB0 | emit_line_number | Emits #line N "file" or # N "file" (short form when dword_106C28C or MSVC EDG-native mode is set). Constructs the directive in a stack buffer starting with #line , appends the decimal line number, then the quoted filename via sub_5B1940. Sets dword_1065820 to the target line number. Resets column counters. |
sub_468150 | 0x468150 | emit_char | If dword_1065818 is set, calls emit_line_directive first. Writes a single character via putc. Increments dword_106581C by 1. |
sub_468190 | 0x468190 | emit_raw_string | Like emit_string but without strlen -- walks the string character by character, incrementing dword_106581C per character. Calls emit_line_directive first if dword_1065818 is set. |
sub_468270 | 0x468270 | emit_decimal | Writes an unsigned integer as decimal digits. Has fast paths for 1-5 digit numbers (manual digit extraction via division by powers of 10). Falls back to sub_465480 (sprintf-style) for larger numbers. Calls emit_line_directive first if needed. |
sub_46BC80 | 0x46BC80 | emit_line_start | If the column counter is nonzero, first emits a newline. Increments dword_1065834 (indent level). Calls emit_line_directive if needed. Then writes the string character by character. Used for the first token on a new line (e.g., #define, #ifdef). |
Output State Variables
| Variable | Address | Type | Role |
|---|---|---|---|
stream | 0x106583x | FILE* | Output file handle for .int.c |
dword_1065834 | 0x1065834 | int | Indent level counter. Incremented by emit_line_start, decremented after each directive block. Not used for actual indentation emission -- tracks logical nesting depth for #line management. |
dword_1065820 | 0x1065820 | int | Output line counter. Tracks the current line number in the generated .int.c file. Incremented by every \n written. |
dword_106581C | 0x106581C | int | Output column counter. Tracks the current column position. Reset to 0 after each newline. |
dword_1065830 | 0x1065830 | int | Column counter after last newline (secondary tracking). Reset to 0 with dword_106581C. |
dword_1065818 | 0x1065818 | int | Needs-line-directive flag. Set to 1 when the source position changes. Checked by every output primitive; when set, a #line directive is emitted before the next output. |
qword_1065810 | 0x1065810 | qword | Current source position (line number from the original .cu file). Updated when processing each entity. |
qword_1065828 | 0x1065828 | qword | Current source file index. Compared against new file references to decide whether to emit a #line with filename. |
qword_126EDE8 | 0x126EDE8 | qword | Mirror of qword_1065810. Updated in parallel; used by other subsystems to query current position. |
Execution Flow
The backend proceeds through seven sequential phases within sub_489000:
sub_489000 (process_file_scope_entities)
|
|-- Phase 1: State initialization (40+ globals zeroed, 4 buffers cleared)
|-- Phase 2: Output file opening (.int.c or stdout)
|-- Phase 3: Boilerplate emission (GCC diagnostics, managed runtime, lambda macros)
|-- Phase 4: Main entity loop (walk source sequence, dispatch to gen_template)
|-- Phase 5: Empty file guard + scope unwind (sub_466C10)
|-- [optional] Breakpoint placeholders (qword_1065840 list)
|-- Phase 6: File trailer (#line, _NV_ANON_NAMESPACE, #include, #undef)
|-- Phase 7: Host reference arrays (sub_6BCF80 x 6, conditional on dword_106BFD0/BFCC)
|
+-- sub_4F7B10: close output file (ID 1701)
Phase 1: State Initialization
The function begins by zeroing approximately 40 global variables and clearing four large buffers. This ensures no state leaks between compilation units (relevant in the recompilation loop, though in practice sub_489000 runs exactly once).
Scalar Zeroing
The first 20 lines of the decompiled function zero individual globals:
dword_1065834 = 0; // indent level
dword_1065830 = 0; // column after newline
stream = 0; // FILE* handle
qword_126EDE8 = 0; // current source position (low 6 bytes)
qword_1065828 = 0; // current file index
dword_1065820 = 0; // output line counter
dword_106581C = 0; // output column counter
dword_1065818 = 0; // needs-line-directive flag
qword_1065748 = 0; // source sequence cursor
qword_1065740 = 0; // alternate source sequence cursor
qword_126C5D0 = 0; // (template instantiation tracking)
dword_106573C = 0;
dword_1065734 = 0;
dword_1065730 = 0;
dword_106572C = 0;
qword_1065708 = 0; // scope stack head
qword_1065720 = 0; // scope free list
qword_1065700 = 0; // scope pool head
dword_10656FC = 0; // current access specifier
// ... additional counters, flags, sequence pointers
Additional globals zeroed later (after the callback setup):
dword_1065758 = 0; dword_1065754 = 0; dword_1065750 = 0;
dword_10656F8 = 0; dword_10656F4 = 0;
qword_1065718 = 0; qword_1065710 = 0;
dword_1065728 = 0; qword_F05708 = 0;
Buffer Clearing
Four memset calls clear hash tables / lookup buffers:
| Buffer Base | Size (hex) | Size (decimal) | Description |
|---|---|---|---|
unk_FE5700 | 0x7FFE0 | 524,256 bytes (~512 KB) | Entity lookup hash table |
unk_F65720 | 0x7FFE0 | 524,256 bytes (~512 KB) | Type lookup hash table |
qword_E85720 | 0x7FFE0 | 524,256 bytes (~512 KB) | Declaration tracking table |
xmmword_F05720 | 0x5FFE8 | 393,192 bytes (~384 KB) | Scope/name resolution table |
Total: approximately 1.93 MB of memory zeroed at backend entry.
Callback Table Setup
After zeroing, the function initializes two tables of function pointers:
gen_be_info callbacks (6 entries at xmmword_1065760..10657B0):
sub_5F9040(&xmmword_1065760); // clear the table first
xmmword_1065760 = off_83BD60; // callback 0: expression gen
xmmword_1065778 = off_83BD68; // callback 1: type gen
xmmword_1065788 = off_83BD70; // callback 2: declaration gen
xmmword_10657A0 = off_83BD78; // callback 3: statement gen
xmmword_10657B0 = qword_83BD80; // callback 4: scope gen
These pointers are loaded from read-only data via SSE (_mm_loadh_ps), packing two 8-byte function pointers per 16-byte XMM value.
Direct callback assignments (4 entries):
| Variable | Address | Value | Identity |
|---|---|---|---|
qword_10657C0 | 0x10657C0 | sub_46BEE0 | gen_statement_expression (only set when not in MSVC __declspec mode) |
qword_10657C8 | 0x10657C8 | loc_469200 | gen_type_operator_expression |
qword_10657D0 | 0x10657D0 | sub_466F40 | gen_be_helper_1 |
qword_10657D8 | 0x10657D8 | sub_4686C0 | gen_be_helper_2 |
Host Compiler Version Detection
A block of conditionals determines warning suppression behavior based on the host compiler version:
byte_10657F0 = 1; // always set
byte_10657F1 = byte_126EBB0; // copy verbose-line-dir flag
if (dword_126EFB4 == 2 // CUDA mode
|| dword_126EF68 <= 199900) // C++ standard <= C++98
{
byte_10657F4 = (dword_126EFB0 != 0); // copy flag
} else {
byte_10657F4 = 1; // force on for newer standards
}
The byte_1065803 flag is set to 1 when MSVC mode (dword_126E1D8) is active or when the GNU/Clang version falls in a specific range (version check qword_126E1F0 - 40500 with tolerance of 2, i.e., Clang versions 40500-40502).
Scope Stack Allocation
A dynamic scope tracking structure is allocated (or resized if it exists from a prior run):
if (qword_10656E8) {
// resize existing: realloc to 16 * (count + 1) bytes
sub_6B74D0(*(qword_10656E8), 16 * (*(qword_10656E8 + 8) + 1));
} else {
// allocate fresh: 16-byte header
v0 = sub_6B7340(16);
qword_10656E8 = v0;
}
// allocate 1024-byte data block, zero it, attach to header
v2 = sub_6B7340(1024);
// zero 1024 bytes in 16-byte steps (zeroing 64 pointer-sized slots)
*v0 = v2;
v0[1] = 63; // capacity = 63 entries
This creates a 64-slot lookup table (63 usable entries plus sentinel) for tracking entity references during code generation.
Phase 2: Output File Opening
The function opens the output .int.c file. Two paths are possible:
Stdout mode: If the output filename (qword_126EEE0) equals "-", the function sets stream = stdout.
// strcmp(qword_126EEE0, "-")
if (filename_is_dash) {
stream = stdout;
}
File mode: Otherwise, the function constructs the output path by appending .int.c to the base filename (stripping the original extension):
v55 = qword_106BF20; // pre-set output path (CLI override)
if (!v55)
v55 = sub_5ADD90(qword_126EEE0, ".int.c"); // derive_name: strip ext, add ".int.c"
stream = sub_4F48F0(v55, 0, 0, 0, 1701); // open_output_file (mode 1701)
The sub_5ADD90 function (derive_name) finds the last . in the filename, strips the extension, and appends .int.c. It handles multi-byte UTF-8 characters correctly when scanning for the dot position. The constant 1701 is the file descriptor identifier used by the file management subsystem.
After opening the file, sub_5B9A20 is called to initialize the output stream state, and sub_467EB0 emits the initial #line 1 directive.
Phase 3: Boilerplate Emission
Before processing any user declarations, the backend emits several blocks of boilerplate that the host compiler needs. The exact output depends on the host compiler identity (Clang, GCC, MSVC) and the CUDA mode.
GCC Diagnostic Suppressions
Multiple #pragma GCC diagnostic directives suppress host compiler warnings that would be spurious for generated code:
// Conditional on Clang version > 30599 (0x7787) or GNU version > 40799 (0x9F5F)
#pragma GCC diagnostic ignored "-Wunused-local-typedefs"
// Conditional on dword_126EFA8 (attribute mode) && dword_106C07C
#pragma GCC diagnostic ignored "-Wattributes"
// Clang or recent GNU/Clang:
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-variable"
#pragma GCC diagnostic ignored "-Wunused-function"
// Clang-specific additional suppressions:
#pragma GCC diagnostic ignored "-Wunused-private-field"
#pragma GCC diagnostic ignored "-Wunused-parameter"
The version thresholds use the encoded host compiler version from qword_126EF90 (Clang version) and qword_126E1F0 (GCC/Clang combined version):
| Hex constant | Decimal | Approximate version |
|---|---|---|
0x7787 | 30,599 | Clang ~3.x |
0x9D07 | 40,199 | GCC/Clang ~4.0 |
0x9E97 | 40,599 | GCC/Clang ~4.1 |
0x9F5F | 40,799 | GCC/Clang ~4.1+ |
Managed Runtime Boilerplate
A block of C code is emitted unconditionally for __managed__ variable support:
static char __nv_inited_managed_rt = 0;
static void **__nv_fatbinhandle_for_managed_rt;
static void __nv_save_fatbinhandle_for_managed_rt(void **in) {
__nv_fatbinhandle_for_managed_rt = in;
}
static char __nv_init_managed_rt_with_module(void **);
Followed by the inline initialization helper:
__attribute__((unused)) // added when dword_106BF6C (alt host mode) is set
static inline void __nv_init_managed_rt(void) {
__nv_inited_managed_rt = (__nv_inited_managed_rt
? __nv_inited_managed_rt
: __nv_init_managed_rt_with_module(__nv_fatbinhandle_for_managed_rt));
}
This boilerplate is surrounded by a #pragma GCC diagnostic push / pop pair to suppress warnings about unused variables/functions in the boilerplate itself.
After the pop, additional #pragma GCC diagnostic ignored directives may be emitted for the remainder of the file (outside the push/pop scope), depending on compiler version.
Lambda Detection Macros
When extended lambda mode (dword_106BF38) is NOT active, three stub macro definitions are emitted:
#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false
Followed by a self-checking #if defined block:
#if defined(__nv_is_extended_device_lambda_closure_type) \
&& defined(__nv_is_extended_host_device_lambda_closure_type) \
&& defined(__nv_is_extended_device_lambda_with_preserved_return_type)
#endif
When extended lambda mode IS active, these macros are not emitted -- the frontend's keyword registration has already defined them as built-in type traits recognized by the parser. The empty #if defined / #endif block serves as a guard that downstream tools can detect.
Phase 4: Main Entity Loop
This is the core of the backend. The source sequence cursor qword_1065748 is initialized from the file scope IL node's declaration list at offset +256: qword_1065748 = *(*(xmmword_126EB60 + 8) + 256), where the high qword of xmmword_126EB60 points to the file scope root (set during fe_wrapup). The cursor walks this linked list of top-level declarations in the order they appeared in the source file. For each entry, it dispatches based on the entry's kind field at offset +16.
Source Sequence Entry Structure
Each source sequence entry has this layout:
| Offset | Size | Field | Description |
|---|---|---|---|
| +0 | 8 | next | Pointer to next entry in the linked list |
| +8 | 1 | sub_kind | Sub-classification within the kind |
| +9 | 1 | skip_flag | If nonzero, entry has already been processed |
| +16 | 1 | kind | Entry kind (see dispatch table below) |
| +24 | 8 | entity | Pointer to the EDG entity node for this declaration |
| +32 | 8 | source_position | Source file/line encoding |
| +48 | 8 | pragma_text | For pragma entries: pointer to raw pragma string |
| +56 | 8 | stdc_kind / pragma_data | STDC pragma kind or additional pragma metadata |
| +57 | 1 | stdc_value | STDC pragma value (ON/OFF/DEFAULT) |
Dual-Cursor Iteration
The loop uses two cursors -- qword_1065748 (primary) and qword_1065740 (alternate) -- to handle pragma interleavings. When the primary cursor encounters a kind-53 entry (a continuation marker), it switches to the alternate cursor. This mechanism handles the case where pragmas are interleaved between parts of a single declaration:
for (i = qword_1065748; i != NULL; ) {
if (entry_kind(i) == 53) { // continuation marker
// save as alternate, follow continuation chain
alt_cursor = i;
i = *(i->entity + 8); // follow entity's next pointer
continue;
}
if (entry_kind(i) == 57) { // pragma interleave
entity = i->entity;
// advance past pragma entries to find next real entity
for (i = i->next; i && entry_kind(i) == 53; ) {
alt_cursor = i;
i = *(i->entity + 8);
}
// handle the pragma inline (see below)
...
} else {
// non-pragma entity: dispatch to gen_template
sub_47ECC0(0);
}
}
When the primary cursor is exhausted and an alternate cursor exists, the primary takes the alternate's next pointer and continues. This ensures correct ordering when pragmas split a declaration sequence.
Full Main Loop Pseudocode
The following pseudocode is derived from the decompiled sub_489000 (lines 288-558) and shows the complete dispatch logic. The variable v12 tracks whether any non-pragma entity was emitted (used by the empty file guard in Phase 5). The variable v14 saves/restores byte_10657FB across pragma handling.
// Initialize source sequence cursor from file scope node
qword_1065748 = *(xmmword_126EB60_high + 256); // source sequence list head
byte_10656F0 = (dword_126EFB4 != 2) + 2; // linkage: 3=C++, 2=C
sub_466E60(...); // init output state
v12 = 0; // no entities emitted yet
while (1) {
v14 = byte_10657FB; // save pragma-in-progress flag
i = qword_1065748; // primary cursor
alt = qword_1065740; // alternate cursor
modified_primary = false;
modified_alt = false;
while (i != NULL) {
kind = *(byte*)(i + 16);
if (kind == 57) {
// --- Pragma interleave ---
entity = *(qword*)(i + 24);
// Walk past continuation markers (kind 53)
for (i = *(qword*)i; i != NULL; ) {
if (*(byte*)(i + 16) != 53) break;
alt = i;
modified_alt = true;
i = *(qword*)(*(qword*)(i + 24) + 8); // follow entity next
}
if (i == NULL && alt != NULL) {
i = *(qword*)alt;
alt = NULL;
modified_alt = true;
}
modified_primary = true;
if (*(byte*)(entity + 9)) // skip_flag set?
continue; // already processed
// Commit cursor state
qword_1065748 = i;
if (modified_alt) qword_1065740 = alt;
byte_10657FB = 1; // mark pragma context
// Set source position from pragma entity
dword_1065818 = 1; // needs line directive
qword_1065810 = *(qword*)(entity + 32);
qword_126EDE8 = *(qword*)(entity + 32);
sub_kind = *(byte*)(entity + 8);
switch (sub_kind) {
case 26: // STDC pragma
emit_line_start("#pragma ");
emit_raw("STDC ");
switch (*(byte*)(entity + 56)) {
case 1: emit_raw("FP_CONTRACT "); break;
case 2: emit_raw("FENV_ACCESS "); break;
case 3: emit_raw("CX_LIMITED_RANGE "); break;
default: assertion("gen_stdc_pragma: bad kind");
}
switch (*(byte*)(entity + 57)) {
case 1: emit_raw("OFF"); break;
case 2: emit_raw("ON"); break;
case 3: emit_raw("DEFAULT"); break;
default: assertion("gen_stdc_pragma: bad value");
}
emit_newline();
break;
case 21: // Line directive pragma
emit_line_start("#line ");
byte_10657F9 = 1;
sub_5FCAF0(*(qword*)(entity + 56), 0, &xmmword_1065760);
byte_10657F9 = 0;
emit_newline();
break;
default: // Generic pragma (including sub_kind 19)
if (!*(qword*)(entity + 48))
assertion("gen_pragma: NULL pragma_text");
emit_line_start("#pragma ");
emit_raw(*(char**)(entity + 48));
emit_newline();
if (sub_kind == 19)
dword_10656F8 = *(int*)(entity + 56); // track #pragma pack
break;
}
byte_10657FB = v14; // restore saved flag
continue; // next iteration
}
// --- Non-pragma entity ---
if (modified_primary) qword_1065748 = i;
if (modified_alt) qword_1065740 = alt;
if (kind == 53) {
// Continuation marker: switch to alternate cursor
alt = i;
modified_alt = true;
i = *(qword*)(*(qword*)(i + 24) + 8);
continue;
}
if (kind == 52) // end_of_construct: should never appear at top level
sub_4F2930("cp_gen_be.c", 26628,
"process_file_scope_entities",
"Top-level end-of-construct entry", 0);
v12 = 1; // mark: entity emitted
sub_47ECC0(0); // gen_template(recursion_level=0)
// Loop continues from updated qword_1065748
}
// Exhausted primary cursor; check for pending alternate
if (i == NULL && alt != NULL) {
i = *(qword*)alt;
alt = NULL;
// ... continue outer loop
} else {
break; // done
}
}
// Final cursor cleanup
if (modified_primary) qword_1065748 = 0;
if (modified_alt) qword_1065740 = alt;
Entity Kind Dispatch
For non-pragma entries (kind != 57), the loop calls sub_47ECC0(0) (gen_template with recursion level 0), which reads the current entity from qword_1065748 and dispatches based on the entity's kind:
| Kind | Name | Handler |
|---|---|---|
| 2 | variable_decl | sub_484A40 (gen_variable_decl) or inline |
| 6 | type_decl | sub_4864F0 (gen_type_decl) |
| 7 | parameter_decl | sub_484A40 |
| 8 | field_decl | Inline field handler |
| 11 | routine_decl | sub_47BFD0 (gen_routine_decl, 1831 lines) |
| 28 | namespace | Inline namespace handler (recursive sub_47ECC0(0)) |
| 29 | using_decl | Inline using-declaration handler |
| 42 | asm_decl | __asm(...) generation |
| 51 | indirect | Unwrap and re-dispatch |
| 52 | end_of_construct | Assertion (kind 52 triggers sub_4F2930 diagnostic) |
| 54 | instantiation | Template instantiation directive |
| 58 | template | Template definition |
| 66 | alias_decl | Alias declaration (using X = Y) |
| 67 | concept_decl | Concept handling |
| 83 | deduction_guide | Deduction guide |
Inline Pragma Handling
Kind 57 entries are pragma interleavings that appear between declarations. The backend handles three sub-kinds inline within sub_489000:
Sub-kind 26: STDC Pragma
Emits #pragma STDC <kind> <value>:
// Read pragma kind from offset +56
switch (stdc_kind) {
case 1: emit("FP_CONTRACT "); break;
case 2: emit("FENV_ACCESS "); break;
case 3: emit("CX_LIMITED_RANGE "); break;
default: assertion_failure("gen_stdc_pragma: bad kind");
}
// Read pragma value from offset +57
switch (stdc_value) {
case 1: emit("OFF"); break;
case 2: emit("ON"); break;
case 3: emit("DEFAULT"); break;
default: assertion_failure("gen_stdc_pragma: bad value");
}
The #pragma keyword is emitted character-by-character from a hardcoded string at address 0x838441 ("#pragma "), followed by "STDC " from address 0x83847B.
Sub-kind 21: Raw Pragma (Line Directive)
Calls sub_5FCAF0 to emit a preprocessor line directive using the pragma's data. The byte_10657F9 flag is set to 1 during emission and reset to 0 afterward, temporarily changing the line-directive emission format.
Sub-kind 19 (or other): Generic Pragma
For all other pragma sub-kinds, the backend reads the raw pragma text from offset +48 and emits it character by character after a #pragma prefix:
if (!entity->pragma_text)
assertion_failure("gen_pragma: NULL pragma_text");
emit("#pragma ");
emit_raw_string(entity->pragma_text);
emit_newline();
For sub-kind 19 specifically, the function also records the pragma data in dword_10656F8, tracking #pragma pack state.
Linkage Specification
The variable byte_10656F0 tracks the current linkage specification:
| Value | Meaning |
|---|---|
| 2 | extern "C" linkage |
| 3 | extern "C++" linkage |
Set at initialization: byte_10656F0 = (dword_126EFB4 != 2) + 2 -- this evaluates to 3 (C++) when in CUDA mode (dword_126EFB4 == 2), and 2 (C) otherwise. This controls how the backend wraps declarations that need explicit linkage changes.
Phase 5: Empty File Guard
After the main loop completes, the function checks whether any entities were actually emitted:
if (!v12 && dword_126EFB4 != 2) {
sub_467E50("int __dummy_to_avoid_empty_file;");
sub_467D60(); // newline
}
The variable v12 tracks whether sub_47ECC0 was called at least once (set to 1 when any non-pragma entity is processed). If no entities were processed AND the mode is not CUDA (dword_126EFB4 != 2), a dummy variable declaration is emitted to prevent the host compiler from rejecting an empty translation unit. In CUDA mode, the file always has content due to the managed runtime boilerplate.
Phase 6: File Trailer
After all entities and the empty-file guard, the function emits a structured trailer. The call to sub_466C10 performs scope stack unwinding -- it pops any remaining scope entries, restoring entity attributes that were temporarily modified during code generation (specifically, bits in byte +82 and +134 of entity nodes).
#line Reset
Two #line 1 "<original_file>" directives bracket the trailer, resetting the host compiler's notion of the current source location back to the original .cu file:
sub_46BC80("#");
if (!dword_126E1F8) // not GNU mode: use long form
sub_467E50("line");
sub_467E50(" 1 \"");
filename = sub_5AF450(qword_106BF88); // get original filename
sub_467E50(filename);
sub_468150(34); // closing quote '"'
_NV_ANON_NAMESPACE Macro
The anonymous namespace support macro is emitted:
#define _NV_ANON_NAMESPACE <unique_id>
The unique identifier is generated by sub_6BC7E0 (get_anonymous_namespace_name), which returns "_GLOBAL__N_<filename>" -- a mangled name that ensures anonymous namespace entities from different translation units do not collide in the final linked binary.
This is followed by a guard block:
#ifdef _NV_ANON_NAMESPACE
#endif
The #ifdef/#endif block appears to be a deliberate no-op that downstream tools (nvcc's driver) can detect to confirm the file was processed by cudafe++.
MSVC Pack Reset
In MSVC host compiler mode (dword_126E1D8), a #pragma pack() is emitted to reset the packing alignment to the compiler default:
if (dword_126E1D8) {
sub_46BC80("#pragma pack()");
sub_467D60();
}
Source Re-inclusion
The original source file is re-included via #include:
#include "<original_file>"
This is the mechanism by which the host compiler sees the original source code: the .int.c file first declares all the generated stubs and boilerplate, then #includes the original file. The EDG frontend has already parsed the original file and knows which declarations are host-visible; the re-inclusion lets the host compiler process them with the stubs already in scope.
A final #line 1 directive follows, and then:
#undef _NV_ANON_NAMESPACE
This cleans up the macro so it does not leak into subsequent compilation units.
Phase 7: Host Reference Arrays
The final emission step generates CUDA host reference arrays via sub_6BCF80 (nv_emit_host_reference_array). These arrays are placed in special ELF sections that the CUDA runtime linker uses to discover device symbols at launch time.
The function is called 6 times with different flag combinations:
// Signature: nv_emit_host_reference_array(emit_fn, is_kernel, is_device, is_internal)
sub_6BCF80(sub_467E50, 1, 0, 1); // kernel, internal -> .nvHRKI
sub_6BCF80(sub_467E50, 1, 0, 0); // kernel, external -> .nvHRKE
sub_6BCF80(sub_467E50, 0, 1, 1); // device, internal -> .nvHRDI
sub_6BCF80(sub_467E50, 0, 1, 0); // device, external -> .nvHRDE
sub_6BCF80(sub_467E50, 0, 0, 1); // constant, internal -> .nvHRCI
sub_6BCF80(sub_467E50, 0, 0, 0); // constant, external -> .nvHRCE
| Section | Array Name | Symbol Type | Linkage |
|---|---|---|---|
.nvHRKI | hostRefKernelArrayInternalLinkage | __global__ kernel | Internal (anonymous namespace) |
.nvHRKE | hostRefKernelArrayExternalLinkage | __global__ kernel | External |
.nvHRDI | hostRefDeviceArrayInternalLinkage | __device__ variable | Internal |
.nvHRDE | hostRefDeviceArrayExternalLinkage | __device__ variable | External |
.nvHRCI | hostRefConstantArrayInternalLinkage | __constant__ variable | Internal |
.nvHRCE | hostRefConstantArrayExternalLinkage | __constant__ variable | External |
Each array entry encodes a device symbol's mangled name as a byte array:
extern "C" {
extern __attribute__((section(".nvHRKE")))
__attribute__((weak))
const unsigned char hostRefKernelArrayExternalLinkage[] = {
0x5f, 0x5a, ... /* mangled name bytes */ 0x00
};
}
The 6 global lists from which these symbols are collected reside at:
| Address | Contents |
|---|---|
unk_1286780 | Device-external symbols |
unk_12867C0 | Device-internal symbols |
unk_1286800 | Constant-external symbols |
unk_1286840 | Constant-internal symbols |
unk_1286880 | Kernel-external symbols |
unk_12868C0 | Kernel-internal symbols |
This phase is conditional: it only executes when dword_106BFD0 (CUDA device registration) or dword_106BFCC (CUDA constant registration) is nonzero.
Module ID Output
Before the host reference arrays, if dword_106BFB8 is set, sub_5B0180 (write_module_id_to_file) writes the CRC32-based module identifier to a separate file. This ID is used by the CUDA runtime to match device code fatbinaries with their host-side registration code.
Breakpoint Placeholders (Between Phase 5 and Phase 6)
After the empty file guard and scope unwinding (sub_466C10) but before the file trailer, if the breakpoint placeholder list (qword_1065840) is non-empty, the backend emits debug breakpoint functions:
static __attribute__((used)) void __nv_breakpoint_placeholder<N>_<name>(void) {
exit(0);
}
The placeholder list is a linked list where each node contains:
| Offset | Field |
|---|---|
| +0 | next pointer |
| +8 | Source position (start) |
| +16 | Source position (end) |
| +24 | Name string (or NULL) |
Each placeholder is numbered sequentially (starting from 0). The __attribute__((used)) prevents the linker from stripping these symbols, and the exit(0) body ensures the function has a concrete implementation that a debugger can set a breakpoint on. The underscore separator before the name distinguishes the placeholder from the numbered prefix.
Complete .int.c File Structure
Putting all phases together, the output .int.c file has this structure:
#line 1 "<input>.cu" // initial line directive
#pragma GCC diagnostic ignored "-Wunused-local-typedefs"
#pragma GCC diagnostic ignored "-Wattributes"
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-variable"
#pragma GCC diagnostic ignored "-Wunused-function"
// ... additional suppressions for Clang
// --- managed runtime boilerplate ---
static char __nv_inited_managed_rt = 0;
static void **__nv_fatbinhandle_for_managed_rt;
static void __nv_save_fatbinhandle_for_managed_rt(void **in) { ... }
static char __nv_init_managed_rt_with_module(void **);
static inline void __nv_init_managed_rt(void) { ... }
#pragma GCC diagnostic pop
#pragma GCC diagnostic ignored "-Wunused-variable"
// --- lambda detection macros (when not in extended lambda mode) ---
#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false
#if defined(...) && defined(...) && defined(...)
#endif
// --- main entity output ---
// [user declarations, type definitions, function stubs, etc.]
// [device-only code wrapped in #if 0 / #endif]
// [__global__ kernels -> __wrapper__device_stub_ forwarding]
// [pragmas interleaved at original positions]
// --- empty file guard (non-CUDA mode only) ---
int __dummy_to_avoid_empty_file;
// --- breakpoint placeholders (if any) ---
static __attribute__((used)) void __nv_breakpoint_placeholder0_name(void) { exit(0); }
// --- file trailer ---
#line 1 "<input>.cu"
#define _NV_ANON_NAMESPACE _GLOBAL__N_<input>
#ifdef _NV_ANON_NAMESPACE
#endif
#pragma pack() // MSVC only
#line 1 "<input>.cu"
#include "<input>.cu" // re-include original source
#line 1 "<input>.cu"
#undef _NV_ANON_NAMESPACE
// --- host reference arrays (if CUDA registration active) ---
extern "C" { extern __attribute__((section(".nvHRKI"))) ... }
extern "C" { extern __attribute__((section(".nvHRKE"))) ... }
extern "C" { extern __attribute__((section(".nvHRDI"))) ... }
extern "C" { extern __attribute__((section(".nvHRDE"))) ... }
extern "C" { extern __attribute__((section(".nvHRCI"))) ... }
extern "C" { extern __attribute__((section(".nvHRCE"))) ... }
Key Global Variables
| Variable | Address | Type | Role |
|---|---|---|---|
stream | output state | FILE* | Output file handle |
dword_1065834 | 0x1065834 | int | Indent/nesting level |
dword_1065820 | 0x1065820 | int | Output line counter |
dword_106581C | 0x106581C | int | Output column counter |
dword_1065818 | 0x1065818 | int | Needs-line-directive flag |
qword_1065810 | 0x1065810 | qword | Current source position |
qword_1065828 | 0x1065828 | qword | Current source file index |
qword_1065748 | 0x1065748 | qword | Source sequence cursor (primary) |
qword_1065740 | 0x1065740 | qword | Source sequence cursor (alternate) |
dword_1065850 | 0x1065850 | int | Device stub mode toggle |
byte_10656F0 | 0x10656F0 | byte | Current linkage spec (2=C, 3=C++) |
dword_10656F8 | 0x10656F8 | int | Current #pragma pack state |
qword_1065708 | 0x1065708 | qword | Scope stack head |
qword_1065700 | 0x1065700 | qword | Scope pool head |
qword_1065720 | 0x1065720 | qword | Scope free list |
dword_106BF38 | 0x106BF38 | int | Extended lambda mode |
dword_106BFB8 | 0x106BFB8 | int | Emit module ID flag |
dword_106BFD0 | 0x106BFD0 | int | CUDA device registration flag |
dword_106BFCC | 0x106BFCC | int | CUDA constant registration flag |
dword_106BF6C | 0x106BF6C | int | Alternative host compiler mode |
dword_126EFB4 | 0x126EFB4 | int | Compiler mode (2 = CUDA) |
dword_126E1D8 | 0x126E1D8 | int | MSVC host compiler flag |
dword_126E1F8 | 0x126E1F8 | int | GNU/GCC host compiler flag |
dword_126E1E8 | 0x126E1E8 | int | Clang host compiler flag |
qword_126E1F0 | 0x126E1F0 | qword | GCC/Clang version number |
dword_126EF68 | 0x126EF68 | int | C++ standard version (__cplusplus) |
Cross-References
- Pipeline Overview -- where stage 7 fits in the full compilation flow
- Frontend Wrapup -- stage 6, produces the finalized IL that the backend consumes
- .int.c File Format -- detailed structure of the backend output file
- Managed Memory Boilerplate -- the
__nv_managed_rtinitialization pattern - Host Reference Arrays --
.nvHRKI/.nvHRDEsection format - Module ID -- CRC32 module identification
- Device/Host Separation -- how the backend filters device vs host code
- Kernel Stub Generation --
__wrapper__device_stub_pattern ingen_routine_decl - Extended Lambda Overview -- lambda wrapper generation
- Lambda Preamble Injection --
sub_6BCC20emission ingen_template
Timing & Exit
The timing and exit subsystem lives in host_envir.c and handles three responsibilities: measuring CPU and wall-clock time for compilation phases, formatting the compilation summary (error/warning counts), and mapping internal status codes to process exit codes. All functions write to qword_126EDF0 (the diagnostic output stream, initialized to stderr in main()).
Key Facts
| Property | Value |
|---|---|
| Source file | host_envir.c (EDG 6.6) |
| Timing functions | sub_5AF350 (capture_time), sub_5AF390 (report_timing) |
| Exit function | sub_5AF1D0 (exit_with_status), 145 bytes, __noreturn |
| Signoff function | sub_5AEE00 (write_signoff), sub_589530 (write_signoff + free_mem_blocks) |
| Timing enable flag | dword_106C0A4 at 0x106C0A4, set by CLI flag --timing (case 20) |
| Diagnostic stream | qword_126EDF0 at 0x126EDF0 (stderr) |
| SARIF mode flag | dword_106BBB8 at 0x106BBB8 |
Timing Infrastructure
capture_time -- sub_5AF350 (0x5AF350)
A 48-byte function that samples both CPU time and wall-clock time into a 16-byte timestamp structure.
// Annotated decompilation
void capture_time(timestamp_t *out) // sub_5AF350
{
out->cpu_ms = (int)((double)(int)clock() * 1000.0 / 1e6); // [0]: CPU milliseconds
out->wall_s = time(NULL); // [1]: wall-clock seconds
}
Timestamp structure layout (16 bytes, two 64-bit fields):
| Offset | Size | Type | Content |
|---|---|---|---|
| +0 | 8 | int64_t | CPU time in milliseconds: clock() * 1000 / CLOCKS_PER_SEC |
| +8 | 8 | time_t | Wall-clock time via time(0) (epoch seconds) |
The CPU time computation clock() * 1000.0 / 1000000.0 normalizes the clock() return value (microseconds on Linux where CLOCKS_PER_SEC = 1000000) to milliseconds, then truncates to integer. This means CPU time resolution is 1 ms.
report_timing -- sub_5AF390 (0x5AF390)
Computes deltas between two timestamps and prints a formatted timing line.
// Annotated decompilation
void report_timing(const char *label, // sub_5AF390
timestamp_t *start,
timestamp_t *end)
{
double elapsed = difftime(end->wall_s, start->wall_s); // wall seconds
double cpu_sec = (double)(end->cpu_ms - start->cpu_ms) / 1000.0; // CPU seconds
fprintf(qword_126EDF0,
"%-30s %10.2f (CPU) %10.2f (elapsed)\n",
label, cpu_sec, elapsed);
}
The decompiled code contains explicit unsigned-to-double conversion handling for 64-bit values (the v6 & 1 | (v6 >> 1) pattern followed by doubling). This is the compiler's standard idiom for converting unsigned 64-bit integers to double on x86-64 when the value might exceed INT64_MAX. In practice, clock() millisecond values fit comfortably in signed 64-bit range, so this path is never taken.
Output format: "%-30s %10.2f (CPU) %10.2f (elapsed)\n"
Front end time 12.34 (CPU) 15.67 (elapsed)
Back end time 3.45 (CPU) 4.56 (elapsed)
Total compilation time 15.79 (CPU) 20.23 (elapsed)
The label is left-justified in a 30-character field. CPU and elapsed times are right-justified in 10-character fields with 2 decimal places.
Timing Flag Activation
The timing flag dword_106C0A4 is registered in the CLI flag table as flag ID 20:
// In sub_452010 (register_internal_flags)
sub_451F80(20, "timing", 35, 0, 0, 1);
// ^id ^name ^case ^ ^ ^undocumented
When --timing is passed on the command line, the CLI parser (sub_459630) hits case 20 in its switch statement, which sets dword_106C0A4 = 1. The flag defaults to 0 (disabled), set explicitly in sub_45EB40 (cmd_line_pre_init).
Timing Brackets in main()
main() at 0x408950 allocates six 16-byte timestamp slots on its stack frame:
| Variable | Stack offset | Purpose |
|---|---|---|
v7 | [rsp+0x00] | Total compilation start |
v8 | [rsp+0x10] | Frontend start |
v9 | [rsp+0x20] | Frontend end |
v10 | [rsp+0x30] | Backend start |
v11 | [rsp+0x40] | Backend end |
v12 | [rsp+0x50] | Total compilation end |
Three timing regions are measured:
Region 1: Frontend
Captured after sub_585DB0 (fe_one_time_init) and reported after sub_588F90 (fe_wrapup). Covers stages 3-6 of the pipeline: heavy initialization, TU state reset, source parsing + IL build, and the 5-pass wrapup.
if (dword_106C0A4)
capture_time(&t_fe_start); // v8
reset_tu_state(); // sub_7A4860
process_translation_unit(filename); // sub_7A40A0
fe_wrapup(filename, 1); // sub_588F90
if (dword_106C0A4) {
capture_time(&t_fe_end); // v9
report_timing("Front end time", &t_fe_start, &t_fe_end);
}
Region 2: Backend
Captured around sub_489000 (process_file_scope_entities). Only executed when dword_106C254 == 0 (no frontend errors).
if (!dword_106C254) {
if (dword_106C0A4)
capture_time(&t_be_start); // v10
process_file_scope_entities(); // sub_489000
if (dword_106C0A4) {
capture_time(&t_be_end); // v11
report_timing("Back end time", &t_be_start, &t_be_end);
}
}
Region 3: Total
Starts before CLI parsing (sub_459630) and ends just before exit. Always uses v7 (captured once at the very beginning) as the start timestamp.
capture_time(&t_total_start); // v7 — captured before CLI parsing
// ... entire compilation ...
if (dword_106C0A4) {
capture_time(&t_total_end); // v12
report_timing("Total compilation time", &t_total_start, &t_total_end);
}
Note that the "Total compilation time" region begins before command-line parsing and includes the CLI parsing overhead, all initialization, frontend, backend, and signoff. The "Front end time" region does NOT include CLI parsing or pre-init -- it starts after fe_one_time_init.
Compilation Summary -- write_signoff
sub_5AEE00 (0x5AEE00) -- write_signoff
This 490-byte function writes the compilation summary trailer to the diagnostic stream. It has two completely separate code paths: SARIF mode and text mode.
SARIF Mode (dword_106BBB8 == 1)
Closes the SARIF JSON document started by sub_5AEDB0 (write_init):
fwrite("]}]}\n", 1, 5, qword_126EDF0);
This closes the results array, the run object, the runs array, and the top-level SARIF document. If dword_106BBB8 is set but not equal to 1, the function hits an assertion: write_signoff at host_envir.c:2203.
Text Mode (dword_106BBB8 == 0)
The text-mode path assembles a human-readable summary from four counters:
| Global | Address | Meaning |
|---|---|---|
qword_126ED90 | 0x126ED90 | Error count |
qword_126ED98 | 0x126ED98 | Warning count |
qword_126EDB0 | 0x126EDB0 | Suppressed error count |
qword_126EDB8 | 0x126EDB8 | Suppressed warning count |
The function uses EDG's message catalog (sub_4F2D60) for all translatable strings:
| Message ID | Purpose | Likely content |
|---|---|---|
| 1742 | Error (singular) | "error" |
| 1743 | Errors (plural) | "errors" |
| 1744 | Warning (singular) | "warning" |
| 1745 | Warning (plural) | "warnings" |
| 1746 | Conjunction | "and" |
| 1747 | Source file indicator (format) | "in compilation of \"%s\"" |
| 1748 | Generated indicator | "generated" |
| 3234 | Suppressed intro | "of which" |
| 3235 | Suppressed verb | "were suppressed" / "was suppressed" |
Output assembly logic (simplified pseudocode):
void write_text_signoff(void) // text-mode path of sub_5AEE00
{
int64_t errors = qword_126ED90;
int64_t warnings = qword_126ED98;
int64_t total = errors + warnings;
// Debug: module declaration count (only if dword_126EFC8 + "module_report")
if (dword_126EFC8 && is_debug_enabled("module_report") && qword_106B9C8)
fprintf(s, "%lu modules declarations processed (%lu failed).\n",
qword_106B9C8, qword_106B9C0);
if (total == 0)
return; // nothing to report
int64_t suppressed_warn = qword_126EDB8;
int64_t suppressed_total = suppressed_warn + qword_126EDB0;
int displayed = total - suppressed_total;
// --- Print displayed counts ---
if (displayed != suppressed_total) { // there ARE unsuppressed diagnostics
if (errors)
fprintf(stream, "%lu %s", errors, msg(errors != 1 ? 1743 : 1742));
if (errors && warnings)
fprintf(stream, " %s ", msg(1746)); // " and "
if (warnings)
fprintf(stream, "%lu %s", warnings, msg(warnings != 1 ? 1745 : 1744));
}
// --- Print suppressed counts ---
if (suppressed_total > 0) {
// Assertion: suppressed_warn must be 0 if we reach here
// (i.e., only suppressed errors, not suppressed warnings, trigger assert)
if (suppressed_warn)
assert(0); // host_envir.c:2141, "write_text_signoff"
if (displayed) {
fprintf(stream, " (%s ", msg(3234)); // " (of which "
fprintf(stream, "%lu %s %s",
suppressed_total,
msg(3235), // "was/were suppressed"
msg(suppressed_total == 1 ? 1742 : 1743));
fputc(')', stream); // close paren
} else {
// All diagnostics were suppressed -- just print suppressed count
fprintf(stream, "%lu %s %s",
suppressed_total,
msg(3235),
msg(suppressed_total == 1 ? 1742 : 1743));
}
}
// --- Print source filename ---
fputc(' ', stream);
if (qword_126EEE0 && *qword_126EEE0 && strcmp(qword_126EEE0, "-") != 0) {
char *display_name = qword_106C040 ? qword_106C040 : qword_126EEE0;
char *basename = normalize_path(display_name) + 32; // sub_5AC020 returns
// buffer, basename at +32
fprintf(stream, msg(1747), basename); // "in compilation of \"%s\""
} else {
fputs(msg(1748), stream); // "generated" (stdout mode)
}
fputc('\n', stream);
}
Example output:
2 errors and 1 warning in compilation of "kernel.cu"
3 errors (of which 1 was suppressed error) in compilation of "main.cu"
sub_589530 (0x589530) -- write_signoff + free_mem_blocks
A thin wrapper (13 bytes) called from main()'s exit path. Performs two operations:
void fe_finish(void) // sub_589530
{
write_signoff(); // sub_5AEE00 — print summary
free_mem_blocks(); // sub_6B8DE0 — release all frontend memory pools
}
sub_6B8DE0 (free_mem_blocks) is the master memory deallocation function from mem_manage.c (assertion at line 1438, function name "free_mem_blocks"). It operates in two modes depending on the global dword_1280728:
Pool allocator mode (dword_1280728 set): Walks three linked lists of allocated memory blocks:
- Current block at
qword_1280720: freed first, looked up in the free-block hash chain atqword_1280748, then the block descriptor itself is freed. - Hash table at
qword_126EC88withdword_126EC80buckets: each bucket is a singly-linked list of block descriptors. Blocks with nonzero size (field[4]) are freed; blocks with zero size trigger themem_manage.c:1438assertion (invariant: a complete block must have a recorded size). - Overflow list at
qword_1280730: same walk-and-free logic.
Each block deallocation decrements qword_1280718 (total allocated bytes) and optionally updates qword_1280710 (low-water mark). At debug level > 4, each free prints: "free_complete_block: freeing block of size %lu\n".
Non-pool mode (dword_1280728 == 0): Iterates source file entries via sub_6B8B20(N) for each entry N from dword_126EC80 down to 0, then walks the permanent allocation array at qword_126EC58, calling sub_5B0500 for each (which wraps munmap or free).
Exit Handling
exit_with_status -- sub_5AF1D0 (0x5AF1D0)
A 145-byte __noreturn function that maps internal compilation status codes to POSIX exit codes. This is the only exit point for normal compilation flow -- every path through main() ends here.
// Full annotated decompilation
__noreturn void exit_with_status(uint8_t status) // sub_5AF1D0
{
// --- Text-mode messages (suppressed in SARIF mode) ---
if (!dword_106BBB8) { // not SARIF mode
if (status == 9 || status == 10) {
fwrite("Compilation terminated.\n", 1, 0x18, qword_126EDF0);
exit(4); // goto LABEL_8
}
if (status == 11) {
fwrite("Compilation aborted.\n", 1, 0x15, qword_126EDF0);
fflush(qword_126EDF0);
abort(); // goto LABEL_10
}
}
// --- Exit code mapping (both text and SARIF modes) ---
switch (status) {
case 3:
case 4:
case 5: exit(0); // success
case 8: exit(2); // warnings only
case 9:
case 10: exit(4); // errors (SARIF mode reaches here)
default: fflush(qword_126EDF0);
abort(); // internal error (11, or any unknown)
}
}
Status-to-exit-code mapping:
| Internal Status | Meaning | Text Output | Exit Code | Termination |
|---|---|---|---|---|
| 3 | Clean success (no warnings, no additional status) | (none) | 0 | exit(0) |
| 4 | Success variant | (none) | 0 | exit(0) |
| 5 | Success with additional status (qword_126ED88 != 0) | (none) | 0 | exit(0) |
| 8 | Warnings present (qword_126ED90 != 0) | (none) | 2 | exit(2) |
| 9 | Errors | "Compilation terminated.\n" | 4 | exit(4) |
| 10 | Errors (variant) | "Compilation terminated.\n" | 4 | exit(4) |
| 11 | Internal error / fatal | "Compilation aborted.\n" | (n/a) | abort() |
In SARIF mode (dword_106BBB8 != 0), the text messages "Compilation terminated." and "Compilation aborted." are suppressed. The exit codes remain the same -- the function falls through to the switch which dispatches identically.
The default case handles status 11 and any unexpected status value by calling abort() after flushing the diagnostic stream. This generates a core dump for debugging.
Control flow note
The code structure looks unusual because the decompiler linearizes a two-phase dispatch. First, text-mode messages are emitted for statuses 9/10 and 11 (with early exit(4) or abort() respectively). If SARIF mode is active OR status is not 9/10/11, execution falls through to the switch statement. This means statuses 9/10 reach exit(4) via two different paths depending on SARIF mode, but the exit code is always 4.
Exit Code Determination in main()
The exit code passed to sub_5AF1D0 is computed in main() based on two global counters:
// From main() at 0x408950
uint8_t exit_code = 8; // default: warnings (errors present → v6=8)
sub_6B8B20(0); // reset file state
sub_589530(); // write_signoff + free_mem_blocks
if (!qword_126ED90) // no errors?
exit_code = qword_126ED88 ? 5 : 3; // success codes
// ... timing, stack restore ...
exit_with_status(exit_code);
Decision tree:
qword_126ED90 != 0 (errors present)
└── exit_code = 8 → exit(2) "warnings only" path
NOTE: This is counterintuitive. When errors exist, the exit
code defaults to 8 (which maps to exit(2), not exit(4)).
However, this path is only reachable when qword_126ED90 was
nonzero at the error gate (dword_106C254 = 1, skip backend),
but became zero by the time we reach the exit code check.
In practice, errors set qword_126ED90 and it stays nonzero.
qword_126ED90 == 0 (no errors)
├── qword_126ED88 != 0 → exit_code = 5 → exit(0) (success w/ status)
└── qword_126ED88 == 0 → exit_code = 3 → exit(0) (clean success)
The variable qword_126ED88 at 0x126ED88 is initialized to 0 in sub_4ED530 (declaration_pre_init) and sub_4ED7C0. It appears to track whether any notable conditions occurred during compilation that are not errors or warnings -- possibly informational remarks or specific compiler actions taken. When nonzero, the exit code changes from 3 to 5, but both map to exit(0).
Stack Limit Restoration
Before calling exit_with_status, main() restores the process stack limit if it was raised during initialization:
if (stack_was_raised) {
rlimits.rlim_cur = original_stack; // restore saved soft limit
setrlimit(RLIMIT_STACK, &rlimits);
}
The boolean stack_was_raised (stored in rbp, variable v4) is set during startup when dword_106C064 (the --modify_stack_limit flag, default ON) causes main() to raise RLIMIT_STACK from its soft limit to the hard limit. This restoration is a defensive measure -- it ensures any child processes spawned during cleanup (or signal handlers) inherit a normal stack size.
Signal-Driven Exit Paths
Three additional paths reach exit_with_status:
SIGINT / SIGTERM Handler -- handler (0x5AF2C0)
Registered in sub_5B1E70 (host_envir_early_init) for signals 2 (SIGINT) and 15 (SIGTERM). The registration is one-shot, guarded by dword_E6E120 (set to 0 after first call). SIGINT registration is conditional: the code first calls signal(SIGINT, SIG_IGN) and checks the return value. If the previous handler was already SIG_IGN (meaning the parent process -- typically nvcc -- has set the child to ignore interrupts), it stays ignored. Otherwise, the custom handler is installed. SIGTERM always gets the handler unconditionally.
__noreturn void handler(void) // 0x5AF2C0
{
fputc('\n', qword_126EDF0); // newline to stderr
terminate_compilation(9); // sub_5AF2B0
}
terminate_compilation -- sub_5AF2B0 (0x5AF2B0)
Bridge function: writes signoff then exits.
__noreturn void terminate_compilation(uint8_t status) // sub_5AF2B0
{
write_signoff(); // sub_5AEE00
exit_with_status(status); // sub_5AF1D0
}
When called from handler, status is 9 (errors), which produces "Compilation terminated.\n" followed by exit(4).
SIGXCPU Handler -- sub_5AF270 (0x5AF270)
Registered for signal 24 (SIGXCPU):
__noreturn void cpu_time_limit_handler(void) // sub_5AF270
{
fputc('\n', qword_126EDF0);
fwrite("Internal error: CPU time limit exceeded.\n", 1, 0x29, qword_126EDF0);
exit_with_status(11); // sub_5AF1D0 → abort()
}
This handler fires if the process receives SIGXCPU despite sub_5B1E70 having set RLIMIT_CPU to RLIM_INFINITY at startup. A SIGXCPU could still arrive if an external resource manager (e.g., batch scheduler) overrides the limit after initialization. Status 11 causes abort() with a core dump.
SIGXFSZ
Set to SIG_IGN in sub_5B1E70 (signal(25, SIG_IGN)). This prevents the process from being killed when writing a .int.c file that exceeds the filesystem's file-size limit. Without this, large compilation outputs could trigger an unhandled SIGXFSZ (25) and terminate with a core dump.
SARIF Output Bookends
The SARIF JSON output is bracketed by two functions:
| Function | Address | When Called | Output |
|---|---|---|---|
sub_5AEDB0 (write_init) | 0x5AEDB0 | During fe_init_part_1 (stage 3) | {"version":"2.1.0","$schema":"...","runs":[{"tool":{"driver":{"name":"EDG CPFE","version":"6.6",...}},"columnKind":"unicodeCodePoints","results":[ |
sub_5AEE00 (write_signoff) | 0x5AEE00 | During sub_589530 (pre-exit) | ]}]} + newline |
The tool metadata identifies the frontend as "EDG CPFE" version "6.6" from "Edison Design Group", with fullName "Edison Design Group C/C++ Front End - 6.6" and informationUri "https://edg.com/c". The column kind is "unicodeCodePoints" (not byte offsets). Individual diagnostics are appended to the results array by the error subsystem between these two calls.
The write_init function (sub_5AEDB0) has the same assertion guard as write_signoff: if dword_106BBB8 is set but not equal to 1, it triggers an assertion at host_envir.c:2017 ("write_init"). Both assertions enforce the invariant that SARIF mode is exactly 0 or 1, never any other value.
Profiling Init -- sub_5AF330 (0x5AF330)
A separate but related mechanism. During sub_585DB0 (fe_one_time_init), if dword_106BD4C is set, sub_5AF330 is called:
int profiling_init(void) // sub_5AF330
{
int was_initialized = dword_126F110;
if (!dword_126F110)
dword_126F110 = 1; // mark as initialized
return was_initialized; // 0 on first call, 1 on subsequent
}
This is a one-shot initializer for a profiling subsystem distinct from the --timing flag. The dword_106BD4C gate is set by a different CLI flag and controls a more granular, per-function profiling infrastructure (used by the EDG debug trace system, not the phase-level timing brackets). The dword_126F110 flag prevents double-initialization if fe_one_time_init is called more than once.
Signal Handler Registration Detail
The full signal setup in sub_5B1E70 (host_envir_early_init):
if (dword_E6E120) { // one-shot guard (starts nonzero)
if (signal(SIGINT, SIG_IGN) != SIG_IGN) // was SIGINT not already ignored?
signal(SIGINT, handler); // install interrupt handler
signal(SIGTERM, handler); // always install
signal(SIGXFSZ, SIG_IGN); // ignore file-size limit signals
signal(SIGXCPU, sub_5AF270); // CPU time limit → abort
dword_E6E120 = 0; // prevent re-registration
}
| Signal | Number | Handler | Behavior |
|---|---|---|---|
| SIGINT | 2 | handler (0x5AF2C0) | Conditional: only if not inherited as SIG_IGN. Writes newline, calls terminate_compilation(9). |
| SIGTERM | 15 | handler (0x5AF2C0) | Always installed. Same handler as SIGINT. |
| SIGXFSZ | 25 | SIG_IGN | Ignored. Prevents crash on large .int.c output. |
| SIGXCPU | 24 | sub_5AF270 (0x5AF270) | Prints "Internal error: CPU time limit exceeded.\n", then exit_with_status(11) (abort). |
After signal setup, sub_5B1E70 also disables the CPU time limit by setting RLIMIT_CPU soft limit to RLIM_INFINITY:
getrlimit(RLIMIT_CPU, &rlimits);
rlimits.rlim_cur = RLIM_INFINITY; // -1 = unlimited
setrlimit(RLIMIT_CPU, &rlimits);
This prevents normal compilations from hitting SIGXCPU. The handler at sub_5AF270 is a safety net for cases where an external resource manager re-imposes the limit after initialization.
Complete Exit Sequence
The full sequence from compilation completion to process termination:
1. sub_6B8B20(0) Reset source file manager state
2. sub_589530() Write signoff + free memory
├── sub_5AEE00() Print error/warning summary (or close SARIF JSON)
└── sub_6B8DE0() Free all frontend memory pools
3. Compute exit_code Based on qword_126ED90, qword_126ED88
4. [If timing enabled]
├── sub_5AF350(v12) Capture total end timestamp
└── sub_5AF390(...) Print "Total compilation time"
5. [If stack was raised]
└── setrlimit(...) Restore original stack soft limit
6. sub_5AF1D0(exit_code) Map status → exit code, terminate
├── 3,4,5 → exit(0)
├── 8 → exit(2)
├── 9,10 → exit(4) + "Compilation terminated."
└── 11 → abort() + "Compilation aborted."
Global Variable Reference
| Variable | Address | Size | Role |
|---|---|---|---|
dword_106C0A4 | 0x106C0A4 | 4 | Timing enable flag. CLI flag 20 (--timing). |
dword_106BBB8 | 0x106BBB8 | 4 | SARIF output mode. 0=text, 1=SARIF JSON. |
qword_126EDF0 | 0x126EDF0 | 8 | Diagnostic output FILE* (stderr). |
qword_126ED90 | 0x126ED90 | 8 | Total error count. |
qword_126ED98 | 0x126ED98 | 8 | Total warning count. |
qword_126ED88 | 0x126ED88 | 8 | Additional status (nonzero changes exit code from 3 to 5). |
qword_126EDB0 | 0x126EDB0 | 8 | Suppressed error count. |
qword_126EDB8 | 0x126EDB8 | 8 | Suppressed warning count. |
qword_126EEE0 | 0x126EEE0 | 8 | Output filename (for source display in signoff). |
qword_106C040 | 0x106C040 | 8 | Display filename override (used if set, else falls back to qword_126EEE0). |
dword_106C254 | 0x106C254 | 4 | Skip-backend flag. Set to 1 when errors detected after frontend. |
dword_106C064 | 0x106C064 | 4 | Stack limit adjustment flag (--modify_stack_limit, default ON). |
dword_E6E120 | 0xE6E120 | 4 | One-shot guard for signal handler registration in sub_5B1E70. |
dword_126F110 | 0x126F110 | 4 | Profiling initialized flag. Set to 1 by sub_5AF330. |
dword_106BD4C | 0x106BD4C | 4 | Profiling gate flag. When set, fe_one_time_init calls sub_5AF330. |
qword_106B9C8 | 0x106B9C8 | 8 | Module declarations processed count (for debug module_report). |
qword_106B9C0 | 0x106B9C0 | 8 | Module declarations failed count. |
dword_1280728 | 0x1280728 | 4 | Memory manager mode flag. Controls pool vs non-pool deallocation in sub_6B8DE0. |
Cross-References
- Pipeline Overview -- placement of timing/exit in the 8-stage pipeline
- Entry Point & Initialization --
main()structure, signal handler registration, stack limit setup - CLI Processing -- flag 20 (
--timing) registration and parsing - Backend Code Generation -- the "Back end time" measurement target
- SARIF & Pragma Control -- SARIF JSON format details
- Diagnostic Overview -- error/warning counting infrastructure
Execution Spaces
Every CUDA function lives in one or more execution spaces that govern where the function can run (host CPU, device GPU, or both) and what it can call. cudafe++ encodes execution space as a single-byte bitfield at offset +182 of the entity (routine) node. This byte is the most frequently tested field in CUDA-specific code paths -- it drives attribute application, redeclaration compatibility, virtual override checking, call-graph validation, IL marking, and code generation selection. Understanding this byte is prerequisite to understanding nearly every CUDA-specific subsystem in cudafe++.
The three CUDA execution-space keywords (__host__, __device__, __global__) are parsed as EDG attributes with internal kind codes 'V' (86), 'W' (87), and 'X' (88) respectively. The attribute dispatch table in apply_one_attribute (sub_413240) routes each kind to a dedicated handler that validates constraints and sets the bitfield. Functions without any explicit annotation default to __host__.
Key Facts
| Property | Value |
|---|---|
| Source file | attribute.c (handlers), class_decl.c (redecl/override), nv_transforms.h (inline predicates) |
| Bitfield location | Entity node byte at offset +182 |
__global__ handler | sub_40E1F0 / sub_40E7F0 (apply_nv_global_attr, two variants) |
__device__ handler | sub_40EB80 (apply_nv_device_attr) |
__host__ handler | sub_4108E0 (apply_nv_host_attr) |
| Virtual override checker | sub_432280 (record_virtual_function_override) |
| Execution space mask table | dword_E7C760[] (indexed by space enum) |
| Mask lookup | sub_6BCF60 (nv_check_execution_space_mask) |
| Annotation helper | sub_41A1F0 (validates HD annotations on types) |
| Relaxed mode flag | dword_106BFF0 (permits otherwise-illegal space combinations) |
| main() entity pointer | qword_126EB70 (compared during attribute application) |
The Execution Space Bitfield (Entity + 182)
Byte offset +182 within a routine entity node encodes the execution space as a bitfield. Individual bits carry distinct meanings:
Byte at entity+182:
bit 0 (0x01) device_capable Function can execute on device
bit 1 (0x02) device_explicit __device__ was explicitly written
bit 2 (0x04) host_capable Function can execute on host
bit 3 (0x08) (reserved)
bit 4 (0x10) host_explicit __host__ was explicitly written
bit 5 (0x20) device_annotation Secondary device flag (used in HD detection)
bit 6 (0x40) global_kernel Function is a __global__ kernel
bit 7 (0x80) hd_combined Combined __host__ __device__ flag
Combined Patterns
The attribute handlers do not set individual bits -- they OR entire patterns into the byte. Each CUDA keyword produces a characteristic bitmask:
| Keyword | OR mask | Resulting byte | Bit breakdown |
|---|---|---|---|
__global__ | 0x61 | 0xE1 | device_capable + device_annotation + global_kernel + bit 7 (always set) |
__device__ | 0x23 | 0x23 | device_capable + device_explicit + device_annotation |
__host__ | 0x15 | 0x15 | device_capable + host_capable + host_explicit |
__host__ __device__ | 0x23 | 0x15 | 0x37 | device_capable + device_explicit + host_capable + host_explicit + device_annotation |
| (no annotation) | none | 0x00 | Implicit __host__ -- bits remain zero |
The 0x80 bit is set unconditionally by the __global__ handler. After the |= 0x61 operation (which sets bit 6), the handler reads the byte back and checks (byte & 0x40) != 0. Since bit 6 was just set, this is always true, so |= 0x80 always executes. Despite the field name hd_combined in some tooling, the bit functions as a "has global annotation" marker in practice.
Why device_capable (bit 0) Appears in host
The __host__ mask 0x15 includes bit 0 (device_capable). This is not an error. Bit 0 acts as a "has execution space annotation" marker rather than a strict "runs on device" flag. The actual device-only vs host-only distinction is determined by the two-bit extraction at bits 4-5 (the 0x30 mask), described below.
Execution Space Classification (0x30 Mask)
The critical two-bit extraction byte & 0x30 classifies a routine into one of four categories:
(byte & 0x30):
0x00 -> no explicit annotation (implicit __host__)
0x10 -> __host__ only
0x20 -> __device__ only
0x30 -> __host__ __device__
This extraction is the basis of nv_is_device_only_routine, an inline predicate defined in nv_transforms.h (line 367). The full check from the decompiled binary is:
// nv_is_device_only_routine (inlined from nv_transforms.h:367)
// entity_sym: the symbol table entry for the routine
// entity_sym+88 -> associated routine entity
__int64 entity = *(entity_sym + 88);
if (!entity)
internal_error("nv_transforms.h", 367, "nv_is_device_only_routine");
char byte = *(char*)(entity + 182);
bool is_device_only = ((byte & 0x30) == 0x20) && ((byte & 0x60) == 0x20);
The double-check (byte & 0x60) == 0x20 ensures the function is device-only and NOT a __global__ kernel (which would have bit 6 set, making byte & 0x60 == 0x60). This predicate is used in:
check_void_return_okay(sub_719D20): suppress missing-return warnings for device-only functionsrecord_virtual_function_override(sub_432280): drive virtual override execution space propagation- Cross-space call validation: determine whether a call crosses execution space boundaries
- IL keep-in-il marking: identify device-reachable code
The 0x60 Mask (Kernel vs Device)
A secondary extraction byte & 0x60 distinguishes kernels from plain device functions:
(byte & 0x60):
0x00 -> no device annotation
0x20 -> __device__ only (not a kernel)
0x40 -> __global__ only (should not occur in isolation)
0x60 -> __global__ (which implies __device__)
nv_is_device_only_routine Truth Table
The predicate is inlined from nv_transforms.h:367 and appears in multiple call sites. Its internal_error guard string "nv_is_device_only_routine" appears in sub_432280 at the source path EDG_6.6/src/nv_transforms.h. The complete truth table for all execution space combinations:
| Execution space | byte+182 | byte & 0x30 | byte & 0x60 | Result |
|---|---|---|---|---|
(none, implicit __host__) | 0x00 | 0x00 | 0x00 | false |
__host__ | 0x15 | 0x10 | 0x00 | false |
__device__ | 0x23 | 0x20 | 0x20 | true |
__host__ __device__ | 0x37 | 0x30 | 0x20 | false |
__global__ | 0xE1 | 0x20 | 0x60 | false |
The __global__ case is the key distinction: byte & 0x30 yields 0x20 (same as __device__), but byte & 0x60 yields 0x60 (not 0x20), so the predicate correctly rejects kernels.
// Full pseudocode for nv_is_device_only_routine
// Inlined at every call site; not a standalone function in the binary.
//
// Input: sym -- a symbol table entry (not the entity itself)
// Output: true if the routine is __device__ only (not __host__, not __global__)
bool nv_is_device_only_routine(symbol *sym) {
entity *e = sym->entity; // sym + 88
if (!e)
internal_error("nv_transforms.h", 367, "nv_is_device_only_routine");
char byte = e->byte_182;
// First check: bits 4-5 == 0x20 -> has __device__, no __host__
// Second check: bits 5-6 == 0x20 -> has __device__, no __global__
return ((byte & 0x30) == 0x20) && ((byte & 0x60) == 0x20);
}
Complete Redeclaration Matrix
The matrix below documents every possible pair of (existing annotation, newly-applied annotation) and the result. Each cell is derived from the three attribute handler functions. "Relaxed" means the outcome changes when dword_106BFF0 is set.
| Existing \ Applying | __host__ | __device__ | __global__ |
|---|---|---|---|
(none) 0x00 | 0x15 -- OK | 0x23 -- OK | 0xE1 -- OK |
__host__ 0x15 | 0x15 -- idempotent | 0x37 -- OK (HD) | error 3481 (always: handler checks byte & 0x10 unconditionally) |
__device__ 0x23 | 0x37 -- OK (HD) | 0x23 -- idempotent | error 3481 (relaxed: OK) |
__global__ 0xE1 | error 3481 (always) | error 3481 (relaxed: OK) | 0xE1 -- idempotent |
__host__ __device__ 0x37 | 0x37 -- idempotent | 0x37 -- idempotent | error 3481 (always: byte & 0x10 fires) |
The __global__ column always errors when the existing annotation includes __host__ (bit 4 = 0x10), because the __global__ handler's condition (v5 & 0x10) != 0 is not guarded by the relaxed-mode flag. The __device__ column errors on existing __global__ only when relaxed mode is off, because the __device__ handler guards its check with !dword_106BFF0.
Note that __global__'s byte value is 0xE1 (not 0x61) because the 0x80 bit is always set after __global__ is applied, as documented above.
Attribute Application Functions
apply_nv_global_attr (sub_40E1F0 / sub_40E7F0)
Two nearly identical entry points exist. Both apply __global__ to a function entity. The variant at sub_40E7F0 uses a do-while loop for parameter iteration instead of a for loop, but the validation logic is identical. Both variants may exist because EDG generates different code paths for attribute-on-declaration vs attribute-on-definition.
The function performs extensive validation before setting the bitmask:
// Pseudocode for apply_nv_global_attr (sub_40E1F0)
int64_t apply_nv_global_attr(attr_node *a1, entity *a2, char target_kind) {
if (target_kind != 11) // only applies to functions
return a2;
// Check constexpr lambda with wrong linkage
if ((a2->qword_184 & 0x800001000000) == 0x800000000000) {
char *name = get_entity_name(a2, 0);
error(3469, a1->source_loc, "__global__", name);
return a2;
}
// Static member check
if ((signed char)a2->byte_176 < 0 && !(a2->byte_81 & 0x04))
warning(3507, a1->source_loc, "__global__");
// operator() check
if (a2->byte_166 == 5)
error(3644, a1->source_loc);
// Return type must be void (skip cv-qualifiers)
type *ret = a2->return_type; // +144
while (ret->kind == 12) // 12 = cv-qualifier wrapper
ret = ret->next; // +144
if (ret->prototype->exception_spec) // +152 -> +56
error(3647, a1->source_loc); // auto/decltype(auto) return
// Execution space conflict check (single condition with ||)
char es = a2->byte_182;
if ((!dword_106BFF0 && (es & 0x60) == 0x20) || (es & 0x10) != 0)
error(3481, a1->source_loc);
// Left branch: already __device__ (not relaxed mode) -> conflict
// Right branch: already __host__ explicit (unconditional) -> conflict
// Return type must be void (non-constexpr path)
if (!(a2->byte_179 & 0x10)) { // not constexpr
if (a2->byte_191 & 0x01) // lambda
error(3506, a1->source_loc);
else if (!is_void_return(a2))
error(3505, a1->source_loc);
}
// Variadic check
// ... skip to prototype, check bit 0 of proto+16
if (proto_flags & 0x01)
error(3503, a1->source_loc);
// >>> SET THE BITMASK <<<
a2->byte_182 |= 0x61; // bits 0,5,6: device_capable + device_annotation + global_kernel
// Local function check
if (a2->byte_81 & 0x04)
error(3688, a1->source_loc);
// main() check
if (a2 == qword_126EB70 && (a2->byte_182 & 0x20))
error(3538, a1->source_loc);
// Always set bit 7 after __global__: the check reads the byte AFTER |= 0x61,
// so bit 6 is always set, making this unconditional.
if (a2->byte_182 & 0x40)
a2->byte_182 |= 0x80;
// Parameter default-init check (device-side warning)
// ... iterate parameters, warn 3669 if missing defaults
return a2;
}
apply_nv_device_attr (sub_40EB80)
Handles both variables (target_kind == 7) and functions (target_kind == 11). For variables, it sets the memory space bitfield at +148 (bit 0 = __device__). For functions, it sets the execution space.
// Variable path (target_kind == 7):
a2->byte_148 |= 0x01; // __device__ memory space
if (((a2->byte_148 & 0x02) != 0) + ((a2->byte_148 & 0x04) != 0) == 2)
error(3481, ...); // both __shared__ (bit 1) AND __constant__ (bit 2) set
if ((signed char)a2->byte_161 < 0)
error(3482, ...); // thread_local
if (a2->byte_81 & 0x04)
error(3485, ...); // local variable
// Function path (target_kind == 11):
// Same constexpr-lambda check as __global__
if (!dword_106BFF0 && (a2->byte_182 & 0x40))
error(3481, ...); // already __global__, now __device__
a2->byte_182 |= 0x23; // device_capable + device_explicit + device_annotation
if ((a2->byte_81 & 0x04) && (a2->byte_182 & 0x40))
error(3688, ...); // local function with __global__
if (a2 == qword_126EB70 && (a2->byte_182 & 0x20))
error(3538, ...); // __device__ on main()
apply_nv_host_attr (sub_4108E0)
The simplest of the three. Only applies to functions (target_kind 11). Fewer validation checks than __global__ or __device__.
// Function path (target_kind == 11):
// Same constexpr-lambda check
if (a2->byte_182 & 0x40)
error(3481, ...); // already __global__, now __host__
a2->byte_182 |= 0x15; // device_capable + host_capable + host_explicit
if ((a2->byte_81 & 0x04) && (a2->byte_182 & 0x40))
error(3688, ...); // local function
if (a2 == qword_126EB70 && (a2->byte_182 & 0x20))
error(3538, ...); // __host__ on main()
Default Execution Space
Functions without any explicit annotation have byte +182 == 0x00. This is treated as implicit __host__:
- The
0x30mask yields0x00, which the cross-space validator treats identically to0x10(explicit__host__) - The function is compiled for the host side only
- It is excluded from device IL during the keep-in-il pass
In JIT compilation mode (--default-device), the default flips to __device__. This changes which functions are kept in device IL without requiring explicit annotations.
Execution Space Conflict Detection
The attribute handlers enforce a mutual-exclusion matrix. When a second execution space attribute is applied to a function that already has one, the handler checks for conflicts using error 3481:
| Already set | Applying | Result |
|---|---|---|
| (none) | __host__ | 0x15 -- accepted |
| (none) | __device__ | 0x23 -- accepted |
| (none) | __global__ | 0xE1 -- accepted |
__host__ (0x15) | __device__ | 0x37 -- accepted (HD) |
__device__ (0x23) | __host__ | 0x37 -- accepted (HD) |
__host__ (0x15) | __global__ | error 3481 (always -- byte & 0x10 is unconditional) |
__device__ (0x23) | __global__ | error 3481 (unless dword_106BFF0) |
__global__ (0xE1) | __host__ | error 3481 (always) |
__global__ (0xE1) | __device__ | error 3481 (unless dword_106BFF0) |
__host__ (0x15) | __host__ | idempotent OR, no error |
__device__ (0x23) | __device__ | idempotent OR, no error |
__global__ (0xE1) | __global__ | idempotent OR, no error |
The relaxed mode flag dword_106BFF0 suppresses certain conflicts. When set, combinations that would normally produce error 3481 are silently accepted. This flag corresponds to --expt-relaxed-constexpr or similar permissive compilation modes. Note that the relaxed flag does NOT affect the __host__ -> __global__ or __global__ -> __host__ paths -- these always error because the __global__ handler checks byte & 0x10 unconditionally, and the __host__ handler checks byte & 0x40 unconditionally.
Virtual Function Override Checking (sub_432280)
When a derived class overrides a virtual function, cudafe++ must verify execution space compatibility. This check is embedded in record_virtual_function_override (sub_432280, 437 lines, from class_decl.c).
nv_is_device_only_routine Inline Check
The function first tests whether the overriding function has the __device__ flag at +177 bit 4 (0x10). If so, and the overridden function does NOT have this flag, execution space propagation occurs:
// Propagation logic (simplified from sub_432280, lines 70-94)
if (overriding->byte_177 & 0x10) { // overriding is __device__
if (!(overridden->byte_177 & 0x10)) { // overridden is NOT __device__
char es = overridden->byte_182;
if ((es & 0x30) != 0x20) { // overridden is not device-only
overriding->byte_182 |= 0x10; // propagate __host__ flag
}
if (es & 0x20) { // overridden has device_annotation
overriding->byte_182 |= 0x20; // propagate device_annotation
}
}
}
Six Virtual Override Mismatch Errors (3542-3547)
When the overriding function is NOT __device__, the checker looks up execution space attributes using sub_5CEE70 (attribute kind 87 = __device__, kind 86 = __host__). Based on which attributes are found on the overriding function and the execution space of the overridden function, one of six errors is emitted:
| Error | Overriding has | Overridden space (byte & 0x30) | Meaning |
|---|---|---|---|
| 3542 | __device__ only | 0x00 or 0x10 (host/implicit) | Device override of host virtual |
| 3543 | __device__ + __host__ | 0x00 (no annotation) | HD override of implicit-host virtual |
| 3544 | __device__ + __host__ | 0x20 (device-only) | HD override of device-only virtual |
| 3545 | no __device__ | 0x20 (device-only) | Host override of device-only virtual |
| 3546 | no __device__ | 0x30 (HD) | Host override of HD virtual |
| 3547 | __device__ only | 0x30 (HD), relaxed mode | Device override of HD virtual (relaxed) |
The errors are emitted via sub_4F4F10 with severity 8 (hard error). The dword_106BFF0 relaxed mode flag modulates certain paths: in relaxed mode, some combinations that would otherwise error are accepted or downgraded.
Decision Logic
// Pseudocode for override mismatch detection (sub_432280, lines 95-188)
char es = overridden->byte_182;
char mask_30 = es & 0x30;
bool has_host_bit = (es & 0x20) != 0; // device_annotation
bool is_hd = (mask_30 == 0x30);
bool has_device_attr = has_attribute(overriding, 87 /*__device__*/);
bool has_host_attr = has_attribute(overriding, 86 /*__host__*/);
if (has_device_attr) {
if (has_host_attr) {
// Overriding is __host__ __device__
if (has_host_bit)
error = 3544; // HD overrides device-only
else if (mask_30 != 0x20)
error = 3543; // HD overrides implicit-host
} else {
// Overriding is __device__ only
if (!has_host_bit)
error = 3542; // device overrides host
if (is_hd && relaxed_mode)
error = 3547; // device overrides HD (relaxed)
}
} else {
// Overriding has no __device__
if (mask_30 == 0x20)
error = 3545; // host overrides device-only
else if (mask_30 == 0x30)
error = 3546; // host overrides HD
}
global Function Constraints
The __global__ handler enforces the strictest constraints of any execution space. A kernel function must satisfy all of the following:
| Constraint | Check | Error |
|---|---|---|
| Must be a function (not variable/type) | target_kind == 11 | silently ignored if not |
| Not a constexpr lambda with wrong linkage | (qword_184 & 0x800001000000) != 0x800000000000 | 3469 |
| Not a static member function | (signed char)byte_176 >= 0 || (byte_81 & 0x04) | 3507 |
Not operator() | byte_166 != 5 | 3644 |
Return type not auto/decltype(auto) | no exception spec at proto+56 | 3647 |
| No conflicting execution space | see conflict matrix above | 3481 |
Return type is void (non-constexpr) | is_void_return(a2) | 3505 / 3506 |
| Not variadic | !(proto_flags & 0x01) | 3503 |
| Not a local function | !(byte_81 & 0x04) | 3688 |
Not main() | a2 != qword_126EB70 | 3538 |
| Parameters have default init (device-side) | walk parameter list | 3669 (warning) |
Execution Space Annotation Helper (sub_41A1F0)
This function validates that type arguments used in __host__ __device__ or __device__ template contexts are well-formed. It traverses the type chain (following cv-qualifier wrappers where kind == 12), emitting diagnostics:
- Error 3597: Type nesting depth exceeds 7 levels
- Error 3598: Type is not device-callable (fails
sub_550E50check) - Error 3599: Type lacks appropriate constructor/destructor for device context
The first argument selects the annotation string: when a3 == 0, the string is "__host__ __device__"; when a3 != 0, it is "__device__".
Attribute Dispatch (apply_one_attribute)
The central dispatcher sub_413240 (apply_one_attribute, 585 lines) routes attribute kinds to their handlers via a switch statement:
| Kind byte | Decimal | Attribute | Handler |
|---|---|---|---|
'V' | 86 | __host__ | sub_4108E0 |
'W' | 87 | __device__ | sub_40EB80 |
'X' | 88 | __global__ | sub_40E1F0 or sub_40E7F0 |
Attribute display names are resolved by sub_40A310 (attribute_display_name), which maps the kind byte back to the human-readable CUDA keyword string for use in diagnostic messages.
Execution Space Mask Table (dword_E7C760)
A lookup table at dword_E7C760 stores precomputed bitmasks indexed by execution space enum value. The function sub_6BCF60 (nv_check_execution_space_mask) performs return a1 & dword_E7C760[a2], allowing fast bitwise checks of whether a given entity's execution space matches a target space category. This table is used throughout cross-space validation and IL marking.
Diagnostics Reference
| Error | Severity | Meaning |
|---|---|---|
| 3469 | error | Execution space attribute on constexpr lambda with wrong linkage |
| 3481 | error | Conflicting execution spaces |
| 3482 | error | __device__ variable with thread_local storage |
| 3485 | error | __device__ attribute on local variable |
| 3503 | error | __global__ function cannot be variadic |
| 3505 | error | __global__ return type must be void (non-constexpr path) |
| 3506 | error | __global__ return type must be void (constexpr/lambda path) |
| 3507 | warning | __global__ on static member function |
| 3538 | error | Execution space attribute on main() |
| 3577 | error | __device__ variable with constexpr and conflicting memory space |
| 3542 | error | Virtual override: __device__ overrides host |
| 3543 | error | Virtual override: __host__ __device__ overrides implicit-host |
| 3544 | error | Virtual override: __host__ __device__ overrides device-only |
| 3545 | error | Virtual override: host overrides device-only |
| 3546 | error | Virtual override: host overrides __host__ __device__ |
| 3547 | error | Virtual override: __device__ overrides HD (relaxed mode) |
| 3597 | error | Type nesting too deep for execution space annotation |
| 3598 | error | Type not callable in target execution space |
| 3599 | error | Type lacks device-compatible constructor/destructor |
| 3644 | error | __global__ on operator() |
| 3647 | error | __global__ return type cannot be auto/decltype(auto) |
| 3669 | warning | __global__ parameter without default initializer (device-side) |
| 3688 | error | Execution space attribute on local function |
Function Map
| Address | Identity | Lines | Source |
|---|---|---|---|
sub_40A310 | attribute_display_name | 83 | attribute.c |
sub_40E1F0 | apply_nv_global_attr (variant 1) | 89 | attribute.c |
sub_40E7F0 | apply_nv_global_attr (variant 2) | 86 | attribute.c |
sub_40EB80 | apply_nv_device_attr | 100 | attribute.c |
sub_4108E0 | apply_nv_host_attr | 31 | attribute.c |
sub_413240 | apply_one_attribute (dispatch) | 585 | attribute.c |
sub_41A1F0 | execution space annotation helper | 82 | class_decl.c |
sub_432280 | record_virtual_function_override | 437 | class_decl.c |
sub_6BCF60 | nv_check_execution_space_mask | 7 | nv_transforms.c |
sub_719D20 | check_void_return_okay | 271 | statements.c |
Cross-References
- Memory Spaces -- variable-side
__device__/__shared__/__constant__at entity+148 - Cross-Space Validation -- call-graph enforcement of execution space rules
- Device/Host Separation -- IL marking driven by execution space
- Kernel Stubs -- host-side stub generation for
__global__functions - Entity Node Layout -- full byte map of the entity structure
- Virtual Override Matrix -- detailed 6-error mismatch table
- JIT Mode --
--default-deviceflag that changes implicit execution space
Memory Spaces
Every CUDA variable that resides in GPU memory belongs to one of four memory spaces: __device__ (global memory), __shared__ (per-block scratchpad), __constant__ (read-only broadcast memory), or __managed__ (unified memory). cudafe++ encodes memory space as a two-byte bitfield at offsets +148 and +149 of the variable entity node. These two bytes are the variable-side analog of the execution space byte at +182 used for functions -- the two systems are complementary but independent.
The memory space bitfield passes through three processing stages. First, attribute handlers in attribute.c set the appropriate bits and enforce mutual exclusion constraints (no __shared__ + __constant__, no thread_local, no grid_constant conflict). Second, declaration processing in decls.c applies additional validation: VLA restrictions for __shared__, constexpr and external-linkage restrictions for __constant__/__device__, and structured binding constraints for all spaces. Third, symbol reference recording in symbol_ref.c checks whether host code illegally accesses device-side variables at reference time.
Memory spaces apply exclusively to variables (entity kind 7). __shared__ and __constant__ have no function-side meaning -- only __device__ (kind 'W', 87) doubles as a function execution space attribute.
Key Facts
| Property | Value |
|---|---|
| Memory space offset | Entity node byte +148 (3-bit bitfield) |
| Extended space offset | Entity node byte +149 (1 bit for __managed__) |
__device__ handler | sub_40EB80 (apply_nv_device_attr, 100 lines, attribute.c) |
__managed__ handler | sub_40E0D0 (apply_nv_managed_attr, 47 lines, attribute.c:10523) |
__shared__ handler | Kind 'Z' (90), not individually decompiled; sets +148 |= 0x02 |
__constant__ handler | Kind '[' (91), not individually decompiled; sets +148 |= 0x04 |
| Declaration processor | sub_4DEC90 (variable_declaration, 1098 lines, decls.c) |
| Variable declaration | sub_4CA6C0 (decl_variable, 1090 lines, decls.c:7730) |
| Variable fixup | sub_4CC150 (cuda_variable_fixup, 120 lines, decls.c) |
| Defined-variable check | sub_4DC200 (mark_defined_variable, 26 lines, decls.c) |
| Cross-space reference checker | sub_72A650 / sub_72B510 (record_symbol_reference_full, symbol_ref.c) |
| Device-var-in-host checker | sub_6BCF10 (nv_check_device_variable_in_host, nv_transforms.c) |
| Post-validation | sub_6BC890 (nv_validate_cuda_attributes, 161 lines, nv_transforms.c) |
| Attribute kind codes | 'W'=87 (__device__), 'Z'=90 (__shared__), '['=91 (__constant__), 'f'=102 (__managed__) |
The Memory Space Bitfield (Entity +148 / +149)
Byte +148: Primary Memory Space
Byte at entity+148:
bit 0 (0x01) __device__ Variable in device global memory
bit 1 (0x02) __shared__ Variable in per-block shared memory
bit 2 (0x04) __constant__ Variable in constant memory
bit 3 (0x08) type_member Set when variable inherits space from type context
bit 4 (0x10) device_at_file __device__ at file scope (no enclosing function)
bit 7 (0x80) weak_odr Set by apply_nv_weak_odr_attr (sub_40AD80)
Bits 3, 4, and 7 are set by decl_variable (sub_4CA6C0) during declaration processing, not by the attribute handlers. Bit 3 is set via *(_BYTE *)(v33 + 148) |= 8u when the variable inherits its memory space from a type context (such as a static member of a class with a device annotation). Bit 4 is set via *(_BYTE *)(v43 + 148) = v73 | 0x10 when a __device__ variable is declared at file scope (dword_126C5D8 == -1, meaning no enclosing function).
Byte +149: Extended Memory Space
Byte at entity+149:
bit 0 (0x01) __managed__ Unified memory (host + device accessible)
bits 1-7 (reserved)
Word-Level Access
Some validation code reads bytes +148 and +149 together as a 16-bit word. The __grid_constant__ conflict check in apply_nv_managed_attr tests:
// sub_40E0D0, line 26 (apply_nv_managed_attr)
if ( (a2[164] & 4) != 0 && (*((_WORD *)a2 + 74) & 0x102) != 0 )
Here (_WORD *)(a2 + 148) (offset 74 in 16-bit units) is tested against 0x0102. In little-endian layout, 0x0102 means byte +148 bit 1 (__shared__) OR byte +149 bit 0 (__managed__). This catches the case where a __grid_constant__ parameter also carries __shared__ or __managed__.
Mutual Exclusion
In valid CUDA programs, at most one of __device__, __shared__, and __constant__ should be set. However, __managed__ always implies __device__ -- the handler sets both +149 bit 0 and +148 bit 0. The validation logic permits __device__ + __managed__ but rejects combinations like __shared__ + __constant__.
The mutual exclusion check appears identically in both apply_nv_managed_attr and apply_nv_device_attr:
// From sub_40EB80 (apply_nv_device_attr), variable path:
v9 = *(_BYTE *)(a2 + 148) | 1; // set __device__ bit
*(_BYTE *)(a2 + 148) = v9;
if ( ((v9 & 2) != 0) + ((v9 & 4) != 0) == 2 )
sub_4F81B0(3481, a1 + 56); // error: conflicting spaces
The expression ((v9 & 2) != 0) + ((v9 & 4) != 0) == 2 is true only when both __shared__ (bit 1) and __constant__ (bit 2) are set simultaneously. This means:
__device__ + __shared__is allowed (the bits coexist)__device__ + __constant__is allowed__shared__ + __constant__triggers error 3481
Attribute Handlers
apply_nv_managed_attr -- sub_40E0D0
The __managed__ handler is the simplest and most thoroughly documented. It demonstrates the full validation pattern that all memory space handlers share.
Entry point: Called from apply_one_attribute (sub_413240) when attribute kind is 'f' (102).
Decompiled logic (47 lines, attribute.c:10523):
// sub_40E0D0 -- apply_nv_managed_attr
// a1: attribute node, a2: entity node, a3: entity kind
// Gate: only applies to variables
if ( a3 != 7 )
internal_error("attribute.c", 10523, "apply_nv_managed_attr");
// Step 1: Set managed flag AND device flag
v3 = a2[148]; // save old memory space byte
a2[149] |= 1; // set __managed__ bit
a2[148] = v3 | 1; // set __device__ bit (managed implies device)
// Step 2: Mutual exclusion check
if ( ((v3 & 2) != 0) + ((v3 & 4) != 0) == 2 )
error(3481, ...); // __shared__ + __constant__ conflict
// Step 3: Thread-local check
if ( (char)a2[161] < 0 )
error(3482, ...); // __managed__ on thread_local
// Step 4: Local variable check
if ( (a2[81] & 4) != 0 )
error(3485, ...); // __managed__ on local variable
// Step 5: __grid_constant__ conflict
if ( (a2[164] & 4) != 0 && (*(WORD*)(a2 + 148) & 0x102) != 0 )
{
// Determine which space string to display
v4 = a2[148];
v5 = "__constant__";
if ( (v4 & 4) == 0 ) {
v5 = "__managed__";
if ( (a2[149] & 1) == 0 ) {
v5 = "__shared__";
if ( (v4 & 2) == 0 ) {
v5 = "__device__";
if ( (v4 & 1) == 0 )
v5 = "";
}
}
}
error(3577, ..., v5); // incompatible with __grid_constant__
}
The space-name selection cascade (__constant__ > __managed__ > __shared__ > __device__ > empty) is used in error messages to show which memory space conflicts with __grid_constant__. The cascade tests bits in priority order, matching the most "restrictive" space first.
apply_nv_device_attr -- sub_40EB80
The __device__ handler is dual-purpose: it handles both variables (a3 == 7) and functions (a3 == 11).
Entry point: Called from apply_one_attribute when attribute kind is 'W' (87).
Variable path (entity kind 7):
// sub_40EB80, variable branch
*(_BYTE *)(a2 + 148) |= 1; // set __device__ bit
// Validation (identical to __managed__):
// 1. Error 3481 if __shared__ + __constant__ both set
// 2. Error 3482 if thread_local (byte +161 bit 7)
// 3. Error 3485 if local variable (byte +81 bit 2)
// 4. Error 3577 if __grid_constant__ conflict
Function path (entity kind 11):
// sub_40EB80, function branch
// Check: not an implicitly-deleted function
if ( (*(_QWORD *)(a2 + 184) & 0x800001000000LL) != 0x800000000000LL
|| (*(_BYTE *)(a2 + 176) & 2) != 0 )
{
// Conflict with __global__
if ( !dword_106BFF0 && (*(_BYTE *)(a2 + 182) & 0x40) != 0 )
error(3481, ...);
*(_BYTE *)(a2 + 182) |= 0x23; // set device execution space
// Local function with __global__ conflict
if ( (*(_BYTE *)(a2 + 81) & 4) != 0 && (*(_BYTE *)(a2 + 182) & 0x40) != 0 )
error(3688, ...);
// __device__ on main()
if ( a2 == qword_126EB70 && (*(_BYTE *)(a2 + 182) & 0x20) != 0 )
warning(3538, ...);
}
else
{
// Implicitly-deleted function: just warn
v14 = get_entity_display_name(a2);
error(3469, ..., "__device__", v14);
}
// Check function parameters for missing default initializers
// (error 3669 for parameters without defaults in device context)
The function path is documented in Execution Spaces -- here we focus on the variable path.
shared and constant Handlers
The __shared__ and __constant__ attribute handlers are dispatched through apply_one_attribute (sub_413240) when attribute kind codes 'Z' (90) and '[' (91) are encountered. Their variable-path logic mirrors __device__ and __managed__:
| Step | __shared__ ('Z') | __constant__ ('[') |
|---|---|---|
| Set memory space bit | byte +148 |= 0x02 | byte +148 |= 0x04 |
| Mutual exclusion (3481) | Check __constant__ bit (bit 2) | Check __shared__ bit (bit 1) |
| Thread-local check (3482) | Yes | Yes |
| Local variable check (3485) | Yes | Yes |
__grid_constant__ conflict (3577) | Yes | Yes |
The __shared__ and __constant__ keywords apply only to variables (kind 7). Unlike __device__, they do not have a function-path branch -- there is no __shared__ or __constant__ function execution space.
Variable Declaration Processing
sub_4DEC90 -- variable_declaration
The top-level declaration processor (decls.c) performs additional CUDA-specific validation after attribute handlers have set the memory space bits. This function is 1098 lines and handles both normal variable declarations and static data member definitions.
CUDA-specific checks in variable_declaration:
| Error | Condition | Description |
|---|---|---|
| 149 | Memory space attribute at illegal scope | CUDA storage class at namespace scope (specific scenarios) |
| 892 | auto with __constant__ | auto-typed __constant__ variable |
| 893 | auto with CUDA attribute | auto-typed variable with other CUDA memory space |
| 3510 | __shared__ with VLA | __shared__ variable with variable-length array type |
| 3566 | __constant__ + constexpr + auto | __constant__ constexpr with auto deduction |
| 3567 | CUDA variable with VLA | CUDA memory-space variable with VLA type |
| 3568 | __constant__ + constexpr | __constant__ combined with constexpr |
| 3578 | CUDA attribute in discarded branch | CUDA attribute on variable in constexpr-if discarded branch |
| 3579 | CUDA attribute + structured binding | CUDA attribute at namespace scope with structured binding |
| 3580 | CUDA attribute on VLA | CUDA attribute on variable-length array |
Memory space string selection (used in error messages):
// sub_4DEC90, line ~357: selecting display name for the memory space
v50 = "__constant__";
if ( (v49 & 4) == 0 ) {
v50 = "__managed__";
if ( (*(_BYTE *)(v15 + 149) & 1) == 0 ) {
v50 = "__host__ __device__" + 9; // pointer arithmetic: = "__device__"
if ( (v49 & 2) != 0 )
v50 = "__shared__";
}
}
The string "__device__" is produced by taking the string "__host__ __device__" and advancing by 9 bytes, skipping past "__host__ ". This is a binary-level optimization -- the compiler shares string storage between the combined "__host__ __device__" literal and the standalone "__device__" reference.
sub_4CA6C0 -- decl_variable
The core variable declaration function (1090 lines, decls.c:7730) handles CUDA memory space propagation during symbol table entry creation. Key behaviors:
Storage class mapping: When declaration state byte at offset +269 equals 5, it indicates a CUDA memory space storage class. The function performs a scope walk to determine the correct namespace scope for the variable. If a prior declaration exists at the same scope (dword_126C5DC == dword_126C5B4), the CUDA storage class is reset to allow redeclaration.
Scope walk: Traverses the scope chain (784-byte scope entries at qword_126C5E8, indexed by dword_126C5E4) upward through class scopes (scope_kind 4) and template scopes (bit 0x20 at scope entry +9), until reaching a non-class, non-template scope. This determines whether the variable is at namespace scope, class scope, or block scope.
Error 3483 -- memory space in non-device function: When a variable with a device memory space bit (+148 bit 0 set) is declared inside a function body, and the enclosing routine is NOT device-only (+182 & 0x30 != 0x20), the function emits error 3483 with the storage kind and space name:
// From sub_4CA6C0, ~line 886-910
if (!at_namespace_scope) {
char space = entity->byte_148;
if (storage_class != 1 && (space & 0x01)) {
routine_descriptor = qword_126C5D0;
if (routine_descriptor) {
entity_ptr = *(routine_descriptor + 32);
if (entity_ptr && (entity_ptr[182] & 0x30) != 0x20) {
const char *name = get_space_name(entity); // priority cascade
const char *kind = (storage_class == 2) ? "a static" : "an automatic";
error(3483, source_loc, kind, name);
}
}
}
}
File-scope device flag: When a __device__ variable is at file scope (dword_126C5D8 == -1), the function sets bit 4 of +148:
if ((entity->byte_148 & 0x01) && dword_126C5D8 == -1)
entity->byte_148 |= 0x10; // bit 4: device_at_file_scope
Redeclaration checking: When a variable is redeclared, the function compares memory space encoding at offset +136 (the attribute byte) between the existing and new entity. Error 1306 is emitted for mismatched CUDA memory spaces.
Memory space propagation: Calls sub_4C4750 (set_variable_attributes) for final attribute propagation, and sub_4CA480 (check_variable_redeclaration) for prior-declaration compatibility.
sub_4DC200 -- mark_defined_variable
Post-declaration validation for device-memory variables with external linkage (26 lines):
// sub_4DC200 -- mark_defined_variable (decompiled)
void mark_defined_variable(entity_t *a1, int a2) {
if (a1[164] & 0x10) { // already marked as defined
if (!dword_106BFD0 // cross-space checking not overridden
&& (a1[148] & 3) == 1 // __device__ set, __shared__ NOT set
&& !is_compiler_generated(a1) // not compiler-generated
&& (a1[80] & 0x70) != 0x10) // not anonymous
{
warning(3648, a1 + 64); // external linkage warning
}
} else if (!a2 && (*(byte*)(*(qword*)a1 + 81) & 2)) {
error(1655, ...); // tentative definition of constexpr
} else {
// Same 3648 check on first definition
if (!dword_106BFD0 && (a1[148] & 3) == 1 && ...)
warning(3648, a1 + 64);
a1[164] |= 0x10; // mark as defined
}
}
The condition (a1[148] & 3) == 1 tests that bit 0 (__device__) is set AND bit 1 (__shared__) is NOT set. This catches __device__ variables (including __device__ __constant__ and __device__ __managed__, since those have bit 0 set) but excludes __shared__ variables (which have bit 1 set). The check is NOT about __constant__ alone -- a pure __constant__ variable (only bit 2 set, value 0x04) would yield (0x04 & 3) == 0, failing the test. The p1.06 report's characterization of error 3648 as "constant with external linkage" is misleading; the actual condition is "device-accessible (non-shared) variable with external linkage."
sub_4CC150 -- cuda_variable_fixup
Called from variable_declaration after CUDA constexpr-if detection. This function:
- Manipulates variable entity fields at offset
+148(memory space) and+162(visibility flags) - Adjusts scope chains using the 784-byte scope entry array
- Creates new type entries for CUDA-specific variable rewriting
Bit Assignment Resolution
Two sweep reports provided conflicting bit assignments for byte +148:
| Source | bit 0 | bit 1 | bit 2 |
|---|---|---|---|
| p1.01 (attribute.c handlers) | __device__ | __shared__ | __constant__ |
| p1.06 (decls.c) | __constant__ | __shared__ | __managed__ |
The decompiled code resolves this definitively in favor of the p1.01 assignment. Two independent functions confirm it:
-
sub_40E0D0(apply_nv_managed_attr) setsa2[149] |= 1(managed at+149) anda2[148] = v3 | 1(device at+148bit 0). The subsequent conflict check tests(v3 & 2)for__shared__and(v3 & 4)for__constant__. -
sub_40EB80(apply_nv_device_attr) sets*(_BYTE *)(a2 + 148) | 1(device at+148bit 0), then uses the identical conflict test((v9 & 2) != 0) + ((v9 & 4) != 0) == 2.
The canonical encoding is:
Byte +148: bit 0 = __device__, bit 1 = __shared__, bit 2 = __constant__
Byte +149: bit 0 = __managed__
The p1.06 report's alternative encoding is an analysis error, caused by mark_defined_variable (sub_4DC200) testing +148 & 3 == 1 in the context of error 3648. That test checks for __device__ set (bit 0) without __shared__ (bit 1) -- not for __constant__ at bit 0. The error was then characterized as "constant with external linkage" based on the error message text rather than the actual bit test.
Validation Constraints
managed Constraints
__managed__ has the strictest requirements among memory space annotations. All five checks occur in apply_nv_managed_attr (sub_40E0D0):
| Constraint | Binary test | Error | Description |
|---|---|---|---|
| Variables only | a3 != 7 | internal_error | __managed__ can only apply to variables, not functions or types |
| No shared+constant | ((old & 2) != 0) + ((old & 4) != 0) == 2 | 3481 | Both __shared__ and __constant__ already set |
| Not thread-local | (signed char)byte+161 < 0 | 3482 | Bit 7 of +161 = thread_local storage |
| Not reference/local | byte+81 & 4 | 3485 | Bit 2 of +81 = reference type or local variable |
| Not grid_constant | byte+164 & 4 and word +148 & 0x0102 | 3577 | __grid_constant__ parameter with managed or shared space |
The __managed__ keyword requires compute capability >= 3.0. This is verified at compilation time via version threshold comparisons (qword_126EF90 > 0x78B3, where 0x78B3 = 30899 in the CUDA version encoding scheme). The specific error code for architecture-too-low is not captured in the decompiled attribute handler.
shared Constraints
__shared__ variables have restrictions enforced across multiple functions:
| Constraint | Where | Error | Description |
|---|---|---|---|
| No VLA type | sub_4DEC90 | 3510 | __shared__ variable cannot have variable-length array type |
| No VLA (general) | sub_4DEC90 | 3580 | CUDA memory-space attribute on variable-length array |
| Not thread-local | Attribute handler | 3482 | __shared__ on thread_local variable |
| Not local (non-block) | Attribute handler | 3485 | Cannot appear on local variables outside device function scope |
| No grid_constant | Attribute handler | 3577 | Incompatible with __grid_constant__ parameter |
constant Constraints
__constant__ carries additional restrictions related to constexpr and type:
| Constraint | Where | Error | Description |
|---|---|---|---|
| No constexpr | sub_4DEC90 | 3568 | __constant__ combined with constexpr (when managed+device bits also set) |
| No constexpr+auto | sub_4DEC90 | 3566 | Constexpr with const-qualified type |
| No VLA type | sub_4DEC90 | 3567 | CUDA memory-space variable with VLA type |
| Not thread-local | Attribute handler | 3482 | __constant__ on thread_local variable |
| Not local | Attribute handler | 3485 | Cannot appear on local variables |
| No grid_constant | Attribute handler | 3577 | Incompatible with __grid_constant__ parameter |
Note: Error 3648 (external linkage warning) is emitted by sub_4DC200 but the condition tests (byte+148 & 3) == 1, which checks for __device__ set without __shared__ -- not specifically __constant__. The check applies to any device-accessible non-shared variable, including __device__, __device__ __constant__, and __device__ __managed__.
Cross-Space Variable Access Checking
When host code references a device-side variable, the symbol reference recorder emits diagnostics. This checking occurs in record_symbol_reference_full (sub_72A650 / sub_72B510, symbol_ref.c) and is gated by global flags dword_106BFD0 and dword_106BFCC.
Gate Logic
1. Is cross-space checking enabled?
→ dword_106BFD0 != 0 OR dword_106BFCC != 0
2. Is the referenced entity a variable (kind == 7)?
→ Yes: proceed to nv_check_device_var_ref_in_host
→ No (kind 10/11/20 -- function): check nv_check_host_var_ref_in_device
3. Get current routine from scope stack (dword_126C5D8)
4. Check routine execution space at +182 (0x30 mask):
→ 0x00 or 0x10 (host): emit device-var-in-host errors
→ 0x20 (device): emit host-var-in-device errors
Device Variable Referenced from Host Code
The nv_check_device_var_ref_in_host path (assert string at symbol_ref.c:2347) checks memory space bits and produces specific errors based on which space the variable occupies:
| Error | Condition | Description |
|---|---|---|
| 3548 | Variable has __shared__ or __constant__ (byte+148 bits 1-2) | Reference to __shared__ / __constant__ variable from host code |
| 3549 | Variable has __constant__ and reference is in initializer context (ref_kind bit 4) | Initializer referencing device memory variable from host |
| 3550 | Variable has __shared__ and reference is a write (ref_kind bit 1) | Write to __shared__ variable from host code |
| 3486 | Via sub_6BCF10 -- complex linkage check (+176 & 0x200000000002000, +166 == 5, +168 in [1,4]) | Illegal device variable reference from host (operator function context) |
Host Variable Referenced from Device Code
The nv_check_host_var_ref_in_device path (assert string at symbol_ref.c:2390) handles the reverse direction:
| Error | Condition | Description |
|---|---|---|
| 3623 | Device-only function referenced outside device context | Use of __device__-only function outside the bodies of device functions |
The error 3623 has two context strings:
"outside the bodies of device functions"-- general case"from a constexpr or consteval __device__ function"-- constexpr context
Relaxation: dword_106BF40
When dword_106BF40 is set (corresponding to --expt-relaxed-constexpr), and the current routine at +182 has the device annotation pattern (& 0x30 == 0x20) with +177 bit 1 set (explicit __device__), cross-space variable access checks are suppressed. This allows constexpr device functions to reference host variables during constant evaluation.
Host Reference Arrays
When the backend emits host-side code, variables marked with __device__, __shared__, or __constant__ are registered in ELF section arrays so the CUDA runtime can discover them at load time. The emission function sub_6BCF80 (nv_emit_host_reference_array) writes entries into six separate sections:
| Section | Array Name | Memory Space | Linkage |
|---|---|---|---|
.nvHRDE | hostRefDeviceArrayExternalLinkage | __device__ | External |
.nvHRDI | hostRefDeviceArrayInternalLinkage | __device__ | Internal |
.nvHRCE | hostRefConstantArrayExternalLinkage | __constant__ | External |
.nvHRCI | hostRefConstantArrayInternalLinkage | __constant__ | Internal |
.nvHRKE | hostRefKernelArrayExternalLinkage | __global__ (kernel) | External |
.nvHRKI | hostRefKernelArrayInternalLinkage | __global__ (kernel) | Internal |
Each array entry contains the mangled name of the device symbol as a byte array:
extern "C" {
extern __attribute__((section(".nvHRDE")))
__attribute__((weak))
const unsigned char hostRefDeviceArrayExternalLinkage[] = {
/* mangled name bytes */ 0x0
};
}
Six global lists (at addresses unk_1286780 through unk_12868C0) accumulate symbols during compilation, one per section type. Note that __shared__ variables do NOT get host reference arrays -- they have no host-visible address.
Redeclaration Compatibility
When a variable is redeclared, decl_variable (sub_4CA6C0) compares the memory space bits between the prior declaration and the new one. Error 1306 is emitted for mismatched CUDA memory spaces:
Error 1306: CUDA memory space mismatch on redeclaration
The comparison tests byte +148 of both the existing entity and the new declaration's computed attributes. The CUDA memory space acts as an implicit storage class -- storage class value 5 in the declaration state (offset 269) indicates a CUDA-specific storage class that requires special scope-walking behavior.
String Table Usage
The memory space keywords appear in the binary's string table and are referenced by error message formatting code:
| String | Usage |
|---|---|
"__constant__" | Error messages for __constant__ constraints, space name display |
"__managed__" | Error messages for __managed__ constraints |
"__device__" | Obtained via "__host__ __device__" + 9 (pointer arithmetic), or direct literal |
"__shared__" | Error messages for __shared__ constraints |
"__host__ __device__" | Combined string; +9 yields "__device__" |
The pointer-arithmetic trick for "__device__" appears in both sub_4DEC90 (variable_declaration) and error message formatting throughout the attribute handlers. It saves binary space by reusing the combined "__host__ __device__" string constant.
Error Code Summary
Attribute Application Errors
| Error | Severity | Description |
|---|---|---|
| 3481 | Error | Conflicting CUDA memory spaces (__shared__ + __constant__ simultaneously) |
| 3482 | Error | CUDA memory space attribute on thread_local variable |
| 3485 | Error | CUDA memory space attribute on local variable |
| 3577 | Error | Memory space incompatible with __grid_constant__ parameter |
Declaration Processing Errors
| Error | Severity | Description |
|---|---|---|
| 149 | Error | Illegal CUDA storage class at namespace scope |
| 892 | Error | auto type with __constant__ variable |
| 893 | Error | auto type with CUDA memory space variable |
| 1306 | Error | CUDA memory space mismatch on redeclaration |
| 3483 | Error | Memory space qualifier on automatic/static variable in non-device function |
| 3510 | Error | __shared__ variable with variable-length array |
| 3566 | Error | __constant__ with constexpr and auto deduction |
| 3567 | Error | CUDA variable with VLA type |
| 3568 | Error | __constant__ combined with constexpr |
| 3578 | Error | CUDA attribute in constexpr-if discarded branch |
| 3579 | Error | CUDA attribute at namespace scope with structured binding |
| 3580 | Error | CUDA attribute on variable-length array |
| 3648 | Warning | Device-accessible (non-shared) variable with external linkage |
Cross-Space Reference Errors
| Error | Severity | Description |
|---|---|---|
| 3486 | Error | Illegal device variable reference from host (operator function context) |
| 3548 | Error | Reference to __shared__ / __constant__ variable from host code |
| 3549 | Error | Initializer referencing device memory variable from host |
| 3550 | Error | Write to __shared__ variable from host code |
| 3623 | Error | Use of __device__-only function outside device context |
Global State Variables
| Variable | Type | Description |
|---|---|---|
dword_126EFA8 | int | CUDA mode flag (nonzero when compiling CUDA) |
dword_126EFB4 | int | CUDA dialect (2 = CUDA C++) |
dword_126EFAC | int | Extended CUDA features flag |
dword_126EFA4 | int | CUDA version-check control |
qword_126EF98 | int64 | CUDA version threshold (hex: 0x9E97 = 40599, 0x9D6C, etc.) |
qword_126EF90 | int64 | CUDA version threshold (hex: 0x78B3 = 30899 for compute_30) |
dword_106BFD0 | int | Enable cross-space reference checking (primary) |
dword_106BFCC | int | Enable cross-space reference checking (secondary) |
dword_106BF40 | int | Allow __device__ function refs in host (--expt-relaxed-constexpr) |
dword_106BFF0 | int | Relaxed execution space mode (permits otherwise-illegal combos) |
qword_126EB70 | ptr | Entity pointer for main() (prevents __device__ on main) |
qword_126C5E8 | ptr | Scope stack base pointer (784-byte entries) |
dword_126C5E4 | int | Current scope stack top index |
dword_126C5D8 | int | Current function scope index (-1 if none) |
Function Map
| Address | Identity | Size | Source |
|---|---|---|---|
sub_40AD80 | apply_nv_weak_odr_attr | 0.2 KB | attribute.c:10497 |
sub_40E0D0 | apply_nv_managed_attr | 0.4 KB | attribute.c:10523 |
sub_40E1F0 | apply_nv_global_attr (variant 1) | 0.9 KB | attribute.c |
sub_40E7F0 | apply_nv_global_attr (variant 2) | 0.9 KB | attribute.c |
sub_40EB80 | apply_nv_device_attr | 1.0 KB | attribute.c |
sub_4108E0 | apply_nv_host_attr | 0.3 KB | attribute.c |
sub_413240 | apply_one_attribute (dispatch) | 5.9 KB | attribute.c |
sub_413ED0 | apply_attributes_to_entity | 4.9 KB | attribute.c |
sub_40A310 | attribute_display_name | 0.6 KB | attribute.c:1307 |
sub_4CA6C0 | decl_variable | 11 KB | decls.c:7730 |
sub_4CC150 | cuda_variable_fixup | 1.2 KB | decls.c:20654 |
sub_4DC200 | mark_defined_variable | 0.3 KB | decls.c |
sub_4DEC90 | variable_declaration | 11 KB | decls.c:12956 |
sub_6BC890 | nv_validate_cuda_attributes | 1.6 KB | nv_transforms.c |
sub_6BCF10 | nv_check_device_variable_in_host | 0.2 KB | nv_transforms.c |
sub_6BCF80 | nv_emit_host_reference_array | 0.8 KB | nv_transforms.c |
sub_72A650 | record_symbol_reference_full (6-arg) | 6.6 KB | symbol_ref.c |
sub_72B510 | record_symbol_reference_full (4-arg) | 7.3 KB | symbol_ref.c |
See Also
- Execution Spaces -- function-level
__host__/__device__/__global__encoding at entity+182 - Cross-Space Call Validation -- full cross-space call checking algorithm
- Entity Node Layout -- complete entity node offset map
- __managed__ Variables --
__managed__attribute system details - __grid_constant__ --
__grid_constant__parameter attribute - Host Reference Arrays -- runtime device symbol discovery
Cross-Space Call Validation
CUDA's execution model partitions code into host (CPU) and device (GPU) worlds. A function in one execution space cannot directly call a function in the other -- a __host__ function cannot call a __device__ function, and vice versa. cudafe++ enforces these rules at two points during compilation: at explicit call sites in expressions (expr.c) and at symbol reference recording time (symbol_ref.c). Together these checks cover both direct function calls and indirect references -- variable accesses, implicit constructor/destructor invocations, and template-instantiated calls. The validation produces 12 distinct calling error messages (6 normal + 6 constexpr-with-suggestion variants), plus 4 variable access errors and 1 device-only function reference error.
Key Facts
| Property | Value |
|---|---|
| Source files | expr.c (call site checks), symbol_ref.c (reference-time checks), class_decl.c (type hierarchy walk), nv_transforms.c (helpers) |
| Call-site checker | sub_505720 (check_cross_execution_space_call, 4.0 KB) |
| Template variant | sub_505B40 (check_cross_space_call_in_template, 2.7 KB) |
| Reference checker | sub_72A650 (record_symbol_reference_full, 6-arg, 659 lines) |
| Reference checker (short) | sub_72B510 (record_symbol_reference_full, 4-arg, 732 lines) |
| Type hierarchy walker | sub_41A1F0 (annotation helper, walks nested types for HD violations) |
| Type hierarchy entry | sub_41A3E0 (validates lambda/class HD annotation, calls sub_41A1F0) |
| Space name helper | sub_6BC6B0 (get_entity_display_name, 49 lines) |
| Trivial-device-copyable | sub_6BC680 (is_device_or_extended_device_lambda, 16 lines) |
| Device ref expression walker | sub_6BE330 (nv_scan_expression_for_device_refs, 89 lines) |
| Diagnostic emission | sub_4F7450 (multi-arg diagnostic), sub_4F8090 (type+entity diagnostic) |
| Calling errors | 3462, 3463, 3464, 3465, 3508 |
| Variable access errors | 3548, 3549, 3550, 3486 |
| Device-only function ref | 3623 |
| Type annotation errors | 3593, 3594, 3597, 3598, 3599, 3615, 3635, 3691 |
| Cross-space enable flag | dword_106BFD0 (primary), dword_106BFCC (secondary) |
| Device ref relaxation | dword_106BF40 (allow __device__ function refs in host) |
| Relaxed constexpr flag | dword_126EFB0 (also referenced as CLI flag 104) |
Execution Space Recall
The execution space is encoded at byte offset +182 of the entity (routine) node. The two-bit extraction byte & 0x30 classifies the routine:
byte & 0x30 | Space | Meaning |
|---|---|---|
0x00 | (none) | Implicit __host__ |
0x10 | __host__ | Explicit host-only |
0x20 | __device__ | Device-only |
0x30 | __host__ __device__ | Both spaces |
The 0x60 mask distinguishes __global__ kernels: (byte & 0x60) == 0x20 means plain __device__, while byte & 0x40 set means __global__.
Additional flags at byte +177 encode secondary space information:
| Bit | Mask | Meaning |
|---|---|---|
| 0 | 0x01 | __host__ annotation present |
| 1 | 0x02 | __device__ annotation present |
| 2 | 0x04 | constexpr device |
| 4 | 0x10 | implicitly HD / __forceinline__ relaxation |
The +177 & 0x10 bit is the critical bypass: when set, the function is treated as implicitly __host__ __device__ and exempt from cross-space checks. This covers constexpr functions (which are implicitly HD since CUDA 7.5) and __forceinline__ functions (which the compiler may allow to be instantiated in either space).
The Implicitly-HD Bypass
Before any cross-space error is emitted, both the caller and callee are tested for the implicitly-HD condition. The exact binary test is:
// Implicitly-HD check (appears in both sub_505720 and sub_505B40)
// entity: pointer to routine entity node
bool is_implicitly_hd(int64_t entity) {
// Check 1: bit 0x10 at +177 (constexpr/forceinline HD)
if ((*(uint8_t*)(entity + 177) & 0x10) != 0)
return true;
// Check 2: deleted function with specific annotation combo
// +184 is an 8-byte extended flags field
// 0x800000000000 = deleted bit, 0x1000000 = explicit annotation
// If deleted but NOT explicitly annotated, AND byte+176 bit 1 is clear:
if ((*(uint64_t*)(entity + 184) & 0x800001000000LL) == 0x800000000000LL
&& (*(uint8_t*)(entity + 176) & 2) == 0)
return true;
return false;
}
This means:
- constexpr functions -- the
+177 & 0x10bit is set during attribute processing, making them callable from both host and device code without explicit annotation. __forceinline__functions -- same bit, allowing cross-space inlining.- Implicitly-deleted functions -- defaulted special members (constructors, destructors, assignment operators) that are deleted due to non-copyable members. These get a pass because they will never actually be called.
If either the caller or the callee is implicitly HD, the cross-space check returns immediately without error.
Call-Site Validation: sub_505720
check_cross_execution_space_call is called during expression scanning in scan_expr_full whenever a function call expression is processed. It takes three parameters:
// sub_505720 -- check_cross_execution_space_call
// a1: entity pointer of the callee function (may be NULL)
// a2: bool -- if true, this is a "must be callable" context (__global__ launch)
// a3: source location pointer for diagnostics
// returns: char (nonzero if diagnostic was emitted)
char check_cross_execution_space_call(int64_t callee, bool must_callable, uint64_t *src_loc);
Algorithm
The function follows a multi-stage gate structure. At each gate, an early return can skip the check entirely:
Gate 1 -- Class scope suppression. If we are inside a class definition scope (dword_126C5C8 != -1) and the current scope has device-scope flags set (scope_entry[6] & 0x06), AND we are inside a type node context (dword_106B670 != -1, type_entry[5] & 0x08), the check is suppressed. This allows member function declarations inside device classes to reference host functions without error -- the actual check happens when the member is instantiated/defined.
Gate 2 -- Diagnostic suppression scope. If the current scope entry has diagnostic-suppression bit 1 of byte +14 set (scope_entry[14] & 0x02), checks are suppressed. This covers SFINAE contexts and decltype evaluation.
Gate 3 -- Concept/requires context. If the current context pointer (qword_106B970) is non-null and byte +17 has bit 1 set (strict-mode or concept context), checks are suppressed.
Gate 4 -- No enclosing function. If dword_126C5D8 == -1 (no enclosing function scope), the caller space defaults to host-only (v7=0, v8=1) -- meaning we are at file scope, which is implicitly host.
Gate 5 -- Extract caller space. The enclosing function entity is retrieved from the scope stack at qword_126C5E8 + 784 * dword_126C5D8 + 224. Its execution space is extracted:
v7 = (caller[182] & 0x60) == 0x20-- caller is host-onlyv8 = (caller[182] & 0x30) != 0x20-- caller is NOT device-onlyv5 = (caller[-8] & 0x10) != 0-- caller has secondary device mark (the-8offset reads a flags byte 8 bytes before the entity, in the preceding allocation header)
Gate 6 -- Caller implicitly HD. The caller is tested for implicitly-HD status. If true, return immediately.
Gate 7 -- Callee implicitly HD. The callee (parameter a1) is tested for implicitly-HD status. If true, return immediately.
Gate 8 -- No caller entity or secondary device. If no caller entity exists or the secondary device flag is set, skip to the __global__ check.
Error Decision Logic
After passing all gates, the function computes which error to emit based on caller/callee space combination:
// Pseudocode for the error decision tree
bool callee_is_not_device = (callee[182] & 0x30) != 0x20; // v3
bool callee_is_host_only = (callee[182] & 0x60) == 0x20; // v4
bool callee_is_global = (callee[182] & 0x40) != 0; // v11 in some paths
bool caller_is_host_only = (caller[182] & 0x60) == 0x20; // v7
bool caller_not_device = (caller[182] & 0x30) != 0x20; // v8
bool has_forceinline = (caller[181] & 0x20) != 0;
if (caller_is_host_only && caller_not_device) {
// Caller is __host__ __device__ (both flags set)
if (has_forceinline || callee_is_not_device || !callee_is_host_only)
goto global_check;
// HD caller calling host-only callee
if (!is_device_or_extended_lambda(callee)) {
char *caller_name = get_entity_display_name(caller, 0);
char *callee_name = get_entity_display_name(callee, 1);
int errcode = 3462 + ((callee[177] & 0x02) != 0); // 3462 or 3463
emit_diagnostic(errcode, src_loc, callee_name, caller_name);
}
} else if (caller_not_device) {
// Caller is host-only, callee is device-only
if (has_forceinline || callee_is_not_device || !callee_is_host_only)
goto global_check;
// Check relaxed-constexpr bypass
if ((callee[177] & 0x02) != 0 && dword_106BF40) {
// Callee has __device__ annotation AND relaxation flag is set
if (must_callable && !callee_is_global)
goto global_check; // suppress for __global__ must-call context
// else suppress entirely
}
// Check constexpr-device bypass
if ((callee[177] & 0x04) != 0)
goto global_check; // constexpr device functions get a pass
// Host caller calling device-only callee
char *caller_name = get_entity_display_name(caller, 0);
char *callee_name = get_entity_display_name(callee, 1);
int errcode = 3465 - ((callee[177] & 0x02) == 0); // 3464 or 3465
emit_diagnostic(errcode, src_loc, callee_name, caller_name);
}
global_check:
if (must_callable && !callee_is_global) {
// must_callable is true but callee is not __global__
// (this path is for __global__ launch checks)
// no error here -- fall through
} else if (!must_callable && callee_is_global) {
// __global__ function called from wrong context
if (callee_is_host_only) {
// __global__ called from host-only -- "cannot be called from host"
emit_diagnostic(3508, src_loc, "host", "cannot");
} else if (!callee_is_host_only) {
// __global__ called from __device__ context
emit_diagnostic(3508, src_loc, "__device__", "cannot");
}
} else if (must_callable || !callee_is_global) {
return; // no __global__ issue
} else {
emit_diagnostic(3508, src_loc, "__global__", "must");
}
Error 3462 vs 3463 (Device-from-Host Direction)
The distinction between errors 3462 and 3463 is the +177 & 0x02 bit on the callee -- whether it has an explicit __device__ annotation:
- 3462:
__device__function called from__host__context. The callee has no explicit__device__annotation (it was implicitly device-only). - 3463: Same violation, but the callee has explicit
__device__annotation. The error message includes an additional note about the__host__ __device__context.
The computation: 3462 + ((callee[177] & 0x02) != 0) yields 3462 when the bit is clear, 3463 when set.
Error 3464 vs 3465 (Host-from-Device Direction)
Similarly for the reverse direction:
- 3464:
__host__function called from__device__context, callee has explicit__device__annotation (bit clear in the subtraction). - 3465: Same violation, callee does NOT have explicit
__device__annotation.
The computation: 3465 - ((callee[177] & 0x02) == 0) yields 3464 when the bit is clear, 3465 when set.
Error 3508 (global Misuse)
Error 3508 is a parameterized error with two string arguments: the context string and the verb. The combinations are:
| Context | Verb | Meaning |
|---|---|---|
"host" | "cannot" | __global__ function cannot be called from __host__ code directly (must use <<<>>>) |
"__device__" | "cannot" | __global__ function cannot be called from __device__ code |
"__host__ __device__" + 9 = "__device__" | "cannot" | Same, from HD context with device focus |
"__global__" | "must" | A __global__ function must be called with <<<>>> syntax |
Template Variant: sub_505B40
check_cross_space_call_in_template performs the same validation but is called during template instantiation rather than initial expression scanning. It has two key differences:
-
Guard on
dword_126C5C4 == -1: only runs when no nested class scope is active. Ifdword_126C5C4 != -1, the entire function is skipped -- template instantiation inside nested class definitions defers cross-space checks. -
Additional scope guards: checks
scope_entry[4] != 12(not a namespace scope) andqword_106B970 + 17 & 0x40 == 0(not in a concept context). These prevent false positives during dependent name resolution. -
No return value: returns
voidinstead ofchar. It only emits diagnostics; it does not report whether a diagnostic was emitted. -
Error code selection: uses
3463 - ((callee[177] & 0x02) == 0)for the HD-caller case (yielding 3462 or 3463), and3465 - ((callee[177] & 0x02) == 0)for the host-caller case (yielding 3464 or 3465). The__global__error always uses"must"verb. -
No
must_callableparameter: the template variant does not handle themust/cannotdistinction for__global__. It always emits3508with"__global__"and"must"if the callee is__global__.
Complete Calling Error Matrix
The following matrix shows which errors fire for each caller/callee space combination:
| Caller \ Callee | __host__ | __device__ | __host__ __device__ | __global__ |
|---|---|---|---|---|
__host__ (explicit) | OK | 3464 or 3465 | OK | 3508 ("must") |
__device__ | 3462 or 3463 | OK | OK | 3508 ("cannot") |
__host__ __device__ | OK | 3462 or 3463 | OK | 3508 |
| (no annotation) = host | OK | 3464 or 3465 | OK | 3508 ("must") |
__global__ | OK | OK | OK | 3508 ("cannot") |
Entries marked "OK" pass the cross-space check without error. The specific error (3462 vs 3463, 3464 vs 3465) depends on whether the callee has the +177 & 0x02 bit (explicit __device__ annotation).
Bypass Conditions (No Error Despite Mismatch)
Even when the matrix says an error should fire, the following conditions suppress it:
- Caller or callee is implicitly HD (
+177 & 0x10): constexpr functions,__forceinline__functions, implicitly-deleted special members. - Caller has
__forceinline__relaxation (+181 & 0x20): the caller has a__forceinline__attribute that relaxes cross-space restrictions. - Callee is a device lambda that passes trivial-device-copyable check (
sub_6BC680returns true): extended lambda optimization. - Callee has constexpr-device flag (
+177 & 0x04): constexpr functions marked for device use. dword_106BF40is set and callee has explicit__device__(+177 & 0x02): the--expt-relaxed-constexpror similar flag allows device function references from host code.- Current scope has diagnostic suppression (
scope_entry[14] & 0x02): SFINAE context. - Concept/requires context (
qword_106B970 + 17 & 0x40).
The 12 Calling Error Messages
cudafe++ emits 6 base error messages for cross-space call violations. Each has a variant that adds a --expt-relaxed-constexpr suggestion when the callee is a constexpr function, yielding 12 total messages:
| Error | Direction | Context | Suggestion? |
|---|---|---|---|
| 3462 | device called from host | Callee lacks explicit __device__ | No |
| 3463 | device called from HD | Callee has explicit __device__ (HD context note) | No |
| 3464 | host called from device | Callee has explicit __device__ (bit clear in subtraction) | No |
| 3465 | host called from device | Callee lacks explicit __device__ | No |
| 3508 | __global__ context error | Parameterized: "must" / "cannot" + space string | No |
| 3462+constexpr | device called from host | constexpr callee | Yes: --expt-relaxed-constexpr |
| 3463+constexpr | device called from HD | constexpr callee | Yes |
| 3464+constexpr | host called from device | constexpr callee | Yes |
| 3465+constexpr | host called from device | constexpr callee | Yes |
| 3508+constexpr | __global__ context | constexpr callee | Yes |
The constexpr suggestion variants are selected by the relaxed-constexpr flag state. When dword_106BF40 (the --expt-relaxed-constexpr relaxation flag) is NOT set and the callee has constexpr annotations, the error message includes a note suggesting the flag to resolve the issue.
Variable Access Validation: symbol_ref.c
The record_symbol_reference_full functions (sub_72A650 / sub_72B510) enforce cross-space rules at the symbol reference level. This is a different check point than the call-site checker -- it catches variable accesses and implicit function references that are not explicit function calls.
Reference Kind Bitmask (Parameter a1)
The first parameter encodes the kind of reference being made:
| Bit | Mask | Meaning |
|---|---|---|
| 0 | 0x01 | Address reference (&var) |
| 1 | 0x02 | Write reference (assignment target) |
| 2 | 0x04 | Non-modifying reference (read) |
| 3 | 0x08 | Direct use |
| 4 | 0x10 | Initializer |
| 5 | 0x20 | Potential modification |
| 6 | 0x40 | Move reference |
| 10 | 0x400 | Template argument |
| 13 | 0x2000 | ODR-use |
| 15 | 0x8000 | Negative offset |
Device Variable in Host Code (Errors 3548-3550)
When the referenced symbol is a variable (kind == 7), record_symbol_reference_full checks whether a __shared__, __constant__, or __managed__ variable is accessed from host code:
// nv_check_device_var_ref_in_host (inlined in sub_72A650, lines ~330-374)
// var_info: *(entity + 88) -- pointer to variable descriptor
uint64_t caller = 0;
if (dword_126C5D8 != -1) // enclosing function scope exists
caller = *(qword_126C5E8 + 784 * dword_126C5D8 + 224);
uint8_t mem_flags = *(var_info + 148); // memory space flags
const char *space_name;
if (mem_flags & 0x02)
space_name = "__shared__";
else if (mem_flags & 0x04)
space_name = "__constant__";
else
space_name = ""; // no specific space (managed or other)
// Exemption: managed variables with bit 0x100 set are OK
if ((*(uint16_t*)(var_info + 148) & 0x0101) == 0x0101)
return; // managed + exemption flag
// Only check if: has device memory annotation, there is a caller,
// caller is NOT device-only, caller is not implicitly-HD
if ((ref_kind & 0x12040) == 0 // not a transparent reference
&& (mem_flags & 0x07) != 0 // has device memory annotation
&& caller != 0
&& (*(caller + 182) & 0x30) != 0x20 // caller NOT device-only
&& (*(caller + 177) & 0x10) == 0 // caller NOT implicitly HD
&& !is_implicitly_hd(caller)) // extended implicit-HD check
{
if (ref_kind & 0x08) // direct use
emit_diag(3548, src_loc, space_name, entity); // "reference to __shared__"
if (ref_kind & 0x10) // initializer
emit_diag(3549, src_loc, space_name, entity); // "initializer for __constant__"
if ((mem_flags & 0x02) && (ref_kind & 0x20)) // __shared__ + write
emit_diag(3550, src_loc, space_name, entity); // "write to __shared__"
}
| Error | Condition | Message |
|---|---|---|
| 3548 | Direct use of __shared__/__constant__ variable from host | Reference to device memory variable from host code |
| 3549 | Initializer referencing __shared__/__constant__ from host | Cannot initialize from host |
| 3550 | Write to __shared__ variable from host | Cannot write to shared memory from host |
Device-Only Function Reference (Error 3623)
For function-type symbols (kind 10 or 11, or concept kind 20), the check validates that __device__-only functions are not referenced from host code:
// nv_check_device_function_ref_in_host (inlined in sub_72A650, lines ~382-454)
// entity: the function being referenced
// entity + 88 -> routine info (for kind 10/11)
// entity + 88 -> +192 for concepts (kind 20)
int64_t routine_info = ...; // resolve through type chain
if (routine_info == 0)
return;
// Only check if: has device annotation, is device-only,
// has no implicit-HD flags
if ((*(routine_info + 191) & 0x01) == 0 // not a coroutine exemption
|| (*(routine_info + 182) & 0x30) != 0x20 // not device-only
|| (*(routine_info + 177) & 0x15) != 0) // has HD/host/constexpr flags
return;
// Check if already exempted by extended flags
if (is_implicitly_hd(routine_info))
return;
// Determine caller context
int64_t caller_routine = 0;
if (dword_126C5D8 != -1) {
caller_routine = *(qword_126C5E8 + 784 * dword_126C5D8 + 224);
} else if (dword_126C5B8) {
// Walk scope stack to find enclosing try block
int scope_idx = dword_126C5E4;
while (scope_idx != -1) {
int64_t entry = qword_126C5E8 + 784 * scope_idx;
if (*(int32_t*)(entry + 408) != -1) // has try block
break;
scope_idx = *(int32_t*)(entry + 560); // parent scope
}
if (scope_idx == -1) return;
caller_routine = *(entry + 224);
}
if (caller_routine == 0) goto emit_outside;
if (is_implicitly_hd(caller_routine)) return;
if ((*(caller_routine + 182) & 0x30) == 0x20) {
// Caller is __device__-only
if ((*(caller_routine + 177) & 0x05) == 0)
return; // no constexpr/consteval markers
context = "from a constexpr or consteval __device__ function";
} else {
context = "outside the bodies of device functions";
}
emit_outside:
const char *name = *(routine_info + 8); // function name
if (!name) name = "";
emit_diagnostic(3623, src_loc, name, context);
Error 3623 has two context strings:
"outside the bodies of device functions"-- the reference is from file scope or host code"from a constexpr or consteval __device__ function"-- the reference is from a constexpr/consteval device function that cannot actually call the target
The dword_106BFD0 / dword_106BFCC Gate
Both record_symbol_reference_full variants gate the cross-space device-reference scan (sub_6BE330) with:
if (dword_106BFD0 || dword_106BFCC) {
// Cross-space reference checking is enabled
if (!qword_126C5D0 // no current routine descriptor
|| *(qword_126C5D0 + 32) == 0 // no routine entity
|| (*(*(qword_126C5D0 + 32) + 182) & 0x30) != 0x20 // not device-only
|| (dword_106BF40 && (*(*(qword_126C5D0 + 32) + 177) & 0x02) != 0))
{
// Call sub_6BE330 to walk expression tree for device references
nv_scan_expression_for_device_refs(entity);
}
}
The scan is skipped when the current routine IS __device__-only -- device code referencing other device symbols is always valid. The dword_106BF40 check further relaxes: if the flag is set AND the routine has explicit __device__ annotation (+177 & 0x02), the scan is also skipped.
Type Hierarchy Walk: sub_41A1F0 / sub_41A3E0
The type hierarchy walkers handle a different class of violation: when a __host__ __device__ or __device__ annotation is applied to a class or lambda whose member types contain HD-incompatible nested types. These functions live in class_decl.c and are called during class completion.
sub_41A3E0 (Entry Point)
This function validates a complete type annotation context. It receives a lambda/class info structure and checks multiple conditions:
// sub_41A3E0 -- validate_type_hd_annotation
// a1: type annotation context structure
// +8: entity pointer
// +32: flags byte (bit 0 = has_host, bit 3 = has_conflict, bit 4 = has_device,
// bit 5 = has_virtual)
// +36: source location
// a2: 0 = __host__ __device__, nonzero = __device__ only
// a3: enable additional nested check (for OptiX path)
char *space_name = (a2 == 0) ? "__host__ __device__" : "__device__";
// Error 3615: duplicate HD annotation conflict
if (a2 == 0 && (flags & 0x01))
emit_diag(3615, src_loc);
// Error 3593: conflict between __host__ and __device__ on type
if (flags & 0x08) {
if (entity && entity[163] < 0) { // entity has device-negative flag
if ((flags & 0x18) != 0x18)
goto check_members;
emit_diag(3635, src_loc); // both __host__ and __device__ + conflict
} else {
emit_diag(3593, src_loc, space_name);
}
}
// Error 3594: virtual function in __device__ context
if (flags & 0x20 || ...)
emit_diag(3594, src_loc, space_name);
// Recurse into member types
walk_type_for_hd_violations(type_entry, src_loc, a2); // sub_41A1F0
// Error 3691: nested OptiX check
if (a3 && (flags & 0x10))
emit_diag(3691, src_loc, space_name);
sub_41A1F0 (Recursive Type Walker)
This function walks the type hierarchy to find nested violations. It uses sub_7A8370 (is-array-type check) and sub_7A9310 (get-array-element-type) to traverse through arrays, and walks through cv-qualified type wrappers (kind == 12) by following the +144 pointer chain.
// sub_41A1F0 -- walk_type_for_hd_violations (recursive)
// a1: type node pointer
// a2: source location pointer
// a3: 0 = HD mode, nonzero = device-only mode
char *space_name = (a3) ? "__device__" : "__host__ __device__";
if (!is_valid_type(a1) || a1 == 0) {
// Base case: no type to check, or check passed at top level
goto label_20;
}
int depth = 0;
int64_t current = a1;
do {
if (!is_array_type(current)) { // sub_7A8370
// Not an array -- check this type for violations
if (depth > 7)
emit_diag(3597, src_loc, space_name, a1); // nesting depth exceeded
// Walk through cv-qualified wrappers
while (*(current + 132) == 12) // cv-qual kind
current = *(current + 144); // underlying type
// Guard: skip if in nested class scope
if (dword_126C5C4 != -1)
return;
if ((scope_entry[6] & 0x06) != 0)
return;
if (scope_entry[4] == 12) // namespace scope
goto walk_callback;
// Error 3598: type not valid in device context
if (!check_type_valid_for_space(30, current, 0)) // sub_550E50
emit_diag(3598, src_loc, space_name, current);
// Error 3599: type has problematic member
int64_t display = get_type_display_name(current); // sub_5BD540
if (!check_member_compat(60, display, current)) // sub_510860
emit_diag(3599, src_loc, space_name, current);
goto label_20;
}
++depth;
current = get_array_element_type(current); // sub_7A9310
} while (current != 0);
label_20:
// Final phase: walk_tree with callback sub_41B420
if (dword_126C5C4 != -1) return;
if ((scope_entry[6] & 0x06) != 0) return;
if (scope_entry[4] == 12) return;
// Save/restore diagnostic state
saved_state = qword_126EDE8;
qword_126EDE8 = *src_loc;
dword_E7FE78 = 0;
walk_tree(a1, sub_41B420, 792); // sub_7B0B60 with callback
qword_126EDE8 = saved_state;
The callback sub_41B420 is used in the tree walk to check each nested type member. This is the same callback used for OptiX extended lambda body validation, applied to validate that all types referenced within the annotated scope are compatible with the target execution space.
Type Annotation Errors
| Error | Condition | Message |
|---|---|---|
| 3593 | Conflict between __host__ and __device__ on extended lambda/type | Cannot apply both annotations |
| 3594 | Virtual function in __device__ or HD context | Virtual dispatch not supported on device |
| 3597 | Type nesting depth exceeds 7 levels in HD validation | Type hierarchy too deep for device |
| 3598 | Nested type not valid in device context | Type X cannot be used in __device__ code |
| 3599 | Nested type member incompatible with device execution | Member of type X is not device-compatible |
| 3615 | Duplicate __host__ __device__ annotation | Already annotated as HD |
| 3635 | Both __host__ and __device__ annotations with negative device flag | Conflicting explicit annotations |
| 3691 | Nested OptiX annotation conflict | OptiX extended lambda nested check failure |
Global State Variables
| Global | Type | Purpose |
|---|---|---|
qword_126C5E8 | int64_t | Scope stack base pointer (array of 784-byte entries) |
dword_126C5E4 | int32_t | Current scope stack top index |
dword_126C5D8 | int32_t | Current function scope index (-1 if none) |
dword_126C5C8 | int32_t | Class scope index (-1 if none) |
dword_126C5C4 | int32_t | Nested class scope (-1 if none) |
dword_126C5B8 | int32_t | Is-member-of-template flag |
qword_126C5D0 | int64_t | Current routine descriptor pointer |
qword_106B970 | int64_t | Current compilation context |
dword_106BFD0 | int32_t | Enable cross-space reference checking (primary) |
dword_106BFCC | int32_t | Enable cross-space reference checking (secondary) |
dword_106BF40 | int32_t | Allow __device__ function references in host |
dword_106B670 | int32_t | Current type node context index (-1 if none) |
qword_106B678 | int64_t | Type node table base pointer |
dword_E7FE78 | int32_t | Diagnostic state flag (cleared during type walks) |
qword_126EDE8 | int64_t | Saved diagnostic source position |
Function Map
| Address | Size | Identity | Source |
|---|---|---|---|
sub_41A1F0 | ~0.5 KB | walk_type_for_hd_violations | class_decl.c |
sub_41A3E0 | ~0.5 KB | validate_type_hd_annotation | class_decl.c |
sub_41B420 | (callback) | Type walk callback for device compat | class_decl.c |
sub_4F7450 | ~0.3 KB | emit_diag_multi_arg (cross-space diagnostics) | expr.c |
sub_505720 | 4.0 KB | check_cross_execution_space_call | expr.c |
sub_505AA0 | 0.8 KB | get_execution_space_string | expr.c |
sub_505B40 | 2.7 KB | check_cross_space_call_in_template | expr.c |
sub_6BC680 | 0.1 KB | is_device_or_extended_device_lambda | nv_transforms.c |
sub_6BC6B0 | 0.5 KB | get_entity_display_name | nv_transforms.c |
sub_6BE330 | 0.9 KB | nv_scan_expression_for_device_refs | nv_transforms.c |
sub_72A650 | 6.6 KB | record_symbol_reference_full (6-arg) | symbol_ref.c |
sub_72B510 | 7.3 KB | record_symbol_reference_full (4-arg) | symbol_ref.c |
Cross-References
- Execution Spaces -- the
+182byte encoding and attribute handlers - Device/Host Separation -- how validated code is split into device and host IL
- Kernel Stubs --
__global__function wrapper generation - Entity Node -- byte offsets
+176,+177,+182,+184 - Diagnostics Overview -- error emission pipeline
- Lambda Overview -- extended lambda HD annotation validation
Device/Host Separation
A single .cu file contains both host and device code intermixed. Conventional wisdom assumes cudafe++ splits them with two compilation passes -- one for host, one for device. That assumption is wrong. cudafe++ uses a single-pass, tag-and-filter architecture: the EDG frontend builds one unified IL tree from the entire translation unit, every entity gets execution-space bits written into its node, and then two separate output paths filter the tagged IL -- one path emits the .int.c host file, the other emits the device IL for cicc. There is no re-parse, no second invocation of the frontend.
This page documents the global variables that control the split, the IL-marking walk that selects device-reachable entries, the host-output filtering logic that suppresses device-only entities, and the output files produced.
Key Facts
| Property | Value |
|---|---|
| Architecture | Single-pass: parse once, tag with execution-space bits, filter at output time |
| Language mode flag | dword_126EFB4 -- language mode (1 = C, 2 = C++) |
| Host compiler identity | dword_126EFA4 -- clang mode; dword_126EFA8 -- gcc mode |
| Device stub mode | dword_1065850 -- toggled per-entity in sub_47BFD0 (gen_routine_decl) |
| Device-only filter | sub_46B3F0 -- returns 0 for device-only entities when generating host output |
| Keep-in-IL entry point | sub_610420 (mark_to_keep_in_il), 892 lines |
| Keep-in-IL worker | sub_6115E0 (walk_tree_and_set_keep_in_il), 4649 lines |
| Prune callback | sub_617310 (prune_keep_in_il_walk), 127 lines |
| Host output entry point | sub_489000 (process_file_scope_entities) |
| Host sequence dispatcher | sub_47ECC0 (gen_template / top-level source sequence processor), 1917 lines |
| Routine declaration | sub_47BFD0 (gen_routine_decl), 1831 lines |
| Host output file | <input>.int.c (transformed C++ for host compiler) |
| Device output file | Named via --gen_device_file_name CLI flag (binary IL for cicc) |
| Module ID file | Named via --module_id_file_name CLI flag |
| Stub file | Named via --stub_file_name CLI flag |
Why Single-Pass Matters
Old NVIDIA documentation and third-party descriptions sometimes describe a "two-pass" compilation model where cudafe++ runs once to extract device code and once to extract host code. This is not what the binary does. The evidence:
-
One frontend invocation.
sub_489000(process_file_scope_entities) is called once. It walks the source sequence list (qword_1065748) a single time, dispatching each entity throughsub_47ECC0. -
No re-parse. The EDG frontend builds the IL tree in memory once. The keep-in-IL walk (
sub_610420) runs duringfe_wrapuppass 3, marking device-reachable entries with bit 7 of the prefix byte. The host backend then emits.int.cfrom the same IL tree, filtering based on execution-space bits. -
dword_126EFB4is a language mode, not a pass counter. Its value2means "C++ mode," not "second pass." It never changes between device and host output phases. -
The device IL is a byte-level binary dump of marked entries, not the output of a separate code-generation pass. The host output is a text-mode C++ file produced by the
gen_*family of functions.
The practical implication: every CUDA entity exists once in memory with its execution-space tag at entity+182. The tag drives all downstream decisions -- what goes into device IL, what appears in host .int.c, what gets wrapped in #if 0, and what gets a kernel stub.
Control Globals
dword_126EFB4 -- Language Mode
| Value | Meaning |
|---|---|
0 | Unset / not initialized |
1 | C mode |
2 | C++ mode |
Set during CLI processing (sub_45C200, case 228/240/246/251/252 for C++ standard versions). In CUDA compilation this is always 2 because .cu files are compiled as C++. The keep-in-IL logic at sub_610420 checks dword_126EFB4 == 2 to decide whether to run the secondary routine-definition marking pass (sub_6175F0).
dword_126EFA4 -- Clang Mode / Device Code Mode
This global has different semantics depending on context. In CLI processing (case 187), it records whether clang host compiler mode is active. In the template instantiation system (p1.18 sweep), it acts as a device-code mode flag (1 = device code path, 0 = host stubs). The dual use reflects the fact that cudafe++ reuses the same global for different phases.
dword_126EFA8 -- GCC Mode / GPU Compilation Mode
Set when gcc host compiler mode is active. In template-related code paths, a nonzero value indicates GPU compilation mode is enabled.
dword_1065850 -- Device Stub Mode Toggle
This global flag controls how __global__ kernel bodies are emitted. It is toggled inside gen_routine_decl (sub_47BFD0). The toggle mechanism is a self-inverting flip that causes gen_routine_decl to process each __global__ kernel TWICE. Because the toggle fires at the TOP of the function (before body emission), the first call (0->1) emits the static stub definition, and the recursive call (1->0) emits the forwarding body.
Toggle Pseudocode (from sub_47BFD0, decompiled line 551-553)
// v3 = entity pointer, v8 = is_friend flag
uint64_t flags = *(uint64_t*)(entity + 176); // 8-byte flags field
// Bitmask 0x40000002000000 combines the __global__ attribute bit (0x40000000000000)
// and a definition/linkage flag (0x2000000) from the entity's flags field at +176.
if ((flags & 0x40000002000000) == 0x40000002000000 && !is_friend)
dword_1065850 = (dword_1065850 == 0); // flip: 0->1 or 1->0
This toggle fires at the TOP of gen_routine_decl, before either stub variant is emitted. Because the function calls itself recursively at the end (decompiled line 1821: return sub_47BFD0(v152, a2)), the toggle fires again on re-entry, resetting the flag.
Body Emission Decision (decompiled line 1421-1432)
The actual stub body selection happens later in the function, based on the CURRENT value of dword_1065850 (which has already been toggled):
if ((entity->byte_182 & 0x40) != 0) { // has __global__ annotation
char has_body = entity->byte_179 & 0x02; // has a definition
if (dword_1065850) {
// First call (toggle 0->1): emit static stub with cudaLaunchKernel placeholder
if (!is_specialization && has_body) {
emit("{ ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);}");
}
} else if (has_body) {
// Recursive call (toggle 1->0): emit forwarding stub
emit("{");
emit_scope_qualifier(entity);
emit("__wrapper__device_stub_");
emit(entity->name);
emit_template_args_if_needed(entity);
emit_parameter_forwarding(entity);
emit(");return;}");
}
// Both invocations: wrap original body in #if 0 / #endif
}
Self-Recursion (decompiled line 1817-1821)
After the first call emits the static stub, the function checks whether dword_1065850 is nonzero (the toggle set it to 1). If so, it restores the source sequence pointer and calls itself:
if (dword_1065850) {
qword_1065748 = saved_source_sequence;
return sub_47BFD0(context, a2); // recursive self-call
}
The recursive invocation toggles dword_1065850 back to 0, emits the forwarding body, and returns without further recursion (since dword_1065850 == 0 at the self-recursion check).
The flag is also set in sub_47ECC0 when processing template instantiation directives (source sequence kind 54): if the entity has byte_182 & 0x40 (device/global annotation) and CUDA language mode is active, dword_1065850 is set to 1 before emitting the instantiation directive.
dword_126EBA8 -- Language Standard Mode
Value 1 indicates C language standard mode. The device-only filtering function sub_46B3F0 references this to determine whether EBA (EDG binary archive) mode applies.
Host-Output Filtering: sub_46B3F0
This compact function (39 lines decompiled) is the gatekeeper that determines whether an entity should be emitted in the host .int.c output. It is called from sub_47ECC0 at the point where the host backend decides whether to emit a type/variable declaration or wrap it in #if 0.
Decompiled Logic
// sub_46B3F0 -- returns 0 to suppress (device-only), nonzero to emit
uint64_t sub_46B3F0(entry *a1, entry *a2) {
char kind = a1->byte_132;
// Classes, structs, unions (kind 9-11): always check device-only
if ((unsigned char)(kind - 9) <= 2)
goto check_device_flag;
// Enums (kind 2): check if scoped enum is device-only
if (kind == 2) {
if ((a1->byte_145 & 0x08) == 0) // not an enum definition
return 1; // emit it
goto check_device_flag;
}
// Typedefs (kind 12): check underlying type kind
if (kind == 12) {
char underlying = a1->byte_160;
if (underlying > 10)
return 0;
// Magic bitmask: 0x71D = 0b11100011101
// Bits set for kinds 0,2,3,4,8,9,10 -> emit
return (0x71DULL >> underlying) & 1;
}
return 1; // everything else: emit
check_device_flag:
int is_device;
if (a2)
is_device = a2->byte_49 & 1;
else
is_device = a1->byte_135 >> 7;
if (!is_device)
return 0; // not device-related, suppress? (inverted logic)
// Device entity: check if it should still be emitted
return dword_126EBA8 // C mode -> emit anyway
|| (kind - 9) > 2 // not a class/struct/union -> emit
|| *(a1->ptr_152 + 89) != 1; // scope check
}
The function uses a bitmask trick (0x71D >> underlying_kind) to quickly determine which typedef underlying types pass the filter. The bit pattern 0b11100011101 selects kinds 0 (void/basic), 2 (enum), 3 (parameter), 4 (pointer), 8 (field), 9 (class), and 10 (struct).
Where It Is Called
In sub_47ECC0 (the master source-sequence dispatcher), when processing type declarations (kind 6):
case 6: // type_decl
sub_4864F0(recursion_level, &continuation, kind_byte);
if (!recursion_level && !sub_46B3F0(type_entry, scope_entry)) {
// Entity is device-only in host context
// Wrap in #if 0 / #endif
}
This is the mechanism that makes device-only classes, structs, and enums invisible to the host compiler. They still exist in the IL tree (and participate in the keep-in-IL walk for device output), but their text representation is suppressed in .int.c.
Device-Only Suppression in Host Output
When sub_46B3F0 returns 0 for an entity, or when the execution-space check in gen_routine_decl identifies a device-only function, the host backend wraps the declaration in preprocessor guards:
#if 0
__device__ void device_only_function() {
// ... original body ...
}
#endif
This pattern appears in three locations:
-
Type declarations --
sub_47ECC0wraps device-only types viasub_46B3F0check. -
Routine declarations --
sub_47BFD0checksentity->byte_81 & 0x04(has device scope) combined with execution-space bits atentity+182. When a function is device-only and the current output track is host, the function body is suppressed. -
Lambda bodies --
sub_47B890(gen_lambda) wraps device lambda bodies in#if 0/#endifand emits__nv_dl_wrapper_twrapper types instead.
The nv_is_device_only_routine Check
The inline predicate from nv_transforms.h:367 is the canonical way to test if a routine lives exclusively in device space:
bool nv_is_device_only_routine(entity *e) {
char byte = e->byte_182;
return ((byte & 0x30) == 0x20) // device annotation, no host
&& ((byte & 0x60) == 0x20); // device, not __global__
}
The double-mask check distinguishes three cases:
(byte & 0x30) == 0x20: has__device__but not__host__(bits 4-5)(byte & 0x60) == 0x20: has__device__but not__global__(bits 5-6)
A __global__ function fails the second test because bit 6 is set (byte & 0x60 == 0x60). This matters because __global__ functions ARE emitted in host output -- as stubs that call __wrapper__device_stub_<name>.
The Keep-in-IL Walk (Device Code Selection)
The keep-in-IL mechanism runs during fe_wrapup pass 3 and selects which IL entries belong to the device output. The full details are documented in the Keep-in-IL page; this section covers the aspects relevant to device/host separation.
Call Chain
sub_610420 (mark_to_keep_in_il)
|
+-- installs pre_walk_check = sub_617310 (prune_keep_in_il_walk)
+-- walks file-scope IL via sub_6115E0 (walk_tree_and_set_keep_in_il)
| |
| +-- for each child entry:
| *(child - 8) |= 0x80 // set bit 7 = keep_in_il
| recurse into child
|
+-- if dword_126EFB4 == 2 (C++ mode):
| sub_6175F0 (walk_scope_and_mark_routine_definitions)
|
+-- iterates 45+ global entry-kind linked lists
+-- processes using-declarations (fixed-point loop)
The Keep Bit
Every IL entry has an 8-byte prefix. Bit 7 (0x80) of the byte at entry_ptr - 8 is the keep-in-IL flag:
Byte at (entry_ptr - 8):
bit 0 (0x01) is_file_scope
bit 1 (0x02) is_in_secondary_il
bit 2 (0x04) current_il_region
bits 3-6 reserved
bit 7 (0x80) keep_in_il <<<< THE DEVICE CODE MARKER
The sign bit doubles as the flag, enabling a fast test: *(signed char*)(entry - 8) < 0 means "keep." The recursive worker sub_6115E0 sets this bit on every reachable sub-entry by ORing 0x80 into the prefix byte and recursing.
Transitive Closure
The walk implements a transitive closure: if a __device__ function references a type, that type gets marked, which transitively marks its member types, base classes, template parameters, and any routines they reference. The prune callback (sub_617310) prevents infinite loops by returning 1 (skip) when an entry already has bit 7 set.
Additional "keep definition" flags exist for deeper marking:
| Entity | Field | Bit | Effect |
|---|---|---|---|
| Type (class/struct) | entry + 162 | bit 7 (0x80) | Retain full class body, not just forward decl |
| Routine | entry + 187 | bit 2 (0x04) | Retain function body |
Seed Entries
The walk starts from entities already tagged with execution-space bits. These seeds include:
- Functions with
__device__or__global__atentity+182 - Variables with
__shared__,__constant__, or__managed__memory space attributes - Extended device/host-device lambdas
Everything reachable from a seed gets the keep bit. Everything without the keep bit is eliminated from the device IL by the elimination pass (sub_5CCBF0).
host device Functions
Functions annotated with both __host__ and __device__ have bits 4 and 5 set in entity+182, producing (byte & 0x30) == 0x30. These functions participate in BOTH output paths:
-
Host output (.int.c): The function passes the
nv_is_device_only_routinecheck (it returns false because bit 4 is set alongside bit 5). The function body is emitted normally -- no#if 0wrapping, no stub substitution. -
Device IL: The keep-in-IL walk marks the function and all its dependencies because it has device-capable bits set. The full function body is retained in the device IL.
This dual inclusion is why __host__ __device__ functions must be valid C++ in both execution contexts. They are compiled once by EDG, then the same IL is consumed by both the host compiler (via .int.c text) and cicc (via binary IL).
Template Instantiation Interaction
When sub_47ECC0 processes a template instantiation directive (source sequence kind 54) for a __host__ __device__ template, it does NOT set dword_1065850. The stub mode toggle only activates for entities with byte_182 & 0x40 (the __global__ kernel bit). Host-device functions get their bodies emitted directly in both tracks.
Output Files
cudafe++ produces up to four output files from a single compilation:
1. Host C++ File (.int.c)
Generated by sub_489000 (process_file_scope_entities). The filename is derived from the input: <input>.int.c, or stdout if the output name is "-".
Contents:
- Pragma boilerplate (
#pragma GCC diagnostic ignored ...) - Managed runtime initialization (
__nv_init_managed_rt,__nv_fatbinhandle_for_managed_rt) - Lambda macro definitions (
__nv_is_extended_device_lambda_closure_type, etc.) #include "crt/host_runtime.h"(injected when first CUDA-tagged type is encountered)- All host-visible declarations with device-only entities wrapped in
#if 0 - Kernel functions replaced with forwarding stubs to
__wrapper__device_stub_<name> - Registration tables (
sub_6BCF80called 6 times for device/host x managed/constant combinations) - Anonymous namespace macro (
_NV_ANON_NAMESPACE) - Original source re-inclusion (
#include "<original_file>")
2. Device IL File
Named via --gen_device_file_name CLI flag (flag index 85). Contains the binary IL for all entries that passed the keep-in-IL walk. This file is consumed by cicc (the CUDA IL compiler).
3. Module ID File
Named via --module_id_file_name CLI flag (flag index 87). Contains the CRC32-based unique identifier for this compilation unit, computed by make_module_id (sub_5B5500). Used to prevent ODR violations across separate compilation units in RDC mode.
4. Stub File
Named via --stub_file_name CLI flag (flag index 86). Contains the __wrapper__device_stub_<name> function definitions that bridge host-side kernel launch calls to the CUDA runtime.
Kernel Stub Generation
For __global__ kernel functions, the host output replaces the original body with two stub forms. The toggle dword_1065850 flips 0->1 at the top of gen_routine_decl, so the static definition is emitted first, followed by the forwarding body from the recursive call:
// Output 1 (dword_1065850 == 1 after toggle, emitted first):
static void __wrapper__device_stub_kernel_name(params) {
::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}
#if 0
<original body>
#endif
// Output 2 (dword_1065850 == 0 after toggle, emitted by recursive call):
void kernel_name(params) {
<scope>::__wrapper__device_stub_kernel_name(params);
return;
}
#if 0
<original body>
#endif
The static stub provides the definition of __wrapper__device_stub_ that the forwarding body calls. The cudaLaunchKernel(0, 0, 0, 0, 0, 0) placeholder creates a linker dependency on the CUDA runtime without performing an actual kernel launch.
For template kernels, the forwarding stub includes explicit template arguments: __wrapper__device_stub_kernel_name<T1, T2, ...>(params). For full details see Kernel Stubs.
Architectural Diagram
.cu source
|
EDG Frontend (parse once)
|
Unified IL Tree
(all entities tagged
at entity+182)
|
+-------------+-------------+
| |
fe_wrapup pass 3 Backend (sub_489000)
mark_to_keep_in_il walks source sequence
(sub_610420) |
| sub_47ECC0 per entity
set bit 7 on |
device-reachable +------+------+
entries | |
| sub_46B3F0 sub_47BFD0
Device IL output returns 0? __global__?
(binary, for cicc) | |
#if 0/endif stub body
wrap it replacement
| |
+------+------+
|
.int.c output
(text C++ for host
compiler)
Function Map
| Address | Name | Lines | Role |
|---|---|---|---|
sub_489000 | process_file_scope_entities | 723 | Backend entry point, .int.c emission |
sub_47ECC0 | gen_template (source sequence dispatcher) | 1917 | Dispatches each entity; calls sub_46B3F0 for type filtering |
sub_47BFD0 | gen_routine_decl | 1831 | Routine declaration/definition; toggles dword_1065850 |
sub_46B3F0 | device-only type filter | 39 | Returns 0 for device-only entities in host output |
sub_610420 | mark_to_keep_in_il | 892 | Top-level device IL marking entry point |
sub_6115E0 | walk_tree_and_set_keep_in_il | 4649 | Recursive worker that sets bit 7 on reachable entries |
sub_617310 | prune_keep_in_il_walk | 127 | Pre-walk callback; skips already-marked entries |
sub_6175F0 | walk_scope_and_mark_routine_definitions | 634 | Additional pass for C++ routine definitions |
sub_47B890 | gen_lambda | 336 | Lambda wrapper generation; #if 0 for device lambda bodies |
sub_4864F0 | gen_type_decl | 751 | Type declaration emission; host runtime injection |
sub_5CCBF0 | eliminate_unneeded_il_entries | 345 | Elimination pass (removes entries without keep bit) |
Cross-References
- Execution Spaces -- byte
+182bitfield encoding for__host__/__device__/__global__; thenv_is_device_only_routinepredicate that drives host-output filtering - Kernel Stubs -- detailed stub generation logic: forwarding body (pass 1) and static cudaLaunchKernel body (pass 2)
- Keep-in-IL -- full documentation of the device code marking walk, the keep bit at
entry_ptr - 8, and the transitive closure algorithm - Memory Spaces -- variable-side
__device__/__shared__/__constant__at entity+148; these are the seed entries for the keep-in-IL walk - .int.c File Format -- structure of the generated host translation file
- Entity Node Layout -- full byte map of the entity structure including offset +176 (flags field) and +182 (execution space byte)
Kernel Stub Generation
When cudafe++ generates the .int.c host translation of a CUDA source file, every __global__ kernel function undergoes a critical transformation: the original kernel body is suppressed and replaced with a device stub -- a lightweight host-callable wrapper that delegates to cudaLaunchKernel. This mechanism is how CUDA kernel launch syntax (kernel<<<grid, block>>>(args)) ultimately becomes a regular C++ function call that the host compiler can process. The stub generation logic lives entirely within gen_routine_decl (sub_47BFD0), a 1,831-line function in cp_gen_be.c that is the central code generator for all C++ function declarations and definitions. A secondary function, gen_bare_name (sub_473F10), handles the character-by-character emission of the __wrapper__device_stub_ prefix into function names.
The stub mechanism operates in two passes controlled by a global toggle, dword_1065850 (the device_stub_mode flag). The toggle fires at the top of gen_routine_decl, BEFORE the body-selection logic runs. Because the toggle is dword_1065850 = (dword_1065850 == 0), it flips 0->1 on the first invocation. This means:
- First invocation (toggle 0->1):
dword_1065850 == 1at decision points -> emits thestaticdeclaration withcudaLaunchKernelplaceholder body, then recurses. - Recursive invocation (toggle 1->0):
dword_1065850 == 0at decision points -> emits the forwarding body that calls__wrapper__device_stub_<name>.
Both invocations wrap the original kernel body in #if 0 / #endif so the host compiler never sees device code.
Key Facts
| Property | Value |
|---|---|
| Source file | cp_gen_be.c (EDG 6.6 backend code generator) |
| Main generator | sub_47BFD0 (gen_routine_decl, 1831 lines) |
| Bare name emitter | sub_473F10 (gen_bare_name, 671 lines) |
| Stub prefix string | "__wrapper__device_stub_" at 0x839420 |
| Specialization prefix | "__specialization_" at 0x839960 |
| cudaLaunchKernel body | "{ ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);}" at 0x839CB8 |
| Device-only dummy (ctor/dtor) | "{int *volatile ___ = 0;" at 0x839A3E + "::free(___);" at 0x839A72 |
| Device-only dummy (global) | "{int volatile ___ = 1;" at 0x839A56 + "::exit(___);" at 0x839A80 |
| Stub mode flag | dword_1065850 (global toggle) |
| Static template stub CLI flag | -static-global-template-stub=true |
| Parameter list generator | sub_478900 (gen_parameter_list) |
| Scope qualifier emitter | sub_474D60 (recursive namespace path) |
| Parameter name emitter | sub_474BB0 (emit entity name for forwarding) |
The Device Stub Mode Toggle
The entire stub generation mechanism hinges on a single global variable, dword_1065850. This flag acts as a modal switch: when set, all subsequent code generation for __global__ functions produces the static stub variant rather than the forwarding body.
Toggle Logic
The toggle occurs in gen_routine_decl at the point where the function's CUDA flags are inspected. The critical line from the decompiled binary:
// sub_47BFD0, around decompiled line 553
// v3 = routine entity pointer, v8 = is_friend flag
__int64 flags = *(_QWORD *)(v3 + 176);
if ((flags & 0x40000002000000) == 0x40000002000000 && v8 != 1)
dword_1065850 = dword_1065850 == 0; // toggle: 0->1 or 1->0
The bitmask 0x40000002000000 encodes a combination of the __global__ attribute and a linkage/definition flag in the entity's 8-byte flags field at offset +176. The condition requires BOTH bits set and the declaration must NOT be a friend declaration (v8 != 1). The toggle expression dword_1065850 == 0 flips the flag: if it was 0, it becomes 1; if it was 1, it becomes 0.
This means gen_routine_decl is called twice for every __global__ kernel. Crucially, the toggle fires at the TOP of the function, BEFORE the body emission logic:
- First call (
dword_1065850 == 0at entry -> toggled to1): All subsequent decision points seedword_1065850 == 1. Emits thestaticstub withcudaLaunchKernelplaceholder body. Then recurses. - Recursive call (
dword_1065850 == 1at entry -> toggled to0): All subsequent decision points seedword_1065850 == 0. Emits the forwarding stub body. Does NOT recurse (the flag is 0 at the end).
The self-recursion that drives the second call is explicit at the end of gen_routine_decl:
// sub_47BFD0, decompiled line 1817-1821
if (dword_1065850) {
qword_1065748 = (int64_t)v163; // restore source sequence pointer
return sub_47BFD0(v152, a2); // recursive self-call
}
After emitting the static stub (first call), the self-recursion check at line 1817 fires because dword_1065850 == 1. The function restores the source sequence state and calls itself. In the recursive call, the toggle fires again (1->0), and the forwarding body is emitted with dword_1065850 == 0. At the end of the recursive call, dword_1065850 == 0, so no further recursion occurs.
Stub Generation: The Forwarding Body
When dword_1065850 == 0 and the entity has __global__ annotation (byte +182 & 0x40) with a body (byte +179 & 0x02), gen_routine_decl emits a forwarding body instead of the original kernel implementation. This is the output produced by the recursive (second) invocation.
Step-by-Step Emission
The forwarding body is assembled from multiple sub_468190 (emit raw string) calls:
// Condition: (byte[182] & 0x40) != 0 && (byte[179] & 2) != 0 && dword_1065850 == 0
// 1. Open brace
sub_468190("{");
// 2. Scope qualification (if kernel is in a namespace)
scope = *(v3 + 40); // entity's enclosing scope
if (scope && byte_at(scope + 28) == 3) { // scope kind 3 = namespace
sub_474D60(*(scope + 32)); // recursively emit namespace::namespace::...
sub_468190("::");
}
// 3. Emit "__wrapper__device_stub_" prefix
sub_468190("__wrapper__device_stub_");
// 4. Emit the original function name
sub_468190(*(char **)(v3 + 8)); // entity name string at offset +8
Template Argument Emission
After the function name, template arguments must be forwarded. The logic branches on whether the function is an explicit template specialization (v153) or a non-template member of a template class:
Case A: Explicit specialization (v153 != 0) -- uses the template argument list at entity offset +224:
v135 = *(v3 + 224); // template_args linked list
if (v135) {
putc('<', stream); // emit '<'
do {
arg_kind = byte_at(v135 + 8);
if (arg_kind == 0) {
// Type argument: emit type specifier + declarator
sub_5FE8B0(v135[4], ...); // gen_type_specifier
sub_5FB270(v135[4], ...); // gen_declarator
} else if (arg_kind == 1) {
// Value argument (non-type template param)
sub_5FCAF0(v135[4], 1, ...); // gen_constant
} else {
// Template-template argument
sub_472730(v135[4], ...); // gen_template_arg
}
v135 = *v135; // next in linked list
separator = v135 ? ',' : '>';
putc(separator, stream);
} while (v135);
}
Case B: Non-specialization -- template parameters from the enclosing class template are forwarded:
// v162 = template parameter info from enclosing scope
v92 = v162[1]; // template parameter list
if (v92 && (byte_at(v92 + 113) & 2) == 0) {
sub_467E50("<");
do {
param_kind = byte_at(v92 + 112);
if (param_kind == 1) {
// type parameter -- emit the type
sub_5FE8B0(*(v92 + 120), ...);
sub_5FB270(*(v92 + 120), ...);
} else if (param_kind == 2) {
// non-type parameter -- emit constant
sub_5FCAF0(*(v92 + 120), 1, ...);
} else {
// template-template parameter
sub_472730(*(v92 + 120), ...);
}
if (byte_at(v92 + 113) & 1)
sub_467E50("..."); // parameter pack expansion
v92 = *(v92 + 104); // next parameter
emit(v92 ? "," : ">");
} while (v92);
}
Parameter Forwarding
After the name and template arguments, the forwarding call's actual arguments are emitted:
// 5. Emit parameter forwarding: "(param1, param2, ...)"
sub_468150(40); // '('
param = *(v167 + 40); // first parameter entity from definition scope
if (param) {
for (separator = ""; ; separator = ",") {
sub_468190(separator);
sub_474BB0(param, 7); // emit parameter name
if (byte_at(param + 166) & 0x40) {
sub_468190("..."); // variadic parameter pack expansion
}
param = *(param + 104); // next parameter in list
if (!param) break;
}
}
sub_468190(");");
// 6. Emit return statement and closing brace
sub_468190("return;}");
Complete Output Example
For a kernel:
namespace my_ns {
template<typename T>
__global__ void my_kernel(T* data, int n) { /* device code */ }
}
The forwarding body (emitted during the recursive call with dword_1065850 == 0) produces:
template<typename T>
void my_ns::my_kernel(T* data, int n) {
my_ns::__wrapper__device_stub_my_kernel<T>(data, n);
return;
}
#if 0
/* original kernel body here */
#endif
Note: __host__ is NOT emitted in the forwarding body. The __global__ attribute is stripped and no explicit execution space appears. The function appears as a plain C++ function in .int.c.
Stub Generation: The Static cudaLaunchKernel Placeholder
When dword_1065850 == 1 (the first invocation, after the toggle), the function declaration is rewritten with a different storage class and body. Despite being called "pass 2" conceptually (it produces the definition that the forwarding body calls), it is emitted FIRST in the output because the toggle sets the flag before any body emission logic runs.
Declaration Modifiers
When dword_1065850 is set, gen_routine_decl forces the storage class to static and optionally prepends the __specialization_ prefix:
// sub_47BFD0, decompiled lines 897-903
if (dword_1065850) {
v164 = 2; // force storage class = static
v23 = "static";
if (v153) // if template specialization
sub_467E50("__specialization_");
goto emit_storage_class; // -> sub_467E50("static"); sub_468150(' ');
}
The __specialization_ prefix is emitted BEFORE static for template specializations. This creates names like __specialization_static void __wrapper__device_stub_kernel(...) which the CUDA runtime uses to distinguish specialization stubs from primary template stubs.
Name Emission via gen_bare_name
In stub mode, gen_bare_name (sub_473F10) prepends the wrapper prefix character-by-character. The relevant code path:
// sub_473F10, decompiled lines 130-144
if (byte_at(v2 + 182) & 0x40 && dword_1065850) {
// Emit line directive if pending
if (dword_1065818)
sub_467DA0();
// Character-by-character emission of "__wrapper__device_stub_"
v25 = "_wrapper__device_stub_"; // note: starts at second char
v26 = 95; // first char: '_' (0x5F = 95)
do {
++v25;
putc(v26, stream);
v26 = *(v25 - 1);
++dword_106581C;
} while ((char)v26);
}
The technique is notable: the string "_wrapper__device_stub_" is stored starting at the second character, and the first underscore (_, ASCII 95) is loaded as the initial character separately. The do/while loop then walks the string pointer forward, emitting each character via putc and incrementing the column counter (dword_106581C). This assembles the full __wrapper__device_stub_ prefix before the actual function name is emitted.
cudaLaunchKernel Placeholder Body
For non-specialization __global__ kernels in stub mode, the body is a single-line placeholder:
// sub_47BFD0, decompiled lines 1424-1429
if (dword_1065850) {
if (!v153 && v90) { // not a specialization AND has __global__ body
sub_468190("{ ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);}");
goto suppress_original;
}
}
The call ::cudaLaunchKernel(0, 0, 0, 0, 0, 0) is never actually executed at runtime. It exists solely to create a linker dependency on the CUDA runtime library, ensuring that cudaLaunchKernel is linked even though the real launch is performed through the CUDA driver API. The six zero arguments match the signature cudaError_t cudaLaunchKernel(const void*, dim3, dim3, void**, size_t, cudaStream_t).
Complete Output Example (Static Stub)
For the same kernel above, the static stub (emitted first, with dword_1065850 == 1) produces:
static void __wrapper__device_stub_my_kernel(float* data, int n) {
::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}
Dummy Bodies for Non-Kernel Device Functions
Not all CUDA-annotated functions are __global__ kernels. Device-only functions (constructors, destructors, and plain __device__ functions) that have definitions also need host-side bodies to prevent host compiler errors. These receive dummy bodies designed to suppress optimizer warnings while remaining syntactically valid.
Condition for Dummy Body Emission
The dummy body path activates in the ELSE branch of the __global__ check -- that is, for non-kernel device functions. The condition from the decompiled code (lines 1603-1606):
// This path is reached when (byte[182] & 0x40) == 0 -- entity is NOT __global__
// The flags field at offset +176 is an 8-byte bitfield encoding linkage/definition state.
uint64_t flags = *(uint64_t*)(entity + 176);
if ((flags & 0x30000000000500) != 0x20000000000000) // NOT a device-only entity with definition
goto emit_original_body; // skip dummy, emit normally
if (!dword_106BFDC || (entity->byte_81 & 4) != 0) // whole-program flag check
{
// Emit dummy body for device-only function visible in host output
}
The bitmask 0x30000000000500 extracts the device-annotation and definition bits from the 8-byte flags field. The target value 0x20000000000000 selects entities that have device annotation set but no host-side definition -- exactly the functions that need a dummy body to satisfy the host compiler.
Constructor/Destructor Dummy (definition_kind 1 or 2)
For constructors (definition_kind == 1) and destructors (definition_kind == 2), the dummy body allocates a volatile null pointer and frees it:
// sub_47BFD0, decompiled lines 1611-1651
if ((unsigned char)(byte[166] - 1) <= 1) {
sub_468190("{int *volatile ___ = 0;");
// ... emit (void)param; for each parameter ...
sub_468190("::free(___);}");
}
Output:
{int *volatile ___ = 0;(void)param1;(void)param2;::free(___);}
The volatile qualifier prevents the optimizer from removing the allocation. The ::free(0) call is a no-op at runtime but establishes a dependency on the C library and prevents dead code elimination of the entire body.
global / Regular Device Function Dummy (definition_kind >= 3)
For non-constructor/destructor device functions, a different pattern is used:
else {
sub_468190("{int volatile ___ = 1;");
// ... emit (void)param; for each parameter ...
sub_468190("::exit(___);}");
}
Output:
{int volatile ___ = 1;(void)param1;(void)param2;::exit(___);}
The ::exit(1) call guarantees the function is never considered to "return normally" by the host compiler's control-flow analysis, suppressing missing-return-value warnings for non-void functions.
Parameter Usage Emission
Between the opening and closing statements, each named parameter is referenced with (void)param; to suppress unused-parameter warnings. The loop walks the parameter list:
for (kk = *(v167 + 40); kk; kk = *(kk + 104)) {
if (*(kk + 8) && !(byte_at(kk + 166) & 0x40)) { // has name, not a pack
// For aggregate types with GNU host compiler: complex cast chain
if (!dword_1065750 && dword_126E1F8
&& is_aggregate_type(*(kk + 112))
&& has_nontrivial_dtor(*(kk + 112))) {
sub_468190("(void)");
sub_468190("reinterpret_cast<void *>(&(const_cast<char &>");
sub_468190("(reinterpret_cast<const volatile char &>(");
sub_474BB0(kk, 7); // parameter name
sub_468190("))))");
} else {
sub_468190("(void)");
sub_474BB0(kk, 7); // parameter name
}
sub_468150(';');
}
}
The complex reinterpret_cast chain for aggregate types with non-trivial destructors avoids triggering GCC/Clang warnings about taking the address of a parameter that might be passed in registers.
The #if 0 / #endif Suppression
After the stub body is emitted, the original kernel body is wrapped in preprocessor guards to hide it from the host compiler:
// sub_47BFD0, decompiled lines 1598-1601
sub_46BC80("#if 0"); // emit "#if 0\n"
--dword_1065834; // decrease indent level
sub_467D60(); // emit newline
// ... then emit the original body via:
dword_1065850_saved = dword_1065850;
dword_1065850 = 0; // temporarily disable stub mode
sub_47AEF0(*(v167 + 80), 0); // gen_statement_full: emit original body
dword_1065850 = dword_1065850_saved; // restore stub mode
sub_466C10(); // finalize
// ... then emit #endif
putc('#', stream);
// character-by-character emission of "#endif\n"
The function temporarily disables stub mode (dword_1065850 = 0) while emitting the original body so that any nested constructs are generated normally. After the body, #endif is emitted and stub mode is restored.
For definitions (when v112 == 0), a trailing ; is appended after #endif to satisfy host compilers that may expect a statement terminator.
The -static-global-template-stub Flag
The CLI flag -static-global-template-stub=true controls how template __global__ functions are stubbed. When enabled, template kernel stubs receive static linkage, which avoids ODR violations when the same template kernel is instantiated in multiple translation units during whole-program compilation (-rdc=false).
The flag produces two diagnostic messages when it encounters problematic patterns:
-
Extern template kernel:
"when "-static-global-template-stub=true", extern __global__ function template is not supported in whole program compilation mode ("-rdc=false")"-- Anexterntemplate kernel cannot receive a static stub because the definitions would conflict across TUs. -
Missing definition:
"when "-static-global-template-stub=true" in whole program compilation mode ("-rdc=false"), a __global__ function template instantiation or specialization (%sq) must have a definition in the current translation unit"-- The static stub requires a local definition to replace.
Both diagnostics recommend either switching to -rdc=true (separate compilation) or explicitly setting -static-global-template-stub=false.
Diagnostic Push/Pop Around Stubs
Before emitting device stub declarations, gen_routine_decl wraps the output in compiler-specific diagnostic suppression to prevent spurious warnings:
For GCC/Clang hosts (dword_126E1F8 set, version > 0x9E97 = 40599):
sub_467E50("\n#pragma GCC diagnostic push\n");
sub_467E50("#pragma GCC diagnostic ignored \"-Wunused-parameter\"\n");
// ... stub emission ...
sub_467E50("\n#pragma GCC diagnostic pop\n");
For MSVC hosts (dword_126E1D8 set):
sub_467E50("\n__pragma(warning(push))\n");
sub_467E50("__pragma(warning(disable : 4100))\n"); // unreferenced formal parameter
// ... stub emission ...
sub_467E50("\n__pragma(warning(pop))\n");
For static template specialization stubs, an additional warning is suppressed:
- GCC/Clang:
#pragma GCC diagnostic ignored "-Wunused-function"(warning 4505 on MSVC: "unreferenced local function has been removed")
Deferred Function List for Whole-Program Mode
When dword_106BFBC (a whole-program compilation flag) is set and dword_106BFDC is clear, instead of emitting a dummy body immediately, gen_routine_decl adds the function to a deferred list:
// sub_47BFD0, decompiled lines 1713-1745
v117 = sub_6B7340(32); // allocate 32-byte node
v117[0] = qword_1065840; // link to previous head
v117[1] = source_start; // source position start
v117[2] = source_end; // source position end
if (has_name)
v117[3] = strdup(name); // copy of function name
else
v117[3] = NULL;
qword_1065840 = v117; // push onto list head
This deferred list (qword_1065840) is later consumed during the breakpoint placeholder generation phase in process_file_scope_entities (sub_489000), where each deferred entry produces a static __attribute__((used)) void __nv_breakpoint_placeholder<N>_<name>(void) { exit(0); } function.
Function Map
| Address | Name | Role |
|---|---|---|
sub_47BFD0 | gen_routine_decl | Main stub generator; 1831 lines; handles all function declarations |
sub_473F10 | gen_bare_name | Character-by-character name emission with __wrapper__device_stub_ prefix |
sub_474BB0 | gen_entity_name | Parameter name emission for forwarding calls |
sub_474D60 | gen_scope_qualifier | Recursive namespace path emission (ns1::ns2::) |
sub_478900 | gen_parameter_list | Parameter list with type transformation in stub mode |
sub_478D70 | gen_function_declarator_with_scope | Full function declarator with cv-qualifiers and ref-qualifiers |
sub_47AEF0 | gen_statement_full | Statement generator used for emitting original body inside #if 0 |
sub_47ECC0 | gen_template / process_source_sequence | Top-level dispatch; also sets dword_1065850 for instantiation directives |
sub_46BC80 | (emit #if directive) | Emits #if 0 / #if 1 preprocessor lines |
sub_467E50 | (emit string) | Primary string emission to output stream |
sub_468190 | (emit raw string) | Raw string emission (no line directive) |
sub_489000 | process_file_scope_entities | Backend entry point; consumes deferred function list |
Concrete Example: Simple Kernel Stub Output
Given this input CUDA source:
__global__ void add_one(int *data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n)
data[idx] += 1;
}
cudafe++ generates the following in the .int.c host translation file. The toggle fires at the top of gen_routine_decl (0->1), so the static stub definition is emitted FIRST, followed by the forwarding body from the recursive call.
Output 1: Static Stub Definition (first call, dword_1065850 == 1 after toggle)
The static stub provides the linker symbol that the forwarding body calls. Diagnostic pragmas wrap the declaration to suppress unused-parameter warnings:
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-parameter"
static void __wrapper__device_stub_add_one(int *data, int n) {
::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}
#if 0
/* Original kernel body -- hidden from host compiler */
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n)
data[idx] += 1;
}
#endif
#pragma GCC diagnostic pop
The static storage class is forced by the check at decompiled line 897-903. The __wrapper__device_stub_ prefix is emitted by gen_bare_name (sub_473F10). The cudaLaunchKernel placeholder body comes from the string literal at 0x839CB8.
Output 2: Forwarding Body (recursive call, dword_1065850 == 0 after toggle)
After the static stub is emitted and gen_routine_decl recurses, the forwarding body replaces the original kernel body. The __global__ attribute is stripped (kernels become regular host functions in .int.c):
void add_one(int *data, int n) {__wrapper__device_stub_add_one(data, n);return;}
#if 0
/* Original kernel body -- hidden from host compiler (emitted again) */
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n)
data[idx] += 1;
}
#endif
The forwarding body is assembled character-by-character:
{-- open brace- Scope qualifier (none for file-scope kernels;
ns::for namespaced ones) __wrapper__device_stub_-- the stub prefix from string at0x839420add_one-- the original function name fromentity + 8(data, n)-- parameter names forwarded (no types, just names viasub_474BB0));return;}-- close the forwarding call and return
The original body appears in #if 0 in both outputs because both code paths reach the same LABEL_457 -> sub_46BC80("#if 0") emission point.
Template Kernel Example
For a template kernel:
template<typename T>
__global__ void scale(T *data, T factor, int n) { /* ... */ }
// explicit instantiation
template __global__ void scale<float>(float *, float, int);
Output 1 (first call, dword_1065850 == 1) produces a specialization stub:
__specialization_static void __wrapper__device_stub_scale(float *data, float factor, int n) {
::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}
Output 2 (recursive call, dword_1065850 == 0) produces a forwarding stub with template arguments:
template<typename T>
void scale(T *data, T factor, int n) {__wrapper__device_stub_scale<T>(data, factor, n);return;}
The __specialization_ prefix is emitted only when the entity is a template specialization (v153 != 0) and dword_1065850 is set (decompiled line 901-902).
Device-Only Function Example
For a non-kernel __device__ function with a body:
__device__ int device_helper(int x, int y) {
return x + y;
}
The host output uses a dummy body instead of a forwarding stub (since there is no __wrapper__device_stub_ target for non-kernel functions):
__attribute__((unused)) int device_helper(int x, int y) {int volatile ___ = 1;(void)x;(void)y;::exit(___);}
#if 0
{
return x + y;
}
#endif
The __attribute__((unused)) prefix is emitted when the function's execution space is device-only ((byte_182 & 0x70) == 0x20) and dword_126E1F8 (GCC host compiler mode) is set (decompiled line 905-906).
Cross-References
- Execution Spaces -- byte
+182bitfield that drives the__global__check; complete redeclaration matrix - Device/Host Separation -- IL marking that determines which functions need stubs; the
dword_1065850toggle lifecycle - RDC Mode -- separate compilation mode that affects stub linkage
- .int.c File Format -- overall structure of the generated host file
- CUDA Runtime Boilerplate -- managed memory initialization emitted alongside stubs
RDC Mode
CUDA supports two compilation models that fundamentally change how cudafe++ processes device code: whole-program mode (-rdc=false, the default) and separate compilation mode (-rdc=true, also called Relocatable Device Code). The mode switch affects error checking, stub linkage, module ID generation, anonymous namespace mangling, and -- when multiple translation units are involved -- triggers EDG's cross-TU correspondence machinery for structural type verification.
From cudafe++'s perspective, the distinction maps to a single CLI flag (--device-c, flag index 77) and a handful of global booleans that gate code paths throughout the binary. This page documents what changes between the two modes, how module IDs are generated, how cross-TU IL correspondence works, and how host stub linkage is controlled.
Key Facts
| Property | Value |
|---|---|
| RDC CLI flag | --device-c (flag index 77, no argument) |
| Whole-program mode flag | dword_106BFBC (also set by --debug_mode) |
| Module ID cache | qword_126F0C0 (cached string, computed once) |
| Module ID generator | sub_5AF830 (make_module_id, ~450 lines) |
| Module ID setter | sub_5AF7F0 (set_module_id) |
| Module ID getter | sub_5AF820 (get_module_id) |
| Module ID file writer | sub_5B0180 (write_module_id_to_file) |
| Module ID file flag | --gen_module_id_file (flag 83) |
| Module ID file path | --module_id_file_name (flag 87) |
| Cross-TU IL copier | sub_796BA0 (copy_secondary_trans_unit_IL_to_primary, trans_copy.c) |
| Cross-TU usage marker | sub_796C00 (mark_secondary_IL_entities_used_from_primary) |
| Class correspondence | sub_7A00D0 (verify_class_type_correspondence, 703 lines) |
| TU processing entry | sub_7A40A0 (process_translation_unit) |
| TU switch | sub_7A3D60 (switch_translation_unit) |
| Host stub linkage flag | --host-stub-linkage-explicit (flag 47) |
| Static host stub flag | --static-host-stub (flag 48) |
| Static template stub flag | --static-global-template-stub (set_flag mechanism) |
| EDG source files | host_envir.c (module ID), trans_copy.c, trans_corresp.c, trans_unit.c |
Whole-Program Mode (-rdc=false)
Whole-program mode is the default. All device code for a given translation unit must be defined within that single .cu file. No external device symbols are allowed. The host compiler sees the entire program at once, and nvlink is not required for device code linking.
Constraints Enforced
Five diagnostics are specific to whole-program mode or are closely tied to the internal-linkage consequences of non-RDC compilation:
1. Inline device/constant/managed variables must have internal linkage.
An inline __device__/__constant__/__managed__ variable must have
internal linkage when the program is compiled in whole program
mode (-rdc=false)
In whole-program mode, the device runtime has no linker step to resolve external inline variables across TUs. An inline __device__ variable with external linkage would need cross-TU deduplication that only nvlink can provide. The frontend forces static (or anonymous-namespace) linkage, emitting an error if the variable has external linkage.
2. Extern __global__ function templates are forbidden (with -static-global-template-stub=true).
when "-static-global-template-stub=true", extern __global__ function
template is not supported in whole program compilation mode ("-rdc=false").
To resolve the issue, either use separate compilation mode ("-rdc=true"),
or explicitly set "-static-global-template-stub=false" (but see nvcc
documentation about downsides of turning it off)
The -static-global-template-stub flag causes template kernel stubs to receive static linkage to avoid ODR violations when the same template is instantiated in multiple host-side compilation units. An extern template declaration conflicts with this because the extern stub expects an external definition while the static stub forces a local one. The diagnostic tag for this is extern_kernel_template.
3. __global__ template instantiations must have local definitions (with -static-global-template-stub=true).
when "-static-global-template-stub=true" in whole program compilation
mode ("-rdc=false"), a __global__ function template instantiation or
specialization (%sq) must have a definition in the current translation
unit.
A static stub requires a definition in the same TU. If the instantiation point references a template defined in another header without an explicit instantiation, the stub has no body to emit. The diagnostic tag is template_global_no_def.
Both template-related diagnostics recommend either switching to -rdc=true or setting -static-global-template-stub=false. The 4 usage contexts in the binary for -static-global-template-stub all appear in error message strings (at addresses 0x88E588 and 0x88E6E0).
4. Kernel launch from __device__ or __global__ functions requires separate compilation.
kernel launch from __device__ or __global__ functions requires
separate compilation mode
Dynamic parallelism -- launching a kernel from device code (a __device__ or __global__ function calling <<<...>>>) -- requires the device linker (nvlink) to resolve cross-module kernel references. In whole-program mode, no device linking occurs, so the construct is illegal. The diagnostic tag is device_launch_no_sepcomp.
5. Address of internal linkage device function (bug mitigation).
address of internal linkage device function (%sq) was taken
(nv bug 2001144). mitigation: no mitigation required if the
address is not used for comparison, or if the target function
is not a CUDA C++ builtin. Otherwise, write a wrapper function
to call the builtin, and take the address of the wrapper
function instead
This diagnostic fires in whole-program mode when code takes the address of a static __device__ function. Because device functions with internal linkage get module-ID-based name mangling, their addresses may differ across compilations or across TUs even when they refer to the "same" function. The warning documents a known NVIDIA bug (2001144) and provides a workaround: wrap the builtin in a non-internal function and take the wrapper's address instead. This diagnostic has no associated tag name -- it is emitted unconditionally when the condition is detected.
Deferred Function List
When dword_106BFBC (whole-program mode) is set and dword_106BFDC (skip-device-only) is clear, gen_routine_decl (sub_47BFD0) adds device-only functions to a deferred linked list (qword_1065840) rather than emitting dummy bodies inline. Each list node is 32 bytes:
| Offset | Field |
|---|---|
| +0 | next pointer |
| +8 | Source position (start) |
| +16 | Source position (end) |
| +24 | Name string (strdup'd, or NULL) |
This list is consumed during the breakpoint placeholder phase in process_file_scope_entities (sub_489000), where each entry produces a static __attribute__((used)) void __nv_breakpoint_placeholder<N>_<name>(void) { exit(0); } function for debugger support.
Separate Compilation Mode (-rdc=true)
When nvcc passes --device-c (flag index 77) to cudafe++, separate compilation mode is activated. This:
- Allows
__device__,__constant__, and__managed__variables to have external linkage - Permits
extern __global__template functions - Enables dynamic parallelism (kernel launches from device code)
- Requires nvlink to resolve device-side cross-TU references
- Generates a module ID that uniquely identifies each compilation unit for runtime registration
In this mode, the host stubs are generated with external linkage (by default) so the host linker can resolve cross-TU kernel calls. The module ID is embedded in the registration code to match host stubs with their corresponding device fatbinary segments.
Multi-TU Processing in EDG
When multiple translation units are compiled in a single cudafe++ invocation (as happens during RDC compilation with nvcc), the EDG frontend processes them sequentially using a stack-based TU management system:
| Global | Purpose |
|---|---|
qword_106BA10 | Current translation unit pointer |
qword_106B9F0 | Primary (first) translation unit |
qword_106BA18 | TU stack top |
dword_106B9E8 | TU stack depth (excluding primary) |
process_translation_unit (sub_7A40A0, trans_unit.c) is the main entry point called from main() for each source file:
- Allocates a 424-byte TU descriptor via
sub_6BA0D0 - Initializes scope state and copies registered variable defaults
- Sets the primary TU pointer (
qword_106B9F0) for the first file - Links the TU into the processing chain
- Opens the source file and sets up include paths
- Runs the parser (
sub_586240) - Dispatches to standard compilation (
sub_4E8A60) or module compilation (sub_6FDDF0) - Calls finalization (
sub_588E90) - Pops the TU from the stack
switch_translation_unit (sub_7A3D60, trans_unit.c, line 514) saves/restores per-TU state when the frontend needs to reference entities from a different TU:
- Asserts
qword_106BA10 != 0(current TU exists) - If target differs from current: saves current TU via
sub_7A3A50 - Restores target TU state via
memcpyfrom per-TU buffer - Sets
qword_106BA10 = target - Restores scope chain:
xmmword_126EB60,qword_126EB70, etc. - Recomputes file scope indices via
sub_704490
Per-TU state is registered through f_register_trans_unit_variable (sub_7A3C00, trans_unit.c, line 227), which accumulates variables into a linked list (qword_12C7AA8). Each registration record is 40 bytes with fields for the variable pointer, name, prior size, and buffer offset. The total per-TU buffer size is tracked in qword_12C7A98.
Three core variables are always registered (sub_7A4690):
dword_106BA08(is_recompilation), 4 bytesqword_106BA00(current_filename), 8 bytesdword_106B9F8(has_module_info), 4 bytes
Module ID Generation
Every compilation unit in CUDA needs a unique identifier to associate host-side registration code with the correct device fatbinary. This identifier -- the module ID -- is generated by make_module_id (sub_5AF830, host_envir.c, ~450 lines) and cached in qword_126F0C0.
Algorithm
The module ID generator has three source modes, tried in order:
Mode 1: Module ID file. If qword_106BF80 (set by --module_id_file_name) is non-NULL, the entire contents of the specified file are read and used as the module ID. This allows build systems to inject deterministic identifiers.
Mode 2: Explicit numeric token. If the caller provides a non-NULL string argument (nptr), it is parsed via strtoul. If the parse succeeds, the numeric value is used directly. If the parse fails (the string is not a pure integer), the string itself is CRC32-hashed and the hash is used.
Mode 3: Default computation. The default path builds the ID from several components:
- Calls
stat()on the source file to obtainmtime - Formats
ctime()of the modification time - Reads
getpid()for the current process ID - Collects
qword_106C038(command-line options hash input) - Computes the CRC32 hash of the options string
- Takes the output filename, strips it to basename
- If the source filename exceeds 8 characters, replaces it with its CRC32 hex representation
The final string is assembled in the format:
{options_crc}_{output_name_len}_{output_name}_{source_or_crc}[_{extra}][_{pid}]
All non-alphanumeric characters in the result are replaced with underscores. The string is allocated permanently and cached in qword_126F0C0.
Debug tracing (gated by dword_126EFC8) emits:
make_module_id: str1 = %s, str2 = %s, pid = %ld
make_module_id: final string = %s
CRC32 Implementation
The function contains an inline CRC32 implementation that appears three times (for the options hash, the source filename, and the extra string). All three copies use the same algorithm:
- Polynomial:
0xEDB88320(standard reflected CRC-32) - Initial value:
0xFFFFFFFF - Processing: bit-by-bit, 8 iterations per byte
- Final XOR: implicit via the reflected algorithm
The triple inlining suggests the CRC32 was originally a macro or small inline function that the compiler expanded at each call site. The polynomial 0xEDB88320 is the bitwise reversal of the standard CRC-32 polynomial 0x04C11DB7, confirming this is the ubiquitous CRC-32/ISO-HDLC algorithm.
PID Incorporation
The getpid() call ensures that concurrent compilations of the same source file produce different module IDs. Without the PID, two parallel nvcc invocations compiling the same .cu file with the same flags would generate identical module IDs, potentially causing runtime registration collisions. The PID is appended as the final underscore-separated component.
Module ID File Output
When --gen_module_id_file (flag 83) is set, write_module_id_to_file (sub_5B0180) generates the module ID via sub_5AF830(0) and writes it to the file specified by qword_106BF80 (--module_id_file_name, flag 87). If the filename is not set, it emits "module id filename not specified". If the write fails, it emits "error writing module id to file".
In the backend output phase, if dword_106BFB8 (emit-symbol-table flag) is set, sub_5B0180 is also called to write the module ID before the host reference arrays are emitted.
Entity-Based Module ID Selection
An alternative module ID source is available through use_variable_or_routine_for_module_id_if_needed (sub_5CF030, il.c, line 31969, ~65 lines). Rather than computing a hash from file metadata, this function selects a representative entity (variable or function) from the current TU whose mangled name can serve as a stable identifier. The selection criteria are strict:
- Entity kind must be 7 (variable) or 11 (routine), tested via
(kind - 7) & 0xFB == 0 - Must have a definition (for variables: offset
+169 != 0; for routines: has a body) - Must not be a class member
- Must not be in an unnamed namespace
- Must have storage class
== 0(no explicitstatic,extern, orregister) - Must not be template-related or marked with special compilation flags
- For routines: must not have explicit specialization, return type must not be a builtin
The selected entity is stored in qword_126F140 with its kind byte in byte_126F138 (7 for variable, 11 for routine). This entity's name is then fed into sub_5AF830 to produce the final module ID string. The entity-based approach provides a more deterministic ID than the PID-based default, since it is derived from source content rather than runtime state.
Anonymous Namespace Mangling
The module ID directly controls how anonymous namespaces are mangled in the .int.c output. The function sub_6BC7E0 (in nv_transforms.c) constructs the anonymous namespace identifier:
// sub_6BC7E0 implementation:
if (qword_1286A00) // cached?
return qword_1286A00;
module_id = sub_5AF830(0); // get or compute module ID
buf = malloc(strlen(module_id) + 12); // "_GLOBAL__N_" = 11 chars + NUL
strcpy(buf, "_GLOBAL__N_");
strcat(buf, module_id);
qword_1286A00 = buf; // cache for reuse
return buf;
This _GLOBAL__N_<module_id> string is emitted in the .int.c trailer as:
#define _NV_ANON_NAMESPACE _GLOBAL__N_<module_id>
#ifdef _NV_ANON_NAMESPACE
#endif
#include "<source_file>"
#undef _NV_ANON_NAMESPACE
The #define gives anonymous namespace entities a stable, unique mangled name that is consistent between the device and host compilation paths. The #ifdef/#endif guard is defensive -- it tests that the macro was defined (it always is at this point). The #include re-includes the original source file with the macro defined, allowing the host compiler to see the anonymous namespace entities with their module-ID-qualified names. The #undef cleans up to avoid polluting later inclusions.
The anonymous namespace hash also appears during host reference array name construction. For static or anonymous-namespace device entities, the scoped name prefix builder (sub_6BD2F0) inserts _GLOBAL__N_<module_id> as the namespace component, ensuring the mangled name in the .nvHR* section uniquely identifies the entity even across TUs with the same anonymous namespace structure.
Usage in Output
The module ID appears in three places in the generated .int.c output:
-
Anonymous namespace mangling:
sub_6BC7E0constructs_GLOBAL__N_<module_id>for anonymous-namespace symbols in device code, producing unique mangled names per TU. -
Registration boilerplate: The
__cudaRegisterFatBinarycall passes the module ID to the CUDA runtime, which uses it to match host registration with the correct device fatbinary. -
Module ID file: When requested, the ID is written to a separate file for consumption by the build system or nvlink.
Cross-TU IL Correspondence
When multiple TUs are processed in a single cudafe++ invocation, the same C++ types, templates, and declarations may appear in multiple TUs. EDG's correspondence system verifies structural equivalence and establishes canonical entries to avoid duplicate definitions in the merged output.
trans_copy.c: IL Copying Between TUs
The trans_copy.c file contains a single function at address 0x796BA0:
copy_secondary_trans_unit_IL_to_primary -- Copies IL entries from secondary translation units into the primary TU's IL tree. Called after all TUs have been parsed, during the fe_wrapup finalization phase (specifically, after the 5-pass multi-TU iteration). This function ensures that device-reachable IL entries from secondary TUs are available in the primary TU's output scope.
A closely related function exists at 0x796C00:
mark_secondary_IL_entities_used_from_primary (sub_796C00) -- Called during fe_wrapup pass 2 (IL lowering), before the TU iteration loop that applies sub_707040 to each TU's file-scope IL. This function marks IL entities in secondary TUs that are referenced from the primary TU, ensuring they survive any dead-code elimination in later passes.
trans_corresp.c: Structural Equivalence Checking
The trans_corresp.c file (address range 0x796E60--0x7A3420, 88 functions) implements the full cross-TU correspondence verification system. The core functions:
verify_class_type_correspondence (sub_7A00D0, 703 lines) is the centerpiece. It performs a deep structural comparison of two class types from different TUs:
- Base class comparison via
sub_7A27B0(verify_base_class_correspondence) -- iterates base class lists, comparing virtual/non-virtual status, accessibility, and type identity - Friend declaration comparison via
sub_7A1830(verify_friend_declaration_correspondence) -- walks friend lists checking structural equivalence - Member function comparison via
sub_7A1DB0(verify_member_function_correspondence, 411 lines) -- compares function signatures, attributes, constexpr status, and virtual overrides - Nested type comparison via
sub_798960(equiv_member_constants) -- verifies nested class/enum/typedef correspondence - Template parameter comparison via
sub_7B2260-- validates template parameter lists match structurally - Using declaration comparison -- dispatches by kind:
36= alias,6/11= using declaration,7/58= namespace using declaration
If any comparison fails, the function delegates to sub_797180 to emit a diagnostic (error codes 1795/1796), then falls through to f_set_no_trans_unit_corresp (5 variants at sub_797B50-sub_7981A0 for different entity kinds).
The type node layout used by the correspondence system:
- Offset
+132: type kind (9=struct, 10=class, 11=union) - Offset
+144: referenced type / next pointer - Offset
+152: class info pointer - Offset
+161: flags byte (bits for anonymous, elaborated, template, local) - Class info at
+128: scope block with members at indexed offsets[12],[13],[14],[18],[22]
Supporting verification functions:
| Address | Name | Scope |
|---|---|---|
sub_7A0E10 | verify_enum_type_correspondence | Enum underlying type and enumerator list |
sub_7A1230 | verify_function_type_correspondence | Parameter and return type |
sub_7A1390 | verify_type_correspondence | Dispatcher to class/enum/function variants |
sub_7A1460 | set_type_correspondence | Links two types as corresponding |
sub_7A1CC0 | verify_nested_class_body_correspondence | Nested class scope comparison |
sub_7A2C10 | verify_template_parameter_correspondence | Template parameter list |
sub_7A3140 | check_decl_correspondence_with_body | Declaration with definition |
sub_7A3420 | check_decl_correspondence_without_body | Declaration-only case |
sub_7A38A0 | check_decl_correspondence | Dispatcher (with/without body) |
sub_7A38D0 | same_source_position | Source position comparison |
sub_7999C0 | find_template_correspondence | Cross-TU template entity matching (601 lines) |
sub_79A5A0 | determine_correspondence | General correspondence determination |
sub_79B8D0 | mark_canonical_instantiation | Updates instantiation canonical status |
sub_79C1A0 | get_canonical_entry_of | Returns canonical entity for a TU entry |
sub_79D080 | establish_instantiation_correspondences | Links instantiations across TUs |
sub_79DFC0 | set_type_corresp | Sets type correspondence |
sub_79E760 | find_routine_correspondence | Cross-TU function matching |
sub_79F320 | find_namespace_correspondence | Cross-TU namespace matching |
Correspondence Lifecycle
The correspondence system uses three hash tables (qword_12C7800, qword_12C7880, qword_12C7900, each 0x70 bytes / 14 slots) plus linked lists to track established correspondences. The lifecycle:
- Registration (
sub_7A3920): Registers three global variables (dword_106B9E4,dword_106B9E0,qword_12C7798) for per-TU save/restore - Initialization (
sub_7A3980): Zeroes all correspondence hash tables and list pointers - Discovery during parsing: As the secondary TU is parsed, types/functions that match primary-TU entities are identified through name and scope comparison
- Verification:
verify_class_type_correspondenceand its siblings perform deep structural comparison - Linkage:
set_type_correspondence(sub_7A1460) andf_set_trans_unit_corresp(sub_79C400, 511 lines) connect matching entities - Canonicalization:
canonical_ranking(sub_796E60) determines which TU's entity is the canonical representative;mark_canonical_instantiation(sub_79B8D0) updates instantiation records
The correspondence allocation uses 24-byte nodes from a free list (qword_12C7AB0) managed by alloc_trans_unit_corresp (sub_7A3B50) and free_trans_unit_corresp (sub_7A3BB0). The free function decrements a refcount at offset +16; when it reaches 1, the node returns to the free list.
Integration with fe_wrapup
The cross-TU correspondence system hooks into the 5-pass multi-TU architecture in fe_wrapup (sub_588E90):
| Pass | Action | Cross-TU Role |
|---|---|---|
| 1 | Per-file IL wrapup (sub_588C60) | Iterates TU chain, prepares file scope IL |
| 2 | IL lowering (sub_707040) | Calls sub_796C00 (mark secondary IL) before loop |
| 3 | IL emission (sub_610420, arg 23) | Marks device-reachable entries per TU |
| 4 | C++ class finalization | Deferred member processing |
| 5 | Per-file part 3 (sub_588D40) | Final per-TU cleanup |
| Post | Cleanup | Calls sub_796BA0 (copy secondary IL to primary) |
After all five passes complete, sub_796BA0 copies remaining secondary-TU IL into the primary TU's tree, and scope renumbering fixes up any index conflicts.
Host Reference Arrays and Linkage Splitting
The six .nvHR* ELF sections emitted in the .int.c output trailer encode device symbol names for CUDA runtime discovery. These arrays are split along two axes: symbol type (kernel, device variable, constant variable) and linkage (external, internal). The split is critical for RDC: external-linkage symbols are globally resolvable by nvlink across all TUs, while internal-linkage symbols are TU-local and require module-ID-based prefixing to avoid collisions.
| Section | Array Name | Symbol Type | Linkage |
|---|---|---|---|
.nvHRKE | hostRefKernelArrayExternalLinkage | __global__ kernel | external |
.nvHRKI | hostRefKernelArrayInternalLinkage | __global__ kernel | internal |
.nvHRDE | hostRefDeviceArrayExternalLinkage | __device__ variable | external |
.nvHRDI | hostRefDeviceArrayInternalLinkage | __device__ variable | internal |
.nvHRCE | hostRefConstantArrayExternalLinkage | __constant__ variable | external |
.nvHRCI | hostRefConstantArrayInternalLinkage | __constant__ variable | internal |
The emission is driven by 6 calls to nv_emit_host_reference_array (sub_6BCF80, 79 lines, nv_transforms.c) with parameters (emit_callback, is_kernel, is_device, is_internal_linkage):
// From sub_489000 (process_file_scope_entities), backend output phase:
if (dword_106BFD0 || dword_106BFCC) {
sub_6BCF80(sub_467E50, 1, 0, 1); // kernel, internal
sub_6BCF80(sub_467E50, 1, 0, 0); // kernel, external
sub_6BCF80(sub_467E50, 0, 1, 1); // device, internal
sub_6BCF80(sub_467E50, 0, 1, 0); // device, external
sub_6BCF80(sub_467E50, 0, 0, 1); // constant, internal
sub_6BCF80(sub_467E50, 0, 0, 0); // constant, external
}
Each call iterates a separate global list that was populated during the entity walk:
| List Address | Content |
|---|---|
unk_1286880 | kernel external |
unk_12868C0 | kernel internal |
unk_1286780 | device external |
unk_12867C0 | device internal |
unk_1286800 | constant external |
unk_1286840 | constant internal |
Entity registration into these lists is performed by nv_get_full_nv_static_prefix (sub_6BE300, 370 lines, nv_transforms.c:2164). This function examines each device-annotated entity and routes it to the appropriate list based on its execution space bits (at entity offset +182) and linkage (internal linkage = static or anonymous namespace, determined by flags at entity offset +80).
For internal linkage entities, the function builds a scoped name prefix:
- Recursively constructs the scope path via
sub_6BD2F0(nv_build_scoped_name_prefix) - For anonymous namespaces, inserts the
_GLOBAL__N_<module_id>prefix (viaqword_1286A00) - Hashes the full path with
format_string_to_sso(sub_6BD1C0) - Constructs the prefix:
off_E7C768 + len + "_" + filename + "_" - Caches the prefix in
qword_1286760for reuse - Appends
"_"and the entity's mangled name
For external linkage entities, the path is simpler: the :: scope-qualified name is used directly without module-ID-based prefixing.
The generated output for each symbol:
extern "C" {
extern __attribute__((section(".nvHRKE")))
__attribute__((weak))
const unsigned char hostRefKernelArrayExternalLinkage[] = {
0x5f, 0x5a, /* ... mangled name bytes ... */ 0x00
};
}
The __attribute__((weak)) allows multiple TUs to define the same array without linker errors -- the CUDA runtime reads whichever copy survives.
Host Stub Linkage Flags
Three CLI flags control the linkage of generated host stubs:
--host-stub-linkage-explicit (Flag 47)
When set, host stubs are emitted with explicit linkage specifiers rather than relying on the default linkage of the surrounding context. This ensures that the stub's linkage matches what nvcc/nvlink expects regardless of the source file's linkage context (e.g., inside an anonymous namespace or extern "C" block).
--static-host-stub (Flag 48)
Forces all generated host stubs (__wrapper__device_stub_*) to have static linkage. This is used in single-TU compilation where the stubs do not need to be visible to other object files. It prevents symbol conflicts when the same kernel name appears in multiple compilation units that are linked together.
--static-global-template-stub (set_flag Mechanism)
Unlike the direct CLI flags above, -static-global-template-stub is set through the generic --set_flag mechanism (flag 193), which looks up the name in the off_D47CE0 table and stores the value. It has 4 usage contexts in the binary, all in error message strings.
When enabled (=true), template __global__ function stubs receive static linkage. This prevents ODR violations in whole-program mode when the same template kernel is instantiated in multiple host-side TUs. The tradeoff is that extern template kernels and out-of-TU instantiations become illegal (see the constraints in the whole-program section above).
Output Differences Between Modes
| Output Aspect | Whole-Program (-rdc=false) | Separate Compilation (-rdc=true) |
|---|---|---|
| Host stub linkage | Can be static (with flags 47/48) | External (default) |
| Template stub linkage | static (with -static-global-template-stub) | External |
| Module ID generation | Generated but less critical | Required for registration matching |
| Module ID file | Optional | Typically generated |
| Device code embedding | Inline fatbinary in host object | Relocatable device object (.rdc) |
| nvlink requirement | No | Yes (resolves device symbols) |
| Dynamic parallelism | Forbidden | Allowed |
| Extern device variables | Forbidden | Allowed |
| Anonymous namespace hash | Used for device symbol uniqueness | Used for device symbol uniqueness |
| Deferred function list | Active (breakpoint placeholders) | Behavior depends on dword_106BFDC |
| Cross-TU correspondence | N/A (single TU) | Active when multi-TU invocation |
Global Variables
| Address | Size | Name | Purpose |
|---|---|---|---|
dword_106BFBC | 4 | whole_program_mode | Whole-program mode; also set by --debug_mode (flag 82, which sets dword_106BFC4=1, dword_106BFC0=1, dword_106BFBC=1) |
dword_106BFDC | 4 | skip_device_only | Disables deferred function list accumulation |
dword_106BFB8 | 4 | emit_symbol_table | Emit symbol table + module ID to file |
dword_106BFD0 | 4 | device_registration | Device registration / cross-space reference checking |
dword_106BFCC | 4 | constant_registration | Constant registration flag |
qword_126F0C0 | 8 | cached_module_id | Cached module ID string |
qword_106BF80 | 8 | module_id_file_path | Module ID file path (from --module_id_file_name) |
qword_106BA10 | 8 | current_translation_unit | Pointer to current TU descriptor |
qword_106B9F0 | 8 | primary_translation_unit | Pointer to first TU (primary) |
qword_106BA18 | 8 | translation_unit_stack | Top of TU stack |
dword_106B9E8 | 4 | tu_stack_depth | TU stack depth (excluding primary) |
qword_12C7AA8 | 8 | registered_variable_list_head | Per-TU variable registration list |
qword_12C7A98 | 8 | per_tu_storage_size | Total per-TU buffer size |
qword_12C7AB0 | 8 | corresp_free_list | Correspondence node free list |
qword_12C7AB8 | 8 | stack_entry_free_list | TU stack entry free list |
qword_1065840 | 8 | deferred_function_list | Breakpoint placeholder linked list head |
Function Map
| Address | Name | Source File | Lines | Role |
|---|---|---|---|---|
sub_5AF830 | make_module_id | host_envir.c | ~450 | CRC32-based unique TU identifier |
sub_5AF7F0 | set_module_id | host_envir.c | ~10 | Setter for cached module ID |
sub_5AF820 | get_module_id | host_envir.c | ~3 | Getter for cached module ID |
sub_5B0180 | write_module_id_to_file | host_envir.c | ~30 | Writes module ID to file |
sub_5CF030 | use_variable_or_routine_for_module_id_if_needed | il.c:31969 | ~65 | Selects representative entity for ID |
sub_6BC7E0 | (anon namespace hash) | nv_transforms.c | ~20 | Generates _GLOBAL__N_<module_id> |
sub_6BCF80 | nv_emit_host_reference_array | nv_transforms.c | 79 | Emits .nvHR* ELF section with symbol names |
sub_6BD2F0 | nv_build_scoped_name_prefix | nv_transforms.c | ~95 | Recursive scope-qualified name builder |
sub_6BE300 | nv_get_full_nv_static_prefix | nv_transforms.c:2164 | ~370 | Scoped name + host ref array registration |
sub_796BA0 | copy_secondary_trans_unit_IL_to_primary | trans_copy.c | ~50 | Copies secondary TU IL to primary |
sub_796C00 | mark_secondary_IL_entities_used_from_primary | -- | -- | Marks secondary IL referenced from primary |
sub_796E60 | canonical_ranking | trans_corresp.c | -- | Determines canonical TU entry |
sub_7975D0 | may_have_correspondence | trans_corresp.c | -- | Quick correspondence eligibility check |
sub_797990 | f_change_canonical_entry | trans_corresp.c | -- | Updates canonical representative |
sub_7983A0 | f_same_name | trans_corresp.c | -- | Cross-TU symbol name comparison |
sub_79C400 | f_set_trans_unit_corresp | trans_corresp.c | 511 | Establishes entity correspondence |
sub_7A00D0 | verify_class_type_correspondence | trans_corresp.c | 703 | Deep class structural comparison |
sub_7A0E10 | verify_enum_type_correspondence | trans_corresp.c | -- | Enum comparison |
sub_7A1230 | verify_function_type_correspondence | trans_corresp.c | -- | Function type comparison |
sub_7A1460 | set_type_correspondence | trans_corresp.c | -- | Links corresponding types |
sub_7A1DB0 | verify_member_function_correspondence | trans_corresp.c | 411 | Member function comparison |
sub_7A27B0 | verify_base_class_correspondence | trans_corresp.c | -- | Base class list comparison |
sub_7A3920 | register_trans_corresp_variables | trans_corresp.c | -- | Registers per-TU state variables |
sub_7A3980 | init_trans_corresp_state | trans_corresp.c | -- | Zeroes all correspondence state |
sub_7A3A50 | save_translation_unit_state | trans_unit.c | -- | Saves current TU state to buffer |
sub_7A3C00 | f_register_trans_unit_variable | trans_unit.c | -- | Registers a per-TU variable |
sub_7A3CF0 | fix_up_translation_unit | trans_unit.c | -- | Finalizes TU state |
sub_7A3D60 | switch_translation_unit | trans_unit.c | -- | Saves/restores TU context |
sub_7A3EF0 | push_translation_unit_stack | trans_unit.c | -- | Pushes TU onto stack |
sub_7A3F70 | pop_translation_unit_stack | trans_unit.c | -- | Pops TU from stack |
sub_7A40A0 | process_translation_unit | trans_unit.c | -- | Main TU processing entry point |
sub_7A4690 | register_builtin_trans_unit_variables | trans_unit.c | -- | Registers 3 core per-TU vars |
Cross-References
- Kernel Stub Generation --
-static-global-template-stubdetails and the stub toggle mechanism - Device/Host Separation -- How the single-pass tag-and-filter architecture works
- .int.c File Format -- Anonymous namespace mangling and module ID in output
- Backend Code Generation -- Module ID output phase
- Host Reference Arrays --
.nvHR*section format and runtime discovery - CLI Flag Inventory -- Flag indices 47, 48, 77, 83, 87
- CUDA Error Catalog -- Category 11 (RDC / whole-program diagnostics)
- EDG 6.6 Overview -- Cross-TU correspondence section
- Template Engine -- Template instantiation deduplication across TUs
- Global Variable Index -- All globals referenced here
JIT Mode
JIT mode is a compilation mode where cudafe++ produces device code only -- no host .int.c file, no kernel stubs, no CUDA runtime registration tables. The output is a standalone device IL payload suitable for runtime compilation via NVRTC (nvrtcCompileProgram) or direct loading through the CUDA Driver API (cuModuleLoadData, cuModuleLoadDataEx). Because there is no host compiler invocation downstream, anything that belongs exclusively to the host side is illegal: explicit __host__ functions, unannotated functions (which default to __host__), namespace-scope variables without memory-space qualifiers, non-const class static data members, and lambda closures inferred to have __host__ execution space.
The --default-device flag inverts the annotation default -- unannotated entities become __device__ instead of __host__, allowing C++ code written without CUDA annotations to compile directly for the GPU. This is the recommended workaround for all four unannotated-entity diagnostics.
Key Facts
| Property | Value |
|---|---|
| Compilation output | Device IL only (no .int.c, no stubs, no registration) |
| Host output suppression | --gen_c_file_name (flag 45) not supplied by driver |
| Device output path | --gen_device_file_name (flag 85) |
| Default execution space (normal) | __host__ (entity+182 byte == 0x00) |
Default execution space (JIT + --default-device) | __device__ (entity+182 byte 0x23) |
| Annotation override flag | --default-device (passed to cudafe++ by NVRTC or nvcc) |
| RDC mode flag | --device-c (flag 77) -- relocatable device code; orthogonal to JIT |
| JIT diagnostic count | 5 error messages (1 explicit-host + 4 unannotated-entity) |
| Diagnostic tag suffix | All five tags end with _in_jit |
| NVRTC integration | NVRTC calls cudafe++ with JIT-appropriate flags internally |
| Driver API consumers | cuModuleLoadData, cuModuleLoadDataEx, cuLinkAddData |
How JIT Mode Is Activated
cudafe++ is never invoked directly by application code. In the standard offline compilation pipeline, nvcc invokes cudafe++ with both --gen_c_file_name (flag 45, the host .int.c path) and --gen_device_file_name (flag 85, the device IL path). Both outputs are generated from a single frontend invocation -- cudafe++ uses a single-pass architecture internally (see Device/Host Separation).
In JIT mode, the driving tool -- typically NVRTC -- invokes cudafe++ with only the device-side output path. The host-output file name (--gen_c_file_name) is not provided, so no .int.c file is generated. The absence of a host output target is what structurally makes this "JIT mode": without a host file, there is no host compiler to feed, and therefore no host-side constructs can be tolerated.
Activation Conditions
JIT mode is not a single user-facing CLI flag. It is an internal compilation state activated by the combination of flags that the driving tool (nvcc or NVRTC) sets when invoking cudafe++:
-
NVRTC invocation. NVRTC always invokes cudafe++ in JIT mode. NVRTC compiles CUDA C++ source to PTX at application runtime. There is no host compiler, no host object file, and no linking -- the output is pure device code.
-
nvcc
--ptxor--cubinwithout host compilation. When nvcc is asked to produce only PTX or cubin output (no host object), it may invoke cudafe++ with the JIT mode configuration to skip host-side generation entirely. -
Architecture target combined with device-only flags. The internal JIT state is set when the target configuration (
--target, flag 245 ->dword_126E4A8) is combined with device-only compilation flags (e.g.,--device-syntax-only, flag 72).
The practical effect: when JIT mode is active, the entire implicit-host-annotation system becomes a source of errors rather than a convenience. Every function without __device__ or __global__ defaults to __host__, and host entities are illegal.
NVRTC Runtime Compilation Path
NVRTC (libnvrtc.so / nvrtc64_*.dll) is NVIDIA's runtime compilation library. Application code calls nvrtcCreateProgram with CUDA C++ source text, then nvrtcCompileProgram to compile it. Internally, NVRTC embeds a complete CUDA compilation pipeline including cudafe++ and cicc, invoking them with JIT-appropriate flags:
Application
|
v
nvrtcCompileProgram(prog, numOptions, options)
|
v
cudafe++ --target <sm_code> --gen_device_file_name <tmpfile> [--default-device] ...
| (no --gen_c_file_name => JIT mode)
v
cicc <tmpfile> --> PTX
|
v
ptxas / cuModuleLoadData --> device binary (cubin)
The user-facing NVRTC options (--gpu-architecture=compute_90, --device-debug, etc.) are translated by the NVRTC library into internal cudafe++ and cicc flags. The --default-device flag is passed through when the user includes it in the NVRTC options array.
CUDA Driver API Consumption
The PTX or cubin produced by the JIT pipeline is consumed by the CUDA Driver API:
cuModuleLoadData/cuModuleLoadDataEx: Load a compiled module (PTX or cubin) into the current context. The driver JIT-compiles PTX to native binary at load time.cuLinkAddData/cuLinkComplete: Link multiple compiled objects into a single module (JIT linking for RDC workflows).cuModuleGetFunction: Retrieve a__global__kernel handle from the loaded module for launch viacuLaunchKernel.
Because JIT-compiled code has no host-side registration (no __cudaRegisterFunction calls, no fatbin embedding), the Driver API is the only path to launch kernels from JIT-compiled modules. The CUDA Runtime API launch syntax (<<<>>>) is not available for JIT-compiled kernels -- the application must use cuLaunchKernel explicitly.
The --default-device Flag
In normal (offline) compilation, functions and namespace-scope variables without explicit CUDA annotations default to __host__. This default makes sense when both host and device outputs are generated: the unannotated entities go into the host .int.c file and are compiled by the host compiler.
In JIT mode, this default is counterproductive. Most code intended for JIT compilation targets the GPU, and requiring explicit __device__ on every function and variable is verbose and incompatible with header-only libraries written for standard C++.
The --default-device flag changes the default:
| Entity type | Default without --default-device | Default with --default-device |
|---|---|---|
| Unannotated function | __host__ (entity+182 == 0x00) | __device__ (entity+182 == 0x23) |
| Namespace-scope variable (no memory space) | Host variable | __device__ variable (entity+148 bit 0 set) |
| Non-const class static data member | Host variable | __device__ variable |
| Lambda closure class (namespace scope) | __host__ inferred space | __device__ inferred space |
Explicitly __host__ function | __host__ (unchanged) | __host__ (unchanged -- always error in JIT) |
Explicitly __device__ function | __device__ (unchanged) | __device__ (unchanged) |
__global__ kernel | __global__ (unchanged) | __global__ (unchanged) |
Entities with explicit annotations are unaffected. Only entities that would otherwise receive the implicit __host__ default are redirected to __device__.
Interaction with Entity+182
The execution-space bitfield at entity+182 (documented in Execution Spaces) is set during attribute application. Without --default-device, an unannotated function has byte 0x00 at entity+182 -- the 0x30 mask extracts 0x00, which is treated as implicit __host__. With --default-device active, the frontend treats unannotated functions as if __device__ had been applied, setting byte+182 to 0x23 (the standard __device__ OR mask: device_capable | device_explicit | device_annotation).
This means the downstream subsystems -- keep-in-IL marking, cross-space validation, device-only filtering -- all see a properly-annotated __device__ entity and process it identically to an explicitly annotated one. The flag does not add a "JIT mode" code path through every subsystem; it simply changes the default annotation, and the existing execution-space machinery handles the rest.
How to Pass the Flag
In normal nvcc workflows, --default-device is passed through -Xcudafe:
nvcc -Xcudafe --default-device source.cu
In NVRTC workflows, the flag is passed via the nvrtcCompileProgram options array:
const char *opts[] = {"--default-device"};
nvrtcCompileProgram(prog, 1, opts);
JIT Mode Diagnostics
Five error messages enforce JIT mode restrictions. All five are emitted during semantic analysis when the frontend encounters an entity that cannot exist in a device-only compilation. The messages are self-documenting: four of the five include an explicit suggestion to use --default-device.
Diagnostic 1: Explicit host Function
Tag: no_host_in_jit
Message:
A function explicitly marked as a __host__ function is not allowed in JIT mode
Trigger: The function declaration carries an explicit __host__ annotation (entity+182 has bit 4 set via the 0x15 OR mask from apply_nv_host_attr at sub_4108E0). This is unconditionally illegal in JIT mode -- there is no device-side representation of a host-only function, and JIT mode produces no host output.
No --default-device suggestion: This is the only JIT diagnostic that does not suggest --default-device. The flag only affects unannotated entities. An explicit __host__ annotation overrides the default. The fix must be a source code change: remove __host__, change it to __device__, or change it to __host__ __device__.
Example:
// JIT mode: error no_host_in_jit
__host__ void setup() { /* ... */ }
// Fix options:
__device__ void setup() { /* ... */ }
__host__ __device__ void setup() { /* ... */ } // if needed in both contexts
Diagnostic 2: Unannotated Function
Tag: unannotated_function_in_jit
Message:
A function without execution space annotations (__host__/__device__/__global__)
is considered a host function, and host functions are not allowed in JIT mode.
Consider using -default-device flag to process unannotated functions as __device__
functions in JIT mode
Trigger: A function entity has (entity+182 & 0x30) == 0x00 -- no explicit execution-space annotation. By default this means implicit __host__, which is illegal in JIT mode.
Fix: Either add __device__ to the function declaration, or compile with --default-device.
Example:
// JIT mode without --default-device: error unannotated_function_in_jit
int compute(int x) { return x * x; }
// Fix 1: explicit annotation
__device__ int compute(int x) { return x * x; }
// Fix 2: compile with --default-device (function becomes implicitly __device__)
Diagnostic 3: Unannotated Namespace-Scope Variable
Tag: unannotated_variable_in_jit
Message:
A namespace scope variable without memory space annotations
(__device__/__constant__/__shared__/__managed__) is considered a host variable,
and host variables are not allowed in JIT mode. Consider using -default-device flag
to process unannotated namespace scope variables as __device__ variables in JIT mode
Trigger: A variable declared at namespace scope (including global scope and anonymous namespaces) lacks a CUDA memory-space annotation. In normal compilation, such variables live in host memory. In JIT mode, host memory is inaccessible.
The check applies to the memory-space bitfield at entity+148, not the execution-space bitfield at entity+182. Without any annotation, none of the memory-space bits (__device__ bit 0, __shared__ bit 1, __constant__ bit 2, __managed__ bit 3) are set.
Scope note: This check targets namespace-scope variables only. Local variables inside __device__ or __global__ functions are not subject to this check -- they live on the device stack or in registers.
Fix: Add a memory-space annotation, or compile with --default-device.
Example:
// JIT mode without --default-device: error unannotated_variable_in_jit
int table[256] = { /* ... */ };
// Fix 1: mutable device memory
__device__ int table[256] = { /* ... */ };
// Fix 2: read-only data
__constant__ int table[256] = { /* ... */ };
Diagnostic 4: Non-Const Class Static Data Member
Tag: unannotated_static_data_member_in_jit
Message:
A class static data member with non-const type is considered a host variable,
and host variables are not allowed in JIT mode. Consider using -default-device flag
to process such data members as __device__ variables in JIT mode
Trigger: A class or struct has a static data member whose type is not const-qualified. Static data members are allocated at namespace scope (not per-instance), so they are subject to the same host-variable prohibition as namespace-scope variables.
Why non-const only: const and constexpr static members with compile-time-constant initializers can be folded into device code by cicc without requiring an actual global variable in host memory. Non-const static members require mutable storage that must be explicitly placed in device memory.
Example:
struct Config {
// JIT mode without --default-device: error unannotated_static_data_member_in_jit
static int max_iterations;
// OK: const with constant initializer (compile-time folding)
static const int default_value = 42;
// OK: constexpr (compile-time constant)
static constexpr float pi = 3.14159f;
};
// Fix: explicit annotation
struct Config {
__device__ static int max_iterations;
};
Diagnostic 5: Lambda Closure Class with Inferred host Space
Tag: host_closure_class_in_jit
Message:
The execution space for the lambda closure class members was inferred to be __host__
(based on context). This is not allowed in JIT mode. Consider using -default-device
to infer __device__ execution space for namespace scope lambda closure classes.
Trigger: A lambda expression at namespace scope (or in a context where the enclosing function has implicit __host__ space) produces a closure class whose execution space is inferred to be __host__. The lambda was not explicitly annotated with __device__, and the enclosing context is host-only, so cudafe++'s execution-space inference assigns __host__ to the closure class members.
This diagnostic interacts with the extended lambda system (documented in Extended Lambda Overview). In normal compilation, a namespace-scope lambda without annotations is host-only and gets a closure type compiled for the CPU. In JIT mode, that closure type has no valid compilation target.
Fix: Either annotate the lambda with __device__ (requires extended lambdas: --expt-extended-lambda), or pass --default-device to change the inference to __device__.
Example:
// JIT mode without --default-device: error host_closure_class_in_jit
auto fn = [](int x) { return x * 2; };
// Fix 1: explicit annotation (requires --expt-extended-lambda)
auto fn = [] __device__ (int x) { return x * 2; };
// Fix 2: compile with --default-device
Diagnostic Summary
| Tag | Entity type | --default-device suggested | Suppressible |
|---|---|---|---|
no_host_in_jit | Explicit __host__ function | No | Yes (via --diag_suppress) |
unannotated_function_in_jit | Function with no annotation | Yes | Yes |
unannotated_variable_in_jit | Namespace-scope variable, no annotation | Yes | Yes |
unannotated_static_data_member_in_jit | Non-const static data member | Yes | Yes |
host_closure_class_in_jit | Lambda closure inferred __host__ | Yes | Yes |
All five diagnostics use the standard cudafe++ diagnostic system. They can be controlled via CLI flags or source pragmas:
--diag_suppress=unannotated_function_in_jit
--diag_warning=no_host_in_jit
#pragma nv_diag_suppress unannotated_variable_in_jit
Warning: Suppressing these diagnostics silences the messages but does not change the underlying problem. The entities still have host execution space and will be absent from the device IL output, leading to link errors or runtime failures when the module is loaded.
Architecture: JIT Mode vs Normal Mode
| Aspect | Normal (offline) mode | JIT mode |
|---|---|---|
| Driver tool | nvcc | NVRTC (or nvcc with --ptx / --cubin) |
Host output (.int.c) | Generated via sub_489000 | Not generated |
| Device IL output | Generated via keep-in-IL walk | Generated via keep-in-IL walk (identical) |
| Kernel stubs | __wrapper__device_stub_ in .int.c | Not needed |
| Registration code | __cudaRegisterFunction / __cudaRegisterVar | Not emitted |
| Fatbin embedding | Embedded in host object | Not applicable |
| Default unannotated space | __host__ | __host__ (error) or __device__ (with --default-device) |
| Kernel launch mechanism | <<<>>> -> cudaLaunchKernel (Runtime API) | cuLaunchKernel (Driver API) |
| Module loading | Automatic (CUDA runtime startup) | Manual (cuModuleLoadData) |
| Link model | Static linking with host object | JIT linking (cuLinkAddData) or direct load |
Single-Pass Architecture Impact
cudafe++ uses a single-pass architecture: the EDG frontend parses the source once, builds a unified IL tree, and tags every entity with execution-space bits at entity+182. In normal mode, two output filters run on this tree -- one for the host .int.c file (driven by sub_489000 -> sub_47ECC0), one for the device IL (driven by the keep-in-IL walk at sub_610420). In JIT mode, only the device IL output path runs. The host output path is simply never invoked because no host output was requested.
This means JIT mode does not require a fundamentally different code path through the frontend. Parsing, semantic analysis, template instantiation, and IL construction all proceed identically. The difference manifests at two points:
-
Diagnostic emission during semantic analysis. The five JIT diagnostics fire when the frontend detects entities that would be host-only. In normal mode, these entities are silently accepted because they will appear in the host output.
-
Output generation. The backend skips host-file emission entirely. The keep-in-IL walk runs as usual, marking device-reachable entries with bit 7 of the prefix byte (
entry_ptr - 8). The device IL writer produces the binary output. No stub generation (gen_routine_declstub path), no registration table emission, no.int.cformatting.
Interaction with Other Modes
RDC (Relocatable Device Code)
JIT mode is orthogonal to RDC (--device-c, flag 77). RDC controls whether device code is compiled for separate linking (enabling cross-TU __device__ function calls and extern __device__ variables), while JIT mode controls whether host output is produced. Both can be active simultaneously -- for example, NVRTC with --relocatable-device-code=true compiles device code for separate device linking without any host output.
When RDC is combined with JIT mode, NVRTC compiles each source file to relocatable device code, and the driver-API linker (cuLinkAddData, cuLinkComplete) resolves cross-references at load time. Without RDC, all device code must be self-contained within a single translation unit.
Extended Lambdas
Extended lambdas (--expt-extended-lambda, controlled by dword_106BF38) interact with JIT mode through the lambda closure class inference. The host_closure_class_in_jit diagnostic targets the case where a lambda's closure is inferred as host-side. With --default-device, the inference changes to device-side, resolving the conflict. Extended lambda capture rules still apply in JIT mode -- captures must be trivially device-copyable, subject to the 1023-capture limit, and array captures are limited to 7 dimensions.
Relaxed Constexpr
Relaxed constexpr mode (--expt-relaxed-constexpr, flag 104, sets dword_106BFF0) makes constexpr functions implicitly __host__ __device__. In JIT mode, this resolves many unannotated-function errors because constexpr functions gain the __device__ annotation implicitly via the HD bypass (entity+177 bit 4). However, non-constexpr unannotated functions still trigger unannotated_function_in_jit unless --default-device is also active.
Practical Patterns
Pattern 1: Minimal JIT Kernel
// Source passed to nvrtcCreateProgram -- no --default-device needed
extern "C" __global__ void add(float* a, float* b, float* c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) c[i] = a[i] + b[i];
}
No annotations needed beyond __global__ on the kernel. All code within the kernel body is implicitly device code. The extern "C" prevents name mangling so the kernel can be found by cuModuleGetFunction.
Pattern 2: JIT-Compiling Library Code with --default-device
// Header-only math library, no CUDA annotations
template <typename T>
T clamp(T val, T lo, T hi) {
return val < lo ? lo : (val > hi ? hi : val);
}
__global__ void kernel(float* data, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) data[i] = clamp(data[i], 0.0f, 1.0f);
}
Without --default-device, clamp triggers unannotated_function_in_jit. With --default-device, clamp is implicitly __device__ and compiles cleanly.
Pattern 3: Guarding Host Code with Preprocessor
// Use __CUDACC_RTC__ to guard host-only code
#ifndef __CUDACC_RTC__
__host__ void cpu_fallback(float* data, int n) {
for (int i = 0; i < n; i++) data[i] *= 2.0f;
}
#endif
__global__ void gpu_process(float* data, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) data[i] *= 2.0f;
}
__CUDACC_RTC__ is predefined by NVRTC. Code guarded by #ifndef __CUDACC_RTC__ is invisible to the JIT compiler, avoiding no_host_in_jit errors.
Pattern 4: Static Data Members in JIT
struct Constants {
static constexpr int BLOCK_SIZE = 256; // OK: constexpr, folded at compile time
static const float EPSILON; // Error without --default-device (non-constexpr const)
};
#ifdef __CUDACC_RTC__
__device__
#endif
const float Constants::EPSILON = 1e-6f; // Annotated for JIT mode
Function Map
| Address | Name | Lines | Role |
|---|---|---|---|
sub_459630 | proc_command_line | 4105 | CLI parser; processes --default-device and --device-c flags |
sub_452010 | init_command_line_flags | 3849 | Registers all flags including default-device |
sub_610420 | mark_to_keep_in_il | 892 | Device IL marking (runs identically in JIT and normal mode) |
sub_489000 | process_file_scope_entities | 723 | Host .int.c backend (skipped entirely in JIT mode) |
sub_47ECC0 | gen_template | 1917 | Source-sequence dispatcher; host output path (skipped in JIT) |
sub_40EB80 | apply_nv_device_attr | 100 | Sets __device__ bits; entity+182 OR 0x23 (function), entity+148 OR 0x01 (variable) |
sub_4108E0 | apply_nv_host_attr | 31 | Sets __host__ bits; entity+182 OR 0x15 |
Cross-References
- Execution Spaces -- entity+182 bitfield,
__host__/__device__/__global__OR masks,0x30mask classification - Device/Host Separation -- single-pass architecture, keep-in-IL walk, host/device output file generation
- Cross-Space Validation -- execution-space call checking (still applies in JIT mode for HD entities)
- CUDA Error Catalog -- Category 10 (JIT Mode), all five diagnostic messages with tag names
- CLI Flag Inventory -- flag table,
--gen_device_file_name(85),--gen_c_file_name(45),--device-c(77) - Architecture Feature Gating --
--targetSM code (dword_126E4A8) and feature thresholds - Extended Lambda Overview -- lambda closure class execution-space inference, wrapper types
- Kernel Stubs --
__wrapper__device_stub_mechanism (absent in JIT mode) - RDC Mode -- relocatable device code, separate compilation for device-side linking
Architecture Feature Gating
cudafe++ enforces architecture-dependent feature gates that prevent use of CUDA constructs on hardware that cannot support them. These gates operate at three distinct layers: compile-time SM version checks against dword_126E4A8 during semantic analysis, string-embedded diagnostic messages with architecture names baked into .rodata, and host-compiler version gating controlling which GCC/Clang-specific #pragma directives and language constructs appear in the generated .int.c output. A separate mechanism, the --db debug system, provides runtime tracing that can expose architecture checks as they execute. This page documents all three layers, the global variables involved, every discovered threshold constant, and the complete data flow from nvcc invocation to feature gate evaluation.
Key Facts
| Property | Value |
|---|---|
| SM version storage | dword_126E4A8 (sm_architecture, set by --target / case 245) |
| SM version TU-level copy | dword_126EBF8 (target_config_index, copied during TU init in sub_586240) |
| Architecture parser stub | sub_7525E0 (6-byte stub returning -1; actual parsing done by nvcc) |
| Post-parse initializer | sub_7525F0 (set_target_configuration, target.c:299) |
| Type table initializer | sub_7515D0 (sets 100+ type-size/alignment globals, called from sub_7525F0) |
| GCC version global | qword_126EF98 (default 80100 = GCC 8.1.0, set by --gnu_version case 184) |
| Clang version global | qword_126EF90 (default 90100 = Clang 9.1.0, set by --clang_version case 188) |
| GCC host dialect flag | dword_126E1F8 (host compiler identified as GCC) |
| Clang host dialect flag | dword_126E1E8 (host compiler identified as Clang) |
| Host GCC version copy | qword_126E1F0 (copied from qword_126EF98 during dialect init) |
| Host Clang version copy | qword_126E1E0 (copied from qword_126EF90 during dialect init) |
--nv_arch error string | "invalid or no value specified with --nv_arch flag" at 0x8884F0 |
| Debug option parser | sub_48A390 (proc_debug_option, 238 lines, debug.c) |
| Debug trace linked list | qword_1065870 (head pointer) |
| Invalid arch sentinel | -1 (0xFFFFFFFF) |
| Feature threshold count | 17 CUDA features across 7 SM versions (20, 30, 52, 60, 70, 80, 90/90a) |
| Host compiler threshold count | 19 version constants across GCC 3.0 through GCC 14.0 |
Layer 1: SM Architecture Input
How the Architecture Reaches cudafe++
cudafe++ never parses architecture strings directly from the user. The driver (nvcc) translates user-facing flags like --gpu-architecture=sm_90 into an internal numeric code and passes it via the --target flag when spawning the cudafe++ process. Inside cudafe++, the --target flag is registered as CLI flag 245 and handled in proc_command_line (sub_459630).
The handler calls sub_7525E0, which in the CUDA Toolkit 13.0 binary is a 6-byte stub:
; sub_7525E0 -- architecture parser stub
; Address: 0x7525E0, Size: 6 bytes
mov eax, 0FFFFFFFFh ; return -1 unconditionally
retn
This stub always returns -1 (the invalid-architecture sentinel). The actual architecture code is injected by nvcc into the argument string that sub_7525E0 receives. Because IDA decompiled this as a stub, the parsing logic is either inlined by the compiler or resolved through a different mechanism at link time. The result is stored in dword_126E4A8:
// proc_command_line (sub_459630), case 245
v80 = sub_7525E0(qword_E7FF28, v23, v20, v30); // parse SM code from arg string
dword_126E4A8 = v80; // store in sm_architecture
if (v80 == -1) {
sub_4F8420(2664); // emit error 2664
// error string: "invalid or no value specified with --nv_arch flag"
sub_4F2930("cmd_line.c", 12219, "proc_command_line", 0, 0);
// assert_fail -- unreachable if error handler returns
}
sub_7525F0(v80); // set_target_configuration
Error 2664 fires when the architecture value is -1. The error string at 0x8884F0 references --nv_arch (the nvcc-facing name for this flag). This string has no direct xrefs in the IDA analysis, meaning it is loaded indirectly through the error message table (off_88FAA0). The --nv_arch name in the error message is a user-facing alias; internally cudafe++ processes it as --target (flag 245).
set_target_configuration (sub_7525F0)
After storing the SM version, sub_7525F0 performs post-parse initialization. This function lives in target.c:299:
// sub_7525F0 -- set_target_configuration
__int64 __fastcall sub_7525F0(int a1)
{
if ((unsigned int)(a1 + 1) > 1) // rejects only -1
assert_fail("set_target_configuration", 299);
sub_7515D0(); // initialize type table for target platform
qword_126E1B0 = "lib"; // library search path prefix
}
The guard (a1 + 1) > 1u is an unsigned comparison that accepts any value >= 0 and rejects only -1 (which wraps to 0 when incremented). This is a sanity check -- in production, nvcc always provides a valid SM code.
Type Table Initialization (sub_7515D0)
The sub_7515D0 function, called from set_target_configuration, initializes over 100 global variables describing the target platform's type sizes, alignments, and numeric limits. This establishes the data model for CUDA device code:
// sub_7515D0 -- target type initialization (excerpt)
// Sets LP64 data model with CUDA-specific type properties
dword_126E328 = 8; // sizeof(long)
dword_126E338 = 4; // sizeof(int)
dword_126E2FC = 16; // sizeof(long double)
dword_126E308 = 16; // alignof(long double)
dword_126E2B8 = 8; // sizeof(pointer)
dword_126E2AC = 8; // alignof(pointer)
dword_126E420 = 2; // sizeof(wchar_t)
dword_126E4A0 = 8; // target vector width
dword_126E258 = 53; // double mantissa bits
dword_126E250 = 1024; // double max exponent
dword_126E254 = -1021; // double min exponent
dword_126E234 = 113; // __float128 mantissa bits
dword_126E22C = 0x4000; // __float128 max exponent
dword_126E230 = -16381; // __float128 min exponent
// ... ~80 more assignments ...
The function unconditionally returns -1, which is not used by the caller.
SM Version Propagation
During translation unit initialization (sub_586240, called from fe_translation_unit_init), the SM version is copied into a TU-level global:
// sub_586240, line 54 in decompiled output
dword_126EBF8 = dword_126E4A8; // target_config_index = sm_architecture
After this point, architecture checks throughout the compiler read either dword_126E4A8 (the CLI-level global) or dword_126EBF8 (the TU-level copy). Both contain the same integer SM version code. The dual-variable pattern exists because EDG's architecture supports multi-TU compilation where each TU could theoretically target a different architecture (though CUDA compilation always uses a single target per cudafe++ invocation).
Layer 2: CUDA Feature Thresholds
cudafe++ checks the SM architecture version at semantic analysis time to gate CUDA-specific features. When a feature is used on an architecture below its minimum requirement, the compiler emits a diagnostic error or warning. All thresholds below were extracted from error strings embedded in the binary's .rodata section and confirmed through cross-reference with diagnostic tag names.
Complete Feature Threshold Table
| Feature | Min Architecture | Diagnostic Tag | Error String |
|---|---|---|---|
| Virtual base classes | compute_20 | use_of_virtual_base_on_compute_1x | Use of a virtual base (%t) requires the compute_20 or higher architecture |
| Device variadic functions | compute_30 | device_function_has_ellipsis | __device__ or __host__ __device__ function with ellipsis requires compute_30 or higher architecture |
__managed__ variables | compute_30 | unsupported_arch_for_managed_capability | __managed__ variables require architecture compute_30 or higher |
alloca() in device code | compute_52 | alloca_unsupported_for_lower_than_arch52 | alloca() is not supported for architectures lower than compute_52 |
| Atomic scope argument | sm_60 | (inline) | atomic operations' scope argument is supported on architecture sm_60 or above. Fall back to use membar. |
| Atomic f64 add/sub | sm_60 | (inline) | atomic add and sub for 64-bit float is supported on architecture sm_60 or above. |
__nv_atomic_* functions | sm_60 | (inline) | __nv_atomic_* functions are not supported on arch < sm_60. |
__grid_constant__ | compute_70 | grid_constant_unsupported_arch | __grid_constant__ annotation is only allowed for architecture compute_70 or later |
| Atomic memory order | sm_70 | (inline) | atomic operations' argument of memory order is supported on architecture sm_70 or above. Fall back to use membar. |
| 128-bit atomic load/store | sm_70 | (inline) | 128-bit atomic load and store are supported on architecture sm_70 or above. |
| 16-bit atomic CAS | sm_70 | (inline) | 16-bit atomic compare-and-exchange is supported on architecture sm_70 or above. |
__nv_register_params__ | compute_80 | register_params_unsupported_arch | __nv_register_params__ is only supported for compute_80 or later architecture |
__wgmma_mma_async | sm_90a | wgmma_mma_async_not_enabled | __wgmma_mma_async builtins are only available for sm_90a |
| Atomic cluster scope | sm_90 | (inline) | atomic operations' scope of cluster is supported on architecture sm_90 or above. Using device scope instead. |
| Atomic cluster scope (load/store) | sm_90 | (inline) | atomic load and store's scope of cluster is supported on architecture sm_90 or above. Using device scope instead. |
| 128-bit atomic exch/CAS | sm_90 | nv_atomic_exch_cas_b128_not_supported | 128-bit atomic exchange or compare-and-exchange is supported on architecture sm_90 or above. |
GPU-Architecture-Gated Attributes (No Specific SM in String)
Several features check the architecture but their error strings do not embed a specific SM version number. Instead, they use the generic phrase "this GPU architecture", meaning the threshold is encoded in the comparison logic rather than the diagnostic text:
| Feature | Diagnostic Tag | Error String |
|---|---|---|
__cluster_dims__ | cluster_dims_unsupported | __cluster_dims__ is not supported for this GPU architecture |
max_blocks_per_cluster | max_blocks_per_cluster_unsupported | cannot specify max blocks per cluster for this GPU architecture |
__block_size__ | block_size_unsupported | __block_size__ is not supported for this GPU architecture |
__managed__ (config) | unsupported_configuration_for_managed_capability | __managed__ variables are not yet supported for this configuration (compilation mode (32/64 bit) and/or target operating system) |
These features are gated by the same dword_126E4A8 comparison mechanism as the features in the main table, but their exact SM threshold values would require tracing the specific comparison instructions in the semantic analysis functions.
Diagnostic Behavior: Errors vs Warnings vs Demotions
Architecture gate violations produce three distinct behaviors depending on the feature class:
Hard errors -- Compilation halts. Features that fundamentally cannot work on the target architecture:
__managed__below compute_30 -- No unified memory hardware support__grid_constant__below compute_70 -- No hardware constant propagation mechanism__nv_register_params__below compute_80 -- Register parameter ABI not available__wgmma_mma_asyncbelow sm_90a -- No warp-group MMA hardwarealloca()below compute_52 -- No dynamic stack allocation support on device- Virtual base classes below compute_20 -- No vtable support on earliest GPU architectures
Fallback warnings -- Compilation continues with degraded behavior. The compiler generates functionally correct but potentially less performant code:
- Atomic scope arguments on pre-sm_60 -- Falls back to
membar-based synchronization - Atomic memory order on pre-sm_70 -- Falls back to
membar-based ordering - 64-bit float atomics on pre-sm_60 -- Falls back to CAS loop emulation
Scope demotion warnings -- Informational diagnostics about automatic scope narrowing:
- Cluster scope atomics on pre-sm_90 -- Silently demotes to device scope ("Using device scope instead")
compute_XX vs sm_XX Naming
Error strings use two naming conventions that reflect CUDA's split between virtual and physical architectures:
-
compute_XX-- Virtual architecture. Checked at PTX generation time. Features gated bycompute_XXare relevant to the intermediate PTX representation and are independent of the specific GPU die. Examples:__managed__(requires unified memory ISA support),alloca()(requires dynamic stack frame instructions). -
sm_XX-- Physical architecture. Checked at SASS generation time. Features gated bysm_XXare tied to specific hardware capabilities of a GPU die. Examples: 128-bit atomics (require specific load/store unit widths), cluster scope (requires the SM 9.0 thread block cluster hardware).
In practice, cudafe++ stores a single integer in dword_126E4A8 and the distinction is purely semantic -- both forms gate against the same numeric value. The value is a compute capability number (e.g., 70 for Volta, 90 for Hopper).
The sm_90a suffix (with the a accelerator flag) is a special case used exclusively for __wgmma_mma_async builtins. This variant requires the Hopper accelerated architecture, which is distinct from the base sm_90. The a suffix is encoded in the SM integer value passed to cudafe++ by nvcc.
__wgmma_mma_async Detail
The warp-group matrix multiply-accumulate builtin has the most granular validation of any architecture-gated feature. Beyond the sm_90a architecture check, cudafe++ also validates:
| Check | Diagnostic Tag | Error String |
|---|---|---|
| Architecture gate | wgmma_mma_async_not_enabled | __wgmma_mma_async builtins are only available for sm_90a |
| Shape validation | wgmma_mma_async_bad_shape | The shape %s is not supported for __wgmma_mma_async builtin |
| A operand type | wgmma_mma_async_bad_A_type | (type mismatch diagnostic) |
| B operand type | wgmma_mma_async_bad_B_type | (type mismatch diagnostic) |
| Missing arguments | wgmma_mma_async_missing_args | The 'A' or 'B' argument to __wgmma_mma_async call is missing |
| Non-constant args | wgmma_mma_async_nonconstant_arg | Non-constant argument to __wgmma_mma_async call |
The validation function is identified as check_wgmma_mma_async (string at 0x888CAC). Four type-specific builtin variants are registered: __wgmma_mma_async_f16, __wgmma_mma_async_bf16, __wgmma_mma_async_tf32, and __wgmma_mma_async_f8.
nv_register_params Detail
The register parameter attribute has three distinct checks, only one of which is an architecture gate:
| Check | Diagnostic Tag | Error String |
|---|---|---|
| Feature enable flag | register_params_not_enabled | __nv_register_params__ support is not enabled |
| Architecture gate | register_params_unsupported_arch | __nv_register_params__ is only supported for compute_80 or later architecture |
| Function type check | register_params_unsupported_function | __nv_register_params__ is not allowed on a %s function |
| Ellipsis check | register_params_ellipsis_function | (variadic function diagnostic) |
The attribute handler is apply_nv_register_params_attr (string at 0x830C78).
SM Version to Feature Summary
| SM Version | Features Introduced | Feature Count |
|---|---|---|
| compute_20 | Virtual base classes in device code | 1 |
| compute_30 | __managed__ variables, device variadic functions | 2 |
| compute_52 | alloca() in device code | 1 |
| sm_60 | Atomic scope argument, 64-bit float atomics, __nv_atomic_* API | 3 |
| sm_70 | __grid_constant__, 128-bit atomic load/store, atomic memory order, 16-bit CAS | 4 |
| compute_80 | __nv_register_params__ | 1 |
| sm_90 / sm_90a | __wgmma_mma_async, thread block clusters, 128-bit atomic exchange/CAS, cluster scope atomics | 5 |
Notably absent from cudafe++ error strings are features like cooperative groups (sm_60+), tensor cores (sm_70+), and dynamic parallelism (sm_35+). These are checked at runtime or by the PTX assembler (ptxas) rather than the language frontend.
Layer 3: Host Compiler Version Gating
cudafe++ generates .int.c output that must compile cleanly under the host C++ compiler (GCC, Clang, or MSVC). Because different host compiler versions support different warning pragmas, attributes, and language features, cudafe++ gates its output based on the host compiler version stored in qword_126EF98 (GCC) and qword_126EF90 (Clang). Additionally, several C++ language feature flags in the EDG frontend are conditionally enabled based on host compiler version to match the behavior the user expects from their host compiler.
Version Encoding
Both GCC and Clang versions are encoded as a single integer: major * 10000 + minor * 100 + patch. For example, GCC 8.1.0 is encoded as 80100. The compiler tests these values against hexadecimal threshold constants using > (strictly-greater-than) comparisons, which effectively means "version at or above threshold + 1." Since all threshold values use a 99 patch level (e.g., 40299 for GCC 4.2.99), the gate > 40299 is equivalent to >= 40300, which effectively means "GCC 4.3 or later."
Complete Threshold Table
| Hex Constant | Decimal | Encoded Version | Effective Gate | Occurrence Count |
|---|---|---|---|---|
0x752F | 29,999 | 2.99.99 | GCC/Clang >= 3.0 | 1 (dialect resolution) |
0x75F7 | 30,199 | 3.01.99 | GCC/Clang >= 3.2 | low |
0x76BF | 30,399 | 3.03.99 | GCC/Clang >= 3.4 | low (cuda_compat_flag gate) |
0x7787 | 30,599 | 3.05.99 | Clang >= 3.6 | medium (-Wunused-local-typedefs) |
0x78B3 | 30,899 | 3.08.99 | Clang >= 3.9 | low |
0x9C3F | 39,999 | 3.99.99 | GCC >= 4.0 | medium (dword_106BDD8 + Clang gate) |
0x9D07 | 40,199 | 4.01.99 | GCC >= 4.2 | medium (-Wunused-variable file-level) |
0x9D6B | 40,299 | 4.02.99 | GCC >= 4.3 | medium (variadic templates) |
0x9DCF | 40,399 | 4.03.99 | GCC >= 4.4 | low (dialect resolution) |
0x9E33 | 40,499 | 4.04.99 | GCC >= 4.5 | low (dialect resolution) |
0x9E97 | 40,599 | 4.05.99 | GCC >= 4.6 | medium (diagnostic push/pop) |
0x9EFB | 40,699 | 4.06.99 | GCC >= 4.7 | low (feature flag gating) |
0x9F5F | 40,799 | 4.07.99 | GCC >= 4.8 | medium (-Wunused-local-typedefs) |
0xEA5F | 59,999 | 5.99.99 | GCC >= 6.0 | 22 files (C++14/17 features) |
0xEB27 | 60,199 | 6.01.99 | GCC >= 6.2 | low (HasFuncPtrConv gate) |
0x1116F | 69,999 | 6.99.99 | GCC >= 7.0 | medium (dword_106BDD8 + feature flags) |
0x15F8F | 89,999 | 8.99.99 | GCC/Clang >= 9.0 | medium (C++17/20 features) |
0x1D4BF | 119,999 | 11.99.99 | GCC/Clang >= 12.0 | 8 files |
0x1FBCF | 129,999 | 12.99.99 | GCC >= 13.0 | 13 files |
0x222DF | 139,999 | 13.99.99 | GCC >= 14.0 | 5 files |
How Thresholds Are Used
The thresholds serve three purposes:
1. Diagnostic pragma emission. The .int.c output includes #pragma GCC diagnostic directives to suppress host compiler warnings about CUDA-generated code. Different GCC/Clang versions introduced different warning flags, so the pragmas are conditionally emitted:
// From sub_489000 (backend boilerplate emission)
// -Wunused-local-typedefs: GCC 4.8+ (0x9F5F) or Clang 3.6+ (0x7787)
if ((dword_126E1E8 && qword_126EF90 > 0x7787)
|| (!dword_106BF6C && !dword_106BF68
&& dword_126E1F8 && qword_126E1F0 > 0x9F5F))
{
emit("#pragma GCC diagnostic ignored \"-Wunused-local-typedefs\"");
}
// Push/pop block for managed RT: GCC 4.6+ (0x9E97) or Clang
if (dword_126E1E8 || (!dword_106BF6C && dword_126E1F8 && qword_126E1F0 > 0x9E97))
{
emit("#pragma GCC diagnostic push");
emit("#pragma GCC diagnostic ignored \"-Wunused-variable\"");
emit("#pragma GCC diagnostic ignored \"-Wunused-function\"");
// ... managed runtime boilerplate ...
emit("#pragma GCC diagnostic pop");
}
// File-level -Wunused-variable: GCC 4.2+ (0x9D07) or Clang
if (dword_126E1E8 || (dword_126E1F8 && qword_126E1F0 > 0x9D07))
emit("#pragma GCC diagnostic ignored \"-Wunused-variable\"");
2. C++ feature gating during dialect resolution. The post-parsing dialect resolution in proc_command_line and the sub_44B6B0 dialect setup function use qword_126EF98 thresholds to decide which C++ language features to enable. Examples from the decompiled code:
// sub_44B6B0 -- dialect resolution, ~400 lines
// GCC 4.3+ (0x9D6B): enable variadic templates
if (qword_126EF98 > 0x9D6B)
dword_106BE1C = 1; // variadic_templates
// GCC 4.7+ (0x9EFB): enable list initialization under certain conditions
if (qword_126EF98 > 0x9EFB && dword_106BE1C && (!byte_E7FFF1 || dword_106C10C))
dword_106BE10 = 1;
// GCC 6.0+ (0xEA5F) or Clang: enable C++14/17 features
if (dword_126EFA4 || (dword_126EFA8 && qword_126EF98 > 0xEA5F))
// Enable feature (Clang always, GCC only 6.0+)
3. CUDA compatibility mode. A special flag dword_E7FF10 (cuda_compat_flag) is set when dword_126EFAC && qword_126EF98 <= 0x76BF -- that is, when extended features are enabled but the GCC version is 3.3.99 or below. This activates a legacy compatibility path for very old host compilers that lack modern C++ support.
The 0xEA5F (59999) Threshold -- The Most Pervasive Gate
The threshold 0xEA5F (GCC 6.0) is the most widely used version constant in the binary, appearing in 22 decompiled functions. It gates the C++14/17 feature set boundary. GCC 6.0 was the first GCC release with full C++14 support and substantial C++17 support.
The typical usage pattern is:
// Pattern: "Clang (any version) OR GCC 6.0+"
if (dword_126EFA4 || (dword_126EFA8 && qword_126EF98 > 0xEA5F))
// Enable C++14/17 feature
// Pattern: "GNU extensions but not Clang, GCC 6.0+"
if (dword_126EFAC && !dword_126EFA4 && qword_126EF98 > 0xEA5F)
// Enable GNU-specific extended feature
Functions using this threshold include: declaration processing (sub_40D900), attribute application (sub_413ED0), class declaration (sub_431590), dialect resolution (sub_44B6B0), initializer processing (sub_48C710, sub_4B6760), backend code generation (sub_4688C0), expression canonicalization (sub_4CA6C0, sub_4D2B70), IL walking (sub_54AED0), scope management (sub_59C9B0, sub_59AF40), type processing (sub_5D1350), overload resolution (sub_662670, sub_666720), and template specialization (sub_6A3B00).
Version-Gated Feature Flag: dword_106BDD8
One particular feature flag (dword_106BDD8) is set during dialect resolution based on a compound version check:
// sub_44B6B0, decompiled line ~228-231
// v4 = (dword_126EFA4 != 0), i.e., is_clang_mode
if ((dword_126EFAC && !v4 && qword_126EF98 > 0x1116F) // GNU ext, not Clang, GCC >= 7.0
|| (v4 && qword_126EF90 > 0x9C3F)) // or Clang >= 4.0
{
dword_106BDD8 = 1;
}
This flag is referenced in 7 decompiled functions (sub_430920, sub_42FE50, sub_447930, sub_44AAC0, sub_44B6B0, sub_45EB40, sub_724630). The W066 global variables report identifies it as optix_mode, but the decompiled code shows it is set purely based on compiler version thresholds during dialect resolution, not from any --emit-optix-ir CLI flag. It likely controls a C++ language feature (possibly structured bindings or another C++17 feature) that requires GCC 7.0+ or Clang 4.0+ support, and the "optix_mode" name in the report may be a misidentification based on context where it was encountered. The flag gates behavior in attribute validation (sub_42FE50), where it interacts with dword_106B670 to control feature availability.
Dialect Initialization Flow
The host compiler version globals are initialized in proc_command_line and propagated to the dialect system during TU initialization:
proc_command_line (CLI parsing, sub_459630):
case 184 (--gnu_version=X): qword_126EF98 = X // GCC version
case 188 (--clang_version=X): qword_126EF90 = X // Clang version
case 182 (--gcc): dword_126EFA8 = 1 // GCC mode flag
case 187 (--clang): dword_126EFA4 = 1 // Clang mode flag
dialect_init (sub_44B6B0, called during setup):
// ~400 lines of version-threshold-based feature flag resolution
// Sets 30+ EDG feature flags based on gcc_version, clang_version,
// cpp_standard_version, and extension mode flags
target dialect (sub_752A80, select_cp_gen_be_target_dialect):
if (dword_126EFA8): // GCC mode
dword_126E1F8 = 1 // host_dialect_gnu
qword_126E1F0 = qword_126EF98 // host_gcc_version
if (dword_126EFA4): // Clang mode
dword_126E1E8 = 1 // host_dialect_clang
qword_126E1E0 = qword_126EF90 // host_clang_version
The defaults for unspecified versions are qword_126EF98 = 80100 (GCC 8.1.0) and qword_126EF90 = 90100 (Clang 9.1.0), set during default_init (sub_45EB40).
The --db Debug Mechanism
The --db flag (CLI case 37) activates EDG's internal debug tracing system by calling sub_48A390 (proc_debug_option). While not directly related to architecture gating, the --db mechanism shares the adjacent global namespace (dword_126EFC8, dword_126EFCC) and is relevant because debug tracing can expose architecture checks as they execute in real time.
Connection Between --db and Architecture
The --db flag does not set or modify any architecture-related globals. Its connection to the architecture system is observational: when debug tracing is enabled, the compiler emits trace output at key decision points throughout compilation, including the semantic analysis functions that evaluate architecture thresholds. Enabling --db=5 (verbosity level 5) causes the compiler to log IL entry kinds, template instantiation steps, and scope transitions, which provides visibility into when and why architecture gates fire.
The CLI dispatch for --db:
// proc_command_line (sub_459630), case 37
case 37: // --db=<string>
if (sub_48A390(qword_E7FF28)) // proc_debug_option
goto error; // returns nonzero on parse failure
dword_106C2A0 = dword_126EFCC; // save initial error count baseline
After proc_debug_option returns, dword_106C2A0 captures the current value of dword_126EFCC (debug verbosity level). This is used as a baseline error count for subsequent error tracking.
proc_debug_option (sub_48A390)
This 238-line function (debug.c) parses debug control strings. On entry, it unconditionally sets dword_126EFC8 = 1 (debug tracing enabled), then dispatches based on the first character of the input:
// sub_48A390 entry
dword_126EFC8 = 1; // enable debug tracing
v3 = (unsigned __int8)*nptr;
if ((v3 - 48) <= 9) { // first char is digit
dword_126EFCC = strtol(v1, 0, 10); // set verbosity level
return 0;
}
The full parsing grammar:
| Input Format | Parsed As | Action |
|---|---|---|
"5" (numeric only) | Verbosity level | Sets dword_126EFCC = 5 |
"name=3" | Name with level | Adds trace node: action=1, level=3 |
"name+=3" | Additive trace | Adds trace node: action=2, level=3 |
"name-=3" | Subtractive trace | Adds trace node: action=3, level=3 |
"name=3!" | Permanent trace | Adds trace node: action=1, level=3, permanent=1 |
"#name" | Hash removal | Removes matching node from trace list |
"-name" | Dash removal | Removes matching node from trace list |
"a,b=2,c=3" | Comma-separated | Processes each entry independently |
Debug Trace Node Structure
Debug trace requests are stored as a singly-linked list rooted at qword_1065870. Each node is 28 bytes, allocated via sub_6B7340 (the IL allocator):
struct debug_trace_node { // 28 bytes (32 allocated)
struct debug_trace_node* next; // +0: linked list link
char* name_string; // +8: entity name to trace (heap copy)
int32 action_type; // +16: 1=set, 2=add, 3=subtract, 4=remove
int32 level; // +20: trace level (integer)
int32 permanent; // +24: 1=survives reset, 0=cleared on reset
};
When proc_debug_option encounters its own name in the trace list (the self-referential check !strcmp(src, "proc_debug_option")), it prints the entire trace state to stderr:
if (qword_1065870 && (v2 & 1) != 0) {
do {
fprintf(s, "debug request for: %s\n", node->name_string);
fprintf(s, "action=%d, level=%d\n", node->action_type, node->level);
node = node->next;
} while (node);
}
Debug Verbosity Levels
The dword_126EFCC verbosity level controls trace output granularity across the entire compiler:
| Level | Effect |
|---|---|
| 0 | No debug output (default) |
| 1-2 | Basic trace: function entry/exit markers |
| 3 | Detailed trace: includes entity names, scope indices |
| 4 | Very detailed: IL entry kinds, overload candidate lists |
| 5+ | Full trace: IL tree walking with "Walking IL tree, entry kind = ..." |
db_name (CLI case 190)
The --db_name flag (case 190) calls a separate function sub_48AD80 to register a debug name filter. Unlike --db which enables global tracing, --db_name restricts trace output to entities matching the specified name pattern. If sub_48AD80 fails (returns nonzero), error 570 is emitted.
Three-Layer Checking Model
Layer 1: Compile-Time Semantic Checks (cudafe++ Frontend)
These are the primary gates. During semantic analysis, cudafe++ reads dword_126E4A8 and compares it against threshold constants. Violations emit diagnostic errors through the standard error system (diagnostic IDs in the 3000+ range, displayed as 20000-series via the +16543 offset formula). These checks are unconditional -- they fire regardless of whether the code would actually execute at runtime.
Enforcement point: Declaration processing, type checking, attribute application, and CUDA-specific semantic validation passes.
Examples:
__managed__variable declaration withdword_126E4A8 < 30triggersunsupported_arch_for_managed_capability__grid_constant__parameter withdword_126E4A8 < 70triggersgrid_constant_unsupported_arch__wgmma_mma_asynccall on non-sm_90a triggerswgmma_mma_async_not_enabled- Virtual base class with
dword_126E4A8 < 20triggersuse_of_virtual_base_on_compute_1x
Layer 2: String-Embedded Diagnostic Formatting
Error strings with architecture names baked into .rodata represent the complete set of architecture-dependent diagnostics. These strings are loaded by the diagnostic system and formatted with the current architecture value. The strings serve as the user-visible feedback for Layer 1 checks.
The architecture name in the string (e.g., "compute_70", "sm_90a") is a literal constant, not a formatted parameter -- the compiler does not interpolate the actual target architecture into these messages. This means the error messages always state the minimum required architecture, not what the user actually specified. The only exception is the virtual base error which uses %t (a type formatter) to include the base class name, not the architecture.
Layer 3: Host Compiler Version Gating
This layer does not check GPU architecture at all -- instead, it gates the output format of the generated .int.c file based on the host C++ compiler's version. The thresholds ensure that GCC/Clang-specific pragmas, attributes, and language constructs in the generated code are compatible with the actual host compiler that will consume the output.
Enforcement point: Backend code generation (sub_489000 and related functions in cp_gen_be.c).
Impact: Incorrect host compiler version gating does not cause compilation failure -- it may produce warnings from the host compiler due to unrecognized pragmas, or miss warning suppression directives that would silence spurious diagnostics.
Interaction Between Layers
nvcc (driver)
|
| --target=<sm_code> --gnu_version=<ver> --clang_version=<ver>
v
cudafe++ process
|
+-- CLI parsing (proc_command_line)
| dword_126E4A8 = sm_code (SM architecture)
| qword_126EF98 = gcc_version (host GCC version)
| qword_126EF90 = clang_version (host Clang version)
|
+-- set_target_configuration (sub_7525F0)
| sub_7515D0() -- type table init (100+ globals)
|
+-- dialect_resolution (sub_44B6B0)
| 30+ feature flags set based on version thresholds
| dword_126E1F8 / dword_126E1E8 -- host dialect set
| qword_126E1F0 / qword_126E1E0 -- host version copies
|
+-- TU init (sub_586240)
| dword_126EBF8 = dword_126E4A8 (SM version copy)
|
+-- [Layer 1] Semantic analysis
| Compare dword_126E4A8 against SM thresholds
| Emit CUDA-specific errors for unsupported features
|
+-- [Layer 2] Diagnostic formatting
| Load error string with baked-in architecture name
| Format and display error to user
|
+-- [Layer 3] .int.c code generation
| Compare qword_126E1F0 / qword_126E1E0 against host thresholds
| Emit appropriate #pragma directives
| Generate host-compiler-compatible boilerplate
|
v
Host Compiler (gcc / clang / cl.exe)
Layers 1 and 2 operate during the frontend phase and can halt compilation. Layer 3 operates during the backend phase and only affects the format of the generated output file.
Global Variable Summary
| Address | Size | Name | Role |
|---|---|---|---|
dword_126E4A8 | 4 | sm_architecture | Target SM version from --target (case 245). Sentinel: -1. |
dword_126EBF8 | 4 | target_config_index | TU-level copy of dword_126E4A8, set in sub_586240. |
qword_126EF98 | 8 | gcc_version | GCC compatibility version. Default 80100. Set by --gnu_version (case 184). |
qword_126EF90 | 8 | clang_version | Clang compatibility version. Default 90100. Set by --clang_version (case 188). |
dword_126EFA8 | 4 | gcc_extensions | GCC mode enabled. Set by --gcc (case 182). |
dword_126EFA4 | 4 | clang_extensions | Clang mode enabled. Set by --clang (case 187). |
dword_126EFAC | 4 | extended_features | Extended features / GNU compat mode. |
dword_126EFB0 | 4 | gnu_extensions_enabled | GNU extensions active. |
dword_126E1F8 | 4 | host_dialect_gnu | Host compiler is GCC/GNU. Set during dialect init. |
dword_126E1E8 | 4 | host_dialect_clang | Host compiler is Clang. Set during dialect init. |
qword_126E1F0 | 8 | host_gcc_version | Host GCC version, copied from qword_126EF98. |
qword_126E1E0 | 8 | host_clang_version | Host Clang version, copied from qword_126EF90. |
dword_126EFC8 | 4 | debug_trace_enabled | Debug tracing active. Set unconditionally by --db. |
dword_126EFCC | 4 | debug_verbosity | Debug output level. >2=detailed, >4=IL walk trace. |
dword_E7FF10 | 4 | cuda_compat_flag | Legacy compat: dword_126EFAC && qword_126EF98 <= 0x76BF. |
dword_106BDD8 | 4 | version_gated_feature | Set when GCC >= 7.0 or Clang >= 4.0. Referenced in 7 functions. |
dword_106C2A0 | 4 | error_count_baseline | Saved from dword_126EFCC after --db processing. |
qword_1065870 | 8 | debug_trace_list | Head of debug trace request linked list. |
dword_126E4A0 | 4 | target_vector_width | Set to 8 by sub_7515D0. |
Cross-References
- CLI Flag Inventory --
--target,--gnu_version,--clang_version,--dbflag details - Architecture Detection --
--targetflag and SM version parsing details - CUDA Error Catalog -- Complete diagnostic messages for each feature gate
- .int.c File Format -- Host compiler pragma emission details
- Backend Code Generation -- GCC/Clang version threshold usage in output
- Global Variable Index -- Full address-level documentation
- Execution Spaces -- Execution space bitfield and attribute handlers
- __managed__ Variables -- Managed variable attribute and SM 30 gate
- __grid_constant__ -- Grid constant attribute and SM 70 gate
Attribute System Overview
cudafe++ processes CUDA attributes through NVIDIA's customization of the EDG 6.6 attribute subsystem. EDG provides a general-purpose attribute infrastructure in attribute.c (approximately 11,500 lines of source, spanning addresses 0x409350--0x418F80 in the binary) that handles C++11 [[...]] attributes, GNU __attribute__((...)), MSVC __declspec, and alignas. NVIDIA extends this infrastructure by injecting 14 CUDA-specific attribute kinds into EDG's attribute kind enumeration, registering CUDA-specific handler callbacks, and adding a post-declaration validation pass that enforces cross-attribute consistency rules (e.g., __launch_bounds__ requires __global__).
The attribute system operates in four phases: scanning (lexer recognizes attribute syntax and builds attribute node lists), lookup (maps attribute names to descriptors via a hash table), application (dispatches to per-attribute handler functions that modify entity nodes), and validation (post-declaration consistency checks). CUDA attributes participate in all four phases, using the same node structures and dispatch mechanisms as standard C++/GNU attributes.
CUDA Attribute Kind Enum
Every attribute node carries a kind byte at offset +8. For standard C++/GNU attributes, EDG assigns kinds from its built-in descriptor table (byte_82C0E0 in the .rodata segment). For CUDA attributes, NVIDIA reserves a block of kind values in the ASCII printable range. The function attribute_display_name (sub_40A310, from attribute.c:1307) contains the authoritative switch table that maps kind values to human-readable names:
| Kind | Hex | ASCII | Display Name | Category | Handler |
|---|---|---|---|---|---|
| 86 | 0x56 | 'V' | __host__ | Execution space | sub_4108E0 |
| 87 | 0x57 | 'W' | __device__ | Execution space | sub_40EB80 |
| 88 | 0x58 | 'X' | __global__ | Execution space | sub_40E1F0 / sub_40E7F0 |
| 89 | 0x59 | 'Y' | __tile_global__ | Execution space | (internal) |
| 90 | 0x5A | 'Z' | __shared__ | Memory space | sub_40E0D0 (shared path) |
| 91 | 0x5B | '[' | __constant__ | Memory space | sub_40E0D0 (constant path) |
| 92 | 0x5C | '\' | __launch_bounds__ | Launch config | sub_411C80 |
| 93 | 0x5D | ']' | __maxnreg__ | Launch config | sub_410F70 |
| 94 | 0x5E | '^' | __local_maxnreg__ | Launch config | sub_411090 |
| 95 | 0x5F | '_' | __tile_builtin__ | Internal | (internal) |
| 102 | 0x66 | 'f' | __managed__ | Memory space | sub_40E0D0 (managed path) |
| 107 | 0x6B | 'k' | __cluster_dims__ | Launch config | sub_4115F0 |
| 108 | 0x6C | 'l' | __block_size__ | Launch config | sub_4109E0 |
| 110 | 0x6E | 'n' | __nv_pure__ | Optimization | (internal) |
The kind values are not contiguous. Kinds 86--95 form a dense block for the original CUDA attributes. Kinds 102, 107, 108, and 110 were added later (managed memory in CUDA 6.0, cluster dimensions in CUDA 11.8, block size and nv_pure more recently), occupying gaps in the ASCII range.
attribute_display_name (sub_40A310)
This function serves dual duty: it formats the display name for diagnostic messages, and its switch table is the canonical enumeration of all CUDA attribute kinds. The logic:
// sub_40A310 -- attribute_display_name (attribute.c:1307)
// a1: pointer to attribute node
const char* attribute_display_name(attr_node_t* a1) {
const char* name = a1->name; // +16
const char* ns = a1->namespace_str; // +24
// If scoped (namespace::name), format "namespace::name"
if (ns) {
size_t ns_len = strlen(ns);
assert(ns_len + strlen(name) + 3 <= 204); // buffer byte_E7FB80
sprintf(byte_E7FB80, "%s::%s", ns, name);
name = intern_string(byte_E7FB80); // sub_5E0700
}
// Override with CUDA display name based on kind byte
switch (a1->kind) { // byte at +8
case 'V': return "__host__";
case 'W': return "__device__";
case 'X': return "__global__";
case 'Y': return "__tile_global__";
case 'Z': return "__shared__";
case '[': return "__constant__";
case '\\': return "__launch_bounds__";
case ']': return "__maxnreg__";
case '^': return "__local_maxnreg__";
case '_': return "__tile_builtin__";
case 'f': return "__managed__";
case 'k': return "__cluster_dims__";
case 'l': return "__block_size__";
case 'n': return "__nv_pure__";
default: return name ? name : "";
}
}
The 204-byte static buffer byte_E7FB80 is shared across calls (not thread-safe, but cudafe++ is single-threaded per translation unit). The intern_string call (sub_5E0700) ensures the formatted "namespace::name" string is deduplicated into EDG's permanent string pool.
Attribute Node Structure
Every attribute is represented by a 72-byte IL node (entry kind 0x48 = attribute). The node layout:
struct attr_node_t { // 72 bytes, IL entry kind 0x48
attr_node_t* next; // +0 next attribute in list
uint8_t kind; // +8 attribute kind byte (CUDA: 'V'..'n')
uint8_t source_mode; // +9 1=C++11, 2=GNU, 3=MSVC, 4=alignas, 5=clang
uint8_t target_kind; // +10 what entity type this targets
uint8_t flags; // +11 bit 0=applies_to_params
// bit 1=skip_arg_check
// bit 4=scoped attribute
// bit 7=unknown/unrecognized
uint32_t _pad; // +12 (alignment)
const char* name; // +16 attribute name string
const char* namespace_str; // +24 namespace (NULL for unscoped)
arg_node_t* arguments; // +32 argument list head
void* source_pos; // +40 source position info
void* decl_context; // +48 declaration context / scope
void* src_loc_1; // +56 source location
void* src_loc_2; // +64 secondary source location
};
For CUDA attributes, the kind byte at offset +8 is the discriminator. When get_attr_descr_for_attribute (sub_40FDB0) resolves an attribute name, it writes the corresponding kind value from the descriptor table (byte_82C0E0) into this field. All subsequent dispatch operates on this byte alone.
The source_mode byte at +9 indicates the syntactic form the user wrote. CUDA attributes like __host__ are parsed as GNU-style attributes (source_mode = 2), because cudafe++ defines them via __attribute__((...)) internally.
Attribute Descriptor Table and Name Lookup
Master Descriptor Table (off_D46820)
The attribute descriptor table is a static array in .rodata at off_D46820, extending to unk_D47A60. Each entry is 32 bytes and encodes:
- Attribute name string
- Kind byte (written to
attr_node_t.kindon match) - Handler function pointer (the
apply_*callback) - Mode/version condition string (e.g.,
'g'for GCC-only,'l'for Clang-only) - Target applicability mask
Initialization: init_attr_name_map (sub_418F80)
At startup, init_attr_name_map iterates the descriptor table, validates each name is at most 100 characters, and inserts it into the hash table qword_E7FB60 (created via sub_7425C0). This hash table enables O(1) lookup of attribute names during parsing.
// sub_418F80 -- init_attr_name_map (attribute.c:1524)
void init_attr_name_map(void) {
attr_name_map = create_hash_table(); // qword_E7FB60
for (attr_descr* d = off_D46820; d < unk_D47A60; d++) {
assert(strlen(d->name) <= 100);
insert_into_hash_table(attr_name_map, d->name, d);
}
// Also initializes dword_E7F078 and processes config if dword_106BF18 set
}
A companion function init_attr_token_map (sub_419070) creates a second hash table qword_E7F038 that maps attribute tokens to their descriptors, used during lexer-level attribute recognition.
Name Normalization: sub_40A250
Before looking up an attribute name, EDG strips __ prefixes and suffixes. The function at sub_40A250 checks whether the name starts with "__" and ends with "__", strips them, and looks up the bare name in qword_E7FB60. This means __host__, __attribute__((host)), and host all resolve to the same descriptor. The stripping respects the current language standard (dword_126EFB4) and C++ version (dword_126EF68).
Central Dispatch: get_attr_descr_for_attribute (sub_40FDB0)
This 227-line function is the central attribute resolution path. Given an attribute node with a name, it:
- Looks up the name in the hash table
- Checks mode compatibility (GCC mode via
dword_126EFA8, Clang mode viadword_126EFA4, MSVC mode viadword_106BF68/dword_106BF58) - Checks namespace match (
"gnu","__gnu__","clang") viacond_matches_attr_mode(sub_40C4C0) - Evaluates version-conditional availability via
in_attr_cond_range(sub_40D620) - Writes the kind byte from the matched descriptor into
attr_node_t.kind - Returns the descriptor entry (which carries the handler function pointer)
The mode condition strings use a compact encoding: 'g'=GCC, 'l'=Clang, 's'=Sun, 'c'=C++, 'm'=MSVC; 'x'=extension, '+'=positive match, '!'=boundary marker.
Attribute Application Pipeline
Phase 1: Scanning
The lexer recognizes attribute syntax and calls into the scanning functions:
| Function | Address | Role |
|---|---|---|
scan_std_attribute_group | sub_412650 | Parses [[...]] C++11 and __attribute__((...)) GNU attributes |
scan_gnu_attribute_groups | sub_412F20 | Handles __attribute__((...)) specifically |
scan_attributes_list | sub_4124A0 | Iterates token stream building attribute node lists |
parse_attribute_argument_clause | sub_40C8B0 | Parses attribute argument expressions |
get_balanced_token | sub_40C6C0 | Handles balanced parentheses/brackets in arguments |
Scanning produces a linked list of attr_node_t nodes. At this stage, the kind byte is unset; only the name and namespace_str fields are populated.
Phase 2: Lookup and Kind Assignment
When the parser reaches a declaration, get_attr_descr_for_attribute resolves each attribute name to a descriptor and writes the kind byte. For CUDA attributes, this assigns values in the 'V'--'n' range.
Phase 3: Application -- apply_one_attribute (sub_413240)
The central dispatcher is a 585-line function containing a switch on the kind byte. For each CUDA kind, it calls the corresponding handler:
// sub_413240 -- apply_one_attribute (attribute.c, main dispatch)
// 585 lines, giant switch on attribute kind
void apply_one_attribute(attr_node_t* attr, entity_t* entity, int target_kind) {
switch (attr->kind) {
case 'V': apply_nv_host_attr(attr, entity, target_kind); break;
case 'W': apply_nv_device_attr(attr, entity, target_kind); break;
case 'X': apply_nv_global_attr(attr, entity, target_kind); break;
case 'Z': apply_nv_shared_attr(attr, entity, target_kind); break;
case '[': apply_nv_constant_attr(attr, entity, target_kind); break;
case '\\': apply_nv_launch_bounds(attr, entity, target_kind); break;
case ']': apply_nv_maxnreg_attr(attr, entity, target_kind); break;
case '^': apply_nv_local_maxnreg(attr, entity, target_kind); break;
case 'f': apply_nv_managed_attr(attr, entity, target_kind); break;
case 'k': apply_nv_cluster_dims(attr, entity, target_kind); break;
case 'l': apply_nv_block_size(attr, entity, target_kind); break;
// ... standard attributes handled similarly ...
}
}
The outer iteration is apply_attributes_to_entity (sub_413ED0, 492 lines), which walks the attribute list, calls apply_one_attribute for each, and handles deferred attributes, attribute merging, and ordering constraints.
Phase 4: Post-Declaration Validation -- sub_6BC890
After all attributes on a declaration are applied, sub_6BC890 (nv_validate_cuda_attributes, from nv_transforms.c) performs cross-attribute consistency checking. This function validates that combinations of CUDA attributes are legal:
// sub_6BC890 -- nv_validate_cuda_attributes (nv_transforms.c)
// a1: entity (function), a2: diagnostic location
void nv_validate_cuda_attributes(entity_t* fn, source_loc_t* loc) {
if (!fn || (fn->byte_177 & 0x10)) // skip if null or already validated
return;
uint8_t exec_space = fn->byte_182; // CUDA execution space bits
launch_config_t* lc = fn->launch_config; // entity+256
// Check 1: parameters with rvalue-reference in __global__ functions
// Walks parameter list, emits error 3702 for ref-qualified params
// Check 2: __nv_register_params__ on __host__-only or __global__
if (fn->byte_183 & 0x08) {
if (exec_space & 0x40) // __global__
emit_error(3661, "__global__");
else if ((exec_space & 0x30) == 0x20) // __host__ only (no __device__)
emit_error(3661, "__host__");
}
// Check 3: __launch_bounds__ without __global__
if (lc && !(exec_space & 0x40)) {
if (lc->maxThreadsPerBlock || lc->minBlocksPerMultiprocessor)
emit_error(3534, "__launch_bounds__");
}
// Check 4: __cluster_dims__ / __block_size__ without __global__
if (lc && (fn->byte_183 & 0x40 || lc->cluster_dim_x > 0)) {
const char* name = (lc->block_size_x > 0) ? "__block_size__" : "__cluster_dims__";
emit_error(3534, name);
}
// Check 5: maxBlocksPerClusterSize exceeds cluster product
if (lc && lc->cluster_dim_x > 0 && lc->maxBlocksPerClusterSize > 0) {
if (lc->maxBlocksPerClusterSize <
lc->cluster_dim_x * lc->cluster_dim_y * lc->cluster_dim_z) {
emit_error(3707, ...);
}
}
// Check 6: __maxnreg__ without __global__
if (lc && lc->maxnreg >= 0 && !(exec_space & 0x40))
emit_error(3715, "__maxnreg__");
// Check 7: __launch_bounds__ + __maxnreg__ conflict
if (lc && lc->maxThreadsPerBlock && lc->maxnreg >= 0)
emit_error(3719, "__launch_bounds__ and __maxnreg__");
// Check 8: __global__ without __launch_bounds__
if ((exec_space & 0x40) && (!lc || (!lc->maxThreadsPerBlock && !lc->minBlocksPerMultiprocessor)))
emit_warning(3695); // "no __launch_bounds__ specified for __global__ function"
}
Error Codes in Validation
| Error | Severity | Message |
|---|---|---|
| 3534 | 7 (error) | "%s" attribute is not allowed on a non-__global__ function |
| 3661 | 7 (error) | __nv_register_params__ is not allowed on a %s function |
| 3695 | 4 (warning) | no __launch_bounds__ specified for __global__ function |
| 3702 | 7 (error) | Parameter with rvalue reference in __global__ function |
| 3707 | 7 (error) | total number of blocks in cluster computed from %s exceeds __launch_bounds__ specified limit |
| 3715 | 7 (error) | __maxnreg__ is not allowed on a non-__global__ function |
| 3719 | 7 (error) | __launch_bounds__ and __maxnreg__ may not be used on the same declaration |
Per-Attribute Handler Function Table
Each CUDA attribute has a dedicated apply_* function registered in the descriptor table. These functions modify entity node fields (execution space bits, memory space bits, launch configuration) and emit diagnostics for invalid usage.
| Attribute | Handler | Address | Lines | Entity Fields Modified |
|---|---|---|---|---|
__host__ | apply_nv_host_attr | sub_4108E0 | 31 | entity+182 |= 0x15 |
__device__ | apply_nv_device_attr | sub_40EB80 | 100 | Functions: entity+182 |= 0x23; Variables: entity+148 |= 0x01 |
__global__ | apply_nv_global_attr | sub_40E1F0 | 89 | entity+182 |= 0x61 |
__global__ (variant 2) | apply_nv_global_attr | sub_40E7F0 | 86 | Same as above (alternate entry point) |
__shared__ | (via device attr path) | -- | -- | entity+148 |= 0x02 |
__constant__ | (via device attr path) | -- | -- | entity+148 |= 0x04 |
__managed__ | apply_nv_managed_attr | sub_40E0D0 | 47 | entity+148 |= 0x01, entity+149 |= 0x01 |
__launch_bounds__ | apply_nv_launch_bounds_attr | sub_411C80 | 98 | entity+256 -> launch config +0, +8, +16 |
__maxnreg__ | apply_nv_maxnreg_attr | sub_410F70 | 67 | entity+256 -> launch config +32 |
__local_maxnreg__ | apply_nv_local_maxnreg_attr | sub_411090 | 67 | entity+256 -> launch config +36 |
__cluster_dims__ | apply_nv_cluster_dims_attr | sub_4115F0 | 145 | entity+256 -> launch config +20, +24, +28 |
__block_size__ | apply_nv_block_size_attr | sub_4109E0 | 265 | entity+256 -> launch config +40..+52 |
__nv_register_params__ | apply_nv_register_params_attr | sub_40B0A0 | 38 | entity+183 |= 0x08 |
Attribute Registration (sub_6B5E50)
The function sub_6B5E50 (160 lines, in the nv_transforms.c / mem_manage.c area) registers NVIDIA-specific pseudo-attributes into EDG's keyword and macro systems at startup. It operates after EDG's standard keyword initialization but before parsing begins.
The registration creates macro-like definitions that the lexer expands before attribute processing. The function:
- Allocates attribute definition nodes via
sub_6BA0D0(EDG's node allocator) - Looks up existing definitions via
sub_734430(hash table search) -- if a definition already exists, it chains the new handler onto it viasub_6AC190 - Creates new keyword entries via
sub_749600if no prior definition exists - Registers
__nv_register_params__as a 40-byte attribute definition node (kind marker 8961) with chain linkage - Registers
__noinline__as a 30-byte attribute definition node (kind marker 6401), including the"oinline))"suffix for__attribute__((__noinline__))expansion - Conditionally registers ARM SME attributes (
__arm_in,__arm_inout,__arm_out,__arm_preserves,__arm_streaming,__arm_streaming_compatible) viasub_6ACCB0when Clang version >= 180000 and ARM target flags are set - Registers
_Pragmaas an operator-like keyword for_Pragma("...")processing
If any registration fails (the existing entry cannot be extended), it emits internal error 1338 with the attribute name and calls sub_6B6280 (fatal error handler).
Entity Node: CUDA Attribute Fields
CUDA attributes modify specific byte fields in entity nodes. The key fields for a reimplementation:
Execution Space (entity+182)
Bit 0 (0x01): __device__ set by apply_nv_device_attr
Bit 2 (0x04): __host__ set by apply_nv_host_attr
Bit 4 (0x10): (reserved)
Bit 5 (0x20): __host__ explicit set by apply_nv_host_attr
Bit 6 (0x40): __global__ set by apply_nv_global_attr
Bit 7 (0x80): __host__ __device__ set when both specified
Handlers use OR-masks: __host__ sets 0x15 (bits 0+2+4), __device__ sets 0x23 (bits 0+1+5), __global__ sets 0x61 (bits 0+5+6). The overlap at bit 0 means all execution-space-annotated functions have bit 0 set, which serves as a quick "has CUDA annotation" predicate.
Memory Space (entity+148)
Bit 0 (0x01): __device__ device memory
Bit 1 (0x02): __shared__ shared memory
Bit 2 (0x04): __constant__ constant memory
Extended Memory Space (entity+149)
Bit 0 (0x01): __managed__ managed (unified) memory
Launch Configuration (entity+256)
A pointer to a separately allocated launch_config_t structure (created by sub_5E52F0):
struct launch_config_t {
uint64_t maxThreadsPerBlock; // +0 from __launch_bounds__(N, ...)
uint64_t minBlocksPerMultiprocessor; // +8 from __launch_bounds__(N, M, ...)
int32_t maxBlocksPerClusterSize; // +16 from __launch_bounds__(N, M, K)
int32_t cluster_dim_x; // +20 from __cluster_dims__(X, ...)
int32_t cluster_dim_y; // +24 from __cluster_dims__(X, Y, ...)
int32_t cluster_dim_z; // +28 from __cluster_dims__(X, Y, Z)
int32_t maxnreg; // +32 from __maxnreg__(N)
int32_t local_maxnreg; // +36 from __local_maxnreg__(N)
int32_t block_size_x; // +40 from __block_size__(X, ...)
int32_t block_size_y; // +44 from __block_size__(X, Y, ...)
int32_t block_size_z; // +48 from __block_size__(X, Y, Z, ...)
uint8_t flags; // +52 bit 0=cluster_dims_set
// bit 1=block_size_set
};
This structure is allocated lazily -- only created when a launch configuration attribute is first applied to a function. The allocation function sub_5E52F0 returns a zero-initialized structure with maxnreg = -1 and local_maxnreg = -1 (sentinel for "unset").
Attribute Processing Global State
| Global | Address | Purpose |
|---|---|---|
qword_E7FB60 | 0xE7FB60 | Attribute name hash table (created by init_attr_name_map) |
qword_E7F038 | 0xE7F038 | Attribute token hash table (created by init_attr_token_map) |
byte_E7FB80 | 0xE7FB80 | 204-byte static buffer for formatted attribute display names |
off_D46820 | 0xD46820 | Master attribute descriptor table (32 bytes per entry, extends to 0xD47A60) |
qword_E7F070 | 0xE7F070 | Visibility stack (for __attribute__((visibility(...))) nesting) |
qword_E7F048 | 0xE7F048 | Alias/ifunc free list head |
qword_E7F058/E7F050 | 0xE7F058/0xE7F050 | Alias chain list head/tail |
dword_E7F080 | 0xE7F080 | Attribute processing flags |
dword_E7F078 | 0xE7F078 | Extended attribute config flag |
The function reset_attribute_processing_state (sub_4190B0) zeroes all of these at the start of each translation unit.
Function Map
| Address | Identity | Source | Confidence |
|---|---|---|---|
sub_40A250 | strip_double_underscores_and_lookup | attribute.c | HIGH |
sub_40A310 | attribute_display_name | attribute.c:1307 | HIGH |
sub_40C4C0 | cond_matches_attr_mode | attribute.c | HIGH |
sub_40C6C0 | get_balanced_token | attribute.c | HIGH |
sub_40C8B0 | parse_attribute_argument_clause | attribute.c | HIGH |
sub_40D620 | in_attr_cond_range | attribute.c | HIGH |
sub_40E0D0 | apply_nv_managed_attr | attribute.c:10523 | HIGH |
sub_40E1F0 | apply_nv_global_attr (variant 1) | attribute.c | HIGH |
sub_40E7F0 | apply_nv_global_attr (variant 2) | attribute.c | HIGH |
sub_40EB80 | apply_nv_device_attr | attribute.c | HIGH |
sub_40FDB0 | get_attr_descr_for_attribute | attribute.c:1902 | HIGH |
sub_4108E0 | apply_nv_host_attr | attribute.c | HIGH |
sub_4109E0 | apply_nv_block_size_attr | attribute.c | HIGH |
sub_410F70 | apply_nv_maxnreg_attr | attribute.c | HIGH |
sub_411090 | apply_nv_local_maxnreg_attr | attribute.c | HIGH |
sub_4115F0 | apply_nv_cluster_dims_attr | attribute.c | HIGH |
sub_411C80 | apply_nv_launch_bounds_attr | attribute.c | HIGH |
sub_412650 | scan_std_attribute_group | attribute.c:2914 | HIGH |
sub_413240 | apply_one_attribute | attribute.c | HIGH |
sub_413ED0 | apply_attributes_to_entity | attribute.c | HIGH |
sub_418F80 | init_attr_name_map | attribute.c:1524 | HIGH |
sub_419070 | init_attr_token_map | attribute.c | HIGH |
sub_4190B0 | reset_attribute_processing_state | attribute.c | HIGH |
sub_6B5E50 | process_nv_register_params / attribute registration | nv_transforms.c | HIGH |
sub_6BC890 | nv_validate_cuda_attributes | nv_transforms.c | VERY HIGH |
Cross-References
- global Function Constraints -- detailed validation rules for
__global__ - Launch Configuration Attributes --
__launch_bounds__,__cluster_dims__,__block_size__ - grid_constant -- grid-constant parameter attribute
- managed Variables -- managed memory attribute
- Minor CUDA Attributes --
__noinline__,__forceinline__,__nv_register_params__,__nv_pure__ - Entity Node Layout -- full entity structure with CUDA field offsets
- CUDA Execution Spaces -- how execution space bits drive code generation
- CUDA Memory Spaces -- memory space bitfield semantics
global Function Constraints
The __global__ attribute designates a CUDA kernel -- a function that executes on the GPU and is callable from host code via the <<<...>>> launch syntax. Of all CUDA execution space attributes, __global__ imposes the most constraints. cudafe++ enforces these constraints across three separate validation passes: attribute application (when __global__ is first applied to an entity), post-declaration validation (after all attributes on a declaration are resolved), and semantic analysis (during template instantiation, redeclaration merging, and lambda processing). This page documents all constraint checks, their implementation in the binary, the entity node fields they inspect, and the diagnostics they emit.
Key Facts
| Property | Value |
|---|---|
| Source files | attribute.c (apply handler), nv_transforms.c (post-validation), class_decl.c (redeclaration, lambda), decls.c (template packs) |
| Apply handler (variant 1) | sub_40E1F0 (89 lines) |
| Apply handler (variant 2) | sub_40E7F0 (86 lines) |
| Post-validation | sub_6BC890 (nv_validate_cuda_attributes, 161 lines) |
| Attribute kind byte | 0x58 = 'X' |
| OR mask applied | entity+182 |= 0x61 (bits 0 + 5 + 6) |
| HD combined flag | entity+182 |= 0x80 (set when __global__ applied to function already marked __host__) |
| Total constraint checks | 37 distinct error conditions |
| Entity fields read | +81, +144, +148, +152, +166, +176, +179, +182, +183, +184, +191 |
| Relaxed mode flag | dword_106BFF0 (suppresses certain conflict checks) |
| main() entity pointer | qword_126EB70 (compared to detect __global__ main) |
Two Variants of apply_nv_global_attr
Two nearly identical functions implement the __global__ application logic. Both perform the same 11 validation checks and apply the same 0x61 bitmask. The difference is purely structural: sub_40E1F0 uses a for loop with a null-terminated break for the parameter default-init iteration, while sub_40E7F0 uses a do-while loop with an explicit null check and early return. Both exist because EDG's attribute subsystem may route through different call paths depending on whether the attribute appears on a declaration or a definition.
// Pseudocode for apply_nv_global_attr (sub_40E1F0 / sub_40E7F0)
// a1: attribute node, a2: entity node, a3: target kind
entity_t* apply_nv_global_attr(attr_node_t* a1, entity_t* a2, uint8_t a3) {
// Gate: only applies to functions (kind 11)
if (a3 != 11)
return a2;
// ---- Phase 1: Linkage / constexpr lambda check ----
// Bits 47 and 24 of the 48-bit field at +184
if ((a2->qword_184 & 0x800001000000) == 0x800000000000) {
// Constexpr lambda with internal linkage but no local flag
char* name = get_entity_display_name(a2, 0); // sub_6BC6B0
emit_error(3469, a1->src_loc, "__global__", name);
return a2; // bail out, do not apply __global__
}
// ---- Phase 2: Structural constraints ----
// 2a. Static member function check
if ((signed char)a2->byte_176 < 0 && !(a2->byte_81 & 0x04))
emit_warning(3507, a1->src_loc, "__global__"); // severity 5
// 2b. operator() check
if (a2->byte_166 == 5)
emit_error(3644, a1->src_loc); // severity 7
// 2c. Exception specification check (uses return type chain)
type_t* ret = a2->type_chain; // entity+144
while (ret->kind == 12) // skip cv-qualifier wrappers
ret = ret->referenced; // type+144
if (ret->prototype->exception_spec) // proto+152 -> +56
emit_error(3647, a1->src_loc); // auto/decltype(auto) return
// 2d. Execution space conflict
uint8_t es = a2->byte_182;
if (!relaxed_mode && (es & 0x60) == 0x20) // already __device__ only
emit_error(3481, a1->src_loc);
if (es & 0x10) // already __host__ explicit
emit_error(3481, a1->src_loc);
// 2e. Return type must be void
if (!(a2->byte_179 & 0x10)) { // not constexpr
if (a2->byte_191 & 0x01) // is lambda
emit_error(3506, a1->src_loc);
else {
type_t* base = skip_typedefs(a2->type_chain); // sub_7A68F0
if (!is_void_type(base->referenced)) // sub_7A6E90
emit_error(3505, a1->src_loc);
}
}
// 2f. Variadic (ellipsis) check
type_t* proto_type = a2->type_chain; // +144
while (proto_type->kind == 12)
proto_type = proto_type->referenced;
if (proto_type->prototype->flags_16 & 0x01) // bit 0 of proto+16
emit_error(3503, a1->src_loc);
// ---- Phase 3: Apply the bitmask ----
a2->byte_182 |= 0x61; // device_capable + device_annotation + global_kernel
// ---- Phase 4: Additional checks (after bitmask set) ----
// 4a. Local function (constexpr local)
if (a2->byte_81 & 0x04)
emit_error(3688, a1->src_loc);
// 4b. main() function check
if (a2 == main_entity && (a2->byte_182 & 0x20))
emit_error(3538, a1->src_loc);
// ---- Phase 5: Parameter iteration (__grid_constant__ warning) ----
if (a1->flags & 0x01) { // attr_node+11 bit 0: applies to parameters
// Walk parameter list from prototype
proto_type = a2->type_chain;
while (proto_type->kind == 12)
proto_type = proto_type->referenced;
param_t* param = *proto_type->prototype->param_list; // deref +152
source_loc_t loc = a1->src_loc; // +56
for (; param != NULL; param = param->next) {
// Peel cv-qualifier wrappers
type_t* ptype = param->type; // param[1]
while (ptype->kind == 12)
ptype = ptype->referenced;
// Check: is type a __grid_constant__ candidate?
if (!has_grid_constant_flag(ptype) && scope_index == -1) {
// sub_7A6B60: checks byte+133 bit 5 (0x20)
int64_t scope = scope_table_base + 784 * scope_table_index;
if ((scope->flags_6 & 0x06) == 0 && scope->kind_4 != 12) {
type_t* ptype2 = param->type;
while (ptype2->kind == 12)
ptype2 = ptype2->referenced;
if (!ptype2->default_init) // type+120 == NULL
emit_error(3669, &loc);
}
}
}
}
// ---- Phase 6: HD combined flag ----
if (a2->byte_182 & 0x40) // __global__ now set
a2->byte_182 |= 0x80; // mark as combined HD
return a2;
}
Execution Order Detail
The 0x61 bitmask is applied before the local-function (3688) and main() (3538) checks but after all structural checks (3507, 3644, 3647, 3481, 3505/3506, 3503). This means the bitmask is set even when errors are emitted -- cudafe++ continues processing after errors to collect as many diagnostics as possible in a single compilation pass.
The constexpr-lambda check at the top (error 3469) is the only check that causes an early return. If the function is a constexpr lambda with wrong linkage, the bitmask is NOT set and no further validation is performed.
Validation Error Catalog
The 37 validation errors are organized by the phase in which they are checked and by semantic category. Error codes below are cudafe++ internal diagnostic numbers; severity values match the sub_4F41C0 severity parameter (5 = warning, 7 = error, 8 = hard error).
Category 1: Return Type
| Error | Severity | Check | Message |
|---|---|---|---|
| 3505 | 7 | !is_void_type(skip_typedefs(entity+144)->referenced) | a __global__ function must have a void return type |
| 3506 | 7 | entity+191 & 0x01 (lambda) and non-void | a __global__ function must not have a deduced return type |
| 3647 | 7 | entity+152 -> +56 != NULL (exception spec present on return proto) | auto/decltype(auto) deduced return type |
Error 3505 and 3506 are mutually exclusive paths guarded by the byte+179 & 0x10 constexpr flag. When the function is not constexpr, the handler checks whether it is a lambda (3506 path, which checks byte+191 bit 0) or a regular function (3505 path, which resolves through skip_typedefs via sub_7A68F0 and tests is_void_type via sub_7A6E90). The skip_typedefs function follows the type chain while type->kind == 12 (cv-qualifier wrapper) and type->byte_161 & 0x7F == 0 (no qualifier flags). The is_void_type function follows the same chain and returns kind == 1 (void).
Error 3647 is checked independently of 3505/3506. The check examines the exception specification pointer at prototype offset +56. In EDG's type system, auto and decltype(auto) return types are represented with a non-null exception specification node on the return type's prototype -- this is a repurposed field that indicates the return type is deduced.
Category 2: Parameters
| Error | Severity | Check | Message |
|---|---|---|---|
| 3503 | 8 | proto+16 & 0x01 (has ellipsis) | a __global__ function cannot have ellipsis |
| 3702 | 7 | param_flags & 0x02 (rvalue ref) | a __global__ function cannot have a parameter with rvalue reference type |
| -- | 7 | Parameter with __restrict__ on reference type | a __global__ function cannot have a parameter with __restrict__ qualified reference type |
| -- | 7 | Parameter of type va_list | A __global__ function or function template cannot have a parameter with va_list type |
| -- | 7 | Parameter of type std::initializer_list | a __global__ function or function template cannot have a parameter with type std::initializer_list |
| -- | 7 | Oversized alignment on win32 | cannot pass a parameter with a too large explicit alignment to a __global__ function on win32 platforms |
| 3669 | 8 | Device-scope parameter without default init | __grid_constant__ parameter warning (device-side check) |
Error 3503 (ellipsis) is checked in the apply handler by testing bit 0 of the function prototype's flags word at offset +16. This bit indicates the parameter list ends with ....
Error 3702 (rvalue reference) is checked in the post-validation pass (sub_6BC890), not in the apply handler. The post-validator walks the parameter list and checks byte offset +32 (bit 1) of each parameter node.
The __restrict__ reference, va_list, initializer_list, and win32 alignment checks are scattered across separate validation functions in nv_transforms.c and are triggered during declaration processing rather than during attribute application.
Error 3669 is checked in the apply handler's parameter iteration loop. It walks each parameter, resolves through cv-qualifier wrappers, and tests whether sub_7A6B60 returns false (meaning the parameter type has bit 5 of byte+133 clear -- not a __grid_constant__ type) AND the scope lookup produces a non-array, non-qualifier type without a default initializer at type+120.
Category 3: Modifiers
| Error | Severity | Check | Message |
|---|---|---|---|
| 3507 | 5 | (signed char)byte_176 < 0 && !(byte_81 & 0x04) | A __global__ function or function template cannot be marked constexpr (warning for static member) |
| 3688 | 8 | byte_81 & 0x04 (local function) | A __global__ function or function template cannot be marked constexpr (constexpr local) |
| 3481 | 8 | Execution space conflict (see matrix) | Conflicting CUDA execution spaces |
| -- | 7 | Function is consteval | A __global__ function or function template cannot be marked consteval |
| 3644 | 7 | byte_166 == 5 (operator function kind) | An operator function cannot be a __global__ function |
| -- | 7 | Defined in friend declaration | A __global__ function or function template cannot be defined in a friend declaration |
| -- | 7 | Exception specification present | An exception specification is not allowed for a __global__ function or function template |
| -- | 7 | Declared in inline unnamed namespace | A __global__ function or function template cannot be declared within an inline unnamed namespace |
| 3538 | 7 | a2 == qword_126EB70 (is main()) | function main cannot be marked __device__ or __global__ |
Error 3507 deserves special attention. The decompiled code shows:
if ((signed char)a2->byte_176 < 0 && !(a2->byte_81 & 0x04))
emit_warning(3507, ...);
The signed char cast means byte_176 >= 0x80 (bit 7 set = static member function). The !(byte_81 & 0x04) condition ensures it is NOT a local function. The emitter uses severity 5 (warning via sub_4F8DB0), meaning this is a warning, not an error -- NVIDIA chose to warn rather than reject __global__ on static members, though the official documentation says it is not allowed. The displayed string is "A __global__ function or function template cannot be marked constexpr" with "__global__" as the attribute name parameter, though the actual semantic is "static member function" per the field being checked.
Error 3644 checks entity+166 == 5. This field stores the "operator function kind" enum value, where 5 corresponds to operator(). This prevents lambda call operators or functors from being directly marked __global__.
Error 3688 is checked after the bitmask is set (byte_182 |= 0x61). It tests byte_81 & 0x04, which indicates a local (block-scope) function. The handler emits with severity 8 (via sub_4F81B0, hard error).
Error 3538 compares the entity pointer against qword_126EB70, which holds the entity pointer for main() (set during initial declaration processing). The condition also requires byte_182 & 0x20 (device annotation bit set), which is always true after |= 0x61.
Category 4: Template Constraints
| Error | Severity | Check | Message |
|---|---|---|---|
| -- | 7 | Pack parameter is not last template parameter | Pack template parameter must be the last template parameter for a variadic __global__ function template |
| -- | 7 | Multiple pack parameters | Multiple pack parameters are not allowed for a variadic __global__ function template |
These checks are performed during template declaration processing in decls.c, not in the apply handler. They constrain variadic __global__ function templates: CUDA requires that pack parameters appear last (so the runtime can enumerate kernel arguments), and only a single pack is permitted (the CUDA launch infrastructure cannot handle multiple parameter packs).
Category 5: Redeclaration
| Error | Severity | Check | Message |
|---|---|---|---|
| -- | 7 | Previously __global__, now no execution space | a __global__ function(%no1) redeclared without __global__ |
| -- | 7 | Previously __global__, now __host__ | a __global__ function(%no1) redeclared with __host__ |
| -- | 7 | Previously __global__, now __device__ | a __global__ function(%no1) redeclared with __device__ |
| -- | 7 | Previously __global__, now __host__ __device__ | a __global__ function(%no1) redeclared with __host__ __device__ |
These four error variants are symmetrical with the reverse direction:
a __device__ function(%no1) redeclared with __global__a __host__ function(%no1) redeclared with __global__a __host__ __device__ function(%no1) redeclared with __global__
Redeclaration checks occur during declaration merging in class_decl.c. When a function is redeclared and the execution space of the new declaration does not match the original, cudafe++ emits one of these errors. The %no1 format specifier inserts the function name. These checks run independently of the apply_nv_global_attr handler -- they operate on the merged entity after both attribute sets have been processed.
Category 6: Constexpr Lambda Linkage
| Error | Severity | Check | Message |
|---|---|---|---|
| 3469 | 5 | (qword_184 & 0x800001000000) == 0x800000000000 | __global__ on constexpr lambda with wrong linkage |
This is the first check in the apply handler and the only one that causes early return. The 48-bit field at entity+184 encodes template and linkage properties. Bit 47 (0x800000000000) indicates internal linkage or a similar constraint, while bit 24 (0x000001000000) indicates a local entity. When bit 47 is set but bit 24 is clear, the entity is a constexpr lambda that cannot legally receive __global__. The handler calls sub_6BC6B0 (get_entity_display_name) to format the entity name for the diagnostic message, then returns without setting the bitmask.
Category 7: Post-Validation (sub_6BC890)
These checks run after all attributes on a declaration have been applied, in the nv_validate_cuda_attributes function:
| Error | Severity | Check | Message |
|---|---|---|---|
| 3702 | 7 | Parameter with rvalue reference flag (bit 1 at param+32) | a __global__ function cannot have a parameter with rvalue reference type |
| 3661 | 7 | __nv_register_params__ on __global__ | __nv_register_params__ is not allowed on a __global__ function |
| 3534 | 7 | __launch_bounds__ on non-__global__ | %s attribute is not allowed on a non-__global__ function |
| 3707 | 7 | maxBlocksPerCluster < cluster product | total number of blocks in cluster computed from %s exceeds __launch_bounds__ specified limit |
| 3715 | 7 | __maxnreg__ on non-__global__ | __maxnreg__ is not allowed on a non-__global__ function |
| 3719 | 7 | Both __launch_bounds__ and __maxnreg__ | __launch_bounds__ and __maxnreg__ may not be used on the same declaration |
| 3695 | 4 | __global__ without __launch_bounds__ | no __launch_bounds__ specified for __global__ function (warning) |
Error 3695 is a severity-4 diagnostic (informational warning). It fires when a __global__ function has no associated launch configuration, encouraging developers to specify __launch_bounds__ for optimal register allocation. This is the only constraint that is a soft advisory rather than a hard or standard error.
Entity Node Field Reference
The apply handler reads and writes specific fields within the entity node. Complete field semantics:
| Offset | Size | Field Name | Role in __global__ Validation |
|---|---|---|---|
+81 | 1 byte | local_flags | Bit 2 (0x04): function is local (block-scope). Checked for 3688 and as exemption for 3507. |
+144 | 8 bytes | type_chain | Pointer to return type. Followed through kind==12 cv-qualifier wrappers. |
+152 | 8 bytes | prototype | Function prototype pointer. At prototype+16: flags (bit 0 = ellipsis). At prototype+56: exception spec pointer. At prototype+0: parameter list head (double deref for first param). |
+166 | 1 byte | operator_kind | Value 5 = operator(). Checked for 3644. |
+176 | 1 byte | member_flags | Bit 7 (0x80, checked as signed char < 0): static member function. Checked for 3507. |
+179 | 1 byte | constexpr_flags | Bit 4 (0x10): function is constexpr. Guards 3505/3506 check (skipped if constexpr). |
+182 | 1 byte | execution_space | The primary execution space bitfield. |= 0x61 sets global kernel. Read for conflict checks (0x60, 0x10 masks). |
+183 | 1 byte | extended_cuda | Bit 3 (0x08): __nv_register_params__. Checked in post-validation. Bit 6 (0x40): __cluster_dims__ set. |
+184 | 8 bytes | linkage_template | 48-bit field encoding template/linkage flags. Only lower 48 bits used; mask 0x800001000000 checks constexpr lambda linkage. |
+191 | 1 byte | lambda_flags | Bit 0 (0x01): entity is a lambda. Routes to 3506 instead of 3505 for void-return check. |
+256 | 8 bytes | launch_config | Pointer to launch configuration struct (56 bytes). NULL if no launch attributes applied. Read in post-validation. |
The 0x61 Bitmask
The OR mask 0x61 sets three bits in the execution space byte:
0x61 = 0b01100001
bit 0 (0x01): device_capable -- function can run on device
bit 5 (0x20): device_annotation -- has explicit device-side annotation
bit 6 (0x40): global_kernel -- function is a __global__ kernel
Bit 0 is shared with __device__ (0x23) and __host__ (0x15). It serves as a "has CUDA annotation" predicate -- any entity with bit 0 set has been explicitly annotated with at least one execution space keyword. This enables fast if (byte_182 & 0x01) checks throughout the codebase.
Bit 5 is shared with __device__. A __global__ function is considered device-annotated because kernel code executes on the GPU.
Bit 6 is unique to __global__. The mask byte_182 & 0x40 is the canonical predicate for "is this a kernel function?" used in dozens of locations throughout the binary.
HD Combined Flag (0x80)
After setting 0x61, the handler checks whether bit 6 (0x40, global kernel) is now set. If so, it ORs 0x80 into the byte. This bit means "combined host+device" and is set as a secondary effect. The logic at the end of the function:
if (a2->byte_182 & 0x40) // just set via |= 0x61
a2->byte_182 |= 0x80; // always true after apply
This means every __global__ function ends up with byte_182 & 0x80 set, which marks it as "combined" in the execution space classification. This is semantically correct: a kernel has both a host-side stub (for launching) and device-side code (for execution).
Parameter Iteration for grid_constant
The final section of the apply handler iterates the function's parameter list to check for parameters that should be annotated __grid_constant__. This check only runs when attr_node->flags bit 0 (a1+11 & 0x01) is set, indicating the attribute application context includes parameter-level processing.
The iteration follows this structure:
// Navigate to function prototype
type_t* proto_type = entity->type_chain; // +144
while (proto_type->kind == 12) // skip cv-qualifiers
proto_type = proto_type->referenced; // +144
// Get parameter list head (double dereference)
param_t** param_list = proto_type->prototype->param_head; // proto+152 -> deref
param_t* param = *param_list; // deref again
for (; param != NULL; param = param->next) {
// Navigate to unqualified parameter type
type_t* ptype = param[1]; // param->type (offset 8)
while (ptype->kind == 12)
ptype = ptype->referenced;
// sub_7A6B60: checks byte+133 bit 5 (0x20) -- "has __grid_constant__"
bool has_gc = (ptype->byte_133 & 0x20) != 0;
if (!has_gc && dword_126C5C4 == -1) {
// Scope table lookup
int64_t scope = qword_126C5E8 + 784 * dword_126C5E4;
uint8_t scope_flags = scope->byte_6;
uint8_t scope_kind = scope->byte_4;
// Skip if scope has qualifier flags or is a cv-qualified scope
if ((scope_flags & 0x06) == 0 && scope_kind != 12) {
// Re-navigate to unqualified type
type_t* ptype2 = param[1];
while (ptype2->kind == 12)
ptype2 = ptype2->referenced;
// Check for default initializer
if (ptype2->qword_120 == 0)
emit_error(3669, &saved_source_loc);
}
}
}
The scope table lookup uses a 784-byte scope structure (at qword_126C5E8 indexed by dword_126C5E4) to determine whether the current context is device-side. The dword_126C5C4 == -1 check verifies we are in device compilation mode. This entire parameter iteration is a device-side warning mechanism: it alerts developers when a kernel parameter lacks a default initializer in a context where __grid_constant__ would be appropriate.
Post-Declaration Validation (sub_6BC890)
After all attributes on a declaration are applied, nv_validate_cuda_attributes (sub_6BC890, 161 lines) performs cross-attribute consistency checks. For __global__ functions, this function enforces:
Rvalue Reference Parameters (3702)
// Walk parameter list
type_t* ret = entity->type_chain;
while (ret->kind == 12)
ret = ret->referenced;
param_t* param = **((param_t***)ret + 19); // proto -> param list
while (param) {
if (param->byte_32 & 0x02) // rvalue reference flag
emit_error(3702, source_loc);
param = param->next;
}
This check scans all parameters for the rvalue reference flag (bit 1 at parameter node offset +32). Kernel functions cannot accept rvalue references because kernel launch involves copying arguments through the CUDA runtime, which does not support move semantics across the host-device boundary.
nv_register_params Conflict (3661)
if (entity->byte_183 & 0x08) { // __nv_register_params__ set
if (entity->byte_182 & 0x40)
emit_error(3661, ..., "__global__");
else if ((entity->byte_182 & 0x30) == 0x20)
emit_error(3661, ..., "__host__");
}
The __nv_register_params__ attribute (bit 3 of byte+183) is incompatible with __global__ because kernel parameter passing uses a fixed ABI that cannot be overridden.
Launch Configuration Without global (3534)
launch_config_t* lc = entity->launch_config; // +256
if (lc && !(entity->byte_182 & 0x40)) {
if (lc->maxThreadsPerBlock || lc->minBlocksPerMultiprocessor)
emit_error(3534, ..., "__launch_bounds__");
}
The __launch_bounds__, __cluster_dims__, and __block_size__ attributes require __global__. If a non-kernel function has any of these, error 3534 fires.
Cluster Dimension Product Check (3707)
if (lc->cluster_dim_x > 0 && lc->maxBlocksPerCluster > 0) {
uint64_t product = lc->cluster_dim_x * lc->cluster_dim_y * lc->cluster_dim_z;
if (lc->maxBlocksPerCluster < product)
emit_error(3707, ...);
}
launch_bounds and maxnreg Conflict (3719)
if (lc->maxThreadsPerBlock && lc->maxnreg >= 0)
emit_error(3719, ..., "__launch_bounds__ and __maxnreg__");
These two attributes provide contradictory register pressure hints and cannot coexist.
Missing launch_bounds Warning (3695)
if ((entity->byte_182 & 0x40) &&
(!lc || (!lc->maxThreadsPerBlock && !lc->minBlocksPerMultiprocessor)))
emit_warning(3695);
Severity 4 (advisory). Encourages developers to annotate kernels with __launch_bounds__ for optimal register allocation.
Execution Space Conflict Matrix
When __global__ is applied to a function that already has an execution space annotation, the handler checks for conflicts using two conditions:
// Condition 1: already __device__ only (without relaxed mode)
if (!dword_106BFF0 && (byte_182 & 0x60) == 0x20)
error(3481);
// Condition 2: already __host__ explicit
if (byte_182 & 0x10)
error(3481);
Current byte_182 | Applying __global__ | (byte & 0x60) == 0x20 | byte & 0x10 | Result |
|---|---|---|---|---|
0x00 (none) | |= 0x61 -> 0x61 | false | false | accepted |
0x23 (__device__) | true | false | error 3481 (unless relaxed) | |
0x15 (__host__) | false | true | error 3481 | |
0x37 (__host__ __device__) | false | true | error 3481 | |
0x61 (__global__) | true | false | error 3481 (unless relaxed) -- idempotent bitmask |
In relaxed mode (dword_106BFF0 != 0), the first condition is suppressed, allowing __device__ + __global__ combinations. The second condition (explicit __host__) is never relaxed.
Helper Functions
| Address | Identity | Lines | Purpose |
|---|---|---|---|
sub_6BC6B0 | get_entity_display_name | 49 | Formats entity name for diagnostic messages. Handles demangling, strips leading ::. |
sub_7A68F0 | skip_typedefs | 19 | Follows type chain through kind==12 wrappers while byte_161 & 0x7F == 0. |
sub_7A6E90 | is_void_type | 16 | Follows type chain through kind==12, returns kind == 1. |
sub_7A6B60 | has_grid_constant_flag | 9 | Follows type chain through kind==12, returns byte_133 & 0x20. |
sub_4F7510 | emit_error_with_names | 66 | Emits error with two string arguments (attribute name + entity name). |
sub_4F8DB0 | emit_warning_with_name | 38 | Emits warning (severity 5) with one string argument. |
sub_4F8200 | emit_error_basic | 10 | Emits error with severity + code + source location. |
sub_4F81B0 | emit_error_minimal | 10 | Emits error (severity 8) with code + source location. |
sub_4F8490 | emit_error_with_extra | 38 | Emits error with one supplementary argument. |
Additional global Constraints (Outside Apply Handler)
Beyond the apply handler and post-validation, several other subsystems enforce __global__-specific rules. These checks occur during template instantiation, lambda processing, and declaration merging:
Template Argument Type Restrictions
CUDA restricts which types can appear as template arguments in __global__ function template instantiations:
- Host-local types (defined inside a
__host__function) cannot be used - Private/protected class members cannot be used (unless the class is local to a
__device__/__global__function) - Unnamed types cannot be used (unless local to a
__device__/__global__function) - Lambda closure types cannot be used (unless the lambda is defined in a
__device__/__global__function, or is an extended lambda with--extended-lambda) - Texture/surface variables cannot be used as non-type template arguments
- Private/protected template template arguments from class scope cannot be used
Static Global Template Stub
In whole-program compilation mode (-rdc=false) with -static-global-template-stub=true:
- Extern
__global__function templates are not supported __global__function template instantiations must have definitions in the current TU
Device-Side Restrictions
Functions marked __global__ (or __device__) are subject to additional restrictions during semantic analysis:
address of labelextension is not supported- ASM operands may specify only one constraint letter
- Certain ASM constraint letters are forbidden
- Texture/surface variables cannot have their address taken or be indirected
- Anonymous union member variables at global/namespace scope cannot be directly accessed
- Function-scope static variables require a memory space specifier
- Dynamic initialization of function-scope static variables is not supported
Function Map
| Address | Identity | Lines | Source File |
|---|---|---|---|
sub_40E1F0 | apply_nv_global_attr (variant 1) | 89 | attribute.c |
sub_40E7F0 | apply_nv_global_attr (variant 2) | 86 | attribute.c |
sub_6BC890 | nv_validate_cuda_attributes | 161 | nv_transforms.c |
sub_6BC6B0 | get_entity_display_name | 49 | nv_transforms.c |
sub_7A68F0 | skip_typedefs | 19 | types.c |
sub_7A6E90 | is_void_type | 16 | types.c |
sub_7A6B60 | has_grid_constant_flag | 9 | types.c |
sub_4F7510 | emit_error_with_names | 66 | error.c |
sub_4F8DB0 | emit_warning_with_name | 38 | error.c |
sub_4F8200 | emit_error_basic | 10 | error.c |
sub_4F81B0 | emit_error_minimal | 10 | error.c |
sub_4F8490 | emit_error_with_extra | 38 | error.c |
sub_413240 | apply_one_attribute (dispatch) | 585 | attribute.c |
Global Variables
| Global | Address | Purpose |
|---|---|---|
dword_106BFF0 | 0x106BFF0 | Relaxed mode flag. When set, suppresses __device__ + __global__ conflict (3481). |
qword_126EB70 | 0x126EB70 | Pointer to the entity node for main(). Compared during 3538 check. |
dword_126C5C4 | 0x126C5C4 | Scope index sentinel (-1 = device compilation mode). Guards 3669 parameter check. |
dword_126C5E4 | 0x126C5E4 | Current scope table index. |
qword_126C5E8 | 0x126C5E8 | Scope table base pointer. Each entry is 784 bytes. |
Cross-References
- Execution Spaces -- bitfield layout, conflict matrix, virtual override checking
- Attribute System Overview -- dispatch table, attribute node structure, application pipeline
- grid_constant -- the parameter attribute that interacts with the 3669 check
- Launch Configuration Attributes --
__launch_bounds__,__cluster_dims__,__block_size__(post-validation errors 3534, 3707, 3715, 3719, 3695) - Entity Node Layout -- full byte map with all CUDA fields
- Kernel Stubs -- host-side stub generation triggered by
byte_182 & 0x40 - CUDA Template Restrictions -- template argument type restrictions for
__global__instantiations - Diagnostics Overview -- error emission functions and severity levels
Launch Configuration Attributes
cudafe++ supports five attributes that control CUDA kernel launch parameters: __launch_bounds__, __cluster_dims__, __block_size__, __maxnreg__, and __local_maxnreg__. All five store their values into a shared 56-byte launch configuration struct pointed to by entity+256. The struct is lazily allocated on first use by sub_5E52F0 and initialized with sentinel values (-1 for all int32 fields, 0 for the two leading int64 fields, flags cleared). Each attribute handler parses its arguments through a shared constant-expression evaluation pipeline (sub_461640 for value extraction, sub_461980 for sign checking), validates positivity and 32-bit range, then writes results into specific offsets of the struct. A post-declaration validation pass (sub_6BC890 in nv_transforms.c) enforces cross-attribute constraints: launch config attributes require __global__, cluster dimensions must not exceed __launch_bounds__, and __maxnreg__ is mutually exclusive with __launch_bounds__.
Key Facts
| Property | Value |
|---|---|
| Source files | attribute.c (apply handlers), nv_transforms.c (post-validation) |
__launch_bounds__ handler | sub_411C80 (98 lines) |
__cluster_dims__ handler | sub_4115F0 (145 lines) |
__block_size__ handler | sub_4109E0 (265 lines) |
__maxnreg__ handler | sub_410F70 (67 lines) |
__local_maxnreg__ handler | sub_411090 (67 lines) |
| Post-validation | sub_6BC890 (nv_validate_cuda_attributes, 160 lines) |
| Struct allocator | sub_5E52F0 (42 lines) |
| Constant value extractor | sub_461640 (const_expr_get_value, 53 lines) |
| Constant sign checker | sub_461980 (const_expr_sign_compare, 97 lines) |
| Dependent-type check | sub_7BE9E0 (is_dependent_type) |
| Entity field | entity+256 -- pointer to launch_config_t (56 bytes, NULL if no launch attrs) |
| Entity extended flags | entity+183 bit 6 (0x40): cluster_dims intent (set by zero-argument __cluster_dims__) |
| Total error codes | 17 distinct diagnostics across all five attributes and post-validation |
Attribute Kind Codes
Each CUDA attribute carries a kind byte at attr_node+8. The five launch config attributes use these values from the attribute_display_name (sub_40A310) switch table:
| Kind | Hex | ASCII | Attribute | Handler |
|---|---|---|---|---|
| 92 | 0x5C | '\' | __launch_bounds__ | sub_411C80 |
| 93 | 0x5D | ']' | __maxnreg__ | sub_410F70 |
| 94 | 0x5E | '^' | __local_maxnreg__ | sub_411090 |
| 107 | 0x6B | 'k' | __cluster_dims__ | sub_4115F0 |
| 108 | 0x6C | 'l' | __block_size__ | sub_4109E0 |
Kinds 92--94 are part of the original dense block (86--95). Kinds 107 and 108 were added later for cluster/Hopper-era features, occupying gaps in the ASCII range.
Launch Configuration Struct Layout
The struct is allocated by sub_5E52F0 and returned with a 16-byte offset from the raw allocation base. All handlers access the struct through the pointer stored at entity+256. The allocator initializes all int32 fields to -1 (sentinel for "not set") and zeroes the two leading int64 fields and the flags byte.
struct launch_config_t { // 56 bytes (offsets from entity+256 pointer)
int64_t maxThreadsPerBlock; // +0 from __launch_bounds__ arg 1 (init: 0)
int64_t minBlocksPerMultiprocessor; // +8 from __launch_bounds__ arg 2 (init: 0)
int32_t maxBlocksPerCluster; // +16 from __launch_bounds__ arg 3 (init: -1)
int32_t cluster_dim_x; // +20 from __cluster_dims__ / __block_size__ (init: -1)
int32_t cluster_dim_y; // +24 from __cluster_dims__ / __block_size__ (init: -1)
int32_t cluster_dim_z; // +28 from __cluster_dims__ / __block_size__ (init: -1)
int32_t maxnreg; // +32 from __maxnreg__ (init: -1)
int32_t local_maxnreg; // +36 from __local_maxnreg__ (init: -1)
int32_t block_size_x; // +40 from __block_size__ (init: -1)
int32_t block_size_y; // +44 from __block_size__ (init: -1)
int32_t block_size_z; // +48 from __block_size__ (init: -1)
uint8_t flags; // +52 bit 0: cluster_dims_set
// bit 1: block_size_set
// +53..+55: padding
};
The struct packs integer fields of mixed widths. The first two fields (maxThreadsPerBlock and minBlocksPerMultiprocessor) are 64-bit to accommodate the full range of CUDA launch bounds values. The cluster dimensions, block sizes, and register counts are 32-bit because individual values cannot exceed hardware limits. The flags byte at offset +52 records which dimension-setting attributes have been applied, enabling mutual exclusion enforcement between __cluster_dims__ and __block_size__.
Allocator: sub_5E52F0
The allocator performs arena allocation via sub_6B7D60, then initializes every field:
// sub_5E52F0 -- allocate_launch_config
launch_config_t* allocate_launch_config() {
void* raw = arena_alloc(pool_id, launch_config_pool_size + 56);
char* base = pool_base + raw;
if (!abi_mode) { // dword_106BA08 == 0
++alloc_counter_prefix;
base += 8;
*(int64_t*)(base - 8) = 0; // 8-byte ABI prefix
}
++alloc_counter_main;
// Zero the int64 fields
*(int64_t*)(base + 0) = 0; // becomes returned+0: maxThreadsPerBlock = 0
*(int64_t*)(base + 8) = 0; // padding (base+8..15)
*(int64_t*)(base + 16) = 0; // becomes returned+0..7 after offset
// Initialize all int32 fields to -1 (sentinel = "not set")
*(int32_t*)(base + 32) = -1; // returned+16: maxBlocksPerCluster
*(int32_t*)(base + 36) = -1; // returned+20: cluster_dim_x
*(int32_t*)(base + 40) = -1; // returned+24: cluster_dim_y
*(int32_t*)(base + 44) = -1; // returned+28: cluster_dim_z
*(int32_t*)(base + 48) = -1; // returned+32: maxnreg
*(int32_t*)(base + 52) = -1; // returned+36: local_maxnreg
*(int32_t*)(base + 56) = -1; // returned+40: block_size_x
*(int32_t*)(base + 60) = -1; // returned+44: block_size_y
*(int32_t*)(base + 64) = -1; // returned+48: block_size_z
base[68] &= 0xFC; // returned+52: clear flags bits 0 and 1
// Set internal flags byte combining ABI mode, device mode, marker
base[8] = (8 * (device_flag & 1)) & 0x7F
| (2 * (!abi_mode)) & 0x7E
| 1;
return (launch_config_t*)(base + 16); // return with 16-byte offset
}
The sentinel value -1 (0xFFFFFFFF as unsigned, -1 as signed) is semantically meaningful throughout: handlers and the post-validator test field >= 0 or field > 0 to determine whether a field has been set. A value of -1 always fails both tests, so unset fields are correctly treated as absent. The two leading int64 fields use 0 as their sentinel since they store __launch_bounds__ arguments where zero means "not specified."
Constant-Expression Evaluation Pipeline
All five attribute handlers share the same two-function pipeline for parsing attribute argument values from EDG's internal 128-bit constant representation.
sub_461980 -- const_expr_sign_compare
Compares a constant expression's value against a 64-bit threshold. Returns +1 if the expression value is greater, -1 if less, 0 if equal. The comparison operates on the 128-bit extended-precision value stored at offsets +152 through +166 (eight 16-bit words) of the expression node.
// sub_461980 -- const_expr_sign_compare(expr_node, threshold)
// Returns: +1 if expr > threshold, -1 if expr < threshold, 0 if equal
int32_t const_expr_sign_compare(expr_node_t* expr, int64_t threshold) {
// Decompose threshold into eight 16-bit words with sign extension
uint16_t thresh_words[8];
// ... sign-extension propagation through all 8 words ...
// Navigate to base type, skipping cv-qualifier wrappers (kind == 12)
type_t* type = expr->type_chain; // expr+112
while (type->kind_132 == 12)
type = type->referenced; // type+144
// Determine signedness from base type
bool is_signed = (type->kind_132 == 2
&& is_signed_type_table[type->subkind_144]);
if (is_signed && (expr->word_152 & 0x8000)) {
// Negative expression value
if (!(threshold_high & 0x8000))
return -1; // negative < non-negative
} else if (!is_signed) {
if (threshold_high & 0x8000)
return 1; // non-negative > negative threshold
}
// Word-by-word comparison from most-significant to least
// expr+152 (MSW) through expr+166 (LSW) vs threshold words
for (int i = 0; i < 8; i++) {
if (expr->words[152 + 2*i] > thresh_words[i]) return 1;
if (expr->words[152 + 2*i] < thresh_words[i]) return -1;
}
return 0; // equal
}
The handlers call const_expr_sign_compare(expr, 0) to check positivity:
<= 0means non-positive (used by__cluster_dims__,__block_size__,__maxnreg__,__local_maxnreg__)< 0means strictly negative (used by__launch_bounds__arg 3, where zero is allowed)
sub_461640 -- const_expr_get_value
Extracts a uint64_t value from a constant expression node's 128-bit representation. Sets an overflow flag if the value does not fit in 64 bits (accounting for sign).
// sub_461640 -- const_expr_get_value(expr_node, *overflow_flag)
// Returns: uint64_t value; *overflow_flag = 1 if truncation occurred
uint64_t const_expr_get_value(expr_node_t* expr, int32_t* overflow) {
// Navigate to base type
type_t* type = expr->type_chain; // expr+112
while (type->kind_132 == 12)
type = type->referenced;
uint16_t sign_word = expr->word_152; // most-significant of 128-bit value
bool is_signed = (type->kind_132 == 2
&& is_signed_type_table[type->subkind_144]);
int16_t expected_high;
if (is_signed) {
*overflow = 0;
expected_high = -(sign_word >> 15); // -1 if negative, 0 if positive
} else {
*overflow = 0;
expected_high = 0;
}
// Verify that the upper 64 bits match the expected sign-extension pattern
bool has_overflow = (sign_word != (uint16_t)expected_high);
if (expr->word_154 != (uint16_t)expected_high) has_overflow = true;
if (expr->word_156 != (uint16_t)expected_high) has_overflow = true;
if (expr->word_158 != (uint16_t)expected_high) has_overflow = true;
// Reconstruct 64-bit value from the lower four 16-bit words
uint64_t result = ((uint64_t)expr->word_160 << 48)
| ((uint64_t)expr->word_162 << 32)
| ((uint64_t)expr->word_164 << 16)
| ((uint64_t)expr->word_166);
if (!is_signed) {
if (has_overflow) { *overflow = 1; }
return result;
}
// Signed: verify sign bit consistency
if (((uint16_t)expected_high) != (uint16_t)(result >> 63)
|| has_overflow
|| (int16_t)sign_word < 0) {
*overflow = 1;
}
return result;
}
The overflow flag is used by all handlers with a consistent check pattern:
int32_t overflow;
uint64_t val = const_expr_get_value(expr, &overflow);
if (overflow || val > 0x7FFFFFFF)
emit_error(OVERFLOW_ERROR_CODE, src_loc);
else
launch_config->field = (int32_t)val;
Template-Dependent Argument Bailout
Before evaluating constant expressions, all five handlers walk the attribute argument list checking for template-dependent types via sub_7BE9E0 (is_dependent_type). The walk follows a linked list of argument nodes (head at attr_node+32), where each node has:
| Offset | Field | Description |
|---|---|---|
+0 | next | Next argument node in list |
+10 | kind | Argument kind: 3 = type-qualified, 4 = expression, 5 = indirect expression |
+32 | expr | Expression/type pointer (accessed as node[4] in decompiled code) |
If any argument has a dependent type, the handler returns immediately without modifying the entity. This defers attribute processing to template instantiation time, when concrete values are available:
// Common bailout pattern (appears in all 5 handlers)
arg_node_t* walk = *(arg_node_t**)(attr_node + 32);
while (walk) {
switch (walk->kind_10) {
case 3: // type-qualified argument
if (walk->expr[4]->kind_148 == 12) // cv-qualifier wrapper
return entity; // dependent -- bail
break;
case 4: // expression argument
if (is_dependent_type(walk->expr[4])) // sub_7BE9E0
return entity;
if (walk->kind_10 != 5)
break;
// fallthrough to case 5
case 5: // indirect expression
if (is_dependent_type(*(walk->expr[4])))
return entity;
break;
default:
break;
}
walk = walk->next;
}
// All args are concrete -- proceed with evaluation
launch_bounds (sub_411C80)
Syntax: __launch_bounds__(maxThreadsPerBlock [, minBlocksPerMultiprocessor [, maxBlocksPerCluster]])
Accepts 1 to 3 arguments. Registered at kind byte 0x5C ('\\').
// sub_411C80 -- apply_nv_launch_bounds_attr (attribute.c, 98 lines)
// a1: attribute node, a2: entity node
entity_t* apply_nv_launch_bounds(attr_node_t* attr, entity_t* entity) {
// ---- Error 3535: launch_bounds on local function ----
// Note: does NOT return early -- continues to store values
if (entity->byte_81 & 0x04)
emit_error_with_name(7, 3535, attr->src_loc, "__launch_bounds__");
// ---- Parse argument list ----
arg_list_t* args = attr->arg_list; // attr+32
if (!args)
return entity;
// ---- Allocate launch config if needed ----
launch_config_t* lc = entity->launch_config; // entity+256
if (!lc) {
lc = allocate_launch_config(); // sub_5E52F0
entity->launch_config = lc;
}
// ---- Arg 1: maxThreadsPerBlock (required, stored as int64) ----
// Copied directly from constant expression value -- no sign/overflow check
lc->maxThreadsPerBlock = args->const_value; // +0, int64
// ---- Arg 2: minBlocksPerMultiprocessor (optional, stored as int64) ----
arg_node_t* arg2_list = *args; // first child
if (!arg2_list)
return entity;
expr_node_t* arg2_expr = *arg2_list; // expression node
lc->minBlocksPerMultiprocessor = arg2_list[4]; // +8, int64, raw copy
// ---- Check for arg 3 existence ----
if (!arg2_expr)
goto process_arg3;
// ---- Template-dependent bailout for remaining args ----
arg_node_t* walk = *(arg_node_t**)(attr + 32);
if (!walk)
goto process_arg3;
// ... dependent type walk (same pattern as documented above) ...
// If any arg is dependent, return entity unchanged
process_arg3:
// ---- Arg 3: maxBlocksPerCluster (optional, int32, uses full pipeline) ----
expr_node_t* expr3 = arg2_expr->const_value; // 3rd arg expression
if (!expr3)
return entity;
if (const_expr_sign_compare(expr3, 0) < 0) {
// Error 3705: negative maxBlocksPerCluster
emit_error(7, 3705, attr->src_loc);
} else {
int32_t overflow;
uint64_t val = const_expr_get_value(expr3, &overflow);
if (overflow || val > 0x7FFFFFFF) {
// Error 3706: overflow
emit_error(7, 3706, attr->src_loc);
} else if (val != 0) {
lc->maxBlocksPerCluster = (int32_t)val; // +16
}
// val == 0: not stored, sentinel -1 remains (means "use default")
}
return entity;
}
Argument Semantics
| Arg | Field | Offset | Type | Validation | Description |
|---|---|---|---|---|---|
| 1 (required) | maxThreadsPerBlock | +0 | int64 | None -- raw copy | Maximum threads per block. Guides register allocation in ptxas. |
| 2 (optional) | minBlocksPerMultiprocessor | +8 | int64 | None -- raw copy | Minimum resident blocks per SM. Guides occupancy optimization. |
| 3 (optional) | maxBlocksPerCluster | +16 | int32 | sign_compare < 0 (3705), overflow (3706) | Maximum blocks per cluster (CUDA 11.8+). |
Critical Implementation Details
First two args bypass the sign/overflow pipeline. Arguments 1 and 2 are copied directly from the constant expression node's value field as 64-bit quantities. They do not pass through const_expr_sign_compare or const_expr_get_value. This means negative or excessively large values for maxThreadsPerBlock and minBlocksPerMultiprocessor are accepted at parse time -- downstream consumers (ptxas) are responsible for rejecting them.
Third argument uses the strict pipeline. Only argument 3 (maxBlocksPerCluster) passes through both const_expr_sign_compare and const_expr_get_value with the overflow check. This argument was added later (CUDA 11.8 cluster launch) and uses the newer, stricter validation pattern.
Zero is acceptable for arg 3. The sign check uses const_expr_sign_compare(expr, 0) < 0 (strictly negative), not <= 0. A zero value passes the sign check but is not written (else if (val != 0) guard), leaving the sentinel -1 in place. This means zero effectively means "use default."
Error 3535 does not abort. The local-function check fires but does NOT return early. Processing continues, arguments are stored, and the launch config struct is populated even after emitting the error. This is consistent with cudafe++'s design of collecting as many diagnostics as possible in a single compilation pass.
cluster_dims (sub_4115F0)
Syntax: __cluster_dims__(x [, y [, z]]) or __cluster_dims__()
Accepts 0 to 3 arguments. Missing dimensions default to 1. Sets flag bit 0 at +52. Registered at kind byte 0x6B ('k').
// sub_4115F0 -- apply_nv_cluster_dims_attr (attribute.c, 145 lines)
entity_t* apply_nv_cluster_dims(attr_node_t* attr, entity_t* entity) {
arg_list_t* args = attr->arg_list; // attr+32
// ---- No-argument form: set intent flag only ----
if (args->kind_10 == 0) { // no arguments present
entity->byte_183 |= 0x40; // cluster_dims intent flag
return entity;
}
// ---- Extract argument expressions (up to 3) ----
expr_node_t* expr_x = args->value;
arg_node_t* child1 = args->first_child;
expr_node_t* expr_y = child1 ? child1->value : NULL;
expr_node_t* expr_z = NULL;
if (child1 && child1->first_child)
expr_z = child1->first_child->value;
// ---- Template-dependent bailout ----
// ... same walk pattern as __launch_bounds__ ...
// ---- Allocate launch config if needed ----
launch_config_t* lc = entity->launch_config;
if (!lc) {
lc = allocate_launch_config();
entity->launch_config = lc;
}
// ---- Conflict check: __block_size__ already set cluster dims ----
if (lc->flags & 0x02) { // bit 1 = block_size_set
emit_error(7, 3791, attr->src_loc);
lc = entity->launch_config; // reload after error emit
}
// ---- Set cluster_dims flag ----
lc->flags |= 0x01; // bit 0 = cluster_dims_set
// ---- Arg 1: cluster_dim_x ----
if (!expr_x) {
lc->cluster_dim_x = 1; // +20, default
} else if (const_expr_sign_compare(expr_x, 0) <= 0) {
emit_error_with_name(7, 3685, attr->src_loc, "__cluster_dims__");
lc = entity->launch_config; // reload
} else {
int32_t overflow;
uint64_t val = const_expr_get_value(expr_x, &overflow);
if (overflow || val > 0x7FFFFFFF)
emit_error(7, 3686, attr->src_loc);
else
lc->cluster_dim_x = (int32_t)val;
}
// ---- Arg 2: cluster_dim_y (defaults to 1) ----
if (!expr_y) {
lc->cluster_dim_y = 1; // +24
} else {
// Same sign_compare/get_value/3685/3686 pattern
// Stores at lc->cluster_dim_y (+24)
}
// ---- Arg 3: cluster_dim_z (defaults to 1) ----
if (!expr_z) {
lc->cluster_dim_z = 1; // +28
} else {
// Same pattern, stores at lc->cluster_dim_z (+28)
}
return entity;
}
Key Observations
Zero-argument form. When __cluster_dims__() is called with no arguments, the handler does not allocate the launch config struct. It sets entity+183 |= 0x40 (the "cluster_dims intent" flag) and returns. This intent flag is checked during post-validation to detect __cluster_dims__ on non-__global__ functions (error 3534) even when no dimensions were specified.
Conflict check with block_size. Before storing dimensions, the handler checks lc->flags & 0x02 (bit 1 = block_size_set). If __block_size__ was already applied, error 3791 fires. Crucially, the handler does NOT return early after this error -- it continues to set the flag and attempt to store values. The reverse conflict (applying __block_size__ after __cluster_dims__) is checked in sub_4109E0 with the same error code, testing lc->flags & 0x01.
Strict positivity (zero rejected). All three dimensions use const_expr_sign_compare(expr, 0) <= 0, rejecting zero. Error 3685 fires with the attribute name "__cluster_dims__" as a format argument. Error 3686 fires for values exceeding 0x7FFFFFFF.
Defaults to 1. Unspecified dimensions default to 1, not 0. A cluster dimension of 1 means "no clustering in that dimension" -- the neutral value. The default is written explicitly (lc->cluster_dim_x = 1), overwriting the -1 sentinel from allocation.
block_size (sub_4109E0)
Syntax: __block_size__(bx [, by [, bz [, cx [, cy [, cz]]]]])
Accepts up to 6 arguments: three block dimensions followed by three optional cluster dimensions. Registered at kind byte 0x6C ('l'). At 265 lines, this is the largest launch config handler.
// sub_4109E0 -- apply_nv_block_size_attr (attribute.c, 265 lines)
entity_t* apply_nv_block_size(attr_node_t* attr, entity_t* entity) {
// ---- Parse up to 6 argument expressions ----
arg_list_t* args = attr->arg_list;
expr_node_t* block_x = args->value; // arg 1
expr_node_t* block_y = NULL; // arg 2
expr_node_t* block_z = NULL; // arg 3
expr_node_t* cluster_x = NULL; // arg 4
expr_node_t* cluster_y = NULL; // arg 5
expr_node_t* cluster_z = NULL; // arg 6
// ... linked-list traversal to extract args 2-6 ...
// ---- Template-dependent bailout ----
// ... same walk pattern ...
// ---- Allocate launch config ----
launch_config_t* lc = entity->launch_config;
if (!lc) {
lc = allocate_launch_config();
entity->launch_config = lc;
}
// ---- Block dimensions: args 1-3 ----
// Each uses: sign_compare <= 0 -> error 3788
// get_value overflow or > 0x7FFFFFFF -> error 3789
// else store at +40/+44/+48
// missing args default to 1
// block_size_x (+40):
if (!block_x)
lc->block_size_x = 1;
else
validate_positive_int32(block_x, &lc->block_size_x, 3788, 3789, attr);
// block_size_y (+44): same pattern, default 1
// block_size_z (+48): same pattern, default 1
// ---- Cluster dimensions: args 4-6 (only if arg 4 present) ----
if (!cluster_x) {
// No cluster dims from __block_size__
lc->flags &= ~0x02; // clear bit 1 temporarily
if (!(lc->flags & 0x01)) { // cluster_dims NOT already set
// Write default cluster dims
lc->cluster_dim_x = 1; // +20
lc->cluster_dim_y = 1; // +24
lc->cluster_dim_z = 1; // +28
}
return entity;
}
// ---- Conflict check: cluster_dims already set ----
if (lc->flags & 0x01) { // bit 0 = cluster_dims_set
emit_error(7, 3791, attr->src_loc);
lc = entity->launch_config;
}
// ---- Set block_size flag ----
lc->flags |= 0x02; // bit 1 = block_size_set
if (lc->flags & 0x01) // cluster_dims_set -> conflict, bail
return entity;
// ---- Parse cluster dims from args 4-6 ----
// Uses error 3788 for non-positive, 3789 for overflow
// (same codes as block dims, with "__block_size__" as attr name)
// Stores at +20/+24/+28, defaults to 1 if absent
return entity;
}
Key Observations
Dual-purpose attribute. __block_size__ combines block dimensions and cluster dimensions in a single attribute. Arguments 1-3 specify the thread block shape (stored at +40/+44/+48); arguments 4-6 specify the cluster shape (stored at +20/+24/+28). This is NVIDIA's older, combined syntax, compared to the newer separate __cluster_dims__ attribute.
Shared cluster fields. Both __block_size__ and __cluster_dims__ write to the same offsets (+20/+24/+28). The flags byte (bit 0 for cluster_dims, bit 1 for block_size) provides mutual exclusion via error 3791.
Block size fields are separate from launch_bounds. The block dimensions from __block_size__ go to +40/+44/+48, distinct from __launch_bounds__'s maxThreadsPerBlock at +0. The __block_size__ attribute specifies exact dimensions; __launch_bounds__ specifies an upper bound. Both can coexist on the same function.
Defaulting behavior when no cluster args. When only 3 arguments are provided (block dims only), the handler checks whether __cluster_dims__ was already applied (flags & 0x01). If not, it writes default cluster dims of (1, 1, 1) to +20/+24/+28. If __cluster_dims__ was already applied, it leaves the existing cluster dim values untouched.
Error 3788/3789. These are the __block_size__-specific equivalents of __cluster_dims__'s 3685/3686. Both use strict positivity (<= 0), rejecting zero.
maxnreg (sub_410F70)
Syntax: __maxnreg__(N)
Accepts exactly 1 argument. Stores at launch_config+32. Registered at kind byte 0x5D (']').
// sub_410F70 -- apply_nv_maxnreg_attr (attribute.c, 67 lines)
entity_t* apply_nv_maxnreg(attr_node_t* attr, entity_t* entity) {
arg_list_t* args = attr->arg_list; // attr+32
if (!args)
return entity;
// ---- Template-dependent bailout ----
// ... same walk pattern ...
// ---- Allocate launch config ----
if (!entity->launch_config)
entity->launch_config = allocate_launch_config();
// ---- Parse the single argument ----
expr_node_t* expr = args->const_value; // argument expression
if (!expr)
return entity;
if (const_expr_sign_compare(expr, 0) <= 0) {
emit_error(7, 3717, attr->src_loc); // non-positive register count
} else {
int32_t overflow;
uint64_t val = const_expr_get_value(expr, &overflow);
if (overflow || val > 0x7FFFFFFF)
emit_error(7, 3718, attr->src_loc); // register count too large
else
entity->launch_config->maxnreg = (int32_t)val; // +32
}
return entity;
}
The maxnreg field defaults to -1 from the allocator. A value >= 0 in post-validation unambiguously means the attribute was applied with a valid value (since zero would be caught by the <= 0 check here, the minimum valid value is 1).
Post-Validation Conflict
The __maxnreg__ handler does not check for conflicts with __launch_bounds__ at application time. The mutual exclusion is enforced in post-validation (sub_6BC890), which emits error 3719 when both maxThreadsPerBlock != 0 and maxnreg >= 0. This design allows the apply handlers to be called in any order.
local_maxnreg (sub_411090)
Syntax: __local_maxnreg__(N)
Structurally identical to __maxnreg__. Stores at launch_config+36. Registered at kind byte 0x5E ('^').
// sub_411090 -- apply_nv_local_maxnreg_attr (attribute.c, 67 lines)
entity_t* apply_nv_local_maxnreg(attr_node_t* attr, entity_t* entity) {
// ... identical structure to __maxnreg__ ...
if (const_expr_sign_compare(expr, 0) <= 0) {
emit_error(7, 3786, attr->src_loc); // error 3786: non-positive
} else {
int32_t overflow;
uint64_t val = const_expr_get_value(expr, &overflow);
if (overflow || val > 0x7FFFFFFF)
emit_error(7, 3787, attr->src_loc); // error 3787: too large
else
entity->launch_config->local_maxnreg = (int32_t)val; // +36
}
return entity;
}
The __local_maxnreg__ attribute limits register usage within a specific device function scope rather than at the kernel level. It uses a separate struct field (+36 vs +32) so both can coexist. The post-validator does NOT check local_maxnreg for __global__-only enforcement -- __local_maxnreg__ is more permissive than __maxnreg__ and may appear on __device__ functions.
Post-Declaration Validation (sub_6BC890)
After all attributes on a declaration have been applied, nv_validate_cuda_attributes (sub_6BC890, 160 lines, in nv_transforms.c) performs cross-attribute consistency checks. This function is called from the declaration processing pipeline and operates on the completed entity node. Multiple errors can be emitted from a single validation pass -- cudafe++ does not short-circuit after the first error.
// sub_6BC890 -- nv_validate_cuda_attributes (nv_transforms.c, 160 lines)
// a1: entity pointer, a2: source location for diagnostics
void nv_validate_cuda_attributes(entity_t* entity, source_loc_t* loc) {
if (!entity || (entity->byte_177 & 0x10))
return; // null or suppressed entity
// ---- Phase 1: Parameter validation (rvalue refs, error 3702) ----
// Walks parameter list checking for rvalue reference flag
// [documented on __global__ page]
// ---- Phase 2: __nv_register_params__ check (error 3661) ----
// [documented on __global__ page]
// ---- Phase 3: Launch config attribute checks ----
launch_config_t* lc = entity->launch_config; // entity+256
uint8_t es = entity->byte_182; // execution space
if (!lc)
goto check_global_advisory;
if (es & 0x40) // is __global__
goto cross_attribute_checks;
// ==== Error 3534: launch config on non-__global__ ====
// 3534 for __launch_bounds__
if (lc->maxThreadsPerBlock || lc->minBlocksPerMultiprocessor) {
emit_error_with_name(7, 3534, &global_loc, "__launch_bounds__");
lc = entity->launch_config; // reload after emit
}
// 3534 for __cluster_dims__ or __block_size__
if ((entity->byte_183 & 0x40) || lc->cluster_dim_x >= 0) {
const char* name = (lc->block_size_x > 0) ? "__block_size__"
: "__cluster_dims__";
emit_error_with_name(7, 3534, &global_loc, name);
lc = entity->launch_config;
if (!lc)
goto check_global_advisory;
}
cross_attribute_checks:
// ==== Error 3707: cluster size exceeds maxBlocksPerCluster ====
if (lc->cluster_dim_x > 0) {
if (lc->maxBlocksPerCluster > 0) {
uint64_t cluster_product = (int64_t)lc->cluster_dim_x
* (int64_t)lc->cluster_dim_y
* (int64_t)lc->cluster_dim_z;
if ((uint64_t)lc->maxBlocksPerCluster < cluster_product) {
const char* name = (lc->block_size_x > 0) ? "__block_size__"
: "__cluster_dims__";
emit_error_with_name(7, 3707, &global_loc, name);
lc = entity->launch_config;
if (!lc)
goto check_maxnreg;
}
}
}
// ==== Error 3719: __launch_bounds__ + __maxnreg__ conflict ====
if (lc->maxnreg >= 0) {
if (!(es & 0x40)) {
// ==== Error 3715: __maxnreg__ on non-__global__ ====
emit_error_with_name(7, 3715, &global_loc, "__maxnreg__");
lc = entity->launch_config;
if (lc)
goto check_maxnreg_conflict;
goto check_global_advisory;
}
check_maxnreg_conflict:
if (!lc->maxThreadsPerBlock) {
// No __launch_bounds__ -- maxnreg is fine on its own
// (but this path is for non-__global__, so it already errored)
goto check_global_advisory;
}
// Both __launch_bounds__ and __maxnreg__ present
emit_error_with_name(7, 3719, &global_loc,
"__launch_bounds__ and __maxnreg__");
}
check_maxnreg:
check_global_advisory:
// ==== Warning 3695: __global__ without __launch_bounds__ ====
if (!(es & 0x40))
return; // not __global__, no advisory needed
lc = entity->launch_config;
if (!lc) {
emit_warning(4, 3695, &kernel_decl_loc);
return;
}
if (!lc->maxThreadsPerBlock && !lc->minBlocksPerMultiprocessor) {
// Launch config exists but no __launch_bounds__ values set
// (struct was allocated by __cluster_dims__ or __block_size__)
emit_warning(4, 3695, &kernel_decl_loc);
}
}
Validation Logic Detail
Error 3534 -- Launch config on non-global. Tests entity->byte_182 & 0x40 (the __global__ bit). If clear, any non-default values in the launch config struct trigger error 3534. The error message uses %s with the specific attribute name. Notably, the check for __cluster_dims__ or __block_size__ tests lc->cluster_dim_x >= 0 (which is true when any cluster dim handler has run, since they write non-negative values). It also checks the intent flag (entity->byte_183 & 0x40) for the zero-argument __cluster_dims__() form.
Error 3707 -- Cluster product exceeds maxBlocksPerCluster. Computes cluster_dim_x * cluster_dim_y * cluster_dim_z using signed 64-bit arithmetic and compares against maxBlocksPerCluster. The multiplication uses the actual stored dimension values. The error message names whichever attribute set the cluster dims ("__block_size__" if block_size_x > 0, otherwise "__cluster_dims__"). This is a compile-time consistency check: if the programmer specifies both a cluster shape and a maximum cluster block count, the shape must fit.
Error 3715 -- maxnreg on non-global. Separate from the general 3534 check. While 3534 covers __launch_bounds__/__cluster_dims__/__block_size__, __maxnreg__ uses its own code because it appears in a different branch of the validation logic.
Error 3719 -- launch_bounds + maxnreg conflict. These two attributes provide contradictory register allocation hints: __launch_bounds__ asks the compiler to choose registers based on occupancy targets; __maxnreg__ overrides with a hard limit. Detected by lc->maxThreadsPerBlock != 0 && lc->maxnreg >= 0.
Warning 3695 -- Missing launch_bounds advisory. Severity 4 (informational). Fires when a __global__ function has no __launch_bounds__ annotation. Tests both lc == NULL (no launch config at all) and maxThreadsPerBlock == 0 && minBlocksPerMultiprocessor == 0 (struct exists but was allocated by other attrs). Not an error; can be suppressed.
Error Catalog
Apply-Time Errors
| Error | Sev | Attribute | Condition | Sign test | Emit function |
|---|---|---|---|---|---|
| 3535 | 7 | __launch_bounds__ | entity+81 & 0x04 (local function) | -- | sub_4F79D0 |
| 3685 | 7 | __cluster_dims__ | sign_compare(expr, 0) <= 0 | <= 0 (zero rejected) | sub_4F79D0 |
| 3686 | 7 | __cluster_dims__ | overflow || val > 0x7FFFFFFF | -- | sub_4F8200 |
| 3705 | 7 | __launch_bounds__ (arg 3) | sign_compare(expr, 0) < 0 | < 0 (zero allowed) | sub_4F8200 |
| 3706 | 7 | __launch_bounds__ (arg 3) | overflow || val > 0x7FFFFFFF | -- | sub_4F8200 |
| 3717 | 7 | __maxnreg__ | sign_compare(expr, 0) <= 0 | <= 0 | sub_4F8200 |
| 3718 | 7 | __maxnreg__ | overflow || val > 0x7FFFFFFF | -- | sub_4F8200 |
| 3786 | 7 | __local_maxnreg__ | sign_compare(expr, 0) <= 0 | <= 0 | sub_4F8200 |
| 3787 | 7 | __local_maxnreg__ | overflow || val > 0x7FFFFFFF | -- | sub_4F8200 |
| 3788 | 7 | __block_size__ | sign_compare(expr, 0) <= 0 | <= 0 | sub_4F79D0 |
| 3789 | 7 | __block_size__ | overflow || val > 0x7FFFFFFF | -- | sub_4F8200 |
| 3791 | 7 | __cluster_dims__ / __block_size__ | flags & opposite_bit | -- | sub_4F8200 |
Post-Validation Errors
| Error | Sev | Condition | Emit function |
|---|---|---|---|
| 3534 | 7 | Launch config attrs on non-__global__ | sub_4F79D0 |
| 3695 | 4 | __global__ without __launch_bounds__ | sub_4F8200 |
| 3707 | 7 | maxBlocksPerCluster < cluster_x * cluster_y * cluster_z | sub_4F79D0 |
| 3715 | 7 | maxnreg >= 0 on non-__global__ | sub_4F79D0 |
| 3719 | 7 | maxThreadsPerBlock != 0 && maxnreg >= 0 | sub_4F79D0 |
Sign-Test Summary
| Attribute | Non-positive error | Overflow error | Sign test | Zero allowed? |
|---|---|---|---|---|
__launch_bounds__ arg 1-2 | (none) | (none) | No check | Yes |
__launch_bounds__ arg 3 | 3705 | 3706 | < 0 | Yes (not stored) |
__cluster_dims__ | 3685 | 3686 | <= 0 | No |
__block_size__ | 3788 | 3789 | <= 0 | No |
__maxnreg__ | 3717 | 3718 | <= 0 | No |
__local_maxnreg__ | 3786 | 3787 | <= 0 | No |
Attribute Interaction Matrix
__launch_bounds__ | __cluster_dims__ | __block_size__ | __maxnreg__ | __local_maxnreg__ | |
|---|---|---|---|---|---|
__launch_bounds__ | -- | OK | OK | 3719 | OK |
__cluster_dims__ | OK | -- | 3791 | OK | OK |
__block_size__ | OK | 3791 | -- | OK | OK |
__maxnreg__ | 3719 | OK | OK | -- | OK |
__local_maxnreg__ | OK | OK | OK | OK | -- |
Additional constraints:
- All attributes except
__local_maxnreg__require__global__execution space (error 3534 / 3715) __launch_bounds__arg 3 must be>=cluster product when cluster dims are set (error 3707)__launch_bounds__is also rejected on local functions at application time (error 3535)
Entity Node Field Reference
| Offset | Size | Field | Role in Launch Config |
|---|---|---|---|
+81 | 1 byte | local_flags | Bit 2 (0x04): local function. Checked by sub_411C80 for error 3535. |
+177 | 1 byte | suppress_flags | Bit 4 (0x10): entity suppressed. Post-validation skips if set. |
+182 | 1 byte | execution_space | Bit 6 (0x40): __global__. Checked by sub_6BC890 for 3534, 3695, 3715. |
+183 | 1 byte | extended_cuda | Bit 6 (0x40): cluster_dims intent (set by zero-arg __cluster_dims__). |
+256 | 8 bytes | launch_config | Pointer to launch_config_t (56 bytes). NULL if no launch config attrs. |
Error Emission Functions
| Address | Identity | Signature | Used for |
|---|---|---|---|
sub_4F79D0 | emit_error_with_name | (severity, code, loc, name_str) | 3535, 3685, 3534, 3707, 3715, 3719, 3788 |
sub_4F8200 | emit_error_basic | (severity, code, loc) | 3686, 3705, 3706, 3717, 3718, 3786, 3787, 3789, 3791, 3695 |
sub_4F79D0 passes a format string argument (the attribute name) into the diagnostic message via %s. sub_4F8200 emits a fixed-format message with no string interpolation. Warning 3695 uses severity 4 through sub_4F8200; all other diagnostics use severity 7.
Function Map
| Address | Identity | Lines | Source File |
|---|---|---|---|
sub_411C80 | apply_nv_launch_bounds_attr | 98 | attribute.c |
sub_4115F0 | apply_nv_cluster_dims_attr | 145 | attribute.c |
sub_4109E0 | apply_nv_block_size_attr | 265 | attribute.c |
sub_410F70 | apply_nv_maxnreg_attr | 67 | attribute.c |
sub_411090 | apply_nv_local_maxnreg_attr | 67 | attribute.c |
sub_6BC890 | nv_validate_cuda_attributes | 160 | nv_transforms.c |
sub_5E52F0 | allocate_launch_config | 42 | il.c (IL allocation) |
sub_461640 | const_expr_get_value | 53 | const_expr.c |
sub_461980 | const_expr_sign_compare | 97 | const_expr.c |
sub_7BE9E0 | is_dependent_type | 15 | template.c |
sub_4F79D0 | emit_error_with_name | -- | error.c |
sub_4F8200 | emit_error_basic | -- | error.c |
Global Variables
| Address | Name | Purpose |
|---|---|---|
qword_126EDE8 | global_source_loc | Default source location used in post-validation error emission |
qword_126DD38 | kernel_decl_loc | Source location for kernel declaration (used in 3695 advisory) |
dword_126EC90 | il_pool_id | Arena allocator pool ID for launch config allocation |
dword_126F694 | launch_config_size | Size parameter for arena allocator |
dword_126F690 | pool_base | Base pointer of the IL arena pool |
dword_106BA08 | abi_mode | ABI compatibility flag; when 0, allocator adds 8-byte prefix |
dword_126E5FC | device_flag | Device compilation mode; bit 0 affects launch config flags byte |
byte_E6D1B0 | is_signed_type_table | Lookup table indexed by type subkind; true if type is signed integer |
Cross-References
- Attribute System Overview -- dispatch table, attribute node structure, kind enum
- global Function Constraints -- the
__global__attribute that launch config attributes require - grid_constant -- parameter attribute that interacts with kernel parameter checks
- Minor Attributes --
__nv_register_params__,__noinline__,__forceinline__ - Entity Node Layout -- full byte map of entity node with
+256pointer - Execution Spaces --
byte_182bitfield layout and__global__predicate - Diagnostics Overview -- error emission functions, severity levels
grid_constant
The __grid_constant__ attribute marks a __global__ function parameter as read-only across the entire kernel grid. When applied, the parameter is loaded once from host memory into GPU constant memory at grid launch, and all threads in the grid read from this cached copy instead of loading from the parameter buffer in global memory. The attribute was introduced in CUDA 11.7 and requires compute capability 7.0 or later (Volta+).
cudafe++ enforces 8 validation checks on __grid_constant__ parameters, distributed across three phases: attribute application (checking type constraints -- const qualification, no reference types, SM version), post-declaration validation (checking that the annotation appears only on __global__ function parameters), and redeclaration/template merging (checking consistency of annotations between declarations). A ninth related check (error 3669) in the __global__ apply handler issues an advisory when a kernel parameter lacks a default initializer in device compilation mode, suggesting that __grid_constant__ would be appropriate.
Key Facts
| Property | Value |
|---|---|
| Internal keyword | grid_constant (stored at 0x82bf0f), displayed as __grid_constant__ (at 0x82bf1d) |
| Attribute category | Optimization (parameter-level) |
| Minimum architecture | compute_70 (Volta), gated by dword_126E4A8 >= 70 |
| Entity node flag | entity+164 bit 2 (0x04) -- set on the parameter entity during attribute application |
| Type node flag | type+133 bit 5 (0x20) -- checked by sub_7A6B60 (type chain query) |
| Parameter node flag | param+32 bit 1 (0x02) -- checked during post-declaration validation in sub_6BC890 |
| Total diagnostics | 8 unique error strings + 1 related advisory (3669) + 1 memory space conflict (3577) |
| Diagnostic tag prefix | grid_constant_* (8 tags in .rodata at 0x84810f--0x857770) |
| Message string block | 0x88d8b0--0x88dbe8 (contiguous block in .rodata) |
Why grid_constant Exists
A parameter annotated __grid_constant__ tells the CUDA runtime and compiler three things:
1. The parameter value is identical for every thread in the grid. This is inherently true for all kernel parameters -- they are passed by value through the kernel launch API -- but the annotation makes this guarantee explicit and mechanically exploitable.
2. The parameter lives in constant memory, not the parameter buffer.
Without the annotation, kernel parameters are placed in a parameter buffer that threads read from global memory (or a dedicated parameter memory space with limited caching). With __grid_constant__, the runtime loads the parameter into the GPU's constant memory cache at launch time. This provides:
- Broadcast reads: all 32 threads in a warp reading the same constant-memory address execute in a single memory transaction. The uniform cache serves a broadcast at full throughput.
- Separate cache hierarchy: constant memory has a dedicated L1 cache (the "uniform cache") separate from the general L1/L2 data caches. Using it for grid-wide parameters reduces pressure on the main cache hierarchy.
- Reduced register pressure: the compiler can re-read the parameter from constant memory at any point instead of keeping it pinned in a register. This frees registers for other values, improving occupancy.
3. The parameter must be const-qualified. Since the value is shared across the grid and cached in constant memory, writes would be nonsensical. The hardware constant memory is read-only from the kernel's perspective. cudafe++ enforces this at the type level.
4. The parameter must not be a reference type.
References to host memory are meaningless on the device. Kernel parameters are already copied to the device by the CUDA runtime. A reference would dangle because it would point into host address space. Even a reference to device memory is not valid here -- __grid_constant__ parameters must be values, not indirections.
SM_70+ Requirement Rationale
The compute_70 (Volta) minimum exists because Volta significantly rearchitected the constant memory subsystem. Pre-Volta GPUs (Maxwell, Pascal) have a more restricted constant memory subsystem with a fixed 64 KB window per kernel. Volta introduced:
- Larger effective constant memory through improved caching
- Per-thread-block constant buffer indexing
- Hardware support for grid-wide parameter broadcasting with the new parameter cache architecture
The compiler lowers __grid_constant__ parameters to ld.const (constant-space load) PTX instructions, which rely on the Volta constant memory architecture to function correctly. On pre-Volta hardware, the constant memory hardware cannot serve this use case.
Where Validation Happens
The __grid_constant__ validation logic is spread across multiple compilation phases because the checks require different kinds of information. The type-level checks (const, reference) can be performed as soon as the attribute is applied. The context check (must be on a __global__ parameter) requires the function's execution space to be resolved. The redeclaration checks require both the old and new declarations to be available.
Phase 1: Attribute Application
Checks 1 (const), 2 (reference), and 4 (architecture) execute during attribute application, when the __grid_constant__ attribute handler runs. This handler is registered in EDG's attribute descriptor table under the kind byte for __grid_constant__. It receives the attribute node, the entity node, and the target kind. The handler inspects the parameter's type node to verify const-qualification and absence of reference semantics, and checks dword_126E4A8 against the threshold value 70.
Phase 2: Post-Declaration Validation
Check 3 (must be on __global__ parameter) executes in nv_validate_cuda_attributes (sub_6BC890). This function runs after all attributes on a declaration have been applied and resolved. It walks the function's parameter list and checks whether any parameter carries the __grid_constant__ flag (param+32 bit 1) on a non-__global__ function.
Phase 3: Redeclaration/Template Merging
Checks 5--8 (consistency across redeclarations, template redeclarations, specializations, and explicit instantiations) execute during the declaration merging passes in class_decl.c, decls.c, and template.c. These passes compare the entity+164 bit 2 flag on corresponding parameters of the old and new declarations.
Validation Check 1: const-Qualified Type
| Property | Value |
|---|---|
| Tag | grid_constant_not_const (at 0x848146) |
| Message | a parameter annotated with __grid_constant__ must have const-qualified type (at 0x88d8b0) |
| Severity | error |
| Phase | Attribute application |
The parameter's type must carry the const qualifier. The check peels through the type chain, following cv-qualifier wrapper nodes (kind == 12) to reach the underlying type, then verifies the const flag is present.
The type-level check works on the same type chain navigation pattern used throughout EDG's type system:
// Conceptual logic (from the __grid_constant__ attribute handler)
type_t* ptype = param->type;
while (ptype->kind == 12) // skip cv-qualifier wrapper nodes
ptype = ptype->referenced; // follow chain at type+144
if (!(ptype->cv_quals & CONST_FLAG))
emit_error("grid_constant_not_const", param->src_loc);
If the user writes:
__global__ void kernel(__grid_constant__ int x) { ... }
cudafe++ emits grid_constant_not_const because int is not const-qualified. The correct form is:
__global__ void kernel(__grid_constant__ const int x) { ... }
Validation Check 2: No Reference Type
| Property | Value |
|---|---|
| Tag | grid_constant_reference_type (at 0x84815e) |
| Message | a parameter annotated with __grid_constant__ must not have reference type (at 0x88d900) |
| Severity | error |
| Phase | Attribute application |
The parameter must not be a reference (& or &&). This check fires independently of the const check -- both can fire on the same parameter.
In EDG's type system, reference types have kind == 7 (lvalue reference) or kind == 19 (rvalue reference). The check walks the type chain through cv-qualifier wrappers and tests the final type kind:
type_t* ptype = param->type;
while (ptype->kind == 12)
ptype = ptype->referenced;
if (ptype->kind == 7 || ptype->kind == 19) // lvalue ref or rvalue ref
emit_error("grid_constant_reference_type", param->src_loc);
Example that triggers this error:
__global__ void kernel(__grid_constant__ const int& x) { ... }
The rationale is that kernel parameters are copied across the host-device boundary by the CUDA runtime. A reference to host memory would be invalid on the device, and a reference to device memory does not participate in the kernel launch parameter copying mechanism. The __grid_constant__ attribute specifically requests constant-memory placement of the parameter value -- a reference has no value to place.
Validation Check 3: Only on global Parameters
| Property | Value |
|---|---|
| Tag | grid_constant_non_kernel (at 0x84812d) |
| Message | __grid_constant__ annotation is only allowed on a parameter of a __global__ function (at 0x88db38) |
| Error code | 3702 |
| Severity | 7 (standard error) |
| Phase | Post-declaration validation (sub_6BC890) |
This check enforces that __grid_constant__ only appears on parameters of __global__ (kernel) functions. Parameters of __device__ or __host__ __device__ functions do not participate in the kernel launch mechanism and have no grid-wide constant memory optimization path.
The check executes in nv_validate_cuda_attributes (sub_6BC890, 161 lines, nv_transforms.c). The validator navigates from the function entity to its parameter list, then walks each parameter testing for the __grid_constant__ flag. The reconstructed pseudocode:
// From nv_validate_cuda_attributes (sub_6BC890)
// a1: function entity node
// a2: pointer to source location for diagnostics
void nv_validate_cuda_attributes(entity_t* a1, source_loc_t* a2) {
if (!a1 || (a1->byte_177 & 0x10))
return; // null entity or suppressed
type_t* type_chain = a1->type_chain; // entity+144
uint8_t exec_space = a1->byte_182; // execution space bitfield
// Skip parameter walk under certain execution space conditions
if (!type_chain || ((exec_space & 0x30) == 0x20 &&
(exec_space & 0x60) != 0x20))
goto skip_param_walk;
// Navigate through cv-qualifier wrappers to reach the function type
while (type_chain->kind == 12)
type_chain = type_chain->referenced; // type+144
// Get parameter list from prototype (double dereference)
param_t* param = **(param_t***)(type_chain + 152);
// Walk each parameter
while (param) {
if (param->byte_32 & 0x02) {
// __grid_constant__ flag is set on a non-__global__ parameter
emit_error(7, 3702, a2); // grid_constant_non_kernel
}
param = param->next;
}
// ... (continues with __launch_bounds__ validation below)
}
The param->byte_32 & 0x02 test checks bit 1 of the parameter node's byte at offset +32. This bit is the __grid_constant__ flag on the parameter entity node -- it is set by the __grid_constant__ attribute application handler when the attribute is first applied, and checked here to verify the containing function is actually a kernel.
The error fires for any execution space that is NOT __global__. The condition skip at the top of the function ((exec_space & 0x30) == 0x20 && (exec_space & 0x60) != 0x20) is a pre-filter that handles certain host-side function configurations -- it does NOT suppress the parameter walk for __global__ functions (which have bit 6 = 0x40 set).
Validation Check 4: compute_70+ Architecture
| Property | Value |
|---|---|
| Tag | grid_constant_unsupported_arch (at 0x857770) |
| Message | __grid_constant__ annotation is only allowed for architecture compute_70 or later (at 0x88db90) |
| Severity | error |
| Phase | Attribute application |
The target architecture, stored in dword_126E4A8 (set by the --target CLI flag via case 245 in proc_command_line), must be >= 70. The architecture code is an integer representation: sm_70 maps to 70, sm_80 to 80, sm_90 to 90, etc.
// Architecture gate in the __grid_constant__ attribute handler
if (dword_126E4A8 < 70)
emit_error("grid_constant_unsupported_arch", param->src_loc);
If the user compiles with -arch=compute_60 or lower and uses __grid_constant__, this error fires. The check is a straightforward integer comparison -- no bitmask, no table lookup.
The architecture value reaches cudafe++ through nvcc, which translates user-facing flags like --gpu-architecture=sm_70 into the internal numeric code and passes it via the --target flag. Inside cudafe++, sub_7525E0 (a 6-byte stub returning -1) nominally parses this value, but the actual number is injected by nvcc into the argument string. See Architecture Feature Gating for the full data flow.
Validation Checks 5--8: Redeclaration Consistency
The four redeclaration consistency checks share the same algorithmic structure but apply to different declaration contexts. They all enforce the invariant that __grid_constant__ annotations must match between declarations: if the first declaration annotates a parameter with __grid_constant__, every subsequent declaration (redeclaration, template redeclaration, specialization, explicit instantiation) must also annotate the corresponding parameter, and vice versa.
Why These Checks Exist
The __grid_constant__ attribute affects the kernel's ABI -- specifically, how the CUDA runtime passes the parameter at launch time. If one translation unit sees a declaration with __grid_constant__ and another sees a declaration without it, they would generate incompatible kernel launch code. In RDC (relocatable device code) mode, where kernels can be declared in one TU and defined in another, this mismatch would cause silent data corruption at runtime. The compiler catches it at declaration merging time to prevent this.
Check 5: Function Redeclaration
| Property | Value |
|---|---|
| Tag | grid_constant_incompat_redecl (at 0x84810f) |
| Message | incompatible __grid_constant__ annotation for parameter %s in function redeclaration (see previous declaration %p) (at 0x88d950) |
| Phase | Redeclaration merging (class_decl.c) |
When a __global__ function is redeclared, cudafe++ compares the entity+164 bit 2 (0x04) flag on each parameter between the existing and new declarations. If the flags differ for any parameter at the same position, the error fires.
// Redeclaration consistency check (conceptual, in class_decl.c)
param_t* old_param = get_params(old_decl);
param_t* new_param = get_params(new_decl);
while (old_param && new_param) {
bool old_gc = (old_param->entity->byte_164 & 0x04) != 0;
bool new_gc = (new_param->entity->byte_164 & 0x04) != 0;
if (old_gc != new_gc)
emit_error("grid_constant_incompat_redecl",
new_param->name, old_decl->src_loc);
old_param = old_param->next;
new_param = new_param->next;
}
Example:
__global__ void kernel(__grid_constant__ const int x);
__global__ void kernel(const int x); // ERROR: grid_constant_incompat_redecl
The %s in the message is expanded to the parameter name, and %p is expanded to a source location reference pointing at the previous declaration.
Check 6: Function Template Redeclaration
| Property | Value |
|---|---|
| Tag | grid_constant_incompat_templ_redecl (at 0x857748) |
| Message | incompatible __grid_constant__ annotation for parameter %s in function template redeclaration (see previous declaration %p) (at 0x88d9c8) |
| Phase | Template redeclaration merging (class_decl.c) |
Same logic as check 5, but for function template redeclarations. Template redeclaration merging occurs in a separate code path from regular function redeclaration because template entities have additional metadata (template parameter lists, partial specialization chains) that must be reconciled.
template<typename T>
__global__ void kernel(__grid_constant__ const T x);
template<typename T>
__global__ void kernel(const T x); // ERROR: grid_constant_incompat_templ_redecl
Check 7: Template Specialization
| Property | Value |
|---|---|
| Tag | grid_constant_incompat_specialization (at 0x857720) |
| Message | incompatible __grid_constant__ annotation for parameter %s in function specialization (see previous declaration %p) (at 0x88da48) |
| Phase | Template specialization processing |
When a function template specialization's __grid_constant__ annotations disagree with the primary template, this error fires. The specialization must preserve the __grid_constant__ annotation from the primary template because the compiler may have already committed to constant-memory parameter placement based on the primary template's declaration.
template<typename T>
__global__ void kernel(__grid_constant__ const T x);
template<>
__global__ void kernel<int>(const int x); // ERROR: grid_constant_incompat_specialization
A specialization that omits the annotation would require a different ABI for that particular instantiation, which the kernel launch infrastructure cannot accommodate on a per-specialization basis.
Check 8: Explicit Instantiation Directive
| Property | Value |
|---|---|
| Tag | grid_constant_incompat_instantiation_directive (at 0x8576f0) |
| Message | incompatible __grid_constant__ annotation for parameter %s in instantiation directive (see previous declaration %p) (at 0x88dac0) |
| Phase | Explicit instantiation processing |
This mirrors the specialization check but applies to explicit instantiation declarations and definitions (template void ... and extern template void ...).
template<typename T>
__global__ void kernel(__grid_constant__ const T x) { ... }
template __global__ void kernel<int>(const int x);
// ERROR: grid_constant_incompat_instantiation_directive
The instantiation directive must match the primary template's __grid_constant__ annotation for each parameter.
Memory Space Conflict Check (Error 3577)
While not one of the 8 __grid_constant__ validation checks, error 3577 provides a guard in the reverse direction. When apply_nv_managed_attr (sub_40E0D0) or apply_nv_device_attr (sub_40EB80) applies a memory space attribute to a variable, they check whether the entity has the __grid_constant__ flag set at entity+164 bit 2. If so, and the variable also has a memory space qualifier, error 3577 is emitted with the name of the conflicting memory space.
The check is identical in both handlers. Here is the reconstructed pseudocode from apply_nv_managed_attr (sub_40E0D0):
// From apply_nv_managed_attr (sub_40E0D0, attribute.c:10523)
// a1: attribute node, a2: entity node, a3: target kind (must be 7 = variable)
entity_t* apply_nv_managed_attr(attr_node_t* a1, entity_t* a2, uint8_t a3) {
// Gate: variables only
if (a3 != 7)
internal_error("apply_nv_managed_attr", "attribute.c", 10523);
// Apply memory space flags
uint8_t old_memspace = a2->byte_148;
a2->byte_149 |= 0x01; // set __managed__ flag
a2->byte_148 = old_memspace | 0x01; // set __device__ flag (managed implies device)
// Check for conflicting memory space combinations
if (((old_memspace & 0x02) != 0) + ((old_memspace & 0x04) != 0) == 2)
emit_error(3481, a1->src_loc); // both __shared__ and __constant__ set
if ((signed char)a2->byte_161 < 0)
emit_error(3482, a1->src_loc); // thread_local conflict
if (a2->byte_81 & 0x04)
emit_error(3485, a1->src_loc); // local variable conflict
// Grid constant conflict check
if ((a2->byte_164 & 0x04) != 0 // has __grid_constant__ flag
&& (*(uint16_t*)(a2 + 148) & 0x0102) != 0) // __shared__ OR __managed__
{
// Determine which memory space to report in the diagnostic
uint8_t mem = a2->byte_148;
const char* space;
if (mem & 0x04) space = "__constant__";
else if (a2->byte_149 & 0x01) space = "__managed__";
else if (mem & 0x02) space = "__shared__";
else if (mem & 0x01) space = "__device__";
else space = "";
emit_error_with_string(3577, a1->src_loc, space);
}
return a2;
}
The 0x0102 mask on the 16-bit word at a2 + 148 checks two bits: bit 1 of byte +148 (__shared__, value 0x02) and bit 0 of byte +149 (__managed__, value 0x01 shifted left by 8 bits = 0x0100). This means the conflict check fires specifically when a __grid_constant__ parameter also has __shared__ or __managed__ -- these memory spaces are incompatible with constant memory placement.
The priority order for the diagnostic message (__constant__ > __managed__ > __shared__ > __device__) determines which memory space name appears in the error output when multiple conflicting spaces are present simultaneously.
The apply_nv_device_attr handler (sub_40EB80) performs the identical check in its variable-handling branch (when a3 == 7):
// From apply_nv_device_attr (sub_40EB80), variable branch
if (a3 == 7) {
a2->byte_148 |= 0x01; // set __device__ flag
// ... shared/constant conflict, thread_local, local variable checks ...
// Identical grid_constant conflict check
if ((a2->byte_164 & 0x04) != 0 && (*(uint16_t*)(a2 + 148) & 0x0102) != 0) {
// Same priority cascade for space name
// ...
emit_error_with_string(3577, a1->src_loc, space);
}
return a2;
}
Entity Node Fields
Three distinct locations in entity/type/parameter nodes carry __grid_constant__ state:
entity+164 bit 2 (0x04): Grid Constant Declaration Flag
Set during attribute application when a parameter is declared __grid_constant__. This is the "declaration-side" flag that records the programmer's intent. Used by:
- Memory space conflict check (error 3577) in
apply_nv_managed_attrandapply_nv_device_attr - Redeclaration consistency checks (checks 5--8)
type+133 bit 5 (0x20): Type-Level Flag
A flag on the type node (not the entity node) checked by sub_7A6B60. This function follows the type chain through cv-qualifier wrappers (kind == 12) and tests byte+133 & 0x20:
// sub_7A6B60 (types.c)
// In the broader EDG type system, this function checks bit 5 of the
// type's flag byte. For CUDA parameter types, this bit indicates
// __grid_constant__ annotation. The same bit is also used as the
// dependent-type flag in template contexts (hence 299 callers in the binary).
bool type_has_flag_0x20(type_t* type) {
while (type->kind == 12) // skip cv-qualifier wrappers
type = type->referenced; // follow type chain at +144
return (type->byte_133 & 0x20) != 0;
}
Used by the __global__ apply handler's parameter iteration to detect parameters that are already annotated with __grid_constant__, suppressing the error 3669 advisory for those parameters.
param+32 bit 1 (0x02): Parameter Node Flag
A flag on the parameter node itself, checked during post-declaration validation (sub_6BC890). The validator walks the parameter list and tests each parameter's byte at offset +32 for bit 1. If set on a parameter of a non-__global__ function, error 3702 (grid_constant_non_kernel) is emitted.
The three flags serve different purposes: the entity flag records the declaration intent and is used for cross-declaration consistency checks, the type flag enables efficient type-level queries during attribute application, and the parameter flag enables the post-validation pass to scan parameter lists without resolving entity or type chains.
Parameter Iteration in the global Apply Handler
The apply_nv_global_attr handlers (sub_40E1F0 and sub_40E7F0) contain a parameter iteration loop that interacts with __grid_constant__. This loop checks each kernel parameter for types that should be __grid_constant__ but are not annotated as such. When found in device compilation mode (dword_126C5C4 == -1), error 3669 is emitted as an advisory.
// From apply_nv_global_attr (sub_40E1F0), Phase 5: parameter iteration
// This section runs only when attr_node+11 bit 0 is set (applies to parameters)
if (a1->byte_11 & 0x01) {
// Navigate to function prototype through cv-qualifier chain
type_t* proto_type = entity->type_chain; // entity+144
while (proto_type->kind == 12)
proto_type = proto_type->referenced;
// Get parameter list head (double dereference from prototype+152)
param_t* param = **(param_t***)(proto_type + 152);
source_loc_t saved_loc = a1->src_loc; // attr_node+56
for (; param != NULL; param = param->next) {
// Peel cv-qualifier wrappers from parameter type
type_t* ptype = param->type; // param[1] (offset 8)
while (ptype->kind == 12)
ptype = ptype->referenced;
// sub_7A6B60: returns true if type+133 bit 5 is set
// (parameter is already __grid_constant__)
if (!sub_7A6B60(ptype) && dword_126C5C4 == -1) {
// Scope table lookup (784-byte entries)
int64_t scope = qword_126C5E8 + 784 * dword_126C5E4;
// Skip if scope has qualifier flags or is a cv-qualified scope
if ((scope->byte_6 & 0x06) == 0 && scope->byte_4 != 12) {
// Re-navigate to unqualified type
type_t* ptype2 = param->type;
while (ptype2->kind == 12)
ptype2 = ptype2->referenced;
// If no default initializer, suggest __grid_constant__
if (ptype2->qword_120 == 0)
emit_error(3669, &saved_loc);
}
}
}
}
The logic: for each parameter in a __global__ function, if the parameter type does NOT already have the __grid_constant__ flag AND we are in device compilation mode AND the current scope is not a cv-qualified context AND the parameter type lacks a default initializer (the type+120 pointer is null), then emit error 3669 as an advisory. The advisory nudges kernel authors to add __grid_constant__ annotations for better performance.
The scope table lookup (qword_126C5E8 indexed by dword_126C5E4, 784-byte entries) determines whether the current compilation context is device-side. The dword_126C5C4 == -1 sentinel explicitly indicates device compilation mode. Together these two conditions ensure the advisory only fires when processing the device-side compilation of a kernel, not during host-side stub generation.
Keyword Registration
The __grid_constant__ keyword is registered during fe_translation_unit_init (sub_5863A0), alongside other CUDA extension keywords (__device__, __global__, __shared__, __constant__, __managed__, __launch_bounds__). The registration inserts both grid_constant (bare form, for attribute name lookup) and __grid_constant__ (double-underscore form, for lexer recognition) into EDG's keyword-to-token-ID mapping.
The attribute name lookup function (sub_40A250) strips leading and trailing double underscores before searching the attribute name hash table (qword_E7FB60), so __grid_constant__ resolves to the same descriptor entry as the bare grid_constant form.
Diagnostic Tag Summary
| Tag | Error Code | Message | Phase |
|---|---|---|---|
grid_constant_not_const | -- | a parameter annotated with __grid_constant__ must have const-qualified type | Application |
grid_constant_reference_type | -- | a parameter annotated with __grid_constant__ must not have reference type | Application |
grid_constant_non_kernel | 3702 | __grid_constant__ annotation is only allowed on a parameter of a __global__ function | Post-validation |
grid_constant_unsupported_arch | -- | __grid_constant__ annotation is only allowed for architecture compute_70 or later | Application |
grid_constant_incompat_redecl | -- | incompatible __grid_constant__ annotation for parameter %s in function redeclaration (see previous declaration %p) | Redeclaration |
grid_constant_incompat_templ_redecl | -- | incompatible __grid_constant__ annotation for parameter %s in function template redeclaration (see previous declaration %p) | Template redecl |
grid_constant_incompat_specialization | -- | incompatible __grid_constant__ annotation for parameter %s in function specialization (see previous declaration %p) | Specialization |
grid_constant_incompat_instantiation_directive | -- | incompatible __grid_constant__ annotation for parameter %s in instantiation directive (see previous declaration %p) | Instantiation |
Error codes for checks 1, 2, 4--8 are not individually mapped in the decompiled code available for this analysis. Error 3702 (check 3) is confirmed from the post-validation function sub_6BC890. Error 3577 (memory space conflict) is confirmed from sub_40E0D0 and sub_40EB80.
Function Map
| Address | Identity | Lines | Source File | Role |
|---|---|---|---|---|
sub_7A6B60 | type flag query (byte_133 & 0x20) | 9 | types.c | Follows type chain, returns grid_constant / dependent flag |
sub_40E0D0 | apply_nv_managed_attr | 47 | attribute.c:10523 | Memory space conflict check (3577) for __managed__ |
sub_40EB80 | apply_nv_device_attr | 100 | attribute.c | Memory space conflict check (3577) for __device__ |
sub_6BC890 | nv_validate_cuda_attributes | 161 | nv_transforms.c | Post-validation: param walk for 3702 (grid_constant_non_kernel) |
sub_40E1F0 | apply_nv_global_attr (variant 1) | 89 | attribute.c | Parameter iteration with grid_constant flag check (3669 advisory) |
sub_40E7F0 | apply_nv_global_attr (variant 2) | 86 | attribute.c | Same parameter iteration (alternate call path, do-while loop) |
sub_5863A0 | fe_translation_unit_init | -- | fe_init.c | Registers __grid_constant__ keyword |
sub_40A250 | attribute name lookup | -- | attribute.c | Strips __ prefix/suffix, searches hash table |
Global Variables
| Global | Address | Purpose |
|---|---|---|
dword_126E4A8 | 0x126E4A8 | Target SM architecture code (from --target). Must be >= 70 for __grid_constant__. |
dword_126C5C4 | 0x126C5C4 | Scope index sentinel. -1 = device compilation mode. Guards 3669 advisory check. |
dword_126C5E4 | 0x126C5E4 | Current scope table index. Used in 3669 scope lookup. |
qword_126C5E8 | 0x126C5E8 | Scope table base pointer (784-byte entries). Used in 3669 scope lookup. |
Cross-References
- Attribute System Overview -- attribute node structure, dispatch pipeline, kind byte enumeration
- __global__ Function Constraints -- parameter iteration for
__grid_constant__advisory (error 3669), full apply handler pseudocode - Entity Node Layout --
entity+164bit 2 (grid_constant flag),param+32bit 1 - CUDA Error Catalog -- all 8
grid_constant_*diagnostic tags - CLI Flag Inventory --
--targetflag settingdword_126E4A8 - Architecture Feature Gating -- SM version gating mechanism,
dword_126E4A8data flow - CUDA Memory Spaces -- constant memory semantics, error 3577 conflict
- RDC Mode -- why redeclaration consistency matters across translation units
managed Variables
The __managed__ attribute declares a variable in CUDA Unified Memory -- a memory region accessible from both host (CPU) and device (GPU) code, with the CUDA runtime handling page migration transparently. Unlike __device__ variables (accessible only from device code without explicit cudaMemcpy), managed variables can be read and written by both the host and device using the same pointer. The hardware and driver cooperate to migrate pages on demand between CPU and GPU memory, so neither the programmer nor the compiler needs to issue explicit copies.
The constraint set on __managed__ reflects two fundamental realities. First, unified memory is a runtime feature: the compiler cannot resolve managed addresses at compile time, so every host-side access must be gated behind a lazy initialization call that registers the variable with the CUDA runtime's unified memory subsystem. Second, unified memory requires hardware support: the Kepler architecture (compute capability 3.0) introduced the UVA (Unified Virtual Addressing) infrastructure that managed memory depends on. These two realities drive the entire implementation -- the attribute handler sets both a managed flag and a device flag (because managed memory is device-global memory with extra runtime semantics), the validation chain rejects memory spaces and qualifiers that conflict with runtime writability, and the code generator wraps every host-side access in a comma-operator expression that forces lazy initialization.
Key Facts
| Property | Value |
|---|---|
| Attribute kind byte | 0x66 = 'f' (102) |
| Handler function | sub_40E0D0 (apply_nv_managed_attr, 47 lines, attribute.c:10523) |
| Entity node flags set | entity+149 bit 0 (__managed__) AND entity+148 bit 0 (__device__) |
| Detection bitmask | (*(_WORD*)(entity + 148) & 0x101) == 0x101 |
| Minimum architecture | compute_30 (Kepler) -- dword_126E4A8 >= 30 |
| Applies to | Variables only (entity kind 7) |
| Diagnostic codes | 3481, 3482, 3485, 3577 (attribute application); arch/config errors (declaration processing) |
| Managed RT boilerplate emitter | sub_489000 (process_file_scope_entities, line 218) |
| Access wrapper emitters | sub_4768F0 (gen_name_ref), sub_484940 (gen_variable_name) |
| Managed access prefix string | 0x839570 (65 bytes) |
| Managed RT static block string | 0x83AAC8 (243 bytes) |
| Managed RT init function string | 0x83ABC0 (210 bytes) |
Semantic Meaning
A __managed__ variable occupies a single virtual address that is valid on both host and device. The CUDA runtime allocates the variable through cudaMallocManaged during module initialization and registers it so the driver can track page ownership. When a kernel accesses the variable, the GPU's page fault handler migrates the page from CPU memory (if needed). When host code accesses it after a kernel launch, the runtime ensures the GPU has finished writing and the page is migrated back to CPU-accessible memory.
This is fundamentally different from the other three memory spaces:
| Space | Accessibility | Migration | Lifetime |
|---|---|---|---|
__device__ | Device only (host needs cudaMemcpy) | Manual | Program lifetime |
__shared__ | Device only, per-thread-block | None (on-chip SRAM) | Block lifetime |
__constant__ | Device read-only (host writes via cudaMemcpyToSymbol) | Manual | Program lifetime |
__managed__ | Host and device, same pointer | Automatic (page faults) | Program lifetime |
Because managed memory is fundamentally device global memory with runtime-managed migration, the __managed__ handler always sets the __device__ bit alongside the __managed__ bit. This is not redundant -- it ensures that all code paths that check for "device-accessible variable" (error 3483 scope checks, external linkage warning 3648, cross-space reference validation) treat managed variables correctly. A managed variable IS a device variable; it just happens to also be host-accessible through the runtime's page migration.
Why the Constraints Exist
Each validation check enforced by the handler exists for a specific hardware or semantic reason:
-
Variables only (kind 7): Unified memory is a storage concept. Functions do not reside in managed memory -- they have execution spaces, not memory spaces.
-
Cannot be
__shared__or__constant__: These are mutually exclusive memory spaces that occupy different physical hardware.__shared__is per-block on-chip SRAM with no concept of host accessibility.__constant__is a read-only cached region with no write path from device code. Managed memory is global DRAM with page migration. They cannot coexist. -
Cannot be
thread_local: Thread-local storage uses thread-specific addressing (TLS segments) which is a host-side concept incompatible with CUDA's execution model. A managed variable must have a single global address visible to all threads on both host and device. -
Cannot be a local variable or reference type: Managed variables require runtime registration with the CUDA driver during module loading. Local variables are stack-allocated with lifetimes that cannot be tracked by the runtime. References cannot cross address spaces -- a reference to a managed variable on the host would hold a CPU virtual address that is meaningless on the device.
-
Requires compute_30+: Unified Virtual Addressing (UVA), the hardware foundation for managed memory, was introduced with the Kepler architecture (compute capability 3.0). On earlier architectures, host and device have separate, non-overlapping virtual address spaces, making transparent page migration impossible.
-
Incompatible with
__grid_constant__: Grid-constant parameters are loaded into constant memory at kernel launch. A managed variable's value is determined by its current page state, which can change between kernel launches. The two semantics are contradictory.
Attribute Application: apply_nv_managed_attr
sub_40E0D0 -- Full Pseudocode
The __managed__ attribute handler is the simplest of the four memory space handlers and demonstrates the complete validation template. Called from apply_one_attribute (sub_413240) when the attribute kind byte is 'f' (102).
// sub_40E0D0 -- apply_nv_managed_attr (attribute.c:10523)
// a1: attribute node pointer (attribute_node_t*)
// a2: entity node pointer (entity_t*)
// a3: entity kind (uint8_t)
// returns: entity node pointer (passthrough)
entity_t* apply_nv_managed_attr(attr_node_t* a1, entity_t* a2, uint8_t a3) {
// ===== Gate: variables only =====
// Entity kind 7 = variable. Any other kind (function=11, type=6, etc.)
// is an internal error -- the dispatch table should never route
// __managed__ to a non-variable entity.
if (a3 != 7)
internal_error("attribute.c", 10523, "apply_nv_managed_attr", 0, 0);
// ===== Step 1: Set managed + device flags =====
// Save current memory space byte for later checks.
// Managed memory IS device global memory, so both flags must be set.
uint8_t old_space = a2->byte_148;
a2->byte_149 |= 0x01; // set __managed__ flag
a2->byte_148 = old_space | 1; // set __device__ flag
// ===== Step 2: Mutual exclusion -- shared + constant =====
// The expression ((x & 2) != 0) + ((x & 4) != 0) == 2 is true
// only when BOTH __shared__ (bit 1) and __constant__ (bit 2) are set.
// This catches an impossible three-way conflict, NOT managed+shared
// or managed+constant individually. The individual conflicts
// (__managed__ + __shared__, __managed__ + __constant__) are caught
// by the __grid_constant__ check or by subsequent declaration processing.
if (((old_space & 2) != 0) + ((old_space & 4) != 0) == 2)
emit_error(3481, a1->source_loc); // "conflicting CUDA memory spaces"
// ===== Step 3: Thread-local check =====
// Byte +161 bit 7 (sign bit when read as signed char) indicates
// thread_local storage duration. Managed variables must have
// static storage duration with a single global address.
if ((signed char)a2->byte_161 < 0)
emit_error(3482, a1->source_loc); // "CUDA memory space on thread_local"
// ===== Step 4: Local variable / reference type check =====
// Byte +81 bit 2 indicates the entity is declared in a local scope
// (block scope, function parameter, or reference type).
// Managed variables require file-scope lifetime for runtime registration.
if (a2->byte_81 & 0x04)
emit_error(3485, a1->source_loc); // "CUDA memory space on local/ref"
// ===== Step 5: __grid_constant__ conflict =====
// Byte +164 bit 2 is the __grid_constant__ flag on the parameter entity.
// If set, check whether this entity also has a conflicting memory space.
// The 16-bit word read at +148 with mask 0x0102 catches:
// byte +148 bit 1 (0x02) = __shared__
// byte +149 bit 0 (0x01, as 0x100 in word) = __managed__
// (Little-endian: word = byte_149 << 8 | byte_148)
if ((a2->byte_164 & 0x04) && (*(uint16_t*)(a2 + 148) & 0x0102)) {
// Build error message: select most restrictive space name
uint8_t space = a2->byte_148;
const char* name = "__constant__";
if (!(space & 0x04)) {
name = "__managed__";
if (!(a2->byte_149 & 0x01)) {
name = "__shared__";
if (!(space & 0x02)) {
name = "__device__";
if (!(space & 0x01))
name = "";
}
}
}
emit_error_with_name(3577, a1->source_loc, name);
// "memory space %s incompatible with __grid_constant__"
}
return a2;
}
Entity Node Fields Modified
| Offset | Field | Bits Set | Meaning |
|---|---|---|---|
+148 | memory_space | bit 0 (0x01) | __device__ -- variable lives in device global memory |
+149 | extended_space | bit 0 (0x01) | __managed__ -- variable is in unified memory |
Entity Node Fields Read (Validation)
| Offset | Field | Mask | Meaning |
|---|---|---|---|
+148 | memory_space | 0x02 | __shared__ flag (mutual exclusion check) |
+148 | memory_space | 0x04 | __constant__ flag (mutual exclusion check) |
+161 | storage_flags | bit 7 (sign) | thread_local storage duration |
+81 | scope_flags | 0x04 | Local scope / reference type indicator |
+164 | cuda_flags | 0x04 | __grid_constant__ parameter flag |
+148:149 | space_word | 0x0102 | Combined __shared__ OR __managed__ (grid_constant conflict) |
Comparison with apply_nv_device_attr (sub_40EB80)
The __device__ handler's variable path (entity kind 7) is structurally identical to apply_nv_managed_attr, minus the byte_149 |= 1 step. Both handlers:
- Set
byte_148 |= 0x01(device memory space) - Check error 3481 (shared + constant mutual exclusion)
- Check error 3482 (thread_local)
- Check error 3485 (local variable)
- Check error 3577 (grid_constant conflict)
The only difference: __managed__ additionally sets byte_149 |= 0x01. The __device__ handler also has a function path (kind 11) for setting execution space bits -- __managed__ has no function path because managed memory is a storage concept, not an execution concept.
Architecture Gating
The compute_30 requirement for __managed__ is enforced during declaration processing, not in the attribute handler itself. The attribute handler (sub_40E0D0) sets the bitfield flags unconditionally; the architecture check happens later when the declaration is fully processed.
Two diagnostic tags cover managed architecture gating:
| Tag | Message | Condition |
|---|---|---|
unsupported_arch_for_managed_capability | __managed__ variables require architecture compute_30 or higher | dword_126E4A8 < 30 |
unsupported_configuration_for_managed_capability | __managed__ variables are not yet supported for this configuration (compilation mode (32/64 bit) and/or target operating system) | Configuration-specific flag check |
The architecture check uses the global dword_126E4A8 which stores the SM version number from the --gpu-architecture flag. The value 30 corresponds to sm_30 (Kepler), the first architecture with Unified Virtual Addressing (UVA) support. The configuration check covers edge cases like 32-bit compilation mode or unsupported operating systems where the CUDA runtime's managed memory subsystem is unavailable.
Managed Runtime Boilerplate
Every .int.c file emitted by cudafe++ contains a block of managed runtime initialization code, emitted unconditionally by sub_489000 (process_file_scope_entities) at line 218. This block is emitted regardless of whether the translation unit contains any __managed__ variables -- the static guard flag ensures zero overhead when no managed variables exist.
Static Declarations
Four declarations are emitted as a single string literal from 0x83AAC8 (243 bytes):
// Emitted verbatim by sub_489000, line 218
static char __nv_inited_managed_rt = 0;
static void **__nv_fatbinhandle_for_managed_rt;
static void __nv_save_fatbinhandle_for_managed_rt(void **in) {
__nv_fatbinhandle_for_managed_rt = in;
}
static char __nv_init_managed_rt_with_module(void **);
Each symbol serves a specific role in the initialization chain:
| Symbol | Type | Role |
|---|---|---|
__nv_inited_managed_rt | static char | Guard flag: 0 = uninitialized, nonzero = initialized |
__nv_fatbinhandle_for_managed_rt | static void** | Cached fatbinary handle, populated during __cudaRegisterFatBinary |
__nv_save_fatbinhandle_for_managed_rt | static void(void**) | Callback that stores the fatbin handle -- called at program startup |
__nv_init_managed_rt_with_module | static char(void**) | Forward declaration -- defined later by crt/host_runtime.h |
The forward declaration of __nv_init_managed_rt_with_module is critical: this function is provided by the CUDA runtime headers and performs the actual cudaRegisterManagedVariable calls. By forward-declaring it here, the managed runtime boilerplate can reference it before the runtime header is #included later in the .int.c file.
Lazy Initialization Function
Emitted immediately after the static block (string at 0x83ABC0, 210 bytes):
// sub_489000, lines 221-224
// Conditional prefix:
if (dword_106BF6C) // alternative host compiler mode
emit("__attribute__((unused)) ");
// Function body:
static inline void __nv_init_managed_rt(void) {
__nv_inited_managed_rt = (
__nv_inited_managed_rt
? __nv_inited_managed_rt
: __nv_init_managed_rt_with_module(
__nv_fatbinhandle_for_managed_rt)
);
}
The ternary is a lazy-init idiom. On first call, __nv_inited_managed_rt is 0 (falsy), so the false branch executes __nv_init_managed_rt_with_module, which registers all managed variables in the translation unit and returns nonzero. The result is stored back into __nv_inited_managed_rt, so subsequent calls short-circuit through the true branch and return the existing nonzero value without re-initializing.
The __attribute__((unused)) prefix is conditionally added when dword_106BF6C (alternative host compiler mode) is set. This suppresses -Wunused-function warnings on host compilers that may not see any call sites for this function if no managed variables exist in the translation unit.
Runtime Registration Sequence
The full initialization flow spans the compilation and runtime startup pipeline:
Compile time (cudafe++ emits into .int.c):
1. __nv_save_fatbinhandle_for_managed_rt() -- defined, stores fatbin handle
2. __nv_init_managed_rt_with_module() -- forward-declared only
3. __nv_init_managed_rt() -- defined, lazy init wrapper
4. #include "crt/host_runtime.h" -- provides _with_module() definition
Program startup:
5. __cudaRegisterFatBinary() calls __nv_save_fatbinhandle_for_managed_rt()
to cache the fatbin handle for this translation unit
First managed variable access:
6. Comma-operator wrapper calls __nv_init_managed_rt()
7. Guard flag is 0, so __nv_init_managed_rt_with_module() executes
8. __nv_init_managed_rt_with_module() calls cudaRegisterManagedVariable()
for every __managed__ variable in the translation unit
9. Guard flag set to nonzero, preventing re-initialization
Subsequent accesses:
10. Comma-operator wrapper calls __nv_init_managed_rt()
11. Guard flag is nonzero, ternary short-circuits, no runtime call
Host Access Transformation: The Comma-Operator Pattern
When cudafe++ generates the .int.c host-side code and encounters a reference to a __managed__ variable, it wraps the access in a comma-operator expression. This is the core mechanism that ensures the CUDA managed memory runtime is initialized before any managed variable is touched on the host.
Detection
Two backend emitter functions detect managed variables using the same 16-bit bitmask test:
// Used by both sub_4768F0 (gen_name_ref) and sub_484940 (gen_variable_name)
if ((*(_WORD*)(entity + 148) & 0x101) == 0x101)
In little-endian layout, the 16-bit word at offset 148 spans bytes +148 (low) and +149 (high). The mask 0x101 tests:
- Bit 0 of byte
+148(0x01):__device__flag - Bit 0 of byte
+149(0x100in the word):__managed__flag
Both bits are always set together by apply_nv_managed_attr, so this test is equivalent to "is this a managed variable?"
Transformed Output
For a managed variable named managed_var, the emitter produces:
(*( (__nv_inited_managed_rt ? (void)0 : __nv_init_managed_rt()), (managed_var)))
The prefix string lives at 0x839570 (65 bytes):
"(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), ("
After emitting the variable name, the suffix ))) closes the expression.
Why This Works: Anatomy of the Expression
Reading from inside out:
(*( (__nv_inited_managed_rt ? (void)0 : __nv_init_managed_rt()), (managed_var)))
^--- ternary: lazy init guard ----^ ^--- value ---^
^--- comma operator: init side-effect, then yield value --------------------------^
^--- dereference: access the managed variable's storage ---------------------------------^
-
Ternary
__nv_inited_managed_rt ? (void)0 : __nv_init_managed_rt()-- The guard flag is checked. If nonzero (already initialized), the expression evaluates to(void)0, which generates no code. If zero (first access),__nv_init_managed_rt()is called, which performs CUDA runtime registration and sets the guard flag to nonzero. -
Comma operator
(init_expr, (managed_var))-- The C comma operator evaluates its left operand for side effects only, discards the result, then evaluates and returns its right operand. This guarantees the initialization side-effect is sequenced before the variable access, per C/C++ sequencing rules (C11 6.5.17, C++17 [expr.comma]). -
Outer dereference
*(...)-- The outer*dereferences the result. After runtime registration, the managed variable's symbol resolves to the unified memory pointer that the CUDA runtime allocated viacudaMallocManaged. The dereference yields the actual variable value.
The entire expression is parenthesized to be safely usable in any expression context -- assignments, function arguments, member access, etc.
Two Emitter Paths
The access transformation is applied by two separate functions, covering different name resolution contexts:
sub_484940 (gen_variable_name, 52 lines) -- handles direct variable name emission. Simpler structure: check the 0x101 bitmask, emit prefix, emit the name (handling three sub-cases: thread-local via this, anonymous via sub_483A80, or regular via sub_472730), emit suffix.
// sub_484940 -- gen_variable_name (pseudocode)
void gen_variable_name(entity_t* a1) {
bool needs_suffix = false;
// Check: is this a __managed__ variable?
if ((*(uint16_t*)(a1 + 148) & 0x101) == 0x101) {
needs_suffix = true;
emit("(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), (");
}
// Emit variable name (three cases)
if (a1->byte_163 & 0x80)
emit("this"); // thread-local proxy
else if (a1->byte_165 & 0x04)
emit_anonymous_name(a1); // compiler-generated name
else
gen_expression_or_name(a1, 7); // regular name emission
if (needs_suffix)
emit(")))");
}
sub_4768F0 (gen_name_ref, 237 lines) -- handles qualified name references with :: scope resolution, template arguments, __super:: qualifier, and member access. The managed wrapping applies an additional gate: a3 == 7 (entity is a variable) AND !v7 (the fourth parameter is zero, meaning no nested context that already handles initialization).
// sub_4768F0 -- gen_name_ref, managed wrapping (lines 160-163, 231-236)
int gen_name_ref(context_t* ctx, entity_t* entity, uint8_t kind, int nested) {
bool needs_suffix = false;
if (!nested && kind == 7
&& (*(uint16_t*)(entity + 148) & 0x101) == 0x101) {
needs_suffix = true;
emit("(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), (");
}
// ... 200+ lines of qualified name emission ...
// handles ::, template<>, __super::, member access paths
if (needs_suffix) {
emit(")))");
return 1;
}
// ...
}
Host-Side Exemption in Cross-Space Checking
Managed variables receive a special exemption in the cross-space reference validation performed by record_symbol_reference_full (sub_72A650). When host code references a __device__ variable, the checker would normally emit error 3548. But managed variables are specifically exempted:
// Inlined in sub_72A650, cross-space variable reference check
if ((*(uint16_t*)(var_info + 148) & 0x0101) == 0x0101)
return; // managed variable -- host access is legal
This uses the same 0x0101 bitmask to detect managed variables. The exemption exists because managed variables are explicitly designed for host access -- that is their entire purpose. Without this exemption, every host-side __managed__ variable access would trigger a spurious "reference to device variable from host code" error.
Managed Variables and constexpr
The declaration processor sub_4DEC90 (variable_declaration) imposes additional constraints when __managed__ is combined with constexpr:
| Error | Condition | Description |
|---|---|---|
| 3568 | __constant__ + constexpr | __constant__ combined with constexpr (prevents runtime initialization) |
| 3566 | __constant__ + constexpr + auto | __constant__ constexpr with auto deduction |
These errors target __constant__ specifically, but the validation cascade also generates the space name for managed variables when constructing error messages. The space name selection uses the same priority cascade as the attribute handler:
// sub_4DEC90, line ~357 -- selecting display name for error messages
const char* space_name = "__constant__";
if (!(space & 0x04)) {
space_name = "__managed__";
if (!(*(uint8_t*)(entity + 149) & 0x01)) {
space_name = "__host__ __device__" + 9; // pointer trick: "__device__"
if (space & 0x02)
space_name = "__shared__";
}
}
The string "__device__" is obtained by taking "__host__ __device__" and advancing by 9 bytes, skipping the "__host__ " prefix. This is a binary-level optimization where the compiler shares string storage between the combined form and the standalone "__device__" substring.
Error 3648: External Linkage Warning
The post-definition check in sub_4DC200 (mark_defined_variable) warns when a device-accessible variable has external linkage. This affects managed variables because they always have the __device__ bit set:
// sub_4DC200 -- mark_defined_variable
// Condition for warning 3648:
if ((entity->byte_148 & 3) == 1 // __device__ set AND __shared__ NOT set
&& !is_compiler_generated(entity)
&& (entity->byte_80 & 0x70) != 0x10) // not anonymous
{
warning(3648, entity->source_loc);
}
The bit test (byte_148 & 3) == 1 checks that bit 0 (__device__) is set and bit 1 (__shared__) is NOT set. This catches:
__device__variables (0x01): yes,(0x01 & 3) == 1__managed__variables (0x01 at +148, 0x01 at +149): yes,(0x01 & 3) == 1__device__ __constant__(0x05): yes,(0x05 & 3) == 1__shared__(0x02): no,(0x02 & 3) == 2__constant__alone (0x04): no,(0x04 & 3) == 0
Managed variables therefore trigger this warning if they have external linkage and are not compiler-generated.
Diagnostic Summary
| Error | Phase | Condition | Message |
|---|---|---|---|
| 3481 | Attribute application | __shared__ AND __constant__ both set | Conflicting CUDA memory spaces |
| 3482 | Attribute application | thread_local storage duration | CUDA memory space on thread_local variable |
| 3485 | Attribute application | Local scope or reference type | CUDA memory space on local variable |
| 3577 | Attribute application | __grid_constant__ + managed/shared | Memory space incompatible with __grid_constant__ |
| 3648 | Post-definition | External linkage on device-accessible (non-shared) var | External linkage warning |
| (arch) | Declaration processing | dword_126E4A8 < 30 | __managed__ requires compute_30 or higher |
| (config) | Declaration processing | Unsupported OS/bitness | __managed__ not supported for this configuration |
Function Map
| Address | Name | Lines | Role |
|---|---|---|---|
sub_40E0D0 | apply_nv_managed_attr | 47 | Attribute handler -- sets flags, validates |
sub_40EB80 | apply_nv_device_attr | 100 | Device handler (variable path is structurally identical) |
sub_413240 | apply_one_attribute | 585 | Dispatch -- routes kind 'f' to sub_40E0D0 |
sub_489000 | process_file_scope_entities | 723 | Emits managed RT boilerplate into .int.c |
sub_4768F0 | gen_name_ref | 237 | Access wrapper -- qualified name path |
sub_484940 | gen_variable_name | 52 | Access wrapper -- direct name path |
sub_4DEC90 | variable_declaration | 1098 | Declaration processing, constexpr/VLA checks |
sub_4DC200 | mark_defined_variable | 26 | External linkage warning (error 3648) |
sub_72A650 | record_symbol_reference_full | ~400 | Cross-space check with managed exemption |
sub_6BC890 | nv_validate_cuda_attributes | 161 | Post-declaration cross-attribute validation |
Cross-References
- Memory Spaces -- bitfield encoding at entity
+148/+149, all four memory space handlers - Attribute System Overview -- dispatch table, attribute kind enum, application pipeline
- grid_constant -- error 3577 conflict with managed
- Architecture Feature Gating -- compute_30 gate for
__managed__ - CUDA Runtime Boilerplate -- managed RT emission, lambda stubs,
__cudaPushCallConfiguration - Cross-Space Validation -- managed exemption in host access checks
- Entity Node Layout -- byte
+148/+149field definitions
Minor CUDA Attributes
cudafe++ defines several CUDA-specific attributes beyond the core execution-space, memory-space, and launch-configuration families. These attributes serve diverse purposes: optimization hints for the downstream compiler, parameter passing strategy selection, inline control that bridges the EDG front-end with cicc's code generation, and internal annotations for tile/cooperative infrastructure. Most are undocumented by NVIDIA. This page covers each in detail: what the attribute does, why it exists, how cudafe++ validates and stores it, and where the flags end up in the entity node.
Attribute Summary
| Kind | Hex | ASCII | Display Name | Category | Handler / Flag |
|---|---|---|---|---|---|
| 110 | 0x6E | 'n' | __nv_pure__ | Optimization | entity+183 (via IL propagation) |
| -- | -- | -- | __nv_register_params__ | ABI | sub_40B0A0 (38 lines), entity+183 bit 3 |
| -- | -- | -- | __forceinline__ | Inline control | entity+177 bit 4 |
| -- | -- | -- | __noinline__ | Inline control | sub_40F5F0 / sub_40F6F0, entity+179 bit 5, entity+180 bit 7 |
| -- | -- | -- | __inline_hint__ | Inline control | entity+179 bit 4 |
| 89 | 0x59 | 'Y' | __tile_global__ | Internal | (no handler observed) |
| 95 | 0x5F | '_' | __tile_builtin__ | Internal | (no handler observed) |
| 94 | 0x5E | '^' | __local_maxnreg__ | Launch config | sub_411090 (67 lines) |
| 108 | 0x6C | 'l' | __block_size__ | Launch config | sub_4109E0 (265 lines) |
Note: __nv_register_params__, __forceinline__, __noinline__, and __inline_hint__ do not have CUDA attribute kind codes. They are processed through different paths (EDG's standard attribute system, pragma-like registration at startup, or direct flag manipulation). Only __nv_pure__, __tile_global__, __tile_builtin__, __local_maxnreg__, and __block_size__ have dedicated CUDA kind bytes in the attribute_display_name switch table.
__nv_pure__ (Kind 0x6E = 'n')
Purpose
__nv_pure__ marks a function as having no observable side effects: given the same inputs, it always returns the same result and does not modify any state visible to the caller. This is an optimization hint for cicc (the CUDA compiler backend). A pure function can be:
- Common-subexpression eliminated (CSE): if
f(x)appears twice in the same basic block, the second call can be replaced by the first call's result. - Hoisted out of loops: if
f(x)is invariant across loop iterations, it can be computed once before the loop (LICM -- loop-invariant code motion). - Dead-code eliminated: if the result of
f(x)is never used and the function has no side effects, the call can be removed entirely.
This is semantically equivalent to GCC's __attribute__((pure)) and LLVM's readonly function attribute, but expressed through NVIDIA's internal attribute system rather than the standard GNU attribute path. The choice of a separate internal attribute rather than reusing the GNU pure attribute reflects cudafe++'s design of routing all CUDA-specific semantics through its own kind-byte dispatch, keeping the NVIDIA optimization pipeline cleanly separated from EDG's standard attribute handling.
Binary Encoding
In the attribute kind enum, __nv_pure__ has kind value 110 (0x6E, ASCII 'n'). This is the highest kind value in the CUDA attribute range, added later than the original dense block (86--95).
The attribute_display_name switch (sub_40A310) maps it:
case 'n': return "__nv_pure__";
Application Behavior
In the apply_one_attribute constraint checker (sub_413240), kind 'n' has the following entry:
case 'n':
if (target_kind == 28) // target is a namespace-level entity
goto LABEL_21; // -> pass through (no per-entity modification)
goto LABEL_8; // -> attribute doesn't apply to this target
The handler does not modify any entity node fields directly. Unlike __host__ or __device__ which set bitmask flags at entity+182, __nv_pure__ propagates through the attribute node list itself. The attribute node with kind 0x6E remains attached to the entity's attribute chain and is consumed later by:
- The
.int.coutput generator (sub_5565E0and related functions), which emits the__nv_pure__attribute into the intermediate C output. In the IL code generator, kind0x6Eshares handling with__launch_bounds__(0x5C):
case 0x5C:
case 0x6E:
a2->kind_field = 25; // IL node type for "function attribute"
sub_540560(0, 0, a2, a4, ...); // emit attribute to .int.c
break;
- cicc then reads the
__nv_pure__annotation from the.int.coutput and applies the corresponding LLVM-level optimization attributes (readonly,willreturn, etc.) to the function in the NVVM IR.
Why It Exists
CUDA device code has optimization opportunities that GCC's pure does not capture. Device functions execute in a constrained environment (no system calls, no I/O, deterministic memory model), which makes purity easier to verify and more valuable to exploit. By providing __nv_pure__ as a separate internal attribute, NVIDIA can:
- Gate it behind CUDA mode (it only appears in device compilation flows).
- Attach it to internal runtime functions (
__shfl_sync, math intrinsics, etc.) that NVIDIA knows are pure but that cannot carry GCCpurethrough the host compilation path. - Avoid interactions with EDG's GNU attribute conflict checking, which has its own rules for
purevsconstvsnoreturn.
String Evidence
The string table contains exactly one reference to __nv_pure__ at address 0x829848, and a diagnostic tag nv_pure at 0x88cc08. The low reference count confirms this is an internal optimization attribute not exposed to user code through documented CUDA APIs.
__nv_register_params__ (Entity+183 bit 3)
Purpose
__nv_register_params__ tells cicc to pass kernel parameters in registers instead of through constant memory. By default, CUDA kernel parameters are loaded via ld.param instructions, which access a dedicated constant memory bank visible to the kernel launch mechanism. This works well when parameter counts are large (the constant memory bank is 4 KB per kernel), but for small parameter counts, passing values directly in registers avoids the latency of the constant memory load path.
Register parameter passing eliminates the constant-bank load latency (typically 4--8 cycles on modern architectures) and removes potential bank conflicts when multiple warps read the same parameters. The trade-off is that it consumes registers from the limited register file, which can reduce occupancy if the kernel already uses many registers.
Requirements
The attribute has four validation checks, enforced across two separate locations:
-
Enablement flag (
dword_106C028): a compiler internal flag that must be set. If not set, the handler emits error 3659 with the message"__nv_register_params__ support is not enabled". This flag is controlled by an internal nvcc option, not exposed to users. -
Architecture check (implied by error string): the string
"__nv_register_params__ is only supported for compute_80 or later architecture"exists in the binary at0x88cb80. This check is performed outside the apply handler, in the post-validation or downstream pipeline. -
Function type restriction (implied by error string): the string
"__nv_register_params__ is not allowed on a %s function"at0x88cbd0shows that certain function types (likely__host__or non-kernel functions) are rejected. The post-validation insub_6BC890checks: ifentity+183 & 0x08is set (register_params flag) but the execution space atentity+182is__global__(bit 6) or the function is not a pure__device__function, it emits error 3661 with the relevant space name. -
Ellipsis (variadic) check: the apply handler (
sub_40B0A0) traverses the function's return type chain to reach the prototype, then checksprototype+16 & 0x01(the variadic flag). If set, it emits error 3662 with the message"__nv_register_params__ is not allowed on a function with ellipsis". Variadic functions cannot use register parameter passing because the parameter count is not known at compile time.
Apply Handler: sub_40B0A0 (38 lines)
// sub_40B0A0 -- apply_nv_register_params_attr (attribute.c:10537)
entity_t* apply_nv_register_params_attr(attr_node_t* a1, entity_t* a2, uint8_t a3) {
assert(a3 == 11); // functions only
bool enabled = true;
if (!dword_106C028) { // enablement flag not set
emit_error(7, 3659, a1->src_loc); // "support is not enabled"
enabled = false;
}
if (!a2) return a2;
// Walk return type chain to get function prototype
type_t* ret_type = a2->type_at_144;
if (!ret_type) goto set_flag;
while (ret_type->kind == 12) // skip cv-qualifier wrappers
ret_type = ret_type->next; // +144
// Check variadic flag
if (ret_type->prototype->flags_16 & 0x01) {
emit_error(7, 3662, a1->src_loc); // "not allowed on variadic"
return a2;
}
set_flag:
if (enabled)
a2->byte_183 |= 0x08; // set register_params bit
return a2;
}
The flag is stored at entity+183 bit 3 (0x08), the same byte that holds the cluster_dims intent flag (bit 6, 0x40). These two flags coexist without conflict because they serve orthogonal purposes.
Post-Declaration Validation
In sub_6BC890 (nv_validate_cuda_attributes), if entity+183 & 0x08 is set:
if (entity->byte_183 & 0x08) {
uint8_t es = entity->byte_182;
if (es & 0x40) { // __global__ function
emit_error(7, 3661, src, "__global__");
} else if ((es & 0x30) != 0x20) { // not pure __device__
emit_error(7, 3661, src, "__host__");
}
// else: pure __device__ function -- register_params is valid
}
This means __nv_register_params__ is only valid on __device__ functions (not __global__, not __host__, not __host__ __device__). Kernel functions (__global__) have their own parameter passing ABI dictated by the CUDA runtime, and host functions use the host ABI.
Registration at Startup
The function sub_6B5E50 (called during compiler initialization) registers __nv_register_params__ as a preprocessor macro expansion. It looks up the name via sub_734430, and if not found, creates a new macro definition node and registers it in the symbol table via sub_749600. The macro body is a 40-byte token sequence that, when expanded, produces the __attribute__((__nv_register_params__)) syntax that EDG's attribute parser can consume. This macro-based registration is why __nv_register_params__ does not have a CUDA kind byte -- it enters the attribute system through the standard GNU __attribute__ path, not through the CUDA attribute descriptor table.
The same startup function also registers __noinline__ with a similar mechanism, and _Pragma (if Clang compatibility mode requires it).
Inline Control Attributes
cudafe++ provides three inline control attributes that interact with EDG's inline heuristic system. These attributes do not have CUDA kind bytes; they are processed through EDG's standard attribute infrastructure and NVIDIA's own flag-setting paths.
Entity Node Fields
entity+177 cuda_flags (byte):
bit 4 (0x10) = __forceinline__
entity+179 more_cuda_flags (byte):
bit 4 (0x10) = __inline_hint__
bit 5 (0x20) = __noinline__ (EDG internal noinline)
entity+180 function_attrs (byte):
bit 7 (0x80) = __noinline__ (GNU attribute form)
__forceinline__
__forceinline__ requests that the compiler always inline the function, overriding cost-based heuristics. It is stored at entity+177 bit 4 (0x10). This bit is checked during cross-execution-space call validation (sub_505720): a __forceinline__ function is treated as implicitly host-device, meaning it suppresses cross-space call errors. The logic in the cross-space checker:
if (entity->byte_177 & 0x10) // __forceinline__
// treat as implicitly __host__ __device__
This relaxation exists because __forceinline__ functions are expected to be inlined at the call site, so their execution space becomes the caller's execution space. There is no separate call to resolve, hence no cross-space violation.
In the .int.c output, __forceinline__ is emitted so that cicc can apply it during NVVM IR generation. cicc translates it to LLVM's alwaysinline attribute.
__noinline__
__noinline__ prevents the compiler from inlining a function, regardless of heuristics. It has two separate handlers because it can arrive through two syntactic paths:
Path 1: EDG internal form (sub_40F5F0, 51 lines)
This handler is invoked when __noinline__ is recognized as a CUDA-specific attribute (source_mode 3 or with the scoped-attribute bit set). It sets entity+179 |= 0x20. In C mode (dword_126EFB4 == 2), it additionally creates an ABI annotation node by calling sub_5E5130 and linking it to the function's prototype exception-spec chain at prototype+56. This ABI node carries flags 0x19 and signals to the code generator that the noinline directive should be preserved across compilation boundaries.
// sub_40F5F0 -- apply_noinline_attr (EDG internal path)
if (target_kind == 11) { // function
if (attr->kind) {
entity->byte_179 |= 0x20; // noinline flag
if (attr->source_mode == 3 && dword_126EFB4 == 2) {
// Create ABI annotation for C mode
extract_func_type(entity+144, &ft_out);
if (!ft_out->prototype->abi_info) {
abi_node_t* n = alloc_abi_node();
*n |= 0x19;
ft_out->prototype->abi_info = n;
}
}
}
return entity;
}
// else: emit error 1835 (wrong target) or 2470 (alignas context)
Path 2: GNU attribute form (sub_40F6F0, 37 lines)
This handler is invoked when __noinline__ arrives through the __attribute__((__noinline__)) GNU attribute path. It sets a different bit: entity+180 |= 0x80. This separation allows the compiler to distinguish between the CUDA-specific noinline directive and the GNU portable one, although in practice both prevent inlining.
Additionally, when the function is a device function (byte+176 bit 7 set = static member, source_mode indicates GNU/Clang, byte+81 bit 2 set = local, byte+187 bit 0 clear), it calls sub_5CEE70(28, entity->attr_chain) to record the noinline directive for device-side compilation.
// sub_40F6F0 -- apply_noinline_attr (GNU form)
if (target_kind == 11) {
entity->byte_180 |= 0x80;
if ((signed char)entity->byte_176 < 0
&& (attr->source_mode == 2 || (attr->flags & 0x10))
&& (entity->byte_81 & 0x04)
&& !(entity->byte_187 & 0x01)) {
sub_5CEE70(28, entity->attr_chain);
}
} else {
// emit error 1835/2470 with appropriate severity
}
__inline_hint__
__inline_hint__ is an internal NVIDIA attribute that provides a non-binding suggestion to the compiler's inlining heuristics. Unlike __forceinline__, which mandates inlining, __inline_hint__ merely biases the cost model in favor of inlining. It is stored at entity+179 bit 4 (0x10).
The attribute is registered through the same startup mechanism as __nv_register_params__ in sub_6B5E50, and its handler apply_nv_inline_hint_attr (referenced at address 0x40A999 within sub_40A8A0) sets the flag. The diagnostic tag nv_inline_hint exists at 0x82bf2f in the string table, suggesting diagnostic messages exist for conflicts.
Mutual Exclusion
__forceinline__ and __noinline__ are mutually exclusive. The diagnostic system includes 2 messages for inline hint conflicts (identified in the W053 error report). When both are applied to the same function, the compiler emits a diagnostic. However, __inline_hint__ can coexist with either, as it is merely a suggestion that the other directives override.
The mutual exclusion is enforced through the constraint checker in apply_one_attribute (sub_413240) and through post-validation checks. The constraint string for the 'r' (routine/function) constraint class includes property codes m (for member/constexpr) and v (for virtual), with + and - qualifiers controlling whether the attribute is allowed or forbidden. Error codes 1835--1843 and 1858--1871 cover the various conflict scenarios.
IL Output
In the .int.c output, the inline control attributes are emitted as standard GNU __attribute__ annotations:
// emitted for __noinline__:
__attribute__((noinline))
// emitted for __forceinline__:
__attribute__((always_inline))
cicc reads these and translates them to LLVM's noinline and alwaysinline function attributes respectively.
__tile_global__ (Kind 0x59 = 'Y')
Purpose
__tile_global__ is an internal execution-space attribute that appears in the attribute_display_name switch table but has no user-facing documentation. Its kind value (89, 'Y') places it in the original dense block of CUDA attributes between __global__ (88, 'X') and __shared__ (90, 'Z').
The name strongly suggests this attribute is related to NVIDIA's tile-based cooperative group infrastructure or the Tensor Memory Accelerator (TMA) programming model, where "tile global" would denote a function that operates on a tile of global memory. In the cooperative groups model, tiled partitions allow threads to cooperatively access contiguous memory regions, and a __tile_global__ function might be the kernel entry point for such a tiled execution pattern.
Binary Evidence
The attribute is defined in the kind enum (the attribute_display_name switch case), but no handler function has been identified in the binary. In the apply_one_attribute dispatcher (sub_413240), there is no case for kind 'Y'. This means:
- The attribute can be parsed and stored in an attribute node.
- It has a display name for diagnostics.
- It does not modify entity node fields through the standard apply pipeline.
This is consistent with the attribute being consumed downstream by cicc or another tool in the compilation pipeline, rather than requiring cudafe++ to perform validation beyond basic parsing. Alternatively, it may be a reserved placeholder for future functionality.
__tile_builtin__ (Kind 0x5F = '_')
Purpose
__tile_builtin__ is another internal attribute in the CUDA kind enum, with kind value 95 (0x5F, ASCII '_'). Its kind value is the last in the original dense block (86--95).
The name suggests this attribute marks functions that are tile-level builtins -- compiler intrinsics that implement tile-based operations. These would be functions like cooperative_groups::tiled_partition::shfl(), cooperative_groups::tiled_partition::ballot(), or TMA copy intrinsics, which are compiled by cudafe++ as ordinary function calls but need special handling by cicc for efficient code generation.
Binary Evidence
Like __tile_global__, __tile_builtin__ has no handler in the apply_one_attribute dispatcher. It appears only in the attribute_display_name switch table. The attribute node with kind 0x5F passes through cudafe++ without entity node modification and is consumed by the downstream compiler.
The pairing of __tile_global__ (Y) and __tile_builtin__ (_) suggests a two-part infrastructure:
__tile_global__marks kernel-level entry points for tiled execution.__tile_builtin__marks the intrinsic operations available within that tiled execution context.
__local_maxnreg__ (Kind 0x5E = '^')
Purpose
__local_maxnreg__ sets a per-function register limit, as opposed to __maxnreg__ which is per-kernel. The distinction matters for __device__ helper functions called from kernels: __maxnreg__ can only be applied to __global__ functions, but __local_maxnreg__ can be applied to any device function. This allows fine-grained register pressure tuning at the function level without requiring the entire kernel to be constrained.
When cicc compiles a __device__ function with __local_maxnreg__, it sets the target register limit for that specific function during register allocation, potentially spilling more aggressively to local memory. The surrounding kernel can use a different register budget.
Apply Handler: sub_411090 (67 lines)
The handler is structurally identical to sub_410F70 (__maxnreg__), differing only in the offset within the launch config struct where it stores the value:
// sub_411090 -- apply_nv_local_maxnreg_attr
entity_t* apply_nv_local_maxnreg_attr(attr_node_t* a1, entity_t* a2, ...) {
// Allocate launch config struct if needed
if (!entity->launch_config)
entity->launch_config = allocate_launch_config(); // sub_5E52F0
// Skip if template-dependent argument
if (is_dependent_type(arg))
return entity;
// Validate: must be positive
if (const_expr_sign_compare(arg, 0) <= 0) { // sub_461980
emit_error(7, 3786, a1->src_loc); // non-positive value
return entity;
}
// Validate: must fit in int32
int64_t val = const_expr_get_value(arg); // sub_461640
if (val > INT32_MAX) {
emit_error(7, 3787, a1->src_loc); // value too large
return entity;
}
entity->launch_config->local_maxnreg = (int32_t)val; // offset +36
return entity;
}
Post-Validation Difference from __maxnreg__
In sub_6BC890, __maxnreg__ (stored at launch_config+32) is validated to require __global__ (error 3715: "__maxnreg__ is only valid on __global__ functions"). __local_maxnreg__ has no such check in post-validation. This is intentional: it is designed to work on __device__ functions as well. The post-validation function only checks the maxnreg field (offset +32) for the __global__ requirement; the local_maxnreg field (offset +36) is left unchecked.
Diagnostics
| Error | Message | Condition |
|---|---|---|
| 3786 | Non-positive __local_maxnreg__ value | const_expr_sign_compare(arg, 0) <= 0 |
| 3787 | __local_maxnreg__ value too large | Value exceeds int32 range |
__block_size__ (Kind 0x6C = 'l')
Purpose
__block_size__ specifies the thread block dimensions (and optionally cluster dimensions) for a kernel at compile time. Unlike __launch_bounds__, which provides hints for the compiler's register allocator, __block_size__ declares the actual block geometry. This enables the compiler to optimize based on known block dimensions: unrolling loops by the block dimension, computing shared memory bank conflict patterns at compile time, and statically determining the number of warps.
Apply Handler: sub_4109E0 (265 lines)
This is the largest of the launch config attribute handlers. It accepts up to 6 arguments: three block dimensions (x, y, z) and three cluster dimensions (x, y, z).
// sub_4109E0 -- apply_nv_block_size_attr (simplified)
entity_t* apply_nv_block_size_attr(attr_node_t* a1, entity_t* a2, ...) {
// Allocate launch config struct if needed
if (!entity->launch_config)
entity->launch_config = allocate_launch_config();
launch_config_t* lc = entity->launch_config;
// Parse block dimensions (arguments 1-3)
// Each: validate positive, validate fits in int32
for (int i = 0; i < 3 && arg_exists; i++) {
if (const_expr_sign_compare(arg, 0) <= 0)
emit_error(7, 3788, src); // non-positive
else {
int64_t val = const_expr_get_value(arg);
if (val > INT32_MAX)
emit_error(7, 3789, src); // too large
else
lc->block_size[i] = (int32_t)val; // +40, +44, +48
}
}
// Parse optional cluster dimensions (arguments 4-6)
if (cluster_args_present) {
// Check for conflict with prior __cluster_dims__
if (lc->flags & 0x01)
emit_error(7, 3791, src); // conflict
for (int i = 0; i < 3 && arg_exists; i++) {
// same positive/range validation
lc->cluster_dim[i] = (int32_t)val; // +20, +24, +28
}
} else if (!(lc->flags & 0x01)) {
// Default cluster dims to (1,1,1) when no cluster args
// and no prior __cluster_dims__
lc->cluster_dim_x = 1;
lc->cluster_dim_y = 1;
lc->cluster_dim_z = 1;
}
lc->flags |= 0x02; // mark block_size_set
return entity;
}
Conflict with __cluster_dims__
__block_size__ and __cluster_dims__ have a bidirectional conflict. Each handler checks the other's flag:
__block_size__checksflags & 0x01(cluster_dims_set) before writing cluster dims: error 3791.__cluster_dims__checksflags & 0x02(block_size_set) before writing cluster dims: error 3791.
However, neither handler returns early on this conflict. Both continue to set their respective flag bits, so after conflict the flags byte can be 0x03 (both bits set). The error diagnostic is emitted but the compilation continues.
Diagnostics
| Error | Message | Condition |
|---|---|---|
| 3788 | Non-positive __block_size__ dimension | const_expr_sign_compare(arg, 0) <= 0 |
| 3789 | __block_size__ dimension too large | Value exceeds int32 range |
| 3791 | Conflicting __cluster_dims__ and __block_size__ | Both attributes applied to same entity |
Global State and Registration
Startup Registration (sub_6B5E50)
The function sub_6B5E50 runs during compiler initialization and registers three names as preprocessor macro definitions:
-
__nv_register_params__: looked up viasub_734430; if not found, creates a new macro viasub_749600and associates it with a 40-byte token sequence. The token body encodes the magic values 8961 (0x2301) as a prefix, followed by attribute argument tokens. If the symbol already exists (the macro was predefined), it appends the token body to the existing definition's expansion viasub_6AC190. -
__noinline__: registered with the same mechanism. The token body contains the string"oinline))"as a suffix (the decompiled code showsstrcpy((char*)(v11+20), "oinline))");), which reconstructs the full__attribute__((__noinline__))expansion. -
_Pragma: conditionally registered ifdword_106C0E0is set. The_Pragmamacro registration enables MSVC-compatible pragma handling in certain compilation modes.
Additionally, if Clang compatibility mode is active (dword_126EFA4 set, qword_126EF90 > 0x2BF1F = Clang >= 3.0, and specific extension flags are enabled), the function registers ARM SVE attribute macros (__arm_in, __arm_inout, __arm_out, __arm_preserves, __arm_streaming, __arm_streaming_compatible).
Entity Node Field Summary
entity+177 bit 4 (0x10): __forceinline__
entity+179 bit 4 (0x10): __inline_hint__
entity+179 bit 5 (0x20): __noinline__ (EDG path)
entity+180 bit 7 (0x80): __noinline__ (GNU path)
entity+181 bit 5 (0x20): __forceinline__ relaxation flag
entity+182 [byte]: execution space (see overview)
entity+183 bit 3 (0x08): __nv_register_params__
entity+183 bit 6 (0x40): __cluster_dims__ intent
entity+256 [pointer]: launch_config_t* (for __local_maxnreg__, __block_size__)
Function Map
| Address | Size | Identity | Source |
|---|---|---|---|
sub_40A310 | 83 lines | attribute_display_name | attribute.c:1307 |
sub_40A8A0 | 23 lines | apply_nv_inline_hint_attr (contains) | attribute.c |
sub_40B0A0 | 38 lines | apply_nv_register_params_attr | attribute.c:10537 |
sub_40F5F0 | 51 lines | apply_noinline_attr (EDG path) | attribute.c |
sub_40F6F0 | 37 lines | apply_noinline_attr (GNU path) | attribute.c |
sub_40F7B0 | 61 lines | apply_noinline_scoped_attr | attribute.c |
sub_4109E0 | 265 lines | apply_nv_block_size_attr | attribute.c |
sub_411090 | 67 lines | apply_nv_local_maxnreg_attr | attribute.c |
sub_413240 | 585 lines | apply_one_attribute (dispatch) | attribute.c |
sub_6B5E50 | 160 lines | Startup registration | nv_transforms.c adjacent |
sub_6BC890 | 160 lines | nv_validate_cuda_attributes | nv_transforms.c |
Diagnostic Tag Index
| Error | Diagnostic Tag | Attribute |
|---|---|---|
| 3659 | register_params_not_enabled | __nv_register_params__ |
| 3661 | register_params_unsupported_function | __nv_register_params__ |
| 3662 | register_params_ellipsis_function | __nv_register_params__ |
| -- | register_params_unsupported_arch | __nv_register_params__ |
| 3786 | local_maxnreg_negative | __local_maxnreg__ |
| 3787 | local_maxnreg_too_large | __local_maxnreg__ |
| 3788 | block_size_must_be_positive | __block_size__ |
| 3789 | (block_size dimension overflow) | __block_size__ |
| 3791 | conflict_between_cluster_dim_and_block_size | __block_size__ / __cluster_dims__ |
| 1835 | (attribute on wrong target) | __noinline__ |
| 2470 | (attribute in alignas context) | __noinline__ |
Cross-References
- Attribute System Overview -- kind enum, descriptor table, application pipeline
- Launch Configuration Attributes -- shared launch_config_t struct,
__launch_bounds__,__maxnreg__,__cluster_dims__ - __global__ Function Constraints -- post-validation checks in
sub_6BC890 - Entity Node Layout -- entity+177, +179, +180, +182, +183 field definitions
- Cross-Space Validation --
__forceinline__relaxation in cross-space calling - Architecture Feature Gating --
__nv_register_params__compute_80 requirement
Extended Lambda Overview
Extended lambdas are the most complex NVIDIA addition to the EDG frontend. Standard C++ lambdas produce closure classes with host linkage only -- they cannot appear in __global__ kernel launches or __device__ function calls because the closure type has no device-side instantiation. The --extended-lambda flag (dword_106BF38) enables a transformation pipeline that wraps each annotated lambda in a device-visible template struct, making the closure class callable across the host/device boundary.
Two wrapper types exist. __nv_dl_wrapper_t handles device-only lambdas (annotated __device__). __nv_hdl_wrapper_t handles host-device lambdas (annotated __host__ __device__). The wrappers are parameterized template structs that store captured variables as typed fields, providing the device compiler with a concrete, instantiatable type for each lambda's captures. The wrapper templates do not exist in any header file -- they are synthesized as raw C++ text and injected into the compilation stream by the backend code generator.
Key Facts
| Property | Value |
|---|---|
| Enable flag | dword_106BF38 (--extended-lambda / --expt-extended-lambda) |
| Source files | class_decl.c (scan), nv_transforms.c (emit), cp_gen_be.c (gen) |
| Device wrapper type | __nv_dl_wrapper_t<Tag, CapturedVarTypePack...> |
| Host-device wrapper type | __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, NeverThrows, Tag, OpFunc, CapturedVarTypePack...> |
| Device bitmap | unk_1286980 (128 bytes, 1024 bits) |
| Host-device bitmap | unk_1286900 (128 bytes, 1024 bits) |
| Max captures supported | 1024 per wrapper type |
lambda_info allocator | sub_5E92A0 |
| Preamble injection marker | Type named __nv_lambda_preheader_injection |
End-to-End Flow
The extended lambda system spans the entire cudafe++ pipeline -- from parsing through backend emission. Five major functions form the chain:
FRONTEND (class_decl.c) BACKEND (cp_gen_be.c + nv_transforms.c)
======================== ========================================
sub_447930 scan_lambda sub_47ECC0 gen_template (dispatcher)
| |
+-- detect annotations +-- sees __nv_lambda_preheader_injection
| (bits at lambda+25) |
+-- validate constraints +-- sub_4864F0 gen_type_decl
| (35+ error codes) | triggers preamble emission
| |
+-- record capture count +-- sub_6BCC20 nv_emit_lambda_preamble
in bitmap | emits ALL __nv_* templates
|
+-- sub_47B890 gen_lambda
emits per-lambda wrapper call
Stage 1: scan_lambda (sub_447930, 2113 lines)
The frontend entry point for all lambda expressions. Called from the expression parser when it encounters [. For extended lambdas, this function performs three critical operations:
-
Execution space detection -- Walks up the scope stack looking for
scope_kind == 17(function body). Reads execution space byte at offset +182: bit 4 =__device__, bit 5 =__host__. Setscan_be_hostandcan_be_deviceflags. -
Annotation processing -- Parses the
__nv_parentspecifier (NVIDIA extension for closure-to-parent linkage) and__host__/__device__attribute annotations on the lambda expression itself. Sets decision bits atlambda_info + 25. -
Validation -- When
dword_106BF38is set, validates that the lambda's execution space is compatible with its enclosing context. Emits errors 3592-3634 and 3689-3690 for violations. Records the capture count in the appropriate bitmap viasub_6BCBF0.
Stage 2: Annotation Detection (Decision Bits)
The scan_lambda function sets bits at lambda_info + 25 that control all downstream behavior:
| Bit | Mask | Meaning | Set when |
|---|---|---|---|
| bit 3 | 0x08 | Device lambda wrapper needed | Lambda has __device__ annotation |
| bit 4 | 0x10 | Host-device lambda wrapper needed | Lambda has __host__ __device__ |
| bit 5 | 0x20 | Has __nv_parent | __nv_parent pragma parsed in capture list |
Additional flags at lambda_info + 24:
| Bit | Mask | Meaning |
|---|---|---|
| bit 4 | 0x10 | Capture-default is = |
| bit 5 | 0x20 | Capture-default is & |
And at lambda_info + 25 lower bits:
| Bit | Mask | Meaning |
|---|---|---|
| bit 0 | 0x01 | Is generic lambda |
| bit 1 | 0x02 | Has __host__ execution space |
| bit 2 | 0x04 | Has __device__ execution space |
Stage 3: Preamble Trigger (sub_4864F0, gen_type_decl)
During backend code generation, sub_47ECC0 (the master source sequence dispatcher) encounters a type declaration whose name matches __nv_lambda_preheader_injection. This sentinel type is never used by user code -- it exists solely as a trigger. When matched:
- The backend emits
#line 1 "nvcc_internal_extended_lambda_implementation". - It calls
sub_6BCC20(nv_emit_lambda_preamble) to inject the entire _nv* template library. - It wraps the trigger type in
#if 0/#endifso it never reaches the host compiler.
Stage 4: Preamble Emission (sub_6BCC20, 244 lines)
This is the single point where all CUDA lambda support templates enter the compilation. It takes a void(*emit)(const char*) callback and emits raw C++ source text. The exact emission order, verified against the decompiled binary, is:
__NV_LAMBDA_WRAPPER_HELPERmacro,__nvdl_remove_ref(withT&,T&&,T(&)(Args...)specializations), and__nvdl_remove_consttrait helpers__nv_dl_tagtemplate (device lambda tag type)- Array capture helpers via
sub_6BC290(__nv_lambda_array_wrapperprimary + dimension 2-8 specializations,__nv_lambda_field_typeprimary + array/const-array specializations) - Primary
__nv_dl_wrapper_twithstatic_assert+ zero-capture__nv_dl_wrapper_t<Tag>specialization (emitted as a single string literal) __nv_dl_trailing_return_tagdefinition + its zero-capture wrapper specialization with__builtin_unreachable()body (emitted as two consecutive string literals)- Device bitmap scan -- iterates
unk_1286980(1024 bits). For each set bit N > 0, callssub_6BB790(N, emit)to generate two__nv_dl_wrapper_tspecializations (standard tag + trailing-return tag) for N captures __nv_hdl_helperclass (anonymous namespace, withfp_copier,fp_deleter,fp_caller,fp_noobject_callerstatic members + out-of-line definitions)- Primary
__nv_hdl_wrapper_twithstatic_assert - Host-device bitmap scan -- iterates
unk_1286900(1024 bits). For each set bit N (including 0), emits four wrapper specializations per N:sub_6BBB10(0, N)(non-mutable, HasFuncPtrConv=false),sub_6BBEE0(0, N)(mutable, HasFuncPtrConv=false),sub_6BBB10(1, N)(non-mutable, HasFuncPtrConv=true),sub_6BBEE0(1, N)(mutable, HasFuncPtrConv=true) __nv_hdl_helper_trait_outerwithconstand non-const operator() specializations, plus conditionally (whendword_126E270is set for C++17 noexcept-in-type-system)const noexceptand non-constnoexceptspecializations -- all inside the same struct, closed by\n};__nv_hdl_create_wrapper_tfactory- Type trait helpers:
__nv_lambda_trait_remove_const,__nv_lambda_trait_remove_volatile,__nv_lambda_trait_remove_cv(composed from the first two) __nv_extended_device_lambda_trait_helper+#define __nv_is_extended_device_lambda_closure_type(X)(emitted together in one string)__nv_lambda_trait_remove_dl_wrapper(unwraps device lambda wrapper to get inner tag)__nv_extended_device_lambda_with_trailing_return_trait_helper+#define __nv_is_extended_device_lambda_with_preserved_return_type(X)(emitted together)__nv_extended_host_device_lambda_trait_helper+#define __nv_is_extended_host_device_lambda_closure_type(X)(emitted together)
Note: each SFINAE trait and its corresponding detection macro are emitted as a single a1() call in the decompiled code, not as separate steps. The device bitmap scan skips bit 0 (zero-capture handled by step 4's specialization), but the host-device bitmap scan processes bit 0 (zero-capture host-device wrappers require distinct HasFuncPtrConv specializations).
Stage 5: Per-Lambda Wrapper Emission (sub_47B890, gen_lambda, 336 lines)
For each lambda expression in the translation unit, the backend emits the wrapper call. The decision depends on the bits at lambda_info + 25:
Device lambda (bit 3 set, byte[25] & 0x08):
__nv_dl_wrapper_t< /* closure type tag */ >(/* captured values */)
The original lambda body is wrapped in #if 0 / #endif so it is invisible to the host compiler. The device compiler sees the wrapper struct which provides the captured values as typed fields.
Host-device lambda (bit 4 set, byte[25] & 0x10):
__nv_hdl_create_wrapper_t<IsMutable, HasFuncPtrConv, Tag, CaptureTypes...>
::__nv_hdl_create_wrapper( /* lambda expression */, capture_args... )
The lambda expression is emitted inline as the first argument (binds to Lambda &&lam in the factory). The factory internally calls std::move(lam) when heap-allocating. Unlike the device lambda path, the original lambda body is NOT wrapped in #if 0 -- it must be visible to both host and device compilers.
Neither bit set (plain lambda or byte[25] & 0x06 == 0x02):
Standard lambda emission with no wrapping. If byte[25] & 0x06 == 0x02, emits an empty body placeholder { } with the real body in #if 0 / #endif.
Bitmap System
Rather than generating all 1024 possible capture-count specializations for each wrapper type, cudafe++ tracks which capture counts were actually used during frontend parsing. This is a critical compile-time optimization.
Bitmap Layout
unk_1286980 (device lambda bitmap):
128 bytes = 16 x uint64 = 1024 bits
Bit N set => __nv_dl_wrapper_t specialization for N captures is needed
unk_1286900 (host-device lambda bitmap):
128 bytes = 16 x uint64 = 1024 bits
Bit N set => __nv_hdl_wrapper_t specializations for N captures are needed
Bitmap Operations
| Function | Address | Operation |
|---|---|---|
nv_reset_capture_bitmasks | sub_6BCBC0 | Zeroes both 128-byte bitmaps. Called before each translation unit. |
nv_record_capture_count | sub_6BCBF0 | Sets bit capture_count in the appropriate bitmap. a1 == 0 targets device, a1 != 0 targets host-device. Implementation: result[a2 >> 6] |= 1LL << a2. |
Scan in sub_6BCC20 | inline | Iterates each uint64 word, shifts right to test each bit, calls the wrapper emitter for each set bit. |
The scan loop in sub_6BCC20 processes 64 bits at a time:
uint64_t *ptr = (uint64_t *)&unk_1286980;
unsigned int idx = 0;
do {
uint64_t word = *ptr;
unsigned int limit = idx + 64;
do {
if (idx != 0 && (word & 1))
emit_device_lambda_wrapper(idx, callback); // sub_6BB790
++idx;
word >>= 1;
} while (limit != idx);
++ptr;
} while (limit != 1024);
Note that bit 0 is never emitted as a specialization -- the zero-capture case is handled by the primary template itself.
The __nv_parent Pragma
__nv_parent is a NVIDIA-specific capture-list extension that provides closure-to-parent class linkage. It appears in the lambda capture list as a special identifier:
auto lam = [__nv_parent = ParentClass, x, y]() __device__ { /* ... */ };
Processing in scan_lambda
During capture list parsing (Phase 3 of sub_447930, around line 584):
- The parser checks for a token matching the string
"__nv_parent"at address0x82e284. - If found, calls
sub_52FB70to resolve the parent class by name lookup. - Sets
lambda_info + 25 |= 0x20(bit 5 = has__nv_parent). - Stores the resolved parent class pointer at
lambda_info + 32. - If
__nv_parentis specified more than once, emits error 3590. - If
__nv_parentis specified without__device__, emits error 3634.
The __nv_parent class reference is used during device code generation to establish the relationship between the lambda's closure type and its enclosing class, which is necessary for the device compiler to properly resolve member accesses through the closure.
lambda_info Structure
Allocated by sub_5E92A0. This is the per-lambda metadata node created during scan_lambda and consumed during backend generation.
| Offset | Size | Field | Description |
|---|---|---|---|
| +0 | 8 | captured_variable_list | Head of linked list of capture entries |
| +8 | 8 | closure_class_type_node | Pointer to the closure class type in the IL |
| +16 | 8 | call_operator_symbol | Pointer to the operator() routine entity |
| +24 | 1 | flags_byte_1 | bit 0 = has captures, bit 3 = __host__, bit 4 = __device__, bit 5 = has __nv_parent, bit 6 = is opaque, bit 7 = constexpr const |
| +25 | 1 | flags_byte_2 | bit 0 = is generic, bit 1 = __host__ exec space, bit 2 = __device__ exec space, bit 3 = device wrapper needed, bit 4 = host-device wrapper needed, bit 5 = has __nv_parent |
| +32 | 8 | __nv_parent_class | Parent class pointer (NVIDIA extension) |
| +40 | 4 | lambda_number | Unique lambda index within scope |
| +44 | 4 | source_location | Source position of lambda expression |
Key Functions
| Address | Name (recovered) | Source | Lines | Role |
|---|---|---|---|---|
sub_447930 | scan_lambda | class_decl.c | 2113 | Frontend: parse lambda, validate constraints, record capture count |
sub_42FE50 | scan_lambda_capture_list | class_decl.c | 524 | Frontend: parse [...] capture list, handle __nv_parent |
sub_42EE00 | make_field_for_lambda_capture | class_decl.c | 551 | Frontend: create closure class fields for captures |
sub_42D710 | scan_lambda_capture_list (inner) | class_decl.c | 1025 | Frontend: process individual capture entries |
sub_42F910 | field_for_lambda_capture | class_decl.c | ~200 | Frontend: resolve capture field via hash lookup |
sub_436DF0 | Lambda template decl helper | class_decl.c | 65 | Frontend: propagate execution space to call operator template |
sub_6BCC20 | nv_emit_lambda_preamble | nv_transforms.c | 244 | Backend: emit ALL __nv_* template infrastructure |
sub_6BB790 | emit_device_lambda_wrapper_specialization | nv_transforms.c | 191 | Backend: emit __nv_dl_wrapper_t<Tag, F1..FN> for N captures |
sub_6BBB10 | emit_host_device_lambda_wrapper (const) | nv_transforms.c | 238 | Backend: emit __nv_hdl_wrapper_t non-mutable variant |
sub_6BBEE0 | emit_host_device_lambda_wrapper (mutable) | nv_transforms.c | 236 | Backend: emit __nv_hdl_wrapper_t mutable variant |
sub_6BC290 | emit_array_capture_helpers | nv_transforms.c | 183 | Backend: emit __nv_lambda_array_wrapper for dim 2-8 |
sub_6BCBC0 | nv_reset_capture_bitmasks | nv_transforms.c | 9 | Init: zero both 128-byte bitmaps |
sub_6BCBF0 | nv_record_capture_count | nv_transforms.c | 13 | Record: set bit in device or host-device bitmap |
sub_6BCDD0 | nv_find_parent_lambda_function | nv_transforms.c | 33 | Query: find enclosing host/device function for nested lambda |
sub_6BC680 | is_device_or_extended_device_lambda | nv_transforms.c | 16 | Query: test if entity qualifies as device lambda |
sub_47B890 | gen_lambda | cp_gen_be.c | 336 | Backend: emit per-lambda wrapper construction call |
sub_4864F0 | gen_type_decl | cp_gen_be.c | 751 | Backend: detect preamble trigger, invoke emission |
sub_47ECC0 | gen_template (dispatcher) | cp_gen_be.c | 1917 | Backend: master source sequence dispatcher |
sub_489000 | process_file_scope_entities | cp_gen_be.c | 723 | Backend: entry point, emits lambda macro defines in boilerplate |
Global State
| Variable | Address | Purpose |
|---|---|---|
dword_106BF38 | 0x106BF38 | Extended lambda mode flag (--extended-lambda) |
dword_106BF40 | 0x106BF40 | Lambda host-device mode flag |
unk_1286980 | 0x1286980 | Device lambda capture-count bitmap (128 bytes) |
unk_1286900 | 0x1286900 | Host-device lambda capture-count bitmap (128 bytes) |
qword_12868F0 | 0x12868F0 | Entity-to-closure mapping hash table |
dword_126E270 | 0x126E270 | C++17 noexcept-in-type-system flag (controls noexcept wrapper variants) |
qword_E7FEC8 | 0xE7FEC8 | Lambda hash table (Robin Hood, 16 bytes/slot, 1024 entries) |
ptr (E7FE40 area) | 0xE7FE40 | Red-black tree root for lambda numbering per source position |
dword_E7FE48 | 0xE7FE48 | Red-black tree sentinel node |
dword_E85700 | 0xE85700 | host_runtime.h already included flag |
dword_106BDD8 | 0x106BDD8 | OptiX mode flag (triggers error 3689 on incompatible lambdas) |
Concrete End-to-End Example
Consider a user writing this CUDA code with --extended-lambda:
// user.cu
#include <cstdio>
__global__ void kernel(int *out) {
int scale = 2;
auto f = [=] __device__ (int x) { return x * scale; };
out[threadIdx.x] = f(threadIdx.x);
}
Here is the transformation at each stage.
Stage 1: scan_lambda detects the lambda
The frontend parser encounters [=] __device__ (int x) { ... }. sub_447930 runs:
- Finds
__device__annotation on the lambda expression. - Sets
lambda_info + 25 |= 0x08(bit 3: device wrapper needed) andlambda_info + 25 |= 0x04(bit 2: has__device__exec space). - Sets
lambda_info + 24 |= 0x10(bit 4: capture-default is=). - Counts one capture (
scale). Callssub_6BCBF0(0, 1)to set bit 1 in the device bitmapunk_1286980. - Creates a closure class (compiler-generated name like
__lambda_17_16) with one field of typeintfor the capturedscale.
Stage 2: Preamble injection
When the backend encounters the sentinel type __nv_lambda_preheader_injection, sub_6BCC20 emits the template library. Because bit 1 is set in the device bitmap, it calls sub_6BB790(1, emit) which generates a one-capture specialization:
template <typename Tag, typename F1>
struct __nv_dl_wrapper_t<Tag, F1> {
typename __nv_lambda_field_type<F1>::type f1;
__nv_dl_wrapper_t(Tag, F1 in1) : f1(in1) { }
template <typename...U1>
int operator()(U1...) { return 0; }
};
template <typename U, U func, typename Return, unsigned Id, typename F1>
struct __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>, F1> {
typename __nv_lambda_field_type<F1>::type f1;
__nv_dl_wrapper_t(__nv_dl_trailing_return_tag<U, func, Return, Id>, F1 in1)
: f1(in1) { }
template <typename...U1>
Return operator()(U1...) { __builtin_unreachable(); }
};
Stage 3: Per-lambda wrapper emission
sub_47B890 (gen_lambda) reads byte[25] & 0x08 (device lambda flag is set) and emits the wrapper construction call. The lambda body is hidden from the host compiler:
// Output in .int.c (what the host compiler sees):
__nv_dl_wrapper_t< __nv_dl_tag<
__NV_LAMBDA_WRAPPER_HELPER(&__lambda_17_16::operator(), 0u)>,
int>(
__nv_dl_tag<
__NV_LAMBDA_WRAPPER_HELPER(&__lambda_17_16::operator(), 0u)>{},
scale)
#if 0
[=] __device__ (int x) { return x * scale; }
#endif
The __NV_LAMBDA_WRAPPER_HELPER(X, Y) macro expands to decltype(X), Y, giving the tag its two non-type parameters: the function pointer type and the pointer itself.
What each compiler sees
Host compiler sees a __nv_dl_wrapper_t<Tag, int> struct with field f1 holding the captured scale. The operator() returns int(0) (never actually called on host). The original lambda body is inside #if 0.
Device compiler sees the same wrapper struct but resolves the tag's encoded function pointer &__lambda_17_16::operator() to call the actual lambda body. The wrapper's f1 field provides the captured scale value.
Architecture: Text Template Approach
NVIDIA's lambda support uses a raw text emission pattern rather than constructing AST nodes. The template infrastructure is generated as C++ source text strings, passed through a callback function:
emit("template <typename Tag, typename...CapturedVarTypePack>\n"
"struct __nv_dl_wrapper_t {\n"
"static_assert(sizeof...(CapturedVarTypePack) == 0,"
"\"nvcc internal error: unexpected number of captures!\");\n"
"};\n");
This text is emitted to the .int.c output file and subsequently parsed by the host compiler. The device compiler receives the same text through a parallel path. This design is architecturally simpler than building proper AST nodes for the wrapper templates, at the cost of the templates existing only as generated text rather than first-class IL entities.
The preamble injection point is controlled by a sentinel type declaration: when the backend encounters a type named __nv_lambda_preheader_injection, it emits the entire template library and wraps the sentinel in #if 0. This guarantees the templates appear exactly once, before any lambda expression that references them, regardless of declaration ordering in the user's source.
Related Pages
- Device Lambda Wrapper --
__nv_dl_wrapper_ttemplate structure in detail - Host-Device Lambda Wrapper --
__nv_hdl_wrapper_ttype-erased design - Capture Handling --
__nv_lambda_field_type,__nv_lambda_array_wrapper - Preamble Injection --
sub_6BCC20emission pipeline step by step - Lambda Restrictions -- 35+ error categories and validation rules
Device Lambda Wrapper (__nv_dl_wrapper_t)
When a C++ lambda is annotated __device__ inside CUDA code compiled with --extended-lambda, the closure class that the frontend creates has host linkage only -- it cannot be instantiated on the device. The device lambda wrapper system solves this by replacing the lambda expression at the call site with a construction of __nv_dl_wrapper_t<Tag, F1, ..., FN>, a template struct whose type parameters encode the lambda's identity (via Tag) and whose fields store the captured variables in device-accessible storage. The wrapper struct has a dummy operator() that never executes real code on the device side -- its purpose is purely to carry captured state across the host/device boundary. The actual device-side call is dispatched through the tag type, which encodes a function pointer to the lambda's operator() as a non-type template parameter.
Two tag types exist. __nv_dl_tag is the standard tag for lambdas with auto-deduced return types. __nv_dl_trailing_return_tag handles lambdas with explicit trailing return types, preserving the user-specified return type through the wrapper. Both tag types carry the lambda's operator() function pointer and a unique ID as template parameters.
The wrapper template does not exist in any header file. It is synthesized as raw C++ text by sub_6BB790 (emit_device_lambda_wrapper_specialization) in nv_transforms.c and injected into the compilation stream during preamble emission. Only the capture counts actually used in the translation unit are emitted, controlled by a 1024-bit bitmap at unk_1286980.
Key Facts
| Property | Value |
|---|---|
| Wrapper type | __nv_dl_wrapper_t<Tag, CapturedVarTypePack...> |
| Standard tag | __nv_dl_tag<U, func, unsigned> |
| Trailing-return tag | __nv_dl_trailing_return_tag<U, func, Return, unsigned> |
| Specialization emitter | sub_6BB790 (emit_device_lambda_wrapper_specialization, 191 lines) |
| Per-lambda emission | sub_47B890 (gen_lambda, 336 lines, cp_gen_be.c) |
| Preamble master emitter | sub_6BCC20 (nv_emit_lambda_preamble, 244 lines) |
| Capture bitmap | unk_1286980 (128 bytes = 1024 bits, device lambda) |
| Bitmap setter | sub_6BCBF0 (nv_record_capture_count, 13 lines) |
| Max supported captures | 1024 |
| Source file | nv_transforms.c (specialization emitter), cp_gen_be.c (per-lambda call) |
| Field type trait | __nv_lambda_field_type<T> |
Primary Template and Zero-Capture Specialization
The primary template is a static_assert trap -- any instantiation with a non-zero variadic pack that was not explicitly specialized triggers a compilation error. The zero-capture specialization (Tag only, no F parameters) provides a trivial constructor and a dummy operator() returning 0.
This code is emitted verbatim as a single string literal from sub_6BCC20:
// Exact binary string (emitted as a single a1() call in sub_6BCC20):
template <typename Tag,typename...CapturedVarTypePack>
struct __nv_dl_wrapper_t {
static_assert(sizeof...(CapturedVarTypePack) == 0,"nvcc internal error: unexpected number of captures!");
};
template <typename Tag>
struct __nv_dl_wrapper_t<Tag> {
__nv_dl_wrapper_t(Tag) { }
template <typename...U1>
int operator()(U1...) { return 0; }
};
Note: no space after the comma in Tag,typename... and no indentation -- this is the literal text injected into the .int.c output. The primary template and the zero-capture specialization are emitted as a single string literal.
The primary template's static_assert acts as a safety net: if the frontend records a capture count of N but fails to emit the corresponding N-capture specialization, the host compiler will produce a diagnostic rather than silently generating broken code. The zero-capture specialization's operator() returns int(0) -- this value is never used at runtime because the device compiler dispatches through the tag's encoded function pointer, not through the wrapper's operator().
Tag Types
__nv_dl_tag
The standard device lambda tag. Three template parameters encode the lambda identity. Exact binary string:
template <typename U, U func, unsigned>
struct __nv_dl_tag { };
The string is "\ntemplate <typename U, U func, unsigned>\nstruct __nv_dl_tag { };\n" -- note the leading newline.
| Parameter | Role |
|---|---|
U | Type of the lambda's operator() (deduced via decltype) |
func | Non-type template parameter: pointer to the lambda's operator() |
unsigned | Unnamed parameter: unique ID disambiguating lambdas with identical operator types |
The __NV_LAMBDA_WRAPPER_HELPER(X, Y) macro (emitted at preamble start) expands to decltype(X), Y, providing the U, func pair from a single expression. The full macro and helper text emitted as the first a1() call:
#define __NV_LAMBDA_WRAPPER_HELPER(X, Y) decltype(X), Y
template <typename T>
struct __nvdl_remove_ref { typedef T type; };
template<typename T>
struct __nvdl_remove_ref<T&> { typedef T type; };
template<typename T>
struct __nvdl_remove_ref<T&&> { typedef T type; };
template <typename T, typename... Args>
struct __nvdl_remove_ref<T(&)(Args...)> {
typedef T(*type)(Args...);
};
template <typename T>
struct __nvdl_remove_const { typedef T type; };
template <typename T>
struct __nvdl_remove_const<T const> { typedef T type; };
The __nvdl_remove_ref specialization for function references (T(&)(Args...)) is notable: it converts a function reference type to a function pointer type (T(*)(Args...)). This handles the case where a lambda captures a function by reference -- the wrapper field needs a copyable function pointer, not a reference.
__nv_dl_trailing_return_tag
For lambdas with explicit trailing return types (-> ReturnType), a separate tag preserves the return type:
template <typename U, U func, typename Return, unsigned>
struct __nv_dl_trailing_return_tag { };
The additional Return parameter carries the user-specified return type. This is necessary because the wrapper's operator() must return this type rather than int, and the body uses __builtin_unreachable() to satisfy the compiler without generating actual return-value code.
Trailing-Return Zero-Capture Specialization
The zero-capture variant for trailing-return lambdas uses __builtin_unreachable() instead of return 0. The exact binary text (emitted as two consecutive a1() calls):
template <typename U, U func, typename Return, unsigned>
struct __nv_dl_trailing_return_tag { };
template <typename U, U func, typename Return, unsigned Id>
struct __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id> > {
__nv_dl_wrapper_t(__nv_dl_trailing_return_tag<U, func, Return, Id>) { }
template <typename...U1> Return operator()(U1...) { __builtin_unreachable(); }
};
Note: the __nv_dl_trailing_return_tag definition and its zero-capture wrapper specialization are emitted together (two strings in immediate succession: the first ends at { before __builtin_unreachable, the second contains __builtin_unreachable(); }\n}; \n\n -- note the trailing space before the newlines).
The __builtin_unreachable() tells the compiler this code path is never taken, so no return value needs to be materialized. This is safe because the wrapper's operator() is never called on the device side -- the device compiler resolves the call through the tag's encoded function pointer directly.
Per-Capture-Count Specialization Generator (sub_6BB790)
The function sub_6BB790 generates partial specializations of __nv_dl_wrapper_t for a specific capture count N. It takes two arguments: the capture count (unsigned int a1) and an emit callback (void(*a2)(const char*)). For each N, it emits two struct specializations: one for __nv_dl_tag and one for __nv_dl_trailing_return_tag.
Generated Template Structure (N captures)
For a lambda capturing N variables, sub_6BB790(N, emit) produces:
// Standard tag specialization
template <typename Tag, typename F1, typename F2, ..., typename FN>
struct __nv_dl_wrapper_t<Tag, F1, F2, ..., FN> {
typename __nv_lambda_field_type<F1>::type f1;
typename __nv_lambda_field_type<F2>::type f2;
...
typename __nv_lambda_field_type<FN>::type fN;
__nv_dl_wrapper_t(Tag, F1 in1, F2 in2, ..., FN inN)
: f1(in1), f2(in2), ..., fN(inN) { }
template <typename...U1>
int operator()(U1...) { return 0; }
};
// Trailing-return tag specialization
template <typename U, U func, typename Return, unsigned Id,
typename F1, typename F2, ..., typename FN>
struct __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>,
F1, F2, ..., FN> {
typename __nv_lambda_field_type<F1>::type f1;
typename __nv_lambda_field_type<F2>::type f2;
...
typename __nv_lambda_field_type<FN>::type fN;
__nv_dl_wrapper_t(__nv_dl_trailing_return_tag<U, func, Return, Id>,
F1 in1, F2 in2, ..., FN inN)
: f1(in1), f2(in2), ..., fN(inN) { }
template <typename...U1>
Return operator()(U1...) { __builtin_unreachable(); }
};
__nv_lambda_field_type Indirection
Each field is declared as typename __nv_lambda_field_type<Fi>::type fi rather than Fi fi. This indirection allows the lambda infrastructure to intercept array types (which cannot be captured by value in C++) and replace them with __nv_lambda_array_wrapper instances that perform element-by-element copying. The primary template is an identity transform:
template <typename T>
struct __nv_lambda_field_type {
typedef T type;
};
Specializations for array types (emitted by sub_6BC290) map T[D1]...[DN] to __nv_lambda_array_wrapper<T[D1]...[DN]>, and const T[D1]...[DN] to const __nv_lambda_array_wrapper<T[D1]...[DN]>.
Emission Mechanics
The decompiled sub_6BB790 reveals the emission is entirely printf-based, building C++ source text in a 1064-byte stack buffer (v29[1064]) and passing each fragment through the emit callback. The function has two major branches:
Branch 1: a1 == 0 (zero captures) -- Dead code. Falls through to emit __nv_dl_wrapper_t(Tag,) : with a trailing comma and empty initializer list, which would produce syntactically invalid C++. This path is never reached because the bitmap scan loop in sub_6BCC20 skips bit 0 (if (v2 && (v3 & 1) != 0)). The zero-capture case is handled by the primary template's __nv_dl_wrapper_t<Tag> specialization emitted unconditionally as a string literal in sub_6BCC20.
Branch 2: a1 > 0 (N captures) -- Generates the N-ary specializations through seven sequential loops:
Loop 1: Emit template parameter list ", typename F1, ..., typename FN"
Loop 2: Emit partial specialization ", F1, ..., FN"
Loop 3: Emit field declarations "typename __nv_lambda_field_type<Fi>::type fi;\n"
Loop 4: Emit constructor parameters "F1 in1, F2 in2, ..., FN inN"
Loop 5: Emit initializer list "f1(in1), f2(in2), ..., fN(inN)"
Emit operator() with "return 0"
Then repeat Loops 1-5 for __nv_dl_trailing_return_tag variant
Loop 6: Same parameter/field emission for trailing-return variant
Loop 7: Same initializer list for trailing-return variant
Emit operator() with __builtin_unreachable()
Each loop uses sprintf(v29, "...", index) for numbered parameters and a2(v29) to emit the fragment. The first element in each comma-separated list is handled specially (no leading comma), with subsequent elements prefixed by ", ".
Key string literals used by sub_6BB790 (extracted from binary):
| String | Purpose |
|---|---|
"\ntemplate <typename Tag" | Opens template parameter list |
", typename F%u" | Each additional type parameter |
">\nstruct __nv_dl_wrapper_t<Tag" | Opens partial specialization |
", F%u" | Each type argument in specialization |
"typename __nv_lambda_field_type<F%u>::type f%u;\n" | Field declaration |
"__nv_dl_wrapper_t(Tag," | Constructor declaration (standard tag) |
"F%u in%u" | Constructor parameter |
"f%u(in%u)" | Initializer list entry |
" { }\ntemplate <typename...U1>\nint operator()(U1...) { return 0; }\n};\n" | Standard operator() |
"__nv_dl_trailing_return_tag<U, func, Return, Id>" | Trailing-return tag name |
" { }\ntemplate <typename...U1>\nReturn operator()(U1...) " | Trailing-return operator() |
"{ __builtin_unreachable(); }\n};\n\n" | Unreachable body |
Concrete Example: 2 Captures
For a lambda capturing two variables, sub_6BB790(2, emit) produces:
template <typename Tag, typename F1, typename F2>
struct __nv_dl_wrapper_t<Tag, F1, F2> {
typename __nv_lambda_field_type<F1>::type f1;
typename __nv_lambda_field_type<F2>::type f2;
__nv_dl_wrapper_t(Tag, F1 in1, F2 in2) : f1(in1), f2(in2) { }
template <typename...U1>
int operator()(U1...) { return 0; }
};
template <typename U, U func, typename Return, unsigned Id,
typename F1, typename F2>
struct __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>,
F1, F2> {
typename __nv_lambda_field_type<F1>::type f1;
typename __nv_lambda_field_type<F2>::type f2;
__nv_dl_wrapper_t(__nv_dl_trailing_return_tag<U, func, Return, Id>,
F1 in1, F2 in2) : f1(in1), f2(in2) { }
template <typename...U1>
Return operator()(U1...) { __builtin_unreachable(); }
};
Per-Lambda Wrapper Emission (sub_47B890)
The backend code generator sub_47B890 (gen_lambda in cp_gen_be.c) handles the per-lambda transformation at each lambda expression's usage site. It reads the decision bits at lambda_info + 25 and emits a wrapper construction call that replaces the lambda expression in the output .int.c file.
Device Lambda Path (bit 3 set: byte[25] & 0x08)
When the device lambda flag is set, the emitter produces a wrapper construction expression followed by a #if 0 block that hides the original lambda body from the host compiler:
// sub_47B890, decompiled lines 46-58
if ((v2 & 8) != 0) {
sub_467E50("__nv_dl_wrapper_t< "); // open wrapper type
sub_475820(a1); // emit tag type (closure class)
sub_46E640(a1); // emit capture type list
sub_467E50(">( "); // close template args, open ctor
sub_475820(a1); // emit tag constructor arg
sub_467E50("{} "); // empty-brace tag construction
sub_46E550(*a1); // emit captured value expressions
sub_467E50(") "); // close ctor call
sub_46BC80("#if 0"); // suppress original lambda
--dword_1065834; // adjust nesting depth
sub_467D60(); // newline
}
The generated output for a device lambda with two captures looks like:
__nv_dl_wrapper_t< __nv_dl_tag<decltype(&ClosureType::operator()),
&ClosureType::operator(), 0u>, int, float>(
__nv_dl_tag<decltype(&ClosureType::operator()),
&ClosureType::operator(), 0u>{}, x, y)
#if 0
// original lambda body hidden from host compiler
[x, y]() __device__ { /* ... */ }
#endif
The #if 0 suppression ensures the host compiler never attempts to parse the device lambda body, which may contain device-only intrinsics and constructs. The device compiler sees the wrapper struct and resolves the call through the tag type's encoded function pointer.
Body Suppression for Host-Only Pass (bit pattern byte[25] & 0x06 == 0x02)
A separate suppression path handles lambdas where the body should not be compiled on the current pass. In this case, the emitter outputs an empty body { } and wraps the real body in #if 0 / #endif:
// sub_47B890, decompiled lines 290-306
if ((*(_BYTE *)(a1 + 25) & 6) == 2) {
sub_467D60(); // newline
sub_468190("{ }"); // empty body placeholder
sub_46BC80("#if 0"); // start suppression
--dword_1065834;
sub_467D60();
}
// ... emit original body under #if 0 ...
sub_47AEF0(body, 0); // emit body (invisible due to #if 0)
if ((*(_BYTE *)(a1 + 25) & 6) == 2) {
sub_46BC80("#endif"); // end suppression
--dword_1065834;
sub_467D60();
dword_1065820 = 0;
qword_1065828 = 0;
}
After the body emission completes, the device lambda path also emits a matching #endif to close the #if 0 block opened at the wrapper call:
// sub_47B890, decompiled lines 312-320
if ((v29 & 8) != 0) { // device lambda
sub_46BC80("#endif"); // close #if 0 from wrapper call
--dword_1065834;
sub_467D60();
dword_1065820 = 0;
qword_1065828 = 0;
}
Host-Device Lambda Path (bit 4 set: byte[25] & 0x10)
Host-device lambdas take a different path through __nv_hdl_create_wrapper_t rather than __nv_dl_wrapper_t. This is covered in the Host-Device Lambda Wrapper page.
Bitmap-Driven Emission
Only capture counts that were actually used during frontend parsing get specializations emitted. The scan loop in sub_6BCC20 processes the 128-byte bitmap at unk_1286980 as an array of 16 uint64_t values:
uint64_t *ptr = (uint64_t *)&unk_1286980;
unsigned int idx = 0;
do {
uint64_t word = *ptr;
unsigned int limit = idx + 64;
do {
if (idx != 0 && (word & 1)) // skip bit 0 (handled by primary)
sub_6BB790(idx, callback); // emit N-capture specialization
++idx;
word >>= 1;
} while (limit != idx);
++ptr;
} while (limit != 1024);
Bit 0 is skipped because the zero-capture case is already handled by the primary template's __nv_dl_wrapper_t<Tag> specialization (emitted unconditionally as a string literal). For each remaining set bit N, sub_6BB790(N, emit) produces two structs (standard tag and trailing-return tag), meaning a translation unit using lambdas with 1, 3, and 5 captures emits exactly 6 wrapper struct specializations rather than the full 2048 that exhaustive generation would produce.
Detection Traits
After all wrapper specializations are emitted, sub_6BCC20 emits SFINAE trait templates that allow compile-time detection of device-lambda wrapper types. These are emitted AFTER the host-device wrapper infrastructure (steps 7-12 in the emission sequence), not immediately after the device bitmap scan. Each trait + its #define macro is emitted as a single a1() call:
// Emitted as one string (step 13 in sub_6BCC20):
template <typename T>
struct __nv_extended_device_lambda_trait_helper {
static const bool value = false;
};
template <typename T1, typename...Pack>
struct __nv_extended_device_lambda_trait_helper<__nv_dl_wrapper_t<T1, Pack...> > {
static const bool value = true;
};
#define __nv_is_extended_device_lambda_closure_type(X) __nv_extended_device_lambda_trait_helper< typename __nv_lambda_trait_remove_cv<X>::type>::value
Note: in the binary, the #define is a single line (no backslash continuation). The 2-space indentation on static const bool matches the binary exactly.
An unwrapper trait strips the wrapper to recover the inner tag type (step 14 in emission):
template<typename T> struct __nv_lambda_trait_remove_dl_wrapper { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_dl_wrapper< __nv_dl_wrapper_t<T> > { typedef T type; };
A separate trait detects whether a wrapper uses a trailing-return tag (step 15 in emission):
template <typename T>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper {
static const bool value = false;
};
template <typename U, U func, typename Return, unsigned Id, typename...Pack>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper<__nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>, Pack...> > {
static const bool value = true;
};
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) __nv_extended_device_lambda_with_trailing_return_trait_helper< typename __nv_lambda_trait_remove_cv<X>::type >::value
Note: the emission order in sub_6BCC20 is: device trait (step 13), then __nv_lambda_trait_remove_dl_wrapper (step 14), then trailing-return trait (step 15), then host-device trait (step 16). The unwrapper appears between the two detection traits, not after both of them.
These traits and macros enable the CUDA runtime headers and device compiler to distinguish wrapped device lambdas from ordinary closure types at compile time, which is necessary for proper template argument deduction in kernel launch expressions.
Function Map
| Address | Name (recovered) | Source | Lines | Role |
|---|---|---|---|---|
sub_6BB790 | emit_device_lambda_wrapper_specialization | nv_transforms.c | 191 | Emit __nv_dl_wrapper_t<Tag, F1..FN> for N captures (both tag variants) |
sub_6BCC20 | nv_emit_lambda_preamble | nv_transforms.c | 244 | Master emitter: primary template, zero-capture, bitmap scan, traits |
sub_6BCBF0 | nv_record_capture_count | nv_transforms.c | 13 | Set bit N in device or host-device bitmap |
sub_6BCBC0 | nv_reset_capture_bitmasks | nv_transforms.c | 9 | Zero both 128-byte bitmaps before each TU |
sub_47B890 | gen_lambda | cp_gen_be.c | 336 | Per-lambda wrapper call emission in .int.c output |
sub_467E50 | emit_string | cp_gen_be.c | -- | Low-level string emitter to output buffer |
sub_46BC80 | emit_preprocessor_directive | cp_gen_be.c | -- | Emit #if 0 / #endif suppression blocks |
sub_475820 | emit_closure_tag_type | cp_gen_be.c | -- | Emit tag type for wrapper construction |
sub_46E640 | emit_capture_type_list | cp_gen_be.c | -- | Emit template argument list of capture types |
sub_46E550 | emit_capture_value_list | cp_gen_be.c | -- | Emit constructor arguments (captured values) |
sub_6BC290 | emit_array_capture_helpers | nv_transforms.c | 183 | Emit __nv_lambda_array_wrapper for dim 2-8 |
Global State
| Variable | Address | Purpose |
|---|---|---|
unk_1286980 | 0x1286980 | Device lambda capture-count bitmap (128 bytes, 1024 bits) |
dword_106BF38 | 0x106BF38 | --extended-lambda mode flag (enables entire system) |
dword_1065834 | 0x1065834 | Preprocessor nesting depth (decremented on #if 0 emission) |
dword_1065820 | 0x1065820 | Output state flag (reset after #endif emission) |
qword_1065828 | 0x1065828 | Output state pointer (reset after #endif emission) |
Related Pages
- Extended Lambda Overview -- End-to-end flow through the five pipeline stages
- Host-Device Lambda Wrapper --
__nv_hdl_wrapper_ttype-erased design - Capture Handling --
__nv_lambda_field_type,__nv_lambda_array_wrapperfor array captures - Preamble Injection --
sub_6BCC20full emission sequence - Lambda Restrictions -- Validation rules and error codes
- Kernel Stub Generation -- Parallel
#if 0suppression pattern for__global__functions
Host-Device Lambda Wrapper
The __nv_hdl_wrapper_t template is cudafe++'s type-erased wrapper for __host__ __device__ extended lambdas. Unlike the device-only __nv_dl_wrapper_t which is a simple aggregate of captured fields, the host-device wrapper must operate on both the host (through the host compiler) and the device (through ptxas). This dual requirement forces a fundamentally different design: the wrapper uses void*-based type erasure with a manager<Lambda> inner struct that provides do_copy, do_call, and do_delete operations as static function pointers. The Lambda type is known only inside the constructor -- after construction, all operations go through the type-erased function pointer table stored in __nv_hdl_helper.
A second, lightweight path exists for lambdas that have no captures and can convert to a raw function pointer. When HasFuncPtrConv=true, the wrapper skips heap allocation entirely and stores the lambda directly as a function pointer via fp_noobject_caller, providing a operator __opfunc_t*() conversion operator.
Both paths are generated as raw C++ source text by two nearly-identical emitter functions in nv_transforms.c: sub_6BBB10 (non-mutable, IsMutable=false, const operator()) and sub_6BBEE0 (mutable, IsMutable=true, non-const operator()). For each capture count N observed during frontend parsing, the preamble emitter (sub_6BCC20) calls each function twice -- once with HasFuncPtrConv=0 and once with HasFuncPtrConv=1 -- producing four partial specializations per capture count: (non-mutable, no-fptr), (mutable, no-fptr), (non-mutable, fptr), (mutable, fptr).
Key Facts
| Property | Value |
|---|---|
| Full template signature | __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, NeverThrows, Tag, OpFunc, Captures...> |
| Source file | nv_transforms.c (EDG 6.6) |
| Non-mutable emitter | sub_6BBB10 (238 lines, IsMutable=false) |
| Mutable emitter | sub_6BBEE0 (236 lines, IsMutable=true) |
| Helper class | __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...> (anonymous namespace) |
| Factory | __nv_hdl_create_wrapper_t<IsMutable, HasFuncPtrConv, Tag, CaptureArgs...> |
| Trait deduction | __nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...> |
| Bitmap | unk_1286900 (128 bytes, 1024 bits) |
| Primary template static_assert | "nvcc internal error: unexpected number of captures in __host__ __device__ lambda!" |
| Specializations per capture count | 4 (2 mutability x 2 HasFuncPtrConv); each of the 4 sub_6BCC20 calls emits one specialization |
| Noexcept variants | Additional 2 trait specializations when dword_126E270 is set (C++17) |
Template Parameters
template <bool IsMutable, // false = const operator(), true = non-const
bool HasFuncPtrConv, // true = captureless, function pointer path
bool NeverThrows, // maps to noexcept(NeverThrows)
typename Tag, // unique tag type per lambda site
typename OpFunc, // operator() signature as R(Args...)
typename... CapturedVarTypePack> // captured variable types F1..FN
struct __nv_hdl_wrapper_t;
| Parameter | Role |
|---|---|
IsMutable | Controls whether operator() is const. false for lambdas without mutable keyword (the common case), true for mutable lambdas. Emitted as "false," by sub_6BBB10 and "true," by sub_6BBEE0. |
HasFuncPtrConv | true when the lambda has no captures and can be implicitly converted to a function pointer. Enables the lightweight fp_noobject_caller path instead of heap allocation. Passed as a1 to the emitter functions. |
NeverThrows | Propagated to noexcept(NeverThrows) on operator(). Set to true only when dword_126E270 is active (C++17 noexcept-in-type-system) and the lambda's operator() is declared noexcept. |
Tag | A unique type tag generated per lambda call site, used to give each __nv_hdl_helper instantiation its own static function pointer storage. Same tag system as device lambdas. |
OpFunc | The lambda's call signature decomposed as OpFuncR(OpFuncArgs...). Used to type the function pointers in __nv_hdl_helper and the wrapper's operator(). |
CapturedVarTypePack | F1, F2, ..., FN -- one type per captured variable. Each becomes a field typename __nv_lambda_field_type<Fi>::type fi in the wrapper struct. |
The __nv_hdl_helper Class
Before any __nv_hdl_wrapper_t specialization is emitted, sub_6BCC20 emits the __nv_hdl_helper class inside an anonymous namespace. This class holds the static function pointers that enable type erasure -- the Lambda type is known when the constructor assigns the pointers, but the operator(), copy constructor, and destructor access them without knowing the concrete Lambda type.
// Exact binary string (emitted as a single a1() call):
namespace {template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
struct __nv_hdl_helper {
typedef void * (*fp_copier_t)(void *);
typedef OpFuncR (*fp_caller_t)(void *, OpFuncArgs...);
typedef void (*fp_deleter_t) (void *);
typedef OpFuncR (*fp_noobject_caller_t)(OpFuncArgs...);
static fp_copier_t fp_copier;
static fp_deleter_t fp_deleter;
static fp_caller_t fp_caller;
static fp_noobject_caller_t fp_noobject_caller;
};
template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
typename __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier_t __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier;
template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
typename __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_deleter_t __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_deleter;
template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
typename __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_caller_t __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_caller;
template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
typename __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_noobject_caller_t __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_noobject_caller;
}
Note three details in the binary that differ from a hand-written version: (1) namespace {template has no newline between the opening brace and template, (2) fp_deleter_t has a space before (void *) that the other typedefs lack: typedef void (*fp_deleter_t) (void *), (3) the blank line between fp_caller and fp_noobject_caller out-of-line definitions is missing -- they are separated by only one newline.
The anonymous namespace is critical: it gives each translation unit its own copy of the static function pointers, preventing ODR violations when multiple TUs use the same lambda tag type. The Tag parameter ensures that different lambda call sites within the same TU get independent function pointer storage even if they share the same OpFuncR(OpFuncArgs...) signature.
Function Pointer Roles
| Pointer | Type | Set by | Used by | Purpose |
|---|---|---|---|---|
fp_copier | void*(*)(void*) | Constructor (capturing path) | Copy constructor | Heap-allocates a new Lambda copy from void* buffer |
fp_caller | OpFuncR(*)(void*, OpFuncArgs...) | Constructor (capturing path) | operator() | Casts void* back to Lambda* and invokes it |
fp_deleter | void(*)(void*) | Constructor (capturing path) | Destructor | Casts void* to Lambda* and deletes it |
fp_noobject_caller | OpFuncR(*)(OpFuncArgs...) | Constructor (non-capturing path) | operator() + conversion operator | Stores the lambda directly as a function pointer |
Type-Erasure Mechanism
The following diagram shows how a void* data pointer and the manager<Lambda> static functions work together to erase the concrete lambda type:
Construction (concrete Lambda type known):
============================================
__nv_hdl_wrapper_t ctor(Tag{}, Lambda &&lam, F1 in1, ...)
|
|-- data = new Lambda(std::move(lam)) // heap-allocate
|
|-- __nv_hdl_helper<Tag,...>::fp_copier // ASSIGN function pointers
| = &manager<Lambda>::do_copy // (Lambda type captured here)
|-- __nv_hdl_helper<Tag,...>::fp_deleter
| = &manager<Lambda>::do_delete
|-- __nv_hdl_helper<Tag,...>::fp_caller
| = &manager<Lambda>::do_call
After construction (Lambda type erased):
============================================
__nv_hdl_wrapper_t
+----------------------------+
| f1, f2, ..., fN | captured variable fields (typed)
| void *data ----------------+---> heap: Lambda object
+----------------------------+
(concrete type unknown here)
operator()(args...):
fp_caller(data, args...)
|
v
manager<Lambda>::do_call(void *buf, args...)
auto ptr = static_cast<Lambda*>(buf);
return (*ptr)(args...);
Copy ctor:
data = fp_copier(in.data)
|
v
manager<Lambda>::do_copy(void *buf)
return new Lambda(*static_cast<Lambda*>(buf));
Move ctor:
data = in.data; in.data = 0; // pointer steal
Destructor:
fp_deleter(data)
|
v
manager<Lambda>::do_delete(void *buf)
delete static_cast<Lambda*>(buf);
The Tag template parameter is critical: it ensures each lambda call site gets its own set of __nv_hdl_helper static function pointers. Without Tag, two different lambdas with the same OpFuncR(OpFuncArgs...) signature would share the same function pointers, and the second constructor call would overwrite the first's fp_caller/fp_copier/fp_deleter.
The Capturing Path (HasFuncPtrConv=false)
When HasFuncPtrConv=false (the a1=0 path in the emitter), the wrapper uses heap allocation for type erasure. This is the full-weight path for lambdas that capture state.
Reconstructed Template (N captures, non-mutable)
The following is the complete C++ output reconstructed from sub_6BBB10 with a1=0 (HasFuncPtrConv=false) and a2=N captures:
template <bool NeverThrows, typename Tag, typename OpFuncR,
typename... OpFuncArgs, typename F1, typename F2, /* ...FN */>
struct __nv_hdl_wrapper_t<false, false, NeverThrows, Tag,
OpFuncR(OpFuncArgs...), F1, F2, /* ...FN */> {
// --- Captured fields ---
typename __nv_lambda_field_type<F1>::type f1;
typename __nv_lambda_field_type<F2>::type f2;
// ...
typename __nv_lambda_field_type<FN>::type fN;
typedef OpFuncR(__opfunc_t)(OpFuncArgs...);
// --- Data member for type-erased lambda ---
void *data;
// --- Type erasure manager ---
template <typename Lambda>
struct manager {
static void *do_copy(void *buf) {
auto ptr = static_cast<Lambda *>(buf);
return static_cast<void *>(new Lambda(*ptr));
};
static OpFuncR do_call(void *buf, OpFuncArgs... args) {
auto ptr = static_cast<Lambda *>(buf);
return (*ptr)(std::forward<OpFuncArgs>(args)...);
};
static void do_delete(void *buf) {
auto ptr = static_cast<Lambda *>(buf);
delete ptr;
}
};
// --- Constructor: heap-allocate Lambda, register function pointers ---
template <typename Lambda>
__nv_hdl_wrapper_t(Tag, Lambda &&lam, F1 in1, F2 in2, /* ...FN inN */)
: f1(in1), f2(in2), /* ...fN(inN), */
data(static_cast<void *>(new Lambda(std::move(lam)))) {
__nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier
= &manager<Lambda>::do_copy;
__nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_deleter
= &manager<Lambda>::do_delete;
__nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_caller
= &manager<Lambda>::do_call;
}
// --- Call operator: delegate through type-erased fp_caller ---
// Binary emits: "OpFuncR operator() (OpFuncArgs... args) " + "const " + "noexcept(NeverThrows) "
OpFuncR operator() (OpFuncArgs... args) const noexcept(NeverThrows) {
return __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>
::fp_caller(data, std::forward<OpFuncArgs>(args)...);
}
// --- Copy constructor: delegate through fp_copier ---
__nv_hdl_wrapper_t(const __nv_hdl_wrapper_t &in)
: f1(in.f1), f2(in.f2), /* ...fN(in.fN), */
data(__nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>
::fp_copier(in.data)) { }
// --- Move constructor: steal void* pointer ---
__nv_hdl_wrapper_t(__nv_hdl_wrapper_t &&in)
: f1(std::move(in.f1)), f2(std::move(in.f2)), /* ...fN(std::move(in.fN)), */
data(in.data) { in.data = 0; }
// --- Destructor: delegate through fp_deleter ---
~__nv_hdl_wrapper_t(void) {
__nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_deleter(data);
}
// --- Copy assignment: deleted ---
__nv_hdl_wrapper_t & operator=(const __nv_hdl_wrapper_t &in) = delete;
};
Key Design Decisions
Heap allocation in constructor. The lambda is std::moved into a heap-allocated copy via new Lambda(std::move(lam)). This erases the concrete type -- the wrapper only holds a void* afterward. The manager<Lambda> static methods are assigned to the __nv_hdl_helper static function pointers during construction, preserving the type information as function pointer values rather than as template parameters.
Static function pointers instead of vtable. Rather than using virtual functions, the wrapper stores the type-erasure operations in static function pointers on __nv_hdl_helper. This is an unconventional choice -- it means all wrappers with the same Tag share the same function pointer storage. This works because within a single translation unit, each tag corresponds to exactly one lambda closure type. The approach avoids vtable overhead (no virtual destructor, no vptr in the wrapper) at the cost of not being safe across multiple lambda types sharing a tag.
Move constructor steals pointer. The move constructor copies the void* data pointer and sets the source to 0 (null). The destructor unconditionally calls fp_deleter(data), so a null data pointer after move must be handled by the deleter. Since delete on a null pointer is a no-op in C++, the moved-from wrapper's destructor call is safe.
Copy assignment is deleted. Only copy construction and move construction are supported. This avoids the complexity of managing the void* lifetime during assignment (which would require deleting the old data and copying the new).
Zero-Capture Specialization
When a2=0 (no captures), the emitter skips the field declarations and the field portions of the member initializer lists. The wrapper degenerates to holding only void* data with no fN fields. The constructor takes only (Tag, Lambda&&) with no capture arguments. The copy and move constructors handle only the data member.
The Lightweight Path (HasFuncPtrConv=true)
When HasFuncPtrConv=true (the a1=1 path), the lambda has no captures and can be implicitly converted to a raw function pointer. The emitter produces a drastically simpler wrapper:
template <bool NeverThrows, typename Tag, typename OpFuncR,
typename... OpFuncArgs>
struct __nv_hdl_wrapper_t<false, true, NeverThrows, Tag,
OpFuncR(OpFuncArgs...)> {
typedef OpFuncR(__opfunc_t)(OpFuncArgs...);
// --- Constructor: store lambda as function pointer ---
template <typename Lambda>
__nv_hdl_wrapper_t(Tag, Lambda &&lam)
{ __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_noobject_caller = lam; }
// --- Call operator: invoke through stored function pointer ---
// Binary: "OpFuncR operator() (OpFuncArgs... args) " + "const " + "noexcept(NeverThrows) "
OpFuncR operator() (OpFuncArgs... args) const noexcept(NeverThrows) {
return __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>
::fp_noobject_caller(std::forward<OpFuncArgs>(args)...);
}
// --- Function pointer conversion operator ---
// Binary: "operator __opfunc_t * () const { ... }"
operator __opfunc_t * () const {
return __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_noobject_caller;
}
// --- Copy assignment: deleted ---
__nv_hdl_wrapper_t & operator=(const __nv_hdl_wrapper_t &in) = delete;
};
No void* data member. No manager struct. No heap allocation. No copy constructor, move constructor, or destructor (the compiler-generated defaults suffice). The lambda is stored directly as a function pointer in fp_noobject_caller, and the wrapper provides an implicit conversion to __opfunc_t* -- the raw function pointer type matching the lambda's signature.
This path is selected when gen_lambda (sub_47B890) detects that the lambda has no capture list (*(_QWORD *)a1 == 0, the capture head pointer is null) and the lambda does not use capture-default = (bit 4 at byte[24] is clear). Additional conditions involving dword_126EFAC, dword_126EFA4, and qword_126EF98 (a version threshold at 0xEB27 = 60199, likely a CUDA toolkit version) gate this detection, suggesting the function-pointer conversion path was added in a specific toolkit release.
Mutable vs Non-Mutable (sub_6BBB10 vs sub_6BBEE0)
The two emitter functions are structurally identical. The sole differences:
| Aspect | sub_6BBB10 (non-mutable) | sub_6BBEE0 (mutable) |
|---|---|---|
| First template bool emitted | "false," | "true," |
operator() qualifier | a3("const ") before noexcept | No "const " emission |
| Binary difference | Line 190: emits "const " | Line 188: skips to noexcept |
In the decompiled binary, the two functions are 238 and 236 lines respectively. The 2-line difference is exactly the a3("const ") call present in sub_6BBB10 but absent from sub_6BBEE0.
For a mutable lambda, the C++ standard says operator() is non-const, allowing the lambda body to modify captured-by-value variables. The wrapper faithfully propagates this: sub_6BBEE0 generates operator() without the const qualifier. In the capturing path, this means the do_call function pointer invokes a non-const Lambda, which is sound because the lambda is heap-allocated and accessed through a mutable void*.
Emitter Call Matrix
sub_6BCC20 emits all four combinations for each set bit N in the host-device bitmap:
sub_6BBB10(0, N, emit); // IsMutable=false, HasFuncPtrConv=false
sub_6BBEE0(0, N, emit); // IsMutable=true, HasFuncPtrConv=false
sub_6BBB10(1, N, emit); // IsMutable=false, HasFuncPtrConv=true
sub_6BBEE0(1, N, emit); // IsMutable=true, HasFuncPtrConv=true
This produces four partial specializations per set bitmap bit N. The NeverThrows parameter remains a template parameter (not a partial-specialization value), handled at instantiation time. Note in the decompiled binary that the fourth call uses v9 (which holds v6 before the post-increment): v9 = v6++; ... sub_6BBEE0(1, v9, a1); -- all four calls use the same capture count N.
The __nv_hdl_helper_trait_outer Deduction Helper
After the per-capture-count specializations, sub_6BCC20 emits a trait class that deduces the wrapper return type from the lambda's operator() signature:
template <bool IsMutable, bool HasFuncPtrConv, typename ...CaptureArgs>
struct __nv_hdl_helper_trait_outer {
// Primary: extract operator() signature via decltype(&Lambda::operator())
template <typename Tag, typename Lambda>
struct __nv_hdl_helper_trait
: public __nv_hdl_helper_trait<Tag, decltype(&Lambda::operator())> { };
// Specialization for const operator() (non-mutable lambda):
template <typename Tag, typename C, typename R, typename... OpFuncArgs>
struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) const> {
template <typename Lambda>
static auto get(Lambda lam, CaptureArgs... args)
-> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, false,
Tag, R(OpFuncArgs...), CaptureArgs...>;
};
// Specialization for non-const operator() (mutable lambda):
template <typename Tag, typename C, typename R, typename... OpFuncArgs>
struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...)> {
template <typename Lambda>
static auto get(Lambda lam, CaptureArgs... args)
-> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, false,
Tag, R(OpFuncArgs...), CaptureArgs...>;
};
// C++17 noexcept variants (only when dword_126E270 is set):
template <typename Tag, typename C, typename R, typename... OpFuncArgs>
struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) const noexcept> {
template <typename Lambda>
static auto get(Lambda lam, CaptureArgs... args)
-> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, true,
Tag, R(OpFuncArgs...), CaptureArgs...>;
};
template <typename Tag, typename C, typename R, typename... OpFuncArgs>
struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) noexcept> {
template <typename Lambda>
static auto get(Lambda lam, CaptureArgs... args)
-> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, true,
Tag, R(OpFuncArgs...), CaptureArgs...>;
};
};
The trick here is the primary __nv_hdl_helper_trait inheriting from a specialization on decltype(&Lambda::operator()). The compiler deduces the member function pointer type of operator(), which pattern-matches against one of the four specializations. The non-noexcept specializations pass NeverThrows=false; the noexcept specializations pass NeverThrows=true. This is how the NeverThrows template parameter gets its value -- through trait deduction, not through an explicit argument.
The C++17 noexcept variants are gated on dword_126E270. In C++17, noexcept became part of the type system, so R(C::*)(Args...) noexcept is a distinct type from R(C::*)(Args...). Without the additional specializations, the compiler would fail to match noexcept member function pointers.
In the decompiled sub_6BCC20, the emission is split into three a1() calls: (1) the base struct with const and non-const specializations (ending with }; for the non-const spec), (2) conditionally (if (dword_126E270)) the const noexcept and noexcept specializations, and (3) a1("\n};") to close the outer struct. This means the closing brace of __nv_hdl_helper_trait_outer is always emitted, but the noexcept specializations inside it are conditional. A subtle consequence: in non-C++17 mode, the binary between the non-const }; and the outer }; contains only \n}; -- the inner struct specializations end before the outer struct closes.
The __nv_hdl_create_wrapper_t Factory
The factory struct ties everything together. It provides a single static method that the backend emits at each host-device lambda usage site:
template <bool IsMutable, bool HasFuncPtrConv,
typename Tag, typename... CaptureArgs>
struct __nv_hdl_create_wrapper_t {
template <typename Lambda>
static auto __nv_hdl_create_wrapper(Lambda &&lam, CaptureArgs... args)
-> decltype(
__nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>
::template __nv_hdl_helper_trait<Tag, Lambda>
::get(lam, args...))
{
typedef decltype(
__nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>
::template __nv_hdl_helper_trait<Tag, Lambda>
::get(lam, args...)) container_type;
return container_type(Tag{}, std::move(lam), args...);
}
};
The trailing return type uses decltype to invoke the trait chain and deduce the exact __nv_hdl_wrapper_t specialization. The body constructs that deduced type with Tag{} (a value-initialized tag), the moved lambda, and the capture arguments.
Backend Emission at Lambda Call Site
When gen_lambda (sub_47B890) encounters a host-device lambda (bit 4 set at byte[25]), it emits the factory call in two phases:
Phase 1 (before lambda body): Opens the factory call with template arguments and the method name:
__nv_hdl_create_wrapper_t< IsMutable, HasFuncPtrConv, Tag, CaptureTypes... >
::__nv_hdl_create_wrapper(
Phase 2 (after lambda body): The lambda expression is emitted as the first argument to __nv_hdl_create_wrapper, then the captured value expressions are appended as trailing arguments, followed by the closing ):
/* lambda expression emitted inline */,
capture_arg1, capture_arg2, ... )
This differs from the device lambda path where the original lambda body is wrapped in #if 0 / #endif. In the host-device path, the lambda is passed by rvalue reference to the factory method, which moves it into a heap-allocated copy for type erasure. The captured values are passed separately (via sub_46E550 at line 323 of the decompiled binary) so the wrapper can store them as typed fields alongside the void* data.
The IsMutable decision comes from byte[24] & 0x02 (mutable keyword present). The HasFuncPtrConv decision involves nested conditions, all gated on the capture list head being null (*(_QWORD *)a1 == 0):
HasFuncPtrConv = false; // default
if (capture_list_head == NULL) {
if (dword_126EFAC && !dword_126EFA4 && qword_126EF98 <= 0xEB27) {
HasFuncPtrConv = true; // forced true for old toolkit versions
} else {
// General path: true iff no capture-default '='
HasFuncPtrConv = !(byte[24] & 0x10);
}
}
When dword_126EFAC is set and dword_126EFA4 is clear, the toolkit version qword_126EF98 is compared against 0xEB27 (60199). At or below this threshold, HasFuncPtrConv is unconditionally true. Above the threshold, it falls through to the general path which checks whether the lambda has a capture-default = (bit 4 at byte[24]): if no = default, then the lambda is captureless and can convert to a function pointer.
This logic is at sub_47B890 lines 62-77 of the decompiled binary.
SFINAE Detection Traits
At the end of the preamble, sub_6BCC20 emits a detection trait and macro for identifying host-device lambda wrappers:
// Exact binary string (step 16 in sub_6BCC20, emitted as a single a1() call):
template <typename>
struct __nv_extended_host_device_lambda_trait_helper {
static const bool value = false;
};
template <bool B1, bool B2, bool B3, typename T1, typename T2, typename...Pack>
struct __nv_extended_host_device_lambda_trait_helper<__nv_hdl_wrapper_t<B1, B2, B3, T1, T2, Pack...> > {
static const bool value = true;
};
#define __nv_is_extended_host_device_lambda_closure_type(X) __nv_extended_host_device_lambda_trait_helper< typename __nv_lambda_trait_remove_cv<X>::type>::value
Note: binary has typename...Pack (no space), Pack...> > (space between angle brackets -- pre-C++11 syntax), two spaces before __nv_extended_host_device_lambda_trait_helper in the macro, and 2-space indentation on static const bool.
This allows compile-time detection of whether a type is a host-device lambda wrapper, used internally by the CUDA runtime headers and by nvcc to apply special handling to extended host-device lambda closure types.
Emission Sequence in sub_6BCC20
The host-device wrapper infrastructure is emitted in steps 7-12 of the 20-step preamble emission sequence:
| Step | Content | Function |
|---|---|---|
| 7 | __nv_hdl_helper class (anonymous namespace, 4 static function pointer members + out-of-line definitions) | sub_6BCC20 inline |
| 8 | Primary __nv_hdl_wrapper_t with static_assert (catches unexpected capture counts) | sub_6BCC20 inline |
| 9 | Per-capture-count specializations: for each bit N set in unk_1286900, emit 4 calls: sub_6BBB10(0,N), sub_6BBEE0(0,N), sub_6BBB10(1,N), sub_6BBEE0(1,N) | sub_6BBB10, sub_6BBEE0 |
| 10 | __nv_hdl_helper_trait_outer deduction helper (2 or 4 trait specializations depending on C++17) | sub_6BCC20 inline |
| 11 | C++17 noexcept trait variants (conditional on dword_126E270) | sub_6BCC20 inline |
| 12 | __nv_hdl_create_wrapper_t factory | sub_6BCC20 inline |
The bitmap scan loop for host-device wrappers differs from the device-lambda loop in one important way: bit 0 IS emitted. The device-lambda loop skips bit 0 (the zero-capture case is handled by the primary template), but the host-device loop processes every set bit including 0. This is because the zero-capture host-device wrapper still requires distinct specializations for the HasFuncPtrConv=true and HasFuncPtrConv=false paths.
// sub_6BCC20, host-device bitmap scan (decompiled)
v5 = (unsigned __int64 *)&unk_1286900;
v6 = 0;
do {
v7 = *v5;
v8 = v6 + 64;
do {
while ((v7 & 1) == 0) { // skip unset bits
++v6;
v7 >>= 1;
if (v6 == v8) goto LABEL_13;
}
sub_6BBB10(0, v6, a1); // non-mutable, HasFuncPtrConv=false
sub_6BBEE0(0, v6, a1); // mutable, HasFuncPtrConv=false
sub_6BBB10(1, v6, a1); // non-mutable, HasFuncPtrConv=true
v9 = v6++;
v7 >>= 1;
sub_6BBEE0(1, v9, a1); // mutable, HasFuncPtrConv=true
} while (v6 != v8);
LABEL_13:
++v5;
} while (v6 != 1024);
Comparison with Device Lambda Wrapper
| Aspect | __nv_dl_wrapper_t | __nv_hdl_wrapper_t |
|---|---|---|
| Type erasure | None -- concrete fields only | void* + manager<Lambda> function pointers |
| Heap allocation | Never | Yes (capturing path) or never (HasFuncPtrConv path) |
| Copy semantics | Trivially copyable aggregate | Custom copy ctor via fp_copier; copy assign deleted |
| Move semantics | Default | Custom move ctor stealing void*; moved-from nulled |
| Destructor | Trivial | Calls fp_deleter(data) |
operator() body | return 0; / __builtin_unreachable() (placeholder) | Delegates through fp_caller or fp_noobject_caller |
| Function pointer conversion | Not supported | operator __opfunc_t * () when HasFuncPtrConv=true |
| Specializations per N | 2 (standard tag + trailing-return tag) | 4 (2 mutability x 2 HasFuncPtrConv) |
| Template params (partial spec) | Tag, F1..FN | IsMutable, HasFuncPtrConv, NeverThrows, Tag, OpFuncR(OpFuncArgs...), F1..FN |
The host-device wrapper is fundamentally more complex because it must produce a callable object that works on both host and device. The device-only wrapper can use placeholder operator bodies (return 0) because the device compiler sees the original lambda body through a different mechanism. The host-device wrapper must actually call the lambda through the type-erased function pointer table.
Concrete Example: Host-Device Lambda with One Capture
User code:
auto add_n = [n] __host__ __device__ (int x) { return x + n; };
int result = add_n(42);
This lambda has one capture (n, by value), is not mutable (default), and cannot convert to a function pointer (it captures). The frontend sets bit 4 at byte[25] (host-device wrapper needed) and calls sub_6BCBF0(1, 1) to set bit 1 in the host-device bitmap unk_1286900.
During preamble emission, sub_6BCC20 sees bit 1 set and emits four specializations via sub_6BBB10(0,1), sub_6BBEE0(0,1), sub_6BBB10(1,1), sub_6BBEE0(1,1). The relevant one for this lambda (non-mutable, capturing) is from sub_6BBB10(0,1).
At the lambda call site, gen_lambda emits:
__nv_hdl_create_wrapper_t< false, false, __nv_dl_tag<...>, int >
::__nv_hdl_create_wrapper(
[n] __host__ __device__ (int x) { return x + n; },
n )
The factory method deduces the wrapper type via __nv_hdl_helper_trait_outer and constructs:
__nv_hdl_wrapper_t<false, false, false, Tag, int(int), int>
At runtime on the host: the constructor heap-allocates the lambda, stores n as field f1, and sets the fp_caller/fp_copier/fp_deleter static function pointers. Calling add_n(42) invokes fp_caller(data, 42) which casts void* back to the lambda type and calls operator()(42).
At runtime on the device: the same wrapper struct is memcpy'd to device memory. The device compiler sees the wrapper's fields and operator() which delegates through the function pointer table, resolving to the lambda body.
Emitter Function Signature
Both sub_6BBB10 and sub_6BBEE0 share the same prototype:
__int64 __fastcall sub_6BBB10(int a1, unsigned int a2,
void (__fastcall *a3)(const char *));
| Parameter | Role |
|---|---|
a1 | HasFuncPtrConv flag. 0 = full type-erased path. 1 = lightweight function pointer path. |
a2 | Number of captured variables (0 to 1023). |
a3 | Emit callback. Called with C++ source text fragments that are concatenated to form the output. |
The functions use a 1080-byte stack buffer (v28[1080]) for sprintf formatting of per-capture template parameters and field declarations. The buffer is large enough for field names up to F1023 / f1023 / in1023 with surrounding template syntax.
Key Functions
| Address | Name | Lines | Role |
|---|---|---|---|
sub_6BBB10 | emit_hdl_wrapper_nonmutable | 238 | Emit __nv_hdl_wrapper_t<false, ...> specialization |
sub_6BBEE0 | emit_hdl_wrapper_mutable | 236 | Emit __nv_hdl_wrapper_t<true, ...> specialization |
sub_6BCC20 | nv_emit_lambda_preamble | 244 | Master emitter; calls both for each bitmap bit |
sub_47B890 | gen_lambda | 336 | Per-lambda site emission of __nv_hdl_create_wrapper_t::__nv_hdl_create_wrapper(...) call |
sub_6BCBF0 | nv_record_capture_count | 13 | Sets bit in unk_1286900 bitmap during frontend scan |
sub_6BCBC0 | nv_reset_capture_bitmasks | 9 | Zeroes both bitmaps before each TU |
Global State
| Variable | Address | Purpose |
|---|---|---|
unk_1286900 | 0x1286900 | Host-device lambda capture-count bitmap (128 bytes, 1024 bits) |
dword_126E270 | 0x126E270 | C++17 noexcept-in-type-system flag; gates noexcept trait variants |
dword_126EFAC | 0x126EFAC | Influences HasFuncPtrConv deduction in gen_lambda |
dword_126EFA4 | 0x126EFA4 | Secondary gate for HasFuncPtrConv path |
qword_126EF98 | 0x126EF98 | Toolkit version threshold for HasFuncPtrConv (compared against 0xEB27) |
dword_106BF38 | 0x106BF38 | Extended lambda mode flag (--extended-lambda) |
Related Pages
- Extended Lambda Overview -- end-to-end pipeline and bitmap system
- Device Lambda Wrapper --
__nv_dl_wrapper_tsimpler aggregate approach - Capture Handling --
__nv_lambda_field_typeand array capture helpers - Preamble Injection --
sub_6BCC20full emission sequence - Lambda Restrictions -- validation rules and error codes
Capture Handling
C++ lambdas capture variables by creating closure-class fields -- one field per captured entity. For scalars this is straightforward: the closure stores a copy (or reference) of the variable. Arrays present a problem because C++ forbids direct value-capture of C-style arrays. CUDA extended lambdas compound the problem: the wrapper template that carries captures across the host/device boundary needs a uniform way to express every field's type, including multi-dimensional arrays and const-qualified variants. cudafe++ solves this with two injected template families: __nv_lambda_field_type<T> (a type trait that maps each captured variable's declared type to a storable type) and __nv_lambda_array_wrapper<T[D1]...[DN]> (a wrapper struct that holds a deep copy of an N-dimensional array with element-by-element copy in its constructor).
A separate subsystem handles the backend code generator's emission of capture type declarations and capture value expressions for each lambda. nv_gen_extended_lambda_capture_types (sub_46E640) walks the capture list and emits decltype-based template arguments wrapped in __nvdl_remove_ref / __nvdl_remove_const / __nv_lambda_trait_remove_cv. sub_46E550 emits the corresponding capture values (variable names, this, *this, or init-capture expressions).
All of this is driven by a bitmap system that tracks which capture counts were actually used, so cudafe++ only emits the wrapper specializations that a given translation unit requires.
Key Facts
| Property | Value |
|---|---|
| Field type trait | __nv_lambda_field_type<T> |
| Array wrapper | __nv_lambda_array_wrapper<T[D1]...[DN]> |
| Supported array dims | 1D (identity) through 7D (generated for ranks 2-8) |
| Array helper emitter | sub_6BC290 (emit_array_capture_helpers) in nv_transforms.c |
| Capture type emitter | sub_46E640 (nv_gen_extended_lambda_capture_types) in cp_gen_be.c |
| Capture value emitter | sub_46E550 in cp_gen_be.c |
| Device bitmap | unk_1286980 (128 bytes = 1024 bits) |
| Host-device bitmap | unk_1286900 (128 bytes = 1024 bits) |
| Bitmap initializer | sub_6BCBC0 (nv_reset_capture_bitmasks) |
| Bitmap setter | sub_6BCBF0 (nv_record_capture_count) |
__nv_lambda_field_type
This is the type trait that maps every captured variable's declared type to a type suitable for storage in a wrapper struct field. For scalar types (and anything that is not an array), it is the identity:
template <typename T>
struct __nv_lambda_field_type {
typedef T type;
};
For array types, it maps to the corresponding __nv_lambda_array_wrapper specialization. cudafe++ generates partial specializations for dimensions 2 through 8, each in both non-const and const variants.
Generated Specializations (Example: 3D)
// Non-const array
template<typename T, size_t D1, size_t D2, size_t D3>
struct __nv_lambda_field_type<T [D1][D2][D3]> {
typedef __nv_lambda_array_wrapper<T [D1][D2][D3]> type;
};
// Const array
template<typename T, size_t D1, size_t D2, size_t D3>
struct __nv_lambda_field_type<const T [D1][D2][D3]> {
typedef const __nv_lambda_array_wrapper<T [D1][D2][D3]> type;
};
For 1D arrays (T[D1]), no specialization is generated. The primary template handles them -- 1D arrays decay to pointers in standard capture, so this is the identity case. The explicit specializations cover dimensions 2 through 8 (template parameter lists with D1 through D2...D7 respectively).
Why Ranks 2-8
The loop in sub_6BC290 runs with counter v1 from 2 to 8 inclusive (while (v1 != 9)). Rank 1 is handled by the primary template. Rank 9+ triggers the static_assert in the unspecialized __nv_lambda_array_wrapper primary template. This bounds the maximum supported array dimensionality for lambda capture at 7D -- an extremely generous limit (standard CUDA kernels rarely exceed 3D arrays).
__nv_lambda_array_wrapper<T[D1]...[DN]>
The array wrapper is a struct that owns a copy of an N-dimensional C-style array. Since arrays cannot be value-captured in C++ (they decay to pointers), this wrapper provides the deep-copy semantics that CUDA extended lambdas need.
Primary Template (Trap)
The unspecialized primary template contains only a static_assert that always fires:
template <typename T>
struct __nv_lambda_array_wrapper {
static_assert(sizeof(T) == 0,
"nvcc internal error: unexpected failure in capturing array variable");
};
This catches any array dimensionality that falls outside the range [2, 8]. Since sizeof(T) is never zero for a real type, the assertion always fails if the primary template is instantiated.
Generated Specializations
For each rank N from 2 through 8, sub_6BC290 generates a partial specialization:
// Example: rank 3
template<typename T, size_t D1, size_t D2, size_t D3>
struct __nv_lambda_array_wrapper<T [D1][D2][D3]> {
T arr[D1][D2][D3];
__nv_lambda_array_wrapper(const T in[D1][D2][D3]) {
for(size_t i1 = 0; i1 < D1; ++i1)
for(size_t i2 = 0; i2 < D2; ++i2)
for(size_t i3 = 0; i3 < D3; ++i3)
arr[i1][i2][i3] = in[i1][i2][i3];
}
};
The constructor takes a const T in[D1]...[DN] parameter and performs element-by-element copy via nested for-loops. Each loop variable is named i1 through iN and iterates from 0 to D1 through DN respectively. The assignment arr[i1]...[iN] = in[i1]...[iN] copies each element.
Reconstructed Output for Rank 4
What sub_6BC290 actually emits for a 4-dimensional array (directly from the decompiled string fragments):
template<typename T, size_t D1, size_t D2, size_t D3, size_t D4>
struct __nv_lambda_array_wrapper<T [D1][D2][D3][D4]> {
T arr[D1][D2][D3][D4];
__nv_lambda_array_wrapper(const T in[D1][D2][D3][D4]) {
for(size_t i1 = 0; i1 < D1; ++i1)
for(size_t i2 = 0; i2 < D2; ++i2)
for(size_t i3 = 0; i3 < D3; ++i3)
for(size_t i4 = 0; i4 < D4; ++i4)
arr[i1][i2][i3][i4] = in[i1][i2][i3][i4];
}
};
Note the double-space before < in the for condition -- this is present in the actual emitted code (visible in the decompiled sprintf format string "for(size_t i%u = 0; i%u < D%u; ++i%u)").
sub_6BC290: emit_array_capture_helpers
Address 0x6BC290, 183 decompiled lines, in nv_transforms.c. Takes a single argument: void (*a1)(const char *), the text emission callback.
Algorithm
The function has two major loops, each iterating rank from 2 to 8.
Loop 1 -- Array wrapper specializations:
for rank = 2 to 8:
emit "template<typename T"
for d = 1 to rank-1:
emit ", size_t D{d}"
emit ">\nstruct __nv_lambda_array_wrapper<T "
for d = 1 to rank-1:
emit "[D{d}]"
emit "> {T arr"
for d = 1 to rank-1:
emit "[D{d}]"
emit ";\n__nv_lambda_array_wrapper(const T in"
for d = 1 to rank-1:
emit "[D{d}]"
emit ") {"
for d = 1 to rank-1:
emit "\nfor(size_t i{d} = 0; i{d} < D{d}; ++i{d})"
emit " arr"
for d = 1 to rank-1:
emit "[i{d}]"
emit " = in"
for d = 1 to rank-1:
emit "[i{d}]"
emit ";\n}\n};\n"
Loop 2 -- Field type specializations:
First emits the primary __nv_lambda_field_type:
emit "template <typename T>\nstruct __nv_lambda_field_type {\ntypedef T type;};"
Then for each rank from 2 to 8, emits two specializations (non-const and const):
for rank = 2 to 8:
// Non-const specialization
emit "template<typename T"
for d = 1 to rank-1:
emit ", size_t D{d}"
emit ">\nstruct __nv_lambda_field_type<T "
for d = 1 to rank-1:
emit "[D{d}]"
emit "> {\ntypedef __nv_lambda_array_wrapper<T "
for d = 1 to rank-1:
emit "[D{d}]"
emit "> type;\n};\n"
// Const specialization
emit "template<typename T"
for d = 1 to rank-1:
emit ", size_t D{d}"
emit ">\nstruct __nv_lambda_field_type<const T "
for d = 1 to rank-1:
emit "[D{d}]"
emit "> {\ntypedef const __nv_lambda_array_wrapper<T "
for d = 1 to rank-1:
emit "[D{d}]"
emit "> type;\n};\n"
Stack Usage
Two stack buffers: v33[1024] for the for-loop lines (the sprintf format includes four %u substitutions) and s[1064] for the dimension fragments (smaller format: "%s%u%s" with prefix/suffix).
Emission Order in Preamble
sub_6BC290 is called from sub_6BCC20 (nv_emit_lambda_preamble) at step 3, after __nvdl_remove_ref/__nvdl_remove_const trait helpers and __nv_dl_tag, but before the primary __nv_dl_wrapper_t definition. This ordering is critical: __nv_dl_wrapper_t field declarations reference __nv_lambda_field_type, which in turn references __nv_lambda_array_wrapper, so both must be defined first.
Capture Type Emission (sub_46E640)
Address 0x46E640, approximately 400 decompiled lines, in cp_gen_be.c. Confirmed identity: nv_gen_extended_lambda_capture_types (assert string at line 17368 of cp_gen_be.c).
This function emits the template type arguments that appear in a wrapper struct instantiation. For a device lambda wrapper __nv_dl_wrapper_t<Tag, F1, F2, ..., FN>, this function generates the F1 through FN types. Each type must precisely match the declared type of the captured variable, with references and top-level const stripped.
Input
Takes __int64 **a1 -- a pointer to the lambda info structure. The capture list is a linked list starting at *a1 (offset +0 of the lambda info). Each capture entry is a node with:
| Offset | Size | Field |
|---|---|---|
| +0 | 8 | next pointer (linked list) |
| +8 | 8 | variable_entity -- pointer to the captured variable's entity node |
| +24 | 8 | init_capture_scope -- scope for init-capture expressions |
| +32 | 1 | flags_byte_1 -- bit 0 = init-capture, bit 7 = has braces/parens |
| +33 | 1 | flags_byte_2 -- bit 0 = paren-init (vs brace-init) |
The variable entity at offset +8 has:
- Offset +8: name string (null if
*thiscapture) - Offset +163: sign bit (bit 7) -- if set, this is a
*thisorthiscapture
Algorithm: Three Capture Kinds
The function walks the capture list and for each entry, dispatches on two conditions: the init-capture flag (i[4] & 1) and the *this flag (byte at entity+163 sign bit).
Case 1: Regular variable capture (i[4] & 1 == 0 and entity+163 >= 0)
Emits:
, typename __nvdl_remove_ref<decltype(varname)>::type
Where varname is the string at entity+8. This strips reference qualification from the variable's type. The decltype(varname) ensures the type is deduced from the actual declaration, not from any decay.
Case 2: *this capture (i[4] & 1 == 0 and entity+163 < 0)
Two sub-cases depending on whether this is an explicit this capture (C++23 deducing this) versus traditional *this:
If i[4] & 8 (explicit this):
, decltype(this) const
Otherwise (traditional *this):
, typename __nvdl_remove_const<typename __nvdl_remove_ref<decltype(*this) > ::type> :: type
If the lambda is non-const (mutable), const is not appended. The mutable check reads (byte)a1[3] & 2 -- if clear, appends const.
Case 3: Init-capture (i[4] & 1 != 0)
Emits:
, typename __nv_lambda_trait_remove_cv<typename __nvdl_remove_ref<decltype({expr})>::type>::type
Where {expr} is the init-capture expression, emitted by calling sub_46D910 (the expression code generator). The expression is wrapped in {...} (brace-init) or (...) (paren-init) depending on byte+33 bit 0. The additional __nv_lambda_trait_remove_cv wrapper strips top-level const and volatile from the deduced type.
GCC Diagnostic Guards
When dword_126E1E8 is set (indicating the host compiler is GCC-based), the init-capture path wraps the decltype expression in pragma guards:
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunevaluated-expression"
decltype({expr})
#pragma GCC diagnostic pop
This suppresses GCC warnings about using decltype on expressions that are not evaluated. The flag dword_126E1E8 is likely set when the target host compiler is GCC rather than MSVC or Clang.
Character-by-Character Emission
The decompiled code reveals that sub_46E640 does not use sub_467E50 (emit string) for all output. For short constant strings like ", ", "typename __nvdl_remove_ref<decltype(", etc., it emits character-by-character via putc(ch, stream) with a manual loop. This is a common pattern in EDG's code generator where inline string emission avoids function-call overhead for fixed text.
The character counter dword_106581C tracks the column position for line-wrapping decisions. Each emission path increments it by the string length.
Capture Value Emission (sub_46E550)
Address 0x46E550, 60 decompiled lines, in cp_gen_be.c. This function emits the actual values passed to the wrapper constructor -- the runtime expressions that initialize each captured field.
Algorithm
Walks the same capture linked list. For each entry, emits , followed by:
| Condition | Output |
|---|---|
Regular variable (byte+32 & 1 == 0, entity+163 >= 0) | Variable name string from entity+8 |
Explicit this (byte+32 & 8, entity+163 < 0) | this |
Traditional *this (byte+32 & 8 == 0, entity+163 < 0) | *this |
Init-capture (byte+32 & 1) | The init-capture expression via sub_46D910 |
For init-captures, the expression is wrapped in (...) or {...} based on bit 0 of byte+33:
- Bit 0 set: paren-init
(expr) - Bit 0 clear: brace-init
{expr}
Relationship to Type Emission
sub_46E550 and sub_46E640 are called in sequence by the per-lambda wrapper emitter (sub_47B890, gen_lambda). The type emission produces the template type parameters; the value emission produces the constructor arguments. Together they construct an expression like:
__nv_dl_wrapper_t<
__nv_dl_tag<decltype(&Closure::operator()), &Closure::operator(), 42>,
typename __nvdl_remove_ref<decltype(x)>::type,
typename __nvdl_remove_ref<decltype(y)>::type
>(tag, x, y)
Bitmap System
Rather than generating wrapper specializations for all possible capture counts (0 through 1023), cudafe++ maintains two 1024-bit bitmaps that record which counts were actually observed during frontend parsing. During preamble emission, only the specializations for set bits are generated.
Memory Layout
unk_1286980 (device lambda bitmap):
Address: 0x1286980
Size: 128 bytes = 16 x uint64_t = 1024 bits
Bit N: __nv_dl_wrapper_t specialization for N captures needed
unk_1286900 (host-device lambda bitmap):
Address: 0x1286900
Size: 128 bytes = 16 x uint64_t = 1024 bits
Bit N: __nv_hdl_wrapper_t specializations for N captures needed
sub_6BCBC0: nv_reset_capture_bitmasks
Address 0x6BCBC0, 9 decompiled lines. Called before each translation unit.
memset(&unk_1286980, 0, 0x80); // Clear device bitmap (128 bytes)
memset(&unk_1286900, 0, 0x80); // Clear host-device bitmap (128 bytes)
sub_6BCBF0: nv_record_capture_count
Address 0x6BCBF0, 13 decompiled lines. Called from scan_lambda (sub_447930) after counting captures.
_QWORD *result = &unk_1286900; // Default: host-device bitmap
if (!a1)
result = &unk_1286980; // a1 == 0: device bitmap
result[a2 >> 6] |= 1LL << a2; // Set bit a2
Parameters:
a1(int): Bitmap selector.0= device, non-zero = host-device.a2(unsigned): Capture count (0-1023).
The bit-set logic: a2 >> 6 selects the uint64_t word (divides by 64), and 1LL << a2 sets the appropriate bit within that word. Since a2 is an unsigned int, the shift 1LL << a2 uses only the low 6 bits of a2 on x86-64, so the word index and bit index are consistent.
Note the mapping inversion: a1 == 0 maps to unk_1286980 (device), while a1 != 0 maps to unk_1286900 (host-device). This is counterintuitive but confirmed by the decompiled code.
Bitmap Scan in nv_emit_lambda_preamble
The scan loop in sub_6BCC20 processes each bitmap as 16 uint64_t words:
// Device lambda bitmap scan
uint64_t *ptr = (uint64_t *)&unk_1286980;
unsigned int idx = 0;
do {
uint64_t word = *ptr;
unsigned int limit = idx + 64;
do {
if (idx != 0 && (word & 1))
sub_6BB790(idx, callback); // emit_device_lambda_wrapper_specialization
++idx;
word >>= 1;
} while (limit != idx);
++ptr;
} while (limit != 1024);
// Host-device lambda bitmap scan
ptr = (uint64_t *)&unk_1286900;
idx = 0;
do {
uint64_t word = *ptr;
unsigned int limit = idx + 64;
do {
while ((word & 1) == 0) { // Skip unset bits
++idx;
word >>= 1;
if (idx == limit) goto next_word;
}
sub_6BBB10(0, idx, callback); // Non-mutable, HasFuncPtrConv=false
sub_6BBEE0(0, idx, callback); // Non-mutable, HasFuncPtrConv=true
sub_6BBB10(1, idx, callback); // Mutable, HasFuncPtrConv=false
sub_6BBEE0(1, idx++, callback); // Mutable, HasFuncPtrConv=true
word >>= 1;
} while (idx != limit);
next_word:
++ptr;
} while (idx != 1024);
Key differences between the two scans:
- The device scan skips bit 0 (
if (idx != 0 && ...)). The zero-capture case is handled by the primary template and its explicit<Tag>specialization already emitted as static text. - The host-device scan does not skip bit 0 -- zero-capture host-device lambdas (stateless lambdas with
__host__ __device__) still need wrapper specializations because the host-device wrapper has function-pointer-conversion variants. - Each set bit in the host-device bitmap triggers four emitter calls (non-mutable/mutable x HasFuncPtrConv false/true), compared to one call per bit for device lambdas.
How Fields Use __nv_lambda_field_type
When sub_6BB790 (emit_device_lambda_wrapper_specialization) generates a wrapper struct for N captures, each field is declared as:
typename __nv_lambda_field_type<F1>::type f1;
typename __nv_lambda_field_type<F2>::type f2;
// ... through fN
This indirection through __nv_lambda_field_type means:
- If
F1isint, the field type isint(identity via primary template). - If
F1isfloat[3][4], the field type is__nv_lambda_array_wrapper<float[3][4]>, which stores a deep copy. - If
F1isconst double[2][2], the field type isconst __nv_lambda_array_wrapper<double[2][2]>.
The constructor mirrors this pattern:
__nv_dl_wrapper_t(Tag, F1 in1, F2 in2, ..., FN inN)
: f1(in1), f2(in2), ..., fN(inN) { }
For array captures, the f1(in1) initialization invokes __nv_lambda_array_wrapper's constructor, which performs the element-by-element copy. For scalar captures, it is a trivial copy/move.
End-to-End Example
Given user code:
int x = 42;
float matrix[3][4];
auto lam = [x, matrix]() __device__ { /* use x and matrix */ };
cudafe++ produces:
-
Frontend (
scan_lambda): Counts 2 captures. Callssub_6BCBF0(0, 2)to set bit 2 in the device bitmap. -
Preamble emission (
sub_6BCC20): Scans the device bitmap, finds bit 2 set. Callssub_6BB790(2, emit)which generates:
template <typename Tag, typename F1, typename F2>
struct __nv_dl_wrapper_t<Tag, F1, F2> {
typename __nv_lambda_field_type<F1>::type f1;
typename __nv_lambda_field_type<F2>::type f2;
__nv_dl_wrapper_t(Tag, F1 in1, F2 in2) : f1(in1), f2(in2) { }
template <typename...U1>
int operator()(U1...) { return 0; }
};
- Per-lambda emission (
sub_47B890callingsub_46E640andsub_46E550):
__nv_dl_wrapper_t<
__nv_dl_tag<decltype(&ClosureType::operator()), &ClosureType::operator(), 0>,
typename __nvdl_remove_ref<decltype(x)>::type, // int
typename __nvdl_remove_ref<decltype(matrix)>::type // float[3][4]
>(tag, x, matrix)
- Template instantiation: The host compiler instantiates the wrapper.
F1 = intso__nv_lambda_field_type<int>::type = int(identity).F2 = float[3][4]so__nv_lambda_field_type<float[3][4]>::type = __nv_lambda_array_wrapper<float[3][4]>, which triggers the rank-2 specialization with its nested double for-loop constructor.
Function Map
| Address | Name (recovered) | Source | Lines | Role |
|---|---|---|---|---|
sub_6BC290 | emit_array_capture_helpers | nv_transforms.c | 183 | Emit __nv_lambda_array_wrapper (ranks 2-8) and __nv_lambda_field_type specializations |
sub_6BCBC0 | nv_reset_capture_bitmasks | nv_transforms.c | 9 | Zero both 128-byte bitmaps at translation unit start |
sub_6BCBF0 | nv_record_capture_count | nv_transforms.c | 13 | Set bit N in device or host-device bitmap |
sub_6BCC20 | nv_emit_lambda_preamble | nv_transforms.c | 244 | Master emitter -- scans bitmaps, calls all sub-emitters |
sub_6BB790 | emit_device_lambda_wrapper_specialization | nv_transforms.c | 191 | Emit __nv_dl_wrapper_t<Tag, F1..FN> for N captures |
sub_46E640 | nv_gen_extended_lambda_capture_types | cp_gen_be.c | ~400 | Emit decltype-based template type args for each capture |
sub_46E550 | (capture value emitter) | cp_gen_be.c | ~60 | Emit variable names / this / *this / init-capture exprs |
sub_46D910 | (expression code generator) | cp_gen_be.c | -- | Called by both sub_46E640 and sub_46E550 for init-captures |
sub_467E50 | (emit string to output) | cp_gen_be.c | -- | String emission helper used by code generator |
sub_467DA0 | (column tracking helper) | cp_gen_be.c | -- | Called when dword_1065818 is set for line-length management |
Global State
| Variable | Address | Size | Purpose |
|---|---|---|---|
unk_1286980 | 0x1286980 | 128 bytes | Device lambda capture-count bitmap |
unk_1286900 | 0x1286900 | 128 bytes | Host-device lambda capture-count bitmap |
dword_106581C | 0x106581C | 4 bytes | Column counter for output line tracking |
dword_1065818 | 0x1065818 | 4 bytes | Line-length management enabled flag |
dword_126E1E8 | 0x126E1E8 | 4 bytes | GCC-compatible host compiler flag (enables diagnostic pragmas) |
stream | (global) | 8 bytes | Output FILE* for code generation |
Related Pages
- Extended Lambda Overview -- end-to-end lambda pipeline and
lambda_infostructure - Device Lambda Wrapper --
__nv_dl_wrapper_ttemplate anatomy - Host-Device Lambda Wrapper --
__nv_hdl_wrapper_ttype-erased design - Preamble Injection --
sub_6BCC20emission sequence in full detail - Lambda Restrictions -- validation errors for malformed captures
Preamble Injection
The entire CUDA extended lambda template library -- every __nv_dl_wrapper_t, every __nv_hdl_wrapper_t, every trait helper and detection macro -- enters the compilation through a single function: sub_6BCC20 (nv_emit_lambda_preamble). This 244-line function in nv_transforms.c accepts a void(*emit)(const char*) callback and produces raw C++ source text that is injected into the .int.c output stream. The preamble is emitted exactly once per translation unit, triggered by a sentinel type declaration named __nv_lambda_preheader_injection. The trigger mechanism lives in sub_4864F0 (gen_type_decl in cp_gen_be.c), which string-compares each type declaration's name against the sentinel marker, emits a synthetic #line directive, and then calls the master emitter.
The preamble contains 20 logical emission steps, ranging from simple type traits (4 lines each) to bitmap-driven loops that generate hundreds of template specializations. The design is driven by a critical optimization: rather than emitting all 1024 possible capture-count specializations for each wrapper type, cudafe++ maintains two 1024-bit bitmaps (unk_1286980 for device lambdas, unk_1286900 for host-device lambdas) that track which capture counts were actually used during frontend parsing. The preamble emitter scans these bitmaps and generates only the specializations that the translation unit requires.
Key Facts
| Property | Value |
|---|---|
| Master emitter | sub_6BCC20 (nv_emit_lambda_preamble, 244 lines, nv_transforms.c) |
| Trigger function | sub_4864F0 (gen_type_decl, 751 lines, cp_gen_be.c) |
| Emit callback (typical) | sub_467E50 (raw text output to .int.c stream) |
| Sentinel type name | __nv_lambda_preheader_injection |
| Synthetic source file | "nvcc_internal_extended_lambda_implementation" |
| Enable flag | dword_106BF38 (--extended-lambda / --expt-extended-lambda) |
| Device bitmap | unk_1286980 (128 bytes = 16 x uint64 = 1024 bits) |
| Host-device bitmap | unk_1286900 (128 bytes = 16 x uint64 = 1024 bits) |
| C++17 noexcept gate | dword_126E270 (controls noexcept trait variants) |
| One-shot guarantee | Once emitted, the sentinel type is wrapped in #if 0 / #endif |
| Max capture count | 1024 (bit index range 0..1023) |
| Array dimension range | 2D through 8D (7 specializations per wrapper) |
Trigger Mechanism: sub_4864F0 (gen_type_decl)
The preamble is not emitted eagerly at the start of compilation. Instead, the EDG frontend inserts a synthetic type declaration named __nv_lambda_preheader_injection into the IL at the point where the lambda template library is needed. During backend code generation, sub_4864F0 (the type declaration emitter in cp_gen_be.c) encounters this declaration and performs the following sequence:
// sub_4864F0, decompiled lines 200-242
// Check: is this a type tagged with the preheader marker? (bit at v4-8 & 0x10)
if ((*(_BYTE *)(v4 - 8) & 0x10) != 0)
{
if (dword_106BF38) // --extended-lambda enabled?
{
v18 = *(_QWORD *)(v4 + 8); // get type name pointer
if (v18)
{
// Compare name against "__nv_lambda_preheader_injection" (30 chars + NUL)
v30 = "__nv_lambda_preheader_injection";
v31 = 32; // comparison length
do {
if (!v31) break;
v29 = *(_BYTE *)v18++ == *v30++;
--v31;
} while (v29);
if (v29) // name matched
{
if (dword_106581C) // pending newline needed
sub_467D60(); // emit newline
// Emit #line directive pointing to synthetic source file
v32 = "#line";
if (dword_126E1DC) // shorthand mode
v32 = "#";
sub_467E50(v32);
sub_467E50(" 1 \"nvcc_internal_extended_lambda_implementation\"");
if (dword_106581C)
sub_467D60();
// THE CRITICAL CALL: emit entire lambda template library
sub_6BCC20(sub_467E50);
dword_1065820 = 0; // reset line tracking state
qword_1065828 = 0;
}
}
}
// Suppress the sentinel type from host compiler output
sub_46BC80("#if 0");
--dword_1065834;
sub_467D60();
}
Trigger Conditions
Three conditions must all be true for preamble emission:
-
Marker bit set -- The type declaration node has bit
0x10set at offset-8(the IL node header flags). This bit marks NVIDIA-injected synthetic declarations. -
Extended lambda mode active --
dword_106BF38is nonzero, meaning--extended-lambda(or--expt-extended-lambda) was passed to nvcc. -
Name matches sentinel -- The type's name at offset
+8is byte-equal to"__nv_lambda_preheader_injection"(a 31-character string including NUL; the comparison loop runs up to 32 iterations).
Synthetic Source File Context
Before calling sub_6BCC20, the trigger emits:
#line 1 "nvcc_internal_extended_lambda_implementation"
This #line directive serves two purposes: it changes the apparent source file for any diagnostics emitted during template parsing, and it provides a recognizable marker in the generated .int.c file for debugging. All lambda template infrastructure appears to originate from "nvcc_internal_extended_lambda_implementation" rather than from the user's source file. The dword_126E1DC flag selects between #line and the shorthand # form for the line directive.
One-Shot Guarantee and Sentinel Suppression
After the preamble is emitted, the sentinel type declaration is wrapped in #if 0 / #endif. The #if 0 is emitted immediately after the preamble call (line 239: sub_46BC80("#if 0")). The matching #endif is emitted later when sub_4864F0 reaches the closing path for this declaration type (lines 736-745):
else if ((*(_BYTE *)(v4 - 8) & 0x10) != 0)
{
if (dword_106581C)
sub_467D60();
++dword_1065834;
sub_468190("#endif");
--dword_1065834;
sub_467D60();
dword_1065820 = 0;
qword_1065828 = 0;
}
The sentinel type __nv_lambda_preheader_injection never reaches the host compiler's type system -- it exists solely as a positional marker in the IL. Because the EDG frontend inserts exactly one such declaration per translation unit, and the backend processes declarations sequentially, the preamble is guaranteed to be emitted exactly once.
After emission, dword_1065820 (output line counter) and qword_1065828 (output state pointer) are reset to zero, ensuring subsequent #line directives correctly track the user's source file.
Master Emitter: sub_6BCC20
The function signature:
__int64 __fastcall sub_6BCC20(void (__fastcall *a1)(const char *));
The single parameter a1 is an output callback. In production, this is always sub_467E50 -- the function that writes raw text to the .int.c output stream. Every a1("...") call appends the given string literal to the output. The function has no other state parameters; all needed state (bitmaps, C++17 flag) is read from globals.
The 20 emission steps are executed unconditionally in a fixed order. Steps 6 and 9 contain bitmap-scanning loops that conditionally call sub-emitters based on which capture counts were registered during frontend parsing. Step 11 is gated on the C++17 noexcept flag.
Step 1: Type Removal Traits and Wrapper Helper Macro
The first a1(...) call emits the largest single string literal in the function -- three foundational metaprogramming utilities:
#define __NV_LAMBDA_WRAPPER_HELPER(X, Y) decltype(X), Y
template <typename T>
struct __nvdl_remove_ref { typedef T type; };
template<typename T>
struct __nvdl_remove_ref<T&> { typedef T type; };
template<typename T>
struct __nvdl_remove_ref<T&&> { typedef T type; };
template <typename T, typename... Args>
struct __nvdl_remove_ref<T(&)(Args...)> {
typedef T(*type)(Args...);
};
template <typename T>
struct __nvdl_remove_const { typedef T type; };
template <typename T>
struct __nvdl_remove_const<T const> { typedef T type; };
__NV_LAMBDA_WRAPPER_HELPER(X, Y) expands to decltype(X), Y. It provides the <U, func> pair for tag type construction from a single expression. At each lambda wrapper call site, the per-lambda emitter (sub_47B890) generates __NV_LAMBDA_WRAPPER_HELPER(&Closure::operator(), &Closure::operator()), which expands to decltype(&Closure::operator()), &Closure::operator().
__nvdl_remove_ref strips lvalue and rvalue references, with a special case for function references (T(&)(Args...) -> T(*)(Args...)). __nvdl_remove_const strips top-level const. Both are used during capture type emission to normalize captured variable types before passing them as template arguments to wrapper structs.
Step 2: Device Lambda Tag
template <typename U, U func, unsigned>
struct __nv_dl_tag { };
The device lambda tag type. U is the type of the lambda's operator(), func is a non-type template parameter holding the pointer to that operator, and the unsigned disambiguates lambdas with identical operator types at different call sites within the same TU.
Step 3: Array Capture Helpers (sub_6BC290)
sub_6BCC20 calls sub_6BC290(a1), which emits the __nv_lambda_array_wrapper and __nv_lambda_field_type infrastructure for C-style array captures. This is a separate 183-line function that generates templates for array dimensions 2 through 8.
Three template families are emitted:
Primary template (static_assert trap):
template <typename T>
struct __nv_lambda_array_wrapper {
static_assert(sizeof(T) == 0,
"nvcc internal error: unexpected failure in capturing array variable");
};
Per-dimension partial specializations (dimensions 2-8). For each dimension D from 2 to 8, sub_6BC290 generates a partial specialization with D size_t template parameters and a nested-for-loop constructor:
// Example: 3D (v1 = 3)
template<typename T, size_t D1, size_t D2, size_t D3>
struct __nv_lambda_array_wrapper<T [D1][D2][D3]> {
T arr[D1][D2][D3];
__nv_lambda_array_wrapper(const T in[D1][D2][D3]) {
for(size_t i1 = 0; i1 < D1; ++i1)
for(size_t i2 = 0; i2 < D2; ++i2)
for(size_t i3 = 0; i3 < D3; ++i3)
arr[i1][i2][i3] = in[i1][i2][i3];
}
};
Field type trait specializations:
template <typename T>
struct __nv_lambda_field_type { typedef T type; };
// For each dimension D from 2 to 8:
template<typename T, size_t D1, ..., size_t DN>
struct __nv_lambda_field_type<T [D1]...[DN]> {
typedef __nv_lambda_array_wrapper<T [D1]...[DN]> type;
};
template<typename T, size_t D1, ..., size_t DN>
struct __nv_lambda_field_type<const T [D1]...[DN]> {
typedef const __nv_lambda_array_wrapper<T [D1]...[DN]> type;
};
The loop structure in sub_6BC290 uses two stack buffers: v33[1024] for the nested-for-loop lines (each sprintf call formats four copies of the loop index variable) and s[1064] for dimension parameters and array subscript expressions. The outer loop runs from v1 = 2 to v1 = 8 (inclusive, 7 iterations). 1D arrays do not need a wrapper -- they can be captured directly. Arrays of 9+ dimensions are unsupported (the primary template's static_assert fires).
See Capture Handling for detailed documentation.
Step 4: Primary __nv_dl_wrapper_t and Zero-Capture Specialization
template <typename Tag, typename...CapturedVarTypePack>
struct __nv_dl_wrapper_t {
static_assert(sizeof...(CapturedVarTypePack) == 0,
"nvcc internal error: unexpected number of captures!");
};
template <typename Tag>
struct __nv_dl_wrapper_t<Tag> {
__nv_dl_wrapper_t(Tag) { }
template <typename...U1>
int operator()(U1...) { return 0; }
};
The primary template traps any instantiation with a non-zero capture count that lacks a matching specialization. The zero-capture specialization provides a trivial constructor and a dummy operator() returning int(0). This return value is never used at runtime -- the device compiler dispatches through the tag's encoded function pointer.
Step 5: Trailing-Return Tag and Base Specialization
template <typename U, U func, typename Return, unsigned>
struct __nv_dl_trailing_return_tag { };
template <typename U, U func, typename Return, unsigned Id>
struct __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id> > {
__nv_dl_wrapper_t(__nv_dl_trailing_return_tag<U, func, Return, Id>) { }
template <typename...U1> Return operator()(U1...) {
__builtin_unreachable();
}
};
For lambdas with explicit trailing return types (-> ReturnType), the tag carries the Return type as a template parameter. The operator() returns Return instead of int, with __builtin_unreachable() satisfying the compiler without generating actual return-value code.
The trailing-return tag and its zero-capture specialization are emitted as two separate a1(...) calls. The __builtin_unreachable() body is split: a1("__builtin_unreachable(); }\n}; \n\n").
Step 6: Device Lambda Bitmap Scan
Scans unk_1286980 (the device lambda bitmap, 1024 bits) and calls sub_6BB790 for each set bit with index greater than zero:
// Decompiled from sub_6BCC20
v1 = (unsigned __int64 *)&unk_1286980;
v2 = 0;
do {
v3 = *v1; // load 64-bit word
v4 = v2 + 64; // word boundary
do {
if (v2 && (v3 & 1) != 0) // skip bit 0, emit for set bits
sub_6BB790(v2, a1); // emit_device_lambda_wrapper_specialization
++v2;
v3 >>= 1;
} while (v4 != v2);
++v1;
} while (v4 != 1024);
Bit 0 is explicitly skipped (if (v2 && ...)). The zero-capture case is handled by the specializations in steps 4 and 5.
For each set bit N > 0, sub_6BB790(N, a1) emits two __nv_dl_wrapper_t partial specializations: one for __nv_dl_tag and one for __nv_dl_trailing_return_tag, each with N typed fields, a constructor taking N parameters, and an initializer list binding inK to fK. See Device Lambda Wrapper for full emitter logic.
This bitmap-driven approach is the critical compile-time optimization. A translation unit using lambdas with capture counts 1, 3, and 5 emits exactly 6 struct specializations rather than 2046 (1023 counts x 2 tag variants).
Step 7: Host-Device Helper Class (__nv_hdl_helper)
Emitted inside an anonymous namespace:
namespace {
template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
struct __nv_hdl_helper {
typedef void * (*fp_copier_t)(void *);
typedef OpFuncR (*fp_caller_t)(void *, OpFuncArgs...);
typedef void (*fp_deleter_t)(void *);
typedef OpFuncR (*fp_noobject_caller_t)(OpFuncArgs...);
static fp_copier_t fp_copier;
static fp_deleter_t fp_deleter;
static fp_caller_t fp_caller;
static fp_noobject_caller_t fp_noobject_caller;
};
// Out-of-line static member definitions (4 members):
template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
typename __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier_t
__nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier;
// ... (fp_deleter, fp_caller, fp_noobject_caller follow the same pattern)
}
The anonymous namespace prevents ODR violations across TUs. The Tag parameter isolates function pointer storage per lambda site even when call signatures are identical. The entire struct definition plus all four out-of-line member definitions are emitted as a single a1(...) call.
| Pointer | Purpose |
|---|---|
fp_copier | Heap-copies a Lambda from void* (used by copy constructor) |
fp_caller | Casts void* to Lambda* and invokes operator() |
fp_deleter | Casts void* to Lambda* and deletes it |
fp_noobject_caller | Stores captureless lambda as raw function pointer |
Step 8: Primary __nv_hdl_wrapper_t
template <bool IsMutable, bool HasFuncPtrConv, bool NeverThrows,
typename Tag, typename OpFunc, typename...CapturedVarTypePack>
struct __nv_hdl_wrapper_t {
static_assert(sizeof...(CapturedVarTypePack) == 0,
"nvcc internal error: unexpected number of captures "
"in __host__ __device__ lambda!");
};
Same safety-net pattern as the device wrapper.
Step 9: Host-Device Lambda Bitmap Scan
Scans unk_1286900 (the host-device bitmap, 1024 bits). Unlike the device scan, this loop does not skip bit 0 -- the zero-capture host-device case still requires distinct specializations for HasFuncPtrConv=true vs HasFuncPtrConv=false.
For each set bit N, four specialization calls are made:
v5 = (unsigned __int64 *)&unk_1286900;
v6 = 0;
do {
v7 = *v5;
v8 = v6 + 64;
do {
while ((v7 & 1) == 0) { // fast-skip unset bits
++v6;
v7 >>= 1;
if (v6 == v8) goto LABEL_13;
}
sub_6BBB10(0, v6, a1); // IsMutable=false, HasFuncPtrConv=false
sub_6BBEE0(0, v6, a1); // IsMutable=true, HasFuncPtrConv=false
sub_6BBB10(1, v6, a1); // IsMutable=false, HasFuncPtrConv=true
v9 = v6++;
v7 >>= 1;
sub_6BBEE0(1, v9, a1); // IsMutable=true, HasFuncPtrConv=true
} while (v6 != v8);
LABEL_13:
++v5;
} while (v6 != 1024);
Note the ordering asymmetry in the fourth call: sub_6BBEE0(1, v9, a1) uses the pre-increment value v9 because v6 has already been incremented by the v9 = v6++ expression.
The inner while ((v7 & 1) == 0) loop provides fast skipping over consecutive unset bits without executing four function calls per zero bit. This is an optimization compared to the device scan loop.
| Call | a1 | a2 | IsMutable | HasFuncPtrConv | operator() qualifier |
|---|---|---|---|---|---|
sub_6BBB10(0, N, emit) | 0 | N | false | false | const noexcept(NeverThrows) |
sub_6BBEE0(0, N, emit) | 0 | N | true | false | noexcept(NeverThrows) (no const) |
sub_6BBB10(1, N, emit) | 1 | N | false | true | const noexcept(NeverThrows) |
sub_6BBEE0(1, N, emit) | 1 | N | true | true | noexcept(NeverThrows) (no const) |
The sole difference between sub_6BBB10 and sub_6BBEE0 is that sub_6BBB10 emits "false," for IsMutable and adds a3("const ") before the noexcept qualifier on operator(), while sub_6BBEE0 emits "true," and omits the const. They are otherwise structurally identical -- 238 vs 236 lines, the 2-line difference being exactly the a3("const ") call.
See Host-Device Lambda Wrapper for the complete internal structure of each specialization.
Step 10: __nv_hdl_helper_trait_outer (Base Specializations)
The deduction helper trait that extracts the wrapper type from a lambda's operator() signature:
template <bool IsMutable, bool HasFuncPtrConv, typename ...CaptureArgs>
struct __nv_hdl_helper_trait_outer {
template <typename Tag, typename Lambda>
struct __nv_hdl_helper_trait
: public __nv_hdl_helper_trait<Tag, decltype(&Lambda::operator())> { };
// Match const operator() (non-mutable lambda):
template <typename Tag, typename C, typename R, typename... OpFuncArgs>
struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) const> {
template <typename Lambda>
static auto get(Lambda lam, CaptureArgs... args)
-> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, false,
Tag, R(OpFuncArgs...), CaptureArgs...>;
};
// Match non-const operator() (mutable lambda):
template <typename Tag, typename C, typename R, typename... OpFuncArgs>
struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...)> {
template <typename Lambda>
static auto get(Lambda lam, CaptureArgs... args)
-> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, false,
Tag, R(OpFuncArgs...), CaptureArgs...>;
};
The primary __nv_hdl_helper_trait inherits from a specialization on decltype(&Lambda::operator()). The compiler deduces the member function pointer type and pattern-matches against the const or non-const specialization. Both produce NeverThrows=false.
This block is emitted without a closing }; -- the noexcept variants (step 11) are conditionally appended before the closing brace.
Step 11: C++17 Noexcept Trait Variants (Conditional)
Gated on dword_126E270:
if (dword_126E270)
a1(/* noexcept trait specializations */);
a1("\n};"); // close __nv_hdl_helper_trait_outer
When C++17 noexcept-in-type-system is active, two additional __nv_hdl_helper_trait specializations are emitted:
// Match const noexcept operator():
template <typename Tag, typename C, typename R, typename... OpFuncArgs>
struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) const noexcept> {
template <typename Lambda>
static auto get(Lambda lam, CaptureArgs... args)
-> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, true,
Tag, R(OpFuncArgs...), CaptureArgs...>;
};
// Match non-const noexcept operator():
template <typename Tag, typename C, typename R, typename... OpFuncArgs>
struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) noexcept> {
template <typename Lambda>
static auto get(Lambda lam, CaptureArgs... args)
-> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, true,
Tag, R(OpFuncArgs...), CaptureArgs...>;
};
The noexcept specializations produce NeverThrows=true. In C++17, R(C::*)(Args...) const noexcept is a distinct type from R(C::*)(Args...) const, so without these specializations, noexcept lambdas would fail to match and the trait chain would break.
Step 12: __nv_hdl_create_wrapper_t Factory
template<bool IsMutable, bool HasFuncPtrConv, typename Tag,
typename...CaptureArgs>
struct __nv_hdl_create_wrapper_t {
template <typename Lambda>
static auto __nv_hdl_create_wrapper(Lambda &&lam, CaptureArgs... args)
-> decltype(
__nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>
::template __nv_hdl_helper_trait<Tag, Lambda>
::get(lam, args...))
{
typedef decltype(
__nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>
::template __nv_hdl_helper_trait<Tag, Lambda>
::get(lam, args...)) container_type;
return container_type(Tag{}, std::move(lam), args...);
}
};
This factory is the entry point called at each host-device lambda usage site. The trailing return type chains through the trait hierarchy to deduce the exact __nv_hdl_wrapper_t specialization. The body constructs the deduced wrapper with Tag{}, the moved lambda, and the capture arguments.
Step 13: CV-Removal Traits
template<typename T> struct __nv_lambda_trait_remove_const { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_const<T const> { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_volatile { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_volatile<T volatile> { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_cv {
typedef typename __nv_lambda_trait_remove_const<
typename __nv_lambda_trait_remove_volatile<T>::type>::type type;
};
These are distinct from the __nvdl_remove_ref/__nvdl_remove_const emitted in step 1. The step-1 traits are used during capture type normalization at wrapper call sites. The step-13 traits are used by the detection macros (steps 14-17) to strip CV qualifiers before testing whether a type is an extended lambda wrapper.
Step 14: Device Lambda Detection Trait
template <typename T>
struct __nv_extended_device_lambda_trait_helper {
static const bool value = false;
};
template <typename T1, typename...Pack>
struct __nv_extended_device_lambda_trait_helper<__nv_dl_wrapper_t<T1, Pack...> > {
static const bool value = true;
};
#define __nv_is_extended_device_lambda_closure_type(X) \
__nv_extended_device_lambda_trait_helper< \
typename __nv_lambda_trait_remove_cv<X>::type>::value
SFINAE detection for device lambda wrappers. The macro strips CV qualifiers first, ensuring const __nv_dl_wrapper_t<...> is also detected. Used by CUDA runtime headers for conditional behavior on extended lambda types.
Step 15: Device Lambda Wrapper Unwrapper
template<typename T> struct __nv_lambda_trait_remove_dl_wrapper { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_dl_wrapper<__nv_dl_wrapper_t<T> > {
typedef T type;
};
Extracts the inner tag type from a zero-capture device lambda wrapper. Only matches __nv_dl_wrapper_t<T> with a single template parameter (the tag). Used to access __nv_dl_tag or __nv_dl_trailing_return_tag for device function dispatch resolution.
Step 16: Trailing-Return Device Lambda Detection
template <typename T>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper {
static const bool value = false;
};
template <typename U, U func, typename Return, unsigned Id, typename...Pack>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper<
__nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>, Pack...> > {
static const bool value = true;
};
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) \
__nv_extended_device_lambda_with_trailing_return_trait_helper< \
typename __nv_lambda_trait_remove_cv<X>::type>::value
Detects whether a device lambda wrapper uses the trailing-return tag variant. Needed because trailing-return lambdas require different handling during device compilation -- the return type is explicit and must be preserved, rather than deduced.
Step 17: Host-Device Lambda Detection Trait
The final emission:
template <typename>
struct __nv_extended_host_device_lambda_trait_helper {
static const bool value = false;
};
template <bool B1, bool B2, bool B3, typename T1, typename T2, typename...Pack>
struct __nv_extended_host_device_lambda_trait_helper<
__nv_hdl_wrapper_t<B1, B2, B3, T1, T2, Pack...> > {
static const bool value = true;
};
#define __nv_is_extended_host_device_lambda_closure_type(X) \
__nv_extended_host_device_lambda_trait_helper< \
typename __nv_lambda_trait_remove_cv<X>::type>::value
Detects any __nv_hdl_wrapper_t instantiation. The partial specialization matches all six template parameters (B1=IsMutable, B2=HasFuncPtrConv, B3=NeverThrows, T1=Tag, T2=OpFunc, Pack=captures).
sub_6BCC20 returns the result of this final a1(...) call.
Bitmap Infrastructure
Registration: sub_6BCBF0 (nv_record_capture_count)
During frontend parsing, scan_lambda (sub_447930) records each lambda's capture count:
__int64 __fastcall sub_6BCBF0(int a1, unsigned int a2)
{
unsigned __int64 *result;
if (a1)
result = (unsigned __int64 *)&unk_1286900; // host-device bitmap
else
result = (unsigned __int64 *)&unk_1286980; // device bitmap
result[a2 >> 6] |= 1ULL << a2;
return (__int64)result;
}
The function selects the bitmap based on a1 (0 = device, nonzero = host-device), computes the word index as a2 >> 6 (divide by 64), and sets the bit via bitwise OR. No synchronization is needed because the frontend is single-threaded.
Reset: sub_6BCBC0 (nv_reset_capture_bitmasks)
Before each translation unit, both bitmaps are zeroed:
void sub_6BCBC0(void)
{
memset(&unk_1286980, 0, 128); // device bitmap
memset(&unk_1286900, 0, 128); // host-device bitmap
}
Scan Algorithm Differences
| Aspect | Device scan (step 6) | Host-device scan (step 9) |
|---|---|---|
| Bitmap | unk_1286980 | unk_1286900 |
| Bit 0 | Skipped (if (v2 && ...)) | Processed |
| Skip strategy | Tests every bit individually | Inner while fast-skips consecutive zeros |
| Calls per set bit | 1 (sub_6BB790) | 4 (sub_6BBB10 x2 + sub_6BBEE0 x2) |
| Specializations per set bit | 2 (standard + trailing-return) | 4 (IsMutable x HasFuncPtrConv) |
The device scan skips bit 0 because the zero-capture case is handled by the always-emitted primary template. The host-device scan processes bit 0 because the zero-capture case requires explicit specializations for the HasFuncPtrConv and IsMutable dimensions -- the always-emitted primary template contains only a static_assert trap.
Complete Emission Order Summary
| Step | Content | Emitter | Templates Produced |
|---|---|---|---|
| 1 | Ref/const removal traits | inline string | __NV_LAMBDA_WRAPPER_HELPER, __nvdl_remove_ref, __nvdl_remove_const |
| 2 | Device tag | inline string | __nv_dl_tag |
| 3 | Array helpers | sub_6BC290 | __nv_lambda_array_wrapper (dim 2-8), __nv_lambda_field_type specializations |
| 4 | Device wrapper primary | inline string | __nv_dl_wrapper_t primary + zero-capture |
| 5 | Trailing-return tag | inline string | __nv_dl_trailing_return_tag + zero-capture specialization |
| 6 | Device bitmap scan | loop + sub_6BB790 | N-capture __nv_dl_wrapper_t (2 per set bit N > 0) |
| 7 | HD helper | inline string | __nv_hdl_helper (anonymous namespace, 4 static FPs) |
| 8 | HD wrapper primary | inline string | __nv_hdl_wrapper_t primary with static_assert |
| 9 | HD bitmap scan | loop + sub_6BBB10 x2 + sub_6BBEE0 x2 | N-capture __nv_hdl_wrapper_t (4 per set bit) |
| 10 | Trait outer | inline string | __nv_hdl_helper_trait_outer (const + non-const specializations) |
| 11 | C++17 noexcept | conditional inline | Noexcept __nv_hdl_helper_trait specializations |
| 12 | Factory | inline string | __nv_hdl_create_wrapper_t |
| 13 | CV traits | inline string | __nv_lambda_trait_remove_const/volatile/cv |
| 14 | Device detection | inline string | __nv_extended_device_lambda_trait_helper + macro |
| 15 | Wrapper unwrap | inline string | __nv_lambda_trait_remove_dl_wrapper |
| 16 | Trailing-return detection | inline string | __nv_extended_device_lambda_with_trailing_return_trait_helper + macro |
| 17 | HD detection | inline string | __nv_extended_host_device_lambda_trait_helper + macro |
Output Size Characteristics
The preamble size depends on the number of distinct capture counts used:
| Component | Fixed/Variable | Approximate Size |
|---|---|---|
| Steps 1-5 (fixed templates) | Fixed | ~1.5 KB |
| Step 3 (array helpers, dim 2-8) | Fixed | ~4 KB |
| Step 6 (device, per capture count) | Variable | ~0.8 KB per count |
| Steps 7-8 (HD helper + primary) | Fixed | ~1.5 KB |
| Step 9 (HD, per capture count) | Variable | ~6 KB per count (4 specializations) |
| Steps 10-17 (traits, macros) | Fixed | ~3 KB |
A typical translation unit with 3-5 distinct capture counts produces approximately 30-50 KB of injected C++ text.
Design Rationale
Text Emission vs AST Construction
The preamble is emitted as raw C++ source text rather than constructed as AST nodes in the EDG IL. This trades correctness-by-construction for implementation simplicity:
- Avoids IL complexity. Constructing proper AST nodes for template partial specializations, static member definitions, anonymous namespaces, and macros would require deep integration with the EDG IL construction API.
- Matches output format. The
.int.cfile is plain C++ text consumed by the host compiler. Since the templates must eventually become text, generating them as text from the start eliminates a serialize-deserialize round trip. - Self-documenting. The emitted text is directly readable in the
.int.cfile.grepfor__nv_dl_wrapper_tto see exactly what was produced.
The cost is that the templates exist only as generated text, not as first-class IL entities. They cannot be analyzed or transformed by other EDG passes. This is acceptable because the preamble templates are infrastructure -- they are never the target of user-facing diagnostics or transformations.
Why Bitmaps Instead of Lists
The 1024-bit bitmap offers constant-time set (O(1) via shift-and-OR) and linear-time scan (O(1024) = effectively constant for a fixed-size structure). The bitmap has zero dynamic allocation, fits in two cache lines (128 bytes), and the scan loop compiles to simple shift-and-test instructions. Alternative representations (sorted lists, hash sets) would add allocation overhead and complexity for negligible benefit given the fixed 128-byte size.
Why Bit 0 Is Skipped for Device but Not Host-Device
The device lambda zero-capture case is fully handled by the primary template's zero-capture specialization (step 4), which is always emitted. No per-capture-count specialization is needed because the zero-capture wrapper has no fields, no constructor parameters, and no specialization-specific behavior.
The host-device zero-capture case requires distinct specializations for HasFuncPtrConv=true (lightweight function pointer path) and HasFuncPtrConv=false (heap-allocated type erasure path). These paths have fundamentally different internal structure. The always-emitted primary template contains only a static_assert trap, not a working implementation, so bit 0 must be processed to generate the actual zero-capture specializations.
Function Map
| Address | Name (recovered) | Source | Lines | Role |
|---|---|---|---|---|
sub_6BCC20 | nv_emit_lambda_preamble | nv_transforms.c | 244 | Master emitter: 17-step template injection pipeline |
sub_4864F0 | gen_type_decl | cp_gen_be.c | 751 | Trigger: detects sentinel, emits #line, calls master emitter |
sub_467E50 | emit_string | cp_gen_be.c | ~29 | Output callback: writes string char-by-char via putc() |
sub_467D60 | emit_newline | cp_gen_be.c | ~15 | Emits \n, increments line counter |
sub_6BC290 | emit_array_capture_helpers | nv_transforms.c | 183 | Step 3: __nv_lambda_array_wrapper for dim 2-8 |
sub_6BB790 | emit_device_lambda_wrapper_specialization | nv_transforms.c | 191 | Step 6: N-capture __nv_dl_wrapper_t (both tag variants) |
sub_6BBB10 | emit_hdl_wrapper_nonmutable | nv_transforms.c | 238 | Step 9: __nv_hdl_wrapper_t<false,...> specialization |
sub_6BBEE0 | emit_hdl_wrapper_mutable | nv_transforms.c | 236 | Step 9: __nv_hdl_wrapper_t<true,...> specialization |
sub_6BCBF0 | nv_record_capture_count | nv_transforms.c | 13 | Sets bit N in device or HD bitmap |
sub_6BCBC0 | nv_reset_capture_bitmasks | nv_transforms.c | 9 | Zeroes both 128-byte bitmaps before each TU |
sub_46BC80 | emit_preprocessor_directive | cp_gen_be.c | -- | Emits #if 0 / #endif suppression blocks |
Global State
| Variable | Address | Type | Purpose |
|---|---|---|---|
unk_1286980 | 0x1286980 | uint64_t[16] | Device lambda capture-count bitmap (1024 bits) |
unk_1286900 | 0x1286900 | uint64_t[16] | Host-device lambda capture-count bitmap (1024 bits) |
dword_106BF38 | 0x106BF38 | int32 | --extended-lambda mode flag |
dword_126E270 | 0x126E270 | int32 | C++17 noexcept-in-type-system flag |
dword_126E1DC | 0x126E1DC | int32 | EDG native mode flag (# vs #line format) |
dword_106581C | 0x106581C | int32 | Output column counter |
dword_1065820 | 0x1065820 | int32 | Output line counter (reset after preamble) |
qword_1065828 | 0x1065828 | int64 | Output state pointer (reset after preamble) |
dword_1065818 | 0x1065818 | int32 | Pending indentation flag |
dword_1065834 | 0x1065834 | int32 | Preprocessor nesting depth counter |
Related Pages
- Extended Lambda Overview -- end-to-end pipeline architecture and bitmap system
- Device Lambda Wrapper --
__nv_dl_wrapper_ttemplate structure,sub_6BB790emitter - Host-Device Lambda Wrapper --
__nv_hdl_wrapper_ttype erasure design,sub_6BBB10/sub_6BBEE0 - Capture Handling --
__nv_lambda_field_type,__nv_lambda_array_wrapper,sub_6BC290 - Lambda Restrictions -- validation rules and error codes
Lambda Restrictions
Extended lambdas are CUDA's most constraint-heavy feature. Before a lambda can be wrapped in __nv_dl_wrapper_t or __nv_hdl_wrapper_t for device transfer, cudafe++ must verify that the closure type is serializable: no reference captures (device memory cannot hold host-side pointers), no function-local types in the public interface (device compiler has no access to them), no unnamed parent classes (the wrapper tag requires a mangleable name), and dozens of other structural invariants. The restriction checker runs as Phase 4 of scan_lambda (sub_447930, lines 626--866 of the 2113-line function) and continues through per-capture validation in make_field_for_lambda_capture (sub_42EE00) and recursive type walks in sub_41A3E0 / sub_41A1F0. Together, these functions enforce 39 distinct diagnostic tags covering 35+ error categories and approximately 45 unique error code call sites.
All restrictions apply only when dword_106BF38 (--extended-lambda / --expt-extended-lambda) is set and the lambda has an explicit __device__ or __host__ __device__ annotation. Standard C++ lambdas and lambdas defined inside __device__ / __global__ function bodies are exempt.
Key Facts
| Property | Value |
|---|---|
| Primary validator | sub_447930 (scan_lambda, Phase 4, ~240 lines within 2113-line function) |
| Per-capture validator | sub_42EE00 (make_field_for_lambda_capture, 551 lines) |
| Type hierarchy walker | sub_41A3E0 (validate_type_hd_annotation, 75 lines) |
| Array/element checker | sub_41A1F0 (walk_type_for_hd_violations, 81 lines) |
| Type walk callback | sub_41B420 (33 lines, issues errors 3603/3604/3606/3607/3610/3611) |
| Diagnostic tag count | 39 unique tags for extended lambda errors |
| Error code range | 3592--3635, 3689--3691 |
| Error severity | All severity 7 (error), except 3612 (warning) and 3590 (error) |
| Enable flag | dword_106BF38 (--extended-lambda) |
| OptiX gate | dword_106BDD8 && dword_106B670 (triggers 3689) |
Restriction Categories
The tables below list every restriction enforced by the extended lambda validator, organized by the phase of validation in which each check occurs. The Error column gives the internal error index (displayed to users as 20000-series with the renumbering formula code + 16543). The Tag column gives the diagnostic tag name usable with --diag_suppress / #pragma nv_diag_suppress.
Category 1: Capture Restrictions
These checks run in two phases. The per-lambda checks (3593, 3595) occur in scan_lambda Phase 4 and in sub_4F9F20 (capture count finalization). The per-capture checks (3596--3599, 3616) run inside make_field_for_lambda_capture (sub_42EE00), which calls sub_41A1F0 for array dimension and constructibility analysis.
| Error | Tag | Restriction | Enforcement Location |
|---|---|---|---|
| 3593 | extended_lambda_reference_capture | Reference capture ([&] or [&x]) is prohibited. Device memory cannot hold host-side references. Fires when capture_default == & and capture_mode == & on the same lambda (byte+24 bits 4 and 5 both set). | sub_447930 Phase 4, line ~825 |
| 3595 | extended_lambda_too_many_captures | Maximum 1023 captures. The bitmap system uses 1024 bits (128 bytes) per wrapper type; bit 0 is reserved for the zero-capture primary template, so the usable range is 1--1023. Capture count > 0x3FE triggers this error. | sub_4F9F20 line ~616 |
| 3596 | extended_lambda_init_capture_array | Init-captures with array type are not supported. The init-capture's type node is checked for kind 3 (array type) with element kind 1 and sub-kind 21. | sub_42EE00 line ~508 |
| 3597 | extended_lambda_array_capture_rank | Arrays with more than 7 dimensions cannot be captured. The walker sub_41A1F0 counts array nesting depth via sub_7A8370 (is_array_type) and sub_7A9310 (get_element_type). If depth > 7, error fires. The limit matches the generated __nv_lambda_array_wrapper specializations (dims 2--8, plus dim 1 as identity). | sub_41A1F0 lines ~29, ~54 |
| 3598 | extended_lambda_array_capture_default_constructible | Array element type must be default-constructible on the host. After unwinding CV-qualifiers (kind 12 loop), calls sub_550E50(30, element_type, 0) to check default-constructibility. Failure emits this error. | sub_41A1F0 line ~40 |
| 3599 | extended_lambda_array_capture_assignable | Array element type must be copy-assignable on the host. Calls sub_5BD540 to get the assignment operator, then sub_510860(60, ...) to verify it is callable. Failure emits this error. | sub_41A1F0 lines ~42--44 |
| 3616 | extended_lambda_pack_capture | Cannot capture an element of a parameter pack. After calling sub_41A1F0 for type validation, sub_7A8C00 checks whether the capture type involves a pack expansion; if so, this error fires. | sub_42EE00 line ~517 |
| 3610 | extended_lambda_init_capture_initlist | Init-captures with std::initializer_list type are prohibited. The type walk callback sub_41B420 checks kind and class identity. | sub_41B420 / sub_4907A0 |
| 3602 | extended_lambda_capture_in_constexpr_if | An extended lambda cannot first-capture a variable inside a constexpr if branch. The capture must be visible outside the discarded branch. | sub_447930 Phase 6 |
| 3614 | extended_lambda_hd_init_capture | Init-captures are completely prohibited for __host__ __device__ lambdas. When byte+25 bit 4 is set (HD wrapper) and the lambda has any captures, this error fires and the HD bits are cleared. | sub_447930 line ~1710 |
| -- | this_addr_capture_ext_lambda | Implicit capture of this in an extended lambda triggers a warning. Separate from the errors above; fires during capture list processing. | sub_42FE50 / sub_42D710 |
| -- | (no tag) | *this capture requires either __device__-only or definition inside __device__/__global__ function, unless enabled by language dialect. | sub_42FE50 |
Category 2: Type Restrictions
Type restrictions enforce that every type visible in the lambda's public interface (captures, parameters, return type, and parent function template arguments) is accessible to the device compiler. Three contexts are checked, each with two sub-checks (function-local types and private/protected class member types). Additionally, the parent function's template arguments are checked for private/protected template members.
| Error | Tag | Context | Restriction |
|---|---|---|---|
| 3603 | extended_lambda_capture_local_type | Capture variable type | A type local to a function cannot appear in the type of a captured variable. |
| 3604 | extended_lambda_capture_private_type | Capture variable type | A private or protected class member type cannot appear in the type of a captured variable. |
| 3606 | extended_lambda_call_operator_local_type | operator() signature | A function-local type cannot appear in the return or parameter types of the lambda's operator(). |
| 3607 | extended_lambda_call_operator_private_type | operator() signature | A private/protected class member type cannot appear in the operator() return or parameter types. |
| 3610 | extended_lambda_parent_local_type | Parent template args | A function-local type cannot appear in the template arguments of the enclosing parent function or any parent classes. |
| 3611 | extended_lambda_parent_private_type | Parent template args | A private/protected class member type cannot appear in the template arguments of the enclosing parent function or parent classes. |
| 3635 | extended_lambda_parent_private_template_arg | Parent template args | A template that is itself a private/protected class member cannot be used as a template argument of the enclosing parent. |
Type Walk Dispatch via dword_E7FE78
The callback sub_41B420 uses a global discriminator dword_E7FE78 to select between the three contexts. Each context is called with a different value:
dword_E7FE78 | Context | Local-type error | Private-type error |
|---|---|---|---|
| 0 | Capture variable type | 3603 | 3604 |
| 1 | operator() signature | 3606 | 3607 |
| 2 | Parent template args | 3610 | 3611 |
The dispatch formula in sub_41B420 is 4 * (dword_E7FE78 != 1) + base_error. For local types, base is 3603; for private types, base is 3604. When dword_E7FE78 == 0, the multiplier is 41 = 4, yielding 3603+0 / 3604+0. When dword_E7FE78 == 1, the multiplier is 40 = 0, yielding 3603+3 = 3606 / 3604+3 = 3607. When dword_E7FE78 == 2 (and != 1), the multiplier is 4*1 = 4, yielding 3603+4 = (incorrect -- the actual formula uses a conditional). In practice the decompiled code shows:
// For function-local type check:
v2 = 3603;
if (dword_E7FE78)
v2 = 4 * (unsigned int)(dword_E7FE78 != 1) + 3606;
// dword_E7FE78=0 -> 3603
// dword_E7FE78=1 -> 4*0 + 3606 = 3606
// dword_E7FE78=2 -> 4*1 + 3606 = 3610
// For private/protected type check:
v4 = 3604;
if (dword_E7FE78)
v4 = 4 * (unsigned int)(dword_E7FE78 != 1) + 3607;
// dword_E7FE78=0 -> 3604
// dword_E7FE78=1 -> 4*0 + 3607 = 3607
// dword_E7FE78=2 -> 4*1 + 3607 = 3611
The tree walk itself is invoked via sub_7B0B60(type_node, sub_41B420, error_base). The error_base parameter (792 or 795) is stored in a global and used by the walker to control recursion behavior, not error selection.
Category 3: Enclosing Parent Function Restrictions
The parent function (the function in whose body the extended lambda is defined) must satisfy several naming and linkage constraints. These exist because the device compiler must be able to instantiate the wrapper template at a globally-unique mangled name derived from the parent function's signature.
| Error | Tag | Restriction | Rationale |
|---|---|---|---|
| 3605 | extended_lambda_enclosing_function_local | Parent function must not be defined inside another function (local function). | Nested function bodies have no externally-visible mangling; the wrapper tag would be unresolvable. |
| 3608 | extended_lambda_cant_take_function_address | Parent function must allow its address to be taken. Checks entity+80 bits 0-1 for address-taken capability. | The wrapper tag encodes a function pointer to the parent's operator(). If address-of is forbidden (e.g., deleted functions), the tag is ill-formed. |
| 3609 | extended_lambda_parent_class_unnamed | Parent function cannot be a member of an unnamed class. Walks the scope chain checking entity+8 (name pointer) for null. | Unnamed classes have no mangled name, making the wrapper tag unresolvable. |
| 3601 | extended_lambda_parent_non_extern | On Windows only: parent function must have external linkage. Internal or no linkage is prohibited. | Windows COFF requires external linkage for cross-TU symbol resolution. On Linux ELF this restriction does not apply. Checks entity+81 bit 2 (has_qualified_scope) and entity+8 (name). |
| 3608 | extended_lambda_inaccessible_parent | Parent function cannot have private or protected access within its class. Checks entity+80 bits 0-1 (access specifier). | Private/protected member functions are not visible to the device compiler's separate compilation pass. |
| 3592 | extended_lambda_enclosing_function_deducible | Parent function must not have a deduced return type (auto return). Checks entity+81 bit 0 (is_deprecated flag used as deducible marker). | Deduced return types are resolved lazily; the wrapper template needs a concrete type. |
| 3600 | (no dedicated tag) | Parent function cannot be = deleted or = defaulted. Checks entity+166 for values 1 or 2 (deleted, defaulted). | A deleted/defaulted function has no body, so the lambda cannot exist. |
| 3613 | (no dedicated tag) | Parent function cannot have a noexcept specification. Checks entity+191 bit 0. | Exception specifications interact with the wrapper's NeverThrows template parameter in ways that cannot be validated at frontend time. |
| 3615 | extended_lambda_enclosing_function_not_found | The validator (sub_41A3E0) could not locate the enclosing function. Fires when the type annotation context byte has bit 0 set but the host-device validation context a2 == 0. | Internal consistency check; should not occur in well-formed code. |
Category 4: Template Parameter Restrictions
The parent function's template parameter list must satisfy naming and variadic constraints to ensure the wrapper tag type can be uniquely instantiated.
| Error | Tag | Restriction |
|---|---|---|
| -- | extended_lambda_parent_template_param_unnamed | Every template parameter of the enclosing parent function must be named. Anonymous template parameters (template <typename>) prevent the wrapper from referencing the parameter in its tag type. Checked per-parameter during scope walk. |
| -- | extended_lambda_nest_parent_template_param_unnamed | Same restriction applied to nested parent scopes (enclosing class templates, enclosing function templates above the immediate parent). |
| -- | extended_lambda_multiple_parameter_packs | The parent template function can have at most one variadic parameter pack, and it must be the last parameter. Multiple packs or non-trailing packs prevent the device compiler from deducing the wrapper specialization. |
Category 5: Nesting and Context Restrictions
| Error | Tag | Restriction | Rationale |
|---|---|---|---|
| -- | extended_lambda_enclosing_function_generic_lambda | An extended lambda cannot be defined inside a generic lambda expression. Generic lambdas have template operator() which makes the closure type non-deducible for wrapper tag generation. | Generic lambdas produce dependent types that the wrapper system cannot resolve. |
| -- | extended_lambda_enclosing_function_hd_lambda | An extended lambda cannot be defined inside another extended __host__ __device__ lambda. | The wrapper for the outer HD lambda would need to capture the inner wrapper, creating a recursive type dependency. |
| -- | extended_host_device_generic_lambda | A __host__ __device__ extended lambda cannot be a generic lambda (i.e., with auto parameters). | The HD wrapper uses type erasure with concrete function pointer types. Generic lambdas would require polymorphic function pointers, which the type erasure scheme cannot express. |
| -- | extended_lambda_inaccessible_ancestor | An extended lambda cannot be defined inside a class that has private or protected access within another class. | The wrapper tag must be visible to both host and device compilation passes. A privately-nested class is not accessible from the translation-unit scope where the wrapper template is instantiated. |
| -- | extended_lambda_inside_constexpr_if | An extended lambda cannot be defined inside the if or else block of a constexpr if statement (platform/dialect dependent). | Discarded constexpr if branches may eliminate the lambda entirely, but the preamble has already been committed. Restriction prevents dangling wrapper specializations. |
| 3590 | extended_lambda_multiple_parent | Cannot specify __nv_parent more than once in a single lambda's capture list. | __nv_parent stores a single parent class pointer at lambda_info + 32; only one slot exists. |
| 3634 | (no dedicated tag) | __nv_parent requires the lambda to be __device__ annotated. If __nv_parent is specified without __device__ execution space, this error fires. Additionally validates that the enclosing scope has __host__ but not __device__ execution space (bits at entity+182). | __nv_parent is used to link the device closure to its enclosing class for member access. This is only meaningful in device execution context. |
Category 6: Specifier and Annotation Restrictions
| Error | Tag | Restriction |
|---|---|---|
| 3612 | extended_lambda_disallowed | __host__ or __device__ annotation on a lambda when --extended-lambda is not enabled. This is a warning, not an error. The flag must be explicitly passed on the command line. |
| 3620 | extended_lambda_constexpr | The constexpr specifier is not allowed on an extended lambda's operator(). Also applies to consteval. Two separate emit calls: one for "constexpr" and one for "consteval". |
| 3621 | (no dedicated tag) | The operator() function for a lambda cannot be explicitly annotated with execution space annotations (__host__/__device__/__global__). The annotations are derived from the closure class, not the operator. Fires when entity+182 bits 1-2 are set on the call operator. |
| 3689 | (no dedicated tag) | OptiX mode incompatibility. When both dword_106BDD8 (OptiX) and dword_106B670 (a secondary OptiX flag) are set, and the lambda body at qword_106B678 + 176*dword_106B670 + 5 has bit 3 set, this error fires. OptiX has stricter lambda body requirements than standard CUDA. |
| 3690 | extended_lambda_discriminator | Lambda numbering lookup failure in the red-black tree (ptr / dword_E7FE48). The tree maps source positions to lambda indices for unique wrapper tag generation. If the tree search fails, the wrapper cannot be uniquely identified. |
| 3691 | (no dedicated tag) | Extended lambda with __host__ __device__ annotation where the type annotation byte has bit 4 set (HD init-capture validation context). Issued by sub_41A3E0 as a final post-check. |
Category 7: Enclosing Scope Miscellaneous
| Error | Tag | Restriction |
|---|---|---|
| 3617 | extended_lambda_no_parent_func | No enclosing function could be found for the extended lambda. sub_6BCDD0 (nv_find_parent_lambda_function) walked the scope chain and returned null. The lambda may be at file scope, which is not a valid context for an extended lambda. |
| 3618 | extended_lambda_illegal_parent | Ambiguous overload when resolving the enclosing function. sub_6BCDD0 found multiple candidate functions. Emitted via sub_4F6E50 with three operands (location, space string, function name). |
| 3619 | (no dedicated tag) | Secondary ambiguity variant. Same as 3618 but fires on a different branch (the v291[0] check rather than v287[0]), indicating the ambiguity was detected through a different resolution path. |
| 3601 | (duplicate) | Lambda defined in unnamed namespace (entity+81 bit 2 set and entity+8 name pointer is null). The wrapper tag requires a named scope. |
| 3605 | (duplicate) | Non-trivially-copyable type in capture scope. When entity+80 bits 0-1 indicate non-trivial copy semantics, the capture cannot be transferred to device memory. |
Validation Architecture
Phase 4 of scan_lambda: Per-Lambda Validation
After parsing the capture list and annotations (Phases 1--3), scan_lambda enters the extended lambda validation block. This block is guarded by dword_106BF38 (extended lambda mode) and the annotation bits at lambda_info + 25. The validation proceeds as:
sub_447930 (scan_lambda), Phase 4 entry:
|
+-- Call sub_6BCDD0 (nv_find_parent_lambda_function)
| Returns: parent function node, sets is_device/is_template flags
|
+-- If parent == NULL: emit error 3617
+-- If ambiguous: emit error 3618 or 3619
|
+-- Validate parent function properties:
| entity+81 bit 0 -> error 3592 (deprecated/deducible)
| entity+191 bit 0 -> error 3613 (noexcept spec)
| entity+166 == 1|2 -> error 3600 (deleted/defaulted)
| entity+81 bit 2 -> unnamed scope check -> error 3601
| entity+80 bits 0-1 -> address-taken / access check -> error 3608
|
+-- Walk parent scope chain for unnamed classes:
| entity+8 == NULL -> error 3609
| Non-trivial copy -> error 3605
|
+-- Check capture-default conflicts:
| byte+24 bits 4+5 both set -> error 3593 (& and = conflict)
|
+-- OptiX gate: dword_106BDD8 -> error 3689
|
+-- Lambda numbering via red-black tree:
Lookup failure -> error 3690
Per-Capture Validation: sub_42EE00
For each captured variable, make_field_for_lambda_capture runs targeted checks:
sub_42EE00 (make_field_for_lambda_capture):
|
+-- If byte+25 bit 3 set (device wrapper):
| |
| +-- Check init-capture for array type
| | type_node+48 == 3 && sub_kind == 21 -> error 3596
| |
| +-- Call sub_41A1F0 (walk_type_for_hd_violations)
| | Counts array dimensions, checks element type
| | dim > 7 -> error 3597
| | Not default-constructible -> error 3598
| | Not assignable -> error 3599
| |
| +-- Check for pack expansion
| sub_7A8C00 returns true -> error 3616
|
+-- (Later) If byte+25 bit 4 set (HD wrapper):
|
+-- Call sub_7B0B60 with sub_41B420 callback
Walks entire type tree, fires 3603/3604 for
function-local and private/protected types
Type Hierarchy Walker: sub_41A3E0 / sub_41A1F0
sub_41A3E0 is the outer wrapper that validates the per-capture annotation context. sub_41A1F0 performs the recursive array dimension walk and element-type validation.
sub_41A3E0 (validate_type_hd_annotation):
|
+-- Determine context string: "__device__" or "__host__ __device__"
| Based on a2 parameter (0 = HD, nonzero = device-only)
|
+-- Check annotation byte (a1+32):
| bit 0 set && a2==0 -> error 3615
| bit 3 set -> check parent visibility:
| entity+163 < 0 (private) -> check bit pattern
| Both bits 3+4 set with private parent -> error 3635
| Otherwise -> error 3593
| bit 5 set -> error 3594 (private/protected access)
|
+-- Unwrap CV-qualifiers on element type (kind==12 loop)
|
+-- Call sub_41A1F0 (walk_type_for_hd_violations):
| Recursive array walker:
| v6 = dimension counter
| Loop: while sub_7A8370(type) returns true
| increment v6, follow sub_7A9310 to element type
| If v6 > 7: error 3597
| Unwrap CV (kind==12 loop)
| If not in dependent context (dword_126C5C4 == -1):
| Check scope flags (byte+6 bits 1-2)
| sub_550E50(30, type, 0) -> error 3598 (not default-constructible)
| sub_5BD540 + sub_510860(60, ...) -> error 3599 (not assignable)
| Call sub_7B0B60(type, sub_41B420, 792) for deep type walk
|
+-- If a3 (third parameter) set:
Check bit 4 of annotation byte -> error 3691
sub_41B420: Type Walk Callback
This compact callback (33 lines decompiled) is invoked by sub_7B0B60 for every type node in the capture's type tree. It checks two properties:
-
Function-local type --
entity+81 bit 0set: the type is defined inside a function body. Error selection usesdword_E7FE78to pick between capture context (3603), operator() context (3606), and parent template-arg context (3610). -
Private/protected member type --
entity+81 bit 2set ANDentity+80 bits 0-1in range [1,2] (private or protected access specifier). Error selection parallels the local-type case: 3604, 3607, or 3611 depending ondword_E7FE78.
Special case: when entity+132 == 9 (template parameter dependent type) AND entity+152 points to a class with byte+86 bit 0 set AND entity+72 is non-null, the function-local check is suppressed. This handles template parameters that are not themselves local but instantiate with local types -- the error is deferred to instantiation time.
Diagnostic Tag Reference
Complete list of all 39 extended lambda diagnostic tags, sorted alphabetically. All tags can be used with --diag_suppress, --diag_warning, --diag_error on the command line, and with #pragma nv_diag_suppress, #pragma nv_diag_warning, #pragma nv_diag_error in source.
| Tag | Category |
|---|---|
extended_host_device_generic_lambda | Nesting |
extended_lambda_array_capture_assignable | Capture |
extended_lambda_array_capture_default_constructible | Capture |
extended_lambda_array_capture_rank | Capture |
extended_lambda_call_operator_local_type | Type |
extended_lambda_call_operator_private_type | Type |
extended_lambda_cant_take_function_address | Parent |
extended_lambda_capture_in_constexpr_if | Capture |
extended_lambda_capture_local_type | Type |
extended_lambda_capture_private_type | Type |
extended_lambda_constexpr | Specifier |
extended_lambda_disallowed | Specifier |
extended_lambda_discriminator | Internal |
extended_lambda_enclosing_function_deducible | Parent |
extended_lambda_enclosing_function_generic_lambda | Nesting |
extended_lambda_enclosing_function_hd_lambda | Nesting |
extended_lambda_enclosing_function_local | Parent |
extended_lambda_enclosing_function_not_found | Parent |
extended_lambda_hd_init_capture | Capture |
extended_lambda_illegal_parent | Parent |
extended_lambda_inaccessible_ancestor | Nesting |
extended_lambda_inaccessible_parent | Parent |
extended_lambda_init_capture_array | Capture |
extended_lambda_init_capture_initlist | Capture |
extended_lambda_inside_constexpr_if | Nesting |
extended_lambda_multiple_parameter_packs | Template |
extended_lambda_multiple_parent | Nesting |
extended_lambda_nest_parent_template_param_unnamed | Template |
extended_lambda_no_parent_func | Parent |
extended_lambda_pack_capture | Capture |
extended_lambda_parent_class_unnamed | Parent |
extended_lambda_parent_local_type | Type |
extended_lambda_parent_non_extern | Parent |
extended_lambda_parent_private_template_arg | Type |
extended_lambda_parent_private_type | Type |
extended_lambda_parent_template_param_unnamed | Template |
extended_lambda_reference_capture | Capture |
extended_lambda_too_many_captures | Capture |
this_addr_capture_ext_lambda | Capture |
Bitmap Interaction
The capture count limit of 1023 derives from the bitmap architecture. Each wrapper type (device and host-device) uses a 128-byte bitmap (unk_1286980 / unk_1286900) storing 1024 bits. The bitmap setter sub_6BCBF0 performs:
result[capture_count >> 6] |= 1LL << capture_count;
Bit 0 is never emitted as a wrapper specialization (the zero-capture case uses the primary template). Bits 1--1023 map to generated partial specializations. The error check at capture count > 0x3FE (1022) fires before the bitmap set operation, so the effective maximum is 1023 captures. Attempting 1024 or more would overflow the 64-bit word boundary calculation, though in practice the error prevents this.
Operator() Annotation Derivation
Error 3621 enforces a fundamental design rule: the operator() function of an extended lambda must not carry explicit execution space annotations. Instead, the execution space is derived from the closure class. During scan_lambda Phase 5 (decl_call_operator_for_lambda), the code sets the call operator's execution space from lambda_info + 25:
// Propagate device/host from lambda_info to call operator
byte[operator+182] = (4 * byte[lambda+25]) & 0x10 | byte[operator+182] & 0xEF;
byte[operator+182] = (16 * byte[lambda+25]) & 0x20 | byte[operator+182] & 0xDF;
If the call operator already has execution space bits set (from explicit annotation by the user), error 3621 fires. The rationale is that the wrapper template's tag type already encodes the execution space; having the operator carry its own annotations would create an inconsistency that the device compiler cannot resolve.
Key Functions
| Address | Name (recovered) | Lines | Role |
|---|---|---|---|
sub_447930 | scan_lambda | 2113 | Master lambda parser; Phase 4 = restriction validator |
sub_42EE00 | make_field_for_lambda_capture | 551 | Per-capture field creator with device-lambda validation |
sub_41A3E0 | validate_type_hd_annotation | 75 | Outer type annotation checker (errors 3593/3594/3615/3635/3691) |
sub_41A1F0 | walk_type_for_hd_violations | 81 | Recursive array dim / element-type validator (3597/3598/3599) |
sub_41B420 | (type walk callback) | 33 | Issues 3603/3604/3606/3607/3610/3611 via dword_E7FE78 dispatch |
sub_6BCDD0 | nv_find_parent_lambda_function | 33 | Scope chain walk to find enclosing host/device function |
sub_6BCBF0 | nv_record_capture_count | 13 | Set bit in device or host-device bitmap |
sub_4F9F20 | (capture count finalizer) | ~620 | Checks capture count > 0x3FE, calls bitmap setter |
sub_7B0B60 | (tree walker) | -- | Recursive type tree traversal, calls callback for each node |
sub_7A8370 | (is_array_type) | -- | Returns nonzero if type node is an array type |
sub_7A9310 | (get_element_type) | -- | Returns the element type of an array type node |
sub_550E50 | (check_default_constructible) | -- | sub_550E50(30, type, 0) tests default-constructibility |
sub_510860 | (check_callable) | -- | sub_510860(60, op, type) tests if operator is callable |
Global State
| Variable | Address | Purpose |
|---|---|---|
dword_106BF38 | 0x106BF38 | Extended lambda mode flag (--extended-lambda) |
dword_106BDD8 | 0x106BDD8 | OptiX mode flag |
dword_106B670 | 0x106B670 | Secondary OptiX lambda flag |
qword_106B678 | 0x106B678 | OptiX lambda body array base pointer |
dword_E7FE78 | 0xE7FE78 | Type walk context discriminator (0=capture, 1=operator, 2=parent) |
ptr | (stack) | Red-black tree root for lambda numbering per source position |
dword_E7FE48 | 0xE7FE48 | Red-black tree sentinel node |
dword_126C5C4 | 0x126C5C4 | Dependent scope index (-1 = not in dependent context) |
dword_126EFAC | 0x126EFAC | CUDA mode flag |
dword_126EFA4 | 0x126EFA4 | GCC extensions flag |
qword_126EF98 | 0x126EF98 | GCC compatibility version |
Related Pages
- Extended Lambda Overview -- end-to-end pipeline, annotation bits,
lambda_infolayout - Device Lambda Wrapper --
__nv_dl_wrapper_ttemplate structure - Host-Device Lambda Wrapper --
__nv_hdl_wrapper_ttype-erased design - Capture Handling --
__nv_lambda_field_type,__nv_lambda_array_wrapper - Preamble Injection --
sub_6BCC20emission pipeline - CUDA Error Catalog -- complete error index with message templates
- Cross-Space Call Validation -- execution space checking infrastructure
IL Overview
The Intermediate Language (IL) is EDG's central data structure -- a typed, scope-linked graph of every declaration, type, expression, statement, and template in the translation unit. cudafe++ (EDG 6.6) builds the IL during parsing, walks it for CUDA device/host separation, and emits it as the .int.c output. The IL never touches disk: IL_SHOULD_BE_WRITTEN_TO_FILE=0 forces in-memory-only operation. All IL nodes live in a region-based arena allocator, organized into file-scope (region 1) and per-function (region N) memory pools.
The IL is versioned as IL_VERSION_NUMBER="6.6" and carries the compile-time flag ALL_TEMPLATE_INFO_IN_IL=1, meaning template definitions, specializations, and instantiation directives are fully represented in the IL graph rather than deferred to a separate template database.
Key Configuration Constants
| Constant | Value | Meaning |
|---|---|---|
IL_VERSION_NUMBER | "6.6" | IL format version, matches EDG version |
IL_SHOULD_BE_WRITTEN_TO_FILE | 0 | IL is never serialized to disk |
ALL_TEMPLATE_INFO_IN_IL | 1 | Full template data in IL graph |
IL_FILE_SUFFIX | (string) | Suffix for IL file names if serialization were enabled |
sizeof_il_entry sentinel | 9999 | Validated at init time (guard value in qword_E6C580) |
IL Entry Kind System
Every IL node carries an entry_kind byte that identifies its type. The name table off_E6DD80 (aliased as il_entry_kind_names at off_E6E020) maps these bytes to human-readable strings. The il_one_time_init function (sub_5CF7F0) validates that this table ends with a "last" sentinel.
There are 85 defined entry kind values (0-84). Some are primary node types with their own linked lists; others are auxiliary records displayed inline by their parent.
Complete il_entry_kind Table
| Kind | Hex | Name | Bytes | Display | Notes |
|---|---|---|---|---|---|
| 0 | 0x00 | none | -- | -- | Null/invalid sentinel |
| 1 | 0x01 | source_file_entry | 80 | Case 1 | File name, line ranges, include flags |
| 2 | 0x02 | constant | 184 | Case 2 | 16 sub-kinds (ck_*) |
| 3 | 0x03 | param_type | 80 | Case 3 | Parameter type in function signature |
| 4 | 0x04 | routine_type_supplement | 64 | Inline | Embedded in routine type node |
| 5 | 0x05 | routine_type_extra | -- | Inline | Additional routine type data |
| 6 | 0x06 | type | 176 | Case 6 | 22 sub-kinds (tk_*) |
| 7 | 0x07 | variable | 232 | Case 7 | Variables, parameters, structured bindings |
| 8 | 0x08 | field | 176 | Case 8 | Class/struct/union members |
| 9 | 0x09 | exception_specification | 16 | Case 9 | noexcept, throw() specs |
| 10 | 0x0A | exception_spec_type | 24 | Case 0xA | Type in exception specification |
| 11 | 0x0B | routine | 288 | Case 0xB | Functions, methods, constructors, destructors |
| 12 | 0x0C | label | 128 | Case 0xC | Goto labels, break/continue targets |
| 13 | 0x0D | expr_node | 72 | Case 0xD | 36 sub-kinds (enk_*) |
| 14 | 0x0E | (reserved) | -- | Inline | Skipped in display |
| 15 | 0x0F | (reserved) | -- | Inline | Skipped in display |
| 16 | 0x10 | switch_case_entry | 56 | Case 0x10 | Case value + range for switch |
| 17 | 0x11 | switch_info | 24 | Case 0x11 | Switch statement descriptor |
| 18 | 0x12 | handler | 40 | Case 0x12 | try/catch handler entry |
| 19 | 0x13 | try_supplement | 32 | Inline | Try block extra info |
| 20 | 0x14 | asm_supplement | -- | Inline | Inline asm statement data |
| 21 | 0x15 | statement | 80 | Case 0x15 | 26 sub-kinds (stmk_*) |
| 22 | 0x16 | object_lifetime | 64 | Case 0x16 | Destruction ordering |
| 23 | 0x17 | scope | 288 | Case 0x17 | 9 sub-kinds (sck_*) |
| 24 | 0x18 | base_class | 112 | Case 0x18 | Inheritance record |
| 25 | 0x19 | string_text | 1* | -- | Raw string literal bytes |
| 26 | 0x1A | other_text | 1* | -- | Compiler version, misc text |
| 27 | 0x1B | template_parameter | 136 | Case 0x1B | Template param with supplement |
| 28 | 0x1C | namespace | 128 | Case 0x1C | Namespace declarations |
| 29 | 0x1D | using_declaration | 80 | Case 0x1D | Using declarations/directives |
| 30 | 0x1E | dynamic_init | 104 | Case 0x1E | 9 sub-kinds (dik_*) |
| 31 | 0x1F | local_static_variable_init | 40 | Case 0x1F | Static local init records |
| 32 | 0x20 | vla_dimension | 48 | Case 0x20 | Variable-length array bound |
| 33 | 0x21 | overriding_virtual_func | 40 | Case 0x21 | Virtual override info |
| 34 | 0x22 | (reserved) | -- | Inline | Skipped in display |
| 35 | 0x23 | derivation_path | 24 | Case 0x23 | Base-class derivation step |
| 36 | 0x24 | base_class_derivation | 32 | -- | Derivation detail record |
| 37 | 0x25 | (reserved) | -- | Inline | Skipped in display |
| 38 | 0x26 | (reserved) | -- | Inline | Skipped in display |
| 39 | 0x27 | class_info | 208 | Case 0x27 | Class type supplement |
| 40 | 0x28 | (reserved) | -- | -- | Skipped in display |
| 41 | 0x29 | constructor_init | 48 | Case 0x29 | Ctor member/base initializer |
| 42 | 0x2A | asm_entry | 152 | Case 0x2A | Inline assembly block |
| 43 | 0x2B | asm_operand | -- | Case 0x2B | Asm constraint + expression |
| 44 | 0x2C | asm_clobber | -- | Case 0x2C | Asm clobber register |
| 45 | 0x2D | (reserved) | -- | Inline | Skipped in display |
| 46 | 0x2E | (reserved) | -- | Inline | Skipped in display |
| 47 | 0x2F | (reserved) | -- | Inline | Skipped in display |
| 48 | 0x30 | (reserved) | -- | Inline | Skipped in display |
| 49 | 0x31 | element_position | 24 | -- | Designator element position |
| 50 | 0x32 | source_sequence_entry | 32 | Case 0x32 | Declaration ordering |
| 51 | 0x33 | full_entity_decl_info | 56 | Case 0x33 | Full declaration info |
| 52 | 0x34 | instantiation_directive | 40 | Case 0x34 | Explicit instantiation |
| 53 | 0x35 | src_seq_sublist | 24 | Case 0x35 | Source sequence sub-list |
| 54 | 0x36 | explicit_instantiation_decl | -- | Case 0x36 | extern template |
| 55 | 0x37 | orphaned_entities | 56 | Case 0x37 | Entities without parent scope |
| 56 | 0x38 | hidden_name | 32 | Case 0x38 | Hidden name entry |
| 57 | 0x39 | pragma | 64 | Case 0x39 | Pragma records (43 kinds) |
| 58 | 0x3A | template | 208 | Case 0x3A | Template declaration |
| 59 | 0x3B | template_decl | 40 | Case 0x3B | Template declaration head |
| 60 | 0x3C | requires_clause | 16 | Case 0x3C | C++20 requires clause |
| 61 | 0x3D | template_param | 136 | Case 0x3D | Template parameter entry |
| 62 | 0x3E | name_reference | 40 | Case 0x3E | Name lookup reference |
| 63 | 0x3F | name_qualifier | 40 | Case 0x3F | Qualified name qualifier |
| 64 | 0x40 | seq_number_lookup | 32 | Case 0x40 | Sequence number index |
| 65 | 0x41 | local_expr_node_ref | -- | Case 0x41 | Local expression reference |
| 66 | 0x42 | static_assert | 24 | Case 0x42 | Static assertion |
| 67 | 0x43 | linkage_spec | 32 | Case 0x43 | extern "C"/"C++" block |
| 68 | 0x44 | scope_ref | 32 | Case 0x44 | Scope back-reference |
| 69 | 0x45 | (reserved) | -- | Inline | Skipped in display |
| 70 | 0x46 | lambda | -- | Case 0x46 | Lambda expression |
| 71 | 0x47 | lambda_capture | -- | Case 0x47 | Lambda capture entry |
| 72 | 0x48 | attribute | 72 | Case 0x48 | C++11/GNU attribute |
| 73 | 0x49 | attribute_argument | 40 | Case 0x49 | Attribute argument |
| 74 | 0x4A | attribute_group | 8 | Case 0x4A | Attribute group |
| 75 | 0x4B | (reserved) | -- | Inline | Skipped in display |
| 76 | 0x4C | (reserved) | -- | Inline | Skipped in display |
| 77 | 0x4D | (reserved) | -- | Inline | Skipped in display |
| 78 | 0x4E | (reserved) | -- | Inline | Skipped in display |
| 79 | 0x4F | template_info | -- | Case 0x4F | Template instantiation info |
| 80 | 0x50 | subobject_path | 24 | Case 0x50 | Address constant sub-path |
| 81 | 0x51 | (reserved) | -- | Inline | Skipped in display |
| 82 | 0x52 | module_info | -- | Case 0x52 | C++20 module metadata |
| 83 | 0x53 | module_decl | -- | Case 0x53 | Module declaration |
| 84 | 0x54 | last | -- | -- | Sentinel for table validation |
Inline entries (kinds 4, 5, 14, 15, 19, 20, 27, 34, 37, 38, 40, 45-48, 69, 75-78, 81) are displayed as part of their parent node rather than as standalone IL entries. The display dispatcher (sub_5F4930) returns immediately for these kinds.
IL Header Structure
The IL header lives in the BSS segment at 0x126EB60 and is printed by display_il_header_and_file_scope (sub_5F76B0). It records translation-unit-level metadata:
struct il_header { // at xmmword_126EB60
il_entry* primary_source_file; // +0x00 head of source file list
scope* primary_scope; // +0x08 file-scope root
routine* main_routine; // +0x10 main() if present
char* compiler_version; // +0x18 "6.6" version string
char* time_of_compilation; // +0x20 build timestamp
uint8_t plain_chars_are_signed; // +0x28 signedness of plain char
uint32_t source_language; // +0x2C 0=C++, 1=C (dword_126EBA8)
uint32_t std_version; // +0x30 e.g. 201703 (dword_126EBAC)
uint8_t pcc_compatibility_mode; // +0x34 PCC compat flag
uint8_t enum_type_is_integral; // +0x35
uint32_t default_max_member_align; // +0x38
uint8_t gcc_mode; // +0x3C GCC compatibility
uint8_t gpp_mode; // +0x3D G++ compatibility
uint32_t gnu_version; // +0x40 e.g. 40201
uint8_t short_enums; // +0x44
uint8_t default_nocommon; // +0x45
uint8_t UCN_identifiers_used; // +0x46
uint8_t vla_used; // +0x47
uint8_t any_templates_seen; // +0x48
uint8_t prototype_instantiations_in_il; // +0x49
uint8_t il_has_all_prototype_instantiations; // +0x4A
uint8_t il_has_C_semantics; // +0x4B
uint8_t nontag_types_used_in_exception_or_rtti; // +0x4C
il_entry* seq_number_lookup_entries; // +0x50
uint32_t target_configuration_index; // +0x58
};
The source_language field selects the display string "sl_Cplusplus" or "sl_C". When source_language == 1 (C mode) and std_version > 199900, the routine display additionally prints C99 pragma state fields (fp_contract, fenv_access, cx_limited_range).
Memory Region System
IL entries are allocated in numbered memory regions managed by a bump allocator (sub_6B7D60):
| Region | Purpose | Lifetime | Globals |
|---|---|---|---|
| 1 | File scope | Entire translation unit | dword_126EC90 (region ID), dword_126F690/dword_126F694 (base offset / prefix size) |
| 2..N | Per-function scope | Duration of function body processing | dword_126EB40 (current region), dword_126F688/dword_126F68C (base offset / prefix size) |
Region 1 contains all file-scope declarations: types, global variables, function declarations, namespaces, templates. Regions 2+ are allocated one per function definition and hold that function's local variables, statements, expressions, labels, and temporaries. The region table at qword_126EC88 maps region indices to their memory, while qword_126EB90 maps region indices to their associated scope entries. dword_126EC80 tracks the total number of regions.
The allocator selects file-scope vs function-scope by comparing dword_126EB40 == dword_126EC90. When equal, the node goes into region 1; otherwise it goes into the current function region. Some node types force a specific region:
- Labels (
alloc_labelatsub_5E5CA0): Assert that the current region is NOT file scope - Templates (
alloc_templateatsub_5E8D20): Always file-scope only - Sequence number lookups (
sub_5E9170): Force region 1 by temporarily setting TU-copy mode
The display system (sub_5F7DF0) iterates all regions:
// File scope
printf("Intermediate language for memory region 1 (file scope):");
walk_file_scope_il(display_il_entry, ...); // sub_60E4F0
// Per-function regions
for (int r = 2; r <= region_count; r++) {
scope* s = scope_table[r];
routine* fn = s->assoc_routine;
printf("Intermediate language for memory region %ld (function \"%s\"):",
r, fn->name);
walk_routine_scope_il(r, display_il_entry, ...); // sub_610200
}
IL Entry Prefix
Every IL node has a multi-qword prefix preceding the node body. The prefix size depends on allocation mode: 24 bytes (3 qwords) in normal file-scope mode, 16 bytes (2 qwords) in TU-copy mode, and 8 bytes (1 qword) for function-scope allocations. The allocator (sub_6B7D60) allocates a contiguous block and the caller returns a pointer past the prefix, so the prefix occupies negative offsets from the returned node pointer.
Normal file-scope mode (dword_106BA08 == 0, dword_126F694 = 24):
Raw allocation layout (normal file-scope, 24-byte prefix):
Offset Size Field
------ ---- -----
+0 8 translation_unit_copy_address (qword, zeroed in normal mode)
+8 8 next_in_list (qword, linked list pointer)
+16 8 prefix flags qword (flags byte at +16, 7 bytes padding)
+24 ... node body starts here (returned pointer)
Node pointer perspective (ptr = raw + 24):
ptr - 24 = TU copy address (8 bytes at raw+0)
ptr - 16 = next pointer (8 bytes at raw+8)
ptr - 8 = prefix flags byte (8 bytes at raw+16, flags in low byte)
ptr + 0 = first byte of node body
TU-copy mode (dword_106BA08 != 0, dword_126F694 = 16):
Raw allocation layout (TU-copy mode, 16-byte prefix):
+0 8 next_in_list (no TU copy slot)
+8 8 prefix flags qword
+16 ... node body starts here (returned pointer)
Function-scope allocations (dword_126F68C = 8):
Raw allocation layout (function-scope, 8-byte prefix):
+0 8 prefix flags qword (no TU copy, no orphan slot)
+8 ... node body starts here (returned pointer)
The prefix flags byte is at ptr - 8 from the returned node pointer (in all modes). The next_in_list pointer at ptr - 16 is the linked list link used by the IL walker to traverse all entries of a given kind (file-scope only). The translation_unit_copy_address at ptr - 24 stores the original address when a node is copied between translation units; it is zeroed in normal mode and absent in TU-copy and function-scope modes.
The keep_in_il test throughout cudafe++ uses *(signed char*)(entry - 8) < 0 to check bit 7 of the prefix flags byte -- this works because the flags byte is always at offset -8 from the node pointer regardless of allocation mode.
Prefix Flags Byte
The prefix flags byte (at offset -8 from the returned node pointer) encodes scope and language information:
| Bit | Mask | Name | Meaning |
|---|---|---|---|
| 0 | 0x01 | allocated | Always set on allocation |
| 1 | 0x02 | file_scope | Set when !dword_106BA08 (not in TU-copy mode) |
| 2 | 0x04 | is_in_secondary_il | Entry came from secondary translation unit |
| 3 | 0x08 | language_flag | Copies dword_126E5FC & 1 (C++ vs C mode indicator) |
| 7 | 0x80 | keep_in_il | CUDA-critical: marks entry for device IL output |
Bit 7 (keep_in_il) is the mechanism by which cudafe++ selects device-relevant declarations. The mark_to_keep_in_il pass in il_walk.c sets this bit on all entries that are needed for device compilation. See Device/Host Separation and keep-in-il for details.
Sub-Kind Systems
Most primary IL entry kinds use a secondary kind byte to discriminate between variants. These sub-kind enums are the core classification taxonomy of the IL.
Type Kinds (tk_*)
The type kind byte lives at offset +132 in the type node body. 22 values, dispatched by set_type_kind (sub_5E2E80) and displayed by display_type (sub_5F06B0):
| Value | Name | Supplement | Size | Notes |
|---|---|---|---|---|
| 0 | tk_error | -- | -- | Error/placeholder type |
| 1 | tk_void | -- | -- | void |
| 2 | tk_integer | integer_type_supplement | 32 | int, char, bool, enum, wchar_t, char8/16/32_t |
| 3 | tk_float | -- | -- | float, double, long double |
| 4 | tk_complex | -- | -- | _Complex float/double/ldouble |
| 5 | tk_imaginary | -- | -- | _Imaginary (C99) |
| 6 | tk_pointer | -- | -- | Pointer, reference, rvalue reference |
| 7 | tk_routine | routine_type_supplement | 64 | Function type (return + params) |
| 8 | tk_array | -- | -- | Fixed and variable-length arrays |
| 9 | tk_class | class_type_supplement | 208 | class types |
| 10 | tk_struct | class_type_supplement | 208 | struct types |
| 11 | tk_union | class_type_supplement | 208 | union types |
| 12 | tk_typeref | typeref_type_supplement | 56 | typedef, using, decltype, typeof |
| 13 | tk_ptr_to_member | -- | -- | Pointer-to-member |
| 14 | tk_template_param | templ_param_supplement | 40 | Template type parameter |
| 15 | tk_vector | -- | -- | SIMD vector type |
| 16 | tk_scalable_vector | -- | -- | Scalable vector (SVE) |
| 17 | tk_nullptr | -- | -- | std::nullptr_t |
| 18 | tk_mfp8 | -- | -- | 8-bit floating point |
| 19 | tk_scalable_vector_count | -- | -- | Scalable vector predicate |
| 20 | (auto/decltype_auto) | -- | -- | Placeholder types |
| 21 | (typeof_unqual/typeof_type) | -- | -- | C23 typeof |
The display function references off_A6FE40 (22 string entries) for type kind names. The typeref sub-kind table at off_A6F640 has 28 entries covering typedef aliases, decltype expressions, auto, and concept-constrained placeholders.
Constant Kinds (ck_*)
The constant kind byte lives at offset +148 in the constant node. 16 values, dispatched by display_constant (sub_5F2720):
| Value | Name | Notes |
|---|---|---|
| 0 | ck_error | Error placeholder |
| 1 | ck_integer | Integer value (arbitrary precision via sub_602F20) |
| 2 | ck_string | String/character literal (char kind + length + raw bytes) |
| 3 | ck_float | Floating-point constant |
| 4 | ck_complex | Complex constant (real + imaginary) |
| 5 | ck_imaginary | Imaginary constant |
| 6 | ck_address | Address constant with 7 address sub-kinds (abk_*) |
| 7 | ck_ptr_to_member | Pointer-to-member constant |
| 8 | ck_label_difference | GNU label address difference |
| 9 | ck_dynamic_init | Dynamically initialized constant |
| 10 | ck_aggregate | Aggregate initializer (linked list of sub-constants) |
| 11 | ck_init_repeat | Repeated initializer (constant + count) |
| 12 | ck_template_param | Template parameter constant with 15 sub-kinds (tpck_*) |
| 13 | ck_designator | Designated initializer |
| 14 | ck_void | Void constant |
| 15 | ck_reflection | Reflection entity reference |
Address constant sub-kinds (abk_*): abk_routine, abk_variable, abk_constant, abk_temporary, abk_uuidof, abk_typeid, abk_label.
Template parameter constant sub-kinds (tpck_*): tpck_param, tpck_expression, tpck_member, tpck_unknown_function, tpck_address, tpck_sizeof, tpck_datasizeof, tpck_alignof, tpck_uuidof, tpck_typeid, tpck_noexcept, tpck_template_ref, tpck_integer_pack, tpck_destructor.
Expression Node Kinds (enk_*)
The expression kind byte lives at offset +24 in the expression node. 36 values, dispatched by display_expr_node (sub_5ECFE0):
| Value | Name | Notes |
|---|---|---|
| 0 | enk_error | Error expression |
| 1 | enk_operation | Binary/unary/ternary operation (120 operator sub-kinds via eok_*) |
| 2 | enk_constant | Constant reference |
| 3 | enk_variable | Variable reference |
| 4 | enk_field | Field access |
| 5 | enk_temp_init | Temporary initialization |
| 6 | enk_lambda | Lambda expression |
| 7 | enk_new_delete | new/delete expression (56-byte supplement) |
| 8 | enk_throw | throw expression (24-byte supplement) |
| 9 | enk_condition | Conditional expression (32-byte supplement) |
| 10 | enk_object_lifetime | Object lifetime management |
| 11 | enk_typeid | typeid expression |
| 12 | enk_sizeof | sizeof expression |
| 13 | enk_sizeof_pack | sizeof...(pack) |
| 14 | enk_alignof | alignof expression |
| 15 | enk_datasizeof | NVIDIA __datasizeof extension |
| 16 | enk_address_of_ellipsis | Address of variadic parameter |
| 17 | enk_statement | Statement expression (GCC extension) |
| 18 | enk_reuse_value | Reused value reference |
| 19 | enk_routine | Function reference |
| 20 | enk_type_operand | Type as operand (e.g., in sizeof) |
| 21 | enk_builtin_operation | Compiler builtin (indexed via off_E6C5A0) |
| 22 | enk_param_ref | Parameter reference |
| 23 | enk_braced_init_list | C++11 braced init list |
| 24 | enk_c11_generic | C11 _Generic selection |
| 25 | enk_builtin_choose_expr | GCC __builtin_choose_expr |
| 26 | enk_yield | C++20 co_yield |
| 27 | enk_await | C++20 co_await |
| 28 | enk_fold_expression | C++17 fold expression |
| 29 | enk_initializer | Initializer expression |
| 30 | enk_concept_id | C++20 concept-id |
| 31 | enk_requires | C++20 requires expression |
| 32 | enk_compound_req | Compound requirement |
| 33 | enk_nested_req | Nested requirement |
| 34 | enk_const_eval_deferred | Deferred constexpr evaluation |
| 35 | enk_template_name | Template name expression |
The enk_operation kind (value 1) carries an additional operation.kind byte dispatched through off_A6F840 (120 entries, the eok_* enum) and an operation.type_kind byte from off_A6FE40 (22 entries).
Expression Operation Kinds (eok_*)
The 120+ operation kinds cover all C++ operators. Key groups:
| Category | Operations |
|---|---|
| Arithmetic | eok_add, eok_subtract, eok_multiply, eok_divide, eok_remainder, eok_negate, eok_unary_plus |
| Bitwise | eok_and, eok_or, eok_xor, eok_complement, eok_shiftl, eok_shiftr |
| Comparison | eok_eq, eok_ne, eok_lt, eok_gt, eok_le, eok_ge, eok_spaceship |
| Logical | eok_land, eok_lor, eok_not |
| Assignment | eok_assign, eok_add_assign, eok_subtract_assign, eok_multiply_assign, etc. |
| Pointer | eok_indirect, eok_address_of, eok_padd, eok_psubtract, eok_pdiff, eok_subscript |
| Member access | eok_dot_field, eok_points_to_field, eok_dot_static, eok_points_to_static, eok_pm_field, eok_points_to_pm_call |
| Casts | eok_cast, eok_lvalue_cast, eok_ref_cast, eok_dynamic_cast, eok_bool_cast, eok_base_class_cast, eok_derived_class_cast |
| Calls | eok_call, eok_dot_member_call, eok_points_to_member_call, eok_dot_pm_call, eok_points_to_pm_func_ptr |
| Increment | eok_pre_incr, eok_pre_decr, eok_post_incr, eok_post_decr |
| Complex | eok_real_part, eok_imag_part, eok_xconj |
| Vector | eok_vector_fill, eok_vector_eq, eok_vector_ne, eok_vector_lt, eok_vector_gt, eok_vector_le, eok_vector_ge, eok_vector_subscript, eok_vector_question, eok_vector_land, eok_vector_lor, eok_vector_not |
| Control | eok_comma, eok_question, eok_parens, eok_lvalue, eok_lvalue_adjust, eok_noexcept |
| Variadic | eok_va_start, eok_va_end, eok_va_arg, eok_va_copy, eok_va_start_single_operand |
| Virtual | eok_virtual_function_ptr, eok_dot_vacuous_destructor_call, eok_points_to_vacuous_destructor_call |
| Misc | eok_array_to_pointer, eok_reference_to, eok_ref_indirect, eok_ref_dynamic_cast, eok_pm_base_class_cast, eok_pm_derived_class_cast, eok_class_rvalue_adjust |
Statement Kinds (stmk_*)
The statement kind byte lives at offset +32 in the statement node. 26 values:
| Value | Name | Supplement | Notes |
|---|---|---|---|
| 0 | stmk_expr | -- | Expression statement |
| 1 | stmk_if | -- | if statement |
| 2 | stmk_constexpr_if | 24 bytes | if constexpr (C++17) |
| 3 | stmk_if_consteval | -- | if consteval (C++23) |
| 4 | stmk_if_not_consteval | -- | if !consteval (C++23) |
| 5 | stmk_while | -- | while loop |
| 6 | stmk_goto | -- | goto statement |
| 7 | stmk_label | -- | Label statement |
| 8 | stmk_return | -- | return statement |
| 9 | stmk_coroutine | 128 bytes | C++20 coroutine body (full coroutine descriptor) |
| 10 | stmk_coroutine_return | -- | co_return statement |
| 11 | stmk_block | 32 bytes | Compound statement / block |
| 12 | stmk_end_test_while | -- | do-while loop |
| 13 | stmk_for | 24 bytes | for loop |
| 14 | stmk_range_based_for | -- | C++11 range-for (iterator, begin, end, incr) |
| 15 | stmk_switch_case | -- | case label |
| 16 | stmk_switch | 24 bytes | switch statement |
| 17 | stmk_init | -- | Declaration with initializer |
| 18 | stmk_asm | -- | Inline assembly |
| 19 | stmk_try_block | 32 bytes | try block |
| 20 | stmk_decl | -- | Declaration statement |
| 21 | stmk_set_vla_size | -- | VLA size computation |
| 22 | stmk_vla_decl | -- | VLA declaration |
| 23 | stmk_assigned_goto | -- | GCC computed goto |
| 24 | stmk_empty | -- | Empty statement |
| 25 | stmk_stmt_expr_result | -- | GCC statement expression result |
The coroutine statement (kind 9) carries the largest supplement at 128 bytes, containing traits, handle, promise, initial/final suspend calls, unhandled_exception call, get_return_object call, new/delete routines, and parameter copies. A preserved typo in the EDG source reads "paramter_copies" (missing 'e'), confirming genuine EDG lineage.
Scope Kinds (sck_*)
The scope kind byte lives at offset +28 in the scope node. 9 observed values:
| Value | Name | Notes |
|---|---|---|
| 0 | sck_file | File scope (translation unit root) |
| 1 | sck_func_prototype | Function prototype scope |
| 2 | sck_block | Block scope (compound statement) |
| 3 | sck_namespace | Namespace scope |
| 6 | sck_class_struct_union | Class/struct/union scope |
| 8 | sck_template_declaration | Template declaration scope |
| 15 | sck_condition | Condition scope (if/while/for condition variable) |
| 16 | sck_enum | Enum scope (C++11 scoped enums) |
| 17 | sck_function | Function body scope (has routine ptr, parameters, ctor inits) |
Scope kinds determine which child lists are displayed. The bitmask (1 << kind) & 0x20044 (bits 2, 6, 17 = block, class/struct/union, function) and (1 << kind) & 0x9 (bits 0, 3 = file, namespace) control whether namespaces, using_declarations, and using_directives lists appear.
Dynamic Init Kinds (dik_*)
The dynamic init kind byte lives at offset +48. 9 values:
| Value | Name | Notes |
|---|---|---|
| 0 | dik_none | No initialization |
| 1 | dik_zero | Zero initialization |
| 2 | dik_constant | Constant initializer |
| 3 | dik_expression | Expression initializer |
| 4 | dik_class_result_via_ctor | Class value via constructor call |
| 5 | dik_constructor | Constructor call (routine + args) |
| 6 | dik_nonconstant_aggregate | Non-constant aggregate init |
| 7 | dik_bitwise_copy | Bitwise copy from source |
| 8 | dik_lambda | Lambda initialization |
Common IL Node Header
All primary IL node types (type, variable, field, routine, scope, namespace, template, etc.) share a 96-byte common header copied from a template at xmmword_126F6A0..126F6F0. This header is initialized by init_il_alloc (sub_5EAD80) and contains:
- Source correspondence (
source_corresp) block: name, position, parent scope, access specifier, linkage, flags - The display function
display_source_corresp(sub_5EDF40) prints these fields for every entity type
Key source correspondence fields (printed for all entities):
nameandunmangled_name_or_mangled_encodingdecl_position(line + column)name_referenceslistis_class_member+access(fromoff_A6F760: public/protected/private/none)parent_scopeandenclosing_routinename_linkage(fromoff_E6E040: none/internal/external/C/C++)- Flags:
referenced,needed,is_local_to_function,marked_as_gnu_extension,externalized,maybe_unused,is_deprecated_or_unavailable
Initialization and Reset
The IL subsystem initializes in two phases:
One-Time Init (sub_5CF7F0)
Called once at program startup. Validates 7 name-table arrays end with "last" sentinels:
| Table | Address | Content |
|---|---|---|
il_entry_kind_names | off_E6E020 | 85 IL entry kind names |
db_storage_class_names | off_E6CD78 | Storage class enum names |
db_special_function_kinds | off_E6D228 | Special function kind names |
db_operator_names | off_E6CD20 | Operator kind names |
name_linkage_kind_names | off_E6E060 | Linkage kind names |
decl_modifier_names | off_E6CD88 | Declaration modifier names |
pragma_ids | off_E6CF38 | Pragma identifier names |
Also validates unsigned_int_kind_of table (byte_E6D1AD == 111 == 'o') and initializes 60+ allocation pools via sub_7A3C00 (pool_init) with element sizes ranging from 1 to 1344 bytes.
Per-TU Init (sub_5CFE20)
Called at the start of each translation unit compilation. Zeroes all pool heads, allocates the constant-sharing hash table (16,312 bytes = 2,039 buckets at qword_126F228), and the character-type hash table (3,240 bytes at qword_126F2F8). Sets sharing mode flags (byte_126E558..126E55A = 3). Tail-calls sub_5EAF00 to reset float constant caches.
Secondary Pool Reset (sub_5D0170)
Resets ~80 transient globals in the 126F680..126F978 range between template instantiation passes. Pure state zeroing, no allocation.
Constant Sharing
IL constants are deduplicated via a 2,039-bucket hash table at qword_126F228. The alloc_shareable_constant function (sub_5D2390) checks constant_is_shareable (sub_5D2210) -- which excludes aggregate constants (kind 10), template parameter constants (kind 12), and string literals when string sharing is disabled (dword_126E1C0).
On a cache hit, the existing constant is relinked to the front of its bucket chain. On a miss, a new 184-byte constant is allocated and inserted. Statistics are tracked: total allocations (qword_126F208), comparisons (qword_126F200), region hits (qword_126F218), global hits (qword_126F220), and new buckets (qword_126F210).
CUDA Extensions to IL
NVIDIA adds several CUDA-specific fields to standard EDG IL nodes:
- Routine flags (bytes 182-183):
nvvm_intrinsic,global(global),device(device),host(host) - Variable flags:
shared(shared),constant(constant),device(device),managed(managed) - keep_in_il bit (prefix byte bit 7): The mechanism for device/host code separation
- Lambda entries (kinds 0x46, 0x47): Extended lambda wrapper support
These extensions are what make cudafe++ the CUDA-aware C++ frontend rather than a stock EDG compiler.
Function Map
| Address | Function | Source | Notes |
|---|---|---|---|
sub_5CF7F0 | il_one_time_init | il.c | Validates tables, inits 60+ pools |
sub_5CFE20 | il_init / il_reset | il.c | Per-TU initialization |
sub_5D0170 | il_reset_secondary_pools | il.c | Template instantiation reset |
sub_5D01F0 | il_rebuild_entry_index | il.c | Build entry pointer index |
sub_5D02F0 | il_invalidate_entry_index | il.c | Clear entry index |
sub_5D0750 | compare_expressions | il.c | Deep structural equality |
sub_5D1350 | compare_constants | il.c | Constant comparison (525 lines) |
sub_5D1FE0 | compare_dynamic_inits | il.c | Dynamic init comparison |
sub_5D2210 | constant_is_shareable | il.c | Shareability predicate |
sub_5D2390 | alloc_shareable_constant | il.c | Hash-table dedup allocation |
sub_5D2F90 | i_copy_expr_tree | il.c | Deep expression tree copy |
sub_5D3B90 | i_copy_constant_full | il.c | Deep constant copy |
sub_5D47A0 | i_copy_dynamic_init | il.c | Deep dynamic init copy |
sub_5E2E80 | set_type_kind | il_alloc.c | Type kind dispatch (22 kinds) |
sub_5E3D40 | alloc_type | il_alloc.c | 176-byte type node |
sub_5E4D20 | alloc_variable | il_alloc.c | 232-byte variable node |
sub_5E4F70 | alloc_field | il_alloc.c | 176-byte field node |
sub_5E53D0 | alloc_routine | il_alloc.c | 288-byte routine node |
sub_5E5CA0 | alloc_label | il_alloc.c | 128-byte label node |
sub_5E5F00 | set_expr_node_kind | il_alloc.c | Expression kind dispatch |
sub_5E62E0 | alloc_expr_node | il_alloc.c | 72-byte expression node |
sub_5E6E20 | set_statement_kind | il_alloc.c | Statement kind dispatch |
sub_5E7060 | alloc_statement | il_alloc.c | 80-byte statement node |
sub_5E7D80 | alloc_scope | il_alloc.c | 288-byte scope node |
sub_5E7A70 | alloc_namespace | il_alloc.c | 128-byte namespace node |
sub_5E8D20 | alloc_template | il_alloc.c | 208-byte template node |
sub_5E99D0 | dump_il_table_statistics | il_alloc.c | Print allocation stats |
sub_5EAD80 | init_il_alloc | il_alloc.c | Initialize common header template |
sub_5F4930 | display_il_entry | il_to_str.c | Main display dispatcher (~1,686 lines) |
sub_5F76B0 | display_il_header_and_file_scope | il_to_str.c | IL header + region 1 |
sub_5F7DF0 | display_il_file | il_to_str.c | Top-level display entry point |
sub_60E4F0 | walk_file_scope_il | il_walk.c | File-scope tree walker |
sub_610200 | walk_routine_scope_il | il_walk.c | Per-function tree walker |
Cross-References
- IL Allocation -- Arena allocator details, node sizes, free lists
- IL Walking -- Tree traversal framework with 5 callback slots
- keep-in-il -- Device code selection via bit 7
- IL Display -- Debug dump format and output
- IL Comparison & Copy -- Expression/constant comparison and deep copy
- Device/Host Separation -- CUDA IL marking
- Type System -- 22 type kinds in detail
IL Allocation
Every IL node in cudafe++ is allocated through a region-based bump allocator implemented in il_alloc.c (EDG 6.6 source at /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/il_alloc.c). The allocator manages 70+ distinct IL entry types across two memory region categories -- file-scope (persistent for the entire translation unit) and per-function-scope (transient, freed after each function body is processed). Free-lists recycle high-churn node types to reduce region pressure. The allocation subsystem occupies address range 0x5E0600-0x5EAF00 in the binary, roughly 43KB of compiled code covering 100+ functions.
Key Facts
| Property | Value |
|---|---|
| Source file | il_alloc.c (EDG 6.6) |
| Address range | 0x5E0600-0x5EAF00 |
| Core allocator | sub_6B7D60 (region_alloc(region_id, size)) |
| File-scope allocator | sub_5E03D0 (alloc_in_file_scope_region) |
| Dual-region allocator | sub_5E02E0 (alloc_in_region) |
| Scratch-region allocator | sub_5E0460 (alloc_in_scratch_region) |
| Stats dump | sub_5E99D0 (dump_il_table_stats), 340 lines |
| Init function | sub_5EAD80 (init_il_alloc) |
| Reset watermarks | sub_5EAEC0 (reset_region_offsets) |
| Clear free-lists | sub_5EAF00 (clear_free_lists) |
| Node types tracked | 70+ (each with per-type counter) |
| Free-list types | 6 (template_arg, constant_list, expr_node, constant, param_type, source_seq_entry) |
Region-Based Bump Allocator
The core allocation primitive is sub_6B7D60 (region_alloc), a bump allocator that takes a region ID and requested size, and returns a pointer to the allocated block within the region's memory. The caller then writes prefix fields and returns a pointer past the prefix to the node body.
region_alloc Pseudocode
// sub_6B7D60 -- region_alloc(region_id, total_size)
// Returns pointer to start of allocated block within the region.
void* region_alloc(int region_id, int64_t requested_size) {
// Step 1: Align requested size to 8-byte boundary, add 8 for capacity margin
int64_t aligned_size = requested_size;
if (requested_size == 0) {
aligned_size = 8; // minimum allocation
} else if (requested_size & 7) {
aligned_size = (requested_size + 7) & ~7; // round up to 8
}
int64_t check_size = aligned_size + 8; // capacity check includes margin
// Step 2: Get current region block
mem_block_t* block = region_table[region_id]; // qword_126EC88[region_id]
void* alloc_ptr = block->next_free; // block[2] = bump pointer
// Step 3: Check if current block has enough space
if (block->end - alloc_ptr < check_size) {
// Not enough space -- try free-list or allocate new block
bool is_reuse = block->is_reusable; // block byte +40
int64_t block_size;
if (is_reuse) {
block_size = 2048; // small reuse block
} else {
flush_region(region_table[region_id]); // sub_6B68D0
block_size = 0x10000; // 64KB default
}
// Search free-list (qword_1280730) for a suitable block
block = find_free_block(aligned_size + 56, block_size);
if (!block) {
// Allocate fresh block from heap
if (block_size < aligned_size + 56)
block_size = aligned_size + 56;
block_size = (block_size + 7) & ~7; // align to 8
block = malloc(block_size);
if (!block) fatal_error(4); // out of memory
block->capacity = block_size;
block->end = (char*)block + block_size;
block->next_free = block + 6; // skip 48-byte header
}
// Link new block into region's block chain
block->is_reusable = 0;
alloc_ptr = block->next_free;
block->next = region_table[region_id];
region_table[region_id] = block;
}
// Step 4: Bump the pointer
total_allocated += aligned_size; // qword_1280700
block->next_free = (char*)alloc_ptr + aligned_size;
alignment_waste += aligned_size - requested_size; // qword_12806F8
per_region_total[region_id] += aligned_size; // qword_126EC50[region_id]
return alloc_ptr;
}
Region Architecture
dword_126EC90 = file_scope_region_id (region 1, persistent)
dword_126EB40 = current_region_id (file-scope or per-function)
dword_126F690 = file-scope base offset (typically 0)
dword_126F694 = file-scope prefix size (24 normal, 16 TU-copy)
dword_126F688 = function-scope base offset
dword_126F68C = function-scope prefix size (8)
qword_126EC88 = region_table (region index -> memory block)
qword_126EB90 = scope_table (region index -> scope entry)
dword_126EC80 = total_region_count
Region selection uses a simple identity test: when dword_126EB40 == dword_126EC90, the current scope is file-scope and nodes go into region 1. When the values differ, the current scope is a function body, and nodes go into the current function's region. Some allocators force a specific behavior:
- File-scope only:
alloc_in_file_scope_region(sub_5E03D0) always usesdword_126EC90 - Dual-region:
alloc_in_region(sub_5E02E0) branches on the identity test - Scratch region:
alloc_in_scratch_region(sub_5E0460) temporarily sets TU-copy mode, allocates from region 1, and restores state - Same-region-as: Used by
alloc_class_list_entry(sub_5E2410) andalloc_based_type_list_member(sub_5E29C0) -- inspects the prefix byte of an existing node to determine which region it lives in, then allocates the new node in that same region
Allocation Protocol
Every IL node allocator follows a consistent protocol. The prefix size varies by mode: 24 bytes for file-scope (normal), 16 bytes for file-scope (TU-copy mode), and 8 bytes for function-scope.
File-scope allocation (normal mode, dword_126F694 = 24):
1. if (dword_126EFC8) trace_enter(5, "alloc_<name>")
2. raw = region_alloc(file_scope_region, entry_size + 24)
3. ptr = raw + dword_126F690 // base offset (typically 0)
4. *(ptr+0) = 0 // zero TU copy address slot (8 bytes)
++qword_126F7C0 // TU copy addr counter
5. *(ptr+8) = 0 // zero the next-in-list pointer (8 bytes)
++qword_126F750 // orphan pointer counter
6. ++qword_126F7D8 // IL entry prefix counter
7. *(ptr+16) = flags_byte: // prefix flags (8-byte qword, flags in low byte)
bit 0 = 1 // allocated
bit 1 = 1 // file_scope (not TU-copy)
bit 3 = dword_126E5FC & 1 // language flag (C++ vs C)
8. node = ptr + 24 // skip 24-byte prefix
9. ++qword_126F8xx // per-type counter
10. initialize type-specific fields
11. copy 96-byte common header from template globals
12. if (dword_126EFC8) trace_leave()
13. return node
Function-scope allocation (dword_126F68C = 8):
1. raw = region_alloc(current_region, entry_size + 8)
2. ptr = raw + dword_126F688 // function-scope base offset
3. *(ptr+0) = flags_byte: // prefix flags (8-byte qword)
bit 1 = !dword_106BA08 // file_scope flag
bit 3 = dword_126E5FC & 1 // language flag
(no TU copy slot, no next-in-list slot)
4. node = ptr + 8 // skip 8-byte prefix
5. return node
The returned pointer skips the prefix, so all field offsets documented in the IL are relative to this returned pointer. The prefix flags byte is always at node - 8 regardless of allocation mode. The next-in-list link (file-scope only) is at node - 16, and the TU-copy address (normal file-scope only) is at node - 24.
Common IL Header Template
Every IL node contains a 96-byte common header, copied from six __m128i template globals initialized by init_il_alloc (sub_5EAD80):
xmmword_126F6A0 [+0..+15] 16 bytes, zeroed
xmmword_126F6B0 [+16..+31] 16 bytes (high qword zeroed)
xmmword_126F6C0 [+32..+47] 16 bytes, zeroed
xmmword_126F6D0 [+48..+63] 16 bytes, zeroed
xmmword_126F6E0 [+64..+79] 16 bytes (from qword_126EFB8 = source position)
xmmword_126F6F0 [+80..+95] 16 bytes (low word = 4, high qword = 0)
qword_126F700 [+96..+103] 8 bytes (current source file reference)
This template captures the current source position and language state at the moment of allocation. The template is refreshed when the parser advances through source positions, so each newly-allocated node carries the file/line/column of the construct it represents.
IL Entry Prefix
Every IL entry has a variable-size raw prefix preceding the node body. The prefix is 24 bytes in normal file-scope mode, 16 bytes in TU-copy file-scope mode, and 8 bytes in function-scope mode.
Normal file-scope (24-byte prefix, ptr = raw + 24):
+0 [8 bytes] TU copy ptr - 24 translation_unit_copy_address
+8 [8 bytes] next ptr - 16 next_in_list link
+16 [8 bytes] flags ptr - 8 prefix flags byte (+ 7 padding)
+24 [...] body ptr + 0 node-specific fields
TU-copy file-scope (16-byte prefix, ptr = raw + 16):
+0 [8 bytes] next ptr - 16 next_in_list link
+8 [8 bytes] flags ptr - 8 prefix flags byte (+ 7 padding)
+16 [...] body ptr + 0 node-specific fields
Function-scope (8-byte prefix, ptr = raw + 8):
+0 [8 bytes] flags ptr - 8 prefix flags byte (+ 7 padding)
+8 [...] body ptr + 0 node-specific fields
Prefix Flags Byte
| Bit | Mask | Name | Set When |
|---|---|---|---|
| 0 | 0x01 | allocated | Always set on fresh allocation |
| 1 | 0x02 | file_scope | !dword_106BA08 (not in TU-copy mode) |
| 2 | 0x04 | is_in_secondary_il | Entry from secondary translation unit |
| 3 | 0x08 | language_flag | dword_126E5FC & 1 (C++ mode indicator) |
| 7 | 0x80 | keep_in_il | Set by device code marking pass |
Bit 7 is the CUDA-critical keep_in_il flag used to select device-relevant declarations. See Keep-in-IL for the marking algorithm. The flags byte is always at entry - 8 regardless of allocation mode, and the sign-bit position allows a fast test: *(signed char*)(entry - 8) < 0 means "keep this entry."
Some allocators preserve bit 7 across free-list recycling (notably alloc_local_constant at sub_5E1A80 and alloc_derivation_step at sub_5E1EE0), ensuring that the keep-in-il status is not lost when a node is reclaimed and reissued.
Complete Node Size Table
The stats dump function sub_5E99D0 prints the allocation table with exact names and per-unit sizes for all 70+ IL entry types. Sizes listed are the allocation unit in bytes -- the values passed to region_alloc.
Primary IL Nodes
| IL Entry Type | Size (bytes) | Counter Global | Allocator | Region |
|---|---|---|---|---|
| type | 176 | qword_126F8E0 | sub_5E3D40 | file-scope |
| variable | 232 | qword_126F8C0 | sub_5E4D20 | dual (kind-dependent) |
| routine | 288 | qword_126F8A8 | sub_5E53D0 | file-scope |
| expr_node | 72 | qword_126F880 | sub_5E62E0 | dual + free-list |
| statement | 80 | qword_126F818 | sub_5E7060 | dual |
| scope | 288 | qword_126F7E8 | sub_5E7D80 | dual |
| constant | 184 | qword_126F968 | sub_5E11C0 | dual |
| field | 176 | qword_126F8B0 | sub_5E4F70 | file-scope |
| label | 128 | qword_126F888 | sub_5E5CA0 | function-scope only |
| asm_entry | 152 | qword_126F890 | sub_5E57B0 | dual |
| namespace | 128 | qword_126F7F8 | sub_5E7A70 | file-scope |
| template | 208 | qword_126F720 | sub_5E8D20 | file-scope |
| template_parameters | 136 | qword_126F728 | sub_5E8A90 | file-scope |
| template_arg | 64 | qword_126F900 | sub_5E2190 | file-scope + free-list |
Type Supplements
Auxiliary structures allocated alongside type nodes by set_type_kind (sub_5E2E80):
| Supplement | Size | Counter | For Type Kinds |
|---|---|---|---|
| integer_type_supplement | 32 | qword_126F8E8 | tk_integer (2) |
| routine_type_supplement | 64 | qword_126F958 | tk_routine (7) |
| class_type_supplement | 208 | qword_126F948 | tk_class (9), tk_struct (10), tk_union (11) |
| typeref_type_supplement | 56 | qword_126F8F0 | tk_typeref (12) |
| templ_param_supplement | 40 | qword_126F8F8 | tk_template_param (14) |
Expression Supplements
Allocated inline by set_expr_node_kind (sub_5E5F00) for expression kinds that need extra storage:
| Supplement | Size | Counter | For Expression Kind |
|---|---|---|---|
| new/delete supplement | 56 | qword_126F868 | enk_new_delete (7) |
| throw supplement | 24 | qword_126F860 | enk_throw (8) |
| condition supplement | 32 | qword_126F858 | enk_condition (9) |
Statement Supplements
Allocated inline by set_statement_kind (sub_5E6E20):
| Supplement | Size | Counter | For Statement Kind |
|---|---|---|---|
| constexpr_if | 24 | qword_126F798 | stmk_constexpr_if (2) |
| block | 32 | qword_126F830 | stmk_block (11) |
| for_loop | 24 | qword_126F820 | stmk_for (13) |
| try supplement | 32 | qword_126F838 | stmk_try_block (19) |
| switch_stmt_descr | 24 | qword_126F848 | stmk_switch (16) |
| coroutine_descr | 128 | qword_126F828 | stmk_coroutine (9) |
Linked-List Entry Types
| Entry Type | Size | Counter | Notes |
|---|---|---|---|
| class_list_entry | 16 | qword_126F940 | Region-aware (sub_5E2410) or simple (sub_5E26A0) |
| routine_list_entry | 16 | qword_126F938 | sub_5E2750 |
| variable_list_entry | 16 | qword_126F930 | sub_5E2800 |
| constant_list_entry | 16 | qword_126F928 | Free-list recycled (sub_5E28B0) |
| IL_entity_list_entry | 24 | qword_126F7B8 | sub_5E94F0 |
| based_type_list_member | 24 | qword_126F950 | Region-aware (sub_5E29C0) |
Inheritance and Virtual Dispatch
| Entry Type | Size | Counter | Allocator |
|---|---|---|---|
| base_class | 112 | qword_126F908 | sub_5E2300 |
| base_class_derivation | 32 | qword_126F910 | sub_5E1FD0 |
| derivation_step | 24 | qword_126F918 | sub_5E1EE0 |
| overriding_virtual_func | 40 | qword_126F920 | sub_5E20D0 |
Variable and Routine Auxiliaries
| Entry Type | Size | Counter | Allocator |
|---|---|---|---|
| dynamic_init | 104 | qword_126F8D8 | sub_5E4650 |
| local_static_var_init | 40 | qword_126F8D0 | sub_5E4870 |
| vla_dimension | 48 | qword_126F8C8 | sub_5E49C0 |
| variable_template_info | 24 | qword_126F8B8 | sub_5E4C70 |
| exception_specification | 16 | qword_126F8A0 | sub_5E5130 |
| exception_spec_type | 24 | qword_126F898 | sub_5E51D0 |
| param_type | 80 | qword_126F960 | sub_5E1D40 (free-list recycled) |
| constructor_init | 48 | qword_126F810 | sub_5E7410 |
| handler | 40 | qword_126F840 | sub_5E6B90 |
| switch_case_entry | 56 | qword_126F850 | sub_5E6A60 |
Scope and Source Tracking
| Entry Type | Size | Counter | Allocator |
|---|---|---|---|
| source_sequence_entry | 32 | qword_126F780 | sub_5E8300 (free-list recycled) |
| src-seq_secondary_decl | 56 | qword_126F778 | sub_5E8480 |
| src-seq_end_of_construct | 24 | qword_126F770 | sub_5E85B0 |
| src-seq_sublist | 24 | qword_126F768 | sub_5E86C0 |
| local-scope-ref | 32 | qword_126F7E0 | sub_5E80A0 |
| object_lifetime | 64 | qword_126F800 | sub_5E7800 (free-list recycled) |
| static_assertion | 24 | qword_126F788 | sub_5E81B0 |
Templates, Names, and Pragmas
| Entry Type | Size | Counter | Allocator |
|---|---|---|---|
| template_decl | 40 | qword_126F738 | sub_5E8C60 |
| requires_clause | 16 | qword_126F730 | sub_5E8BB0 |
| name_reference | 40 | qword_126F718 | sub_5E90B0 |
| name_qualifier | 40 | qword_126F710 | sub_5E8FC0 |
| element_position | 24 | qword_126F708 | sub_5E8EB0 |
| pragma | 64 | qword_126F808 | sub_5E7570 |
| using-decl | 80 | qword_126F7F0 | sub_5E7BF0 |
| instantiation_directive | 40 | qword_126F758 | sub_5E8770 |
| linkage_spec_block | 32 | qword_126F760 | sub_5E8830 |
| hidden_name | 32 | qword_126F740 | sub_5E8980 |
Attributes and Miscellaneous
| Entry Type | Size | Counter | Allocator |
|---|---|---|---|
| attribute | 72 | qword_126F7B0 | sub_5E9600 |
| attribute_arg | 40 | qword_126F7A8 | sub_5E96F0 |
| attribute_group | 8 | qword_126F7A0 | sub_5E97C0 |
| source_file | 80 | qword_126F970 | sub_5E08D0 |
| seq_number_lookup_entry | 32 | qword_126F7C8 | sub_5E9170 |
| subobject_path | 24 | qword_126F790 | sub_5E0A30 |
| orphaned_list_header | 56 | qword_126F748 | sub_5E0800 |
Bookkeeping Counters (No Separate Allocator)
| Counter | Size | Global | Meaning |
|---|---|---|---|
| string_literal_text | 1 | qword_126F7D0 | Raw string literal bytes (accumulated) |
| fs_orphan_pointers | 8 | qword_126F750 | File-scope orphan pointer slots |
| trans_unit_copy_addr | 8 | qword_126F7C0 | TU-copy address slots written |
| IL_entry_prefix | 4 | qword_126F7D8 | Total prefix flags bytes written |
Free-List Recycling
Six node types use free-list recycling to avoid allocating fresh memory for high-churn entries. Each free-list is a singly-linked list with the link pointer embedded in the node itself.
Active Free-Lists
| Node Type | Free-List Head | Link Offset | Alloc Function | Free Function |
|---|---|---|---|---|
| template_arg (64B) | qword_126F670 | +0 | sub_5E2190 | sub_5E22D0 (free_template_arg_list) |
| constant_list_entry (16B) | qword_126F668 | +0 | sub_5E28B0 | sub_5E2990 (return_constant_list_entries_to_free_list) |
| expr_node (72B) | qword_126E4B0 | +64 | sub_5E62E0 | (kind set to 36 = ek_reclaimed) |
| constant (184B) | qword_126E4B8 | +104 | sub_5E1A80 (alloc_local_constant) | sub_5E1B70 (free_local_constant) |
| param_type (80B) | qword_126F678 | +0 | sub_5E1D40 (alloc_param_type) | sub_5E1EB0 (free_param_type_list) |
| source_seq_entry (32B) | scope+328 | -- | sub_5E8300 | (per-scope recycling) |
| object_lifetime (64B) | scope+512 | +56 | sub_5E7800 | (per-scope recycling) |
Expression Node Recycling
Expression nodes use the most sophisticated free-list protocol. The allocator (sub_5E62E0) checks qword_126E4B0 before allocating fresh memory:
// Pseudocode for alloc_expr_node
if (expr_free_list != NULL) {
node = expr_free_list;
assert(node->kind == 36); // ek_reclaimed sentinel
expr_free_list = *(node + 64); // link at offset +64
// reuse node (preserves bit 7 of prefix)
} else {
node = region_alloc(region_id, 72);
// full prefix initialization
}
set_expr_node_kind(node, requested_kind);
++total_expr_count;
++fs_expr_count;
update_rescan_counter(&rescan_expr_count);
When expression nodes are freed, their kind byte at offset +24 is set to 36 (ek_reclaimed), and their link pointer at offset +64 chains them into the free list. The stats dump walks this free list to count available recycled nodes, printing them as "(avail. fs expr node)".
A source-tracking variant alloc_expr_node_with_source_tracking (sub_5E66B0) wraps the allocation in save_source_correspondence/restore_source_correspondence calls (sub_5B8910/sub_5B89C0). For non-same-region allocations, this variant uses alloc_permanent(72) instead of the dual-region allocator because the free list cannot safely cross region boundaries.
Constant Recycling
Local constants use a separate free-list (qword_126E4B8) with the link at offset +104. The free_local_constant function (sub_5E1B70) validates the node is in-use (bit 0 of prefix) before unlinking. The check_local_constant_use assertion function (sub_5E1D00) verifies qword_126F680 == 0 at function boundaries, ensuring all borrowed constants have been returned.
The duplicate_constant_to_other_region function (sub_5E1BB0) handles the case where a constant must be copied from one region to another. When source and destination are the same region, it works in-place. When they differ, it allocates 184 bytes in the target region, copies contents via sub_5BA500, frees the original to the free list, and applies post-copy fixups (sub_5B9DE0, sub_5D39A0).
set_type_kind -- Type Kind Dispatch
set_type_kind (sub_5E2E80, confirmed at il_alloc.c:2334) writes the type kind byte at offset +132 of the type node and allocates any required type supplement. It handles 22 type kinds (0x00-0x15):
| Kind | Name | Action |
|---|---|---|
| 0 | tk_error | No-op |
| 1 | tk_void | No-op |
| 2 | tk_integer | Allocates 32-byte integer_type_supplement, sets default access=5 |
| 3 | tk_float | Sets format byte = 2 |
| 4 | tk_complex | Sets format byte = 2 |
| 5 | tk_imaginary | Sets format byte = 2 |
| 6 | tk_pointer | Zeroes 2 payload fields |
| 7 | tk_routine | Allocates 64-byte routine_type_supplement, initializes calling convention and parameter bitfields |
| 8 | tk_array | Zeroes size and flags fields |
| 9 | tk_class | Allocates 208-byte class_type_supplement, stores kind at +100 |
| 10 | tk_struct | Same as class |
| 11 | tk_union | Same as class |
| 12 | tk_typeref | Allocates 56-byte typeref_type_supplement |
| 13 | tk_ptr_to_member | Zeroes fields |
| 14 | tk_template_param | Allocates 40-byte templ_param_supplement |
| 15 | tk_vector | Zeroes fields |
| 16 | tk_scalable_vector | Zeroes fields |
| 17-21 | Pack/special types | No-op or zeroes |
| default | -- | internal_error("set_type_kind: bad type kind") |
The class type supplement (208 bytes) is the largest supplement. init_class_type_supplement_fields (sub_5E2D70) initializes it with defaults: access=1, virtual_function_table_index=-1, and zeroed member lists. The companion function init_class_type_supplement (sub_5E2C70) accesses the supplement through the type node's pointer at offset +152.
A combined function init_type_fields_and_set_kind (sub_5E3590, 317 lines) copies the 96-byte template header and then runs the same switch as set_type_kind inline. This is used by alloc_type (sub_5E3D40) to avoid a separate function call.
set_expr_node_kind -- Expression Kind Dispatch
set_expr_node_kind (sub_5E5F00, confirmed at il_alloc.c:3932) writes the expression kind byte at offset +24 and zeroes offset +8. It handles 36 expression kinds (0-35):
| Kind | Name | Action |
|---|---|---|
| 0 | enk_error | No-op |
| 1 | enk_operation | Sets operation bytes (0x78=120, 0x15=21, 0, 0), zeroes 2 qwords |
| 2-6 | enk_constant..enk_lambda | Zeroes 2 qword operand fields |
| 7 | enk_new_delete | Allocates 56-byte supplement via permanent alloc |
| 8 | enk_throw | Allocates 24-byte supplement |
| 9 | enk_condition | Allocates 32-byte supplement |
| 10 | enk_object_lifetime | Zeroes 2 qwords |
| 11,25,32 | Address-of variants | 1 qword + flag |
| 12-15 | Cast variants | Sets word=1, 1 qword |
| 16 | enk_address_of_ellipsis | No-op |
| 17,18,22,23,29,33,35 | Simple operand | 1 qword |
| 19 | enk_routine | 3 qwords |
| 20 | enk_type_operand | 2 qwords |
| 21 | enk_builtin_operation | Sets byte=117 (0x75), 1 qword |
| 24,26,27,30,31 | Complex operand | 2 qwords |
| 28 | enk_fold_expression | 1 qword + 1 dword |
| 34 | enk_const_eval_deferred | 1 qword + 1 dword |
| default | -- | internal_error("set_expr_node_kind: bad kind") |
The reinit_expr_node_kind function (sub_5E60E0) performs the same dispatch but additionally resets header fields (flag bits and source position from qword_126EFB8) before the kind switch. This is used when an existing expression node is repurposed without reallocation.
set_statement_kind -- Statement Kind Dispatch
set_statement_kind (sub_5E6E20, confirmed at il_alloc.c:4513) writes the statement kind byte at offset +32 and zeroes offset +40. It handles 26 statement kinds (0x00-0x19):
| Kind | Name | Supplement |
|---|---|---|
| 0 | stmk_expr | 1 qword (expression pointer) |
| 1 | stmk_if | 2 qwords (condition + body) |
| 2 | stmk_constexpr_if | Allocates 24 bytes |
| 3,4 | stmk_if_consteval | 2 qwords |
| 5 | stmk_while | 1 qword |
| 6,7 | stmk_goto/stmk_label | 2 qwords |
| 8 | stmk_return | 1 qword |
| 9 | stmk_coroutine | 1 qword (links to 128-byte coroutine_descr) |
| 10,23,24,25 | Various | No-op |
| 11 | stmk_block | Allocates 32 bytes, stores source pos, sets priority |
| 12 | stmk_end_test_while | 1 qword |
| 13 | stmk_for | Allocates 24 bytes |
| 14 | stmk_range_based_for | 2 qwords |
| 15 | stmk_switch_case | 2 qwords |
| 16 | stmk_switch | Allocates 24 bytes |
| 17 | stmk_init | 1 qword |
| 18 | stmk_asm | 1 qword + flag |
| 19 | stmk_try_block | Allocates 32 bytes |
| 20 | stmk_decl | 1 qword |
| 21,22 | VLA statements | 1 qword |
| default | -- | internal_error("set_statement_kind: bad kind") |
set_constant_kind -- Constant Kind Dispatch
set_constant_kind (sub_5E0C60, confirmed at il_alloc.c:952) writes the constant kind byte at offset +148 and initializes the variant-specific union fields. 16 constant kinds (0-15):
| Kind | Name | Action |
|---|---|---|
| 0 | ck_error | Zeroes variant fields |
| 1 | ck_integer | Calls init_target_int (sub_461260) |
| 2 | ck_string | Zeroes string fields |
| 3 | ck_float | Zeroes float fields |
| 4 | ck_address | Allocates 32-byte sub-node in file-scope region |
| 5 | ck_complex | Zeroes complex fields |
| 6 | ck_imaginary | Zeroes imaginary fields |
| 7 | ck_ptr_to_member | Zeroes 2 fields |
| 8 | ck_label_difference | Zeroes 2 fields |
| 9 | ck_dynamic_init | Zeroes |
| 10 | ck_aggregate | Zeroes aggregate list head |
| 11 | ck_init_repeat | Zeroes repeat fields |
| 12 | ck_template_param | Zeroes, dispatches to set_template_param_constant_kind |
| 13 | ck_designator | Zeroes |
| 14 | ck_void | Zeroes |
| 15 | ck_reflection | Zeroes |
| default | -- | internal_error("set_constant_kind: bad kind") |
The template parameter constant kind has its own sub-dispatch (sub_5E0B40, il_alloc.c:768) handling 14 sub-kinds (tpck_*), each zeroing variant fields at offsets +160, +168, +176. It validates the parent constant kind is 12 (ck_template_param) before proceeding.
Additional Kind Dispatchers
set_routine_special_kind
sub_5E5280 (confirmed at il_alloc.c:3065) sets the routine special kind byte at offset +166. 8 values (0-7):
| Kind | Action |
|---|---|
| 0 | Sets word at +168 to 0 |
| 1-4 | No-op |
| 5 | Zeroes byte at +168 |
| 6-7 | Zeroes qword at +168 |
| default | internal_error("set_routine_special_kind: bad kind") |
set_dynamic_init_kind
sub_5E45C0 (confirmed at il_alloc.c:2506) sets the dynamic init kind at offset +48. 10 values (0-9) controlling what fields are initialized in the dynamic initialization variant union.
Statistics Dump
dump_il_table_stats (sub_5E99D0) prints a formatted table of all IL allocation counters. It is invoked when tracing is enabled or on explicit request. The output format:
IL table use:
Table Number Each Total
----- ------ ---- -----
source file 42 80 3360
constant 1847 184 339848
type 923 176 162448
variable 412 232 95584
routine 287 288 82656
expr node 12847 72 924984
statement 5923 80 473840
scope 312 288 89856
...
Total 2172576
The function iterates all 70+ counters, multiplies count by per-unit size, accumulates a running total, and adds the passed argument a1 (typically the raw region overhead) for the final sum. It also walks the expr_node free list (qword_126E4B0) to count available recycled nodes, printing them separately as "(avail. fs expr node)".
The counter globals are contiguous in BSS from qword_126F680 through qword_126F970, with 8-byte spacing (qword counters). The full ordered list of counters is documented in the Complete Node Size Table above.
Initialization and Reset
init_il_alloc (sub_5EAD80)
Called once at compiler startup. Responsibilities:
- Zeroes the 96-byte common header template (
xmmword_126F6A0-xmmword_126F6F0) - Sets the source position portion of the template from
qword_126EFB8 - Computes the language mode byte:
byte_126E5F8 = (dword_126EFB4 != 2) + 2(C++ mode detection) - Registers 6 allocator state variables with
sub_7A3C00(saveable state for region offset save/restore across compilation phases) - Optionally calls
sub_6F5D00ifdword_106BF18is set (debug initialization)
reset_region_offsets (sub_5EAEC0)
Resets the bump allocator watermarks. Called at region boundaries:
dword_126F690 = 0; // base offset reset
if (dword_106BA08) { // TU-copy mode
dword_126F68C = 8; // function-scope watermark
dword_126F688 = 0; // function-scope base
dword_126F694 = 16; // file-scope watermark
} else {
dword_126F694 = 24; // file-scope watermark (extra 8 for TU copy addr)
}
The different initial watermark values (16 vs 24) reflect the prefix size in each mode: normal mode uses a 24-byte prefix (8 TU-copy + 8 next-link + 8 flags), while TU-copy mode uses a 16-byte prefix (8 next-link + 8 flags, no TU-copy slot). Function-scope allocations use dword_126F68C = 8 (8-byte prefix: flags only).
clear_free_lists (sub_5EAF00)
Zeroes all 5 global free-list heads:
qword_126F678 = 0; // param_type free list
qword_126F670 = 0; // template_arg free list
qword_126E4B8 = 0; // local_constant free list
qword_126E4B0 = 0; // expr_node free list
qword_126F668 = 0; // constant_list_entry free list
Called at function-scope exit to prevent dangling pointers into freed regions.
String Allocation
Two specialized allocators handle string storage in regions:
copy_string_to_region (sub_5E0600, il_alloc.c:548)
char* copy_string_to_region(int region_id, const char* str) {
size_t len = strlen(str);
char* buf;
if (region_id == 0)
buf = heap_alloc(len + 1); // general heap
else if (region_id == file_scope_region)
buf = region_alloc(file_scope, len + 1); // file-scope region
else if (region_id == -1)
buf = persistent_alloc(len + 1); // persistent heap
else
internal_error("copy_string_to_region");
return strcpy(buf, str);
}
copy_string_of_length_to_region (sub_5E0700, il_alloc.c:572)
Same three-way dispatch but takes an explicit length parameter and uses strncpy with explicit null termination: result[len] = 0.
Special Allocation Patterns
Labels -- Function-Scope Assertion
alloc_label (sub_5E5CA0) asserts that dword_126EB40 != dword_126EC90 (must be in function scope). Labels cannot exist at file scope -- they are always allocated in a function's region:
assert(current_region != file_scope_region); // il_alloc.c:3588
Variables -- Kind-Dependent Region
alloc_variable (sub_5E4D20) uses the variable's linkage kind to select the allocation strategy: when kind > 2 (non-local variables like global, extern, static), it uses the dual-region allocator (sub_5E02E0). Otherwise it allocates directly in the file-scope region. This ensures that local variables live in function regions while globals persist in the file-scope region.
GNU Supplement for Routines
alloc_gnu_supplement_for_routine (sub_5E56D0, il_alloc.c:3412) asserts that no supplement already exists (*(routine+240) == 0), then allocates a 40-byte supplement and stores the pointer at routine+240. This is for GCC-extension attributes on functions (visibility, alias, constructor/destructor priority).
Pragma -- 43 Kinds
alloc_pragma (sub_5E7570, il_alloc.c:4781) uses the same-region-as pattern (handling null, non-file-scope, scratch, and same-region-as cases) and dispatches a switch covering 43 pragma kinds (0-42). Most kinds are no-op; kinds 19, 21, 26, 28, 29 have small payload fields.
Scope -- Routine Association
alloc_scope (sub_5E7D80) validates that if assoc_routine (argument a3) is non-null, the scope kind must be 17 (sck_function). Violation triggers internal_error("assoc_routine is non-NULL") at il_alloc.c:4946. After kind dispatch, it zeroes 26 qword fields (offsets 80-280) and sets *(result+240) = -1 as a sentinel.
Global Variable Map
| Address | Name | Purpose |
|---|---|---|
dword_126EC90 | file_scope_region_id | Region 1 identifier |
dword_126EB40 | current_region_id | Active allocation region |
dword_106BA08 | tu_copy_mode | TU-copy mode flag (affects prefix layout) |
dword_126EFC8 | tracing_enabled | When set, brackets alloc calls with trace_enter/leave |
qword_126EFB8 | null_source_position | Default source position for new nodes |
qword_126F700 | current_source_file | Current source file reference |
qword_106B9B0 | compilation_context | Active compilation context pointer |
dword_126E5FC | source_file_flags | Bit 0 = C++ mode indicator |
byte_126E5F8 | language_std_byte | Language standard (controls routine type init) |
dword_106BFF0 | uses_exceptions | Exception model flag (set in routine alloc) |
IL Tree Walking
The IL tree walking framework is the backbone of every operation that must visit the complete IL graph: debug display, device code marking, IL serialization, and IL copying for template instantiation. The framework lives in il_walk.c (with entry-kind dispatch logic auto-generated from walk_entry.h). It provides a generic, callback-driven traversal engine consisting of two core functions: walk_file_scope_il (sub_60E4F0), which orchestrates the top-level iteration over all global entry-kind lists, and walk_entry_and_subtree (sub_604170), which recursively descends into a single entry's children according to the IL schema. Five global function-pointer slots allow each client to customize the walk's behavior without modifying the walker itself.
The framework follows a strict separation of traversal and action. The walker knows how to navigate the IL graph; the callbacks decide what to do at each node. This design enables the same walker to serve four fundamentally different purposes: pretty-printing, transitive-closure marking, pointer remapping during copy, and entry filtering during serialization.
Key Facts
| Property | Value |
|---|---|
| Source file | il_walk.c (EDG 6.6) |
| Header (auto-generated dispatch) | walk_entry.h |
| Assert path | /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/il_walk.c |
| Top-level file-scope walker | sub_60E4F0 (walk_file_scope_il), 2043 lines |
| Recursive entry walker | sub_604170 (walk_entry_and_subtree), 7763 lines / 42KB |
| Routine-scope walker | sub_610200 (walk_routine_scope_il), 108 lines |
| Hash table reset | sub_603B30 (clear_walk_hash_table), 23 lines |
| Anonymous union lookup | sub_603FE0 (find_parent_var_of_anon_union_type), 127 lines |
| Entry kinds covered | 85 (switch cases 0--84) |
| Recursive self-calls | ~330 (in walk_entry_and_subtree) |
| Callback slots | 5 global function pointers |
Callback Slot Architecture
The five callback slots are stored as global function pointers. Before any walk, the caller saves all five values, installs its own set, and restores the originals on exit. This save/restore discipline makes walks re-entrant -- a callback can itself trigger a nested walk with different callbacks.
| Address | Slot Name | Signature | Purpose |
|---|---|---|---|
qword_126FB88 | entry_callback | entry_ptr(entry_ptr, entry_kind) | Called for each entry visited; may return a replacement pointer |
qword_126FB80 | string_callback | void(char_ptr, string_kind, byte_length) | Called for each string field; string_kind is 24 (id_name), 25 (string_text), or 26 (other_text); byte_length is strlen+1 for kinds 24/26, field-based for kind 25 |
qword_126FB78 | pre_walk_check | int(entry_ptr, entry_kind) | Called before descending into an entry; returns nonzero to skip the subtree |
qword_126FB70 | entry_replace | entry_ptr(entry_ptr, entry_kind) | Called to remap an entry pointer (used during IL copy to translate old pointers to new ones) |
qword_126FB68 | entry_filter | entry_ptr(entry_ptr, entry_kind) | Called on linked-list heads to filter entries; returning NULL removes the entry from the list |
The pre_walk_check slot is the only one whose return value controls flow: nonzero means "already handled, skip this subtree." The keep-in-il pass uses this to avoid revisiting already-marked entries (preventing infinite recursion on cyclic references). The entry_replace slot is used during IL copy operations to translate pointers from the source IL region to the destination region.
Walk State Globals
In addition to the five callback slots, four state variables track the walker's context:
| Address | Name | Description |
|---|---|---|
dword_126FB5C | is_file_scope_walk | 1 during walk_file_scope_il, 0 during walk_routine_scope_il |
dword_126FB58 | is_secondary_il | 1 if the current scope belongs to the secondary IL region |
dword_106B644 | current_il_region | Toggles per IL region; used to stamp bit 2 of entry flags |
dword_126FB60 | walk_mode_flags | Bitmask controlling walk behavior (e.g., strip template info) |
All four are saved and restored alongside the callback slots, making the entire walk context atomically swappable.
walk_file_scope_il (sub_60E4F0)
This is the central traversal entry point. Every operation that needs to visit the entire file-scope IL calls this function with its desired callbacks. It takes six arguments:
void walk_file_scope_il(
entry_callback_t a1, // entry visitor (qword_126FB88)
string_callback_t a2, // string visitor (qword_126FB80)
entry_replace_t a3, // pointer remapper (qword_126FB70)
entry_filter_t a4, // list filter (qword_126FB68)
pre_walk_check_t a5, // pre-visit gate (qword_126FB78)
int a6 // walk_mode_flags (dword_126FB60)
);
Initialization
The function begins by saving all five callback slots and all four walk state variables, then installs the caller's values:
// Save current state
saved_entry_cb = qword_126FB88;
saved_string_cb = qword_126FB80;
saved_pre_walk = qword_126FB78;
saved_entry_replace = qword_126FB70;
saved_entry_filter = qword_126FB68;
saved_is_file_scope = dword_126FB5C;
saved_is_secondary = dword_126FB58;
saved_il_region = dword_106B644;
saved_mode_flags = dword_126FB60;
// Install new callbacks
qword_126FB88 = a1;
qword_126FB80 = a2;
qword_126FB70 = a3;
qword_126FB68 = a4;
qword_126FB78 = a5;
dword_126FB60 = a6;
dword_126FB5C = 1; // mark as file-scope walk
An assertion fires if pre_walk_check is NULL and the primary scope is in secondary IL (bit 1 of flags byte set):
if (!a5 && is_secondary)
assert_fail("il_walk.c", 270, "walk_file_scope_il");
This prevents unguarded walks into secondary IL regions, which would produce incorrect results because secondary entries need canonical-entry delegation.
Walk Order
The function visits IL entries in a fixed, deterministic order. This order is significant for serialization (the IL binary format expects entries in this exact sequence) and for display (the --dump_il output follows this structure).
Phase 1: Primary scope (kind 23)
primary_scope = xmmword_126EB60[1]; // second qword of IL header
// If entry_replace callback exists, remap the scope pointer first
if (entry_replace)
primary_scope = entry_replace(primary_scope, 23);
// Determine IL region flags from the scope's prefix byte
is_secondary_il = (*(primary_scope - 8) & 0x02) != 0;
current_il_region = ((*(primary_scope - 8) >> 2) ^ 1) & 1;
walk_entry_and_subtree(primary_scope, 23);
The scope entry (kind 23) is walked first because it is the root of the scope tree. Walking the scope recursively visits all nested scopes and their member lists.
Phase 2: Source file entries (kind 1)
for (entry = xmmword_126EB60[0]; entry; entry = entry->child_file) {
if (entry_filter && !entry_filter(entry, 1))
continue; // filtered out
walk_entry_and_subtree(entry, 1);
}
Source file entries form a linked list via offset +56 (child_file). Each entry holds the file name, full path, and name_as_written strings.
Phase 3: main_routine pointer and string entries
Before walking strings, the function remaps the main_routine pointer from the IL header:
// main_routine (qword_126EB70, IL header + 0x10)
if (entry_replace) {
il_header.main_routine = entry_replace(il_header.main_routine, 11);
// Also remap compiler_version through entry_replace
compiler_version = entry_replace(compiler_version, 26);
}
Then two string entries from the IL header are walked as "other text" (kind 26):
// compiler_version (qword_126EB78, IL header + 0x18)
if (compiler_version) {
if (trace_verbosity > 4)
fprintf(s, "Walking IL tree, string entry kind = %s\n", "other text");
if (string_callback)
string_callback(compiler_version, 26, strlen(compiler_version) + 1);
}
// time_of_compilation (qword_126EB80, IL header + 0x20) -- same pattern
if (entry_replace)
time_of_compilation = entry_replace(time_of_compilation, 26);
if (time_of_compilation) {
if (string_callback)
string_callback(time_of_compilation, 26, strlen(time_of_compilation) + 1);
}
Strings are walked with kind 26 (other_text) and the string callback receives the raw character pointer, the kind, and the length including the null terminator.
Phase 4: Orphaned entities list (kind 55)
for (entry = qword_126EBA0; entry; entry = entry->next) {
if (entry_filter && !entry_filter(entry, 55))
continue;
walk_entry_and_subtree(entry, 55);
}
Kind 55 entries are orphaned entities -- declarations that lost their parent scope (e.g., after template instantiation cleanup). They are stored in a separate linked list headed at qword_126EBA0.
Phase 5: Global entry-kind lists (kinds 1--72)
The bulk of the walk iterates 45 global linked lists, one per entry kind. Each list head is stored at a fixed address in the 0x126E610--0x126EA80 range, with 16-byte spacing. The complete walk order, verified from the decompiled sub_60E4F0:
| # | Global Address | Kind | Entry Kind Name |
|---|---|---|---|
| 1 | qword_126E610 | 1 | source_file_entry |
| 2 | qword_126E620 | 2 | constant |
| 3 | qword_126E630 | 3 | param_type |
| 4 | qword_126E640 | 4 | routine_type_supplement |
| 5 | qword_126E650 | 5 | routine_type_extra |
| 6 | qword_126E660 | 6 | type |
| 7 | qword_126E670 | 7 | variable |
| 8 | qword_126E680 | 8 | field |
| 9 | qword_126E690 | 9 | exception_specification |
| 10 | qword_126E6A0 | 10 | exception_spec_type |
| 11 | qword_126E6B0 | 11 | routine |
| 12 | qword_126E6C0 | 12 | label |
| 13 | qword_126E6D0 | 13 | expr_node |
| 14 | qword_126E6E0 | 14 | (reserved) |
| 15 | qword_126E6F0 | 15 | (reserved) |
| 16 | qword_126E700 | 16 | switch_case_entry |
| 17 | qword_126E710 | 17 | switch_info |
| 18 | qword_126E720 | 18 | handler |
| 19 | qword_126E730 | 19 | try_supplement |
| 20 | qword_126E740 | 20 | asm_supplement |
| 21 | qword_126E750 | 21 | statement |
| 22 | qword_126E760 | 22 | object_lifetime |
| 23 | qword_126E770 | 23 | scope |
| 24 | qword_126E7B0 | 27 | template_parameter |
| 25 | qword_126E7C0 | 28 | namespace |
| 26 | qword_126E7D0 | 29 | using_declaration |
| 27 | qword_126E7E0 | 30 | dynamic_init |
| 28 | qword_126E810 | 33 | overriding_virtual_func |
| 29 | qword_126E820 | 34 | (reserved) |
| 30 | qword_126E830 | 35 | derivation_path |
| 31 | qword_126E840 | 36 | base_class_derivation |
| 32 | qword_126E850 | 37 | (reserved) |
| 33 | qword_126E860 | 38 | (reserved) |
| 34 | qword_126E870 | 39 | class_info |
| 35 | qword_126E880 | 40 | (reserved) |
| 36 | qword_126E890 | 41 | constructor_init |
| 37 | qword_126E8A0 | 42 | asm_entry |
| 38 | qword_126E8E0 | 46 | lambda |
| 39 | qword_126E8F0 | 47 | lambda_capture |
| 40 | qword_126E900 | 48 | attribute |
| 41 | qword_126E9D0 | 61 | template_param |
| 42 | qword_126E9B0 | 59 | template_decl |
| 43 | qword_126E9E0 | 62 | name_reference |
| 44 | qword_126E9F0 | 63 | name_qualifier |
| 45 | qword_126EA80 | 72 | attribute (C++11) |
Note the gaps in the walk order: kinds 24-26 (base_class, string_text, other_text), 31-32 (local_static_variable_init, vla_dimension), 43-45 (asm_operand, asm_clobber, reserved), and 49-58 (element_position through hidden_name) are skipped. These entry kinds are either embedded inline within parent entries, accessed only through the recursive descent of walk_entry_and_subtree, or have no file-scope lists. Also note that kinds 59 and 61 appear out-of-order (61 before 59) -- this is verified in the binary.
For each non-empty list, the walk applies the entry_replace callback (if present) to each entry before descending, and follows the next pointer (at offset -16 in the raw allocation, which is the next_in_list link in the entry prefix).
Phase 6: Special trailing lists
Three additional lists are walked after the main kind-indexed sequence:
// seq_number_lookup entries (kind 64) at qword_126EBE8
for (entry = qword_126EBE8; entry; entry = entry->next) {
if (entry_filter) ...
walk_entry_and_subtree(entry, 64);
}
// External declarations (kind 6) at qword_126EBE0
// -- uses entry_filter with kind 6 and follows offset +104 links
// Kind 83 entries at qword_126EC00
for (entry = qword_126EC00; entry; entry = entry->next) {
if (entry_filter) ...
walk_entry_and_subtree(entry, 83);
}
Cleanup
After all phases complete, the function restores all saved state:
dword_126FB5C = saved_is_file_scope;
dword_126FB58 = saved_is_secondary;
dword_106B644 = saved_il_region;
dword_126FB60 = saved_mode_flags;
qword_126FB88 = saved_entry_cb;
qword_126FB80 = saved_string_cb;
qword_126FB78 = saved_pre_walk;
qword_126FB70 = saved_entry_replace;
qword_126FB68 = saved_entry_filter;
If tracing is active (dword_126EFC8), the function emits trace-leave via sub_48AFD0.
walk_entry_and_subtree (sub_604170)
This is the recursive engine -- the second-largest function in the entire cudafe++ binary at 7763 lines / 42KB of decompiled code. It takes an entry pointer and its kind, then recursively walks every child entry according to the IL schema.
Entry Protocol
Before descending into any entry, the function executes a two-path check:
while (true) {
if (pre_walk_check != NULL) {
// Callback path: delegate decision to the callback
if (pre_walk_check(entry, entry_kind))
return; // callback says skip
} else {
// Default path: check flags
flags = *(entry - 8);
// If not file-scope walk and entry has file-scope bit: skip
if (!is_file_scope_walk && (flags & 0x01))
return;
// If entry's il_region bit matches current_il_region: skip
if (((flags & 0x04) != 0) == current_il_region)
return;
// Stamp the entry's il_region bit to match current region
*(entry - 8) = (4 * (current_il_region & 1)) | (flags & 0xFB);
}
// Trace output at verbosity > 4
if (trace_verbosity > 4)
fprintf(s, "Walking IL tree, entry kind = %s\n",
il_entry_kind_names[entry_kind]);
// Dispatch on entry kind
switch ((char)entry_kind) { ... }
}
The while(true) loop structure exists because certain cases (particularly linked-list tails) use continue to re-enter the check with a new entry, avoiding redundant function-call overhead for tail chains.
The default-path flags check serves two purposes:
- Scope isolation: File-scope entries encountered during a routine-scope walk are skipped (they belong to the outer walk).
- Region tracking: The
current_il_regiontoggle prevents visiting the same entry twice within a single walk -- once stamped, an entry's bit 2 matchescurrent_il_region, and the equality check causes the walker to skip it.
Entry Kind Dispatch
The giant switch covers all 85 entry kinds. Each case knows the exact layout of that entry type and recursively calls walk_entry_and_subtree on every child pointer. The three callbacks are invoked at appropriate points:
entry_replace: Called on each child pointer before recursion, potentially replacing it with a remapped pointer.string_callback: Called on string fields (file names, identifier text), receiving the string pointer, kind 26, and byte length including null terminator.entry_filter: Called on linked-list head pointers, returning NULL to remove the entry from the list.
Coverage by Entry Kind
The following table shows the major entry kinds and what the walker visits for each:
| Kind | Name | Children Walked |
|---|---|---|
| 1 | source_file_entry | file_name (string, kind 26 at [0]), full_name (string, kind 26 at [1]), name_as_written (string, kind 26 at [2]), child file list (kind 1, linked via offset +56 at [5]), associated entry at [6] (kind 1), module info at [8] (kind 82) |
| 2 | constant | Type refs at [14]/[15] (kind 6), expression at [16] (kind 13); sub-switch on constant_kind byte at +148 (see below) |
| 3 | parameter | Type (kind 6), declared_type (kind 6), default_arg_expr (kind 13), attributes (kind 72) |
| 6 | type | Base type (kind 6), member field list (kind 8), template info (kind 58), scope (kind 23), base class list (kind 24), class_info supplement (kind 39) |
| 7 | variable | Type (kind 6), initializer expression (kind 13), attributes (kind 72), declared_type (kind 6) |
| 8 | field | Next field (kind 8), type (kind 6), bit_size_constant (kind 2) |
| 9 | exception_spec | Type list (kind 10), noexcept expression (kind 13) |
| 11 | routine | Return type (kind 6), parameter list (kind 3), body scope (kind 23), template info (kind 58), exception spec (kind 9), attributes (kind 72) |
| 12 | label | Break label (kind 12), continue label (kind 12) |
| 13 | expression | Sub-expressions (kind 13), operand entries, type references (kind 6); sub-switch on expression operator covers ~120 operator kinds |
| 16 | switch_case | Statement (kind 21), case_value constant (kind 2) |
| 17 | switch_info | Case list (kind 16), default case (kind 16), sorted case array |
| 18 | catch_entry | Parameter (kind 7), statement (kind 21), dynamic_init expression (kind 13) |
| 21 | statement | Sub-statements (kind 21), expressions (kind 13), labels (kind 12); sub-switch on statement kind |
| 22 | object_lifetime | Variable (kind 7), lifetime scope boundary |
| 23 | scope | Variables list (kind 7), routines list (kind 11), types list (kind 6), nested scopes (kind 23), namespaces (kind 28), using-declarations (kind 29), hidden names (kind 56), labels (kind 12) |
| 24 | base_class | Next (kind 24), type (kind 6), derived_class (kind 6), offset expression |
| 27 | template_parameter | Default value, constraint expression (kind 60), template param supplement |
| 28 | namespace | Associated scope (kind 23), flags |
| 29 | using_declaration | Target entity, position, access specifier |
| 30 | dynamic_init | Expression (kind 13), associated variable (kind 7) |
| 39 | class_info | Constructor initializer list (kind 41), friend list, base class list (kind 24) |
| 41 | constructor_init | Next (kind 41), member/base expression, initializer expression |
| 55 | orphaned_entities | Entity list, scope reference |
| 58 | template | Template parameter list (kind 61), body, specializations list |
| 72 | attribute | Attribute arguments (kind 73), next attribute (kind 72) |
| 80 | subobject_path | Linked list (kind 80), each entry walked recursively |
Constant Entry Sub-Switch (Case 2)
The constant entry handler is one of the most complex cases. After walking two type references ([14], [15] as kind 6) and one expression ([16] as kind 13), it dispatches on the constant_kind byte at entry + 148:
// Walk shared fields first
walk(a1[14], 6); // type
walk(a1[15], 6); // declared_type
walk(a1[16], 13); // associated expression
// Strip template info if walk_mode_flags set
if (walk_mode_flags)
a1[17] = 0;
switch (constant_kind) {
case 0: /* ck_error */
case 1: /* ck_integer */
case 3: /* ck_float */
case 5: /* ck_imaginary */
case 14: /* ck_void */
break; // leaf constants, no children
case 2: /* ck_string */
// Walk string data at [20] as string_text (kind 25)
// Length comes from [19] (not strlen -- may have embedded NULs)
if (string_callback)
string_callback(a1[20], 25, a1[19]);
break;
case 4: /* ck_complex */
walk(a1[19], 27); // template_parameter (real/imaginary parts)
break;
case 6: /* ck_address -- 7 sub-kinds at entry+152 */
switch (address_sub_kind) {
case 0: entry_replace(a1[20], 11); break; // routine
case 1: entry_replace(a1[20], 7); break; // variable
case 2: case 3:
walk(a1[20], 2); break; // constant (recurse)
case 4: entry_replace(a1[20], 6); break; // type (typeid)
case 5: walk(a1[20], 6); break; // type (uuidof, recurse)
case 6: entry_replace(a1[20], 12); break; // label
default: error("bad address const kind");
}
// Then walk subobject_path list at [22] (kind 80)
break;
case 7: /* ck_ptr_to_member */
entry_replace(a1[19], 36); // derivation_path
walk(a1[20], 62); // name_reference
// Conditional: if a1[21] & 2, replace [22] as routine(11)
// else replace [22] as field(8)
break;
case 8: /* ck_label_difference */
walk(a1[20], 2); // constant (recurse)
break;
case 9: /* ck_dynamic_init */
walk(a1[19], 30); // dynamic_init entry
break;
case 10: /* ck_aggregate */
// Linked list of constants at [19], each via offset +104
for each constant in list: walk(entry, 2);
entry_replace(a1[20], 2); // tail constant
break;
case 11: /* ck_init_repeat */
walk(a1[19], 2); // repeated constant
break;
case 12: /* ck_template_param -- 15 sub-kinds at entry+152 */
// Another sub-switch with cases 0-13 + default error
break;
case 13: /* ck_designator */
walk(a1[20], 2); // constant value
break;
case 15: /* ck_reflection */
// Walk [20] with kind from entry+152 byte
break;
}
The walk_mode_flags field zeroing (a1[17] = 0) strips template parameter constant info during IL binary output. This is the template-stripping behavior controlled by argument a6 of walk_file_scope_il.
String Entry Handling
String fields within entries are walked with three distinct string kind values:
| String Kind | Value | Display Name | Used For | Length Source |
|---|---|---|---|---|
id_name | 24 | "id name" | Identifier names (variable, function, field names) | strlen(str) + 1 |
string_text | 25 | "string text" | String literal content (for ck_string constants) | Constant's length field [19] |
other_text | 26 | "other text" | File names, compiler version, compilation time, asm text | strlen(str) + 1 |
The string_text kind (25) is special: its length comes from the enclosing constant entry's [19] field rather than strlen, because C/C++ string literals may contain embedded null bytes. All other string kinds use strlen(str) + 1.
Error Strings
The function contains diagnostic strings from walk_entry.h that fire on unexpected sub-kind values:
| String | Line | Triggers When |
|---|---|---|
"walk_entry_and_subtree: bad address const kind" | 883 | Unknown address_constant_kind in constant entry (kind 2, sub-kind 6) |
"walk_entry_and_subtree: bad template param constant kind" | 1035 | Unknown template_param_constant_kind in constant entry (kind 2, sub-kind 12) |
"walk_entry_and_subtree: bad constant kind" | 1051 | Unknown constant_kind in constant entry (kind 2) |
All three errors reference walk_entry.h as the source file and walk_entry_and_subtree as the function name, confirming the dispatch code is generated from the header file.
walk_routine_scope_il (sub_610200)
The routine-scope counterpart of walk_file_scope_il. It takes a routine index and walks that routine's scope chain:
void walk_routine_scope_il(int routine_index, ...) {
// Same 5-callback + 4-state save/restore pattern
// Trace: "walk_routine_scope_il"
// Assert: il_walk.c, line 376
dword_126FB5C = 0; // NOT file-scope walk
scope = qword_126EB90[routine_index]; // routine_scope_array
while (scope) {
walk_entry_and_subtree(scope, 23);
if (entry_replace)
scope = entry_replace(scope, 23);
scope = scope->next;
}
}
The key difference from walk_file_scope_il is that is_file_scope_walk is set to 0, which changes the entry protocol in walk_entry_and_subtree: entries with the file-scope bit set in their flags byte are skipped, because they belong to the file-scope IL and should not be processed during a routine-scope walk.
Callers and Use Cases
The walk framework serves four distinct purposes. Each caller installs a different callback configuration.
IL Display
The --dump_il debug output uses the walk framework with display_il_entry (sub_5F4930) as the entry callback:
// sub_5F76B0 (display_il_header)
walk_file_scope_il(
display_il_entry, // a1: entry callback = sub_5F4930
NULL, // a2: no string callback
NULL, // a3: no replace
NULL, // a4: no filter
NULL, // a5: no pre-walk check
0 // a6: no special flags
);
With all callbacks NULL except entry_callback, the walker visits every entry in walk order and calls display_il_entry on each, which dispatches on entry kind to print formatted field dumps. The pre_walk_check is NULL, so the default flags-based skip logic applies -- the current_il_region toggle prevents double-visiting.
Keep-in-IL Marking
The device code selection pass (mark_to_keep_in_il, sub_610420) installs the prune callback as pre_walk_check and NULL for everything else:
// sub_610420 (mark_to_keep_in_il)
qword_126FB88 = NULL; // no entry callback
qword_126FB80 = NULL; // no string callback
qword_126FB78 = prune_keep_in_il_walk; // sub_617310
qword_126FB70 = NULL; // no replace
qword_126FB68 = NULL; // no filter
The prune_keep_in_il_walk callback (sub_617310) sets bit 7 (0x80) of each entry's flags byte and returns 1 for already-marked entries (preventing infinite recursion). The actual subtree walk is handled by a specialized copy of walk_entry_and_subtree (sub_6115E0, walk_tree_and_set_keep_in_il, 4649 lines) that directly sets the keep bit on every reachable child rather than using callbacks. See Keep-in-IL for the full mechanism.
IL Serialization
IL binary output (when IL_SHOULD_BE_WRITTEN_TO_FILE would be enabled, or for device IL output) uses all five callback slots:
entry_callback: Records each entry's position in the output streamstring_callback: Serializes string data with length prefixentry_replace: Translates IL pointers to output-stream offsetsentry_filter: Skips entries that should not appear in the output (e.g., entries withoutkeep_in_ilfor device IL)pre_walk_check: Prevents re-serializing entries already written
IL Copy (Template Instantiation)
When EDG instantiates a template, it copies the template's IL subtree into a new region. The copy operation uses entry_replace to remap all pointers from the source region to the destination:
entry_replace: For each child pointer, allocates a new entry in the destination region, copies the source entry's contents, and returns the new pointerstring_callback: Copies string data into the destination regionpre_walk_check: Tracks which entries have already been copied (using the visited-set hash table atqword_126FB50)
Hash Table for Visited Set
The walk framework includes a visited-set hash table for cycles and deduplication:
| Address | Name | Description |
|---|---|---|
qword_126FB50 | hash_table_array | Pointer to hash table bucket array |
dword_126FB48 | hash_table_count | Number of entries in hash table |
qword_126FB40 | visited_set | Pointer to visited-set data |
dword_126FB30 | visited_count | Number of visited entries |
The hash table is reset by sub_603B30 (clear_walk_hash_table) before each walk operation. It uses open addressing and is primarily employed during IL copy operations to map source entry pointers to their destination counterparts.
Helper Functions
Several helper functions support the walk framework:
| Address | Identity | Lines | Purpose |
|---|---|---|---|
sub_603B30 | clear_walk_hash_table | 23 | Zeros the visited-set hash table (qword_126FB50, dword_126FB48) |
sub_603FE0 | find_parent_var_of_anon_union_type | 127 | Searches scope member lists for the variable that owns an anonymous union type |
sub_603BB0 | find_var_in_nested_scopes | 333 | Recursively searches nested scopes for a variable (deeply unrolled, 8+ levels) |
sub_603B00 | (trivial getter) | 9 | Walk-state accessor |
sub_610200 | walk_routine_scope_il | 108 | Routine-scope walker (counterpart to walk_file_scope_il) |
Keep-in-IL Specialized Walkers
The keep-in-il pass uses parallel implementations of the walk framework that bypass the callback mechanism for performance:
| Address | Identity | Lines | Purpose |
|---|---|---|---|
sub_6115E0 | walk_tree_and_set_keep_in_il | 4649 | File-scope variant -- sets bit 7 directly on every reachable entry |
sub_618660 | walk_entry_and_set_keep_in_il_routine_scope | 3728 | Routine-scope variant |
sub_61CE20--sub_620190 | (keep-in-il helpers) | various | Per-kind helpers for template args, exception specs, array bounds, expressions, statements |
These specialized walkers are structurally identical to walk_entry_and_subtree but replace callback invocations with direct *(entry - 8) |= 0x80 operations. They exist as separate functions rather than callback-based walks because the keep-in-il marking is performance-critical -- it runs on every CUDA compilation, and eliminating the function-pointer indirection across ~330 recursive calls provides measurable speedup.
Global Entry-Kind List Layout
The per-kind linked lists are stored in a contiguous global array starting at 0x126E600, with 16-byte stride. The formula 0x126E600 + kind * 0x10 gives the list head for most entry kinds up to kind 72. The complete walk order with all 51 lists (45 from Phase 5, 3 from Phase 6, plus orphaned entities, source files, and seq_number_lookup) is documented in the Phase 5 table above.
The three trailing lists (Phase 6) are stored outside the contiguous array at separate addresses in the IL header/footer region:
| Address | Kind | Purpose | Next-Pointer Strategy |
|---|---|---|---|
qword_126EBE8 | 64 | Sequence number lookup entries | Standard next_in_list at node prefix |
qword_126EBE0 | 6 | External declarations (type list) | Type-specific next at offset +104 |
qword_126EC00 | 83 | Module declarations (C++20) | Standard next_in_list at node prefix |
The external declarations list (qword_126EBE0) is notable: it walks entries as kind 6 (type) but uses a different linked-list strategy (offset +104 rather than the standard prefix next pointer). This is because the external declarations list is a secondary index over type entries that are also present in the main type list at qword_126E660.
Walk Order Diagram
walk_file_scope_il(callbacks...)
|
+-- [save 5 callbacks + 4 state vars]
+-- [install caller's callbacks]
|
+-- Phase 1: walk_entry_and_subtree(primary_scope, 23)
| |
| +-- Recursively visits all nested scopes,
| their member lists (vars, routines, types),
| and all subtrees
|
+-- Phase 2: source_file list (kind 1)
| +-- for each file: walk(file, 1)
| +-- walks file_name, full_name, child files
|
+-- Phase 3: main_routine + string entries
| +-- entry_replace(main_routine, 11)
| +-- string_callback(compiler_version, 26, len)
| +-- string_callback(time_of_compilation, 26, len)
|
+-- Phase 4: orphaned_entities list (kind 55)
| +-- for each orphan: walk(orphan, 55)
|
+-- Phase 5: global lists (kinds 1, 2, 3, ..., 72)
| +-- for each kind:
| for each entry in list:
| entry_replace(entry, kind)
| walk(entry, kind)
|
+-- Phase 6: trailing lists (kinds 64, 6-ext, 83)
|
+-- [restore saved state]
Diagnostic Strings
| String | Source | Condition |
|---|---|---|
"walk_file_scope_il" | sub_60E4F0 | Trace enter (dword_126EFC8 nonzero) |
"walk_routine_scope_il" | sub_610200 | Trace enter |
"Walking IL tree, entry kind = %s\n" | sub_604170 | dword_126EFCC > 4 |
"Walking IL tree, string entry kind = %s\n" | sub_604170 / sub_60E4F0 | dword_126EFCC > 4 |
"walk_entry_and_subtree: bad address const kind" | sub_604170 | Unknown address constant sub-kind (walk_entry.h:883) |
"walk_entry_and_subtree: bad template param constant kind" | sub_604170 | Unknown template param constant sub-kind (walk_entry.h:1035) |
"walk_entry_and_subtree: bad constant kind" | sub_604170 | Unknown constant kind (walk_entry.h:1051) |
"find_parent_var_of_anon_union_type" | sub_603FE0 | Assert at lines 511, 523 |
"find_parent_var_of_anon_union_type: var not found" | sub_603FE0 | Variable lookup failed |
Function Map
| Address | Identity | Confidence | Lines | EDG Source |
|---|---|---|---|---|
sub_60E4F0 | walk_file_scope_il | 99% | 2043 | il_walk.c:270 |
sub_604170 | walk_entry_and_subtree | 99% | 7763 | il_walk.c / walk_entry.h |
sub_610200 | walk_routine_scope_il | 98% | 108 | il_walk.c:376 |
sub_603B30 | clear_walk_hash_table | 85% | 23 | il_walk.c |
sub_603FE0 | find_parent_var_of_anon_union_type | 99% | 127 | il_walk.c:511 |
sub_603BB0 | find_var_in_nested_scopes | 85% | 333 | il_walk.c |
sub_603B00 | (trivial walk-state accessor) | 80% | 9 | il_walk.c |
sub_6115E0 | walk_tree_and_set_keep_in_il | 98% | 4649 | il_walk.c |
sub_618660 | walk_entry_and_set_keep_in_il_routine_scope | 88% | 3728 | il_walk.c |
sub_61CE20 | (keep-in-il helper: template args) | 80% | 100 | il_walk.c |
sub_61D0C0 | (keep-in-il helper: exception spec) | 80% | 108 | il_walk.c |
sub_61D330 | (keep-in-il helper: array bound) | 80% | 97 | il_walk.c |
sub_61D570 | (keep-in-il helper: overriding virtual) | 80% | 120 | il_walk.c |
sub_61D7F0 | (keep-in-il helper: base class) | 80% | 69 | il_walk.c |
sub_61D9B0 | (keep-in-il helper: attributes) | 80% | 202 | il_walk.c |
sub_61DEC0 | (keep-in-il helper: using-decl) | 80% | 101 | il_walk.c |
sub_61E160 | (keep-in-il helper: object lifetime) | 80% | 76 | il_walk.c |
sub_61E370 | (keep-in-il helper: expressions) | 80% | 369 | il_walk.c |
sub_61ECF0 | (keep-in-il helper: statements) | 80% | 466 | il_walk.c |
sub_61F420 | (keep-in-il helper: additional exprs) | 80% | 631 | il_walk.c |
sub_61FEA0 | (keep-in-il helper: decl sequence) | 80% | 173 | il_walk.c |
Cross-References
- IL Overview -- entry kind table, IL header structure
- IL Allocation -- entry prefix layout, flags byte definition
- Keep-in-IL -- device code marking pass using this framework
- IL Display --
display_il_entrycallback - Pipeline Overview -- when walks are triggered
- Device/Host Separation -- higher-level context
Keep-in-IL (Device Code Selection)
cudafe++ compiles a single .cu translation unit that contains both host and device code. After the EDG frontend builds the complete IL tree, cudafe++ must split the two worlds: host-side declarations feed into the .int.c output, while device-side declarations feed into the binary IL emitted for cicc. The keep-in-il mechanism performs this split. It is a transitive-closure walk that starts from known device entities (functions with __device__/__global__ attributes, __shared__/__constant__/__managed__ variables) and recursively marks every IL entry they reference. Entries that survive the mark phase are written to the device IL; entries without the mark are stripped by the elimination pass.
The entire mechanism lives in il_walk.c (the mark/walk side) and il.c (the elimination side). It runs as pass 3 of fe_wrapup, after IL lowering (pass 2) and before C++ class finalization (pass 4).
Key Facts
| Property | Value |
|---|---|
| Source file | il_walk.c (mark), il.c (eliminate) |
| Mark entry point | sub_610420 (mark_to_keep_in_il), 892 lines |
| Recursive worker | sub_6115E0 (walk_tree_and_set_keep_in_il), 4649 lines / 23KB |
| Prune callback | sub_617310 (prune_keep_in_il_walk), 127 lines |
| Elimination entry point | sub_5CCBF0 (eliminate_unneeded_il_entries), 345 lines |
| Template cleanup | sub_5CCA40 (clear_instantiation_required_on_unneeded_entities), 86 lines |
| Body removal | sub_5CC410 (eliminate_bodies_of_unneeded_functions), ~200 lines |
| Trigger | fe_wrapup pass 3, argument 23 (scope entry kind) |
| Guard flag | dword_106B640 (set=1 before walk, cleared=0 after) |
| Key bit | Bit 7 (0x80) of byte at entry_ptr - 8 |
The Keep-in-IL Bit
Every IL entry is preceded by an 8-byte prefix. The byte at offset -8 from the entry pointer contains per-entry flags:
Byte at (entry_ptr - 8):
bit 0 (0x01) is_file_scope Entry belongs to file-scope IL region
bit 1 (0x02) is_in_secondary_il Entry is in the secondary IL (second TU)
bit 2 (0x04) current_il_region Toggles per IL region (0 or 1)
bits 3-6 (reserved)
bit 7 (0x80) keep_in_il DEVICE CODE MARKER
The sign bit of this byte doubles as the keep-in-il flag. This allows a fast check: *(signed char*)(entry - 8) < 0 means "keep this entry." The elimination pass exploits this: it tests *(char*)(entry - 8) >= 0 to identify entries to remove.
Two additional "keep definition" flags exist on specific entity types:
| Entity kind | Field | Bit | Meaning |
|---|---|---|---|
| Type (kind 6, class/struct) | entry + 162 | bit 7 (0x80) | keep_definition_in_il -- retain full class body |
| Routine (kind 11) | entry + 187 | bit 2 (0x04) | keep_definition_in_il -- retain function body |
The keep_definition_in_il flag is stronger than the base keep_in_il flag. A type marked with only keep_in_il may be emitted as a forward declaration; one marked with keep_definition_in_il retains its full member list, base classes, and nested types.
Pipeline Context
fe_wrapup (sub_588F90)
|
+-- Pass 1: sub_588C60 per-file IL wrapup
+-- Pass 2: sub_707040 IL lowering
+-- Pass 3: dword_106B640 = 1 <<<< guard ON
| sub_610420(file_scope, 23) <<<< MARK PHASE
| dword_106B640 = 0 <<<< guard OFF
+-- Pass 4: sub_5CCA40 clear_instantiation_required <<<< CLEANUP
| sub_5CC410 eliminate_bodies_of_unneeded (if deferred)
| sub_5CCBF0 eliminate_unneeded_il_entries (if deferred)
+-- Pass 5: sub_588D40 per-file part_3
The guard flag dword_106B640 is an assertion variable: mark_to_keep_in_il asserts it is nonzero during the walk. If the assertion fires, it triggers internal_error at il_walk.c:1959. This prevents accidental invocation of the marking logic outside the designated pass window.
Mark Phase: mark_to_keep_in_il (sub_610420)
This is the top-level entry point for device code marking. It orchestrates the entire transitive-closure walk.
Callback Installation
The function saves all five IL tree-walk callback slots and installs a specialized configuration:
// Save existing callbacks
saved_entry_cb = qword_126FB88; // entry_callback
saved_string_cb = qword_126FB80; // string_callback
saved_pre_walk = qword_126FB78; // pre_walk_check
saved_entry_replace = qword_126FB70; // entry_replace
saved_entry_filter = qword_126FB68; // entry_filter
// Install keep-in-il configuration
qword_126FB88 = NULL; // no entry callback
qword_126FB80 = NULL; // no string callback
qword_126FB78 = prune_keep_in_il_walk; // sub_617310
qword_126FB70 = NULL; // no replacement
qword_126FB68 = NULL; // no filter
dword_126FB60 = 0; // walk_mode_flags = 0
dword_126FB58 = (flags_byte & 0x02) != 0; // is_secondary_il
All callbacks are restored on exit, making the function safely re-entrant from nested invocations (the prune callback can recursively call mark_to_keep_in_il for canonical entries in secondary IL).
Scope Initialization
For scope entries (kind 23), the function handles two cases:
-
Scope already has
keep_in_ilset (byte atentry + 28is nonzero): Callwalk_tree_and_set_keep_in_ildirectly. The scope was previously identified as device-relevant. -
Fresh scope (byte at
entry + 28is zero): Clear bit 7 of the entry's flags byte, then walk. This is the file-scope entry point where the walk begins with the keep bit initially cleared, allowing the recursive walk to set it transitively.
if (entry_kind == 23) { // scope
if (*(entry + 28) != 0) {
walk_tree_and_set_keep_in_il(entry, 23);
} else {
*(entry - 8) &= 0x7F; // clear keep_in_il
// Debug: "Beginning file scope keep_in_il walk"
walk_tree_and_set_keep_in_il(entry, 23);
if (dword_126EFB4 == 2) // C++ mode
walk_scope_and_mark_routine_definitions(entry); // sub_6175F0
}
}
Global Entry-Kind List Walk
After processing the scope, mark_to_keep_in_il iterates all 45+ global entry-kind linked lists. These lists at 0x126E610--0x126EA80 hold every file-scope entity indexed by entry kind. The function visits each list and calls walk_tree_and_set_keep_in_il on every entry:
// Orphaned scope list (kind 55) -- only entries with keep_definition flag
for (entry = qword_126EBA0; entry; entry = entry->next) {
if (entry->routine_byte_187 & 0x04) // keep_definition_in_il set
walk_tree_and_set_keep_in_il(entry, 55);
}
// Source files (kind 1), constants (kind 2), parameters (kind 3), ...
// through to concepts (kind 72)
for (int kind = 1; kind <= 72; kind++) {
for (entry = global_list[kind]; entry; entry = entry->next)
walk_tree_and_set_keep_in_il(entry, kind);
}
The iteration order mirrors walk_file_scope_il (sub_60E4F0), processing kinds 1 through 72 with some gaps (kinds 24--26, 31--32, 43--45, 49--58, 60, 64--71 are skipped because those lists are empty or handled differently).
Using-Declaration Fixup
After the main walk, the function processes using-declarations attached to scopes. This is a fixed-point loop that repeats until no new entities are marked:
do {
changed = 0;
process_using_decl_list(scope->using_decls, is_class_scope, &changed);
} while (changed);
For each scope region (iterated via entry + 264), it walks the using-declaration chain and handles six declaration kinds:
| Using-decl kind byte | Name | Action |
|---|---|---|
0x33 | Simple using | If target entity is marked, mark the using-decl |
0x34 | Using with namespace | If target entity is marked, mark using-decl + namespace |
0x35 | Nested scope | Recurse via sub_6170C0 |
0x36 | Using with template | If target entity is marked, mark using-decl + template |
6 | Type alias (typedef) | Special: if typedef of a class/struct with has_definition flag, and the underlying class is marked, mark the typedef too |
66 | Using-everything | Force-mark unconditionally, set changed = 1 |
The typedef case (kind 6) deserves attention. When a typedef aliases a marked class, the typedef entry gets marked so that device code can reference the class through its alias name. The check verifies entry + 132 == 12 (typedef type kind), the underlying type is a class/struct/union (kinds 9--11), and the has_definition flag (entry + 161, bit 2) is set.
Recursive Worker: walk_tree_and_set_keep_in_il (sub_6115E0)
This 23KB function is structurally identical to the generic walk_entry_and_subtree (sub_604170) but specialized: instead of invoking callbacks, it directly sets the keep_in_il bit on every reachable sub-entry and recurses.
The function dispatches on entry kind (approximately 80 cases) and for each child pointer it encounters, performs:
if (child != NULL) {
*(child - 8) |= 0x80; // set keep_in_il
walk_tree_and_set_keep_in_il(child, child_kind); // recurse
}
Key entry kinds and what they transitively mark:
| Entry kind | ID | Children marked |
|---|---|---|
source_file | 1 | file_name, full_name, child files |
constant | 2 | type, string data, address target |
parameter | 3 | type, declared_type, default_arg_expr, attributes |
type | 6 | base_type, member fields, template info, scope, base classes |
variable | 7 | type, initializer expression, attributes |
field | 8 | next field, type, bit_size_constant |
routine | 11 | return_type, parameters, body, template info, exception specs |
expression | 13 | sub-expressions, operands, type references |
statement | 21 | sub-statements, expressions, labels |
scope | 23 | all member lists (variables, routines, types, nested scopes) |
template_parameter | 39 | default values, constraints |
namespace | 28 | associated scope |
The function also handles cross-references in template instantiations: when it encounters a template specialization, it follows the primary template pointer and marks the template definition too. This ensures that if device code uses vector<int>, the vector template itself is retained.
Pre-Walk Check Integration
Before recursing into any entry, the walk checks the pre_walk_check callback (qword_126FB78), which is set to prune_keep_in_il_walk. This callback returns 1 (skip) if the entry is already marked, preventing infinite recursion on cyclic references (classes referencing their own members) and avoiding redundant work.
Prune Callback: prune_keep_in_il_walk (sub_617310)
This callback is installed as the pre_walk_check during the keep-in-il walk. It runs before the walker descends into each entry.
Decision Logic
int prune_keep_in_il_walk(entry_ptr, entry_kind) {
char flags = *(entry_ptr - 8);
// Case 1: Secondary IL mismatch -- delegate to canonical
if (is_secondary_il && !(flags & 0x02)) {
canonical = lookup_canonical(entry_ptr, entry_kind); // sub_5B9EE0
if (dword_126EE48) { // CUDA mode
if (canonical && canonical->assoc_entry) {
target = *canonical->assoc_entry;
if (target != entry_ptr && (*(target - 8) & 0x02))
mark_to_keep_in_il(target, entry_kind); // recurse
}
}
return 1; // skip this entry (handled via canonical)
}
// Case 2: Already marked -- skip
if (flags < 0) // bit 7 set = signed negative
return 1;
// Case 3: Type with class/struct/union definition -- mark definition too
if (entry_kind == 6 && (*(entry + 132) - 9) <= 2) {
if (is_local || is_imported || !has_name || has_definition)
set_keep_definition_on_type(entry); // sub_6111C0
}
// Set the keep_in_il bit
*(entry_ptr - 8) |= 0x80;
// Debug output
if (trace_active && trace_filter("needed_flags", entry, kind)) {
switch (entry_kind) {
case 6: fprintf(s, "Setting keep_in_il on type "); break;
case 7: fprintf(s, "Setting keep_in_il on var "); break;
case 11: fprintf(s, "Setting keep_in_il on rout "); break;
case 28: fprintf(s, "Setting keep_in_il on namespace "); break;
}
}
// Case 4: Variable/routine in non-guard mode -- check class membership
if (!dword_106B640) {
if (!(*(entry + 82) & 0x10)) {
canonical = lookup_canonical(entry, entry_kind);
// Assert canonical exists (il_walk.c:1885)
if (*(canonical + 81) & 0x04) { // is class member
class_type = **(canonical + 40 + 32);
walk_tree_and_set_keep_in_il(class_type, 6);
set_keep_definition_on_type(class_type);
}
}
return 1;
}
// Handle canonical entry in secondary IL (CUDA mode)
canonical = lookup_canonical(entry, entry_kind);
if (dword_126EE48 && canonical) {
assoc = *(canonical + 32);
if (assoc) {
target = *assoc;
if (target != entry && (*(target - 8) & 0x02))
mark_to_keep_in_il(target, entry_kind);
}
}
return 0; // continue walking into this entry's children
}
The callback's return value controls the walk: returning 1 tells the walker to skip the subtree (entry already processed or delegated to canonical), returning 0 tells it to descend into children.
Secondary IL Handling
When cudafe++ processes multiple translation units (e.g., through #include chains that bring in separate compilation units), it maintains primary and secondary IL regions. The secondary IL flag (bit 1 of the flags byte) distinguishes them. The prune callback handles cross-region references by looking up the canonical (primary) version of each entry via sub_5B9EE0 and recursively marking that version instead. This ensures the device IL output contains the primary definitions, not secondary duplicates.
Keep-Definition Logic
For Types (sub_6111C0 / sub_611300)
When a class/struct/union type needs its definition kept (not just a forward declaration), set_keep_definition_on_type performs:
void set_keep_definition_on_type(entry) {
// Debug: "Setting keep_definition_in_il on <type>"
*(entry + 162) |= 0x80; // set keep_definition bit
// If already marked keep_in_il, clear and re-walk
// (definition requires deeper traversal than reference)
if (*(entry - 8) & 0x80) {
*(entry - 8) &= ~0x80; // clear keep_in_il
mark_to_keep_in_il(entry, 6); // re-walk with full traversal
}
// For class/struct: also clear/re-walk the associated scope
if (entry_kind is class/struct/union) {
scope = entry->associated_scope;
*(scope - 8) &= ~0x80;
// Follow canonical type chain
}
}
The clear-and-re-walk pattern is important: when an entity was initially marked via a shallow reference (e.g., a pointer to the class), only the type entry itself was marked. When the definition is later needed (e.g., the device code accesses a member), the keep bit is cleared and the walk restarts, this time descending into all members, base classes, and nested types.
For Routines (sub_6113F0 / sub_6181E0)
void set_keep_definition_on_routine(entry) {
// Debug: "Setting keep_definition_in_il on rout <name>"
*(entry + 187) |= 0x04; // set keep_definition bit
// If template specialization: also mark the primary template
if (*(entry + 177) & 0x20) {
primary = lookup_primary_template(entry); // sub_5BBCC0
mark_to_keep_in_il(primary, 11);
}
// Special member handling (copy/move constructors)
if (special_member_kind == 1 || special_member_kind == 2) {
// Recurse on associated class type's ctor/dtor
}
}
Scope-Level Routine Walk: sub_6175F0
In C++ mode (dword_126EFB4 == 2), after the main mark pass, mark_to_keep_in_il calls sub_6175F0 on the file scope. This function performs an additional sweep through all scope hierarchies to ensure routine definitions are correctly retained:
- For each class/struct scope with
keep_in_ilset: recurse into the class scope - For each namespace (non-alias): recurse into the namespace scope
- For routines in class scopes with external linkage: if marked but not
keep_definition, callset_keep_definition_on_routine
This handles the case where a class method is referenced by device code through a virtual call or template instantiation, requiring the full function body to be available in the device IL.
Elimination Phase
After the mark phase completes, three functions strip unmarked entities from the IL.
clear_instantiation_required_on_unneeded_entities (sub_5CCA40)
Runs in pass 4 of fe_wrapup, C++ mode only. Prevents unnecessary template instantiations from being triggered during IL output.
The function recursively walks the scope tree and for each routine with template instantiation flags, checks whether the instantiation is still needed:
void clear_instantiation_required_on_unneeded_entities(scope) {
assert(dword_126EFB4 == 2); // C++ only
// Recurse into child scopes (skip namespace aliases)
for (child = scope->nested_namespaces; child; child = child->next) {
if (!(child->flags & 0x01)) // not an alias
recurse(child->associated_scope);
}
// Recurse into class scopes
for (type = scope->types_list; type; type = type->next) {
if (is_class_struct_union(type) && !is_anonymous(type))
recurse(type->type_extra->scope_entry);
}
// Clear instantiation_required on unneeded routines
for (rout = scope->routines_list; rout; rout = rout->next) {
if (!(rout->flags_80 & 0x08) // not suppressed
&& !(rout->flags_179 & 0x10) // not already cleared
&& ((rout->flags_179 & 6) == 2 || (dword_126E204 && rout->flags_176 < 0))
&& rout->source_corresp != NULL
&& !(rout->flags_176 & 0x02)) // not locally defined
{
clear_instantiation_required(rout->name, 0, 2); // sub_78A380
}
}
// For non-file scopes: also clear on variable templates
if (scope->scope_kind != 0) {
for (var = scope->variables_list; var; var = var->next) {
if (!(var->flags_80 & 0x08)
&& !(var->flags_162 & 0x40)
&& (var->flags_162 & 0xB0) == 0x10
&& var->name != NULL)
{
clear_instantiation_required(var->name, 0, 2);
}
}
}
}
eliminate_bodies_of_unneeded_functions (sub_5CC410)
Walks the IL table (qword_126EB98) and removes function bodies for routines that were not marked with keep_definition_in_il:
void eliminate_bodies_of_unneeded_functions() {
for (idx = 1; idx <= dword_126EC78; idx++) {
scope = qword_126EC88[idx];
if (!scope) continue;
if (scope not in current TU) continue;
if (scope_kind != 17) continue; // 17 = function body
routine = scope->owning_entity;
if (routine->keep_definition_in_il) // byte+187 & 0x04
continue;
if (!(routine->flags_29 & 0x01))
continue;
remove_function_body(routine); // sub_5CAB40
}
}
eliminate_unneeded_il_entries (sub_5CCBF0)
The main elimination pass. Walks the scope tree and removes all unmarked entities from the IL linked lists.
void eliminate_unneeded_il_entries(scope) {
emit_info = get_emit_info(scope); // sub_703C30
assert(emit_info != NULL); // il.c:29598
// Recurse into child scopes (skip namespace aliases)
for (child = scope->nested_namespaces; child; child = child->next) {
if (!(child->flags & 0x01))
eliminate_unneeded_il_entries(child->associated_scope);
}
// --- Eliminate variables ---
prev = NULL;
for (var = scope->variables_list; var; var = next) {
next = var->next_in_list; // offset +104
if (*(signed char*)(var - 8) < 0) { // keep_in_il set
prev = var; // keep in list
} else {
// Unlink from list
if (prev) prev->next = next;
else scope->variables_list = next;
var->next = NULL;
// C++ mode: walk expression trees to clear hidden names
if (cpp_mode) {
walk_tree(var->expr_tree, clear_hidden_name_cb, 147);
walk_tree(var->alt_tree, clear_hidden_name_cb, 147);
}
}
}
emit_info[5] = prev; // last kept variable
// File scope: also clean orphaned scope list
if (scope->scope_kind == 0)
eliminate_unneeded_scope_orphaned_list_entries(); // sub_5CC570
// --- Eliminate routines (same pattern as variables) ---
prev = NULL;
for (rout = scope->routines_list; rout; rout = next) {
next = rout->next_in_list;
if (*(signed char*)(rout - 8) < 0) {
prev = rout;
} else {
// Unlink + clear hidden names
}
}
emit_info[6] = prev;
// Clear global variable reference if unmarked
if (qword_126EB70 && *(signed char*)(qword_126EB70 - 8) >= 0)
qword_126EB70 = NULL;
// --- Eliminate types ---
prev = NULL;
for (type = scope->types_list; type; type = next) {
next = type->next_in_list;
// Follow typedef chains to find the real type for sign-bit check
real = type;
if (real->type_kind == 12 && !real->name) { // anonymous typedef
do { real = real->base_type; }
while (real->type_kind == 12 && !real->name);
}
if (*(signed char*)(real - 8) < 0) { // marked
prev = type;
if (is_class_struct_union(type))
eliminate_unneeded_class_definitions(type); // sub_5CC1B0
} else {
// Unlink + process eliminated class members
if (is_class_struct_union(type) && cpp_mode)
process_members_of_eliminated_class(type);
type->base_type = NULL;
clear_type_extra_member_lists(type->type_extra);
type->type_extra->flags |= 0x20; // mark as eliminated
}
}
emit_info[4] = prev;
// --- Eliminate hidden names ---
// (same sign-bit check, unlink unmarked entries from scope->hidden_names)
// File scope: emit orphaned scopes
if (scope->scope_kind == 0)
emit_orphaned_scopes(scope); // sub_718720
// Clean external declarations list
for (ext = qword_126EBE0; ext; ext = next) {
next = ext->next;
if (*(signed char*)(ext - 8) >= 0) {
// Unlink from external declarations list
}
}
}
The debug output for eliminated vs. retained entities uses a string trick: "TARG_VERT_TAB_CHAR" + 17 evaluates to "R", so the output reads either "Removing variable <name>" (eliminated) or "Not removing variable <name>" (retained).
Global State
| Address | Name | Description |
|---|---|---|
dword_106B640 | keep_in_il_walk_active | Assertion guard; 1 during pass 3, 0 otherwise |
dword_126EFB4 | cpp_mode | 2 = C++ mode (enables class/template processing) |
dword_126EFC8 | trace_active | Nonzero enables diagnostic output |
dword_126EFCC | trace_verbosity | Higher = more output (>2 prints elimination details) |
dword_126EE48 | cuda_mode | Nonzero enables CUDA-specific canonical entry handling |
dword_126E204 | template_compat_flag | Affects template instantiation clearing criteria |
qword_126EBA0 | orphaned_scope_list | File-scope orphaned scopes (kind 55 list) |
qword_126EB70 | global_variable_ref | Cleared if its entry is unmarked |
qword_126EBE0 | external_decl_list | External declarations; unmarked ones removed |
qword_126EB98 | il_table | Array of IL scope pointers, indexed by il_table_index |
qword_126FB78 | pre_walk_check | Callback slot; set to prune_keep_in_il_walk during mark |
qword_126FB88 | entry_callback | Callback slot; NULL during mark phase |
dword_126FB58 | is_secondary_il | Walk state: 1 if currently in secondary IL region |
dword_126FB5C | is_file_scope_walk | Walk state: 1 during file-scope walk |
dword_106B644 | current_il_region | Walk state: toggles per IL region |
dword_126FB60 | walk_mode_flags | Walk state: 0 during keep-in-il walk |
Diagnostic Strings
| String | Source | Condition |
|---|---|---|
"Beginning file scope keep_in_il walk" | sub_610420 | trace_active && trace_category("needed_flags") |
"Ending file scope keep_in_il walk" | sub_610420 | Same |
"Setting keep_in_il on type " | sub_617310 | trace_active && trace_filter("needed_flags", entry, 6) |
"Setting keep_in_il on var " | sub_617310 | Same, kind 7 |
"Setting keep_in_il on rout " | sub_617310 | Same, kind 11 |
"Setting keep_in_il on namespace " | sub_617310 | Same, kind 28 |
"Setting keep_definition_in_il on " | sub_6111C0 | Trace active |
"Setting keep_definition_in_il on rout " | sub_6113F0 | Trace active |
"Removing variable <name>" | sub_5CCBF0 | trace_verbosity > 2 or trace_filter("dump_elim") |
"Not removing variable <name>" | sub_5CCBF0 | Same (for kept entries) |
"Removing routine <name>" | sub_5CCBF0 | Same |
"Removing <type>" | sub_5CCBF0 | Same |
"eliminate_unneeded_il_entries" | sub_5CCBF0 | trace_active (level 3 trace enter/exit) |
Function Map
| Address | Identity | Confidence | Lines | EDG Source |
|---|---|---|---|---|
sub_610420 | mark_to_keep_in_il | 99% | 892 | il_walk.c:1959 |
sub_6115E0 | walk_tree_and_set_keep_in_il | 98% | 4649 | il_walk.c |
sub_617310 | prune_keep_in_il_walk | 99% | 127 | il_walk.c:1885 |
sub_6111C0 | set_keep_definition_on_type | 95% | 63 | il_walk.c |
sub_611300 | set_keep_definition_on_type_simple | 92% | 48 | il_walk.c |
sub_6113F0 | set_keep_definition_on_routine | 95% | 81 | il_walk.c |
sub_6181E0 | set_keep_definition_on_routine_unconditional | 90% | 69 | il_walk.c |
sub_6170C0 | process_using_decl_list | 92% | 154 | il_walk.c |
sub_6175F0 | walk_scope_and_mark_routine_definitions | 90% | 634 | il_walk.c |
sub_616EE0 | mark_virtual_function_types_to_keep | 85% | 88 | il_walk.c |
sub_618370 | walk_and_set_keep_in_il_helper | 80% | 119 | il_walk.c |
sub_618660 | walk_entry_and_set_keep_in_il_routine_scope | 88% | 3728 | il_walk.c |
sub_5CCBF0 | eliminate_unneeded_il_entries | 100% | 345 | il.c:29598 |
sub_5CCA40 | clear_instantiation_required_on_unneeded_entities | 100% | 86 | il.c:29450 |
sub_5CC410 | eliminate_bodies_of_unneeded_functions | 100% | ~200 | il.c:29231 |
sub_5CC1B0 | eliminate_unneeded_class_definitions | 100% | ~200 | il.c |
sub_5CC570 | eliminate_unneeded_scope_orphaned_list_entries | 100% | ~200 | il.c:29398 |
sub_5CB920 | process_members_of_eliminated_class_definition | 100% | ~300 | il.c:29097 |
sub_5B9EE0 | lookup_canonical_entry | -- | -- | il_walk.c |
sub_78A380 | clear_instantiation_required | -- | -- | template.c |
Cross-References
- Pipeline Overview -- overall compilation flow
- IL Overview -- entry kinds, header, regions
- IL Tree Walking -- generic walker with 5 callbacks
- Device/Host Separation -- higher-level splitting strategy
- Execution Spaces -- how entities get device/host attributes
IL Display
The IL display subsystem produces a human-readable text dump of the entire Intermediate Language graph. It is compiled from EDG's il_to_str.c (source path /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/il_to_str.c, confirmed by an assertion at line 6175 in form_float_constant). The display code occupies address range 0x5EB290--0x603A00 in the binary (roughly 90KB), with the main dispatch functions at 0x5EC600--0x5F7FD0 and formatting helpers continuing through 0x6039E0.
Activation is via the il_display CLI flag (flag index 10 in the boolean flag table), which triggers display_il_file after the frontend completes parsing. The output goes to stdout through an indirectable callback mechanism (qword_126F980). When active, every IL entry in every memory region is printed with labeled fields, 25-column-aligned formatting, and scope/address annotations.
Key Facts
| Property | Value |
|---|---|
| Source file | il_to_str.c (EDG 6.6) |
| Address range | 0x5EB290--0x6039E0 |
| Top-level entry point | sub_5F7DF0 (display_il_file), 56 lines |
| Header + file-scope | sub_5F76B0 (display_il_header), 174 lines |
| Main dispatcher | sub_5F4930 (display_il_entry), 1,686 lines |
| Single-entity display | sub_5F7D50 (display_single_entity), 38 lines |
| CLI flag | il_display (index 10, boolean) |
| Output callback | qword_126F980 (function pointer, default sub_5EB290 = fputs(s, stdout)) |
| Display-active flag | byte_126FA16 (set to 1 during display) |
| Scope context flag | dword_126FA30 (1 = file-scope region, 0 = function-scope) |
| Entry kind name table | off_E6DD80 (~84 entries, indexed by entry kind byte) |
Top-Level Control Flow
display_il_file (sub_5F7DF0)
│
│ printf("Display of IL file \"%s\", produced by the compilation of \"%s\"\n",
│ il_file_name, source_file_name)
│
├── display_il_header (sub_5F76B0)
│ │ dword_126FA30 = 1 // file-scope mode
│ │ puts("\n\nIntermediate language for memory region 1 (file scope):")
│ │ puts("\nil_header:")
│ │ ... 30+ header fields ...
│ │
│ └── walk_file_scope_il(display_il_entry, ...) // sub_60E4F0
│ └── display_il_entry (sub_5F4930) // callback per entity
│
└── for region = 2 .. dword_126EC80:
dword_126FA30 = 0 // function-scope mode
// lookup function name from scope table
printf("\n\nIntermediate language for memory region %ld (function \"%s\"):\n",
region, function_name)
walk_routine_scope_il(region, display_il_entry, ...) // sub_610200
└── display_il_entry (sub_5F4930) // callback per entity
Memory region 1 is always file-scope (global declarations, types, templates). Regions 2+ correspond to individual function bodies. The scope table at qword_126EB90 maps each region index to its owning scope entry; the display code checks scope.kind == 17 (sck_function) and extracts the routine name for the banner.
IL Header Fields
display_il_header (sub_5F76B0) prints the translation-unit-level metadata stored in BSS at 0x126EB60--0x126EBF8:
| Field | Type | Notes |
|---|---|---|
primary_source_file | IL pointer | Source file entry for the main .cu file |
primary_scope | IL pointer | File-scope scope entry |
main_routine | IL pointer | main() routine entry, if present |
compiler_version | string | EDG compiler version string |
time_of_compilation | string | Build timestamp |
plain_chars_are_signed | bool | Default char signedness |
source_language | enum | sl_Cplusplus (0) or sl_C (1), from dword_126EBA8 |
std_version | integer | C/C++ standard version (e.g., 201703 for C++17), from dword_126EBAC |
pcc_compatibility_mode | bool | PCC compatibility |
enum_type_is_integral | bool | Whether enum underlying type is integral |
default_max_member_alignment | integer | Default structure packing alignment |
gcc_mode | bool | GCC compatibility mode |
gpp_mode | bool | G++ compatibility mode |
gnu_version | integer | GNU compatibility version number |
short_enums | bool | -fshort-enums behavior |
default_nocommon | bool | Default -fno-common |
UCN_identifiers_used | bool | Universal character names in identifiers |
vla_used | bool | Variable-length arrays present |
any_templates_seen | bool | Whether any templates were parsed |
prototype_instantiations_in_il | bool | Template prototype instantiations included |
il_has_all_prototype_instantiations | bool | All prototypes included (ALL_TEMPLATE_INFO_IN_IL=1) |
il_has_C_semantics | bool | C-language semantics active |
nontag_types_used_in_exception_or_rtti | bool | Non-tag types in EH/RTTI |
seq_number_lookup_entries | integer | Count of source sequence entries |
target_configuration_index | integer | Target configuration selector |
After printing the header, display_il_header calls walk_file_scope_il (sub_60E4F0) with display_il_entry as the per-entity callback. This iterates every IL entry in file-scope region 1.
The Main Dispatcher: display_il_entry
display_il_entry (sub_5F4930, 1,686 lines) is the central per-entity display function. It receives an entry pointer and an entry kind byte, and dispatches to the appropriate per-kind display function.
Transparent (Inline) Kinds
The first switch handles kinds that are displayed inline by their parent and should not appear as standalone entries. These return immediately without output:
Transparent kinds (early return):
4 routine_type_supplement 15 (reserved)
5 routine_type_extra 19 try_supplement
14 (reserved) 20 asm_supplement
27 template_parameter_suppl 34 (reserved)
37 (reserved) 38 (reserved)
46 (reserved) 47 (reserved)
48 (reserved) 75 (reserved)
76 (reserved) 77 (reserved)
78 (reserved) 81 (reserved)
Entry Header Line
For non-transparent kinds, the dispatcher prints a scope-annotated header:
file-scope type@7f3a4b200100
func-scope variable@7f3a4b300200
The scope prefix comes from dword_126FA30 (1 = "file-scope", 0 = "func-scope"). The kind name is looked up from off_E6DD80[kind_byte]. The address is the raw entry pointer value. For entries in function-scope regions while dword_126FA30 == 1, a warning "**NON FILE SCOPE PTR**" is emitted.
Dispatch Table
The second switch dispatches to specialized display functions:
| Kind | Hex | Name | Display function | Lines |
|---|---|---|---|---|
| 1 | 0x01 | source_file_entry | inline in dispatcher | ~40 |
| 2 | 0x02 | constant | sub_5F2720 (display_constant) | 605 |
| 3 | 0x03 | param_type | inline in dispatcher | ~30 |
| 6 | 0x06 | type | sub_5F06B0 (display_type) | 1,033 |
| 7 | 0x07 | variable | sub_5EE500 (display_variable) | 614 |
| 8 | 0x08 | field | inline in dispatcher | ~80 |
| 9 | 0x09 | exception_specification | inline | ~20 |
| 10 | 0x0A | exception_spec_type | inline | ~10 |
| 11 | 0x0B | routine | sub_5EF1A0 (display_routine) | 1,160 |
| 12 | 0x0C | label | inline | ~30 |
| 13 | 0x0D | expr_node | sub_5ECFE0 (display_expr_node) | 534 |
| 16 | 0x10 | switch_case_entry | inline | ~15 |
| 17 | 0x11 | switch_info | inline | ~10 |
| 18 | 0x12 | handler | inline | ~15 |
| 21 | 0x15 | statement | sub_5EC600 (display_statement) | 328 |
| 22 | 0x16 | object_lifetime | inline | ~20 |
| 23 | 0x17 | scope | sub_5F2140 (display_scope) | 177 |
| 28 | 0x1C | namespace | inline | ~20 |
| 29 | 0x1D | using_declaration | inline | ~20 |
| 30 | 0x1E | dynamic_init | sub_5F37F0 (display_dynamic_init) | 248 |
| 31 | 0x1F | local_static_variable_init | inline | ~15 |
| 32 | 0x20 | vla_dimension | inline | ~10 |
| 33 | 0x21 | overriding_virtual_func | inline | ~15 |
| 35 | 0x23 | derivation_path | inline | ~10 |
| 36 | 0x24 | base_class | inline | ~25 |
| 39 | 0x27 | class_info | sub_5F4030 (display_class_supplement) | 366 |
| 41 | 0x29 | constructor_init | inline | ~15 |
| 42 | 0x2A | asm_entry | inline | ~25 |
| 43 | 0x2B | asm_operand | inline | ~15 |
| 44 | 0x2C | asm_clobber | inline | ~10 |
| 50 | 0x32 | source_sequence_entry | inline | ~15 |
| 51 | 0x33 | full_entity_decl_info | inline | ~15 |
| 52 | 0x34 | instantiation_directive | inline | ~10 |
| 53 | 0x35 | src_seq_sublist | inline | ~10 |
| 54 | 0x36 | explicit_instantiation_decl | inline | ~10 |
| 55 | 0x37 | orphaned_entities | inline | ~10 |
| 56 | 0x38 | hidden_name | inline | ~10 |
| 57 | 0x39 | pragma | inline | ~20 |
| 58 | 0x3A | template | inline | ~20 |
| 59 | 0x3B | template_decl | inline | ~15 |
| 60 | 0x3C | requires_clause | inline | ~10 |
| 61 | 0x3D | template_param | inline | ~15 |
| 62 | 0x3E | name_reference | sub_5EBC60 (display_name_reference) | 84 |
| 63 | 0x3F | name_qualifier | inline | ~15 |
| 64 | 0x40 | seq_number_lookup | inline | ~10 |
| 65 | 0x41 | local_expr_node_ref | inline | ~10 |
| 66 | 0x42 | static_assert | inline | ~10 |
| 67 | 0x43 | linkage_spec | inline | ~10 |
| 68 | 0x44 | scope_ref | inline | ~10 |
| 70 | 0x46 | lambda | inline | ~15 |
| 71 | 0x47 | lambda_capture | inline | ~15 |
| 72 | 0x48 | attribute | inline | ~20 |
| 73 | 0x49 | attribute_argument | inline | ~10 |
| 74 | 0x4A | attribute_group | inline | ~10 |
| 79 | 0x4F | template_info | inline | ~15 |
| 80 | 0x50 | subobject_path | inline | ~10 |
| 82 | 0x52 | module_info | inline | ~10 |
| 83 | 0x53 | module_decl | inline | ~10 |
Per-Kind Display Functions
source_file_entry (Kind 1)
Displayed inline in the dispatcher. Fields:
| Field | Type | Notes |
|---|---|---|
file_name | string | Short file name |
full_name | string | Full path |
name_as_written | string | As-written in #include |
first_seq_number | integer | First source sequence number in this file |
last_seq_number | integer | Last source sequence number |
first_line_number | integer | First line number |
child_files | IL pointer list | Included files |
is_implicit_include | bool | Implicitly included |
is_include_file | bool | Is an #included file (not the primary TU) |
top_level_file | bool | Top-level compilation unit |
source_corresp (Shared Prefix)
All named entities (variable, routine, type, field, label, namespace, template_param) share a source_corresp sub-record, printed by display_source_corresp (sub_5EDF40, 170 lines). This is the first thing displayed for each such entity:
source_corresp:
name: foo
unmangled_name_or_mangled_encoding: _Z3foov
decl_position.seq: 42
decl_position.column: 5
name_references: name_reference@7f3a...
is_class_member: TRUE
access: public
parent_scope: file-scope scope@7f3a...
enclosing_routine: NULL
referenced: TRUE
needed: TRUE
name_linkage: external
Fields displayed by display_source_corresp:
| Field | Type | Lookup table |
|---|---|---|
name | string | Direct string |
unmangled_name_or_mangled_encoding | string | Direct string |
decl_position | position | seq + column sub-fields |
name_references | IL pointer | name_reference entry |
is_class_member | bool | -- |
access | enum | off_A6F760 (4 entries: public/protected/private/none) |
parent_scope | IL pointer | Scope entry |
enclosing_routine | IL pointer | Routine entry |
referenced | bool | -- |
needed | bool | -- |
is_local_to_function | bool | -- |
parent_via_local_scope_ref | IL pointer | -- |
name_linkage | enum | off_E6E040 (none/internal/external/C/C++) |
has_associated_pragma | bool | -- |
is_decl_after_first_in_comma_list | bool | -- |
copied_from_secondary_trans_unit | bool | -- |
same_name_as_external_entity_in_secondary_trans_unit | bool | -- |
member_of_unknown_base | bool | -- |
qualified_unknown_base_member | bool | -- |
marked_as_gnu_extension | bool | -- |
is_deprecated_or_unavailable | bool | -- |
externalized | bool | -- |
maybe_unused | bool | [[maybe_unused]] attribute |
source_sequence_entry | IL pointer | -- |
attributes | IL pointer | Attribute list |
type (Kind 6)
display_type (sub_5F06B0, 1,033 lines) handles all 22 type kinds. After calling display_source_corresp, it prints common type fields then switches on the type kind byte at offset +132:
Common type fields:
| Field | Lookup table |
|---|---|
next | IL pointer |
based_types | Linked list, kind from off_A6F420 (6 entries) |
size | Integer |
alignment | Integer |
incomplete | bool |
used_in_exception_or_rtti | bool |
declared_in_function_prototype | bool |
alignment_set_explicitly | bool |
variables_are_implicitly_referenced | bool |
may_alias | bool |
autonomous_primary_tag_decl | bool |
is_builtin_va_list | bool |
is_builtin_va_list_from_cstdarg | bool |
has_gnu_abi_tag_attribute | bool |
in_gnu_abi_tag_namespace | bool |
type_kind | Enum from off_A6FE40 (22 entries) |
Type kind switch (offset +132):
| Kind | Name | Key sub-fields |
|---|---|---|
| 2 | integer | int_kind (via sub_5F9110), explicitly_signed, wchar_t_type, char8_t_type, char16_t_type, char32_t_type, bool_type; for enums: is_scoped_enum, packed, originally_unnamed, is_template_enum, ELF_visibility, base_type, assoc_template |
| 3/4/5 | float/double/ldouble | float_kind (via sub_5F93D0) |
| 6 | pointer | type_pointed_to, is_reference, is_rvalue_reference |
| 7 | function | return_type, param_type_list, assoc_routine, has_ellipsis, prototyped, trailing_return_type, value_returned_by_cctor, does_not_return, result_should_be_used, is_const, explicit_calling_convention, calling_convention (from off_E6CDA0), this_class, qualifiers, ref_qualifiers, prototype_scope, exception_specification |
| 8 | array | element_type, qualifiers, is_static, is_variable_size_array, is_vla, element_count, bound_constant |
| 9/10/11 | class/struct/union | field_list, extra_info (class supplement via sub_5F4030), final, abstract, any_virtual_base_classes, any_virtual_functions, originally_unnamed, is_template_class, is_specialized, is_empty_class, is_packed, max_member_alignment |
| 12 | typeref | typeref_type, template_arg_list, assoc_template, typeref_kind (from off_A6F640, 28 entries), qualifiers, predeclared, has_variably_modified_type, is_nonreal |
| 13 | member pointer | class_of_which_a_member, type |
| 14 | template param | kind (tptk_param/tptk_member/tptk_unknown), is_pack, is_generic_param, is_auto_param, class_type, coordinates |
| 15 | vector | element_type, size_constant, is_boolean_vector, vector_kind |
| 16 | tuple | element_type, tuple_elements |
variable (Kind 7)
display_variable (sub_5EE500, 614 lines) is one of the most field-heavy display functions. After display_source_corresp, it prints:
| Field | Lookup table / Notes |
|---|---|
next | IL pointer |
type | IL pointer |
storage_class | off_A6FE00 (7 entries: none/auto/register/static/extern/mutable/thread_local) |
declared_storage_class | Same table |
asm_name or reg | off_A6F480 (53 register kind entries) |
alignment | Integer |
ELF_visibility | off_A6F720 (5 entries) |
init_priority | Integer |
cleanup_routine | IL pointer |
container / bindings | Selected by bits at offset +162 |
section | String (ELF section name) |
aliased_variable | IL pointer |
declared_type | IL pointer |
template_info | IL pointer |
CUDA-specific variable fields:
| Field | Notes |
|---|---|
shared | __shared__ memory space |
constant | __constant__ memory space |
device | __device__ memory space |
Boolean flags (approximately 50 flags spanning bytes 144--208):
is_weak, is_weakref, is_gnu_alias, has_gnu_used_attribute, has_gnu_abi_tag_attribute, is_not_common, is_common, has_internal_linkage_attribute, asm_name_is_valid, used, address_taken, is_parameter, is_parameter_pack, is_pack_element, is_enhanced_for_iterator, initializer_in_class, constant_valued, is_thread_local, extends_lifetime, is_template_param_object, compiler_generated, is_in_class_specialization, is_handler_param, is_this_parameter, referenced_non_locally, modified_within_try_block, is_template_variable, is_prototype_instantiation, is_nonreal, is_specialized, specialized_with_old_syntax, explicit_instantiation, class_explicitly_instantiated, explicit_do_not_instantiate, param_value_has_been_changed, param_used_more_than_once, is_anonymous_parent_object, is_member_constant, is_constexpr, declared_constinit, is_inline, suppress_inline_definition, superseded_external, has_variably_modified_type, is_vla, is_compound_literal, has_explicit_initializer, has_parenthesized_initializer, has_direct_braced_initializer, has_flexible_array_initializer, declared_with_auto_type_specifier, declared_with_decltype_auto, declared_with_class_template_placeholder
routine (Kind 11)
display_routine (sub_5EF1A0, ~1,160 lines) is the single largest per-kind display function. After display_source_corresp:
| Field | Lookup table / Notes |
|---|---|
next | IL pointer |
type | IL pointer (function type) |
function_def_number | Integer |
memory_region | Integer (region index for function body) |
storage_class | off_A6FE00 (7 entries) |
declared_storage_class | Same table |
special_kind | off_A6FC00 (13 entries: none/constructor/destructor/conversion/operator/lambda_call_operator/...) |
opname_kind | off_A6FC80 (47 entries) |
builtin_function_kind | Integer |
ELF_visibility | off_A6F720 |
virtual_function_number | Integer |
constexpr_intrinsic_number | Integer |
section | String |
aliased_routine | IL pointer |
inline_partner | IL pointer |
ctor_priority / dtor_priority | Integer |
asm_name | String |
declared_type | IL pointer |
generating_using_decl | IL pointer |
befriending_classes | IL pointer list |
assoc_template | IL pointer |
template_arg_list | Via display_template_arg_list |
CUDA-specific routine flags (byte 182):
| Flag | Bit | Meaning |
|---|---|---|
nvvm_intrinsic | bit 4 | NVVM intrinsic function |
device | bit 5 | __device__ execution space |
global | bit 6 | __global__ execution space |
host | bit 4 (byte 183) | __host__ execution space |
C99-specific fields (displayed when dword_126EBA8 == 1 and std_version > 199900):
fp_contract, fenv_access, cx_limited_range -- pragma state values from off_A6F460 (4 entries).
Boolean flags (approximately 60 flags spanning bytes 176--191):
address_taken, is_virtual, overrides_base_member, pure_virtual, final, override, covariant_return_virtual_override, is_inline, is_declared_constexpr, is_constexpr, is_constexpr_intrinsic, compiler_generated, defined, called, is_explicit_constructor, is_explicit_conversion_function, is_trivial_default_constructor, is_trivial_copy_function, is_trivial_destructor, is_initializer_list_ctor, is_delegating_ctor, is_inheriting_ctor, assignment_to_this_done, is_prototype_instantiation, is_template_function, is_specialized, specialized_with_old_syntax, explicit_instantiation, class_explicitly_instantiated, explicit_do_not_instantiate, has_nodiscard_attribute, never_throws, is_in_class_specialization, never_inline, is_pure, is_initialization_routine, is_finalization_routine, is_weak, is_weakref, is_gnu_alias, is_ifunc, has_gnu_used_attribute, has_gnu_abi_tag_attribute, in_gnu_abi_tag_namespace, allocates_memory, no_instrument_function, no_check_memory_usage, always_inline, gnu_c89_inline, implicit_alias, has_internal_linkage_attribute, contains_try_block, contains_local_class_type, superseded_external, defined_in_friend_decl, contains_statement_expression, inline_in_class_definition, is_lambda_body, is_defaulted, is_deleted, contains_local_static_variable, is_raw_literal_operator, is_tls_init_routine, has_deducible_return_type, has_deduced_return_type, contains_generic_lambda, is_coroutine, is_top_level_in_mem_region, friend_defined_in_instantiation, is_ineligible, definition_needed, defined_outside_of_parent, trailing_requires_clause
expr_node (Kind 13)
display_expr_node (sub_5ECFE0, 534 lines) handles 36 expression node kinds. Common expression fields are printed first:
| Field | Notes |
|---|---|
type | IL pointer (expression type) |
orig_lvalue_type | IL pointer |
next | IL pointer |
is_lvalue | bool |
is_xvalue | bool |
result_is_not_used | bool |
is_pack_expansion | bool |
is_parenthesized | bool |
compiler_generated | bool |
volatile_fetch | bool |
do_not_interpret | bool |
type_definition_needed | bool |
Expression kind switch (offset +24):
| Kind | Name | Key sub-fields |
|---|---|---|
| 0 | enk_error | (none) |
| 1 | enk_operation | operation.kind from off_A6F840 (120 operator kinds), operation.type_kind from off_A6FE40 (22 type kinds), 20+ boolean flags for cast semantics, ADL suppression, virtual call properties, evaluation order |
| 2 | enk_constant | Constant value reference |
| 3 | enk_variable | Variable reference |
| 4 | enk_field | Field access |
| 5 | enk_temp_init | Temporary initialization |
| 6 | enk_lambda | Lambda expression |
| 7 | enk_new_delete | is_new, placement_new, aligned_version, array_delete, global_new_or_delete, deducible_type, type, routine, arg, dynamic_init, number_of_elements |
| 8 | enk_throw | Throw expression |
| 9 | enk_condition | Conditional expression |
| 10 | enk_object_lifetime | Object lifetime tracking |
| 11 | enk_typeid | typeid expression |
| 12 | enk_sizeof | sizeof expression |
| 13 | enk_sizeof_pack | sizeof... (pack) |
| 14 | enk_alignof | alignof expression |
| 15 | enk_datasizeof | __datasizeof |
| 16 | enk_address_of_ellipsis | Address of ... |
| 17 | enk_statement | Statement expression |
| 18 | enk_reuse_value | Value reuse |
| 19 | enk_routine | Function reference |
| 20 | enk_type_operand | Type operand |
| 21 | enk_builtin_operation | Built-in op from off_E6C5A0 |
| 22 | enk_param_ref | Parameter reference |
| 23 | enk_braced_init_list | Braced initializer |
| 24 | enk_c11_generic | _Generic selection |
| 25 | enk_builtin_choose_expr | __builtin_choose_expr |
| 26 | enk_yield | co_yield |
| 27 | enk_await | co_await |
| 28 | enk_fold | Fold expression |
| 29 | enk_initializer | Initializer |
| 30 | enk_concept_id | Concept ID |
| 31 | enk_requires | requires expression |
| 32 | enk_compound_req | Compound requirement |
| 33 | enk_nested_req | Nested requirement |
| 34 | enk_const_eval_deferred | Consteval deferred |
| 35 | enk_template_name | Template name |
Every expression case ends with dump_source_position("position", ...) to record the source location.
statement (Kind 21)
display_statement (sub_5EC600, 328 lines) handles 26 statement kinds. Common fields first:
| Field | Notes |
|---|---|
position | Source position |
next | IL pointer |
parent | IL pointer (enclosing scope/block) |
attributes | IL pointer |
has_associated_pragma | bool |
is_initialization_guard | bool |
is_lowering_boilerplate | bool |
is_fallthrough_statement | bool |
is_likely | bool |
is_unlikely | bool |
Statement kind switch (offset +32):
| Kind | Name | Key sub-fields |
|---|---|---|
| 0 | stmk_expr | Expression statement |
| 1 | stmk_if | if |
| 2 | stmk_constexpr_if | if constexpr |
| 3 | stmk_if_consteval | if consteval (C++23) |
| 4 | stmk_if_not_consteval | if !consteval |
| 5 | stmk_while | while loop |
| 6 | stmk_goto | goto |
| 7 | stmk_label | label |
| 8 | stmk_return | return |
| 9 | stmk_coroutine | Coroutine body (see below) |
| 10 | stmk_coroutine_return | Coroutine return |
| 11 | stmk_block | Block/compound: statements, final_position, assoc_scope, lifetime, end_of_block_reachable, is_statement_expression |
| 12 | stmk_end_test_while | do-while |
| 13 | stmk_for | for loop |
| 14 | stmk_range_based_for | Range-for: iterator, range, begin, end, ne_call_expr, incr_call_expr |
| 15 | stmk_switch_case | switch case |
| 16 | stmk_switch | switch |
| 17 | stmk_init | Initialization |
| 18 | stmk_asm | Inline assembly |
| 19 | stmk_try_block | try block |
| 20 | stmk_decl | Declaration |
| 21 | stmk_set_vla_size | VLA size |
| 22 | stmk_vla_decl | VLA declaration |
| 23 | stmk_assigned_goto | Computed goto |
| 24 | stmk_empty | Empty statement |
| 25 | stmk_stmt_expr_result | Statement expression result |
Coroutine statement (case 9) displays the full C++20 coroutine lowering structure:
traits, handle, promise, init_await_resume, this_param_copy,
paramter_copies, final_suspend_label, initial_suspend_call,
final_suspend_call, unhandled_exception_call, get_return_object_call,
new_routine, delete_routine, ...
The field name "paramter_copies" (missing the 'e' in "parameter") is a typo preserved verbatim from the EDG source. This confirms the display strings originate from Edison Design Group's own il_to_str.c -- a reimplementation would spell it correctly.
scope (Kind 23)
display_scope (sub_5F2140, 177 lines) handles 9 scope kinds:
| Kind | Name | Extra fields |
|---|---|---|
| 0 | sck_file | Top-level file scope |
| 1 | sck_func_prototype | Function prototype scope |
| 2 | sck_block | assoc_handler |
| 3 | sck_namespace | assoc_namespace |
| 6 | sck_class_struct_union | assoc_type |
| 8 | sck_template_declaration | Template declaration scope |
| 15 | sck_condition | assoc_statement |
| 16 | sck_enum | assoc_type |
| 17 | sck_function | routine.ptr, parameters, constructor_inits, lifetime_of_local_static_vars, this_param_variable, return_value_variable |
Common scope fields: next, parent, kind
Boolean flags: do_not_free_memory_region, is_constexpr_routine, is_stmt_expr_block, is_placeholder_scope, needed_walk_done
Child entity lists: assoc_block, lifetime, constants, types, variables, nonstatic_variables, labels, routines, asm_entries, scopes
Conditional lists (controlled by bitmask tests on scope kind):
// Bitmask 0x20044 = bits 2+6+17 = sck_block + sck_class_struct_union + sck_function
// Bitmask 0x9 = bits 0+3 = sck_file + sck_namespace
if ((1LL << kind) & 0x20044) {
// display: namespaces, using_declarations, using_directives
}
if ((1LL << kind) & 0x9) {
// display: namespaces, using_declarations, using_directives
}
// Also: dynamic_inits, local_static_variable_inits (function/block scopes)
// expr_node_refs, scope_refs, vla_dimensions (function scope + C mode)
// pragmas, hidden_names, templates, source_sequence_list, src_seq_sublist_list
constant (Kind 2)
display_constant (sub_5F2720, 605 lines) handles 16 constant kinds. After display_source_corresp, common fields include next, type, orig_type, expr, and approximately 25 boolean flags.
Constant kind switch (offset +148):
| Kind | Name | Key sub-fields |
|---|---|---|
| 0 | ck_error | (none) |
| 1 | ck_integer | Integer value via sub_602F20 |
| 2 | ck_string | character_kind (char/wchar_t/char8_t/char16_t/char32_t), length, literal_kind (see below) |
| 3 | ck_float | Float value via sub_5FCAF0 |
| 4 | ck_complex | Complex value |
| 5 | ck_imaginary | Imaginary value |
| 6 | ck_address | Sub-kind: abk_routine/variable/constant/temporary/uuidof/typeid/label; subobject_path, offset |
| 7 | ck_ptr_to_member | casting_base_class, name_reference, cast_to_base, is_function_ptr |
| 8 | ck_label_difference | from_address, to_address |
| 9 | ck_dynamic_init | dynamic_init pointer |
| 10 | ck_aggregate | first_constant, last_constant, has_dynamic_init_component |
| 11 | ck_init_repeat | constant, count, multidimensional_aggr_tail_not_repeated |
| 12 | ck_template_param | Sub-kinds: tpck_param/expression/member/unknown_function/address/sizeof/datasizeof/alignof/uuidof/typeid/noexcept/template_ref/integer_pack/destructor |
| 13 | ck_designator | is_field_designator, is_generic, uses_direct_init_syntax |
| 14 | ck_void | (none) |
| 15 | ck_reflection | entity, local_scope_number |
dynamic_init (Kind 30)
display_dynamic_init (sub_5F37F0, 248 lines) handles 9 dynamic initialization kinds:
| Kind | Name | Key sub-fields |
|---|---|---|
| 0 | dik_none | (none) |
| 1 | dik_zero | Zero-initialization |
| 2 | dik_constant | Constant initialization |
| 3 | dik_expression | Expression initialization |
| 4 | dik_class_result_via_ctor | Class result through constructor |
| 5 | dik_constructor | routine, args, is_copy_constructor_with_implied_source, is_implicit_copy_for_copy_initialization, value_initialization |
| 6 | dik_nonconstant_aggregate | Non-constant aggregate |
| 7 | dik_bitwise_copy | source |
| 8 | dik_lambda | lambda, constant, non_constant |
Common fields: next, variable, destructor, lifetime, next_in_destruction_list, unordered, init_expr_lifetime, and approximately 20 boolean flags including static_temp, follows_an_exec_statement, inside_conditional_expression, has_temporary_lifetime, is_constructor_init, is_freeing_of_storage_on_exception, overlaps_temps_in_inner_lifetime, is_reused_value, is_creation_of_initializer_list_object, master_entry.
class_info (Kind 39)
display_class_type_supplement (sub_5F4030, 366 lines) is not dispatched directly from the kind table but called by display_type when the type kind is class/struct/union (kinds 9/10/11). It prints the class supplement record:
| Field | Notes |
|---|---|
base_classes | IL pointer list |
direct_base_classes | IL pointer list |
preorder_base_classes | IL pointer list |
primary_base_class | IL pointer |
size_without_virtual_base_classes | Integer |
alignment_without_virtual_base_classes | Integer |
highest_virtual_function_number | Integer |
virtual_function_info_offset | Integer |
virtual_function_info_base_class | IL pointer |
ELF_visibility | off_A6F720 |
is_lambda_closure_class | bool |
is_generic_lambda_closure_class | bool |
has_lambda_conversion_function | bool |
is_initializer_list | bool |
has_initializer_list_ctor | bool |
has_anonymous_union_member | bool |
anonymous_union_kind | enum (auk_none/auk_variable/auk_field) |
is_va_list_tag | bool |
has_nodiscard_attribute | bool |
has_field_initializer | bool |
removed_from_il | bool |
contains_error | bool |
befriending_classes | Linked list (checks kind bytes 9/10/11 for class/struct/union) |
friend_routines | IL pointer list |
friend_classes | IL pointer list |
assoc_scope | IL pointer |
assoc_template | IL pointer |
template_arg_list | Via display_template_arg_list |
lambda_parent.variable / .field / .routine | Selected by bits in byte 86 |
proxy_of_type | IL pointer |
Formatting Infrastructure
25-Column Field Labels
dump_field_label (sub_5EB2A0, 22 lines) is the universal field label formatter. It prints "field_name:" then pads with spaces to column 25. If the label plus colon exceeds 24 characters, it prints a newline first to avoid misalignment:
storage_class: static
alignment: 16
is_constexpr: TRUE
This produces the consistent columnar output visible in all IL dumps.
Boolean Fields
dump_field_bool (sub_5EB450, 25 lines) prints a label and "TRUE" or "FALSE":
is_virtual: TRUE
pure_virtual: FALSE
Source Position Fields
dump_source_position (sub_5EB4E0, 82 lines) prints position as two sub-fields when the position is non-zero (seq != 0 or column != 0):
position.seq: 42
position.column: 5
Reads a 32-bit sequence number at *position and a 16-bit column at *(position + 4).
IL Pointer Annotations
dump_il_entity_pointer (sub_5EB8B0, 99 lines) is the most comprehensive pointer formatter. For each IL entity pointer, it prints:
- Scope prefix:
"file-scope"or"func-scope"(from bit 0 of the entry prefix byte atentry_ptr - 8) - Kind name: from
off_E6DD80[kind_byte] - Hex address:
@%lx - Entity name (kind-dependent):
- Kinds with name at offset +8 (bitmask
0x2000000010001984): prints the name string - Kind 12 (label): prints
"label "prefix + name - Kind 6 (type): calls qualified name formatter
- Kind 2 (constant): calls type display
- Kind 0x40 (seq_number_lookup): prints qualified name from offset +0
- Kind with bit 36 set: prints qualified name from offset +40, plus
"in"context from +56
- Kinds with name at offset +8 (bitmask
primary_source_file: file-scope source_file_entry@7f3a4b100020 "test.cu"
main_routine: file-scope routine@7f3a4b200100 "main"
The variant dump_il_string_pointer (sub_5EB670) prints the same format but includes the string value from the pointed-to entry. A scope mismatch (e.g., function-scope pointer found during file-scope display) triggers a "**NON FILE SCOPE PTR**" warning.
Entity List Display
display_entity_list (sub_5EC450, 87 lines) walks a linked list of entity pointers and prints each with scope/kind/address annotations:
entities: file-scope variable@7f3a... "x"
func-scope variable@7f3a... "y"
It follows the next link at offset 0 of each list node until NULL.
String Literal Display
dump_string_value (sub_5EB300, 41 lines) prints string values with proper escape handling:
- NULL pointers print
"NULL" - Non-printable characters are printed as octal escapes (
\OOO) - Backslash and double-quote are backslash-escaped (
\\,\") - The octal mask width is controlled by
dword_126E49C(CHAR_BIT equivalent, typically 8)
file_name: "test.cu"
full_name: "/home/user/project/test.cu"
Float Constant Formatting
form_float_constant (sub_5F7FD0, 302 lines) handles float-to-string conversion with EDG-specific formatting. An assertion at line 6175 guards against buffer overflow (63-byte limit).
Float kind suffixes:
| Kind | Suffix | Type |
|---|---|---|
| 0 | (none) | double |
| 2 | f/F | float |
| 3 | f32x | extended float32 |
| 5 | f64x | extended float64 |
| 6 | l/L | long double |
| 7 | w | float128/wide |
| 8 | q | quad |
| 9 | bf16 | bfloat16 |
| 10 | f16 | float16 |
| 11 | f32 | float32 |
| 12 | f64 | float64 |
| 13 | f128 | float128 |
Special value handling:
- NaN:
__builtin_nanf(""),__builtin_nan(""), etc. (when compiler version > 30299) - Infinity:
__builtin_huge_valf()or(__extension__ 0x1.0p<exp>f) - Division form:
(f/0.0f)or(f/(0,0.0f))(C++ vs C modes, selected bydword_126E1D8/dword_126E1E8) - User-defined literals:
(funcname("string_value"))form
Data Tables Referenced
The display subsystem relies on approximately 20 string-to-enum lookup tables in the .rodata segment:
| Address | Name | Entries | Used by |
|---|---|---|---|
off_A6F000 | attr_arg_kind_names | 6 | Attribute argument display |
off_A6F040 | attr_location_names | 24 | Attribute display |
off_A6F100 | attr_family_names | 5 | Attribute display |
off_A6F140 | attr_kind_names | 86 | Attribute display |
off_A6F3F0 | class_kind_labels | 3 | befriending_classes display |
off_A6F420 | based_type_kind_names | 6 | display_type based_types |
off_A6F460 | pragma_state_names | 4 | fp_contract/fenv_access/cx_limited_range |
off_A6F480 | register_kind_names | 53 | display_variable reg field |
off_A6F640 | typeref_kind_names | 28 | display_type typeref |
off_A6F720 | elf_visibility_kind_names | 5 | ELF visibility (all entity types) |
off_A6F760 | access_specifier_names | 4 | public/protected/private/none |
off_A6F840 | expr_operator_kind_names | 120 | display_expr_node operations |
off_A6FC00 | special_function_kind_names | 13 | display_routine special_kind |
off_A6FC80 | operator_name_kind_names | 47 | display_routine opname_kind |
off_A6FE00 | storage_class_names | 7 | Storage class (variable + routine) |
off_A6FE40 | type_kind_names | 22 | Type kind (all type displays) |
off_E6C5A0 | builtin_operation_names | varies | display_expr_node builtins |
off_E6CDA0 | calling_convention_names | varies | display_type calling conventions |
off_E6CDE0 | pragma_kind_names | varies | Pragma display |
off_E6CF40 | asm_clobber_reg_names | varies | Asm clobber display |
off_E6D240 | token_kind_names | varies | Fold expression / attribute_arg tokens |
off_E6DD80 | il_entry_kind_names | ~84 | All display functions (entry kind) |
off_E6E040 | linkage_kind_names | varies | Name linkage (source_corresp) |
All tables use the same bounds-checking pattern:
const char *name = "**BAD STORAGE CLASS**";
if ((unsigned char)value <= 6u)
name = storage_class_names[value];
puts(name);
Out-of-range values produce "**BAD <KIND>**" sentinel strings, which serve as diagnostic markers for corrupted IL.
Global State
| Address | Name | Type | Purpose |
|---|---|---|---|
dword_126FA30 | is_file_scope_region | int | 1 during file-scope display, 0 during function-scope |
qword_126F980 | output_callbacks | function ptr | Output function (default: sub_5EB290 = fputs(s, stdout)) |
byte_126FA16 | display_active | byte | Set to 1 during display, prevents re-entrant calls |
byte_126FA11 | pcc_compat_shadow | byte | Shadow of PCC compatibility mode during display |
dword_126EBA8 | source_language | int | 0 = C++, 1 = C |
dword_126EBAC | std_version | int | C/C++ standard version number |
dword_126EC80 | total_region_count | int | Number of memory regions (1 = file scope only) |
qword_126EC88 | region_table | pointer array | Region index to memory block mapping |
qword_126EB90 | scope_table | pointer array | Region index to scope entry mapping |
qword_126EEE0 | source_file_name | string ptr | Name of the source file being compiled |
Helper Functions (0x5F8000--0x6039E0)
The display subsystem includes approximately 50 additional helper functions in the address range beyond the main dispatchers:
| Address | Lines | Identity | Purpose |
|---|---|---|---|
sub_5F85E0 | 78 | display_bool_field | Boolean TRUE/FALSE output |
sub_5F8760 | 97 | display_flags_word | Flags word display |
sub_5F8910 | 88 | display_type_qualifiers | const/volatile/restrict qualifier flags |
sub_5F8A80 | 49 | display_storage_class | Storage class enum |
sub_5F8BD0 | 139 | display_access_specifier | Access with indentation |
sub_5F8DF0 | 103 | display_linkage_kind | Linkage kind enum |
sub_5F9040 | 28 | init_output_context | Initialize display callback state |
sub_5F9110 | 149 | display_int_type_kind | Integer type kind name |
sub_5F93D0 | 70 | display_float_type_kind | Float type kind name |
sub_5F9500 | 70 | display_int_type_size | Integer type size name |
sub_5F9650 | 99 | display_qualifier_flags | Full qualifier flags |
sub_5F9820 | 18 | display_ref_qualifier | & or && |
sub_5F9860 | 91 | display_calling_convention | Calling convention from off_E6CDA0 |
sub_5F99A0 | 115 | display_attribute_target | Attribute target kind |
sub_5F9BC0 | 20 | display_asm_keyword | "asm" or "volatile" |
sub_5F9C10 | 26 | display_elaborated_type | Elaborated type specifier |
sub_5F9CA0 | 50 | display_struct_layout | Structure layout padding mode |
sub_5F9D80 | 89 | display_member_alignment | Member alignment field |
sub_5F9F70 | 57 | display_template_kind | Template kind name |
sub_5FA0D0 | 283 | display_template_arg_list | Full template argument list |
sub_5FA660 | 127 | display_constraint_expr | Constraint expression (C++20) |
sub_5FA8F0 | 118 | display_deduction_guide | Deduction guide info |
sub_5FAB70 | 333 | display_capture_list | Lambda capture list |
sub_5FB270 | 556 | display_expr_operator_name | Expression operator name (120 kinds) |
sub_5FBCD0 | 571 | display_expr_details | Operator-specific expression details |
sub_5FCAF0 | 1,319 | display_float_constant | Float/complex/imaginary formatting |
sub_5FE7C0 | 55 | display_expr_flag | Expression flag display |
sub_5FE8B0 | 1,659 | display_expr_operator | Expression operator details (2nd largest) |
sub_600740 | 72 | display_for_range | Range-based for details |
sub_600870 | 171 | display_coroutine_info | Coroutine info (C++20) |
sub_600BF0 | 19 | display_designated_init | Designated initializer |
sub_600C50 | 107 | display_attribute_entry | Attribute entry |
sub_600E00 | 55 | display_asm_operand | Asm operand display |
sub_600EF0 | 76 | display_asm_statement | Asm statement details |
sub_600FF0 | 29 | display_gcc_builtin_kind | GCC built-in kind |
sub_601070 | 87 | display_pragma_info | Pragma info |
sub_6011F0 | 155 | display_declspec_attribute | __declspec attribute |
sub_601460 | 92 | display_thread_local | Thread-local info |
sub_6015A0 | 73 | display_module_info | Module info (C++20) |
sub_6016F0 | 197 | display_concept_requires | Concept/requires expression |
sub_601B10 | 48 | display_pack_expansion | Pack expansion info |
sub_601BE0 | 50 | display_structured_binding | Structured binding (C++17) |
sub_601CB0 | 562 | display_additional_expr | Additional expression info |
sub_6027D0 | 144 | display_deduced_class | Deduced class info |
sub_6029B0 | 190 | display_decl_sequence | Declaration sequence entry |
sub_602DC0 | 74 | display_enum_underlying | Enum underlying type |
sub_602F20 | 306 | display_integer_constant | Integer constant formatting |
sub_603670 | 134 | display_vendor_attribute | Vendor attribute details |
sub_6038F0 | 26 | display_cleanup_handler | Cleanup handler |
sub_6039E0 | 78 | display_sequence_entry | Last function in il_to_str region |
The "paramter_copies" Typo
The coroutine statement display (case 9 in display_statement) prints the field label "paramter_copies" -- missing the 'e' in "parameter." This typo is present in the compiled binary's string table and originates from Edison Design Group's source code. It serves as strong provenance evidence: a clean-room reimplementation would not reproduce this exact spelling error, confirming that cudafe++ links genuine EDG il_to_str.c object code.
Complete Call Graph
display_il_file (sub_5F7DF0) ─── TOP LEVEL
├── display_il_header (sub_5F76B0)
│ ├── init_output_context (sub_5F9040)
│ ├── dump_il_entity_pointer (sub_5EB8B0) ×30+ for header fields
│ ├── dump_field_bool (sub_5EB450) ×15+ for header booleans
│ ├── dump_string (sub_5EB790)
│ └── walk_file_scope_il (sub_60E4F0)
│ └── display_il_entry (sub_5F4930) ─── callback per entity
│
└── [loop over regions 2..N]
└── walk_routine_scope_il (sub_610200)
└── display_il_entry (sub_5F4930) ─── callback per entity
display_il_entry (sub_5F4930) ─── MAIN DISPATCHER
├── display_source_corresp (sub_5EDF40) ─── shared by named entities
├── display_statement (sub_5EC600) ─── case 0x15
│ ├── display_coroutine_info (sub_600870)
│ └── display_for_range (sub_600740)
├── display_expr_node (sub_5ECFE0) ─── case 0x0D
│ ├── display_expr_operator (sub_5FE8B0)
│ ├── display_expr_operator_name (sub_5FB270)
│ └── display_expr_details (sub_5FBCD0)
├── display_variable (sub_5EE500) ─── case 0x07
│ └── display_init_kind (sub_5EBB50)
├── display_routine (sub_5EF1A0) ─── case 0x0B
│ └── display_template_arg_list (sub_5EBF60 / sub_5FA0D0)
├── display_type (sub_5F06B0) ─── case 0x06
│ ├── display_class_supplement (sub_5F4030)
│ ├── display_int_type_kind (sub_5F9110)
│ └── display_float_type_kind (sub_5F93D0)
├── display_scope (sub_5F2140) ─── case 0x17
├── display_constant (sub_5F2720) ─── case 0x02
│ ├── display_integer_constant (sub_602F20)
│ └── display_float_constant (sub_5FCAF0)
├── display_dynamic_init (sub_5F37F0) ─── case 0x1E
├── display_name_reference (sub_5EBC60) ─── case 0x3E
└── display_entity_list (sub_5EC450) ─── multiple cases
display_single_entity (sub_5F7D50) ─── TARGETED DISPLAY
├── entity_lookup (sub_73D400)
├── resolve_entity (sub_7377D0)
├── get_entity_kind (sub_5C64C0)
├── init_output_context (sub_5F9040)
└── display_il_entry (sub_5F4930)
Relationship to Other Subsystems
The IL display subsystem is read-only: it never modifies the IL graph. It shares the same entry walker functions used by the IL Tree Walking framework (walk_file_scope_il = sub_60E4F0, walk_routine_scope_il = sub_610200) and the Keep-in-IL mark phase, but passes display_il_entry as the callback instead of a transformation function.
The IL Allocation subsystem provides dump_il_table_stats (sub_5E99D0), which dumps allocation counters rather than IL content -- a complementary diagnostic activated separately.
The field offsets printed by the display functions serve as ground truth for the IL Overview entry kind table and the Entity Node Layout documentation.
IL Comparison & Deep Copy
The IL comparison and deep copy engines are two tightly coupled subsystems in EDG's il.c that serve template instantiation, constant sharing, and overload resolution. The comparison engine determines structural equivalence between two IL expression trees or constant nodes -- needed when the compiler must decide whether two template arguments are "the same" or whether a constant has already been allocated. The deep copy engine clones expression trees while optionally substituting template parameters for their actual arguments -- the core mechanism behind template instantiation. Both subsystems are recursive tree walkers dispatched by node-kind switches, and both operate on the same IL node layout described in IL Overview.
These two engines share the address range 0x5D0750--0x5DFAD0 in the binary (roughly 37KB of compiled code). The comparison engine occupies 0x5D0750--0x5D2160, constant sharing infrastructure sits at 0x5D2170--0x5D2D80, the expression copy engine fills 0x5D2DE0--0x5D5550, and the template parameter substitution dispatcher extends from 0x5DC000--0x5DFAD0.
Key Facts
| Property | Value |
|---|---|
| Source file | il.c (EDG 6.6) |
| Assert path | /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/il.c |
| Comparison engine | sub_5D0750 (compare_expressions), 588 lines |
| Constant comparison | sub_5D1350 (compare_constants), 525 lines |
| Dynamic init comparison | sub_5D1FE0 (compare_dynamic_inits), ~80 lines |
| Constant sharing allocator | sub_5D2390 (alloc_shareable_constant), ~200 lines |
| Expression tree copier | sub_5D2F90 (i_copy_expr_tree), 424 lines |
| Constant deep copier | sub_5D3B90 (i_copy_constant_full), 305 lines |
| Template substitution dispatcher | sub_5DC000 (copy_template_param_expr), 1416 lines |
| Template constant dispatcher | sub_5DE290 (copy_template_param_con), 819 lines |
| Constant sharing hash buckets | 2039 |
| Recursion depth guard | dword_126F1D0 (incremented/decremented around compare_expressions) |
Part 1: The Comparison Engine
Why It Exists
Three front-end subsystems need structural equality testing on IL trees:
-
Template argument deduction. When the compiler deduces template arguments from a function call, it must compare the deduced value against a previously deduced value for the same parameter. Two independently constructed expression trees representing
sizeof(int)must compare as equal even though they are distinct heap allocations. -
Constant sharing. Identical constants across the translation unit are deduplicated into a single canonical node in file-scope memory. The comparison engine is the hash table's equality predicate -- after two constants hash to the same bucket,
compare_constantsdetermines whether they are structurally identical. -
Overload resolution. When the compiler checks whether two function template specializations have equivalent signatures, it compares their template argument expressions for equivalence.
compare_expressions (sub_5D0750)
This is the main entry point. It takes two expression-node pointers and a flags word, and returns 1 (match) or 0 (mismatch). It uses a 36-case switch on the expression-node kind byte (offset +24 in the node layout).
compare_expressions(expr_a, expr_b, flags) -> bool:
if expr_a == expr_b:
return TRUE // pointer identity short-circuit
if expr_a->kind != expr_b->kind:
return FALSE // different node types never match
recursion_depth++ // dword_126F1D0
switch expr_a->kind:
case 0 (null):
result = FALSE // two null nodes are never "equal"
case 1 (operation):
if expr_a->op_code != expr_b->op_code:
result = FALSE
else:
// compare each operand in the linked list pairwise
result = compare_operand_lists(expr_a->operands, expr_b->operands, flags)
if result:
result = equiv_types(expr_a->result_type, expr_b->result_type)
case 2 (constant reference):
result = compare_constants(expr_a->constant, expr_b->constant, flags)
case 3 (entity reference):
// first try pointer equality on the referenced entity
if expr_a->entity == expr_b->entity:
result = TRUE
elif sharing_enabled and same_sharing_symbol(expr_a, expr_b):
result = TRUE
else:
// deep entity comparison via equiv_types + compare_template_variables
result = equiv_types(expr_a->entity->type, expr_b->entity->type)
case 4, 19 (type reference):
result = (expr_a->type_ptr == expr_b->type_ptr)
or equiv_types(expr_a->type_ptr, expr_b->type_ptr)
case 5, 18 (dynamic init):
result = compare_dynamic_inits(expr_a->init, expr_b->init, flags)
case 6 (source position):
result = (expr_a->offset == expr_b->offset)
case 7 (full expression info):
result = compare_flags(expr_a, expr_b)
and equiv_types(...)
and compare_expressions(expr_a->sub_expr, expr_b->sub_expr, flags)
case 8 (template arguments):
// element-by-element comparison of arg lists
result = compare_template_arg_lists(expr_a->args, expr_b->args, flags)
case 10, 33 (sub-expression wrapper):
result = compare_expressions(expr_a->inner, expr_b->inner, flags)
case 11, 32 (unary with boolean):
result = (expr_a->bool_field == expr_b->bool_field)
and compare_expressions(expr_a->inner, expr_b->inner, flags)
case 12, 14, 15 (typed value):
result = (expr_a->value_byte == expr_b->value_byte)
and compare_type_or_value(...)
case 13 (two-byte key):
result = (expr_a->key_word == expr_b->key_word)
case 16 (always-equal sentinel):
result = TRUE
case 17, 22, 35 (opaque pointer):
result = (expr_a->ptr == expr_b->ptr)
case 20 (pointer with fallback):
result = (expr_a->ptr == expr_b->ptr)
or deep_compare_via_sub_7B2260(...)
case 21 (keyed sub-expression):
result = (expr_a->key == expr_b->key)
and compare_expressions(expr_a->inner, expr_b->inner, flags)
case 23 (simple sub-expression):
result = compare_expressions(expr_a->inner, expr_b->inner, flags)
case 24 (nested expression pair):
result = compare_pair(expr_a, expr_b, flags)
case 25 (lambda/closure):
result = chase_closure_ptrs_and_compare(...)
case 28 (attributed expression):
result = (expr_a->attr_flags == expr_b->attr_flags)
and compare_expressions(expr_a->inner, expr_b->inner, flags)
case 30 (template specialization):
if expr_a->hash != expr_b->hash:
result = FALSE
else:
result = compare_template_specializations(expr_a, expr_b)
case 31 (function template args):
result = compare_each_arg_type(expr_a->args, expr_b->args)
default:
internal_error("compare_expressions: bad expr kind")
recursion_depth--
return result
Flags interpretation. The third argument flags is a bitmask:
| Bit | Mask | Meaning |
|---|---|---|
| 0 | 0x01 | Strict mode -- entity references must be pointer-identical, not just structurally equivalent |
| 1 | 0x02 | Check constraints -- compare template constraints alongside types |
| 2 | 0x04 | Allow specialization -- used by the equivalence wrapper (sub_5D1320) when comparing for specialization matching |
Recursion depth guard. The global dword_126F1D0 is incremented on entry and decremented on exit. This counter is not used for depth limiting -- it exists so that diagnostic routines (guarded by dword_126EFC8) can print indented traces via sub_5C4B70 (dump_expr_tree).
compare_constants (sub_5D1350)
Constants are the most structurally complex IL nodes. A single constant node is 184 bytes and carries a constant_kind byte at offset +148 that selects among 16 primary kinds, some of which contain nested sub-kinds. The comparison function uses an outer switch on constant_kind and inner switches for aggregate and template-parameter sub-kinds.
compare_constants(const_a, const_b, flags) -> bool:
if const_a == const_b:
return TRUE
if const_a->constant_kind != const_b->constant_kind:
return FALSE
switch const_a->constant_kind:
case 0, 14 (trivial kinds):
return TRUE
case 1 (integer):
return compare_integer_values(const_a->value, const_b->value)
and (const_a->flags == const_b->flags)
case 2 (string literal):
return memcmp(const_a->bytes, const_b->bytes, const_a->length) == 0
case 3, 5 (float):
return compare_float_value(const_a->value, const_b->value)
case 4 (complex):
return compare_float_value(const_a->real, const_b->real)
and compare_float_value(const_a->imag, const_b->imag)
case 6 (address constant):
// nested switch on address_kind (offset +152), 6 sub-kinds:
switch const_a->address_kind:
case 0, 1: pointer equality or compare_entities(...)
case 2: recursive type comparison
case 3, 6: pointer equality at offset +160
case 5: type comparison via deep_compare
// uses while(2) loop for manual tail-call optimization
// on case 2 and case 13
case 7 (template argument):
return compare_template_arg(...)
case 8 (pair of constants):
return compare_constants(const_a->first, const_b->first, flags)
and compare_constants(const_a->second, const_b->second, flags)
case 9 (dynamic init):
return compare_dynamic_inits(const_a->init, const_b->init, flags)
case 10 (aggregate):
// walk linked lists of sub-constants in lockstep
a_elem = const_a->first_element
b_elem = const_b->first_element
while a_elem and b_elem:
if not compare_constants(a_elem, b_elem, flags): return FALSE
a_elem = a_elem->next; b_elem = b_elem->next
return (a_elem == NULL and b_elem == NULL)
case 11 (constant + scope):
return compare_constants(const_a->sub, const_b->sub, flags)
and (const_a->scope_byte == const_b->scope_byte)
case 12 (template parameter constant):
// deeply nested -- 14 sub-kinds at offset +152:
switch const_a->template_param_kind:
case 0: pack parameter comparison
case 1: compare_expressions on embedded expr
case 2: compare_types via sub_5B3080
case 3: compare_types + type + flags
case 4, 12: recursive compare_constants
case 5-10: type equality + sub_5BFB80 comparison
case 11: type + template argument list
case 13: type equality only
case 13 (entity ref constant):
return (const_a->flags == const_b->flags)
and (pointer_equal_or_sharing_match(const_a->entity, const_b->entity))
case 15 (literal value):
return (const_a->value_ptr == const_b->value_ptr)
default:
internal_error("compare_constants: bad constant kind")
Manual tail-call optimization. For cases 6 (address constant with type sub-kind) and 13 (entity ref), the function uses while(2) (an infinite loop that reassigns the operands and continues from the top of the comparison) instead of making a recursive call. This avoids stack growth when comparing chains of address constants, which can be deeply nested in pointer-to-member types.
compare_dynamic_inits (sub_5D1FE0)
Dynamic initializers represent runtime initialization expressions (constructors, aggregate init, etc.). The comparison function dispatches on the init kind byte at offset +48:
compare_dynamic_inits(init_a, init_b, flags) -> bool:
if init_a->kind != init_b->kind:
return FALSE
if init_a->flags != init_b->flags:
return FALSE
// entity fields at +8, +16 compared with sharing-aware equality
switch init_a->kind:
case 0, 1: return TRUE (after header match)
case 2, 6: return compare_constants(init_a->constant, init_b->constant)
case 3, 4: return compare_expressions(init_a->expr, init_b->expr)
case 5: return compare_entity_ref(...)
and compare_sub_exprs(...)
Part 2: Constant Sharing
Why It Exists
Without deduplication, every occurrence of the integer constant 42 in a translation unit would allocate a separate 184-byte constant node in the IL. For large programs (especially heavy template users), this wastes significant memory. The constant sharing system maintains a hash table of canonical constants: when a new constant is about to be allocated, alloc_shareable_constant first checks whether an identical constant already exists in the hash table. If so, the existing node is returned; if not, a new canonical copy is created in the file-scope region and inserted into the table.
Shareability Predicate
Not all constants can be shared. The predicate constant_is_shareable (sub_5D2210) checks several blocking conditions:
constant_is_shareable(constant) -> bool:
if not sharing_enabled (dword_126EE48):
return FALSE
if constant has parent:
// parent must be type 2 (constant); checks sharing flag 0x40 at byte+81
// and calls compare_constants on the parent's value
return parent_is_shareable(...)
// blocking conditions for parentless constants:
if constant->associated_entry != NULL: return FALSE // already bound to an entry
if constant->extra_data != 0: return FALSE // has auxiliary data
if constant->flags & 0x02: return FALSE // flag bit 1 blocks sharing
switch constant->constant_kind:
case 2 (string): return string_sharing_enabled (dword_126E1C0)
case 6 (address): return TRUE unless has extra payload at +176
or address_subkind==4 with data
case 7 (template): return (constant->extra_ptr == NULL)
case 10 (aggregate): return FALSE // aggregates never shared
case 12 (template param): return FALSE // template params never shared
default: return TRUE
The rationale for excluding aggregates and template parameters: aggregate constants contain linked lists of sub-constants that would require recursive sharing checks, and template parameter constants are inherently unique to their instantiation context.
Hash Table Structure
The hash table is allocated during il_init (sub_5CFE20) as a 16,312-byte block (stored at qword_126F228), yielding 2039 bucket slots (16312 / 8 = 2039). Each bucket is a pointer to the head of a singly-linked chain of constant nodes.
Hash Table Layout (qword_126F228):
+--------+--------+--------+ +--------+
| slot 0 | slot 1 | slot 2 | ... |slot 2038|
+--------+--------+--------+ +--------+
| | |
v v v
const -> const -> NULL (singly-linked chains)
|
v
const -> NULL
Why 2039? The number 2039 is prime. Using a prime number as the hash table size ensures that the modular-reduction step (hash % 2039) distributes keys uniformly even when the hash function produces patterns with common factors. The compiled code computes the modulus through an optimized multiply-and-shift sequence (multiply by 0x121456F, then shift) rather than a hardware division instruction.
alloc_shareable_constant (sub_5D2390)
This is the entry point for all constant allocation when sharing is enabled. It implements a hash-table lookup with MRU (most recently used) reordering of the chain:
alloc_shareable_constant(local_constant) -> constant*:
total_alloc_count++ // qword_126F208
if not sharing_enabled or not constant_is_shareable(local_constant):
return alloc_constant(local_constant) // fallback to non-shared alloc
if local_constant has parent:
// parent's shared pointer is already the canonical copy
assert parent->type == 2
return parent->shared_ptr
// ---- hash table lookup ----
hash = compute_constant_hash(local_constant) // sub_5BE150
bucket_index = hash % 2039
bucket_ptr = &hash_table[bucket_index]
prev = NULL
curr = *bucket_ptr
while curr != NULL:
comparison_count++ // qword_126F200
if compare_constants(curr, local_constant, 0):
// ---- HIT: MRU reorder ----
if prev != NULL:
// unlink curr from current position
prev->next = curr->next
// move curr to front of chain
curr->next = *bucket_ptr
*bucket_ptr = curr
if curr is in same region:
region_hit_count++ // qword_126F218
else:
global_hit_count++ // qword_126F220
return curr
prev = curr
curr = curr->next
// ---- MISS: allocate new canonical constant ----
new_bucket_count++ // qword_126F210
new_constant = alloc_in_file_scope(184) // sub_5E11C0 or sub_5E1620
memcpy(new_constant, local_constant, 184) // 11.5 x SSE + 8-byte tail
clear_sharing_flags(new_constant)
fixup_constant_references(new_constant) // sub_5D39A0
// link at head of chain
new_constant->next = *bucket_ptr
*bucket_ptr = new_constant
return new_constant
MRU optimization rationale. When a hash bucket chain contains many constants (collision), recently matched constants are likely to be matched again soon (temporal locality from template instantiation expanding the same types repeatedly). Moving the matched node to the front of the chain converts an O(n) average-case lookup into O(1) for repeated accesses to the same constant.
Statistics counters. The sharing system maintains four counters for profiling:
| Counter | Address | Meaning |
|---|---|---|
qword_126F200 | comparisons | Total compare_constants calls during sharing |
qword_126F208 | total_allocs | Total calls to alloc_shareable_constant |
qword_126F210 | new_buckets | Number of cache misses (new canonical entries) |
qword_126F218 | region_hits | Sharing hits where the existing constant is in the same region |
qword_126F220 | global_hits | Sharing hits where the existing constant is in a different region |
String Constant Interning (sub_5DBAB0)
String literals receive a separate interning pass through intern_string_constant at 0x5DBAB0. This function reuses the same 2039-bucket hash table (qword_126F228) but with string-specific comparison logic:
intern_string_constant(string, context_a, context_b) -> constant*:
hash = compute_constant_hash(string)
bucket_index = hash % 2039
// linear chain search with exact match (flag=1)
for each entry in chain:
if compare_constants(entry, local_constant, 1): // strict mode
move_to_front(entry) // MRU
return entry
// miss: allocate new string constant in file-scope region
new = alloc_constant_with_source_sequence(ck_string)
memcpy(new, local_constant, 184)
new->string_data = alloc_string_storage(strlen(string) + 1)
strcpy(new->string_data, string)
clear_sharing_flags(new)
fixup_constant_references(new)
link_at_chain_head(bucket_index, new)
free_local_constant(local_constant)
return new
fixup_constant_references (sub_5D39A0)
After a constant is copied into the shared region, some of its internal pointers may still reference nodes in the source (non-shared) region. fixup_constant_references walks the constant's internal structure and redirects these dangling references:
- If the constant's associated IL entry is not in the shared region, the back-pointer at offset +128 is cleared.
- For template parameter constants (kind 12), sub-kinds 1 and 5-10 may embed expression trees at offsets +160/+168. If these expressions are not in the shared region, they are deep-copied via
copy_expr_treeor reattached viaattach_to_region. - For literal value constants (kind 15) with expression sub-kind 13, the constant kind is rewritten based on the expression's kind (expr kind 2 becomes const kind 2, etc.), effectively inlining the expression into the constant.
Part 3: The Deep Copy Engine
Why It Exists
Template instantiation requires cloning expression trees from template definitions while replacing template parameter references with the actual arguments provided at the instantiation site. This is not a simple memcpy -- every node in the tree must be visited, its pointers updated to reference the new region's copies, and template parameter nodes must be intercepted and replaced with substituted values. The deep copy engine provides this transformation.
Default argument expansion also uses the copy engine: when a function call omits an argument that has a default, the default's expression tree is cloned from the function declaration into the call site.
i_copy_expr_tree (sub_5D2F90)
The central expression copier. It takes an expression node, a flags word, and a substitution-list context, then returns a freshly allocated deep copy.
i_copy_expr_tree(src_expr, flags, subst_list) -> expr_node*:
// shallow clone: allocate new node, copy fixed fields
dest = allocate_expr_node_clone(src_expr) // sub_5C28B0
switch src_expr->kind:
case 0 (null): // no children to copy
case 3 (entity ref): // entity pointer is shared, not copied
case 4 (type ref): // type pointer is shared
case 16 (sentinel): // no data
case 19 (template ref): // entity pointer is shared
case 20 (type constant): // type pointer is shared
case 22 (opaque ptr): // shallow only
case 30 (template spec): // shallow only
break // nothing beyond the shallow clone
case 1 (operation):
// recursively copy the operand linked list
dest->operands = i_copy_list_of_expr_trees(src_expr->operands, flags, subst)
case 2 (constant reference):
// deep-copy the constant node
dest->constant = i_copy_constant_full(src_expr->constant, NULL, flags, subst)
case 5 (dynamic init):
dest->init = i_copy_dynamic_init(src_expr->init, flags, subst)
case 6 (call expression):
// walk argument list, copy each argument expression
dest->args = copy_arg_list(src_expr->args, flags, subst)
case 7 (full expression info):
// copy 6 sub-fields (type, scope, sub-expression, etc.)
copy_full_expr_children(dest, src_expr, flags, subst)
case 8 (template arguments):
dest->type_list = copy_type_list(src_expr->type_list, flags)
case 9 (pack expansion):
dest->type = copy_type(src_expr->type)
dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
case 10 (object lifetime):
push_lifetime_scope()
dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
attach_lifetime(dest)
case 11, 23, 32, 33 (sub-expression list):
dest->list = i_copy_list_of_expr_trees(src_expr->list, flags, subst)
case 12, 14, 15 (typed value):
// conditional copy based on value byte
if src_expr->value_byte matches copy-condition:
dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
case 13 (two-byte key):
// no children beyond the key value
case 17 (entity reference, copyable):
if flags & 0x80: // copy_entities mode
dest->entity = alloc_constant_from_entity(src_expr->entity)
case 18 (substitution slot):
// look up in subst_list for replacement
dest = resolve_from_substitution_list(subst_list, src_expr->index)
case 21, 26, 27 (expression + list):
dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
dest->list = i_copy_list_of_expr_trees(src_expr->list, flags, subst)
case 24 (list + pointer):
dest->list = i_copy_list_of_expr_trees(src_expr->list, flags, subst)
dest->ptr = copy_pointer_target(src_expr->ptr)
case 25 (expression + flags):
dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
dest->flags |= propagated_flags
case 28 (attributed expression):
dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
// attribute flags are already copied in the shallow clone
case 31 (expression + extracted pointer):
dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
dest->extra = extract_pointer(src_expr)
case 34 (constexpr fold):
dest = copy_constexpr_fold(src_expr) // sub_65AE50
default:
internal_error("i_copy_expr_tree: bad expr kind")
// ---- post-copy entity resolution (LABEL_11) ----
if flags & 0x10: // resolve_refs mode
for kinds 2, 3, 7, 19:
resolve_entity_ref(dest) // sub_5B3030
return dest
Flags interpretation for the copy engine:
| Bit | Mask | Meaning |
|---|---|---|
| 4 | 0x10 | resolve_refs -- after copying, resolve entity references through the symbol table |
| 7 | 0x80 | copy_entities -- copy entity nodes themselves (not just references to them) |
| 12 | 0x1000 | mark_instantiated -- stamp copied nodes with the instantiation flag |
| 14 | 0x5000 | preserve_source_pos -- carry source-position annotations from source to copy |
i_copy_list_of_expr_trees (sub_5D38C0)
A helper that walks a linked list of expression nodes (connected via the next pointer at offset +16), copies each via i_copy_expr_tree, and links the copies into a new list:
i_copy_list_of_expr_trees(head, flags, subst) -> expr_node*:
result_head = NULL
result_tail = NULL
curr = head
while curr != NULL:
copy = i_copy_expr_tree(curr, flags, subst)
if result_head == NULL:
result_head = copy
else:
result_tail->next = copy
result_tail = copy
curr = curr->next
return result_head
i_copy_constant_full (sub_5D3B90)
The constant copier handles the 184-byte constant node and its recursive sub-structure. It maintains a substitution list to avoid duplicating shared type definitions across the copy tree.
i_copy_constant_full(src, dest_or_null, flags, subst_list) -> constant*:
if dest_or_null:
dest = dest_or_null // copy in place
else:
dest = alloc_constant_node() // sub_5E11C0
memcpy(dest, src, 184) // 11 x SSE + 8-byte tail
clear_sharing_flag(dest, bit 2 at [5].byte[3])
clear_sharing_flag(dest, bit 5 at [9].byte[3])
switch dest->constant_kind:
case 10 (aggregate):
// walk linked list of sub-constants, deep-copy each
for each element in dest->element_list:
element = i_copy_constant_full(element, NULL, flags, subst_list)
relink(element)
case 11 (constant + scope):
dest->sub_constant = i_copy_constant_full(
dest->sub_constant, NULL, flags, subst_list)
case 9 (dynamic init):
dest->init = i_copy_dynamic_init(dest->init, flags, subst_list)
case 6 (address constant with type definition):
// substitution-list management: look up whether this type
// has already been copied in this tree; if so, reuse the copy
existing = lookup_in_subst_list(subst_list, src->type_def)
if existing:
dest->type_def = existing
else:
dest->type_def = copy_type(src->type_def)
add_to_subst_list(subst_list, src->type_def, dest->type_def)
case 12 (template parameter constant):
switch dest->template_param_kind:
case 0, 2, 3, 13: // no extra copy needed
case 1: dest->value_expr = copy_expr_tree(dest->value_expr)
case 4, 12: dest->inner = i_copy_constant_full(dest->inner, ...)
case 5-10: dest->extra_expr = copy_expr_tree(dest->extra_expr)
case 11: dest->type = copy_type(dest->type)
dest->arg_list = copy_template_arg_list(dest->arg_list)
fixup_constant_references(dest) // sub_5D39A0
return alloc_shareable_constant(dest) // sub_5D2390 -- may deduplicate
Substitution list purpose. When copying an expression tree that references the same type definition in multiple sub-expressions (e.g., two occurrences of decltype(x) in a single template), the substitution list ensures both references point to the same copied type node, preserving the sharing relationship from the original tree.
Public Wrappers
The i_copy_* functions are internal -- they take a substitution-list parameter that must be cleaned up after use. Public wrappers handle this lifecycle:
| Wrapper | Address | Internal function | Purpose |
|---|---|---|---|
copy_expr_tree | sub_5D3940 | i_copy_expr_tree | Expression deep copy with auto-cleanup |
copy_constant_full | sub_5D4300 | i_copy_constant_full | Constant deep copy with auto-cleanup |
copy_dynamic_init | sub_5D4CF0 | i_copy_dynamic_init | Dynamic init deep copy with auto-cleanup |
copy_constant | sub_5D4D50 | i_copy_constant_full | Simple constant copy (flags=0) |
Each wrapper allocates a local substitution list, calls the internal function, then appends the list entries to the global free list at qword_126F1E0.
Part 4: Template Parameter Substitution
Why It Exists
The deep copy engine (Part 3) performs a mechanical tree clone -- it duplicates structure but does not transform content. Template instantiation requires more: when the copier encounters a node referencing template parameter T, it must replace that node with the actual type argument (e.g., int). When it encounters sizeof(T), it must evaluate the expression with T=int and produce the constant 4. The template parameter substitution engine is the transformation layer that sits on top of the copy engine, intercepting template-parameter nodes and performing the substitution.
copy_template_param_expr (sub_5DC000)
This is the central dispatcher for expression-level template substitution. At 1416 lines and 7872 bytes of compiled code, it is the largest single function in the comparison/copy subsystem. It takes up to 10 arguments:
copy_template_param_expr(
expr_node* a1, // expression to substitute
template_ctx a2, // template argument context
template_ctx a3, // secondary context (for nested templates)
type* a4, // expected result type
scope* a5, // current scope
int a6, // flags
int* a7, // error_flag (output: set to 1 on failure)
diag_info* a8, // diagnostics context
constant* a9, // scratch constant (pre-allocated workspace)
constant** a10 // output constant pointer
) -> expr_node* // substituted expression, or NULL (use a9/a10)
The function dispatches on expr->kind and, for operation nodes, further dispatches on the operation code:
copy_template_param_expr(expr, tctx, ...):
switch expr->kind:
case 0 (empty):
return expr // pass through unchanged
case 1 (operation):
switch expr->op_code:
case 116 (type expression):
return copy_template_param_type_expr(expr, tctx, ...)
case 5 (cast):
// substitute the cast's target type
new_type = copy_template_param_type(expr->target_type, tctx)
// recursively substitute the operand
new_operand = copy_template_param_expr(expr->operands, tctx, ...)
return build_cast_node(new_type, new_operand)
case 0, 25, 28, 29, 53-57, 71, 72, 87, 88, 103 (unary/simple binary):
// substitute operand(s) recursively
new_ops = substitute_operand_list(expr->operands, tctx)
return build_operation_node(expr->op_code, new_ops, new_type)
case 26, 27, 39-43, 58-63 (binary with type check):
// substitute both operands
lhs = copy_template_param_expr(expr->operands[0], tctx, ...)
rhs = copy_template_param_expr(expr->operands[1], tctx, ...)
// post-substitution type promotion:
do_conversions_on_operands_of_copied_template_expr(op, &lhs, &rhs)
return build_operation_node(op, lhs, rhs, result_type)
case 39 (ternary / conditional):
cond = copy_template_param_expr(operands[0], ...)
true_ = copy_template_param_expr(operands[1], ...)
false_= copy_template_param_expr(operands[2], ...)
return build_conditional(cond, true_, false_)
case 44, 45 (imaginary):
internal_error("imaginary operators not implemented")
case 2 (constant reference):
return copy_template_param_con(expr->constant, tctx, ...)
case 3 (variable / entity reference):
// look up the variable in the substitution context
subst = find_substitution(tctx, expr->entity)
if subst found:
// check type compatibility between expected and actual
if is_pointer_compatible(expected_type, subst->type):
return build_value_from_constant(subst)
else:
apply_type_conversion(subst, expected_type)
return expr // no substitution needed
case 5 (function call in constant context):
// dispatch on call sub-kind
switch expr->call_subkind:
case 1: substitute type + validate value
case 2: delegate to copy_template_param_con
case 19 (template parameter reference):
// same logic as case 3 for template parameters
subst = find_substitution(tctx, expr->template_param)
return build_substituted_expr(subst)
case 20 (type constant):
new_type = copy_template_param_type(expr->type, tctx)
return build_type_constant_expr(new_type)
case 21 (builtin operation):
return copy_template_param_builtin_operation(expr, tctx, ...)
// asserts no error in process
case 22 (type reference):
new_type = copy_template_param_type(expr->type, tctx)
return build_type_ref_expr(new_type)
case 23 (expression wrapper):
inner = copy_template_param_expr(expr->inner, tctx, ...)
return wrap(inner)
case 30 (pack expansion):
return expand_pack(expr, tctx, ...)
case 31 (dependent entity):
// complex hash-map based instantiation tracking
// with get_with_hash / vector_insert / entity list processing
return instantiate_dependent_entity(expr, tctx, ...)
Post-substitution type conversions. The internal helper do_conversions_on_operands_of_copied_template_expr (at il.c line 18885) handles arithmetic promotions that must occur after template parameter substitution. For example, T + U where T=int and U=double requires promoting the int operand to double -- this promotion is not present in the template definition's expression tree because the types are unknown there. The function handles:
- Shift operators (ops 53-54): promote the result type to the promoted LHS type.
- Comparison operators (ops 58-63): compute the common type and apply usual arithmetic conversions.
- Arithmetic operators (default): compute common type via
sub_5657C0and insert implicit conversion nodes. - Imaginary operators (ops 44-45): explicitly not implemented (triggers
internal_error).
copy_template_param_con (sub_5DE290)
The constant-level substitution dispatcher. At 819 lines, it handles the case where a constant node in a template definition contains a reference to a template parameter:
copy_template_param_con(constant, tctx, expected_type, scope, flags,
error_flag, diag, scratch) -> constant*:
switch constant->constant_kind:
case 12 (template parameter constant):
// this is the core case -- the constant IS a template parameter
switch constant->template_param_kind:
case 0 (value parameter):
// look up the bound value in the template argument list
binding = lookup_template_arg(tctx, constant->param_index)
if binding is a pack:
return expand_pack_element(binding, ...)
return binding->value_constant
case 1 (expression parameter):
// try overload resolution first
result = copy_template_param_con_overload_resolution(...)
if result: return result
// fall back to full expression-level substitution
return copy_template_param_expr(constant->expr, tctx, ...)
case 2 (non-member entity parameter):
return copy_template_param_unknown_entity_con(constant, FALSE, ...)
case 3 (member entity parameter):
return copy_template_param_unknown_entity_con(constant, TRUE, ...)
case 4 (nested constant parameter):
return copy_template_param_con(constant->inner, tctx, ...)
case 5-10 (scalar value parameters: sizeof, alignof, etc.):
// look up the substitution via sub_5BFB80
// perform type equality check
// apply type conversions if needed
return substituted_scalar_constant(...)
case 11 (entity + argument pack):
// entity substitution with argument list processing
return substitute_entity_with_args(...)
case 12 (nested recursive):
return copy_template_param_con(constant->inner, tctx, ...)
case 6 (address/aggregate constant):
switch constant->address_kind:
case 3 (function call):
// substitute callee type + each argument recursively
callee_type = copy_template_param_type(constant->callee_type, tctx)
for each arg in constant->args:
arg = copy_template_param_con(arg, tctx, ...)
return build_call_constant(callee_type, args)
default:
if is_dependent_type(constant->type):
return deep_copy_constant(constant)
// handle address-space attribute patterns
case 15 (expression constant):
switch constant->expr_constant_kind:
case 46 (strip_template_arg):
// dispatch on template argument type:
// 0 = type argument -> type substitution
// 1 = value argument -> value substitution
// 2 = template argument -> template substitution
case 6: return type_substitution(...)
case 13: return non_type_param_substitution(...)
case 2: return recursive copy_template_param_con(inner, ...)
default:
internal_error("copy_template_param_con: unexpected kind")
copy_template_param_con_with_substitution (sub_5DFAD0)
The top-level entry point for template constant substitution, called from the template instantiation driver. It manages the IL region switch (moving allocation to file-scope for the duration of instantiation), handles the initial overload-resolution check, and performs post-substitution type normalization:
copy_template_param_con_with_substitution(constant, template_args, scope,
expected_type, access, flags,
error_flag, scratch):
saved_region = current_region
switch_to_file_scope_region() // with debug trace
local_scratch = alloc_local_constant()
// ---- special case: overload resolution for expression parameters ----
if constant->kind == 12 and constant->param_kind == 1:
overload_info = lookup_overload_candidate(constant)
if overload_info:
result = copy_template_param_con_overload_resolution(
constant, overload_info, tctx, ...)
if result: goto post_process
// ---- validate expected type ----
if expected_type is pointer_type:
validate_pointer_binding(expected_type)
// ---- main substitution ----
result = copy_template_param_expr(constant->expr, tctx, ...)
// or: result = copy_template_param_con(constant, tctx, ...)
// depending on whether the constant embeds an expression
post_process:
// ---- post-substitution type normalization ----
if result->type is pointer_type:
validate_binding(result)
result = try_implicit_conversion(result)
elif result->type is array_type:
result = try_implicit_conversion(result)
result = array_to_pointer_decay(result)
elif result->type is function_type:
result = try_implicit_conversion(result)
result = function_to_pointer_decay(result)
else:
result = general_conversion(result)
// ---- handle deferred instantiation ----
if is_deferred_instantiation(result):
copy_deferred_data_into_scratch(result, scratch)
restore_region(saved_region)
free_local_constant(local_scratch)
return result
Supporting Functions
| Function | Address | Lines | Purpose |
|---|---|---|---|
copy_template_param_type_expr | sub_5DDEB0 | 82 | Handles op=116 type expressions within template substitution; extracts and substitutes the type, checks dependent-type status |
copy_template_param_expr_list | sub_5DE010 | 77 | Iterates an expression linked list, calling copy_template_param_expr on each element; shares a single scratch constant across all iterations |
copy_template_param_value_expr | sub_5DE1A0 | 55 | Single-expression variant; passes the expression's own type as the expected type |
copy_template_param_con_overload_resolution | sub_5DF6A0 | 180 | Attempts overload resolution during template substitution when the template parameter refers to a set of overloaded functions; validates result type compatibility |
copy_template_param_unknown_entity_con | sub_5DB420 | 213 | Handles entity constants where the entity kind is not known until substitution time (using declarations, namespace aliases, variables, templates, types) |
Part 5: Data Flow Between the Subsystems
The four subsystems interact in a specific calling pattern during template instantiation:
Template Instantiation Driver
|
+-> copy_template_param_con_with_substitution (entry point)
|
+-> copy_template_param_expr (expression-level dispatch)
| |
| +-> copy_template_param_con (constant-level dispatch)
| | |
| | +-> copy_template_param_unknown_entity_con
| | +-> copy_template_param_con_overload_resolution
| | +-> [recursive: copy_template_param_expr]
| | +-> [recursive: copy_template_param_con]
| |
| +-> copy_template_param_type (type-level, in type.c)
| +-> copy_template_param_type_expr
| +-> copy_template_param_expr_list
| +-> copy_template_param_builtin_operation
|
+-> alloc_shareable_constant (deduplication on output)
|
+-> compare_constants (hash table equality check)
+-> fixup_constant_references
The comparison engine is not called during the copy itself -- it is called only at the end, when the newly constructed constants are passed through alloc_shareable_constant for deduplication. This means the copy engine may temporarily create duplicate constants that are later merged by the sharing infrastructure. The design separates concerns: the copy engine focuses on correctness (producing a valid substituted tree), while the sharing engine focuses on efficiency (deduplicating identical results).
Part 6: Initialization and Reset
il_one_time_init (sub_5CF7F0)
Called once at program startup. Validates seven name-table arrays end with the "last" sentinel string, checks the sizeof_il_entry guard value (9999), and initializes 60+ allocation pools via pool_init (sub_7A3C00) with element sizes ranging from 1 byte to 1344 bytes. Conditionally initializes C++-mode pools (guarded by dword_106BF68 || dword_106BF58).
il_init (sub_5CFE20)
Called at the start of each translation unit. Zeroes all global pool heads, allocates and zeroes the two hash tables:
- Character type table: 3240 bytes at
qword_126F2F8(5 character types x 81 slots = 405 entries, 8 bytes each). - Constant sharing table: 16312 bytes at
qword_126F228(2039 buckets, 8 bytes each).
Sets the three sharing mode bytes (byte_126E558, byte_126E559, byte_126E55A) to 3 (all sharing enabled), and tail-calls il_init_float_constants (sub_5EAF00).
il_reset_secondary_pools (sub_5D0170)
Zeroes ~80 qword globals in the 0x126F680--0x126F978 range. These are transient counters, list heads, and cached type pointers used during template instantiation. Called separately from il_init, suggesting it resets state between instantiation passes within the same translation unit.
Address Map
| Address | Function | Lines | Role |
|---|---|---|---|
0x5CF7F0 | il_one_time_init | ~200 | One-time startup validation + pool init |
0x5CFE20 | il_init | ~100 | Per-TU hash table allocation + state reset |
0x5D0170 | il_reset_secondary_pools | ~40 | Reset instantiation-transient state |
0x5D0750 | compare_expressions | 588 | Expression tree structural equality |
0x5D1320 | compare_expressions_for_equivalence | ~10 | Thin wrapper (flags=4) |
0x5D1350 | compare_constants | 525 | Constant structural equality, 16 kinds |
0x5D1FE0 | compare_dynamic_inits | ~80 | Dynamic init comparison |
0x5D2160 | compare_constants_default | ~5 | Thin wrapper (flags=0) |
0x5D2170 | expr_tree_contains_template_param_constant | ~50 | Template param presence check |
0x5D2210 | constant_is_shareable | ~100 | Shareability predicate |
0x5D2390 | alloc_shareable_constant | ~200 | Hash table deduplication allocator |
0x5D2890 | alloc_il_entry_from_constant | ~20 | Wraps constant in IL entry |
0x5D2F90 | i_copy_expr_tree | 424 | Expression tree deep copy (35-case switch) |
0x5D38C0 | i_copy_list_of_expr_trees | ~40 | Linked-list copy helper |
0x5D3940 | copy_expr_tree | ~30 | Public wrapper with cleanup |
0x5D39A0 | fixup_constant_references | ~80 | Post-copy pointer fixup |
0x5D3B90 | i_copy_constant_full | 305 | Constant deep copy (16-kind switch) |
0x5D4300 | copy_constant_full | ~20 | Public wrapper with cleanup |
0x5D47A0 | i_copy_dynamic_init | ~150 | Dynamic init deep copy |
0x5D4C00 | copy_lambda_capture | ~60 | Lambda capture list copy |
0x5D4DB0 | alloc_constant | ~150 | Non-shared constant allocation with kind-specific cleanup |
0x5DBAB0 | intern_string_constant | ~92 | String literal interning via hash table |
0x5DC000 | copy_template_param_expr | 1416 | Template substitution -- expression dispatcher |
0x5DDEB0 | copy_template_param_type_expr | 82 | Template substitution -- type expressions |
0x5DE010 | copy_template_param_expr_list | 77 | Template substitution -- expression list |
0x5DE1A0 | copy_template_param_value_expr | 55 | Template substitution -- single value expr |
0x5DE290 | copy_template_param_con | 819 | Template substitution -- constant dispatcher |
0x5DF6A0 | copy_template_param_con_overload_resolution | 180 | Template substitution -- overload resolution |
0x5DFAD0 | copy_template_param_con_with_substitution | 288 | Template substitution -- top-level entry |
.int.c File Format
When cudafe++ processes a CUDA source file, the backend code generator emits a transformed C++ translation called the .int.c file (short for "intermediate C"). This is the host-side output that the downstream host compiler (GCC, Clang, or MSVC) will compile. The file preserves all host-visible declarations from the original source but replaces device code with stubs, injects CUDA runtime boilerplate, and appends registration tables and anonymous namespace support. The entire emission is driven by process_file_scope_entities (sub_489000), a 723-line function in cp_gen_be.c that serves as the backend entry point. It initializes output state, opens the output stream, emits a fixed sequence of preamble sections, walks the EDG intermediate language source sequence to generate the transformed C++ body, then appends a fixed trailer with _NV_ANON_NAMESPACE handling, #pragma pack() for MSVC, and CUDA host reference arrays.
Key Facts
| Property | Value |
|---|---|
| Backend entry point | sub_489000 (process_file_scope_entities, 723 lines) |
| EDG source file | cp_gen_be.c (lines 19916-26628) |
| Default output name | <input>.int.c (via sub_5ADD90 string concatenation) |
| Output override global | qword_106BF20 (set by CLI flag gen_c_file_name, case 45) |
| Stdout sentinel | "-" (output filename compared character-by-character) |
| Output stream global | stream (FILE pointer at fixed address) |
| Line counter | dword_1065820 (incremented on every \n) |
| Column counter | dword_106581C (character position within current line) |
| Indent level | dword_1065834 (decremented with -- around directive blocks) |
| Needs-line-directive flag | dword_1065818 (triggers #line emission before next output) |
| Source sequence cursor | qword_1065748 (current IL entry being processed) |
| Device stub mode toggle | dword_1065850 (0=normal, 1=generating __wrapper__device_stub_) |
| Empty file guard string | "int __dummy_to_avoid_empty_file;" at 0x83AED8 |
| Anon namespace macro string | "_NV_ANON_NAMESPACE" at 0x83AF45 |
| Managed RT boilerplate | inline static functions for __managed__ variable support |
Output File Naming
The output filename is determined by three inputs, checked in order:
// sub_489000, decompiled lines 153-177
char *input_name = qword_126EEE0; // source filename from CLI
// 1. Check for stdout mode
if (strcmp(input_name, "-") == 0) {
stream = stdout;
}
else {
// 2. Check for explicit output name override
char *output_name = qword_106BF20;
if (!output_name)
// 3. Default: append ".int.c" to input filename
output_name = sub_5ADD90(input_name, ".int.c");
stream = sub_4F48F0(output_name, 0, 0, 0, 1701); // open for writing
}
The - sentinel enables piping cudafe++ output to stdout for debugging or toolchain integration. The qword_106BF20 override is set by the gen_c_file_name CLI option (case 45 in the CLI parser at sub_459630), allowing nvcc to specify an explicit output path. The default .int.c suffix means a file kernel.cu produces kernel.cu.int.c.
Complete .int.c File Structure
A fully-generated .int.c file follows this fixed section ordering, top to bottom:
+------------------------------------------------------------------+
| 1. #line directive (initial source position) |
+------------------------------------------------------------------+
| 2. #pragma GCC diagnostic ignored "-Wunused-local-typedefs" |
| #pragma GCC diagnostic ignored "-Wattributes" |
+------------------------------------------------------------------+
| 3. #pragma GCC diagnostic push |
| #pragma GCC diagnostic ignored "-Wunused-variable" |
| #pragma GCC diagnostic ignored "-Wunused-function" |
+------------------------------------------------------------------+
| 4. Managed runtime boilerplate |
| (static __nv_inited_managed_rt, __nv_init_managed_rt, etc.) |
+------------------------------------------------------------------+
| 5. #pragma GCC diagnostic pop |
+------------------------------------------------------------------+
| 6. #pragma GCC diagnostic ignored "-Wunused-variable" |
| #pragma GCC diagnostic ignored "-Wunused-private-field" |
| #pragma GCC diagnostic ignored "-Wunused-parameter" |
+------------------------------------------------------------------+
| 7. Extended lambda macro definitions (or #define false stubs) |
+------------------------------------------------------------------+
| 8. MAIN BODY: transformed C++ from source sequence walk |
| - #include "crt/host_runtime.h" (injected at first CUDA type) |
| - Device stubs for __global__ kernels |
| - #if 0 / #endif around device-only code |
| - All host-visible declarations, types, functions |
+------------------------------------------------------------------+
| 9. Empty file guard (if no entities generated) |
+------------------------------------------------------------------+
| 10. Breakpoint placeholders (debug builds only) |
+------------------------------------------------------------------+
| 11. _NV_ANON_NAMESPACE define / include / undef trick |
+------------------------------------------------------------------+
| 12. #pragma pack() (MSVC only) |
+------------------------------------------------------------------+
| 13. Module ID file output (if dword_106BFB8 set) |
+------------------------------------------------------------------+
| 14. Host reference arrays (.nvHRKI, .nvHRDE, etc.) |
+------------------------------------------------------------------+
Section 1: Initial #line Directive
After opening the output stream, sub_489000 emits a #line directive via sub_46D1A0 to establish the initial source mapping. This directive points the host compiler's diagnostic messages back to the original .cu file:
// sub_489000, decompiled lines 283-287
sub_46D1A0(v10, v11); // emit #line <number> "<filename>"
The #line directive format depends on the host compiler. For GCC/Clang hosts (dword_126E1F8 set), the line keyword is omitted (producing # 1 "file.cu"). For MSVC hosts (dword_126E1D8 set), the full #line 1 "file.cu" form is used. This pattern recurs throughout the file wherever source position changes.
Section 2-6: Diagnostic Suppressions
The preamble contains a layered set of #pragma GCC diagnostic directives that suppress warnings the host compiler would otherwise emit on the generated code. The exact set depends on which host compiler is active and its version.
Suppression Decisions
The conditions controlling each suppression are checked against host compiler identification globals:
| Global | Meaning |
|---|---|
dword_126E1E8 | Host is Clang |
dword_126E1F8 | Host is GCC (including Clang in GCC-compat mode) |
dword_126E1D8 | Host is MSVC |
qword_126EF90 | Clang version number |
qword_126E1F0 | GCC/Clang version number |
dword_106BF6C | Alternative host compiler mode |
dword_106BF68 | Secondary host compiler flag |
-Wunused-local-typedefs
Emitted early, outside any push/pop scope:
// sub_489000, decompiled lines 182-187
if ((dword_126E1E8 && qword_126EF90 > 0x7787) // Clang > 30599
|| (!dword_106BF6C && !dword_106BF68
&& dword_126E1F8 && qword_126E1F0 > 0x9F5F)) // GCC > 40799
{
emit("#pragma GCC diagnostic ignored \"-Wunused-local-typedefs\"");
}
This targets GCC 4.8+ and Clang 3.1+, which introduced the -Wunused-local-typedefs warning. CUDA template machinery frequently creates local typedefs that are used only by device code (suppressed in #if 0 blocks), triggering spurious warnings.
-Wattributes
// sub_489000, decompiled lines 188-189
if (dword_126EFA8 && dword_106C07C)
emit("\n#pragma GCC diagnostic ignored \"-Wattributes\"\n");
Suppresses warnings about unknown or ignored __attribute__ annotations. Emitted when CUDA-specific attribute processing is active (dword_126EFA8) and a secondary flag (dword_106C07C) indicates the host compiler would reject CUDA-specific attributes.
Push/Pop Block with -Wunused-variable and -Wunused-function
The managed runtime boilerplate (section 4) is wrapped in a diagnostic push/pop block:
// sub_489000, decompiled lines 190-234
emit("#pragma GCC diagnostic push");
emit("#pragma GCC diagnostic ignored \"-Wunused-variable\"");
emit("#pragma GCC diagnostic ignored \"-Wunused-function\"");
// ... managed runtime boilerplate here ...
emit("#pragma GCC diagnostic pop");
The push/pop scope isolates these suppressions to the managed runtime code. The conditions for emitting this block check Clang presence (dword_126E1E8), or GCC version > 40599 (qword_126E1F0 > 0x9E97). The managed runtime functions are static and may be unused in translation units without __managed__ variables.
Post-Pop File-Level Suppressions
After the pop, additional file-scoped suppressions are emitted that remain active for the rest of the file:
// sub_489000, decompiled lines 243-250
emit("#pragma GCC diagnostic ignored \"-Wunused-variable\"\n");
if (dword_126E1E8) { // Clang only
emit("#pragma GCC diagnostic ignored \"-Wunused-private-field\"\n");
emit("#pragma GCC diagnostic ignored \"-Wunused-parameter\"\n");
}
The -Wunused-private-field and -Wunused-parameter suppressions are Clang-specific. GCC does not have -Wunused-private-field, and GCC's -Wunused-parameter behavior differs.
Summary of All Suppressions
| Warning | Scope | Host Compiler | Version Threshold |
|---|---|---|---|
-Wunused-local-typedefs | File-level | Clang, GCC | Clang > 30599, GCC > 40799 |
-Wattributes | File-level | GCC/Clang | When CUDA attrs active |
-Wunused-variable | Push/pop block | Clang, GCC >= 40599 | Around managed RT only |
-Wunused-function | Push/pop block | Clang, GCC >= 40599 | Around managed RT only |
-Wunused-variable | File-level | Clang, GCC >= 40199 | Rest of file |
-Wunused-private-field | File-level | Clang only | Always |
-Wunused-parameter | File-level | Clang only | Always |
Section 7: Extended Lambda Macros
When extended lambda mode is NOT active (dword_106BF38 == 0), three stub macros are defined:
// sub_489000, decompiled lines 259-264
emit("#define __nv_is_extended_device_lambda_closure_type(X) false\n");
emit("#define __nv_is_extended_host_device_lambda_closure_type(X) false\n");
emit("#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false\n");
emit("#if defined(__nv_is_extended_device_lambda_closure_type)"
" && defined(__nv_is_extended_host_device_lambda_closure_type)"
"&& defined(__nv_is_extended_device_lambda_with_preserved_return_type)\n"
"#endif\n");
These macros are consumed by crt/host_runtime.h to conditionally compile lambda wrapper infrastructure. When extended lambdas are disabled, all three evaluate to false, causing the runtime header to skip lambda wrapper code. The #if defined(...) && defined(...) block that immediately follows is an existence check -- it verifies the macros are defined, producing a compilation error if some other header has #undef'd them.
When extended lambda mode IS active (dword_106BF38 != 0), these defines are skipped entirely. The lambda preamble injection system (via sub_6BCC20) provides the real implementations later in the main body.
Section 8: Main Body -- Source Sequence Walk
The main body is generated by iterating the global source sequence list (qword_1065748), which is a linked list of EDG IL entries representing every top-level declaration in the translation unit. For each entry, the backend dispatches to sub_47ECC0 (gen_template / process_source_sequence), which handles all declaration kinds:
// sub_489000, decompiled lines 288-316 (simplified)
while (qword_1065748) {
entry = qword_1065748;
kind = entry->kind; // byte at offset +16
if (kind == 57) {
// Pragma interleaving -- handled inline
handle_pragma(entry);
} else if (kind == 52) {
// End-of-construct -- should not appear at top level
fatal_error("Top-level end-of-construct entry");
} else {
entities_generated = 1;
sub_47ECC0(0); // gen_template at recursion level 0
}
}
During this walk, several CUDA-specific injections occur:
-
#include "crt/host_runtime.h"-- injected bysub_4864F0(gen_type_decl) orsub_47ECC0when the first CUDA-tagged entity at global scope is encountered. The flagdword_E85700prevents duplicate inclusion. -
Device stub pairs --
__global__kernel functions trigger two calls togen_routine_decl(sub_47BFD0): first the forwarding body, then the staticcudaLaunchKernelplaceholder, controlled by thedword_1065850toggle. -
#if 0/#endifguards -- device-only declarations are wrapped in preprocessor guards to hide them from the host compiler. -
Interleaved pragmas -- source sequence entries of kind 57 represent
#pragmadirectives from the original source (including#pragma pack,#pragma STDC, and user pragmas), which are re-emitted at their original positions.
Section 9: Empty File Guard
If the source sequence walk produced no entities (v12 == 0) and the compilation is not in pure CUDA mode (dword_126EFB4 != 2), a dummy declaration is emitted to prevent the host compiler from rejecting an empty translation unit:
// sub_489000, decompiled lines 565-569
if (!entities_generated && dword_126EFB4 != 2) {
emit("int __dummy_to_avoid_empty_file;");
newline();
}
Some host compilers (notably older GCC versions) produce warnings or errors on completely empty .c files. The int __dummy_to_avoid_empty_file; declaration is a minimal valid C/C++ statement that suppresses this.
Section 10: Breakpoint Placeholders
When the deferred function list (qword_1065840) is non-empty, the backend emits one breakpoint placeholder function per entry. These are used for debugger support in whole-program compilation mode:
// sub_489000, decompiled lines 573-651 (simplified)
node = qword_1065840; // linked list of deferred functions
index = 0;
while (node) {
emit("static __attribute__((used)) void __nv_breakpoint_placeholder");
emit_decimal(index);
putc('_', stream);
if (node->name)
emit(node->name);
emit("(void) ");
// Set source position from node
set_source_position(node->source_start);
emit("{ ");
set_source_position(node->source_end);
emit("exit(0); }");
node = node->next;
index++;
}
Each placeholder has the form static __attribute__((used)) void __nv_breakpoint_placeholderN_funcname(void) { exit(0); }. The __attribute__((used)) prevents the linker from stripping these functions. The debugger uses their addresses to set breakpoints on device functions that have been stripped from the host binary.
The deferred list is populated by gen_routine_decl when dword_106BFBC (whole-program mode) is set and dword_106BFDC is clear -- device-only functions that need host-side breakpoint anchors are pushed onto this list rather than receiving dummy bodies inline.
Section 11: _NV_ANON_NAMESPACE Trick
The trailer contains a four-step sequence that handles C++ anonymous namespace mangling for CUDA. Anonymous namespaces in C++ create translation-unit-local symbols, but CUDA device code requires globally unique symbol names (because device code from multiple TUs is linked together by the device linker). The _NV_ANON_NAMESPACE mechanism assigns a deterministic, globally unique identifier to each TU's anonymous namespace.
Step-by-Step Emission
// sub_489000, decompiled lines 654-710
// Step 1: #line back to original source
emit("#");
if (!dword_126E1F8) // MSVC: include "line" keyword
emit("line");
emit(" 1 \"");
emit(path_transform(qword_106BF88)); // original source file path
emit("\"");
// Step 2: #define _NV_ANON_NAMESPACE <hash>
emit("#define ");
emit("_NV_ANON_NAMESPACE");
emit(" ");
emit(sub_6BC7E0()); // generate unique hash string
newline();
// Step 3: #ifdef / #endif (force inclusion check)
emit("#ifdef ");
emit("_NV_ANON_NAMESPACE");
newline();
emit("#endif");
newline();
// Step 3b: #pragma pack() for MSVC
if (dword_126E1D8) { // MSVC host
emit("#pragma pack()");
newline();
}
// Step 4: #include "<original_file>"
emit("#");
if (!dword_126E1F8)
emit("line");
emit(" 1 \"");
emit(path_transform(qword_106BF88));
emit("\"");
newline();
emit("#include ");
emit("\"");
emit(path_transform(qword_106BF88));
emit("\"");
newline();
// Step 5: Reset #line and #undef
emit("#");
if (!dword_126E1F8)
emit("line");
emit(" 1 \"");
emit(path_transform(qword_106BF88));
emit("\"");
newline();
emit("#undef ");
emit("_NV_ANON_NAMESPACE");
newline();
The Hash Generator (sub_6BC7E0)
The _NV_ANON_NAMESPACE value is produced by sub_6BC7E0, which constructs the string _GLOBAL__N_ followed by the module ID hash:
// sub_6BC7E0 (20 lines)
if (cached_result)
return cached_result;
char *module_id = sub_5AF830(0); // compute CRC32-based module ID
size_t len = strlen(module_id);
char *result = allocate(len + 12);
strcpy(result, "_GLOBAL__N_");
strcpy(result + 11, module_id);
cached_result = result;
return result;
The module ID (sub_5AF830) is a CRC32-based hash incorporating the source filename, compiler options, file modification time, and process ID. This produces values like _GLOBAL__N_1a2b3c4d5e6f7890 -- deterministic enough for reproducible builds, but unique enough to avoid collisions between TUs.
Why the Define/Include/Undef Sequence
The three-step define/include/undef pattern serves a specific purpose:
-
#define _NV_ANON_NAMESPACE <hash>-- establishes the macro before the source file is re-included. -
#include "<original_file>"-- re-includes the original.cusource. During this second inclusion, any code inside anonymous namespaces that uses_NV_ANON_NAMESPACEgets the unique hash substituted, producing globally unique symbol names for device code. -
#undef _NV_ANON_NAMESPACE-- cleans up the macro after inclusion.
The #ifdef _NV_ANON_NAMESPACE / #endif block between define and include is a safety check -- it verifies the macro was actually defined before proceeding.
This mechanism works in conjunction with the EDG frontend's anonymous namespace handling. When the frontend encounters namespace { ... } containing device code, it generates references to _NV_ANON_NAMESPACE that become concrete identifiers during the re-inclusion pass. The name mangling in the demangler (sub_7CA140, sub_7C5650, sub_7C4E80) also uses _NV_ANON_NAMESPACE to produce consistent mangled names.
Section 12: #pragma pack() for MSVC
When the host compiler is MSVC (dword_126E1D8 set), a bare #pragma pack() is emitted to reset the packing alignment to the compiler default:
// sub_489000, decompiled lines 676-681
if (dword_126E1D8) {
emit("#pragma pack()");
newline();
}
This reset ensures that any #pragma pack(N) directives from the original source or from included CUDA headers do not leak into subsequent translation units. On GCC/Clang, the #pragma pack() push/pop mechanism is typically handled differently, so this emission is MSVC-specific.
Section 13-14: Module ID and Host Reference Arrays
The final two sections are conditional:
Module ID output (sub_5B0180): When dword_106BFB8 is set, the module ID string (the same CRC32-based hash from sub_5AF830) is written to a separate file. This ID is used by the CUDA runtime to match host-side registration code with the device fatbinary.
Host reference arrays (sub_6BCF80): When dword_106BFD0 (device registration) or dword_106BFCC (constant registration) is set, six calls to sub_6BCF80 emit ELF section declarations for host reference arrays:
// sub_489000, decompiled lines 713-721
// nv_emit_host_reference_array(emit_fn, is_kernel, is_device, is_internal)
sub_6BCF80(emit_callback, 1, 0, 1); // kernel, internal -> .nvHRKI
sub_6BCF80(emit_callback, 1, 0, 0); // kernel, external -> .nvHRKE
sub_6BCF80(emit_callback, 0, 1, 1); // device, internal -> .nvHRDI
sub_6BCF80(emit_callback, 0, 1, 0); // device, external -> .nvHRDE
sub_6BCF80(emit_callback, 0, 0, 1); // constant, internal -> .nvHRCI
sub_6BCF80(emit_callback, 0, 0, 0); // constant, external -> .nvHRCE
These produce extern "C" declarations with __attribute__((section(".nvHRXX"))) annotations, where XX is one of KE, KI, DE, DI, CE, CI (Kernel/Device/Constant + External/Internal). The arrays contain mangled names of device symbols, enabling the CUDA runtime to locate and register them at program startup.
Complete Example
For a source file kernel.cu containing a single __global__ kernel function and a host function, the generated kernel.cu.int.c looks approximately like this:
# 1 "kernel.cu"
#pragma GCC diagnostic ignored "-Wunused-local-typedefs"
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-variable"
#pragma GCC diagnostic ignored "-Wunused-function"
static char __nv_inited_managed_rt = 0;
static void **__nv_fatbinhandle_for_managed_rt;
static void __nv_save_fatbinhandle_for_managed_rt(void **in) {
__nv_fatbinhandle_for_managed_rt = in;
}
static char __nv_init_managed_rt_with_module(void **);
static inline void __nv_init_managed_rt(void) {
__nv_inited_managed_rt = (__nv_inited_managed_rt
? __nv_inited_managed_rt
: __nv_init_managed_rt_with_module(
__nv_fatbinhandle_for_managed_rt));
}
#pragma GCC diagnostic pop
#pragma GCC diagnostic ignored "-Wunused-variable"
#pragma GCC diagnostic ignored "-Wunused-private-field"
#pragma GCC diagnostic ignored "-Wunused-parameter"
#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false
#if defined(__nv_is_extended_device_lambda_closure_type) \
&& defined(__nv_is_extended_host_device_lambda_closure_type) \
&& defined(__nv_is_extended_device_lambda_with_preserved_return_type)
#endif
/* === main body begins here === */
#include "crt/host_runtime.h"
# 5 "kernel.cu"
void host_function(int *data, int n) {
for (int i = 0; i < n; i++) data[i] *= 2;
}
# 10 "kernel.cu"
void my_kernel(float *data, int n) {
::my_kernel::__wrapper__device_stub_my_kernel(data, n);
return;
}
#if 0
/* original __global__ kernel body suppressed */
#endif
static void __wrapper__device_stub_my_kernel(float *data, int n) {
::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}
/* === main body ends === */
# 1 "kernel.cu"
#define _NV_ANON_NAMESPACE _GLOBAL__N_a1b2c3d4e5f67890
#ifdef _NV_ANON_NAMESPACE
#endif
# 1 "kernel.cu"
#include "kernel.cu"
# 1 "kernel.cu"
#undef _NV_ANON_NAMESPACE
Initialization State
Before emitting any output, sub_489000 zeroes all output-related global state and initializes four large hash tables (each 512KB, cleared with memset). It also sets up a function pointer table (xmmword_1065760 through xmmword_10657B0) containing code generation callbacks:
// sub_489000, decompiled lines 62-97 (summarized)
dword_1065834 = 0; // indent level
stream = NULL; // output file handle
dword_1065820 = 0; // line counter
dword_106581C = 0; // column counter
dword_1065818 = 0; // needs-line-directive
qword_1065748 = 0; // source sequence cursor
qword_1065740 = 0; // alternate cursor
dword_1065850 = 0; // device stub mode
// Clear four 512KB hash tables
memset(&unk_FE5700, 0, 0x7FFE0); // 524,256 bytes
memset(&unk_F65720, 0, 0x7FFE0);
memset(qword_E85720, 0, 0x7FFE0);
memset(&xmmword_F05720, 0, 0x5FFE8); // 393,192 bytes (smaller)
// Callback setup
if (!dword_126DFF0) // not MSVC mode
qword_10657C0 = sub_46BEE0; // gen_be callback
qword_10657C8 = loc_469200; // line directive callback
qword_10657D0 = sub_466F40; // output callback
qword_10657D8 = sub_4686C0; // error callback
#line Directive Protocol
Throughout the file, #line directives maintain the mapping between generated output and original source positions. The emission protocol differs by host compiler:
| Host Compiler | #line Format | Example |
|---|---|---|
| GCC / Clang | # <line> "<file>" | # 42 "kernel.cu" |
| MSVC | #line <line> "<file>" | #line 42 "kernel.cu" |
The dword_1065818 flag (needs_line_directive) is set whenever the current source position changes. Before emitting the next declaration or statement, sub_467DA0 checks this flag and emits a #line directive if needed, then clears the flag. The source position is tracked in two globals: qword_1065810 (pending position) and qword_126EDE8 (current position).
Function Map
| Address | Name | Role |
|---|---|---|
sub_489000 | process_file_scope_entities | Backend entry point; orchestrates entire .int.c emission |
sub_47ECC0 | gen_template / process_source_sequence | Walks source sequence, dispatches all declaration kinds |
sub_47BFD0 | gen_routine_decl | Function declaration/definition generator; kernel stub logic |
sub_4864F0 | gen_type_decl | Type declaration generator; injects #include "crt/host_runtime.h" |
sub_484A40 | gen_variable_decl | Variable declaration generator; managed memory registration |
sub_467E50 | (emit string) | Primary string emission to output stream |
sub_468190 | (emit raw string) | Raw string emission without line directive check |
sub_46BC80 | (emit directive) | Emits #if / #endif preprocessor lines |
sub_467DA0 | (emit line directive) | Conditionally emits #line when dword_1065818 is set |
sub_467D60 | (emit newline) | Emits newline and flushes pending line directive |
sub_46CF20 | (emit source position) | Sets source position for next #line directive |
sub_5ADD90 | (string concat) | Concatenates input filename with .int.c extension |
sub_4F48F0 | (file open) | Opens output file for writing (mode 1701) |
sub_6BC7E0 | (anon namespace hash) | Generates _GLOBAL__N_<module_id> string |
sub_5AF830 | make_module_id | CRC32-based unique TU identifier |
sub_5B0180 | write_module_id_to_file | Writes module ID to separate file |
sub_6BCF80 | nv_emit_host_reference_array | Emits .nvHRKE/.nvHRDI/etc. ELF sections |
sub_4F7B10 | (file close) | Closes output stream (mode 1701) |
Cross-References
- Kernel Stub Generation -- detailed stub mechanism using
dword_1065850toggle - Device/Host Separation -- how device-only code gets
#if 0guards - CUDA Runtime Boilerplate -- managed memory initialization functions
- Host Reference Arrays --
.nvHRKI/.nvHRDEsection format - Module ID & Registration -- CRC32 hash computation details
- Pipeline Overview -- where backend generation fits in the 7-stage pipeline
- Extended Lambda Overview -- lambda macro definitions and preamble injection
CUDA Runtime Boilerplate
Every .int.c file emitted by cudafe++ contains a fixed block of CUDA runtime initialization code, injected unconditionally before the main body. This boilerplate implements lazy initialization of the CUDA managed memory runtime and defines macro stubs for the extended lambda detection system. The managed runtime block is always emitted regardless of whether the translation unit uses __managed__ variables -- the static flag __nv_inited_managed_rt ensures the runtime is initialized at most once, and the static linkage prevents symbol conflicts across translation units. The lambda detection macros provide a compile-time protocol between cudafe++ and crt/host_runtime.h: the runtime header inspects these macros to decide whether to compile lambda wrapper infrastructure.
Key Facts
| Property | Value |
|---|---|
| Emitter function | sub_489000 (process_file_scope_entities, line 218) |
| Managed RT string address | 0x83AAC8 (243 bytes) |
| Init function string address | 0x83ABC0 (210 bytes) |
| Managed access wrapper string | 0x839570 (65 bytes) |
| Access wrapper emitters | sub_4768F0 (gen_name_ref, xref at 0x476DCF), sub_484940 (gen_variable_name, xref at 0x484A08) |
| Lambda stub macros string | 0x83AD10, 0x83AD50, 0x83AD98 |
| Lambda existence check string | 0x83ADE8 (194 bytes) |
| Extended lambda mode flag | dword_106BF38 (extended_lambda_mode) |
| Alternative host flag | dword_106BF6C (alternative_host_compiler_mode) |
__cudaPushCallConfiguration lookup | sub_511D40 (scan_expr_full), string at 0x899213 |
| Push config error message | 0x88CA48, error code 3654 |
| Managed variable detection | (*(_WORD *)(entity + 148) & 0x101) == 0x101 |
| EDG source file | cp_gen_be.c |
Managed Memory Runtime Initialization
Static Variables Block
The first emission at line 218 of sub_489000 outputs four declarations as a single string literal:
static char __nv_inited_managed_rt = 0;
static void **__nv_fatbinhandle_for_managed_rt;
static void __nv_save_fatbinhandle_for_managed_rt(void **in) {
__nv_fatbinhandle_for_managed_rt = in;
}
static char __nv_init_managed_rt_with_module(void **);
These are emitted verbatim from a single string at 0x83AAC8:
"static char __nv_inited_managed_rt = 0; static void **__nv_fatbinhandle_for_managed_rt;
static void __nv_save_fatbinhandle_for_managed_rt(void **in)
{__nv_fatbinhandle_for_managed_rt = in;} static char __nv_init_managed_rt_with_module(void **);"
Each component serves a specific role:
| Symbol | Type | Purpose |
|---|---|---|
__nv_inited_managed_rt | static char | Guard flag: 0 = not initialized, nonzero = initialized |
__nv_fatbinhandle_for_managed_rt | static void** | Cached fatbinary handle, set during __cudaRegisterFatBinary |
__nv_save_fatbinhandle_for_managed_rt | static void (void**) | Stores the fatbin handle for later use by the init function |
__nv_init_managed_rt_with_module | static char (void**) | Forward declaration -- defined by crt/host_runtime.h |
The forward declaration of __nv_init_managed_rt_with_module is critical: this function is provided by the CUDA runtime headers (crt/host_runtime.h) and performs the actual CUDA runtime API calls to register managed variables with the unified memory system. By forward-declaring it here, the managed runtime boilerplate can reference it before the header is #included later in the file.
Lazy Initialization Function
Immediately after the static block, sub_489000 emits the __nv_init_managed_rt inline function. The emission has a conditional prefix:
// sub_489000, decompiled lines 221-224
if (dword_106BF6C) // alternative host compiler mode
emit("__attribute__((unused)) ");
emit(" static inline void __nv_init_managed_rt(void) {"
" __nv_inited_managed_rt = (__nv_inited_managed_rt"
" ? __nv_inited_managed_rt"
" : __nv_init_managed_rt_with_module("
"__nv_fatbinhandle_for_managed_rt));}");
When dword_106BF6C (alternative host compiler mode) is set, the function is prefixed with __attribute__((unused)) to suppress "defined but not used" warnings on host compilers that do not understand CUDA semantics.
The emitted function, reformatted for readability:
static inline void __nv_init_managed_rt(void) {
__nv_inited_managed_rt = (
__nv_inited_managed_rt
? __nv_inited_managed_rt
: __nv_init_managed_rt_with_module(
__nv_fatbinhandle_for_managed_rt)
);
}
This is a lazy initialization pattern. On first call, __nv_inited_managed_rt is 0 (falsy), so the ternary takes the false branch and calls __nv_init_managed_rt_with_module. That function performs CUDA runtime registration and returns a nonzero value which is stored back into __nv_inited_managed_rt. On subsequent calls, the ternary short-circuits and returns the existing value without re-initializing. The function is static inline to allow the host compiler to inline it at every managed variable access site, and static to avoid symbol collisions across translation units.
Runtime Registration Flow
The complete managed memory initialization sequence spans the compilation pipeline:
1. cudafe++ emits __nv_save_fatbinhandle_for_managed_rt() definition
2. cudafe++ emits forward decl of __nv_init_managed_rt_with_module()
3. cudafe++ emits __nv_init_managed_rt() with lazy init pattern
4. #include "crt/host_runtime.h" provides __nv_init_managed_rt_with_module()
5. __cudaRegisterFatBinary() calls __nv_save_fatbinhandle_for_managed_rt()
to cache the fatbin handle
6. First access to any __managed__ variable triggers __nv_init_managed_rt()
7. __nv_init_managed_rt_with_module() calls __cudaRegisterManagedVariable()
for every __managed__ variable in the TU
Managed Variable Access Transformation
When the backend encounters a reference to a __managed__ variable during code generation, it wraps the access in a comma-operator expression that forces lazy initialization. This transformation is performed by two functions:
sub_4768F0(gen_name_ref, xref at0x476DCF) -- handles qualified name referencessub_484940(gen_variable_name, xref at0x484A08) -- handles direct variable name emission
Detection Condition
Both functions detect __managed__ variables using the same bitfield test:
// sub_484940, decompiled line 11
if ((*(_WORD *)(entity + 148) & 0x101) == 0x101)
This tests two bits simultaneously as a 16-bit word read at offset 148:
| Byte | Bit | Mask | Meaning |
|---|---|---|---|
+148 | bit 0 | 0x01 | __device__ memory space |
+149 | bit 0 | 0x01 (reads as 0x100 in word) | __managed__ flag |
The combined mask 0x101 matches when both __device__ and __managed__ are set. The __managed__ attribute handler (sub_40E0D0, apply_nv_managed_attr) always sets both bits: __managed__ implies the variable resides in device global memory (__device__), with the additional unified-memory semantics.
Emitted Wrapper
When the condition matches, the emitter outputs a prefix string from 0x839570:
(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), (
After the variable name is emitted normally, the suffix ))) closes the expression. The complete transformed access for a managed variable managed_var becomes:
(*( (__nv_inited_managed_rt ? (void)0 : __nv_init_managed_rt()), (managed_var)))
Breaking down the expression:
- Outer
*(...)-- dereferences the result (the managed variable is accessed through a pointer after initialization) - Comma operator
(init_expr, (managed_var))-- evaluates the init expression for its side effect, then yields the variable - Ternary
__nv_inited_managed_rt ? (void)0 : __nv_init_managed_rt()-- lazy init guard: if already initialized, the ternary evaluates to(void)0(no-op). Otherwise, calls__nv_init_managed_rt()which performs runtime registration
This pattern guarantees that any access to any __managed__ variable triggers runtime initialization exactly once, regardless of access order. The comma operator ensures the initialization is a sequenced side effect evaluated before the variable access.
sub_4768F0 (gen_name_ref) -- Qualified Access Path
The name reference generator at sub_4768F0 handles the more complex case where the variable access includes scope qualification (::, template arguments, member access):
// sub_4768F0, decompiled lines 160-163
if (!v7 && a3 == 7 && (*(_WORD *)(v9 + 148) & 0x101) == 0x101) {
v13 = 1; // flag: need closing )))
emit("(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), (");
// ... emit qualified name with scope resolution ...
}
The condition a3 == 7 indicates the entity is a variable (IL entry kind 7). The !v7 check (v7 = a4, the fourth parameter) gates on whether the access is from a context that already handles initialization. The v13 flag tracks whether the closing ))) needs to be emitted after the complete name expression:
// sub_4768F0, decompiled lines 231-236
if (v13) {
emit(")))");
return 1;
}
sub_484940 (gen_variable_name) -- Direct Access Path
The direct variable name emitter at sub_484940 follows the same pattern but with a simpler structure:
// sub_484940, decompiled lines 10-15
v1 = 0;
if ((*(_WORD *)(a1 + 148) & 0x101) == 0x101) {
v1 = 1; // flag: need closing )))
emit("(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), (");
}
// ... emit variable name (possibly anonymous, templated, etc.) ...
if (v1) {
emit(")))");
return;
}
This function handles three variable name forms:
- Thread-local variables (byte
+163bit 7 set) -- emits"this"string (4 characters via inline loop) - Anonymous variables (byte
+165bit 2 set) -- dispatches tosub_483A80for generated name emission - Regular variables -- dispatches to
sub_472730(gen_expression_or_name, mode 7)
The managed wrapper is applied around all three forms.
__cudaPushCallConfiguration Lookup
When cudafe++ processes a CUDA kernel launch expression (kernel<<<grid, block, shmem, stream>>>(args...)), the frontend must locate the __cudaPushCallConfiguration runtime function to lower the <<<>>> syntax into standard C++ function calls. This lookup occurs in sub_511D40 (scan_expr_full), the 80KB expression scanner.
Lookup Mechanism
At case 0x48 (decimal 72, the token for kernel launch <<<), the scanner performs a name lookup:
// sub_511D40, decompiled lines 1999-2006
sub_72EEF0("__cudaPushCallConfiguration", 0x1B); // inject name into scope
v206 = sub_698940(v255, 0); // lookup the declaration
if (!v206 || *(_BYTE *)(v206 + 80) != 11) { // not found or not a function
sub_4F8200(0x0B, 3654, &qword_126DD38); // emit error 3654
}
The lookup calls sub_72EEF0 to insert the identifier __cudaPushCallConfiguration (27 bytes, 0x1B) into the current scope context, then sub_698940 performs the actual name resolution. If the declaration is not found (!v206) or the entity at offset +80 is not a function (kind != 11), error 3654 is emitted.
Error 3654
The error string at 0x88CA48:
unable to find __cudaPushCallConfiguration declaration.
CUDA toolkit installation may be corrupt.
This error indicates that the CUDA runtime headers have not been properly included or that the toolkit installation is broken. The __cudaPushCallConfiguration function is declared in crt/device_runtime.h (included transitively through crt/host_runtime.h), so this error should only appear if the include paths are misconfigured.
The error is emitted with severity 0x0B (11), which maps to a fatal error -- compilation cannot continue without this function because every kernel launch depends on it.
Kernel Launch Lowering
After successful lookup, the scanner builds an AST node representing the lowered kernel launch. The <<<grid, block, shmem, stream>>> syntax is transformed into:
// Conceptual lowering:
if (__cudaPushCallConfiguration(grid, block, shmem, stream) != 0) {
// launch configuration failed
}
kernel(args...);
Error 3655 (emitted at line 2019) handles the case where the call configuration push succeeds syntactically but the stream argument is missing in contexts that require it. The string for this is "explicit stream argument not provided in kernel launch".
Lambda Detection Macros
Default Stub Macros (No Extended Lambdas)
When dword_106BF38 (extended_lambda_mode) is 0, sub_489000 emits three macro definitions that evaluate to false, followed by an existence check:
// sub_489000, decompiled lines 259-264
emit("#define __nv_is_extended_device_lambda_closure_type(X) false\n");
emit("#define __nv_is_extended_host_device_lambda_closure_type(X) false\n");
emit("#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false\n");
emit("#if defined(__nv_is_extended_device_lambda_closure_type)"
" && defined(__nv_is_extended_host_device_lambda_closure_type)"
"&& defined(__nv_is_extended_device_lambda_with_preserved_return_type)\n"
"#endif\n");
Verbatim emitted code:
#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false
#if defined(__nv_is_extended_device_lambda_closure_type) && defined(__nv_is_extended_host_device_lambda_closure_type)&& defined(__nv_is_extended_device_lambda_with_preserved_return_type)
#endif
Note the missing space before && in the second conjunction -- this is exactly how the string appears in the binary at 0x83ADE8. The #if defined(...) block is a compile-time assertion: if any of the three macros were #undef'd by a misbehaving header between this point and their use in crt/host_runtime.h, the preprocessor would silently skip lambda-related code rather than producing cryptic template errors. The #endif immediately follows -- the block has no body because its purpose is solely the existence check.
These macros are consumed by crt/host_runtime.h to conditionally compile lambda wrapper infrastructure. When all three evaluate to false, the runtime header skips device lambda wrapper template instantiation, host-device lambda wrapper instantiation, and trailing-return-type lambda handling.
Trait-Based Macros (Extended Lambdas Active)
When dword_106BF38 is nonzero (--extended-lambda or --expt-extended-lambda CLI flag), the stub macros are NOT emitted. Instead, the lambda preamble emitter sub_6BCC20 (nv_emit_lambda_preamble) provides trait-based implementations later in the file body. The decision is made at line 256 of sub_489000:
// sub_489000, decompiled lines 251-264
if (dword_106BF38) // extended lambdas enabled?
goto LABEL_38; // skip stub macros, jump to next section
// else: emit stubs
emit("#define __nv_is_extended_device_lambda_closure_type(X) false\n");
// ...
The trait-based implementations emitted by sub_6BCC20 use template specialization rather than preprocessor macros. Each macro is #define'd to invoke a type trait helper:
Device lambda detection (string at 0xA82CF8):
template <typename T>
struct __nv_extended_device_lambda_trait_helper {
static const bool value = false;
};
template <typename T1, typename...Pack>
struct __nv_extended_device_lambda_trait_helper<__nv_dl_wrapper_t<T1, Pack...> > {
static const bool value = true;
};
#define __nv_is_extended_device_lambda_closure_type(X) \
__nv_extended_device_lambda_trait_helper< \
typename __nv_lambda_trait_remove_cv<X>::type>::value
Preserved return type detection (string at 0xA82F68):
template <typename T>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper {
static const bool value = false;
};
template <typename U, U func, typename Return, unsigned Id, typename...Pack>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper<
__nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>, Pack...> > {
static const bool value = true;
};
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) \
__nv_extended_device_lambda_with_trailing_return_trait_helper< \
typename __nv_lambda_trait_remove_cv<X>::type >::value
Host-device lambda detection (string at 0xA831B0):
template <typename>
struct __nv_extended_host_device_lambda_trait_helper {
static const bool value = false;
};
template <bool B1, bool B2, bool B3, typename T1, typename T2, typename...Pack>
struct __nv_extended_host_device_lambda_trait_helper<
__nv_hdl_wrapper_t<B1, B2, B3, T1, T2, Pack...> > {
static const bool value = true;
};
#define __nv_is_extended_host_device_lambda_closure_type(X) \
__nv_extended_host_device_lambda_trait_helper< \
typename __nv_lambda_trait_remove_cv<X>::type>::value
All three trait helpers follow the same pattern: a primary template with value = false, a partial specialization matching the corresponding wrapper type with value = true, and a macro that instantiates the trait after stripping cv-qualifiers via __nv_lambda_trait_remove_cv. The cv-stripping is necessary because lambda closure types may be captured as const references.
Macro Registration in the Frontend
The three macro names are registered as built-in identifiers by sub_5863A0 (a frontend initialization function), which calls sub_7463B0 to register each name with a unique identifier code:
// sub_5863A0, decompiled lines 976-978
sub_7463B0(328, "__nv_is_extended_device_lambda_closure_type");
sub_7463B0(329, "__nv_is_extended_host_device_lambda_closure_type");
sub_7463B0(330, "__nv_is_extended_device_lambda_with_preserved_return_type");
These registrations (IDs 328, 329, 330) make the names known to the EDG lexer before any source code is parsed, ensuring they can be resolved during preprocessing even if no header has defined them yet.
Diagnostic Suppression Scope
The managed runtime boilerplate is wrapped in a #pragma GCC diagnostic push / pop block to isolate its warning suppressions:
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-variable"
#pragma GCC diagnostic ignored "-Wunused-function"
/* managed runtime declarations */
#pragma GCC diagnostic pop
The push/pop is emitted only when the host compiler supports it: Clang (dword_126E1E8 set), or GCC version > 40599 (qword_126E1F0 > 0x9E97 and dword_106BF6C not set). The suppressions are necessary because __nv_inited_managed_rt and __nv_init_managed_rt are static symbols that may never be referenced in translation units without __managed__ variables, causing -Wunused-variable and -Wunused-function warnings.
Global State Dependencies
| Global | Type | Meaning | Effect on Emission |
|---|---|---|---|
dword_106BF38 | int | extended_lambda_mode | 0: emit false stubs. Nonzero: skip stubs, sub_6BCC20 provides traits |
dword_106BF6C | int | alternative_host_compiler_mode | Adds __attribute__((unused)) to __nv_init_managed_rt |
dword_126E1E8 | int | Host is Clang | Controls push/pop and extra suppressions |
dword_126E1F8 | int | Host is GCC | Controls push/pop version threshold |
qword_126E1F0 | int64 | GCC/Clang version number | > 0x9E97 (40599) for push/pop support |
Function Map
| Address | Name | Role |
|---|---|---|
sub_489000 | process_file_scope_entities | Emits managed RT block and lambda macros |
sub_4768F0 | gen_name_ref | Wraps qualified managed variable accesses |
sub_484940 | gen_variable_name | Wraps direct managed variable accesses |
sub_511D40 | scan_expr_full | Looks up __cudaPushCallConfiguration for <<<>>> lowering |
sub_6BCC20 | nv_emit_lambda_preamble | Emits trait-based lambda detection macros |
sub_5863A0 | (frontend init) | Registers lambda macro names as built-in identifiers |
sub_467E50 | (emit string) | Primary string emission to output stream |
sub_72EEF0 | (inject identifier) | Inserts __cudaPushCallConfiguration into scope for lookup |
sub_698940 | (name lookup) | Resolves identifier to entity declaration |
sub_4F8200 | (emit error) | Error emission with severity and error code |
Cross-References
- .int.c File Format -- complete file structure showing where runtime boilerplate sits
- Device Lambda Wrapper --
__nv_dl_wrapper_tmatched by trait macros - Host-Device Lambda Wrapper --
__nv_hdl_wrapper_tmatched by trait macros - Preamble Injection --
sub_6BCC20emission of trait templates - Entity Node Layout -- byte +148/+149 memory space bitfield
- __managed__ Variables -- attribute handler setting the 0x101 bits
- Kernel Stub Generation -- device stub side of kernel launch lowering
- Host Reference Arrays -- registration tables that reference managed variables
Host Reference Arrays
When cudafe++ splits a CUDA source file into device and host halves, the host-side .int.c output is compiled by a standard C++ compiler (GCC, Clang, or MSVC) that has no concept of device symbols. The CUDA runtime, however, needs to know which __global__ kernels, __device__ variables, and __constant__ variables exist so it can register them at program startup. cudafe++ solves this by emitting host reference arrays -- static byte arrays containing the mangled names of device symbols, placed into specially-named ELF sections that downstream tools (the fatbinary linker and crt/host_runtime.h registration code) read to enumerate device entities. The mechanism exists because the host compiler's symbol table contains only host-side symbols; the .nvHR* sections provide the complementary device-side symbol directory that the CUDA runtime needs to build the host-device binding table.
The arrays are emitted at the very end of the .int.c file, after the #undef _NV_ANON_NAMESPACE cleanup, by six calls to nv_emit_host_reference_array (sub_6BCF80, 79 lines, nv_transforms.c). Each call handles one combination of symbol type (kernel, device variable, constant variable) and linkage class (external, internal). The split by linkage is critical for RDC (relocatable device code) compilation: external-linkage symbols are globally visible across translation units and resolved by nvlink, while internal-linkage symbols (from static declarations or anonymous namespaces) are TU-local and must carry module-ID-based name prefixes to avoid collisions.
Key Facts
| Property | Value |
|---|---|
| Emission function | sub_6BCF80 (nv_emit_host_reference_array, 79 lines) |
| EDG source file | nv_transforms.c |
| Caller | sub_489000 (process_file_scope_entities, lines 713--721) |
| Guard condition | dword_106BFD0 || dword_106BFCC (device or constant registration enabled) |
| Emit callback | sub_467E50 (primary string emitter to output stream) |
| Registration function | sub_6BE300 (nv_get_full_nv_static_prefix, 370 lines, nv_transforms.c:2164) |
| Scope prefix builder | sub_6BD2F0 (nv_build_scoped_name_prefix, 95 lines) |
| Expression walker | sub_6BE330 (nv_scan_expression_for_device_refs, 89 lines) |
| List data structure | std::list<std::string>-like containers at 6 global addresses |
| Static prefix cache | qword_1286760 |
| Anonymous namespace name | qword_1286A00 (format: _GLOBAL__N_<module_id>) |
| Prefix format string | at off_E7C768, expanded as "%s%lu_%s_" |
| Assert guard | nv_transforms.c:2164, "nv_get_full_nv_static_prefix" |
The Six Sections
The arrays are organized into 6 ELF sections along two axes: symbol type (3 values) and linkage (2 values):
| Section | Array Name | Symbol Type | Linkage | Global List Address |
|---|---|---|---|---|
.nvHRKE | hostRefKernelArrayExternalLinkage | __global__ kernel | External | unk_1286880 |
.nvHRKI | hostRefKernelArrayInternalLinkage | __global__ kernel | Internal | unk_12868C0 |
.nvHRDE | hostRefDeviceArrayExternalLinkage | __device__ variable | External | unk_1286780 |
.nvHRDI | hostRefDeviceArrayInternalLinkage | __device__ variable | Internal | unk_12867C0 |
.nvHRCE | hostRefConstantArrayExternalLinkage | __constant__ variable | External | unk_1286800 |
.nvHRCI | hostRefConstantArrayInternalLinkage | __constant__ variable | Internal | unk_1286840 |
The section name encoding is: .nvHR (host reference) + one letter for symbol type (K=kernel, D=device, C=constant) + one letter for linkage (E=external, I=internal).
Note that __shared__ variables are not included -- they have no host-visible address and exist only within a kernel's execution lifetime.
Emission Architecture
Invocation from the Backend
The backend entry point sub_489000 (process_file_scope_entities) calls sub_6BCF80 six times at the very end of .int.c generation (decompiled lines 713--721). The calls are guarded by two flags: dword_106BFD0 (device registration mode) and dword_106BFCC (constant registration mode). If neither is set, no arrays are emitted.
// sub_489000 trailer, decompiled lines 713-721
if (dword_106BFD0 || dword_106BFCC) {
// nv_emit_host_reference_array(emit_fn, is_kernel, is_device, is_internal)
sub_6BCF80(sub_467E50, 1, 0, 1); // kernel, internal -> .nvHRKI
sub_6BCF80(sub_467E50, 1, 0, 0); // kernel, external -> .nvHRKE
sub_6BCF80(sub_467E50, 0, 1, 1); // device, internal -> .nvHRDI
sub_6BCF80(sub_467E50, 0, 1, 0); // device, external -> .nvHRDE
sub_6BCF80(sub_467E50, 0, 0, 1); // constant, internal -> .nvHRCI
sub_6BCF80(sub_467E50, 0, 0, 0); // constant, external -> .nvHRCE
}
The function signature is:
void nv_emit_host_reference_array(
void (*emit)(const char *), // a1: string emission callback
int is_kernel, // a2: 1 = kernel, 0 = variable
int is_device, // a3: 1 = __device__, 0 = __constant__ (only when is_kernel=0)
int is_internal // a4: 1 = internal linkage, 0 = external linkage
);
The flag decoding for selecting which global list, section name, and array name to use works as follows:
if is_kernel (a2 != 0):
if is_internal (a4 != 0): list = unk_12868C0, section = ".nvHRKI", name = "hostRefKernelArrayInternalLinkage"
else: list = unk_1286880, section = ".nvHRKE", name = "hostRefKernelArrayExternalLinkage"
else if is_internal (a4 != 0):
if is_device (a3 != 0): list = unk_12867C0, section = ".nvHRDI", name = "hostRefDeviceArrayInternalLinkage"
else: list = unk_1286840, section = ".nvHRCI", name = "hostRefConstantArrayInternalLinkage"
else:
if is_device (a3 != 0): list = unk_1286780, section = ".nvHRDE", name = "hostRefDeviceArrayExternalLinkage"
else: list = unk_1286800, section = ".nvHRCE", name = "hostRefConstantArrayExternalLinkage"
Note the precedence: the kernel flag is checked first. When is_kernel=1, the is_device flag is ignored entirely -- kernels are always kernels regardless of is_device.
Emission Output Format
For each section, sub_6BCF80 emits a single array declaration:
extern "C" {
extern __attribute__((section(".nvHRKE")))
__attribute__((weak))
const unsigned char hostRefKernelArrayExternalLinkage[] = {
/* _Z8myKernelPfi */
0x5f,0x5a,0x38,0x6d,0x79,0x4b,0x65,0x72,0x6e,0x65,0x6c,0x50,0x66,0x69,0x0,
/* _Z12otherKernelPd */
0x5f,0x5a,0x31,0x32,0x6f,0x74,0x68,0x65,0x72,0x4b,0x65,0x72,0x6e,0x65,0x6c,0x50,0x64,0x0,
0x0};
}
Key details about the emitted C:
extern "C"wrapping ensures no C++ name mangling is applied to the array itself. The section name in the ELF binary is the sole identifier.__attribute__((section(".nvHRXX")))places the array in a named ELF section that downstream tools scan by name.__attribute__((weak))allows multiple translation units to define the same array name without causing linker errors. When multiple TUs each emit their ownhostRefKernelArrayExternalLinkage, the linker keeps one copy. This is safe because the CUDA runtime reads the section contents, not the symbol -- it concatenates all.nvHRKEsection contributions from all object files.const unsigned char[]encodes each mangled name as individual hex bytes, not as a string literal. This avoids any issues with embedded NUL bytes or special characters in mangled names.- Each symbol name is preceded by a
/* mangled_name */comment for human readability. - Each name is terminated by
0x0(NUL byte). - If the list is empty (no symbols of that type/linkage), the array contains a single
0x0sentinel.
The iteration traverses a doubly-linked list rooted at the global list variable. From the decompiled code:
// Decompiled iteration in sub_6BCF80, lines 56-73
for (node = list[3]; list + 1 != node; node = next_node(node)) {
emit("/* ");
emit(*(char **)(node + 32)); // mangled name string
emit(" */\n");
size_t len = *(size_t *)(node + 40); // string length
for (size_t j = 0; j < len; j++) {
char byte = *(char *)(*(char **)(node + 32) + j);
snprintf(buf, 128, "0x%x,", byte);
emit(buf);
}
emit("0x0,"); // NUL terminator for this name
}
Each node in the linked list stores:
+32: pointer to the mangled name string+40: length of the mangled name
The list structure itself is a std::list<std::string>-compatible container where list[3] (offset +24) points to the first data node and list + 1 (offset +8) is the sentinel/end node.
Symbol Registration Pipeline
The host reference arrays are the output of a two-phase pipeline: (1) symbol collection during compilation, and (2) array emission at the end of the backend pass.
Phase 1: Collection During Compilation
As cudafe++ processes the AST, it encounters declarations marked with __global__, __device__, or __constant__. Each such entity must be registered in the appropriate global list so it appears in the host reference array. This registration is performed by two cooperating functions:
nv_scan_expression_for_device_refs (sub_6BE330, 89 lines) recursively walks expression trees looking for references to device-annotated entities. It dispatches on expression kind:
| Expression Kind | Handling |
|---|---|
| 7 (variable reference) | Checks __global__ bit, registers if device-annotated |
| 11 (function reference) | Checks function attributes, registers if __global__ |
| 15 (member access) | Recurses on the member |
| 16 (pointer dereference) | Recurses on the operand |
| 17 (expression list) | Recurses on each element |
| 20 (call expression) | Checks the callee |
| 24 (cast expression) | Recurses on the operand |
When the walker finds a device entity, it tail-calls into nv_get_full_nv_static_prefix.
nv_get_full_nv_static_prefix (sub_6BE300, 370 lines) is the master registration function. It determines the symbol's linkage class and constructs the name that goes into the host reference array. The function begins with two early-exit checks:
if (!entity) return;
if ((entity[182] & 0x40) == 0) return; // not __global__
Byte +182 of the entity node carries execution space bits. Bit 6 (0x40) indicates __global__. Byte +179 carries additional flags where bits 0x12 indicate device/constant annotation. Byte +80 bits 0x70 encode the linkage class: 0x10 = internal (static/anonymous), 0x30 = external.
The function then splits into two paths based on linkage:
Internal Linkage Path
For static functions, anonymous-namespace entities, or entities with forced internal linkage, the name must include a TU-unique prefix to prevent collisions across translation units:
-
Scope prefix construction (
sub_6BD2F0): Recursively walks the entity's enclosing scopes (byte+28 == 3indicates "has parent scope"). For each scope level, the scope name is extracted from+32 -> +8(the scope's identifier string). For anonymous namespaces (where the scope name pointer is NULL), the function substitutes_GLOBAL__N_<module_id>, constructing and caching this string inqword_1286A00. -
Hash computation (
sub_6BD1C0): The scope-qualified name is hashed usingvsnprintfwith format string at address8573734(likely"%s%lu"or similar) and a 32-byte buffer. This produces a deterministic hash of the scope path. -
Static prefix construction: The full prefix is assembled as:
snprintf(buf, size, "%s%lu_%s_", off_E7C768, strlen(module_id), module_id)where
off_E7C768is a fixed prefix string (likely"__nv_static_"or similar) andmodule_idcomes fromsub_5AF830(the CRC32-based module identifier). The result is cached inqword_1286760so it is computed only once per TU. -
Name assembly: The prefix, a
"_"separator, and the entity's mangled name (from entity+8) are concatenated. -
List insertion: The assembled name is pushed into the internal-linkage list (
unk_12868C0for kernels,unk_12867C0for device variables,unk_1286840for constants) via astd::list::push_back-equivalent call.
External Linkage Path
For entities with default (external) linkage, the path is simpler:
-
A
" ::"scope prefix is prepended (string at address10998575, corresponding to" ::"-- two bytes). -
If the entity has a parent scope (byte
+28 == 3at the scope entry), the scope-qualified name is built by recursing through parent scopes, concatenating"::"separators and hashing each level withsub_6BD1C0. -
The entity's mangled name (from entity
+8) is appended directly. -
The result is pushed into the external-linkage list (
unk_1286880for kernels,unk_1286780for device variables,unk_1286800for constants).
Phase 2: Emission (Backend Trailer)
After the entire source file has been processed and all entity walks have populated the 6 global lists, the backend trailer calls sub_6BCF80 six times. Each call drains one list and emits the corresponding ELF section declaration. The emission is always performed for all 6 sections, even if some lists are empty (producing arrays with only a 0x0 sentinel).
Internal vs. External Linkage Split
The split into internal and external linkage sections serves two distinct purposes:
Whole-Program Mode (-rdc=false)
In whole-program (non-RDC) mode, all device code from a single TU is embedded directly in the host object file as a fatbinary. The host reference arrays tell crt/host_runtime.h's __cudaRegisterLinkedBinary machinery which symbols exist in the fatbinary so it can register them with the CUDA driver at program startup.
Internal-linkage symbols require the TU-unique prefix to avoid name collisions if two TUs define identically-named static __global__ kernels. The prefix incorporates the module ID (a CRC32 of the TU's representative entity) to ensure uniqueness.
Separate Compilation Mode (-rdc=true)
In RDC mode, device code is compiled to relocatable device objects (.rdc files) that nvlink links together. External-linkage device symbols must be globally resolvable across TUs. The .nvHRKE/.nvHRDE/.nvHRCE sections provide the symbol directory that nvlink uses to match device symbols with their host-side registration entries.
Internal-linkage symbols in RDC mode remain TU-local. They carry module-ID prefixes and are placed in the *I sections, which nvlink processes separately. The split ensures that nvlink does not attempt to deduplicate or cross-reference symbols that were intentionally given internal linkage.
Downstream Consumption
Host Compiler
GCC/Clang/MSVC compiles the .int.c file and sees the extern "C" array declarations with __attribute__((section(...))). The host compiler places each array into the named ELF section (or PE section on Windows). Because the arrays are const unsigned char[] with weak linkage, they impose no runtime overhead and can be safely deduplicated by the linker.
Fatbinary Linker (fatbinary / nvlink)
The fatbinary linker reads the .nvHR* sections from each object file to discover which device symbols need registration. For each entry in the byte arrays, it extracts the mangled name (scanning for 0x0 terminators) and matches it against the device code in the fatbinary or relocatable device object.
CUDA Runtime (crt/host_runtime.h)
At program startup, the CUDA runtime's __cudaRegisterLinkedBinary function (or __cudaRegisterFatBinary in whole-program mode) walks the .nvHR* sections to:
- Register each
__global__kernel withcudaRegisterFunction - Register each
__device__variable withcudaRegisterVar - Register each
__constant__variable withcudaRegisterVar(with the constant flag)
This registration enables the host-side API (cudaLaunchKernel, cudaMemcpyToSymbol, etc.) to resolve device symbols by name at runtime.
Supporting Data Structures
Global List Nodes
Each of the 6 global lists (unk_1286780 through unk_12868C0) is a std::list<std::string>-compatible doubly-linked list. The list head structure occupies 48 bytes (3 pointers + metadata):
| Offset | Field | Description |
|---|---|---|
| +0 | allocator | Allocator state |
| +8 | sentinel | Sentinel/end node address (comparison target for iteration end) |
| +16 | size | Number of entries |
| +24 | first | Pointer to first data node |
Each data node stores:
| Offset | Field | Description |
|---|---|---|
| +0 | prev | Previous node pointer |
| +8 | next | Next node pointer |
| +16 | data_start | Start of string data area |
| +32 | str_ptr | Pointer to mangled name character data |
| +40 | str_len | Length of the mangled name |
The strings use SSO (Small String Optimization): if the mangled name is 15 bytes or shorter, the character data is stored inline starting at offset +16; otherwise str_ptr at +32 points to a heap allocation and offset +16 stores the heap capacity.
Static Prefix Cache
qword_1286760 caches the internal-linkage prefix string computed by nv_get_full_nv_static_prefix. The format is:
<off_E7C768><module_id_length>_<module_id>_
Where off_E7C768 is a fixed string (the NVIDIA static prefix marker), the module ID comes from sub_5AF830 (CRC32-based), and the underscores separate the components. This prefix is allocated once via sub_5E03D0 and reused for all internal-linkage entities in the TU.
Anonymous Namespace Name Cache
qword_1286A00 caches the anonymous namespace identifier, constructed as _GLOBAL__N_<module_id>. This follows the Itanium ABI convention for anonymous namespace mangling but uses the CUDA module ID instead of a random hash. It is allocated once by sub_6BD2F0 and reused for all entities in anonymous namespaces.
Scope-Qualified Name Builder
sub_6BD2F0 (nv_build_scoped_name_prefix) recursively constructs scope-qualified names for internal-linkage entities:
void nv_build_scoped_name_prefix(char **scope_name, scope_entry *parent, string *result) {
// Recurse to parent scope first
if (parent && parent->kind == 3) // byte +28 == 3
nv_build_scoped_name_prefix(parent->parent->name, parent->parent->scope, result);
char *name = *scope_name;
if (!name)
name = get_or_create_anon_namespace_name(); // _GLOBAL__N_<module_id>
// Build: hash(name) via vsnprintf with format at 8573734, 32-byte buffer
// Append to result string
format_string_to_sso(&tmp, vsnprintf, 32, 8573734, name_len);
string_append(result, tmp);
}
The recursion visits ancestor scopes from outermost to innermost, concatenating hashed scope names. This produces a deterministic, collision-resistant path that uniquely identifies the entity's position in the namespace hierarchy.
Host Reference Trie
During compilation, cudafe++ maintains a trie (prefix tree) structure for deduplicating host reference entries. This trie is stored alongside the linear lists and prevents the same symbol from being registered twice if it is referenced from multiple points in the source.
The trie is cleaned up at the end of compilation by:
sub_6BD530(nv_free_host_ref_tree, 257 lines) -- deeply recursive tree destructor with 9 levels of inlined recursionsub_6BD820(nv_free_host_ref_list, 34 lines) -- iterates the linked list, callingnv_free_host_ref_treefor each node's tree, then frees the node
Each trie node structure:
| Offset | Field | Description |
|---|---|---|
| +0 | next | Next sibling pointer |
| +8 | (reserved) | Alignment/flags |
| +16 | child_chain | First child in chain |
| +24 | child_tree | Child subtree pointer |
| +32 | data_ptr | Pointer to name data (or +48 if inline) |
| +40 | data_len | Length of name data |
| +48 | inline_data | Inline storage for short names |
If data_ptr == &node[48] (the inline data area), no separate allocation was made; otherwise data_ptr points to a heap-allocated string that nv_free_host_ref_tree frees separately.
Complete Emission Example
For a source file containing:
__global__ void myKernel(float *data, int n) { /* ... */ }
__device__ int d_counter;
static __constant__ float c_table[256];
The .int.c trailer emits:
extern "C" {
extern __attribute__ ((section (".nvHRKI"))) __attribute__((weak)) const unsigned char hostRefKernelArrayInternalLinkage[] = {
0x0};
extern "C" {
extern __attribute__ ((section (".nvHRKE"))) __attribute__((weak)) const unsigned char hostRefKernelArrayExternalLinkage[] = {
/* _Z8myKernelPfi */
0x5f,0x5a,0x38,0x6d,0x79,0x4b,0x65,0x72,0x6e,0x65,0x6c,0x50,0x66,0x69,0x0,
0x0};
}
extern "C" {
extern __attribute__ ((section (".nvHRDI"))) __attribute__((weak)) const unsigned char hostRefDeviceArrayInternalLinkage[] = {
0x0};
}
extern "C" {
extern __attribute__ ((section (".nvHRDE"))) __attribute__((weak)) const unsigned char hostRefDeviceArrayExternalLinkage[] = {
/* _Z9d_counter */
0x5f,0x5a,0x39,0x64,0x5f,0x63,0x6f,0x75,0x6e,0x74,0x65,0x72,0x0,
0x0};
}
extern "C" {
extern __attribute__ ((section (".nvHRCI"))) __attribute__((weak)) const unsigned char hostRefConstantArrayInternalLinkage[] = {
/* __nv_static_42_kernel_cu_c_table */
0x5f,0x5f,0x6e,0x76,0x5f,...,0x0,
0x0};
}
extern "C" {
extern __attribute__ ((section (".nvHRCE"))) __attribute__((weak)) const unsigned char hostRefConstantArrayExternalLinkage[] = {
0x0};
}
Note how c_table (declared static __constant__) appears in the internal-linkage .nvHRCI section with its module-ID-prefixed name, while myKernel (external linkage by default) appears in .nvHRKE with its standard Itanium-ABI mangled name.
Function Map
| Address | Name | Source | Lines | Role |
|---|---|---|---|---|
sub_6BCF80 | nv_emit_host_reference_array | nv_transforms.c | 79 | Selects section/list by flags, emits array declaration |
sub_6BE300 | nv_get_full_nv_static_prefix | nv_transforms.c:2164 | 370 | Master registration: determines linkage, builds name, inserts into list |
sub_6BE330 | nv_scan_expression_for_device_refs | nv_transforms.c | 89 | Recursive expression walker that finds device entity references |
sub_6BD2F0 | nv_build_scoped_name_prefix | nv_transforms.c | 95 | Recursive scope-qualified name builder for internal-linkage entities |
sub_6BD1C0 | format_string_to_sso | nv_transforms.c | 48 | Formats via vsnprintf into std::string SSO buffer |
sub_6BD530 | nv_free_host_ref_tree | nv_transforms.c | 257 | Recursive deep-free of deduplication trie |
sub_6BD820 | nv_free_host_ref_list | nv_transforms.c | 34 | Frees linked list of host reference entries |
sub_6BCF10 | nv_check_device_variable_in_host | nv_transforms.c | 16 | Validates device variable not improperly referenced from host |
sub_5AF830 | make_module_id | host_envir.c | ~450 | CRC32-based TU identifier used in internal-linkage prefixes |
sub_489000 | process_file_scope_entities | cp_gen_be.c | 723 | Backend entry point; calls sub_6BCF80 x6 in trailer |
sub_467E50 | (emit string) | cp_gen_be.c | -- | Primary string emission callback passed to sub_6BCF80 |
Global Variables
| Address | Type | Name | Purpose |
|---|---|---|---|
unk_1286780 | list | device external list | Accumulates __device__ external-linkage symbol names |
unk_12867C0 | list | device internal list | Accumulates __device__ internal-linkage symbol names |
unk_1286800 | list | constant external list | Accumulates __constant__ external-linkage symbol names |
unk_1286840 | list | constant internal list | Accumulates __constant__ internal-linkage symbol names |
unk_1286880 | list | kernel external list | Accumulates __global__ external-linkage symbol names |
unk_12868C0 | list | kernel internal list | Accumulates __global__ internal-linkage symbol names |
qword_1286760 | char* | static prefix cache | Cached internal-linkage prefix string (computed once per TU) |
qword_1286A00 | char* | anon namespace name | Cached _GLOBAL__N_<module_id> string |
dword_106BFD0 | int | device registration flag | Enables device symbol registration (guard for emission) |
dword_106BFCC | int | constant registration flag | Enables constant symbol registration (guard for emission) |
Cross-References
- .int.c File Format -- complete file structure showing where host reference arrays sit (sections 13--14)
- CUDA Runtime Boilerplate -- managed memory initialization that references registered symbols
- Module ID & Registration -- CRC32 hash computation used in internal-linkage prefixes
- RDC Mode -- how the internal/external split interacts with separate compilation
- Memory Spaces --
__device__/__constant__/__shared__attribute encoding - Name Mangling --
nv_get_full_nv_static_prefixand Itanium ABI encoding - Backend Code Generation -- Phase 7 host reference array emission
- CLI Flag Inventory -- flags controlling device/constant registration
Module ID & Registration
When CUDA programs are compiled with separate compilation (-rdc=true), each .cu translation unit is compiled independently and later linked by nvlink. The host-side registration code emitted by cudafe++ must associate its __cudaRegisterFatBinary call with the correct device fatbinary, and anonymous namespace device symbols must receive globally unique mangled names. The module ID is a string identifier computed by make_module_id (sub_5AF830, host_envir.c, ~450 lines) that provides this uniqueness. It is derived from a CRC32 hash of the compiler options and source filename, combined with the output filename and process ID. Once computed, the module ID is cached in qword_126F0C0 and referenced throughout the backend code generator -- in _NV_ANON_NAMESPACE construction, _GLOBAL__N_ mangling, _INTERNAL prefixing, host reference array scoped names, and the module ID file written for nvlink consumption.
Key Facts
| Property | Value |
|---|---|
| Generator function | sub_5AF830 (make_module_id, ~450 lines, host_envir.c) |
| Setter | sub_5AF7F0 (set_module_id, host_envir.c, line 3387 assertion) |
| Getter | sub_5AF820 (get_module_id, host_envir.c) |
| File writer | sub_5B0180 (write_module_id_to_file, host_envir.c) |
| Entity-based selector | sub_5CF030 (use_variable_or_routine_for_module_id_if_needed, il.c, line 31969) |
| Anon namespace constructor | sub_6BC7E0 (nv_transforms.c, ~20 lines) |
| Cached module ID global | qword_126F0C0 (8 bytes, initially NULL) |
| Selected entity global | qword_126F140 (8 bytes, IL entity pointer) |
| Selected entity kind | byte_126F138 (1 byte, 7=variable or 11=routine) |
| Module ID file path global | qword_106BF80 (set by --module_id_file_name, flag 87) |
| Generate-module-ID-file flag | --gen_module_id_file (flag 83, no argument) |
| Module ID file path flag | --module_id_file_name (flag 87, has argument) |
| Options hash input global | qword_106C038 (string, command-line options to hash) |
| Output filename global | qword_106C040 (display filename override) |
| Emit-symbol-table flag | dword_106BFB8 (triggers write_module_id_to_file in backend) |
| CRC32 polynomial | 0xEDB88320 (CRC-32/ISO-HDLC, reflected) |
| CRC32 initial value | 0xFFFFFFFF |
| Debug trace topic | "module_id" (gated by dword_126EFC8) |
| Debug format strings | "make_module_id: str1 = %s, str2 = %s, pid = %ld\n" at 0xA5DA48 |
"make_module_id: final string = %s\n" at 0xA5DA80 |
Algorithm Overview
The module ID generator has three source modes, tried in priority order. The result is always cached in qword_126F0C0 -- the function returns immediately if the cache is populated.
Mode 1: Module ID File
If qword_106BF80 (set by the --module_id_file_name CLI flag) is non-NULL and dword_106BFB8 is clear, the function opens the specified file, reads its entire contents into a heap-allocated buffer, null-terminates it, and uses that as the module ID verbatim. This allows build systems to inject deterministic, reproducible identifiers from external sources (e.g., a content hash of the source file computed by the build system).
// sub_5AF830, mode 1: read module ID from file
if (!dword_106BFB8 && qword_106BF80) {
FILE *f = open_file(qword_106BF80, "r"); // sub_4F4870
if (!f) fatal("unable to open module id file for reading");
fseek(f, 0, SEEK_END);
size_t len = ftell(f);
rewind(f);
char *buf = allocate(len + 1); // sub_6B7340
if (fread(buf, 1, len, f) != len)
fatal("unable to read module id from file");
buf[len] = '\0';
fclose(f);
qword_126F0C0 = buf;
return buf;
}
Mode 2: Explicit Token (Caller-Provided String)
If the caller passes a non-NULL first argument (src), the function enters the default computation path using that string as the source filename component. When a secondary string argument (nptr) is provided instead (used by use_variable_or_routine_for_module_id_if_needed), it is first parsed with strtoul. If the parse succeeds (the entire string was consumed as a number), the numeric value is formatted as an 8-digit hex string. If the parse fails (the string is not purely numeric), the string is CRC32-hashed and the hash is used as the hex token. The working directory (qword_126EEA0) is used as an extra component, and the PID is always appended.
Mode 3: Default Computation (stat + ctime + getpid)
When no caller-provided string is available, the function stat()s the output file. If the stat succeeds and the file is a regular file (S_IFREG), the modification time (st_mtime) is converted to a string via ctime(), and the PID is obtained via getpid(). If the stat fails or the result is not a regular file, only the PID is used, with the compilation timestamp string (qword_126EB80) as the source component.
Complete Generation Pseudocode
function make_module_id(src_arg):
// Check cache
if qword_126F0C0 != NULL:
return qword_126F0C0
// Mode 1: read from file
if !dword_106BFB8 AND qword_106BF80 != NULL:
return read_file_contents(qword_106BF80)
// Determine the output filename base
if dword_126EE48: // multi-TU mode
output_name = **(qword_106BA10 + 184) // from TU descriptor
else:
output_name = xmmword_126EB60[0] // primary source file
if qword_106C040 != NULL:
output_name = qword_106C040 // display name override
// Determine source string and extra string
pid = 0
extra = NULL
if src_arg != NULL:
src = src_arg
// skip nptr processing, fall through to assembly
else if nptr != NULL: // caller-provided numeric token
(value, endptr) = strtoul(nptr, 0)
if endptr <= nptr OR *endptr != '\0':
value = crc32(nptr) // not a pure number, hash it
src = sprintf("%08lx", value)
pid = getpid()
extra = qword_126EEA0 // working directory
else: // default: stat the output file
if stat(output_name) succeeds AND is regular file:
mtime = stat.st_mtime
src = ctime(mtime)
pid = getpid()
extra = qword_126EEA0
else:
pid = getpid()
src = qword_126EB80 // compilation timestamp
extra = qword_126EEA0
// --- Assemble the module ID string ---
// Step 1: CRC32 of command-line options
if qword_106C038 != NULL:
options_crc = crc32(qword_106C038)
options_hex = sprintf("_%08lx", options_crc)
else:
options_hex = sprintf("_%08lx", 0)
// Step 2: source name compression
name_len = strlen(src) + (extra ? strlen(extra) + 1 : 0)
if name_len > 8:
// Source name too long -- replace with CRC32
combined_crc = crc32(src)
if extra:
combined_crc = crc32_continue(combined_crc, extra)
src = sprintf("%08lx", combined_crc)
// extra is consumed into the hash, set to NULL
extra = NULL
// Step 3: PID suffix
if pid != 0:
pid_suffix = sprintf("_%ld", pid)
else:
pid_suffix = ""
// Step 4: extract basename of output file
basename = strip_directory_prefix(output_name) // sub_5AC1F0
basename_len = strlen(basename)
// Step 5: concatenate all components
result = options_hex + "_" + basename_len + "_" + basename + "_" + src
if extra:
result += "_" + extra
if pid != 0 AND nptr == NULL:
result += pid_suffix
// Step 6: sanitize -- replace all non-alphanumeric with '_'
for each character c in result:
if !isalnum(c):
c = '_'
// Cache and return
qword_126F0C0 = result
return result
Module ID Format
The final module ID string follows this structure:
_{options_crc}_{basename_len}_{basename}_{source_or_crc}[_{extra}][_{pid}]
All non-alphanumeric characters are replaced with underscores after assembly. A concrete example for a file kernel.cu compiled with nvcc -arch=sm_89 -rdc=true:
_a1b2c3d4_9_kernel_cu_5e6f7890_1234
| | | | |
| | | | +-- PID (getpid())
| | | +------------ CRC32 of source name (> 8 chars compressed)
| | +--------------------- output basename ("kernel.cu", dot -> "_")
| +------------------------ basename length (9, "kernel.cu")
+----------------------------------- CRC32 of options string
The leading underscore comes from the options_hex format ("_%08lx"). All dots, slashes, dashes, and other non-alphanumeric characters are uniformly replaced with underscores, making the result safe for use as a C identifier suffix.
CRC32 Implementation
The function contains an inline CRC32 implementation that appears three times in the decompiled output -- once for the options string hash, once for the source filename hash, and once for the extra string hash. All three are byte-identical in the binary, indicating the compiler inlined a shared helper (likely a static inline function or macro) at each call site.
The algorithm is the standard bit-by-bit reflected CRC-32 used by ISO 3309, ITU-T V.42, Ethernet, PNG, and zlib. The polynomial 0xEDB88320 is the bit-reversed form of the generator polynomial 0x04C11DB7.
CRC32 Pseudocode
function crc32(data: byte_string) -> uint32:
crc = 0xFFFFFFFF // initialization vector
for each byte in data:
for bit_index in 0..7:
// XOR the lowest bit of crc with the current data bit
if ((crc ^ (byte >> bit_index)) & 1) != 0:
crc = (crc >> 1) ^ 0xEDB88320
else:
crc = crc >> 1
return crc ^ 0xFFFFFFFF // final inversion
CRC32 Decompiled (Single Instance)
This is one of the three identical inline copies from sub_5AF830, processing the options string at qword_106C038:
// sub_5AF830, lines 121-165 (options CRC32)
uint64_t crc = 0xFFFFFFFF;
uint8_t *ptr = (uint8_t *)qword_106C038;
if (ptr) {
while (*ptr) {
uint8_t byte = *ptr;
while (1) {
++ptr;
// Bit 0
uint64_t tmp = crc >> 1;
if (((uint8_t)crc ^ byte) & 1) tmp ^= 0xEDB88320;
// Bit 1
uint64_t tmp2 = tmp >> 1;
if (((uint8_t)tmp ^ (byte >> 1)) & 1) tmp2 ^= 0xEDB88320;
// Bit 2
uint64_t tmp3 = tmp2 >> 1;
if (((uint8_t)tmp2 ^ (byte >> 2)) & 1) tmp3 ^= 0xEDB88320;
// Bit 3
uint64_t tmp4 = tmp3 >> 1;
if (((uint8_t)tmp3 ^ (byte >> 3)) & 1) tmp4 ^= 0xEDB88320;
// Bit 4
uint64_t tmp5 = tmp4 >> 1;
if (((uint8_t)tmp4 ^ (byte >> 4)) & 1) tmp5 ^= 0xEDB88320;
// Bit 5
uint64_t tmp6 = tmp5 >> 1;
if (((uint8_t)tmp5 ^ (byte >> 5)) & 1) tmp6 ^= 0xEDB88320;
// Bit 6
uint64_t tmp7 = tmp6 >> 1;
if (((uint8_t)tmp6 ^ (byte >> 6)) & 1) tmp7 ^= 0xEDB88320;
// Bit 7
crc = tmp7 >> 1;
if (((uint8_t)tmp7 ^ (byte >> 7)) & 1) == 0)
break;
byte = *ptr;
crc ^= 0xEDB88320;
if (!*ptr) goto done;
}
}
done:
sprintf(options_hex, "_%08lx", crc ^ 0xFFFFFFFF);
}
The unrolled 8-iteration loop processes one byte at a time without a lookup table. Each iteration shifts the CRC right by one bit and conditionally XORs the polynomial. The final XOR with 0xFFFFFFFF is the standard CRC-32 finalization step. The compiler fully unrolled the inner 8-bit loop, turning what was originally a counted for (int i = 0; i < 8; i++) loop into 8 sequential if-shift-xor blocks. The three copies in the function differ only in which input string they process and which output variable receives the result.
Why Three Inline Copies
The CRC32 code appears at three locations within sub_5AF830:
| Copy | Input | Output | Purpose |
|---|---|---|---|
| 1 (lines 121-164) | qword_106C038 (options string) | options_hex | Hash compiler flags into the module ID prefix |
| 2 (lines 186-273) | src + extra (source + extra strings) | src (overwritten with hex) | Compress long source filenames (> 8 chars) into a fixed-width hash |
| 3 (lines 361-407) | nptr (explicit token string) | v67 | Hash non-numeric caller-provided tokens |
Copy 2 is a two-pass CRC: it first hashes the source filename string, then continues the CRC state into the extra string (working directory), producing a single combined hash. This is why the code between copies 2a and 2b checks if (extra_len != 0) before starting the second pass.
The original C source almost certainly had a single crc32_string() helper function (or macro) that the compiler inlined at each call site during optimization. The EDG front-end codebase uses similar inline expansion patterns elsewhere (e.g., the 9 copies of UTF-8 decoding logic in the same file).
Module ID Source Modes -- Decision Tree
make_module_id(src)
|
+-- qword_126F0C0 set? --> return cached
|
+-- File mode available?
| (qword_106BF80 != NULL && !dword_106BFB8)
| YES --> read file, cache, return
|
+-- Caller provided src argument?
| YES --> use src as source component, no PID
|
+-- nptr set (explicit token)?
| YES --> strtoul(nptr)
| |
| +-- parse OK? --> use numeric value
| +-- parse fail? --> CRC32 hash nptr
| extra = working_directory
| pid = getpid()
|
+-- Default (no src, no nptr)
stat(output_file)
|
+-- stat OK && regular file?
| src = ctime(st_mtime)
| pid = getpid()
| extra = working_directory
|
+-- stat fail
src = qword_126EB80 (compilation timestamp)
pid = getpid()
extra = working_directory
Entity-Based Module ID Selection
An alternative entry path into the module ID system is use_variable_or_routine_for_module_id_if_needed (sub_5CF030, il.c, line 31969, ~65 lines). Instead of computing a hash from file metadata, this function selects a representative entity (variable or function) from the current translation unit whose mangled name serves as a stable identifier. The mangled name is then passed to sub_5AF830 as the src argument.
Selection Criteria
The function is invoked during IL processing. It first checks sub_5AF820 (get_module_id) -- if a module ID is already cached, it returns immediately. Otherwise, it evaluates the candidate entity:
// sub_5CF030, simplified
char *use_variable_or_routine_for_module_id_if_needed(entity, kind) {
if (get_module_id())
return get_module_id(); // already computed
if (qword_126F140) {
// Already selected an entity, extract its name
assert(dword_106BF10 || dword_106BEF8); // il.c:32064
goto extract_name;
}
// Validate entity kind: must be 7 (variable) or 11 (routine)
assert(entity && ((kind - 7) & 0xFB) == 0); // il.c:31969
// Check if entity is unsuitable (member of TU scope, etc.)
if (entity->scope == primary_scope
|| (entity->flags_81 & 0x04) // unnamed namespace
|| (entity->scope && entity->scope->kind == 3))
{
// Skip: entity in primary scope, unnamed namespace, or class scope
...
return NULL;
}
if (kind == 7) { // Variable
// Must have: no storage class, has definition, not template-related,
// not inline, not constexpr, not thread-local
if (entity->storage_class == 0
&& entity->has_definition // offset +169
&& !(entity->flags_162 & 0x10) // not explicit specialization
&& !(entity->flags_164 & 0x10) // not partial specialization
&& entity->flags_148 >= 0 // not extern template
&& !(entity->flags_160 & 0x08) // not inline variable
&& entity->flags_165 >= 0) // not constexpr
{
qword_126F140 = entity;
byte_126F138 = 7;
}
}
else { // Routine (kind == 11)
// Must have: no specialization, no builtin return type,
// no template parameters, not defaulted/deleted
if (!entity->flags_164
&& entity->flags_176 >= 0 // not defaulted
&& !(entity->flags_179 & 0x02) // not deleted
&& !(entity->flags_180 & 0x38) // not template-related
&& !(entity->flags_184 & 0x20)) // not consteval
{
// Additional checks: return type not builtin, not coroutine
if (!is_builtin_type(entity->return_type)
&& !is_generic_function(entity)
&& !is_concept_function(entity->return_type_entry))
{
qword_126F140 = entity;
byte_126F138 = 11;
}
}
}
extract_name:
// Get the entity's mangled name
char *name;
if (byte_126F138 == 7) {
// Variable: check unnamed namespace, use mangled or lowered name
if ((entity->flags_81 & 0x04) || (entity->scope && entity->scope->kind == 3))
name = get_lowered_name(); // sub_6A70C0
else
name = entity->name; // offset +8
} else {
// Routine: similar checks, use name or lowered name
assert(byte_126F138 == 11); // il.c:32079
if (dword_126EFB4 == 2) // C++20 mode
name = get_mangled_name(); // sub_6A76C0
else
name = entity->name;
}
assert(name != NULL); // il.c:32086
return make_module_id(name); // sub_5AF830(name)
}
The strict filtering ensures the selected entity is one whose mangled name is deterministic across compilations of the same source. Template instantiations, inline variables, and unnamed namespace entities are excluded because their names may vary or conflict.
set_module_id and get_module_id
The module ID cache has a setter/getter pair for use by external callers that compute the ID through other means:
// sub_5AF7F0 -- set_module_id (host_envir.c, line 3387)
void set_module_id(char *id) {
assert(qword_126F0C0 == NULL); // "set_module_id" -- must not be set already
qword_126F0C0 = id;
}
// sub_5AF820 -- get_module_id (host_envir.c)
char *get_module_id(void) {
return qword_126F0C0;
}
The setter asserts that the module ID has not been previously set. This is a safety guard: the module ID must be computed exactly once per compilation. Any attempt to set it twice indicates a logic error in the pipeline.
write_module_id_to_file
The write_module_id_to_file function (sub_5B0180, host_envir.c, ~30 lines) is called during the backend output phase when dword_106BFB8 (emit-symbol-table flag) is set. It generates the module ID (via sub_5AF830(0)) and writes the raw string to a file:
// sub_5B0180 -- write_module_id_to_file
void write_module_id_to_file(void) {
char *id = make_module_id(NULL); // sub_5AF830(0)
char *path = qword_106BF80; // module ID file path
if (!path)
fatal("module id filename not specified");
FILE *f = open_file_for_writing(path); // sub_4F48F0
size_t len = strlen(id);
if (fwrite(id, 1, len, f) != len)
fatal("error writing module id to file");
fclose(f);
}
The module ID file is a plain text file containing nothing but the module ID string (no newline, no header). This file is consumed by the fatbinary linker (fatbinary) and nvlink during the device linking phase.
Downstream Consumers
The module ID is referenced in seven distinct locations across the cudafe++ binary:
1. Anonymous Namespace Mangling (sub_6BC7E0)
Constructs the _GLOBAL__N_<module_id> string used as the _NV_ANON_NAMESPACE macro value in the .int.c trailer:
// sub_6BC7E0 (nv_transforms.c, ~20 lines)
if (qword_1286A00) // cached?
return qword_1286A00;
char *id = make_module_id(NULL); // sub_5AF830(0)
char *buf = allocate(strlen(id) + 12); // "_GLOBAL__N_" = 11 chars + NUL
strcpy(buf, "_GLOBAL__N_");
strcpy(buf + 11, id);
qword_1286A00 = buf; // cache for reuse
return buf;
This string appears in the .int.c output as:
#define _NV_ANON_NAMESPACE _GLOBAL__N_a1b2c3d4e5f67890
#ifdef _NV_ANON_NAMESPACE
#endif
#include "kernel.cu"
#undef _NV_ANON_NAMESPACE
2. Scoped Name Prefix Builder (sub_6BD2F0)
The recursive nv_build_scoped_name_prefix function uses the same _GLOBAL__N_<module_id> string when building scope-qualified names for internal-linkage device entities in host reference arrays. If the entity is in an anonymous namespace and qword_1286A00 is not yet computed, it calls sub_5AF830(0) directly to generate the module ID.
3. Internal Linkage Prefix (sub_69DAA0)
Constructs _INTERNAL<module_id> for internal-linkage entities during name lowering:
// sub_69DAA0 (lower_name.c context)
char *id = make_module_id(NULL);
char *buf = allocate(strlen(id) + 10);
strcpy(buf, "_INTERNAL"); // 0x414E5245544E495F in little-endian
strcpy(buf + 9, id);
4. Unnamed Namespace Naming (sub_69ED40, give_unnamed_namespace_a_name)
When the name lowering pass encounters an unnamed (anonymous) namespace entity, it calls sub_5AF830(0) to obtain the module ID and constructs a _GLOBAL__N_<module_id> name for the namespace. The function is confirmed as give_unnamed_namespace_a_name from assert strings at lower_name.c lines 7880 and 7889.
5. Frontend Wrapup (sub_588E90)
The translation_unit_wrapup function (sub_588E90, fe_wrapup.c) calls sub_5AF830(0) unconditionally during frontend finalization. This ensures the module ID is computed and cached before the backend code generator needs it, even if no earlier consumer triggered computation.
6. Entity-Based Selection (sub_5CF030)
As described above, use_variable_or_routine_for_module_id_if_needed selects a representative entity and passes its mangled name to sub_5AF830, which then uses the name as the src component instead of file metadata.
7. Module ID File Output (sub_5B0180)
Writes the raw module ID string to a file for consumption by fatbinary and nvlink.
Integration with the Compilation Pipeline
The module ID is computed at multiple points during compilation, but only the first computation persists (all subsequent calls return the cached value):
Pipeline stage Module ID action
--------------------------------------------------------------
CLI parsing Flags 83/87 set qword_106BF80
Options string stored in qword_106C038
Frontend processing sub_5CF030 may select entity-based ID
Frontend wrapup (sub_588E90) sub_5AF830(0) ensures ID is computed
Backend output (sub_489000) sub_6BC7E0 uses ID for _NV_ANON_NAMESPACE
sub_6BCF80 uses ID in host reference arrays
sub_5B0180 writes ID to file (if dword_106BFB8)
The --gen_module_id_file flag (83) controls whether a module ID file is generated at all. The --module_id_file_name flag (87) specifies its path. Both are set by nvcc when invoking cudafe++ with -rdc=true.
PID Incorporation
The getpid() call ensures that concurrent compilations of the same source file produce different module IDs. Without the PID, two parallel nvcc invocations compiling the same .cu file with the same flags would generate identical module IDs, causing runtime registration collisions when the resulting objects are linked together. The PID is appended as the final underscore-separated component and is only included in modes 2 and 3 (not when the caller provides a src argument directly, and not when the module ID is read from a file). This means reproducible builds require mode 1 (file-based) or entity-based selection.
Global Variables
| Address | Size | Name | Description |
|---|---|---|---|
qword_126F0C0 | 8 | cached_module_id | Cached module ID string (computed once, never freed) |
qword_106BF80 | 8 | module_id_file_path | Path from --module_id_file_name (flag 87) |
qword_106C038 | 8 | options_hash_input | Command-line options string for CRC32 hashing |
qword_106C040 | 8 | display_filename | Output filename override (used as basename source) |
qword_126F140 | 8 | selected_entity | Entity chosen by use_variable_or_routine_for_module_id_if_needed |
byte_126F138 | 1 | selected_entity_kind | Kind of selected entity (7=variable, 11=routine) |
dword_106BFB8 | 4 | emit_symbol_table | Flag: write module ID file + symbol table in backend |
qword_1286A00 | 8 | cached_anon_namespace_hash | Cached _GLOBAL__N_<module_id> string |
qword_126EEA0 | 8 | working_directory | Current working directory (set during host_envir_early_init) |
qword_126EB80 | 8 | compilation_timestamp | ctime() of compilation start (IL header) |
dword_126EFC8 | 4 | debug_trace_flag | Enables debug trace output to FILE s |
Function Map
| Address | Name | Source File | Lines | Role |
|---|---|---|---|---|
sub_5AF830 | make_module_id | host_envir.c | ~450 | CRC32-based unique TU identifier generator |
sub_5AF7F0 | set_module_id | host_envir.c | ~10 | Setter with assert guard (must be called once) |
sub_5AF820 | get_module_id | host_envir.c | ~3 | Returns qword_126F0C0 |
sub_5B0180 | write_module_id_to_file | host_envir.c | ~30 | Writes module ID to file for nvlink |
sub_5CF030 | use_variable_or_routine_for_module_id_if_needed | il.c:31969 | ~65 | Selects representative entity for stable ID |
sub_6BC7E0 | (anon namespace hash) | nv_transforms.c | ~20 | Constructs _GLOBAL__N_<module_id> |
sub_6BD2F0 | nv_build_scoped_name_prefix | nv_transforms.c | ~95 | Recursive scope-qualified name builder |
sub_69DAA0 | (internal linkage prefix) | lower_name.c | ~60 | Constructs _INTERNAL<module_id> prefix |
sub_69ED40 | give_unnamed_namespace_a_name | lower_name.c:7880 | ~80 | Names anonymous namespaces with module ID |
sub_588E90 | translation_unit_wrapup | fe_wrapup.c | ~30 | Ensures module ID is computed during wrapup |
Cross-References
- .int.c File Format --
_NV_ANON_NAMESPACEtrailer section that consumes the module ID - CUDA Runtime Boilerplate -- managed memory registration that uses the fatbinary handle
- Host Reference Arrays --
.nvHR*sections where scoped names include the module ID - RDC Mode -- separate compilation mode that requires module IDs for cross-TU linking
- CLI Flag Inventory -- flags 83 (
gen_module_id_file) and 87 (module_id_file_name) - Backend Code Generation -- output phase where
write_module_id_to_fileis called - Frontend Wrapup --
translation_unit_wrapuptriggers early module ID computation
EDG 6.6 Overview
cudafe++ is built on top of Edison Design Group's (EDG) commercial C++ frontend, version 6.6. EDG provides the complete C++ language implementation -- lexer, preprocessor, parser, semantic analysis, type system, template instantiation, overload resolution, constant evaluation, and Itanium ABI name mangling. NVIDIA licenses this frontend and compiles it from source with CUDA-specific modifications injected at three distinct integration levels: a dedicated NVIDIA source file (nv_transforms.c), surgical modifications to EDG source files that call into NVIDIA headers, and a large layer of CUDA property-query leaf functions that permeate every compilation phase.
The build path embedded in the binary is:
/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/
Source Tree
The binary contains debug path references to 52 .c files and 13 .h files. Together these constitute the entire EDG frontend plus NVIDIA's single dedicated source file.
Source Files (.c)
| # | File | Pipeline role |
|---|---|---|
| 1 | attribute.c | C++11/GNU/CUDA attribute parsing and validation |
| 2 | class_decl.c | Class/struct/union declaration processing, lambda scanning |
| 3 | cmd_line.c | Command-line argument parsing (276 flags) |
| 4 | const_ints.c | Compile-time integer constant evaluation |
| 5 | cp_gen_be.c | Backend -- .int.c code generation, source sequence walking |
| 6 | debug.c | Debug output and IL dump infrastructure |
| 7 | decl_inits.c | Declaration initializer processing |
| 8 | decl_spec.c | Declaration specifier parsing (storage class, type qualifiers) |
| 9 | declarator.c | Declarator parsing (pointers, arrays, function signatures) |
| 10 | decls.c | General declaration processing |
| 11 | disambig.c | Syntactic disambiguation (expression vs. declaration) |
| 12 | error.c | Diagnostic message formatting and emission (3,795 messages) |
| 13 | expr.c | Expression parsing and semantic analysis |
| 14 | exprutil.c | Expression utility functions (coercion, evaluation) |
| 15 | extasm.c | Extended inline assembly parsing |
| 16 | fe_init.c | Frontend initialization (36 subsystem init routines) |
| 17 | fe_wrapup.c | Frontend finalization (5-pass wrapup sequence) |
| 18 | float_pt.c | Floating-point literal parsing |
| 19 | floating.c | IEEE 754 constant folding (arbitrary precision) |
| 20 | folding.c | General constant folding |
| 21 | func_def.c | Function definition processing |
| 22 | host_envir.c | Host environment interface (file I/O, exit, signals) |
| 23 | il.c | IL node creation, linking, and management |
| 24 | il_alloc.c | IL arena allocator (region-based, 64KB blocks) |
| 25 | il_to_str.c | IL-to-string conversion for debug display |
| 26 | il_walk.c | IL tree walking with 5 callback functions |
| 27 | interpret.c | Constexpr interpreter (compile-time evaluation engine) |
| 28 | layout.c | Struct/class memory layout computation |
| 29 | lexical.c | Lexer / tokenizer (357 token kinds) |
| 30 | literals.c | String and numeric literal processing |
| 31 | lookup.c | Name lookup (unqualified, qualified, ADL) |
| 32 | lower_name.c | Itanium ABI name mangling |
| 33 | macro.c | Preprocessor macro expansion |
| 34 | mem_manage.c | Internal memory management (arena allocator, tracking) |
| 35 | modules.c | C++20 module support (mostly stubs in CUDA build) |
| 36 | nv_transforms.c | NVIDIA-authored -- CUDA AST transforms, lambda wrappers |
| 37 | overload.c | C++ overload resolution |
| 38 | pch.c | Precompiled header support |
| 39 | pragma.c | Pragma processing (43 pragma kinds) |
| 40 | preproc.c | Preprocessor directives (#include, #ifdef, etc.) |
| 41 | scope_stk.c | Scope stack management |
| 42 | src_seq.c | Source sequence (declaration ordering for emission) |
| 43 | statements.c | Statement parsing and semantic analysis |
| 44 | symbol_ref.c | Symbol reference tracking |
| 45 | symbol_tbl.c | Symbol table operations (hash-based lookup) |
| 46 | sys_predef.c | System predefinitions (built-in types, macros) |
| 47 | target.c | Target configuration (data model, ABI) |
| 48 | templates.c | Template instantiation, specialization, deduction |
| 49 | trans_copy.c | Translation unit IL deep copy |
| 50 | trans_corresp.c | Cross-TU type correspondence verification (RDC) |
| 51 | trans_unit.c | Translation unit lifecycle (the main entry point) |
| 52 | types.c | C++ type system (22 type kinds, queries, construction) |
Header Files (.h)
| # | File | Contents |
|---|---|---|
| 1 | decls.h | Declaration node structure definitions |
| 2 | float_type.h | Floating-point type descriptors |
| 3 | il.h | IL entry kind enums, node structure definitions |
| 4 | lexical.h | Token kind enums, lexer state |
| 5 | mem_manage.h | Memory allocator interface |
| 6 | modules.h | Module system declarations |
| 7 | nv_transforms.h | NVIDIA-authored -- CUDA transform API, called from EDG files |
| 8 | overload.h | Overload resolution structures |
| 9 | scope_stk.h | Scope stack interface |
| 10 | symbol_tbl.h | Symbol table interface |
| 11 | types.h | Type node structure, type kind enum |
| 12 | util.h | General utility macros and inline functions |
| 13 | walk_entry.h | IL walking callback signatures |
Code Breakdown
The binary contains approximately 6,300 identifiable functions in the EDG portion of the code:
| Category | Functions | % of binary | Description |
|---|---|---|---|
| Attributed to source files | ~2,200 | ~35% | Matched to one of the 52 .c files via assert strings, source path references, or address-range mapping |
| Unmapped EDG functions | ~2,900 | ~46% | EDG code without source file attribution (inlined, optimized, or from headers) |
| C++ runtime / ABI | ~1,200 | ~19% | Itanium ABI runtime, exception handling, std:: library, operator new/delete |
Top 10 Source Files by Function Count
| Rank | File | Functions | Primary responsibility |
|---|---|---|---|
| 1 | expr.c | ~195 | Expression parsing, operator semantics, implicit conversions |
| 2 | il.c | ~185 | IL node creation, entry kind dispatch, node linking |
| 3 | templates.c | ~172 | Template instantiation worklist, SFINAE, deduction |
| 4 | exprutil.c | ~154 | Expression coercion, arithmetic conversions, lvalue analysis |
| 5 | symbol_tbl.c | ~102 | Symbol table hash operations, scope chain walking |
| 6 | overload.c | ~100 | Candidate set construction, ICS ranking, best viable function |
| 7 | class_decl.c | ~90 | Class body parsing, member declarations, lambda scanning |
| 8 | attribute.c | ~83 | Attribute parsing, CUDA attribute validation dispatch |
| 9 | cp_gen_be.c | ~81 | Backend emission, .int.c generation, device stub writing |
| 10 | scope_stk.c | ~72 | Scope push/pop, scope kind management, lookup context |
Architecture: Classic Frontend Pipeline
EDG implements a textbook multi-pass compiler frontend. cudafe++ drives it in a single-threaded, sequential pipeline from main() at 0x408950:
source.cu
|
v
+-----------+ lexical.c, macro.c, preproc.c, literals.c
| Lexer / | 357 token kinds, trigraph handling, raw string
| Preproc | adjustment, __CUDA_ARCH__ macro injection
+-----------+
| token stream
v
+-----------+ expr.c, declarator.c, decl_spec.c, statements.c,
| Parser | class_decl.c, disambig.c, func_def.c, extasm.c
| | Recursive-descent with disambiguation
+-----------+
| parse tree
v
+-----------+ overload.c, exprutil.c, lookup.c, templates.c,
| Semantic | types.c, attribute.c, const_ints.c, folding.c
| Analysis | Type checking, overload resolution, template
| | instantiation, constexpr evaluation
+-----------+
| annotated AST
v
+-----------+ il.c, il_alloc.c, il_walk.c, scope_stk.c,
| IL Build | symbol_tbl.c, src_seq.c, trans_unit.c
| | Scope-linked graph of all declarations, types,
| | expressions, statements, templates
+-----------+
| IL graph
v
+-----------+ fe_wrapup.c, lower_name.c, trans_corresp.c
| Wrapup | 5-pass finalization: dead code marking,
| | name lowering, cross-TU correspondence (RDC)
+-----------+
| finalized IL
v
+-----------+ cp_gen_be.c, nv_transforms.c, host_envir.c
| Backend | Walk source sequence, emit .int.c file,
| Emission | inject CUDA stubs, lambda wrappers, host
| | reference arrays, managed variable boilerplate
+-----------+
|
v
output.int.c
The process_translation_unit function (sub_7A40A0 in trans_unit.c) is the main entry point for compilation. It allocates a 424-byte TU descriptor, opens the source file, and orchestrates the parse-to-IL sequence. For the main compilation path, it calls:
sub_586240-- parse the translation unit (drives lexer + parser)sub_4E8A60-- standard compilation finalization (IL completion)sub_588F90--fe_wrapup(5-pass IL finalization)sub_489000-- backend entry (.int.cemission, "Back end time")
NVIDIA Modifications
NVIDIA's CUDA integration is organized in three layers, from most isolated to most pervasive.
Level 1: NVIDIA-Authored Source (nv_transforms.c + nv_transforms.h)
A single dedicated NVIDIA source file at address range 0x6BAE70--0x6BE4A0, containing approximately 34 functions in ~14KB of code. This file implements all CUDA-specific AST transformations:
| Function | Address | Purpose |
|---|---|---|
nv_init_transforms | 0x6BAE70 | Zero all NVIDIA transform state at startup |
emit_device_lambda_wrapper | 0x6BB790 | Generate __nv_dl_wrapper_t<Tag, F1..FN> partial specialization |
emit_hdl_wrapper (non-mutable) | 0x6BBB10 | Generate __nv_hdl_wrapper_t<false, ...> type-erased wrapper |
emit_hdl_wrapper (mutable) | 0x6BBEE0 | Same as above but operator() is non-const |
emit_array_capture_helpers | 0x6BC290 | Generate __nv_lambda_array_wrapper for 2D-8D arrays |
nv_validate_cuda_attributes | 0x6BC890 | Validate __launch_bounds__, __cluster_dims__, __maxnreg__ |
nv_reset_capture_bitmasks | 0x6BCBC0 | Zero device/host-device capture bitmasks per TU |
nv_record_capture_count | 0x6BCBF0 | Set bit N in capture bitmap for wrapper generation |
nv_emit_lambda_preamble | 0x6BCC20 | Master emitter: inject all __nv_* templates into compilation |
nv_find_parent_lambda_function | 0x6BCDD0 | Walk scope chain for enclosing device/global function |
nv_emit_host_reference_array | 0x6BCF80 | Generate .nvHRKE/.nvHRDI/etc. ELF section arrays |
nv_get_full_nv_static_prefix | 0x6BE300 | Build scoped name + register entity in host ref arrays |
The companion header nv_transforms.h declares the API surface that EDG source files call into. This is the primary NVIDIA integration point -- EDG code never calls nv_transforms.c functions directly; it calls through the header's declarations.
Key data structures managed by nv_transforms.c:
| Global | Size | Purpose |
|---|---|---|
unk_1286980 | 128 bytes (1024 bits) | Device lambda capture-count bitmap |
unk_1286900 | 128 bytes (1024 bits) | Host-device lambda capture-count bitmap |
qword_12868F0 | pointer | Entity-to-closure ID hash table |
qword_1286A00 | pointer | Cached anonymous namespace name (_GLOBAL__N_<file>) |
qword_1286760 | pointer | Cached static name prefix string |
unk_1286780--unk_12868C0 | 6 lists | Host reference array symbol lists (one per section type) |
dword_126E270 | 4 bytes | C++17 noexcept-in-type-system flag |
Level 2: NVIDIA-Modified EDG Files
Three EDG source files contain direct calls into nv_transforms.h functions, making them the "NVIDIA-aware" EDG files:
cp_gen_be.c -- The backend code generator. When it encounters a type named __nv_lambda_preheader_injection during source sequence walking, it calls nv_emit_lambda_preamble (sub_6BCC20) to inject the entire __nv_* template library. It also calls NVIDIA functions for host reference array emission, managed variable boilerplate, and device stub generation.
class_decl.c -- The class/struct declaration processor. The scan_lambda function (sub_447930, 2113 lines) detects __host__/__device__ annotations on lambda expressions, validates CUDA-specific constraints (35+ error codes in range 3592--3690), and records capture counts in the bitmaps via nv_record_capture_count.
statements.c -- The statement parser. Calls NVIDIA transform functions for statement-level CUDA validation, such as checking that __syncthreads() is not called in divergent control flow within __global__ functions.
Level 3: CUDA Property Query Layer
The most pervasive integration layer consists of 104 small leaf functions clustered at addresses 0x7A6000--0x7AA000 (within types.c). These are type-system query functions that answer questions like "is this type a __device__ pointer?", "does this class have __shared__ storage?", "is this a kernel function type?".
Each follows a canonical pattern:
bool is_<property>_type(type_node *t) {
while (t->kind == 12) // 12 = tk_typedef
t = t->referenced_type; // strip typedef layers
return <check on underlying type>;
}
These 104 accessors account for 3,648 total call sites across the binary. The top callers by call-site count:
| Address | Callers | Identity | Returns |
|---|---|---|---|
0x7A8A30 | 407 | is_class_or_struct_or_union_type | kind in {9, 10, 11} |
0x7A9910 | 389 | type_pointed_to | ptr->referenced_type (kind == 6) |
0x7A9E70 | 319 | get_cv_qualifiers | accumulated cv-qual bits (& 0x7F) |
0x7A6B60 | 299 | is_dependent_type | bit 5 of byte +133 |
0x7A7630 | 243 | is_object_pointer_type | kind == 6 && !(bit 0 of +152) |
0x7A8370 | 221 | is_array_type | kind == 8 |
0x7A7B30 | 199 | is_member_pointer_or_ref | kind == 6 && (bit 0 of +152) |
0x7A6AC0 | 185 | is_reference_type | kind == 7 |
0x7A8DC0 | 169 | is_function_type | kind == 14 |
0x7A6E90 | 140 | is_void_type | kind == 1 |
CUDA integration is pervasive because these tiny accessors are called from every phase of compilation -- the parser checks execution space during declaration, semantic analysis validates cross-space calls, the type system queries CUDA qualifiers during overload resolution, and the backend reads them during IL emission. There is no isolated "CUDA layer"; the CUDA awareness is distributed across the entire frontend through these leaf functions.
Type Kind Constants
The type query functions operate on a type_node structure (176 bytes, IL entry kind 6). The kind field at offset +132 encodes:
| kind | Name | Description |
|---|---|---|
| 0 | tk_none | Null/invalid |
| 1 | tk_void | void |
| 2 | tk_integer | All integer types including bool, char, enums |
| 3 | tk_float | float |
| 4 | tk_double | double |
| 5 | tk_long_double | long double |
| 6 | tk_pointer | Pointer types (object and member) |
| 7 | tk_reference | Lvalue reference (T&) |
| 8 | tk_array | Array types (T[], T[N]) |
| 9 | tk_struct | struct |
| 10 | tk_class | class |
| 11 | tk_union | union |
| 12 | tk_typedef | Typedef alias (stripped by all query functions) |
| 13 | tk_pointer_to_member | Pointer-to-member (T C::*) |
| 14 | tk_function | Function type |
| 15 | tk_bitfield | Bit-field |
| 16 | tk_pack_expansion | Parameter pack expansion |
| 17 | tk_pack_expansion | Alternate pack expansion form |
| 18 | tk_auto | auto / decltype(auto) placeholder |
| 19 | tk_rvalue_reference | Rvalue reference (T&&) |
| 20 | tk_nullptr_t | std::nullptr_t |
Memory Management
EDG uses a custom region-based arena allocator implemented in mem_manage.c (address range 0x6B5E40--0x6BA230). Key characteristics:
- Block size: 64KB (0x10000) per block
- Region model: Multiple numbered regions (file-scope = region 1, per-function = region N)
- Free list recycling: Freed blocks go to
qword_1280730for reuse before new allocation - Trim threshold: Blocks with more than 1,887 unused bytes are split; remainder goes to free list
- Tracking: All allocations recorded for watermark monitoring (
qword_1280718= total,qword_1280710= peak) - Dual mode: Malloc-based (mode 0) or mmap-based (mode 1), selected by
dword_1280728from CLI flag
Block structure (48+ bytes header per 64KB block):
| Offset | Type | Field |
|---|---|---|
| +0 | void* | Next pointer (block chain) |
| +8 | void* | Current allocation pointer |
| +16 | void* | High-water mark within block |
| +24 | void* | End-of-block pointer |
| +32 | int64 | Block total size (0 if sub-block) |
| +40 | byte | Trimmed flag |
| +48 | -- | Start of usable data |
The free_fe function (sub_6BA230, 533 lines) implements a hash-table-based deduplicating allocator for front-end object deallocation, using open addressing with linear probing.
C++20 Modules (Stubs)
The modules.c file (address range 0x7C0C60--0x7C2560) contains approximately 20 functions implementing the C++20 module import/export interface. CUDA does not support C++20 modules, so most functions are stubs that return 0:
has_pending_template_definition_from_module-- returns 0has_pending_template_specializations_from_module-- returns 0- Seven additional stub functions at
0x7C2350--0x7C2410-- all return 0
The non-stub functions handle the binary module interface file format (magic header {0x9A, 0x13, 0x37, 0x7D}) and basic module name matching, likely preserved from the EDG baseline for future CUDA module support.
Cross-TU Correspondence (RDC Mode)
When compiling with Relocatable Device Code (--rdc), multiple translation units are processed sequentially. The trans_corresp.c file (address range 0x7A00D0--0x7A38A0) implements structural equivalence checking between types from different TUs:
verify_class_type_correspondence(sub_7A00D0, 703 lines) -- Deep comparison of class types: base classes, friend declarations, member functions, nested types, template parametersverify_enum_type_correspondence(sub_7A0E10) -- Enum underlying type and enumerator list comparisonverify_function_type_correspondence(sub_7A1230) -- Parameter list and return type comparisonset_type_correspondence(sub_7A1460) -- Links two corresponding types across TUs
The trans_unit.c file manages TU lifecycle with a stack-based model:
| Global | Purpose |
|---|---|
qword_106BA10 | Current translation unit pointer |
qword_106B9F0 | Primary (first) translation unit |
qword_106BA18 | TU stack top |
dword_106B9E8 | TU stack depth (excluding primary) |
process_translation_unit (sub_7A40A0) allocates a 424-byte TU descriptor and drives the parse-to-completion sequence. switch_translation_unit (sub_7A3D60) saves/restores per-TU state (registered variables, scope stack, file scope) when switching between TUs during RDC compilation.
Cross-References
- Pipeline Overview -- How EDG stages map to the 7-stage pipeline
- IL Overview -- The 85 entry kinds that EDG produces
- Extended Lambda Overview -- The
nv_transforms.clambda pipeline in detail - Type System -- Deep dive on 22 type kinds and class layout
- Template Engine -- Template instantiation worklist
- Name Mangling -- Itanium ABI encoding with CUDA extensions
- Lexer -- Tokenizer and keyword registration
- Overload Resolution -- Candidate evaluation and ICS ranking
- Diagnostics Overview -- The 3,795 error message system
Lexer & Tokenizer
The lexer in cudafe++ is EDG 6.6's lexical.c implementation -- a hand-coded, state-machine-driven tokenizer that converts raw source bytes into a stream of 357 distinct token kinds. It spans approximately 185 functions across the address range 0x668330--0x689130 and constitutes one of the densest subsystems in the binary. The design is a classic multi-layered scanner: a byte-level character scanner (sub_679800, 907 lines) feeds into a token acquisition engine (sub_6810F0, 3,811 lines), which in turn is wrapped by a cache-aware token delivery function (sub_676860, 1,995 lines). CUDA keyword recognition is injected at the get_token_main level, gated on dword_106C2C0 (GPU compilation mode flag).
The lexer does not use generated tables from tools like flex. Instead, every character-class test, keyword match, and operator scan is written as explicit C switch/if chains, compiled into dense jump tables by the optimizer. This produces extremely large functions -- get_token_main alone has approximately 300 local variables in its decompiled form -- but eliminates the overhead of table-driven DFA transitions for a language as context-sensitive as C++.
Key Facts
| Property | Value |
|---|---|
| Source file | lexical.c (~185 functions) |
| Address range | 0x668330--0x689130 |
| Token kinds | 357 (indexed from off_E6D240 name table) |
| Primary scanner | sub_679800 (scan_token, 907 lines) |
| Token acquisition | sub_6810F0 (get_token_main, 3,811 lines, ~300 locals) |
| Cache + delivery | sub_676860 (get_next_token, 1,995 lines) |
| Numeric literal scanner | sub_672390 (scan_numeric_literal, 1,571 lines) |
| Keyword registration | sub_5863A0 (keyword_init, in fe_init.c, 200+ keywords) |
| Universal char scanner | sub_6711E0 (scan_universal_character, 278 lines) |
| Template arg scanner | sub_67DC90 (scan_template_argument_list, 1,078 lines) |
| Token cache entry size | 80--112 bytes (8 cache entry kinds) |
| Scope entry size | 784 bytes (at qword_126C5E8) |
| GPU mode gate | dword_106C2C0 |
| Current token global | word_126DD58 |
Architecture
The lexer is organized as four concentric layers, each calling into the one below it:
Parser (expr.c, decls.c, statements.c)
│
▼
get_next_token (sub_676860) ← Cache management, macro rescan
│
▼
get_token_main (sub_6810F0) ← Keyword classification, CUDA gates
│
▼
scan_token (sub_679800) ← Character-level scanning
│
▼
Input buffer (qword_126DDA0) ← Raw bytes from source file
The parser never calls the character-level scanner directly. All token consumption flows through get_next_token, which checks the token cache and rescan lists before falling through to get_token_main. This layering allows the lexer to support lookahead, backtracking, macro expansion replay, and template argument rescanning without modifying the core scanner.
Token System
The 357 Token Kinds
Every token produced by the lexer carries a 16-bit token code stored in word_126DD58. The complete set of 357 token kinds is indexed through the name table at off_E6D240, which maps each token code to its string representation. The stop-token table at qword_126DB48 + 8 contains 357 boolean entries used by the error recovery scanner to identify synchronization points.
Token codes are assigned in blocks:
| Range | Category | Examples |
|---|---|---|
| 1--51 | Operators and punctuation | +, -, *, /, (, ), {, }, ::, -> |
| 52--76 | Alternative tokens / digraphs | and, or, not, <%, %>, <:, :> |
| 77--108 | C89 keywords | auto(77), break(78), case(79), char(80), while(108) |
| 109--131 | C99/C11 keywords | restrict(119), _Bool(120), _Complex(121), _Imaginary(122) |
| 132--136 | MSVC keywords | __declspec(132), __int8(133), __int16(134), __int32(135), __int64(136) |
| 137--199 | C++ keywords | catch(150), class(151), template(160), decltype(185), typeof(189) |
| 200--206 | Compiler internal | Internal token kinds for the preprocessor |
| 207--330 | Type traits | __is_class(207), __has_trivial_copy, ..., NVIDIA-specific traits at 328--330 |
| 331--356 | Extended types / recent additions | _Float32(331)--_Float128(335), C++23/26 features |
CUDA-Specific Token Kinds
Three NVIDIA type-trait keywords occupy dedicated token codes registered during keyword_init:
| Token Code | Keyword | Purpose |
|---|---|---|
| 328 | __nv_is_extended_device_lambda_closure_type | Tests if type is a device lambda |
| 329 | __nv_is_extended_host_device_lambda_closure_type | Tests if type is a host-device lambda |
| 330 | __nv_is_extended_device_lambda_with_preserved_return_type | Tests if device lambda preserves return type |
These are registered as standard type-trait keywords and participate in the same token classification path as the 60+ standard __is_xxx/__has_xxx traits.
Token State Globals
When a token is produced, the following globals are populated:
| Address | Name | Type | Description |
|---|---|---|---|
word_126DD58 | current_token_code | WORD | 16-bit token kind (0--356) |
qword_126DD38 | current_source_position | QWORD | Encoded file/line/column |
qword_126DD48 | token_text_ptr | QWORD | Pointer to identifier/literal text |
src | token_start_position | char* | Start of token in input buffer |
n | token_text_length | size_t | Length of token text |
dword_126DF90 | token_flags_1 | DWORD | Classification flags |
dword_126DF8C | token_flags_2 | DWORD | Additional flags |
qword_126DF80 | token_extra_data | QWORD | Context-dependent payload |
xmmword_106C380--106C3B0 | identifier_lookup_result | 4 x 128-bit | SSE-packed lookup result for identifiers (64 bytes) |
The 64-byte identifier lookup result is written into four SSE registers (xmmword_106C380 through xmmword_106C3B0) by the identifier classification path. When a scanned identifier is also a keyword, the lookup result contains the keyword's token code, scope information, and classification flags. The compiler uses movaps/movups instructions to read/write this packed state in bulk.
Token Cache
The token cache provides the lookahead, backtracking, and macro-expansion replay capabilities required by C++ parsing. Tokens are stored in a linked list of cache entries that can be consumed, rewound, and re-scanned.
Cache Entry Layout (80--112 bytes)
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 8 | next | Next entry in cache linked list |
+8 | 8 | source_position | Encoded source location |
+16 | 2 | token_code | Token kind (0--356) |
+18 | 1 | cache_entry_kind | Discriminator for payload type (see below) |
+20 | 4 | flags | Token flags |
+24 | 4 | extra_flags | Additional flags |
+32 | 8 | extra_data | Context-dependent data |
+40.. | varies | payload | Kind-specific payload data |
Cache Entry Kinds
| Kind | Value | Payload | Description |
|---|---|---|---|
| identifier | 1 | Name pointer + lookup result | Identifier token with pre-resolved scope lookup |
| macro_def | 2 | Macro definition pointer | Macro definition for re-expansion (calls sub_5BA500) |
| pragma | 3 | Pragma data | Preprocessor pragma for deferred processing |
| pp_number | 4 | Number text | Preprocessing number (not yet classified as int/float) |
| (reserved) | 5 | -- | Not observed in use |
| string | 6 | String data + encoding | String literal token |
| (reserved) | 7 | -- | Not observed in use |
| concatenated_string | 8 | Concatenated string data | Wide or multi-piece concatenated string literal |
Cache Management Globals
| Address | Name | Description |
|---|---|---|
qword_1270150 | cached_token_rescan_list | Head of list of tokens to re-scan (pushed back for lookahead) |
qword_1270128 | reusable_cache_stack | Stack of reusable cache entry blocks |
qword_1270148 | free_token_list | Free list for recycling cache entries |
qword_1270140 | macro_definition_chain | Active macro definition chain |
dword_126DB74 | has_cached_tokens | Boolean flag: nonzero when cache is non-empty |
Cache Operations
| Address | Identity | Description |
|---|---|---|
sub_669650 | copy_tokens_from_cache | Copies cached preprocessor tokens for macro re-expansion (assert at lexical.c:3417) |
sub_669D00 | allocate_token_cache_entry | Allocates from free list at qword_1270118 |
sub_669EB0 | create_cached_token_node | Creates and initializes token cache node |
sub_66A000 | append_to_token_cache | Appends token to cache list, maintains tail pointer |
sub_66A140 | push_token_to_rescan_list | Pushes token onto rescan stack at qword_1270150 |
sub_66A2C0 | free_single_cache_entry | Returns cache entry to free list |
Layer 1: scan_token (sub_679800)
scan_token is the character-level scanner. It reads raw bytes from the input buffer at qword_126DDA0, classifies them, and produces a single token. The function is 907 lines and dispatches on the first byte of each token.
Character Dispatch
The scanner reads the byte at the current input position and enters one of the following paths:
| First Byte | Action |
|---|---|
0x00 (NUL) | Control byte processing (8 embedded control types, see below) |
0x09 (TAB), 0x0B (VT), 0x0C (FF), 0x20 (space) | Whitespace -- advance and retry |
a--z, A--Z, _ | Identifier or keyword scanning |
0--9 | Numeric literal scanning (decimal, hex, octal, binary) |
' | Character literal scanning |
" | String literal scanning |
/ | Comment (// or /* */) or division operator |
. | Dot operator, or float literal if followed by digit |
< | Less-than, <=, <<, <<=, <=>, or template bracket |
> | Greater-than, >=, >>, >>=, or template bracket |
+, -, *, %, ^, ~, !, =, &, | | Operator scanning (single or compound) |
(, ), [, ], {, }, ;, ,, ?, @ | Single-character tokens |
# | Preprocessor directive or stringification operator |
\ | Universal character name (\uXXXX, \UXXXXXXXX) or line continuation |
Embedded Control Bytes (NUL Dispatch)
The input buffer uses embedded NUL bytes (0x00) as in-band control markers. When the scanner encounters a NUL, it reads the next byte as a control type code:
| Control Type | Value | Action |
|---|---|---|
| Newline marker | 1 | End of line -- calls sub_6702F0 (refill_buffer) to read next source line |
| (reserved) | 2 | -- |
| Macro position | 3 | Macro expansion position marker -- calls sub_66A770 to update position tracking |
| End of directive | 4 | Marks end of a preprocessor directive |
| EOF (primary) | 5 | End of current source file -- pops file stack |
| Stale position | 6 | Invalid position marker -- emits diagnostic 1192 or 861 |
| Continuation | 7 | Backslash-newline continuation was here |
| EOF (secondary) | 8 | Secondary EOF marker for nested includes |
This in-band signaling approach avoids the cost of checking buffer boundaries on every character read. The refill_buffer function (sub_6702F0, 792 lines) places these marker bytes at the end of each source line, so the scanner can detect line endings and EOF without comparing the input pointer against a limit.
Input Buffer System
| Address | Name | Description |
|---|---|---|
qword_126DDA0 | current_input_position | Read pointer into the input buffer |
qword_126DDD8 | input_buffer_base | Start of the allocated input buffer |
qword_126DDD0 | input_buffer_end | End of the allocated input buffer |
qword_126DDF0 | file_stack | Stack of open source files (for #include) |
qword_127FBA8 | current_file_handle | FILE* for the current source file |
dword_127FBA0 | eof_flag | Set when current file reaches EOF |
dword_127FB9C | multibyte_encoding_mode | Values >1 enable multibyte character decoding via sub_5B09B0 |
dword_126DDA8 | source_line_counter | Lines read from current source file |
dword_126DDBC | output_line_counter | Lines emitted to preprocessed output |
Buffer Refill: read_next_source_line (sub_66F4E0)
sub_66F4E0 (735 lines) reads the next line from the source file into the input buffer. It calls getc() for single-byte mode or sub_5B09B0 for multibyte mode (controlled by dword_127FB9C > 1). The function:
- Reads characters one at a time until newline or EOF
- Handles backslash-newline line splicing (joining continuation lines)
- Places control byte markers at newline positions (type 1) and EOF (type 5/8)
- Updates the line counter at
dword_126DDA8 - Manages trigraph warnings (diagnostic 1750) through the companion function
sub_6702F0
Layer 2: get_token_main (sub_6810F0)
get_token_main is the largest function in the lexer at 3,811 decompiled lines with approximately 300 local variables. It wraps scan_token and performs the complete token classification pipeline: keyword recognition, CUDA keyword gating, template parameter detection, operator overload name lookup, access specifier tracking, and namespace scope management.
Token Classification Pipeline
After scan_token produces a raw token, get_token_main performs these classification steps:
scan_token produces raw token
│
├── Identifier?
│ ├── Look up in keyword table
│ │ ├── Standard C/C++ keyword → set token_code to keyword kind
│ │ ├── CUDA keyword (dword_106C2C0 != 0) → set token_code
│ │ ├── Type trait keyword → set token_code (207-356)
│ │ └── Not a keyword → classify as identifier token
│ │
│ ├── Check template parameter context
│ │ └── If inside template<>, classify as type-name or non-type
│ │
│ └── Entity lookup for context-sensitive classification
│ ├── typedef name → classify as TYPE_NAME token
│ ├── class/struct name → classify as CLASS_NAME
│ ├── enum name → classify as ENUM_NAME
│ ├── namespace name → classify as NAMESPACE_NAME
│ └── template name → classify as TEMPLATE_NAME
│
├── Numeric literal?
│ └── Route to scan_numeric_literal (sub_672390)
│
├── String/character literal?
│ └── Handle encoding prefix (L, u8, u, U, R)
│
└── Operator/punctuation?
├── Check for template angle bracket context
├── Handle digraphs/alternative tokens
└── Produce operator token code
CUDA Keyword Detection
CUDA keyword handling is gated on dword_106C2C0 (GPU mode). When this flag is nonzero, get_token_main recognizes CUDA-specific identifiers and routes them to the CUDA attribute processing path:
// Pseudocode from get_token_main
if (token_is_identifier) {
// ... standard keyword lookup ...
if (dword_106C2C0 != 0) { // GPU mode active
// Check for __device__, __host__, __global__,
// __shared__, __constant__, __managed__,
// __launch_bounds__, __grid_constant__
// Route to CUDA attribute handlers
if (dword_106BA08) { // CUDA attribute processing enabled
sub_74DC30(...); // CUDA attribute resolution
sub_74E240(...); // CUDA attribute application
}
}
}
The GPU mode flag dword_106C2C0 is also checked during:
- Attribute token processing in
sub_686350(handle_attribute_token, 584 lines) - Deferred diagnostic emission in
sub_668660(severity override viabyte_126ED55) - Entity visibility computation in
sub_669130
C++ Standard Version Gating
Throughout get_token_main, keyword classification is gated on the C++ standard version stored in dword_126EF68:
| Version Value | Standard | Keywords Enabled |
|---|---|---|
| 201102 | C++11 | constexpr, decltype, nullptr, char16_t, char32_t, static_assert |
| 201402 | C++14 | binary literals, digit separators |
| 201703 | C++17 | if constexpr, char8_t, structured bindings |
| 202002 | C++20 | concept, requires, co_yield, co_return, co_await, consteval, constinit |
| 202302 | C++23 | typeof, typeof_unqual, extended digit separators |
The language mode at dword_126EFB4 controls broader dialect selection:
| Value | Mode | Effect |
|---|---|---|
| 1 | GNU/default | GNU extensions enabled, alternative tokens recognized |
| 2 | MSVC | MSVC keywords enabled (__declspec, __int8--__int64), some GNU extensions disabled |
Context-Sensitive Token Classification
C++ requires the lexer to classify identifiers based on declaration context. The functions supporting this classification:
| Address | Identity | Description |
|---|---|---|
sub_668C90 | classify_identifier_entity | Dispatches on entity kind: typedef(3), class(4,5), function(7,9), namespace(19-22) |
sub_668E00 | resolve_entity_through_alias | Walks typedef/using chains (kind=3 with +104 flag, kind=16 → **[+88]) |
sub_668F80 | get_resolved_entity_type | Resolves entity to underlying type through alias chains |
sub_668900 | handle_token_identifier_type_check | Determines if token is identifier vs typename vs template |
sub_666720 | select_dual_lookup_symbol | Selects between two candidate symbols in dual-scope lookup (372 lines) |
Entity classification reads the entity_kind byte at offset +80 of entity nodes:
switch (entity->kind) { // offset +80
case 3: // typedef
return TYPE_NAME;
case 4: case 5: // class / struct
return CLASS_NAME;
case 6: // enum
return ENUM_NAME;
case 7: // function
return IDENTIFIER;
case 9: case 10: // namespace / namespace alias
return NAMESPACE_NAME;
case 19: case 20: case 21: case 22: // template kinds
return TEMPLATE_NAME;
case 16: // using declaration
return resolve_through_using(entity);
case 24: // namespace alias (resolved)
return NAMESPACE_NAME;
}
Layer 3: get_next_token (sub_676860)
get_next_token (1,995 lines) is the token delivery function called by the parser. It manages the token cache, handles macro expansion replay, and calls get_token_main only when no cached tokens are available.
Token Delivery Flow
get_next_token (sub_676860)
│
├── Check cached_token_rescan_list (qword_1270150)
│ └── If non-empty: pop token, dispatch on cache_entry_kind
│ ├── kind 1 (identifier): load xmmword_106C380..106C3B0
│ ├── kind 2 (macro_def): call sub_5BA500 (macro expansion)
│ ├── kind 3 (pragma): process deferred pragma
│ ├── kind 4 (pp_number): return as-is
│ ├── kind 6 (string): return string token
│ └── kind 8 (concatenated_string): return concatenated string
│
├── Check reusable_cache_stack (qword_1270128)
│ └── If non-empty: pop and return cached token
│ (assert: "get_token_from_reusable_cache_stack" at 4450, 4469)
│
├── Check pending_macro_arg (qword_106B8A0)
│ └── If set: process macro argument token
│
└── Fall through to get_token_main (sub_6810F0)
└── Full token acquisition from source
The function sets the following globals on every token delivery:
word_126DD58= token codeqword_126DD38= source positiondword_126DF90= token flags 1dword_126DF8C= token flags 2qword_126DF80= extra data
CUDA Attribute Token Interception
When CUDA attribute processing is enabled (dword_106BA08 != 0), get_next_token intercepts identifier tokens and routes them through CUDA attribute resolution via sub_74DC30 and sub_74E240. This allows CUDA execution-space attributes (__device__, __host__, __global__) to be recognized at the token level rather than requiring full declaration parsing.
Numeric Literal Scanner: scan_numeric_literal (sub_672390)
The numeric literal scanner is 1,571 lines and handles every numeric literal format defined by C89 through C++23.
Literal Prefix Dispatch
scan_numeric_literal
│
├── First char '0':
│ ├── 0x/0X → hex literal (isxdigit validation)
│ ├── 0b/0B → binary literal (C++14)
│ ├── 0[0-7] → octal literal
│ └── 0 alone → decimal zero
│
├── First char '1'-'9':
│ └── decimal literal
│
└── After integer part:
├── '.' → floating-point literal
├── 'e'/'E' → decimal float exponent
├── 'p'/'P' → hex float exponent
└── suffix → type suffix parsing
C++14 Digit Separators
Digit separators (' characters within numeric literals) are handled through a two-flag system:
| Address | Name | Purpose |
|---|---|---|
dword_126EEFC | cpp14_digit_separators_enabled | Master enable for digit separator support |
dword_126DB58 | digit_separator_seen | Set when a separator is encountered in the current literal |
When dword_126EEFC is enabled, the scanner accepts ' between digits:
// Digit separator handling in scan_numeric_literal
while (isdigit(*pos) || (*pos == '\'' && dword_126EEFC)) {
if (*pos == '\'') {
dword_126DB58 = 1; // mark separator seen
pos++;
if (!isdigit(*pos))
emit_diagnostic(2629); // separator not followed by digit
continue;
}
// process digit...
}
C++23 extended digit separators (for binary, octal, hex) are gated on dword_126EF68 > 202302:
if (dword_126EF68 > 202302) {
// C++23: allow digit separators in binary/octal/hex
} else {
emit_diagnostic(2628); // C++23 feature used in earlier mode
}
Integer Suffix Parsing
sub_6748A0 (convert_integer_suffix, 137 lines) parses the following suffixes:
| Suffix | Type |
|---|---|
| (none) | int (or promoted per value) |
u / U | unsigned int |
l / L | long |
ll / LL | long long |
ul / UL | unsigned long |
ull / ULL | unsigned long long |
z / Z | size_t (C++23) |
uz / UZ | size_t unsigned (C++23) |
sub_674BB0 (determine_numeric_literal_type, 400 lines) applies the C++ promotion rules based on the literal value and suffix to determine the final type.
Floating-Point Literal Handling
| Address | Identity | Description |
|---|---|---|
sub_675390 | scan_float_exponent | Scans e/E/p/P exponent suffix (57 lines) |
sub_6754B0 | convert_float_literal | Converts float literal string to value (338 lines) |
Float suffixes: f/F (float), l/L (long double), none (double).
Universal Character Names: scan_universal_character (sub_6711E0)
sub_6711E0 (278 lines, assert at lexical.c:12384) scans \uXXXX and \UXXXXXXXX universal character names in identifiers and string/character literals.
void scan_universal_character(char *input, uint32_t *result) {
int width;
if (input[1] == 'u')
width = 4; // \uXXXX
else
width = 8; // \UXXXXXXXX
uint32_t value = 0;
for (int i = 0; i < width; i++) {
char c = *input++;
if (!isxdigit(c)) {
// emit error diagnostic
return;
}
int digit;
if (c >= '0' && c <= '9')
digit = c - 48; // '0' = 48
else if (islower(c))
digit = c - 87; // 'a' = 97, 97-87 = 10
else
digit = c - 55; // 'A' = 65, 65-55 = 10
value = (value << 4) | digit;
}
*result = value;
}
sub_671870 (validate_universal_character_value, 62 lines) performs range checking after scanning: surrogate pair values (0xD800--0xDFFF) are rejected, and values outside the valid Unicode range (> 0x10FFFF) produce an error.
The feature is controlled by dword_106BCC4 (universal characters enabled) and dword_106BD4C (extended character mode).
Keyword Registration: keyword_init (sub_5863A0)
sub_5863A0 (1,113 lines, in fe_init.c) registers all C/C++ keywords with the symbol table during frontend initialization. It calls sub_7463B0 (enter_keyword) once per keyword, passing the token ID and string representation. GNU double-underscore variants are registered via sub_585B10, and alternative tokens via sub_749600.
Keyword Categories and Version Gating
Keywords are registered conditionally based on language mode and standard version:
keyword_init (sub_5863A0)
│
├── C89 core (always registered)
│ auto(77), break(78), case(79), char(80), continue(82),
│ default(83), do(84), double(85), else(86), enum(87),
│ extern(88), float(89), for(90), goto(91), if(92),
│ int(93), long(94), register(95), return(96), short(97),
│ sizeof(99), static(100), struct(101), switch(102),
│ typedef(103), union(104), unsigned(105), void(106), while(108)
│
├── C99 (gated on C99+ mode)
│ _Bool(120), _Complex(121), _Imaginary(122), restrict(119)
│
├── C11 (gated on C11+ mode)
│ _Generic(262), _Atomic(263), _Alignof(247), _Alignas(248),
│ _Thread_local(194), _Static_assert(184), _Noreturn(260)
│
├── C23 (gated on C23 mode)
│ bool, true, false, alignof, alignas, static_assert,
│ thread_local, typeof(189), typeof_unqual(190)
│
├── C++ core (gated on C++ mode: dword_126EFB4 == 2)
│ catch(150), class(151), friend(153), inline(154),
│ mutable(174), operator(156), new(155), delete(152),
│ private(157), protected(158), public(159), template(160),
│ this(161), throw(162), try(163), virtual(164),
│ namespace(175), using(179), typename(183), typeid(178),
│ const_cast(166), dynamic_cast(167), static_cast(177),
│ reinterpret_cast(176)
│
├── C++ alternative tokens (gated on C++ mode)
│ and(52), and_eq(64), bitand(33), bitor(51), compl(37),
│ not(38), not_eq(48), or(53), or_eq(66), xor(50), xor_eq(65)
│
├── C++ modern keywords (gated on standard version)
│ C++11: constexpr(244), decltype(185), nullptr(237),
│ char16_t(126), char32_t(127)
│ C++17: char8_t(128)
│ C++20: consteval(245), constinit(246), co_yield(267),
│ co_return(268), co_await(269), concept(295), requires(294)
│ C++23: typeof(189), typeof_unqual(190)
│
├── GNU extensions (gated on dword_126EFA8)
│ __extension__(187), __auto_type(186), __attribute(142),
│ __builtin_offsetof(117), __builtin_types_compatible_p(143),
│ __builtin_shufflevector(258), __builtin_convertvector(259),
│ __builtin_complex(261), __builtin_has_attribute(296),
│ __builtin_addressof(271), __builtin_bit_cast(297),
│ __int128(239), __bases(249), __direct_bases(250),
│ _Float32(331), _Float32x(332), _Float64(333),
│ _Float64x(334), _Float128(335)
│
├── MSVC extensions (gated on dword_126EFB0)
│ __declspec(132), __int8(133), __int16(134),
│ __int32(135), __int64(136)
│
├── Clang extensions (gated on Clang version at qword_126EF90)
│ _Nullable(264), _Nonnull(265), _Null_unspecified(266)
│
├── Type traits (60+, gated by standard version)
│ __is_class(207), __is_enum, __is_union, __has_trivial_copy,
│ __has_virtual_destructor, ... through token code 327
│
├── NVIDIA CUDA type traits (gated on GPU mode)
│ __nv_is_extended_device_lambda_closure_type(328),
│ __nv_is_extended_host_device_lambda_closure_type(329),
│ __nv_is_extended_device_lambda_with_preserved_return_type(330)
│
└── EDG internal keywords (always registered)
__edg_type__(272), __edg_size_type__(277),
__edg_ptrdiff_type__(278), __edg_bool_type__(279),
__edg_wchar_type__(280), __edg_opnd__(282),
__edg_throw__(281), __edg_is_deducible(304),
__edg_vector_type__(273), __edg_neon_vector_type__(274)
Version gating globals used during keyword registration:
| Address | Name | Values |
|---|---|---|
dword_126EFB4 | language_mode | 1 = K&R C / GNU default, 2 = C++ |
dword_126EF68 | cpp_standard_version | 199900, 201102, 201402, 201703, 202002, 202302 |
qword_126EF98 | gnu_version | e.g., 0x9FC3 = GCC 4.0.3 |
qword_126EF90 | clang_version | e.g., 0x15F8F, 0x1D4BF |
dword_126EFA8 | gnu_extensions_enabled | Boolean |
dword_126EFA4 | extensions_enabled | Boolean (Clang compat) |
dword_126EFAC | c_language_mode | Boolean: C vs C++ |
dword_126EFB0 | microsoft_extensions_enabled | Boolean |
String and Character Literal Scanning
Character Literal Scanning
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_66CB30 | scan_character_literal_prefix | 34 | Detects encoding prefix (L, u, U, u8) |
sub_66CBD0 | scan_character_literal | 111 | Scans 'x' / L'x' / u'x' / U'x' / u8'x' literals |
String Literal Scanning
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_66C550 | scan_string_literal | 356 | Scans quoted string literals with escape sequences |
sub_676080 | scan_raw_string_literal | 391 | Scans R"delimiter(content)delimiter" raw strings |
sub_66E6E0 | scan_identifier_suffix | 94 | Checks for user-defined literal suffixes (C++11) |
sub_66E920 | is_valid_ud_suffix | 51 | Validates user-defined literal suffix names |
sub_6892F0 | string_literal_concatenation_check | 107 | Checks adjacent string literal tokens for concatenation |
sub_689550 | process_user_defined_literal | 332 | Handles C++11 UDL operator lookup |
Encoding Prefixes
The lexer recognizes 5 string encoding prefixes, each producing a different string literal type:
| Prefix | Token | Character Type | Width |
|---|---|---|---|
| (none) | "..." | char | 1 byte |
L | L"..." | wchar_t | 4 bytes (Linux) |
u8 | u8"..." | char8_t (C++20) / char | 1 byte |
u | u"..." | char16_t | 2 bytes |
U | U"..." | char32_t | 4 bytes |
Scope Entry Layout
The lexer interacts heavily with the scope system. Scope entries are 784-byte records stored in an array at qword_126C5E8, indexed by dword_126C5E4 (current scope index).
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 4 | name_hash | Hash of scope name for lookup |
+4 | 1 | scope_kind | Kind code (12 = file scope, see below) |
+6 | 1 | scope_flags | Bit flags: bit 5 = inline namespace |
+7 | 1 | access_flags | Bit 0 = in class context |
+10 | 1 | extra_flags | Bit 0 = module scope |
+12 | 1 | template_flags | Bit 0 = in template argument scan, bit 4 = has concepts |
+24 | 8 | symbol_chain_or_hash_ptr | Head of symbol chain or hash table |
+32 | 8 | hash_table_ptr | Hash table for O(1) lookup in large scopes |
+192 | 8 | lazy_load_scope_ptr | Pointer for lazy symbol loading (calls sub_7C1900) |
+208 | 4 | scope_depth | Nesting depth counter |
+376 | 8 | parent_template_info | Template context for template scope entries |
+416 | 8 | module_info | C++20 module partition data |
+632 | 8 | class_info_ptr | Pointer to class descriptor for class scopes |
Scope-related globals:
| Address | Name | Description |
|---|---|---|
dword_126C5E4 | current_scope_index | Index into scope table |
dword_126C5C4 | class_scope_index | Innermost class scope (-1 if none) |
dword_126C5C8 | namespace_scope_index | Innermost namespace scope (-1 if none) |
dword_126C5DC | file_scope_index | File (global) scope index |
xmmword_126C520 | entity_kind_to_language_mode_map | 32-entry table mapping entity kinds to required language modes |
Lexer State Stack
The lexer supports push/pop of its entire state for speculative parsing and template argument scanning.
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_688320 | push_lexical_state | 137 | Pushes current lexer state onto qword_126DB40 stack |
sub_668330 | pop_lexical_state_stack_full | 166 | Pops state, restores stop-token table, macro chains (assert at lexical.c:17808) |
State stack nodes are 80-byte linked-list entries:
| Offset | Size | Field |
|---|---|---|
+0 | 8 | next (previous state) |
+8 | 8 | cached_tokens |
+16 | 8 | source_position |
+24--+72 | 48 | token_cache_state (saved cache pointers and flags) |
The push/pop mechanism is used for:
- Template argument list scanning (
sub_67DC90, 1,078 lines) - Speculative parsing in disambiguation contexts
- Macro expansion state save/restore
Template Argument Scanning: scan_template_argument_list (sub_67DC90)
sub_67DC90 (1,078 lines, assert at lexical.c:19918) scans template argument lists (<...>). This is one of the most complex lexer functions because of the >> ambiguity: in vector<vector<int>>, the closing >> must be split into two > tokens to close two template argument lists.
The scanner:
- Pushes lexer state and sets template argument scanning mode (scope entry offset
+12, bit 0) - Scans tokens while tracking nesting depth of
<>pairs - Handles nested template-ids recursively
- Creates token cache entries for deferred parsing
- Uses the scope system to classify identifiers within template arguments
- Disambiguates
>>as either right-shift or double template close
The entity kind checks at offsets +80 (values 19--22) identify template entities for recursive template-id scanning.
Preprocessor Integration
The lexer handles several preprocessor-related responsibilities:
Source Position Tracking
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_66D100 | set_source_position | 282 | Converts raw input position to file/line/column (called from dozens of locations) |
sub_66D5E0 | emit_output_line | 491 | Emits source text and #line directives to preprocessed output |
sub_66B1F0 | emit_preprocessed_output | 231 | Outputs #line directives via qword_106C280 (output FILE*) |
Macro Expansion Support
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_66A770 | lookup_macro_at_position | 41 | Scans macro chain (qword_126DD80) for macro enclosing given position |
sub_66A7F0 | create_macro_expansion_record | 44 | Allocates macro expansion tracking node |
sub_66A890 | push_macro_expansion | 41 | Pushes new expansion onto active stack |
sub_66A940 | pop_macro_expansion | 28 | Pops expansion from stack |
sub_66A9D0 | is_in_macro_expansion | 12 | Returns whether currently inside macro expansion |
sub_66A9F0 | get_macro_expansion_depth | 17 | Returns nesting depth of macro expansions |
sub_66A310 | invalidate_macro_node | 56 | Clears macro definition when it goes out of scope |
sub_66A5E0 | free_macro_definition_chain | 91 | Walks and frees macro chain via qword_126DD70 / qword_126DDE0 |
Include File Handling
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_66BB50 | open_source_file | 332 | Opens include files via sub_4F4970 (fopen wrapper), creates file tracking nodes |
sub_66EA70 | open_next_input_file | 364 | Opens next input source after current file ends, manages include-stack unwinding |
sub_67BAB0 | scan_header_name | 110 | Scans <filename> or "filename" for #include directives |
Token Pasting and Stringification
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_67D1E0 | handle_token_pasting | 117 | Implements ## preprocessor operator |
sub_67D440 | stringify_token | 251 | Implements # preprocessor operator |
sub_67D050 | check_token_paste_validity | 57 | Validates token paste produces a valid token |
sub_67D900 | expand_macro_argument | 204 | Expands a single macro argument during substitution |
Operator Scanning
Multi-character operators are scanned by a set of dedicated functions in the 0x67ABB0--0x67BAB0 range. The scanner reads the first operator character and dispatches to the appropriate function to check for compound operators:
| First Char | Possible Tokens |
|---|---|
< | <, <=, <<, <<=, <=>, <% (digraph {), <: (digraph [) |
> | >, >=, >>, >>= |
+ | +, ++, += |
- | -, --, -=, ->, ->* |
* | *, *= |
& | &, &&, &= |
| | |, ||, |= |
= | =, == |
! | !, != |
: | :, :: |
. | ., ..., .* |
Template Angle Bracket Disambiguation
sub_67CB70 (handle_template_angle_brackets, 263 lines) handles the critical disambiguation of < and > in template contexts. In template argument lists, < opens and > closes, but in expressions, they are comparison operators. The function uses scope context information and the current parsing state (from the 784-byte scope entries) to make the determination.
Error Recovery
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_6887C0 | skip_to_token | 317 | Error recovery: skips tokens until finding a synchronization point (;, }, etc.) |
sub_6886F0 | expect_token | 31 | Checks current token matches expected kind, emits diagnostic on mismatch |
sub_688560 | peek_next_token | 44 | Looks ahead at next token without consuming it |
The stop-token table at qword_126DB48 + 8 (357 entries) controls which token kinds are valid synchronization points for error recovery.
Built-in Type and Attribute Handling
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_685AB0 | handle_builtin_type_token | 289 | Processes built-in type keywords (int, float, etc.) into type tokens |
sub_685F10 | process_decltype_token | 212 | Handles decltype() expression in token stream |
sub_686350 | handle_attribute_token | 584 | Processes [[attribute]] and __attribute__((x)) syntax, including CUDA attributes |
sub_686F40 | process_asm_or_extension_keyword | 244 | Handles asm, __asm__, and extension keywords |
Diagnostic Strings
| String | Source | Condition |
|---|---|---|
"pop_lexical_state_stack_full" | sub_668330 | Assert at lexical.c:17808 |
"copy_tokens_from_cache" | sub_669650 | Assert at lexical.c:3417 |
"scan_universal_character" | sub_6711E0 | Assert at lexical.c:12384 |
"get_token_from_cached_token_rescan_list" | sub_676860 | Assert at lexical.c:4302 |
"get_token_from_reusable_cache_stack" | sub_676860 | Assert at lexical.c:4450, 4469 |
"scan_template_argument_list" | sub_67DC90 | Assert at lexical.c:19918 |
"select_dual_lookup_symbol" | sub_666720 | Assert at lexical.c:22477 |
"keyword_init" | sub_5863A0 | Assert at fe_init.c:1597 |
"fe_translation_unit_init" | sub_5863A0 | Assert at fe_init.c:2373 |
| Diagnostic Code | Context | Meaning |
|---|---|---|
| 870 | Character literal scanning | Invalid character in literal |
| 912 | select_dual_lookup_symbol | Ambiguous lookup result |
| 1192 | Control byte type 6 | Stale source position marker |
| 861 | Control byte type 6 | Invalid position reference |
| 1665 | check_deferred_diagnostics | Deferred macro-related warning |
| 1750 | refill_buffer | Trigraph sequence warning |
| 2628 | Numeric literal scanner | C++23 digit separator used in earlier mode |
| 2629 | Numeric literal scanner | Digit separator not followed by digit |
Function Map
| Address | Identity | Confidence | Lines | EDG Source |
|---|---|---|---|---|
sub_5863A0 | keyword_init / fe_translation_unit_init | 98% | 1,113 | fe_init.c:1597 |
sub_666720 | select_dual_lookup_symbol | HIGH | 372 | lexical.c:22477 |
sub_668330 | pop_lexical_state_stack_full | HIGH | 166 | lexical.c:17808 |
sub_668660 | check_deferred_diagnostics | MEDIUM | 104 | lexical.c |
sub_6688A0 | get_scope_from_entity | HIGH | 32 | lexical.c |
sub_668C90 | classify_identifier_entity | MEDIUM | 89 | lexical.c |
sub_668E00 | resolve_entity_through_alias | MEDIUM | 88 | lexical.c |
sub_669650 | copy_tokens_from_cache | HIGH | 385 | lexical.c:3417 |
sub_669D00 | allocate_token_cache_entry | MEDIUM | 119 | lexical.c |
sub_66A000 | append_to_token_cache | MEDIUM | 88 | lexical.c |
sub_66A140 | push_token_to_rescan_list | MEDIUM | 46 | lexical.c |
sub_66A3F0 | create_source_region_node | MEDIUM | 84 | lexical.c |
sub_66A5E0 | free_macro_definition_chain | MEDIUM | 91 | lexical.c |
sub_66A770 | lookup_macro_at_position | MEDIUM | 41 | lexical.c |
sub_66A890 | push_macro_expansion | MEDIUM | 41 | lexical.c |
sub_66AA50 | process_preprocessor_directive | MEDIUM | 380 | lexical.c |
sub_66B1F0 | emit_preprocessed_output | MEDIUM | 231 | lexical.c |
sub_66B910 | skip_whitespace_and_comments | MEDIUM | 105 | lexical.c |
sub_66BB50 | open_source_file | HIGH | 332 | lexical.c |
sub_66C550 | scan_string_literal | MEDIUM | 356 | lexical.c |
sub_66CBD0 | scan_character_literal | MEDIUM | 111 | lexical.c |
sub_66D100 | set_source_position | HIGH | 282 | lexical.c |
sub_66D5E0 | emit_output_line | HIGH | 491 | lexical.c |
sub_66DFF0 | scan_pp_number | MEDIUM | 268 | lexical.c |
sub_66EA70 | open_next_input_file | MEDIUM | 364 | lexical.c |
sub_66F4E0 | read_next_source_line | HIGH | 735 | lexical.c |
sub_6702F0 | refill_buffer | HIGH | 792 | lexical.c |
sub_6711E0 | scan_universal_character | HIGH | 278 | lexical.c:12384 |
sub_671870 | validate_universal_character_value | MEDIUM | 62 | lexical.c |
sub_6719B0 | scan_identifier_or_keyword | HIGH | 400 | lexical.c |
sub_672390 | scan_numeric_literal | HIGH | 1,571 | lexical.c |
sub_6748A0 | convert_integer_suffix | MEDIUM | 137 | lexical.c |
sub_674BB0 | determine_numeric_literal_type | MEDIUM | 400 | lexical.c |
sub_675390 | scan_float_exponent | MEDIUM | 57 | lexical.c |
sub_6754B0 | convert_float_literal | MEDIUM | 338 | lexical.c |
sub_676080 | scan_raw_string_literal | MEDIUM-HIGH | 391 | lexical.c |
sub_676860 | get_next_token | HIGHEST | 1,995 | lexical.c:4302 |
sub_679800 | scan_token | HIGH | 907 | lexical.c |
sub_67BAB0 | scan_header_name | MEDIUM | 110 | lexical.c |
sub_67CB70 | handle_template_angle_brackets | MEDIUM | 263 | lexical.c |
sub_67D050 | check_token_paste_validity | LOW | 57 | lexical.c |
sub_67D1E0 | handle_token_pasting | MEDIUM | 117 | lexical.c |
sub_67D440 | stringify_token | MEDIUM | 251 | lexical.c |
sub_67D900 | expand_macro_argument | MEDIUM | 204 | lexical.c |
sub_67DC90 | scan_template_argument_list | HIGH | 1,078 | lexical.c:19918 |
sub_67F2E0 | create_template_argument_cache | MEDIUM | 184 | lexical.c |
sub_67F740 | rescan_template_arguments | MEDIUM-HIGH | 583 | lexical.c |
sub_680670 | resolve_dependent_template_id | MEDIUM | 240 | lexical.c |
sub_680AE0 | handle_dependent_name_context | MEDIUM | 235 | lexical.c |
sub_6810F0 | get_token_main | HIGHEST | 3,811 | lexical.c |
sub_685AB0 | handle_builtin_type_token | MEDIUM | 289 | lexical.c |
sub_685F10 | process_decltype_token | MEDIUM | 212 | lexical.c |
sub_686350 | handle_attribute_token | MEDIUM-HIGH | 584 | lexical.c |
sub_686F40 | process_asm_or_extension_keyword | MEDIUM | 244 | lexical.c |
sub_687F30 | setup_lexer_for_parsing_mode | MEDIUM | 216 | lexical.c |
sub_688320 | push_lexical_state | MEDIUM | 137 | lexical.c |
sub_688560 | peek_next_token | MEDIUM | 44 | lexical.c |
sub_6886F0 | expect_token | MEDIUM | 31 | lexical.c |
sub_6887C0 | skip_to_token | MEDIUM | 317 | lexical.c |
Cross-References
- Pipeline Overview -- keyword registration during
sub_5863A0 - Entry Point & Initialization -- frontend init calls keyword_init
- Template Engine -- template argument scanning at lexer level
- Type System -- entity kind classification used by lexer
- Token Kind Table -- full 357-entry token table
- Scope Entry -- 784-byte scope entry structure
- Entity Node Layout -- entity node offsets used by identifier classification
- Global Variable Index -- all global addresses referenced here
- Attribute System Overview -- CUDA attribute handling at token level
Expression Parser
The expression parser is the largest subsystem in cudafe++. It lives in EDG 6.6's expr.c, which compiles to approximately 335KB of code (address range 0x4F8000--0x556600) containing roughly 320 functions. The central function scan_expr_full (sub_511D40) alone occupies 80KB -- approximately 2,000 decompiled lines with over 300 local variables. EDG uses a hand-written recursive descent parser, not a generated one (no yacc/bison). Each C++ operator precedence level has its own scanning function, and the call chain follows the precedence hierarchy: assignment, conditional, logical-or, logical-and, bitwise-or, bitwise-xor, bitwise-and, equality, relational, shift, additive, multiplicative, pointer-to-member, unary, postfix, primary.
CUDA-specific extensions are woven directly into this subsystem: cross-execution-space call validation at every function call site, remapping of GCC __sync_fetch_and_* builtins to NVIDIA __nv_atomic_fetch_* intrinsics, and constexpr-if gating of literal evaluation based on compilation mode.
Key Facts
| Property | Value |
|---|---|
| Source file | expr.c (~320 functions) + exprutil.c (~90 functions) |
| Address range | 0x4F8000--0x556600 (expr.c), 0x558720--0x55FE10 (exprutil.c) |
| Total code size | ~385KB |
| Central dispatcher | sub_511D40 (scan_expr_full, 80KB, ~2,000 lines, 300+ locals) |
| Ternary handler | sub_526E30 (scan_conditional_operator, 48KB) |
| Function call handler | sub_545F00 (scan_function_call, 2,490 lines) |
| New-expression handler | sub_54AED0 (scan_new_operator, 2,333 lines) |
| Identifier handler | sub_5512B0 (scan_identifier, 1,406 lines) |
| Template rescan | sub_5565E0 (rescan_expr_with_substitution_internal, 1,558 lines) |
| Atomic builtin remapper | sub_537BF0 (adjust_sync_atomic_builtin, 1,108 lines, NVIDIA-specific) |
| Cross-space validation | sub_505720 (check_cross_execution_space_call, 4KB) |
| Current token global | word_126DD58 (16-bit token kind) |
| Expression context | qword_106B970 (current scope/context pointer) |
| Trace flag | dword_126EFC8 (debug trace), dword_126EFCC (verbosity level) |
Architecture
Recursive Descent, No Generator
EDG's expression parser is entirely hand-written C. There are no parser tables, no DFA state machines, and no grammar transformation output. Each operator precedence level maps to one or more scan_* functions that call down the precedence chain via direct function calls. The parser is effectively a family of mutually recursive functions whose call graph encodes the C++ grammar.
The top-level entry point is scan_expr_full, which serves a dual role: (1) it contains the primary-expression scanner as a massive switch on token kind, and (2) after scanning a primary expression, it enters a post-scan binary-operator dispatch loop that routes to the correct precedence-level handler based on the next operator token.
scan_expr_full (sub_511D40)
│
├─ [token switch] ─────────► Primary expressions
│ case 1 → scan_identifier (sub_5512B0)
│ case 2,3 → scan_numeric_literal (sub_5632C0)
│ case 27 → scan_cast_or_expr (sub_544290)
│ case 161 → scan_new_operator (sub_54AED0)
│ case 162 → scan_throw_operator (sub_5211B0)
│ ... (100+ token cases)
│
├─ [postfix loop] ──────────► Postfix operators
│ () → scan_function_call (sub_545F00)
│ [] → scan_subscript_operator (sub_540560)
│ .-> → scan_field_selection_operator (sub_5303E0)
│ ++-- → scan_postfix_incr_decr (sub_510D70)
│
└─ [binary dispatch] ───────► Binary operators by precedence
prec 64 → scan_simple_assignment_operator (sub_53FD70)
scan_compound_assignment_operator (sub_536E80)
prec 60 → scan_conditional_operator (sub_526E30)
prec 59 → scan_logical_operator (sub_526040) [||]
prec 58 → scan_logical_operator (sub_526040) [&&]
prec 57 → scan_comma_operator (sub_529720)
... → scan_bit_operator (sub_525BC0) [| ^ &]
... → scan_eq_operator (sub_524ED0) [== !=]
... → scan_add_operator (sub_523EB0) [+ -]
... → scan_mult_operator (sub_5238C0) [* / %]
... → scan_shift_operator (sub_524960) [<< >>]
... → scan_ptr_to_member_operator (sub_522650) [.* ->*]
Precedence Levels
The parser assigns numeric precedence levels internally, passed as the a3 (third) parameter to scan_expr_full. The precedence integer increases with binding strength (higher values = tighter binding):
| Level | Operators | Handler |
|---|---|---|
| 57 | , (comma) | scan_comma_operator |
| 58 | || | scan_logical_operator |
| 59 | && | scan_logical_operator |
| 60 | ? : (conditional) | scan_conditional_operator |
| 61 | | | scan_bit_operator |
| 62 | ^ | scan_bit_operator |
| 63 | & | scan_bit_operator |
| 64 | = += -= ... | scan_simple_assignment_operator / scan_compound_assignment_operator |
When scan_expr_full encounters a binary operator token whose precedence is lower than the current precedence parameter, it returns immediately, allowing the caller at that precedence level to consume the operator. This is the standard recursive descent technique: each level calls the next-higher-precedence scanner for its operands.
scan_expr_full -- The Central Dispatcher
scan_expr_full (sub_511D40, 80KB) is the largest function in the entire cudafe++ binary. Its structure follows this pattern:
function scan_expr_full(result, scan_info, precedence, flags, ...) {
// 1. Trace entry
if (debug_trace_flag)
trace_enter(4, "scan_expr_full")
if (debug_verbosity > 3)
fprintf(trace_stream, "precedence level = %d\n", precedence)
// 2. Extract context flags from current scope
context = current_scope // qword_106B970
in_cuda_extension = (context[20] & 0x08) != 0
in_pack_expansion = context[21] & 0x01
saved_pending_expr = pending_expression // qword_106B968
pending_expression = 0
// 3. Handle template rescan context
if (in_template_context) {
if (context.flags == TEMPLATE_ONLY_DEPENDENT)
init_expr_stack_entry(...)
// Mark as template-argument context
}
// 4. Handle forced-parenthesized-expression flag
if (flags & 0x08)
goto scan_cast_or_expr // sub_544290
// 5. Check for decltype token (185)
if (current_token == 185 && dialect == C++)
call sub_6810F0(...) // re-classify through lexer
// 6. MASTER TOKEN SWITCH -- dispatch on word_126DD58
switch (current_token) {
case 1: // identifier
// Special-case: check if identifier is a hidden type trait
if (identifier_is("__is_pointer")) { set_token(320); scan_unary_type_trait(); break; }
if (identifier_is("__is_invocable")) { set_token(225); scan_call_like_builtin(); break; }
if (identifier_is("__is_signed")) { set_token(324); scan_unary_type_trait(); break; }
// Default: full identifier scan
scan_identifier(result, flags, precedence, ...)
break;
case 2, 3, 123, 124, 125: // numeric, char, utf literals
// Context-sensitive literal handling:
// - Check constexpr-if context (execution-space dependent)
// - Route to appropriate literal scanner
if (is_constexpr_if_context)
value = compute_constexpr_literal()
scan_constexpr_literal_result(value, result)
else
scan_numeric_literal(literal_data, result) // sub_5632C0
break;
case 4, 5, 6, 181, 182: // string literals
scan_string_literal(literal_data, result) // sub_5632C0
// Vector deprecation check for CUDA
if ((cuda_mode || cuda_device_mode) && has_vector_literal_flag)
result.flags |= VECTOR_DEPRECATED
break;
case 7: // postfix-string-context (interpolated strings)
check_postfix_string_context(...)
scan_string_expression(literal_data, result) // sub_563580
break;
case 27: // left-paren '('
scan_cast_or_expr(result, scratch, flags) // sub_544290
// Disambiguates: C-cast, grouped expr, GNU statement expr, fold expr
break;
case 31, 32: // prefix ++ / --
scan_prefix_incr_decr(result, ...) // sub_516080
break;
case 33: // & (address-of)
scan_ampersand_operator(result, ...) // sub_516720
break;
case 34: // * (indirection)
scan_indirection_operator(result, ...) // sub_517270
break;
case 35, 36, 37, 38: // unary + - ~ !
scan_arith_prefix_operator(result, ...) // sub_517680
break;
case 77: // lambda expression '['
scan_lambda_expression(result, ...) // sub_5BBA60
break;
case 99, 284: // sizeof
scan_sizeof_operator(result, ...) // sub_517BD0
break;
case 109: // _Generic
scan_type_generic_operator(result, ...) // inlined
break;
case 152: // requires
scan_requires_expression(result, ...) // sub_52CFF0
break;
case 155: // new (in C++ concept context path)
scan_new_operator(result, ...) // sub_54AED0
break;
case 161: // new-expression
scan_class_new_expression(result, ...) // sub_6C9940/sub_6C9C50
break;
case 162: // throw
scan_throw_operator(result, ...) // sub_5211B0
break;
case 166: // const_cast
scan_const_cast_operator(result, ...) // sub_520280
break;
case 167: // static_cast
scan_static_cast_operator(result, ...) // sub_51F670
break;
case 176: // reinterpret_cast
scan_reinterpret_cast_operator(result, ...) // sub_5209A0
break;
case 177: // dynamic_cast
scan_named_cast_operator(result, ...) // sub_53D590
break;
case 178: // typeid
scan_typeid_operator(result, ...) // sub_535370
break;
case 185: // decltype
scan_decltype_operator(result, ...) // sub_52A3B0
break;
case 195 ... 356: // type traits (__is_class, __is_enum, etc.)
scan_unary_type_trait_helper(result, ...) // sub_51A690
// or
scan_binary_type_trait_helper(result, ...) // sub_51B650
break;
case 243: // noexcept
scan_noexcept_operator(result, ...) // sub_51D910
break;
case 267: // co_yield
// Coroutine yield expression handling
scan_braced_init_list_full(result, ...) // sub_5360D0
add_await_to_operand(result, ...) // sub_50B630
break;
case 269: // co_await
// Recursive scan of operand, then wrap with await semantics
scan_expr_full(result, info, precedence, flags | AWAIT)
add_await_to_operand(result, ...) // sub_50B630
break;
case 297: // __builtin_bit_cast
scan_builtin_bit_cast(result, ...) // sub_51CC60
break;
// ... approximately 100 additional cases
}
// 7. POST-SCAN BINARY OPERATOR DISPATCH LOOP
// After scanning a primary/prefix expression, check for binary operators
while (true) {
op = current_token
op_prec = get_binary_op_precedence(op)
if (op_prec < precedence)
break // operator binds less tightly than our level
switch (op) {
case '?': scan_conditional_operator(result, info, flags) // sub_526E30
case '=': scan_simple_assignment_operator(result, ...) // sub_53FD70
case '+=': scan_compound_assignment_operator(result, ...) // sub_536E80
case '||': scan_logical_operator(result, info, ...) // sub_526040
case '&&': scan_logical_operator(result, info, ...) // sub_526040
case '|': scan_bit_operator(result, ...) // sub_525BC0
case '^': scan_bit_operator(result, ...)
case '&': scan_bit_operator(result, ...)
case '==': scan_eq_operator(result, ...) // sub_524ED0
case '!=': scan_eq_operator(result, ...)
case '<': scan_rel_operator(result, ...) // sub_543A90
case '+': scan_add_operator(result, ...) // sub_523EB0
case '-': scan_add_operator(result, ...)
case '*': scan_mult_operator(result, ...) // sub_5238C0
case '/': scan_mult_operator(result, ...)
case '%': scan_mult_operator(result, ...)
case '<<': scan_shift_operator(result, ...) // sub_524960
case '>>': scan_shift_operator(result, ...)
case '.*': scan_ptr_to_member_operator(result, ...) // sub_522650
case '->*': scan_ptr_to_member_operator(result, ...)
case ',': scan_comma_operator(result, ...) // sub_529720
// Postfix operators (not precedence-gated):
case '(': scan_function_call(result, ...) // sub_545F00
case '[': scan_subscript_operator(result, ...) // sub_540560
case '.': scan_field_selection_operator(result, ...) // sub_5303E0
case '->': scan_field_selection_operator(result, ...)
case '++': scan_postfix_incr_decr(result, ...) // sub_510D70
case '--': scan_postfix_incr_decr(result, ...)
}
}
// 8. Restore saved state and return
pending_expression = saved_pending_expr
if (debug_trace_flag)
trace_exit(...)
return result
}
Token Dispatch Map (Complete)
The master switch in scan_expr_full covers approximately 120 distinct token cases. The full dispatch table:
| Token Code(s) | Expression Form | Handler |
|---|---|---|
| 1 | Identifier (with __is_pointer/__is_signed detection) | scan_identifier (sub_5512B0) |
| 2, 3 | Integer / floating-point literal | scan_numeric_literal (sub_5632C0) |
| 4, 5, 6, 181, 182 | String literal (narrow, wide, UTF-8/16/32) | scan_string_literal (sub_5632C0) |
| 7 | Postfix string context | sub_563580 |
| 8 | Literal operator call | make_func_operand_for_literal_operator_call (sub_4FFFB0) |
| 18, 80--136, 165, 180, 183 | Type keywords in expression context | scan_type_returning_type_trait_operator / scan_identifier |
| 25 | __extension__ | scan_expr_splicer (sub_52FD70) or scan_statement_expression (sub_4F9F20) |
| 27 | ( | scan_cast_or_expr (sub_544290) -- disambiguates cast/group/fold/stmt-expr |
| 31, 32 | ++ / -- (prefix) | scan_prefix_incr_decr (sub_516080) |
| 33 | & (address-of) | scan_ampersand_operator (sub_516720) |
| 34 | * (indirection) | scan_indirection_operator (sub_517270) |
| 35--38 | + - ~ ! (unary) | scan_arith_prefix_operator (sub_517680) |
| 50 | __builtin_expect | bound_function_in_cast (sub_503F70) |
| 77 | [ (lambda) | scan_lambda_expression (sub_5BBA60) |
| 99, 284 | sizeof | scan_sizeof_operator (sub_517BD0) |
| 109 | _Generic | scan_type_generic_operator (inlined) |
| 111, 247 | alignof / _Alignof | scan_alignof_operator (sub_519300) |
| 112 | __intaddr | scan_intaddr_operator (sub_520EE0) |
| 113 | va_start | scan_va_start_operator (sub_51E8A0) |
| 114 | va_arg | scan_va_arg_operator (sub_51DFA0) |
| 115 | va_end | scan_va_end_operator (sub_51E4A0) |
| 116 | va_copy | scan_va_copy_operator (sub_51E670) |
| 117 | offsetof | scan_offsetof (sub_555530) |
| 123 | char literal | scan_utf_char_literal (sub_5659D0) |
| 124 | wchar_t literal | scan_wchar_literal (sub_5658D0) |
| 125 | UTF character literal | scan_wide_char_literal (sub_565950) |
| 138--141 | __FUNCTION__/__PRETTY_FUNCTION__/__func__ | setup_function_name_literal (sub_50AC80) |
| 143 | __builtin_types_compatible_p | scan_builtin_operation_args_list (sub_534920) |
| 144, 145 | __real__ / __imag__ | scan_complex_projection (sub_51D210) |
| 146 | typeid (execution-space variant) | scan_typeid_operator (sub_535370) |
| 152 | requires (C++20) | scan_requires_expression (sub_52CFF0) |
| 155 | Concept expression | scan_new_operator path (sub_54AED0) |
| 161 | new | scan_class_new_expression (sub_6C9940) |
| 162 | throw | scan_throw_operator (sub_5211B0) |
| 166 | const_cast | scan_const_cast_operator (sub_520280) |
| 167 | static_cast | scan_static_cast_operator (sub_51F670) |
| 176 | reinterpret_cast | scan_reinterpret_cast_operator (sub_5209A0) |
| 177 | dynamic_cast | scan_named_cast_operator (sub_53D590) |
| 178 | typeid | scan_typeid_operator (sub_535370) |
| 185 | decltype | scan_decltype_operator (sub_52A3B0) |
| 188 | wchar_t literal (alt) | sub_5BCDE0 |
| 189 | typeof | scan_typeof_operator (sub_52B540) |
| 195--206 | Unary type traits | scan_unary_type_trait_helper (sub_51A690) |
| 207--292 | Binary type traits | scan_binary_type_trait_helper (sub_51B650) |
| 225, 226 | __is_invocable / __is_nothrow_invocable | dispatch_call_like_builtin (sub_535080) |
| 227--235 | Builtin operations | sub_535080 / sub_51BC10 / sub_51B0C0 |
| 237 | __builtin_constant_p | sub_5BC7E0 |
| 243 | noexcept (operator) | scan_noexcept_operator (sub_51D910) |
| 251--256 | Builtin atomic operations | check_operand_is_pointer (sub_5338B0/sub_533B80) |
| 257, 258 | Fold expression tokens | scan_builtin_shuffle (sub_53E480) |
| 259 | __builtin_convertvector | scan_builtin_convertvector (sub_521950) |
| 261 | __builtin_complex | scan_builtin_complex (sub_521DB0) |
| 262 | __builtin_choose_expr | scan_c11_generic_selection (sub_554400) |
| 267 | co_yield | Braced-init-list + coroutine add_await_to_operand (sub_50B630) |
| 269 | co_await | Recursive scan_expr_full + add_await_to_operand |
| 270 | __builtin_launder | sub_51B0C0(60, ...) |
| 271 | __builtin_addressof | scan_builtin_addressof (sub_519CF0) |
| 294 | Pack expansion | scan_requires_expr (sub_542D90) |
| 296 | __has_attribute | scan_builtin_has_attribute (sub_51C780) |
| 297 | __builtin_bit_cast | scan_builtin_bit_cast (sub_51CC60) |
| 300, 301 | __is_pointer_interconvertible_with_class | sub_51BE60 |
| 302, 303 | __is_corresponding_member | sub_51C270 |
| 304 | __edg_is_deducible | sub_51B360 |
| 306, 307 | __builtin_source_location | sub_5BC720 / sub_534920 |
scan_conditional_operator -- Ternary ? :
scan_conditional_operator (sub_526E30, 48KB) is the second-largest expression-scanning function. The ternary operator is notoriously complex because it must unify the types of two branches that may have completely different types. The function handles:
- Type unification between branches: determines the common type of the true and false expressions. This involves the usual arithmetic conversions for numeric types, pointer-to-derived to pointer-to-base conversions, null pointer conversions, and user-defined conversion sequences.
- Lvalue conditional expressions (GCC extension): when both branches are lvalues of the same type, the result is itself an lvalue.
- Void branches: if one or both branches are void expressions, the result type is void.
- Throw in branches: a throw expression in one branch causes the result to take the type of the other branch.
- Constexpr evaluation: when the condition is a constant expression, only one branch is semantically evaluated (the other is discarded).
- Reference binding: determines whether the result is an lvalue reference, rvalue reference, or prvalue.
- Overloaded operator?: resolution of user-defined conditional operators.
function scan_conditional_operator(context, result, flags) {
// 1. The condition has already been scanned -- it is in 'result'
// We are positioned at the '?' token
// 2. Save expression stack state
saved_stack = save_expr_stack()
// 3. Scan true branch (between ? and :)
// Note: precedence resets -- assignment expressions allowed here
init_expr_stack_entry(...)
scan_expr_full(true_result, info, ASSIGNMENT_PREC, flags)
// 4. Expect and consume ':'
expect_token(':')
// 5. Scan false branch
scan_expr_full(false_result, info, ASSIGNMENT_PREC, flags)
// 6. Type unification of true_result and false_result
true_type = get_type(true_result)
false_type = get_type(false_result)
if (both_void(true_type, false_type))
result_type = void
else if (is_throw(true_result))
result_type = false_type
else if (is_throw(false_result))
result_type = true_type
else if (arithmetic_types(true_type, false_type))
result_type = usual_arithmetic_conversions(true_type, false_type)
else if (same_class_lvalues(true_result, false_result))
result_type = common_lvalue_type(true_type, false_type) // GCC ext
else if (pointer_types(true_type, false_type))
result_type = composite_pointer_type(true_type, false_type)
else
// Try user-defined conversions (overload resolution)
result_type = resolve_via_conversion_sequences(true_type, false_type)
// 7. Apply cv-qualification merging
result_type = merge_cv_qualifications(true_type, false_type, result_type)
// 8. Build result expression node
build_conditional_expr_node(result, condition, true_result, false_result, result_type)
// 9. Restore stack
restore_expr_stack(saved_stack)
}
The complexity arises from the 15+ different type-pair combinations (arithmetic-arithmetic, pointer-pointer, pointer-null, class-class with conversions, void-void, throw-anything, lvalue-lvalue GCC extension) that each require different conversion logic.
scan_function_call -- All Call Forms
scan_function_call (sub_545F00, 2,490 lines) handles every form of function call expression. It is invoked from the postfix operator dispatch in scan_expr_full when a ( follows a primary expression, and also from various specialized paths.
The function handles:
- Regular function calls with overload resolution
- Builtin function calls -- GCC/Clang
__builtin_*with special semantics - Pseudo-calls to builtins --
va_start,__builtin_va_start, etc. - GNU
__builtin_classify_type-- compile-time type classification - SFINAE context -- failed overload resolution suppresses errors instead of aborting
- Template argument deduction for function templates at call sites
- CUDA atomic builtin remapping -- delegates to
adjust_sync_atomic_builtin(see below)
function scan_function_call(callee_operand, flags, context, ...) {
// 1. Classify the callee
operand_kind = get_operand_kind(callee_operand)
assert(operand_kind is valid) // "scan_function_call: bad operand kind"
// 2. Scan argument list
scan_call_arguments(arg_list, ...) // sub_545760
// 3. Branch on callee kind
if (is_builtin_function(callee_operand)) {
// Check if this is a special builtin
if (is_sync_atomic_builtin(callee_operand)) {
// CUDA-specific: remap __sync_fetch_and_* → __nv_atomic_fetch_*
result = adjust_sync_atomic_builtin(callee, args, ...) // sub_537BF0
return result
}
// check_builtin_function_for_call: validate args for builtins
check_builtin_function_for_call(callee, arg_list, ...)
// scan_builtin_pseudo_call: for builtins with special evaluation
if (is_pseudo_call_builtin(callee))
return scan_builtin_pseudo_call(callee, arg_list, ...)
}
// 4. Overload resolution
if (has_overload_candidates(callee_operand)) {
best = perform_overload_resolution(callee, arg_list, ...)
if (best == AMBIGUOUS)
emit_error(...)
if (best == NO_MATCH && in_sfinae_context)
return SFINAE_FAILURE
callee = best.function
}
// 5. Template argument deduction (if callee is a function template)
if (is_function_template(callee)) {
deduced = deduce_template_args(callee, arg_list, ...)
if (deduction_failed && in_sfinae_context)
return SFINAE_FAILURE
callee = instantiate_template(callee, deduced)
}
// 6. CUDA cross-execution-space check
if (cuda_mode)
check_cross_execution_space_call(callee, ...) // sub_505720
// 7. Apply implicit conversions to arguments
for each (arg, param) in zip(arg_list, callee.params):
convert_arg_to_param_type(arg, param)
// 8. Build call expression node
build_call_expression(result, callee, arg_list, return_type)
}
scan_call_arguments (sub_545760, 332 lines)
The argument scanner called from scan_function_call:
function scan_call_arguments(arg_list_out, ...) {
// assert "scan_call_arguments"
// Loop: scan comma-separated expressions until ')'
while (current_token != ')') {
scan_expr_full(arg, info, ASSIGNMENT_PREC, flags)
append(arg_list_out, arg)
if (current_token == ',')
consume(',')
else
break
}
// Handle default arguments for missing trailing params
// Handle parameter pack expansion
}
scan_new_operator -- All new Forms
scan_new_operator (sub_54AED0, 2,333 lines) implements the complete C++ new expression. The function name strings embedded in the binary confirm the following sub-operations:
| Sub-operation | Embedded Assert String |
|---|---|
| Entry point | "scan_new_operator" |
| Rescan in template | "rescan_new_operator_expr" |
| Token validation | "scan_new_operator: expected new or gcnew" |
| Token extraction | "get_new_operator_token" |
| Type parsing | "scan_new_type" |
| Paren-as-braced fallback | "scan_paren_expr_list_as_braced_list" |
| Array size deduction | "deduce_new_array_size" |
| Deallocation lookup | "determine_deletion_for_new" |
| Paren initializer | "prep_new_object_init_paren_initializer" |
| Brace initializer | "prep_new_object_init_braced_initializer" |
| No initializer | "prep_new_object_init_no_initializer" |
| Non-POD error | "scan_new_operator: non-POD class has neither actual nor assumed ctor" |
The function processes all forms:
function scan_new_operator(result, flags, context, ...) {
// Determine scope: ::new (global) vs. new (class-scope)
is_global = check_and_consume("::")
// Parse optional placement arguments: new(placement_args)
if (current_token == '(')
placement_args = scan_expression_list(...)
// Parse the allocated type: new Type
type = scan_new_type(...)
// Parse optional array dimension: new Type[size]
if (current_token == '[') {
array_size = scan_expression(...)
if (can_deduce_size)
deduce_new_array_size(type, initializer)
}
// Parse optional initializer
if (current_token == '(')
init = prep_new_object_init_paren_initializer(type, ...)
else if (current_token == '{')
init = prep_new_object_init_braced_initializer(type, ...)
else
init = prep_new_object_init_no_initializer(type, ...)
// Look up matching operator new
new_fn = lookup_operator_new(type, placement_args, is_global, ...)
// Look up matching operator delete (for exception cleanup)
determine_deletion_for_new(new_fn, type, placement_args, ...)
// For template-dependent types, defer to rescan at instantiation
if (is_dependent_type(type))
record_for_rescan(...)
// Build new-expression node
build_new_expr(result, new_fn, type, init, placement_args, array_size)
}
scan_identifier -- Name Resolution in Expression Context
scan_identifier (sub_5512B0, 1,406 lines) handles the case where the current token is an identifier in expression context. This is far more complex than a simple name lookup because identifiers in C++ can resolve to variables, functions, enumerators, type names (triggering functional-notation casts), anonymous union members, or preprocessing constants.
The function contains assert strings revealing its sub-operations:
| Assert String | Purpose |
|---|---|
"scan_identifier" | Entry point |
"scan_identifier: in preprocessing expr" | Identifier in #if context evaluates to 0 or 1 |
"anonymous_parent_variable_of" | Navigate to parent variable of anonymous union member |
"anonymous_parent_variable_of: bad symbol kind on list" | Error path for malformed anonymous union chain |
"make_anonymous_union_field_operand" | Construct operand for anonymous union member access |
"get_with_hash" | Hash-based lookup for cached resolution results |
function scan_identifier(result, flags, precedence, ...) {
// 1. Preprocessing-expression context
// In #if, undefined identifiers evaluate to 0
if (in_preprocessing_expression) {
// "scan_identifier: in preprocessing expr"
result = make_integer_constant(0)
return
}
// 2. Look up identifier in current scope
lookup_result = scope_lookup(current_identifier, current_scope)
// 3. If identifier resolves to a type name → functional-notation cast
if (is_type_entity(lookup_result)) {
scan_functional_notation_type_conversion(type, result, ...) // sub_54E7C0
return
}
// 4. If identifier is an anonymous union member
if (is_anonymous_union_member(lookup_result)) {
// Walk up to find the named parent variable
parent = anonymous_parent_variable_of(lookup_result)
result = make_anonymous_union_field_operand(parent, lookup_result)
return
}
// 5. If identifier is a function (possibly overloaded)
if (is_function_entity(lookup_result)) {
result = make_func_operand(lookup_result)
// Lambda capture check
if (in_lambda_scope)
check_var_for_lambda_capture(lookup_result, ...)
return
}
// 6. Variable reference
result = make_var_operand(lookup_result)
// 7. Lambda capture analysis
if (in_lambda_scope)
check_var_for_lambda_capture(lookup_result, ...)
// 8. Cross-execution-space reference check (CUDA)
if (cuda_mode)
check_cross_execution_space_reference(lookup_result, ...)
}
CUDA-Specific: Cross-Execution-Space Call Validation
Two functions implement the CUDA execution space enforcement that prevents illegal calls between __host__ and __device__ code:
check_cross_execution_space_call (sub_505720)
Called from scan_function_call and other call sites. The function extracts execution space information from bit-packed flags at entity offset +182:
function check_cross_execution_space_call(callee, is_must_check, diag_ctx) {
// Extract callee's execution space from entity flags
if (callee != NULL) {
is_not_device_only = (callee[182] & 0x30) != 0x20 // bits 4-5
is_host_only = (callee[182] & 0x60) == 0x20 // bits 5-6
is_global = (callee[182] & 0x40) != 0 // bit 6
}
// Early exits for special contexts
if (compilation_chain == -1) return // not in compilation
if (CU has CUDA flags cleared) return // not a CUDA compilation unit
if (in_SFINAE_context) return // errors suppressed
// Get caller's execution space from enclosing function
enclosing_fn = CU_table[enclosing_CU_index].function // at +224
if (enclosing_fn != NULL) {
caller_host_only = (enclosing_fn[182] & 0x60) == 0x20
caller_not_device_only = (enclosing_fn[182] & 0x30) != 0x20
} else {
// Top-level code: treated as __host__
caller_host_only = 0
caller_not_device_only = 1
}
// Check for implicitly HD (constexpr or __host__ __device__ by inference)
if (callee[177] & 0x10) return // callee is implicitly HD
if (callee has deleted+explicit HD flags) return
// The actual cross-space check matrix:
// caller=host, callee=device → error 3462 or 3463
// caller=device, callee=host → error 3464 or 3465
// callee=__global__ → error 3508
if (caller_not_device_only && caller_host_only) {
// Caller is __host__ only
if (callee is __device__ only) {
if (is_trivial_device_copyable(callee)) // sub_6BC680
return // allow
space1 = get_execution_space_name(enclosing_fn, 0) // sub_6BC6B0
space2 = get_execution_space_name(callee, 1)
emit_error(3462 + has_explicit_host, ...)
}
} else if (caller_not_device_only) {
// Caller is __device__ only
if (callee is __host__ only)
emit_error(3464 + has_explicit_device, ...)
}
if (callee is __global__) {
emit_error(3508, is_must_check ? "must" : "cannot", ...)
}
}
The bit encoding at entity offset +182:
| Bits | Mask | Meaning |
|---|---|---|
| 4--5 | & 0x30 | __device__ flag: 0x20 = device-only |
| 5--6 | & 0x60 | __host__ flag: 0x20 = host-only |
| 6 | & 0x40 | __global__ flag |
Error codes issued:
| Code | Meaning |
|---|---|
| 3462 | __device__ function called from __host__ context |
| 3463 | Variant of 3462 with __host__ annotation note |
| 3464 | __host__ function called from __device__ context |
| 3465 | Variant of 3464 with __device__ annotation note |
| 3508 | __global__ function called from wrong context |
check_cross_space_call_in_template (sub_505B40)
A simplified variant (2.7KB) used during template instantiation. The logic mirrors check_cross_execution_space_call but operates when dword_126C5C4 == -1 (template instantiation depth guard). It does not take the is_must_check parameter and always checks both directions.
See the Execution Spaces page for full details on the CUDA execution model.
CUDA-Specific: adjust_sync_atomic_builtin
adjust_sync_atomic_builtin (sub_537BF0, 1,108 lines) is the largest NVIDIA-specific function in the expression parser. It transforms GCC-style __sync_fetch_and_* atomic builtins into NVIDIA's own __nv_atomic_fetch_* intrinsics.
Why This Remapping Exists
CUDA inherits GCC's __sync_fetch_and_* builtin family from the host-side C/C++ dialect, but NVIDIA's GPU ISA (PTX) uses a different instruction encoding for atomic operations. The GPU atomics have type-specific variants that the PTX backend needs to select the correct instruction. Rather than teaching the backend to decompose generic __sync_* builtins, NVIDIA front-loads the transformation in the parser, mapping each builtin to a type-suffixed __nv_atomic_fetch_* intrinsic that directly corresponds to a PTX atomic instruction.
The type suffix ensures correct instruction selection:
| Suffix | Type Category | PTX Atomic Type |
|---|---|---|
_s | Signed integer | .s32, .s64 |
_u | Unsigned integer | .u32, .u64 |
_f | Floating-point | .f32, .f64 |
Remapping Table
| GCC Builtin | NVIDIA Intrinsic (base) |
|---|---|
__sync_fetch_and_add | __nv_atomic_fetch_add |
__sync_fetch_and_sub | __nv_atomic_fetch_sub |
__sync_fetch_and_and | __nv_atomic_fetch_and |
__sync_fetch_and_xor | __nv_atomic_fetch_xor |
__sync_fetch_and_or | __nv_atomic_fetch_or |
__sync_fetch_and_max | __nv_atomic_fetch_max |
__sync_fetch_and_min | __nv_atomic_fetch_min |
Pseudocode
function adjust_sync_atomic_builtin(callee, args, arg_list, builtin_info, result_ptr) {
// assert "adjust_sync_atomic_builtin" at line 6073
original_entity = get_builtin_entity(callee) // sub_568F30
assert(original_entity != NULL)
// Check arg count -- if extra args and first arg is not pointer type
if (builtin_info.extra_arg_count && callee[8] != 1) {
// Reset and emit diagnostic 3768 (wrong arg type for atomic)
original_entity = NULL
if (validate_arg_types(...))
emit_error(3768, diag_ctx)
return original_entity
}
// Walk argument list to find the pointee type (type of *ptr)
if (args == NULL) {
// Use declared arg count from builtin info
arg_index = builtin_info.declared_arg_count
// ... validate, may emit error 3769 or 1645
} else {
// Navigate to the relevant argument node
// Extract the pointee type by unwinding cv-qualifiers
arg_type = get_init_component_type(args)
pointee = unwrap_cv_qualifiers(arg_type) // while type_kind == 12
}
// Determine the type suffix based on pointee type
if (is_integer_type(pointee)) {
if (is_signed(pointee))
suffix = "_s" // signed
else
suffix = "_u" // unsigned
} else if (is_float_type(pointee)) {
suffix = "_f" // floating-point
} else {
// Not a supported atomic type
if (validate_arg_types(...))
emit_error(1645 or 852, diag_ctx)
return original_entity
}
// Construct the NVIDIA intrinsic name
// Map __sync_fetch_and_OP → __nv_atomic_fetch_OP + suffix
base_name = map_sync_to_nv(original_entity.name)
// e.g., "__sync_fetch_and_add" → "__nv_atomic_fetch_add"
full_name = base_name + suffix
// e.g., "__nv_atomic_fetch_add_s" for signed int
// Look up or create the NVIDIA intrinsic entity
nv_entity = lookup_nv_intrinsic(full_name)
// Replace the callee with the NVIDIA intrinsic
*result_ptr = nv_entity
return original_entity
}
The function validates that the pointee type is one of the supported atomic types. If the user passes a pointer to an unsupported type (e.g., a struct), it falls through to emit diagnostic 1645 ("argument type not supported for atomic operation") or 852 (a more specific variant when the __sync function has explicit type constraints).
Template Expression Rescanning
When a template is instantiated, expression trees from the template definition are re-evaluated with concrete template argument substitutions. This is handled by rescan_expr_with_substitution_internal (sub_5565E0, 1,558 lines), the third-largest function in the expression parser.
The function dispatches on expression kind (not token kind -- these are IL expression nodes, not source tokens) and recursively rescans each sub-expression with substitutions applied:
| Assert String | Purpose |
|---|---|
"rescan_expr_with_substitution_internal" | Entry point |
"operator_token_for_builtin_operator" | Maps operator codes to tokens for rescan |
"operator_token_for_expr_rescan" | Alternate operator-to-token mapping |
"invalid expr kind in expr rescan" | Unreachable default case |
"rescan_braced_init_list" | Rescans {init-list} nodes |
"make_operand_for_rescanned_identifier" | Rebuilds identifier operands after substitution |
"symbol_for_template_param_unknown_entity_rescan" | Handles dependent names during rescan |
"scan_rel_operator" | Rescans relational operators (for comparison rewriting) |
The key insight is that during template definition parsing, the parser builds a partially-evaluated expression tree where template-dependent parts are stored as opaque nodes. During instantiation, this function walks that tree, substitutes concrete types/values, and re-runs the semantic analysis that was deferred.
Supporting Infrastructure
Diagnostic Emission (30+ wrapper functions, 0x4F8000--0x4F8F80)
The expression parser uses a family of thin diagnostic wrapper functions at the beginning of the address range. Each wraps the core pattern: create_diag(code) -> add_arg(type/entity/string) -> emit(diag). The variants differ only in argument count and types:
| Function | Identity | Arguments |
|---|---|---|
sub_4F8090 | emit_diag_with_type_and_entity | Type arg + entity arg |
sub_4F8160 | emit_diag_1arg | Single argument |
sub_4F8220 | emit_diag_with_2_type_args | Two type arguments |
sub_4F8320 | emit_diag_with_entity_and_type | Entity first, type second |
sub_4F8B20 | issue_incomplete_type_diag | Incomplete type diagnostic (assert confirmed) |
Expression Stack (exprutil.c, 0x558720+)
The expression parser maintains a stack of expression contexts via qword_106B970. Each stack entry (the "current context") holds compilation mode flags, scope depth, CUDA execution space state, and template context bits. Key operations:
| Function | Identity | Purpose |
|---|---|---|
sub_55D0D0 | save_expr_stack | Saves current expression stack state |
sub_55D100 | init_expr_stack_entry | Creates new stack frame |
sub_55DB50 | pop_expr_stack | Restores previous frame |
sub_55E490 | set_operand_kind | Sets the operand classification |
sub_55C180 | alloc_ref_entry | Allocates reference-entry for tracking |
sub_55C830 | free_init_component | Frees initializer component node |
Comparison Rewriting (C++20, 0x501020--0x508DC0)
The C++20 three-way comparison operator (<=>) triggers rewriting of traditional comparison expressions. complete_comparison_rewrite (sub_505E80, 6.9KB) rewrites a < b into (a <=> b) < 0 when a spaceship operator exists. It uses a recursion counter at qword_106B510 limited to 100 to prevent infinite rewrite loops. Related functions:
| Function | Identity |
|---|---|
sub_501020 | determine_defaulted_spaceship_return_type |
sub_5015D0 | synthesize_defaulted_comparison_body |
sub_501B00 | check_comparison_category_type |
sub_505E10 | token_for_rel_op -- maps operator kinds to tokens (16->43, 17->44, 32->45, 33->46) |
sub_505E80 | complete_comparison_rewrite -- core rewrite engine |
sub_506430 | check_defaulted_eq_properties |
sub_5068F0 | check_defaulted_secondary_comp |
Range-Based For Loop Desugaring (0x50C510, 16.8KB)
fill_in_range_based_for_loop_constructs (sub_50C510) generates the desugared components of for (auto x : range):
// Source: for (auto x : range_expr) body
// Desugared: {
// auto && __range = range_expr;
// auto __begin = begin(__range);
// auto __end = end(__range);
// for (; __begin != __end; ++__begin) {
// auto x = *__begin;
// body
// }
// }
The function calls sub_6EF7A0 (overload resolution) to look up begin() and end() via ADL, and emits error 2285 when no suitable begin/end is found.
Key Global Variables
| Address | Name | Type | Description |
|---|---|---|---|
word_126DD58 | current_token_code | WORD | Current token kind (0--356) |
qword_126DD38 | current_source_position | QWORD | Encoded file/line/column |
qword_106B970 | current_scope | QWORD | Expression context stack pointer |
qword_106B968 | pending_expression | QWORD | Pending expression accumulator |
dword_126EFC8 | debug_trace_flag | DWORD | Nonzero enables trace output |
dword_126EFCC | debug_verbosity | DWORD | Trace verbosity level (>3 prints precedence) |
dword_126EFB4 | language_dialect | DWORD | 1=C, 2=C++ |
qword_126EF98 | standard_version | QWORD | Language standard version level |
dword_126EFA8 | in_template_context | DWORD | Nonzero during template parsing |
dword_126EFA4 | strict_mode | DWORD | Strict conformance mode flag |
dword_126EFAC | extended_features | DWORD | Extended features enabled |
xmmword_106C380 | identifier_lookup_result | 128-bit | SSE-packed identifier lookup (64 bytes total, 4 xmmwords) |
qword_106B510 | comparison_rewrite_depth | QWORD | Recursion counter for C++20 comparison rewriting (max 100) |
dword_106C2C0 | gpu_compilation_mode | DWORD | Nonzero during GPU compilation |
qword_126C5E8 | compilation_unit_table | QWORD | Base of CU array (784-byte stride) |
dword_126C5E4 | current_CU_index | DWORD | Index into compilation unit table |
dword_126C5D8 | enclosing_function_CU_index | DWORD | CU index of enclosing function |
dword_126C5C4 | template_instantiation_depth | DWORD | -1 = not in template instantiation |
Diagnostic Codes
The expression parser emits approximately 50 distinct diagnostic codes:
| Code | Meaning |
|---|---|
| 57 | Pointer-to-member on non-class type |
| 58 | Pointer-to-member on incomplete type |
| 60 | Pointer-to-member on wrong class type |
| 165 | Wrong argument count for builtin |
| 244 | Type access violation in member selection |
| 529 | Pointer-to-member in concept context |
| 852 | Unsupported type for atomic operation (typed variant) |
| 1022 | Inaccessible member in selection |
| 1032 | Invalid _Generic controlling expression |
| 1036 | Unsupported predefined function name |
| 1436 | __builtin_types_compatible_p not available |
| 1543 | __builtin_source_location not available |
| 1596 | Invalid literal operator call |
| 1645 | Argument type not supported for atomic operation |
| 1733 | new-expression in module context |
| 1763 | GNU statement expression not available |
| 1777 | Statement expression in constexpr context |
| 2285 | No begin/end for range-based for |
| 2669 | co_yield outside coroutine |
| 2747 | co_yield not in function scope |
| 2866 | Statement expression in constexpr context |
| 2896 | Statement expression in template instantiation |
| 2982 | Comparison rewrite recursion limit exceeded |
| 3462 | __device__ function called from __host__ context |
| 3463 | Variant of 3462 with __host__ annotation note |
| 3464 | __host__ function called from __device__ context |
| 3465 | Variant of 3464 with __device__ annotation note |
| 3508 | __global__ function called from wrong context |
| 3768 | Wrong argument type for atomic builtin (extra arg) |
| 3769 | Wrong argument type for atomic builtin (declared arg) |
Function Index
Complete listing of confirmed functions in the expression parser, grouped by subsystem:
Core Expression Scanning
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_511D40 | 80KB | scan_expr_full | DEFINITE |
sub_526E30 | 48KB | scan_conditional_operator | DEFINITE |
sub_545F00 | 16KB | scan_function_call | DEFINITE |
sub_54AED0 | 15KB | scan_new_operator | DEFINITE |
sub_5512B0 | 9KB | scan_identifier | DEFINITE |
sub_544290 | 6KB | scan_cast_or_expr | DEFINITE |
sub_5565E0 | 10KB | rescan_expr_with_substitution_internal | DEFINITE |
sub_529720 | 12KB | scan_comma_operator | DEFINITE |
sub_526040 | 15KB | scan_logical_operator | DEFINITE |
sub_543A90 | 1.4KB | scan_rel_operator | DEFINITE |
sub_540160 | 1.2KB | apply_one_fold_operator | DEFINITE |
sub_543FA0 | 1KB | assemble_fold_expression_operand | DEFINITE |
Unary Operators
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_516080 | 7.6KB | scan_prefix_incr_decr | DEFINITE |
sub_516720 | 13KB | scan_ampersand_operator | DEFINITE |
sub_517270 | 4.4KB | scan_indirection_operator | DEFINITE |
sub_517680 | 5.1KB | scan_arith_prefix_operator | DEFINITE |
sub_517BD0 | 26KB | scan_sizeof_operator | DEFINITE |
sub_519300 | 9.4KB | scan_alignof_operator | DEFINITE |
sub_519CF0 | 6.1KB | scan_builtin_addressof | DEFINITE |
sub_510D70 | 8.2KB | scan_postfix_incr_decr | DEFINITE |
Binary Operators
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_5238C0 | 5.4KB | scan_mult_operator | DEFINITE |
sub_523EB0 | 10.6KB | scan_add_operator | DEFINITE |
sub_524960 | 5.8KB | scan_shift_operator | DEFINITE |
sub_524ED0 | 5.6KB | scan_eq_operator | DEFINITE |
sub_525BC0 | 4.7KB | scan_bit_operator | DEFINITE |
sub_525450 | 8.6KB | scan_gnu_min_max_operator | DEFINITE |
sub_522650 | 19.8KB | scan_ptr_to_member_operator | DEFINITE |
Assignment
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_53FD70 | 1.1KB | scan_simple_assignment_operator | DEFINITE |
sub_536E80 | 3.1KB | scan_compound_assignment_operator | DEFINITE |
sub_508770 | 4.7KB | process_simple_assignment | DEFINITE |
Member Access
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_5303E0 | 15KB | scan_field_selection_operator | DEFINITE |
sub_4FEB60 | 4.5KB | make_field_selection_operand | DEFINITE |
sub_4FEF00 | 4.6KB | do_field_selection_operation | DEFINITE |
sub_540560 | 3.1KB | scan_subscript_operator | DEFINITE |
Cast Operators
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_51EE00 | 8.3KB | scan_new_style_cast | DEFINITE |
sub_51F670 | 13.5KB | scan_static_cast_operator | DEFINITE |
sub_520280 | 8.8KB | scan_const_cast_operator | DEFINITE |
sub_5209A0 | 4.9KB | scan_reinterpret_cast_operator | DEFINITE |
sub_53C690 | 3.6KB | scan_named_cast_operator | HIGH |
Type Traits
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_51A690 | 12KB | scan_unary_type_trait_helper | DEFINITE |
sub_51B650 | 7.2KB | scan_binary_type_trait_helper | DEFINITE |
sub_535080 | 0.2KB | dispatch_call_like_builtin | MEDIUM |
sub_534B60 | 1.8KB | scan_call_like_builtin_operation | DEFINITE |
sub_549700 | 2.2KB | compute_is_invocable | DEFINITE |
sub_550E50 | 1.3KB | compute_is_constructible | DEFINITE |
sub_510410 | 2.1KB | compute_is_convertible | DEFINITE |
sub_510860 | 2.3KB | compute_is_assignable | DEFINITE |
CUDA-Specific
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_505720 | 4KB | check_cross_execution_space_call | DEFINITE |
sub_505B40 | 2.7KB | check_cross_space_call_in_template | DEFINITE |
sub_537BF0 | 7KB | adjust_sync_atomic_builtin | DEFINITE |
sub_520EE0 | 2.7KB | scan_intaddr_operator | DEFINITE |
Initializers and Braced-Init-Lists
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_5360D0 | 4.7KB | parse_braced_init_list_full | DEFINITE |
sub_5392B0 | 0.2KB | complete_braced_init_list_parsing | DEFINITE |
sub_539340 | 1KB | scan_braced_init_list_cast | DEFINITE |
sub_539670 | 0.4KB | get_braced_init_list | DEFINITE |
sub_541000 | 2KB | scan_member_constant_initializer_expression | DEFINITE |
sub_541DC0 | 5.5KB | prescan_initializer_for_auto_type_deduction | DEFINITE |
Coroutines
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_50B630 | 10KB | add_await_to_operand | DEFINITE |
sub_50C070 | 1.8KB | check_coroutine_context | HIGH |
sub_50E080 | 4.5KB | make_coroutine_result_expression | DEFINITE |
C++20 Concepts and Requires
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_52CFF0 | 13.5KB | scan_requires_expression | DEFINITE |
sub_542D90 | 3.8KB | scan_requires_expr | DEFINITE |
sub_52EB60 | 8.6KB | scan_requires_clause | DEFINITE |
Declaration Parser
C++ declaration parsing is the most ambiguity-ridden phase of front-end compilation. A statement like T(x); is simultaneously a valid function-style cast (expression) and a variable declaration with redundant parentheses. EDG 6.6 in cudafe++ resolves this by splitting the work into two stages: a prescanning/disambiguation phase (disambig.c) that probes ahead in the token stream to classify ambiguous constructs, followed by committed parsing across four tightly-coupled source files -- decl_spec.c (declaration specifiers), declarator.c (declarator syntax), decls.c (symbol table insertion and semantic validation), and decl_inits.c (initializer processing). CUDA adds a fifth axis of complexity: every declaration may carry execution space attributes (__device__, __host__, __global__) and memory space qualifiers (__shared__, __constant__, __managed__), which are parsed as attribute category 4 and must be separated from standard C++ attributes before semantic analysis.
The core pipeline processes approximately 22,000 lines of decompiled logic across six major functions, each exceeding 1,000 lines. The design is a classic recursive-descent parser with significant state carried in stack-allocated structures (128-byte decl_spec accumulators packed as __m128i arrays) and global scope chain state (784-byte entries in the scope table at qword_126C5E8).
Key Facts
| Property | Value |
|---|---|
| Source files | decl_spec.c, declarator.c, decls.c, decl_inits.c, disambig.c |
| Address range | 0x4A0000--0x4F8000 (~360 KB of code, ~530 functions) |
| Central dispatcher | sub_4ACF80 (decl_specifiers, 4,761 lines) |
| Declarator entry | sub_4B7BC0 (declarator, 284 lines) |
| Function declarator | sub_4B8190 (function_declarator, 3,144 lines) |
| Recursive declarator | sub_4BC950 (r_declarator, 2,578 lines) |
| Function declaration | sub_4CE420 (decl_routine, 2,858 lines) |
| Variable declaration | sub_4CA6C0 (decl_variable, 1,090 lines) |
| Top-level variable entry | sub_4DEC90 (variable_declaration, 1,098 lines) |
| Disambiguation | sub_4EA560 (prescan_declaration, ~400 lines) |
| Scope entry size | 784 bytes (at qword_126C5E8) |
| Decl specifier accumulator | 128 bytes (4 x __m128i, stack-allocated) |
| CUDA mode flag | dword_126EFA8 (bool), dialect in dword_126EFB4 (2 = C++) |
| Current token global | word_126DD58 |
| Token advance | sub_676860 (get_next_token) |
Architecture
The declaration parsing pipeline operates as a five-stage waterfall. Each stage narrows the interpretation of the token stream until a fully-resolved declaration is inserted into the symbol table:
Token Stream (from lexer)
│
▼
STAGE 1: Disambiguation (disambig.c)
│ prescan_declaration ─── lookahead to classify ambiguous constructs
│ prescan_gnu_attribute ── skip __attribute__((...)) blocks
│ find_for_loop_separator ── distinguish for-init from expression
│
▼
STAGE 2: Declaration Specifiers (decl_spec.c)
│ decl_specifiers ─── 4,761-line switch dispatching on token kind
│ ├── storage class: auto, register, static, extern, typedef
│ ├── type specifiers: int, char, void, class/struct/enum, typename
│ ├── cv-qualifiers: const, volatile, restrict
│ ├── function specifiers: inline, virtual, explicit, constexpr, consteval
│ ├── CUDA attributes: __device__, __host__, __global__ (category 4)
│ └── class_specifier / enum_specifier (recursive for definitions)
│
▼
STAGE 3: Declarator (declarator.c)
│ declarator ─── coordinates pointer/array/function declarators
│ ├── pointer_declarator ── *, &, &&, ::*
│ ├── r_declarator ── recursive descent on declarator-id
│ ├── array_declarator ── [expression], []
│ ├── function_declarator ── (params) cv-qualifiers -> trailing-return noexcept
│ └── scan_declarator_attributes ── separates CUDA attrs from standard
│
▼
STAGE 4: Declaration Processing (decls.c)
│ decl_routine ─── function/method declarations (2,858 lines)
│ decl_variable ── variable declarations with CUDA memory space
│ variable_declaration ── top-level entry with CUDA error emission
│ find_linked_symbol ── redeclaration detection
│ id_linkage ── linkage determination (internal/external/none)
│
▼
STAGE 5: Initializer Processing (decl_inits.c)
ctor_inits_for_inheriting_ctor ── inheriting constructors
dtor_initializer ── destructor init lists
check_for_missing_initializer_full ── missing initializer diagnostics
Stage 1: Disambiguation (disambig.c)
The Problem
C++ has a famous syntactic ambiguity: many token sequences can be parsed as either declarations or expressions. The canonical example:
T(x); // declaration of variable x of type T? or function-style cast of x to T?
T(x)(y); // declaration of function x returning T? or call to T(x) with argument y?
T * x; // declaration of pointer-to-T named x? or multiplication of T and x?
The C++ standard resolves these with the "if it can be a declaration, it is a declaration" rule. EDG implements this by prescanning: before committing to a parse, the parser saves the lexer state, probes ahead through the token stream to determine whether the construct is a declaration, then restores the lexer state and dispatches to the appropriate parser.
prescan_declaration (sub_4EA560)
This is the top-level disambiguation entry point, called when the parser encounters an ambiguous construct at statement or declaration level. It operates in a non-destructive lookahead mode: it consumes tokens tentatively, classifies the construct, then rewinds.
prescan_declaration(flags):
save_lexer_state()
# Compute CUDA-aware skip mode
if flags & 0x800 == 0: # not in template context
skip_mode = 16385 # 0x4001: standard prescan
else:
skip_mode = 67125249 # 0x3FFC001: template-aware prescan
# In CUDA C++ mode, use cuda_skip_token for identifier classification
if dword_126EFB4 == 2: # CUDA C++ dialect
while not at_end_of_tentative_scan():
token = current_token()
if is_cuda_keyword(token):
cuda_skip_token(skip_mode) # sub_6810F0
else:
advance_token() # sub_676860
classify_declaration_vs_expression()
restore_lexer_state()
return classification # DECLARATION or EXPRESSION
The skip_mode is a bitmask encoding which token classes to recognize during prescanning. In CUDA mode, the wider mask (0x3FFC001) includes CUDA execution-space keywords so that __device__ int x; is correctly classified as a declaration even though __device__ is not a standard C++ keyword.
prescan_gnu_attribute (sub_4E9E70)
Attributes complicate disambiguation because __attribute__((foo)) can appear almost anywhere in a declaration. This function skips over balanced GNU attribute sequences during prescanning:
prescan_gnu_attribute():
assert current_token == 142 # GNU __attribute__ token
while current_token == 142:
advance_token() # consume __attribute__
match_balanced_parens() # skip ((...))
# CUDA extension: check if identifier is CUDA keyword
if dword_126EFB4 == 2: # CUDA C++ mode
if BYTE1(xmmword_106C390) & 2: # CUDA extension flag
cuda_skip_token(...)
find_for_loop_separator (sub_4EC690)
A special-purpose disambiguator for for loops. In for(init; cond; incr), the parser must find the semicolons that separate the three clauses. This is non-trivial because the init clause can contain declarations with complex types, nested parentheses, and template angle brackets.
find_for_loop_separator():
create_disambiguation_checkpoint() # sub_67B4F0
paren_depth = 0
while true:
token = current_token()
if token == '(':
paren_depth++
elif token == ')':
if paren_depth == 0:
break
paren_depth--
elif token == ';' and paren_depth == 0:
restore_checkpoint()
return SEMICOLON_FOUND # 0x4B = 75
elif token == EOF:
restore_checkpoint()
return EOF # 9
restore_checkpoint()
return NOT_FOUND # 0
Stage 2: Declaration Specifiers (decl_spec.c)
decl_specifiers (sub_4ACF80) -- The Central Dispatcher
This is the most complex function in the declaration parser: 4,761 decompiled lines, a while(2) loop containing a giant switch on token kinds, processing every specifier in a C++ declaration. It handles storage classes, type specifiers, cv-qualifiers, function specifiers, and CUDA attributes, accumulating results into a 128-byte stack structure.
Input Parameter: Context Flags
The a1 parameter encodes the parsing context as a bitmask:
| Bit | Mask | Context |
|---|---|---|
| 2 | 0x4 | Inside class member declaration |
| 3 | 0x8 | Inside function parameter list |
| 4 | 0x10 | At block scope |
| 6 | 0x40 | Inside template parameter list |
| 14 | 0x4000 | Friend declaration |
| 15 | 0x8000 | At class scope |
| 18 | 0x40000 | In-declaration (re-entrant) |
| 20 | 0x100000 | Constexpr lambda context |
The Accumulator Structure
Results are accumulated into a stack-allocated structure (parameter a2) laid out as:
| Offset | Size | Field | Description |
|---|---|---|---|
+8 | 4 | specifier_flags | Bitmask of specifiers seen |
+32 | 8 | source_position | Position of first specifier |
+120 | 4 | flags | Parsing state flags |
+132 | 4 | context | Context discriminator |
+200 | 8 | attribute_list | Linked list of parsed attributes |
+208 | 8 | attribute_list_alt | Secondary attribute list (CUDA exec space) |
+228 | 4 | modifiers | Accumulated modifier bits |
+272 | 8 | type_ptr | Resolved type pointer |
Pseudocode
decl_specifiers(context_flags, accumulator, type_chain, ...):
debug_trace(3, "decl_specifiers")
spec_bits = 0 # accumulated specifier combination flags
error_flag = 0
while true: # while(2) in decompilation
token = word_126DD58 # current token
switch token:
# ── Storage class specifiers ──
case TOKEN_AUTO: # 77
case TOKEN_REGISTER: # 119
case TOKEN_STATIC: # 99
case TOKEN_EXTERN: # 88
case TOKEN_TYPEDEF: # 103
process_storage_class_specifier(
auto_flag, ..., context_flags, accumulator,
prev_scope, &spec_bits, &result, &type_out, &error_flag
)
continue
# ── Type specifiers (keywords) ──
case TOKEN_VOID .. TOKEN_DOUBLE: # 81-119 range
case TOKEN_SIGNED:
case TOKEN_UNSIGNED:
case TOKEN_CHAR:
case TOKEN_INT:
case TOKEN_FLOAT:
case TOKEN_DOUBLE:
# Validate combination with existing specifiers
if spec_bits & CONFLICTING_TYPE_MASK:
emit_error(84) # conflicting type specifiers
spec_bits |= type_specifier_bit(token)
advance_token()
continue
# ── cv-qualifiers ──
case TOKEN_CONST: # 263
case TOKEN_VOLATILE: # 264
case TOKEN_RESTRICT: # 265, 266
accumulator.modifiers |= cv_bit(token)
advance_token()
continue
# ── Function specifiers ──
case TOKEN_INLINE:
spec_bits |= INLINE_BIT
advance_token()
continue
case TOKEN_VIRTUAL:
spec_bits |= VIRTUAL_BIT
advance_token()
continue
case TOKEN_EXPLICIT:
spec_bits |= EXPLICIT_BIT
advance_token()
continue
# ── C++11/17/20 specifiers ──
case TOKEN_CONSTEXPR:
spec_bits |= CONSTEXPR_BIT
if context_flags & 0x100000: # constexpr lambda
emit_error(1570)
advance_token()
continue
case TOKEN_CONSTEVAL:
spec_bits |= CONSTEVAL_BIT
advance_token()
continue
case TOKEN_CONSTINIT:
spec_bits |= CONSTINIT_BIT
advance_token()
continue
case TOKEN_THREAD_LOCAL:
spec_bits |= THREAD_LOCAL_BIT
advance_token()
continue
# ── Class/struct/union/enum definitions ──
case TOKEN_CLASS: # 151
case TOKEN_STRUCT:
case TOKEN_UNION:
class_specifier(scope, context_flags, ..., &result, &error_flag)
continue
case TOKEN_ENUM:
enum_specifier(scope, context_flags, ..., &result, &error_flag)
continue
# ── typename specifier ──
case TOKEN_TYPENAME: # 183
typename_specifier(&type_out, accumulator, context_flag, ...)
continue
# ── Identifier (type name or constructor) ──
case TOKEN_IDENTIFIER: # 1
# This is the declaration/expression ambiguity hotspot
if try_interpret_as_type_name(accumulator): # sub_4C4F80
continue
if is_constructor_decl(enclosing_class): # sub_4AC970
continue
# Not a type name — fall through to end of specifiers
break
# ── GNU __attribute__ / __declspec ──
case TOKEN_ATTRIBUTE: # 142
parse_attribute_list(accumulator)
# CUDA: execution space attributes separated here
continue
# ── typeof / decltype ──
case TOKEN_TYPEOF: # 189
case TOKEN_DECLTYPE: # 185
parse_typeof_or_decltype(accumulator)
continue
# ── End of specifiers ──
case TOKEN_SEMICOLON: # 55
default:
break # exit while loop
# Post-processing: validate specifier combinations
if spec_bits == 0 and no_type_found:
emit_error(79) # missing type specifier
# CUDA: check execution space context
if dword_126EFB4 == 2: # CUDA C++ mode
validate_cuda_execution_space(accumulator, context_flags)
if invalid_cuda_context:
emit_error(3537) # execution space attribute in wrong context
Token Classification Map
The switch in decl_specifiers handles the following token kinds:
| Token Code | Keyword | Category |
|---|---|---|
| 1 | identifier | Type name or constructor check |
| 77 | auto | Storage class (C++03) / placeholder type (C++11) |
| 88 | extern | Storage class |
| 99 | static | Storage class |
| 103 | typedef | Storage class |
| 119 | register | Storage class |
| 80--108 | C type keywords | Type specifiers |
| 142 | __attribute__ | GNU attribute |
| 151 | class | Class specifier |
| 183 | typename | Typename specifier |
| 185 | decltype | Decltype specifier |
| 189 | typeof | GNU typeof |
| 263--266 | cv-qualifiers | const, volatile, restrict, __restrict |
process_storage_class_specifier (sub_4A31A0)
Validates and records a storage class specifier. C++ allows at most one storage class per declaration (with some exceptions for thread_local).
process_storage_class_specifier(auto_flag, ..., context_flags, decl_info,
prev_scope, spec_bits, result, type_out, error_flag):
# Flag bits in context_flags:
# 1=function, 4=class, 8=extern, 0x10=static, 0x200=register,
# 0x4000=friend, 0x8000=at class scope, 0x100000=constexpr lambda
if *spec_bits & STORAGE_CLASS_MASK:
emit_error(80) # duplicate storage class
return
if conflicting_with_previous_specifier:
emit_error(81) # conflicting storage class
return
switch current_storage_class:
case EXTERN:
if at_block_scope and not_cpp_mode:
emit_error(85)
if at_file_scope and not_standard_mode:
emit_error(149)
decl_info.linkage_byte = 3 # external linkage
case STATIC:
if in_class_definition and cpp_mode:
emit_error(328)
case REGISTER:
emit_error(481) # deprecated
case AUTO:
if dword_126EF4C: # auto parameter support enabled
# C++20: auto in parameter list = abbreviated template
create_placeholder_type() # sub_5BBA60
else:
emit_error(1598) # auto type in invalid context
*spec_bits |= storage_class_bit
class_specifier (sub_4A57C0, 2,179 lines)
Parses class/struct/union specifiers including the full class body. This function manages scope entry/exit, base class lists, member declarations, access specifiers, and CUDA execution space propagation.
Key operations:
- Calls
scan_tag_name(sub_4A38A0, 1,216 lines) to parse the class name, handling qualified names and template parameters - Calls
check_for_class_modifiers(sub_4A3610) to detectfinal/__final - Manages the scope stack: pushes a class scope (kind 6 or 7) at
qword_126C5E8 + 784 * scope_index - Sets CUDA execution space flags at scope entry offset
+182(bit0x20) for device-side class definitions - Issues error 2407 for enum definitions in prohibited CUDA execution contexts
enum_specifier (sub_4AA2F0, 1,437 lines)
Parses enum, enum class, and enum struct specifiers, including:
- Underlying type (
enum E : int) - Opaque enum declarations (
enum class E : int;) - Scoped vs. unscoped enum semantics
- Calls
scan_enumerator_list(sub_4A89F0, 950 lines) for the enumerator body
Specifier Validation Functions
After decl_specifiers accumulates all specifiers, several validation functions check that the combination is legal:
| Function | Address | Lines | Purpose |
|---|---|---|---|
check_use_of_constexpr | sub_4A22B0 | 153 | Validates constexpr on functions and variables |
check_use_of_consteval | sub_4A1BF0 | 104 | Validates consteval on functions only |
check_use_of_constinit | sub_4A1EC0 | 77 | Validates constinit on variables with static storage |
check_use_of_thread_local | sub_4A2000 | 111 | Validates thread_local placement |
check_explicit_specifier | sub_4A1DF0 | 45 | Validates explicit on constructors/conversions |
check_gnu_c_auto_type | sub_4A2580 | 52 | Validates GNU __auto_type |
Each follows the same pattern: examine the accumulated specifier bits and the entity kind at offset +80 of the declaration node, and emit a targeted error if the combination is illegal. For example, check_use_of_consteval:
check_use_of_consteval(decl_info):
entity = decl_info[0]
kind = entity[+80] # symbol kind
if kind != FUNCTION (10) and kind != MEMBER_FUNCTION (11):
emit_error(2926) # consteval on non-function
entity[+177] &= 0xF9 # clear consteval bit
return
func_kind = entity[+166]
if func_kind == DESTRUCTOR (2):
emit_error(2927) # consteval on destructor
entity[+177] &= 0xF9
return
if func_kind == CONSTRUCTOR (1):
if type_has_virtual_base(entity[+88]):
emit_error(2928) # consteval on ctor with virtual base
entity[+177] &= 0xF9
return
if func_kind == CONVERSION (5):
if certain_conversion_conditions:
emit_error(2959) # consteval on certain conversions
entity[+177] &= 0xF9
Stage 3: Declarator Parsing (declarator.c)
Architecture
Declarator parsing uses inside-out construction: the C++ declarator syntax places the declared name in the center, with type constructors radiating outward (pointers to the left, arrays and function parameters to the right). The parser builds a derived-type chain that is later unwound against the base type from decl_specifiers to produce the final type.
Declarator syntax (C++ grammar):
declarator := pointer-declarator
pointer-declarator := {*, &, &&, C::*} cv-qualifiers* direct-declarator
direct-declarator := declarator-id | ( declarator ) | direct-declarator ( params ) | direct-declarator [ expr ]
declarator-id := qualified-name | unqualified-name
The parser coordinates five specialized sub-parsers:
| Function | Address | Lines | Role |
|---|---|---|---|
declarator | sub_4B7BC0 | 284 | Top-level entry: dispatches to pointer/r_declarator |
r_declarator | sub_4BC950 | 2,578 | Recursive descent on direct-declarator |
pointer_declarator | sub_4B72A0 | 440 | *, &, &&, ::* with cv-qualifiers |
array_declarator | sub_4B6760 | 518 | [expr] and [] |
function_declarator | sub_4B8190 | 3,144 | (params) cv-quals -> ret noexcept |
scan_declarator_attributes (sub_4B3970) -- CUDA Attribute Separation
This is the critical function that separates CUDA execution space attributes from standard C++ attributes on declarators. In standard C++, attributes apply to the entity being declared. CUDA adds a parallel attribute dimension -- execution space -- that must be routed to a separate storage location.
The function iterates through the attribute list and sorts each attribute by its category byte at offset +9:
scan_declarator_attributes(decl_info, attr_accumulator):
attr_list = decl_info[+200] # primary attribute list
for each attr in attr_list:
category = attr[+9] # attribute category byte
kind = attr[+8] # attribute kind
placement = attr[+10] # where in declaration it appeared
switch category:
case 1: # TYPE attribute (alignas, etc.)
# Keep on primary list, set placement
attr[+10] = 10 # after type specifier
case 2: # DECLARATION attribute ([[nodiscard]], etc.)
if attr[+11] & 0x10:
# CUDA/vendor declaration attribute
route_to_vendor_list(attr)
else:
# Standard declaration attribute
attr[+10] = 12 # before declarator
case 3: # STATEMENT attribute ([[fallthrough]], etc.)
if decl_info[+131] & 8: # class-key context
handle_class_key_stmt_attr(attr)
case 4: # CUDA EXECUTION SPACE attribute
# __device__, __host__, __global__
# Move to SECONDARY attribute list
move_to_list(attr, decl_info[+184])
# Error if misplaced
if wrong_position:
emit_error(1847) # attribute in wrong position
# Mark all processed attributes
for each attr in processed:
attr[+11] |= 1 # set "consumed" flag
The separation into primary (offset +200) and secondary (offset +184) attribute lists is essential: downstream code (decl_routine, decl_variable) reads execution space from the secondary list and standard attributes from the primary list. This prevents CUDA execution space from interfering with standard attribute processing like [[nodiscard]] or [[deprecated]].
function_declarator (sub_4B8190, 3,144 lines)
The second-largest function in the declarator parser. It handles the complete C++ function declarator grammar including C++11 trailing return types, C++11/17 noexcept specifications, C++23 deducing this, and the C++ function qualifier trailer (const, volatile, &, &&).
function_declarator(decl_info, context_flags):
debug_trace(3, "function_declarator")
# Parse parameter list
expect_token('(')
param_list = parse_parameter_list()
expect_token(')')
# C++ member function qualifiers
cv_quals = 0
while is_cv_qualifier(current_token):
cv_quals |= cv_bit(current_token)
advance_token()
# Ref-qualifier (& or &&)
ref_qual = NONE
if current_token == '&':
ref_qual = LVALUE_REF
advance_token()
elif current_token == '&&':
ref_qual = RVALUE_REF
advance_token()
# Exception specification
except_spec = NONE
if current_token == TOKEN_THROW:
except_spec = parse_throw_spec()
elif current_token == TOKEN_NOEXCEPT:
except_spec = parse_noexcept_spec()
# C++11 trailing return type
trailing_return = NULL
if current_token == TOKEN_ARROW: # ->
advance_token()
trailing_return = parse_type()
# C++20 trailing requires clause
requires_clause = NULL
if current_token == TOKEN_REQUIRES:
requires_clause = scan_trailing_requires_clause()
# C++23 deducing this
if has_explicit_this_parameter(param_list):
mark_deducing_this()
# Build function type node
func_type = add_to_derived_type_list(
FUNCTION_TYPE,
param_list, cv_quals, ref_qual,
except_spec, trailing_return, requires_clause
)
return func_type
Derived Type Construction
add_to_derived_type_list (sub_4B4CF0, 600 lines) is the type-chain builder. Each declarator modifier (pointer, reference, array, function) appends a new node to a linked list. After parsing completes, form_declared_type (sub_4B4870) walks this chain bottom-up, applying each modifier to the base type to produce the final declared type.
For a declaration like const int *(*fp)(double):
Base type: const int
Derived chain: [function(double)] → [pointer] → [pointer]
Unwound: pointer to (pointer to function(double) returning const int)
Stage 4: Declaration Processing (decls.c)
decl_variable (sub_4CA6C0, 1,090 lines)
Processes variable declarations after specifiers and declarator have been parsed. This is where CUDA memory space qualifiers are applied and the variable entity is inserted into the symbol table.
CUDA Memory Space Bits
Variable entries carry a CUDA memory space bitmask at offset +148:
| Bit | Mask | Memory Space | Meaning |
|---|---|---|---|
| 0 | 0x01 | __constant__ | Device-side constant memory |
| 1 | 0x02 | __shared__ | Block-shared memory (per-SM) |
| 2 | 0x04 | __managed__ | Unified memory (host + device accessible) |
| 4 | 0x10 | __device__ | Device global memory |
These bits are set from the declaration state object (parameter a2), which carries the parsed CUDA attribute at offset +240:
decl_variable(decl_specs, decl_state, storage_class, out_entity, out_flags):
debug_trace(3, "decl_variable")
assert(decl_state != NULL) # decls.c:7730
# Look up existing variable in scope
existing = lookup_variable_in_scope( # sub_4C84B0
scope, name, type_info
)
# Create new variable entity
var_entity = create_variable_entry( # sub_5C9840
name, type, storage_class
)
# Apply CUDA memory space from declaration state
if dword_126EFA8: # CUDA mode enabled
cuda_attr_ptr = decl_state[+240]
if cuda_attr_ptr != NULL:
# Extract memory space from attribute
space = extract_memory_space(cuda_attr_ptr)
var_entity[+148] = space # set memory space bits
# Scope walk: determine if variable is at namespace scope
# or inside a function (affects valid memory space combinations)
scope_idx = dword_126C5E4 # current scope index
scope_base = qword_126C5E8 # scope table base
while scope_idx > 0:
scope_entry = scope_base + 784 * scope_idx
scope_kind = scope_entry[+4]
if scope_kind == 4: # class scope — walk up
scope_idx = scope_entry[+256] # parent scope
continue
break
# Template scope check
if scope_entry[+9] & 0x20: # is_template_scope
handle_template_variable()
# Check redeclaration compatibility
if existing != NULL:
old_space = existing[+148]
new_space = var_entity[+148]
if old_space != new_space:
# Determine which string to use for error message
if new_space & 0x04:
space_name = "__managed__"
elif new_space & 0x01:
space_name = "__constant__"
elif new_space & 0x02:
space_name = "__shared__"
elif new_space & 0x10:
space_name = "__device__"
emit_error(1306) # CUDA memory space mismatch on redeclaration
# Anonymous type check
if type_is_anonymous(var_entity):
emit_error(891) # anonymous type in variable declaration
# Apply remaining attributes
set_variable_attributes(var_entity) # sub_4C4750
variable_declaration (sub_4DEC90, 1,098 lines) -- Top-Level Entry
This is the outermost entry point for processing a variable declaration. It wraps decl_variable with CUDA-specific validation, constexpr/constinit checks, and static data member definition handling.
CUDA-Specific Error Emission
The function contains a dense block of CUDA error checks for variable declarations:
variable_declaration(decl_info, ...):
# Early CUDA checks
check_constexpr_variable_init(decl_info) # sub_4DAC80
# CUDA memory space string selection for error messages
mem_space_bits = entity[+148]
byte_149 = entity[+149]
if mem_space_bits & 0x04: # __managed__
# No __managed__-specific string needed here
pass
# Build human-readable attribute name for diagnostics
if byte_149 & 1:
space_str = "__constant__"
elif mem_space_bits & 4 == 0:
space_str = "__managed__"
if byte_149 & 1 == 0:
space_str = "__device__"
if mem_space_bits & 2:
space_str = "__shared__"
# CUDA variable constraint errors
if is_shared_variable:
if is_variable_length_array:
emit_error(3510) # __shared__ variable with VLA
if is_constant_variable:
if is_constexpr:
emit_error(3568) # __constant__ combined with constexpr
if is_volatile:
emit_error(3566) # __constant__ combined with volatile
if is_vla:
emit_error(3567) # __constant__ with VLA
if has_cuda_attribute:
if in_constexpr_if_discarded_branch:
emit_error(3578) # CUDA attribute in discarded branch
if at_namespace_scope and is_structured_binding:
emit_error(3579) # CUDA attribute on structured binding
if is_variable_length_array:
emit_error(3580) # CUDA attribute on VLA
# Dispatch to decl_variable or define_static_data_member
if is_static_member_definition:
define_static_data_member(...)
else:
decl_variable(decl_specs, decl_state, storage_class, ...)
# Post-declaration CUDA fixup
cuda_variable_fixup(entity) # sub_4CC150
mark_defined_variable(entity) # sub_4DC200
Complete CUDA Variable Error Table
| Error | Condition | Message Summary |
|---|---|---|
| 149 | Illegal CUDA storage class at namespace scope | Storage class not allowed here |
| 891 | Anonymous type in variable declaration | Anonymous type cannot be used |
| 892 | auto-typed CUDA variable (variant) | auto not allowed with CUDA qualifier |
| 893 | auto-typed CUDA variable | auto not allowed with CUDA qualifier |
| 1306 | Memory space mismatch on redeclaration | Conflicting CUDA memory space |
| 3483 | (CUDA variable context error) | CUDA attribute context mismatch |
| 3510 | __shared__ variable with VLA | Variable-length arrays not allowed in __shared__ |
| 3566 | __constant__ with volatile | volatile incompatible with __constant__ |
| 3567 | __constant__ with VLA | Variable-length arrays not allowed in __constant__ |
| 3568 | __constant__ with constexpr | constexpr incompatible with __constant__ |
| 3578 | CUDA attribute in constexpr if discarded branch | CUDA attribute in dead code |
| 3579 | CUDA attribute on structured binding at namespace scope | Structured binding cannot have CUDA attribute |
| 3580 | CUDA attribute on VLA | Variable-length arrays not allowed with CUDA attribute |
| 3648 | __constant__ with external linkage | External __constant__ not allowed |
| 1655 | Tentative definition of constexpr variable | Missing initializer |
decl_routine (sub_4CE420, 2,858 lines)
The largest function in the declaration processing stage. It handles function and method declarations, integrating CUDA calling convention validation, attribute consistency checking, and template interaction.
Parameters
| Parameter | Offset | Description |
|---|---|---|
a1 | -- | decl_specifiers accumulator (__m128i*) |
a2 | -- | Declaration state object |
a3 | -- | Function info (offset +64 = flags, +80 = prior type) |
a4 | -- | SRK flags bitmask |
a5--a8 | -- | Output pointers and context |
SRK Flag Bits
The a4 parameter carries "scan result kind" flags that describe what was parsed:
| Bit | Mask | Meaning |
|---|---|---|
| 0 | 0x01 | SRK_DECLARATION -- forward declaration |
| 1 | 0x02 | SRK_DEFINITION -- has function body |
| 7 | 0x80 | SRK_IMPLICIT -- compiler-generated |
| 8 | 0x100 | SRK_CONSTEXPR -- constexpr function |
Function Entity Layout
After processing, a function entity contains:
| Offset | Size | Field | Description |
|---|---|---|---|
+80 | 1 | entity_kind | 10 = function, 11 = member function |
+88 | 8 | descriptor | Pointer to function descriptor |
+144 | 8 | type | Function type pointer |
+164 | 1 | defined_flag | Set when definition is seen |
+166 | 1 | function_kind | 1=ctor, 2=dtor, 5=conversion, 7=deduction guide |
+168 | 8 | template_info | Template instantiation info |
+177 | 1 | attribute_flags | bit 1=constexpr, bit 2=consteval |
+188 | 1 | cuda_flags_1 | CUDA calling convention |
+189 | 1 | cuda_flags_2 | CUDA execution space |
+192 | 8 | parameter_list | Head of parameter linked list |
Pseudocode
decl_routine(decl_specs, decl_state, func_info, srk_flags, ...):
debug_trace(3, "decl_routine")
# Assertions
assert func_info != NULL # decls.c:10057
assert storage_class is valid # decls.c:10059
assert srk_flags & SRK_DECLARATION # decls.c:10061
assert func_type is routine type # decls.c:10063
if srk_flags & SRK_DEFINITION:
assert body follows # decls.c:10068
if srk_flags & SRK_IMPLICIT:
assert compiler-generated context # decls.c:10149
# CUDA calling convention check
if dword_126EFB4 == 2: # CUDA C++ mode
check_cuda_calling_convention( # sub_4C6AB0
func_type, decl_specs
)
check_cuda_attribute_consistency( # sub_4C6D50
decl_state
)
# Look up existing declaration
existing = find_linked_symbol(name, scope)
if existing != NULL:
# Redeclaration checks
if existing.calling_convention != new_calling_convention:
emit_error(948) # calling convention mismatch
if has_cuda_attribute(existing) and has_cuda_attribute(new):
if not compatible_cuda_attributes(existing, new):
emit_error(1430) # function attribute mismatch
# CUDA-specific restrictions
if has_global_attribute:
if return_type is auto:
emit_error(1158) # auto return type with __global__
if is_deduction_guide:
if has_any_cuda_attribute:
emit_error(2885) # CUDA attribute on deduction guide
if is_explicit_instantiation:
if conflicting_template_attributes:
emit_error(1034) # explicit instantiation conflict
# Process CUDA attributes on the function
process_cuda_attributes(decl_state) # sub_42A250
remove_cuda_trailing_return(decl_state) # sub_42A210
# Canonicalize trailing return type in CUDA mode
if dword_126EFB4 == 2:
canonicalize_return_type(func_type) # sub_5DBCB0
# Symbol table insertion
entity = create_function_entity(name, func_type, storage_class)
# Set defined flag
assert entity.defined_flag is correct # decls.c:10417
# OpenMP variant handling (if active)
if dword_106B4B8: # omp_declare_variant_active
create_omp_variant_name("$$OMP_VARIANT%06d", variant_id)
CUDA Attribute Integration
Attribute Category System
EDG classifies attributes using a category byte at offset +9 in the attribute node:
| Category | Value | Meaning | Examples |
|---|---|---|---|
| Type | 1 | Applies to the type | alignas, __aligned__ |
| Declaration | 2 | Applies to the declaration | [[nodiscard]], [[deprecated]] |
| Statement | 3 | Applies to a statement | [[fallthrough]], [[likely]] |
| Execution space | 4 | CUDA execution space | __device__, __host__, __global__ |
Category 4 is NVIDIA's addition to EDG's attribute system. Standard EDG uses categories 1-3. CUDA execution space attributes are recognized by the lexer as identifiers, classified as CUDA keywords by get_token_main (sub_6810F0) when dword_106C2C0 (GPU mode) is active, and converted to attribute nodes with category 4 during attribute parsing.
Attribute Node Layout
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 8 | next | Next attribute in linked list |
+8 | 1 | kind | Attribute kind (0 when cleared/consumed) |
+9 | 1 | category | 1=type, 2=decl, 3=stmt, 4=exec-space |
+10 | 1 | placement | Where in declaration it appeared (10=after type, 12=before declarator) |
+11 | 1 | flags | bit 0 = consumed, bit 4 = CUDA/vendor |
+16 | 8 | payload | Attribute-specific data |
Execution Space Propagation
When a CUDA execution space attribute is parsed, it flows through three processing points:
-
decl_specifiers (
sub_4ACF80): CUDA attributes are recognized as token 142 (attribute) and parsed into the attribute list. The attribute parser sets category 4 for execution space attributes. -
scan_declarator_attributes (
sub_4B3970): Separates category-4 attributes from the primary attribute list and moves them to the secondary list at offset+184of the declaration info structure. -
decl_routine / decl_variable: Reads execution space from the secondary attribute list and applies it to the function/variable entity. For functions, the execution space goes to offsets
+188/+189of the entity. For variables, the memory space goes to offset+148.
warn_on_cuda_execution_space_attributes (sub_4A8990)
A safety valve that catches execution space attributes in places where they should not appear (e.g., on type definitions that are not function or variable declarations):
warn_on_cuda_execution_space_attributes(attr_list):
warned = false
for each attr in attr_list:
category = attr[+9]
if category == 1 or category == 4: # type or exec-space
if not warned:
emit_error(1882) # invalid exec space attr
warned = true
attr[+8] = 0 # clear kind (suppress further processing)
Scope Chain and Context Tracking
The declaration parser relies heavily on the scope chain stored in the global scope table. Every declaration must be inserted at the correct scope, and many validation checks depend on whether the current scope is namespace-scope, class-scope, block-scope, or template-scope.
Scope Entry Layout (784 bytes)
| Offset | Size | Field | Description |
|---|---|---|---|
+4 | 1 | scope_kind | 2=namespace, 4=class, 6=function, 8=nested block, 10=block, 12=template, 15/17=special |
+6 | 1 | flags_1 | bit 1=extern, bit 2=inline namespace, bit 7=pending class flag |
+7 | 1 | flags_2 | bit 1=has using directives |
+9 | 1 | template_flags | bit 5=is template scope, bit 1-3=template kind |
+12 | 4 | scope_flags | bit 2-3=scope modifier |
+182 | 1 | cuda_flags | bit 5 (0x20)=CUDA device-side scope |
+192 | 8 | first_entity | Head of entity linked list |
+216 | 8 | type_pointer | Associated type (for class scopes) |
+224 | 8 | namespace_ptr | Associated namespace |
+256 | 4 | parent_scope | Index of parent scope in table |
+368 | 8 | source_begin | Source position where scope begins |
+376 | 8 | associated_entity | Entity that opened this scope |
+408 | 4 | parent_scope_idx | Alternate parent scope index |
Scope Table Globals
| Address | Name | Description |
|---|---|---|
qword_126C5E8 | scope_table_base | Array of 784-byte scope entries |
dword_126C5E4 | current_scope_index | Index into scope table |
dword_126C5DC | current_scope_id | Current scope identifier |
dword_126C5B4 | namespace_scope_id | Nearest enclosing namespace scope |
dword_126C5BC | class_scope_depth | Nesting depth of class scopes |
dword_126C5C4 | lambda_scope_id | Current lambda scope (-1 if none) |
dword_126C5C8 | template_scope_id | Current template scope (-1 if none) |
Scope Walk for CUDA Memory Space
When processing a CUDA variable declaration, the parser walks up the scope chain to determine if the variable is at namespace scope (where __device__/__constant__/__managed__ are valid) or inside a function body (where __shared__ is additionally valid):
determine_cuda_variable_scope(var_entity):
scope_idx = dword_126C5E4
scope_base = qword_126C5E8
while scope_idx > 0:
entry = scope_base + 784 * scope_idx
kind = entry[+4]
if kind == 4: # class scope
# Walk through class scopes to find enclosing namespace/function
scope_idx = entry[+256] # parent scope
continue
if kind == 2: # namespace scope
# Variable is at namespace scope
# Valid spaces: __device__, __constant__, __managed__
return NAMESPACE_SCOPE
if kind == 6 or kind == 10: # function or block scope
# Variable is inside a function body
# Valid spaces: __shared__, __device__, __constant__, __managed__
return FUNCTION_SCOPE
scope_idx = entry[+256]
return FILE_SCOPE
Linkage Determination
id_linkage (sub_4C3380, 310 lines)
Determines whether an identifier has internal, external, or no linkage. This is called during decl_variable and decl_routine to set the linkage byte on the entity.
id_linkage(entity, storage_class, scope):
debug_trace(3, "id_linkage")
kind = entity[+80] # entity kind
# C++ linkage rules
if dword_126EFB4 == 2: # C++ mode
if storage_class == STATIC:
return INTERNAL # 0x10
if storage_class == EXTERN:
return EXTERNAL # 0x20
if scope_kind == NAMESPACE:
if kind == FUNCTION:
return EXTERNAL
if kind == VARIABLE:
if is_const_qualified and not explicitly_extern:
return INTERNAL
return EXTERNAL
if scope_kind == BLOCK:
return NONE # 0x00
# C linkage rules (simpler)
if storage_class == STATIC:
return INTERNAL
if scope_kind == FILE:
return EXTERNAL
return NONE
# Debug output
debug_print(linkage_string) # "internal" / "external" / "none"
find_linked_symbol (sub_4C1CC0, 608 lines)
The redeclaration detection engine. When a new declaration is processed, this function searches the current and enclosing scopes for a previously-declared symbol with the same name and compatible linkage:
find_linked_symbol(name, scope, entity_kind):
debug_trace(3, "find_linked_symbol")
# Look up in symbol table
existing = symbol_lookup(name, scope) # sub_698940
if existing == NULL:
return NULL
# For functions: handle overload sets
if entity_kind == FUNCTION:
# Walk overload set checking for compatible signature
for each overload in existing.overload_set:
if types_match(overload.type, new_type):
return overload
return NULL # new overload, not redeclaration
# For variables: check linkage compatibility
if entity_kind == VARIABLE:
if existing.linkage == new_linkage:
return existing
# Special case: extern at block scope refers to
# namespace-scope variable with same name
if new_storage_class == EXTERN and scope_kind == BLOCK:
return walk_to_namespace_scope_and_search(name)
return NULL
Constructor and Destructor Initialization (decl_inits.c)
ctor_inits_for_inheriting_ctor (sub_4A0310, 746 lines)
Builds the initialization sequence for inheriting constructors (C++11 using Base::Base;). The function iterates virtual base member lists to find matching base constructors and constructs the initialization order:
ctor_inits_for_inheriting_ctor(decl_info):
class_type = decl_info[+40][+32] # enclosing class type
member_list = class_type[+152] # member list
# Iterate virtual bases
for each member in member_list:
if member[+80] == 8: # base class member kind
base_type = resolve_base_type(member)
base_ctor = find_base_constructor(base_type)
if decl_info[+178] & 0x40: # inheriting-ctor redirection
# Walk class hierarchy via offset+216 link
while has_redirect(current):
current = current[+216]
base_ctor = find_redirect_target(current)
# Check accessibility
check_base_ctor_accessibility(base_ctor) # sub_48B3F0
# Build init entry
init_entry = allocate_init_entry() # sub_6BA0D0
init_entry.target = base_ctor
append_to_init_list(init_entry)
dtor_initializer (sub_4A0EC0, 339 lines)
Builds the destructor initialization (destruction) list for a class. The destruction order is the reverse of construction order -- members are destroyed in reverse declaration order, then base classes in reverse order:
dtor_initializer(decl_info):
debug_trace(3, "dtor_initializer") # decl_inits.c:10153
class_type = decl_info[5][+32]
member_list = class_type[+152]
# Check for delegating constructor
if decl_info[22] & 2:
return # delegating ctor, no separate dtor init needed
# Pass 1: members with flag (offset[10] & 2)
for each member in member_list:
if member[10] & 2:
if class_type[+132] != 11: # not union
dtor = resolve_member_destructor(member)
entry = allocate_init_entry()
entry.destructor = dtor
# Pass 2: members with (offset[10] & 3) == 1
for each member in member_list:
if (member[10] & 3) == 1:
dtor = resolve_member_destructor(member)
entry = allocate_init_entry()
entry.destructor = dtor
# Base class destructors (reverse order)
base_list = class_type[+96]
for each base in reverse(base_list):
dtor = resolve_base_destructor(base) # sub_737270
entry = allocate_init_entry()
entry.destructor = dtor
check_for_missing_initializer_full (sub_4A1540, 248 lines)
Checks whether a variable declaration is missing a required initializer:
check_for_missing_initializer_full(entity, type, unused, deferred_error):
kind = entity[+80] # 7=variable, 9=static member
# VLA check
if is_variable_length_array(type):
emit_error(252) # VLA cannot have initializer
# const check (C++ mode)
if dword_126EFB4 == 2: # C++ mode
if is_const_qualified(type) and not has_initializer(entity):
if not is_extern(entity):
emit_error(257) # const object requires initializer
# Abstract class check
if type[+160] & 2: # abstract class flag
if type[+132] & 0xFB == 8: # array of abstract
emit_error(812) # array of abstract class
else:
emit_error(516) # abstract class cannot be instantiated
# constexpr check
if entity has constexpr flag:
if not has_initializer(entity):
emit_error(517) # constexpr variable requires initializer
CUDA Mode Control Globals
The declaration parser is gated on several CUDA mode flags that control which code paths are active:
| Address | Name | Type | Description |
|---|---|---|---|
dword_126EFA8 | is_cuda_compilation | bool | Master CUDA mode flag |
dword_126EFB4 | cuda_dialect | int | 0=none, 1=C, 2=C++ |
dword_126EFAC | extended_cuda_features | bool | Additional CUDA extensions enabled |
dword_126EFA4 | cuda_host_compilation | bool | Compiling host-side code |
dword_126EFB0 | cuda_relaxed_constexpr | bool | Allow constexpr on device functions |
dword_106C17C | constexpr_cuda_enabled | bool | CUDA constexpr compatibility mode |
qword_126EF98 | cuda_version_threshold_1 | int64 | Version gate (0x9E97 = 40599 = CUDA 12.x) |
qword_126EF90 | cuda_version_threshold_2 | int64 | Version gate (0x78B3 = 30899 = CUDA 11.x) |
dword_126EF68 | cpp_standard_version | int | C++ standard year (201102, 201402, ...) |
dword_126EF64 | cpp_extensions_enabled | bool | Language extensions active |
CUDA Version Gating
Several CUDA-specific code paths are guarded by version thresholds. The version values are encoded as major * 1000 + minor * 10 + patch:
// CUDA 11.x and later: enable extended constexpr
if qword_126EF90 > 0x78B3: // 30899 → CUDA version >= 11.x
enable_extended_constexpr()
// CUDA 12.x and later: enable managed memory attributes
if qword_126EF98 > 0x9E97: // 40599 → CUDA version >= 12.x
enable_managed_attributes()
// Recent CUDA: enable namespace-scope CUDA variable checks
if qword_126EF98 > 0x1116F: // 70000+ → very recent CUDA
enable_strict_namespace_checks()
Function Map
decl_spec.c (0x4A1BF0--0x4B37F0)
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_4A1BF0 | check_use_of_consteval | 104 | Validate consteval specifier |
sub_4A1DF0 | check_explicit_specifier | 45 | Validate explicit specifier |
sub_4A1EC0 | check_use_of_constinit | 77 | Validate constinit specifier |
sub_4A2000 | check_use_of_thread_local | 111 | Validate thread_local specifier |
sub_4A22B0 | check_use_of_constexpr | 153 | Validate constexpr specifier |
sub_4A2580 | check_gnu_c_auto_type | 52 | Validate GNU __auto_type |
sub_4A2630 | scan_edg_vector_type | 203 | Parse vector type syntax |
sub_4A2B80 | is_function_declaration_ahead | 162 | Lookahead: function declaration? |
sub_4A2E40 | process_auto_parameter | 153 | C++20 auto parameters |
sub_4A31A0 | process_storage_class_specifier | 223 | Storage class validation |
sub_4A3610 | check_for_class_modifiers | 139 | Detect final/__final |
sub_4A38A0 | scan_tag_name | 1,216 | Parse class/enum name |
sub_4A4FD0 | set_name_linkage_for_type | 41 | Set type linkage |
sub_4A5140 | update_membership_of_class | 173 | Update class scope info |
sub_4A5510 | attach_tag_attributes | 143 | Attach attributes to types |
sub_4A57C0 | class_specifier | 2,179 | Parse class/struct/union definition |
sub_4A8990 | warn_on_cuda_execution_space_attributes | 33 | CUDA exec space warning |
sub_4A89F0 | scan_enumerator_list | 950 | Parse enum body |
sub_4AA2F0 | enum_specifier | 1,437 | Parse enum specifier |
sub_4AC550 | typename_specifier | 197 | Parse typename T::type |
sub_4AC970 | is_constructor_decl | 225 | Detect constructor declaration |
sub_4ACE00 | enclosing_class_type | 43 | Get enclosing class from scope |
sub_4ACF80 | decl_specifiers | 4,761 | Central specifier dispatcher |
sub_4B37F0 | decl_spec_one_time_init | 40 | Module initialization |
declarator.c (0x4B3920--0x4C00A0)
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_4B3970 | scan_declarator_attributes | 297 | Separate CUDA exec-space attrs |
sub_4B3E80 | scan_trailing_requires_clause | 136 | C++20 requires clause |
sub_4B4230 | check_for_restrict_qualifier_on_derived_type | 124 | Restrict validation |
sub_4B4870 | form_declared_type | 53 | Combine base type + derived chain |
sub_4B4990 | report_bad_return_type_qualifier | 89 | cv-qual on return type |
sub_4B4CF0 | add_to_derived_type_list | 600 | Build derived type chain |
sub_4B5A70 | delayed_scan_of_exception_spec | 211 | Deferred exception spec |
sub_4B6760 | array_declarator | 518 | Parse [expr] |
sub_4B72A0 | pointer_declarator | 440 | Parse *, &, &&, ::* |
sub_4B7BC0 | declarator | 284 | Top-level declarator entry |
sub_4B8190 | function_declarator | 3,144 | Parse function signature |
sub_4BC7F0 | scan_requires_expr_parameters | 61 | C++20 requires-expr params |
sub_4BC950 | r_declarator | 2,578 | Recursive descent declarator |
sub_4C00A0 | scan_lambda_declarator | 414 | Lambda declarator |
decls.c (0x4C0840--0x4F0000)
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_4C0910 | incompatible_types_are_SVR4_compatible | 77 | SVR4 ABI compat check |
sub_4C0B10 | set_default_calling_convention | 112 | Calling convention setup |
sub_4C0CB0 | record_overload | 91 | Record function overload |
sub_4C0E90 | set_linkage_for_class_members | 107 | Propagate class linkage |
sub_4C10E0 | set_linkage_environment | 138 | Linkage environment setup |
sub_4C15D0 | check_use_of_placeholder_type | 175 | Validate auto/decltype(auto) |
sub_4C1CC0 | find_linked_symbol | 608 | Redeclaration detection |
sub_4C3380 | id_linkage | 310 | Linkage determination |
sub_4C3A80 | qualified_name_redecl_sym | 320 | Qualified redeclaration |
sub_4CA6C0 | decl_variable | 1,090 | Variable declaration processing |
sub_4CC150 | cuda_variable_fixup | 120 | CUDA post-decl variable fixup |
sub_4CE420 | decl_routine | 2,858 | Function declaration processing |
sub_4DAC80 | check_constexpr_variable_init | 60 | CUDA constexpr check |
sub_4DB440 | process_asm_block | 200 | Inline assembly declaration |
sub_4DC200 | mark_defined_variable | 26 | CUDA constexpr linkage |
sub_4DD710 | check_trailing_return_type | 80 | Auto type deduction check |
sub_4DEC90 | variable_declaration | 1,098 | Top-level variable entry |
disambig.c (0x4E9E70--0x4EC690)
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_4E9E70 | prescan_gnu_attribute | 98 | Skip __attribute__ in prescan |
sub_4EA560 | prescan_declaration | 400 | Top-level disambiguation |
sub_4EB270 | prescan_declarator | 200 | Prescan declarator tokens |
sub_4EC690 | find_for_loop_separator | 100 | Find ; in for-init |
decl_inits.c (0x4A0310--0x4A1BE0)
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_4A0310 | ctor_inits_for_inheriting_ctor | 746 | Inheriting ctor init list |
sub_4A0EC0 | dtor_initializer | 339 | Destructor init list |
sub_4A1540 | check_for_missing_initializer_full | 248 | Missing init diagnostic |
sub_4A1B60 | decl_inits_init | 11 | Module initialization |
sub_4A1BB0 | decl_inits_reset | 9 | Module reset |
Cross-References
- Lexer -- token production,
word_126DD58,sub_676860(get_next_token) - Template Engine -- template scope interaction during declarator parsing
- CUDA Template Restrictions --
__global__template argument validation, executed afterdecl_routine - Name Mangling -- mangled name generation for declared entities
- Overload Resolution -- overload set construction during
find_linked_symbol - Constexpr Interpreter -- invoked during
check_use_of_constexprfor validation
Overload Resolution
The overload resolution engine in cudafe++ is EDG 6.6's implementation of the C++ overload resolution algorithm (ISO C++ [over.match]). It lives in overload.c -- approximately 100 functions spanning address range 0x6BE4A0--0x6EF7A0 (roughly 200KB of compiled code). Overload resolution is one of the most complex subsystems in any C++ compiler because it sits at the intersection of nearly every other language feature: implicit conversions, user-defined conversions, template argument deduction, SFINAE, partial ordering, reference binding, list initialization, copy elision, and operator overloading each contribute decision branches to the algorithm. EDG implements the standard three-phase architecture -- candidate collection, viability checking, best-viable selection -- with NVIDIA-specific extensions for CUDA execution-space filtering.
Key Facts
| Property | Value |
|---|---|
| Source file | overload.c (~100 functions) |
| Address range | 0x6BE4A0--0x6EF7A0 |
| Total code size | ~200KB |
| Main selection entry | sub_6E6400 (select_overloaded_function, 1,483 lines, 20 parameters) |
| Operator dispatch | sub_6EF7A0 (select_overloaded_operator, 2,174 lines) |
| Viability checker | sub_6E2040 (determine_function_viability, 2,120 lines) |
| Candidate evaluator | sub_6C4C00 (candidate evaluation, 1,044 lines) |
| Main driver | sub_6CE6E0 (overload resolution driver, 1,246 lines) |
| Built-in candidates | sub_6CD010 (built-in operator candidates, 752 lines) |
| Candidate iterator | sub_6E4FA0 (try_overloaded_function_match, 633 lines) |
| Conversion scoring | sub_6BEE10 (standard_conversion_sequence, 375 lines) |
| ICS comparison | sub_6CBC40 (implicit conversion sequence comparison, 345 lines) |
| Qualification compare | sub_6BE6C0 (compare_qualification_conversions, 127 lines) |
| Copy constructor select | sub_6DBEA0 (select_overloaded_copy_constructor, 625 lines) |
| Default constructor select | sub_6E9080 (select_overloaded_default_constructor, 358 lines) |
| Assignment operator select | sub_6DD600 (select_overloaded_assignment_operator, 492 lines) |
| CTAD entry | sub_6E8300 (deduce_class_template_args, 285 lines) |
| List initializer | sub_6D7C80 (prep_list_initializer, 2,119 lines) |
| Overload set traversal | sub_6BA230 (iterate overload set) |
| Overload debug trace | dword_126EFC8 (enable), qword_106B988 (output stream) |
| CUDA extensions flag | byte_126E349 |
| Language mode | dword_126EFB4 (2 = C++) |
| Standard version | dword_126EF68 (201103 = C++11, 201703 = C++17, 202301 = C++23) |
Why Overload Resolution Is Hard
Overload resolution is not a simple "find the best match" operation. The C++ standard defines it as a partial ordering problem over implicit conversion sequences, where each sequence is itself a multi-step chain of type transformations. The key sources of complexity:
-
Implicit conversion sequences (ICS). Each argument-to-parameter match produces an ICS consisting of up to three steps: a standard conversion (lvalue-to-rvalue, array-to-pointer, etc.), optionally a user-defined conversion (constructor or conversion function), then another standard conversion. Ranking two ICSs against each other requires comparing each step independently.
-
User-defined conversions. When no standard conversion exists, the compiler must search for converting constructors on the target type AND conversion operators on the source type, then perform a nested overload resolution among those candidates. This creates recursive invocations of the overload engine.
-
Template argument deduction. Function templates produce candidates only after deduction succeeds. Deduction may fail (SFINAE), producing no candidate. Successfully deduced candidates participate in a separate tie-breaking rule: non-template functions are preferred over template specializations, and "more specialized" templates are preferred over "less specialized" ones ([over.match.best] p2.5).
-
Partial ordering. When comparing two function templates that are both viable, the compiler must determine which is "more specialized" by attempting deduction in both directions (templates.c handles this). The result feeds back into overload ranking.
-
Operator overloading. Built-in operators (like
+onint) compete with user-definedoperator+. The compiler synthesizes "built-in candidate functions" representing every valid built-in operator signature, adds them to the candidate set alongside user-defined operators, and runs the same best-viable algorithm on the combined set. -
Special contexts. Copy-initialization vs. direct-initialization, list-initialization, reference binding, and conditional-operator type determination each have their own overload resolution sub-procedures with modified candidate sets and ranking rules.
Architecture: Three-Phase Pipeline
PHASE 1 PHASE 2 PHASE 3
Candidate Collection Viability Check Best-Viable Selection
┌───────────────────┐ ┌──────────────────┐ ┌──────────────────────┐
│ │ │ │ │ │
f(args...) ──────►│ Name lookup │────►│ For each cand: │────►│ Pairwise comparison │
│ ADL (arg-dep.) │ │ - param count │ │ of viable candidates │
│ Using-declarations│ │ - conversions │ │ via ICS ranking │──► winner
│ Template deduction│ │ - constraints │ │ │
│ Built-in synth │ │ - access check │ │ Tie-breakers: │──► or ambiguity
│ │ │ │ │ - non-template pref │
└───────────────────┘ └──────────────────┘ │ - partial ordering │──► or no match
│ - cv-qual ranking │
└──────────────────────┘
Phase 1: Candidate Collection
Candidates are collected into an overload set -- a linked list of entries allocated via sub_6BA0D0 and iterated via sub_6BA230. The overload set is built by the caller before invoking select_overloaded_function. Sources of candidates include:
- Name lookup results. All declarations visible by name at the call site, including base class members and using-declarations.
- Argument-dependent lookup (ADL). Additional functions found by searching the associated namespaces of the argument types (Koenig lookup). These are added to the set by the name lookup machinery before overload resolution begins.
- Template specializations. For each function template in the name lookup result, template argument deduction is attempted. If deduction succeeds, the resulting specialization is added as a candidate. If deduction fails, the template is silently dropped (SFINAE).
- Built-in operator candidates. For operator expressions,
sub_6CD010synthesizes candidate functions representing every valid built-in operator signature for the given operand types. These synthetic candidates use single-character type classification codes to match operand patterns.
Phase 2: Viability Checking
determine_function_viability (sub_6E2040, 2,120 lines) is the core viability checker. For each candidate function, it determines whether all arguments can be implicitly converted to the corresponding parameter types.
determine_function_viability (sub_6E2040, 2120 lines)
Input: candidate function F, argument list A[0..n-1]
Output: viability flag, per-argument conversion summaries
// Guard: SFINAE context handling
if (in_sfinae_context)
push_diagnostic_suppression()
// PASS 1: Basic eligibility
if (F is deleted)
return NOT_VIABLE
if (F is template && deduction_failed)
return NOT_VIABLE
if (F has fewer params than args && !F.is_variadic)
return NOT_VIABLE
if (F has more params than args && excess params lack defaults)
return NOT_VIABLE
// Handle implicit 'this' parameter for member functions
if (F is non-static member function) {
this_match = selector_match_with_this_param(
object_operand, F.this_param_type) // sub_6D0A80
if (this_match == FAILED)
return NOT_VIABLE
}
// PASS 2: Per-argument conversion check
for i in 0..n-1:
log("determine_function_viability: arg %d", i)
param_type = F.params[i].type
arg_type = A[i].type
// Compute implicit conversion sequence
ics = compute_standard_conversion_sequence( // sub_6BEE10
arg_type, param_type, context_flags)
if (ics == NO_CONVERSION) {
// Try user-defined conversion
ics = try_user_defined_conversion(
arg_type, param_type) // sub_6BF610
if (ics == NO_CONVERSION)
return NOT_VIABLE
}
// Check narrowing for list-initialization
if (context == LIST_INIT && ics.is_narrowing)
return NOT_VIABLE
// Record per-argument match summary
summaries[i] = ics
log("(pass 2)") // second pass through for detailed scoring
// All arguments convertible -- candidate is viable
return VIABLE, summaries[]
The function implements a two-pass approach visible in the debug trace output: pass 1 performs a quick rejection check (parameter count, deleted status, deduction success), and pass 2 computes the full conversion sequence for each argument. The per-argument summaries are stored in a 48-byte structure (set_arg_summary_for_user_conversion at sub_6BE990 initializes these).
Phase 3: Best-Viable Selection
select_overloaded_function (sub_6E6400, 1,483 lines, 20 parameters) performs the final selection. It is the master entry point for overload resolution -- called from the expression parser, from CTAD, and from special member function selection.
select_overloaded_function (sub_6E6400, 1483 lines, 20 params)
Input: overload_set, arg_list, context_flags, ...
Output: best_function or AMBIGUOUS or NO_MATCH
log("Entering select_overloaded_function with ...")
// Early exit: dependent type arguments => defer to instantiation time
if (selector_type_is_dependent)
return DEPENDENT
// Step 1: Iterate candidates and check viability
viable_set = []
try_overloaded_function_match( // sub_6E4FA0
overload_set, arg_list, &viable_set, ...)
if (viable_set is empty)
return NO_MATCH
if (viable_set has exactly 1 candidate)
return viable_set[0]
// Step 2: Pairwise comparison of viable candidates
// For each pair (F1, F2), compare their ICS for each argument
best = viable_set[0]
ambiguous = false
for each candidate C in viable_set[1..]:
cmp = compare_candidates(best, C)
// compare_candidates calls compare_conversion_sequences (sub_6BFF70)
// for each argument position, and applies tie-breakers
if (cmp == C_IS_BETTER)
best = C
ambiguous = false
else if (cmp == NEITHER_BETTER)
ambiguous = true
// Step 3: Verify best is strictly better than ALL others
if (ambiguous) {
// Final check: is there a single candidate that beats all?
for each candidate C in viable_set:
if (C != best) {
cmp = compare_candidates(best, C)
if (cmp != BEST_IS_BETTER)
return AMBIGUOUS
}
}
return best
Candidate Comparison Rules
The pairwise comparison between two viable candidates F1 and F2 follows [over.match.best]. The result is one of: F1-better, F2-better, or indistinguishable.
compare_candidates(F1, F2):
// Rule 1: Compare implicit conversion sequences argument-by-argument
f1_better_count = 0
f2_better_count = 0
for i in 0..n-1:
cmp = compare_conversion_sequences( // sub_6BFF70
F1.ics[i], F2.ics[i])
if (cmp == F1_BETTER) f1_better_count++
if (cmp == F2_BETTER) f2_better_count++
if (f1_better_count > 0 && f2_better_count == 0)
return F1_IS_BETTER
if (f2_better_count > 0 && f1_better_count == 0)
return F2_IS_BETTER
// Rule 2: Non-template preferred over template
if (F1 is non-template && F2 is template)
return F1_IS_BETTER
if (F2 is non-template && F1 is template)
return F2_IS_BETTER
// Rule 3: More-specialized template preferred
if (both are templates) {
partial = partial_ordering(F1.template, F2.template)
if (partial == F1_MORE_SPECIALIZED)
return F1_IS_BETTER
if (partial == F2_MORE_SPECIALIZED)
return F2_IS_BETTER
}
// Rule 4: Compare qualification conversions
cmp_qual = compare_qualification_conversions( // sub_6BE6C0
F1.qual_info, F2.qual_info)
if (cmp_qual != 0)
return cmp_qual
return NEITHER_BETTER
Implicit Conversion Sequence (ICS) Model
An ICS is the sequence of transformations needed to convert an argument type to a parameter type. EDG computes and stores ICS information in a compact structure.
Standard Conversion Sequence
standard_conversion_sequence (sub_6BEE10, 375 lines) computes the standard-conversion component of an ICS. It produces a conversion rank used in comparison.
| Rank | Name | Examples | Priority |
|---|---|---|---|
| Exact Match | No conversion needed | int to int, lvalue-to-rvalue | 1 (best) |
| Promotion | Integer/float promotion | short to int, float to double | 2 |
| Conversion | Standard conversion | int to double, derived-to-base | 3 |
| User-Defined | User conversion + std conversion | Foo to Bar via constructor | 4 |
| Ellipsis | Match via ... parameter | Any type to variadic | 5 (worst) |
Within the same rank, additional criteria refine the comparison:
- Qualification adjustment.
const TtoTis worse thanTtoT.compare_qualification_conversions(sub_6BE6C0) encodes cv-qualification as a bitmask (const= 0x20,volatile= 0x40,restrict= 0x80) and compares subset relationships. - Derived-to-base distance. Conversion through a shorter inheritance chain is better. Checked via
sub_7AB300. - Reference binding. Binding to
T&&is preferred over binding toconst T&when the argument is an rvalue.
User-Defined Conversion Sequence
When no standard conversion exists, try_conversion_function_match_full (sub_6D0F50, 1,085 lines) searches for a user-defined conversion path. It considers:
- Converting constructors on the target type (non-explicit constructors that accept the source type).
- Conversion functions on the source type (
operator T()members).
For each candidate conversion, it checks:
try_conversion_function_match_full (sub_6D0F50, 1085 lines)
Input: source_class_type, dest_type, context_flags
Output: selected conversion function/constructor, or AMBIGUOUS, or NONE
log("considering conversion functions for [%lu.%d]")
if (source is not class type)
error("try_conversion_function_match_full: source not class")
// Iterate conversion function candidates of source class
for each conv_func in source_class.conversion_functions: // via sub_6BA230
return_type = conv_func.return_type
if (return_type is compatible with dest_type) {
// Check standard conversion from return_type to dest_type
post_ics = compute_standard_conversion_sequence(
return_type, dest_type)
if (post_ics != NO_CONVERSION)
add_to_viable(conv_func, post_ics)
}
// Also check for converting constructors on dest type
conversion_from_class_possible( // sub_6D28C0/6D2ED0
source_class_type, dest_type, &viable_set)
// Select best among viable user-defined conversions
if (viable_set has 1 candidate)
return viable_set[0]
if (viable_set has multiple candidates)
return best-of or AMBIGUOUS
return NONE
The conversion_from_class_possible functions (sub_6D28C0 252 lines, sub_6D2ED0 293 lines) emit full debug traces with entry/exit messages:
Entering conversion_from_class_possible, dest_type = <type>
Candidate functions list: ...
Leaving conversion_from_class_possible: <result>
The Main Overload Resolution Driver
sub_6CE6E0 (1,246 lines) is the central driver function -- "THE MONSTER" -- that coordinates the overload resolution pipeline. It is called from determine_selector_match_level and from the candidate evaluation logic, acting as the type-comparison and scoring backbone that feeds the higher-level selection functions.
overload_resolution_driver (sub_6CE6E0, 1246 lines)
// This function performs the detailed type comparison and conversion
// sequence computation that determines how well a candidate matches.
//
// It is called per-candidate, per-argument-position from the viability
// checker and the candidate evaluator.
// 1. Quick identity check
if (arg_type == param_type)
return EXACT_MATCH
// 2. Chase typedef chains to canonical types
arg_canon = canonical_type(arg_type)
param_canon = canonical_type(param_type)
// 3. Apply lvalue-to-rvalue conversion
if (param expects rvalue && arg is lvalue)
apply lvalue_to_rvalue conversion, record in ICS
// 4. Apply array-to-pointer / function-to-pointer decay
if (arg is array) convert to pointer-to-element
if (arg is function) convert to pointer-to-function
// 5. Check for standard conversions (integral promotion, float promotion,
// integral conversion, floating conversion, pointer conversion,
// pointer-to-member conversion, boolean conversion)
std_conv = find_applicable_standard_conversion(arg_canon, param_canon)
if (std_conv != NONE)
return std_conv with rank
// 6. Check for qualification conversion (add const/volatile)
qual_conv = check_qualification_conversion(arg_canon, param_canon)
if (qual_conv)
return EXACT_MATCH with qual adjustment
// 7. Check derived-to-base conversion
if (is_class(arg_canon) && is_class(param_canon)) {
if (is_derived_from(arg_canon, param_canon))
return CONVERSION_RANK with derived-to-base marker
}
// 8. No standard conversion found
return NO_CONVERSION
Candidate Evaluation Function
sub_6C4C00 (1,044 lines) is the candidate evaluation function -- it scores each candidate by computing the full set of implicit conversion sequences across all arguments and produces the data that compare_candidates uses.
evaluate_candidate (sub_6C4C00, 1044 lines)
Input: candidate F, argument list args[], match_context
Output: per-argument ICS array, overall viability
for each argument position i:
// Compute the implicit conversion sequence
ics = overload_resolution_driver( // sub_6CE6E0
args[i].type, F.params[i].type, flags)
if (ics == NO_CONVERSION) {
// Try user-defined conversion
udc = try_user_defined_conversion(args[i].type, F.params[i].type)
if (udc == NONE)
mark F as non-viable for position i
return NON_VIABLE
ics = user_defined_ics(udc)
}
// Record the ICS for this position
F.arg_summaries[i] = ics
// Compute overall match quality
F.match_level = worst(F.arg_summaries[0..n-1])
return VIABLE
Candidate Iteration
try_overloaded_function_match (sub_6E4FA0, 633 lines, and variant sub_6E5B20, 367 lines) iterates the overload set and calls determine_function_viability for each candidate.
try_overloaded_function_match (sub_6E4FA0, 633 lines)
Input: overload_set, arg_list, context
Output: viable_candidates[]
log("try_overloaded_function_match")
// Traverse the overload set
cursor = overload_set.head
while (cursor != NULL): // via sub_6BA230
candidate = cursor.function
log("try_overloaded_function_match: considering %s",
candidate.name) // via sub_5B72C0
// Set up traversal symbol for template deduction
set_overload_set_traversal_symbol(cursor)
// Check viability
viable = determine_function_viability( // sub_6E2040
candidate, arg_list, context)
if (viable) {
add candidate to viable_candidates[]
record conversion summaries
}
cursor = cursor.next
Operator Overloading
Operator overloading resolution follows a specialized path because it must consider both user-defined operators AND synthesized built-in operator candidates.
Entry Point: select_overloaded_operator
sub_6EF7A0 (2,174 lines) is the master entry point for operator overloading. It is called from the expression parser whenever an operator expression involves a class-type operand.
select_overloaded_operator / check_for_operator_overloading
(sub_6EF7A0, 2174 lines)
Input: operator_kind, lhs_operand, rhs_operand (if binary), context
Output: selected function (user-defined or built-in), or use-builtin flag
log("Entering check_for_operator_overloading")
// Guard: dependent operands => defer
if (lhs is dependent || rhs is dependent)
log("check_for_operator_overloading: dep operand")
return DEPENDENT
// Step 1: Collect user-defined operator candidates
// Search member operators of lhs class
// Search non-member operators via name lookup + ADL
user_candidates = collect_user_operator_candidates(
operator_kind, lhs, rhs)
// Step 2: Generate built-in operator candidates
builtin_candidates = generate_builtin_candidates( // sub_6CD010
operator_kind, lhs.type, rhs.type)
// Step 3: Combine candidate sets
combined = user_candidates + builtin_candidates
// Step 4: Run standard overload resolution on combined set
result = select_overloaded_function( // sub_6E6400
combined, [lhs, rhs], OPERATOR_CONTEXT)
if (result is a built-in candidate) {
// Adjust operands for built-in semantics
adjust_operand_for_builtin_operator( // sub_6E0E50
lhs, rhs, operator_kind)
return USE_BUILTIN
}
log("Leaving f_check_for_operator_overloading")
return result.function
Built-in Operator Candidate Generation
sub_6CD010 (752 lines) generates synthetic candidate functions representing built-in operators. It uses a type classification code scheme where each type category is encoded as a single character.
Type Classification Codes
| Code | Meaning | Query Function |
|---|---|---|
A / a | Arithmetic type | sub_7A7590 (is_arithmetic) |
B | Boolean type | is_bool |
b | Boolean-equivalent | is_pointer/bool |
C | Class type | sub_7A8A30 (is_class) |
D / I / i | Integer/integral type | sub_7A71E0 (is_integral) |
E | Enum type | sub_7A70F0 (is_enum) |
F | Pointer-to-function | is_function_pointer |
H | Handle type (CLI) | is_handle |
M | Pointer-to-member | sub_7A8D90 (is_member_pointer) |
N | nullptr_t | is_nullptr |
O | Pointer-to-object | is_object_pointer |
P | Pointer (any) | is_pointer |
S | Scoped enum | is_scoped_enum |
h | Handle-to-CLI-array | is_handle_array |
n | Non-bool arithmetic | is_non_bool_arithmetic |
The function matches_type_code (sub_6BECA0) dispatches on these codes to check whether an operand matches a candidate pattern. The function name_for_type_code (sub_6BE4A0, 67 lines) converts codes to human-readable strings for diagnostics (e.g., A becomes "arithmetic").
Candidate Pattern Matching
try_builtin_operands_match (sub_6ED2A0, 812 lines) matches operands against built-in operator patterns. The patterns are encoded as strings like "A;P" where each character is a type code and ; separates operand positions.
try_builtin_operands_match (sub_6ED2A0, 812 lines)
Input: operator_kind, pattern_string, operand types
Output: match result
log("try_builtin_operands_match: considering %s", pattern_string)
for i in 0..num_operands-1:
code = pattern_string[i] // after skipping separators
log("try_builtin_operands_match: operand %d", i)
if (!matches_type_code(operand[i].type, code))
log("try_builtin_operands_match: ran off pattern")
return NO_MATCH
return MATCH with conversion cost
try_conversions_for_builtin_operator (sub_6EE340, 1,058 lines) contains a large switch over operator kinds that selects the appropriate type pattern tables. It checks dword_126EF68 for C++17 features (>= 201703) and dword_126EFB4 for language mode.
Special Member Function Selection
Overload resolution for special member functions uses dedicated entry points that share the same underlying machinery but provide specialized candidate sets and matching rules.
Copy/Move Constructor Selection
select_overloaded_copy_constructor (sub_6DBEA0, 625 lines)
Input: class_type, source_operand, context_flags
Output: selected constructor symbol, or NULL
log("Entering select_overloaded_copy_constructor, class_type = %s",
class_type.name)
// Iterate all constructors of the class
for each ctor in class_type.constructors: // via sub_6BA230
log("select_overloaded_copy_constructor: considering %s",
ctor.name) // via sub_5B72C0
// Check copy parameter match
match = determine_copy_param_match( // sub_6DBAC0
ctor, source_operand)
// determine_copy_param_match calls:
// sub_6CE6E0 (type comparison)
// sub_6BE5D0 (value category check)
// sub_6DB6E0 (deduce_one_parameter for template ctors)
if (match.viable) {
if (match better than current_best)
current_best = ctor
// Check for ambiguity
if (match == current_best && ctor != current_best)
ambiguous = true
}
log("Leaving select_overloaded_copy_constructor, cctor_sym = %s",
current_best.name)
return current_best
The value category check (sub_6BE5D0, copy_function_not_callable_because_of_arg_value_category, 39 lines) is critical for C++11 move semantics: it rejects copy constructors when the source is an rvalue and a move constructor is available, and vice versa.
Default Constructor Selection
select_overloaded_default_constructor (sub_6E9080, 358 lines)
Input: class_type
Output: selected constructor symbol
log("Entering select_overloaded_default_constructor, class_type = %s",
class_type.name)
// Collect zero-argument constructors
// Check for default arguments (a 1-param ctor with default is a default ctor)
// Run standard overload resolution with empty argument list
log("Leaving select_overloaded_default_constructor, ctor_sym = %s",
result.name)
return result
Assignment Operator Selection
select_overloaded_assignment_operator (sub_6DD600, 492 lines)
Input: class_type, rhs_operand
Output: selected assignment operator symbol
log("Entering select_overloaded_assignment_operator, class_type = %s",
class_type.name)
// Iterate assignment operator candidates
for each assign_op in class_type.assignment_operators: // via sub_6BA230
log("select_overloaded_assignment_operator: considering %s",
assign_op.name)
// Check parameter match (similar to copy constructor)
// ...
log("Leaving select_overloaded_assignment_operator, assign_sym = %s",
result.name)
return result
Copy Elision
C++17 guaranteed copy elision is handled by handle_elided_copy_constructor_no_guard (two variants: sub_6DCD60 166 lines and sub_6DD180 169 lines). Even with elision, the compiler must verify that the copy/move constructor would be callable -- the constructor is selected via select_overloaded_copy_constructor but never actually invoked. The wrapper arg_copy_can_be_done_via_constructor (sub_6DCC00, 55 lines) performs this check.
List Initialization
prep_list_initializer (sub_6D7C80, 2,119 lines) implements C++11 brace-enclosed initializer list resolution. It is one of the largest functions in overload.c, reflecting the combinatorial complexity of list initialization.
prep_list_initializer (sub_6D7C80, 2119 lines)
Input: init_list (braced expression list), target_type, context
Output: converted initializer expression
// The algorithm (per [dcl.init.list]):
//
// 1. If T has an initializer_list<X> constructor and the braced-init-list
// can be converted to initializer_list<X>, use that constructor.
//
// 2. If T is an aggregate, perform aggregate initialization.
//
// 3. If T has constructors, overload resolution selects a constructor
// with the elements of the braced-init-list as arguments.
//
// 4. If T is a reference, bind to a temporary or element.
//
// At each step, check for narrowing conversions (C++11 requirement).
// Gate: C++11 required
if (dword_126EF68 < 201103) // std_version < C++11
return LEGACY_PATH
// Step 1: Check for initializer_list constructor
init_list_ctor = find_initializer_list_constructor( // sub_6DFEC0
target_type, element_type)
if (init_list_ctor) {
init_list_obj = make_initializer_list_object( // sub_6DFEC0
init_list, element_type)
return set_up_for_constructor_call(init_list_ctor,
init_list_obj)
}
// Step 2: Aggregate initialization (recursive for nested braces)
if (is_aggregate(target_type)) {
for each element in init_list:
// Recursively call prep_list_initializer for nested braces
prep_list_initializer(element, member_type, ...) // recursive
return aggregate_init_expr
}
// Step 3: Constructor overload resolution
result = select_overloaded_function( // sub_6E6400
target_type.constructors, init_list.elements, LIST_INIT)
// Step 4: Check for narrowing
check_narrowing_conversions(init_list, result)
return result
The find_initializer_list_constructor / make_initializer_list_object function (sub_6DFEC0, 692 lines) handles std::initializer_list<T> construction. It iterates constructors to find one taking initializer_list<T> and sets up the backing array via set_overload_set_traversal_symbol.
Class Template Argument Deduction (CTAD)
C++17 CTAD is implemented by deduce_class_template_args (sub_6E8300, 285 lines). CTAD works by synthesizing a set of "deduction guides" -- function-like entities derived from the class template's constructors -- and running overload resolution on them.
deduce_class_template_args (sub_6E8300, 285 lines)
Input: class_template, constructor_arguments, context
Output: deduced template arguments
// Step 1: Generate implicit deduction guides from constructors
// For each constructor C(P1, P2, ...) of class template T<A, B, ...>:
// Create guide: T(P1, P2, ...) -> T<deduced-A, deduced-B, ...>
// Step 2: Add explicit deduction guides (user-provided)
// Step 3: Run overload resolution among all guides
selected_guide = select_overloaded_function( // sub_6E6400
deduction_guides, constructor_args, CTAD_CONTEXT)
// Step 4: Extract deduced template arguments from selected guide
return selected_guide.deduced_args
CTAD delegates entirely to select_overloaded_function for the actual resolution -- the deduction guides are treated as ordinary function candidates with synthesized parameter types.
Auto Type Deduction
deduce_auto_type (sub_6DB010, 314 lines) implements C++11 auto type deduction, which is structurally similar to template argument deduction. It handles the special case of auto x = {1, 2, 3} where the deduced type is std::initializer_list<int>.
Conversion Infrastructure
Reference Binding
prep_reference_initializer_operand (sub_6D47B0, 1,121 lines) handles reference initialization, which has its own overload-resolution sub-algorithm for selecting the correct binding path:
- Direct binding. If the initializer is an lvalue of the right type (or derived), bind directly.
- Conversion-through-temporary. If a user-defined conversion exists, create a temporary and bind the reference to it.
- Direct reference binding check.
conversion_for_direct_reference_binding_possible(sub_6D4610, 49 lines) checks whether direct binding is possible.
Operand Conversion
After overload resolution selects a function, the arguments must be physically converted to match the parameter types:
| Function | Lines | Role |
|---|---|---|
sub_6D6650 (user_convert_operand) | 427 | Applies user-defined conversion (constructor call or conversion function call) |
sub_6E1430 (convert_operand_into_temp) | 418 | Creates a temporary and converts operand into it |
sub_6E1C40 (prep_argument variant 1) | 69 | Prepares argument for function call |
sub_6E1E40 (prep_argument variant 2) | 69 | Simplified argument preparation |
sub_6EB1C0 (adjust_overloaded_function_call_arguments) | 249 | Post-resolution argument adjustment |
sub_6E0E50 (adjust_operand_for_builtin_operator) | 199 | Adjusts operands for built-in operator semantics |
The high-level call setup function select_and_prepare_to_call_overloaded_function (sub_6EB550, 392 lines) combines overload resolution with argument preparation in a single entry point.
Dynamic Initialization
determine_dynamic_init_for_class_init (sub_6DEBC0, 679 lines) determines whether a class object initialization requires a runtime (dynamic) initialization routine rather than static initialization. It checks whether the constructor is trivial, whether the initializer is a constant expression, and whether the target requires dynamic dispatch.
Conditional Operator
conditional_operator_conversion_possible (sub_6EBFC0, 326 lines) handles the special overload resolution for the ternary conditional operator (? :), which has unique type-determination rules involving common type computation between the second and third operands.
Ambiguity Diagnostics
When overload resolution fails due to ambiguity, dedicated diagnostic functions produce the error messages:
| Function | Lines | Role |
|---|---|---|
sub_6D7040 (diagnose_overload_ambiguity standalone) | 191 | Formats and emits ambiguity diagnostic with candidate list |
sub_6D35E0 (user_defined_conversion_possible with diagnosis) | 399 | Handles ambiguity in user-defined conversion resolution |
The diagnostic output uses sub_4F59D0, sub_4F5C10, sub_4F5CF0, and sub_4F5D50 for type-to-string formatting, producing messages in the format:
ambiguous overload for 'operator+(A, B)':
candidate: operator+(int, int)
candidate: operator+(A::operator int(), int)
Missing Sentinel Warning
warn_if_missing_sentinel (sub_6E9C60, 1,170 lines) is a large function that checks for missing sentinel arguments (NULL terminators) in variadic function calls. It references multiple CUDA extension flags (byte_126E349, byte_126E358, byte_126E3C0, byte_126E3C1, byte_126E481) because CUDA functions have different variadic conventions.
CUDA Execution Space Interaction
CUDA introduces an additional dimension to overload resolution: execution space compatibility. In standard C++, any visible function is a candidate. In CUDA, a candidate from the wrong execution space may be excluded or penalized.
How Execution Spaces Affect Candidates
The CUDA execution space interaction with overload resolution happens at two levels:
Level 1: Post-resolution validation (expr.c). After overload resolution selects the best viable function, check_cross_execution_space_call (sub_505720, 4KB) validates that the selected function is callable from the current execution context. If the call is illegal (e.g., calling a __device__-only function from __host__ code), error 3462--3465 is emitted. This check runs AFTER overload resolution, not during candidate filtering.
Level 2: Overload-internal CUDA awareness (overload.c). Within overload.c itself, the CUDA extensions flag byte_126E349 gates CUDA-specific behavior in several functions:
-
try_conversion_function_match_full(sub_6D0F50): Checksbyte_126E349when evaluating whether a conversion function is viable. In CUDA mode, conversion functions from the wrong execution space may be excluded from consideration during the user-defined conversion search. -
warn_if_missing_sentinel(sub_6E9C60): Usesbyte_126E349andbyte_126E358to adjust sentinel checking behavior for CUDA-annotated variadic functions.
The key architectural decision is that CUDA does NOT filter candidates during Phase 1 (candidate collection) or Phase 2 (viability checking) of overload resolution proper. Instead, execution-space validation is a separate pass that runs after the standard C++ overload algorithm completes. This preserves EDG's clean separation between the standard-conforming overload engine and NVIDIA's CUDA extensions.
Cross-Space Validation
The execution space is encoded in the entity node at offset +182 as a bitfield:
| Bit Pattern | Meaning |
|---|---|
(byte & 0x30) == 0x20 | __device__ only |
(byte & 0x60) == 0x20 | __host__ only |
(byte & 0x60) == 0x40 | __global__ |
(byte & 0x30) == 0x30 | __host__ __device__ |
The cross-space checker (sub_505720) compares the caller's execution space with the callee's and emits:
| Error | Condition |
|---|---|
| 3462 | __device__ called from __host__ |
| 3463 | Variant of 3462 for HD context |
| 3464 | __host__ called from __device__ |
| 3465 | Variant of 3464 with __device__ note |
| 3508 | __global__ called from wrong context |
A template-instantiation variant (sub_505B40, check_cross_space_call_in_template) performs the same checks during template instantiation.
Debug Tracing
Overload resolution includes extensive debug tracing controlled by dword_126EFC8. When enabled, functions emit trace output via sub_48AFD0 / sub_48AE00 to the stream at qword_106B988:
Entering select_overloaded_function with ...
try_overloaded_function_match: considering foo(int)
determine_function_viability: arg 0
(pass 2)
try_overloaded_function_match: considering foo(double)
determine_function_viability: arg 0
(pass 2)
comparing candidates: foo(int) vs foo(double)
Leaving select_overloaded_function: foo(int)
The trace format [%lu.%d] is used in conversion function matching to identify candidates by internal ID.
Overload Set Management
Overload sets are managed via two key functions in the memory management subsystem:
| Function | Role |
|---|---|
sub_6BA0D0 | Allocate a new overload set entry |
sub_6BA230 | Iterate/traverse an overload set (linked list walk) |
sub_6EC650 | Overload set traversal utility (212 lines) |
sub_6ECA20 | Overload set construction from multiple sources (137 lines) |
sub_6ECCE0 | Overload set initialization wrapper (23 lines) |
The linked-list representation means candidate iteration is O(n) per traversal, but overload sets are typically small (< 100 candidates), so this is not a performance concern.
Complete Function Map
| Address | Size (lines) | Identity | Confidence |
|---|---|---|---|
0x6BE4A0 | 67 | name_for_type_code | VERY HIGH |
0x6BE5D0 | 39 | copy_function_not_callable_because_of_arg_value_category | VERY HIGH |
0x6BE6C0 | 127 | compare_qualification_conversions | HIGH |
0x6BE990 | 68 | set_arg_summary_for_user_conversion | VERY HIGH |
0x6BEAF0 | 30 | set_explicit_flag_on_param_list | HIGH |
0x6BEB60 | 69 | find_conversion_function | VERY HIGH |
0x6BECA0 | 70 | matches_type_code | VERY HIGH |
0x6BEE10 | 375 | standard_conversion_sequence | HIGH |
0x6BF610 | 80 | check_user_defined_conversion | HIGH |
0x6BF710 | 163 | evaluate_conversion_for_argument | HIGH |
0x6BFA50 | 129 | process_builtin_operator_candidate | HIGH |
0x6BFD00 | 67 | name_for_overloaded_operator | HIGH |
0x6BFE40 | 48 | check_ambiguous_conversion | HIGH |
0x6BFF70 | 100 | compare_conversion_sequences | HIGH |
0x6C4C00 | 1,044 | candidate evaluation | HIGH |
0x6C5C90 | 386 | candidate scoring/ranking | MEDIUM |
0x6C8B70 | 418 | argument conversion computation | MEDIUM |
0x6C92B0 | 383 | template argument deduction for overloads | MEDIUM |
0x6CBC40 | 345 | implicit conversion sequence comparison | MEDIUM |
0x6CD010 | 752 | built-in operator candidate generation | HIGH |
0x6CE010 | 226 | operator overload candidate setup | MEDIUM |
0x6CE6E0 | 1,246 | overload resolution driver ("THE MONSTER") | HIGH |
0x6D03D0 | 170 | determine_selector_match_level (6-param) | HIGH |
0x6D0790 | 132 | determine_selector_match_level (4-param) | HIGH |
0x6D0A80 | 225 | selector_match_with_this_param | HIGH |
0x6D0F50 | 1,085 | try_conversion_function_match_full | HIGH |
0x6D28C0 | 252 | conversion_from_class_possible (9-param) | HIGH |
0x6D2ED0 | 293 | conversion_from_class_possible (10-param) | HIGH |
0x6D35E0 | 399 | user_defined_conversion_possible / diagnose_overload_ambiguity | HIGH |
0x6D3DC0 | 360 | conversion_possible | HIGH |
0x6D4610 | 49 | conversion_for_direct_reference_binding_possible | HIGH |
0x6D47B0 | 1,121 | prep_reference_initializer_operand | HIGH |
0x6D61F0 | 176 | reference init helper | MEDIUM |
0x6D6650 | 427 | user_convert_operand / set_up_for_conversion_function_call | HIGH |
0x6D7040 | 191 | diagnose_overload_ambiguity (standalone) | HIGH |
0x6D7410 | 239 | prep_conversion_operand | HIGH |
0x6D79E0 | 93 | conversion operand wrapper | MEDIUM |
0x6D7C80 | 2,119 | prep_list_initializer | HIGH |
0x6DACA0 | 154 | list init parameter deduction helper | MEDIUM |
0x6DB010 | 314 | deduce_auto_type | HIGH |
0x6DB6E0 | 236 | deduce_one_parameter | HIGH |
0x6DBAC0 | 175 | determine_copy_param_match | HIGH |
0x6DBEA0 | 625 | select_overloaded_copy_constructor | HIGH |
0x6DCC00 | 55 | arg_copy_can_be_done_via_constructor | HIGH |
0x6DCD60 | 166 | handle_elided_copy_constructor_no_guard (variant 1) | HIGH |
0x6DD180 | 169 | handle_elided_copy_constructor_no_guard (variant 2) | HIGH |
0x6DD600 | 492 | select_overloaded_assignment_operator | HIGH |
0x6DE110 | 31 | actualize_class_object_from_braced_init_list_for_bitwise_copy | HIGH |
0x6DE1D0 | 75 | full_adjust_class_object_type | HIGH |
0x6DE320 | 111 | set_up_for_constructor_call | HIGH |
0x6DE5A0 | 174 | temp_init_from_operand_full | HIGH |
0x6DE9E0 | 7 | temp_init_from_operand (wrapper) | HIGH |
0x6DE9F0 | 114 | find_top_temporary | HIGH |
0x6DEBC0 | 679 | determine_dynamic_init_for_class_init | HIGH |
0x6DF8C0 | 107 | conversion with dynamic init wrapper | MEDIUM |
0x6DFBF0 | 92 | convert and determine dynamic init helper | MEDIUM |
0x6DFEC0 | 692 | make_initializer_list_object / find_initializer_list_constructor | HIGH |
0x6E0E50 | 199 | adjust_operand_for_builtin_operator | HIGH |
0x6E1250 | 79 | argument preparation helper | MEDIUM |
0x6E1430 | 418 | convert_operand_into_temp | HIGH |
0x6E1C40 | 69 | prep_argument (5-param) | HIGH |
0x6E1E40 | 69 | prep_argument (4-param) | HIGH |
0x6E2040 | 2,120 | determine_function_viability | HIGH |
0x6E4FA0 | 633 | try_overloaded_function_match (variant 1) | HIGH |
0x6E5B20 | 367 | try_overloaded_function_match (variant 2) | HIGH |
0x6E61D0 | 121 | overload match wrapper | MEDIUM |
0x6E6400 | 1,483 | select_overloaded_function (20 params) | HIGH |
0x6E8300 | 285 | deduce_class_template_args (CTAD) | HIGH |
0x6E8890 | 199 | type comparison for overload | MEDIUM |
0x6E8E20 | 93 | overload candidate evaluation helper | MEDIUM |
0x6E9080 | 358 | select_overloaded_default_constructor | HIGH |
0x6E9750 | 281 | argument list builder | MEDIUM |
0x6E9C60 | 1,170 | warn_if_missing_sentinel | HIGH |
0x6EAF90 | 105 | node_for_arg_of_overloaded_function_call | HIGH |
0x6EB1C0 | 249 | adjust_overloaded_function_call_arguments | HIGH |
0x6EB550 | 392 | select_and_prepare_to_call_overloaded_function | HIGH |
0x6EBFC0 | 326 | conditional_operator_conversion_possible | HIGH |
0x6EC650 | 212 | overload set iterator | MEDIUM |
0x6ECA20 | 137 | overload set builder | MEDIUM |
0x6ECCE0 | 23 | overload set init wrapper | LOW |
0x6ECD70 | 160 | util.h insert operation | MEDIUM |
0x6ECFB0 | 193 | util.h insert variant | MEDIUM |
0x6ED2A0 | 812 | try_builtin_operands_match | HIGH |
0x6EE340 | 1,058 | try_conversions_for_builtin_operator | HIGH |
0x6EF7A0 | 2,174 | select_overloaded_operator / check_for_operator_overloading | HIGH |
Key Globals
| Global | Usage |
|---|---|
dword_126EFB4 | Language mode (2 = C++) |
dword_126EF68 | Language standard version (201103/201703/202301) |
dword_126EFA4 | GNU extensions enabled |
dword_126EFAC | Extended mode flag |
dword_126EFC8 | Debug trace enabled (controls overload trace output) |
dword_126EFCC | Debug output level |
qword_106B988 | Overload debug output stream |
qword_106B990 | Overload debug output stream (alternate) |
qword_12C6B30 | Overload candidate list |
byte_126E349 | CUDA extensions flag |
byte_126E358 | Extension flag (likely __CUDA_ARCH__-related) |
dword_106BEA8 | Overload configuration flag |
dword_106BEC0 | Overload configuration flag |
dword_106C2A8 | Used by selector match level |
dword_106C2B8 | Operator-related flag |
dword_106C2BC | Operator mode flag |
dword_106C104 | Operator configuration |
dword_106C124 | Operator configuration |
dword_106C140 | Operator configuration |
dword_106C16C | Operator configuration |
dword_126C5C4 | Template nesting depth |
dword_126C5E4 | Scope stack depth |
qword_126C5E8 | Scope stack base |
Template Engine
The template engine in cudafe++ is EDG 6.6's implementation of C++ template instantiation, argument deduction, partial specialization ordering, and the worklist-driven fixpoint loop that produces all needed template instantiations at translation-unit end. It lives primarily in templates.c (160+ functions at 0x7530C0--0x794D30) with supporting cross-TU correspondence logic in trans_corresp.c (0x796E60--0x79F9E0).
Template instantiation in a C++ compiler is fundamentally a deferred operation: the compiler parses template definitions, records their bodies in a declaration cache, and only instantiates when a concrete use forces it. EDG implements this with two pending worklists -- one for class templates, one for function/variable templates -- that accumulate entries during parsing and are drained by a fixpoint loop at the end of each translation unit. This page documents the complete instantiation pipeline from "entity added to worklist" through "instantiated body emitted into IL."
Key Facts
| Property | Value |
|---|---|
| Source file | templates.c (172 functions), trans_corresp.c (36 functions) |
| Address range | 0x7530C0--0x794D30 (templates), 0x796E60--0x79F9E0 (correspondence) |
| Fixpoint entry point | sub_78A9D0 (template_and_inline_entity_wrapup), 136 lines |
| Worklist walker | sub_78A7F0 (do_any_needed_instantiations), 72 lines |
| Should-instantiate gate | sub_774620 (should_be_instantiated), 326 lines |
| Function instantiation | sub_775E00 (instantiate_template_function_full), 839 lines |
| Class instantiation | sub_777CE0 (f_instantiate_template_class), 516 lines |
| Variable instantiation | sub_774C30 (instantiate_template_variable), 751 lines |
| Pending function/variable list | qword_12C7740 (linked list head) |
| Pending class list | qword_12C7758 (linked list head) |
| Function depth limit | qword_12C76E0 (max 255 = 0xFF) |
| Class depth limit | Per-type counter at type entry +56, via qword_106BD10 |
| Pending counter | sub_75D740 (increment) / sub_75D7C0 (decrement) |
| SSE state save | 4 xmmword registers for functions, 12 for classes |
| Instantiation modes | "none" / "all" / "used" / "local" |
| Fixpoint flag | dword_12C771C (set=1 when new work discovered, loop restarts) |
Instantiation Entry Structure
Each pending instantiation is represented as a linked-list node. The function/variable worklist uses entries with the following layout:
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 8 | entity | Primary symbol pointer |
+8 | 8 | next | Next entry in pending list |
+16 | 8 | inst_info | Instantiation info record (must be non-null) |
+24 | 8 | master_instance | Canonical template symbol |
+32 | 8 | actual_decl | Declaration in the instantiation context |
+40 | 8 | cached_decl | Cached declaration (for kind 7 / function-local) |
+64 | 8 | body_flags | Deferred/deleted function flags |
+72 | 8 | pre_computed_result | Result from prior instantiation attempt |
+80 | 1 | flags | Status bitfield (see below) |
Flags Byte at +80
| Bit | Mask | Name | Meaning |
|---|---|---|---|
| 0 | 0x01 | instantiated | Entity has been instantiated |
| 1 | 0x02 | not_needed | Entity was determined to not need instantiation |
| 3 | 0x08 | explicit_instantiation | From explicit template declaration |
| 4 | 0x10 | suppress_auto | Auto-instantiation suppressed (extern template) |
| 5 | 0x20 | excluded | Entity excluded from instantiation set |
| 7 | 0x80 | can_be_instantiated_checked | Pre-check already performed |
Flags Byte at +28 (on inst_info at +16)
| Bit | Mask | Name | Meaning |
|---|---|---|---|
| 0 | 0x01 | blocked | Instantiation blocked (dependency cycle) |
| 3 | 0x08 | debug_checked | Already checked by debug tracing path |
The Fixpoint Loop: template_and_inline_entity_wrapup
sub_78A9D0 is the top-level entry point, called at the end of each translation unit from fe_wrapup. It implements a fixpoint loop that keeps running until no new instantiations are discovered.
template_and_inline_entity_wrapup (sub_78A9D0)
|
+-- Assert: qword_106BA18 == 0 (not nested in another TU)
+-- Check: dword_126EFB4 == 2 (full compilation mode)
|
+-- FOR EACH translation_unit IN qword_106B9F0 linked list:
| |
| +-- sub_7A3EF0: set up TU context (switch active TU)
| |
| +-- PHASE 1: Process pending class instantiations
| | Walk qword_12C7758 list:
| | For each class entry:
| | if sub_7A6B60 (is_dependent_type) == false
| | AND sub_7A8A30 (is_class_or_struct_type) == true:
| | f_instantiate_template_class(entry)
| |
| +-- PHASE 2: Enable instantiation mode
| | dword_12C7730 = 1
| |
| +-- PHASE 3: Process pending function/variable instantiations
| | do_any_needed_instantiations()
| |
| +-- sub_7A3F70: tear down TU context
|
+-- PHASE 4: Check for newly-needed instantiations
| if dword_12C771C != 0:
| dword_12C771C = 0
| LOOP BACK to top <<<< FIXPOINT
|
+-- Check dword_12C7718 for additional pass
The fixpoint is necessary because instantiating one template may trigger references to other uninstantiated templates. For example, instantiating std::vector<Foo> may require instantiating std::allocator<Foo>, Foo's copy constructor, comparison operators, and so on. The loop re-runs until dword_12C771C (the "new instantiations needed" flag) remains zero through an entire pass.
Class-Before-Function Ordering
Classes are instantiated first (Phase 1) because function template instantiations may depend on complete class types. A function template body that accesses T::value_type requires T to be fully instantiated before the function body can be parsed. The two-phase design avoids forward-reference failures during function body replay.
Worklist Walker: do_any_needed_instantiations
sub_78A7F0 walks the pending function/variable instantiation list and processes each entry that passes the should_be_instantiated gate.
void do_any_needed_instantiations(void) {
entry_t *v0 = qword_12C7740; // pending list head
while (v0) {
if (v0->flags & 0x02) { // already done
v0 = v0->next;
continue;
}
inst_info_t *v2 = v0->inst_info; // offset +16, must be non-null
if (!(v2->flags & 0x08)) { // not debug-checked
if (dword_126EFC8) // debug tracing enabled
sub_756B40(v0); // f_is_static_or_inline check
}
if (v2->flags & 0x01) { // blocked
v0 = v0->next;
continue;
}
if (v0->flags >= 0) { // bit 7 not set (not pre-checked)
sub_7574B0(v0); // f_entity_can_be_instantiated
}
if (should_be_instantiated(v0, 1)) {
instantiate_template_function_full(v0, 1);
}
v0 = v0->next; // offset +8
}
}
The walk is a simple linear traversal. New entries appended during instantiation will be visited on the current pass if they appear after the current position, or on the next fixpoint iteration otherwise.
Debug tracing output: when dword_126EFC8 is nonzero, the walker emits "do_any_needed_instantiations, checking: " followed by the entity name for each entry it considers.
Decision Gate: should_be_instantiated
sub_774620 is the critical decision function that determines whether a pending template entity actually requires instantiation. It implements a chain of rejection checks -- an entity must pass all of them to be instantiated.
int should_be_instantiated(entry_t *a1, int a2) {
// 1. Already done?
if (a1->flags_28 & 0x01) return 0;
// 2. Suppressed by extern template?
if (a1->flags_80 & 0x20) return 0;
// 3. Already instantiated and not explicit?
if ((a1->flags_80 & 0x08) && !(a1->flags_80 & 0x01))
return 0;
// 4. Has valid master instance?
if (!a1->master_instance) return 0; // offset +24
// 5. Entity kind filter (function-specific)
int kind = get_entity_kind(a1->master_instance);
switch (kind) {
case 10: case 11: // class member function
case 17: // lambda
case 9: // namespace-scope function
case 7: // variable template
break; // eligible
default:
return 0; // not a function/variable entity
}
// 6. Implicit include needed?
if (needs_implicit_include(a1))
do_implicit_include_if_needed(a1); // sub_754A70
// 7. Depth limit check
if (get_depth(a1) > *qword_106BD10)
return 0;
// 8. Depth warning (diagnostic 489/490)
if (approaching_depth_limit(a1))
emit_warning(489); // or 490
return 1;
}
The depth limit at qword_106BD10 is the configurable maximum instantiation nesting depth. When exceeded, the entity is silently skipped. When approaching the limit, warnings 489 and 490 are emitted to alert the developer.
Function Instantiation: instantiate_template_function_full
sub_775E00 (839 lines) is the workhorse for instantiating function templates. It saves global parser state, replays the cached function body through the parser with substituted template arguments, and restores state afterward.
SSE State Save/Restore
The function saves and restores 4 SSE registers (xmmword_106C380--xmmword_106C3B0) that hold critical parser/scope state. These 128-bit registers store packed parser context (scope indices, token positions, flags) that must be preserved across instantiation because the parser is stateful and re-entrant:
Save on entry:
saved_state[0] = xmmword_106C380 // parser scope context
saved_state[1] = xmmword_106C390 // token stream state
saved_state[2] = xmmword_106C3A0 // scope nesting info
saved_state[3] = xmmword_106C3B0 // auxiliary flags
Restore on exit (always, even on error):
xmmword_106C380 = saved_state[0]
xmmword_106C390 = saved_state[1]
xmmword_106C3A0 = saved_state[2]
xmmword_106C3B0 = saved_state[3]
The use of SSE registers for state save/restore is a compiler optimization -- the generated code uses movaps/movups instructions to save 64 bytes of state in 4 instructions rather than 8 individual mov instructions. The data itself is ordinary integer/pointer fields packed into 128-bit quantities by the compiler's register allocator.
Instantiation Flow
instantiate_template_function_full (sub_775E00)
|
+-- Save 4 SSE registers (parser state)
|
+-- Check pre-existing result: a1[9] (offset +72)
| If result exists:
| Load associated translation unit
| GOTO restore
|
+-- Fresh instantiation:
| |
| +-- Check implicit include needed
| +-- Resolve actual declaration via find_corresponding_instance
| +-- For class members (kind 20): handle member function templates
| |
| +-- Depth limit check:
| | if qword_12C76E0 >= 0xFF (255):
| | emit error, GOTO restore
| | qword_12C76E0++
| |
| +-- Constraint satisfaction check:
| | sub_7C2370 / sub_7C23B0 (C++20 requires-clause)
| |
| +-- Handle deferred/deleted functions (offset +64 flags)
| |
| +-- Set up substitution context: sub_709DE0
| | Binds template parameters to concrete arguments
| |
| +-- Replay cached function body: sub_5A88B0
| | Re-parses the saved token stream with substituted types
| |
| +-- Emit into IL: sub_676860
| | Processes tokens until end marker (token kind 9)
| |
| +-- Update canonical entry: sub_79F1D0
| | Links instantiation to cross-TU correspondence table
| |
| +-- qword_12C76E0-- (decrement depth)
|
+-- Restore 4 SSE registers
Depth Counter: qword_12C76E0
This global counter tracks the current nesting depth of function template instantiations. The hard limit is 255 (0xFF). Each call to instantiate_template_function_full increments it on entry and decrements on exit. When the counter reaches 255, the function emits a fatal error and aborts instantiation.
The 255 limit is a safety valve against infinite recursive template instantiation (e.g., template<int N> struct S { S<N+1> member; }). The C++ standard mandates that implementations support at least 1,024 recursively nested template instantiations ([Annex B]), but EDG defaults to 255. This may be configurable via a CLI flag that sets qword_106BD10.
Class Instantiation: f_instantiate_template_class
sub_777CE0 (516 lines) instantiates class templates. It is structurally similar to the function instantiation path but saves significantly more state (12 SSE registers vs. 4) because class instantiation involves deeper parser state perturbation -- class bodies contain member declarations, nested types, and member function definitions.
SSE State Save/Restore (12 Registers)
Save on entry:
saved[0] = xmmword_106C380
saved[1] = xmmword_106C390
saved[2] = xmmword_106C3A0
saved[3] = xmmword_106C3B0
saved[4] = xmmword_106C3C0
saved[5] = xmmword_106C3D0
saved[6] = xmmword_106C3E0
saved[7] = xmmword_106C3F0
saved[8] = xmmword_106C400
saved[9] = xmmword_106C410
saved[10] = xmmword_106C420
saved[11] = xmmword_106C430
Restore on exit:
(reverse order, same 12 registers)
The additional 8 registers (beyond the 4 used by function instantiation) capture the extended scope stack state, class body parsing context, base class list, member template processing state, and access specifier tracking that class body parsing requires.
Class Type Entry Layout
Class instantiation operates on a type entry with the following relevant fields:
| Offset | Size | Field | Description |
|---|---|---|---|
+56 | 8 | instantiation_depth_counter | Per-type depth limit via qword_106BD10 |
+72 | 8 | containing_template_decl | The template declaration this specialization came from |
+88 | 8 | scope_name_info | Scope and name resolution data |
+96 | 8 | class_body_info | Pointer to cached class body tokens |
+104 | 8 | base_class_list | Linked list of base class entries |
+120 | 8 | namespace_lookup_info | Namespace and extern template info |
+132 | 1 | kind | Type kind: 9=struct, 10=class, 11=union, 12=alias |
+144 | 8 | canonical_type | Pointer to canonical type entry (follow kind==12 chain) |
+152 | 8 | parent_scope | Enclosing scope entry |
+160 | 4 | attribute_flags | Attribute bits |
+176 | 1 | template_flags | bit 0 = primary template, bit 7 = inline |
+192 | 8 | template_argument_list | Substituted template argument list |
+200 | 8 | member_template_list | Linked list of member templates |
+296 | 8 | associated_constraint | C++20 constraint expression |
+298 | 1 | extra_flags | Additional status bits |
Instantiation Flow
f_instantiate_template_class (sub_777CE0)
|
+-- Walk to canonical type entry: follow kind==12 chain at +144
+-- Get class symbol: sub_72F640
|
+-- Check extern template constraints: sub_7C2370/sub_7C23B0
|
+-- Save 12 SSE registers
|
+-- Depth limit check:
| if type_entry[+56] >= *qword_106BD10:
| emit error, GOTO restore
| type_entry[+56]++
|
+-- Set up substitution context: sub_709DE0
|
+-- Handle base class list:
| sub_415BE0 (parse base-specifier-list)
| sub_4A5510 (validate base classes)
|
+-- Parse class body from declaration cache
| Replay saved tokens with substituted types
|
+-- Process member templates:
| Loop on member_template_list (offset +200)
| sub_7856E0 for each member template
|
+-- Perform deferred access checks:
| sub_744F60 (perform_deferred_access_checks_at_depth)
|
+-- type_entry[+56]-- (decrement depth)
|
+-- Restore 12 SSE registers
Per-Type Depth Limit
Unlike function instantiation (which uses a single global counter qword_12C76E0 with a hard limit of 255), class instantiation uses a per-type counter stored at offset +56 of the type entry. The limit is still read from qword_106BD10. This per-type design prevents one deeply-nested class hierarchy from consuming the entire depth budget -- each class type tracks its own instantiation nesting independently.
Variable Instantiation: instantiate_template_variable
sub_774C30 (751 lines) handles variable template instantiation. Variable templates (C++14) are less common than function or class templates but follow the same pattern: extract master instance, set up substitution, replay cached declaration.
Instantiation Flow
instantiate_template_variable (sub_774C30)
|
+-- Extract master instance: a1[3]=symbol, a1[4]=decl
|
+-- Look up declaration type:
| Switch on kind: 4/5, 6, 9/10, 19-22
|
+-- Find declaration cache: offset +216 or +264
|
+-- Depth limit check: qword_106BD10
|
+-- Set up substitution context: sub_709DE0
|
+-- Create declaration state:
| memset(v77, 0, 0x1D8) // 472 bytes = declaration state
| v77[0] = symbol
| v77[3] = source position
| v77[6] = type
| v77[15] = flags
| v77[19] = self-pointer
| v77[33] = additional flags
| v77[35] = initializer
| v77[36] = IL tree
|
+-- Perform type substitution: sub_764AE0 (scan_template_declaration)
|
+-- Handle constexpr/constinit evaluation
|
+-- Handle deferred access checks
|
+-- Update canonical entry
|
+-- For kind==7 (function-local variable templates):
Special handling via sub_5C9600, copy attributes from prototype
The declaration state structure is 472 bytes (0x1D8), stack-allocated and zero-initialized. This is the same structure used by the main declaration parser -- variable template instantiation reuses the declaration parsing infrastructure with pre-populated fields.
Pending Counter Management
Two small functions manage a pending-instantiation counter that tracks how many instantiations are in flight. This counter is used for progress reporting and infinite-loop detection.
increment_pending_instantiations (sub_75D740)
Called when a new template entity is added to the pending worklist. Increments the counter and checks against a maximum threshold via too_many_pending_instantiations (sub_75D6A0).
decrement_pending_instantiations (sub_75D7C0)
Called when an instantiation completes (successfully or by rejection). Decrements the counter.
The counter itself is not directly visible in the sweep report but is inferred from the call pattern: the increment function is called from code paths that add entries to qword_12C7740 or qword_12C7758, and the decrement is called at the end of each instantiate_template_function_full / f_instantiate_template_class / instantiate_template_variable invocation.
Instantiation Modes
The template engine supports four instantiation modes, controlled by CLI flags that set dword_12C7730 and related configuration globals:
| Mode | dword_12C7730 | Behavior |
|---|---|---|
"none" | 0 | No automatic instantiation. Only explicit template declarations trigger instantiation. Used for precompiled headers. |
"used" | 1 | Instantiate templates that are actually used (ODR-referenced). This is the default mode. The should_be_instantiated function checks usage flags. |
"all" | 2 | Instantiate all templates that have been declared, whether or not they are used. Used for template library precompilation. |
"local" | 3 | Instantiate only templates with internal linkage. Extern templates are skipped. Used for split compilation models. |
The mode transitions during compilation:
- During parsing:
dword_12C7730 = 0(collection only, no instantiation) - At wrapup entry:
dword_12C7730 = 1(enable "used" mode) - During fixpoint: mode may escalate to "all" if
dword_12C7718is set
The precompile mode (dword_106C094 == 3) skips the fixpoint loop entirely and records template entities for later instantiation in the consuming translation unit.
Substitution Engine: copy_type_with_substitution
sub_76D860 (1,229 lines) is the core type substitution function. It takes a type node and a set of template-parameter-to-argument bindings, and produces a new type with all template parameters replaced by their concrete values.
copy_type_with_substitution(type, bindings) -> type
|
+-- Dispatch on type->kind:
|
+-- Simple types (int, float, void): return type unchanged
|
+-- Pointer type (kind 6):
| new_pointee = copy_type_with_substitution(type->pointee, bindings)
| return make_pointer_type(new_pointee)
|
+-- Reference types (kind 7, 19):
| new_referent = copy_type_with_substitution(type->referent, bindings)
| return make_reference_type(new_referent, type->is_rvalue)
|
+-- Array type (kind 8):
| new_element = copy_type_with_substitution(type->element, bindings)
| new_size = substitute_expression(type->size_expr, bindings)
| return make_array_type(new_element, new_size)
|
+-- Function type (kind 14):
| new_return = copy_type_with_substitution(type->return_type, bindings)
| new_params = [substitute each parameter type]
| return make_function_type(new_return, new_params, type->cv_quals)
|
+-- Template parameter type:
| Look up parameter in bindings
| return concrete argument type
|
+-- Template-id type:
| new_args = copy_template_arg_list_with_substitution(type->args, bindings)
| return find_or_instantiate_template_class(type->template, new_args)
|
+-- Pack expansion (kind 16, 17):
| Expand pack with all elements from the binding
| return list of substituted types
Supporting substitution functions:
| Address | Identity | Description |
|---|---|---|
sub_77BA10 | copy_parent_type_with_substitution | Substitutes in enclosing class context |
sub_77BFE0 | copy_template_with_substitution | Substitutes within template declarations |
sub_77FDE0 | copy_template_arg_list_with_substitution | Substitutes within argument lists (612 lines) |
sub_780B80 | copy_template_class_reference_with_substitution | Handles class template references |
sub_78B600 | copy_template_variable_with_substitution | Handles variable template references |
sub_793DF0 | substitute_template_param_list | Walks parameter list with substitution (741 lines) |
Template Argument Deduction
The deduction subsystem determines template argument values from function call arguments. Key functions:
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_77CEE0 | matches_template_type | 788 | Core deduction: matches actual type against template parameter pattern. Implements [temp.deduct]. |
sub_77CA90 | matches_template_type_for_class_type | -- | Class-specific variant with additional base class traversal |
sub_77C720 | matches_template_arg_list | -- | Matches a sequence of template arguments |
sub_77C510 | matches_template_template_param | -- | Matches template template parameters |
sub_77C240 | template_template_arg_matches_param | -- | Template template argument compatibility check |
sub_77E9F0 | matches_template_constant | -- | Matches non-type template arguments (constant expressions) |
sub_77E310 | parameter_is_more_specialized | 330 | Partial ordering rule: determines which parameter is more specialized |
sub_780FC0 | all_templ_params_have_values | 332 | Post-deduction check: verifies all parameters received values |
sub_781660 | wrapup_template_argument_deduction | -- | Finalizes deduction, applies default arguments |
sub_781C40 | matches_partial_specialization | 316 | Tests actual arguments against a partial specialization |
Partial Specialization Ordering
When multiple partial specializations match, the engine must select the "most specialized" one. This implements C++ [temp.class.order] and [temp.func.order]:
check_partial_specializations (sub_774470)
|
+-- For each partial specialization of the template:
| matches_partial_specialization(actual_args, partial_spec)
| If matches: add to candidates list
| add_to_partial_order_candidates_list (sub_773E40)
|
+-- If multiple candidates:
| partial_ord (sub_75D2A0)
| Pairwise comparison using parameter_is_more_specialized
| Select most specialized, or emit ambiguity error
|
+-- Return winning specialization (or primary template if no match)
For function templates, ordering uses compare_function_templates (sub_7730D0, 665 lines) which implements the more complex function template partial ordering rules.
Template Declaration Infrastructure
The declaration side handles parsing template<...> prefixes and setting up template entities:
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_786260 | template_declaration | 2,487 | Main entry point for all template declarations. Handles primary, explicit specialization, partial specialization, and friend templates. |
sub_782690 | class_template_declaration | 2,280 | Class-specific template declaration processing |
sub_78D600 | template_or_specialization_declaration_full | 2,034 | Unified handler routing to class, function, or variable paths |
sub_764AE0 | scan_template_declaration | 412 | Parses the template<...> prefix |
sub_779D80 | scan_template_param_list | 626 | Parses template parameter lists |
sub_77AAB0 | scan_lambda_template_param_list | -- | C++20 lambda template parameter parsing |
sub_770790 | make_template_function | 914 | Creates function template entity |
sub_753870 | make_template_variable | -- | Creates variable template entity |
sub_756310 | set_up_template_decl | -- | Template declaration state initialization |
Explicit Instantiation
Explicit instantiation (template class Foo<int>; or template void f<int>();) is handled by a dedicated path:
explicit_instantiation (sub_791C70, 105 lines)
|
+-- Parse 'extern' flag: a2 & 1 = is_extern_instantiation
+-- Save compilation mode (dword_106C094)
|
+-- Determine instantiation kind:
| extern: kind = 16
| non-extern, no inline: kind = 15
| non-extern, inline: kind = 18
|
+-- For precompiled header mode: mark scope entry
|
+-- instantiation_directive (sub_7908E0, 626 lines):
| |
| +-- Initialize target scope entry (memset 472 bytes)
| +-- Check CUDA device-code instantiation pragmas
| +-- Parse declaration:
| | For classes: sub_789EF0 (update_instantiation_flags)
| | For functions: sub_78D0E0 (find_matching_template_instance)
| | then sub_7897C0 (update_instantiation_flags)
| | For variables: similar path
| +-- Handle instantiation attributes (dllexport/visibility)
| +-- Clean up parser state
|
+-- Handle deferred access checks: sub_744F60
+-- Restore compilation mode
update_instantiation_flags (sub_7897C0, 351 lines) sets the appropriate instantiation-required bits on the template entity after matching an explicit instantiation directive. It checks compilation mode, CUDA device/host targeting, and adjusts flags accordingly.
CUDA Integration Points
The template engine interacts with CUDA through several mechanisms:
-
Device/host filtering in
should_be_instantiated: The function checks CUDA execution space attributes viasub_756840(sym_can_be_instantiated) to determine if a template entity should be instantiated for the current compilation target (device or host). -
Instantiation directives: CUDA-specific
#pragmadirectives can trigger or suppress template instantiation for device code. Theinstantiation_directivefunction checks for these atdword_126EFA8(GPU mode) anddword_126EFA4(device-code flag). -
Namespace injection: CUDA-specific symbols are entered into
cuda::stdviaenter_symbol_for_namespace_cuda_std(sub_749330) andstd::metaviaenter_symbol_for_namespace_std_meta(sub_7493C0, C++26 reflection support). -
Target dialect selection:
select_cp_gen_be_target_dialect(sub_752A80) determines whether template instantiations emit device PTX code or host code, based ondword_126EFA8(GPU mode) anddword_126EFA4(device vs. host).
Cross-TU Correspondence
When compiling with RDC mode or multiple translation units, the same template may be instantiated in different TUs. The trans_corresp.c file (0x796E60--0x79F9E0) handles deduplication and canonical entry selection:
| Address | Identity | Description |
|---|---|---|
sub_796E60 | canonical_ranking | Determines which of two TU entries is canonical |
sub_7975D0 | may_have_correspondence | Checks if cross-TU correspondence is possible |
sub_7999C0 | find_template_correspondence | Finds corresponding template across TUs (601 lines) |
sub_79A5A0 | determine_correspondence | Establishes correspondence relationship |
sub_79B8D0 | mark_canonical_instantiation | Marks the canonical version of an instantiation |
sub_79C400 | f_set_trans_unit_corresp | Sets up cross-TU correspondence (511 lines) |
sub_79D080 | establish_instantiation_correspondences | Links instantiation results across TUs |
sub_79EE80--sub_79F1D0 | update_canonical_entry (3 variants) | Updates canonical representative after instantiation |
sub_79F9E0 | record_instantiation | Records an instantiation for cross-TU tracking |
The correspondence system ensures that when std::vector<int> is instantiated in TU1 and TU2, both produce structurally equivalent IL, and only one canonical version is emitted to the output.
Global State
| Address | Name | Description |
|---|---|---|
qword_12C7740 | pending_instantiation_list | Head of pending function/variable instantiation linked list |
qword_12C7758 | pending_class_instantiation_list | Head of pending class instantiation linked list |
dword_12C7730 | instantiation_mode_active | Current instantiation mode (0=none, 1=used, 2=all, 3=local) |
dword_12C771C | new_instantiations_needed | Fixpoint flag: set to 1 when new work discovered |
dword_12C7718 | additional_pass_needed | Secondary fixpoint flag for extra passes |
qword_12C76E0 | instantiation_depth_counter | Current function template nesting depth (max 0xFF) |
qword_106BD10 | max_instantiation_depth_limit | Configurable depth limit (read by class and function paths) |
xmmword_106C380--106C3B0 | parser_state_save_area | 4 SSE registers saved by function instantiation |
xmmword_106C380--106C430 | parser_state_save_area_full | 12 SSE registers saved by class instantiation |
dword_106C094 | compilation_mode | 0=none, 1=normal, 3=precompile |
dword_126EFB4 | compilation_phase | 2=full compilation (required for fixpoint loop) |
qword_106B9F0 | translation_unit_list_head | Linked list of TUs for per-TU fixpoint iteration |
qword_106BA18 | tu_stack_top | Must be 0 (not nested) when fixpoint starts |
dword_126EFC8 | debug_tracing_enabled | Nonzero enables trace output for instantiation |
dword_126EFA8 | gpu_mode | Nonzero when compiling CUDA code |
dword_126EFA4 | device_code | 1=device-side compilation, 0=host stubs |
word_126DD58 | current_token_kind | Parser state: current token (9=END) |
qword_126DD38 | source_position | Parser state: current source location |
qword_126C5E8 | scope_table_base | Array of 784-byte scope entries |
dword_126C5E4 | current_scope_index | Index into scope table |
Diagnostic Strings
| String | Source | Condition |
|---|---|---|
"do_any_needed_instantiations, checking: " | sub_78A7F0 | dword_126EFC8 != 0 (debug tracing) |
"template_and_inline_entity_wrapup" | sub_78A9D0 | Assert string |
"should_be_instantiated" | sub_774620 | Assert string at templates.c:36894 |
"instantiate_template_function_full" | sub_775E00 | Assert string at templates.c:7359 |
"f_instantiate_template_class" | sub_777CE0 | Assert string at templates.c:5277 |
"instantiate_template_variable" | sub_774C30 | Assert string at templates.c:7814 |
"check_template_nesting_depth" | sub_7533E0 | Assert string |
"instantiation_directive" | sub_7908E0 | Assert string at templates.c:41682 |
"explicit_instantiation" | sub_791C70 | Assert string at templates.c:42231 |
"template_arg_is_dependent" | sub_7530C0 | Assert string at templates.c:8897 |
Function Map
| Address | Identity | Confidence | Lines | EDG Source |
|---|---|---|---|---|
sub_78A9D0 | template_and_inline_entity_wrapup | 100% | 136 | templates.c:40084 |
sub_78A7F0 | do_any_needed_instantiations | 100% | 72 | templates.c:39760 |
sub_774620 | should_be_instantiated | 95% | 326 | templates.c:36894 |
sub_775E00 | instantiate_template_function_full | 95% | 839 | templates.c:7359 |
sub_777CE0 | f_instantiate_template_class | 95% | 516 | templates.c:5277 |
sub_774C30 | instantiate_template_variable | 95% | 751 | templates.c:7814 |
sub_75D740 | increment_pending_instantiations | 95% | -- | templates.c |
sub_75D7C0 | decrement_pending_instantiations | 95% | -- | templates.c |
sub_75D6A0 | too_many_pending_instantiations | 95% | -- | templates.c |
sub_7574B0 | f_entity_can_be_instantiated | 95% | -- | templates.c:37066 |
sub_756B40 | f_is_static_or_inline_template_entity | 95% | -- | templates.c |
sub_756840 | sym_can_be_instantiated | 95% | -- | templates.c |
sub_754A70 | do_implicit_include_if_needed | 95% | -- | templates.c |
sub_76D860 | copy_type_with_substitution | 95% | 1229 | templates.c |
sub_77FDE0 | copy_template_arg_list_with_substitution | 95% | 612 | templates.c |
sub_793DF0 | substitute_template_param_list | 95% | 741 | templates.c |
sub_77CEE0 | matches_template_type | 95% | 788 | templates.c |
sub_780FC0 | all_templ_params_have_values | 95% | 332 | templates.c |
sub_781C40 | matches_partial_specialization | 95% | 316 | templates.c |
sub_774470 | check_partial_specializations | 95% | 58 | templates.c |
sub_773E40 | add_to_partial_order_candidates_list | 95% | 306 | templates.c |
sub_75D2A0 | partial_ord | 95% | -- | templates.c |
sub_7730D0 | compare_function_templates | 95% | 665 | templates.c |
sub_786260 | template_declaration | 95% | 2487 | templates.c |
sub_782690 | class_template_declaration | 95% | 2280 | templates.c |
sub_78D600 | template_or_specialization_declaration_full | 95% | 2034 | templates.c |
sub_764AE0 | scan_template_declaration | 95% | 412 | templates.c |
sub_779D80 | scan_template_param_list | 95% | 626 | templates.c |
sub_770790 | make_template_function | 95% | 914 | templates.c |
sub_771D50 | find_template_function | 95% | 470 | templates.c |
sub_7621A0 | find_template_class | 95% | 519 | templates.c |
sub_78AC50 | find_template_variable | 95% | 528 | templates.c |
sub_7908E0 | instantiation_directive | 95% | 626 | templates.c:41682 |
sub_791C70 | explicit_instantiation | 95% | 105 | templates.c:42231 |
sub_7897C0 | update_instantiation_flags | 90% | 351 | templates.c |
sub_7770E0 | update_instantiation_required_flag | 95% | 434 | templates.c |
sub_78D0E0 | find_matching_template_instance | 95% | -- | templates.c |
sub_709DE0 | set_up_substitution_context | -- | -- | (likely templates.c) |
sub_744F60 | perform_deferred_access_checks_at_depth | 95% | -- | symbol_tbl.c |
sub_7530C0 | template_arg_is_dependent | 95% | -- | templates.c:8897 |
sub_762C80 | template_arg_list_is_dependent_full | 95% | 839 | templates.c |
sub_75EF10 | equiv_template_arg_lists | 95% | 493 | templates.c |
sub_7931B0 | make_template_implicit_deduction_guide | 95% | 433 | templates.c |
sub_794D30 | ctad | 95% | 990 | templates.c |
sub_796E60 | canonical_ranking | 95% | -- | trans_corresp.c |
sub_7999C0 | find_template_correspondence | 95% | 601 | trans_corresp.c |
sub_79C400 | f_set_trans_unit_corresp | 95% | 511 | trans_corresp.c |
sub_79F1D0 | update_canonical_entry | 95% | -- | trans_corresp.c |
sub_79F9E0 | record_instantiation | 95% | -- | trans_corresp.c |
Cross-References
- EDG 6.6 Overview -- Architecture and NVIDIA modification layers
- CUDA Template Restrictions -- CUDA-specific template constraints
- Type System -- Type kinds and class layout referenced during substitution
- Keep-in-IL -- Device code selection interacts with instantiation results
- Pipeline Overview -- Where template wrapup fits in the compilation pipeline
- Template Instance Record -- Data structure for instantiation entries
- Scope Entry -- 784-byte scope structure used during instantiation
- Diagnostics Overview -- Warning 489/490 for depth limits
CUDA Template Restrictions
CUDA's split-compilation model imposes restrictions on C++ templates that have no counterpart in standard C++. When a __global__ function template is instantiated, cudafe++ generates a host-side stub whose mangled name must exactly match what the device compiler (cicc) independently produces. This agreement is only possible if both compilers can derive the complete mangled name from the template's signature and arguments. Types that are invisible to one side -- host-local types, unnamed types, private class members, certain lambda closures -- break this invariant and are therefore rejected. The same constraints apply to variable templates used in device contexts, and additional structural restrictions prevent variadic __global__ templates from producing ambiguous mangled names. This page documents all 24 CUDA-specific template restriction errors across 8 categories, the implementation functions that enforce them, and the __NV_name_expr mechanism that relies on these guarantees.
Key Facts
| Property | Value |
|---|---|
| Source file | cp_gen_be.c (EDG 6.6 backend code generator) |
| Access checker | sub_469F80 (template_arg_is_accessible, 144 lines) |
| Cache engine | sub_469480 (cache_access_result_for, 670 lines) |
| Arg list walker | sub_46A230 (walks template arg lists, 182 lines) |
| Pre-unnamed check | sub_46A5B0 (arg_before_unnamed_template_param_arg, 396 lines) |
| Scope resolver | sub_469F30 (resolves scope via hash lookup, 23 lines) |
| Callback for scope walk | sub_46ACC0 (passed as callback into sub_61FE60) |
| Cache hash table | xmmword_F05720 (384 KB, 16,382-entry table, 24 bytes per slot) |
| Entity lookup table | unk_FE5700 (512 KB, used by sub_469F30) |
| Free list head | qword_F05708 (recycled cache entries) |
| Total restriction errors | 24 across 8 categories |
Why These Restrictions Exist
The CUDA compilation model splits a single .cu source file into two compilation paths:
-
Host path: cudafe++ generates a
.int.cfile containing host stubs. The host compiler (gcc, clang, MSVC) compiles these stubs and produces a host object file. Each__global__function template instantiation becomes a__wrapper__device_stub_function. -
Device path: The same source is compiled by
ciccinto PTX. The device compiler independently instantiates the same templates and produces the device-side function bodies.
At link time, the CUDA runtime matches host stubs to device functions by mangled name. Both compilers must produce identical mangled names for every __global__ template instantiation. This is only possible when all template arguments are types that both compilers can see, name, and mangle identically. A host-only local type, for example, exists only in the host compiler's scope -- cicc cannot see it and cannot produce a matching mangled name. The restrictions documented below enforce this invariant.
The same logic applies to __device__/__constant__ variable templates, which must also match across the host/device boundary for registration and symbol lookup.
Category A: __global__ Declaration Restrictions (8 errors)
These errors prevent __global__ function templates from using C++ features that would prevent host stub generation or violate kernel ABI constraints.
| Tag | Message | Reason |
|---|---|---|
global_function_constexpr | A __global__ function or function template cannot be marked constexpr | Kernels are not evaluated at compile time; constexpr is meaningless for device launch. |
global_function_consteval | A __global__ function or function template cannot be marked consteval | consteval requires compile-time evaluation, incompatible with runtime kernel launch. |
global_class_decl | A __global__ function or function template cannot be a member function | Kernels have no this pointer; the launch ABI has no slot for an object reference. |
global_friend_definition | A __global__ function or function template cannot be defined in a friend declaration | Friend definitions have limited visibility, conflicting with the requirement for a globally-linkable stub. |
global_exception_spec | An exception specification is not allowed for a __global__ function or function template | GPU hardware has no exception unwinding mechanism. |
global_function_in_unnamed_inline_ns | A __global__ function or function template cannot be declared within an inline unnamed namespace | Unnamed namespaces produce TU-local linkage, but kernel stubs must have external linkage for runtime registration. |
global_function_with_initializer_list | a __global__ function or function template cannot have a parameter with type std::initializer_list | std::initializer_list holds a pointer to backing storage that cannot be transparently transferred to device memory. |
global_va_list_type | A __global__ function or function template cannot have a parameter with va_list type | Variadic argument lists require stack-based access that does not exist on GPU hardware. |
These checks occur during attribute application in apply_nv_global_attr (sub_40E1F0 / sub_40E7F0) and in the post-validation pass nv_validate_cuda_attributes (sub_6BC890). The checks apply equally to non-template __global__ functions and __global__ function templates.
Category B: Variadic __global__ Template Constraints (2 errors)
Standard C++ allows multiple parameter packs in a template and does not require packs to be the last parameter. CUDA restricts this for __global__ templates because the host stub ABI requires unambiguous argument layout.
| Tag | Message |
|---|---|
global_function_pack_not_last | Pack template parameter must be the last template parameter for a variadic __global__ function template |
global_function_multiple_packs | Multiple pack parameters are not allowed for a variadic __global__ function template |
Rationale
The kernel launch wrapper (<<<grid, block>>>) must marshal each argument into a contiguous parameter buffer. For a variadic template like template<typename... Ts> __global__ void kernel(Ts... args), the compiler generates the buffer layout at instantiation time. If the pack is not last, or if multiple packs are present, the positional mapping between template parameters and launch arguments becomes ambiguous -- the compiler cannot determine which arguments belong to which pack without full deduction context that may not be available at stub generation time.
Example
// OK: single pack, last position
template<typename T, typename... Ts>
__global__ void kernel(T first, Ts... rest);
// Error: pack not last
template<typename... Ts, typename T>
__global__ void kernel(Ts... args, T last); // global_function_pack_not_last
// Error: multiple packs
template<typename... Ts, typename... Us>
__global__ void kernel(Ts... a, Us... b); // global_function_multiple_packs
Category C: Template Argument Visibility for __global__ (6 errors)
These are the core name-mangling restrictions. Every type used as a template argument to a __global__ function template instantiation must be visible and nameable by both the host and device compilers.
C.1: Host-local types
| Tag | Message |
|---|---|
global_func_local_template_arg | A type defined inside a __host__ function (%t) cannot be used in the template argument type of a __global__ function template instantiation |
A type defined inside a __host__ function exists only within that function's scope. The device compiler never sees it and cannot produce a matching mangled name.
void host_function() {
struct LocalType { int x; };
kernel<LocalType><<<1,1>>>(); // error: host-local type
}
C.2: Private/protected class members
| Tag | Message |
|---|---|
global_private_type_arg | A type that is defined inside a class and has private or protected access (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the class is local to a __device__ or __global__ function |
Private/protected nested types are accessible only through the enclosing class's access control. While C++ allows friend access and member function access to these types, the device compiler processes templates independently and may not have the same access context. The exception for types local to __device__/__global__ functions reflects that both compilers see device function bodies.
class Outer {
struct Inner { int x; }; // private
friend void launch();
};
void launch() {
kernel<Outer::Inner><<<1,1>>>(); // error: private type
}
C.3: Unnamed types
| Tag | Message |
|---|---|
global_unnamed_type_arg | An unnamed type (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the type is local to a __device__ or __global__ function |
Unnamed types (anonymous structs, unnamed enums) have no canonical name. Itanium ABI mangling for unnamed types relies on positional encoding within the enclosing scope, which may differ between host and device compilers if they process the enclosing scope differently. Types local to __device__/__global__ functions are exempt because the device compiler processes those scopes identically.
enum { A, B, C }; // unnamed enum
kernel<decltype(A)><<<1,1>>>(); // error: unnamed type
C.4: Lambda closures
| Tag | Message |
|---|---|
global_lambda_template_arg | The closure type for a lambda (%t%s) cannot be used in the template argument type of a __global__ function template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the flag '-extended-lambda' is specified and the lambda is an extended lambda (a __device__ or __host__ __device__ lambda defined within a __host__ or __host__ __device__ function) |
Lambda closures are compiler-generated anonymous types. Without --extended-lambda, there is no protocol for both compilers to agree on the closure type's mangled representation. The extended lambda mechanism (--extended-lambda / --extended-lambda) establishes a naming convention for lambdas annotated with __device__ or __host__ __device__, enabling cross-compiler name agreement.
auto f = [](int x){ return x*2; };
kernel<decltype(f)><<<1,1>>>(); // error unless extended lambda
C.5: Private/protected template template arguments
| Tag | Message |
|---|---|
global_private_template_arg | A template that is defined inside a class and has private or protected access cannot be used in the template template argument of a __global__ function template instantiation |
The same access-control problem as C.2, but for template template parameters. A private class template used as a template template argument cannot be guaranteed visible in the device compiler's independent instantiation context.
C.6: Texture/surface non-type arguments
| Message |
|---|
A texture or surface variable cannot be used in the non-type template argument of a __device__, __host__ __device__ or __global__ function template instantiation |
Texture and surface objects have special hardware semantics. Their runtime addresses are not fixed at compile time (they are bound through the texture subsystem), so they cannot serve as non-type template arguments whose values must be known to produce a deterministic mangled name.
Implementation: The Access Checking Pipeline
The template argument restriction checks are implemented in a three-function pipeline within cp_gen_be.c:
sub_469F80 — template_arg_is_accessible
This is the primary entry point. It dispatches on the template argument kind (byte at arg+8):
int template_arg_is_accessible(arg_t *a1, int scope_depth, char check_scope, int *cache_miss) {
arg->flags_25 |= 0x10; // mark: currently checking
int kind = arg->kind; // offset +8
switch (kind) {
case 0: // type argument
type = arg->value; // offset +32
result = cache_access_result_for(type, 6, scope_depth, cache_miss);
if (!result && (check_scope & 1)) {
// walk through typedef chains (type_kind == 12)
while (type->kind == 12)
type = type->canonical; // offset +144
result = cache_access_result_for(type, 6, scope_depth, cache_miss);
if (!result) {
sub_469F30(&type_holder, 0); // resolve via entity lookup
result = (type_holder != original_type);
}
}
break;
case 1: // template argument (template template parameter)
entity = arg->value; // offset +32
// Check class accessibility via derivation chain
if (entity->base_class) { // offset +128
// Use IL walker sub_61FE60 with callback sub_46ACC0
sub_61EC40(visitor_state);
visitor_state[0] = sub_46ACC0; // the callback
sub_61FE60(entity->base_class, visitor_state);
result = (visitor_state->found == 0);
}
break;
case 2: // non-type argument
result = cache_access_result_for(arg->value, 58, scope_depth, cache_miss);
break;
default:
__assert_fail("template_arg_is_accessible", "cp_gen_be.c", 2448);
}
arg->flags_25 &= ~0x10; // clear: done checking
return result;
}
The flags_25 |= 0x10 / &= ~0x10 pattern is a recursion guard: it marks the argument as "currently being checked" to prevent infinite loops through mutually-referential template arguments.
sub_469480 — cache_access_result_for
This function caches the result of access checking for a given entity to avoid redundant computation. The cache is a hash table at xmmword_F05720 with 16,382 buckets (0x3FFF), each 24 bytes wide.
Cache entry layout (24 bytes):
| Offset | Size | Field | Description |
|---|---|---|---|
| +0 | 8 | next | Pointer to next entry in chain (collision list) |
| +8 | 8 | entity | Entity pointer being cached |
| +16 | 4 | scope_id | Scope identifier from qword_1065708 chain |
| +20 | 1 | result | Cached access result (1 = accessible, 0 = not) |
| +21 | 1 | arg_kind | Template argument kind that was checked |
Hash function: The entity pointer is right-shifted by 6 bits, then taken modulo 0x3FFF:
unsigned hash = ((unsigned)(entity >> 6) * 262161ULL) >> 32;
unsigned bucket = (entity >> 6)
- 0x3FFF * (((hash + ((entity >> 6) - hash) >> 1)) >> 13);
char *slot = &xmmword_F05720[24 * bucket];
Cache hit path: If slot->entity == entity and the scope matches, return the cached result immediately. The function walks the qword_1065708 chain (the scope stack) to verify that the cached result was computed in a compatible scope context.
Cache eviction: When a cached entry's scope no longer matches (the scope stack has changed since caching), the entry is moved to the free list (qword_F05708). New entries are allocated from the free list or via sub_6B7340 (24-byte allocation).
Fallback (cache miss): On cache miss, the function performs the actual accessibility analysis:
-
For type arguments (kind 6): resolves typedefs, checks if the type is a class/struct/enum with access restrictions. Uses
sub_5F9C10to resolve through elaborated type specifiers. Checksentity->access_bitsat+80(bits 0-1: 0=public, 1=protected, 2=private). -
For non-type arguments (kind 58): checks the entity's accessibility directly.
-
For class/struct types (kinds 9-11): walks the class's template argument list recursively via
sub_469F80. -
For dependent types (kind 14): recursively checks the base type.
-
For function types (kind 7) and pointer-to-member types (kind 13): recursively checks the return type, parameter types, and pointed-to class.
After computing the result, it is stored in the cache for future lookups.
sub_46A230 — Template Arg List Walker
This function walks a template instantiation's argument list and checks each argument for accessibility. It uses the entity lookup hash table at unk_FE5700 to find cached resolution results.
__int64 walk_template_args(__int64 hash_table, unsigned __int64 type) {
// Resolve through typedef chains
while (type->kind == 12)
type = type->canonical; // offset +144
// Hash the type pointer into a bucket
_QWORD *bucket = hash_table + 32 * ((type >> 6) % 0x3FFF);
// Walk the bucket chain
while (bucket && bucket[1]) {
entry = bucket[1]; // the entity entry
// Check if this entry matches our type
if (entry->canonical != type && !sub_7B2260(entry->canonical, type, 0))
continue;
// Scope compatibility check
if (bucket[2] && bucket[2] != qword_126C5D0)
continue;
// For template entities (kind 10), walk their argument lists
if (entry->kind == 10) {
arg_list = *entry->template_args;
while (arg) {
if (arg->flags_25 & 0x10) // already being checked
goto next;
if (!template_arg_is_accessible(arg, 0, 0, &miss))
goto not_found;
arg = arg->next;
}
}
// Access check on the entity itself
if (entry->access_bits != 0) // private/protected
if (!sub_467780(entity, 1, 0)) // check access
goto not_found;
// Cache the resolved entity in bucket[3]
bucket[3] = qword_10657E8;
return entry;
}
return 0;
}
The walker handles three argument kinds:
- Kind 0 (type): Checks the type entity's accessibility and, for class templates (kind 12 with subkind 10), recursively walks nested template arguments.
- Kind 1 (template): Checks the template entity's class ancestry.
- Kind 2 (non-type): Resolves the non-type argument's scope via
sub_5F9BC0.
sub_46A5B0 — arg_before_unnamed_template_param_arg
This function handles the generation of template arguments that appear before unnamed template parameter arguments. It determines the positional index of each argument relative to the template parameter list and calls the appropriate code-generation routine. The assert at line 4795 guards against an unexpected argument kind (must be 0, 1, or 2; kind 3 is a pack expansion sentinel).
Category D: Variable Template Parallel Restrictions (5 errors)
Variable templates (template<typename T> __device__ T var = ...) used in device contexts carry the same restrictions as __global__ function templates. The diagnostics mirror Category C exactly:
| Tag | Message |
|---|---|
variable_template_private_type_arg | A type that is defined inside a class and has private or protected access (%t) cannot be used in the template argument type of a variable template instantiation, unless the class is local to a __device__ or __global__ function |
variable_template_private_template_arg | (private template template arg in variable template) |
variable_template_unnamed_type_template_arg | An unnamed type (%t) cannot be used in the template argument type of a variable template template instantiation, unless the type is local to a __device__ or __global__ function |
variable_template_func_local_template_arg | A type defined inside a __host__ function (%t) cannot be used in the template argument type of a variable template template instantiation |
variable_template_lambda_template_arg | The closure type for a lambda (%t%s) cannot be used in the template argument type of a variable template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the lambda is an 'extended lambda' and the flag --extended-lambda is specified |
The implementation shares the same cache_access_result_for / template_arg_is_accessible pipeline described in the Category C implementation section. The only difference is the error tag and message string emitted on failure.
Why Variable Templates Need the Same Restrictions
Variable templates instantiated with __device__, __constant__, or __managed__ memory space are registered by the CUDA runtime using their mangled names. The host-side .int.c file contains registration arrays (emitted in .nvHRDE, .nvHRDI, .nvHRCE, .nvHRCI sections) whose entries are byte arrays encoding mangled variable names. The device compiler independently mangles the same variable template instantiation. Both must produce identical names, so the same visibility constraints apply.
Category E: Static Global Template Stub (2 errors)
In whole-program compilation mode (-rdc=false) with -static-global-template-stub=true, template __global__ functions receive static linkage on their host stubs. This prevents ODR violations when the same template kernel is instantiated in multiple translation units. Two scenarios are incompatible with this mode:
| Tag | Message |
|---|---|
extern_kernel_template | when "-static-global-template-stub=true", extern __global__ function template is not supported in whole program compilation mode ("-rdc=false"). To resolve the issue, either use separate compilation mode ("-rdc=true"), or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off) |
template_global_no_def | when "-static-global-template-stub=true" in whole program compilation mode ("-rdc=false"), a __global__ function template instantiation or specialization (%sq) must have a definition in the current translation unit. To resolve this issue, either use separate compilation mode ("-rdc=true"), or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off) |
The Problem
An extern template kernel declaration says "this template instantiation exists elsewhere." But if the stub is static, there is no way for the linker to resolve the extern reference to a stub in another TU, because static symbols are TU-local. Similarly, a template instantiation without a definition in the current TU cannot have a static stub generated for it, because there is no body to inline.
Resolution Paths
Both diagnostics suggest the same two alternatives:
- Switch to
-rdc=true(separate compilation): each TU gets its own device object, and cross-TU kernel references are resolved by the device linker (nvlink). - Set
-static-global-template-stub=false: stubs get external linkage, allowing cross-TU references at the cost of potential ODR violations if the same template is instantiated in multiple TUs.
Category F: Local Type Prevents Host Launch (1 error)
| Tag | Message |
|---|---|
local_type_used_in_global_function | a local type %t (defined in %sq1) used in global function %sq2 template argument, the global function cannot be launched from host code. |
This is a warning-level diagnostic, not a hard error. It fires when a type local to a function (but not a __host__-function-local type, which would be Category C.1) is used as a template argument. The kernel can still be instantiated and called from device code, but the host-side launch path is blocked because the local type is not visible to the host stub generator.
This diagnostic differs from global_func_local_template_arg in severity and scope: it is a soft warning that the kernel "cannot be launched from host code," rather than a hard error that rejects the instantiation entirely.
Category G: __grid_constant__ in Instantiation Directives (1 error)
| Tag | Message |
|---|---|
grid_constant_incompat_templ_redecl | incompatible __grid_constant__ annotation for parameter %s in function template redeclaration (see previous declaration %p) |
When a function template is redeclared, the __grid_constant__ annotations on its parameters must match the original declaration. This is enforced because __grid_constant__ affects the ABI: a parameter marked __grid_constant__ is placed in constant memory and accessed through a different addressing mode. If a redeclaration omits the annotation, the host stub and device function would disagree on parameter layout.
The related diagnostic grid_constant_incompat_instantiation_directive applies to explicit instantiation directives (template __global__ void kernel<int>(...)) and is documented in the grid_constant page.
Category H: Kernel Launches from System File Templates (1 error)
| Message |
|---|
kernel launches from templates are not allowed in system files |
This error fires when a <<<...>>> kernel launch expression appears inside a template function defined in a system header file. System headers are files marked with #pragma system_header or located in system include paths (e.g., the CUDA toolkit's include/ directory).
The restriction exists because system headers are processed with relaxed diagnostics. Kernel launch expressions inside template functions in system headers would be instantiated in user code contexts, but the launch transformation (replacing <<<...>>> with cudaConfigureCall + stub call) operates during the system header's processing pass where diagnostic state may be suppressed. Rather than risk silent miscompilation, the compiler rejects this pattern outright.
The __NV_name_expr Mechanism (6 errors)
NVRTC (NVIDIA's runtime compilation library) provides a mechanism to obtain the mangled name of a __global__ function or __device__/__constant__ variable at compile time. This mechanism is exposed through the __CUDACC_RTC__name_expr intrinsic, which the frontend processes during lowered name lookup.
Purpose
NVRTC compiles CUDA code at runtime, producing PTX that is loaded into the driver. The host application needs to look up compiled kernels and device variables by name via cuModuleGetFunction / cuModuleGetGlobal. The __NV_name_expr mechanism bridges this gap: the user provides a C++ name expression (e.g., my_kernel<int> or my_device_var<float>), and the compiler returns the corresponding mangled name (e.g., _Z9my_kernelIiEvv).
The 6 Errors
| Tag | Message |
|---|---|
name_expr_parsing | Error in parsing name expression for lowered name lookup. Input name expression was: %sq |
name_expr_extra_tokens | Extra tokens found after parsing name expression for lowered name lookup. Input name expression was: %sq |
name_expr_internal_error | Internal error in parsing name expression for lowered name lookup. Input name expression was: %sq |
name_expr_non_global_routine | Name expression cannot form address of a non-__global__ function. Input name expression was: %sq |
name_expr_non_device_variable | Name expression cannot form address of a variable that is not a __device__/__constant__ variable. Input name expression was: %sq |
name_expr_not_routine_or_variable | Name expression must form address of a __global__ function or the address of a __device__/__constant__ variable. Input name expression was: %sq |
Processing Pipeline
-
Parsing: The name expression is parsed as a C++ id-expression. If parsing fails,
name_expr_parsingis emitted. If tokens remain after a successful parse,name_expr_extra_tokensfires. -
Lookup: The parsed expression is resolved via standard C++ name lookup (qualified or unqualified, with template argument deduction if needed).
-
Validation: The resolved entity is checked:
- If it is a function, it must be
__global__(has the__global__execution space byte set). Otherwise:name_expr_non_global_routine. - If it is a variable, it must be
__device__or__constant__(memory space bits atentity+148). Otherwise:name_expr_non_device_variable. - If it is neither a function nor a variable:
name_expr_not_routine_or_variable.
- If it is a function, it must be
-
Mangling: If validation passes, the entity is mangled using the Itanium ABI mangler (in
lower_name.c) and the resulting string is recorded for NVRTC output.
Connection to Template Restrictions
The __NV_name_expr mechanism relies on every template argument being mangeable. All of the Category C restrictions directly support this: if a template argument type cannot be mangled (because it is unnamed, local, private, etc.), the name expression lookup would produce a mangled name that does not match the device-side mangling. The restrictions are enforced at template instantiation time, before any name expression lookup occurs, so that invalid instantiations never reach the mangling stage.
Data Structures
Template Argument Node (arg_t)
The template argument node is a linked-list entry used by sub_469F80 and sub_46A230:
| Offset | Size | Field | Description |
|---|---|---|---|
| +0 | 8 | next | Next argument in the list |
| +8 | 1 | kind | Argument kind: 0=type, 1=template, 2=non-type, 3=pack expansion |
| +24 | 1 | flags_24 | Bit 0: is pack expansion |
| +25 | 1 | flags_25 | Bit 4 (0x10): currently being checked (recursion guard) |
| +32 | 8 | value | Pointer to the type/entity/expression |
Entity Node (type/symbol)
Relevant fields for accessibility checking:
| Offset | Size | Field | Description |
|---|---|---|---|
| +8 | 8 | name_entry | Name string pointer (or next scope for unnamed) |
| +24 | 8 | alt_name | Alternative name (for flag bit 3 at +81) |
| +40 | 8 | scope_info | Scope information; +32 from this is the enclosing class/namespace |
| +80 | 1 | access_bits | Bits 0-1: access specifier (0=public, 1=protected, 2=private) |
| +81 | 1 | entity_flags | Bit 2 (0x04): is template specialization; bit 6 (0x40): is anonymous |
| +128 | 8 | base_class | Base class pointer (for class entities) |
| +132 | 1 | type_kind | Type kind: 6/8=pointer/ref, 7=function, 9-11=class/struct/enum, 12=typedef, 13=pointer-to-member, 14=dependent |
| +144 | 8 | canonical | Canonical type (for typedefs: the underlying type) |
| +148 | 1 | subtype_kind | Subkind (for type_kind 12: 10=template-id, 12=elaborated) |
| +152 | 8 | type_info | Type-specific data (template args, function params, etc.) |
| +160 | 1 | template_kind | For template entities: template kind |
| +161 | 1 | visibility | Bit 7 (0x80): private visibility (negative char value) |
| +162 | 2 | extra_flags | Bit 7 (0x80) + bit 9 (0x200): cached accessibility state |
Diagnostic Summary
All 24 errors sorted by category:
| # | Category | Tag | Severity |
|---|---|---|---|
| 1 | A | global_function_constexpr | error |
| 2 | A | global_function_consteval | error |
| 3 | A | global_class_decl | error |
| 4 | A | global_friend_definition | error |
| 5 | A | global_exception_spec | error |
| 6 | A | global_function_in_unnamed_inline_ns | error |
| 7 | A | global_function_with_initializer_list | error |
| 8 | A | global_va_list_type | error |
| 9 | B | global_function_pack_not_last | error |
| 10 | B | global_function_multiple_packs | error |
| 11 | C | global_func_local_template_arg | error |
| 12 | C | global_private_type_arg | error |
| 13 | C | global_unnamed_type_arg | error |
| 14 | C | global_lambda_template_arg | error |
| 15 | C | global_private_template_arg | error |
| 16 | C | (texture/surface non-type arg) | error |
| 17 | D | variable_template_private_type_arg | error |
| 18 | D | variable_template_private_template_arg | error |
| 19 | D | variable_template_unnamed_type_template_arg | error |
| 20 | D | variable_template_func_local_template_arg | error |
| 21 | D | variable_template_lambda_template_arg | error |
| 22 | E | extern_kernel_template | error |
| 23 | E | template_global_no_def | error |
| 24 | F | local_type_used_in_global_function | warning |
Category G (grid_constant_incompat_templ_redecl) and Category H (kernel launches from templates...) are counted separately as they span the template/non-template boundary.
Function Map
| Address | Identity | Lines | Role |
|---|---|---|---|
sub_469F80 | template_arg_is_accessible | 144 | Primary access checker -- dispatches on arg kind |
sub_469480 | cache_access_result_for | 670 | Hash-cached accessibility analysis |
sub_46A230 | (walks template arg lists) | 182 | Iterates entity lookup table for arg lists |
sub_46A5B0 | arg_before_unnamed_template_param_arg | 396 | Handles args before unnamed template params |
sub_469F30 | (scope resolve helper) | 23 | Resolves scope via cache_access_result_for + entity lookup |
sub_46ACC0 | (scope walk callback) | -- | Callback passed to IL walker sub_61FE60 |
sub_467780 | (access check) | -- | Checks C++ access control (public/protected/private) |
sub_466F40 | (output callback) | -- | Code generation output callback |
sub_5BFC70 | (pack expansion resolver) | -- | Resolves pack expansion nodes (kind 3) |
sub_5F9BC0 | (scope resolver) | -- | Resolves entity scope chain |
sub_5F9C10 | (elaborated type resolver) | -- | Resolves elaborated type specifiers |
sub_7B2260 | (type equivalence) | -- | Checks structural type equivalence |
sub_61EC40 | (init visitor) | 27 | Initializes IL tree visitor state |
sub_61FE60 | (walk expression tree) | 17 | Walks expression tree with callback |
Global Variables
| Global | Address | Description |
|---|---|---|
xmmword_F05720 | 0xF05720 | Access check cache hash table (384 KB, 16,382 entries x 24 bytes) |
qword_F05708 | 0xF05708 | Free list head for recycled cache entries |
qword_F05730 | 0xF05730 | Scope ID array parallel to cache (4 bytes per entry) |
unk_FE5700 | 0xFE5700 | Entity lookup hash table (512 KB) |
qword_1065708 | 0x1065708 | Scope stack head (linked list of scope entries) |
qword_126C5D0 | 0x126C5D0 | Global scope sentinel |
qword_10657E8 | 0x10657E8 | Current scope context for entity resolution |
dword_1065848 | 0x1065848 | Extended lambda mode flag |
dword_1065850 | 0x1065850 | Device stub mode flag |
Cross-References
- Template Engine -- instantiation worklist, fixpoint loop, and the
should_be_instantiatedgate - __global__ Function Attributes -- attribute application and post-validation checks
- Kernel Stub Generation -- host stub emission,
-static-global-template-stubflag - __grid_constant__ -- parameter annotation compatibility in template redeclarations
- CUDA Diagnostics -- complete error catalog with all 24+ messages
- Lambda Device Wrapper -- extended lambda mechanism for closure type template args
- Execution Spaces -- host/device/global space model
- Backend Pipeline -- initialization of hash tables used by the access checker
- Int-C Format -- how the
.int.coutput encodes device symbol registration arrays
Constexpr Interpreter
The constexpr interpreter is the compile-time expression evaluation engine inside cudafe++. It lives in EDG 6.6's interpret.c (69 functions at 0x620CE0--0x65DE10, approximately 33,000 decompiled lines) and implements a virtual machine that executes arbitrary C++ expressions during compilation. Its central function, do_constexpr_expression (sub_634740), is the single largest function in the entire cudafe++ binary: 11,205 decompiled lines, 63KB of machine code, 128 unique callees, and 28 self-recursive call sites.
The interpreter exists because C++ constexpr evaluation requires the compiler to act as an execution engine. Since C++11, constexpr has grown from simple return-expression functions to a Turing-complete subset of C++ that includes loops, branches, dynamic memory allocation (C++20), virtual dispatch, exception-like control flow, and -- as of C++26 -- compile-time reflection. The interpreter must evaluate all of these constructs faithfully, track object lifetimes, detect undefined behavior, and convert results back into IL constants.
Key Facts
| Property | Value |
|---|---|
| Source file | interpret.c (69 functions, ~33,000 decompiled lines) |
| Address range | 0x620CE0--0x65DE10 |
| Main evaluator | sub_634740 (do_constexpr_expression), 11,205 lines, 63KB |
| Builtin evaluator | sub_651150 (do_constexpr_builtin_function), 5,032 lines |
| Loop evaluator | sub_644580 (do_constexpr_range_based_for_statement), 2,836 lines |
| Constructor evaluator | sub_6480F0 (do_constexpr_ctor), 1,659 lines |
| Call dispatcher | sub_657560 (do_constexpr_call), 1,445 lines |
| Top-level entry | sub_65AE50 (interpret_expr) |
| Materialization | sub_631110 (copy_interpreter_object_to_constant), 1,444 lines |
| Value extraction | sub_64B580 (extract_value_from_constant), 2,299 lines |
| Arena block size | 64KB (0x10000) |
| Large alloc threshold | 1,024 bytes (0x400) |
| Max type size | 64MB (0x4000000) |
| Uninitialized marker | 0xDB fill pattern |
| Self-recursive calls | 28 (in do_constexpr_expression) |
| Confirmed assert IDs | 38 functions with assert strings |
| C++26 reflection | 8 std::meta::* functions |
Architecture Overview
The interpreter is structured as a tree-walking evaluator with arena-based memory, memoization caching, and a call stack that mirrors C++ function invocation. The rest of the compiler invokes it through interpret_expr, which sets up interpreter state, calls the recursive evaluator, and converts the result back to an IL constant.
AST expression node
|
v
+-----------------+
| interpret_expr | sub_65AE50 — allocates state, arena, hash table
+-----------------+
|
v
+---------------------------+
| do_constexpr_expression | sub_634740 — the 11,205-line evaluator
| | dispatches on expression-kind code
| +-- arithmetic ops | cases 40-45: +, -, *, /, %
| +-- comparisons | cases 49-51: <, >, ==, !=, <=, >=
| +-- member access | cases 3-4: . and ->
| +-- type conversions | case 5: cast sub-switch (20+ type pairs)
| +-- pointer arithmetic | cases 46-48, 50: ptr+int, ptr-ptr
| +-- function calls ------+---> do_constexpr_call (sub_657560)
| +-- constructors ------+---> do_constexpr_ctor (sub_6480F0)
| +-- builtins ------+---> do_constexpr_builtin_function (sub_651150)
| +-- loops ------+---> do_constexpr_range_based_for (sub_644580)
| +-- statements ------+---> do_constexpr_statement (sub_647850)
| +-- dynamic_cast | inline within main evaluator
| +-- typeid | inline within main evaluator
| +-- offsetof | inline within main evaluator
| +-- bit_cast | calls translate_*_bytes functions
+---------------------------+
|
v
+-------------------------------------+
| copy_interpreter_object_to_constant | sub_631110 — materializes result
+-------------------------------------+ back into IL constant nodes
|
v
IL constant (returned to compiler)
Entry Points
The interpreter has multiple entry points, each called from a different compilation phase:
| Entry | Address | Lines | Called from |
|---|---|---|---|
interpret_expr | sub_65AE50 | 572 | General constexpr evaluation (primary) |
| Entry for expression lowering | sub_65A290 | 311 | Expression lowering phase (sub_6E2040) |
| Entry for expression trees | sub_65A8C0 | 274 | Expression handling (sub_5BB4C0, sub_5C3760) |
interpret_dynamic_sub_initializers | sub_65CFA0 | 67 | Aggregate initialization |
| Misc entries | sub_65BAB0--sub_65D150 | 150-470 | Template instantiation, static_assert, enum values |
All entry points follow the same pattern: allocate the interpreter state object, initialize the arena and hash table, call do_constexpr_expression, then extract and convert the result.
Interpreter State Object
The interpreter state is a structure passed as the first argument (a1) to every evaluator function. It contains the evaluation stack, heap tracking, memoization cache, and diagnostic context.
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 8 | hash_table | Pointer to variable-to-value hash table |
+8 | 8 | hash_capacity | Hash table capacity mask (low 32) / entry count (high 32) |
+16 | 8 | stack_top | Current stack allocation pointer |
+24 | 8 | stack_base | Base of current arena block |
+32 | 8 | heap_list | Head of heap allocation chain (large objects) |
+40 | 4 | scope_depth | Current scope nesting counter |
+56 | 8 | hash_aux_1 | Auxiliary hash table pointer |
+64 | 8 | hash_aux_2 | Auxiliary hash table capacity |
+72 | 8 | call_chain | Current call stack chain (for recursion tracking) |
+88 | 8 | diag_context_1 | Diagnostic context pointer |
+96 | 8 | diag_context_2 | Source location for error reporting |
+112 | 8 | diag_context_3 | Additional diagnostic metadata |
+132 | 1 | flags_1 | Mode flags (bit 0 = strict mode) |
+133 | 1 | flags_2 | Additional mode flags |
Memory Model
The interpreter uses a dual-tier memory system: an arena allocator for small objects and direct heap allocation for large ones.
Arena Allocator
Arena blocks are 64KB (0x10000 bytes) each, linked together at offset +24:
Block layout:
+------------------+
| next_block (+0) |---> previous block (or null)
| alloc_ptr (+8) |---> current bump position
| capacity (+16) |---> end of usable space
| base (+24) |---> start of block data
+------------------+
| usable space | 64KB of object storage
| ... |
+------------------+
Allocation follows a bump-pointer pattern:
void *arena_alloc(interp_state *state, size_t size) {
size = ALIGN_UP(size, 8);
ptrdiff_t remaining = 0x10000 - (state->stack_top - state->stack_base);
if (remaining < size) {
// Allocate new 64KB block, link to chain
new_block = sub_622D20();
new_block->next = state->stack_base;
state->stack_base = new_block;
state->stack_top = new_block + HEADER_SIZE;
}
void *result = state->stack_top;
state->stack_top += size;
return result;
}
Large Object Heap
Objects larger than 1,024 bytes (0x400) bypass the arena and are allocated individually via sub_6B7340 (the compiler's general-purpose allocator). These allocations are tracked through an allocation chain so they can be freed when the interpreter scope exits.
Object Header Layout
Every interpreter object has a header preceding the value bytes:
offset -10 [-10] bitmap byte 2 (validity tracking)
offset -9 [ -9] bitmap byte 1 (initialization tracking)
offset -8 [ -8] type pointer (8 bytes, points to type_node)
offset 0 [ 0] value bytes start here
... value data (size depends on type)
New objects are initialized with value bytes filled to 0xDB (decimal 219), which serves as an uninitialized-memory sentinel. Any read from an object whose bytes still contain 0xDB triggers error 2700 (access to uninitialized object).
Constexpr Value Representation
Values in the interpreter use a type-dependent representation:
| Type category | kind byte | Value size | Representation |
|---|---|---|---|
void | 0 | 0 | Flag 0x40 set, no value bytes |
pointer | 1 | 0 | Stored as reference metadata, not inline bytes |
integral | 2 | 16 bytes | Two 64-bit words (supports __int128) |
float | 3 | 16 bytes | IEEE 754 value in first 4/8 bytes, padded |
double | 4 | 16 bytes | IEEE 754 value in first 8 bytes, padded |
complex | 5 | 32 bytes | Real + imaginary parts |
class/struct | 6 | 32 bytes | Reference to interpreter object |
union | 7 | 32 bytes | Reference to interpreter object |
array | 8 | N * elem_size | Recursive: element count times element size |
class (variants) | 9, 10, 11 | Cached | Looked up in type-to-size hash table |
typedef | 12 | (follow) | Chase to underlying type |
enum | 13 | 16 bytes | Same as integral |
nullptr_t | 19 | 32 bytes | Null pointer representation |
The reference representation for pointers and class objects uses 32 bytes (two __m128i values). The flag byte at offset +8 within a reference encodes:
| Bit | Meaning |
|---|---|
| 0 | Has concrete object backing |
| 1 | Past-the-end pointer (one past array) |
| 2 | Has allocation chain (from constexpr new) |
| 3 | Has subobject path (member/base offset chain) |
| 4 | Has bitfield information |
| 5 | Is dangling (object lifetime ended) |
| 6 | Is const-qualified |
Memoization Hash Table
The interpreter maintains a hash table that maps type pointers to precomputed value sizes, avoiding redundant recursive size computations for class types:
| Global | Purpose |
|---|---|
qword_126FEC0 | Hash table base pointer |
qword_126FEC8 | Capacity mask (low 32 bits) / entry count (high 32 bits) |
Each entry is 16 bytes: 8-byte key (type pointer), 4-byte size value, 4-byte padding. Collision resolution uses linear probing with a bitmask. The table grows (via sub_620760) when load factor exceeds 50%.
Constexpr Allocation Tracking (C++20)
C++20 introduced constexpr dynamic memory allocation (new/delete in constexpr contexts). The interpreter tracks these through a global allocation chain:
| Global | Purpose |
|---|---|
qword_126FBC0 | Free list head |
qword_126FBB8 | Outstanding allocation count |
When std::allocator<T>::allocate() is called during constexpr (sub_62B100), the interpreter allocates from its arena, sets bit 2 in the object's flag byte, and links the allocation into the chain. std::allocator<T>::deallocate() (sub_62B470) validates that the freed pointer was actually allocated by std::allocator::allocate() and unlinks it. At the end of constexpr evaluation, any remaining allocations indicate a bug in the evaluated code (memory leaked during constant evaluation).
The Main Evaluator: do_constexpr_expression
sub_634740 is the heart of the interpreter. It takes four parameters:
// Returns 1 on success, 0 on failure
int do_constexpr_expression(
interp_state *a1, // interpreter state
expr_node *a2, // AST expression node to evaluate
value_buf *a3, // output value buffer (32 bytes)
address_t *a4 // "home" address for reference tracking
);
The function body is organized as a nested switch statement. The outer switch dispatches on the expression category at *(a2+24), and several cases contain inner switches for further dispatch.
Outer Switch: Expression Categories
int do_constexpr_expression(interp_state *a1, expr_node *a2,
value_buf *a3, address_t *a4) {
int category = *(a2 + 24); // expression category code
switch (category) {
case 0: // ---- void expression ----
a3->flags = 0x40; // mark as void
return 1;
case 1: // ---- operator expression ----
return eval_operator(a1, a2, a3, a4); // inner switch on *(a2+40)
case 10: // ---- sub-expression wrapper ----
return do_constexpr_expression(a1, *(a2+40), a3, a4); // recurse
case 11: // ---- typeid expression ----
return do_constexpr_typeid(a1, a2, a3); // inline
case 17: // ---- statement expression (GNU extension) ----
return do_constexpr_statement(a1, *(a2+40), a3); // sub_647850
case 18: // ---- variable lookup ----
return lookup_variable(a1, a2, a3, a4); // hash table at a1+0
case 19: // ---- function / static variable reference ----
return resolve_static_ref(a1, a2, a3);
case 21: // ---- special expressions ----
return eval_special(a1, a2, a3, a4); // inner switch on *(a2+40)
default:
emit_error(2721); // "expression is not a constant expression"
return 0;
}
}
Inner Switch: Operator Codes (case 1)
The operator expression case dispatches on the operator code at *(a2+40). This is the largest sub-switch, covering 100+ cases:
int eval_operator(interp_state *a1, expr_node *a2,
value_buf *a3, address_t *a4) {
int opcode = *(a2 + 40);
switch (opcode) {
// ---- Assignment ----
case 0: case 1:
// Evaluate RHS, store to LHS address
if (!do_constexpr_expression(a1, rhs, &rval, NULL)) return 0;
if (!do_constexpr_expression(a1, lhs, &lval, NULL)) return 0;
assign_value(lval.address, &rval, type);
*a3 = lval;
return 1;
// ---- Member access (. and ->) ----
case 3: case 4:
// Evaluate base object, compute member offset
if (!do_constexpr_expression(a1, base_expr, &base, NULL)) return 0;
member_offset = compute_member_offset(member_decl, base.type);
a3->address = base.address + member_offset;
return 1;
// ---- Type conversion (cast) ----
case 5:
return eval_conversion(a1, a2, a3, a4); // 20+ type-pair sub-switch
// ---- Parenthesized expression ----
case 9:
return do_constexpr_expression(a1, *(a2+48), a3, a4); // recurse
// ---- Pointer increment/decrement ----
case 22: case 23:
if (!do_constexpr_expression(a1, operand, &val, NULL)) return 0;
// Validate pointer is within array bounds
pos = get_runtime_array_pos(val.address);
if (pos < 0 || pos >= array_size) {
emit_error(2692); // array bounds violation
return 0;
}
val.address += (opcode == 22) ? elem_size : -elem_size;
*a3 = val;
return 1;
// ---- Unary negation / bitwise complement ----
case 26:
if (!do_constexpr_expression(a1, operand, &val, NULL)) return 0;
if (is_integer_type(val.type))
a3->int_val = ~val.int_val; // or -val.int_val
else if (is_float_type(val.type))
a3->float_val = -val.float_val;
return 1;
// ---- Arithmetic binary operators ----
case 40: // addition
case 41: // subtraction
case 42: // multiplication
case 43: // division
case 44: // modulo
case 45: // (additional arithmetic)
if (!do_constexpr_expression(a1, lhs, &left, NULL)) return 0;
if (!do_constexpr_expression(a1, rhs, &right, NULL)) return 0;
if (opcode == 43 && right.int_val == 0) {
emit_error(61); // division by zero = UB
return 0;
}
result = apply_arithmetic(opcode, left, right, type);
if (check_overflow(result, type)) {
emit_error(2708); // arithmetic overflow
return 0;
}
*a3 = result;
return 1;
// ---- Pointer arithmetic ----
case 46: // pointer + integer
case 47: // integer + pointer
case 48: // pointer - integer
if (!do_constexpr_expression(a1, ptr_expr, &ptr, NULL)) return 0;
if (!do_constexpr_expression(a1, int_expr, &idx, NULL)) return 0;
// Validate result stays within allocation bounds
new_pos = get_runtime_array_pos(ptr.address) + idx.int_val;
if (new_pos < 0 || new_pos > array_size) { // past-the-end is valid
emit_error(2735); // pointer arithmetic underflow/overflow
return 0;
}
a3->address = ptr.address + idx.int_val * elem_size;
return 1;
// ---- Comparison operators ----
case 49: case 50: case 51:
if (!do_constexpr_expression(a1, lhs, &left, NULL)) return 0;
if (!do_constexpr_expression(a1, rhs, &right, NULL)) return 0;
// Pointer comparison: validate same complete object
if (is_pointer(left.type) && !same_complete_object(left, right)) {
emit_error(2734); // invalid pointer comparison
return 0;
}
a3->int_val = apply_comparison(opcode, left, right);
return 1;
// ---- Compound assignment (+=, -=, etc.) ----
case 74: case 75:
// Evaluate LHS address, compute new value, store back
...
// ---- Shift operators ----
case 80: case 81: case 82: case 83:
// Left shift, right shift (arithmetic and logical)
...
// ---- Array subscript ----
case 87: case 88: case 89: case 90: case 91:
if (!do_constexpr_expression(a1, base, &arr, NULL)) return 0;
if (!do_constexpr_expression(a1, index, &idx, NULL)) return 0;
if (idx.int_val < 0 || idx.int_val >= array_dimension) {
emit_error(2692); // array bounds violation
return 0;
}
a3->address = arr.address + idx.int_val * elem_size;
return 1;
// ---- Pointer-to-member dereference (.* and ->*) ----
case 92: case 93:
...
// ---- sizeof ----
case 94:
a3->int_val = compute_sizeof(operand_type);
return 1;
// ---- Comma operator ----
case 103:
do_constexpr_expression(a1, lhs, &discard, NULL); // evaluate + discard
return do_constexpr_expression(a1, rhs, a3, a4); // return RHS
default:
emit_error(2721); // not a constant expression
return 0;
}
}
Type Conversion Sub-Switch (operator case 5)
Type conversions are one of the most complex parts of the evaluator. The sub-switch dispatches on source/target type pairs and handles overflow detection:
int eval_conversion(interp_state *a1, expr_node *a2,
value_buf *a3, address_t *a4) {
type_node *src_type = source_type(a2);
type_node *dst_type = target_type(a2);
int src_kind = src_type->kind; // offset +132
int dst_kind = dst_type->kind;
// Evaluate the operand first
value_buf operand;
if (!do_constexpr_expression(a1, *(a2+48), &operand, NULL))
return 0;
// Dispatch on type pair
if (src_kind == 2 && dst_kind == 2) {
// int -> int: check truncation
if (!fits_in_target(operand.int_val, dst_type)) {
emit_error(2707); // integer overflow in conversion
return 0;
}
a3->int_val = truncate_to(operand.int_val, dst_type);
}
else if (src_kind == 2 && (dst_kind == 3 || dst_kind == 4)) {
// int -> float/double
a3->float_val = (double)operand.int_val;
}
else if ((src_kind == 3 || src_kind == 4) && dst_kind == 2) {
// float/double -> int: check overflow
if (operand.float_val > INT_MAX || operand.float_val < INT_MIN) {
emit_error(2728); // floating-point conversion overflow
return 0;
}
a3->int_val = (int64_t)operand.float_val;
}
else if (src_kind == 1 && dst_kind == 2) {
// pointer -> int (reinterpret_cast)
if (!cuda_allows_reinterpret_cast()) { // dword_106C2C0
emit_error(2727); // invalid conversion
return 0;
}
}
else if (src_kind == 6 && dst_kind == 6) {
// class -> class (derived-to-base or base-to-derived)
a3->address = adjust_pointer_for_base(operand.address, src_type, dst_type);
}
else if (src_kind == 19 && dst_kind == 1) {
// nullptr_t -> pointer
a3->address = 0;
a3->flags |= 0; // null pointer
}
// ... 15+ additional type pairs ...
return 1;
}
Variable Lookup (case 18)
When the evaluator encounters a variable reference, it looks up the variable's current value in the interpreter's hash table:
int lookup_variable(interp_state *a1, expr_node *a2,
value_buf *a3, address_t *a4) {
void *var_key = get_variable_entity(a2);
uint64_t *table = a1->hash_table; // offset +0
uint64_t mask = a1->hash_capacity; // offset +8, low 32 bits
// Linear-probing hash lookup
uint32_t idx = hash(var_key) & mask;
while (table[idx * 2] != 0) {
if (table[idx * 2] == var_key) {
// Found: load value from stored address
void *value_addr = table[idx * 2 + 1];
load_value(a3, value_addr, get_type(a2));
return 1;
}
idx = (idx + 1) & mask;
}
// Variable not in scope -> likely a static/global constexpr
return resolve_static_ref(a1, a2, a3);
}
Function Call Dispatch: do_constexpr_call
sub_657560 (1,445 lines) handles all function call evaluation during constexpr. It is the central dispatcher that routes calls to the appropriate evaluator based on the callee kind.
int do_constexpr_call(interp_state *a1, expr_node *call_expr,
value_buf *result, address_t *home) {
// 1. Resolve the callee
func_info callee;
if (!eval_constexpr_callee(a1, call_expr, &callee)) // sub_643FE0
return 0;
// 2. Check recursion depth
int depth = count_call_chain(a1->call_chain);
if (depth > MAX_CONSTEXPR_DEPTH) {
emit_error(2701); // constexpr evaluation exceeded depth limit
return 0;
}
// 3. Dispatch by callee kind
if (callee.is_builtin) {
// Route to builtin evaluator
return do_constexpr_builtin_function( // sub_651150
a1, callee.descriptor, args, result, &success);
}
if (callee.is_constructor) {
// Route to constructor evaluator
return do_constexpr_ctor(a1, callee, args, // sub_6480F0
result, home);
}
if (callee.is_destructor) {
// Route to destructor evaluator (two variants)
return do_constexpr_dtor(a1, callee, args, // sub_64EFE0 or sub_64FB10
result);
}
if (callee.is_virtual) {
// Virtual dispatch: resolve through vtable
func_info resolved = resolve_virtual_call(callee, this_obj);
if (!resolved.is_constexpr) {
emit_error(269); // virtual function is not constexpr
return 0;
}
callee = resolved;
}
// 4. Check that function body is available
if (!callee.has_body) {
emit_error(2823); // constexpr function not defined
return 0;
}
// 5. Push call frame
call_frame frame;
frame.prev = a1->call_chain;
a1->call_chain = &frame;
frame.func = callee.entity;
// 6. Bind arguments to parameters
for (int i = 0; i < callee.param_count; i++) {
value_buf arg_val;
if (!do_constexpr_expression(a1, args[i], &arg_val, NULL))
goto cleanup;
bind_parameter(a1, callee.params[i], &arg_val);
}
// 7. Evaluate function body
int ok = do_constexpr_statement(a1, callee.body, result); // sub_647850
// 8. Pop call frame, clean up allocations
cleanup:
a1->call_chain = frame.prev;
release_allocation_chain(a1, &frame); // sub_633EC0
return ok;
}
Callee Resolution: eval_constexpr_callee
sub_643FE0 (305 lines) resolves the callee expression of a function call. It handles direct calls, virtual dispatch (vtable lookup), and pointer-to-member-function calls. For virtual calls, it resolves overrides by walking the vtable of the most-derived type of the object being called on.
Recursion Depth Tracking
The interpreter tracks call depth through the call_chain linked list at offset +72 in the interpreter state. Each do_constexpr_call invocation pushes a frame; each return pops it. The chain is also used for diagnostic output -- when a constexpr evaluation fails, the error message includes the call stack showing how the offending expression was reached.
Constructor Evaluation: do_constexpr_ctor
sub_6480F0 (1,659 lines) evaluates constructor calls during constexpr. It implements the full C++ construction sequence:
int do_constexpr_ctor(interp_state *a1, func_info *ctor,
expr_node **args, value_buf *result,
address_t *target_addr) {
class_type *cls = ctor->parent_class;
// 1. Initialize virtual base classes (if most-derived)
for (vbase in cls->virtual_bases) {
address_t vbase_addr = target_addr + vbase.offset;
if (vbase.has_initializer) {
if (!do_constexpr_expression(a1, vbase.init, &val, &vbase_addr))
return 0;
} else {
init_subobject_to_zero(vbase_addr, vbase.type); // sub_62C030
}
}
// 2. Initialize non-virtual base classes
for (base in cls->bases) {
address_t base_addr = target_addr + base.offset;
if (base.has_ctor_call) {
if (!do_constexpr_ctor(a1, base.ctor, base.args,
&val, &base_addr))
return 0;
}
}
// 3. Initialize data members (in declaration order)
for (member in cls->members) {
address_t mem_addr = target_addr + member.offset;
if (member.has_mem_initializer) {
// From constructor's member-initializer-list
if (!do_constexpr_expression(a1, member.init, &val, &mem_addr))
return 0;
} else if (member.has_default_initializer) {
// From in-class default member initializer
if (!do_constexpr_expression(a1, member.default_init,
&val, &mem_addr))
return 0;
} else {
// Default-initialize (zero for trivial types)
init_subobject_to_zero(mem_addr, member.type);
}
}
// 4. Execute constructor body (if non-trivial)
if (ctor->has_body) {
if (!do_constexpr_statement(a1, ctor->body, result))
return 0;
}
// 5. Handle delegating constructors
if (ctor->is_delegating) {
return do_constexpr_ctor(a1, ctor->delegate_target,
args, result, target_addr);
}
// 6. For trivial copy/move, use memcpy optimization
if (ctor->is_trivial_copy) {
copy_interpreter_subobject(target_addr, source_addr, cls);
return 1; // sub_6337D0
}
return 1;
}
Loop Evaluation: do_constexpr_range_based_for_statement
sub_644580 (2,836 lines) evaluates all loop constructs during constexpr: for, while, do-while, and range-based for. It is self-recursive for nested loops.
int do_constexpr_range_based_for_statement(
interp_state *a1, stmt_node *loop, value_buf *result) {
// --- Range-based for ---
if (loop->kind == RANGE_FOR) {
// 1. Evaluate range expression: auto&& __range = <expr>
value_buf range_val;
if (!do_constexpr_expression(a1, loop->range_expr, &range_val, NULL))
return 0;
// 2. Evaluate begin() and end()
value_buf begin_val, end_val;
if (!do_constexpr_call(a1, loop->begin_call, &begin_val, NULL))
return 0;
if (!do_constexpr_call(a1, loop->end_call, &end_val, NULL))
return 0;
// 3. Loop: while (begin != end)
while (true) {
// Evaluate condition: begin != end
value_buf cond;
if (!do_constexpr_expression(a1, loop->condition, &cond, NULL))
return 0;
if (!cond.int_val)
break; // loop finished
// Bind loop variable: auto x = *begin
value_buf elem;
if (!do_constexpr_expression(a1, loop->deref_expr, &elem, NULL))
return 0;
bind_parameter(a1, loop->loop_var, &elem);
// Execute loop body
int body_result = do_constexpr_statement( // sub_6593C0
a1, loop->body, result);
if (body_result == BREAK) break;
if (body_result == RETURN) return body_result;
// CONTINUE falls through to increment
// Increment iterator: ++begin
if (!do_constexpr_expression(a1, loop->increment, &begin_val, NULL))
return 0;
// Destroy loop variable for this iteration
cleanup_iteration(a1, loop->loop_var); // sub_658CE0
}
return 1;
}
// --- Traditional for/while/do-while ---
if (loop->kind == FOR_LOOP) {
// Initialize
if (loop->init_stmt)
do_constexpr_statement(a1, loop->init_stmt, NULL);
while (true) {
// Condition
if (loop->condition) {
value_buf cond;
do_constexpr_expression(a1, loop->condition, &cond, NULL);
if (!cond.int_val) break;
}
// Body
int r = do_constexpr_statement(a1, loop->body, result);
if (r == BREAK) break;
if (r == RETURN) return r;
// Increment
if (loop->increment)
do_constexpr_expression(a1, loop->increment, NULL, NULL);
}
}
return 1;
}
The loop body evaluation is delegated to sub_6593C0 (816 lines), which handles per-iteration variable binding, break/continue/return propagation, and destruction of loop-scoped temporaries.
Statement Evaluation: do_constexpr_statement
sub_647850 (509 lines) evaluates compound statements, declarations, branches, and switch statements during constexpr:
int do_constexpr_statement(interp_state *a1, stmt_node *stmt,
value_buf *result) {
switch (stmt->kind) {
case COMPOUND:
// Push scope, evaluate each sub-statement, pop scope
a1->scope_depth++;
for (s in stmt->children) {
int r = do_constexpr_statement(a1, s, result);
if (r == RETURN || r == BREAK || r == CONTINUE)
{ a1->scope_depth--; return r; }
}
a1->scope_depth--;
return OK;
case DECLARATION:
// Allocate interpreter storage, evaluate initializer
return do_constexpr_init_variable(a1, stmt->decl, result);
case IF_STMT:
value_buf cond;
do_constexpr_expression(a1, stmt->condition, &cond, NULL);
if (cond.int_val)
return do_constexpr_statement(a1, stmt->then_branch, result);
else if (stmt->else_branch)
return do_constexpr_statement(a1, stmt->else_branch, result);
return OK;
case SWITCH_STMT:
value_buf switch_val;
do_constexpr_expression(a1, stmt->condition, &switch_val, NULL);
// Find matching case label
case_label = find_case(stmt->cases, switch_val.int_val);
return do_constexpr_statement(a1, case_label->body, result);
case RETURN_STMT:
if (stmt->return_expr)
do_constexpr_expression(a1, stmt->return_expr, result, NULL);
return RETURN;
case FOR_STMT: case WHILE_STMT: case DO_STMT: case RANGE_FOR:
return do_constexpr_range_based_for_statement( // sub_644580
a1, stmt, result);
case BREAK_STMT: return BREAK;
case CONTINUE_STMT: return CONTINUE;
case TRY_STMT:
// try/catch in constexpr (C++26 direction, partially supported)
...
}
}
Builtin Function Evaluation: do_constexpr_builtin_function
sub_651150 (5,032 lines) evaluates compiler intrinsics and __builtin_* functions at compile time. It dispatches on the builtin function ID (a 16-bit value at *(a2+168)), using a sparse comparison tree rather than a dense switch table.
int do_constexpr_builtin_function(
interp_state *a1,
func_desc *a2, // function descriptor
value_buf **a3, // argument array
value_buf *a4, // result buffer
int *a5) { // success/failure output
uint16_t builtin_id = *(a2 + 168);
// --- Arithmetic overflow detection ---
// __builtin_add_overflow, __builtin_sub_overflow, __builtin_mul_overflow
if (builtin_id == BUILTIN_ADD_OVERFLOW) {
int64_t a = a3[0]->int_val, b = a3[1]->int_val;
bool overflow;
int64_t result = checked_add(a, b, &overflow);
a3[2]->int_val = result; // write to output parameter
a4->int_val = overflow ? 1 : 0;
return 1;
}
// --- Bit manipulation ---
// __builtin_clz, __builtin_ctz, __builtin_popcount, __builtin_parity
if (builtin_id == BUILTIN_CLZ) {
uint64_t val = a3[0]->int_val;
if (val == 0) { emit_error(61); return 0; } // UB: clz(0)
a4->int_val = __builtin_clzll(val);
return 1;
}
if (builtin_id == BUILTIN_POPCOUNT) {
a4->int_val = __builtin_popcountll(a3[0]->int_val);
return 1;
}
if (builtin_id == BUILTIN_BSWAP32) {
a4->int_val = __builtin_bswap32((uint32_t)a3[0]->int_val);
return 1;
}
// --- String operations ---
// __builtin_strlen, __builtin_strcmp, __builtin_memcmp,
// __builtin_strchr, __builtin_memchr
if (builtin_id == BUILTIN_STRLEN) {
char *str = get_interpreter_string(a1, a3[0]);
a4->int_val = strlen(str);
return 1;
}
if (builtin_id == BUILTIN_STRCMP) {
char *s1 = get_interpreter_string(a1, a3[0]);
char *s2 = get_interpreter_string(a1, a3[1]);
a4->int_val = strcmp(s1, s2);
return 1;
}
// --- Floating-point classification ---
// __builtin_isnan, __builtin_isinf, __builtin_isfinite,
// __builtin_fpclassify, __builtin_huge_val, __builtin_nan
if (builtin_id == BUILTIN_ISNAN) {
a4->int_val = isnan(a3[0]->float_val) ? 1 : 0;
return 1;
}
if (builtin_id == BUILTIN_NAN) {
char *tag = get_interpreter_string(a1, a3[0]);
a4->float_val = nan(tag);
return 1;
}
// --- C++20/23 bit operations ---
// std::bit_cast (via __builtin_bit_cast)
if (builtin_id == BUILTIN_BIT_CAST) {
// Serialize source object to target-format bytes
translate_interpreter_object_to_target_bytes( // sub_62A490
a1, a3[0], byte_buffer);
// Deserialize into destination type
translate_target_bytes_to_interpreter_object( // sub_62C670
a1, byte_buffer, a4, dst_type);
return 1;
}
// --- Type traits ---
// __is_constant_evaluated()
if (builtin_id == BUILTIN_IS_CONSTANT_EVALUATED) {
a4->int_val = 1; // always true inside constexpr evaluator
return 1;
}
// --- Memory operations ---
// __builtin_memcpy, __builtin_memmove
if (builtin_id == BUILTIN_MEMCPY) {
// Copy N bytes between interpreter objects
copy_interpreter_bytes(a3[0]->address, a3[1]->address,
a3[2]->int_val);
*a4 = *a3[0]; // return dest pointer
return 1;
}
// ... 50+ additional builtin categories ...
emit_error(2721); // builtin not evaluable at compile time
return 0;
}
Builtin Categories Summary
| Category | Examples | Count |
|---|---|---|
| Arithmetic overflow | __builtin_add_overflow, __builtin_mul_overflow | 3 |
| Bit manipulation | __builtin_clz, __builtin_ctz, __builtin_popcount, __builtin_bswap | 8+ |
| String operations | __builtin_strlen, __builtin_strcmp, __builtin_memcmp, __builtin_strchr | 6+ |
| Math/FP classify | __builtin_isnan, __builtin_isinf, __builtin_huge_val, __builtin_nan | 8+ |
| Type queries | __is_constant_evaluated, __has_unique_object_representations | 4+ |
| Memory operations | __builtin_memcpy, __builtin_memmove | 3+ |
C++20/23 <bit> | std::bit_cast, std::bit_ceil, std::bit_floor, std::countl_zero | 8+ |
| Atomic (limited) | Constexpr-evaluable atomic subset | 2+ |
Destructor Evaluation
Two functions handle constexpr destructor calls, splitting responsibilities:
do_constexpr_dtor variant 1 (sub_64EFE0, 503 lines) -- Evaluates the destructor body itself. Runs the user-written destructor code, then destroys members in reverse declaration order.
do_constexpr_dtor variant 2 / perform_destructions (sub_64FB10, 877 lines) -- Handles the full destruction sequence including base class destructors and array element destruction. Also implements perform_destructions, the post-evaluation cleanup that destroys all constexpr-created objects when their scope ends.
Materialization: Interpreter Objects to IL Constants
After constexpr evaluation completes, the interpreter's internal objects must be converted back into IL constant nodes that the rest of the compiler can consume.
copy_interpreter_object_to_constant
sub_631110 (1,444 lines) traverses the interpreter's memory representation of an object and builds the corresponding IL constant tree:
il_node *copy_interpreter_object_to_constant(
interp_state *a1, address_t obj_addr, type_node *type) {
int kind = type->kind;
switch (kind) {
case 2: case 13: // integer, enum
return make_integer_constant(load_int(obj_addr), type);
case 3: case 4: // float, double
return make_float_constant(load_float(obj_addr), type);
case 1: // pointer
if (is_null_pointer(obj_addr))
return make_null_pointer_constant(type);
// Non-null: build address expression with relocation
return make_address_constant(
translate_interpreter_offset(obj_addr), // inline helper
type);
case 6: case 9: case 10: case 11: // class/struct/union
il_node *result = make_aggregate_constant(type);
// Recursively convert each member
for (member in get_members(type)) {
address_t mem_addr = obj_addr + member.offset;
il_node *mem_val = copy_interpreter_object_to_constant(
a1, mem_addr, member.type);
add_member_to_aggregate(result, mem_val);
}
return result;
case 8: // array
il_node *result = make_array_constant(type);
for (int i = 0; i < array_dimension(type); i++) {
address_t elem_addr = obj_addr + i * elem_size;
il_node *elem = copy_interpreter_object_to_constant(
a1, elem_addr, elem_type);
add_element_to_array(result, elem);
}
return result;
}
}
This function also contains get_reflection_string_entry and translate_interpreter_offset as inlined helpers -- the former handles C++26 reflection string extraction, and the latter converts interpreter memory addresses into IL address expressions with proper relocations.
extract_value_from_constant (reverse direction)
sub_64B580 (2,299 lines) performs the inverse: given an IL constant node (from a previously evaluated constexpr), it extracts the value into the interpreter's internal representation. This is used when a constexpr function references another constexpr variable whose value was already computed.
__builtin_bit_cast Support
Two functions implement the byte-level serialization needed for std::bit_cast:
translate_interpreter_object_to_target_bytes (sub_62A490, 461 lines) -- Serializes an interpreter object to a target-format byte sequence. Must handle endianness conversion, padding bytes, and bitfield layout according to the target ABI.
translate_target_bytes_to_interpreter_object (sub_62C670, 529 lines) -- Deserializes target-format bytes back into an interpreter object. Validates that the source bytes represent a valid value for the destination type (e.g., no trap representations for bool).
C++20 Constexpr Memory Support
std::allocator::allocate
sub_62B100 (do_constexpr_std_allocator_allocate, 177 lines) -- Handles new expressions in constexpr context. Allocates from the interpreter arena, sets the allocation-chain flag (bit 2), and links the allocation into the tracking chain.
std::allocator::deallocate
sub_62B470 (do_constexpr_std_allocator_deallocate, 195 lines) -- Handles delete in constexpr context. Validates the pointer was allocated by std::allocator::allocate() by searching the allocation chain (qword_126FBC0 / qword_126FBB8).
std::construct_at
sub_64F920 (do_constexpr_std_construct_at, 108 lines) -- Handles std::construct_at() (C++20). Validates the target pointer, then delegates to do_constexpr_ctor for actual construction.
C++26 Reflection Support
EDG 6.6 includes experimental support for the P2996 compile-time reflection proposal. Eight dedicated functions implement std::meta::* operations:
| Function | Address | Lines | Reflection operation |
|---|---|---|---|
do_constexpr_std_meta_substitute | sub_628510 | 526 | std::meta::substitute() -- template argument substitution |
do_constexpr_std_meta_enumerators_of | sub_62EB00 | 342 | std::meta::enumerators_of() -- enum value list |
do_constexpr_std_meta_subobjects_of | sub_62F0B0 | 434 | std::meta::subobjects_of() -- all subobjects |
do_constexpr_std_meta_bases_of | sub_62F7B0 | 339 | std::meta::bases_of() -- base class list |
do_constexpr_std_meta_nonstatic_data_members_of | sub_62FD30 | 308 | std::meta::nonstatic_data_members_of() |
do_constexpr_std_meta_static_data_members_of | sub_630280 | 308 | std::meta::static_data_members_of() |
do_constexpr_std_meta_members_of | sub_6307E0 | 590 | std::meta::members_of() -- all members |
do_constexpr_std_meta_define_class | sub_65DE10 | 553 | std::meta::define_class() -- class synthesis |
These functions operate on "infovecs" -- information vectors created by make_infovec (sub_62E1B0, 241 lines) that encode reflection metadata as interpreter-internal objects. The get_interpreter_string and get_interpreter_string_length helpers (also within sub_65DE10) extract string values from these infovecs for operations that take string parameters (member names, type names).
The define_class operation is particularly notable: it allows constexpr code to synthesize entirely new class types at compile time, a capability that goes beyond simple introspection.
CUDA-Specific Constexpr Behavior
The interpreter checks several global flags to relax standard constexpr restrictions for CUDA device code:
| Global | Purpose |
|---|---|
dword_106C2C0 | Controls reinterpret_cast semantics in device constexpr |
dword_106C1D8 | Controls pointer dereference behavior (likely --expt-relaxed-constexpr) |
dword_106C1E0 | Controls typeid availability in device constexpr |
dword_126EFAC | CUDA mode flag (enables/disables constexpr relaxations) |
dword_126EFA4 | Secondary CUDA mode flag (combined with EFAC for fine control) |
Standard C++ forbids reinterpret_cast, typeid, and certain pointer operations in constexpr contexts. CUDA relaxes these restrictions because GPU programming patterns frequently require type punning and address manipulation that the standard deems non-constant. When these flags are set, the interpreter suppresses the corresponding error codes and evaluates the expression as if it were permitted.
Language Version Gates
| Global | Check | Meaning |
|---|---|---|
qword_126EF98 | > 0x222DF (140,255) | C++20 features enabled (standard 202002) |
qword_126EF98 | > 0x15F8F (89,999) | C++14 features enabled (standard 201402) |
dword_126EFB4 | == 2 | Full C++20+ compilation mode |
dword_126EF68 | >= 202001 | C++20 constexpr dynamic allocation enabled |
These version checks gate features like constexpr new/delete (C++20), constexpr dynamic_cast and typeid (C++20), and constexpr virtual dispatch (C++20).
Error Codes
The interpreter emits detailed diagnostics when constexpr evaluation fails. Each error code identifies a specific category of failure:
| Error | Meaning |
|---|---|
| 61 | Undefined behavior detected (division by zero, clz(0), etc.) |
| 269 | Virtual function called is not constexpr |
| 286 | Pure virtual function called |
| 2691 | Invalid pointer comparison direction |
| 2692 | Array bounds violation |
| 2700 | Access to uninitialized object |
| 2701 | Constexpr evaluation exceeded depth limit |
| 2707 | Integer overflow in type conversion |
| 2708 | Arithmetic overflow in computation |
| 2721 | Expression is not a constant expression |
| 2725 | Type too large for constexpr evaluation (> 64MB) |
| 2727 | Invalid type conversion in constexpr |
| 2728 | Floating-point conversion overflow |
| 2734 | Invalid pointer comparison (different complete objects) |
| 2735 | Pointer arithmetic out of bounds |
| 2751 | Null pointer dereference |
| 2760 | Pointer-to-member dereference failure |
| 2766 | Null pointer arithmetic |
| 2808 | Class too large for constexpr representation |
| 2823 | Constexpr function body not available |
| 2879 | offsetof on invalid member |
| 2921 | Direct value return failure |
| 2938 | Virtual base class offset not found |
| 2955 | Statement expression evaluation failure |
| 2993 | Object lifetime violation |
| 2999 | Variable-length array in constexpr |
| 3007 | Pointer-to-member comparison failure |
| 3024 | Dynamic initialization order issue |
| 3248 | Member access on uninitialized object |
| 3312 | Object representation mismatch (bit_cast) |
Supporting Functions
Value Management
| Function | Address | Lines | Purpose |
|---|---|---|---|
f_value_bytes_for_type | sub_628DE0 | 843 | Compute interpreter storage size for a type |
init_subobject_to_zero | sub_62C030 | 284 | Zero-initialize a constexpr subobject |
mark_mutable_members_not_initialized | sub_62D0F0 | 203 | Mark mutable members after copy |
| Copy scalar value | sub_62B8A0 | 61 | Assign scalar value to interpreter object |
| Load value | sub_64EA30 | 293 | Load value from interpreter object into buffer |
| Check initialized | sub_62BF60 | 55 | Validate interpreter object is initialized |
Object Addressing
| Function | Address | Lines | Purpose |
|---|---|---|---|
find_subobject_for_interpreter_address | sub_629D30 | 334 | Map address to subobject identity |
obj_type_at_address | sub_62A210 | 133 | Most-derived type at an address |
get_runtime_array_pos | sub_6341C0 | 224 | Array element index for a pointer |
last_subobject_path_link | sub_6345D0 | 21 | Tail of subobject path chain |
get_trailing_subobject_path_entry | sub_634630 | 82 | Trailing subobject for virtual bases |
| Copy subobject | sub_6337D0 | 379 | Copy subobject between interpreter addresses |
| Validate subobject path | sub_62B980 | 314 | Recursive validation of class hierarchy traversal |
Condition and Allocation
| Function | Address | Lines | Purpose |
|---|---|---|---|
do_constexpr_condition | sub_658EE0 | 302 | Evaluate if/while/for condition |
do_constexpr_condition_alloc | sub_62D810 | 187 | Allocate storage for condition result |
do_constexpr_init_variable | sub_6509E0 | 427 | Initialize local variable in constexpr |
| Allocate value slot | sub_62D4F0 | 183 | Allocate and init a value slot in arena |
| Release allocation chain | sub_633EC0 | 157 | Free tracked constexpr allocations |
Dynamic Initialization and Lambdas
| Function | Address | Lines | Purpose |
|---|---|---|---|
do_constexpr_dynamic_init | sub_64A040 | 1,111 | Dynamic initialization of constexpr variables |
do_constexpr_lambda | (within sub_64A040) | -- | Lambda capture evaluation |
do_array_constructor_copy | (within sub_64A040) | -- | Array construction via copy ctor |
Debug and Diagnostics
| Function | Address | Lines | Purpose |
|---|---|---|---|
| Format constexpr value | sub_632E80 | 268 | Format value for error messages |
| Dump constexpr value | sub_6333E0 | 166 | fprintf-based debug dump |
Complete Function Map
| Address | Lines | Identity | Confidence |
|---|---|---|---|
sub_628180 | 237 | Init/entry wrapper | MEDIUM |
sub_628510 | 526 | do_constexpr_std_meta_substitute | HIGH (95%) |
sub_628DE0 | 843 | f_value_bytes_for_type | VERY HIGH (99%) |
sub_629D30 | 334 | find_subobject_for_interpreter_address | VERY HIGH (99%) |
sub_62A210 | 133 | obj_type_at_address | VERY HIGH (99%) |
sub_62A490 | 461 | translate_interpreter_object_to_target_bytes | VERY HIGH (99%) |
sub_62AD90 | 194 | Allocate interpreter value storage | HIGH (85%) |
sub_62B100 | 177 | do_constexpr_std_allocator_allocate | VERY HIGH (99%) |
sub_62B470 | 195 | do_constexpr_std_allocator_deallocate | VERY HIGH (99%) |
sub_62B8A0 | 61 | Copy scalar value | HIGH (85%) |
sub_62B980 | 314 | Validate/traverse subobject path | HIGH (80%) |
sub_62BF60 | 55 | Validate initialization state | HIGH (85%) |
sub_62C030 | 284 | init_subobject_to_zero | VERY HIGH (99%) |
sub_62C670 | 529 | translate_target_bytes_to_interpreter_object | VERY HIGH (99%) |
sub_62D0F0 | 203 | mark_mutable_members_not_initialized | VERY HIGH (99%) |
sub_62D4F0 | 183 | Allocate constexpr value slot | HIGH (80%) |
sub_62D810 | 187 | do_constexpr_condition_alloc | VERY HIGH (99%) |
sub_62DB00 | 132 | Get value type size (wrapper) | HIGH (80%) |
sub_62DD10 | 242 | Builtin dispatch helper | MEDIUM (70%) |
sub_62E1B0 | 241 | make_infovec | VERY HIGH (99%) |
sub_62E670 | 276 | Init/entry wrapper | MEDIUM (60%) |
sub_62EB00 | 342 | do_constexpr_std_meta_enumerators_of | VERY HIGH (99%) |
sub_62F0B0 | 434 | do_constexpr_std_meta_subobjects_of | VERY HIGH (99%) |
sub_62F7B0 | 339 | do_constexpr_std_meta_bases_of | VERY HIGH (99%) |
sub_62FD30 | 308 | do_constexpr_std_meta_nonstatic_data_members_of | VERY HIGH (99%) |
sub_630280 | 308 | do_constexpr_std_meta_static_data_members_of | VERY HIGH (99%) |
sub_6307E0 | 590 | do_constexpr_std_meta_members_of | VERY HIGH (99%) |
sub_631110 | 1,444 | copy_interpreter_object_to_constant | VERY HIGH (99%) |
sub_632CB0 | 36 | Create reflection string object | MEDIUM (70%) |
sub_632D80 | 64 | get_reflection_string_entry helper | HIGH (85%) |
sub_632E80 | 268 | Format constexpr value for diagnostics | MEDIUM (65%) |
sub_6333E0 | 166 | Dump constexpr value (debug) | MEDIUM (65%) |
sub_6337D0 | 379 | Copy interpreter subobject | HIGH (85%) |
sub_633EC0 | 157 | Release allocation chain | HIGH (80%) |
sub_6341C0 | 224 | get_runtime_array_pos | VERY HIGH (99%) |
sub_6345D0 | 21 | last_subobject_path_link | VERY HIGH (99%) |
sub_634630 | 82 | get_trailing_subobject_path_entry | VERY HIGH (99%) |
sub_634740 | 11,205 | do_constexpr_expression | ABSOLUTE (100%) |
sub_643C50 | 202 | Prepare constexpr callee | HIGH (85%) |
sub_643FE0 | 305 | eval_constexpr_callee | VERY HIGH (99%) |
sub_644580 | 2,836 | do_constexpr_range_based_for_statement | VERY HIGH (99%) |
sub_647850 | 509 | do_constexpr_statement | HIGH (90%) |
sub_6480F0 | 1,659 | do_constexpr_ctor | VERY HIGH (99%) |
sub_64A040 | 1,111 | do_constexpr_dynamic_init / do_constexpr_lambda | VERY HIGH (99%) |
sub_64B580 | 2,299 | extract_value_from_constant | VERY HIGH (99%) |
sub_64DFA0 | 86 | Destructor chain walker | HIGH (80%) |
sub_64E170 | 404 | Perform destruction sequence | HIGH (85%) |
sub_64E9E0 | 26 | Predicate / flag check | MEDIUM (65%) |
sub_64EA30 | 293 | Load value from interpreter object | HIGH (85%) |
sub_64EFE0 | 503 | do_constexpr_dtor (variant 1) | VERY HIGH (99%) |
sub_64F8F0 | 14 | Trivial forwarding wrapper | MEDIUM (60%) |
sub_64F920 | 108 | do_constexpr_std_construct_at | VERY HIGH (99%) |
sub_64FB10 | 877 | do_constexpr_dtor (v2) / perform_destructions | VERY HIGH (99%) |
sub_6509E0 | 427 | do_constexpr_init_variable | VERY HIGH (99%) |
sub_651150 | 5,032 | do_constexpr_builtin_function | ABSOLUTE (100%) |
sub_657560 | 1,445 | do_constexpr_call | VERY HIGH (99%) |
sub_658CE0 | 134 | Loop iteration cleanup | HIGH (80%) |
sub_658EE0 | 302 | do_constexpr_condition | VERY HIGH (99%) |
sub_6593C0 | 816 | Loop body evaluator | HIGH (85%) |
sub_65A290 | 311 | Entry from expression lowering | MEDIUM (70%) |
sub_65A8C0 | 274 | Entry from expression trees | MEDIUM (70%) |
sub_65AE50 | 572 | interpret_expr (primary entry) | VERY HIGH (99%) |
sub_65BAB0--sub_65D150 | 150-470 | Misc entry points | MEDIUM (70%) |
sub_65CFA0 | 67 | interpret_dynamic_sub_initializers | VERY HIGH (99%) |
sub_65D9A0--sub_65DD20 | 7-68 | Small utility/accessor functions | MEDIUM (65%) |
sub_65DE10 | 553 | do_constexpr_std_meta_define_class | VERY HIGH (99%) |
Cross-References
- EDG 6.6 Overview -- Position of
interpret.cin the source tree - Type System -- The 22 type kinds that the interpreter evaluates
- Template Engine -- Constexpr evaluation during template instantiation
- IL Overview -- IL constant nodes that materialization produces
- Diagnostics Overview -- Error message system for constexpr failures
- Pipeline Overview -- Where constexpr evaluation sits in the compilation pipeline
Name Mangling
The name mangling subsystem in cudafe++ implements the Itanium C++ ABI name mangling specification, with NVIDIA-specific extensions for CUDA device lambda wrappers and host reference array registration. The mangling pipeline lives in lower_name.c (60+ functions spanning 0x69C980--0x6AB280) and produces the _Z prefixed symbols that appear in .int.c output and PTX. A separate CUDA-aware demangler at sub_7CABB0 (930 lines, statically linked, not EDG code) reverses the process with extensions for three NVIDIA vendor-specific mangled prefixes: Unvdl, Unvdtl, and Unvhdl. The glue between mangling and CUDA execution spaces is nv_get_full_nv_static_prefix in nv_transforms.c, which constructs scoped static prefixes for __global__ template stubs destined for host reference arrays.
Key Facts
| Property | Value |
|---|---|
| Source file | lower_name.c (60+ functions), nv_transforms.c (prefix builder) |
| Address range | 0x69C980--0x6AB280 (mangling), 0x6BE300 (static prefix) |
| Demangler | sub_7CABB0 (930 lines, NVIDIA custom, not EDG) |
| ABI standard | Itanium C++ ABI (IA-64), extended with NVIDIA vendor types |
| Operator name table | sub_69C980 (mangled_operator_name), 47 entries |
| Entity mangler | sub_6A1F00 (mangle_entity_name), ~1000 lines |
| Expression mangler | sub_6A8B10 (mangled_expression), ~700 lines |
| Scalable vector mangler | sub_69CF10 (mangled_scalable_vector_name), 170 lines |
| Static prefix builder | sub_6BE300 (nv_get_full_nv_static_prefix), 370 lines |
| Output buffer | qword_127FCC0 (dynamic buffer with capacity tracking) |
| Demangling mode flag | qword_126ED90 (non-zero = demangling/diagnostic mode) |
| Compressed mangling flag | dword_106BC7C (ABI version control) |
| ABI version selector | qword_126EF98 (selects vendor-specific vs standard codes) |
Architecture Overview
Name mangling occurs at two distinct points in the cudafe++ pipeline:
-
Forward mangling (IL lowering): EDG's
lower_name.cconverts entity nodes into Itanium ABI mangled names during the IL-to-text code generation phase. The entry point ismangle_entity_name(sub_6A1F00), which dispatches through 60+ helper functions to handle every C++ construct -- namespaces, classes, templates, operators, expressions, lambdas, and vendor-extended types. -
Reverse demangling (diagnostics): A statically linked demangler at
sub_7CABB0converts mangled names back to human-readable form for error messages and debug output. This demangler is not EDG code -- it is NVIDIA's custom implementation that wraps the standard Itanium ABI demangling algorithm with CUDA-specific extensions for device lambda wrapper types.
Entity Node (IL)
|
+-- sub_69FF70 (check_mangling_special_cases)
| Checks: extern "C", linkage name override, builtin
| If special case handled, done.
|
+-- sub_6A1F00 (mangle_entity_name) ~1000 lines
| |
| +-- sub_69C980 (mangled_operator_name) 47 operators
| +-- sub_69E740 (mangle_type_encoding) type dispatch
| +-- sub_6A3B00 (mangle_function_encoding)
| +-- sub_6A41A0 (mangle_declaration)
| +-- sub_6A4920 (mangle_template_parameter)
| +-- sub_6A5DC0 (mangle_abi_tags) B<tag> encoding
| +-- sub_6A6AF0 (mangle_template_args)
| +-- sub_6A78B0 (mangle_complete_type)
| +-- sub_6A8390 (mangled_nested_name_component)
| +-- sub_6A85E0 (mangled_entity_reference)
| +-- sub_6A8B10 (mangled_expression) ~700 lines
| +-- sub_6AB280 (mangled_encoding_for_sizeof)
|
+-- Output buffer: qword_127FCC0
[buffer_ptr, write_pos, capacity, overflow_flag, ...]
Operator Name Table (sub_69C980)
mangled_operator_name at 0x69C980 is a pure lookup function: it takes an operator kind byte and an arity flag, and returns a pointer to the two-character Itanium ABI mangled operator code. The function covers all 47 overloadable C++ operators, including C++20 co_await.
Assert: "mangled_operator_name: bad kind" at lower_name.c:11557.
Four operators are context-sensitive -- their mangled code depends on whether the usage is unary (arity a2==1) or binary:
| Kind | Unary | Binary | C++ Operator |
|---|---|---|---|
| 5 | ps | pl | + |
| 6 | ng | mi | - |
| 7 | de | ml | * |
| 11 | ad | an | & |
Complete Operator Kind Table
| Kind | Code | Operator | Kind | Code | Operator |
|---|---|---|---|---|---|
| 1 | nw | new | 26 | ls | << |
| 2 | dl | delete | 27 | rs | >> |
| 5 | ps/pl | + (unary/binary) | 28 | rS | >>= |
| 6 | ng/mi | - (unary/binary) | 29 | lS | <<= |
| 7 | de/ml | * (unary/binary) | 30 | eq | == |
| 9 | rm | % | 31 | ne | != |
| 11 | ad/an | & (unary/binary) | 32 | le | <= |
| 12 | or | | | 33 | ge | >= |
| 13 | co | ~ | 34 | ss | <=> |
| 14 | nt | ! | 37 | pp | ++ |
| 16 | lt | < | 40 | pm | ->* |
| 17 | gt | > | 41 | pt | -> |
| 24 | aN | %= | 42 | cl | () |
| 43 | ix | [] | 44 | qu | ?: |
| 45 | v23min | vendor min | 46 | v23max | vendor max |
| 47 | aw | co_await (C++20) |
Kinds 3, 4, 8, 10, 15, 18--23, 25, 28--29, 35--36, 38--39 return pointers to .rodata string constants (unk_A7C560 etc.) that encode the remaining standard operators (dv, eo, aS, pL, mI, mL, dV, eO, aa, oo, mm, cm).
Note kinds 45 and 46: these are vendor-extended operators using the v<length><name> Itanium ABI encoding. v23min and v23max are NVIDIA/CUDA-specific min/max operators with a length prefix of 23 -- this encodes the string "min" (3 chars) and "max" (3 chars) as vendor-qualified identifiers.
Entity Name Mangling (sub_6A1F00)
mangle_entity_name at 0x6A1F00 is the master mangling function. It produces the complete Itanium ABI mangled name for any entity node. At roughly 1000 decompiled lines, it handles every C++ entity kind through a multi-level dispatch.
Demangling Mode Early Exit
The function begins with a demangling-mode check:
if (qword_126ED90) { // demangling / diagnostic mode
emit_char(1, output); // '?'
emit_string("?", output);
return;
}
When qword_126ED90 is non-zero, the function emits "?" and returns immediately. This mode is used during diagnostic output when the compiler needs a placeholder rather than a real mangled name.
Pre-dispatch: Special Cases (sub_69FF70)
Before the main dispatch, sub_69FF70 (check_mangling_special_cases, 447 lines at 0x69FF70) screens for entities that bypass normal mangling:
- Linkage name override: If the entity has an explicit
asm("name")or[[gnu::alias("name")]], the override name is used directly. extern "C"linkage: Returns the unmangled source name.- Builtin entities: Special-cased to avoid generating bogus mangled names.
Main Dispatch Structure
After special-case screening, mangle_entity_name dispatches on the entity kind byte at entity node offset +132:
| Entity Kind | Handler | Encoding |
|---|---|---|
| Regular function | sub_6A3B00 (mangle_function_encoding) | _Z<encoding> |
| Regular variable | Direct type mangling | _Z<name><type> |
| Namespace member | sub_6A0740 (mangle_namespace_prefix) | N<qual>..E |
| Class member | sub_6A0A80 (mangle_class_prefix) | N<class><name>E |
| Template specialization | sub_6A6AF0 (mangle_template_args) | I<args>E |
| Operator function | sub_69C980 (mangled_operator_name) | operator codes |
| Constructor/destructor | sub_69FE30 | C1/C2/C3/D0/D1/D2 |
| Lambda closure | Lambda-specific path | Ul<sig>E<disc>_ |
| Local entity | sub_69F830 (mangle_local_name) | Z<func>E<entity> |
| Special (vtable etc.) | sub_69FBC0 (mangle_special_name) | TV/TI/GV etc. |
Type Encoding Subpipeline
Type mangling is handled by sub_69E740 (mangle_type_encoding, 177 lines at 0x69E740), which dispatches on type kind to produce Itanium ABI type codes:
| Type | Code | Type | Code |
|---|---|---|---|
void | v | bool | b |
char | c | signed char | a |
unsigned char | h | short | s |
unsigned short | t | int | i |
unsigned int | j | long | l |
unsigned long | m | long long | x |
unsigned long long | y | float | f |
double | d | long double | e |
__int128 | n | unsigned __int128 | o |
wchar_t | w | char8_t | Du |
char16_t | Ds | char32_t | Di |
_Float16 | DF16_ | __float128 | g |
std::nullptr_t | Dn | auto | Da |
decltype(auto) | Dc |
Pointer and reference types are encoded with prefix qualifiers: P (pointer), R (lvalue reference), O (rvalue reference). CV-qualifiers use K (const), V (volatile), r (restrict).
The builtin type mangler at sub_6A13A0 (396 lines) includes CUDA-specific type detection through dword_106C2C0 (GPU mode flag) to handle CUDA-extended types.
Substitution Mechanism
The Itanium ABI uses substitution sequences (S_, S0_, S1_, ...) to compress repeated type references. The substitution infrastructure in lower_name.c centers on:
sub_69F0D0(mangle_substitution_check): Checks whether a type/name component has already been emitted and should use a substitution reference.sub_69F150(mangle_with_substitution, 87 lines): HandlesS_encoding, including the well-known substitutionsSa(std::allocator),Sb(std::basic_string),Ss(std::string),Si(std::istream),So(std::ostream),Sd(std::iostream).
Template Argument Mangling
Template arguments are enclosed in I...E and handled by:
sub_69ED40(mangle_template_args, 86 lines): Iterates the template argument list, emittingIprefix andEsuffix.sub_69EEE0(mangle_template_arg, 109 lines): Mangles individual template arguments, dispatching between type arguments (direct type encoding), non-type arguments (expression or literal encoding), and template template arguments.sub_6A4920(mangle_template_parameter, 277 lines): Encodes template parameter references (T_,T0_,T1_, ...).
ABI Tag Mangling (sub_6A5DC0)
sub_6A5DC0 (643 lines at 0x6A5DC0) handles [[gnu::abi_tag("...")]] attribute propagation per the Itanium ABI extensions. ABI tags are encoded as B<length><tag> suffixes and must be propagated through template instantiations and inline namespaces (e.g., std::__cxx11::basic_string with tag cxx11). This is one of the more complex mangling functions due to the transitive nature of tag propagation.
Constructor/Destructor Encoding (sub_69FE30)
Constructors and destructors use the Itanium ABI's multi-variant encoding:
| Code | Meaning |
|---|---|
C1 | Complete object constructor |
C2 | Base object constructor |
C3 | Complete object allocating constructor |
D0 | Deleting destructor |
D1 | Complete object destructor |
D2 | Base object destructor |
Special Name Mangling (sub_69FBC0)
sub_69FBC0 (125 lines) produces mangled names for compiler-generated symbols:
| Prefix | Symbol |
|---|---|
_ZTV | Virtual table |
_ZTT | VTT (construction vtable) |
_ZTI | typeinfo structure |
_ZTS | typeinfo name string |
_ZGV | Guard variable for static initialization |
_ZTH | Thread-local initialization function |
_ZTW | Thread-local wrapper function |
Expression Mangling (sub_6A8B10)
mangled_expression at 0x6A8B10 is the second-largest function in lower_name.c at roughly 700 decompiled lines. It produces the Itanium ABI encoding for arbitrary C++ expressions appearing in template arguments, noexcept specifications, and decltype contexts.
Assert: "mangled_encoding_for_expression_full" at lower_name.c:6870, "mangled_expr_operator_name: bad operator" at lower_name.c:11873, "mangled_call_operation" at lower_name.c:6132.
Expression Kind Dispatch
The function first calls sub_69E740 to classify the expression node, then dispatches on the expression kind byte at node offset +24:
| Kind | Description | ABI Encoding |
|---|---|---|
| 0 | Error/unknown expression | ? (demangling mode only) |
| 1 | Operator expression | Dispatches on operator byte at +40 |
| 2 | Literal value | L<type><value>E |
| 3 | Entity reference | L_Z<encoding>E or substitution |
| 4 | Template parameter | T_/T0_ etc. |
| 5 | sizeof/alignof/typeid/noexcept | Delegated to sub_6AB280 |
| 6 | Cast expression | sc/dc/rc/cv prefix |
| 7 | Call expression | cl<callee><args>E or cp<args>E |
| 8 | Member access | dt/pt prefix |
| 9 | Conditional expression | qu<cond><then><else> |
| 10 | Pack expansion | sp<pattern> |
Operator Sub-dispatch (Kind 1)
When the expression is an operator expression, the function reads the operator byte at node offset +40 and performs a large switch covering 100+ cases. For standard binary and unary operators, it calls sub_69C980 (mangled_operator_name) to get the two-character ABI code, then recursively processes operands. Notable special cases:
- Cast operators (kinds
0x05--0x13): Dispatches betweensc(static_cast),dc(dynamic_cast),rc(reinterpret_cast), andcv(C-style cast) based on cast flags at node offset+25and+42. The compressed mangling flagdword_106BC7Cforcescvfor all casts when set. - Vendor extensions (
0x21,0x22):__real__and__imag__complex number operations, encoded asv18__real__andv18__imag__using the vendor-extended operator format. - Increment/decrement (kinds
0x23--0x26): Pre/post increment (pp) and decrement (mm). Post-increment/decrement append_suffix per Itanium ABI. - Call expressions (kinds
0x69--0x6D,0x16--0x17,0x69): Dispatches tomangled_call_operationwhich determines the callee encoding and emitscl(call) orcp(non-dependent call) prefix.
sizeof/alignof/typeid/noexcept (sub_6AB280)
mangled_encoding_for_sizeof at 0x6AB280 (130 lines) handles the sizeof-family of operators:
| ABI Code | Operator | Variant |
|---|---|---|
sz | sizeof(expr) | Expression operand |
st | sizeof(type) | Type operand |
az | alignof(expr) | Expression operand |
at | alignof(type) | Type operand |
te | typeid(expr) | Expression operand |
ti | typeid(type) | Type operand |
nx | noexcept(expr) | Expression operand |
For older ABI versions (controlled by dword_106BC7C and qword_126EF98), the function emits vendor-specific codes v17alignof and v18alignofe instead of the standard at/az codes.
Scalable Vector Name Mangling (sub_69CF10)
mangled_scalable_vector_name at 0x69CF10 (170 lines) returns mangled names for ARM SVE and RISC-V V extension scalable vector types. EDG supports these types natively, and they must be mangled using the vendor-specific Itanium ABI encoding.
Assert: "mangled_scalable_vector_name" at lower_name.c:10473 and lower_name.c:10440.
The function dispatches on the type node's kind byte at offset +132:
Dispatch Logic
- Kind 12 (elaborated type): Unwraps through the elaboration chain (offset
+144points to the underlying type). - Kind 3 (typedef/alias): Dispatches on subkind at offset
+144:- Subkind 1:
svintvariants (signed integer vectors) - Subkind 2:
svfloatvariants (floating-point vectors) - Subkind 4:
svboolvariants (predicate vectors) - Subkind 9:
svcountvariants
- Subkind 1:
- Kind 18 (mfloat8):
mfloat8xtypes for ML inference. - Kind 2 (plain vector): Dispatches on element type byte at offset
+144, handling 8 element widths (cases 1--8).
Each type category has 4 mangling variants selected by the a2 parameter (values 1--4), corresponding to different vector widths or tuple sizes (e.g., svint8_t, svint8x2_t, svint8x3_t, svint8x4_t). The actual mangled strings are stored in .rodata pointer tables (off_A7E950 through off_A7EA18).
There is also special handling for svboolx4_t via sub_7A7220, which detects the specific boolean-tuple-of-4 predicate type and returns a dedicated mangling string.
Mangling Output Buffer
All mangling functions write into a shared output buffer managed through qword_127FCC0. The buffer structure:
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 8 | reserved | Not used during mangling |
+8 | 8 | capacity | Allocated buffer size |
+16 | 8 | write_pos | Current write position (length of mangled name so far) |
+24 | 8 | unused | Reserved |
+32 | 8 | buffer_ptr | Pointer to character buffer |
Key buffer operations:
sub_69D850(append_char_to_buffer): Appends a single character, callssub_6B9B20to grow the buffer ifwrite_pos + 1 > capacity.sub_69D530(append_string): Appends a string to the buffer.sub_69D580(append_number): Appends a base-36 encoded number.sub_6B9B20(ensure_output_buffer_space): Grows the buffer (doubles capacity).
The sub_69DAA0 function (mangle_number, 63 lines) writes numbers in base-36 encoding as required by the Itanium ABI for substitution indices and discriminators.
Mangling Type Marks
The mangling pipeline uses a mark-and-sweep mechanism to track which types have been referenced during signature mangling (needed for substitution sequence generation):
sub_69CCB0(set_signature_mark, 76 lines): Marks types in a function signature for mangling. Handles function types (a2=7) and template functions (a2=11) by callingsub_5CF440for type traversal.sub_69CE10(ttt_mark_entry, 36 lines): Sets or clears the mangling mark on individual type entities. Uses bit 7 of byte at entity offset+81. The direction (mark vs unmark) is controlled bydword_127FC70.
CUDA Demangler Extensions (sub_7CABB0)
The CUDA-aware demangler at sub_7CABB0 (930 decompiled lines at 0x7CABB0) is a statically linked NVIDIA implementation, not part of EDG. It implements a full Itanium ABI C++ name demangler with three NVIDIA vendor-type extensions for CUDA lambda wrappers.
Function Signature
unsigned char* sub_7CABB0(
unsigned char *mangled_name, // a1: input cursor into mangled name
int64_t qualifier_out, // a2: output qualifier struct (24 bytes)
char flags, // a3: behavior flags
int64_t output_ctx // a4: output buffer context
);
Output Buffer Context (a4)
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 8 | buffer_ptr | Output character buffer |
+8 | 8 | write_pos | Current output position |
+16 | 8 | capacity | Buffer capacity |
+24 | 4 | error_flag | Set to 1 on buffer overflow |
+28 | 4 | overflow | Redundant overflow indicator |
+32 | 8 | suppress_level | When >0, output is suppressed (for dry-run parsing) |
+48 | 8 | error_count | Cumulative parse error counter |
+64 | 8 | skip_template | When set, suppresses template argument output |
Qualifier Output (a2)
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 4 | has_template_args | Set to 1 when template arguments were parsed |
+4 | 4 | cv_qualifiers | bit 0=const, bit 1=volatile, bit 2=restrict |
+8 | 4 | ref_qualifier | 0=none, 1=lvalue &, 2=rvalue && |
+16 | 8 | template_depth | Template nesting depth |
Flags (a3)
| Bit | Meaning |
|---|---|
| 0 | Static-from mode: wraps output in [static from ...]...[C++] |
| 1 | Suppress-scope mode: increments suppress level |
Parsing Dispatch
The demangler handles these Itanium ABI top-level prefixes:
| Prefix Byte | ASCII | ABI Meaning | Handler |
|---|---|---|---|
0x42 | B | EDG block-scope static entity | Block-scope handler (offset + length) |
0x4E | N | Nested name (qualified) | sub_7CA440 (nested-name parser) |
0x5A | Z | Local entity | sub_7CEAE0 (encoding parser) + local suffix |
0x53 | S | Substitution | sub_7CD7B0 (substitution resolver) |
0x53 0x74 | St | std:: prefix | Emits std:: + sub_7CD0B0 (unqualified-name) |
| other | Unqualified name | sub_7CD0B0 (unqualified-name parser) |
After parsing the name, the function checks for I (template argument list, 0x49) and dispatches to sub_7C9D30 (template-args parser). A template argument cache at qword_12C7B48/12C7B40/12C7B50 stores parsed entries using a dynamic array that grows by 500 entries via malloc/realloc.
CUDA Vendor Type Extensions
The key NVIDIA extensions are triggered when the demangler encounters the vendor-extended type prefix U followed by nv (bytes 0x55 0x6E 0x76). Three patterns are recognized:
Unvdl -- Device Lambda Wrapper
Pattern: Unvdl<arity><encoding><type>...
Input: "Unvdl" + <numeric_arity> + <function_encoding> + <captured_types>...
Output: "__nv_dl_wrapper_t<__nv_dl_tag<(& :: <scope>), <arity>, <type1>, ...> >"
Decoded step by step:
- Emit
__nv_dl_wrapper_t< - Emit
__nv_dl_tag< - Parse numeric arity via
sub_7C3180, subtract 2 to get actual capture count - Parse one type (
sub_7CE590) for the wrapped function type - Emit
,(+& ::+ recursively demangle scope (callingsub_7CABB0with flags=2) - Emit
), - Parse remaining captured types (count from step 3)
- Emit
> >
Unvdtl -- Trailing Return Device Lambda
Pattern: Unvdtl<arity><return_type><encoding><captured_types>...
Input: "Unvdtl" + <arity> + <type> + <func_encoding> + <captured_types>...
Output: "__nv_dl_wrapper_t<__nv_dl_trailing_return_tag<...>, <return_type>, ...>"
Same as Unvdl except:
- Emit
__nv_dl_wrapper_t< - Emit
__nv_dl_trailing_return_tag<(instead of__nv_dl_tag<) - After the scope demangling, parse an additional return type via
sub_7CE590 - Parse a function type via
sub_7CE5D0(adds 1 to result pointer for theEterminator) - Then parse remaining captured types
Unvhdl -- Host-Device Lambda Wrapper
Pattern: Unvhdl<bool1><bool2><bool3><arity><encoding><captured_types>...
Input: "Unvhdl" + <IsMutable> + <HasFuncPtrConv> + <NeverThrows> + <arity> + ...
Output: "__nv_hdl_wrapper_t<true/false, true/false, true/false,
__nv_dl_tag<(& :: <scope>), <arity>, <type1>, ...> >"
The three boolean template parameters are decoded first:
- Parse numeric value via
sub_7C3180-- if value != 2 (i.e.,falsein the encoding), emittrue,; otherwise emitfalse, - Repeat for
HasFuncPtrConv(second boolean) - Repeat for
NeverThrows(third boolean) - Then proceed identically to
Unvdl(emit__nv_dl_tag<, parse captures, etc.), but withv68=1flag marking the host-device variant
The boolean encoding convention: 2 encodes false, any other value (typically 0 or 1) encodes true. This is the reverse of the usual convention and matches the internal encoding used by nv_transforms.c when generating the mangled names.
Block-Scope Static Handling
When the input starts with B (ASCII 0x42), the demangler handles EDG's block-scope static entity encoding:
- If flags bit 0 is set and suppress_level is 0: emit
[static from - Parse an optional negative sign (
n) followed by a decimal length - Skip ahead by that length (the block-scope name)
- If suppress_level is 0: emit
]followed by[C++](the closing bracket and C++ marker) - If flags bit 0 is not set: decrement suppress_level
Instance Suffix
After parsing the main name, if the next character is _ followed by digits (or __ followed by digits), the demangler parses an instance discriminator and emits (instance N) suffix in the output, where N = parsed_value + 2.
Default Argument Suffix
For local entities (after Z...E), the discriminator prefix d triggers special handling:
d_ord<number>_: emits[default argument N (from end)]::where N = parsed_value + 2dn<number>_: negative-index variant
Call Graph
The demangler calls into specialized sub-parsers:
| Address | Function | Purpose |
|---|---|---|
sub_7CA440 | Nested-name parser | Handles N...E qualified names |
sub_7CEAE0 | Encoding parser | Top-level <encoding> production |
sub_7CD0B0 | Unqualified-name parser | <source-name> and operator names |
sub_7CD7B0 | Substitution resolver | S_/S0_ back-references |
sub_7C9D30 | Template-args parser | I<args>E |
sub_7CE590 | Type parser | Full type demangling |
sub_7CE5D0 | Function-type parser | Function signature types |
sub_7C3180 | Numeric literal parser | Decimal number extraction |
sub_7C30C0 | Arity emitter | Outputs numeric arity values |
sub_7C2FB0 | String emitter | Emits literal strings to output buffer |
sub_7C3030 | Signed number parser | Handles negative numbers |
Static Prefix for global Templates (sub_6BE300)
nv_get_full_nv_static_prefix at 0x6BE300 (370 lines) in nv_transforms.c constructs unique prefix strings for __global__ function templates with static/internal linkage. These prefixes are used to register device symbols in host reference arrays (the .nvHR* ELF sections that the CUDA runtime uses for symbol discovery).
Assert: "nv_get_full_nv_static_prefix" at nv_transforms.c:2164.
Entry Conditions
The function checks two conditions on the entity node:
- Bit
0x40at entity offset+182must be set (marks__global__functions) - A name string at entity offset
+8must be non-null
Internal vs External Linkage Paths
The function takes different paths based on entity linkage:
Internal linkage (bits 0x12 at offset +179 set, or storage class 0x10 at offset +80):
- Build scoped name prefix via
sub_6BD2F0(nv_build_scoped_name_prefix), which recursively walks the scope chain (offset+40-> parent scope at offset+28) to buildNamespace1::Namespace2::style prefixes. Anonymous namespaces insert_GLOBAL__N_<filename>. - Hash the entity name via
sub_6BD1C0(format_string_to_sso) usingvsnprintfwith a format string at address8573734. - Build the full prefix string using
snprintf:
snprintf(qword_1286760, n, "%s%lu_%s_", off_E7C768, strlen(filename), filename);
Where off_E7C768 is a global prefix string (likely "_nv_static_"), the %lu is the filename length, and %s is the filename from sub_5AF830(0). The result is cached in qword_1286760 for reuse across entities in the same translation unit.
- Concatenate prefix +
"_"separator + entity scoped name - Register the full string in
qword_12868C0(kernel internal-linkage host reference list)
External linkage:
- Build name with
" ::"scope prefix (the leading space is intentional -- it matches the demangler output format) - Walk scope chain via
sub_6BD2F0if the entity has a parent scope with kind 3 (namespace) - Hash the entity name via
sub_6BD1C0 - Append
"_"separator - Register in
qword_1286880(kernel external-linkage host reference list)
Host Reference Arrays
The prefixes generated by this function end up in six global lists, one per combination of {kernel, device, constant} x {external, internal} linkage:
| Global | Section | Array Name |
|---|---|---|
unk_1286780 | .nvHRDE | hostRefDeviceArrayExternalLinkage |
unk_12867C0 | .nvHRDI | hostRefDeviceArrayInternalLinkage |
unk_1286800 | .nvHRCE | hostRefConstantArrayExternalLinkage |
unk_1286840 | .nvHRCI | hostRefConstantArrayInternalLinkage |
unk_1286880 | .nvHRKE | hostRefKernelArrayExternalLinkage |
unk_12868C0 | .nvHRKI | hostRefKernelArrayInternalLinkage |
These are emitted by sub_6BCF80 (nv_emit_host_reference_array) as weak extern "C" byte arrays in the specified ELF sections.
Related Mangling Infrastructure
Type Mangling Subsystem (0x7C3000--0x7D0E00)
A separate type mangling subsystem exists in the 0x7C3000--0x7D0E00 range, used for diagnostic output and type encoding (distinct from the lower_name.c mangling used for symbol generation). Key functions:
| Address | Function | Lines | Description |
|---|---|---|---|
sub_7C3480 | encode_operator_name | 716 | Operator name encoding for diagnostics |
sub_7C5650 | encode_type_for_mangling | 794 | Full type encoding dispatcher |
sub_7C6290 | encode_expression | 2519 | Largest function -- expression encoding |
sub_7C8BE0 | encode_special_expression | 674 | Special expression forms |
sub_7CBB90 | encode_builtin_type | 1314 | All builtin type mappings |
sub_7CEAE0 | encode_template_args | 1417 | Template argument encoding |
sub_7CFFC0 | encode_nullptr | 484 | nullptr-related type encoding |
The encode_expression function at sub_7C6290 (2519 lines) is the largest single function in the entire type mangling subsystem and handles the full range of C++ expressions including dynamic_cast, const_cast, reinterpret_cast, safe_cast, static_cast, subscript, and throw.
Nested Name Components (sub_6A8390)
mangled_nested_name_component at 0x6A8390 (101 lines) handles the intermediate components within N...E nested name encodings. It emits ABI substitution codes:
dn: Destructor nameco: Coercion operatorsr: Unresolved scope resolutionL_ZN: Local scope nested nameD1Ev: Destructor suffix (complete object destructor, void return)
When in compressed mode (dword_106BC7C set), the function checks for std:: namespace via sub_7BE9E0 (is_std_namespace) and uses shortened forms.
Entity Reference Mangling (sub_6A85E0)
mangled_entity_reference at 0x6A85E0 (197 lines) is the central dispatch for mangling entity references within expressions. It handles:
- Qualified scope resolution (bit 2 at entity offset
+81) - Address-of expressions (
adprefix) - Compressed vs full mangling paths
- Class member vs free-function encoding
Assert: "mangled_entity_reference" at lower_name.c:4183.
Mangling Discriminators (sub_69DBE0)
mangle_discriminator at 0x69DBE0 (72 lines) writes discriminators for local entities. Itanium ABI uses _ for discriminator 0, _<number>_ for higher discriminators, where the number is encoded in base-36.
Global State Summary
| Global | Type | Purpose |
|---|---|---|
qword_127FCC0 | Buffer* | Primary mangling output buffer |
qword_126ED90 | qword | Demangling/diagnostic mode flag |
dword_106BC7C | dword | Compressed/vendor-ABI mode flag |
qword_126EF98 | qword | ABI version selector |
dword_127FC70 | dword | Mark/unmark direction for type marks |
qword_1286760 | char* | Cached static prefix string |
qword_1286A00 | char* | Cached anonymous namespace name |
dword_12C6A24 | dword | Block-scope suppress level (demangler) |
qword_12C7B48 | qword | Template argument cache index |
qword_12C7B40 | qword | Template argument cache capacity |
qword_12C7B50 | qword | Template argument cache pointer |
off_E7C768 | char* | Static prefix base string |
Function Address Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
0x69C830 | 24 | init_lower_name | LOW |
0x69C980 | 168 | mangled_operator_name | HIGH |
0x69CCB0 | 76 | set_signature_mark | HIGH |
0x69CE10 | 36 | ttt_mark_entry | HIGH |
0x69CF10 | 170 | mangled_scalable_vector_name | HIGH |
0x69D530 | -- | append_string | MEDIUM |
0x69D580 | -- | append_number | MEDIUM |
0x69D850 | -- | append_char_to_buffer | MEDIUM |
0x69DAA0 | 63 | mangle_number | MEDIUM |
0x69DBE0 | 72 | mangle_discriminator | MEDIUM |
0x69E380 | 116 | mangle_cv_qualifiers | MEDIUM |
0x69E5F0 | 79 | mangle_ref_qualifier | MEDIUM |
0x69E740 | 177 | mangle_type_encoding | MEDIUM-HIGH |
0x69EA40 | 150 | mangle_function_type | MEDIUM |
0x69ED40 | 86 | mangle_template_args | MEDIUM |
0x69EEE0 | 109 | mangle_template_arg | MEDIUM |
0x69F0D0 | 28 | mangle_substitution_check | LOW |
0x69F150 | 87 | mangle_with_substitution | MEDIUM |
0x69F320 | 78 | mangle_nested_name | MEDIUM |
0x69F830 | 54 | mangle_local_name | MEDIUM |
0x69F930 | 60 | mangle_unscoped_name | MEDIUM |
0x69FA90 | 58 | mangle_source_name | MEDIUM |
0x69FBC0 | 125 | mangle_special_name | MEDIUM |
0x69FE30 | 78 | mangle_constructor_destructor | MEDIUM |
0x69FF70 | 447 | check_mangling_special_cases | MEDIUM-HIGH |
0x6A0740 | 189 | mangle_namespace_prefix | MEDIUM |
0x6A0A80 | 88 | mangle_class_prefix | MEDIUM |
0x6A0FB0 | 245 | mangle_pointer_type | MEDIUM |
0x6A13A0 | 396 | mangle_builtin_type | MEDIUM-HIGH |
0x6A1C80 | 97 | mangle_expression | MEDIUM |
0x6A1F00 | ~1000 | mangle_entity_name | HIGH |
0x6A4920 | 277 | mangle_template_parameter | MEDIUM |
0x6A5DC0 | 643 | mangle_abi_tags | MEDIUM-HIGH |
0x6A78B0 | 297 | mangle_complete_type | MEDIUM |
0x6A7F20 | 232 | mangle_initializer | MEDIUM |
0x6A8390 | 101 | mangled_nested_name_component | HIGH |
0x6A85E0 | 197 | mangled_entity_reference | HIGH |
0x6A8B10 | ~700 | mangled_expression | HIGH |
0x6AA030 | 30 | mangled_expression_list | HIGH |
0x6AB280 | 130 | mangled_encoding_for_sizeof | HIGH |
0x6BE300 | 370 | nv_get_full_nv_static_prefix | VERY HIGH |
0x7CABB0 | 930 | CUDA demangler (top-level) | HIGH |
Type System
The type system in cudafe++ is EDG 6.6's implementation of the C++ type representation, query, construction, comparison, and layout infrastructure. It lives primarily in types.c (250+ functions at 0x7A4940--0x7C02A0) with type allocation in il_alloc.c (0x5E2E80--0x5E45C0), type construction helpers in il.c (0x5D64F0--0x5D6DB0), and class layout computation in layout.c (0x65EA50--0x665B50).
Every C++ entity -- variable, function parameter, expression result, template argument -- carries a type pointer. EDG represents types as 176-byte heap-allocated nodes organized by a type_kind discriminant, with supplementary structures for complex kinds (classes, functions, integers, typedefs, template parameters). Type identity in the IL is pointer-based: two types are the "same type" if and only if they resolve to the same canonical node after chasing typedef chains. This page documents the complete type node architecture, the 22 type kinds, the 130 leaf query functions, the MRU-cached type construction pipeline, and the Itanium ABI class layout engine.
Key Facts
| Property | Value |
|---|---|
| Source file | types.c (250+ functions), il_alloc.c (allocators), il.c (construction), layout.c (class layout) |
| Address range | 0x7A4940--0x7C02A0 (types.c), 0x5E2E80--0x5E45C0 (alloc), 0x5D64F0--0x5D6DB0 (il.c), 0x65EA50--0x665B50 (layout) |
| Type node size | 176 bytes (raw allocation includes 16-byte IL prefix) |
| Type kind count | 22 values (0x00--0x15) |
| Leaf query functions | 130 at 0x7A6260--0x7A9F90 (3,648 total call sites across binary) |
| Class layout entry | sub_662670 (do_class_layout), 2,548 lines |
| Type allocator | sub_5E3D40 (alloc_type), 176-byte bump allocation |
| Kind dispatch | sub_5E2E80 (set_type_kind), 22-way switch |
| Qualified type cache | sub_5D64F0 (f_make_qualified_type), MRU linked list at type +112 |
| Type comparison | sub_7AA150 (types_are_identical), 636 lines |
| Top query by callers | is_class_or_struct_or_union_type at 0x7A8A30 (407 call sites) |
| Type counter global | qword_126F8E0 (incremented on every alloc_type) |
| Void type singleton | qword_126E5E0 |
Type Node Layout (176 Bytes)
Every type in the IL is a 176-byte node allocated by alloc_type (sub_5E3D40). The allocator prepends a 16-byte IL prefix (8-byte TU-copy address + 8-byte next pointer), so the pointer returned to callers points at offset +16 of the raw allocation. All offsets below are relative to the returned pointer.
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 96 | common header | Copied from xmmword_126F6A0..126F6F0 at allocation time |
+0 | 8 | source_corresp | Source position info |
+8 | 1 | prefix_flags | IL entry prefix: bit 0 = allocated, bit 1 = file-scope, bit 3 = language |
+112 | 8 | qualified_chain | Head of MRU linked list of cv-qualified variants |
+120 | 4 | size_info | Type size in target units (for constexpr value computation) |
+128 | 4 | alignment | Type alignment |
+132 | 1 | type_kind | Discriminant byte: 0--21 (22 values) |
+133 | 1 | type_flags_1 | Bit 5 = is_dependent |
+134 | 1 | elaboration_flags | Low 2 bits = elaboration specifier kind |
+136 | 1 | type_flags_3 | Bit 2 = bitfield flag, bit 5 = unqualified strip flag |
+144 | 8 | referenced_type | Points to base/element/return type (kind-dependent). For pointers: pointed-to type. For arrays: element type. For typedefs: underlying type |
+145 | 1 | integer_subkind | (overlaps +144 byte 1; valid when kind==2) Bit 3 = scoped enum, bit 4 = bit-int capable |
+146 | 1 | integer_flags | (overlaps +144 byte 2; valid when kind==2) Bit 2 = _BitInt |
+152 | 8 | supplement_ptr | Pointer to kind-specific supplement, or member-pointer class type (kind==6 with member bit, kind==13) |
+153 | 1 | array_flags | (overlaps +152 byte 1; valid when kind==8) Bit 0 = dependent, bit 1 = VLA, bit 5 = star-modified |
+160 | 8 | secondary_data | Array bound (kind==8) / attribute info (kind==12) / enum underlying type (kind==2) |
+161 | 1 | qualifier_or_class_flags | Typeref: cv-qualifier bits (kind==12). Class: bit 0 = local, bit 4 = template, bit 5 = anonymous (kind==9/10/11) |
+163 | 1 | class_flags_2 | (valid when kind==9/10/11) Bit 0 = empty class |
+164 | 1 | feature_usage | Copied to byte_12C7AFC by record_type_features_used |
Note: Fields at offsets +144--+164 form a union-like region. Different type kinds interpret these bytes differently. The overlap is intentional -- a pointer type uses +152 for the class pointer while an array type uses +153 for VLA flags, and so on.
The type_kind byte at +132 is the single most frequently read field in the entire binary. Every type query function begins by checking it, and the canonical typedef-chase pattern reads it in a tight loop.
Type Kind Enumeration (22 Values)
EDG uses 22 type kind values (tk_*), each with optional supplementary allocations for kind-specific metadata.
| Value | Name | Supplement | Supplement Size | Description |
|---|---|---|---|---|
| 0 | tk_none | -- | -- | Sentinel / uninitialized |
| 1 | tk_void | -- | -- | void type |
| 2 | tk_integer | integer_type_supplement | 32 B | All integer types: bool, char, short, int, long, long long, __int128, _BitInt(N), and enumerations. Subkind at +145 discriminates |
| 3 | tk_float | -- | -- | float (format byte at +144 = 2) |
| 4 | tk_double | -- | -- | double |
| 5 | tk_long_double | -- | -- | long double, __float128, _Float16, __bf16 |
| 6 | tk_pointer | -- | -- | Pointer to T. Bit 0 of +152 distinguishes member pointers from object pointers |
| 7 | tk_routine | routine_type_supplement | 64 B | Function type. Supplement holds parameter list, calling convention, this-class pointer, exception specification |
| 8 | tk_array | -- | -- | Array of T. Bound at +160, element type at +144 |
| 9 | tk_struct | class_type_supplement | 208 B | struct type |
| 10 | tk_class | class_type_supplement | 208 B | class type |
| 11 | tk_union | class_type_supplement | 208 B | union type |
| 12 | tk_typeref | typeref_type_supplement | 56 B | Typedef / elaborated type. References the underlying type at +144. This is the "chase me" kind |
| 13 | tk_pointer_to_member | -- | -- | Pointer-to-member. Member type at +144, class type at +152 |
| 14 | tk_template_param | templ_param_supplement | 40 B | Unresolved template type parameter |
| 15 | tk_typeof | -- | -- | typeof / __typeof__ expression type |
| 16 | tk_decltype | -- | -- | decltype(expr) type |
| 17 | tk_pack_expansion | -- | -- | Parameter pack expansion |
| 18 | tk_auto | -- | -- | auto / decltype(auto) placeholder |
| 19 | tk_rvalue_reference | -- | -- | Rvalue reference T&& |
| 20 | tk_nullptr_t | -- | -- | std::nullptr_t |
| 21 | tk_reserved | -- | -- | Reserved / unused (handled as no-op in set_type_kind) |
The Integer Type Supplement (32 Bytes)
Integer types (kind 2) carry a 32-byte supplement allocated by set_type_kind and tracked by qword_126F8E8. This supplement discriminates the enormous variety of C++ integer types -- bool, char, signed char, unsigned char, wchar_t, char8_t, char16_t, char32_t, short, int, long, long long, __int128, _BitInt(N), and all scoped/unscoped enumerations.
The integer subkind value (at byte +145 of the parent type node) encodes:
| Value | Type Category |
|---|---|
| 1--10 | Standard integer types (bool through unsigned long long) |
| 11 | _BitInt / extended integer |
| 12 | __int128 / extended |
Signedness is determined by a lookup table at byte_E6D1B0, indexed by the integer subkind value.
The Routine Type Supplement (64 Bytes)
Function types (kind 7) carry a 64-byte supplement tracked by qword_126F958. Key fields:
| Offset (in supplement) | Size | Field |
|---|---|---|
+0 | 8 | Parameter type list head |
+8 | 8 | Exception specification |
+16 | 4 | Calling convention / noexcept flags |
+32 | 16 | Bitfield struct (ABI attributes, variadic flag) |
+40 | 8 | this-class pointer (for member functions) |
The Class Type Supplement (208 Bytes)
Class/struct/union types (kinds 9/10/11) carry a 208-byte supplement tracked by qword_126F948. This is the largest supplement and contains the full class metadata:
| Offset (in supplement) | Size | Field |
|---|---|---|
+0 | 8 | Scope pointer (member declarations) |
+8 | 8 | Base class list head |
+16 | 8 | Virtual function table pointer |
+40 | 4 | Initialized to 1 by init_class_type_supplement_fields |
+86 | 1 | Bit 0 = has virtual bases, bit 3 = has user conversion |
+88 | 1 | Bit 5 = has flexible array / VLA member |
+100 | 4 | Class kind (9=struct, 10=class, 11=union) |
+128 | 8 | Scope block pointer |
+176 | 4 | Initialized to -1 (sentinel) |
The Typeref Supplement (56 Bytes)
Typedef types (kind 12) carry a 56-byte supplement tracked by qword_126F8F0. A typeref wraps another type, creating the alias chain that all query functions must chase. The supplement holds the typedef declaration entity, elaborated type specifier information, and attribute data.
The Typedef Chase Pattern
The most pervasive code pattern in the entire binary is the typedef chase loop. Because C++ types may be wrapped in arbitrarily many typedef layers (typedef int myint; typedef myint myint2;), every function that inspects a type property must first resolve through all typedef indirections to reach the underlying canonical type.
The canonical pattern appears in every one of the 130 leaf query functions:
// Canonical typedef chase — appears 130+ times in types.c
type_t *skip_typedefs(type_t *type) {
while (type->type_kind == 12) // 12 = tk_typeref
type = type->referenced_type; // offset +144
return type;
}
bool is_class_or_struct_or_union_type(type_t *type) {
type = skip_typedefs(type);
int kind = type->type_kind; // offset +132
return kind == 9 || kind == 10 || kind == 11;
}
In x86-64 machine code, this compiles to a 3-instruction loop:
.loop:
cmp byte [rdi+132], 12 ; type->type_kind == tk_typeref?
jne .done
mov rdi, [rdi+144] ; type = type->referenced_type
jmp .loop
.done:
Why 130 Separate Functions?
A natural question: why does EDG have 130 individual query functions instead of a single get_type_kind() accessor? The answer is the EDG compilation model. Each function in types.c is a public API entry point that other source files (parse.c, lower.c, templates.c, etc.) can call without including the full type-system header. This provides:
-
Encapsulation. Callers never see the
type_kindenum values or internal layout. They callis_class_or_struct_or_union_type()instead of checkingkind == 9 || kind == 10 || kind == 11. -
Binary stability. If EDG adds a new type kind or renumbers existing ones, only
types.cneeds recompilation. All callers are insulated. -
Fast-path optimization. Each leaf function is tiny (10--30 bytes of machine code), fits in a single cache line, and branches on at most 2--3 constants. The branch predictor handles these trivially.
-
Semantic naming.
is_arithmetic_type()is self-documenting wherekind >= 2 && kind <= 5is not. This matters in a 2.5M-line codebase.
Query Function Catalog (Top 30 by Caller Count)
| Address | Callers | Function | Returns |
|---|---|---|---|
0x7A8A30 | 407 | is_class_or_struct_or_union_type | kind in {9,10,11} |
0x7A9910 | 389 | type_pointed_to | ptr->referenced_type (kind==6) |
0x7A9E70 | 319 | get_cv_qualifiers | Accumulated cv-qualifier bits (& 0x7F) |
0x7A6B60 | 299 | is_dependent_type | Bit 5 of byte +133 |
0x7A7630 | 243 | is_object_pointer_type | kind==6 and not member pointer |
0x7A8370 | 221 | is_array_type | kind==8 |
0x7A7B30 | 199 | is_member_pointer_or_ref | kind==6 with member bit |
0x7A6AC0 | 185 | is_reference_type | kind==7 |
0x7A8DC0 | 169 | is_function_type | kind==14 |
0x7A6E90 | 140 | is_void_type | kind==1 |
0x7A7C40 | 132 | is_trivially_copy_constructible | Recursive triviality check |
0x7A9350 | 126 | array_element_type (deep) | Strips arrays+typedefs to element |
0x7A7010 | 85 | is_enum_type | kind==2 with scoped check |
0x7A71B0 | 82 | is_integer_type | kind==2 |
0x7A8020 | 77 | type_size_and_alignment | Computes sizeof/alignof |
0x7A7810 | 77 | is_member_pointer_flag | kind==6, bit 0 of +152 |
0x7A8270 | 77 | get_mangled_type_encoding | Type encoding for name mangling |
0x7A8D90 | 76 | is_pointer_to_member_type | kind==13 |
0x7A73F0 | 70 | is_long_double_type | kind==5 |
0x7A7950 | 68 | is_member_ptr_with_both_bits | kind==6, bits 0 and 1 of +152 |
0x7A70F0 | 62 | is_scoped_enum_type | kind==2, bit 3 of +145 |
0x7A6EF0 | 56 | is_rvalue_reference_type | kind==19 (rvalue reference T&&) |
0x7A9310 | 51 | array_element_type (shallow) | One-level array to element |
0x7A6B90 | 46 | is_simple_function_type | kind==8, specific flag pattern |
0x7A7220 | 43 | is_bit_int_type | kind==2, bit 2 of +146 |
0x7A7300 | 42 | is_floating_point_type | kind in {3,4,5} |
0x7A7750 | 40 | is_non_member_ptr_type | kind==6, no member bit |
0x7A6EC0 | 39 | is_nullptr_t_type | kind==20 |
0x7A99D0 | 37 | pm_member_type | kind==13, extracts member type at +152 |
0x7A8F10 | 34 | is_unresolved_function_type | kind==14, constraint check |
Total: 128 unique query functions, 4,448 call sites, average 34.75 callers per function.
Typedef Stripping Variants
Six specialized typedef-stripping functions exist, each stopping at a different boundary:
| Address | Function | Behavior |
|---|---|---|
0x7A68F0 | skip_typedefs | Strips all typedef layers, preserves cv-qualifiers |
0x7A6930 | skip_named_typedefs | Strips typedefs that have no name |
0x7A6970 | skip_to_attributed_typedef | Stops at typedef with attribute flag set |
0x7A69C0 | skip_typedefs_and_attributes | Strips both typedef and attribute layers |
0x7A6A10 | skip_to_elaborated_typedef | Stops at typedef with elaborated-type-specifier flag |
0x7A6A70 | skip_non_attributed_typedefs | Stops at typedef with any attribute bits |
These variants exist because C++ semantics sometimes care about intermediate typedef layers. For example, [[deprecated]] typedef int bad_int; attaches the attribute to the typedef itself, not to int. A function checking for deprecation must stop at the attributed typedef layer rather than chasing through to int.
Duplicate Query Functions
Several functions are exact binary duplicates with distinct addresses:
0x7A7630=0x7A7670=0x7A7750(is_non_member_pointer/is_object_pointer_type)0x7A7B00=0x7A7B70(is_pointer_type)0x7A78D0=0x7A7910(is_non_const_ref)
These duplicates exist because EDG uses distinct function names for semantic clarity even when the implementation is identical. The function-level linker does not merge them because they have distinct symbols with different ABI meanings: callers of is_object_pointer_type() and is_non_member_pointer_type() conceptually ask different questions even though the current answer is the same. If a future C++ revision changed pointer semantics, only one function would need updating.
Type Allocation
Type nodes are allocated by alloc_type (sub_5E3D40), which follows the standard IL allocation pattern used by all node allocators in il_alloc.c:
type_t *alloc_type(int type_kind) {
// 1. Optional debug trace
if (dword_126EFC8)
trace_enter("alloc_type");
// 2. Bump-allocate 176 bytes from the current region
void *raw = region_alloc(dword_126EC90, 176);
// 3. Set up IL prefix (16 bytes before the returned pointer)
// raw[0..7] = TU-copy address (0 if not in copy mode)
// raw[8..15] = next pointer (0)
if (!dword_106BA08) {
++qword_126F7C0; // orphan prefix count
*(raw + 0) = 0; // TU-copy addr
}
++qword_126F750; // IL entry count
*(raw + 8) = 0; // next pointer
// 4. Set prefix flags byte
byte flags = 1; // bit 0 = allocated
if (!dword_106BA08)
flags |= 2; // bit 1 = file-scope
if (dword_126E5FC & 1)
flags |= 8; // bit 3 = C++ mode
*(raw + 8) = flags;
// 5. Increment type counter
++qword_126F8E0;
// 6. Copy 96-byte common IL header
type_t *result = raw + 16;
memcpy(result, &xmmword_126F6A0, 96);
// 7. Dispatch to set_type_kind
set_type_kind(result, type_kind);
// 8. Optional debug trace
if (dword_126EFC8)
trace_leave();
return result;
}
set_type_kind: The 22-Way Dispatch
set_type_kind (sub_5E2E80) writes the kind byte and allocates any required supplement structure:
void set_type_kind(type_t *type, int kind) {
type->type_kind = kind; // byte at +132
switch (kind) {
case 0: case 1: // tk_none, tk_void
case 17: case 18: // pack expansions
case 19: case 20: case 21: // auto, rvalue_ref, nullptr_t
break; // no supplement needed
case 2: // tk_integer
type->referenced_type = 5; // default integer subkind
type->supplement_ptr = alloc_permanent(32);
++qword_126F8E8; // integer supplement counter
// Store source position at supplement+16
break;
case 3: case 4: case 5: // tk_float, tk_double, tk_long_double
type->referenced_type = 2; // format byte
break;
case 6: // tk_pointer
type->supplement_ptr = 0; // zero class-pointer field
type->secondary_data = 0;
break;
case 7: // tk_routine (function type)
type->supplement_ptr = alloc_permanent(64);
++qword_126F958; // routine supplement counter
// Initialize calling convention bitfield at supplement+32
break;
case 8: // tk_array
// Zero size and flags fields
break;
case 9: case 10: case 11: // tk_struct, tk_class, tk_union
type->supplement_ptr = alloc_permanent(208);
++qword_126F948; // class supplement counter
init_class_type_supplement_fields(type->supplement_ptr);
type->supplement_ptr->class_kind = kind; // at supplement+100
break;
case 12: // tk_typeref
type->supplement_ptr = alloc_permanent(56);
++qword_126F8F0; // typeref supplement counter
break;
case 13: // tk_pointer_to_member
// Zero member/class fields
break;
case 14: // tk_template_param
type->supplement_ptr = alloc_permanent(40);
++qword_126F8F8; // template param supplement counter
break;
case 15: case 16: // tk_typeof, tk_decltype
// Zero expression pointer fields
break;
default:
internal_error("set_type_kind: bad type kind");
}
}
Qualified Type Construction: The MRU Cache
When the compiler needs a const int given an int, it calls f_make_qualified_type (sub_5D64F0). This function is called extremely frequently -- every variable declaration, function parameter, and expression type computation may need cv-qualified variants. EDG optimizes this with a move-to-front (MRU) linked list cache on each type node.
type_t *f_make_qualified_type(type_t *base_type, int qualifiers) {
// qualifiers bitmask: bit 0 = const, bit 1 = volatile,
// bit 2 = restrict, bits 3-6 = address space
// 1. Array special case: cv-qualify the element type, not the array
if (base_type->type_kind == 8) { // array
type_t *elem = base_type->referenced_type;
type_t *qual_elem = f_make_qualified_type(elem, qualifiers);
return rebuild_array_type(base_type, qual_elem);
}
// 2. Strip existing qualifiers that already match
int existing = get_cv_qualifiers(base_type) & 0x7F;
int needed = qualifiers & ~existing;
if (needed == 0)
return base_type; // already qualified as requested
// 3. Search the MRU cache at base_type->qualified_chain (+112)
type_t *prev = NULL;
type_t *cur = base_type->qualified_chain;
while (cur) {
if (cur->type_kind == 12 && // must be typeref
(cur->class_flags_1 & 0x7F) == qualifiers) {
// Cache hit -- move to front if not already there
if (prev) {
prev->next = cur->next;
cur->next = base_type->qualified_chain;
base_type->qualified_chain = cur;
}
return cur;
}
prev = cur;
cur = cur->next;
}
// 4. Cache miss -- allocate new qualified type
type_t *qual = alloc_type(12); // tk_typeref
qual->referenced_type = base_type; // +144 = underlying type
qual->class_flags_1 = qualifiers & 0x7F; // +161 = qualifier bits
setup_type_node(qual); // sub_5B3DE0
// 5. Insert at head of cache list
qual->next = base_type->qualified_chain;
base_type->qualified_chain = qual;
return qual;
}
The MRU optimization is critical because type construction is highly skewed: const T is needed far more often than volatile const restrict T. By moving the most recently matched qualified variant to the front of the chain, subsequent lookups for the same qualification find it immediately.
The same MRU pattern appears in ptr_to_member_type_full (sub_5DB220), which caches pointer-to-member types on the member type's qualification chain at +112.
CV-Qualifier Bitmask
| Bit | Mask | Qualifier |
|---|---|---|
| 0 | 0x01 | const |
| 1 | 0x02 | volatile |
| 2 | 0x04 | __restrict |
| 3--6 | 0x78 | Address space qualifier (CUDA/OpenCL) |
The 7-bit mask (& 0x7F) at offset +161 of a typeref node encodes the full cv-qualification. get_cv_qualifiers (sub_7A9E70, 319 callers) accumulates these bits by chasing the typedef chain:
int get_cv_qualifiers(type_t *type) {
int quals = 0;
while (type->type_kind == 12) { // chase typedefs
quals |= type->class_flags_1 & 0x7F;
type = type->referenced_type;
}
return quals;
}
Type Comparison
sub_7AA150 (types_are_identical, 636 lines) is the main structural type comparison function. It handles all 22 type kinds with recursive descent into component types. The algorithm:
- Chase typedefs on both operands to reach canonical types.
- If pointer-equal after chasing, return true (the common fast path).
- If kinds differ, return false.
- Dispatch on kind:
- Integer (kind 2): Compare integer subkind values.
- Pointer (kind 6): Recursively compare pointed-to types.
- Array (kind 8): Compare bounds and recursively compare element types.
- Function (kind 7): Compare return type, then parameter-by-parameter.
- Class (kind 9/10/11): Pointer equality only (nominal typing).
- Template param (kind 14): Compare parameter index and depth.
- Pointer-to-member (kind 13): Compare both class and member types.
The comparison is structural for most types but nominal for classes. Two distinct struct Foo definitions in different scopes are different types even if they have identical members.
Cross-TU Type Correspondence
For relocatable device code (RDC) compilation, cudafe++ must match types across translation units. sub_7B2260 (types_are_equivalent_for_correspondence, 688 lines) performs a deep structural comparison that tolerates certain cross-TU differences (different typedef layers, different source positions) while requiring identical essential structure.
Type Construction Functions
Beyond f_make_qualified_type, several other type construction functions use the same cache pattern:
| Address | Function | Creates | Cache Location |
|---|---|---|---|
0x5D64F0 | f_make_qualified_type | const T, volatile T, etc. | Type +112 chain |
0x5D6770 | make_vector_type | __attribute__((vector_size(N))) T | Allocated fresh |
0x5D68E0 | character_type | char[N] string literal types | Hash table at qword_126F2F8 (81-slot per kind) |
0x5DB220 | ptr_to_member_type_full | T Class::* | Member type +112 chain (MRU) |
0x7AB9B0 | construct_function_type | R(Args...) | Allocated fresh (423 lines) |
0x7A6320 | make_cv_combined_type | Combines cv-quals from two types | Allocated fresh |
Character Type Cache
String literal types (char[5], wchar_t[12], etc.) are extremely common in C++ programs. character_type (sub_5D68E0) uses a hash-table cache at qword_126F2F8 with 81 slots per character kind (5 kinds: char, wchar_t, char8_t, char16_t, char32_t), covering array sizes 0 through 80. For sizes exceeding 80, no caching is performed and a fresh array type is allocated every time.
Class Layout: do_class_layout
sub_662670 (do_class_layout, 2,548 lines) is the most complex function in the type system. It implements the Itanium C++ ABI class layout algorithm with GNU extensions, MSVC compatibility mode, and CUDA-specific adjustments. It is called exactly once per class definition from sub_442680 (class definition processing).
What do_class_layout Computes
For each class/struct/union, the function determines:
sizeof: Total class size including padding.alignof: Required alignment, incorporatingalignas,__attribute__((aligned)), and#pragma pack.- Member offsets: Byte offset of each non-static data member.
- Base class offsets: Byte offset of each non-virtual base class subobject.
- Virtual base offsets: Byte offset of each virtual base class subobject (stored in the vtable).
- Vtable pointer placement: Where
_vptris placed (offset 0 for primary base, elsewhere for secondary). - Empty base optimization (EBO): Whether empty base classes can share address with data members.
- Bit-field packing: How bit-fields are packed into allocation units.
- Tail padding reuse: Whether derived classes can place members in base class tail padding (non-POD only).
Pseudocode: Itanium ABI Layout
void do_class_layout(type_t *class_type) {
class_info_t *info = class_type->supplement_ptr;
int sizeof_val = 0;
int alignof_val = 1;
int dsize = 0; // data size (excludes tail padding)
// PHASE 1: Lay out non-virtual base classes
for (base_t *base = info->base_list; base; base = base->next) {
if (base->is_virtual)
continue; // defer virtual bases
int base_size = base->type->size_info;
int base_align = base->type->alignment;
// Empty base optimization
if (is_empty_class(base->type)) {
int offset = 0;
while (empty_base_conflict(class_type, base->type, offset))
offset += base_align;
set_base_class_offset(base, offset);
// sizeof may not increase for empty bases
} else {
// Align dsize up to base alignment
dsize = ALIGN_UP(dsize, base_align);
set_base_class_offset(base, dsize);
dsize += base_size;
}
alignof_val = MAX(alignof_val, base_align);
}
// PHASE 2: Place vptr if needed
if (class_has_virtual_functions(class_type) &&
!has_primary_base_with_vptr(class_type)) {
// vptr at current offset (usually 0)
dsize = ALIGN_UP(dsize, POINTER_ALIGN);
dsize += POINTER_SIZE;
alignof_val = MAX(alignof_val, POINTER_ALIGN);
}
// PHASE 3: Lay out non-static data members
for (field_t *field = info->first_field; field; field = field->next) {
int field_align = alignment_of_field_full(field);
int field_size = field->type->size_info;
if (field->is_bitfield) {
align_offsets_for_bit_field(field, &dsize, &alignof_val);
continue;
}
dsize = ALIGN_UP(dsize, field_align);
// Warn if field lands in tail padding of a base class
warn_if_offset_in_tail_padding(class_type, dsize, field);
field->offset = dsize;
dsize += field_size;
alignof_val = MAX(alignof_val, field_align);
}
// PHASE 4: Lay out virtual base classes
for (base_t *base = info->base_list; base; base = base->next) {
if (!base->is_virtual)
continue;
int base_align = base->type->alignment;
if (is_empty_class(base->type)) {
int offset = sizeof_val;
while (subobject_conflict(class_type, base->type, offset))
offset += base_align;
set_virtual_base_class_offset(base, offset);
} else {
sizeof_val = ALIGN_UP(sizeof_val > dsize ? sizeof_val : dsize,
base_align);
set_virtual_base_class_offset(base, sizeof_val);
sizeof_val += base->type->size_info;
}
}
// PHASE 5: Finalize
sizeof_val = MAX(sizeof_val, dsize);
sizeof_val = ALIGN_UP(sizeof_val, alignof_val);
if (sizeof_val == 0)
sizeof_val = 1; // C++ requires sizeof >= 1
compute_empty_class_bit(class_type);
trailing_base_does_not_affect_gnu_size(class_type);
check_explicit_alignment(class_type);
class_type->size_info = sizeof_val;
class_type->alignment = alignof_val;
// Debug: dump_layout() if debug flag set
if (dword_126EFC8)
dump_layout(class_type);
}
Key Sub-Functions
| Address | Function | Purpose |
|---|---|---|
0x65EA50 | trailing_base_does_not_affect_gnu_size | Checks if trailing empty base affects GNU-compatible size vs dsize |
0x65EE70 | empty_base_conflict | Self-recursive: detects two empty bases of same type at same address |
0x65F410 | increment_field_offsets | Advances offset counters; warns about tail-padding overlap |
0x65F9F0 | last_user_field_of | Finds last user-declared (non-compiler-generated) field |
0x65FC20 | subobject_conflict | Generalizes empty_base_conflict to all subobjects |
0x6610B0 | set_base_class_offsets | Assigns offsets to non-virtual base class subobjects |
0x6614A0 | set_virtual_base_class_offset | Assigns offsets to virtual base class subobjects |
0x6621E0 | alignment_of_field_full | Computes field alignment considering packed, aligned, pragma pack |
Empty Base Optimization
The EBO is one of the most subtle parts of C++ layout. The C++ standard requires that two distinct subobjects of the same type have different addresses. But empty base classes (no data members, no virtual functions, all bases empty) can be placed at offset 0 without consuming space -- unless another subobject of the same type already occupies that address.
empty_base_conflict (sub_65EE70, 240 lines) is self-recursive: it walks the entire base class hierarchy checking for address collisions. When a conflict is detected, the layout engine advances the offset by the base's alignment until no conflict exists.
Alignment Computation
alignment_of_field_full (sub_6621E0, 193 lines) computes the effective alignment of a data member considering all alignment modifiers in priority order:
- Natural alignment of the field's type.
__attribute__((aligned(N)))-- increases alignment.__attribute__((packed))-- reduces alignment to 1.#pragma pack(N)-- caps alignment at N.__declspec(align(N))-- MSVC mode alignment.
The interaction between these modifiers follows complex ABI rules. For example, #pragma pack(4) on a struct with a double member reduces the double's alignment from 8 to 4, but __attribute__((aligned(16))) on the same member overrides the pack to 16.
Type Trait Evaluation
sub_7BDCB0 (evaluate_type_trait, 510 lines) implements the compiler built-in type traits: __is_trivially_copyable, __is_constructible, __has_unique_object_representations, __is_aggregate, __is_empty, etc. These are dispatched via a switch on trait ID and return boolean results by inspecting the class type supplement flags and calling recursive property checks.
Type Deduction
sub_7B9670 (deduce_template_argument_type, 459 lines) handles template argument deduction from function arguments to template parameters. This is separate from the template engine's substitute_in_type (sub_7BCDE0, 800 lines), which performs the reverse operation: given concrete template arguments, produce the substituted type.
Global Type Singletons
Several frequently-used types are cached as global pointers to avoid repeated allocation:
| Global | Type |
|---|---|
qword_126E5E0 | void type |
qword_126F2F0 | void type (duplicate reference) |
qword_126F1A0 | std::source_location::__impl (cached on first use) |
Statistics Tracking
Every type-related allocation increments a per-kind counter. print_trans_unit_statistics (sub_7A45A0) dumps these counters via fprintf:
| Counter | What it counts | Per-entry size |
|---|---|---|
qword_126F8E0 | Type nodes allocated | 176 B |
qword_126F8E8 | Integer type supplements | 32 B |
qword_126F958 | Routine type supplements | 64 B |
qword_126F948 | Class type supplements | 208 B |
qword_126F8F0 | Typeref supplements | 56 B |
qword_126F8F8 | Template param supplements | 40 B |
qword_126F280 | Pointer-to-member types constructed | -- |
CUDA-Specific Type Extensions
Address Space Qualifiers
CUDA's __shared__, __constant__, and __device__ memory spaces are represented as address-space qualifiers in the cv-qualifier bitmask (bits 3--6 at +161). The attribute kind values {1, 6, 11, 12} (bitmask 0x1842) are checked in compare_attribute_specifiers (sub_7A5E10) to detect incompatible address-space qualified typedefs.
Feature Usage Tracking
record_type_features_used (sub_7A4F10) records GPU feature requirements based on types encountered:
_BitInttypes (integer subkind 11/12): sets bit 0 ofbyte_12C7AFC__float128/__bf16types: sets bit 2- Bit-fields: sets bit 1
- Class types: copies feature bits from
+164
This information feeds into architecture gating, ensuring that code using _BitInt(128) targets a GPU architecture that supports it.
Constexpr Type Size Limits
The constexpr interpreter (sub_628DE0, f_value_bytes_for_type) enforces a 64 MB limit (0x4000000 bytes) on types used in constexpr evaluation. This prevents compile-time memory exhaustion from expressions like constexpr std::array<char, 1'000'000'000> x{};.
Function Map
| Address | Lines | Function | Source |
|---|---|---|---|
0x5D64F0 | 340 | f_make_qualified_type | il.c |
0x5DB220 | 63 | ptr_to_member_type_full | il.c |
0x5E2E80 | -- | set_type_kind | il_alloc.c |
0x5E3D40 | -- | alloc_type | il_alloc.c |
0x65EA50 | 105 | trailing_base_does_not_affect_gnu_size | layout.c |
0x65EE70 | 240 | empty_base_conflict | layout.c |
0x65FC20 | 271 | subobject_conflict | layout.c |
0x6610B0 | 196 | set_base_class_offsets | layout.c |
0x6614A0 | 204 | set_virtual_base_class_offset | layout.c |
0x6621E0 | 193 | alignment_of_field_full | layout.c |
0x662670 | 2548 | do_class_layout | layout.c |
0x7A4B40 | -- | ttt_is_type_with_no_name_linkage | types.c |
0x7A4F10 | -- | record_type_features_used | types.c |
0x7A5E10 | -- | compare_attribute_specifiers | types.c |
0x7A6260 | -- | type_has_flexible_array_or_vla | types.c |
0x7A6320 | -- | make_cv_combined_type | types.c |
0x7A68F0--0x7A9F90 | -- | 130 leaf query functions | types.c |
0x7AA150 | 636 | types_are_identical | types.c |
0x7AB9B0 | 423 | construct_function_type | types.c |
0x7AE680 | 541 | adjust_type_for_templates | types.c |
0x7B2260 | 688 | types_are_equivalent_for_correspondence | types.c |
0x7B3400 | 905 | standard_conversion_sequence | types.c |
0x7B5210 | 441 | require_complete_type | types.c |
0x7B6350 | 1107 | compute_type_layout | types.c |
0x7B7750 | 784 | compute_class_properties | types.c |
0x7B9670 | 459 | deduce_template_argument_type | types.c |
0x7BDCB0 | 510 | evaluate_type_trait | types.c |
0x7BF630 | 348 | format_type_for_diagnostic | types.c |
0x7C02A0 | -- | compatible_ms_bit_field_container_types | types.c |
Diagnostic System Overview
The cudafe++ diagnostic system is a 7-stage pipeline rooted in EDG 6.6's error.c. It manages 3,795 error message templates, 9 severity levels, per-error suppression tracking, #pragma diagnostic overrides, and two output formats (text and SARIF JSON). The most-connected function in the entire binary -- sub_4F2930 (assertion handler) with 5,185 call sites -- feeds into this system, making error handling the single largest cross-cutting concern in cudafe++.
Error Table
The error message template table lives at off_88FAA0: an array of 3,795 const char* pointers indexed by error code (0--3794).
| Range | Count | Origin | Display Format |
|---|---|---|---|
| 0--3456 | 3,457 | Standard EDG 6.6 | #N-D |
| 3457--3794 | 338 | NVIDIA CUDA extensions | #(N+16543)-D (20000--20337-D series) |
The renumbering logic in construct_text_message (sub_4EF9D0):
int display_code = error_code;
if (display_code > 3456)
display_code = error_code + 16543; // 3457 → 20000, 3794 → 20337
sprintf(buf, "%d", display_code);
The -D suffix is appended only when severity <= 7 (warnings and below). Errors with severity > 7 (catastrophic, command-line error, internal) omit the suffix:
const char *suffix = "-D";
if (severity > 7)
suffix = "";
Any access with error code > 3794 triggers sub_4F2D30 (error_text), which fires an assertion: "error_text: invalid error code" (error.c:911).
Severity Levels
Nine severity values are stored as a single byte at offset 180 of the diagnostic record:
| Value | Name | Display String (lowercase) | Display String (uppercase) | Colorization | Exit Behavior |
|---|---|---|---|---|---|
| 2 | note | "note" | "Note" | cyan (code 4) | continues |
| 4 | remark | "remark" | "Remark" | cyan (code 4) | continues |
| 5 | warning | "warning" | "Warning" | magenta (code 3) | continues |
| 6 | command-line warning | "command-line warning" | "Command-line warning" | magenta (code 3) | continues |
| 7 | error (soft) | "error" | "Error" | red (code 2) | continues, counted |
| 8 | error (hard) | "error" | "Error" | red (code 2) | continues, counted, not suppressible by pragma |
| 9 | catastrophic | "catastrophic error" | "Catastrophic error" | red (code 2) | immediate exit(4) |
| 10 | command-line error | "command-line error" | "Command-line error" | red (code 2) | immediate exit(4) |
| 11 | internal error | "internal error" | "Internal error" | red (code 2) | immediate exit(11) via abort path |
Uppercase display strings are used when dword_106BCD4 is set, indicating the diagnostic originates from a predefined macro file context (e.g., "In predefined macro file: Error #...").
The special string "nv_diag_remark" at offset +8 yields "remark" -- an NVIDIA-specific annotation kind for CUDA diagnostic remarks.
Severity Byte Arrays
Three parallel byte arrays, indexed as [4 * error_code], track per-error severity state:
| Array | Address | Purpose |
|---|---|---|
byte_1067920 | 0x1067920 | Default severity -- the compile-time severity assigned to each error code |
byte_1067921 | 0x1067921 | Current severity -- the effective severity after #pragma overrides |
byte_1067922 | 0x1067922 | Tracking flags -- bit 0: first-time guard, bit 1: already-emitted, bit 2: has pragma override |
The 4-byte stride means each error code occupies a 4-byte slot across all three arrays, with only the first byte of each slot used. This layout allows the pragma override system (sub_4F30A0) to efficiently look up and modify per-error severity.
7-Stage Diagnostic Pipeline
caller emits error
|
v
[1] create_diagnostic_entry sub_4F40C0
Allocate ~200-byte record, set error_code + severity
|
v
[2] check_for_overridden_severity sub_4F30A0
Walk #pragma diagnostic stack, apply push/pop overrides
|
v
[3] check_severity sub_4F1330 ← 62 callers, 77 callees
Central dispatch: suppress/promote, error limit, output routing
|
├─── text path ──────────────────────────────────────┐
| |
v v
[4] write_message_to_buffer sub_4EF620 [6] write_sarif_message_json sub_4EF8A0
Expand %XY format specifiers from template JSON-escape + wrap
| |
v v
[5] construct_text_message sub_4EF9D0 (6.5 KB) SARIF JSON buffer → stderr
file:line prefix, severity label, word wrap,
caret lines, template context, include stack
|
v
[4a] process_fill_in sub_4EDCD0 (1,202 lines)
Expand %T/%n/%s/%p/%d/%u/%t/%r specifiers
|
v
output → stderr or redirect file
Stage 1: create_diagnostic_entry (sub_4F40C0)
Allocates a diagnostic record via sub_4EC940 and initializes it:
record = allocate_diagnostic_record();
record->kind = 0; // primary diagnostic
record->error_code = a1; // offset 176
if (severity <= 7)
check_for_overridden_severity(a1, &severity, position);
record->severity = severity; // offset 180
// resolve source position → file, line, column
// link into global diagnostic chain (qword_106BA10)
The wrapper sub_4F41C0 sets dword_106B4A8 (file index mode) to -1 for command-line and fatal severities (6, 9, 10, 11), disabling file-index tracking for diagnostics that have no meaningful source location.
Sub-diagnostics are created by sub_4F5A70 with kind=2, linked to their parent's sub-diagnostic chain at offsets 40/48 of the parent record.
Stage 2: check_for_overridden_severity (sub_4F30A0)
Walks the #pragma diagnostic stack stored in qword_1067820. Each stack entry is a 24-byte record containing a source position, a pragma action code, and an optional error code target.
Pragma action codes and their effect on severity:
| Code | Pragma | Effect |
|---|---|---|
| 30 | ignored | Set severity to 3 (suppress) |
| 31 | remark | Set severity to 4 |
| 32 | warning | Set severity to 5 |
| 33 | error | Set severity to 7 |
| 35 | default | Restore from byte_1067920[4 * error_code] |
| 36 | push/pop marker | Scope boundary for push/pop tracking |
The function uses binary search (bsearch with comparator sub_4ECD20) to find the nearest pragma entry that applies at the current source position, then walks backward through the stack to resolve nested push/pop scopes.
Stage 3: check_severity (sub_4F1330)
The central dispatch function (601 decompiled lines, 62 callers, 77 callees). This is the most complex function in the error subsystem.
Complete decision tree pseudocode (derived from the decompiled sub_4F1330):
void check_severity(diagnostic_record *record) {
dword_1065938 = 0; // reset caret-position cache
uint8_t min_sev = byte_126ED69; // minimum severity threshold
// ── Gate 1: Minimum severity filter ──
if (min_sev > record->severity) {
if (min_sev == 3) // severity 3 = suppress sentinel
ASSERT_FAIL("check_severity", error.c:3859);
goto count_and_exit; // silently discard
}
// ── Gate 2: System-header / suppress-all promotion ──
if (is_system_header(record->source_sequence_number)) {
min_sev = 8; // promote to hard error
} else if (qword_106BCD8) { // suppress-all-but-fatal mode
min_sev = 7; // treat as error-level floor
} else if (min_sev == 3) {
ASSERT_FAIL("check_severity", error.c:3859);
}
if (record->severity < min_sev)
goto count_and_exit;
// ── Gate 3: Per-error tracking flags ──
uint8_t *flags = &byte_1067922[4 * record->error_code];
if (record->severity <= 7) { // suppressible severities only
uint8_t old = *flags;
*flags |= 2; // mark as emitted
if ((old & 1) && (old & 2)) // first-time guard + already-emitted
goto suppressed; // skip: already seen in this scope
} else {
*flags |= 2; // hard errors: always mark, never skip
}
// ── Gate 4: Pragma diagnostic check ──
if (dword_126C5E4 != -1) { // scope tracking active
if (check_pragma_diagnostic(record->error_code,
record->severity,
&record->source_seq)) {
suppressed:
// Update error/warning counters even when suppressed
uint8_t sev = record->severity;
if (sev <= 7 && sev < byte_126ED68) // promote to error threshold
sev = sev; // keep as-is
else
sev = 7; // count as error
update_suppressed_counts(sev, &qword_126EDC8);
goto count_and_exit;
}
// Record pragma scope if applicable
if (in_template_scope() || has_special_scope_flags())
record_pragma_diagnostic(record->error_code, record->severity);
}
// ── Gate 5: Suppress-all-but-fatal redirect ──
if (qword_106BCD8 && !dword_106BCD4 && record->error_code != 992) {
emit_error_992(); // replace with fatal error 992
return; // guard against catastrophic loop
}
// ── Severity promotion: warning → error threshold ──
uint8_t effective_sev = record->severity;
if (effective_sev <= 7 && effective_sev >= byte_126ED68) {
effective_sev = 7; // promote to error
if (dword_126C5C8 == -1) {
update_counts(7, &qword_126ED80);
if (!dword_126ED78) // no further counting needed
goto skip_extra_counts;
goto update_additional_counts;
}
} else if (effective_sev > 7 || effective_sev < byte_126ED68) {
// Already at error+ or below promotion threshold
}
update_counts(effective_sev, &qword_126ED80);
if (dword_126ED78 && (effective_sev - 9) > 2) // not catastrophic/internal
goto update_additional_counts;
skip_extra_counts:
if (qword_126EDC0)
update_counts(effective_sev, qword_126EDC0);
// ── Allocate output buffers (first use) ──
if (!qword_106B488) {
qword_106B488 = alloc_buffer(0x400);
qword_106B480 = alloc_buffer(0x80);
}
reset_buffer(qword_106B488);
reset_buffer(qword_106B480);
// ── Catastrophic loop detection ──
if (record->severity == 9) {
if (dword_106B4B0) { // already processing catastrophic
fprintf(stderr, "%s\n", "Loop in catastrophic error processing.");
emergency_exit(9); // never returns
}
dword_106B4B0 = 1; // set catastrophic guard
if (record->error_code == 3709 || !dword_126ED48)
goto emit_message;
} else if (record->severity == 11 || record->error_code == 3709) {
goto emit_message; // internal error or warnings-as-errors
}
// ── Template context expansion ──
int context_count = 0;
for (int scope = dword_126C5E4; scope > 0; scope--) {
context_count += format_scope_context(scope);
}
// Include-file context
if (dword_126EE48 && qword_106B9F0 && has_include_context()) {
file_info = lookup_source_file(record->source_seq);
if (file_info != current_file) {
context_count++;
// Emit error 1063/1064 (include-stack context)
create_sub_diagnostic(record, (context_count != 1) ? 1064 : 1063);
}
}
// Context elision (when context_limit is set)
if (dword_126ED58 > 0 && dword_126ED58 + 1 < context_count) {
// Emit error 1150: "%d context lines elided"
}
emit_message:
// ── Output routing ──
reset_buffer(qword_106B488);
if (dword_106BBB8 == 1) {
// SARIF JSON path
write_sarif_json(record); // → qword_106B478
fputs(sarif_buffer, stderr);
fflush(stderr);
} else {
construct_text_message(record); // → sub_4EF9D0
}
// ── Termination for fatal severities ──
if (record->severity >= 9 && record->severity <= 11) {
cleanup(); // flush output, close files
emergency_exit(record->severity); // exit with severity as code
// unreachable
}
// ── Error limit enforcement ──
if (qword_126ED90 + qword_126ED98 >= qword_126ED60) {
fprintf(stderr, "%s\n", "Error limit reached.");
if (qword_106C260) // raw listing file
fwrite("C \"\" 0 0 error limit reached\n", 1, 29, listing);
cleanup();
emergency_exit(9); // exit(catastrophic)
}
// ── Warnings-as-errors promotion ──
if (record->severity == 5 && dword_106C088 && !dword_106B4BC) {
uint8_t saved_min = byte_126ED69;
byte_126ED69 = 4; // temporarily lower threshold
dword_106C088 = 0; // prevent recursion in self
dword_106B4BC = 1; // prevent recursion guard
emit_diagnostic(4, 3709, ...); // "warnings treated as errors"
byte_126ED69 = saved_min; // restore threshold
dword_106C088 = 1; // restore mode
}
// ── File index update ──
if (dword_106B4A8 != -1)
update_file_index(record);
count_and_exit:
return;
}
Key decision points explained:
Minimum severity gate:
The global byte_126ED69 is the minimum severity threshold -- diagnostics below this level are silently discarded. When the threshold is 3 (the "suppress" sentinel), an assertion fires, which prevents the threshold from ever being set to the suppress level directly.
System-header promotion:
When a diagnostic originates from a system header (detected by sub_5B9B60), its severity is promoted to 8 (hard error, not suppressible by pragma). This applies equally to CUDA system headers.
Per-error tracking:
Bit 0 of the tracking flags (byte_1067922[4 * error_code]) acts as a first-time guard: if both bit 0 (first-time) and bit 1 (already-emitted) are set, the error has been suppressed-then-seen, and further emissions are skipped depending on the pragma scope.
Suppress-all-but-fatal mode:
When qword_106BCD8 is set and the error is not error 992 (the fatal sentinel), check_severity replaces the current diagnostic with error 992 and re-enters.
Catastrophic loop detection:
The re-entry guard dword_106B4B0 prevents infinite recursion when a catastrophic error triggers another catastrophic error during its own processing. The message "Loop in catastrophic error processing." is printed directly to stderr followed by emergency_exit(9).
Error limit enforcement:
qword_126ED90 (total errors) + qword_126ED98 (total warnings) are checked against qword_126ED60 (error limit). When exceeded, the compiler writes the limit message and exits with catastrophic status. The raw listing file also receives a machine-readable C "" 0 0 error limit reached line.
Warnings-as-errors promotion:
When dword_106C088 (warnings-are-errors mode) is set, every warning (severity 5) triggers error 3709 ("warnings treated as errors") as a follow-up diagnostic. The implementation temporarily lowers the minimum severity threshold to 4 (remark), disables warnings-as-errors mode, sets the recursion guard, emits the diagnostic, then restores all three values. This prevents the error-3709 diagnostic from itself triggering another error-3709.
Output routing:
if (dword_106BBB8 == 1)
// SARIF JSON path: sub_4EF8A0 → qword_106B478
else
sub_4EF9D0(record); // text path → construct_text_message
Termination for fatal severities:
Severities 9 (catastrophic), 10 (command-line error), and 11 (internal error) all trigger cleanup via sub_66B5E0 followed by sub_5AF2B0(severity), which maps severity to the process exit code.
Stage 4: write_message_to_buffer (sub_4EF620)
Looks up the error template string from the table and expands format specifiers:
const char *template = off_88FAA0[error_code]; // error_code must be <= 3794
Format specifier syntax: %XY...Zn where:
X= specifier letter (T,d,n,p,r,s,t,u)Y...Z= option characters (a-z, A-Z), max 29n= trailing digit = fill-in index
Special forms:
%%= literal%%[label]= named label fill-in, looked up inoff_D481E0table
Each specifier dispatches to process_fill_in (sub_4EDCD0) with the appropriate fill-in kind.
Stage 5: construct_text_message (sub_4EF9D0)
The largest function in the error subsystem at 1,464 decompiled lines (6.5 KB). Formats the complete diagnostic output.
Output format:
file(line): severity #code-D: message text
Variant formats:
"At end of source: ..."-- when line number is 0"In predefined macro file: ..."-- whendword_106BCD4is set"Line N"-- when the file name is"-"(stdin)
Sub-diagnostic indentation:
| Kind | Indent (chars) | Continuation indent |
|---|---|---|
| 0 (primary) | 0 | 10 |
| 2 (sub-diagnostic, same parent) | 10 | 20 |
| 2 (sub-diagnostic, different parent) | 12 | 22 |
| 3 (related) | 1 | 11 |
Word wrapping:
The function wraps output text at dword_106B470 (terminal width) column boundaries. When colorization is disabled, it uses a simple space-scanning algorithm. When colorization is enabled (ESC byte 0x1B in the formatted string), it tracks visible character width separately from escape sequence bytes and wraps only on visual boundaries.
Fill-in verification:
After output, the function iterates the fill-in linked list and asserts that every entry has used_flag == 1. An unused fill-in triggers: "construct_text_message: not all fill-ins used for error string: \"...\"" (error.c:4781).
Raw listing output:
When qword_106C260 (raw listing file) is open and the diagnostic is not a continuation (kind != 3), a machine-readable line is emitted:
S "filename" line column message\n
Where S is a single-character severity code: R (remark), W (warning), E (error), C (catastrophic/internal). Internal errors additionally prefix "(internal error) " before the message text.
Stage 6: process_fill_in (sub_4EDCD0)
Expands a single format specifier by searching the diagnostic record's fill-in linked list (head at offset 184) for an entry matching the requested kind and index. 1,202 decompiled lines.
Fill-in kind dispatch (from ASCII code of specifier letter):
| Letter | ASCII | Kind | Payload |
|---|---|---|---|
%T | 84 | 6 (type) | Type node pointer |
%d | 100 | 0 (decimal) | Integer value |
%n | 110 | 4 (entity name) | Entity node pointer + options |
%p | 112 | 2 (parameter) | Source position |
%r | 114 | 7 | Byte + pointer |
%s | 115 | 3 (string) | String pointer |
%t | 116 | 5 | (type variant) |
%u | 117 | 1 (unsigned) | Unsigned integer value |
Entity name options (for %n specifier):
| Option | Meaning |
|---|---|
f | Full qualification |
o | Omit kind prefix |
p | Omit parameters |
t | Full with template arguments |
a | Omit + show accessibility |
d | Show declaration location |
T | Show template specialization |
Assertion Handler (sub_4F2930)
The most-connected function in the entire cudafe++ binary: 5,185 call sites. Declared __noreturn.
Signature:
void __noreturn assertion_handler(
char *source_file, // EDG source file path
int line_number, // source line number
const char *func_name, // enclosing function name
const char *prefix, // message prefix (or NULL)
const char *message // detail message (or NULL)
);
Message format (with prefix):
assertion failed: <prefix> <message> (<file>, line <line> in <func>)
Message format (without prefix):
assertion failed at: "<file>", line <line> in <func>
The function allocates a 0x400-byte buffer via sub_6B98A0, concatenates the message components using sub_6B9CD0 (buffer append), then calls sub_4F21C0 (internal_error). Because sub_4F21C0 is also __noreturn, the code after the call is dead -- the decompiler shows a loop structure with sprintf(v20, "%d", v8) that is never actually reached.
When dword_126ED40 (suppress assertion output) is set, the message text is replaced with "<suppressed>".
Internal Error Handler (sub_4F21C0)
Creates error 2656 with severity 11 (internal error), outputs it through the standard pipeline, then exits.
void __noreturn internal_error(const char *message) {
if (dword_1065928) { // re-entry guard
fprintf(stderr, "%s: %s\n", "Internal error loop", message);
sub_5AF2B0(11); // emergency exit
}
dword_1065928 = 1; // set guard
diag = sub_4F41C0(2656, ¤t_pos, 11); // create diag record
if (message)
sub_4F2E90(diag, message); // attach message as fill-in
sub_4F1330(diag); // route through check_severity
sub_5AF1D0(11); // cleanup + exit(11)
sub_4F2240(); // update file index (unreachable)
}
The re-entry guard dword_1065928 prevents infinite recursion: if internal_error is called while already processing an internal error (e.g., an assertion fires inside the error formatting code), it prints "Internal error loop: <message>" directly to stderr and exits immediately with code 11.
Exit Codes
| Code | Condition | Trigger |
|---|---|---|
| 0 | Compilation succeeded | Normal exit via sub_5AF1D0(0) |
| 2 | Errors encountered | total_errors > 0 at exit |
| 4 | Catastrophic error | Severity 9 or 10 reached |
| 11 | Internal error | Severity 11 (assertion failure) |
| abort | Double internal error | Re-entry in sub_4F21C0 or catastrophic loop |
The exit path flows through sub_5AF2B0, which maps the severity to the appropriate process exit code. Catastrophic loop detection ("Loop in catastrophic error processing.") calls sub_5AF2B0(9), which maps to exit code 4.
Diagnostic Record Layout
Each diagnostic record is approximately 200 bytes, allocated by sub_4EC940:
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 4 | kind | 0=primary, 2=sub-diagnostic, 3=continuation |
| 8 | 8 | next | Linked list pointer (global chain) |
| 16 | 8 | parent | Parent diagnostic (for sub-diagnostics) |
| 24 | 8 | related_list | Related diagnostic chain |
| 40 | 8 | sub_diagnostic_head | First sub-diagnostic |
| 48 | 8 | sub_diagnostic_tail | Last sub-diagnostic |
| 72 | 8 | context_head | Template/include context chain |
| 88 | 8 | related_info | Related location info pointer |
| 96 | 8 | source_sequence_number | Position in source sequence |
| 136 | 4 | file_index | Index into source file table |
| 140 | 2 | column_end | End column for caret range |
| 144 | 4 | line_delta | Line offset for continuation |
| 152 | 8 | file_name_string | Canonical file path |
| 160 | 8 | display_file_name | Display-formatted file path |
| 168 | 4 | column_number | Column number |
| 172 | 4 | caret_info | Caret position data |
| 176 | 4 | error_code | Error code (0--3794) |
| 180 | 1 | severity | Severity level (2--11) |
| 184 | 8 | fill_in_list_head | First fill-in entry |
| 192 | 8 | fill_in_list_tail | Last fill-in entry |
Fill-In Entry Layout
Each fill-in entry is 40 bytes, allocated from a free-list pool (qword_106B490) or heap (sub_6B8070):
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 4 | kind | Fill-in kind (0--7, mapped from format specifier letter) |
| 4 | 1 | used_flag | Set to 1 when consumed during formatting |
| 8 | 8 | next | Next fill-in in linked list |
| 16 | 8+ | payload | Union: qword for most kinds; int+int for kind 4 (entity name) |
Kind-specific initialization in alloc_fill_in_entry (sub_4F2DE0):
- Kind 2 (parameter): payload =
qword_126EFB8(current source position) - Kind 4 (entity name): payload = 0, extra =
0xFFFFFFFF, flags = 0 - Kind 7: byte + qword payload
- Default: payload = 0
Colorization
Initialized by sub_4F2C10 (init_colorization, error.c:825):
- Check
NOCOLORenvironment variable -- if set, disable colorization - Check
sub_5AF770(isatty) -- if stderr is not a terminal, disable - Read
EDG_COLORSorGCC_COLORSenvironment variable - Default:
"error=01;31:warning=01;35:note=01;36:locus=01:quote=01:range1=32"
Category codes used in escape sequences:
| Code | Category | Default ANSI |
|---|---|---|
| 1 | reset | \033[0m |
| 2 | error | \033[01;31m (bold red) |
| 3 | warning | \033[01;35m (bold magenta) |
| 4 | note/remark | \033[01;36m (bold cyan) |
| 5 | locus | \033[01m (bold) |
| 6 | quote | \033[01m (bold) |
| 7 | range1 | \033[32m (green) |
Controlled by dword_126ECA0 (colorization requested) and dword_126ECA4 (colorization active). The sub_4ECDD0 function emits escape sequences to the output buffer, and sub_4F3E50 handles escape insertion during word-wrapped output.
Key Global Variables
| Variable | Address | Type | Purpose |
|---|---|---|---|
off_88FAA0 | 0x88FAA0 | const char*[3795] | Error message template table |
off_D481E0 | 0xD481E0 | struct[] | Named label fill-in table |
byte_1067920 | 0x1067920 | byte[4*3795] | Default severity per error |
byte_1067921 | 0x1067921 | byte[4*3795] | Current severity per error |
byte_1067922 | 0x1067922 | byte[4*3795] | Per-error tracking flags |
byte_126ED68 | 0x126ED68 | byte | Error promotion threshold |
byte_126ED69 | 0x126ED69 | byte | Minimum severity threshold |
qword_126ED60 | 0x126ED60 | qword | Error limit |
qword_126ED90 | 0x126ED90 | qword | Total error count |
qword_126ED98 | 0x126ED98 | qword | Total warning count |
dword_106B4B0 | 0x106B4B0 | int | Catastrophic error re-entry guard |
dword_106B4BC | 0x106B4BC | int | Warnings-as-errors recursion guard |
dword_106BBB8 | 0x106BBB8 | int | Output format (0=text, 1=SARIF) |
dword_106C088 | 0x106C088 | int | Warnings-are-errors mode |
dword_1065928 | 0x1065928 | int | Internal error re-entry guard |
qword_106BCD8 | 0x106BCD8 | qword | Suppress-all-but-fatal mode |
dword_106BCD4 | 0x106BCD4 | int | Predefined macro file mode |
qword_106B488 | 0x106B488 | qword | Message text buffer (0x400 initial) |
qword_106B480 | 0x106B480 | qword | Location prefix buffer (0x80 initial) |
qword_106B478 | 0x106B478 | qword | SARIF JSON buffer (0x400 initial) |
dword_106B470 | 0x106B470 | int | Terminal width for word wrapping |
qword_126EDF0 | 0x126EDF0 | FILE* | Error output stream (default stderr) |
qword_106C260 | 0x106C260 | FILE* | Raw listing output file |
Function Map
| Address | Name (Recovered) | EDG Source | Size | Role |
|---|---|---|---|---|
0x4EC940 | allocate_diagnostic_record | error.c | -- | Pool allocator for diagnostic records |
0x4ECB10 | write_sarif_physical_location | error.c | -- | SARIF location JSON fragment |
0x4ECDD0 | emit_colorization_escape | error.c | -- | Emit ANSI escape to buffer |
0x4ED190 | record_pragma_diagnostic | error.c | -- | Record pragma override in scope |
0x4ED240 | check_pragma_diagnostic | error.c | -- | Check if error suppressed by pragma |
0x4EDCD0 | process_fill_in | error.c:4297 | 1,202 lines | Format specifier expansion |
0x4EF620 | write_message_to_buffer | error.c:4703 | 159 lines | Template string expansion |
0x4EF8A0 | write_sarif_message_json | error.c | 79 lines | SARIF message JSON wrapper |
0x4EF9D0 | construct_text_message | error.c:3153 | 1,464 lines | Full text diagnostic formatter |
0x4F1330 | check_severity | error.c:3859 | 601 lines | Central severity dispatch |
0x4F2190 | check_severity_thunk | error.c | 8 lines | Tail-call wrapper |
0x4F21A0 | internal_error_variant | error.c | 9 lines | check_severity + exit(11) |
0x4F21C0 | internal_error | error.c | 22 lines | Error 2656, severity 11, re-entry guard |
0x4F2240 | update_file_index | error.c | 114 lines | LRU source-file index cache |
0x4F24B0 | build_source_caret_line | error.c | ~100 lines | Source caret underline |
0x4F2930 | assertion_handler | error.c | 101 lines | 5,185 callers, __noreturn |
0x4F2C10 | init_colorization | error.c:825 | 43 lines | Parse EDG_COLORS/GCC_COLORS |
0x4F2D30 | error_text_invalid_code | error.c:911 | 12 lines | Assert on code > 3794 |
0x4F2DE0 | alloc_fill_in_entry | error.c | 41 lines | Pool allocator for fill-ins |
0x4F2E90 | append_fill_in_string | error.c | -- | Attach string fill-in to diagnostic |
0x4F30A0 | check_for_overridden_severity | error.c:3803 | ~130 lines | Pragma diagnostic stack walk |
0x4F3480 | format_assertion_message | error.c | ~100 lines | Multi-arg string builder |
0x4F3E50 | emit_colorization_in_wrap | error.c | -- | Escape handling during word wrap |
0x4F40C0 | create_diagnostic_entry | error.c:5202 | ~50 lines | Base record creator |
0x4F41C0 | create_diagnostic_entry_with_file_index | error.c | 13 lines | Wrapper with file-index mode |
0x4F5A70 | create_sub_diagnostic | error.c:5242 | 32 lines | kind=2 sub-diagnostic creator |
0x4F6C40 | format_scope_context | error.c | -- | Extract instantiation context from scope |
Call Graph
sub_4F2930 (assertion_handler) [5,185 callers, __noreturn]
└── sub_4F21C0 (internal_error)
├── sub_4F41C0 (create_diagnostic_entry, error=2656, sev=11)
│ └── sub_4F40C0 (create_diagnostic_entry)
│ └── sub_4F30A0 (check_for_overridden_severity)
├── sub_4F2E90 (append_fill_in_string)
├── sub_4F1330 (check_severity) [62 callers, 77 callees]
│ ├── sub_4ED240 (check_pragma_diagnostic)
│ ├── sub_4EF9D0 (construct_text_message)
│ │ ├── sub_4EF620 (write_message_to_buffer)
│ │ │ └── sub_4EDCD0 (process_fill_in)
│ │ ├── sub_4F24B0 (build_source_caret_line)
│ │ └── sub_4F3E50 (emit_colorization_in_wrap)
│ ├── sub_4EF8A0 (write_sarif_message_json)
│ │ └── sub_4EF620 (write_message_to_buffer)
│ ├── sub_4F5A70 (create_sub_diagnostic)
│ ├── sub_4F2DE0 (alloc_fill_in_entry)
│ ├── sub_4F6C40 (format_scope_context)
│ ├── sub_66B5E0 (cleanup)
│ └── sub_5AF2B0 (exit)
├── sub_5AF1D0 (cleanup + exit)
└── sub_4F2240 (update_file_index)
CUDA Error Catalog
cudafe++ reserves error indices 3457--3794 for CUDA-specific diagnostics. These 338 slots are displayed to the user as error numbers 20000--20337 with a -D suffix (for suppressible severities), produced by the renumbering logic in construct_text_message (sub_4EF9D0): when the internal error code exceeds 3456, the display code is error_code + 16543. Of the 338 slots, approximately 210 carry unique error message templates; the remainder are reserved or share templates with parametric fill-ins (%s, %sq, %t, %n, %no). Every CUDA error can be suppressed, promoted, or demoted by its diagnostic tag name via --diag_suppress, --diag_warning, --diag_error, or the #pragma nv_diagnostic system.
This page is a searchable reference catalog organized by error category. For the diagnostic pipeline mechanics (severity levels, pragma stack, output formatting), see Diagnostic Overview.
Error Numbering Scheme
// construct_text_message (sub_4EF9D0), error.c:3153
int display_code = error_code;
if (display_code > 3456)
display_code = error_code + 16543; // 3457 -> 20000, 3794 -> 20337
sprintf(buf, "%d", display_code);
// Suffix: "-D" appended when severity <= 7 (note, remark, warning, soft error)
const char *suffix = (severity > 7) ? "" : "-D";
User-visible format: file(line): error #20042-D: calling a __device__ function from a __host__ function is not allowed
Mapping formula:
| Direction | Formula |
|---|---|
| Display to internal | internal = display - 16543 (for display >= 20000) |
| Internal to display | display = internal + 16543 (for internal > 3456) |
Diagnostic Tag Names and Suppression
Each CUDA error has an associated diagnostic tag name -- a snake_case identifier that can be passed to --diag_suppress, --diag_warning, --diag_error, or --diag_default instead of the numeric code. The tag names are also accepted by #pragma nv_diag_suppress, #pragma nv_diag_warning, etc.
# Suppress a specific CUDA error by tag name
nvcc --diag_suppress=calling_a_constexpr__host__function_from_a__device__function
# Suppress by numeric code (equivalent)
nvcc --diag_suppress=20042
# In source code
#pragma nv_diag_suppress device_function_redeclared_with_host
The pragma actions understood by cudafe++:
| Pragma | Internal Code | Effect |
|---|---|---|
nv_diag_suppress | 30 | Set severity to 3 (suppressed) |
nv_diag_remark | 31 | Set severity to 4 (remark) |
nv_diag_warning | 32 | Set severity to 5 (warning) |
nv_diag_error | 33 | Set severity to 7 (error) |
nv_diag_default | 35 | Restore original severity |
nv_diag_once | -- | Emit only on first occurrence |
Category 1: Cross-Space Calling (12 messages)
Cross-space call validation is the highest-frequency CUDA diagnostic category. The checker walks the call graph and emits an error whenever a function in one execution space calls a function in an incompatible space. Six variants cover non-constexpr calls; six more cover constexpr calls (which can be relaxed with --expt-relaxed-constexpr).
Standard Cross-Space Calls
| Tag | Message Template |
|---|---|
unsafe_device_call | calling a __device__ function(%sq1) from a __host__ function(%sq2) is not allowed |
unsafe_device_call | calling a __device__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed |
unsafe_device_call | calling a __host__ function(%sq1) from a __device__ function(%sq2) is not allowed |
unsafe_device_call | calling a __host__ function(%sq1) from a __global__ function(%sq2) is not allowed |
unsafe_device_call | calling a __host__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed |
unsafe_device_call | calling a __host__ function from a __host__ __device__ function is not allowed |
Constexpr Cross-Space Calls
These fire when --expt-relaxed-constexpr is not enabled. The message explicitly suggests the flag.
| Tag | Message Template |
|---|---|
unsafe_device_call | calling a constexpr __device__ function(%sq1) from a __host__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this. |
unsafe_device_call | calling a constexpr __device__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this. |
unsafe_device_call | calling a constexpr __host__ function(%sq1) from a __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this. |
unsafe_device_call | calling a constexpr __host__ function(%sq1) from a __global__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this. |
unsafe_device_call | calling a constexpr __host__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this. |
unsafe_device_call | calling a constexpr __host__ function from a __host__ __device__ function is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this. |
Implementation: Cross-space checks are performed by the call-graph walker in the CUDA validation pass. The checker compares the execution space byte at entity offset +182 of the callee against the caller. When the mask test fails, the appropriate variant is selected based on whether either function is constexpr and whether the callee has named fill-ins or uses the anonymous (no %sq) form.
Category 2: Virtual Override Mismatch (6 messages)
When a derived class overrides a virtual function, the execution space of the override must match the base. Six combinations cover all mismatched pairs among __host__, __device__, and __host__ __device__.
| Tag | Message Template |
|---|---|
| -- | execution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ function |
| -- | execution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ __device__ function |
| -- | execution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __device__ function |
| -- | execution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __host__ __device__ function |
| -- | execution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __device__ function |
| -- | execution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __host__ function |
Implementation: The override checker (sub_432280, record_virtual_function_override) extracts the 0x30 mask from the execution space byte of both the base and derived function entities. If they differ, the appropriate pair is selected and emitted. The __global__ space is not included because __global__ functions cannot be virtual (see Category 4).
Category 3: Redeclaration Mismatch (12 messages)
When a function is redeclared with a different execution space annotation, cudafe++ either emits an error (incompatible combination) or a warning (compatible promotion to __host__ __device__).
Error-Level Redeclarations (4 messages)
| Tag | Message Template |
|---|---|
device_function_redeclared_with_global | a __device__ function(%no1) redeclared with __global__ |
global_function_redeclared_with_device | a __global__ function(%no1) redeclared with __device__ |
global_function_redeclared_with_host | a __global__ function(%no1) redeclared with __host__ |
global_function_redeclared_with_host_device | a __global__ function(%no1) redeclared with __host__ __device__ |
Warning-Level Redeclarations (Promoted to HD, 5 messages)
| Tag | Message Template |
|---|---|
device_function_redeclared_with_host | a __device__ function(%no1) redeclared with __host__, hence treated as a __host__ __device__ function |
device_function_redeclared_with_host_device | a __device__ function(%no1) redeclared with __host__ __device__, hence treated as a __host__ __device__ function |
device_function_redeclared_without_device | a __device__ function(%no1) redeclared without __device__, hence treated as a __host__ __device__ function |
host_function_redeclared_with_device | a __host__ function(%no1) redeclared with __device__, hence treated as a __host__ __device__ function |
host_function_redeclared_with_host_device | a __host__ function(%no1) redeclared with __host__ __device__, hence treated as a __host__ __device__ function |
Global Redeclarations (3 messages)
| Tag | Message Template |
|---|---|
global_function_redeclared_without_global | a __global__ function(%no1) redeclared without __global__ |
host_function_redeclared_with_global | a __host__ function(%no1) redeclared with __global__ |
host_device_function_redeclared_with_global | a __host__ __device__ function(%no1) redeclared with __global__ |
Implementation: Redeclaration checking occurs in decl_routine (sub_4CE420) and check_cuda_attribute_consistency (sub_4C6D50). The checker compares the execution space byte from the prior declaration against the new declaration's attribute set. When bits differ, it selects the message based on which bits changed and whether the result is a compatible promotion.
Category 4: __global__ Function Constraints (37 messages)
__global__ (kernel) functions have the most extensive constraint set of any execution space. These errors enforce the CUDA programming model requirement that kernels have specific signatures, cannot be members, and cannot use certain C++ features.
Return Type and Signature
| Tag | Message Template |
|---|---|
global_function_return_type | a __global__ function must have a void return type |
global_function_deduced_return_type | a __global__ function must not have a deduced return type |
global_function_has_ellipsis | a __global__ function cannot have ellipsis |
global_rvalue_ref_type | a __global__ function cannot have a parameter with rvalue reference type |
global_ref_param_restrict | a __global__ function cannot have a parameter with __restrict__ qualified reference type |
global_va_list_type | A __global__ function or function template cannot have a parameter with va_list type |
global_function_with_initializer_list | a __global__ function or function template cannot have a parameter with type std::initializer_list |
global_param_align_too_big | cannot pass a parameter with a too large explicit alignment to a __global__ function on win32 platforms |
Declaration Context
| Tag | Message Template |
|---|---|
global_class_decl | A __global__ function or function template cannot be a member function |
global_friend_definition | A __global__ function or function template cannot be defined in a friend declaration |
global_function_in_unnamed_inline_ns | A __global__ function or function template cannot be declared within an inline unnamed namespace |
global_operator_function | An operator function cannot be a __global__ function |
global_new_or_delete | (internal -- global on operator new/delete) |
| -- | function main cannot be marked __device__ or __global__ |
C++ Feature Restrictions
| Tag | Message Template |
|---|---|
global_function_constexpr | A __global__ function or function template cannot be marked constexpr |
global_function_consteval | A __global__ function or function template cannot be marked consteval |
global_function_inline | (internal -- global with inline) |
global_exception_spec | An exception specification is not allowed for a __global__ function or function template |
Template Argument Restrictions
| Tag | Message Template |
|---|---|
global_private_type_arg | A type that is defined inside a class and has private or protected access (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the class is local to a __device__ or __global__ function |
global_private_template_arg | A template that is defined inside a class and has private or protected access cannot be used in the template template argument of a __global__ function template instantiation |
global_unnamed_type_arg | An unnamed type (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the type is local to a __device__ or __global__ function |
global_func_local_template_arg | A type defined inside a __host__ function (%t) cannot be used in the template argument type of a __global__ function template instantiation |
global_lambda_template_arg | The closure type for a lambda (%t%s) cannot be used in the template argument type of a __global__ function template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the flag '-extended-lambda' is specified and the lambda is an extended lambda (a __device__ or __host__ __device__ lambda defined within a __host__ or __host__ __device__ function) |
local_type_used_in_global_function | a local type %t (defined in %sq1) used in global function %sq2 template argument, the global function cannot be launched from host code. |
Variadic Template Constraints
| Tag | Message Template |
|---|---|
global_function_multiple_packs | Multiple pack parameters are not allowed for a variadic __global__ function template |
global_function_pack_not_last | Pack template parameter must be the last template parameter for a variadic __global__ function template |
Variable Template Restrictions (parallel to kernel template)
| Tag | Message Template |
|---|---|
variable_template_private_type_arg | A type that is defined inside a class and has private or protected access (%t) cannot be used in the template argument type of a variable template instantiation, unless the class is local to a __device__ or __global__ function |
variable_template_private_template_arg | (private template template arg in variable template) |
variable_template_unnamed_type_template_arg | An unnamed type (%t) cannot be used in the template argument type of a variable template template instantiation, unless the type is local to a __device__ or __global__ function |
variable_template_func_local_template_arg | A type defined inside a __host__ function (%t) cannot be used in the template argument type of a variable template template instantiation |
variable_template_lambda_template_arg | The closure type for a lambda (%t%s) cannot be used in the template argument type of a variable template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the lambda is an 'extended lambda' and the flag --extended-lambda is specified |
Launch Configuration Attributes
| Tag | Message Template |
|---|---|
bounds_attr_only_on_global_func | %s is only allowed on a __global__ function |
maxnreg_attr_only_on_global_func | (maxnreg only on global) |
| -- | The %s qualifiers cannot be applied to the same kernel |
| -- | Multiple %s specifiers are not allowed |
| -- | no __launch_bounds__ specified for __global__ function |
cuda_specifier_twice_in_group | (duplicate CUDA specifier on same declaration) |
Category 5: Extended Lambda Restrictions (35 messages)
Extended lambdas (__device__ or __host__ __device__ lambdas defined within host code, enabled by --extended-lambda) are one of the most constraint-heavy features in CUDA. The restriction set enforces that the lambda's closure type can be serialized for device transfer.
Capture Restrictions
| Tag | Message Template |
|---|---|
extended_lambda_reference_capture | An extended %s lambda cannot capture variables by reference |
extended_lambda_pack_capture | An extended %s lambda cannot capture an element of a parameter pack |
extended_lambda_too_many_captures | An extended %s lambda can only capture up to 1023 variables |
extended_lambda_array_capture_rank | An extended %s lambda cannot capture an array variable (type: %t) with more than 7 dimensions |
extended_lambda_array_capture_assignable | An extended %s lambda cannot capture an array variable whose element type (%t) is not assignable on the host |
extended_lambda_array_capture_default_constructible | An extended %s lambda cannot capture an array variable whose element type (%t) is not default constructible on the host |
extended_lambda_init_capture_array | An extended %s lambda cannot init-capture variables with array type |
extended_lambda_init_capture_initlist | An extended %s lambda cannot have init-captures with type std::initializer_list |
extended_lambda_capture_in_constexpr_if | An extended %s lambda cannot first-capture variable in constexpr-if context |
this_addr_capture_ext_lambda | Implicit capture of 'this' in extended lambda expression |
extended_lambda_hd_init_capture | init-captures are not allowed for extended __host__ __device__ lambdas |
| -- | Unless enabled by language dialect, *this capture is only supported when the lambda is either __device__ only, or is defined within a __device__ or __global__ function |
Type Restrictions on Captures and Parameters
| Tag | Message Template |
|---|---|
extended_lambda_capture_local_type | A type local to a function (%t) cannot be used in the type of a variable captured by an extended __device__ or __host__ __device__ lambda |
extended_lambda_capture_private_type | A type that is a private or protected class member (%t) cannot be used in the type of a variable captured by an extended __device__ or __host__ __device__ lambda |
extended_lambda_call_operator_local_type | A type local to a function (%t) cannot be used in the return or parameter types of the operator() of an extended __device__ or __host__ __device__ lambda |
extended_lambda_call_operator_private_type | A type that is a private or protected class member (%t) cannot be used in the return or parameter types of the operator() of an extended __device__ or __host__ __device__ lambda |
extended_lambda_parent_local_type | A type local to a function (%t) cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended __device__ or __host__ __device__ lambda |
extended_lambda_parent_private_type | A type that is a private or protected class member (%t) cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended __device__ or __host__ __device__ lambda |
extended_lambda_parent_private_template_arg | A template that is a private or protected class member cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended %s lambda |
Enclosing Parent Function Restrictions
| Tag | Message Template |
|---|---|
extended_lambda_enclosing_function_local | The enclosing parent function (%sq2) for an extended %s1 lambda must not be defined inside another function |
extended_lambda_inaccessible_parent | The enclosing parent function (%sq2) for an extended %s1 lambda cannot have private or protected access within its class |
extended_lambda_enclosing_function_deducible | The enclosing parent function (%sq2) for an extended %s1 lambda must not have deduced return type |
extended_lambda_cant_take_function_address | The enclosing parent function (%sq2) for an extended %s1 lambda must allow its address to be taken |
extended_lambda_parent_non_extern | On Windows, the enclosing parent function (%sq2) for an extended %s1 lambda cannot have internal or no linkage |
extended_lambda_parent_class_unnamed | The enclosing parent function (%sq2) for an extended %s1 lambda cannot be a member function of a class that is unnamed |
extended_lambda_parent_template_param_unnamed | The enclosing parent function (%sq2) for an extended %s1 lambda cannot be in a template which has a unnamed parameter: %nd |
extended_lambda_nest_parent_template_param_unnamed | The enclosing parent %n for an extended %s lambda cannot be a template which has a unnamed parameter |
extended_lambda_multiple_parameter_packs | The enclosing parent template function (%sq2) for an extended %s1 lambda cannot have more than one variadic parameter, or it is not listed last in the template parameter list. |
Nesting and Context Restrictions
| Tag | Message Template |
|---|---|
extended_lambda_enclosing_function_generic_lambda | An extended %s1 lambda cannot be defined inside a generic lambda expression(%sq2). |
extended_lambda_enclosing_function_hd_lambda | An extended %s1 lambda cannot be defined inside an extended __host__ __device__ lambda expression(%sq2). (note: double space before "lambda" is present in the binary) |
extended_lambda_inaccessible_ancestor | An extended %s1 lambda cannot be defined inside a class (%sq2) with private or protected access within another class |
extended_lambda_inside_constexpr_if | For this host platform/dialect, an extended lambda cannot be defined inside the 'if' or 'else' block of a constexpr if statement |
extended_lambda_multiple_parent | Cannot specify multiple __nv_parent directives in a lambda declaration |
extended_host_device_generic_lambda | __host__ __device__ extended lambdas cannot be generic lambdas |
| -- | If an extended %s lambda is defined within the body of one or more nested lambda expressions, each of these enclosing lambda expressions must be defined within the immediate or nested block scope of a function. |
Specifier and Annotation
| Tag | Message Template |
|---|---|
extended_lambda_disallowed | __host__ or __device__ annotation on lambda requires --extended-lambda nvcc flag |
extended_lambda_constexpr | The %s1 specifier is not allowed for an extended %s2 lambda |
| -- | The operator() function for a lambda cannot be explicitly annotated with execution space annotations (__host__/__device__/__global__), the annotations are derived from its closure class |
Category 6: Device Code Restrictions (13 messages)
General restrictions that apply to any code executing on the GPU. These errors are emitted when C++ features unsupported by the NVPTX backend appear in __device__ or __global__ function bodies.
| Tag | Message Template |
|---|---|
cuda_device_code_unsupported_operator | The operator '%s' is not allowed in device code |
unsupported_type_in_device_code | %t %s1 a %s2, which is not supported in device code |
| -- | device code does not support exception handling |
| -- | device code does not support coroutines |
| -- | operations on vector types are not supported in device code |
undefined_device_entity | cannot use an entity undefined in device code |
undefined_device_identifier | identifier %sq is undefined in device code |
thread_local_in_device_code | cannot use thread_local specifier for variable declarations in device code |
unrecognized_pragma_device_code | unrecognized #pragma in device code |
| -- | zero-sized parameter type %t is not allowed in device code |
| -- | zero-sized variable %sq is not allowed in device code |
| -- | dynamic initialization is not supported for a function-scope static %s variable within a __device__/__global__ function |
| -- | function-scope static variable within a __device__/__global__ function requires a memory space specifier |
Category 7: Kernel Launch (6 messages)
Errors related to <<<...>>> kernel launch syntax.
| Tag | Message Template |
|---|---|
device_launch_no_sepcomp | kernel launch from __device__ or __global__ functions requires separate compilation mode |
missing_api_for_device_side_launch | device-side kernel launch could not be processed as the required runtime APIs are not declared |
| -- | explicit stream argument not provided in kernel launch |
| -- | kernel launches from templates are not allowed in system files |
device_side_launch_arg_with_user_provided_cctor | cannot pass an argument with a user-provided copy-constructor to a device-side kernel launch |
device_side_launch_arg_with_user_provided_dtor | cannot pass an argument with a user-provided destructor to a device-side kernel launch |
Category 8: Memory Space and Variable Restrictions (15 messages)
Variable Access Across Spaces
| Tag | Message Template |
|---|---|
device_var_read_in_host | a %s1 %n1 cannot be directly read in a host function |
device_var_written_in_host | a %s1 %n1 cannot be directly written in a host function |
device_var_address_taken_in_host | address of a %s1 %n1 cannot be directly taken in a host function |
host_var_read_in_device | a host %n1 cannot be directly read in a device function |
host_var_written_in_device | a host %n1 cannot be directly written in a device function |
host_var_address_taken_in_device | address of a host %n1 cannot be directly taken in a device function |
Variable Declaration Restrictions
| Tag | Message Template |
|---|---|
illegal_local_to_device_function | %s1 %sq2 variable declaration is not allowed inside a device function body |
illegal_local_to_host_function | %s1 %sq2 variable declaration is not allowed inside a host function body |
| -- | the __shared__ memory space specifier is not allowed for a variable declared by the for-range-declaration |
| -- | __shared__ variables cannot have external linkage |
device_variable_in_unnamed_inline_ns | A %s variable cannot be declared within an inline unnamed namespace |
| -- | member variables of an anonymous union at global or namespace scope cannot be directly accessed in __device__ and __global__ functions |
Auto-Deduced Device References
| Tag | Message Template |
|---|---|
auto_device_fn_ref | A non-constexpr __device__ function (%sq1) with "auto" deduced return type cannot be directly referenced %s2, except if the reference is absent when __CUDA_ARCH__ is undefined |
device_var_constexpr | (constexpr rules for device variables) |
device_var_structured_binding | (structured bindings on device variables) |
Category 9: __grid_constant__ (8 messages)
The __grid_constant__ annotation (compute_70+) marks a kernel parameter as read-only grid-wide. Errors enforce that the parameter is on a __global__ function, is const-qualified, and is not a reference type.
| Tag | Message Template |
|---|---|
grid_constant_non_kernel | __grid_constant__ annotation is only allowed on a parameter of a __global__ function |
grid_constant_not_const | a parameter annotated with __grid_constant__ must have const-qualified type |
grid_constant_reference_type | a parameter annotated with __grid_constant__ must not have reference type |
grid_constant_unsupported_arch | __grid_constant__ annotation is only allowed for architecture compute_70 or later |
grid_constant_incompat_redecl | incompatible __grid_constant__ annotation for parameter %s in function redeclaration (see previous declaration %p) |
grid_constant_incompat_templ_redecl | incompatible __grid_constant__ annotation for parameter %s in function template redeclaration (see previous declaration %p) |
grid_constant_incompat_specialization | incompatible __grid_constant__ annotation for parameter %s in function specialization (see previous declaration %p) |
grid_constant_incompat_instantiation_directive | incompatible __grid_constant__ annotation for parameter %s in instantiation directive (see previous declaration %p) |
Category 10: JIT Mode (5 messages)
JIT mode (-dc for device-only compilation) restricts host constructs. These errors guide users toward the -default-device flag for unannotated declarations.
| Tag | Message Template |
|---|---|
no_host_in_jit | A function explicitly marked as a __host__ function is not allowed in JIT mode |
unannotated_function_in_jit | A function without execution space annotations (__host__/__device__/__global__) is considered a host function, and host functions are not allowed in JIT mode. Consider using -default-device flag to process unannotated functions as __device__ functions in JIT mode |
unannotated_variable_in_jit | A namespace scope variable without memory space annotations (__device__/__constant__/__shared__/__managed__) is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process unannotated namespace scope variables as __device__ variables in JIT mode |
unannotated_static_data_member_in_jit | A class static data member with non-const type is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process such data members as __device__ variables in JIT mode |
host_closure_class_in_jit | The execution space for the lambda closure class members was inferred to be __host__ (based on context). This is not allowed in JIT mode. Consider using -default-device to infer __device__ execution space for namespace scope lambda closure classes. |
Category 11: RDC / Whole-Program Mode (4 messages)
Diagnostics related to relocatable device code (-rdc=true) and whole-program compilation (-rdc=false).
| Tag | Message Template |
|---|---|
| -- | An inline __device__/__constant__/__managed__ variable must have internal linkage when the program is compiled in whole program mode (-rdc=false) |
template_global_no_def | when "-static-global-template-stub=true" in whole program compilation mode ("-rdc=false"), a __global__ function template instantiation or specialization (%sq) must have a definition in the current translation unit. To resolve this issue, either use separate compilation mode ("-rdc=true"), or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off) |
extern_kernel_template | when "-static-global-template-stub=true", extern __global__ function template is not supported in whole program compilation mode ("-rdc=false"). To resolve the issue, either use separate compilation mode ("-rdc=true"), or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off) |
| -- | address of internal linkage device function (%sq) was taken (nv bug 2001144). mitigation: no mitigation required if the address is not used for comparison, or if the target function is not a CUDA C++ builtin. Otherwise, write a wrapper function to call the builtin, and take the address of the wrapper function instead |
Category 12: Atomics (26 messages)
CUDA atomics are lowered to PTX instructions with specific size, type, scope, and memory order constraints. These diagnostics enforce hardware limits.
Architecture and Type Constraints
| Tag | Message Template |
|---|---|
nv_atomic_functions_not_supported_below_sm60 | __nv_atomic_* functions are not supported on arch < sm_60. |
nv_atomic_operation_not_in_device_function | atomic operations are not in a device function. |
nv_atomic_function_no_args | atomic function requires at least one argument. |
nv_atomic_function_address_taken | nv atomic function must be called directly. |
invalid_nv_atomic_operation_size | atomic operations and, or, xor, add, sub, min and max are valid only on objects of size 4, or 8. |
invalid_nv_atomic_cas_size | atomic CAS is valid only on objects of size 2, 4, 8 or 16 bytes. |
invalid_nv_atomic_exch_size | atomic exchange is valid only on objects of size 4, 8 or 16 bytes. |
invalid_data_size_for_nv_atomic_generic_function | generic nv atomic functions are valid only on objects of size 1, 2, 4, 8 and 16 bytes. |
non_integral_type_for_non_generic_nv_atomic_function | non-generic nv atomic load, store, cas and exchange are valid only on integral types. |
invalid_nv_atomic_operation_add_sub_size | atomic operations add and sub are not valid on signed integer of size 8. |
nv_atomic_add_sub_f64_not_supported | atomic add and sub for 64-bit float is supported on architecture sm_60 or above. |
invalid_nv_atomic_operation_max_min_float | atomic operations min and max are not supported on any floating-point types. |
floating_type_for_logical_atomic_operation | For a logical atomic operation, the first argument cannot be any floating-point types. |
nv_atomic_cas_b16_not_supported | (16-bit CAS not supported) |
nv_atomic_exch_cas_b128_not_supported | (128-bit exchange/CAS not supported) |
nv_atomic_load_store_b128_version_too_low | (128-bit load/store requires newer arch) |
Memory Order and Scope
| Tag | Message Template |
|---|---|
nv_atomic_load_order_error | atomic load's memory order cannot be release or acq_rel. |
nv_atomic_store_order_error | atomic store's memory order cannot be consume, acquire or acq_rel. |
nv_atomic_operation_order_not_constant_int | atomic operation's memory order argument is not an integer literal. |
nv_atomic_operation_scope_not_constant_int | atomic operation's scope argument is not an integer literal. |
invalid_nv_atomic_memory_order_value | (invalid memory order enum value) |
invalid_nv_atomic_thread_scope_value | (invalid thread scope enum value) |
Scope Fallback Warnings
| Tag | Message Template |
|---|---|
nv_atomic_operations_scope_fallback_to_membar | atomic operations' scope argument is supported on architecture sm_60 or above. Fall back to use membar. |
nv_atomic_operations_memory_order_fallback_to_membar | atomic operations' argument of memory order is supported on architecture sm_70 or above. Fall back to use membar. |
nv_atomic_operations_scope_cluster_change_to_device | atomic operations' scope of cluster is supported on architecture sm_90 or above. Using device scope instead. |
nv_atomic_load_store_scope_cluster_change_to_device | atomic load and store's scope of cluster is supported on architecture sm_90 or above. Using device scope instead. |
Category 13: ASM in Device Code (6 messages)
Inline assembly constraints are more restrictive in device code (NVPTX backend supports fewer constraint letters than x86).
| Tag | Message Template |
|---|---|
asm_constraint_letter_not_allowed_in_device | asm constraint letter '%s' is not allowed inside a __device__/__global__ function |
| -- | an asm operand may specify only one constraint letter in a __device__/__global__ function |
| -- | The 'C' constraint can only be used for asm statements in device code |
| -- | The cc clobber constraint is not supported in device code |
cuda_xasm_strict_placeholder_format | (strict placeholder format in CUDA asm) |
addr_of_label_in_device_func | address of label extension is not supported in __device__/__global__ functions |
Category 14: #pragma nv_abi (10 messages)
The #pragma nv_abi directive controls the calling convention for device functions, adjusting parameter passing to match PTX ABI requirements.
| Tag | Message Template |
|---|---|
nv_abi_pragma_bad_format | (malformed #pragma nv_abi) |
nv_abi_pragma_invalid_option | #pragma nv_abi contains an invalid option |
nv_abi_pragma_missing_arg | #pragma nv_abi requires an argument |
nv_abi_pragma_duplicate_arg | #pragma nv_abi contains a duplicate argument |
nv_abi_pragma_not_constant | #pragma nv_abi argument must evaluate to an integral constant expression |
nv_abi_pragma_not_positive_value | #pragma nv_abi argument value must be a positive value |
nv_abi_pragma_overflow_value | #pragma nv_abi argument value exceeds the range of an integer |
nv_abi_pragma_device_function | #pragma nv_abi must be applied to device functions |
nv_abi_pragma_device_function_context | #pragma nv_abi is not supported inside a host function |
nv_abi_pragma_next_construct | #pragma nv_abi must appear immediately before a function declaration, function definition, or an expression statement |
Category 15: __nv_register_params__ (4 messages)
The __nv_register_params__ attribute forces all parameters to be passed in registers (compute_80+).
| Tag | Message Template |
|---|---|
register_params_not_enabled | __nv_register_params__ support is not enabled |
register_params_unsupported_arch | __nv_register_params__ is only supported for compute_80 or later architecture |
register_params_unsupported_function | __nv_register_params__ is not allowed on a %s function |
register_params_ellipsis_function | __nv_register_params__ is not allowed on a function with ellipsis |
Category 16: __CUDACC_RTC__name_expr (6 messages)
The __CUDACC_RTC__name_expr intrinsic is used by NVRTC to form the mangled name of a __global__ function or __device__/__constant__ variable at compile time.
| Tag | Message Template |
|---|---|
name_expr_parsing | (error during name expression parsing) |
name_expr_non_global_routine | Name expression cannot form address of a non-__global__ function. Input name expression was: %sq |
name_expr_non_device_variable | Name expression cannot form address of a variable that is not a __device__/__constant__ variable. Input name expression was: %sq |
name_expr_not_routine_or_variable | Name expression must form address of a __global__ function or the address of a __device__/__constant__ variable. Input name expression was: %sq |
name_expr_extra_tokens | (extra tokens after name expression) |
name_expr_internal_error | (internal error in name expression processing) |
Category 17: Texture and Surface Variables (8 messages)
Texture and surface objects have special memory semantics. These errors enforce that they are not used in ways incompatible with the GPU texture subsystem.
| Tag | Message Template |
|---|---|
texture_surface_variable_in_unnamed_inline_ns | A texture or surface variable cannot be declared within an inline unnamed namespace |
| -- | A texture or surface variable cannot be used in the non-type template argument of a __device__, __host__ __device__ or __global__ function template instantiation |
reference_to_text_surf_type_in_device_func | a reference to texture/surface type cannot be used in __device__/__global__ functions |
reference_to_text_surf_var_in_device_func | taking reference of texture/surface variable not allowed in __device__/__global__ functions |
addr_of_text_surf_var_in_device_func | cannot take address of texture/surface variable %sq in __device__/__global__ functions |
addr_of_text_surf_expr_in_device_func | cannot take address of texture/surface expression in __device__/__global__ functions |
indir_into_text_surf_var_in_device_func | indirection not allowed for accessing texture/surface through variable %sq in __device__/__global__ functions |
indir_into_text_surf_expr_in_device_func | indirection not allowed for accessing texture/surface through expression in __device__/__global__ functions |
Category 18: __managed__ Variables (7 messages)
__managed__ unified-memory variables have significant restrictions because they must be accessible from both host and device.
| Tag | Message Template |
|---|---|
managed_const_type_not_allowed | a __managed__ variable cannot have a const qualified type |
managed_reference_type_not_allowed | a __managed__ variable cannot have a reference type |
managed_cant_be_shared_constant | __managed__ variables cannot be marked __shared__ or __constant__ |
unsupported_arch_for_managed_capability | __managed__ variables require architecture compute_30 or higher |
unsupported_configuration_for_managed_capability | __managed__ variables are not yet supported for this configuration (compilation mode (32/64 bit) and/or target operating system) |
decltype_of_managed_variable | A __managed__ variable cannot be used as an unparenthesized id-expression argument for decltype() |
| -- | (dynamic initialization restrictions for managed variables) |
Category 19: Device Function Signature Constraints (5 messages)
Restrictions on __device__ and __host__ __device__ functions that are distinct from __global__ constraints.
| Tag | Message Template |
|---|---|
device_function_has_ellipsis | __device__ or __host__ __device__ function with ellipsis requires compute_30 or higher architecture |
device_func_tex_arg | (device function with texture argument restriction) |
no_host_device_initializer_list | (std::initializer_list in host device context) |
no_host_device_move_forward | (std::move/forward in host device context) |
no_strict_cuda_error | (relaxed error checking mode) |
Category 20: __wgmma_mma_async Builtins (4 messages)
Warp Group Matrix Multiply-Accumulate builtins (sm_90a+).
| Tag | Message Template |
|---|---|
wgmma_mma_async_not_enabled | __wgmma_mma_async builtins are only available for sm_90a |
wgmma_mma_async_nonconstant_arg | Non-constant argument to __wgmma_mma_async call |
wgmma_mma_async_missing_args | The 'A' or 'B' argument to __wgmma_mma_async call is missing |
wgmma_mma_async_bad_shape | The shape %s is not supported for __wgmma_mma_async builtin |
Category 21: __block_size__ / __cluster_dims__ (8 messages)
Architecture-dependent launch configuration attributes.
| Tag | Message Template |
|---|---|
block_size_unsupported | __block_size__ is not supported for this GPU architecture |
block_size_must_be_positive | (block size values must be positive) |
cluster_dims_unsupported | __cluster_dims__ is not supported for this GPU architecture |
cluster_dims_must_be_positive | (cluster_dims values must be positive) |
cluster_dims_too_large | (cluster_dims exceeds maximum) |
conflict_between_cluster_dim_and_block_size | cannot specify the second tuple in __block_size__ while __cluster_dims__ is present |
| -- | cannot specify max blocks per cluster for this GPU architecture |
shared_block_size_must_be_positive | (shared block size must be positive) |
Category 22: Inline Hint Conflicts (2 messages)
| Tag | Message Template |
|---|---|
| -- | "__inline_hint__" and "__forceinline__" may not be used on the same declaration |
| -- | "__inline_hint__" and "__noinline__" may not be used on the same declaration |
Category 23: Miscellaneous CUDA Errors
Remaining CUDA-specific diagnostics that do not fall into the above categories.
| Tag | Message Template |
|---|---|
cuda_displaced_new_or_delete_operator | (displaced new/delete in CUDA context) |
cuda_demote_unsupported_floating_point | (unsupported floating-point type demoted) |
illegal_ucn_in_device_identifer | Universal character is not allowed in device entity name (%sq) |
thread_local_for_device_vars | (thread_local on device variables) |
| -- | __global__ function or function template cannot have a parameter with va_list type |
global_qualifier_not_allowed | (execution space qualifier not allowed here) |
Complete Diagnostic Tag Index (286 tags)
The following table lists all 286 CUDA-specific diagnostic tag names extracted from the cudafe++ binary. Each tag can be used with --diag_suppress, --diag_warning, --diag_error, or #pragma nv_diag_suppress / nv_diag_warning / nv_diag_error.
Tags are organized alphabetically within functional groups.
Cross-Space / Execution Space
| Tag Name |
|---|
unsafe_device_call |
Redeclaration
| Tag Name |
|---|
device_function_redeclared_with_global |
device_function_redeclared_with_host |
device_function_redeclared_with_host_device |
device_function_redeclared_without_device |
global_function_redeclared_with_device |
global_function_redeclared_with_host |
global_function_redeclared_with_host_device |
global_function_redeclared_without_global |
host_device_function_redeclared_with_global |
host_function_redeclared_with_device |
host_function_redeclared_with_global |
host_function_redeclared_with_host_device |
__global__ Constraints
| Tag Name |
|---|
bounds_attr_only_on_global_func |
cuda_specifier_twice_in_group |
global_class_decl |
global_exception_spec |
global_friend_definition |
global_func_local_template_arg |
global_function_consteval |
global_function_constexpr |
global_function_deduced_return_type |
global_function_has_ellipsis |
global_function_in_unnamed_inline_ns |
global_function_inline |
global_function_multiple_packs |
global_function_pack_not_last |
global_function_return_type |
global_function_with_initializer_list |
global_lambda_template_arg |
global_new_or_delete |
global_operator_function |
global_param_align_too_big |
global_private_template_arg |
global_private_type_arg |
global_qualifier_not_allowed |
global_ref_param_restrict |
global_rvalue_ref_type |
global_unnamed_type_arg |
global_va_list_type |
local_type_used_in_global_function |
maxnreg_attr_only_on_global_func |
missing_launch_bounds |
template_global_no_def |
Extended Lambda
| Tag Name |
|---|
extended_host_device_generic_lambda |
extended_lambda_array_capture_assignable |
extended_lambda_array_capture_default_constructible |
extended_lambda_array_capture_rank |
extended_lambda_call_operator_local_type |
extended_lambda_call_operator_private_type |
extended_lambda_cant_take_function_address |
extended_lambda_capture_in_constexpr_if |
extended_lambda_capture_local_type |
extended_lambda_capture_private_type |
extended_lambda_constexpr |
extended_lambda_disallowed |
extended_lambda_discriminator |
extended_lambda_enclosing_function_deducible |
extended_lambda_enclosing_function_generic_lambda |
extended_lambda_enclosing_function_hd_lambda |
extended_lambda_enclosing_function_local |
extended_lambda_enclosing_function_not_found |
extended_lambda_hd_init_capture |
extended_lambda_illegal_parent |
extended_lambda_inaccessible_ancestor |
extended_lambda_inaccessible_parent |
extended_lambda_init_capture_array |
extended_lambda_init_capture_initlist |
extended_lambda_inside_constexpr_if |
extended_lambda_multiple_parameter_packs |
extended_lambda_multiple_parent |
extended_lambda_nest_parent_template_param_unnamed |
extended_lambda_no_parent_func |
extended_lambda_pack_capture |
extended_lambda_parent_class_unnamed |
extended_lambda_parent_local_type |
extended_lambda_parent_non_extern |
extended_lambda_parent_private_template_arg |
extended_lambda_parent_private_type |
extended_lambda_parent_template_param_unnamed |
extended_lambda_reference_capture |
extended_lambda_too_many_captures |
this_addr_capture_ext_lambda |
Device Code
| Tag Name |
|---|
addr_of_label_in_device_func |
asm_constraint_letter_not_allowed_in_device |
auto_device_fn_ref |
cuda_device_code_unsupported_operator |
cuda_xasm_strict_placeholder_format |
illegal_ucn_in_device_identifer |
no_strict_cuda_error |
thread_local_in_device_code |
undefined_device_entity |
undefined_device_identifier |
unrecognized_pragma_device_code |
unsupported_type_in_device_code |
Device Function
| Tag Name |
|---|
device_func_tex_arg |
device_function_has_ellipsis |
no_host_device_initializer_list |
no_host_device_move_forward |
Kernel Launch
| Tag Name |
|---|
device_launch_no_sepcomp |
device_side_launch_arg_with_user_provided_cctor |
device_side_launch_arg_with_user_provided_dtor |
missing_api_for_device_side_launch |
Variable Access
| Tag Name |
|---|
device_var_address_taken_in_host |
device_var_constexpr |
device_var_read_in_host |
device_var_structured_binding |
device_var_written_in_host |
device_variable_in_unnamed_inline_ns |
host_var_address_taken_in_device |
host_var_read_in_device |
host_var_written_in_device |
illegal_local_to_device_function |
illegal_local_to_host_function |
Variable Template
| Tag Name |
|---|
variable_template_func_local_template_arg |
variable_template_lambda_template_arg |
variable_template_private_template_arg |
variable_template_private_type_arg |
variable_template_unnamed_type_template_arg |
__managed__
| Tag Name |
|---|
decltype_of_managed_variable |
managed_cant_be_shared_constant |
managed_const_type_not_allowed |
managed_reference_type_not_allowed |
unsupported_arch_for_managed_capability |
unsupported_configuration_for_managed_capability |
__grid_constant__
| Tag Name |
|---|
grid_constant_incompat_instantiation_directive |
grid_constant_incompat_redecl |
grid_constant_incompat_specialization |
grid_constant_incompat_templ_redecl |
grid_constant_non_kernel |
grid_constant_not_const |
grid_constant_reference_type |
grid_constant_unsupported_arch |
Atomics
| Tag Name |
|---|
floating_type_for_logical_atomic_operation |
invalid_data_size_for_nv_atomic_generic_function |
invalid_nv_atomic_cas_size |
invalid_nv_atomic_exch_size |
invalid_nv_atomic_memory_order_value |
invalid_nv_atomic_operation_add_sub_size |
invalid_nv_atomic_operation_max_min_float |
invalid_nv_atomic_operation_size |
invalid_nv_atomic_thread_scope_value |
non_integral_type_for_non_generic_nv_atomic_function |
nv_atomic_add_sub_f64_not_supported |
nv_atomic_cas_b16_not_supported |
nv_atomic_exch_cas_b128_not_supported |
nv_atomic_function_address_taken |
nv_atomic_function_no_args |
nv_atomic_functions_not_supported_below_sm60 |
nv_atomic_load_order_error |
nv_atomic_load_store_b128_version_too_low |
nv_atomic_load_store_scope_cluster_change_to_device |
nv_atomic_operation_not_in_device_function |
nv_atomic_operation_order_not_constant_int |
nv_atomic_operation_scope_not_constant_int |
nv_atomic_operations_memory_order_fallback_to_membar |
nv_atomic_operations_scope_cluster_change_to_device |
nv_atomic_operations_scope_fallback_to_membar |
nv_atomic_store_order_error |
JIT Mode
| Tag Name |
|---|
host_closure_class_in_jit |
no_host_in_jit |
unannotated_function_in_jit |
unannotated_static_data_member_in_jit |
unannotated_variable_in_jit |
RDC / Whole-Program
| Tag Name |
|---|
extern_kernel_template |
template_global_no_def |
#pragma nv_abi
| Tag Name |
|---|
nv_abi_pragma_bad_format |
nv_abi_pragma_device_function |
nv_abi_pragma_device_function_context |
nv_abi_pragma_duplicate_arg |
nv_abi_pragma_invalid_option |
nv_abi_pragma_missing_arg |
nv_abi_pragma_next_construct |
nv_abi_pragma_not_constant |
nv_abi_pragma_not_positive_value |
nv_abi_pragma_overflow_value |
__nv_register_params__
| Tag Name |
|---|
register_params_ellipsis_function |
register_params_not_enabled |
register_params_unsupported_arch |
register_params_unsupported_function |
name_expr
| Tag Name |
|---|
name_expr_extra_tokens |
name_expr_internal_error |
name_expr_non_device_variable |
name_expr_non_global_routine |
name_expr_not_routine_or_variable |
name_expr_parsing |
Texture / Surface
| Tag Name |
|---|
addr_of_text_surf_expr_in_device_func |
addr_of_text_surf_var_in_device_func |
indir_into_text_surf_expr_in_device_func |
indir_into_text_surf_var_in_device_func |
reference_to_text_surf_type_in_device_func |
reference_to_text_surf_var_in_device_func |
texture_surface_variable_in_unnamed_inline_ns |
__wgmma_mma_async
| Tag Name |
|---|
wgmma_mma_async_bad_shape |
wgmma_mma_async_missing_args |
wgmma_mma_async_nonconstant_arg |
wgmma_mma_async_not_enabled |
__block_size__ / __cluster_dims__
| Tag Name |
|---|
block_size_must_be_positive |
block_size_unsupported |
cluster_dims_must_be_positive |
cluster_dims_too_large |
cluster_dims_unsupported |
conflict_between_cluster_dim_and_block_size |
shared_block_size_must_be_positive |
shared_block_size_too_large |
Miscellaneous
| Tag Name |
|---|
cuda_demote_unsupported_floating_point |
cuda_displaced_new_or_delete_operator |
thread_local_for_device_vars |
Internal Representation
Each CUDA error message is stored as a const char* entry in the error template table at off_88FAA0. The diagnostic tag names are stored in a separate string-to-integer lookup table; the tag name resolver (sub_4ED240 and related functions) performs a binary search on this table to match tag strings against internal error codes.
The format specifiers embedded in CUDA error messages use the same system as EDG base errors:
| Specifier | Meaning | Example in CUDA messages |
|---|---|---|
%sq | Quoted entity name | Function name in cross-space call |
%sq1, %sq2 | Indexed quoted names | Caller and callee in call errors |
%no1 | Entity name (omit kind) | Function name in redeclaration |
%n1, %n2 | Entity names | Override base/derived pair |
%nd | Entity name with decl location | Template parameter |
%s, %s1, %s2 | String fill-in | Execution space keyword |
%t | Type fill-in | Type name in template arg errors |
%p | Source position | Previous declaration location |
For full format specifier documentation, see Format Specifiers.
Format Specifiers
The cudafe++ diagnostic system uses a custom format specifier language -- not printf -- to expand parameterized error messages. The expansion engine is process_fill_in (sub_4EDCD0, 1,202 decompiled lines in error.c), called by write_message_to_buffer (sub_4EF620, 159 lines) during template string expansion. Each diagnostic record carries a linked list of typed fill-in entries that supply the actual values -- type nodes, entity pointers, strings, integers, source positions -- which the format engine renders into the final message text.
This page documents the specifier syntax, the fill-in kind system, entity-kind dispatch, suffix options, numeric indexing, and the labeled fill-in mechanism.
Specifier Syntax
When write_message_to_buffer walks an error template string (looked up from off_88FAA0[error_code]), it recognizes three format constructs:
| Syntax | Meaning | Example |
|---|---|---|
%% | Literal % character | "100%% complete" |
%XY...Zn | Fill-in specifier: letter X, options Y...Z, index n | %nfd2, %sq1, %t |
%[label] | Named label fill-in reference | %[class_or_struct] |
Positional Specifier Parsing
The parser (sub_4EF620, error.c:4703) processes %XY...Zn specifiers as follows:
// After seeing '%', read next char as specifier letter
char spec_letter = template[pos + 1]; // 'T', 'd', 'n', 'p', 'r', 's', 't', 'u'
pos += 2;
// Collect option characters (a-z, A-Z) into buffer, max 29
int opt_count = 0;
char options[30];
while (true) {
char c = template[pos];
if (c >= '0' && c <= '9') {
// Trailing digit = fill-in index (1-based)
fill_in_index = c - '0';
break;
}
if ((c & 0xDF) < 'A' || (c & 0xDF) > 'Y') {
// Not a letter -- end of specifier, index defaults to 1
fill_in_index = 1;
break;
}
options[opt_count++] = c;
if (opt_count > 29)
assertion_handler("error.c", 4739,
"write_message_to_buffer",
"construct_text_message:",
"too many option characters");
pos++;
}
options[opt_count] = '\0';
process_fill_in(diagnostic_record, spec_letter, options, fill_in_index);
The maximum of 29 option characters is enforced by an assertion. In practice, specifiers use 0--3 option characters.
Fill-In Kinds
The specifier letter maps to a fill-in kind value through a switch on (letter - 84) in process_fill_in (sub_4EDCD0, error.c:4297):
| Letter | ASCII | letter - 84 | Kind | Payload Type | Description |
|---|---|---|---|---|---|
%T | 84 | 0 | 6 | Type node pointer | Type name, uppercase rendering ("<int, float>") |
%d | 100 | 16 | 0 | int64 | Signed decimal integer |
%n | 110 | 26 | 4 | Entity node pointer | Entity/symbol name with rich formatting |
%p | 112 | 28 | 2 | Source position cookie | Source file + line reference |
%r | 114 | 30 | 7 | byte + pointer | Template parameter reference |
%s | 115 | 31 | 3 | const char* | Plain string |
%t | 116 | 32 | 5 | Type node pointer | Type name, lowercase rendering ("int") |
%u | 117 | 33 | 1 | uint64 | Unsigned decimal integer |
Any other letter triggers the assertion: "process_fill_in: bad fill-in kind" (error.c:4297).
Usage Frequency Across 3,795 Templates
Measured across all error message templates in off_88FAA0:
| Specifier | Occurrences | Typical Context |
|---|---|---|
%s | ~470 | String fragments: attribute names, keyword text, flag names |
%t | ~241 | Type names in mismatch diagnostics |
%sq | ~233 | Quoted string fragments in CUDA cross-space messages |
%n | ~179 | Entity names: function, variable, class, template |
%p | ~76 | Source positions: "declared at line N of file.cu" |
%d | ~60 | Numeric values: counts, limits, sizes |
%T | ~40 | Type template parameter lists |
%u | ~20 | Unsigned counts |
%r | ~10 | Template parameter back-references |
Fill-In Entry Layout
Each fill-in entry is a 40-byte node allocated from a pool (qword_106B490) or heap by alloc_fill_in_entry (sub_4F2DE0):
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 4 | kind | Fill-in kind (0--7, from specifier letter mapping) |
| 4 | 1 | used_flag | Set to 1 when consumed during expansion |
| 5 | 3 | (padding) | -- |
| 8 | 8 | next | Next fill-in in linked list |
| 16 | 8+ | payload | Union, varies by kind (see below) |
Payload Layout by Kind
Kind 0 (decimal, %d) / Kind 1 (unsigned, %u) / Kind 3 (string, %s) / Kind 5 (type, %t) / Kind 6 (type, %T):
| Offset | Size | Field |
|---|---|---|
| 16 | 8 | value -- int64 for kind 0/1, const char* for kind 3, type node pointer for kind 5/6 |
Kind 2 (position, %p):
| Offset | Size | Field |
|---|---|---|
| 16 | 8 | position_cookie -- initialized to qword_126EFB8 (current source position) at allocation time |
Kind 4 (entity name, %n):
| Offset | Size | Field |
|---|---|---|
| 16 | 8 | entity_ptr -- pointer to entity node |
| 24 | 4 | scope_index -- initialized to 0xFFFFFFFF (invalid) |
| 28 | 1 | full_qualification_flag |
| 29 | 1 | original_name_flag |
| 30 | 1 | parameter_list_flag |
| 31 | 1 | template_function_flag |
| 32 | 1 | definition_flag |
| 33 | 1 | alternate_original_flag |
| 34 | 1 | template_only_flag |
Kind 7 (%r):
| Offset | Size | Field |
|---|---|---|
| 16 | 1 | param_byte |
| 17 | 7 | (padding) |
| 24 | 8 | template_scope_ptr |
Fill-In Linked List
Fill-in entries attach to the diagnostic record as a singly-linked list:
- Head pointer: diagnostic record offset 184 (
fill_in_list_head) - Tail pointer: diagnostic record offset 192 (
fill_in_list_tail)
When process_fill_in searches for a matching entry, it walks the list from head, looking for the first entry where node->kind == requested_kind. If the specifier includes an index (e.g., %t2), it skips index - 1 matching entries before consuming the target:
const __m128i *node = *(diagnostic + 184); // fill_in_list_head
if (!node)
goto fill_in_not_found;
while (node->kind != requested_kind || --index > 0) {
node = node->next; // offset 8
if (!node)
goto fill_in_not_found;
}
node->used_flag = 1; // mark consumed (offset 4)
// proceed with kind-specific rendering
If no matching entry is found, process_fill_in triggers an assertion with a diagnostic message identifying the missing fill-in: "specified fill-in (%X, N) not found for error string: \"...\"" (error.c:4317).
After all format specifiers have been expanded, construct_text_message (sub_4EF9D0) iterates the entire fill-in list and asserts that every entry has used_flag == 1. An unconsumed fill-in triggers: "construct_text_message: not all fill-ins used for error string: \"...\"" (error.c:4781).
Numeric Indexing
When a template string must reference multiple fill-ins of the same kind, a trailing digit selects which one:
| Specifier | Meaning |
|---|---|
%t | First type fill-in (index 1, default) |
%t1 | First type fill-in (index 1, explicit) |
%t2 | Second type fill-in (index 2) |
%n1 | First entity name fill-in |
%n2 | Second entity name fill-in |
%sq1 | First string fill-in, quoted |
%sq2 | Second string fill-in, quoted |
The index is a single digit 0--9. Index 0 behaves identically to index 1 (the counter is pre-decremented before comparison). In practice, most templates use indices 1 and 2; a few use up to 3.
Real template example (CUDA cross-space call, error 3499):
calling a __device__ function(%sq1) from a __host__ function(%sq2) is not allowed
Here %sq1 and %sq2 are both kind 3 (string) with option q (quoted), selecting the first and second string fill-ins respectively. The caller attaches two string fill-ins -- the called function's name and the calling function's name.
Suffix Options
String Options (%s)
The %s specifier accepts only one option character: q for quoted output.
| Form | Rendering |
|---|---|
%s | Raw string: foo |
%sq | Quoted string: "foo" |
The q option wraps the string in double-quote characters (") and applies colorization if enabled (quote category, code 6 = bold). Any other option character on %s triggers: "process_fill_in: bad option" (error.c:4364).
Multiple q characters are permitted syntactically (the parser loops over all option chars validating each is q) but have no additional effect -- only one layer of quoting is applied.
Entity Name Options (%n)
The %n specifier accepts a rich set of option suffixes that control how an entity is rendered. Options are processed left-to-right, setting flags on the fill-in entry's flag bytes (offsets 28--34):
| Option | Flag Byte | Effect |
|---|---|---|
f | offset 28 (full_qualification) | Show fully-qualified name with namespace/class scope chain |
o | offset 29 (original_name) | Omit the entity kind prefix (suppress "function ", "variable ", etc.) |
p | offset 30 (parameter_list) | Show function parameter types in signature |
t | offset 31 + offset 28 | Show template arguments AND full qualification (sets both flags) |
a | offset 29 + offset 33 | Show original name AND alternate/accessibility info |
d | offset 32 (definition) | Append declaration location: " (declared at line N of file.cu)" |
T | offset 34 (template_only) | Show template specialization context: " (from translation unit ...)" |
Options can be combined. Common combinations from the error template table:
| Specifier | Rendering Example |
|---|---|
%n | function "foo" |
%no | "foo" (no kind prefix) |
%nf | function "ns::cls::foo" (fully qualified) |
%nfd | function "ns::cls::foo" (declared at line 42 of bar.cu) |
%nt | function "ns::cls::foo<int>" (full + template args) |
%np | function "foo" [with parameters shown] |
%nT | function "foo" (from translation unit bar.cu) |
%na | "foo" based on template argument(s) ... |
No Options for Other Kinds
The %d, %u, %p, %t, %T, and %r specifiers reject all option characters:
if (*options != '\0')
assertion_handler("error.c", 4372,
"process_fill_in",
"process_fill_in: bad option", NULL);
Kind-Specific Rendering
Kind 0 -- Signed Decimal (%d)
Renders the 64-bit signed integer payload using snprintf(buf, 20, "%lli", value), then writes the result to the output buffer. The 20-character buffer accommodates the full range of int64_t values including the sign.
Kind 1 -- Unsigned Decimal (%u)
Formats the payload through sub_4F63D0, which renders the unsigned 64-bit value into a dynamically-sized string buffer.
Kind 2 -- Source Position (%p)
Calls sub_4F6820 (form_source_position) with the position cookie from the fill-in payload. The rendering includes:
- File name (via
sub_5B15D0for display formatting) - Line number
- Contextual text supplied by the caller through three string arguments (prefix, suffix, end-of-source fallback)
The caller passes context strings like " (declared ", ")", "(at end of source)" to frame the position reference. When the position resolves to line 0 or the file is "-" (stdin), alternate formats are used.
Kind 3 -- String (%s / %sq)
Without the q option, writes the string pointer payload directly to the output buffer via strlen + sub_6B9CD0 (buffer append).
With the q option, wraps the string in double quotes with colorization:
if (colorization_active)
emit_escape(buffer, 6); // quote color (bold)
write_char(buffer, '"');
write_string(buffer, payload);
if (colorization_active)
emit_escape(buffer, 1); // reset
write_char(buffer, '"');
Kind 5 -- Type, Lowercase (%t)
Renders the type node through the type formatting subsystem. The rendering pipeline:
- Set
byte_10678FA = 1(name lookup kind = type display mode) - Write opening
" - Call
sub_600740(format type for display) with the type node and the entity formatter callback table (qword_1067860) - Write closing
" - Check via
sub_7BE9C0if the type has an "aka" (also-known-as) desugared form - If yes, append
' (aka "desugared_type")'-- comparing the rendered forms to avoid redundant output when they are identical
The aka check compares the rendered text of the original type against the desugared type. If they produce identical strings (same length, same content via strncmp), the aka suffix is suppressed by truncating the buffer back to the pre-aka position.
Kind 6 -- Type, Uppercase (%T)
Renders a type template argument list in angle brackets:
write_string(buffer, "\"<");
// Walk the template argument linked list
for (arg = payload; arg != NULL; arg = arg->next) {
if (arg->kind != 3) // skip pack expansion markers
format_template_argument(arg, &entity_formatter);
if (arg->next && arg->next->kind != 3)
write_string(buffer, ", ");
}
write_string(buffer, ">\"");
Template argument entries with kind == 3 (at byte offset +8) are pack-expansion markers and are skipped during rendering.
Kind 7 -- Template Parameter Reference (%r)
Renders a template parameter by looking up the parameter entity through sub_5B9EE0 (entity lookup by scope + index). If found and non-null, renders via sub_4F3970 (unqualified entity name). Otherwise, falls back to sub_6011F0 (generic template parameter formatting).
Entity Kind Dispatch (%n)
When processing %n specifiers, process_fill_in reads the entity kind byte at offset 80 of the entity node and dispatches to kind-specific rendering logic. The function first resolves through projection indirection: if entity_kind == 16 (typedef), it follows the pointer at entity->info_ptr->pointed_to; if entity_kind == 24 (resolved namespace alias), it follows entity->info_ptr.
The dispatch handles 25 entity kind values (0--24, with gaps at 14/15/16/24 handled as special cases):
| Entity Kind | Value | Kind Label String | Index in off_88FAA0 | Rendering Logic |
|---|---|---|---|---|
| keyword | 0 | (none -- literal "keyword") | -- | Write keyword ", then the keyword's name string from entity->name_sym->name |
| concept | 1 | (from table) | 1462 | Simple: write kind label + quoted name |
| constant template parameter | 2 | "constant" or "nontype" | -- | Check template parameter subkind: type_kind 14 with subkind 2 = "nontype", else "constant" |
| template parameter | 3 | (from table) | 1464 or 1465 | Check whether the template parameter is a type parameter (type_kind != 14) → index 1465, else 1464 |
| class | 4 | (from table, CUDA-aware) | 1466--1468 | CUDA mode: 1467 or 1468 (class vs struct); non-CUDA: 1466 |
| struct | 5 | (same as class) | 1466--1468 | Same dispatch as class, differentiated by v46 != 5 |
| enum | 6 | (from table) | 1472 | Simple: write kind label + quoted name |
| variable | 7 | "variable" or "handler parameter" | 1474 or 1475 | Check handler-parameter flag (offset 163, bit 0x40). If set: "handler parameter" (index 1474). If variable is a structured binding (offset 162, bit 1): use index 2937. Otherwise: "variable" (index 1475) with optional template context |
| field | 8 | "field" or "member" | 1480 or 1481 | CUDA C++ mode: "member" (index 1480); C mode: "field" (index 1481) |
| member | 9 | "member" | 1480 | Always "member" with optional template context from scope chain |
| function | 10 | "function" or "deduction guide" | 1478 or 2892 | Check linkage kind (offset 166 == 7): deduction guide → index 2892. Otherwise "function" (1478). Walk qualified type chain to strip cv-qualifiers |
| function overload | 11 | (same as function) | 1478 or 2892 | Same dispatch as function (case 10), merged in the switch |
| namespace | 12 | (from table) | 1463 | Simple: write kind label + quoted name |
| label | 13 | (none) | -- | Write quoted name only, no kind prefix, no type info |
| typedef (indirect variable) | 14 | "variable" | 1475 | Dereferences through entity->info_ptr->pointed_to and renders as variable |
| typedef (indirect function) | 15 | "function" | 1478 | Dereferences through entity->info_ptr, extracts function entity + routine info |
| typedef | 16 | -- | -- | Assertion: "form_symbol_summary: projection of projection kind" (error.c:2020). Should have been resolved before dispatch |
| using declaration | 17 | (from table) | 1479 | Simple: write kind label + quoted name |
| parameter | 18 | "parameter" | 1473 | Simple: write "parameter" + quoted name with type info |
| class (anonymous/unnamed) | 19 | (from table) | 1469--1471 or 1889 | Multiple sub-cases: anonymous class bit 0x40 → index 1469; class-template with bit 0x02 → index 1470; deduction_guide bit → index 1889; else index 1471 |
| function template | 20 | "function template" | 1485 (lambda) or kind label | Lambda function (offset 189, bit 0x20): index 1485 with scope entity. Otherwise: "function template" with type and parameter info |
| variable template | 21 | (from table) | 2750 | Simple: write kind label + quoted name |
| alias template | 22 | (from table) | 3050 | Simple: write kind label + quoted name |
| concept template | 23 | (from table) | 1482 | Simple: write kind label + quoted name |
| resolved namespace alias | 24 | -- | -- | Assertion: "form_symbol_summary: projection of projection kind" (same as kind 16). Should have been resolved |
Any entity kind value outside 0--24 (excluding the gaps that trigger assertions) hits the default case: "form_symbol_summary: unsupported symbol kind" (error.c:2023).
Entity Rendering Pipeline
For entity kinds that produce a fully-formatted name (most non-trivial cases), the rendering proceeds through these stages:
1. Write entity kind label string (e.g., "function ")
└── sub_6B9EA0(buffer, kind_label_string)
└── sub_6B9CD0(buffer, " ", 1)
2. Open quote
└── Optional colorization: sub_4ECDD0(buffer, 6) // quote color
└── sub_6B9CD0(buffer, "\"", 1)
3. Render type prefix (if has_type_info and full_qualification)
└── sub_5FE8B0(type_node, 0, 1, 0, 0, &entity_formatter)
4. Render qualified or unqualified name
├── With template context: sub_737A00(entity, &entity_formatter)
└── Without template context: sub_4F3970(entity)
5. Render function parameters (if applicable)
├── Full parameter types: sub_5FB270(type, 0, 0, &entity_formatter)
└── Simple type suffix: sub_6016F0(type, &entity_formatter)
6. Close quote
└── sub_6B9CD0(buffer, "\"", 1)
└── Optional colorization: sub_4ECDD0(buffer, 1) // reset
7. Append accessibility info (if 'a' option)
└── " based on template argument(s) "
└── sub_5FA660(template_arg_list, 0, &entity_formatter)
8. Append declaration location (if 'd' option)
└── sub_4F6820(position, diag, " (declared ", ")", "(at end of source)")
9. Append translation unit info (if 'T' option)
└── " (from translation unit <filename>)"
The original_name flag (o option) suppresses steps 1 and 3, rendering only the bare quoted name without a kind prefix or type qualification. The full_qualification flag (f option) enables step 3 and uses sub_737A00 for fully-qualified name rendering in step 4. The parameter_list flag (p option) forces step 5 to include full parameter-type rendering.
Template Context in Entity Names
When dword_126E274 (show template arguments) is non-zero and the entity has template context, the renderer can walk up the template scope chain:
- Access the entity's routine info (for functions: offset 88 → offset 192 → offset 16)
- Check for the instantiated-from entity (offset 104 of scope info, guarded by
!(offset_176 & 1)) - If found, use the instantiated-from entity as the display target
- For class templates (entity_kind == 20): walk the template parameter chain, rendering
<param1, param2, ...>with pack-expansion markers (...) for variadic parameters
CUDA-Specific Entity Rendering
Several entity kinds have CUDA-aware rendering paths:
- Class/struct (kinds 4/5): When
dword_126EFB4 == 2(CUDA C++ mode) and the entity has an anonymous flag (offset 161, bit 0x80), rendering jumps to the anonymous-class handler (kind 19) instead - Field (kind 8): In CUDA C++ mode, the kind label is
"member"(index 1480); in C mode, it is"field"(index 1481) - Class/struct label selection: In CUDA C++ mode, the kind label index is always 1467; in non-CUDA mode, it depends on whether the entity is class vs struct
Labeled Fill-Ins (%[label])
The %[label] syntax references a named fill-in from the label table at off_D481E0. This mechanism allows error templates to include conditional text fragments that vary based on language mode or compilation context.
Label Table Structure
off_D481E0 is an array of 24-byte entries (3 pointers per entry):
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 8 | name | Label name string (e.g., "class_or_struct") |
| 8 | 8 | condition_ptr | Pointer to condition flag (dword) |
| 16 | 4 | true_index | String table index when *condition_ptr != 0 |
| 20 | 4 | false_index | String table index when *condition_ptr == 0 |
Label Lookup Algorithm
// write_message_to_buffer, error.c:4714
char *label_start = template + pos + 2; // skip "%["
char *label_end = strchr(template + pos + 1, ']');
if (!label_end)
assertion_handler("error.c", 4714, "write_message_to_buffer", NULL, NULL);
size_t label_len = label_end - label_start;
// Walk off_D481E0 table
struct label_entry *entry = off_D481E0;
while (entry->name) {
if (strncmp(entry->name, label_start, label_len) == 0) {
// Found matching label
int string_index;
if (*entry->condition_ptr)
string_index = entry->true_index;
else
string_index = entry->false_index;
if (string_index > 3794)
error_text_invalid_code(); // sub_4F2D30
// Expand the referenced string directly into the buffer
const char *text = off_88FAA0[string_index];
write_to_buffer(buffer, text, strlen(text));
pos = label_end + 1;
break;
}
entry++; // advance by 24 bytes
}
if (!entry->name) {
// Label not found -- fatal
fprintf(stderr, "missing fill-in label: %.*s\n", label_len, label_start);
assertion_handler("error.c", 430,
"get_label_fill_in_entry",
"get_label_fill_in_entry: no label fill-in found", NULL);
}
The label table entries reference string indices in the same off_88FAA0 table used for error messages. This allows a single error template to produce different text depending on compilation mode -- for example, using "class" vs "struct" based on a language-mode flag, or "virtual" vs "" based on a feature flag.
The label text is written directly to the output buffer without further format specifier processing -- labels cannot contain nested % specifiers.
Output Buffer
All rendering targets the global message text buffer at qword_106B488:
- Initial allocation: 0x400 bytes (1 KB) via
sub_6B98A0 - Dynamic growth:
sub_6B9B20doubles the buffer when capacity is exceeded - String append:
sub_6B9CD0(buffer, data, length)-- the workhorse write function - String write:
sub_6B9EA0(buffer, string)-- convenience wrapper (callsstrlen+sub_6B9CD0)
The entity display callback infrastructure at qword_1067860 allows the type/name formatting subsystem to write to the same buffer through an indirect call:
| Variable | Address | Purpose |
|---|---|---|
qword_1067860 | 0x1067860 | Entity formatter callback (set to sub_5B29C0) |
qword_1067870 | 0x1067870 | Entity formatter output buffer (set to qword_106B488) |
byte_10678F1 | 0x10678F1 | C mode flag (dword_126EFB4 == 1) |
byte_10678F4 | 0x10678F4 | Pre-C++11 flag |
byte_10678FA | 0x10678FA | Name lookup kind (saved/restored around type rendering) |
byte_10678FE | 0x10678FE | Entity display flags (saved/restored around %n processing) |
byte_1067902 | 0x1067902 | Type desugaring mode flag (saved/restored around %t aka rendering) |
Colorization Interaction
When dword_126ECA4 (colorization active) is non-zero, the format engine inserts ANSI escape sequences around quoted names and type references:
| Context | Color Code | ANSI Sequence | Visual |
|---|---|---|---|
Opening quote (") | 6 (quote) | \033[01m | Bold |
Closing quote (") | 1 (reset) | \033[0m | Normal |
| Type rendering context | (inherited) | -- | Inherits from diagnostic severity color |
The escape sequences are emitted by sub_4ECDD0(buffer, color_code). The color codes correspond to the categories parsed from EDG_COLORS / GCC_COLORS environment variables during initialization.
Function Map
| Address | Name (Recovered) | Size | Role |
|---|---|---|---|
0x4EDCD0 | process_fill_in | 1,202 lines | Core format specifier expansion |
0x4EF620 | write_message_to_buffer | 159 lines | Template string walker, % parser |
0x4F2DE0 | alloc_fill_in_entry | 41 lines | Pool allocator for 40-byte fill-in nodes |
0x4F2D30 | error_text_invalid_code | 12 lines | Assert on invalid error code (> 3794) |
0x4F2930 | assertion_handler | 101 lines | __noreturn, 5,185 callers |
0x4F3480 | format_assertion_message | ~100 lines | Multi-arg string builder for assertion text |
0x4F6820 | form_source_position | ~130 lines | Render %p source position with file + line |
0x4F3970 | format_entity_unqualified | -- | Render unqualified entity name |
0x4F39E0 | format_entity_with_template | -- | Render entity with template args + accessibility |
0x737A00 | format_qualified_name | -- | Render fully-qualified name through scope chain |
0x5FE8B0 | format_type_with_qualifiers | -- | Render type with cv-qualifiers for %n prefix |
0x5FB270 | format_function_parameters | -- | Render function parameter type list |
0x6016F0 | format_simple_type | -- | Render simple type suffix |
0x600740 | format_type_for_display | -- | Render type for %t specifier |
0x7BE9C0 | has_desugared_type | -- | Check if type has an "aka" form |
0x5FA660 | format_template_argument_list | -- | Render template argument list for %n a option |
0x5FA0D0 | format_template_argument | -- | Render single template argument for %T |
0x5B9EE0 | lookup_entity_by_scope | -- | Entity lookup for %r template parameter |
0x4F63D0 | format_unsigned_decimal | -- | Render unsigned integer for %u |
0x6B9CD0 | buffer_append | -- | Write bytes to dynamic buffer |
0x6B9EA0 | buffer_write_string | -- | Write null-terminated string to buffer |
0x4ECDD0 | emit_colorization_escape | -- | Emit ANSI escape sequence |
Cross-References
- Diagnostic Overview -- 7-stage pipeline, severity levels, diagnostic record layout
- CUDA Error Catalog -- all 338 CUDA-specific error templates with specifier usage
- SARIF & Pragma Control -- SARIF JSON output and
#pragma nv_diagnosticsystem
SARIF Output & Pragma Diagnostic Control
cudafe++ supports two diagnostic output formats -- traditional text (default) and SARIF v2.1.0 JSON -- controlled by the --output_mode flag (flag index 274, stored in dword_106BBB8). Alongside the output format, the pragma diagnostic system allows per-error severity overrides at arbitrary source positions through #pragma nv_diag_* directives, which record a stack of severity modifications binary-searched at emission time. A companion colorization subsystem adds ANSI escape sequences to text-mode output, governed by environment variables and terminal detection. This page covers the internals of all three subsystems.
For the diagnostic pipeline architecture, severity levels, and error message formatting, see Diagnostic Overview. For the CUDA error catalog and tag-name suppression, see CUDA Errors.
SARIF Output Mode
Activation
SARIF mode is activated by passing --output_mode sarif on the command line. The flag handler (case 274 in the CLI parser at sub_454160) performs a simple string comparison:
// sub_454160, case 274
if (strcmp(arg, "text") == 0)
dword_106BBB8 = 0; // text mode (default)
else if (strcmp(arg, "sarif") == 0)
dword_106BBB8 = 1; // SARIF JSON mode
else
error("unrecognized output mode (must be one of text, sarif): %s", arg);
When dword_106BBB8 == 1, three changes take effect globally:
write_init(sub_5AEDB0) emits the SARIF JSON header instead of nothingcheck_severity(sub_4F1330) routes each diagnostic through the SARIF JSON builder instead ofconstruct_text_messagewrite_signoff(sub_5AEE00) emits]}]}\ninstead of the error/warning summary line
All other pipeline behavior -- severity computation, pragma overrides, error counting, exit codes -- is identical in both modes. Exit codes in SARIF mode skip the text messages ("Compilation terminated.", "Compilation aborted.") but use the same numeric values (0, 2, 4, 11).
SARIF Header (sub_5AEDB0)
write_init is called once at the start of compilation. In SARIF mode, it writes the JSON envelope to qword_126EDF0 (the diagnostic output stream, typically stderr):
{
"version": "2.1.0",
"$schema": "https://raw.githubusercontent.com/oasis-tcs/sarif-spec/master/Schemata/sarif-schema-2.1.0.json",
"runs": [{
"tool": {
"driver": {
"name": "EDG CPFE",
"version": "6.6",
"organization": "Edison Design Group",
"fullName": "Edison Design Group C/C++ Front End - 6.6",
"informationUri": "https://edg.com/c"
}
},
"columnKind": "unicodeCodePoints",
"results": [
The version strings ("6.6") are hardcoded in the binary via two %s format arguments that both resolve to the static string "6.6". The runs array is opened but not closed -- each diagnostic result is appended as the compilation proceeds, and the array is closed by write_signoff.
An assertion guards the mode value: if dword_106BBB8 is neither 0 nor 1, the function fires sub_4F2930 with "write_init" at host_envir.c:2017.
SARIF Result Object
Each diagnostic emitted through check_severity (sub_4F1330) produces one JSON result object. The construction happens inline within check_severity at LABEL_91, building the JSON into the SARIF buffer qword_106B478:
{
"ruleId": "EC<error_code>",
"level": "<severity_string>",
"message": {"text": "<expanded_message>"},
"locations": [{"physicalLocation": <location_object>}],
"relatedLocations": [<related_location_objects>]
}
Comma handling: When qword_126ED90 + qword_126ED98 > 1 (more than one diagnostic has been emitted), a comma is prepended before the opening { to maintain valid JSON array syntax.
Rule ID Format
The rule ID is always "EC" followed by the internal error code (0--3794), not the display code:
sub_6B9CD0(sarif_buf, "\"ruleId\":", 9);
sub_6B9CD0(sarif_buf, "\"EC", 3);
sprintf(s, "%lu", *(uint32_t*)(record + 176)); // internal error code
sub_6B9CD0(sarif_buf, s, strlen(s));
sub_6B9CD0(sarif_buf, "\"", 1);
For a CUDA error with internal code 3499 (display code 20042), the rule ID is "EC3499", not "EC20042". This differs from the text-mode format which uses "EC%lu" with the same internal code in construct_text_message.
Level Mapping
The level field is derived from the diagnostic severity byte at record offset 180. When severity <= byte_126ED68 (the error-promotion threshold) and severity <= 7, it is promoted to "error" before level selection. The mapping:
| Severity | level String | SARIF Standard? |
|---|---|---|
| 4 (remark) | "remark" | Non-standard extension |
| 5 (warning) | "warning" | Standard |
| 7 (error, soft) | "error" | Standard |
| 8 (error, hard) | "error" | Standard |
| 9 (catastrophic) | "catastrophe" | Non-standard extension |
| 11 (internal) | "internal_error" | Non-standard extension |
Any other severity value triggers the assertion at error.c:4886:
sub_4F2930(..., "write_sarif_level",
"determine_severity_code: bad severity", 0);
Notes (severity 2) and command-line diagnostics (severity 6, 10) never reach the SARIF level mapper -- notes are suppressed below the minimum severity gate, and command-line diagnostics bypass the SARIF path entirely.
Message Object (sub_4EF8A0)
The message text is produced by write_sarif_message_json (sub_4EF8A0), which wraps the expanded error template in a JSON {"text":"..."} object:
- Appends
{"text":"to the SARIF buffer - Calls
write_message_to_buffer(sub_4EF620) to expand the error template with fill-in values intoqword_106B488 - Null-terminates the message buffer
- JSON-escapes the message: iterates each character, prepending
\before any"(0x22) or\(0x5C) character - Appends
"}to close the message object
The escaping is minimal -- only double-quote and backslash are escaped. Control characters (newlines, tabs) are not escaped, relying on the fact that EDG error messages do not contain embedded newlines.
Physical Location (sub_4ECB10)
When the diagnostic record has a valid file index (offset 136 != 0), a locations array is emitted containing one physical location object:
{
"physicalLocation": {
"artifactLocation": {"uri": "file://<canonical_path>"},
"region": {"startLine": <line>, "startColumn": <column>}
}
}
The function sub_4ECB10 (write_sarif_physical_location):
- Calls
sub_5B97A0to resolve the source-position cookie at record offset 136 into file path, line number, and column number - Calls
sub_5B1060to canonicalize the file path - Emits the
artifactLocationwith afile://URI prefix - Emits
startLineunconditionally - Emits
startColumnonly when the column value is non-zero (thev4check:if (v4))
The startColumn conditional emission means that diagnostics without column information (e.g., command-line errors) produce location objects with only startLine.
Related Locations
Sub-diagnostics (linked at record offset 72, the sub_diagnostic_head pointer) are serialized into the relatedLocations array:
if (record->sub_diagnostic_head) {
append(",\"relatedLocations\":[");
int first = 1;
for (sub = record->sub_diagnostic_head; sub; sub = sub->next) {
sub->parent = record; // back-link at offset 16
append("{\"message\":");
write_sarif_message_json(sub); // expand sub-diagnostic message
if (sub->file_index)
write_sarif_physical_location(sub);
append("}");
if (!first)
append(","); // note: comma AFTER closing }
first = 0;
}
append("]");
}
Each related location has its own message object and an optional physicalLocation. The comma is placed after the closing brace of each entry except the first, yielding [{...}{...},{...},...] -- this is a bug in the JSON generation that produces malformed output when there are three or more related locations, since the first separator comma is missing.
SARIF Footer (sub_5AEE00)
write_signoff closes the JSON structure:
if (dword_106BBB8 == 1) {
fwrite("]}]}\n", 1, 5, qword_126EDF0);
return;
}
This closes: results array (]), the run object (}), the runs array (]), and the top-level object (}), followed by a newline.
In text mode, write_signoff instead prints the error/warning summary (e.g., "3 errors, 2 warnings detected in file.cu"), using message-table lookups via sub_4F2D60 with IDs 1742--1748 and 3234--3235 for pluralization.
Complete SARIF Output Example
{"version":"2.1.0","$schema":"https://raw.githubusercontent.com/oasis-tcs/sarif-spec/master/Schemata/sarif-schema-2.1.0.json","runs":[{"tool":{"driver":{"name":"EDG CPFE","version":"6.6","organization":"Edison Design Group","fullName":"Edison Design Group C/C++ Front End - 6.6","informationUri":"https://edg.com/c"}},"columnKind":"unicodeCodePoints","results":[{"ruleId":"EC3499","level":"error","message":{"text":"calling a __device__ function(\"foo\") from a __host__ function(\"main\") is not allowed"},"locations":[{"physicalLocation":{"artifactLocation":{"uri":"file:///path/to/test.cu"},"region":{"startLine":10,"startColumn":5}}}]}]}]}
Pragma Diagnostic Control
Pragma Actions
cudafe++ processes #pragma nv_diag_* directives through the preprocessor, which records them as pragma action entries on a global stack. Six action codes are defined:
| Code | Pragma Directive | Severity Effect | Internal Name |
|---|---|---|---|
| 30 | #pragma nv_diag_suppress | Set severity to 3 (suppressed) | ignored |
| 31 | #pragma nv_diag_remark | Set severity to 4 (remark) | remark |
| 32 | #pragma nv_diag_warning | Set severity to 5 (warning) | warning |
| 33 | #pragma nv_diag_error | Set severity to 7 (error) | error |
| 35 | #pragma nv_diag_default | Restore from byte_1067920[4 * error_code] | default |
| 36 | #pragma nv_diag_push / pop | Scope boundary marker | push/pop |
Note the gap: action code 34 is not used. Actions 30--33 modify severity, 35 restores the compile-time default, and 36 provides push/pop scoping to allow localized overrides.
The pragmas accept either a numeric error code or a diagnostic tag name:
#pragma nv_diag_suppress 20042 // by display code
#pragma nv_diag_suppress calling_a_constexpr__host__function // by tag name
Display codes >= 20000 are converted to internal codes by sub_4ED170:
int internal_code = (display_code > 19999) ? display_code - 16543 : display_code;
Pragma Stack (qword_1067820)
The pragma stack is a dynamically-growing array of 24-byte records stored at qword_1067820. The array is managed as a sorted-by-position sequence to enable binary search.
Each 24-byte stack entry has the following layout:
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 4 | position_cookie | Source position (sequence number) |
| 4 | 2 | column | Column number within the line |
| 8 | 1 | action_code | Pragma action (30--36) |
| 9 | 1 | flags | Bit 0: is push/pop with saved index |
| 16 | 8 | error_code or saved_index | Target error code, or -1/saved push index for scope markers |
The array header (pointed to by qword_1067820) contains:
| Offset | Size | Field |
|---|---|---|
| 0 | 8 | Pointer to entry array base |
| 8 | 8 | Array capacity |
| 16 | 8 | Entry count |
Recording Pragma Entries (sub_4ED190)
When the preprocessor encounters a #pragma nv_diag_* directive, record_pragma_diagnostic (sub_4ED190) creates a new stack entry:
void record_pragma_diagnostic(uint error_code, uint8_t severity, uint *position) {
// Hash: (column+1) * (position+1) * error_code * (severity+1)
uint64_t hash = (*(uint16_t*)(position+2) + 1) * (*position + 1)
* error_code * (severity + 1);
uint64_t bucket = hash % 983; // 0x3D7
entry = allocate(32);
entry->error_code_field = error_code; // offset 8
entry->severity = severity; // offset 12
entry->position = *position; // offset 16
entry->saved_index = 0xFFFFFFFF; // offset 24 = -1
// Insert at head of hash chain
entry->next = hash_table[bucket]; // qword_1065960
hash_table[bucket] = entry;
}
This function serves double duty: it records the pragma entry for the per-diagnostic suppression hash table (qword_1065960, 983 buckets) used by check_pragma_diagnostic (sub_4ED240), and it simultaneously records the entry on the position-sorted pragma stack.
The bit byte_1067922[4 * error_code] |= 4 is set to mark that this error code has at least one pragma override, enabling the fast-path check in check_for_overridden_severity.
Per-Diagnostic Suppression Check (sub_4ED240)
check_pragma_diagnostic (sub_4ED240) is the fast-path check called from check_severity to determine whether a specific diagnostic at a specific source position should be suppressed. It operates on the hash table rather than the sorted stack:
bool check_pragma_diagnostic(uint error_code, uint8_t severity, uint *position) {
uint64_t hash = (position->column + 1) * (position->cookie + 1)
* error_code * (severity + 1);
entry = hash_table[hash % 983];
// Walk hash chain matching all four fields
while (entry) {
if (entry->error_code == error_code &&
entry->severity == severity &&
entry->position == position->cookie &&
entry->column == position->column)
break;
entry = entry->next;
}
if (!entry) return false;
// Scope check: compare current scope ID
scope = scope_table[current_scope_index];
if (entry->saved_scope_id != scope->id || scope->kind == 9) {
entry->saved_scope_id = scope->id;
entry->emit_count = 0;
return true; // first time in this scope → suppress
}
// Already seen in this scope → check error limit
entry->emit_count++;
return entry->emit_count <= error_limit;
}
Severity Override Resolution (sub_4F30A0)
check_for_overridden_severity (sub_4F30A0) is the position-based pragma stack walker. It is called from create_diagnostic_entry (sub_4F40C0) for any diagnostic with severity <= 7, and determines the effective severity by walking the pragma stack backward from the diagnostic's source position.
Entry conditions:
void check_for_overridden_severity(int error_code, char *severity_out,
int64_t position, ...) {
char current_severity = byte_1067921[4 * error_code];
// Fast path: if no pragma override exists for this error code, skip
if ((byte_1067922[4 * error_code] & 4) == 0)
goto done;
// Ensure pragma stack exists and has entries
if (!qword_1067820 || !qword_1067820->count)
goto done;
Binary search phase:
When the diagnostic position is before the last pragma stack entry (i.e., the position comparison at offset 0/4 shows the diagnostic comes before the final entry), the function uses bsearch with comparator sub_4ECD20 to find the nearest pragma entry at or before the diagnostic position:
// Construct search key from diagnostic position
search_key.position = position->cookie;
search_key.column = position->column;
qword_10658F8 = 0; // scratch: will hold the best-match pointer
result = bsearch(&search_key, stack_base, entry_count, 24, comparator);
The comparator sub_4ECD20 compares position cookies first, then columns. It has a side effect: whenever the comparison result is >= 0 (the search key is at or after the candidate), it stores the candidate pointer in qword_10658F8. This means after bsearch completes, qword_10658F8 holds the rightmost entry that is at or before the search key -- the "floor" entry.
Backward walk phase:
After finding the starting position (either via binary search or by starting from the last entry), the function walks backward through the stack:
while (1) {
uint8_t action = *(uint8_t*)(entry + 8);
if (action == 36) { // push/pop marker
if ((*(uint8_t*)(entry+9) & 1) == 0)
goto skip; // plain pop: no saved index
int64_t saved_idx = *(int64_t*)(entry + 16);
if (saved_idx == -1)
goto skip; // push without matching pop
// Jump to the push point
entry = &stack_base[24 * saved_idx];
continue;
}
if (*(uint32_t*)(entry + 16) == error_code) {
switch (action) {
case 30: current_severity = 3; goto apply; // suppress
case 31: current_severity = 4; goto apply; // remark
case 32: current_severity = 5; goto apply; // warning
case 33: current_severity = 7; goto apply; // error
case 35: // default
current_severity = byte_1067920[4 * error_code];
goto done;
default:
assertion("get_severity_from_pragma", error.c:3741);
}
}
skip:
if (entry == stack_base)
goto done; // reached bottom of stack
entry -= 24; // previous entry
}
done:
if (current_severity)
*severity_out = current_severity;
apply:
*severity_out = current_severity;
The key insight is the push/pop handling: action code 36 entries with flags & 1 set contain a saved index at offset 16 that points to the corresponding push entry. The walker jumps to the push entry, effectively skipping all pragma entries within the pushed scope, restoring the severity state from before the push.
An out-of-bounds entry pointer triggers the assertion at error.c:3803:
if (entry < stack_base || entry >= &stack_base[24 * count])
assertion("check_for_overridden_severity", error.c:3803);
GCC Diagnostic Pragma Output
cudafe++ generates #pragma GCC diagnostic directives in its output (the transformed C++ sent to the host compiler) to suppress host-compiler warnings on code that cudafe++ knowingly generates or transforms. These are not the same as the nv_diag_* pragmas that control cudafe++'s own diagnostics.
The output pragmas are emitted via sub_467E50 (the line-output function) with hardcoded strings:
// Emitted around certain code regions
sub_467E50("#pragma GCC diagnostic push");
sub_467E50("#pragma GCC diagnostic ignored \"-Wunused-local-typedefs\"");
sub_467E50("#pragma GCC diagnostic ignored \"-Wattributes\"");
// ... generated code ...
sub_467E50("#pragma GCC diagnostic pop");
The full set of GCC warnings suppressed in output:
| Warning Flag | Context |
|---|---|
-Wunevaluated-expression | decltype expressions in init-captures (when dword_126E1E8 = GCC host) |
-Wattributes | CUDA attribute annotations on transformed code |
-Wunused-parameter | Device function stubs with unused parameters |
-Wunused-function | Forward-declared device functions not called in host path |
-Wunused-local-typedefs | Type aliases generated for CUDA type handling |
-Wunused-variable | Variables in constexpr-if discarded branches |
-Wunused-private-field | Private members of device-only classes |
On MSVC host compilers, the equivalent mechanism uses __pragma(warning(push)) / __pragma(warning(pop)) instead.
Colorization
Initialization (sub_4F2C10)
Colorization is initialized by init_colorization (sub_4F2C10), called from the diagnostic pipeline setup. The function determines whether color output should be enabled and parses the color specification.
Decision sequence:
1. Assert dword_126ECA0 != 0 (colorization was requested via --colors)
2. Check getenv("NOCOLOR") → if set, disable
3. Check sub_5AF770() → if stderr is not a TTY, disable
4. If still enabled, parse color spec
5. Set dword_126ECA4 = dword_126ECA0 (activate colorization)
Step 3 calls sub_5AF770 (check_terminal_capabilities), which:
- Verifies
qword_126EDF0(diagnostic output FILE*) exists - Calls
fileno()+isatty()on it - Calls
getenv("TERM")and rejects"dumb"terminals - Returns 1 if interactive, 0 otherwise
The --colors / --no_colors CLI flag pair controls dword_126ECA0 (colorization requested). When --no_colors is set or NOCOLOR is in the environment, colorization is unconditionally disabled regardless of terminal capabilities.
Color Specification Parsing (sub_4EC850)
The color specification string is sourced from environment variables with a fallback chain:
char *spec = getenv("EDG_COLORS");
if (!spec) {
spec = getenv("GCC_COLORS");
if (!spec)
spec = "error=01;31:warning=01;35:note=01;36:locus=01:quote=01:range1=32";
}
Note: although the string "DEFAULT_EDG_COLORS" appears in the binary (as a compile-time macro name), the actual default is hardcoded. The EDG_COLORS variable takes priority over GCC_COLORS, allowing EDG-specific customization while maintaining GCC compatibility.
The specification format is category=codes:category=codes:... where:
categoryis one of:error,warning,note,locus,quote,range1codesis a semicolon-separated sequence of ANSI SGR parameters (digits and;only):separates category assignments
sub_4EC850 (parse_color_category) is called once for each of the 6 configurable categories:
sub_4EC850(2, "error"); // category code 2
sub_4EC850(3, "warning"); // category code 3
sub_4EC850(4, "note"); // category code 4
sub_4EC850(5, "locus"); // category code 5
sub_4EC850(6, "quote"); // category code 6
sub_4EC850(7, "range1"); // category code 7
For each category, the parser:
- Uses
strstr()to find the category name in the spec string - Checks that the character after the name is
= - Extracts the value up to the next
:(or end of string) - Validates that the value contains only digits (0x30--0x39) and semicolons (0x3B)
- Stores the pointer and length in
qword_126ECC0[2*code]andqword_126ECC8[2*code] - If validation fails (non-digit, non-semicolon character), nullifies the entry
Color Category Codes
Seven category codes are used internally, with code 1 reserved for reset:
| Code | Category | Default ANSI | Escape | Applied To |
|---|---|---|---|---|
| 1 | reset | \033[0m | ESC [ 0 m | End of any colored region |
| 2 | error | \033[01;31m | ESC [ 01;31 m | Error/catastrophic/internal severity labels |
| 3 | warning | \033[01;35m | ESC [ 01;35 m | Warning/command-line-warning labels |
| 4 | note/remark | \033[01;36m | ESC [ 01;36 m | Note and remark severity labels |
| 5 | locus | \033[01m | ESC [ 01 m | Source file:line location prefix |
| 6 | quote | \033[01m | ESC [ 01 m | Quoted identifiers in messages |
| 7 | range1 | \033[32m | ESC [ 32 m | Source-range underline markers |
Escape Sequence Emission
Two functions handle color escape output, depending on context:
sub_4ECDD0 (emit_colorization_escape): Used within construct_text_message for inline color markers. Writes a 2-byte internal marker (ESC byte 0x1B followed by the category code) into the output buffer. These markers are later expanded into full ANSI sequences during the final output pass.
void emit_colorization_escape(buffer *buf, uint8_t category_code) {
buf_append_byte(buf, 0x1B); // ESC
buf_append_byte(buf, category_code);
}
sub_4F3E50 (add_colorization_characters): Used during word-wrapped output to emit full ANSI escape sequences. For category 1 (reset), it writes ESC [ 0 m. For categories 2--7, it writes ESC [ followed by the parsed ANSI codes from qword_126ECC0, followed by m.
void add_colorization_characters(uint8_t category) {
if (category > 7)
assertion("add_colorization_characters", error.c:862);
if (category == 1) {
// Reset: ESC [ 0 m
buf_append(sarif_buf, ESC);
buf_append(sarif_buf, '[');
buf_append(sarif_buf, '0');
buf_append(sarif_buf, 'm');
} else if (color_pointer[category]) {
// ESC [ <codes> m
buf_append(sarif_buf, ESC);
buf_append(sarif_buf, '[');
buf_append_n(sarif_buf, color_pointer[category], color_length[category]);
buf_append(sarif_buf, 'm');
}
}
The assertion at error.c:862 fires if a category code > 7 is passed, which would indicate a programming error in the diagnostic formatter.
Word Wrapping with Colors
construct_text_message (sub_4EF9D0) has two code paths for word wrapping:
- Non-colorized: Simple space-scanning algorithm that breaks at the terminal width (
dword_106B470) - Colorized: Tracks visible character width separately from escape sequence bytes. When the formatted string contains byte 0x1B (ESC), the wrapping logic counts only non-escape characters toward the column width, ensuring that ANSI codes do not prematurely trigger line breaks.
The terminal width dword_106B470 defaults to a reasonable value (typically 80 or derived from the terminal) and controls the column at which output lines are wrapped.
Colorization State Variables
| Variable | Address | Purpose |
|---|---|---|
dword_126ECA0 | 0x126ECA0 | Colorization requested (--colors flag) |
dword_126ECA4 | 0x126ECA4 | Colorization active (after init_colorization) |
qword_126ECC0 | 0x126ECC0 | Color spec pointer array (2 qwords per category) |
qword_126ECC8 | 0x126ECC8 | Color spec length array (paired with pointers) |
dword_106B470 | 0x106B470 | Terminal width for word wrapping |
Diagnostic Counter System (sub_4F3020)
The function update_diagnostic_counter (sub_4F3020) is called from check_severity to increment per-severity counters. These counters drive the summary output in write_signoff and the error-limit check:
void update_diagnostic_counter(uint8_t severity, uint64_t *counter_block) {
switch (severity) {
case 2: break; // notes: not counted
case 4: counter_block[0]++; break; // remarks
case 5:
case 6: counter_block[1]++; break; // warnings
case 7:
case 8: counter_block[2]++; break; // errors
case 9:
case 10:
case 11: counter_block[3]++; break; // fatal
default:
assertion("update_diagnostic_counter: bad severity", error.c:3223);
}
}
The primary counter block is at qword_126ED80 (4 qwords: remark_count, warning_count, error_count, fatal_count). The global totals qword_126ED90 (total errors) and qword_126ED98 (total warnings) are updated from a different counter block qword_126EDC8 after pragma-suppressed diagnostics are processed.
Global Variables
| Variable | Address | Type | Purpose |
|---|---|---|---|
dword_106BBB8 | 0x106BBB8 | int | Output format: 0=text, 1=SARIF |
qword_106B478 | 0x106B478 | buffer* | SARIF JSON output buffer (0x400 initial) |
qword_106B488 | 0x106B488 | buffer* | Message text buffer (0x400 initial) |
qword_106B480 | 0x106B480 | buffer* | Location prefix buffer (0x80 initial) |
qword_1067820 | 0x1067820 | array* | Pragma diagnostic stack (24-byte entries) |
qword_1065960 | 0x1065960 | ptr[983] | Per-diagnostic suppression hash table |
qword_10658F8 | 0x10658F8 | ptr | bsearch scratch: best-match pragma entry |
byte_1067920 | 0x1067920 | byte[4*3795] | Default severity per error code |
byte_1067921 | 0x1067921 | byte[4*3795] | Current severity per error code |
byte_1067922 | 0x1067922 | byte[4*3795] | Per-error tracking flags (bit 2 = has pragma) |
dword_126ECA0 | 0x126ECA0 | int | Colorization requested |
dword_126ECA4 | 0x126ECA4 | int | Colorization active |
qword_126ECC0 | 0x126ECC0 | ptr[] | Color spec pointers (per category) |
qword_126ECC8 | 0x126ECC8 | size_t[] | Color spec lengths (per category) |
qword_126EDF0 | 0x126EDF0 | FILE* | Diagnostic output stream |
Function Map
| Address | Name | EDG Source | Size | Role |
|---|---|---|---|---|
0x4EC850 | parse_color_category | error.c | 47 lines | Parse one category=codes from color spec |
0x4ECB10 | write_sarif_physical_location | error.c | 64 lines | Emit SARIF physicalLocation JSON |
0x4ECD20 | bsearch_comparator | error.c | 15 lines | Position comparator for pragma stack search |
0x4ECD50 | check_suppression_flags | error.c | 30 lines | Bit-flag suppression test |
0x4ECDD0 | emit_colorization_escape | error.c | 30 lines | Write ESC+category to buffer |
0x4ED100 | create_file_index_entry | error.c | 22 lines | Allocate 160-byte file-index node |
0x4ED170 | display_to_internal_code | error.c | 12 lines | Convert display code >= 20000 to internal |
0x4ED190 | record_pragma_diagnostic | error.c | 24 lines | Record pragma entry in hash table |
0x4ED240 | check_pragma_diagnostic | error.c | 39 lines | Hash-based per-diagnostic suppression check |
0x4EF8A0 | write_sarif_message_json | error.c | 79 lines | JSON-escape and wrap message text |
0x4F1330 | check_severity | error.c:3859 | 601 lines | Central dispatch, SARIF/text routing |
0x4F2C10 | init_colorization | error.c:825 | 43 lines | Parse color env vars, set up categories |
0x4F3020 | update_diagnostic_counter | error.c:3223 | 38 lines | Increment per-severity counters |
0x4F30A0 | check_for_overridden_severity | error.c:3803 | ~130 lines | Pragma stack walk with bsearch |
0x4F3E50 | add_colorization_characters | error.c:862 | ~80 lines | Emit full ANSI escape sequence |
0x5AEDB0 | write_init | host_envir.c:2017 | 28 lines | SARIF header / text-mode no-op |
0x5AEE00 | write_signoff | host_envir.c:2203 | 131 lines | SARIF footer / text-mode summary |
0x5AF770 | check_terminal_capabilities | host_envir.c | ~30 lines | TTY + TERM detection |
Entity Node Layout
The entity node is the central data structure in cudafe++ (EDG 6.6) for representing every named declaration: functions, variables, fields, parameters, namespaces, and types. Each node is a variable-size record -- routines occupy 288 bytes, variables 232 bytes, fields 176 bytes -- linked into scope chains and cross-referenced by type nodes, expression nodes, and template instantiation records.
This page focuses on the CUDA-specific fields that NVIDIA grafted onto the EDG entity node. These fields encode execution space (__host__/__device__/__global__), variable memory space (__shared__/__constant__/__managed__), launch configuration (__launch_bounds__/__cluster_dims__/__block_size__/__maxnreg__), and assorted kernel metadata. The attribute application functions in attribute.c write these fields; the backend code generator, cross-space validator, IL walker, and stub emitter read them.
Key Facts
| Property | Value |
|---|---|
| Routine entity size | 288 bytes (IL entry kind 11) |
| Variable entity size | 232 bytes (IL entry kind 7) |
| Field entity size | 176 bytes (IL entry kind 8) |
| Execution space offset | +182 (1 byte, bitfield) |
| Memory space offset | +148 (1 byte, bitfield) |
| Launch config pointer | +256 (8-byte pointer to 56-byte struct) |
| Source file | attribute.c (writers), nv_transforms.c / cp_gen_be.c (readers) |
| Attribute dispatch | sub_413240 (apply_one_attribute, 585 lines) |
| Post-validation | sub_6BC890 (nv_validate_cuda_attributes) |
Visual Layout (Routine Entity, 288 Bytes)
Offset 0 8 16 24 32 40 48 56
+=========+=========+=========+=========+=========+=========+=========+=========+
0x00 | next_entity_ptr | name_string_ptr | (EDG internal) |
+---------+---------+---------+---------+---------+---------+---------+---------+
0x20 | (EDG internal continued) |
+---------+---------+---------+---------+---------+---------+---------+---------+
0x40 | (EDG internal continued) |
+====+====+=========+=========+=========+=========+=========+=========+=========+
0x50 |kind|stor| | assoc_entity_ptr | |
|+80 |+81 | | | |
+----+----+---------+---------+---------+---------+---------+---------+---------+
0x60 | | variable_type_ptr | |
+=========+=========+=========+=========+====+=========+=========+==========+===+
0x80 | storage_class/align| |type_kind| | return_type_ptr |MEM |EXT | |
| | | |+132 | | +144 |+148|+149| |
+---------+---------+---------+----+----+----+---------+---------+----+----+----+
0x98 | proto_ptr / param_list +152 |link|stor| |grid| |op | | |
| |+160|+161| |+164| |+166| | |
+---------+---------+---------+----+----+----+----+----+----+---------+---------+
0xB0 |mbr |dev | |kern|func| |EXEC|CEXT| template_linkage_flags +184 |
|+176|+177| |+179|+180| |+182|+183| |
+----+----+----+----+----+----+----+----+=========+=========+=========+=========+
0xC0 | alias_chain/linkage+186 | |ctor/dtor|lambda | |
| | | +190 | +191 | |
+---------+---------+---------+---------+---------+---------+---------+---------+
0xD0 | variable_alias_chain_next +208 | |
+---------+---------+---------+---------+---------+---------+---------+---------+
0xF0 | func_extra / alias_entry +240 | |
+---------+---------+---------+---------+---------+---------+---------+---------+
0x100 | LAUNCH_CONFIG_PTR +256 | (padding to 288) |
+=========+=========+=========+=========+=========+=========+=========+=========+
CUDA-specific fields (UPPERCASE):
MEM = +148 variable memory space bitfield (__device__/__shared__/__constant__)
EXT = +149 extended memory space (__managed__)
EXEC = +182 execution space bitfield (__host__/__device__/__global__)
CEXT = +183 CUDA extended flags (__nv_register_params__, __cluster_dims__ intent)
LAUNCH_CONFIG_PTR = +256 pointer to 56-byte launch_config_t struct
Full Offset Map (CUDA-Relevant Fields)
The table below documents every entity node offset touched by CUDA attribute handlers and validation functions. Offsets are byte positions from the start of the entity node. Fields marked "EDG base" are standard EDG fields that CUDA code tests but does not define.
| Offset | Size | Field | Set By | Read By |
|---|---|---|---|---|
+0 | 8 | Next entity pointer (linked list) | EDG | Scope iteration |
+8 | 8 | Name string pointer | EDG | Error messages, stub emission |
+80 | 1 | Entity kind byte (7=variable, 8=field, 11=routine) | EDG | All attribute handlers |
+81 | 1 | Storage flags (bit 2=local, bit 3=has_name, bit 6=anonymous) | EDG | __global__ / __device__ validation |
+88 | 8 | Associated entity pointer | EDG | nv_is_device_only_routine |
+112 | 8 | Variable type pointer | EDG | get_func_type_for_attr |
+128 | 1 | Storage class code / alignment | EDG | apply_internal_linkage_attr |
+132 | 1 | Type kind byte (12=qualifier) | EDG | Return type traversal |
+144 | 8 | Return type / next-in-chain pointer | EDG | __global__ void-return check |
+148 | 1 | Variable memory space bitfield | CUDA attr handlers | Backend, IL walker |
+149 | 1 | Extended memory space | apply_nv_managed_attr | Backend, runtime init |
+152 | 8 | Function prototype / parameter list head | EDG | __global__ param checks |
+160 | 1 | Linkage/visibility bits (variable: low 3 = visibility) | Various | Visibility propagation |
+161 | 1 | Storage/linkage flags (bit 7=thread_local) | EDG | __managed__ / __device__ validation |
+164 | 1 | Storage class / grid_constant flags (bit 2=grid_constant) | __grid_constant__ handler | __managed__/__device__ conflict check |
+166 | 1 | Operator function kind (5=operator function) | EDG | __global__ validation |
+176 | 1 | Member function flags (bit 7=static member) | EDG | __global__ static-member check |
+177 | 1 | Device propagation flag (bit 4=0x10) | Virtual override propagation | Override space checking |
+179 | 1 | Constexpr/kernel flags | Declaration processing | Stub generation, attribute interaction |
+180 | 1 | Function attributes (bit 6=nodiscard, bit 7=noinline) | Various attribute handlers | Backend |
+182 | 1 | Execution space bitfield | CUDA execution space handlers | Everywhere |
+183 | 1 | CUDA extended flags | __cluster_dims__ / __nv_register_params__ | Post-validation, stub emission |
+184 | 8 | Template/linkage flags (48-bit field) | EDG + CUDA handlers | Lambda check, visibility |
+186 | 1 | Alias chain flag (bit 3=internal linkage) | apply_internal_linkage_attr | Linker |
+190 | 1 | Constructor/destructor priority flags | apply_constructor_attr / apply_destructor_attr | Backend |
+191 | 1 | Lambda flags (bit 0=is_lambda) | EDG lambda processing | __global__ validation |
+208 | 8 | Variable alias chain next pointer | apply_alias_attr | Alias loop detection |
+240 | 8 | Function extra info / alias entry | apply_alias_attr | Alias chain traversal |
+256 | 8 | Launch configuration pointer | CUDA launch config handlers | Post-validation, backend |
Execution Space Bitfield (Byte +182)
This is the most frequently read field in CUDA-specific code paths. Every function entity carries a single byte that encodes which execution spaces the function belongs to.
Byte at entity+182:
bit 0 (0x01) device_capable Function can execute on device
bit 1 (0x02) device_explicit __device__ was explicitly written
bit 2 (0x04) host_capable Function can execute on host
bit 3 (0x08) (reserved)
bit 4 (0x10) host_explicit __host__ was explicitly written
bit 5 (0x20) device_annotation Secondary device flag (HD detection)
bit 6 (0x40) global_kernel Function is a __global__ kernel
bit 7 (0x80) global_confirmed Always set by __global__ handler tail guard
Combined Patterns
The attribute handlers do not set individual bits. They OR entire patterns into the byte. Each CUDA keyword produces a fixed bitmask:
| Keyword | OR mask(s) | Result byte | Handler | Evidence |
|---|---|---|---|---|
__global__ | 0x61 then 0x80 | 0xE1 | sub_40E1F0 (apply_nv_global_attr) | `entity+182 |
__device__ | 0x23 | 0x23 | sub_40EB80 (apply_nv_device_attr) | `entity+182 |
__host__ | 0x15 | 0x15 | sub_4108E0 (apply_nv_host_attr) | `entity+182 |
__host__ __device__ | 0x23 then 0x15 | 0x37 | Both handlers in sequence | OR of device + host masks |
| (no annotation) | none | 0x00 | -- | Implicit __host__ |
The 0x80 bit is set unconditionally at the end of apply_nv_global_attr. After the main body ORs 0x61 into byte+182 (setting bit 6 = global_kernel), a tail guard checks bit 6 and always ORs 0x80:
// sub_40E1F0, lines 84-88
v10 = *(_BYTE *)(a2 + 182);
if ( (v10 & 0x40) == 0 ) // if bit 6 (global_kernel) not set, bail
return a2; // (only reachable via early error paths)
*(_BYTE *)(a2 + 182) = v10 | 0x80; // always set for __global__
Since 0x61 was already OR'd in, bit 6 is always set on the normal path, so 0x80 is always applied. The actual result byte for any successful __global__ application is 0x61 | 0x80 = 0xE1. The guard condition only triggers on error paths where 0x61 was never applied (e.g., the template-lambda error at line 21 which returns before reaching line 56).
Extraction Patterns
Code throughout cudafe++ extracts execution space category using bitmask tests:
| Mask | Test | Meaning | Used in |
|---|---|---|---|
& 0x30 | == 0x00 | No explicit annotation (implicit host) | Space classification |
& 0x30 | == 0x10 | __host__ only | Space classification |
& 0x30 | == 0x20 | __device__ only | nv_is_device_only_routine |
& 0x30 | == 0x30 | __host__ __device__ | Space classification |
& 0x60 | == 0x20 | Device, not kernel | Device-only predicate |
& 0x60 | == 0x60 | __global__ kernel (implies device) | Kernel identification |
& 0x40 | != 0 | Is a __global__ kernel | Stub generation gate |
Variable Memory Space Bitfield (Byte +148)
For variable entities (kind 7), byte +148 encodes the CUDA memory space:
Byte at entity+148:
bit 0 (0x01) __device__ Variable resides in device global memory
bit 1 (0x02) __shared__ Variable resides in shared memory
bit 2 (0x04) __constant__ Variable resides in constant memory
These bits are mutually exclusive in valid programs. The attribute handlers enforce this by checking for conflicting combinations:
// From apply_nv_device_attr (sub_40EB80), variable path:
a2->byte_148 |= 0x01; // set __device__
int shared_or_constant = a2->byte_148 & 0x06; // check __shared__ | __constant__
if (popcount(shared_or_constant) + (a2->byte_148 & 0x01) == 2)
error(3481, ...); // conflicting memory spaces
The __device__ attribute on a function (kind 11) does NOT touch byte +148. It writes to byte +182 (execution space) instead. The memory space byte is strictly for variables.
Extended Memory Space (Byte +149)
Byte at entity+149:
bit 0 (0x01) __managed__ Unified memory, accessible from both host and device
Set by apply_nv_managed_attr (sub_40E0D0). The handler also sets bit 0 of +148 (__device__) because managed memory resides in device global memory. Additional validation:
- Error 3481: conflicting if
__shared__or__constant__is already set - Error 3482: cannot be thread-local (
byte +161bit 7) - Error 3485: cannot be a local variable (
byte +81bit 2) - Error 3577: incompatible with
__grid_constant__parameter (byte +164bit 2)
Constexpr and Kernel Flags (Byte +176, +179)
Byte +176: Member Function Flags
Byte at entity+176:
bit 7 (0x80) static_member Function is a static class member
Tested by apply_nv_global_attr to detect static __global__ functions. The check is (signed char)(a2->byte_176) < 0, which is true when bit 7 is set. Combined with the local-function test (byte +81 bit 2 clear), this triggers warning 3507.
Byte +179: Constexpr / Kernel Property Flags
Byte at entity+179:
bit 1 (0x02) kernel_body Function has a kernel body (used for stub generation)
bit 2 (0x04) (instantiation) Instantiation-required status
bit 4 (0x10) constexpr Function is constexpr
bit 5 (0x20) noinline Function is noinline
The kernel_body flag at bit 1 (0x02) is the primary gate for device stub generation. The backend code generator (gen_routine_decl in cp_gen_be.c) checks:
// From gen_routine_decl (sweep p1.04, line ~1430)
if ((*(_BYTE *)(v3 + 182) & 0x40) != 0 // is __global__ kernel
&& (*(_BYTE *)(v3 + 179) & 2) != 0) // has kernel body
{
// Emit __wrapper__device_stub_<name>(<params>) forwarding body
}
The constexpr flag at bit 4 (0x10) is tested during __global__ attribute validation. When set, the void-return-type check AND the lambda check are both skipped:
// From apply_nv_global_attr (sub_40E1F0), lines 39-50
if ( (*(_BYTE *)(a2 + 179) & 0x10) == 0 ) // NOT constexpr
{
// Non-constexpr __global__: check return type and lambda
if ( (*(_BYTE *)(a2 + 191) & 1) != 0 )
error(3506, ...); // lambda __global__ not allowed
else if ( !is_void_return_type(a2) )
error(3505, ...); // must return void
}
// If constexpr (bit 4 set): skip both checks entirely
This is a separate check from the static-member test (byte +176 bit 7 with byte +81 bit 2), which appears earlier at line 28:
if ( *(char *)(a2 + 176) < 0 // static member (bit 7 set)
&& (*(_BYTE *)(a2 + 81) & 4) == 0 ) // not local
warning(3507, "__global__"); // static __global__ warning
Operator Function Kind (Byte +166)
Byte at entity+166:
Value 5: operator function (operator(), operator+, etc.)
Tested during __global__ attribute application. If the entity is an operator function (value == 5), error 3644 is emitted: operator() cannot be declared __global__.
// From apply_nv_global_attr (sub_40E1F0), line 30-31
if ( *(_BYTE *)(a2 + 166) == 5 )
sub_4F8200(7, 3644, a1 + 56); // error: __global__ on operator function
This prevents declaring lambda call operators as kernels via the __global__ attribute directly (extended lambdas use a different mechanism with wrapper types).
Parameter List (Pointer +152)
For routine entities, offset +152 holds a pointer to the function prototype structure. The prototype's first field (+0) points to the parameter list head -- a linked list of parameter entities.
The __global__ attribute handler iterates this list to check two constraints:
-
Variadic check: prototype
+16bit 0 indicates variadic parameters. If set, error 3503 is emitted (variadic__global__functions are not allowed). -
__grid_constant__check: the post-validation functionnv_validate_cuda_attributes(sub_6BC890) walks the parameter list looking for parameters withbyte +32bit 1 set (the__grid_constant__flag on a parameter entity). If found on a non-__global__function, error 3702 is emitted.
// From nv_validate_cuda_attributes (sub_6BC890), lines 26-39
// Walk parameter list from prototype
v10 = **(__int64 ****)(v2 + 152); // parameter list head
while (v10) {
if (((_BYTE)v10[4] & 2) != 0) // parameter byte+32 bit 1 = __grid_constant__
error(3702, ...); // grid_constant on non-kernel parameter
v10 = (__int64 **)*v10; // next parameter
}
CUDA Extended Flags (Byte +183)
Byte at entity+183:
bit 3 (0x08) __nv_register_params__ Function uses register parameter passing
bit 6 (0x40) __cluster_dims__ intent cluster_dims attribute with no arguments
nv_register_params (Bit 0x08)
Set by apply_nv_register_params_attr (sub_40B0A0). When present, the post-validation function nv_validate_cuda_attributes checks whether the function is __global__ or __host__, and emits error 3661 if so. Device-only functions (__device__ without __host__) are exempt:
// From nv_validate_cuda_attributes (sub_6BC890), lines 42-69
if ( (*(_BYTE *)(a1 + 183) & 8) == 0 ) // no __nv_register_params__
goto check_launch_config;
if ( (v3 & 0x40) != 0 ) { // __global__ kernel
v4 = "__global__";
error(3661, &qword_126EDE8, v4); // incompatible
} else if ( (v3 & 0x30) != 0x20 ) { // NOT device-only (has host component)
v4 = "__host__";
error(3661, &qword_126EDE8, v4); // incompatible
}
// else: device-only function -- __nv_register_params__ is allowed
The key check is (v3 & 0x30) != 0x20: when the execution space annotation bits indicate device-only (bits 4,5 = 0x20), the error is skipped. This means __nv_register_params__ is valid only on __device__ functions -- it is rejected on __global__, __host__, and __host__ __device__ functions.
cluster_dims Intent (Bit 0x40)
Set by apply_nv_cluster_dims_attr (sub_4115F0) when the attribute is applied with zero arguments. This marks the function as "wants cluster dimensions" without specifying concrete values -- the values may come from a separate __block_size__ attribute or from a template parameter.
Template / Linkage Flags (Pointer +184)
Offset +184 is a 48-bit (6-byte) field encoding template instantiation and linkage information. The __global__ attribute handler tests a specific bit pattern to detect constexpr lambdas with template linkage:
// From apply_nv_global_attr (sub_40E1F0), line 21
if ( (*(_QWORD *)(a2 + 184) & 0x800001000000LL) == 0x800000000000LL )
{
// This is a template lambda with external linkage but no definition yet.
// Applying __global__ to it is an error.
v14 = sub_6BC6B0(a2, 0); // get entity name
sub_4F7510(3469, a1 + 56, "__global__", v14);
return;
}
The mask 0x800001000000 tests two bits:
- Bit 47 (
0x800000000000): template instantiation pending - Bit 24 (
0x000001000000): has definition body
When bit 47 is set but bit 24 is clear, the entity is a template lambda awaiting instantiation that has no body yet -- applying __global__ (or __device__) to such an entity produces error 3469.
Launch Configuration Struct (Pointer +256)
Offset +256 holds a pointer to a lazily-allocated 56-byte launch configuration structure. This pointer is NULL for functions without any launch configuration attributes. The allocation function sub_5E52F0 creates and zero-initializes the struct on first use.
Launch Config Layout
struct launch_config_t { // 56 bytes, allocated by sub_5E52F0
int64_t maxThreadsPerBlock; // +0 from __launch_bounds__(arg1)
int64_t minBlocksPerMP; // +8 from __launch_bounds__(arg2)
int32_t maxBlocksPerCluster; // +16 from __launch_bounds__(arg3)
int32_t cluster_dim_x; // +20 from __cluster_dims__(x) or __block_size__(x,y,z,cx)
int32_t cluster_dim_y; // +24 from __cluster_dims__(y) or __block_size__(x,y,z,cx,cy)
int32_t cluster_dim_z; // +28 from __cluster_dims__(z) or __block_size__(x,y,z,cx,cy,cz)
int32_t maxnreg; // +32 from __maxnreg__(N)
int32_t local_maxnreg; // +36 from __local_maxnreg__(N)
int32_t block_size_x; // +40 from __block_size__(x)
int32_t block_size_y; // +44 from __block_size__(y)
int32_t block_size_z; // +48 from __block_size__(z)
uint8_t flags; // +52 bit 0=cluster_dims_set, bit 1=block_size_set
}; // 3 bytes padding to 56
Attribute-to-Field Mapping
| Attribute | Arguments | Fields Written | Handler |
|---|---|---|---|
__launch_bounds__(M) | 1 int | +0 = M | sub_411C80 |
__launch_bounds__(M,N) | 2 ints | +0 = M, +8 = N | sub_411C80 |
__launch_bounds__(M,N,C) | 3 ints | +0 = M, +8 = N, +16 = C | sub_411C80 |
__cluster_dims__(x) | 1 int | +20 = x, +24 = 1, +28 = 1, +52 bit 0 | sub_4115F0 |
__cluster_dims__(x,y) | 2 ints | +20 = x, +24 = y, +28 = 1, +52 bit 0 | sub_4115F0 |
__cluster_dims__(x,y,z) | 3 ints | +20 = x, +24 = y, +28 = z, +52 bit 0 | sub_4115F0 |
__cluster_dims__() | 0 args | entity+183 bit 6 (intent flag only) | sub_4115F0 |
__maxnreg__(N) | 1 int | +32 = N | sub_410F70 |
__local_maxnreg__(N) | 1 int | +36 = N | sub_411090 |
__block_size__(x,y,z) | 3 ints | +40 = x, +44 = y, +48 = z, +52 bit 1 | sub_4109E0 |
__block_size__(x,y,z,cx,cy,cz) | 6 ints | block + cluster dims, +52 bits 0+1 | sub_4109E0 |
Post-Validation Constraints
The function nv_validate_cuda_attributes (sub_6BC890) performs cross-attribute validation after all attributes have been applied. The key checks on the launch config struct:
1. __launch_bounds__ only on __global__:
// sub_6BC890, lines 45-51
v5 = *(_QWORD *)(a1 + 256); // launch config pointer
if ( !v5 ) goto done;
if ( (v3 & 0x40) != 0 ) // if __global__, skip to next check
goto check_cluster;
// Not __global__ but has launch_bounds values
if ( *(_QWORD *)v5 || *(_QWORD *)(v5 + 8) )
error(3534, "__launch_bounds__"); // launch_bounds on non-kernel
2. __cluster_dims__/__block_size__ only on __global__:
// sub_6BC890, lines 81-87
if ( (*(_BYTE *)(a1 + 183) & 0x40) != 0 // cluster_dims intent
|| *(int *)(v5 + 20) >= 0 ) // cluster_dim_x set
{
v11 = "__cluster_dims__";
if ( *(int *)(v5 + 40) > 0 )
v11 = "__block_size__";
error(3534, v11); // not allowed on non-kernel
}
3. maxBlocksPerCluster vs cluster product:
// sub_6BC890, lines 101-114
v6 = *(int *)(v5 + 20); // cluster_dim_x
if ( (int)v6 > 0 ) {
v7 = *(int *)(v5 + 16); // maxBlocksPerCluster
if ( (int)v7 > 0
&& v7 < *(int*)(v5 + 28) * *(int*)(v5 + 24) * v6 )
{
// maxBlocksPerCluster < cluster_dim_x * cluster_dim_y * cluster_dim_z
error(3707, "__cluster_dims__"); // inconsistent values
}
}
4. __maxnreg__ only on __global__:
// sub_6BC890, lines 116-121
if ( *(int *)(v5 + 32) < 0 ) // maxnreg not set (sentinel -1)
goto check_launch_maxnreg_conflict;
if ( (v9 & 0x40) == 0 ) // not __global__
error(3715, "__maxnreg__"); // maxnreg on non-kernel
5. __launch_bounds__ + __maxnreg__ conflict:
// sub_6BC890, lines 144-145
if ( *(_QWORD *)v5 ) // maxThreadsPerBlock set
error(3719, "__launch_bounds__ and __maxnreg__");
Entity Kind Reference
The entity kind byte at +80 determines which offsets are valid. CUDA attribute handlers gate on this value:
| Kind | Value | CUDA offsets used | Handler examples |
|---|---|---|---|
| Variable | 7 | +148, +149, +161, +164 | __device__, __shared__, __constant__, __managed__ |
| Field | 8 | +136 | packed, aligned (non-CUDA) |
| Routine | 11 | +144, +152, +166, +176, +179, +182, +183, +184, +191, +256 | All execution space attrs, launch config |
Cross-References
- Execution Spaces -- deep dive on byte
+182semantics and the six virtual override mismatch errors - Attributes Overview -- attribute kind enum (86-108) and
apply_one_attributedispatch - IL Overview -- IL entry kinds 7 (variable), 8 (field), 11 (routine) node sizes
- Scope Entry -- 784-byte scope structure that contains entity chains
Scope Entry
The scope entry is the 784-byte record that forms the elements of the scope stack, the central data structure in cudafe++ for tracking nested lexical scopes during C++ parsing and semantic analysis. The scope stack is a flat array at qword_126C5E8, indexed by dword_126C5E4 (current depth). Every time the parser enters a new scope -- file, block, function body, class definition, template declaration, namespace -- a new 784-byte entry is pushed onto this stack. When the scope closes, the entry is popped and all associated cleanup runs: symbol table housekeeping, using-directive deactivation, name collision discriminator assignment, template parameter restoration, and memory region disposal.
This page documents the scope stack entry layout, the scope kind enum, the key flag bytes, the CUDA-specific additions (device/host scope context), the template instantiation depth counters, and the major push/pop functions.
Key Facts
| Property | Value |
|---|---|
| Entry size | 784 bytes (constant, verified by "Stack entry size: %d\n" in debug statistics) |
| Stack base pointer | qword_126C5E8 (global, array of 784-byte entries) |
| Current depth index | dword_126C5E4 (global, 0-based index of topmost entry) |
| Function scope index | dword_126C5D8 (-1 if not inside a function scope) |
| Class scope index | dword_126C5C8 (-1 if not inside a class scope) |
| File scope index | dword_126C5DC |
| EDG source file | scope_stk.c (address range 0x6FE160-0x7106B0, ~160 functions) |
| Push function | sub_700560 (push_scope_full, 1476 lines, 13 parameters) |
| Pop function | sub_7076A0 (pop_scope, 1142 lines) |
| Index arithmetic | 784 * index for byte offset; reverse via multiply by 0x7D6343EB1A1F58D1 and shift right (division by 784 = 16 * 49) |
Scope Stack Global Variables
| Global | Type | Meaning |
|---|---|---|
qword_126C5E8 | void* | Base pointer to the scope stack array |
dword_126C5E4 | int32 | Current scope stack top index (0-based) |
dword_126C5D8 | int32 | Current function scope index (-1 if none) |
dword_126C5DC | int32 | File scope index / secondary depth marker |
dword_126C5AC | int32 | Saved depth for template instantiation |
qword_126C5D0 | void* | Current routine descriptor pointer |
dword_126C5B8 | int32 | is_member_of_template flag |
dword_126C5C8 | int32 | Class scope index (-1 if none) |
dword_126C5C4 | int32 | Nested class / lambda scope index (-1 if none) |
dword_126C5E0 | int32 | Scope hash / identifier |
dword_126C5B4 | int32 | Namespace scope index |
dword_126C5BC | int32 | Class scope depth counter |
qword_126C598 | void* | Pack expansion context pointer |
Full Offset Map
The table documents every field observed in the 784-byte scope stack entry. These are scope stack entry fields, not IL scope node fields (the IL scope is a separate 288-byte structure pointed to from offset +192).
| Offset | Size | Field | Evidence |
|---|---|---|---|
+0 | 4 | scope_number | Unique identifier for this scope instance; checked in pop_scope assertions |
+4 | 1 | scope_kind | Scope kind enum byte (see table below) |
+5 | 1 | scope_flags_1 | General flags |
+6 | 2 | scope_flags_2 | Bit 0 = void return flag; bit 1 = device scope context (NVIDIA addition); bit 2 = inline namespace; in some contexts bit 1 = is_extern, bit 5 = inline_namespace |
+7 | 1 | access_flags | Bit 0 = in class context; bit 1 = has using-directives; bit 4 = lambda body |
+8 | 1 | scope_flags_4 | Template/class/reactivation bits; bit 5 (0x20) = is_template_scope |
+9 | 1 | scope_flags_5 | Bit 0 = needs cleanup / scope pop control -- when set, triggers sub_67B4E0() cleanup of template instantiation artifacts before popping |
+10 | 1 | scope_flags_6 | Bit 0 = in_template_context |
+11 | 1 | sign bit | in_template_dependent_context |
+12 | 1 | scope_flags_7 | Bit 0 = in_template_arg_scan; bit 2 = suppress_diagnostics; bit 4 = has_concepts / void_return_warned |
+13 | 1 | scope_flags_8 | Bit 4 = warned_no_return |
+14 | 1 | flags3 | Bit 2 = in_device_code (NVIDIA-specific, marks whether code in this scope is device code) |
+24 | 8 | symbol_chain_or_hash_ptr | Pointer to the name symbol chain or hash table for name lookup |
+32 | 8 | hash_table_ptr | Hash table pointer (when scope uses hashing for lookup) |
+32-+144 | 112 | Inline tail info | When +24 is 0, this region contains inline tail pointers for entity lists: +40 = variables tail, +48 = types tail, +56 = routines next, +88 = asm tail, +112 = namespaces tail, +144 = templates tail |
+192 | 8 | il_scope_ptr | Pointer to the associated 288-byte IL scope node (the persistent representation that survives scope pop) |
+200 | 8 | local_static_init_list | List of local static variable initializers |
+208 | 8 | vla_dimensions_list / scope_depth | VLA dimension tracking (C mode); scope depth integer |
+216 | 8 | class_type_ptr / tu_ptr | For class scopes: pointer to the class type symbol. For template instantiation scopes: pointer to the translation unit descriptor |
+224 | 8 | routine_descriptor | Pointer to the current routine descriptor (set for function scopes) |
+232 | 8 | namespace_entity | For namespace scopes: pointer to the namespace entity |
+240 | 4 | region_number | Memory region number (-1 = invalid sentinel, set by alloc_scope) |
+256 | 4 | parent_scope_index | Index of the enclosing scope in the stack (reported at both +240 and +256 in different sweeps -- likely +240 = region, +256 = parent) |
+272 | 8 | name_hiding_list | Linked list of names hidden by declarations in this scope |
+296 | 8 | local_vars_tail | Tail pointer for the local variables list |
+368 | 8 | source_begin | Source position at scope entry |
+376 | 8 | associated_entity / parent_template_info | Associated entity pointer / template information pointer |
+384 | 8 | template_argument_list | Template argument list for instantiation scopes |
+408 | 4 | try_block_index / enclosing_class_scope_index | Try block index (-1 = none); in class contexts, index to enclosing class scope |
+416 | 8 | module_info | Module information pointer (C++20 modules support) |
+424 | 4 | line_number | Line number at scope open (for diagnostics) |
+496 | 8 | root_object_lifetime | Root of the object lifetime tree for this scope |
+512 | 8 | freed_lifetime_list | List of freed object lifetimes awaiting reuse |
+560 | 4 | enclosing_scope_index | Parent scope index for pop validation |
+576 | 4 | template_instantiation_depth_counter | Nested instantiation depth counter -- incremented on recursive template instantiation push, decremented on pop; when > 0, pop just decrements without actually popping the scope stack |
+580 | 4 | orig_depth | Original scope stack depth at time of template instantiation push; validated during pop |
+584 | 4 | saved_scope_depth | Saved scope depth; restored via dword_126C5AC on template instantiation pop |
+608 | 8 | class_def_info_ptr | Class definition information pointer |
+624 | 8 | template_info_ptr | Template information record pointer |
+632 | 8 | template_parameter_list / class_info_ptr | Template parameter list pointer |
+704 | 8 | lambda_counter | Lambda expression counter within this scope (int64) |
+720 | 4 | fixup_counter | Deferred fixup counter |
+728 | 8 | has_been_completed | Completion flag (int64 used as bool) |
+736 | 8 | deferred_fixup_list_head | Head of deferred fixup linked list |
+744 | 8 | deferred_fixup_list_tail | Tail of deferred fixup linked list |
Scope Stack Kind Enum
The scope stack kind byte at +4 uses a different, larger enum than the IL scope kind (sck_*) at IL scope node +28. The scope stack enum includes additional entries for reactivation states, template instantiation context, and module scopes. The mapping is derived from scope_kind_to_string (sub_7000E0, 77 lines) which contains display string literals for each enum value, and from display_scope (sub_5F2140) in il_to_str.c.
Scope Stack Kind Values
| Value | Name | Display String | Notes |
|---|---|---|---|
| 0 | ssk_source_file | "source file" | Top-level file scope. Maps to IL sck_file (0). |
| 1 | ssk_func_prototype | "function prototype" | Function prototype scope (parameter names). Maps to IL sck_func_prototype (1). |
| 2 | ssk_block | "block" | Block scope (compound statement). Maps to IL sck_block (2). |
| 3 | ssk_alloc_namespace | "alloc_namespace" | Namespace scope (first opening). Maps to IL sck_namespace (3). |
| 4 | ssk_namespace_extension | "namespace extension" | Namespace extension (reopened namespace N { ... }). |
| 5 | ssk_namespace_reactivation | "namespace reactivation" | Namespace scope reactivated for out-of-line definition. |
| 6 | ssk_class_struct_union | "class/struct/union" | Class/struct/union scope. Maps to IL sck_class_struct_union (6). |
| 7 | ssk_class_reactivation | "class reactivation" | Class scope reactivated for out-of-line member definition (e.g., void MyClass::foo() { ... }). |
| 8 | ssk_template_declaration | "template declaration" | Template declaration scope (template<...>). Maps to IL sck_template_declaration (8). |
| 9 | ssk_template_instantiation | "template instantiation" | Template instantiation scope (pushed by push_template_instantiation_scope). |
| 10 | ssk_instantiation_context | "instantiation context" | Instantiation context scope (tracks the chain of instantiation sites for diagnostics). |
| 11 | ssk_module_decl_import | "module decl import" | C++20 module declaration/import scope. |
| 12 | ssk_module_isolation | "module isolation" | C++20 module isolation scope (module purview boundary). |
| 13 | ssk_pragma | "pragma" | Pragma scope (for pragma-delimited regions). |
| 14 | ssk_function_access | "function access" | Function access scope. |
| 15 | ssk_condition | "condition" | Condition scope (if/while/for condition variable). Maps to IL sck_condition (15). |
| 16 | ssk_enum | "enum" | Scoped enum scope (C++11 enum class). Maps to IL sck_enum (16). |
| 17 | ssk_function | "function" | Function body scope (has routine pointer, parameters, ctor init list). Maps to IL sck_function (17). |
Relationship to IL Scope Kinds
The IL scope node (288 bytes, allocated by alloc_scope at sub_5E7D80) uses a smaller sck_* enum at its +28 field. The scope stack entry at +192 points to the IL scope that persists after the stack entry is popped. Not all scope stack kinds produce an IL scope -- reactivation kinds (5, 7) and context kinds (9, 10) reuse existing IL scopes.
IL sck_* | Value | Corresponding stack kind(s) |
|---|---|---|
sck_file | 0 | 0 |
sck_func_prototype | 1 | 1 |
sck_block | 2 | 2 |
sck_namespace | 3 | 3, 4, 5 |
sck_class_struct_union | 6 | 6, 7 |
sck_template_declaration | 8 | 8 |
sck_condition | 15 | 15 |
sck_enum | 16 | 16 |
sck_function | 17 | 17 |
CUDA-Specific Fields
NVIDIA added two device/host scope tracking bits to the scope entry, grafted onto the EDG base structure.
Byte +6, Bit 1: Device Scope Context
scope_entry+6, bit 1 (0x02):
When set: code in this scope is compiled for the device execution space.
When clear: code in this scope is compiled for the host.
This bit is tested by CUDA-specific code paths to determine whether the current compilation context targets device or host. It affects:
- Whether
__device__-only functions suppress certain diagnostics (e.g., missing return value warning atcheck_void_return_okay,sub_719D20) - Whether device-specific type validation applies
- Severity overrides via
byte_126ED55(default diagnostic severity for device mode)
The bit is set when entering __device__ or __global__ function scopes and cleared when entering __host__ scopes. This allows mixed host/device compilation to track which context is active at any nesting depth.
Byte +14, Bit 2: In Device Code
scope_entry+14, bit 2 (0x04):
Secondary device-code marker. Set when the parser is inside a device
function body. Used in conjunction with dword_106C2C0 (CUDA device
compilation mode flag).
Template Instantiation Depth Counters
Three fields at offsets +576, +580, and +584 form the template instantiation depth tracking system. These fields enable the scope stack to handle nested template instantiations without fully pushing/popping scope entries at every nesting level.
Mechanism
When push_template_instantiation_scope (sub_709DE0) sets up a template instantiation, it writes the current scope stack depth into +580 (orig_depth) and the saved global depth into +584 (saved_scope_depth). The +576 counter starts at 0.
If the same template scope is re-entered for a nested instantiation (e.g., recursive template), +576 is incremented rather than pushing a full new scope entry. On pop, pop_template_instantiation_scope (sub_708EE0) checks +576:
if (scope_entry[576] > 0) {
scope_entry[576]--; // just decrement, don't pop
return;
}
// Full pop: restore scope stack to orig_depth
pop_scopes_to(scope_entry[580]);
restore(dword_126C5AC, scope_entry[584]);
This optimization avoids deep scope stack growth during deeply recursive template instantiations (e.g., std::tuple<T1, T2, ..., TN> with large N).
Validation
pop_template_instantiation_scope_with_check (sub_708E90) validates that +576 matches the expected depth before calling the actual pop. The assertion is at scope_stk.c line 5593. A mismatch triggers an internal error.
Push Scope: push_scope_full (sub_700560)
The core scope push function (1476 lines, 13 parameters, located at 0x700560). Called directly or via thin wrappers for each scope kind.
Parameters
The 13-parameter signature handles all scope kinds through a single entry point:
- Scope kind
- Associated entity pointer (class type, namespace entity, routine descriptor, etc.)
- Region number
- Additional flags 5-13. Kind-specific parameters (template info, reactivation data, etc.)
Key Operations
-
Stack growth: Increments
dword_126C5E4. If the stack exceeds its allocation, it is reallocated (the base pointerqword_126C5E8may change). -
Entry initialization: Zeros the 784-byte entry, then sets:
+0= scope number (from a global counter)+4= scope kind+240= region number+192= IL scope pointer (newly allocated viaalloc_scopeor reused from a reactivated entity)+560= parent scope index
-
Kind-specific setup:
- File (0): Sets
dword_126C5DC, initializes file-level state. - Block (2): Links to enclosing function scope.
- Namespace (3, 4, 5): Sets
+232to namespace entity. For extensions (4), reuses existing IL scope. For reactivation (5), callsadd_active_using_directives_for_scope. - Class (6, 7): Sets
+216to class type pointer,dword_126C5C8to current index. For reactivation (7), walks the class hierarchy to restore template context. - Template declaration (8): Sets template-related bits in
+8. - Function (17): Sets
dword_126C5D8,qword_126C5D0,+224.
- File (0): Sets
-
Parent scope linkage: Calls
set_parent_scope_on_pushto establish the scope tree. -
Memory region: Calls
get_enclosing_memory_regionto determine the memory arena for allocations within this scope.
Push Wrappers
| Wrapper | Address | Parameters | Target Kind |
|---|---|---|---|
push_scope (thin) | sub_704790 | 7 | Various |
push_scope_with_using_dirs | sub_7047C0 | 29 | Namespace + using |
push_template_scope | sub_704870 | 7 | Template declaration (8) |
push_block_reactivation_scope | sub_7048A0 | 32 | Block reactivation |
push_namespace_scope_full | sub_7024D0 | 40 | Namespace (3) |
push_function_scope | sub_704BB0 | 13 | Function (17) |
push_class_scope | sub_704C10 | 17 | Class (6) |
push_scope_for_compound_statement | sub_70C8A0 | 64 | Block (2) |
push_scope_for_condition | sub_70C950 | 86 | Condition (15) |
push_scope_for_init_statement | sub_70CAE0 | 49 | Block (2), C++17 init |
Pop Scope: pop_scope (sub_7076A0)
The core scope pop function (1142 lines, at 0x7076A0). Complement to push_scope_full. Performs all scope cleanup in a specific order.
Cleanup Sequence
-
Debug trace: If
dword_126EFC8is set, prints"pop_scope: number = %ld, depth = %d". -
Scope wrapup: Calls
wrapup_scope(sub_706710, 381 lines) which:- Iterates all symbols in the scope
- Runs
end_of_scope_symbol_check(sub_705440, 781 lines) for consistency validation - Emits needed definitions
- Reports unreferenced entities
-
Using-directive deactivation: Clears active using-directives for this scope via
sub_6FEC10(debug: clear using-directive). -
Template parameter restoration: If leaving a template scope, calls
restore_default_template_params(sub_6FEEE0) to undo template parameter symbol bindings. -
Name collision discriminators: Assigns ABI discriminator values to entities with the same name in this scope via
assign_discriminators_to_entities_list(sub_7036E0). -
C99 inline definitions: Checks
check_c99_inline_definition(sub_703AD0) for C99-mode inline function rules. -
Module/pragma state: Adjusts STDC pragma state (
byte_126E558/559/55A) and module context if applicable. -
Stack decrement: Decrements
dword_126C5E4. Restores previous scope's global state (function scope index, class scope index, etc.). -
Memory region disposal: Frees the memory arena associated with this scope if the scope kind has one.
Pop Variants
| Function | Address | Lines | Purpose |
|---|---|---|---|
pop_scope (core) | sub_7076A0 | 1142 | Full scope pop with all cleanup |
pop_scope_full | sub_70C440 | 100 | Wrapper calling core + name hiding cleanup |
pop_scope (validation) | sub_70C620 | 62 | Pop with object lifetime validation: asserts "pop_scope: curr_object_lifetime is not that of", "pop_scope: unexpected curr_object_lifetime" |
Template Instantiation Scope
The template instantiation scope push and pop are separate from the generic scope push/pop. They handle the complex process of binding template parameters to arguments, setting up the correct translation unit context, and managing pack expansions.
Push: push_template_instantiation_scope (sub_709DE0)
The largest function in the scope_stk.c range at 1281 lines. Takes 8 parameters: template pointer, association info, and various flags.
Key operations:
-
Translation unit check: Calls
sub_7418D0to verify that the template being instantiated belongs to the same translation unit, or that cross-TU instantiation is explicitly allowed (flag& 0x1000). Failure triggers the assert"push_template_instantiation_scope: wrong translation unit". -
Template parameter binding: Iterates the template parameter list and the instantiation argument list in parallel, creating bindings. For each template parameter:
- Type parameters: binds to the supplied type argument
- Non-type parameters: binds to the supplied expression/value
- Template template parameters: binds to the supplied template
-
Pack expansion: For variadic templates, handles parameter pack expansion. Creates pack instantiation descriptors via
create_pack_instantiation_descr(sub_70CF50, 772 lines). -
Scope entry setup: Writes
+576= 0,+580= current depth,+584=dword_126C5AC. Sets+216to the translation unit pointer. -
State save/restore: Saves
dword_126C5B8(is_member_of_template),dword_126C5D8(function scope),qword_126C5D0(routine descriptor). -
Reactivation flags: Flag bits
& 0x84000control class template reactivation behavior. When set, the function enters class reactivation mode viasub_70BB60.
Pop: pop_template_instantiation_scope (sub_708EE0)
66 lines. Reverse of the push.
- Reads
+576(depth counter). If > 0, decrements and returns early (nested instantiation shortcut). - If bit 0 of
+9is set (needs_cleanup), callssub_67B4E0()to clean up template instantiation artifacts. - Pops scope entries back to
orig_depth(+580) viasub_7076A0. - Restores
dword_126C5ACfrom+584. - Calls
sub_6FED20(debug trace: set using-directive).
Related Functions
| Function | Address | Lines | Role |
|---|---|---|---|
pop_template_instantiation_scope_wrapper | sub_708E70 | 7 | Thin wrapper passing through to sub_708EE0 |
pop_template_instantiation_scope_with_check | sub_708E90 | 14 | Validates +576 depth counter before calling sub_708EE0 |
pop_template_instantiation_scope_variant | sub_709110 | 71 | Alternative pop with extra +8 flag processing, returns int64 |
pop_instantiation_scope_for_rescan | sub_709000 | 54 | Pop for template argument rescan case |
push_instantiation_scope_for_rescan | sub_70B900 | 123 | Push for template parameter rescanning |
push_instantiation_scope_for_templ_param_rescan | sub_70B7C0 | 52 | Push for template parameter rescan |
push_instantiation_scope_for_class | sub_70BB60 | 131 | Push for class template instantiation |
push_class_and_template_reactivation_scope_full | sub_7098B0 | 261 | Combined class + template reactivation |
Using-Directive Activation
When entering a namespace scope that has active using namespace directives, those directives must be reactivated so that names from the nominated namespace are visible. The scope stack manages this through two functions:
add_active_using_directives_for_scope (sub_6FFCC0)
246 lines. Called during scope push when entering a namespace or block that may have inherited using-directives. Walks the using-directive list for the scope and calls add_active_using_directive_to_scope for each one.
Debug trace format: "adding using-dir at depth %d for namespace %s applies at %d".
Using-Directive Debug Traces
| Function | Address | Lines | Trace |
|---|---|---|---|
| Debug: set using-directive | sub_6FED20 | 74 | "setting using-dir at depth %d for namespace %s applies at %d" |
| Debug: clear using-directive | sub_6FEC10 | 34 | "clearing using-dir at depth %d for namespace %s applies at %d" |
| Debug: using-dir set/clear | sub_704490 | 106 | "using_dir", "setting", "clearing" |
Name Collision Discriminators
When multiple local entities share the same name (e.g., two struct S in different blocks within the same function), the Itanium ABI requires a discriminator suffix in the mangled name. The scope stack manages this through:
| Function | Address | Lines | Role |
|---|---|---|---|
get_name_collision_list + initialize_local_name_collision_table | sub_6FE760 | 64 | Manages the name collision table at qword_12C6FE0 |
compute_local_name_collision_discriminator + distinct_lambda_signatures | sub_702FB0 | 293 | Computes ABI discriminator values for local entities; includes lambda signature discrimination logic |
cancel_name_collision_discriminator | sub_7034C0 | 118 | Cancels a previously assigned discriminator (7 assertion sites) |
assign_discriminators_to_entities_list | sub_7036E0 | 46 | Assigns ABI discriminators to a list of entities at scope exit |
set_parent_entity_for_closure_types | sub_703790 | 91 | Sets parent entity for lambda closure types (needed for correct mangling, 5 assertion sites) |
set_parent_routine_for_closure_types_in_default_args | sub_703920 | 43 | Sets parent routine for lambdas in default argument contexts |
Class and Template Reactivation
When defining an out-of-line member function (void MyClass::foo() { ... }), the parser must reactivate the class scope so that class member names are visible. For class templates, this also requires reactivating the template instantiation scope.
reactivate_class_context (sub_7029D0 / sub_70BE50)
Two implementations exist:
sub_7029D0(196 lines, in p1.16): Reactivates a class scope for out-of-line definition. Asserts"reactivate_class_context: class type has NULL assoc_info".sub_70BE50(130 lines, in p1.17): Additional variant that handles nesting, template flags, and scope_entry+8bit manipulation.
push_class_and_template_reactivation_scope_full (sub_7098B0)
261 lines. Handles the combined case of class template reactivation. Reads symbol flags at offsets +80, +81, +161, +162. Processes "specified template decl info" at +64 of assoc_info. Detects member templates via bit 0x10 at +81. When dword_106BC58 is set, enters class reactivation mode with sub_70BB60.
reactivate_local_context (sub_702670 / sub_70C0F0)
sub_702670(120 lines): Reactivates a previously saved local scope context. Callspush_scope_full.sub_70C0F0(50 lines): Asserts"reactivate_local_context".
Pack Expansion Support
The scope stack provides infrastructure for variadic template parameter pack expansion during instantiation.
| Function | Address | Lines | Role |
|---|---|---|---|
create_pack_instantiation_descr | sub_70CF50 | 772 | Creates pack instantiation descriptors; handles sizeof..., fold expressions |
create_pack_instantiation_descr_helper | sub_70DD60 | 212 | Helper for pack descriptor creation |
cleanup_pack_instantiation_state | sub_70E130 | 37 | Cleans up pack expansion state |
end_potential_pack_expansion_context | sub_70E1D0 | 392 | Processes end of pack expansion; checks C++17 via dword_126EF68 > 199900; uses qword_126C598 (pack expansion context) |
find_template_arg_for_pack + get_enclosing_template_params_and_args | sub_6FE9B0 | 140 | Traverses scope stack to find template arguments for parameter packs |
Scope Stack Query Functions
| Function | Address | Lines | Role |
|---|---|---|---|
get_innermost_template_dependent_context | sub_6FE160 | 72 | Traverses scope stack to find innermost template-dependent scope |
get_outermost_template_dependent_context | sub_6FFA60 | 54 | Complement to innermost variant |
get_curr_template_params_and_args (part 1) | sub_70E7F0 | 321 | Retrieves current template parameters and arguments from scope stack |
get_curr_template_params_and_args (full) | sub_70F540 | 1002 | Full implementation with default argument handling and pack expansion |
is_in_template_context | sub_70EE10 | 16 | No-arg predicate, returns bool |
is_in_template_instantiation_scope | sub_70EDA0 | 27 | 6-arg predicate, returns bool |
current_class_symbol_if_class_template | sub_704130 | 84 | Returns class symbol if inside a class template definition |
is_in_deprecated_context | sub_70F440 | 43 | Checks scope_entry[83] bit 4 and walks scope stack |
get_scope_depth | sub_70C600 | 17 | Returns current scope stack depth value |
get_template_scope_info_for_entity | sub_7106B0 | 74 | Last function in scope_stk.c range |
Debug and Statistics
Scope Statistics Dump (sub_702DC0)
95 lines. Prints scope stack statistics when debug tracing is enabled. Output format:
Scope stack statistics
Stack entry size: 784
Max. stack depth: <N>
Followed by per-scope-kind counts using all scope kind display names.
Scope Entry Dump (sub_700260 / sub_7002D0)
sub_700260(17 lines): Prints" scope %d"with scope kind name viascope_kind_to_string. Detects bad depth with"***BAD SCOPE DEPTH***".sub_7002D0(111 lines): Detailed dump using format"%s%3ld %3d "with associated type/symbol information.
End-of-Scope Processing
wrapup_scope (sub_706710)
381 lines. Major scope cleanup function called from pop_scope. Processes all symbols in the scope, emits needed definitions, runs end_of_scope_symbol_check. Debug traces: "wrapup_scope", "Wrapping up ", " scope".
end_of_scope_symbol_check (sub_705440)
781 lines, 6 assertion sites. The largest validation function. Checks:
- Symbol-to-IL-entry parent-class consistency (
"end_of_scope_symbol_check: sym/il-entry parent-class mismatch") - Parameter-to-routine association (
"end_of_scope_symbol_check: parameter with no assoc routine") - Hash table statistics (
"hash_stats","Hash statistics for: ")
set_needed_flags_at_end_of_file_scope (sub_707040)
188 lines. Determines which entities need to be emitted at the end of the translation unit. Validates scope kind ("set_needed_flags_at_end_of_file_scope: bad scope kind"). Debug brackets: "Start of set_needed_flags_at_end_of_file_scope\n" / "End of set_needed_flags_at_end_of_file_scope\n".
finish_function_body_processing (sub_6FE2A0)
142 lines. Post-processes function bodies after the scope closes. Determines whether the function needs to be emitted ("routine_needed_even_if_unreferenced", "Not calling mark_as_needed for", "storage class is %s\n").
Cross-References
- Entity Node Layout -- entity kind enum, execution space byte at
+182 - IL Overview -- IL scope kinds (
sck_*), IL entry kind 23 (scope, 288 bytes) - IL Allocation --
alloc_scopeallocator for 288-byte IL scope nodes - Template Instance Record -- template instantiation data structures
- Translation Unit Descriptor -- TU pointer stored at scope entry
+216
Translation Unit Descriptor
The translation unit descriptor is the 424-byte structure at the heart of cudafe++'s multi-TU compilation support. Every source file processed by the frontend -- whether via RDC separate compilation or C++20 module import -- gets its own TU descriptor. The descriptor holds pointers to the parser state, scope stack snapshot, error context, and IL tree root for that translation unit. When the frontend switches from one TU to another, it saves the entire set of per-TU global variables into the outgoing descriptor's storage buffer and restores the incoming descriptor's saved values, making TU switching look like a cooperative context switch for compiler state.
The descriptor is allocated from the region-based arena (sub_6BA0D0), linked into a global TU chain, and managed through a TU stack that tracks the active-TU history for nested TU switches (e.g., when processing an entity requires switching to its owning TU temporarily).
Key Facts
| Property | Value |
|---|---|
| Size | 424 bytes (confirmed by print_trans_unit_statistics: "translation units ... 424 bytes each") |
| Allocation | sub_6BA0D0(424) -- region-based arena, never individually freed |
| Source file | trans_unit.c (EDG 6.6, address range 0x7A3A50-0x7A48B0, ~12 functions) |
| Allocator | sub_7A40A0 (process_translation_unit) |
| Save function | sub_7A3A50 (save_translation_unit_state) |
| Restore function | sub_7A3D60 (switch_translation_unit) |
| Fix-up function | sub_7A3CF0 (fix_up_translation_unit) |
| Statistics | sub_7A45A0 (print_trans_unit_statistics) |
| TU count global | qword_12C7A78 (incremented on each allocation) |
| Active TU global | qword_106BA10 (current_translation_unit) |
| Primary TU global | qword_106B9F0 (primary_translation_unit) |
Full Offset Map
The table below documents every field in the 424-byte TU descriptor. Offsets are byte positions from the start of the descriptor. Fields are identified from the initialization code in process_translation_unit (sub_7A40A0), the save/restore pair (sub_7A3A50/sub_7A3D60), and the fix-up function (sub_7A3CF0).
| Offset | Size | Field | Set By | Read By |
|---|---|---|---|---|
+0 | 8 | next_tu -- linked list pointer to next TU in chain | process_translation_unit (via qword_12C7A90) | fe_wrapup TU iteration loop |
+8 | 8 | prev_scope_state -- saved scope pointer (xmmword_126EB60+8) | save_translation_unit_state | switch_translation_unit |
+16 | 8 | storage_buffer -- pointer to bulk registered-variable storage | process_translation_unit (allocates sub_6BA0D0(per_tu_storage_size)) | save/switch_translation_unit |
+24 | 160 | file_scope_info -- file scope state block (20 qwords, initialized by sub_7046E0) | sub_7046E0 (zeroes 20 fields at offsets 0-152 within this block) | Scope stack operations, sub_704490 |
+184 | 8 | (cleared to 0) -- within file scope info tail | process_translation_unit | -- |
+192 | 8 | (cleared to 0) -- gap between scope info and registered-variable zone | process_translation_unit | -- |
+200 | 160 | registered-variable direct fields -- zeroed bulk region (offsets +200 through +359) | memset in process_translation_unit; individual fields written by registered-variable initialization loop | save/switch_translation_unit via storage buffer |
+208 | 8 | scope_stack_saved_1 -- saved qword_126EB70 (scope stack depth marker) | save_translation_unit_state (a1[26]) | switch_translation_unit |
+256 | 8 | scope_stack_saved_2 -- saved qword_126EBA0 | save_translation_unit_state (a1[32]) | switch_translation_unit |
+320 | 8 | scope_stack_saved_3 -- saved qword_126EBE0 | save_translation_unit_state (a1[40]) | switch_translation_unit |
+352 | 8 | (cleared to 0) -- end of registered-variable zone | process_translation_unit | -- |
+360 | 8 | (cleared to 0) -- additional state word 1 | process_translation_unit | -- |
+368 | 8 | (cleared to 0) -- additional state word 2 | process_translation_unit | -- |
+376 | 8 | module_info_ptr -- pointer to module info structure (parameter a3 of process_translation_unit) | process_translation_unit | Module import path, a3[2] back-link |
+384 | 8 | il_state_ptr -- shortcut pointer for IL state (1344-byte aggregate at unk_126E600), set via registered-variable mechanism with offset_in_tu = 384 | Registered-variable init loop | IL operations |
+392 | 2 | flags -- bit field: byte 0 = is_primary_tu (1 if a3 == NULL), byte 1 = 0x01 (initialization sentinel, combined initial value = 0x0100) | process_translation_unit | TU classification |
+394 | 14 | (padding / reserved) | -- | -- |
+408 | 4 | error_severity_level -- copied from dword_126EC90 (current maximum error severity) | process_translation_unit | Error reporting, recovery decisions |
+416 | 8 | (cleared to 0) -- additional state | process_translation_unit | -- |
Layout Diagram
Translation Unit Descriptor (424 bytes)
===========================================
+0 [next_tu ] -----> next TU in chain (NULL for last)
+8 [prev_scope_state ] -----> saved scope ptr (from xmmword_126EB60+8)
+16 [storage_buffer ] -----> heap block for registered variable values
+24 [ ]
[ file_scope_info (160 bytes, 20 qwords) ]
[ initialized by sub_7046E0: all fields zeroed ]
[ scope state snapshot for this TU's file scope ]
+184 [ (tail of scope info, cleared) ]
+192 [ (gap, cleared to 0) ]
+200 [ ]
[ registered-variable direct fields (160 bytes, bulk zeroed) ]
[ includes scope stack snapshots at +208, +256, +320 ]
[ individual fields set by registered-variable init loop ]
+352 [ (cleared to 0) ]
+360 [ (additional state, cleared) ]
+368 [ (additional state, cleared) ]
+376 [module_info_ptr ] -----> module info (NULL for primary TU)
+384 [il_state_ptr ] -----> shortcut to IL state in storage buffer
+392 [flags ] 0x0100 initial; byte 0 = is_primary
+394 [ (reserved) ]
+408 [error_severity ] from dword_126EC90
+412 [ (pad) ]
+416 [ (additional, 0) ]
+424 === end ===
Initialization Sequence
The initialization in process_translation_unit proceeds in this order:
[+0]= 0 (next_tu pointer, not yet linked)[+16]=sub_6BA0D0(qword_12C7A98)(allocate storage buffer, size = accumulated registered-variable total)[+8]= 0 (prev_scope_state)sub_7046E0(tu + 24)-- zero-initialize the 160-byte file scope info block[+192]= 0,[+352]= 0,[+184]= 0 -- explicit clears around the bulk regionmemset(aligned(tu + 200), 0, ...)-- bulk-zero the registered-variable direct fields from +200 to +360 (aligned to 8-byte boundary)[+360]= 0,[+368]= 0,[+376]= 0 -- clear additional state[+392]=0x0100(flags: initialized sentinel in high byte)[+408]= 0,[+416]= 0- Registered-variable default-value loop: iterate
qword_12C7AA8(registered variable list) and for each entry withoffset_in_tu != 0, writevariable_addressintotu_desc[offset_in_tu] [+376]=a3(module_info_ptr)[+392] byte 0=(a3 == NULL)(is_primary flag)
Lifecycle
Phase 1: Registration (Before Any TU Processing)
Before the first translation unit is processed, every EDG subsystem registers its per-TU global variables by calling f_register_trans_unit_variable (sub_7A3C00). This happens during frontend initialization, before dword_12C7A8C (registration_complete) is set to 1.
The three core variables are registered by register_builtin_trans_unit_variables (sub_7A4690):
// sub_7A4690 -- register_builtin_trans_unit_variables
f_register_trans_unit_variable(&dword_106BA08, 4, 0); // is_recompilation
f_register_trans_unit_variable(&qword_106BA00, 8, 0); // current_filename
f_register_trans_unit_variable(&dword_106B9F8, 4, 0); // has_module_info
In total, approximately 217 calls to f_register_trans_unit_variable are made across all subsystems. Each call adds a 40-byte registration record to the linked list headed by qword_12C7AA8 and accumulates the variable size into qword_12C7A98 (the per-TU storage buffer size). The accumulated size determines how large the storage buffer allocation will be for each TU descriptor.
Phase 2: Allocation and Initialization
When process_translation_unit (sub_7A40A0) is called for each source file:
process_translation_unit(filename, is_recompilation, module_info_ptr)
- If a current TU exists (
qword_106BA10 != 0), save its state viasave_translation_unit_state - Reset compilation state (
sub_5EAEC0-- error state reset) - If recompilation mode: reset parse state (
sub_585EE0) - Set
dword_12C7A8C = 1(registration complete -- no more variable registrations allowed) - Allocate the 424-byte descriptor and its storage buffer
- Initialize all fields (see sequence above)
- Copy registered-variable defaults into the descriptor
- Link into the TU chain
Phase 3: Linking
The descriptor is linked into two structures simultaneously:
TU Chain (singly-linked list via [+0]):
- Head:
qword_106B9F0(primary_translation_unit) -- the first TU processed - Tail:
qword_12C7A90(tu_chain_tail) -- the most recently allocated TU - Linking:
*tu_chain_tail = new_tu; tu_chain_tail = new_tu - Used by:
fe_wrapupto iterate all TUs during the 5-pass post-processing
TU Stack (singly-linked list of 16-byte stack entries):
- Top:
qword_106BA18(translation_unit_stack_top) - Each entry:
[+0]= next,[+8]= tu_descriptor_ptr - Free list:
qword_12C7AB8(stack entries are recycled, not freed) - Depth counter:
dword_106B9E8(counts non-primary TUs on the stack)
TU Chain: TU Stack:
qword_106BA18
primary_tu --> tu_2 --> tu_3 |
^ v
| [next|tu_3] --> [next|tu_2] --> [next|primary] --> NULL
qword_106B9F0 each entry: 16 bytes
Phase 4: Active TU Tracking
The global qword_106BA10 always points to the currently active TU descriptor. All compiler state -- parser globals, scope stack, symbol tables, error context -- corresponds to this TU. Switching the active TU requires a full context switch through switch_translation_unit.
Phase 5: Processing (5 Passes in fe_wrapup)
After parsing completes, fe_wrapup (sub_588F90) iterates the TU chain and performs 5 passes over all TUs:
- Pass 1 (
file_scope_il_wrapup): per-TU scope cleanup, cross-TU entity marking - Pass 2 (
set_needed_flags_at_end_of_file_scope): compute needed-flags for entities - Pass 3 (
mark_to_keep_in_il): mark entities to keep in the IL tree - Pass 4 (three sub-stages): clear unneeded instantiation flags, eliminate unneeded function bodies, eliminate unneeded IL entries
- Pass 5 (
file_scope_il_wrapup_part_3): final cleanup, scope assertion, re-run of passes 2-4 for the primary TU
Each pass switches to the target TU via switch_translation_unit before processing.
Phase 6: Pop and Cleanup
After sub_588E90 (translation_unit_wrapup) and the main compilation passes complete, the TU is popped from the stack. The inline pop code in process_translation_unit (mirroring pop_translation_unit_stack at sub_7A3F70):
- Assert:
stack_top->tu_ptr == current_tu(stack integrity check) - Decrement
dword_106B9E8if popped TU is not the primary TU - Move stack entry to free list (
qword_12C7AB8) - If a previous TU remains on the stack, switch to it via
switch_translation_unit
Registered Variable Mechanism
The registered variable mechanism is the save/restore system that makes TU switching possible. It works in three phases: registration, save, and restore.
Registration Phase
During frontend initialization, each subsystem calls f_register_trans_unit_variable to declare global variables that contain per-TU state. Each call creates a 40-byte registration record:
Registered Variable Entry (40 bytes)
[0] 8 next linked list pointer
[8] 8 variable_address pointer to the global variable
[16] 8 variable_size number of bytes to save/restore
[24] 8 offset_in_storage byte offset within the TU storage buffer
[32] 8 offset_in_tu byte offset within the TU descriptor (0 = none)
Registration accumulates the total storage buffer size in qword_12C7A98. Each variable gets assigned a sequential offset within the buffer:
// f_register_trans_unit_variable (sub_7A3C00), simplified
void f_register_trans_unit_variable(void *var_ptr, size_t size, size_t offset_in_tu) {
assert(!registration_complete); // dword_12C7A8C must be 0
assert(var_ptr != NULL);
record = alloc(40);
record->next = NULL;
record->variable_address = var_ptr;
record->variable_size = size;
record->offset_in_storage = per_tu_storage_size; // qword_12C7A98
record->offset_in_tu = offset_in_tu;
// append to linked list
if (!list_head) list_head = record; // qword_12C7AA8
if (list_tail) list_tail->next = record;
list_tail = record; // qword_12C7AA0
// align size to 8 bytes, accumulate
size_t aligned = (size + 7) & ~7;
per_tu_storage_size += aligned; // qword_12C7A98
}
The third parameter offset_in_tu is non-zero only for variables that need a direct shortcut pointer within the TU descriptor itself. For example, the 1344-byte IL state aggregate at unk_126E600 registers with offset_in_tu = 384, so tu_desc[384] receives a pointer to the stored copy of that aggregate within the storage buffer. Most variables pass 0 (no shortcut needed).
Save Phase (save_translation_unit_state)
When switching away from a TU, sub_7A3A50 saves the current state:
// save_translation_unit_state (sub_7A3A50), simplified
void save_translation_unit_state(tu_desc *tu) {
char *storage = tu->storage_buffer; // tu[2]
// Iterate all registered variables
for (reg = registered_variable_list_head; reg; reg = reg->next) {
// Copy current global value into storage buffer
void *dst = storage + reg->offset_in_storage;
memcpy(dst, reg->variable_address, reg->variable_size);
// If this variable has a direct field in the TU descriptor,
// store a pointer to the saved copy there
if (reg->offset_in_tu != 0) {
*(void **)((char *)tu + reg->offset_in_tu) = dst;
}
}
// Save scope stack state (3 explicit fields)
tu->scope_saved_1 = qword_126EB70; // tu[26]
tu->scope_saved_2 = qword_126EBA0; // tu[32]
tu->scope_saved_3 = qword_126EBE0; // tu[40]
// Save file scope indices via sub_704490
if (dword_126C5E4 != -1) {
sub_704490(dword_126C5E4, 0, 0);
// Walk scope stack entries, clear scope-to-TU back-pointers
for (entry = scope_top; entry; entry -= 784) {
if (entry[+192])
*(int *)(entry[+192] + 240) = -1;
if (!entry[+4]) break;
}
}
}
Restore Phase (switch_translation_unit)
When switching to a different TU, sub_7A3D60 restores its state:
// switch_translation_unit (sub_7A3D60), simplified
void switch_translation_unit(tu_desc *target) {
assert(current_tu != NULL); // qword_106BA10
if (current_tu == target) return; // no-op if already active
save_translation_unit_state(current_tu); // save outgoing
current_tu = target; // qword_106BA10 = target
char *storage = target->storage_buffer;
// Iterate all registered variables -- REVERSE of save
for (reg = registered_variable_list_head; reg; reg = reg->next) {
// Copy saved value from storage buffer back to global
memcpy(reg->variable_address, storage + reg->offset_in_storage,
reg->variable_size);
// Update shortcut pointer if present
if (reg->offset_in_tu != 0) {
*(void **)((char *)target + reg->offset_in_tu) =
memcpy result; // points into the global
}
}
// Restore scope stack state
xmmword_126EB60_high = target[1]; // prev_scope_state
qword_126EB70 = target[26];
qword_126EBA0 = target[32];
qword_126EBE0 = target[40];
// Rebuild file scope indices via sub_704490
if (dword_126C5E4 != -1) {
// Recompute scope-to-TU back-pointers
for (entry = scope_top; entry; entry -= 784) {
if (entry[+192])
*(int *)(entry[+192] + 240) = index_formula;
if (!entry[+4]) break;
}
sub_704490(dword_126C5E4, 1, computed_flag);
}
}
The key asymmetry between save and restore: memcpy direction is reversed. Save copies global -> storage_buffer. Restore copies storage_buffer -> global. The shortcut pointer (offset_in_tu) semantics also differ: during save it points into the storage buffer; during restore it points back to the global variable.
Fix-Up Phase
After the primary TU's registered-variable defaults are first copied into its descriptor, fix_up_translation_unit (sub_7A3CF0) performs a one-time pass that writes variable-address pointers into the TU descriptor's shortcut fields:
// fix_up_translation_unit (sub_7A3CF0)
void fix_up_translation_unit(tu_desc *primary) {
assert(primary->next_tu == NULL); // must be the first TU
for (reg = registered_variable_list_head; reg; reg = reg->next) {
if (reg->offset_in_tu != 0) {
*(void **)((char *)primary + reg->offset_in_tu) =
reg->variable_address;
}
}
}
This ensures the primary TU's shortcut pointers point directly to the live global variables rather than the storage buffer, since the primary TU's globals are the "real" values (not copies).
TU Stack Operations
The TU stack supports nested TU switches. This is needed when processing an entity declared in a different TU requires temporarily switching to that TU's context.
Push (push_translation_unit_stack)
// sub_7A3EF0 -- push_translation_unit_stack
void push_translation_unit_stack(tu_desc *tu) {
// Allocate stack entry from free list or fresh allocation
stack_entry *entry;
if (stack_entry_free_list) { // qword_12C7AB8
entry = stack_entry_free_list;
stack_entry_free_list = entry->next;
} else {
entry = alloc(16); // sub_6B7340(16)
++stack_entry_count; // qword_12C7A80
}
entry->tu_ptr = tu; // [+8]
entry->next = tu_stack_top; // [+0] = qword_106BA18
// If pushing a different TU than current, switch to it
if (current_tu != tu)
switch_translation_unit(tu);
// Track depth for non-primary TUs
if (tu != primary_tu)
++tu_stack_depth; // dword_106B9E8
tu_stack_top = entry; // qword_106BA18
}
Pop (pop_translation_unit_stack)
// sub_7A3F70 -- pop_translation_unit_stack
void pop_translation_unit_stack() {
stack_entry *top = tu_stack_top; // qword_106BA18
// Integrity assertion: top-of-stack TU must match current TU
assert(top->tu_ptr == current_tu); // top[+8] == qword_106BA10
if (top->tu_ptr != primary_tu)
--tu_stack_depth; // dword_106B9E8
// Pop: move top to free list, advance stack
stack_entry *prev = top->next;
top->next = stack_entry_free_list; // return to free list
stack_entry_free_list = top;
tu_stack_top = prev; // qword_106BA18
// If another TU remains, switch to it
if (prev)
switch_translation_unit(prev->tu_ptr);
}
Push Entity's TU (push_entity_translation_unit)
A convenience function sub_7A3FE0 pushes the TU that owns a given entity:
// sub_7A3FE0 -- push_entity_translation_unit
int push_entity_translation_unit(entity *ent) {
if (ent->flags_81 & 0x20) return 0; // anonymous entity, no TU
tu_desc *owning_tu = get_entity_tu(ent); // sub_741960
if (owning_tu == current_tu) return 0; // already in correct TU
push_translation_unit_stack(owning_tu);
return 1; // caller must pop when done
}
TU Stack Entry Layout
TU Stack Entry (16 bytes)
[0] 8 next next entry in stack (toward bottom) or free list
[8] 8 tu_desc_ptr pointer to the TU descriptor
Stack entries are recycled through a free list (qword_12C7AB8). They are allocated by sub_6B7340 (general storage, not arena) on first use and never deallocated -- only returned to the free list on pop.
TU Correspondence (24 bytes)
When processing multiple TUs in RDC mode, the frontend must track structural equivalence between types and declarations across TUs. Each correspondence is a 24-byte node:
Trans Unit Correspondence (24 bytes)
[0] 8 next linked list pointer
[8] 8 ptr pointer to the corresponding entity/type
[16] 4 refcount reference count (freed when decremented to 1)
[20] 1 flag correspondence type flag
Allocation uses a free list (qword_12C7AB0) with fallback to arena allocation (sub_6BA0D0(24)). The reference-counted deallocation in free_trans_unit_corresp (sub_7A3BB0) asserts that refcount > 0 before decrementing, and only pushes the node onto the free list when the count reaches 1 (not 0 -- the last reference is the free-list entry itself).
Global Variables
TU State
| Global | Type | Identity |
|---|---|---|
qword_106BA10 | tu_desc* | current_translation_unit -- always points to the active TU |
qword_106B9F0 | tu_desc* | primary_translation_unit -- the first TU processed (head of chain) |
qword_12C7A90 | tu_desc* | tu_chain_tail -- last TU in the linked list |
qword_106BA18 | stack_entry* | translation_unit_stack_top -- top of the TU stack |
dword_106B9E8 | int32 | tu_stack_depth -- number of non-primary TUs on the stack |
qword_106BA00 | char* | current_filename -- source file name for the active TU |
dword_106BA08 | int32 | is_recompilation -- 1 if this TU is being recompiled |
dword_106B9F8 | int32 | has_module_info -- 1 if the active TU has module info |
Registration Infrastructure
| Global | Type | Identity |
|---|---|---|
qword_12C7AA8 | reg_entry* | registered_variable_list_head |
qword_12C7AA0 | reg_entry* | registered_variable_list_tail |
qword_12C7A98 | size_t | per_tu_storage_size -- accumulated total size of all registered variables (determines storage buffer allocation) |
dword_12C7A8C | int32 | registration_complete -- set to 1 when first TU is allocated; guards against late registration |
dword_12C7A88 | int32 | has_seen_module_tu -- set to 1 when a TU with module info is processed |
Free Lists and Counters
| Global | Type | Identity |
|---|---|---|
qword_12C7AB8 | stack_entry* | stack_entry_free_list |
qword_12C7AB0 | corresp* | corresp_free_list |
qword_12C7A78 | int64 | tu_count -- total TU descriptors allocated |
qword_12C7A80 | int64 | stack_entry_count -- total stack entries allocated (not freed) |
qword_12C7A68 | int64 | registration_count -- total registered variable entries |
qword_12C7A70 | int64 | corresp_count -- total correspondence nodes allocated |
Correspondence State
| Global | Type | Identity |
|---|---|---|
dword_106B9E4 | int32 | Correspondence state variable 1 (per-TU, registered) |
dword_106B9E0 | int32 | Correspondence state variable 2 (per-TU, registered) |
qword_12C7798 | int64 | Correspondence state variable 3 (per-TU, registered) |
qword_12C7800 | [14] | Correspondence hash table 1 (0x70 bytes) |
qword_12C7880 | [14] | Correspondence hash table 2 (0x70 bytes) |
qword_12C7900 | [14] | Correspondence hash table 3 (0x70 bytes) |
Reset Functions
Two reset functions exist for different scopes:
reset_translation_unit_state (sub_7A4860) -- zeroes the 6 core runtime globals. Called during error recovery or frontend teardown:
qword_106BA10 = 0; // current_tu
qword_106B9F0 = 0; // primary_tu
qword_12C7A90 = 0; // tu_chain_tail
dword_106B9F8 = 0; // has_module_info
qword_106BA18 = 0; // tu_stack_top
dword_106B9E8 = 0; // tu_stack_depth
init_translation_unit_tracking (sub_7A48B0) -- zeroes all 13 tracking globals. Called during frontend initialization before any registrations:
qword_12C7AA8 = 0; // registered_variable_list_head
qword_12C7AA0 = 0; // registered_variable_list_tail
qword_12C7A98 = 0; // per_tu_storage_size
dword_106BA08 = 0; // is_recompilation
qword_106BA00 = 0; // current_filename
qword_12C7AB8 = 0; // stack_entry_free_list
qword_12C7AB0 = 0; // corresp_free_list
qword_12C7A80 = 0; // stack_entry_count
qword_12C7A78 = 0; // tu_count
qword_12C7A68 = 0; // registration_count
qword_12C7A70 = 0; // corresp_count
dword_12C7A8C = 0; // registration_complete
dword_12C7A88 = 0; // has_seen_module_tu
Memory Statistics
The print_trans_unit_statistics function (sub_7A45A0) reports the allocation counts and total memory for the four structure types managed by the TU system:
| Structure | Size | Counter | Storage |
|---|---|---|---|
| Trans unit correspondence | 24 bytes | qword_12C7A70 | Arena |
| Translation unit descriptor | 424 bytes | qword_12C7A78 | Arena (gen. storage) |
| TU stack entry | 16 bytes | qword_12C7A80 | General storage |
| Variable registration | 40 bytes | qword_12C7A68 | General storage |
Function Map
| Address | Identity | Confidence | Role |
|---|---|---|---|
sub_7A3A50 | save_translation_unit_state | HIGH | Save all per-TU globals to storage buffer |
sub_7A3B50 | alloc_trans_unit_corresp | HIGH | Allocate 24-byte correspondence node |
sub_7A3BB0 | free_trans_unit_corresp | HIGH | Reference-counted deallocation |
sub_7A3C00 | f_register_trans_unit_variable | DEFINITE | Register a global for per-TU save/restore |
sub_7A3CF0 | fix_up_translation_unit | DEFINITE | One-time shortcut pointer fix-up for primary TU |
sub_7A3D60 | switch_translation_unit | DEFINITE | Context-switch to a different TU |
sub_7A3EF0 | push_translation_unit_stack | HIGH | Push TU onto the stack |
sub_7A3F70 | pop_translation_unit_stack | DEFINITE | Pop TU from the stack |
sub_7A3FE0 | push_entity_translation_unit | MEDIUM-HIGH | Push the TU that owns a given entity |
sub_7A40A0 | process_translation_unit | DEFINITE | Main entry: allocate, init, parse, cleanup |
sub_7A45A0 | print_trans_unit_statistics | HIGH | Memory usage report for TU subsystem |
sub_7A4690 | register_builtin_trans_unit_variables | HIGH | Register the 3 core per-TU globals |
sub_7A4860 | reset_translation_unit_state | DEFINITE | Zero 6 runtime globals |
sub_7A48B0 | init_translation_unit_tracking | DEFINITE | Zero all 13 tracking globals |
Assertions
The TU system contains 8 assertion sites (calls to sub_4F2930 with source path trans_unit.c):
| Line | Function | Condition | Meaning |
|---|---|---|---|
| 163 | free_trans_unit_corresp | refcount > 0 | Correspondence double-free |
| 227 | f_register_trans_unit_variable | !registration_complete | Variable registered after first TU allocated |
| 230 | f_register_trans_unit_variable | var_ptr != NULL | NULL pointer passed for registration |
| 469 | fix_up_translation_unit | primary_tu->next == NULL | Fix-up called on non-primary TU |
| 514 | switch_translation_unit | current_tu != NULL | Switch attempted with no active TU |
| 556 | pop_translation_unit_stack | stack_top->tu == current_tu | Stack/active TU mismatch |
| 696 | process_translation_unit | !(a3==NULL && has_seen_module) | Non-module TU after module TU |
| 725 | process_translation_unit | is_recompilation (when primary and first TU) | Primary TU must be a recompilation |
Cross-References
- RDC Mode -- multi-TU compilation: correspondence system, cross-TU IL copying, module ID generation
- Frontend Wrapup -- the 5-pass post-processing architecture that iterates the TU chain
- Scope Entry -- 784-byte scope stack entries saved/restored during TU switches
- Entity Node -- entities carry a back-pointer to their owning TU (extracted via
sub_741960) - IL Overview -- the IL tree rooted in each TU's file scope
- Pipeline Overview -- where
process_translation_unitsits in the full pipeline
Type Node
The type node is the fundamental type representation in EDG's intermediate language (IL). Every C++ type -- from int to const volatile std::vector<std::pair<int, float>>*& -- is represented as a 176-byte type node allocated by alloc_type (sub_5E3D40 in il_alloc.c). Type nodes form the backbone of the type system: every variable, field, routine, expression, and parameter carries a pointer to its type node. There are approximately 4,448 call sites across 128 type-query leaf functions in types.c alone.
The type node is a discriminated union. The type_kind byte at offset +132 selects one of 22 type kinds (0-21), and certain type kinds trigger allocation of a separate supplementary structure (class_type_supplement, routine_type_supplement, etc.) that hangs off the type node at offset +152. The base 176 bytes contain the common header shared with all IL entries (96 bytes), the type discriminator, qualifier flags, size/alignment, and type-kind-specific inline payload fields.
Key Facts
| Property | Value |
|---|---|
| Allocation size | 176 bytes (IL entry size) |
| Allocator | sub_5E3D40 (alloc_type), il_alloc.c |
| Kind setter | sub_5E2E80 (set_type_kind), 22 cases |
| In-place reinit | sub_5E3590 (init_type_fields), no allocation |
| Counter global | qword_126F8E0 |
| Stats label | "type" (176 bytes each) |
| Region | file-scope only (always dword_126EC90) |
| Type query functions | 128 leaf functions in types.c (0x7A4940-0x7C02A0) |
| Most-called query | is_class_or_struct_or_union_type (sub_7A8A30, 407 call sites) |
| Source files | il_alloc.c (allocation), types.c (queries/construction) |
Memory Layout
Raw Allocation vs Returned Pointer
Like all IL entries, the raw allocation includes a 16-byte prefix that is hidden from the returned pointer. The allocator returns raw + 16, so all field offsets documented below are relative to this returned pointer.
Raw allocation (192 bytes total):
raw+0 [8 bytes] TU copy address (zeroed, ptr[-2])
raw+8 [8 bytes] next-in-list link (zeroed, ptr[-1])
raw+16 [176 bytes] type node body (ptr+0 onward)
Prefix flags byte at ptr[-8]:
bit 0 (0x01) allocated
bit 1 (0x02) file_scope
bit 3 (0x08) language_flag (C++ mode)
bit 7 (0x80) keep_in_il (CUDA device marking)
Complete Field Map
The 176 bytes of the type node body divide into three regions: the common IL header (bytes 0-95), the type discriminator and qualifier zone (bytes 96-135), and the type-kind-specific payload (bytes 136-175).
Offset Size Field Description
------ ---- ----- -----------
+0 96 common_il_header Shared with all IL entry types (see below)
+96 24 (continuation of Source position, declaration metadata,
common header area) scope/name linkage -- varies by IL kind
+120 8 type_size Computed size of this type in bytes
+128 4 alignment Required alignment (bytes)
+132 1 type_kind Type discriminator (0-21, see table below)
+133 1 type_flags_1 bit 5 = is_dependent
+134 1 type_qual_flags bit 0 = const, bit 1 = volatile
(cleared by sub_5E3580: *(a1+134) &= 0xFC)
+135 1 (padding/reserved)
+136 8 (reserved/varies) Kind-dependent inline storage
+144 8 referenced_type For tk_pointer/tk_reference/tk_typedef:
-> pointed-to/referenced/aliased type
For tk_pointer_to_member: -> class type
For tk_function: return type enum (2=void)
For tk_integer (enum): underlying type ptr
+145 1 enum_flags For tk_integer:
bit 3 = scoped_enum
bit 4 = is_bit_int_capable
+146 1 extended_int_flags For tk_integer:
bit 2 = is_BitInt
+147 1 (padding)
+148 1 (varies by kind) For tk_class/struct/union: access default
set_type_kind initializes to 1
+149 1 kind_init_byte Flags initialized during set_type_kind
+150 2 (cleared by init) Zeroed by init_type_fields_and_set_kind
+152 8 supplement_ptr For tk_class/struct/union: -> class_type_supplement
For tk_routine: -> routine_type_supplement
For tk_integer: -> integer_type_supplement
For tk_typeref: -> typeref_type_supplement
For tk_template_param: -> templ_param_supplement
For tk_pointer: member_pointer_flag
bit 0 = is_member_pointer
bit 1 = extended_member_ptr
For tk_array: bound expression pointer
+153 1 array_flags For tk_array:
bit 0 = dependent_bound
bit 1 = is_VLA
bit 5 = star_modifier
+154 6 (varies)
+160 8 typedef_attr_kind For tk_typeref: attribute kind value
For tk_array: numeric bound value
+161 1 class_flags_1 For tk_class/struct/union:
bit 0 = is_local_class
bit 2 = no_name_linkage
bit 4 = is_template_class
bit 5 = is_anonymous
bit 7 = has_nested_types
+162 1 typedef_flags For tk_typeref:
bit 0 = is_elaborated
bit 6 = is_attributed
bit 7 = has_addr_space
+163 1 class_flags_2 For tk_class/struct/union:
bit 0 = extern_template_inst
bit 3 = alignment_set
bit 4 = is_scoped (for union)
+164 2 feature_flags Target feature requirements
(copied to byte_12C7AFC by
record_type_features_used)
+166 2 (reserved)
+168 4 alignment_attr Explicit alignment / packed attribute
+172 4 (tail padding)
Common IL Header (Bytes 0-95)
The first 96 bytes are copied verbatim from the template globals (xmmword_126F6A0 through xmmword_126F6F0) during allocation. This template captures the current source file, line, and column position, and is refreshed as the parser advances. The header contains:
xmmword_126F6A0 [+0..+15] scope/class pointer, name pointer (zeroed)
xmmword_126F6B0 [+16..+31] declaration metadata (high qword zeroed)
xmmword_126F6C0 [+32..+47] reserved (zeroed)
xmmword_126F6D0 [+48..+63] reserved (zeroed)
xmmword_126F6E0 [+64..+79] source position (from qword_126EFB8)
xmmword_126F6F0 [+80..+95] low word = 4 (access default), high zeroed
qword_126F700 [+96..+103] current source file reference
The source position at bytes +64..+79 allows error messages and diagnostics to reference the exact declaration point for each type.
Type Kind Enumeration
The type kind byte at offset +132 holds one of 22 values. The set_type_kind function (sub_5E2E80, 279 lines, il_alloc.c:2334) dispatches on this value to initialize type-kind-specific fields and allocate supplement structures where needed.
| Value | Name | C++ Constructs | Supplement | Payload |
|---|---|---|---|---|
| 0 | tk_none | Placeholder / uninitialized | None | no-op |
| 1 | tk_void | void | None | no-op |
| 2 | tk_integer | bool, char, short, int, long, long long, __int128, _BitInt(N), all unsigned variants, wchar_t, char8_t, char16_t, char32_t, enumerations | 32-byte integer_type_supplement | +144=5 (default) |
| 3 | tk_float | float, _Float16, __bf16 | None | format byte = 2 |
| 4 | tk_double | double | None | format byte = 2 |
| 5 | tk_long_double | long double, __float128 | None | format byte = 2 |
| 6 | tk_pointer | T*, member pointers (bit 0 of +152) | None | 2 payload fields zeroed |
| 7 | tk_routine | Function type int(int, float) | 64-byte routine_type_supplement | calling convention, params init |
| 8 | tk_array | T[N], T[], VLAs | None | size+flags zeroed |
| 9 | tk_struct | struct S | 208-byte class_type_supplement | kind stored at supplement+100 |
| 10 | tk_class | class C | 208-byte class_type_supplement | kind stored at supplement+100 |
| 11 | tk_union | union U | 208-byte class_type_supplement | kind stored at supplement+100 |
| 12 | tk_typeref | typedef, using alias, elaborated type specifiers | 56-byte typeref_type_supplement | -- |
| 13 | tk_typeof | typeof(expr), __typeof__ | None | zeroed |
| 14 | tk_template_param | typename T, template type/non-type/template parameters | 40-byte templ_param_supplement | -- |
| 15 | tk_decltype | decltype(expr) | None | zeroed |
| 16 | tk_pack_expansion | T... (parameter pack expansion) | None | zeroed |
| 17 | tk_pack_expansion_alt | Alternate pack expansion form | None | no-op |
| 18 | tk_auto | auto, decltype(auto) | None | no-op |
| 19 | tk_rvalue_reference | T&& (rvalue reference) | None | no-op |
| 20 | tk_nullptr_t | std::nullptr_t | None | no-op |
| 21 | tk_reserved_21 | Reserved / unused | None | no-op |
Reconciling set_type_kind with types.c query functions: There is an apparent conflict between the set_type_kind dispatch (where case 7 allocates a routine supplement, case 0xD/13 is typeof, case 0xE/14 is template_param) and the types.c query function catalog (where is_reference_type tests kind==7, is_pointer_to_member_type tests kind==13, is_function_type tests kind==14). The set_type_kind switch is the authoritative source for allocation behavior -- it is a 279-line DEFINITE-confidence function with the embedded error string "set_type_kind: bad type kind". The types.c catalog was reconstructed from runtime query patterns and may reflect a different numbering or the fact that type kind values are reassigned after initial allocation. The table above follows the set_type_kind dispatch numbering. The types.c query mappings are documented in the query function catalog below for cross-reference.
set_type_kind Dispatch Summary
switch (type_kind) {
case 0, 1, 17..21: // tk_none, tk_void, alt-pack, auto, rvalue_ref, nullptr, reserved
break; // no-op: simple types with no extra state
case 2: // tk_integer
type->+144 = 5; // default integer subkind
supplement = alloc_in_file_scope_region(32); // integer_type_supplement
++qword_126F8E8;
supplement->+16 = source_position;
type->+152 = supplement;
break;
case 3, 4, 5: // tk_float, tk_double, tk_long_double
type->format_byte = 2; // IEEE format indicator
break;
case 6: // tk_pointer
type->+144 = 0; // pointed-to type (to be set later)
type->+152 = 0; // member-pointer flags (cleared)
break;
case 7: // tk_routine
supplement = alloc_in_file_scope_region(64); // routine_type_supplement
++qword_126F958;
init_bitfield_struct(supplement+32); // calling convention defaults
type->+152 = supplement;
break;
case 8: // tk_array
type->+120 = 0; // array total size (unknown)
type->+152 = 0; // bound expression (none)
type->+153 &= mask; // clear array flags
type->+160 = 0; // numeric bound (none)
break;
case 9, 10, 11: // tk_struct, tk_class, tk_union
supplement = alloc_in_file_scope_region(208); // class_type_supplement
++qword_126F948;
init_class_type_supplement_fields(supplement);
supplement->+100 = type_kind; // remember which class flavor
type->+152 = supplement;
break;
case 12 (0xC): // tk_typeref (typedef / using alias)
supplement = alloc_in_file_scope_region(56); // typeref_type_supplement
++qword_126F8F0;
type->+152 = supplement;
break;
case 13 (0xD): // tk_typeof
type->+144 = 0;
type->+152 = 0;
break;
case 14 (0xE): // tk_template_param
supplement = alloc_in_file_scope_region(40); // templ_param_supplement
++qword_126F8F8;
type->+152 = supplement;
break;
case 15 (0xF): // tk_decltype
type->+144 = 0;
type->+152 = 0;
break;
case 16 (0x10): // tk_pack_expansion
type->+144 = 0;
type->+152 = 0;
break;
default:
internal_error("set_type_kind: bad type kind");
}
type->+132 = type_kind;
Supplement Structures
Five type kinds trigger allocation of a supplementary structure. The supplement pointer lives at type node offset +152 and points to a separately-allocated block in the file-scope region.
class_type_supplement (208 bytes)
Allocated for tk_struct (kind 9), tk_class (kind 10), and tk_union (kind 11). This is the richest supplement, carrying the full class definition metadata. Initialized by init_class_type_supplement_fields (sub_5E2D70, 40 lines) and init_class_type_supplement (sub_5E2C70, 42 lines).
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 8 | scope_ptr | Pointer to the class scope (288-byte scope node) |
+8 | 8 | base_class_list | Head of linked list of base class entries (112 bytes each) |
+16 | 8 | friend_decl_list | Head of friend declaration list |
+24 | 8 | member_list_head | Member entity list (routines, variables, nested types) |
+32 | 8 | nested_type_list | Nested type definitions |
+40 | 4 | default_access | 1 = public (struct/union), 2 = private (class) |
+44 | 4 | (reserved) | |
+48 | 8 | using_decl_list | Using declarations in class scope |
+56 | 8 | static_data_members | Static data member list |
+64 | 8 | template_info | Template instantiation info (if template class) |
+72 | 8 | virtual_base_list | Virtual base class chain |
+80 | 4 | (flags) | |
+84 | 2 | (reserved) | |
+86 | 1 | class_property_flags | bit 0 = has_virtual_bases, bit 3 = has_user_conversion |
+88 | 1 | extended_flags | bit 5 = has_flexible_array |
+96 | 8 | vtable_ptr | Virtual function table pointer |
+100 | 4 | class_kind | Copy of type_kind (9, 10, or 11) |
+104 | 8 | destructor_ptr | Pointer to destructor entity |
+112 | 8 | copy_ctor_ptr | Copy constructor entity |
+120 | 8 | move_ctor_ptr | Move constructor entity |
+128 | 8 | (scope chain) | |
+136 | 8 | conversion_functions | User-defined conversion operator list |
+144 | 8 | befriending_classes | List of classes that befriend this class |
+152 | 8 | deduction_guides | Deduction guide list (C++17) |
+160 | 8 | (reserved) | |
+168 | 8 | (reserved) | |
+176 | 4 | vtable_index | Virtual function table index, initialized to -1 (0xFFFFFFFF) |
+180 | 4 | (padding) | |
+184 | 8 | (reserved) | |
+192 | 8 | (reserved) | |
+200 | 8 | (reserved) |
Counter: qword_126F948, stats label "class type supplement".
routine_type_supplement (64 bytes)
Allocated for tk_routine (kind 7) by set_type_kind. Encodes the function signature metadata.
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 8 | param_type_list | Head of parameter type linked list (80 bytes each) |
+8 | 8 | return_type | Return type pointer |
+16 | 8 | exception_spec | Exception specification pointer (16 bytes) |
+24 | 8 | (reserved) | |
+32 | 4 | calling_convention | Calling convention bitfield (initialized by set_type_kind) |
+36 | 4 | param_count | Number of parameters |
+40 | 4 | flags | Variadic, noexcept, trailing-return, etc. |
+44 | 4 | (reserved) | |
+48 | 8 | (reserved) | |
+56 | 8 | (reserved) |
Counter: qword_126F958, stats label "routine type supplement".
Each parameter in the param_type_list is an 80-byte param_type node (allocated by alloc_param_type, sub_5E1D40, free-list recycled from qword_126F678). Parameter types form a singly-linked list through their +0 field.
integer_type_supplement (32 bytes)
Allocated for tk_integer (kind 2). Represents the properties of integral and enumeration types.
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 4 | integer_subkind | Subkind identifier (values 1-12, default 5) |
+4 | 4 | bit_width | Width in bits (for _BitInt(N)) |
+8 | 4 | signedness | 0=unsigned, 1=signed (lookup via byte_E6D1B0) |
+12 | 4 | (reserved) | |
+16 | 8 | source_position | Source position at allocation time |
+24 | 8 | underlying_type | For enums: pointer to the underlying integer type |
Counter: qword_126F8E8, stats label "integer type supplement".
The integer_subkind field distinguishes between the various integer types. Known subkind values from type query analysis:
| Subkind | Type |
|---|---|
| 1-10 | Standard integer types (bool through long long) |
| 11 | _Float16 / extended |
| 12 | __int128 / extended |
typeref_type_supplement (56 bytes)
Allocated for tk_typeref (kind 12 = 0xC in set_type_kind). Links the typedef/using-alias to its referenced declaration and tracks elaborated type specifier properties.
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 8 | referenced_decl | The declaration this typedef names |
+8 | 8 | original_type | The original type before typedef expansion |
+16 | 8 | scope_ptr | Scope in which the typedef was declared |
+24 | 8 | (reserved) | |
+32 | 8 | attribute_info | Attribute specifier chain |
+40 | 8 | template_info | Template argument list (for alias templates) |
+48 | 8 | (reserved) |
Counter: qword_126F8F0, stats label "typeref type supplement".
The elaborated type specifier kind is encoded in type_node+162:
- bit 0: is_elaborated (uses
struct/class/union/enumkeyword) - bit 6: is_attributed (carries
[[...]]attributes) - bit 7: has_addr_space (CUDA address space attribute)
The constant 0x18C2 (= bits {1,6,7,11,12}) is used as a bitmask in is_incomplete_type_deep (sub_7A6580) to identify the set of elaborated type specifier kinds.
templ_param_supplement (40 bytes)
Allocated for tk_template_param (kind 14 = 0xE in set_type_kind). Represents a template type parameter (typename T), non-type template parameter, or template template parameter.
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 4 | param_index | Zero-based index in the template parameter list |
+4 | 4 | param_depth | Nesting depth (0 for outermost template) |
+8 | 4 | param_kind | 0=type, 1=non-type, 2=template-template |
+12 | 4 | (reserved) | |
+16 | 8 | constraint | Associated constraint expression (C++20 concepts) |
+24 | 8 | default_arg | Default template argument (type or expression) |
+32 | 8 | (reserved) |
Counter: qword_126F8F8, stats label "templ param supplement".
Type Qualifier Encoding
CV-qualifiers are not stored as separate type nodes (unlike some compiler designs). Instead, they are encoded as bit flags within the type node itself. The primary qualifier storage is at offset +134:
Byte at type+134 (type_qual_flags):
bit 0 (0x01) const
bit 1 (0x02) volatile
The function clear_type_qualifier_bits (sub_5E3580) performs *(a1+134) &= 0xFC to strip both const and volatile.
Additional qualifier information is accessed through the prefix flags byte at ptr[-8]:
- bit 5 (0x20):
__restrictqualifier (has_restrict_qualifier,sub_7A7850) - bit 6 (0x40):
volatilequalifier duplicate (has_volatile_qualifier,sub_7A7890)
The function get_cv_qualifiers (sub_7A9E70, 319 call sites) accumulates cv-qualifier bits by walking through typedef chains, applying a & 0x7F mask to collect all qualifier bits from each layer.
Type Query Function Catalog
The types.c file (address range 0x7A4940-0x7C02A0) contains approximately 250 functions. Of these, 128 are tiny leaf functions that query type node properties. They follow a canonical pattern:
// Canonical type query pattern
bool is_<property>_type(type_node *type) {
while (type->type_kind == 12) // skip typedefs
type = type->referenced_type;
return type->type_kind == <expected>; // or flag check
}
Most-Referenced Query Functions
Sorted by call site count across the entire binary:
| Callers | Function | Address | Test |
|---|---|---|---|
| 407 | is_class_or_struct_or_union_type | 0x7A8A30 | kind in {9, 10, 11} |
| 389 | type_pointed_to | 0x7A9910 | kind==6, return +144 |
| 319 | get_cv_qualifiers | 0x7A9E70 | accumulate qualifier bits (& 0x7F) |
| 299 | is_dependent_type | 0x7A6B60 | bit 5 of byte +133 |
| 243 | is_object_pointer_type | 0x7A7630 | kind==6 && !(bit 0 of +152) |
| 221 | is_array_type | 0x7A8370 | kind==8 |
| 199 | is_member_pointer_or_ref | 0x7A7B30 | kind==6 && (bit 0 of +152) |
| 185 | is_reference_type | 0x7A6AC0 | kind==7 |
| 169 | is_function_type | 0x7A8DC0 | kind==14 |
| 140 | is_void_type | 0x7A6E90 | kind==1 |
| 126 | array_element_type (deep) | 0x7A9350 | strips arrays+typedefs recursively |
| 85 | is_enum_type | 0x7A7010 | kind==2 (with scoped check) |
| 82 | is_integer_type | 0x7A71B0 | kind==2 |
| 77 | is_member_pointer_flag | 0x7A7810 | kind==6, bit 0 of +152 |
| 76 | is_pointer_to_member_type | 0x7A8D90 | kind==13 |
| 70 | is_long_double_type | 0x7A73F0 | kind==5 |
| 62 | is_scoped_enum_type | 0x7A70F0 | kind==2, bit 3 of +145 |
| 56 | is_rvalue_reference_type | 0x7A6EF0 | kind==19 |
Typedef Stripping Functions
Six functions strip typedef layers with different stopping conditions:
| Function | Address | Behavior |
|---|---|---|
skip_typedefs | 0x7A68F0 | Strips all typedef layers, preserves cv-qualifiers |
skip_named_typedefs | 0x7A6930 | Stops at unnamed typedefs |
skip_to_attributed_typedef | 0x7A6970 | Stops at typedef with attribute flag |
skip_typedefs_and_attributes | 0x7A69C0 | Strips both typedefs and attributed-typedefs |
skip_to_elaborated_typedef | 0x7A6A10 | Stops at typedef with elaborated flag |
skip_non_attributed_typedefs | 0x7A6A70 | Stops at typedef with any attribute bits |
Compound Type Predicates
| Function | Address | Type Kinds |
|---|---|---|
is_arithmetic_type | 0x7A7560 | {2, 3, 4, 5} |
is_scalar_type | 0x7A7BA0 | {2, 3, 4, 5, 6(non-member), 13, 19, 20} |
is_aggregate_type | 0x7A8B40 | {8, 9, 10, 11} |
is_floating_point_type | 0x7A7300 | {3, 4, 5} |
is_pack_or_auto_type | 0x7A7420 | {16, 17, 18} |
is_pack_expansion_type | 0x7A6BE0 | {16, 17} |
is_complete_type | 0x7A6DA0 | Not void, not reference, not incomplete class |
Duplicate Functions
EDG uses distinct function names for semantic clarity even when the implementation is identical. The compiler does not merge them:
0x7A7630==0x7A7670==0x7A7750(all:is_non_member_pointer/is_object_pointer_type)0x7A7B00==0x7A7B70(both:is_pointer_type)0x7A78D0==0x7A7910(both:is_non_const_ref)
Type Construction
alloc_type (sub_5E3D40)
The primary type allocation function. Takes a single argument: the type kind. Returns a pointer to a fully-initialized 176-byte type node with the appropriate supplement structure allocated and linked.
Protocol:
- Trace enter (if
dword_126EFC8set) - Allocate 176 bytes via
region_allocin file-scope region - Write 16-byte prefix (TU copy addr, next link, flags byte)
- Increment
qword_126F8E0(type counter) - Copy 96-byte common IL header from template globals
- Set default access to 1 at
+148 - Dispatch
set_type_kindswitch for the requested kind - Trace leave (if tracing)
- Return
raw + 16
init_type_fields (sub_5E3590)
Re-initializes an existing type node in-place without allocating new memory. Used when a type node needs to change kind after initial allocation (rare but occurs during template instantiation). Copies the template header and dispatches the same set_type_kind switch.
make_cv_combined_type (sub_7A6320)
Constructs a new type that combines cv-qualifiers from two source types. Recursively handles arrays (recurses on element type) and pointer-to-member (recurses on member type). Allocates a fresh type node via alloc_type, copies the base type via sub_5DA0A0, then applies the combined qualifiers via sub_5D64F0.
Type Comparison
types_are_identical (sub_7AA150)
The main type comparison function (636 lines). Handles all 22 type kinds with deep structural comparison. For class types, delegates to the class scope comparison infrastructure. For function types, compares parameter lists, return types, and calling conventions.
types_are_equivalent_for_correspondence (sub_7B2260)
A 688-line function used during multi-TU compilation (CUDA RDC mode). Compares types across translation units for structural equivalence, called from verify_class_type_correspondence (sub_7A00D0).
compatible_ms_bit_field_container_types (sub_7C02A0)
The last function in types.c. Checks if two integer types are compatible for MSVC bit-field container layout rules: both must be kind==2 (integer) with matching size at offset +120.
Pointer and Reference Encoding
Pointers use type kind 6 (tk_pointer), with member-pointer status distinguished by flag bits at offset +152:
tk_pointer (kind 6):
+144 referenced_type The pointed-to / referenced type
+152 bit 0 = 0 Object pointer (T*)
bit 0 = 1 Member pointer (T C::*)
bit 1 Extended member pointer flag
The types.c query functions use the following kind tests for pointer/reference classification. Note that the kind values tested here correspond to the types.c query numbering (see reconciliation note in the type kind table):
| Query | Kind Test | +152 Test | Matches |
|---|---|---|---|
is_pointer_type | kind==6 | -- | T*, T C::* |
is_object_pointer_type | kind==6 | !(bit 0) | T* only |
is_member_pointer_flag | kind==6 | bit 0 | T C::* only |
is_reference_type | kind==7 | -- | T& (lvalue reference) |
is_rvalue_reference_type | kind==19 | -- | T&& |
is_pointer_to_member_type | kind==13 | -- | T C::* (alternate encoding) |
The pm_class_type (sub_7A9A10) and pm_member_type (sub_7A99D0) access +144 and +152 respectively for kind-13 nodes.
Array Type Encoding
Array types (kind 8) store bounds inline in the type node:
tk_array (kind 8):
+120 type_size Total array size in bytes (0 if unknown)
+128 alignment Element alignment
+144 element_type Pointer to the element type node
+152 bound_expr Bound expression pointer (for VLAs and dependent)
+153 array_flags:
bit 0 = dependent_bound (template-dependent array size)
bit 1 = is_VLA (C99 variable-length array)
bit 5 = star_modifier (C99 [*] syntax)
+160 numeric_bound Compile-time bound value (when not VLA/dependent)
The function identical_array_type_level (sub_7A4E10, types.c:6779) compares two array types by checking the VLA flag, dependent flag, and then either bound expressions (via sub_5D2160) or numeric bounds at +160.
Class Type Flags
Class types (kinds 9, 10, 11) carry two flag bytes at offsets +161 and +163 in the type node, plus property flags in the class_type_supplement at supplement offset +86:
type_node+161 (class_flags_1)
| Bit | Mask | Field | Query Function |
|---|---|---|---|
| 0 | 0x01 | is_local_class | is_local_class_type (0x7A8EE0) |
| 2 | 0x04 | no_name_linkage | ttt_is_type_with_no_name_linkage (0x7A4B40) |
| 4 | 0x10 | is_template_class | is_template_class_type (0x7A8EA0) |
| 5 | 0x20 | is_anonymous | is_non_anonymous_class_type tests !(bit 5) (0x7A8A90) |
| 7 | 0x80 | has_nested_types | -- |
type_node+163 (class_flags_2)
| Bit | Mask | Field | Query Function |
|---|---|---|---|
| 0 | 0x01 | extern_template_inst | is_empty_class checks this (0x7A range) |
| 3 | 0x08 | alignment_set | -- |
| 4 | 0x10 | is_scoped | is_scoped_union_type (0x7A8B00) |
class_type_supplement+86
| Bit | Mask | Field | Query Function |
|---|---|---|---|
| 0 | 0x01 | has_virtual_bases | class_has_virtual_bases (0x7A8BC0) |
| 3 | 0x08 | has_user_conversion | class_has_user_conversion (0x7A8C00) |
Type Size and Layout
type_size_and_alignment (sub_7A8020, 132 lines) computes the size and alignment of a type for ABI purposes. The computed size is stored at type_node offset +120 and alignment at +128.
For class types, the major layout computation is performed by compute_type_layout (sub_7B6350, 1107 lines), which handles:
- Base class sub-object placement
- Virtual base class offsets
- Member field alignment and padding
- Bit-field packing (with MSVC compatibility via
compatible_ms_bit_field_container_types) - Empty base optimization
Integration with Other IL Nodes
Type nodes are referenced from virtually every other IL entity:
| IL Node | Offset | Description |
|---|---|---|
| Variable entity (232B) | +112 | Variable's declared type |
| Field entity (176B) | +112 | Field's declared type |
| Routine entity (288B) | +112 | Function's type (kind 7 with routine_type_supplement) |
| Expression node (72B) | +16 | Expression result type |
| Parameter type (80B) | +8 | Parameter's declared type |
| Constant (184B) | +112 | Constant's type |
| Template argument (64B) | +32 | Type argument value (when kind=0) |
Allocation Statistics
In a typical CUDA compilation, the stats dump (sub_5E99D0) reports type node counts in the thousands. The supplement allocation counts track closely:
type 176 bytes each (qword_126F8E0)
integer type supplement 32 bytes each (qword_126F8E8)
routine type supplement 64 bytes each (qword_126F958)
class type supplement 208 bytes each (qword_126F948)
typeref type supplement 56 bytes each (qword_126F8F0)
templ param supplement 40 bytes each (qword_126F8F8)
param type 80 bytes each (qword_126F960, free-list recycled)
Type nodes are always allocated in the file-scope region (persistent for the entire translation unit) because types must outlive any individual function body. This contrasts with expression nodes and statements which can be allocated in per-function regions and freed after each function is processed.
Template Instance Record
The template instance record is the 128-byte structure that represents a pending or completed template instantiation in cudafe++ (EDG 6.6). Every template entity that may require instantiation -- function templates, class member function templates, variable templates -- gets one of these records allocated by alloc_template_instance (sub_7416E0). The records are chained into a singly-linked worklist for function/variable templates (qword_12C7740). A separate worklist of type entries (not instance records) at qword_12C7758 tracks pending class template instantiations. A fixpoint loop at translation-unit end drains both lists, instantiating entities until no new work remains.
This page documents the instance record layout, the master instance info record, the two worklists, the depth-tracking mechanisms, the parser state save/restore during instantiation, and the fixpoint algorithm that ties everything together.
Key Facts
| Property | Value |
|---|---|
| Instance record size | 128 bytes (allocated by sub_7416E0, alloc_template_instance) |
| Master instance info size | 32 bytes (allocated by sub_7416A0, alloc_master_instance_info) |
| Allocation counter (instances) | qword_12C74F0 (incremented on each 128-byte allocation) |
| Allocation counter (master info) | qword_12C74E8 (incremented on each 32-byte allocation) |
| Memory allocator | sub_6BA0D0 (EDG arena allocator) |
| Function/variable worklist head | qword_12C7740 |
| Function/variable worklist tail | qword_12C7738 |
| Class worklist head | qword_12C7758 |
| Fixpoint entry point | sub_78A9D0 (template_and_inline_entity_wrapup) |
| Worklist walker | sub_78A7F0 (do_any_needed_instantiations) |
| Decision gate | sub_774620 (should_be_instantiated) |
| Source file | templates.c (EDG 6.6, path edg/EDG_6.6/src/templates.c) |
Instance Record Layout (128 bytes)
Each record is allocated by alloc_template_instance (sub_7416E0) and zero-initialized. The allocator clears all 128 bytes, then initializes offsets +84 and +92 from qword_126EFB8 (the current source position context). The low nibble of byte +81 is explicitly masked to zero (*(_BYTE *)(result + 81) &= 0xF0).
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 8 | entity_primary | Primary entity pointer (the instantiation's own symbol) |
+8 | 8 | next | Next entry in the pending worklist (singly-linked) |
+16 | 8 | inst_info | Pointer to 32-byte master instance info record (see below) |
+24 | 8 | master_symbol | Canonical template symbol -- the entity being instantiated from |
+32 | 8 | actual_decl | Declaration entity in the instantiation context |
+40 | 8 | cached_decl | Cached declaration for function-local templates (partial specialization lookup result) |
+48 | 8 | referencing_namespace | Namespace that triggered the instantiation (set by determine_referencing_namespace, sub_75D5B0) |
+56 | 8 | (reserved) | Zero-initialized, usage not observed |
+64 | 8 | body_flags | Deferred/deleted function body flags |
+72 | 8 | pre_computed_result | Result from a prior instantiation attempt (non-null skips re-instantiation) |
+80 | 1 | flags | Status bitfield (see flags table below) |
+81 | 1 | flags2 | Secondary flags (bit 0 = on_worklist, bit 1 = warning_emitted) |
+84 | 8 | source_position_1 | Source location context at entry creation (from qword_126EFB8) |
+92 | 8 | source_position_2 | Second source location context at entry creation (from qword_126EFB8) |
+104 | 8 | (reserved) | Zero-initialized |
+112 | 8 | (reserved) | Zero-initialized |
+120 | 8 | (reserved) | Zero-initialized |
Allocator Pseudocode
// sub_7416E0 — alloc_template_instance
template_instance_t *alloc_template_instance(void) {
if (debug_tracing_enabled)
trace_enter(5, "alloc_template_instance");
template_instance_t *rec = arena_alloc(128); // sub_6BA0D0
alloc_count_instances++; // qword_12C74F0
// Zero all fields
rec->entity_primary = NULL; // +0
rec->next = NULL; // +8
rec->inst_info = NULL; // +16
rec->master_symbol = NULL; // +24
rec->actual_decl = NULL; // +32
rec->cached_decl = NULL; // +40
rec->ref_namespace = NULL; // +48
rec->reserved_56 = NULL; // +56
rec->body_flags = NULL; // +64
rec->precomputed = NULL; // +72
rec->flags = 0; // +80
rec->flags2 &= 0xF0; // +81: clear low nibble
rec->source_pos_1 = current_source_context; // +84 from qword_126EFB8
rec->source_pos_2 = current_source_context; // +92 from qword_126EFB8
rec->reserved_104 = NULL; // +104
rec->reserved_112 = NULL; // +112
rec->reserved_120 = NULL; // +120
if (debug_tracing_enabled)
trace_leave();
return rec;
}
Flags Byte at +80
Six bits are used. The byte is written by update_instantiation_required_flag (sub_7770E0) and read by do_any_needed_instantiations (sub_78A7F0) and should_be_instantiated (sub_774620).
| Bit | Mask | Name | Meaning |
|---|---|---|---|
| 0 | 0x01 | instantiation_required | Entity needs instantiation (set by update_instantiation_required_flag) |
| 1 | 0x02 | not_needed | Entity was determined not to need instantiation (skip on worklist walk) |
| 3 | 0x08 | explicit_instantiation | Explicit template declaration triggered this entry |
| 4 | 0x10 | suppress_auto | Auto-instantiation suppressed (from extern template declaration) |
| 5 | 0x20 | excluded | Entity excluded from instantiation set |
| 7 | 0x80 | can_be_instantiated_checked | Pre-check (f_entity_can_be_instantiated) already performed; skip redundant check |
Flags Byte at +81
| Bit | Mask | Name | Meaning |
|---|---|---|---|
| 0 | 0x01 | on_worklist | Entry has been linked into the pending worklist |
| 1 | 0x02 | warning_emitted | Depth-limit warning already emitted for this entry |
The on_worklist bit at +81 bit 0 is the guard that prevents double-insertion into the linked list. When add_to_instantiations_required_list sets up the linked list linkage (at qword_12C7740/qword_12C7738), it checks this bit first and sets it afterward. If the bit is already set, the function takes the "already on worklist" path which may set the new_instantiations_needed fixpoint flag (dword_12C771C = 1) instead.
Master Instance Info Record (32 bytes)
Each template entity (class, function, or variable template) has exactly one master instance info record, allocated by alloc_master_instance_info (sub_7416A0). This record is shared across all instantiations of the same template and is stored at the template's associated scope info (scope_assoc + 16). The link between a 128-byte instance record and its master info is at instance +16.
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 8 | next | Next master info in chain (linked list) |
+8 | 8 | back_pointer | Pointer back to the template instance record that owns this info |
+16 | 8 | associated_scope | Pointer to the associated scope/translation-unit data |
+24 | 4 | pending_count | Number of pending instantiations of this template (incremented/decremented by update_instantiation_required_flag) |
+28 | 1 | flags | Status bits (low nibble cleared on allocation) |
Master Info Flags Byte at +28
| Bit | Mask | Name | Meaning |
|---|---|---|---|
| 0 | 0x01 | blocked | Instantiation blocked (dependency cycle or extern template) |
| 2 | 0x04 | has_instances | At least one instantiation has been completed |
| 3 | 0x08 | debug_checked | Already checked by debug tracing path |
Allocator Pseudocode
// sub_7416A0 — alloc_master_instance_info
master_instance_info_t *alloc_master_instance_info(void) {
master_instance_info_t *info = arena_alloc(32); // sub_6BA0D0
alloc_count_master_info++; // qword_12C74E8
info->next = NULL; // +0
info->back_pointer = NULL; // +8
info->associated_scope = NULL; // +16
info->pending_count = 0; // +24
info->flags &= 0xF0; // +28: clear low nibble
return info;
}
find_or_create_master_instance (sub_753550)
This function connects a 128-byte instance record to its shared master info. It looks up the template's scope association, checks whether a master info record already exists at scope_assoc + 16, and creates one if absent.
// sub_753550 — find_or_create_master_instance
void find_or_create_master_instance(template_instance_t *inst) {
entity_t *sym = inst->master_symbol; // inst[3], offset +24
scope_t *scope = resolve_template_scope(sym); // sub_73DE50
// Find the template's canonical entity
entity_t *canonical;
if (is_variable(sym)) // (kind - 7) & 0xFD == 0
canonical = *find_variable_correspondence(scope); // sub_79AAA0
else
canonical = *find_function_correspondence(scope); // sub_79FD80
assert(canonical != NULL, "find_or_create_master_instance");
scope_assoc_t *assoc = canonical->scope_assoc; // offset +96
master_instance_info_t *info = assoc->master_info; // assoc + 16
if (info == NULL) {
// First instantiation of this template — allocate master info
info = alloc_master_instance_info(); // sub_7416A0
info->back_pointer = inst; // info[1] = inst
if (sym != inst->actual_decl) {
// Class members: add to secondary deferred list
// qword_12C7750 / qword_12C7748
append_to_deferred_list(info);
}
assoc->master_info = info; // assoc + 16
if (debug_tracing_enabled) {
trace("find_or_create_master_instance: symbol:");
print_symbol(inst->master_symbol);
}
}
inst->inst_info = info; // inst[2], offset +16
}
The Two Worklists
Template instantiation uses two separate worklists -- one for class templates, one for function/variable templates. This separation is fundamental to correctness: class templates must be instantiated before function templates within each fixpoint iteration, because function template bodies may reference members of class template instantiations.
Function/Variable Worklist (qword_12C7740)
| Global | Purpose |
|---|---|
qword_12C7740 | Head of the singly-linked list |
qword_12C7738 | Tail pointer (for O(1) append) |
Entries are 128-byte instance records linked through +8 (next pointer). New entries are appended at the tail by add_to_instantiations_required_list (the tail section of sub_7770E0).
Class Worklist (qword_12C7758)
This list holds type entries (not 128-byte instance records) that need class template instantiation. Entries are linked through offset +0 of the type entry. The list is populated by update_instantiation_flags (sub_789EF0) and drained by template_and_inline_entity_wrapup (sub_78A9D0).
Worklist Insertion: add_to_instantiations_required_list
The tail portion of update_instantiation_required_flag (sub_7770E0, starting at the label checking inst->flags2 & 0x01) implements worklist insertion:
// Tail of sub_7770E0 — add_to_instantiations_required_list
void add_to_instantiations_required_list(template_instance_t *inst) {
if (inst->flags2 & 0x01) {
// Already on the worklist — do not re-add.
// But if instantiation mode is active and instantiation_required is set,
// signal that a new fixpoint pass is needed.
if (instantiation_mode_active
&& (inst->flags & 0x01)
&& inst->inst_info != NULL
&& !(inst->inst_info->flags & 0x01)) // not blocked
{
new_instantiations_needed = 1; // dword_12C771C
tu_ptr->needs_recheck = 1; // TU + 393
}
return;
}
// Link into the function/variable worklist
if (pending_list_head) // qword_12C7740
pending_list_tail->next = inst; // qword_12C7738->next
else
pending_list_head = inst;
pending_list_tail = inst;
inst->flags2 |= 0x01; // mark as on worklist
// Verify correct translation unit
tu_t *tu = trans_unit_for_symbol(inst->master_symbol); // sub_741960
assert(tu == current_tu,
"add_to_instantiations_required_list: symbol for wrong translation unit");
}
The Fixpoint Loop
The fixpoint loop is the algorithm that drives all template instantiation in cudafe++. It runs at the end of each translation unit, after parsing is complete, and iterates until no new instantiation work remains. The entry point is template_and_inline_entity_wrapup (sub_78A9D0).
Algorithm
template_and_inline_entity_wrapup (sub_78A9D0):
assert(tu_stack_top == 0) // qword_106BA18: not nested in another TU
assert(compilation_phase == 2) // dword_126EFB4: full compilation mode
LOOP:
FOR EACH translation_unit IN tu_list (qword_106B9F0):
set_up_tu_context(tu) // sub_7A3EF0
// PHASE 1 — Class templates (from qword_12C7758)
for entry in class_worklist:
if is_dependent_type(entry) continue
if !is_class_or_struct(entry) continue
f_instantiate_template_class(entry)
// PHASE 2 — Enable instantiation mode
instantiation_mode_active = 1 // dword_12C7730
// PHASE 3 — Function/variable templates
do_any_needed_instantiations() // sub_78A7F0
tear_down_tu_context() // sub_7A3F70
// PHASE 4 — Check fixpoint condition
new_instantiations_needed = 0 // dword_12C771C
FOR EACH translation_unit IN tu_list:
if tu->needs_recheck: // TU + 393
tu->needs_recheck = 0
set_up_tu_context(tu)
do_any_needed_instantiations()
// process inline entities
tear_down_tu_context()
additional_pass_needed = 1 // dword_12C7718
if new_instantiations_needed:
GOTO LOOP // restart fixpoint
The fixpoint is necessary because instantiating one template can trigger references to other not-yet-instantiated templates. For example, instantiating std::vector<Foo> may require instantiating std::allocator<Foo>, Foo's copy constructor, comparison operators, and any other templates used in std::vector's implementation. Each such reference may add a new entry to the worklist, which the next pass will discover and process. The loop terminates when a complete pass produces no new entries.
Worklist Walker: do_any_needed_instantiations
sub_78A7F0 performs a linear walk over the function/variable worklist. For each entry, it applies a series of rejection filters, and if the entry passes all of them, dispatches to instantiate_template_function_full (sub_775E00).
// sub_78A7F0 — do_any_needed_instantiations
void do_any_needed_instantiations(void) {
template_instance_t *entry = pending_list_head; // qword_12C7740
while (entry) {
// 1. Already processed?
if (entry->flags & 0x02) { // not_needed
entry = entry->next;
continue;
}
// 2. Get master instance info
master_instance_info_t *info = entry->inst_info; // offset +16
assert(info != NULL, "do_any_needed_instantiations");
// 3. Blocked by dependency?
if (info->flags & 0x01) { // blocked
entry = entry->next;
continue;
}
// 4. Check if debug-verified
if (!(info->flags & 0x08)) // not debug_checked
f_is_static_or_inline(entry); // sub_756B40
// 5. Pre-check if not already done
if (!(entry->flags & 0x80)) // can_be_instantiated not checked
f_entity_can_be_instantiated(entry); // sub_7574B0
// 6. Mode filter
if (compilation_mode != 1 // dword_106C094
&& !(entry->flags & 0x01)) // not instantiation_required
{
entry = entry->next;
continue;
}
// 7. Blocked after pre-check?
if (info->flags & 0x01) {
entry = entry->next;
continue;
}
// 8. Decision gate
if (!should_be_instantiated(entry, 1)) { // sub_774620
entry = entry->next;
continue;
}
// 9. Instantiate
instantiate_template_function_full(entry, 1); // sub_775E00
entry = entry->next;
}
}
The walk is a simple forward traversal. Entries appended during instantiation (by new add_to_instantiations_required_list calls from within instantiated function bodies) that land after the current position will be visited on this same pass. Entries that land before the current position, or entries whose status changes after being skipped, are caught by the next fixpoint iteration.
Decision Gate: should_be_instantiated
sub_774620 (326 lines) is the final filter before instantiation. It implements an eight-step rejection chain. An entity must pass every step to be instantiated.
// sub_774620 — should_be_instantiated
bool should_be_instantiated(template_instance_t *inst, int check_implicit) {
master_instance_info_t *info = inst->inst_info; // +16
assert(info != NULL, "should_be_instantiated");
// 1. Blocked?
if (info->flags & 0x01) return false;
// 2. Excluded?
if (inst->flags & 0x20) return false; // excluded
// 3. Explicit but not required?
if (inst->flags & 0x08) { // explicit_instantiation
if (!(inst->flags & 0x01)) return false; // not marked required
}
// 4. Not required and not in normal mode?
if (!(inst->flags & 0x01) && compilation_mode != 1)
return false;
// 5. Has valid master_symbol?
entity_t *master = inst->master_symbol; // +24
if (!master) return false;
// 6. Entity kind check
// Function: kind 10/11 (member), kind 9 (namespace-scope)
// Variable: kind 7
// class-local function: kind 17 (lambda)
int kind = master->kind; // master + 80
// ... kind-specific filtering ...
// 7. Template body available?
// Check that the template has a cached body to replay
if (!has_template_body(inst)) return false;
// 8. Implicit include?
if (check_implicit && implicit_include_enabled) {
do_implicit_include_if_needed(inst); // sub_754A70
// re-check body availability after include
}
// 9. Depth limit warning (diagnostics 489/490)
if (approaching_depth_limit(inst)) {
if (!(inst->flags2 & 0x02)) { // warning not yet emitted
emit_warning(489 or 490, inst->master_symbol);
inst->flags2 |= 0x02; // mark warning emitted
}
return false;
}
return true;
}
Depth Tracking
Template instantiation depth is tracked at two levels -- a global counter for function templates and a per-type counter for class templates -- plus a pending-instantiation counter that detects runaway expansion.
Function Template Depth: qword_12C76E0
A single global counter incremented on entry to instantiate_template_function_full and decremented on exit. Hard limit: 255 (0xFF).
// Inside sub_775E00 (instantiate_template_function_full):
if (instantiation_depth >= 0xFF) { // qword_12C76E0
emit_fatal_error(/* depth exceeded */);
goto restore_state;
}
instantiation_depth++;
// ... perform instantiation ...
instantiation_depth--;
The 255 limit is a safety valve against infinite recursive template metaprogramming. Consider:
template<int N>
struct factorial {
static constexpr int value = N * factorial<N-1>::value;
};
Without a depth limit, factorial<256> would recurse 256 levels deep, each level re-entering the parser to process the template body. At 255, EDG aborts with a fatal error rather than risk a stack overflow. The C++ standard (Annex B) recommends implementations support at least 1,024 recursively nested template instantiations, but EDG defaults to 255 as a practical limit -- configurable via qword_106BD10.
Class Template Depth: Per-Type Counter at type_entry + 56
Each class type entry has its own depth counter at offset +56. The limit is read from qword_106BD10 (the same configurable limit, typically 500). This per-type design is critical: it prevents one deeply-nested class hierarchy from blocking all other class instantiations.
// Inside sub_777CE0 (f_instantiate_template_class):
uint32_t depth = type_entry->depth_counter; // type_entry + 56
if (depth >= max_depth_limit) { // qword_106BD10
emit_error(456, decl);
type_entry->flags |= 0x01; // mark completed
goto restore_state;
}
type_entry->depth_counter++;
// ... perform class instantiation ...
type_entry->depth_counter--;
Pending Instantiation Counter: increment/decrement/too_many
Three functions manage a per-type pending-instantiation counter that detects exponential expansion of template instantiation work.
increment_pending_instantiations (sub_75D740): dispatches on the entity kind byte at entity + 80 to locate the owning type entry, then increments the counter at type_entry + 56.
decrement_pending_instantiations (sub_75D7C0): mirror of the above, decrements.
too_many_pending_instantiations (sub_75D6A0): compares the counter against qword_106BD10. If the threshold is met, emits diagnostic 456 and returns true to abort the instantiation.
// sub_75D6A0 — too_many_pending_instantiations
bool too_many_pending_instantiations(entity_t *entity, entity_t *context,
source_pos_t *pos) {
type_entry_t *type = resolve_owning_type(entity); // dispatch on entity->kind
assert(type != NULL, "too_many_pending_instantiations");
uint32_t count = type->pending_counter; // type + 56
if (count >= max_depth_limit) { // qword_106BD10
emit_error(456, pos, context);
return true;
}
return false;
}
The entity-kind dispatch is identical across all three functions:
| Entity Kind | Byte +80 | Type Entry Resolution |
|---|---|---|
| 4, 5 (template member function) | entity->scope_assoc->field_80 | *(entity->assoc + 96)->offset_80 |
| 6 (type alias template) | entity->scope_assoc->field_32 | *(entity->assoc + 96)->offset_32 |
| 9, 10 (namespace function, class) | entity->scope_assoc->field_56 | *(entity->assoc + 96)->offset_56 |
| 19-22 (class member types) | entity->type_info | entity->offset_88 |
Depth Limit Counter at type_entry + 432
Inside update_instantiation_required_flag (sub_7770E0), a secondary counter at type_entry + 432 (a 16-bit word) tracks how many times an entity's instantiation-required flag has been toggled. When this counter reaches 200, diagnostic 599 is emitted as a warning. If it exceeds 199, the instantiation is skipped entirely. This catches oscillating patterns where two mutually-dependent templates keep adding and removing each other from the worklist.
// Inside sub_7770E0, compilation_mode == 1 path:
if (!setting_required && is_function_or_variable(master_symbol)) {
int16_t toggle_count = *(int16_t *)(type_entry + 432);
toggle_count++;
*(int16_t *)(type_entry + 432) = toggle_count;
if (toggle_count == 200)
emit_warning(599, actual_decl);
if (toggle_count > 199)
return; // stop oscillating
}
Parser State Save/Restore During Instantiation
Template instantiation re-enters the parser: the compiler replays the cached template body tokens with substituted types. This means the parser's global state -- scope indices, current token, source position, declaration context -- must be saved before instantiation and restored afterward. EDG uses movups/movaps SSE instructions to bulk-save/restore this state in 128-bit chunks.
Why SSE?
The global parser state variables are ordinary integers, pointers, and flags laid out at consecutive addresses. The compiler's register allocator (or manual optimization) packs adjacent globals into 128-bit SSE loads/stores, saving 4 or more individual mov instructions per save/restore. This is not a quirk of the architecture -- it is a deliberate performance optimization for a hot path. Template-heavy C++ codebases (Boost, STL, Eigen) can trigger thousands of instantiations, each requiring a state save/restore pair.
Function Instantiation: 4 SSE Registers
instantiate_template_function_full (sub_775E00) saves and restores 4 SSE registers covering 64 bytes of parser state at addresses 0x106C380--0x106C3B0.
Save on entry (before any parser re-entry):
local[0] = xmmword_106C380 // 16 bytes: parser scope context
local[1] = xmmword_106C390 // 16 bytes: token stream state
local[2] = xmmword_106C3A0 // 16 bytes: scope nesting info
local[3] = xmmword_106C3B0 // 16 bytes: auxiliary flags
Also saved as individual scalars:
saved_source_pos = qword_126DD38
saved_source_col = WORD2(qword_126DD38)
saved_diag_pos = qword_126EDE8
saved_diag_col = WORD2(qword_126EDE8)
Restore on exit (always, even on error path):
xmmword_106C380 = local[0]
xmmword_106C390 = local[1]
xmmword_106C3A0 = local[2]
xmmword_106C3B0 = local[3]
qword_126DD38 = saved_source_pos
qword_126EDE8 = saved_diag_pos
Class Instantiation: 11 + 12 SSE Registers (Conditional)
f_instantiate_template_class (sub_777CE0) saves substantially more state because class body parsing involves deeper parser perturbation -- member declarations, nested types, access specifiers, base class processing, and member template definitions all modify global parser state.
The save is conditional on the current token kind (word_126DD58). If the token kind is between 2 and 8 inclusive (meaning the parser is mid-expression or mid-declaration when the class instantiation is triggered), the full save executes:
Primary state block (always saved when token is 2--8): 11 SSE registers from xmmword_126DC60--xmmword_126DD00, covering 176 bytes of declaration parser state, plus qword_126DD10 (8 bytes).
Save:
local[0] = xmmword_126DC60 // declaration context
local[1] = xmmword_126DC70 // access specifier state
local[2] = xmmword_126DC80 // base class list context
local[3] = xmmword_126DC90 // member template tracking
local[4] = xmmword_126DCA0 // nested type state
local[5] = xmmword_126DCB0 // friend declaration context
local[6] = xmmword_126DCC0 // using declaration state
local[7] = xmmword_126DCD0 // default argument context
local[8] = xmmword_126DCE0 // static assertion state
local[9] = xmmword_126DCF0 // concept/requires state
local[10] = xmmword_126DD00 // template parameter context
saved_dd10 = qword_126DD10 // additional scalar
Extended state block (saved only when token kind == 8, class definition in progress): 12 more SSE registers from xmmword_126DBA0--xmmword_126DC40, covering 192 bytes, plus qword_126DC50.
local[11] = xmmword_126DBA0 // class body parse state
local[12] = xmmword_126DBB0 // virtual function table context
local[13] = xmmword_126DBC0 // constructor/destructor tracking
local[14] = xmmword_126DBD0 // initializer list state
local[15] = xmmword_126DBE0 // exception specification
local[16] = xmmword_126DBF0 // noexcept evaluation context
local[17] = xmmword_126DC00 // member initializer state
local[18] = xmmword_126DC10 // default member init
local[19] = xmmword_126DC20 // alignment tracking
local[20] = unk_126DC30 // padding/layout state
local[21] = xmmword_126DC40 // class completion state
saved_dc50 = qword_126DC50 // additional scalar
The conditional save is a performance optimization: when the parser is in a simple context (token kind outside 2--8), the class instantiation only needs to save the 4 SSE registers from xmmword_106C380--xmmword_106C3B0 (same as function instantiation). The full 23-register save is only needed when a class instantiation is triggered mid-parse (e.g., during elaborated type specifier resolution or SFINAE evaluation).
Summary of State Save Areas
| Instantiation Kind | Condition | SSE Registers | Bytes Saved | Address Range |
|---|---|---|---|---|
| Function | Always | 4 | 64 | 0x106C380--0x106C3B0 |
| Class (minimal) | token not 2--8 | 4 | 64 | 0x106C380--0x106C3B0 |
| Class (mid-declaration) | token 2--8 | 4 + 11 | 64 + 184 | 0x106C380--0x106C3B0 + 0x126DC60--0x126DD10 |
| Class (mid-class-body) | token == 8 | 4 + 11 + 12 | 64 + 184 + 200 | All three ranges |
The update_instantiation_required_flag Function
sub_7770E0 (434 lines) is the central function that decides whether to add a template instance to the worklist. Its name in EDG source is update_instantiation_required_flag, confirmed by the assert string at templates.c:38863 and the debug trace "Setting instantiation_required flag to %s for (options=%d)".
This function is called whenever a template entity's instantiation status changes -- when a template is first referenced, when it is explicitly instantiated, when its definition becomes available, or when an extern template declaration is encountered.
Parameters
void update_instantiation_required_flag(
template_instance_t *inst, // a1: the instance record
bool setting, // a2 (int cast): true = mark required, false = unmark
unsigned int options // a3: bitmask controlling behavior
);
Options Bitmask
| Bit | Mask | Meaning |
|---|---|---|
| 0 | 0x01 | Force worklist addition even without inline body |
| 1 | 0x02 | Unmarking: decrement pending count and clear flags |
| 2 | 0x04 | Suppress should_be_instantiated check |
| 3 | 0x08 | Check for inline member of class template |
High-Level Flow
update_instantiation_required_flag(inst, setting, options):
// 1. Resolve owning type entry (for toggle counter)
type_entry = resolve_type_from_actual_decl(inst->actual_decl)
// 2. Check inline member status
if is_function_or_variable(master_symbol):
if is_inline_member:
adjust options based on module/inline status
// 3. Debug trace
if debug_tracing_enabled:
"Setting instantiation_required flag to TRUE/FALSE for (options=N)"
print_symbol(inst->master_symbol)
// 4. Ensure inst_info exists
if inst_info is NULL:
find_or_create_master_instance(inst) // sub_753550
// 5. If setting == true and entity != actual_decl:
// — Set inst->flags |= 0x01 (instantiation_required)
// — If inst_info exists, increment pending_count
// — Set referencing_namespace
// — Possibly call should_be_instantiated + instantiate immediately
// — Add to worklist via add_to_instantiations_required_list
// 6. If setting == false and options & 0x02:
// — Decrement pending_count
// — Clear inst->flags bit 0
// — If count < 0: internal error (templates.c:38908)
// 7. Worklist linkage (add_to_instantiations_required_list tail)
// — Check on_worklist bit
// — Append to qword_12C7740/qword_12C7738
// — Or set fixpoint flag if already on list
Global State Reference
| Address | Name | Type | Description |
|---|---|---|---|
qword_12C7740 | pending_instantiation_list | void* | Head of function/variable worklist |
qword_12C7738 | pending_instantiation_list_tail | void* | Tail of function/variable worklist |
qword_12C7758 | pending_class_list | void* | Head of class template worklist |
qword_12C7750 | deferred_master_info_list | void* | Head of deferred master-info list |
qword_12C7748 | deferred_master_info_list_tail | void* | Tail of deferred master-info list |
dword_12C7730 | instantiation_mode | int32 | 0=none, 1=used, 2=all, 3=local |
dword_12C771C | new_instantiations_needed | int32 | Fixpoint flag (1 = restart loop) |
dword_12C7718 | additional_pass_needed | int32 | Secondary fixpoint flag |
qword_12C76E0 | function_depth_counter | int64 | Current function instantiation depth (max 255) |
qword_106BD10 | max_depth_limit | int64 | Configurable depth limit (read by both function and class paths) |
qword_12C74F0 | instance_alloc_count | int64 | Total 128-byte records allocated |
qword_12C74E8 | master_info_alloc_count | int64 | Total 32-byte master-info records allocated |
qword_12C7708 | inline_entity_list_head | void* | Head of inline entity fixup list |
qword_12C7700 | inline_entity_list_tail | void* | Tail of inline entity fixup list |
Diagnostic Messages
| Number | Severity | Condition | Message Summary |
|---|---|---|---|
| 456 | Error | Depth counter >= max limit | Excessive template instantiation depth |
| 489 | Warning | Approaching depth limit (explicit instantiation) | Template instantiation depth nearing limit |
| 490 | Warning | Approaching depth limit (auto instantiation) | Template instantiation depth nearing limit |
| 599 | Warning | Toggle counter reaches 200 | Instantiation flag oscillation detected |
| 759 | Error | Entity not visible at file scope | Template entity not accessible for instantiation |
Function Map
| Address | Identity | Confidence | Lines | Role |
|---|---|---|---|---|
sub_7416E0 | alloc_template_instance | 95% | 40 | Allocate 128-byte instance record |
sub_7416A0 | alloc_master_instance_info | 95% | 16 | Allocate 32-byte master info record |
sub_753550 | find_or_create_master_instance | 95% | 75 | Link instance to shared master info |
sub_7770E0 | update_instantiation_required_flag | 95% | 434 | Update flags, add to worklist |
sub_78A7F0 | do_any_needed_instantiations | 100% | 72 | Walk function/variable worklist |
sub_78A9D0 | template_and_inline_entity_wrapup | 100% | 136 | Fixpoint loop entry point |
sub_774620 | should_be_instantiated | 95% | 326 | Decision gate |
sub_775E00 | instantiate_template_function_full | 95% | 839 | Function template instantiation |
sub_777CE0 | f_instantiate_template_class | 95% | 516 | Class template instantiation |
sub_774C30 | instantiate_template_variable | 95% | 751 | Variable template instantiation |
sub_75D740 | increment_pending_instantiations | 95% | -- | Increment per-type depth counter |
sub_75D7C0 | decrement_pending_instantiations | 95% | -- | Decrement per-type depth counter |
sub_75D6A0 | too_many_pending_instantiations | 95% | -- | Check depth limit, emit diagnostic 456 |
sub_75D5B0 | determine_referencing_namespace | 95% | 47 | Find namespace that triggered instantiation |
sub_7574B0 | f_entity_can_be_instantiated | 95% | -- | Pre-check: body available, constraints satisfied |
sub_756B40 | f_is_static_or_inline_template_entity | 95% | -- | Check linkage for instantiation eligibility |
sub_789EF0 | update_instantiation_flags | 95% | -- | Update class instantiation flags, add to class worklist |
sub_72ED70 | alloc_symbol_list_entry | 95% | 39 | Allocate 16-byte symbol list node (for inline entity list) |
Cross-References
- Template Engine -- the full instantiation pipeline, substitution engine, argument deduction, partial ordering
- CUDA Template Restrictions -- CUDA-specific template argument accessibility checks
- Scope Entry -- 784-byte scope stack entry, template instantiation depth counters at
+576/+580/+584 - Entity Node Layout -- entity kind byte at
+80, execution space at+182 - Translation Unit Descriptor -- TU linked list at
qword_106B9F0, per-TUneeds_recheckflag at+393
CLI Flag Inventory
Quick Reference: 20 Most Important CUDA-Specific Flags
Flag (via -Xcudafe) | nvcc Equivalent | ID | Effect |
|---|---|---|---|
--diag_suppress=N | --diag-suppress=N | 39 | Suppress diagnostic number N (comma-separated) |
--diag_error=N | --diag-error=N | 42 | Promote diagnostic N to error |
--diag_warning=N | --diag-warning=N | 41 | Demote diagnostic N to warning |
--display_error_number | -- | 44 | Show #NNNNN-D error codes in output |
--target=smXX | --gpu-architecture=smXX | 245 | Set SM architecture target (parsed via sub_7525E0) |
--relaxed_constexpr | --expt-relaxed-constexpr | 104 | Allow constexpr cross-space calls |
--extended-lambda | --expt-extended-lambda | 106 | Enable __device__/__host__ __device__ lambdas in host code (dword_106BF38) |
--device-c | -rdc=true | 77 | Relocatable device code (separate compilation) |
--keep-device-functions | --keep-device-functions | 71 | Do not strip unused device functions |
--no_warnings | -w | 22 | Suppress all warnings |
--promote_warnings | -W | 23 | Promote all warnings to errors |
--error_limit=N | -- | 32 | Maximum errors before abort (default: unbounded) |
--force-lp64 | -m64 | 65 | LP64 data model (pointer=8, long=8) |
--output_mode=sarif | -- | 274 | SARIF JSON diagnostic output |
--debug_mode | -G | 82 | Full debug mode (sets 3 debug globals) |
--device-syntax-only | -- | 72 | Device-side syntax check without codegen |
--no-device-int128 | -- | 52 | Disable __int128 on device |
--zero_init_auto_vars | -- | 81 | Zero-initialize automatic variables |
--fe-inlining | -- | 54 | Enable frontend inlining |
--gen_c_file_name=path | -- | 45 | Set output .int.c file path |
These are the flags most commonly passed through -Xcudafe for CUDA development. The full inventory of 276 flags follows below.
cudafe++ accepts 276 command-line flags registered in a flat table at dword_E80060. The flags are not parsed directly from the binary's argv -- NVIDIA's driver compiler nvcc decomposes its own options and invokes cudafe++ with the appropriate low-level flags. Users never run cudafe++ directly; instead, they pass options through nvcc -Xcudafe <flag>, which strips the -Xcudafe prefix and forwards the remainder as a bare argument to the cudafe++ process.
The flag system is implemented in three functions within cmd_line.c:
| Function | Address | Lines | Role |
|---|---|---|---|
register_command_flag | sub_451F80 | 25 | Insert one entry into the flag table |
init_command_line_flags | sub_452010 | 3,849 | Register all 276 flags (called once) |
proc_command_line | sub_459630 | 4,105 | Main parser: match argv against table, dispatch to 275-case switch |
default_init | sub_45EB40 | 470 | Zero 350 global config variables + flag-was-set bitmap |
Flag Table Structure
Each flag occupies a 40-byte entry in a contiguous array beginning at dword_E80060, with a maximum capacity of 552 entries (overflow triggers a panic via sub_40351D). The current count is tracked in dword_E80058.
struct flag_entry { // 40 bytes per entry
int32_t case_id; // dword_E80060[idx*10] -- switch dispatch ID
char* name; // qword_E80068[idx*5] -- long flag name string
int16_t short_char; // word_E80070[idx*20] -- single-char alias (0 if none)
int8_t is_valid; // word_E80070[idx*20]+1 -- always 1
int8_t takes_value; // byte_E80072[idx*40] -- flag requires =<value> argument
int32_t visible; // dword_E80080[idx*10] -- mode/action classification
int8_t is_boolean; // byte_E80073[idx*40] -- flag is on/off toggle
int64_t name_length; // qword_E80078[idx*5] -- strlen(name), precomputed
};
The flag-was-set bitmap at byte_E7FF40 spans 0x110 bytes (272 flag slots). When a flag is matched during parsing, the corresponding bit is set to record that the user explicitly provided it. This bitmap is zeroed by default_init before every compilation.
Registration Protocol
register_command_flag (sub_451F80) is called approximately 275 times from init_command_line_flags. Its prototype:
void register_command_flag(
int case_id, // dispatch ID for the switch statement
char* name, // "--name" (without the dashes)
char short_opt, // single-letter alias, 0 for none
char takes_value, // 1 if the flag requires =<value>
int mode_flag, // visibility / classification
char enabled // whether the flag is active
);
Some flags are registered as paired toggles -- --flag and --no_flag share the same case_id but set the target global to 1 or 0 respectively. These pairs are registered either by two calls to register_command_flag or by inline table population within init_command_line_flags.
Parsing Flow
proc_command_line (sub_459630) is the master CLI parser. It:
- Calls
init_command_line_flagsto populate the flag table (once) - Allocates four hash tables for accumulating
-D,-I, system include, and macro alias arguments - Adjusts nine diagnostic severities by default via
sub_4ED400: four are suppressed (severity 3: errors 1373, 1374, 1375, 2330) and five are demoted to remark (severity 4: errors 1257, 1633, 111, 185, 175) - Enters the main loop over
argv:- Scans for
-prefix to identify flags - Handles
-Xshort flags and--flag-namelong flags - Handles
--flag=valuesyntax viaparse_flag_name_value(sub_451EC0) - Matches flag names against the registered table using
strncmpagainst each entry's precomputedname_length - Dispatches to a giant
switch(case_id)with 275 cases
- Scans for
- Executes post-parsing dialect resolution (described below)
- Opens output, error, and list files
- Treats the remaining non-flag
argventry as the input filename
The -Xcudafe Pass-Through
Users never invoke cudafe++ directly. The intended usage path is:
nvcc --some-option -Xcudafe --diag_suppress=1234 source.cu
nvcc strips -Xcudafe and passes --diag_suppress=1234 directly to the cudafe++ process as an argv element. Multiple -Xcudafe arguments accumulate. Because cudafe++ flags use -- long-form prefixes, there is no ambiguity with nvcc's own flag namespace.
Certain nvcc flags like --expt-extended-lambda and --expt-relaxed-constexpr are translated by nvcc into the corresponding cudafe++ internal flags (--extended-lambda, --relaxed_constexpr) before invocation. Users do not need to know the internal names.
Flag Catalog by Category
The 276 flags are grouped below by functional category. Each table lists:
- ID -- the
case_idused in the dispatch switch - Flag -- the
--nameas registered (paired flags shown asname / no_name) - Short -- single-character alias (dash required:
-E,-C, etc.) - Arg -- whether the flag takes a
=<value>argument - Effect -- what the flag does internally
Core EDG Flags (1--44)
These are standard Edison Design Group frontend options that predate NVIDIA's CUDA modifications.
| ID | Flag | Short | Arg | Effect |
|---|---|---|---|---|
| 1 | strict | -A | no | Enable strict standards conformance mode |
| 2 | strict_warnings | -a | no | Strict mode with extra warnings |
| 3 | no_line_commands | -P | no | Suppress #line directives in preprocessor output |
| 4 | preprocess | -E | no | Preprocessor-only mode (output to stdout) |
| 5 | comments | -C | no | Preserve comments in preprocessor output |
| 6 | old_line_commands | -- | no | Use old-style # N "file" line directives |
| 7 | old_c | -K | no | K&R C mode (calls set_c_mode(1)) |
| 8 | dependencies | -M | no | Output #include dependency list (preprocessor-only) |
| 9 | trace_includes | -H | no | Print each #include file as it is opened |
| 10 | il_display | -- | no | Dump intermediate language after parsing |
| 11 | anachronisms / no_anachronisms | -- | no | Allow/disallow anachronistic C++ constructs |
| 12 | cfront_2.1 | -b | no | Cfront 2.1 compatibility mode |
| 13 | cfront_3.0 | -- | no | Cfront 3.0 compatibility mode |
| 14 | no_code_gen | -n | no | Parse only, skip code generation |
| 15 | signed_chars / unsigned_chars | -s | no | Default char signedness |
| 16 | instantiate | -t | yes | Template instantiation mode: none, all, used, local |
| 17 | implicit_include / no_implicit_include | -B | no | Enable/disable implicit inclusion of template definitions |
| 18 | suppress_vtbl / force_vtbl | -- | no | Control virtual table emission |
| 19 | dollar | -$ | no | Allow $ in identifiers |
| 20 | timing | -# | no | Print compilation phase timing |
| 21 | version | -v | no | Print version banner and continue |
| 22 | no_warnings | -w | no | Suppress all warnings (sets severity threshold to error-only) |
| 23 | promote_warnings | -W | no | Promote warnings to errors |
| 24 | remarks | -r | no | Enable remark-level diagnostics |
| 25 | c | -m | no | Force C language mode |
| 26 | c++ | -p | no | Force C++ language mode |
| 27 | exceptions / no_exceptions | -x | no | Enable/disable C++ exception handling |
| 28 | no_use_before_set_warnings | -j | no | Suppress "used before set" variable warnings |
| 29 | include_directory | -I | yes | Add include search path (handles - for stdin) |
| 30 | define_macro | -D | yes | Define preprocessor macro (builds linked list) |
| 31 | undefine_macro | -U | yes | Undefine preprocessor macro |
| 32 | error_limit | -e | yes | Maximum number of errors before abort |
| 33 | list | -L | yes | Generate listing file |
| 34 | xref | -X | yes | Generate cross-reference file |
| 35 | error_output | -- | yes | Redirect error output to file |
| 36 | output | -o | yes | Set output file path |
| 37 | db | -d | yes | Load debug database |
| 38 | time_limit | -- | yes | Set compilation time limit |
| 39 | diag_suppress | -- | yes | Suppress diagnostic numbers (comma-separated list) |
| 40 | diag_remark | -- | yes | Demote diagnostics to remark severity |
| 41 | diag_warning | -- | yes | Set diagnostics to warning severity |
| 42 | diag_error | -- | yes | Promote diagnostics to error severity |
| 43 | diag_once | -- | yes | Emit diagnostic only on first occurrence |
| 44 | display_error_number / no_display_error_number | -- | no | Show/hide error code numbers in output |
NVIDIA CUDA-Specific Flags (45--89)
These flags are NVIDIA additions absent from stock EDG. They control CUDA compilation modes, device code generation, and host/device interaction.
| ID | Flag | Arg | Effect |
|---|---|---|---|
| 45 | gen_c_file_name | yes | Set output .int.c file path (qword_106BF20) |
| 46 | msvc_target_version | yes | MSVC version for compatibility (dword_126E1D4) |
| 47 | host-stub-linkage-explicit | no | Use explicit linkage on host stubs |
| 48 | static-host-stub | no | Generate static host stubs |
| 49 | device-hidden-visibility | no | Apply hidden visibility to device symbols |
| 50 | no-hidden-visibility-on-unnamed-ns | no | Exempt unnamed namespaces from hidden visibility |
| 51 | no-multiline-debug | no | Disable multiline debug info |
| 52 | no-device-int128 | no | Disable __int128 on device |
| 53 | no-device-float128 | no | Disable __float128 on device |
| 54 | fe-inlining | no | Enable frontend inlining (dword_106C068 = 1) |
| 55 | modify-stack-limit | yes | Control stack limit modification (dword_106C064) |
| 56 | fassociative-math | no | Enable associative floating-point math |
| 57 | orig_src_file_name | yes | Original source file name (before preprocessing) |
| 58 | orig_src_path_name | yes | Original source path name (full path) |
| 59 | frandom-seed | yes | Random seed for reproducible output |
| 60 | check-template-param-qual | no | Check template parameter qualifications |
| 61 | check-clock-call | no | Validate clock() calls in device code |
| 62 | check-ffs-call | no | Validate ffs() calls in device code |
| 63 | check-routine-address-taken | no | Check when device routine address is taken |
| 64 | check-memory-clobber | no | Validate memory clobber in inline asm |
| 65 | force-lp64 | no | LP64 data model: pointer=8, long=8 |
| 66 | force-llp64 | no | LLP64 data model: pointer=4, long=4 |
| 67 | pgi_llvm | no | PGI/LLVM backend mode |
| 68 | pgi_arch_ppc | no | PGI PowerPC architecture |
| 69 | pgi_arch_aarch64 | no | PGI AArch64 architecture |
| 70 | pgi_version | yes | PGI compiler version number |
| 71 | keep-device-functions | no | Do not strip unused device functions |
| 72 | device-syntax-only | no | Device-side syntax check without codegen |
| 73 | device-time-trace | no | Enable device compilation time tracing |
| 74 | force_linkonce_to_weak | no | Convert linkonce to weak linkage |
| 75 | disable_host_implicit_call_check | no | Skip implicit call validation on host |
| 76 | no_strict_cuda_error | no | Relax strict CUDA error checking |
| 77 | device-c | no | Relocatable device code (RDC) mode |
| 78 | no-shadow-functions | no | Disable function shadowing in device code |
| 79 | disable_ext_lambda_cache | no | Disable extended lambda capture cache |
| 80 | no-constant-variable-inferencing | no | Disable constexpr variable inference on device |
| 81 | zero_init_auto_vars | no | Zero-initialize automatic variables |
| 82 | debug_mode | no | Full debug mode (sets 3 debug globals to 1) |
| 83 | gen_module_id_file | no | Generate module ID file |
| 84 | include_file_name | yes | Forced include file name |
| 85 | gen_device_file_name | yes | Device-side output file name |
| 86 | stub_file_name | yes | Stub file output path |
| 87 | module_id_file_name | yes | Module ID file path |
| 88 | tile_bc_file_name | yes | Tile bitcode file path |
| 89 | tile-only | no | Tile-only compilation mode |
Architecture and Host Compiler Flags (90--114)
These flags identify the target architecture and host compiler for compatibility emulation.
| ID | Flag | Short | Arg | Effect |
|---|---|---|---|---|
| 90 | m32 | -- | no | 32-bit mode: pointer=4, long=4, all types sized for ILP32 |
| 91 | m64 | -- | no | 64-bit mode (default on Linux x86-64) |
| 92 | Version | -V | no | Print version with different copyright format, then exit(1) |
| 93 | compiler_bindir | -- | yes | Host compiler binary directory |
| 94 | sdk_dir | -- | yes | SDK directory path |
| 95 | pgc++ | -- | no | PGI C++ compiler mode |
| 96 | icc | -- | no | Intel ICC compiler mode |
| 97 | icc_version | -- | yes | Intel ICC version number |
| 98 | icx | -- | no | Intel ICX (oneAPI) compiler mode |
| 99 | grco | -- | no | GRCO compiler mode |
| 100 | allow_managed | -- | no | Allow __managed__ variable declarations |
| 101 | gen_system_templates_from_text | -- | no | Generate system templates from text |
| 102 | no_host_device_initializer_list | -- | no | Disable HD initializer_list support |
| 103 | no_host_device_move_forward | -- | no | Disable HD std::move/std::forward |
| 104 | relaxed_constexpr | -- | no | Relaxed constexpr rules for device code (--expt-relaxed-constexpr) |
| 105 | dont_suppress_host_wrappers | -- | no | Emit host wrapper functions unconditionally |
| 106 | arm_cross_compiler | -- | no | ARM cross-compilation mode |
| 107 | target_woa | -- | no | Windows on ARM target |
| 108 | gen_div_approx_no_ftz | -- | no | Generate approximate division without flush-to-zero |
| 109 | gen_div_approx_ftz | -- | no | Generate approximate division with flush-to-zero |
| 110 | shared_address_immutable | -- | no | Shared memory addresses are immutable |
| 111 | uumn | -- | no | Unnamed union member naming |
C++ Language Feature Toggle Flags (115--275)
The largest group -- approximately 120 paired boolean toggles that control individual C++ language features. Most are inherited from EDG's configuration surface. Each pair shares a case_id and sets a global variable to 1 (--flag) or 0 (--no_flag).
Precompiled Headers (115--121)
| ID | Flag | Arg | Effect |
|---|---|---|---|
| 115 | unsigned_wchar_t | no | wchar_t is unsigned |
| 116 | create_pch | yes | Create precompiled header file |
| 117 | use_pch | yes | Use existing precompiled header |
| 118 | pch | no | Enable PCH mode |
| 119 | pch_messages / no_pch_messages | no | Show/hide PCH status messages |
| 120 | pch_verbose / no_pch_verbose | no | Verbose PCH output |
| 121 | pch_dir | yes | PCH file directory |
Core C++ Feature Toggles (122--170)
| ID | Flag | Arg | Default |
|---|---|---|---|
| 122 | restrict / no_restrict | no | on |
| 123 | long_lifetime_temps / short_lifetime_temps | no | -- |
| 124 | wchar_t_keyword / no_wchar_t_keyword | no | on |
| 125 | pack_alignment | yes | -- |
| 126 | alternative_tokens / no_alternative_tokens | no | on |
| 127 | svr4 / no_svr4 | no | -- |
| 128 | brief_diagnostics / no_brief_diagnostics | no | -- |
| 129 | nonconst_ref_anachronism / no_nonconst_ref_anachronism | no | -- |
| 130 | no_preproc_only | no | -- |
| 131 | rtti / no_rtti | no | on |
| 132 | building_runtime | no | -- |
| 133 | bool / no_bool | no | on |
| 134 | array_new_and_delete / no_array_new_and_delete | no | -- |
| 135 | explicit / no_explicit | no | -- |
| 136 | namespaces / no_namespaces | no | on |
| 137 | using_std / no_using_std | no | -- |
| 138 | remove_unneeded_entities / no_remove_unneeded_entities | no | on |
| 139 | typename / no_typename | no | -- |
| 140 | implicit_typename / no_implicit_typename | no | on |
| 141 | special_subscript_cost / no_special_subscript_cost | no | -- |
| 143 | old_style_preprocessing | no | -- |
| 144 | old_for_init / new_for_init | no | -- |
| 145 | for_init_diff_warning / no_for_init_diff_warning | no | -- |
| 146 | distinct_template_signatures / no_distinct_template_signatures | no | -- |
| 147 | guiding_decls / no_guiding_decls | no | on |
| 148 | old_specializations / no_old_specializations | no | on |
| 149 | wrap_diagnostics / no_wrap_diagnostics | no | -- |
| 150 | implicit_extern_c_type_conversion / no_implicit_extern_c_type_conversion | no | -- |
| 151 | long_preserving_rules / no_long_preserving_rules | no | -- |
| 152 | extern_inline / no_extern_inline | no | -- |
| 153 | multibyte_chars / no_multibyte_chars | no | -- |
| 154 | embedded_c++ | no | Embedded C++ mode |
| 155 | vla / no_vla | no | -- |
| 156 | enum_overloading / no_enum_overloading | no | -- |
| 157 | nonstd_qualifier_deduction / no_nonstd_qualifier_deduction | no | -- |
| 158 | late_tiebreaker / early_tiebreaker | no | -- |
| 159 | preinclude | yes | -- |
| 160 | preinclude_macros | yes | -- |
| 161 | pending_instantiations | yes | -- |
| 162 | const_string_literals / no_const_string_literals | no | on |
| 163 | class_name_injection / no_class_name_injection | no | on |
| 164 | arg_dep_lookup / no_arg_dep_lookup | no | on |
| 165 | friend_injection / no_friend_injection | no | on |
| 166 | nonstd_using_decl / no_nonstd_using_decl | no | -- |
| 168 | designators / no_designators | no | -- |
| 169 | extended_designators / no_extended_designators | no | -- |
| 170 | variadic_macros / no_variadic_macros | no | -- |
| 171 | extended_variadic_macros / no_extended_variadic_macros | no | -- |
Include Paths and Module Support (167, 172, 256--265)
Note: These flags use non-contiguous IDs because sys_include and incl_suffixes are registered early, while the C++20 module flags use a separate ID range (256+).
| ID | Flag | Arg | Effect |
|---|---|---|---|
| 167 | sys_include | yes | System include directory |
| 172 | incl_suffixes | yes | Include file suffix list (default "::stdh:") |
| 256 | modules_directory | yes | C++20 modules directory |
| 257 | ms_mod_file_map | yes | MSVC module file mapping |
| 258 | ms_header_unit | yes | MSVC header unit |
| 259 | ms_header_unit_quote | yes | MSVC quoted header unit |
| 260 | ms_header_unit_angle | yes | MSVC angle-bracket header unit |
| 261 | ms_mod_interface / no_ms_mod_interface | no | MSVC module interface mode |
| 262 | ms_internal_partition / no_ms_internal_partition | no | MSVC internal partition mode |
| 263 | ms_translate_include / no_ms_translate_include | no | MSVC translate #include to import |
| 264 | modules / no_modules | no | Enable/disable C++20 modules |
| 265 | module_import_diagnostics / no_module_import_diagnostics | no | Module import diagnostic messages |
Host Compiler and Language Feature Toggles (182--239)
Note: All IDs below are verified against the decompiled init_command_line_flags (sub_452010). Flags are registered by sub_451F80 (explicit call) or by inline array population. IDs are not sequential -- gaps exist where flags were removed or repurposed.
| ID | Flag | Arg | Default |
|---|---|---|---|
| 182 | gcc / no_gcc | no | GCC compatibility mode |
| 183 | g++ / no_g++ | no | G++ mode (alias for GCC C++ mode) |
| 184 | gnu_version | yes | GCC version number (default 80100 = GCC 8.1.0) |
| 185 | report_gnu_extensions | no | Report use of GNU extensions |
| 186 | short_enums / no_short_enums | no | Use minimal-size enum representation |
| 187 | clang / no_clang | no | Clang compatibility mode |
| 188 | clang_version | yes | Clang version number (default 90100 = Clang 9.1.0) |
| 189 | strict_gnu / no_strict_gnu | no | Strict GNU mode |
| 190 | db_name | yes | Debug database name |
| 191 | long_long | no | Allow long long type |
| 192 | context_limit | yes | Maximum template instantiation context depth |
| 193 | set_flag / clear_flag | yes | Raw flag manipulation via off_D47CE0 lookup table |
| 194 | edg_base_dir | yes | EDG base directory (error on invalid path) |
| 195 | embedded_c / no_embedded_c | no | Embedded C mode (not relevant to CUDA) |
| 196 | thread_local_storage / no_thread_local_storage | no | thread_local support |
| 197 | trigraphs / no_trigraphs | no | Trigraph processing (default on) |
| 198 | nonstd_default_arg_deduction / no_nonstd_default_arg_deduction | no | -- |
| 199 | stdc_zero_in_system_headers / no_stdc_zero_in_system_headers | no | -- |
| 200 | template_typedefs_in_diagnostics / no_template_typedefs_in_diagnostics | no | -- |
| 202 | uliterals / no_uliterals | no | Unicode literals (u"", U"", u8"") |
| 203 | type_traits_helpers / no_type_traits_helpers | no | Intrinsic type traits |
| 204 | c++11 / c++0x | no | C++11 mode (sets dword_126EF68 to 201103 or 199711) |
| 205 | list_macros | no | List all defined macros after preprocessing |
| 206 | dump_configuration | no | Dump full compiler configuration |
| 207 | dump_legacy_as_target | yes | Dump legacy configuration in target format |
| 208 | signed_bit_fields / unsigned_bit_fields | no | Default bit-field signedness |
| 210 | check_concatenations / no_check_concatenations | no | String literal concatenation checks |
| 211 | unicode_source_kind | yes | Source encoding: UTF-8=1, UTF-16LE=2, UTF-16BE=3, none=0 |
| 212 | lambdas / no_lambdas | no | C++ lambda expressions |
| 213 | rvalue_refs / no_rvalue_refs | no | Rvalue references |
| 214 | rvalue_ctor_is_copy_ctor / rvalue_ctor_is_not_copy_ctor | no | Rvalue constructor treatment |
| 215 | gen_move_operations / no_gen_move_operations | no | Implicit move constructor/assignment (default on) |
| 216 | auto_type / no_auto_type | no | C++11 auto type deduction |
| 217 | auto_storage / no_auto_storage | no | auto as storage class (C++03 meaning) |
| 218 | nonstd_instantiation_lookup / no_nonstd_instantiation_lookup | no | -- |
| 219 | nullptr / no_nullptr | no | nullptr keyword |
| 220 | gcc89_inlining | no | GCC 8.9-era inlining behavior |
| 221 | nonstd_gnu_keywords / no_nonstd_gnu_keywords | no | GNU extension keywords |
| 222 | default_nocommon_tentative_definitions / default_common_tentative_definitions | no | Tentative definition linkage |
| 223 | no_token_separators_in_pp_output | no | -- |
| 224 | c23_typeof / no_c23_typeof | no | C23 typeof operator |
| 225 | c++11_sfinae / no_c++11_sfinae | no | C++11 SFINAE rules |
| 226 | c++11_sfinae_ignore_access / no_c++11_sfinae_ignore_access | no | Ignore access checks in SFINAE |
| 227 | variadic_templates / no_variadic_templates | no | Parameter packs and pack expansion |
| 228 | c++03 | no | C++03 mode (sets dword_126EF68 to 199711) |
| 229 | func_prototype_tags / no_func_prototype_tags | no | -- |
| 230 | implicit_noexcept / no_implicit_noexcept | no | Implicit noexcept on destructors |
| 231 | unrestricted_unions / no_unrestricted_unions | no | Unrestricted unions (C++11) |
| 232 | max_depth_constexpr_call | yes | Maximum constexpr recursion depth (default 200) |
| 233 | max_cost_constexpr_call | yes | Maximum constexpr evaluation cost (default 256) |
| 234 | delegating_constructors / no_delegating_constructors | no | -- |
| 235 | lossy_conversion_warning / no_lossy_conversion_warning | no | -- |
| 236 | deprecated_string_conv / no_deprecated_string_conv | no | Deprecated string literal to char* conversion |
| 237 | user_defined_literals / no_user_defined_literals | no | UDL support |
| 238 | preserve_lvalues_with_same_type_casts / no_... | no | -- |
| 239 | nonstd_anonymous_unions / no_nonstd_anonymous_unions | no | -- |
Late C++/Architecture/Output Flags (240--258)
| ID | Flag | Arg | Effect |
|---|---|---|---|
| 240 | c++14 | no | C++14 mode (sets dword_126EF68 to 201402) |
| 241 | c11 | no | C11 mode (sets dword_126EF68 to 201112) |
| 242 | c17 | no | C17 mode (sets dword_126EF68 to 201710) |
| 243 | c23 | no | C23 mode (sets dword_126EF68 to 202311) |
| 244 | digit_separators / no_digit_separators | no | C++14 digit separators (1'000'000) |
| 245 | target | yes | SM architecture string, parsed via sub_7525E0 into dword_126E4A8 |
| 246 | c++17 | no | C++17 mode (sets dword_126EF68 to 201703) |
| 247 | utf8_char_literals / no_utf8_char_literals | no | UTF-8 character literal support |
| 248 | stricter_template_checking | no | Additional template constraint checks |
| 249 | exc_spec_in_func_type / no_exc_spec_in_func_type | no | Exception spec as part of function type (C++17) |
| 250 | aligned_new / no_aligned_new | no | Aligned operator new (C++17) |
| 251 | c++20 | no | C++20 mode (sets dword_126EF68 to 202002) |
| 252 | c++23 | no | C++23 mode (sets dword_126EF68 to 202302) |
| 253 | ms_std_preprocessor / no_ms_std_preprocessor | no | MSVC standard preprocessor mode |
| 268 | partial-link | no | Partial linking mode |
| 273 | dump_command_options | no | Print all registered flag names |
| 274 | output_mode | yes | Output format: text (0) or sarif (1) |
| 275 | incognito / no_incognito | no | Incognito mode |
Note: Many IDs in the 240-252 range serve double duty as both C/C++ standard selectors and feature toggles. The standard selection IDs are also cross-referenced in the Language Standard Selection section above.
Inline-Registered Paired Flags
Seven additional paired flags are registered through inline table population rather than calls to register_command_flag. They share the same entry structure but are populated directly into the array:
| Flag | Effect |
|---|---|
relaxed_abstract_checking / no_relaxed_abstract_checking | Relax abstract class checks |
concepts / no_concepts | C++20 concepts support |
colors / no_colors | Colorized diagnostic output |
keep_restrict_in_signatures / no_keep_restrict_in_signatures | Preserve restrict in mangled names |
check_unicode_security / no_check_unicode_security | Unicode security checks (homoglyph detection) |
old_id_chars / no_old_id_chars | Legacy identifier character rules |
add_match_notes / no_add_match_notes | Add notes about matching overloads |
Language Standard Selection
Six language standard flags set dword_126EF68 (the internal __cplusplus / __STDC_VERSION__ value) and trigger corresponding mode changes:
C Standards
| ID | Flag | dword_126EF68 value | C standard |
|---|---|---|---|
| 7 | old_c | (K&R) | Pre-ANSI C via set_c_mode(1) |
| 179 | c89 | 198912 | ANSI C / C89 |
| 178 | c99 | 199901 | C99 |
| 241 | c11 | 201112 | C11 |
| 242 | c17 | 201710 | C17 |
| 243 | c23 | 202311 | C23 |
C++ Standards
| ID | Flag | dword_126EF68 value | C++ standard |
|---|---|---|---|
| 228 | c++03 | 199711 | C++98/03 (also aliased as c++98 via --c++11 flag ID 204 with conditional) |
| 204 | c++11 | 201103 | C++11 (sets 199711 if dword_E7FF14 is unset or C mode) |
| 240 | c++14 | 201402 | C++14 |
| 246 | c++17 | 201703 | C++17 |
| 251 | c++20 | 202002 | C++20 |
| 252 | c++23 | 202302 | C++23 |
When a C++ standard is selected, the post-parsing dialect resolution logic automatically enables the corresponding feature flags. For example, selecting --c++11 (value 201103) enables lambdas, rvalue references, auto type deduction, nullptr, variadic templates, and other C++11 features. The resolution logic also interacts with GCC/Clang version thresholds to determine which extensions are available.
Diagnostic Control Flags
The five diag_* flags (IDs 39--43) accept comma-separated lists of diagnostic numbers. The parser strips whitespace, splits on commas, and calls sub_4ED400(number, severity, 1) for each number:
--diag_suppress=1234,5678 # suppress errors 1234 and 5678
--diag_warning=20001 # demote CUDA error 20001 to warning
--diag_error=111 # promote diagnostic 111 to error
--diag_remark=185 # demote diagnostic 185 to remark
--diag_once=175 # emit diagnostic 175 only once
The error number system is documented in Diagnostic System Overview. Numbers above 3456 in the internal range correspond to the 20000-series CUDA errors via the offset formula display_code = internal_code + 16543.
Post-Parsing Dialect Resolution
After the main parsing loop completes, proc_command_line executes a large block of dialect resolution logic that:
- Resolves host compiler mode conflicts -- If both
--gccand--clangare set, or--cfront_2.1is combined with modern modes, the resolution picks one and adjusts feature flags accordingly - Sets C++ feature flags from
__cplusplusversion -- Based on the value indword_126EF68:199711(C++98/03): baseline features only201103(C++11): enables lambdas, rvalue refs, auto, nullptr, variadic templates, range-based for, delegating constructors, unrestricted unions, user-defined literals201402(C++14): adds digit separators, generic lambdas, relaxed constexpr201703(C++17): adds aligned new, exception spec in function type, structured bindings202002(C++20): adds concepts, modules, coroutines202302(C++23): adds latest features
- Applies GCC version thresholds -- When in GCC compatibility mode, certain features are gated on the GCC version number stored in
qword_126EF98(default 80100 = GCC 8.1.0). Known thresholds:40299(0x9D6B): GCC 4.240599(0x9E97): GCC 4.540699(0x9EFB): GCC 4.6- Higher versions enable progressively more features
- Opens output files -- Error output, listing file, output file
- Processes the input filename -- The remaining non-flag argv entry
Key Globals After Resolution
| Global | Type | Content |
|---|---|---|
dword_126EF68 | int32 | __cplusplus / __STDC_VERSION__ value |
dword_126EFB4 | int32 | Language mode: 0=unset, 1=C, 2=C++ |
dword_126EFA8 | int32 | GCC compatibility enabled |
dword_126EFA4 | int32 | Clang compatibility enabled |
qword_126EF98 | int64 | GCC version (default 80100) |
qword_126EF90 | int64 | Clang version (default 90100) |
dword_126EFB0 | int32 | GNU extensions enabled |
dword_126EFAC | int32 | Clang extensions enabled |
dword_126E4A8 | int32 | SM architecture code (from --target) |
dword_126E1D4 | int32 | MSVC target version |
The set_flag / clear_flag Mechanism
Flag ID 199 (--set_flag / --clear_flag) provides a raw escape hatch. The argument is a flag name looked up in the off_D47CE0 table -- an array of {name, global_address} pairs. If the name is found, the corresponding global variable is set to the provided integer value (--set_flag=name=value) or cleared to 0 (--clear_flag=name). This mechanism allows nvcc to toggle internal EDG configuration flags that do not have dedicated CLI flag registrations.
Default Values
default_init (sub_45EB40) runs before proc_command_line and initializes approximately 350 global configuration variables. Notable non-zero defaults:
| Global | Default | Meaning |
|---|---|---|
dword_106C210 | 1 | Exceptions enabled |
dword_106C180 | 1 | RTTI enabled |
dword_106C178 | 1 | bool is keyword |
dword_106C194 | 1 | Namespaces enabled |
dword_106C19C | 1 | Argument-dependent lookup enabled |
dword_106C1A0 | 1 | Class name injection enabled |
dword_106C1A4 | 1 | String literals are const |
dword_106C188 | 1 | wchar_t is keyword |
dword_106C18C | 1 | Alternative tokens enabled |
dword_106C140 | 1 | Compound literals allowed |
dword_106C138 | 1 | Dependent name processing enabled |
dword_106C134 | 1 | Template parsing enabled |
dword_106C12C | 1 | Friend injection enabled |
dword_106BDB8 | 1 | restrict enabled |
dword_106BDB0 | 1 | Remove unneeded entities enabled |
dword_106BD98 | 1 | Trigraphs enabled |
dword_106BD68 | 1 | Guiding declarations allowed |
dword_106BD58 | 1 | Old specializations allowed |
dword_106BD54 | 1 | Implicit typename enabled |
dword_106BE84 | 1 | Generate move operations enabled |
dword_106C064 | 1 | Stack limit modification enabled |
qword_106BD10 | 200 | Max constexpr recursion depth |
qword_106BD08 | 256 | Max constexpr evaluation cost |
qword_126EF98 | 80100 | Default GCC version (8.1.0) |
qword_126EF90 | 90100 | Default Clang version (9.1.0) |
qword_126EF78 | 1926 | MSVC version threshold |
qword_126EF70 | 99999 | Some upper bound sentinel |
Conflict Detection
Before the main parsing loop, check_conflicting_flags (sub_451E80) verifies that flags 3, 193, 194, and 195 (no_line_commands, set_flag, clear_flag, and related flags) are not used in conflicting combinations. If any conflict is detected, error 1027 is emitted.
Version Banners
Two flags print version information:
--version (ID 21, -v):
cudafe: NVIDIA (R) Cuda Language Front End
Portions Copyright (c) 2005, 2024 NVIDIA Corporation
Portions Copyright (c) 1988-2018, 2024 Edison Design Group Inc.
Based on Edison Design Group C/C++ Front End, version 6.6
Cuda compilation tools, release 13.0, V13.0.88
--Version (ID 92, -V):
Prints a different copyright format with full date/time stamp, then calls exit(1).
Cross-References
- Pipeline Overview -- Stage 2 is
proc_command_line - Diagnostic System Overview --
diag_suppress/diag_errorflag handling - Architecture Detection --
--targetflag and SM version parsing - Experimental Flags --
--set_flag/--clear_flagfor internal feature gates - EDG 6.6 Overview --
cmd_line.csource file context
EDG Build Configuration
cudafe++ is built from Edison Design Group (EDG) C/C++ front end source code, version 6.6. At build time, NVIDIA sets approximately 750 compile-time constants that control every aspect of the front end's behavior -- from which backend generates output, to how the IL system operates, to what ABI conventions are followed. These constants are baked into the binary and cannot be changed at runtime. They represent the specific EDG configuration NVIDIA chose for CUDA compilation.
The function dump_configuration (sub_44CF30, 785 lines) prints all 747 constants as C preprocessor #define statements when invoked with --dump_configuration. Of these, 613 are defined and 134 are explicitly listed as "not defined." The output is written to qword_126EDF0 (the configuration output stream, typically stderr) in alphabetical order.
$ cudafe++ --dump_configuration
/* Configuration data for Edison Design Group C/C++ Front End */
/* version 6.6, built on Aug 20 2025 at 13:59:03. */
#define ABI_CHANGES_FOR_ARRAY_NEW_AND_DELETE 1
#define ABI_CHANGES_FOR_CONSTRUCTION_VTBLS 1
...
#define WRITE_SIGNOFF_MESSAGE 1
/* Legacy configuration: <unnamed> */
#define LEGACY_TARGET_CONFIGURATION_NAME NULL
The constants fall into seven categories: backend selection, IL system, internal checking, diagnostics, target platform model, compiler compatibility, and feature defaults.
Backend Selection
The EDG front end supports multiple backend code generators. NVIDIA configured cudafe++ for the C++ code generation backend (cp_gen_be), which means the front end's output is C++ source code -- not object code, not C, and not a serialized IL file.
| Constant | Value | Meaning |
|---|---|---|
BACK_END_IS_CP_GEN_BE | 1 | Backend generates C++ source (the .ii / .int.c output) |
BACK_END_IS_C_GEN_BE | 0 | Not the C code generation backend |
BACK_END_SHOULD_BE_CALLED | 1 | Backend phase is active (front end does not stop after parsing) |
CP_GEN_BE_TARGET_MATCHES_SOURCE_DIALECT | 1 | Generated C++ targets the same dialect as the input |
GEN_CPP_FILE_SUFFIX | ".int.c" | Output file suffix for generated C++ |
GEN_C_FILE_SUFFIX | ".int.c" | Output file suffix for generated C (same as C++, unused) |
This is the central architectural fact about cudafe++. It is a source-to-source translator: CUDA C++ goes in, host-side C++ with device stubs comes out. The cp_gen_be backend walks the IL tree and emits syntactically valid C++ that the host compiler (gcc/clang/MSVC) can consume. The generated code preserves the original types, templates, and namespaces rather than lowering to a simpler representation.
The CP_GEN_BE_TARGET_MATCHES_SOURCE_DIALECT=1 setting means the backend does not down-level the output. If the input is C++17, the generated code uses C++17 constructs. This avoids the complexity of translating modern C++ features into older dialects.
Disabled Backend Features
Several backend capabilities are compiled out:
| Constant | Value | Meaning |
|---|---|---|
GCC_IS_GENERATED_CODE_TARGET | 0 | Output is not GCC-specific C |
CLANG_IS_GENERATED_CODE_TARGET | 0 | Output is not Clang-specific C |
MSVC_IS_GENERATED_CODE_TARGET | 0 | Output is not MSVC-specific C |
SUN_IS_GENERATED_CODE_TARGET | 0 | Output is not Sun/Oracle compiler C |
MICROSOFT_DIALECT_IS_GENERATED_CODE_TARGET | 0 | Output does not use Microsoft C++ extensions |
None of the compiler-specific code generation targets are enabled. The cp_gen_be emits portable C++ that is syntactically valid across all major compilers. This is possible because CUDA's host compilation already controls dialect selection through its own flag forwarding to the host compiler.
IL System
The Intermediate Language (IL) system is the core data structure connecting the parser to the backend. NVIDIA's configuration makes a critical choice: the IL is never serialized to disk.
| Constant | Value | Meaning |
|---|---|---|
IL_SHOULD_BE_WRITTEN_TO_FILE | 0 | IL stays in memory -- never written to an IL file |
DO_IL_LOWERING | 0 | No IL transformation passes before backend |
IL_WALK_NEEDED | 1 | IL walker infrastructure is compiled in |
IL_VERSION_NUMBER | "6.6" | IL format version, matches EDG version |
ALL_TEMPLATE_INFO_IN_IL | 1 | Complete template metadata in the IL graph |
PROTOTYPE_INSTANTIATIONS_IN_IL | 1 | Uninstantiated function prototypes preserved |
NEED_IL_DISPLAY | 1 | IL display/dump routines compiled in |
NEED_NAME_MANGLING | 1 | Name mangling infrastructure compiled in |
NEED_DECLARATIVE_WALK | 0 | Declarative IL walker not needed |
Why IL_SHOULD_BE_WRITTEN_TO_FILE=0 Matters
In a standard EDG deployment (like the Comeau C++ compiler or Intel ICC's older front end), the IL can be serialized to a binary file for separate backend processing. With IL_SHOULD_BE_WRITTEN_TO_FILE=0, NVIDIA eliminates the entire IL serialization path. The IL exists only as an in-memory graph during compilation:
- The parser builds IL nodes in region-based arenas (file-scope region 1, per-function region N)
- The IL walker traverses the graph to select device vs. host code
- The cp_gen_be backend reads the IL graph directly and emits C++ source
- The arenas are freed
This design means the IL_FILE_SUFFIX constant is left undefined -- there is no suffix because there is no file. The constants LARGE_IL_FILE_SUPPORT, USE_TEMPLATE_INFO_FILE, TEMPLATE_INFO_FILE_SUFFIX, INSTANTIATION_FILE_SUFFIX, and EXPORTED_TEMPLATE_FILE_SUFFIX are all similarly undefined.
Why DO_IL_LOWERING=0 Matters
IL lowering is an optional transformation pass that simplifies the IL before the backend processes it. In a lowering-enabled build, complex C++ constructs (VLAs, complex numbers, rvalue adjustments) are reduced to simpler forms. With DO_IL_LOWERING=0, NVIDIA bypasses all of this:
| Constant | Value | Meaning |
|---|---|---|
DO_IL_LOWERING | 0 | Master lowering switch is off |
LOWER_COMPLEX | 0 | No lowering of _Complex types |
LOWER_VARIABLE_LENGTH_ARRAYS | 0 | VLAs passed through as-is |
LOWER_CLASS_RVALUE_ADJUST | 0 | No rvalue conversion lowering |
LOWER_FIXED_POINT | 0 | No fixed-point lowering |
LOWER_IFUNC | 0 | No indirect function lowering |
LOWER_STRING_LITERALS_TO_NON_CONST | 0 | String literals keep const qualification |
LOWER_EXTERN_INLINE | 1 | Exception: extern inline functions are lowered |
LOWERING_NORMALIZES_BOOLEAN_CONTROLLING_EXPRESSIONS | 0 | No boolean normalization |
LOWERING_REMOVES_UNNEEDED_CONSTRUCTIONS_AND_DESTRUCTIONS | 0 | No dead construction removal |
The only lowering that remains active is LOWER_EXTERN_INLINE=1, which handles extern inline functions that need special treatment in the generated output. Everything else passes through the IL untransformed.
This makes sense for cudafe++'s role. As a source-to-source translator, it benefits from preserving the original code structure. The host compiler handles all the actual lowering when it compiles the generated .ii file.
Why IL_WALK_NEEDED=1 Matters
Despite no serialization and no lowering, the IL walk infrastructure is compiled in. This is because cudafe++ uses the IL walker for its primary CUDA-specific task: device/host code separation. The walker traverses the IL graph and marks each entity with execution space flags (__host__, __device__, __global__), then the backend selectively emits code based on which space is being generated.
Template Information Preservation
| Constant | Value | Meaning |
|---|---|---|
ALL_TEMPLATE_INFO_IN_IL | 1 | Full template definitions in the IL, not a separate database |
PROTOTYPE_INSTANTIATIONS_IN_IL | 1 | Even uninstantiated prototypes kept |
RECORD_TEMPLATE_STRINGS | 1 | Template argument strings preserved |
RECORD_HIDDEN_NAMES_IN_IL | 1 | Names hidden by using declarations still recorded |
RECORD_UNRECOGNIZED_ATTRIBUTES | 1 | Unknown [[attributes]] preserved in IL |
RECORD_RAW_ASM_OPERAND_DESCRIPTIONS | 1 | Raw asm operand text kept |
KEEP_TEMPLATE_ARG_EXPR_THAT_CAUSES_INSTANTIATION | 1 | Template argument expressions that trigger instantiation are retained |
With ALL_TEMPLATE_INFO_IN_IL=1, template definitions, partial specializations, and instantiation directives live directly in the IL graph. This eliminates the need for a separate template information file (USE_TEMPLATE_INFO_FILE is undefined). Combined with PROTOTYPE_INSTANTIATIONS_IN_IL=1, the IL retains complete template metadata -- even for function templates that have been declared but not yet instantiated. This is essential for CUDA's device/host separation, where a template might be instantiated in different execution spaces.
Internal Checking
NVIDIA builds cudafe++ with assertions enabled. This produces a binary with extensive runtime self-checking.
| Constant | Value | Meaning |
|---|---|---|
CHECKING | 1 | Internal assertion macros are active |
DEBUG | 1 | Debug-mode code paths are compiled in |
CHECK_SWITCH_DEFAULT_UNEXPECTED | 1 | Default cases in switch statements trigger assertions |
EXPENSIVE_CHECKING | 0 | Costly O(n) verification checks are disabled |
OVERWRITE_FREED_MEM_BLOCKS | 0 | No memory poisoning on free |
EXIT_ON_INTERNAL_ERROR | 0 | Internal errors do not call exit() directly |
ABORT_ON_INIT_COMPONENT_LEAKAGE | 0 | No abort on init-time leaks |
TRACK_INTERPRETER_ALLOCATIONS | 0 | constexpr interpreter does not track allocations |
Assertion Infrastructure
With CHECKING=1, the internal assertion macro internal_error (sub_4F2930) is live. The binary contains 5,178 call sites across 2,139 functions that invoke this handler. Each call site passes the source file name, line number, function name, and a diagnostic message pair. When an assertion fires, the handler constructs error 2656 with severity level 11 (catastrophic) and reports it through the standard diagnostic infrastructure.
The DEBUG=1 setting enables additional code paths that perform intermediate consistency checks during parsing and IL construction. These checks are less expensive than EXPENSIVE_CHECKING (which is off) but still add measurable overhead to compilation time. NVIDIA presumably leaves both CHECKING and DEBUG on because cudafe++ is a critical toolchain component where silent corruption is far worse than a slightly slower compilation.
The CHECK_SWITCH_DEFAULT_UNEXPECTED=1 setting means that every switch statement in the EDG source that handles enumerated values will trigger an assertion if control reaches the default case. This catches missing case handling when new enum values are added.
Diagnostics Configuration
These constants control the default formatting and behavior of compiler error messages.
| Constant | Value | Meaning |
|---|---|---|
DEFAULT_BRIEF_DIAGNOSTICS | 0 | Full diagnostics by default (not one-line) |
DEFAULT_DISPLAY_ERROR_NUMBER | 0 | Error numbers hidden by default |
COLUMN_NUMBER_IN_BRIEF_DIAGNOSTICS | 1 | Column numbers included in brief-mode output |
DEFAULT_ENABLE_COLORIZED_DIAGNOSTICS | 1 | ANSI color codes enabled by default |
MAX_ERROR_OUTPUT_LINE_LENGTH | 79 | Diagnostic lines wrap at 79 characters |
DEFAULT_CONTEXT_LIMIT | 10 | Maximum 10 lines of instantiation context shown |
DEFAULT_DISPLAY_ERROR_CONTEXT_ON_CATASTROPHE | 1 | Show context even on fatal errors |
DEFAULT_ADD_MATCH_NOTES | 1 | Add notes explaining overload/template resolution |
DEFAULT_DISPLAY_TEMPLATE_TYPEDEFS_IN_DIAGNOSTICS | 0 | Use raw types, not typedef aliases, in messages |
DEFAULT_OUTPUT_MODE | om_text | Default output is text, not SARIF JSON |
DEFAULT_MACRO_POSITIONS_IN_DIAGNOSTICS | (undefined) | Macro expansion position tracking is off |
ERROR_SEVERITY_EXPLICIT_IN_ERROR_MESSAGES | 1 | Severity word ("error"/"warning") always printed |
DIRECT_ERROR_OUTPUT_TO_STDOUT | 0 | Errors go to stderr |
WRITE_SIGNOFF_MESSAGE | 1 | Print summary line at compilation end |
Color Configuration
The DEFAULT_EDG_COLORS constant encodes ANSI SGR (Select Graphic Rendition) color codes for diagnostic categories:
"error=01;31:warning=01;35:note=01;36:locus=01:quote=01:range1=32"
| Category | SGR Code | Appearance |
|---|---|---|
error | 01;31 | Bold red |
warning | 01;35 | Bold magenta |
note | 01;36 | Bold cyan |
locus | 01 | Bold (default color) |
quote | 01 | Bold (default color) |
range1 | 32 | Green (non-bold) |
This matches GCC's diagnostic color scheme, which is intentional -- cudafe++ is designed to produce diagnostics that look visually consistent with the host GCC compiler's output.
ABI Configuration
| Constant | Value | Meaning |
|---|---|---|
ABI_COMPATIBILITY_VERSION | 9999 | Maximum ABI compatibility level |
IA64_ABI | 1 | Uses Itanium C++ ABI (standard on Linux) |
ABI_CHANGES_FOR_ARRAY_NEW_AND_DELETE | 1 | Array new/delete ABI changes active |
ABI_CHANGES_FOR_CONSTRUCTION_VTBLS | 1 | Construction vtable ABI changes active |
ABI_CHANGES_FOR_COVARIANT_VIRTUAL_FUNC_RETURN | 1 | Covariant return ABI changes active |
ABI_CHANGES_FOR_PLACEMENT_DELETE | 1 | Placement delete ABI changes active |
ABI_CHANGES_FOR_RTTI | 1 | RTTI ABI changes active |
DRIVER_COMPATIBILITY_VERSION | 9999 | Maximum driver-level compatibility |
The ABI_COMPATIBILITY_VERSION=9999 is a sentinel meaning "accept all ABI changes." In EDG's versioning scheme, specific ABI compatibility versions can be set to match a particular compiler release (e.g., GCC 3.2's ABI). Setting it to 9999 means cudafe++ uses the latest ABI rules for every construct, which is appropriate because it generates source code that the host compiler will re-ABI anyway.
All five ABI_CHANGES_FOR_* constants are set to 1, meaning every ABI improvement EDG has made is active. These affect name mangling, vtable layout, and RTTI representation. Since cudafe++ emits C++ source rather than object code, these primarily affect name mangling output and the structure of compiler-generated entities.
Compiler Compatibility Layer
cudafe++ emulates GCC by default. These constants configure the compatibility surface.
| Constant | Value | Meaning |
|---|---|---|
DEFAULT_GNU_COMPATIBILITY | 1 | GCC compatibility mode is on by default |
DEFAULT_GNU_VERSION | 80100 | Default GCC version = 8.1.0 |
GNU_TARGET_VERSION_NUMBER | 70300 | Target GCC version = 7.3.0 |
DEFAULT_GNU_ABI_VERSION | 30200 | Default GNU ABI version = 3.2.0 |
DEFAULT_CLANG_COMPATIBILITY | 0 | Clang compat off by default |
DEFAULT_CLANG_VERSION | 90100 | Clang version if enabled = 9.1.0 |
DEFAULT_MICROSOFT_COMPATIBILITY | 0 | MSVC compat off by default |
DEFAULT_MICROSOFT_VERSION | 1926 | MSVC version if enabled = 19.26 (VS 2019) |
MSVC_TARGET_VERSION_NUMBER | 1926 | Same: MSVC 19.26 target |
GNU_EXTENSIONS_ALLOWED | 1 | GNU extensions compiled into the parser |
GNU_X86_ASM_EXTENSIONS_ALLOWED | 1 | GNU inline asm syntax supported |
GNU_X86_ATTRIBUTES_ALLOWED | 1 | GNU __attribute__ on x86 targets |
GNU_VECTOR_TYPES_ALLOWED | 1 | GNU vector types (__attribute__((vector_size(...)))) |
GNU_VISIBILITY_ATTRIBUTE_ALLOWED | 1 | __attribute__((visibility(...))) support |
GNU_INIT_PRIORITY_ATTRIBUTE_ALLOWED | 1 | __attribute__((init_priority(...))) support |
MICROSOFT_EXTENSIONS_ALLOWED | 0 | MSVC extensions not available |
SUN_EXTENSIONS_ALLOWED | 0 | Sun/Oracle extensions not available |
The DEFAULT_GNU_VERSION=80100 encodes GCC 8.1.0 as major*10000 + minor*100 + patch. This is the baseline GCC version cudafe++ emulates when nvcc does not specify an explicit --compiler-bindir host compiler. At runtime, nvcc overrides this with the actual detected host GCC version via --gnu_version=NNNNN.
The version numbers stored here serve as fallback defaults. They affect which GNU extensions and builtins are available, which warning behaviors are emulated, and how __GNUC__ / __GNUC_MINOR__ / __GNUC_PATCHLEVEL__ are defined for the preprocessor.
Disabled Compatibility Modes
| Constant | Value | Meaning |
|---|---|---|
CFRONT_2_1_OBJECT_CODE_COMPATIBILITY | 0 | No AT&T cfront 2.1 compat |
CFRONT_3_0_OBJECT_CODE_COMPATIBILITY | 0 | No AT&T cfront 3.0 compat |
CFRONT_GLOBAL_VS_MEMBER_NAME_LOOKUP_BUG | 0 | No cfront name lookup bug emulation |
DEFAULT_SUN_COMPATIBILITY | (undefined) | No Sun/Oracle compat |
CPPCLI_ENABLING_POSSIBLE | 0 | C++/CLI (managed C++) disabled |
CPPCX_ENABLING_POSSIBLE | 0 | C++/CX (WinRT extensions) disabled |
DEFAULT_UPC_MODE | 0 | Unified Parallel C disabled |
DEFAULT_EMBEDDED_C_ENABLED | 0 | Embedded C extensions disabled |
NVIDIA disables every compatibility mode except GCC. This is consistent with CUDA's host compiler support matrix: GCC and Clang on Linux, MSVC on Windows. The cfront, Sun, UPC, and embedded C modes are EDG capabilities that NVIDIA does not need.
Target Platform Model
The TARG_* constants describe the target architecture's data model. Since cudafe++ is a source-to-source translator for the host side, these model x86-64 Linux.
Data Type Sizes (bytes)
| Type | Size | Alignment |
|---|---|---|
char | 1 | 1 |
short | 2 | 2 |
int | 4 | 4 |
long | 8 | 8 |
long long | 8 | 8 |
__int128 | 16 | 16 |
pointer | 8 | 8 |
float | 4 | 4 |
double | 8 | 8 |
long double | 16 | 16 |
__float80 | 16 | 16 |
__float128 | 16 | 16 |
ptr-to-data-member | 8 | 8 |
ptr-to-member-function | 16 | 8 |
ptr-to-virtual-base | 8 | 8 |
This is the standard LP64 data model (long and pointer are 64-bit). TARG_ALL_POINTERS_SAME_SIZE=1 confirms there are no near/far pointer distinctions.
Key Target Properties
| Constant | Value | Meaning |
|---|---|---|
TARG_CHAR_BIT | 8 | 8 bits per byte |
TARG_HAS_SIGNED_CHARS | 1 | char is signed by default |
TARG_HAS_IEEE_FLOATING_POINT | 1 | IEEE 754 floating point |
TARG_SUPPORTS_X86_64 | 1 | x86-64 target support |
TARG_SUPPORTS_ARM64 | 0 | No ARM64 target support |
TARG_SUPPORTS_ARM32 | 0 | No ARM32 target support |
TARG_DEFAULT_NEW_ALIGNMENT | 16 | operator new returns 16-byte aligned |
TARG_IA64_ABI_USE_GUARD_ACQUIRE_RELEASE | 1 | Thread-safe static local init guards |
TARG_CASE_SENSITIVE_EXTERNAL_NAMES | 1 | Symbol names are case-sensitive |
TARG_EXTERNAL_NAMES_GET_UNDERSCORE_ADDED | 0 | No leading underscore on symbols |
The TARG_SUPPORTS_ARM64=0 and TARG_SUPPORTS_ARM32=0 confirm that this build of cudafe++ targets x86-64 Linux only. NVIDIA produces separate cudafe++ builds for other host platforms (ARM64 Linux, Windows).
Floating Point Model
| Constant | Value | Meaning |
|---|---|---|
FP_USE_EMULATION | 1 | Floating-point constant folding uses software emulation |
USE_SOFTFLOAT | 1 | Software floating-point library linked |
APPROXIMATE_QUADMATH | 1 | __float128 operations use approximate arithmetic |
USE_QUADMATH_LIBRARY | 0 | Not linked against libquadmath |
HOST_FP_VALUE_IS_128BIT | 1 | Host FP value representation uses 128 bits |
FP_LONG_DOUBLE_IS_80BIT_EXTENDED | 1 | long double is x87 80-bit extended precision |
FP_LONG_DOUBLE_IS_BINARY128 | 0 | long double is not IEEE binary128 |
FLOAT80_ENABLING_POSSIBLE | 1 | __float80 type can be enabled |
FLOAT128_ENABLING_POSSIBLE | 1 | __float128 type can be enabled |
The FP_USE_EMULATION=1 and USE_SOFTFLOAT=1 settings mean cudafe++ does not use the host CPU's floating-point unit for constant folding during compilation. Instead, it uses a software emulation library. This guarantees deterministic results regardless of the build machine's FPU behavior, rounding mode, or x87 precision settings. The APPROXIMATE_QUADMATH=1 indicates that __float128 constant folding uses an approximate (but portable) implementation rather than requiring libquadmath.
Memory and Host Configuration
| Constant | Value | Meaning |
|---|---|---|
USE_MMAP_FOR_MEMORY_REGIONS | 1 | IL memory regions use mmap |
USE_MMAP_FOR_MODULES | 1 | C++ module storage uses mmap |
HOST_ALLOCATION_INCREMENT | 65536 | Arena grows in 64 KB increments |
HOST_ALIGNMENT_REQUIRED | 8 | Host requires 8-byte alignment |
HOST_IL_ENTRY_PREFIX_ALIGNMENT | 8 | IL node prefix aligned to 8 bytes |
HOST_POINTER_ALIGNMENT | 8 | Pointer alignment on host platform |
USE_FIXED_ADDRESS_FOR_MMAP | 0 | No fixed mmap addresses |
NULL_POINTER_IS_ZERO | 1 | Null pointer has all-zero bit pattern |
The USE_MMAP_FOR_MEMORY_REGIONS=1 setting means the IL's region-based arena allocator uses mmap system calls (likely MAP_ANONYMOUS) rather than malloc. This gives EDG more control over memory layout and allows whole-region deallocation via munmap without fragmentation concerns. The 64 KB allocation increment (HOST_ALLOCATION_INCREMENT=65536) means each arena expansion maps a new 64 KB page-aligned chunk.
Code Generation Controls
These constants affect what the cp_gen_be backend emits.
| Constant | Value | Meaning |
|---|---|---|
GENERATE_SOURCE_SEQUENCE_LISTS | 1 | Source sequence lists (instantiation ordering) generated |
GENERATE_LINKAGE_SPEC_BLOCKS | 1 | extern "C" blocks preserved in output |
USING_DECLARATIONS_IN_GENERATED_CODE | 1 | using declarations appear in output |
GENERATE_EH_TABLES | 0 | No EH tables -- host compiler handles exceptions |
GENERATE_MICROSOFT_IF_EXISTS_ENTRIES | 0 | No __if_exists / __if_not_exists output |
SUPPRESS_ARRAY_STATIC_IN_GENERATED_CODE | 1 | static in array parameter declarations suppressed |
GCC_BUILTIN_VARARGS_IN_GENERATED_CODE | 0 | No GCC __builtin_va_* in output |
USE_HEX_FP_CONSTANTS_IN_GENERATED_CODE | 0 | No hex float literals in output |
ADD_BRACES_TO_AVOID_DANGLING_ELSE_IN_GENERATED_C | 0 | No extra braces for dangling else |
DOING_SOURCE_ANALYSIS | 1 | Source analysis mode (affects what is preserved) |
The GENERATE_EH_TABLES=0 is significant. Exception handling tables are not generated because cudafe++ emits source code -- the host compiler is responsible for generating the actual EH tables when it compiles the .ii output. Similarly, GCC_BUILTIN_VARARGS_IN_GENERATED_CODE=0 means the output uses standard <stdarg.h> varargs rather than GCC builtins, keeping the output compiler-portable.
Template and Instantiation Model
| Constant | Value | Meaning |
|---|---|---|
AUTOMATIC_TEMPLATE_INSTANTIATION | 0 | No automatic instantiation to separate files |
INSTANTIATION_BY_IMPLICIT_INCLUSION | 1 | Template definitions found via implicit include |
INSTANTIATE_TEMPLATES_EVERYWHERE_USED | 0 | Not every use triggers instantiation |
INSTANTIATE_EXTERN_INLINE | 0 | Extern inline templates not instantiated eagerly |
INSTANTIATE_INLINE_VARIABLES | 0 | Inline variables not instantiated eagerly |
INSTANTIATE_BEFORE_PCH_CREATION | 0 | No instantiation before PCH |
DEFAULT_INSTANTIATION_MODE | tim_none | No separate instantiation mode |
DEFAULT_MAX_PENDING_INSTANTIATIONS | 200 | Maximum pending instantiations per TU |
MAX_TOTAL_PENDING_INSTANTIATIONS | 256 | Hard cap on total pending |
MAX_UNUSED_ALL_MODE_INSTANTIATIONS | 200 | Limit on unused instantiation entries |
DEFAULT_MAX_DEPTH_CONSTEXPR_CALL | 256 | Maximum constexpr recursion depth |
DEFAULT_MAX_COST_CONSTEXPR_CALL | 2000000 | Maximum constexpr evaluation cost |
The AUTOMATIC_TEMPLATE_INSTANTIATION=0 and DEFAULT_INSTANTIATION_MODE=tim_none disable EDG's automatic template instantiation mechanism. This mechanism (where EDG writes instantiation requests to a file for later processing) is unnecessary because cudafe++ processes each translation unit in a single pass -- templates are instantiated inline as the parser encounters them, and the backend emits the instantiated code directly.
Feature Enablement Constants
The DEFAULT_* constants set the initial values of runtime-configurable features. These can be overridden by command-line flags, but they establish the baseline behavior when no flags are specified.
Enabled by Default
| Constant | Value | Feature |
|---|---|---|
DEFAULT_GNU_COMPATIBILITY | 1 | GCC compatibility mode |
DEFAULT_EXCEPTIONS_ENABLED | 1 | C++ exception handling |
DEFAULT_RTTI_ENABLED | 1 | Runtime type identification |
DEFAULT_BOOL_IS_KEYWORD | 1 | bool is a keyword (not a typedef) |
DEFAULT_WCHAR_T_IS_KEYWORD | 1 | wchar_t is a keyword |
DEFAULT_NAMESPACES_ENABLED | 1 | Namespaces are supported |
DEFAULT_ARG_DEPENDENT_LOOKUP | 1 | ADL (Koenig lookup) active |
DEFAULT_CLASS_NAME_INJECTION | 1 | Class name injected into its own scope |
DEFAULT_EXPLICIT_KEYWORD_ENABLED | 1 | explicit keyword recognized |
DEFAULT_EXTERN_INLINE_ALLOWED | 1 | extern inline permitted |
DEFAULT_IMPLICIT_NOEXCEPT_ENABLED | 1 | Implicit noexcept on dtors/deallocs |
DEFAULT_IMPLICIT_TYPENAME_ENABLED | 1 | typename implicit in dependent contexts |
DEFAULT_TYPE_TRAITS_HELPERS_ENABLED | 1 | Compiler intrinsic type traits |
DEFAULT_STRING_LITERALS_ARE_CONST | 1 | String literals have const type |
DEFAULT_TYPE_INFO_IN_NAMESPACE_STD | 1 | type_info in std:: |
DEFAULT_C_AND_CPP_FUNCTION_TYPES_ARE_DISTINCT | 1 | C and C++ function types differ |
DEFAULT_FRIEND_INJECTION | 1 | Friend declarations inject names |
DEFAULT_DISTINCT_TEMPLATE_SIGNATURES | 1 | Template signatures are distinct |
DEFAULT_ARRAY_NEW_AND_DELETE_ENABLED | 1 | operator new[] / operator delete[] |
DEFAULT_CPP11_DEPENDENT_NAME_PROCESSING | 1 | C++11-style dependent name processing |
DEFAULT_ENABLE_COLORIZED_DIAGNOSTICS | 1 | ANSI color in diagnostics |
DEFAULT_CHECK_FOR_BYTE_ORDER_MARK | 1 | UTF-8 BOM detection on |
DEFAULT_CHECK_PRINTF_SCANF_POSITIONAL_ARGS | 1 | printf/scanf format checking |
DEFAULT_ALWAYS_FOLD_CALLS_TO_BUILTIN_CONSTANT_P | 1 | __builtin_constant_p folded |
Disabled by Default (Require Explicit Enabling)
| Constant | Value | Feature |
|---|---|---|
DEFAULT_CPP_MODE | 199711 | Default language standard is C++98 |
DEFAULT_LAMBDAS_ENABLED | 0 | Lambdas off (enabled by C++ version selection) |
DEFAULT_RVALUE_REFERENCES_ENABLED | 0 | Rvalue refs off (enabled by C++ version) |
DEFAULT_VARIADIC_TEMPLATES_ENABLED | 0 | Variadic templates off (enabled by C++ version) |
DEFAULT_NULLPTR_ENABLED | 0 | nullptr off (enabled by C++ version) |
DEFAULT_RANGE_BASED_FOR_ENABLED | 0 | Range-for off (enabled by C++ version) |
DEFAULT_AUTO_TYPE_SPECIFIER_ENABLED | 0 | auto type deduction off (enabled by C++ version) |
DEFAULT_COMPOUND_LITERALS_ALLOWED | 0 | C99 compound literals off |
DEFAULT_DESIGNATORS_ALLOWED | 0 | C99/C++20 designated initializers off |
DEFAULT_C99_MODE | 0 | Not in C99 mode |
DEFAULT_VLA_ENABLED | 0 | Variable-length arrays off |
DEFAULT_CPP11_SFINAE_ENABLED | 0 | C++11 SFINAE rules off (enabled by C++ version) |
DEFAULT_MODULES_ENABLED | 0 | C++20 modules off |
DEFAULT_REFLECTION_ENABLED | 0 | C++ reflection off |
DEFAULT_MICROSOFT_COMPATIBILITY | 0 | MSVC compat off |
DEFAULT_CLANG_COMPATIBILITY | 0 | Clang compat off |
DEFAULT_BRIEF_DIAGNOSTICS | 0 | Full diagnostic output |
DEFAULT_DISPLAY_ERROR_NUMBER | 0 | Error numbers hidden |
DEFAULT_INCOGNITO | 0 | Not in incognito mode |
DEFAULT_REMOVE_UNNEEDED_ENTITIES | 0 | Dead code not removed |
The DEFAULT_CPP_MODE=199711 (C++98) looks surprising, but this is simply the EDG default. In practice, nvcc always passes an explicit --std=c++NN flag to cudafe++ that overrides this default, typically --std=c++17 in modern CUDA. The C++11/14/17/20 features listed as "disabled by default" are all enabled by the standard version selection code in proc_command_line.
Predefined Macro Constants
These constants control which macros cudafe++ automatically defines for the preprocessor.
| Constant | Value | Effect |
|---|---|---|
DEFINE_MACRO_WHEN_EXCEPTIONS_ENABLED | 1 | --exceptions causes #define __EXCEPTIONS |
DEFINE_MACRO_WHEN_RTTI_ENABLED | 1 | --rtti causes #define __RTTI |
DEFINE_MACRO_WHEN_BOOL_IS_KEYWORD | 1 | bool keyword causes #define _BOOL |
DEFINE_MACRO_WHEN_WCHAR_T_IS_KEYWORD | 1 | wchar_t keyword causes #define _WCHAR_T |
DEFINE_MACRO_WHEN_ARRAY_NEW_AND_DELETE_ENABLED | 1 | Causes #define __ARRAY_OPERATORS |
DEFINE_MACRO_WHEN_PLACEMENT_DELETE_ENABLED | 1 | Causes #define __PLACEMENT_DELETE |
DEFINE_MACRO_WHEN_VARIADIC_TEMPLATES_ENABLED | 1 | Causes #define __VARIADIC_TEMPLATES |
DEFINE_MACRO_WHEN_CHAR16_T_AND_CHAR32_T_ARE_KEYWORDS | 1 | Causes #define __CHAR16_T_AND_CHAR32_T |
DEFINE_MACRO_WHEN_LONG_LONG_IS_DISABLED | 1 | Causes #define __NO_LONG_LONG when long long is off |
DEFINE_FEATURE_TEST_MACRO_OPERATORS_IN_ALL_MODES | 1 | Feature test macros available in all modes |
MACRO_DEFINED_WHEN_IA64_ABI | "__EDG_IA64_ABI" | Always defined (since IA64_ABI=1) |
MACRO_DEFINED_WHEN_TYPE_TRAITS_HELPERS_ENABLED | "__EDG_TYPE_TRAITS_ENABLED" | Always defined (since type traits are on) |
These macros allow header files to conditionally compile based on which compiler features are active. They are part of EDG's mechanism for compatibility with GCC's predefined macro surface -- GCC defines __EXCEPTIONS when exceptions are on, so cudafe++ does the same.
Miscellaneous Constants
| Constant | Value | Meaning |
|---|---|---|
VERSION_NUMBER | "6.6" | EDG front end version |
VERSION_NUMBER_FOR_MACRO | 606 | Numeric form for __EDG_VERSION__ macro |
DIRECTORY_SEPARATOR | '/' | Unix path separator |
FILE_NAME_FOR_STDIN | "-" | Standard Unix convention for stdin |
OBJECT_FILE_SUFFIX | ".o" | Unix object file suffix |
PCH_FILE_SUFFIX | ".pch" | Precompiled header suffix |
PREDEFINED_MACRO_FILE_NAME | "predefined_macros.txt" | File with platform-defined macros |
DEFAULT_TMPDIR | "/tmp" | Default temp directory |
DEFAULT_USR_INCLUDE | "/usr/include" | Default system include path |
DEFAULT_EDG_BASE | "" | EDG base directory (empty = use argv[0] path) |
MAX_INCLUDE_FILES_OPEN_AT_ONCE | 8 | Limit on simultaneously open include files |
MODULE_MAX_LINE_NUMBER | 250000 | Maximum source lines per module |
COMPILE_MULTIPLE_SOURCE_FILES | 0 | One source file per invocation |
COMPILE_MULTIPLE_TRANSLATION_UNITS | 0 | One TU per invocation |
USING_DRIVER | 0 | Not integrated into a driver binary |
EDG_WIN32 | 0 | Not a Windows build |
WINDOWS_PATHS_ALLOWED | 0 | No backslash path separators |
The VERSION_NUMBER="6.6" identifies this as EDG C/C++ front end version 6.6, which is the latest major release. VERSION_NUMBER_FOR_MACRO=606 becomes the __EDG_VERSION__ predefined macro, allowing header files to detect the exact EDG version (e.g., #if __EDG_VERSION__ >= 606).
The legacy configuration section at the bottom of the dump output reports LEGACY_TARGET_CONFIGURATION_NAME as NULL, meaning this build does not use a named legacy target configuration. In EDG's framework, named target configurations are used to preset constants for specific compilers (e.g., "gnu" or "microsoft"). NVIDIA's configuration is fully custom and does not map to any of EDG's predefined configurations.
Relationship Between Build Configuration and Runtime Flags
The build configuration constants and the runtime CLI flags form a two-layer system:
-
Build-time constants (
CHECKING=1,BACK_END_IS_CP_GEN_BE=1,IL_SHOULD_BE_WRITTEN_TO_FILE=0) determine what code paths exist in the binary. IfIL_SHOULD_BE_WRITTEN_TO_FILE=0, the IL serialization code is not compiled in -- no runtime flag can enable it. -
DEFAULT_*constants set initial values for features that can be toggled at runtime.DEFAULT_EXCEPTIONS_ENABLED=1means exceptions are on unless--no_exceptionsis passed. These defaults are loaded bydefault_init(sub_45EB40) before command-line parsing. -
*_ENABLING_POSSIBLEconstants gate whether a feature can be toggled at all.COROUTINE_ENABLING_POSSIBLE=1means the--coroutines/--no_coroutinesflag pair is registered.REFLECTION_ENABLING_POSSIBLE=0means the reflection flag pair is not even registered -- the feature cannot be turned on.
This layering means the build configuration determines the binary's permanent capabilities, while the CLI flags select among the enabled possibilities.
Function Reference
| Function | Address | Lines | Role |
|---|---|---|---|
dump_configuration | sub_44CF30 | 785 | Print all 747 constants as #define statements |
default_init | sub_45EB40 | 470 | Initialize 350 config globals from DEFAULT_* values |
init_command_line_flags | sub_452010 | 3,849 | Register all CLI flags (gated by *_ENABLING_POSSIBLE) |
proc_command_line | sub_459630 | 4,105 | Parse flags and override DEFAULT_* settings |
Architecture Detection
cudafe++ determines the target GPU architecture through a five-stage pipeline: nvcc translates the user-facing --gpu-architecture=sm_XX flag into an internal numeric index, passes it to cudafe++ via --target, the CLI parser stores the index in a global, set_target_configuration configures over 100 type-system globals for that target, and the TU initializer copies the index into per-translation-unit state where it is read by feature gates throughout compilation. A parallel path, select_cp_gen_be_target_dialect, routes the backend to emit either device-side or host-side C++ based on a separate flag. This page documents the complete chain from nvcc invocation to the point where individual feature checks read the stored architecture value.
Key Facts
| Property | Value |
|---|---|
| Target index global | dword_126E4A8 (set by --target, CLI case 245) |
| Invalid sentinel | -1 (0xFFFFFFFF) |
| Error on invalid target | Error 2664: "invalid or no value specified with --nv_arch flag" |
| Target parser stub | sub_7525E0 (6 bytes, returns -1 unconditionally) |
| Configuration function | sub_7525F0 (set_target_configuration, target.c:299) |
| Type table initializer | sub_7515D0 (100+ globals, called from sub_7525F0) |
| Configuration validator | sub_7527B0 (check_target_configuration, target.c:512-659) |
| Field alignment initializer | sub_752DF0 (init_field_alignment_tables, target.c:825) |
| Dialect selector | sub_752A80 (select_cp_gen_be_target_dialect, target.c:736) |
| TU-level copy | dword_126EBF8 (target_configuration_index, set in sub_586240) |
| GPU mode flag | dword_126EFA8 (set by --gcc, case 182; gates dialect selection) |
| Device-side flag | dword_126EFA4 (set by --clang, case 187; selects device vs host output) |
The Full Propagation Chain
The architecture value flows through five distinct stages before it is available for feature gate checks. Each stage adds a layer of processing: parsing, validation, type model configuration, dialect routing, and per-TU state replication.
Stage 1: nvcc Stage 2: CLI parsing
--gpu-architecture=sm_90 ---> case 245 (--target)
translates to --target=<idx> sub_7525E0(<arg>) -> dword_126E4A8
if -1: error 2664, abort
|
v
Stage 3: Target init Stage 4: Dialect selection
sub_7525F0(idx) sub_752A80()
assert idx != -1 if dword_126EFA8 (GPU mode):
sub_7515D0() -> 100+ type globals if dword_126EFA4: device path
qword_126E1B0 = "lib" else: host path
sub_752DF0() -> alignment tables
sub_7527B0() -> validation
|
v
Stage 5: TU initialization
sub_586240()
dword_126EBF8 = dword_126E4A8 (per-TU copy)
version marker: "6.6\0"
timestamp copy
|
v
Feature checks throughout compilation
if (dword_126E4A8 < 70) { error("__grid_constant__ requires compute_70"); }
if (dword_126E4A8 < 80) { error("__nv_register_params__ requires compute_80"); }
...
Stage 1: nvcc Translates the Architecture
Users specify the GPU architecture through nvcc:
nvcc --gpu-architecture=sm_90 source.cu
nvcc translates this into an internal numeric index and passes it to cudafe++ as --target=<index>. The value stored in dword_126E4A8 is NOT a raw SM number like 90 -- it is an index into EDG's target configuration table. nvcc performs the mapping from user-facing strings (sm_90, compute_80, etc.) to this index. cudafe++ never sees the sm_XX string directly.
The --target flag is registered as CLI flag 253 with the internal case_id 245 in the flag table:
// From sub_452010 (init_command_line_flags)
sub_451F80(245, "target", 0, 1, 1, 1);
// ^id ^name ^no_short ^has_arg ^mode ^enabled
Stage 2: CLI Parsing (proc_command_line, case 245)
When proc_command_line (sub_459630) encounters --target, it dispatches to case 245:
// sub_459630, case 245 (decompiled)
case 245:
v80 = sub_7525E0(qword_E7FF28, v23, v20, v30);
dword_126E4A8 = v80; // store target index
if (v80 == -1) {
sub_4F8420(2664); // emit error 2664
// "invalid or no value specified with --nv_arch flag"
sub_4F2930("cmd_line.c", 12219,
"proc_command_line", 0, 0); // assert-fail
}
sub_7525F0(v80); // set_target_configuration
goto LABEL_136; // continue parsing
The error string references --nv_arch, which is the nvcc-facing name for this flag. Internally cudafe++ processes it as --target (case 245). The discrepancy exists because the error message is shared with nvcc's error reporting path.
The sub_7525E0 Stub
sub_7525E0 is the architecture parser function. In the CUDA Toolkit 13.0 binary, it is a 6-byte stub:
// sub_7525E0 -- 0x7525E0, 6 bytes
__int64 sub_7525E0()
{
return 0xFFFFFFFFLL; // always returns -1
}
; IDA disassembly
sub_7525E0:
mov eax, 0FFFFFFFFh
retn
This stub always returns the invalid sentinel -1. The actual architecture code reaches dword_126E4A8 through the argument value passed by nvcc, not through parsing logic within this function. The function signature in the call site (sub_7525E0(qword_E7FF28, v23, v20, v30)) shows that four arguments are passed, but the stub ignores all of them. This means either:
-
The actual parsing is performed by nvcc, which passes the pre-resolved numeric index as the argument string, and
sub_7525E0simply converts it withstrtol-- but the link-time optimization eliminated the body because the result was equivalent to the argument itself. -
The function is a placeholder that was replaced at link time by a different object file that nvcc provides when building the toolchain.
In either case, the return value -1 is only reached when no valid --target argument is provided, which triggers error 2664.
Stage 3: set_target_configuration (sub_7525F0)
After the target index is stored, sub_7525F0 performs the post-parse initialization. This function lives in target.c:299:
// sub_7525F0 -- set_target_configuration
__int64 __fastcall sub_7525F0(int a1)
{
// Guard: accepts any value >= 0, rejects only -1
// (a1 + 1) wraps -1 to 0, and (0u > 1) is false
// Any non-negative value + 1 > 1 would be true... BUT this is unsigned:
// -1 + 1 = 0, 0 > 1u = false (passes)
// 0 + 1 = 1, 1 > 1u = false (passes)
// The guard actually fires when a1 <= -2 (e.g., -2 + 1 = -1, cast unsigned = huge)
if ((unsigned int)(a1 + 1) > 1)
assert_fail("target.c", 299, "set_target_configuration", 0, 0);
sub_7515D0(); // initialize type tables
qword_126E1B0 = "lib"; // library search path prefix
return -1; // return value unused
}
The unsigned comparison (a1 + 1) > 1u accepts values 0 and -1, rejecting everything else. In practice, only 0 or a valid non-negative target index reaches this function (the -1 case is caught earlier by the error 2664 check). The guard is a sanity assertion rather than a functional check.
Type Table Initialization (sub_7515D0)
sub_7515D0 is the core of Stage 3. It sets over 100 global variables that define the target platform's data model. These globals control how the EDG front end sizes types, computes alignments, and evaluates constant expressions. The function hardcodes an LP64 data model with CUDA-specific properties:
// sub_7515D0 -- target type initialization (complete decompilation)
__int64 sub_7515D0()
{
// === Integer type sizes (in bytes) ===
dword_126E338 = 4; // sizeof(int)
dword_126E328 = 8; // sizeof(long)
dword_126E410 = 4; // sizeof(short) [confirmed by cross-ref]
dword_126E420 = 2; // sizeof(wchar_t)
// === Pointer properties ===
dword_126E2B8 = 8; // sizeof(pointer)
dword_126E2AC = 8; // alignof(pointer)
dword_126E4A0 = 8; // target bits-per-byte (CHAR_BIT)
dword_126E29C = 8; // sizeof(ptrdiff_t)
// === Floating-point properties ===
// float: 24-bit mantissa, exponent range [-125, 128]
dword_126E264 = 24; // float mantissa bits
dword_126E25C = 128; // float max exponent
dword_126E260 = -125; // float min exponent
// double: 53-bit mantissa, exponent range [-1021, 1024]
dword_126E258 = 53; // double mantissa bits
dword_126E250 = 1024; // double max exponent
dword_126E254 = -1021; // double min exponent
// long double: 16 bytes, same as __float128
dword_126E2FC = 16; // sizeof(long double)
dword_126E308 = 16; // alignof(long double)
// __float128: 113-bit mantissa, exponent range [-16381, 16384]
dword_126E234 = 113; // __float128 mantissa bits
dword_126E22C = 0x4000; // __float128 max exponent (16384)
dword_126E230 = -16381; // __float128 min exponent
// 80-bit extended (x87): same parameters as __float128
dword_126E240 = 64; // x87 extended mantissa bits
dword_126E238 = 0x4000; // x87 extended max exponent
dword_126E23C = -16381; // x87 extended min exponent
dword_126E24C = 64; // another extended format (IBM double-double?)
dword_126E244 = 0x4000;
dword_126E248 = -16381;
// === Alignment properties ===
dword_126E400 = 8; // alignof(long long)
dword_126E3F0 = 8; // alignof(double)
dword_126E35C = 8; // alignof(long)
dword_126E3E0 = 16; // alignof(__int128) or max alignment
dword_126E318 = 16; // alignof(long double, repeated)
dword_126E278 = 16; // maximum natural alignment
// === Endianness and signedness ===
dword_126E4A4 = 1; // little-endian
dword_126E498 = 1; // char is signed
dword_126E368 = 1; // int is 2's complement
dword_126E384 = 1; // enum underlying type signed
// === Bit-field and struct layout ===
dword_126E3A8 = -1; // MSVC bit-field allocation mode (-1 = disabled)
dword_126E2A8 = 0; // no extra struct padding
dword_126E2F0 = 0; // field alignment override disabled
dword_126E398 = 0; // no special alignment for unnamed fields
dword_126E298 = 0; // no zero-length array as last field padding
dword_126E288 = 1; // field alloc order = declaration order
dword_126E294 = 1; // allow zero-sized objects
dword_126E28C = 1; // allow empty base optimization
// === ABI flags ===
dword_126E394 = 1; // ELF-style name mangling
dword_126E3AC = 1; // Itanium ABI compliance
dword_126E37C = 1; // EH table generation enabled
dword_126E3A0 = 0; // no Windows SEH
dword_126E36C = 1; // thunks for virtual calls
dword_126E380 = 1; // covariant return types
dword_126E39C = 0; // no RTTI incompatibility workaround
// === Integral type encoding (byte_126E4xx) ===
byte_126E431 = 0; // bool encoding index
byte_126E430 = 2; // char encoding index
byte_126E480 = 4; // char16_t encoding
byte_126E470 = 6; // char32_t encoding
byte_126E490 = 5; // wchar_t encoding
byte_126E481 = 6; // char8_t encoding
// === Size_t properties ===
byte_126E349 = 8; // size_t byte width indicator
qword_126E350 = -1; // SIZE_MAX (0xFFFFFFFFFFFFFFFF for 64-bit)
byte_126E348 = 7; // size_t type encoding index
// === String properties ===
dword_126E49C = 8; // host string char bit width
dword_126E1BC = 1; // feature flag (enabled)
dword_126E494 = 1; // null-terminated string assumption
// === Replicated size values (qword versions) ===
// These are 64-bit copies of the 32-bit size values above,
// used for 64-bit arithmetic in constant evaluation
qword_126E330 = 8; // sizeof(long) as int64
qword_126E340 = 4; // sizeof(int) as int64
qword_126E300 = 16; // sizeof(long double) as int64
qword_126E310 = 16; // alignof(long double) as int64
qword_126E418 = 4; // sizeof(short) as int64
qword_126E3E8 = 16; // sizeof(__int128) as int64
qword_126E408 = 8; // sizeof(long long) as int64
qword_126E320 = 16; // alignof(something 16B) as int64
qword_126E3F8 = 8; // alignof(double) as int64
qword_126E3D0 = 16; // sizeof(max int) as int64
qword_126E360 = 8; // sizeof(long) alignment as int64
qword_126E2C0 = 8; // sizeof(pointer) as int64
qword_126E2B0 = 16; // alignof(pointer, packed) as int64
qword_126E428 = 2; // sizeof(wchar_t) as int64
qword_126E2A0 = 8; // sizeof(ptrdiff_t) as int64
// === Miscellaneous ===
qword_126E3B0 = 0; // no custom va_list
qword_126E3B8 = 0; // no custom va_list secondary
dword_126E3A4 = 0; // bit-field container sizing disabled
byte_126E2F6 = 4; // unnamed struct alignment
byte_126E2F5 = 4; // unnamed union alignment
byte_126E2F4 = 4; // default minimum alignment
byte_126E2F7 = 4; // stack alignment
byte_126E2F8 = 4; // thread-local alignment
byte_126E358 = 7; // size_t type kind encoding
dword_126E370 = 0; // padding/zero
dword_126E374 = 0;
dword_126E378 = 1; // 64-bit mode flag (LP64)
dword_126E290 = 0;
dword_126E388 = 0;
dword_126E38C = 0;
dword_126E390 = -1; // special marker
return -1; // return value unused by caller
}
The function establishes the LP64 data model: sizeof(int)=4, sizeof(long)=8, sizeof(pointer)=8. This matches the CUDA device code ABI where device pointers are 64-bit. The dword_126E378 = 1 flag explicitly marks this as 64-bit mode.
CLI Overrides for the Data Model
Two CLI flags can override specific type properties set by sub_7515D0, because they are processed before case 245 in the switch:
Case 65 (--force-lp64): Enforces 64-bit pointer and long sizes:
case 65:
dword_106C01C = 1; // force-lp64 flag recorded
qword_126E408 = 8; // sizeof(long long) = 8
dword_126E400 = 8; // alignof(long long) = 8
byte_126E349 = 8; // size_t = 8 bytes
byte_126E358 = 7; // size_t type encoding
Case 66 (--force-llp64): Sets 32-bit pointer and long sizes (Windows-like):
case 66:
dword_106C018 = 1; // force-llp64 flag recorded
qword_126E408 = 4; // sizeof(long long) = 4
dword_126E400 = 4; // alignof(long long) = 4
byte_126E349 = 10; // size_t = different encoding
byte_126E358 = 9; // size_t type encoding
Case 90 (--m32): Sets the complete 32-bit (ILP32) data model:
case 90:
dword_126E378 = 0; // 32-bit mode (not LP64)
qword_126E360 = 4; // sizeof(long) = 4
dword_126E35C = 4; // alignof(long) = 4
qword_126E350 = 0xFFFFFFFF; // SIZE_MAX = 32-bit
byte_126E349 = 6; // size_t = 4 bytes
byte_126E358 = 5; // size_t type encoding
qword_126E2C0 = 4; // sizeof(pointer) = 4
dword_126E2B8 = 4; // sizeof(pointer, dword) = 4
qword_126E2B0 = 8; // alignof(pointer, packed) = 8
dword_126E2AC = 4; // alignof(pointer) = 4
qword_126E2A0 = 4; // sizeof(ptrdiff_t) = 4
dword_126E29C = 4; // sizeof(ptrdiff_t, dword) = 4
qword_126E408 = 4; // sizeof(long long) = 4
dword_126E400 = 4; // alignof(long long) = 4
byte_126E2F4 = 4; // default minimum alignment = 4
Because sub_7515D0 is called from sub_7525F0 (which runs during case 245), and case 90 executes before case 245, the --m32 overrides are applied first but then overwritten by sub_7515D0's LP64 defaults. This means the 32-bit overrides from --m32 are effective ONLY for the globals that sub_7515D0 does NOT touch. For the globals that both code paths write (like qword_126E408, dword_126E400, byte_126E349, byte_126E358), the sub_7515D0 LP64 values take precedence. However, --force-lp64 and --force-llp64 are no-ops when --target is also specified, because sub_7515D0 overwrites their values too.
In practice, nvcc controls all of these flags coherently -- it does not pass conflicting combinations.
Configuration Validation (sub_7527B0)
After sub_7515D0 sets the type tables, sub_752DF0 (init_field_alignment_tables) populates alignment lookup tables and then calls sub_7527B0 (check_target_configuration). This function validates the consistency of the configured type model:
// sub_7527B0 -- check_target_configuration (pseudocode summary)
void check_target_configuration()
{
// Validate char fits in 8 bytes
compute_type_size(0, &size, &precision);
if (size > 8) fatal("target char is too large");
// Validate wchar_t size
if (qword_126E488 > 8) fatal("target wchar_t is too large");
// Validate char16_t: must be unsigned, at least 16 bits
if (qword_126E478 > 8) fatal("target char16_t is too large");
if (dword_126E4A0 * qword_126E478 <= 15)
fatal("target char16_t is too small");
if (is_unsigned[byte_126E480] == 0)
assert_fail("target char16_t must be unsigned");
// Validate char32_t: must be unsigned, at least 32 bits
if (qword_126E468 > 8) fatal("target char32_t is too large");
if (dword_126E4A0 * qword_126E468 <= 31)
fatal("target char32_t is too small");
if (is_unsigned[byte_126E470] == 0)
assert_fail("target char32_t must be unsigned");
// Validate size_t range
compute_type_size(byte_126E349, &size, &precision);
if (size * dword_126E4A0 > 64) size_bits = 64;
if (qword_126E350 > max_for_bits(size_bits))
fatal("targ_size_t_max is too large");
// Validate largest integer type
if (qword_126E3D8 > 16) fatal("targ_sizeof_largest_integer too large");
if (qword_126E3D8 < qword_126E3F8)
fatal("invalid targ_sizeof_largest_integer");
// Validate INT_VALUE_PARTS
if (16 * dword_126E4A0 != 128)
fatal("invalid INT_VALUE_PARTS_PER_INTEGER_VALUE");
// Validate host string char
if (dword_126E49C > 8) fatal("targ_host_string_char_bit too large");
// Validate pack alignment
if (dword_126E284 < 1 || dword_126E284 > 255)
fatal("invalid targ_minimum_pack_alignment");
if (dword_126E284 > dword_126E280)
fatal("invalid targ_maximum_pack_alignment");
// Validate GNU IA-32 vector function integer sizes
if (qword_126E428 != 2 || qword_126E418 != 4 || qword_126E3F8 != 8)
assert_fail("invalid integer sizes for GNU IA-32 vector functions");
// Validate MSVC bit-field allocation
if (dword_126E3A4 && dword_126E3A8 != -1)
fatal("targ_microsoft_bit_field_allocation must be -1 "
"when targ_bit_field_container_size is TRUE");
// Validate field allocation order
if (!dword_126E3AC) assert_fail(...);
if (!dword_126E288)
fatal("targ_field_alloc_sequence_equals_decl_sequence must be TRUE");
// Validate host/target endianness match
if (dword_126E4A4 != dword_126EE40)
fatal("unexpected host/target endian mismatch");
// After validation, call dialect selector
select_cp_gen_be_target_dialect(); // sub_752A80
}
The validator confirms that the type model is internally consistent. Most of these checks are compile-time assertions that should never fire with the hardcoded LP64 values from sub_7515D0, but they guard against corruption or misconfiguration if the type globals are modified by other code paths (such as --m32 or --force-llp64).
Notably, check_target_configuration calls select_cp_gen_be_target_dialect (sub_752A80) as its last action. This means dialect selection happens after all type model validation is complete.
Field Alignment Tables (sub_752DF0)
init_field_alignment_tables populates two alignment lookup tables at qword_12C7640 and qword_12C7680. These tables map integer type kinds to their struct field alignment requirements. The function only fills the tables when dword_126E2F0 (field alignment override) is nonzero; in the default CUDA configuration, this field is set to 0 by sub_7515D0, so the alignment tables remain at their initialized-to-zero state.
When the tables are populated, they read alignment values from the dword_126E2CC-dword_126E2F0 range (which sub_7515D0 leaves at zero), meaning the alignment tables are effectively disabled for CUDA targets. The function also copies qword_126E3E8 (sizeof largest integer type) into qword_126E3D8 before calling the configuration validator.
Stage 4: Dialect Selection (sub_752A80)
select_cp_gen_be_target_dialect determines whether the backend generates device-side or host-side C++ output. It is called from check_target_configuration (sub_7527B0) after all type model validation passes:
// sub_752A80 -- select_cp_gen_be_target_dialect (complete decompilation)
__int64 sub_752A80()
{
// Guard: no dialect should be set yet
if (dword_126E1F8 || dword_126E1D0 || dword_126E1FC || dword_126E1E8)
assert_fail("target.c", 736,
"select_cp_gen_be_target_dialect",
"Target dialect already set.", 0);
if (dword_126EFA8) { // GPU compilation mode enabled
dword_126E1DC = 1; // enable cp_gen backend
dword_126E1EC = 1; // enable backend output
if (dword_126EFA4) { // device-side compilation
dword_126E1E8 = 1; // set device target dialect
qword_126E1E0 = qword_126EF90; // copy Clang version
return qword_126EF90;
} else { // host-side compilation (stub generation)
dword_126E1F8 = 1; // set host target dialect
qword_126E1F0 = qword_126EF98; // copy GCC version
return qword_126EF98;
}
}
return result; // non-GPU mode: no dialect set
}
The guard at entry checks that no dialect has been previously set. This fires only if select_cp_gen_be_target_dialect is called twice, which is a programming error.
Dialect Global Roles
| Global | Role | Set When |
|---|---|---|
dword_126EFA8 | GPU compilation mode active | --gcc flag (case 182) sets this to 1 |
dword_126EFA4 | Device-side (vs host-side) compilation | --clang flag (case 187) sets this to 1 |
dword_126E1DC | cp_gen backend enabled | GPU mode active |
dword_126E1EC | Backend output enabled | GPU mode active |
dword_126E1E8 | Device target dialect selected | Device-side compilation |
dword_126E1F8 | Host target dialect selected | Host-side compilation |
qword_126E1E0 | Device compiler version | Copied from qword_126EF90 (Clang version) |
qword_126E1F0 | Host compiler version | Copied from qword_126EF98 (GCC version) |
The naming of dword_126EFA8 as "gcc mode" and dword_126EFA4 as "clang mode" is misleading. In CUDA compilation, dword_126EFA8 means "GPU compilation is active" (nvcc always passes --gcc) and dword_126EFA4 means "this is the device-side pass" (nvcc passes --clang for the device compilation pass, not for the host pass). The version numbers copied into qword_126E1E0 and qword_126E1F0 represent the host compiler's version for pragma compatibility, not the "Clang" or "GCC" version in any semantic sense.
Device vs Host Output Paths
cudafe++ is invoked twice per .cu file by nvcc:
-
Device pass (
dword_126EFA4 = 1): cudafe++ processes the CUDA source and emits the device-side IL/PTX code. The dialect is set to "device" (dword_126E1E8 = 1) and the version number comes fromqword_126EF90. -
Host pass (
dword_126EFA4 = 0): cudafe++ processes the same source and emits the host-side.int.cfile with device stubs. The dialect is set to "host" (dword_126E1F8 = 1) and the version number comes fromqword_126EF98.
The dialect selection determines which backend code paths execute during .int.c generation. Device-dialect mode generates PTX-compatible output; host-dialect mode generates host C++ with stub functions.
Stage 5: TU Initialization (sub_586240)
During translation unit initialization, sub_586240 copies the target index from the CLI-level global into per-TU state:
// sub_586240 -- fe_translation_unit_init_secondary (relevant excerpt)
if (dword_106BA08) { // is recompilation / secondary TU
// ... version marker and timestamp setup ...
v6 = allocate(4);
*(int32_t *)v6 = 3550774; // "6.6\0" -- EDG version marker
qword_126EB78 = v6; // store version string pointer
qword_126EB80 = strcpy(allocate(len), byte_106B5C0); // timestamp
dword_126EBF8 = dword_126E4A8; // CRITICAL: copy target index
}
The copy dword_126EBF8 = dword_126E4A8 replicates the architecture index into the translation unit's state block. Both globals contain the same value in single-TU compilation (which is the only mode CUDA uses). The dual-variable pattern exists because EDG's multi-TU architecture theoretically supports per-TU target configurations, but CUDA compilation always uses a single target per cudafe++ invocation.
After this point, feature checks throughout the compiler read either dword_126E4A8 (the CLI-level global) or dword_126EBF8 (the TU-level copy). Both contain the same integer target index.
Feature Gate Mechanism
Individual features are gated by comparing dword_126E4A8 against threshold constants during semantic analysis. The pattern is consistent across all architecture-gated features:
// Pattern: hard error on unsupported architecture
if (dword_126E4A8 < THRESHOLD) {
emit_error(DIAGNOSTIC_ID, location);
// compilation continues or aborts depending on severity
}
Some features use a global flag that is set during target initialization rather than reading dword_126E4A8 directly. For example, __nv_register_params__ checks dword_106C028 (the "uumn" flag, set by CLI case 112) rather than comparing the architecture directly:
// sub_40B0A0 -- apply_nv_register_params_attr
if (!dword_106C028) { // feature not enabled
emit_error(7, 3659, location); // "not enabled" error
v3 = 0; // mark as invalid
}
The architecture check for __nv_register_params__ is separate -- it uses diagnostic tag register_params_unsupported_arch (requiring compute_80+), which is evaluated in a different code path from the enable flag check.
Feature Flag vs Direct Comparison
The distinction between feature-flag gating and direct SM comparison is:
-
Direct comparison (
dword_126E4A8 < N): Used for features where the threshold is baked into the comparison instruction. The threshold cannot be changed without recompiling cudafe++. Examples:__grid_constant__(< 70),__managed__(< 30),alloca()(< 52). -
Feature flag (
dword_XXXXXX == 0): Used for features that can be enabled/disabled independently of the architecture. The flag is set by a CLI option, and the architecture is checked separately. Example:__nv_register_params__usesdword_106C028for the enable check and a separate comparison for the architecture check.
Both patterns ultimately depend on the target index value, but the feature-flag pattern adds an extra level of indirection that allows nvcc to control feature availability through CLI flags rather than relying solely on the architecture number.
The --db Debug Mechanism
The --db flag (CLI case 37) activates EDG's internal debug tracing. While not directly part of the architecture detection chain, it shares adjacent globals (dword_126EFC8, dword_126EFCC) and can expose architecture checks as they execute.
The --db flag calls sub_48A390 (proc_debug_option, 238 lines, debug.c). On entry, it unconditionally enables tracing:
dword_126EFC8 = 1; // debug tracing enabled
If the argument is a bare integer, it sets the verbosity level:
if (first_char is digit) {
dword_126EFCC = strtol(arg, NULL, 10); // verbosity level
return 0;
}
Otherwise, it parses debug trace control entries (see Architecture Feature Gating for the full --db parsing grammar). After proc_debug_option returns, the CLI parser saves the verbosity level:
// proc_command_line, case 37
dword_106C2A0 = dword_126EFCC; // save error count baseline
At higher verbosity levels (5+), the compiler logs IL tree walking with messages like "Walking IL tree, entry kind = ...", which provides visibility into when architecture gate checks fire during semantic analysis.
Complete Call Graph
main (sub_585EE0)
|
+-> proc_command_line (sub_459630)
| |
| +-> case 90 (--m32): set ILP32 type properties
| +-> case 65 (--force-lp64): set LP64 overrides
| +-> case 66 (--force-llp64): set LLP64 overrides
| +-> case 245 (--target):
| |
| +-> sub_7525E0(<arg>) // parse target index (stub: returns -1)
| +-> dword_126E4A8 = result // store target index
| +-> if -1: emit error 2664 // invalid target
| +-> sub_7525F0(result) // set_target_configuration
| |
| +-> sub_7515D0() // initialize 100+ type globals (LP64)
| +-> qword_126E1B0 = "lib" // library prefix
| |
| +-> [implicit via sub_752DF0]:
| +-> sub_752DF0() // init_field_alignment_tables
| +-> sub_7527B0() // check_target_configuration
| |
| +-> [20+ consistency checks]
| +-> sub_752A80() // select_cp_gen_be_target_dialect
| |
| +-> if GPU mode && device:
| | dword_126E1E8 = 1 (device dialect)
| | qword_126E1E0 = qword_126EF90
| +-> if GPU mode && host:
| dword_126E1F8 = 1 (host dialect)
| qword_126E1F0 = qword_126EF98
|
+-> fe_translation_unit_init (sub_586240)
|
+-> dword_126EBF8 = dword_126E4A8 // copy target index to TU state
+-> qword_126EB78 = "6.6\0" // EDG version marker
+-> qword_126EB80 = timestamp // compilation timestamp
[After TU init, feature checks read dword_126E4A8 or dword_126EBF8]
Global Variable Summary
Target Architecture State
| Address | Size | Name | Role |
|---|---|---|---|
dword_126E4A8 | 4 | sm_architecture | Target index from --target. Sentinel: -1. |
dword_126EBF8 | 4 | target_configuration_index | TU-level copy of dword_126E4A8. |
dword_126E378 | 4 | is_64bit_mode | 1 = LP64 (64-bit), 0 = ILP32 (32-bit). |
dword_126E4A4 | 4 | target_little_endian | 1 = little-endian. |
Type Model (Sizes, set by sub_7515D0)
| Address | Size | Name | LP64 Value |
|---|---|---|---|
dword_126E338 / qword_126E340 | 4/8 | sizeof_int | 4 |
dword_126E328 / qword_126E330 | 4/8 | sizeof_long | 8 |
dword_126E2B8 / qword_126E2C0 | 4/8 | sizeof_pointer | 8 |
dword_126E29C / qword_126E2A0 | 4/8 | sizeof_ptrdiff | 8 |
dword_126E410 / qword_126E418 | 4/8 | sizeof_short | 4 |
dword_126E400 / qword_126E408 | 4/8 | sizeof_long_long | 8 |
dword_126E420 / qword_126E428 | 4/8 | sizeof_wchar | 2 |
dword_126E2FC / qword_126E300 | 4/8 | sizeof_long_double | 16 |
dword_126E258 | 4 | double_mantissa_bits | 53 |
dword_126E264 | 4 | float_mantissa_bits | 24 |
dword_126E234 | 4 | float128_mantissa_bits | 113 |
Type Model (Alignment, set by sub_7515D0)
| Address | Size | Name | LP64 Value |
|---|---|---|---|
dword_126E2AC | 4 | alignof_pointer | 8 |
dword_126E35C / qword_126E360 | 4/8 | alignof_long | 8 |
dword_126E308 / qword_126E310 | 4/8 | alignof_long_double | 16 |
dword_126E3F0 / qword_126E3F8 | 4/8 | alignof_double | 8 |
dword_126E278 | 4 | max_natural_alignment | 16 |
byte_126E2F4 | 1 | default_min_alignment | 4 |
Dialect Selection State
| Address | Size | Name | Role |
|---|---|---|---|
dword_126EFA8 | 4 | gpu_mode_enabled | GPU compilation active (set by --gcc) |
dword_126EFA4 | 4 | is_device_compilation | Device-side pass (set by --clang) |
dword_126E1DC | 4 | cp_gen_enabled | cp_gen backend active |
dword_126E1EC | 4 | backend_output_enabled | Backend output generation active |
dword_126E1E8 | 4 | device_dialect_set | Device target dialect selected |
dword_126E1F8 | 4 | host_dialect_set | Host target dialect selected |
qword_126E1E0 | 8 | device_version | Clang version copied for device dialect |
qword_126E1F0 | 8 | host_version | GCC version copied for host dialect |
qword_126E1B0 | 8 | lib_prefix | Library search prefix, set to "lib" |
Feature Gate Globals
| Address | Size | Name | Role |
|---|---|---|---|
dword_106C028 | 4 | nv_register_params_enabled | Enable flag for __nv_register_params__ (set by --uumn, case 112) |
Cross-References
- CLI Flag Inventory --
--target(case 245),--m32(case 90),--force-lp64(case 65),--force-llp64(case 66) flag details - Architecture Feature Gating -- SM version thresholds for CUDA features, host compiler version gating,
--dbdebug mechanism - EDG Build Configuration -- Compile-time constants controlling backend selection and IL configuration
- Pipeline Overview -- Where architecture detection fits in the compilation pipeline
- CLI Processing --
proc_command_linedispatcher and flag table mechanics - Translation Unit Descriptor -- TU state block containing
dword_126EBF8 - Global Variable Index -- Full address-level documentation of all globals
- Minor Attributes --
__nv_register_params__attribute handler anddword_106C028usage
Experimental and Version-Gated Flags
cudafe++ gates several categories of CUDA language features behind flags that nvcc manages automatically. Users interact with these through nvcc options like --expt-extended-lambda and --expt-relaxed-constexpr; nvcc translates these into the internal cudafe++ flags --extended-lambda and --relaxed_constexpr before invocation. A third category, C++ standard version gating, controls which language-level features affect the CUDA compilation pipeline. Two additional flags (--default-device, --no-device-int128/--no-device-float128) tune device code semantics without the "experimental" label.
This page documents the internal mechanism of each flag: the global variable it sets, every code path it unlocks, the diagnostics it suppresses or enables, and the compile-time cost of enabling it.
Flag Summary
| nvcc Flag | cudafe++ Internal Flag | Flag ID | Global Variable | Default | Effect |
|---|---|---|---|---|---|
--expt-extended-lambda | --extended-lambda | 79* | dword_106BF38 | 0 | Enable entire extended lambda wrapper infrastructure |
--expt-relaxed-constexpr | --relaxed_constexpr | 104 | dword_106BF40 | 0 | Allow constexpr cross-space calls |
-std=c++NN | --c++NN / set_flag | -- | dword_126EF68 | 199711 | Gate C++ standard features |
| (JIT mode) | --default-device | ** | -- | 0 | Change unannotated default to __device__ |
--no-device-int128 | --no-device-int128 | 52 | -- | 0 | Disable __int128 in device code |
--no-device-float128 | --no-device-float128 | 53 | -- | 0 | Disable __float128/_Float128 in device code |
* The extended-lambda flag is registered as flag 79 (disable_ext_lambda_cache is a separate flag at that slot in some reports; the exact case_id for the flag parsed as extended-lambda is in the CUDA-specific range 47--89 but the individual case within the grouped 47--53 block is not fully disambiguated). The flag string "extended-lambda" is at binary address 0x836410, referenced from sub_452010 (init_command_line_flags).
** The --default-device flag is not in the standard numbered flag catalog (1--275). It is registered through one of the 7 inline-registered paired flags or the set_flag/clear_flag table (off_D47CE0). Its string literal appears in four JIT error messages in the binary.
--extended-lambda (dword_106BF38)
This is the single most impactful experimental flag in cudafe++. It enables the entire extended lambda subsystem -- approximately 40 functions in nv_transforms.c, 2,100 lines of lambda scanning in cmd_line.c, 17 steps of preamble text emission, and per-lambda wrapper generation in the backend. Without it, CUDA lambdas annotated with __device__ or __host__ __device__ are rejected outright.
What It Enables
When dword_106BF38 != 0, the following subsystems activate:
1. Lambda scanning (sub_447930, 2,113 lines)
The 7-phase scan_lambda function performs full CUDA validation on every lambda expression. Phase 4 checks all 35+ restriction categories documented in the restrictions page. Without the flag, phase 4 early-exits and emits error 3612 instead.
2. Preamble injection (sub_4864F0 + sub_6BCC20)
When the backend encounters a type declaration for the sentinel __nv_lambda_preheader_injection, three conditions must all be true for the preamble to fire:
// sub_4864F0 trigger conditions:
if ((entity_bits[-8] & 0x10) != 0 // marker bit set
&& dword_106BF38 != 0 // --extended-lambda enabled
&& name_matches_sentinel) // 30-byte name comparison
{
sub_6BCC20(emit_func); // emit ~10-50 KB of template text
}
The master emitter (sub_6BCC20) produces the complete lambda wrapper infrastructure as inline C++ text injected into the .int.c output. The 17-step emission sequence generates:
| Step | Output | Purpose |
|---|---|---|
| 1 | __NV_LAMBDA_WRAPPER_HELPER, __nvdl_remove_ref, __nvdl_remove_const | Utility macros and type traits |
| 2 | __nv_dl_tag | Device lambda tag type |
| 3 | Array capture helpers (dim 2--8) | N-dimensional array forwarding via sub_6BC290 |
| 4 | Primary __nv_dl_wrapper_t + zero-capture specialization | Device lambda wrapper template |
| 5 | __nv_dl_trailing_return_tag + zero-capture specialization | Trailing return type support |
| 6 | Device bitmap scan | One sub_6BB790 call per set bit in unk_1286980 |
| 7 | __nv_hdl_helper (anonymous namespace, 4 static function pointers) | Host-device lambda dispatch helper |
| 8 | Primary __nv_hdl_wrapper_t with static_assert | Host-device wrapper template |
| 9 | HD bitmap scan | Four calls per set bit in unk_1286900 (const x mutable x 2 helpers) |
| 10 | __nv_hdl_helper_trait_outer | Deduction helper traits |
| 11 | C++17 noexcept variants | Conditional on dword_126E270 (see C++ version gating) |
| 12 | __nv_hdl_create_wrapper_t | Factory for HD wrappers |
| 13 | __nv_lambda_trait_remove_const/volatile/cv | CV-qualifier removal traits |
| 14 | __nv_extended_device_lambda_trait_helper + detection macro | Device lambda type detection |
| 15 | __nv_lambda_trait_remove_dl_wrapper | Unwrapper trait |
| 16 | Trailing-return detection trait + macro | Type introspection |
| 17 | HD detection trait + macro | Host-device lambda type detection |
3. 1024-bit capture bitmaps
Two bitmaps track which capture counts have been observed during parsing:
| Bitmap | Address | Scope | Bits Used |
|---|---|---|---|
| Device | unk_1286980 | 128 bytes (1024 bits) | Bit N = capture count N seen in a __device__ lambda |
| Host-device | unk_1286900 | 128 bytes (1024 bits) | Bit N = capture count N seen in an HD lambda |
sub_6BCBF0 registers a capture count by setting the corresponding bit. sub_6BCBC0 resets both bitmaps to zero between translation units. The maximum representable capture count is 1023 (bit 0 is reserved for the primary template in the device path; the HD path uses bit 0). Error 3595 fires when capture count exceeds 1022 (v33 > 0x3FE).
4. Per-lambda wrapper generation (sub_47B890, 336 lines)
During backend code generation, gen_lambda produces the per-lambda wrapper specialization for each extended lambda encountered. This runs in the gen_template dispatcher (sub_47ECC0).
5. Extended lambda capture type generation (sub_46E640, ~400 lines)
nv_gen_extended_lambda_capture_types generates explicit type declarations for captured variables, enabling the closure type to be serialized across host/device boundaries.
What Happens Without It
When dword_106BF38 == 0, any lambda with __host__ or __device__ annotations triggers error 3612:
error #20155-D: __host__ or __device__ annotation on lambda requires --extended-lambda nvcc flag
Additionally, the .int.c header emits hardcoded false macros (from sub_489000):
#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false
These definitions ensure that code using the detection macros compiles without error but reports that no extended lambdas exist.
Compile-Time Cost
Enabling --extended-lambda has measurable compile-time impact:
- Fixed overhead: ~10 KB of injected template text (steps 1--5, 7--8, 10--17) emitted for every translation unit, regardless of how many lambdas appear
- Variable per capture count: ~0.8 KB per distinct device lambda capture count, ~6 KB per distinct HD capture count (the HD path emits 4 specializations per bit: const non-mutable, const mutable, non-const non-mutable, non-const mutable)
- Typical TU with 3--5 distinct capture counts: 30--50 KB of additional
.int.ctext - Template instantiation load: The wrapper templates use deep SFINAE patterns; the host compiler (gcc/clang/MSVC) must instantiate these for every extended lambda in the TU
- Lambda scanning: The 2,113-line
scan_lambdafunction performs full restriction validation on every lambda expression, adding O(N) per-lambda overhead
The cost is proportional to the number of distinct capture counts, not the total number of lambdas. Two __device__ lambdas each capturing 3 variables share a single wrapper specialization.
Related Error Codes
All 35+ extended lambda error codes (3590--3691) are documented in lambda/restrictions.md. Key errors specific to the flag gate:
| Error | Display | Tag | Condition |
|---|---|---|---|
| 3612 | 20155-D | extended_lambda_disallowed | Lambda has __host__/__device__ annotation but dword_106BF38 == 0 |
| 3595 | 20138-D | extended_lambda_too_many_captures | Capture count > 1023 |
| 3590 | 20133-D | extended_lambda_multiple_parent | Multiple __nv_parent pragmas |
--expt-relaxed-constexpr (dword_106BF40)
This flag relaxes cross-execution-space calling rules for constexpr functions. Without it, a constexpr __device__ function cannot be called from a __host__ function and vice versa, even though constexpr functions are evaluated at compile time on the host regardless of their execution space annotation.
Flag Registration
Registered as flag ID 104 (relaxed_constexpr) in the CUDA-specific flag range. The --expt-relaxed-constexpr nvcc flag is translated to --relaxed_constexpr before passing to cudafe++. The flag sets dword_106BF40 to 1.
Note: Despite the W066 report labeling this global lambda_host_device_mode, the decompiled code shows it is checked in two distinct contexts: cross-space call validation (sub_505720) and extended lambda device qualification (sub_6BC680). The variable name reflects its role in relaxing constexpr constraints, not lambda-specific behavior. It affects lambda behavior only in the specific case of is_device_or_extended_device_lambda (see below).
What It Relaxes
The flag modifies behavior in two code paths:
1. Cross-space call checking (sub_505720)
In check_cross_execution_space_call, when the caller is a __device__-only function and the callee has bit 2 set at offset +177 (explicit __device__ annotation), the checker tests dword_106BF40:
// sub_505720, caller is __device__ or __global__, callee is constexpr __host__:
if ((callee[177] & 0x02) != 0) { // callee has explicit execution space
if (dword_106BF40) { // --expt-relaxed-constexpr
// skip error, allow the call
return;
}
}
Without the flag, this path falls through to emit one of the 6 constexpr-specific cross-space errors.
2. Device lambda qualification (sub_6BC680)
In is_device_or_extended_device_lambda, when an entity has __device__ annotation (bit 177|2) but NOT the extended lambda bit (bit 177|4), the function returns dword_106BF40 != 0:
// sub_6BC680 (decompiled):
bool is_device_or_extended_device_lambda(entity* a1) {
if ((a1->byte_177 & 0x02) != 0) { // has __device__
if ((a1->byte_177 & 0x04) == 0) { // NOT extended lambda
return dword_106BF40 != 0; // relaxed constexpr allows it
}
return true;
}
return false;
}
This means --expt-relaxed-constexpr allows certain __device__ lambdas to be treated as extended device lambdas even without the --extended-lambda flag, but only in the specific context of device lambda type checking.
The 6 Error Messages It Suppresses
When dword_106BF40 == 0 and a constexpr function call crosses execution spaces, one of these 6 error messages is emitted. Each message explicitly suggests the flag as a workaround:
| # | Caller Space | Callee Space | Error Message |
|---|---|---|---|
| 1 | __host__ __device__ | constexpr __device__ | "calling a constexpr __device__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this." |
| 2 | __host__ | constexpr __device__ | "calling a constexpr __device__ function(%sq1) from a __host__ function(%sq2) is not allowed. ..." |
| 3 | __host__ __device__ | constexpr __host__ | "calling a constexpr __host__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. ..." |
| 4 | __host__ __device__ | constexpr __host__ | "calling a constexpr __host__ function from a __host__ __device__ function is not allowed. ..." (no entity names -- edge case for unresolved functions) |
| 5 | __device__ | constexpr __host__ | "calling a constexpr __host__ function(%sq1) from a __device__ function(%sq2) is not allowed. ..." |
| 6 | __global__ | constexpr __host__ | "calling a constexpr __host__ function(%sq1) from a __global__ function(%sq2) is not allowed. ..." |
The %sq1 and %sq2 format specifiers are cudafe++'s diagnostic format for qualified entity names (see diagnostics/format-specifiers.md).
Why It Is Experimental
The flag is labeled "experimental" because enabling it can produce silent runtime errors when:
-
A
constexprfunction has different behavior on host vs device due to#ifdef __CUDA_ARCH__guards or host/device-specific intrinsics. The compiler evaluatesconstexprfunctions on the host during compilation, but with the flag enabled, aconstexpr __device__function might be evaluated on the host where__CUDA_ARCH__is not defined, producing a different constant value than the programmer expects for device code. -
A
constexpr __host__function references host-only APIs (file I/O, system calls, host-specific math libraries). With relaxed constexpr, this function can be called from a__device__context. If the call is not resolved at compile time (not actually evaluated as a constant expression), the linker or runtime will fail with an obscure error rather than the clear cudafe++ diagnostic. -
The relaxation applies globally -- there is no per-function opt-in. Once enabled, all constexpr cross-space calls are permitted, making it impossible to catch genuinely incorrect calls alongside intentionally relaxed ones.
The related diagnostic tag is cl_relaxed_constexpr_requires_bool (at binary address 0x853640), which indicates there was at some point a stricter validation that the flag's value must be boolean.
Interaction with Other Globals
The dword_106BF40 flag interacts with the cross-space checking infrastructure controlled by dword_106BFD0 (device_registration) and dword_106BFCC (constant_registration). When dword_106BF40 is set AND the current routine is in device scope (+182 & 0x30 == 0x20) AND the routine has __device__ annotation (+177 bit 1), the cross-space reference check in record_symbol_reference_full (sub_72A650/sub_72B510) skips the error entirely.
C++ Standard Version Gating (dword_126EF68)
The global variable dword_126EF68 holds the C++ (or C) standard version as an integer matching the __cplusplus or __STDC_VERSION__ predefined macro value. This is set during CLI parsing and controls feature gating throughout the frontend.
Version Values
| Standard | dword_126EF68 Value | nvcc Flag |
|---|---|---|
| C++98/03 | 199711 | -std=c++03 |
| C++11 | 201103 | -std=c++11 |
| C++14 | 201402 | -std=c++14 |
| C++17 | 201703 | -std=c++17 |
| C++20 | 202002 | -std=c++20 |
| C++23 | 202302 | -std=c++23 |
C standard values are also stored here when compiling C code:
| Standard | dword_126EF68 Value |
|---|---|
| K&R | (triggers set_c_mode(1) instead) |
| C89 | 198912 |
| C99 | 199901 |
| C11 | 201112 |
| C17 | 201710 |
| C23 | 202311 |
How Version Gating Works
Throughout the frontend, dword_126EF68 is compared against threshold values to enable or disable features. The comparison is always >= or > against the version number. Examples from the binary:
List initialization (sub_6D7DE0, overload.c): The 2,119-line list initialization function checks dword_126EF68 >= 201103 before enabling C++11 brace-enclosed initializer semantics.
Operator overloading (sub_6E7310, overload.c): Checks dword_126EF68 >= 201703 for C++17 features like class template argument deduction in operator resolution.
Preprocessor directives (sub_6FEDD0, preproc.c): Checks dword_126EF68 >= 202301 for #elifdef/#elifndef support (C++23 feature).
Byte ordering in .int.c output (sub_489000): Sets byte_10657F4 based on:
if (dword_126EFB4 == 2) // CUDA mode
byte_10657F4 = (dword_126EFB0 != 0);
else if (dword_126EF68 <= 199900) // pre-C99
byte_10657F4 = (dword_126EFB0 != 0);
else
byte_10657F4 = 1;
C++17 noexcept-in-Type-System (dword_126E270)
A key version-gated feature for CUDA is dword_126E270, the C++17 "noexcept is part of the type system" flag. This global is set when dword_126EF68 >= 201703 and controls whether the lambda preamble injection (step 11 in sub_6BCC20) emits noexcept specializations of __nv_hdl_helper_trait_outer:
// sub_6BCC20, step 11:
if (dword_126E270) { // C++17 noexcept in type system
// Emit 2 additional trait specializations with NeverThrows=true
// for noexcept-qualified function types
emit_noexcept_trait_specialization(emit, /* const */ 0);
emit_noexcept_trait_specialization(emit, /* non-const */ 1);
}
// Closing }; of __nv_hdl_helper_trait_outer emitted unconditionally after
Without these specializations, C++17 code using noexcept lambdas in host-device contexts would fail to match the wrapper traits, producing template deduction failures.
Version Interactions with CUDA
The C++ standard version interacts with CUDA semantics in several ways:
- C++11 minimum: Most CUDA lambda features require
>= 201103. Extended lambdas are only meaningful with C++11 lambda syntax. - C++14 generic lambdas: Generic
__device__lambdas (withautoparameters) are gated on>= 201402. - C++17 structured bindings and if constexpr: The extended lambda system interacts with
if constexprthrough restriction errors 3620/3621 (constexpr/consteval conflict in lambda operator()). - C++20 concepts: The template variant of cross-space checking (
sub_505B40) has a concept-context guard that checksdword_126C5C4(nested class scope), which is only meaningful with C++20 concepts.
--default-device
This flag is specific to JIT (device-only) compilation mode and changes the default execution space for unannotated entities from __host__ to __device__.
Mechanism
When enabled, the execution-space assignment logic modifies entity+182 to receive the __device__ OR mask (0x23) instead of the implicit host default (0x00). For variables, entity+148 bit 0 (__device__ memory space) is set.
JIT Mode Context
JIT mode activates when --gen_c_file_name (flag 45) is NOT provided -- there is no host output path, so the host backend never runs. This is the compilation mode used by NVRTC (the CUDA runtime compilation library) and the CUDA Driver API's runtime compilation facilities (cuModuleLoadData, cuLinkAddData).
Without --default-device, five JIT-specific diagnostics warn about unannotated entities:
| Diagnostic Tag | Message Summary |
|---|---|
no_host_in_jit | Explicit __host__ not allowed in JIT mode (no --default-device suggestion) |
unannotated_function_in_jit | Unannotated function considered host, not allowed in JIT |
unannotated_variable_in_jit | Namespace-scope variable without memory space annotation |
unannotated_static_data_member_in_jit | Non-const static data member considered host |
host_closure_class_in_jit | Lambda closure class inferred as __host__ |
Four of the five messages explicitly suggest --default-device as a workaround. The exception is no_host_in_jit -- an explicit __host__ annotation cannot be overridden by a flag and requires a source code change.
The --default-device flag interacts with the extended lambda system (dword_106BF38): when both are active, namespace-scope lambda closure classes infer __device__ execution space instead of __host__, avoiding the host_closure_class_in_jit diagnostic.
See cuda/jit-mode.md for full JIT mode documentation.
--no-device-int128 / --no-device-float128
These two flags (IDs 52 and 53) disable 128-bit integer and floating-point types in device code respectively.
Registration
Both are registered in sub_452010 as no-argument mode flags in the CUDA-specific range:
| Flag | ID | Binary Address | Global Effect |
|---|---|---|---|
no-device-int128 | 52 | 0x836133 | Disables __int128 type in device compilation |
no-device-float128 | 53 | 0x836144 | Disables __float128/_Float128 in device compilation |
Purpose
The EDG frontend supports __int128 (keyword ID 239 in the builtin keyword table) and _Float128 (keyword ID 335) as extended types. In device code, these types may not be supported by all GPU architectures or may have different semantics than on the host.
The flags belong to the grouped CUDA boolean flags (cases 47--53 in proc_command_line), alongside host-stub-linkage-explicit, static-host-stub, device-hidden-visibility, no-hidden-visibility-on-unnamed-ns, and no-multiline-debug.
Type feature tracking uses byte_12C7AFC as a usage flags byte: bit 0 tracks specific integer subtypes (kinds 11, 12), bit 2 tracks float128/bfloat16 usage. The dword_106C070 global serves as the float128 feature flag, and dword_106C06C controls bfloat16.
NVRTC has specific support strings for both types in the binary (int128 NVRTC, float128 NVRTC), confirming that the JIT compilation path handles the presence or absence of these types explicitly.
Interaction Matrix
The experimental flags interact with each other and with version gating:
| Interaction | Behavior |
|---|---|
--extended-lambda + C++17 | Enables noexcept wrapper trait specializations (step 11 in preamble) via dword_126E270 |
--extended-lambda + --expt-relaxed-constexpr | A __device__ lambda without the extended-lambda bit is treated as extended if dword_106BF40 is set (via sub_6BC680) |
--extended-lambda + JIT mode | Lambda closure class execution space inference changes; --default-device affects namespace-scope lambda inference |
--expt-relaxed-constexpr + cross-space checking | Suppresses 6 specific constexpr cross-space errors; does NOT suppress the 6 non-constexpr variants |
--no-device-int128 + NVRTC | NVRTC-specific handling confirms both flags are respected in JIT compilation |
| C++20 + cross-space checking | Concept context guard in sub_505B40 adds an additional bypass condition for template cross-space calls |
Global Variable Reference
| Address | Size | Semantic Name | Set By | Checked By |
|---|---|---|---|---|
dword_106BF38 | 4 | extended_lambda_mode | Flag 79* (--extended-lambda) | sub_4864F0 (trigger), sub_489000 (macros), sub_447930 (scan_lambda) |
dword_106BF40 | 4 | relaxed_constexpr_mode | Flag 104 (--relaxed_constexpr) | sub_505720 (cross-space call), sub_6BC680 (device lambda test), sub_72A650/sub_72B510 (symbol ref) |
dword_126EF68 | 4 | cpp_standard_version | CLI std selection | 28+ functions across all subsystems |
dword_126E270 | 4 | cpp17_noexcept_type | Post-parsing dialect resolution | sub_6BCC20 (preamble step 11) |
Function Reference
| Address | Lines | Identity | Source | Role |
|---|---|---|---|---|
sub_452010 | 3,849 | init_command_line_flags | cmd_line.c | Registers all 276 flags including experimental |
sub_459630 | 4,105 | proc_command_line | cmd_line.c | Parses flags, sets globals |
sub_447930 | 2,113 | scan_lambda | cmd_line.c | Full lambda validation (uses dword_106BF38) |
sub_4864F0 | 751 | gen_type_decl | cp_gen_be.c | Preamble injection trigger (checks dword_106BF38) |
sub_6BCC20 | 244 | nv_emit_lambda_preamble | nv_transforms.c | Master preamble emitter (17 steps) |
sub_505720 | 147 | check_cross_execution_space_call | expr.c | Cross-space call checker (uses dword_106BF40) |
sub_505B40 | 92 | check_cross_space_call_in_template | expr.c | Template variant of cross-space checker |
sub_6BC680 | 16 | is_device_or_extended_device_lambda | nv_transforms.c | Device lambda test (uses dword_106BF40) |
sub_489000 | 723 | process_file_scope_entities | cp_gen_be.c | Backend entry; emits false macros when flag off |
sub_46E640 | ~400 | nv_gen_extended_lambda_capture_types | cp_gen_be.c | Capture type declarations for extended lambdas |
sub_6BCBF0 | 13 | nv_record_capture_count | nv_transforms.c | Bitmap bit-set for capture counts |
sub_6BCBC0 | ~10 | nv_reset_capture_bitmaps | nv_transforms.c | Reset both 1024-bit bitmaps |
Cross-References
- config/cli-flags.md -- complete flag catalog and registration protocol
- lambda/overview.md -- extended lambda pipeline architecture
- lambda/preamble-injection.md -- 17-step preamble emission detail
- lambda/restrictions.md -- all 35+ lambda restriction error codes
- cuda/cross-space-validation.md -- cross-space call checking and
dword_106BF40relaxation - cuda/jit-mode.md -- JIT mode,
--default-deviceflag, and NVRTC - diagnostics/cuda-errors.md -- complete CUDA error catalog
EDG Source File Map
This page is the definitive reference table mapping all 52 .c source files and 13 .h header files from EDG 6.6 to their binary addresses in the cudafe++ CUDA 13.0 build. Every column is derived from the .rodata string cross-reference database and verified against the 20 sweep reports (P1.01 through P1.20).
For narrative discussion of these files and their roles in the compilation pipeline, see the Function Map and EDG Overview pages.
Build Path
All source files share the build prefix:
/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/
Coverage Summary
| Metric | Count |
|---|---|
.c files with mapped functions | 52 |
.h files with mapped functions | 13 |
| Total source files | 65 |
Functions mapped via .c paths | 2,129 |
Functions mapped via .h paths only | 80 |
| Total mapped functions | 2,209 |
Unmapped functions in EDG region (0x403300--0x7E0000) | ~2,896 |
C++ runtime / demangler (0x7E0000--0x829722) | ~1,085 |
PLT stubs + init (0x402A18--0x403300) | ~283 |
| Total functions in binary | ~6,483 |
| Mapping coverage | 34.1% |
The 34% mapping rate reflects the fact that only functions containing EDG internal_error assertions reference __FILE__ strings. Functions below the assertion threshold, display-only code compiled without assertions, inlined leaf functions, and the statically-linked C++ runtime are all invisible to this technique.
Column Definitions
| Column | Meaning |
|---|---|
| # | Row number, ordered by main body start address |
| Source File | Filename from the EDG source tree |
| Origin | EDG = standard Edison Design Group code; NVIDIA = NVIDIA-authored |
| Total Funcs | Unique functions referencing this file's __FILE__ string (stubs + main) |
| Stubs | Assert wrapper functions in 0x403300--0x408B40 |
| Main Funcs | Functions in the main body region (after 0x409350) |
| Main Body Start | Lowest xref address outside the stub region |
| Main Body End | Highest xref address outside the stub region |
| Code Size | Main Body End - Main Body Start in bytes; approximate (includes interleaved .h inlines and alignment padding) |
Source File Table -- 52 .c Files
Sorted by main body start address. This ordering reflects the binary layout, which is near-alphabetical with two exceptions noted below.
| # | Source File | Origin | Total Funcs | Stubs | Main Funcs | Main Body Start | Main Body End | Code Size |
|---|---|---|---|---|---|---|---|---|
| 1 | attribute.c | EDG | 177 | 7 | 170 | 0x409350 | 0x418F80 | 64,560 |
| 2 | class_decl.c | EDG | 273 | 9 | 264 | 0x419280 | 0x447930 | 190,160 |
| 3 | cmd_line.c | EDG | 44 | 1 | 43 | 0x44B250 | 0x459630 | 58,336 |
| 4 | const_ints.c | EDG | 4 | 1 | 3 | 0x461C20 | 0x4659A0 | 15,744 |
| 5 | cp_gen_be.c | EDG | 226 | 25 | 201 | 0x466F90 | 0x489000 | 139,376 |
| 6 | debug.c | EDG | 2 | 0 | 2 | 0x48A1B0 | 0x48A1B0 | <1 KB |
| 7 | decl_inits.c | EDG | 196 | 4 | 192 | 0x48B3F0 | 0x4A1540 | 90,448 |
| 8 | decl_spec.c | EDG | 88 | 3 | 85 | 0x4A1BF0 | 0x4B37F0 | 72,704 |
| 9 | declarator.c | EDG | 64 | 0 | 64 | 0x4B3970 | 0x4C00A0 | 50,480 |
| 10 | decls.c | EDG | 207 | 5 | 202 | 0x4C0910 | 0x4E8C40 | 164,656 |
| 11 | disambig.c | EDG | 5 | 1 | 4 | 0x4E9E70 | 0x4EC690 | 10,272 |
| 12 | error.c | EDG | 51 | 1 | 50 | 0x4EDCD0 | 0x4F8F80 | 45,744 |
| 13 | expr.c | EDG | 538 | 10 | 528 | 0x4F9870 | 0x5565E0 | 380,528 |
| 14 | exprutil.c | EDG | 299 | 13 | 286 | 0x558720 | 0x583540 | 175,648 |
| 15 | extasm.c | EDG | 7 | 0 | 7 | 0x584CA0 | 0x585850 | 2,992 |
| 16 | fe_init.c | EDG | 6 | 1 | 5 | 0x585B10 | 0x5863A0 | 2,192 |
| 17 | fe_wrapup.c | EDG | 2 | 0 | 2 | 0x588D40 | 0x588F90 | 592 |
| 18 | float_pt.c | EDG | 79 | 0 | 79 | 0x589550 | 0x594150 | 44,032 |
| 19 | folding.c | EDG | 139 | 9 | 130 | 0x594B30 | 0x5A4FD0 | 66,464 |
| 20 | func_def.c | EDG | 56 | 1 | 55 | 0x5A51B0 | 0x5AAB80 | 22,992 |
| 21 | host_envir.c | EDG | 19 | 2 | 17 | 0x5AD540 | 0x5B1E70 | 18,736 |
| 22 | il.c | EDG | 358 | 16 | 342 | 0x5B28F0 | 0x5DFAD0 | 184,800 |
| 23 | il_alloc.c | EDG | 38 | 1 | 37 | 0x5E0600 | 0x5E8300 | 31,488 |
| 24 | il_to_str.c | EDG | 83 | 1 | 82 | 0x5F7FD0 | 0x6039E0 | 47,632 |
| 25 | il_walk.c | EDG | 27 | 1 | 26 | 0x603FE0 | 0x620190 | 115,120 |
| 26 | interpret.c | EDG | 216 | 5 | 211 | 0x620CE0 | 0x65DE10 | 250,160 |
| 27 | layout.c | EDG | 21 | 2 | 19 | 0x65EA50 | 0x665A60 | 28,688 |
| 28 | lexical.c | EDG | 140 | 5 | 135 | 0x666720 | 0x689130 | 141,328 |
| 29 | literals.c | EDG | 21 | 0 | 21 | 0x68ACC0 | 0x68F2B0 | 17,904 |
| 30 | lookup.c | EDG | 71 | 2 | 69 | 0x68FAB0 | 0x69BE80 | 50,128 |
| 31 | lower_name.c | EDG | 179 | 11 | 168 | 0x69C980 | 0x6AB280 | 59,648 |
| 32 | macro.c | EDG | 43 | 1 | 42 | 0x6AB6E0 | 0x6B5C10 | 42,288 |
| 33 | mem_manage.c | EDG | 9 | 2 | 7 | 0x6B6DD0 | 0x6BA230 | 13,408 |
| 34 | nv_transforms.c | NVIDIA | 1 | 0 | 1 | 0x6BE300 | 0x6BE300 | ~22 KB1 |
| 35 | overload.c | EDG | 284 | 3 | 281 | 0x6BE4A0 | 0x6EF7A0 | 201,472 |
| 36 | pch.c | EDG | 23 | 3 | 20 | 0x6F2790 | 0x6F5DA0 | 13,840 |
| 37 | pragma.c | EDG | 28 | 0 | 28 | 0x6F61B0 | 0x6F8320 | 8,560 |
| 38 | preproc.c | EDG | 10 | 0 | 10 | 0x6F9B00 | 0x6FC940 | 11,840 |
| 39 | scope_stk.c | EDG | 186 | 6 | 180 | 0x6FE160 | 0x7106B0 | 75,600 |
| 40 | src_seq.c | EDG | 57 | 1 | 56 | 0x710F10 | 0x718720 | 30,736 |
| 41 | statements.c | EDG | 83 | 1 | 82 | 0x719300 | 0x726A50 | 55,120 |
| 42 | symbol_ref.c | EDG | 42 | 2 | 40 | 0x726F20 | 0x72CEA0 | 24,448 |
| 43 | symbol_tbl.c | EDG | 175 | 8 | 167 | 0x72D950 | 0x74B8D0 | 122,688 |
| 44 | sys_predef.c | EDG | 35 | 1 | 34 | 0x74C690 | 0x751470 | 19,936 |
| 45 | target.c | EDG | 11 | 0 | 11 | 0x7525F0 | 0x752DF0 | 2,048 |
| 46 | templates.c | EDG | 455 | 12 | 443 | 0x7530C0 | 0x794D30 | 285,808 |
| 47 | trans_copy.c | EDG | 2 | 0 | 2 | 0x796BA0 | 0x796BA0 | <1 KB |
| 48 | trans_corresp.c | EDG | 88 | 6 | 82 | 0x796E60 | 0x7A3420 | 50,112 |
| 49 | trans_unit.c | EDG | 10 | 0 | 10 | 0x7A3BB0 | 0x7A4690 | 2,784 |
| 50 | types.c | EDG | 88 | 5 | 83 | 0x7A4940 | 0x7C02A0 | 112,480 |
| 51 | modules.c | EDG | 22 | 3 | 19 | 0x7C0C60 | 0x7C2560 | 6,400 |
| 52 | floating.c | EDG | 50 | 9 | 41 | 0x7D0EB0 | 0x7D59B0 | 19,200 |
| TOTALS | 5,338 | 198 | 5,140 | 0x409350 | 0x7D59B0 | ~3.57 MB |
Source File Table -- 13 .h Header Files
Header files appear in assertion strings when an inline function or macro defined in the header triggers an internal_error call. The function itself is compiled within the .c file's translation unit, but __FILE__ resolves to the header path. These functions are scattered across the binary, interleaved with the .c file that #include-d them.
| # | Header File | Total Funcs | Stubs | Main Funcs | Min Address | Max Address | Primary Host |
|---|---|---|---|---|---|---|---|
| 1 | decls.h | 1 | 0 | 1 | 0x4E08F0 | 0x4E08F0 | decls.c |
| 2 | float_type.h | 63 | 0 | 63 | 0x7D1C90 | 0x7DEB90 | floating.c |
| 3 | il.h | 5 | 2 | 3 | 0x52ABC0 | 0x6011F0 | expr.c, il.c, il_to_str.c |
| 4 | lexical.h | 1 | 0 | 1 | 0x68F2B0 | 0x68F2B0 | lexical.c / literals.c boundary |
| 5 | mem_manage.h | 4 | 0 | 4 | 0x4EDCD0 | 0x4EDCD0 | error.c |
| 6 | modules.h | 5 | 0 | 5 | 0x7C1100 | 0x7C2560 | modules.c |
| 7 | nv_transforms.h | 3 | 0 | 3 | 0x432280 | 0x719D20 | class_decl.c, cp_gen_be.c, src_seq.c |
| 8 | overload.h | 1 | 0 | 1 | 0x6C9E40 | 0x6C9E40 | overload.c |
| 9 | scope_stk.h | 4 | 0 | 4 | 0x503D90 | 0x574DD0 | expr.c, exprutil.c |
| 10 | symbol_tbl.h | 2 | 1 | 1 | 0x7377D0 | 0x7377D0 | symbol_tbl.c |
| 11 | types.h | 17 | 4 | 13 | 0x469260 | 0x7B05E0 | Many .c files (scattered type queries) |
| 12 | util.h | 124 | 10 | 114 | 0x430E10 | 0x7C2B10 | All major .c files |
| 13 | walk_entry.h | 51 | 0 | 51 | 0x604170 | 0x618660 | il_walk.c |
| TOTALS | 281 | 17 | 264 |
Header Distribution Patterns
The 13 headers fall into three distinct patterns:
Localized headers -- functions cluster in a single .c file's address range:
float_type.h(63 funcs in 52 KB at0x7D1C90--0x7DEB90, all withinfloating.c)walk_entry.h(51 funcs in 90 KB at0x604170--0x618660, all withinil_walk.c)modules.h(5 funcs in 5 KB at0x7C1100--0x7C2560, all withinmodules.c)decls.h,lexical.h,overload.h,symbol_tbl.h(1--2 funcs each, single site)mem_manage.h(4 funcs, single site inerror.c)
Moderately scattered headers -- functions appear in 2--3 .c files:
il.h(5 funcs acrossexpr.c,il.c,il_to_str.c)scope_stk.h(4 funcs acrossexpr.c,exprutil.c)nv_transforms.h(3 funcs acrossclass_decl.c,cp_gen_be.c,src_seq.c)
Pervasive headers -- functions inlined into most .c files:
util.h(124 xrefs spanning0x430E10--0x7C2B10, nearly the entire EDG region)types.h(17 funcs spanning0x469260--0x7B05E0, scattered type queries)
Assert Stub Region
The region 0x403300--0x408B40 contains 198 small __noreturn functions. Each encodes a single assertion site: the source file path, line number, and enclosing function name. When the assertion condition fails, the stub calls sub_4F2930 (EDG's internal_error handler) and does not return. Every stub is 29 bytes.
Stub Distribution by Source File
| Source File | Stub Count | Source File | Stub Count |
|---|---|---|---|
cp_gen_be.c | 25 | macro.c | 1 |
il.c | 16 | mem_manage.c | 2 |
exprutil.c | 13 | modules.c | 3 |
templates.c | 12 | overload.c | 3 |
lower_name.c | 11 | pch.c | 3 |
expr.c | 10 | preproc.c | 0 |
class_decl.c | 9 | scope_stk.c | 6 |
folding.c | 9 | src_seq.c | 1 |
floating.c | 9 | statements.c | 1 |
attribute.c | 7 | symbol_ref.c | 2 |
symbol_tbl.c | 8 | symbol_tbl.h | 1 |
trans_corresp.c | 6 | sys_predef.c | 1 |
lexical.c | 5 | target.c | 0 |
decls.c | 5 | trans_copy.c | 0 |
types.c | 5 | trans_unit.c | 0 |
decl_inits.c | 4 | types.h | 4 |
interpret.c | 3 | util.h | 10 |
decl_spec.c | 3 | il.h | 2 |
host_envir.c | 2 | debug.c | 0 |
layout.c | 2 | extasm.c | 0 |
lookup.c | 2 | fe_wrapup.c | 0 |
cmd_line.c | 1 | float_pt.c | 0 |
const_ints.c | 1 | declarator.c | 0 |
disambig.c | 1 | pragma.c | 0 |
error.c | 1 | nv_transforms.c | 0 |
fe_init.c | 1 | literals.c | 0 |
func_def.c | 1 | ||
il_alloc.c | 1 | ||
il_to_str.c | 1 | ||
il_walk.c | 1 |
After the stubs, addresses 0x408B40--0x409350 contain 15 C++ static constructor functions (ctor_001 through ctor_015) that initialize global tables at program startup. These have no source file attribution.
Gap Analysis -- Unmapped Regions
The following address ranges within the EDG .text region contain functions that could not be mapped to any source file via __FILE__ strings. Each gap represents functions that either lack assertions entirely, use non-EDG assertion macros, or are compiler-generated (vtable thunks, exception handlers, template instantiation artifacts).
| # | Gap Range | Size | Between | Probable Content |
|---|---|---|---|---|
| 1 | 0x408B40--0x409350 | 2 KB | stubs / attribute.c | Static constructors (ctor_001--ctor_015) |
| 2 | 0x447930--0x44B250 | 13 KB | class_decl.c / cmd_line.c | Boundary helpers, small inlines |
| 3 | 0x459630--0x461C20 | 34 KB | cmd_line.c / const_ints.c | Unmapped option handlers, flag tables |
| 4 | 0x4659A0--0x466F90 | 6 KB | const_ints.c / cp_gen_be.c | Constant integer helpers |
| 5 | 0x489000--0x48A1B0 | 5 KB | cp_gen_be.c / debug.c | Backend emission tail |
| 6 | 0x48A1B0--0x48B3F0 | 5 KB | debug.c / decl_inits.c | Debug infrastructure |
| 7 | 0x5E8300--0x5F7FD0 | 87 KB | il_alloc.c / il_to_str.c | IL display routines (no assertions) |
| 8 | 0x620190--0x620CE0 | 3 KB | il_walk.c / interpret.c | Walk epilogue |
| 9 | 0x65DE10--0x65EA50 | 3 KB | interpret.c / layout.c | Interpreter tail |
| 10 | 0x665A60--0x666720 | 3 KB | layout.c / lexical.c | Layout/lexer boundary |
| 11 | 0x689130--0x68ACC0 | 7 KB | lexical.c / literals.c | Token conversion helpers |
| 12 | 0x6AB280--0x6AB6E0 | 1 KB | lower_name.c / macro.c | Mangling helpers |
| 13 | 0x6BA230--0x6BAE70 | 3 KB | mem_manage.c / nv_transforms.c | Memory infrastructure |
| 14 | 0x6EF7A0--0x6F2790 | 12 KB | overload.c / pch.c | Overload resolution tail |
| 15 | 0x6FC940--0x6FE160 | 6 KB | preproc.c / scope_stk.c | Preprocessor tail functions |
| 16 | 0x751470--0x7525F0 | 7 KB | sys_predef.c / target.c | Predefined macro infrastructure |
| 17 | 0x7A4690--0x7A4940 | 1 KB | trans_unit.c / types.c | TU helpers |
| 18 | 0x7C2560--0x7D0EB0 | 59 KB | modules.c / floating.c | Type-name encoding, module helpers |
| 19 | 0x7D59B0--0x7DEB90 | 37 KB | floating.c tail | float_type.h template instantiations |
| 20 | 0x7DFFF0--0x82A000 | 304 KB | post-EDG | C++ runtime, demangler, soft-float, EH |
| Total unmapped | ~582 KB |
The largest unmapped gap within EDG code proper is the IL display region at 0x5E8300--0x5F7FD0 (87 KB). These functions were compiled from il_to_str.c but contain no assertions because the display/dump subsystem was built without assertion macros -- it is purely diagnostic code that formats IL trees to stdout.
Alphabetical Layout Observation
Source files are laid out in the binary in near-alphabetical order by filename, a consequence of the build system compiling .c files in directory-listing order and the linker processing them sequentially. The sequence is strictly alphabetical from attribute.c through types.c (rows 1--50).
Two files break this pattern:
| File | Expected Position | Actual Position | Offset |
|---|---|---|---|
modules.c | Between mem_manage.c and nv_transforms.c (#33--#34) | After types.c (#51, at 0x7C0C60) | +47 rows late |
floating.c | Between float_pt.c and folding.c (#18--#19) | After modules.c (#52, at 0x7D0EB0) | +34 rows late |
Both files appear after the main alphabetical sequence, placed at the very end of the EDG region. The most likely explanation is that modules.c and floating.c are compiled as separate translation units outside the main EDG build directory -- perhaps in a subdirectory or a secondary build target -- and are appended to the link line after the alphabetically-sorted main objects. The modules.c file implements C++20 module support (mostly stubs in the CUDA build), and floating.c implements arbitrary-precision IEEE 754 arithmetic -- both are semi-independent subsystems that could plausibly be compiled separately.
Note that floating.c is followed immediately by its private header float_type.h (63 template instantiations at 0x7D1C90--0x7DEB90), confirming they share a compilation unit.
Binary Region Map
0x402A18 +--------------------------+
| PLT stubs / init (283) | 3 KB
0x403300 +--------------------------+
| Assert stubs (198) | 22 KB
0x408B40 +--------------------------+
| Constructors (15) | 2 KB
0x409350 +--------------------------+
| attribute.c | 65 KB
0x419280 | class_decl.c | 190 KB
| cmd_line.c | 58 KB
| const_ints.c | 16 KB
| cp_gen_be.c | 139 KB
| debug.c | <1 KB
| decl_inits.c | 90 KB
| decl_spec.c | 73 KB
| declarator.c | 50 KB
| decls.c | 165 KB
| disambig.c | 10 KB
| error.c | 46 KB
| expr.c | 381 KB
| exprutil.c | 176 KB
| extasm.c | 3 KB
| fe_init.c | 2 KB
| fe_wrapup.c | <1 KB
| float_pt.c | 44 KB
| folding.c | 66 KB
| func_def.c | 23 KB
| host_envir.c | 19 KB
| il.c | 185 KB
| il_alloc.c | 31 KB
| [il display gap] | 87 KB (unmapped)
| il_to_str.c | 48 KB
| il_walk.c | 115 KB
| interpret.c | 250 KB
| layout.c | 29 KB
| lexical.c | 141 KB
| literals.c | 18 KB
| lookup.c | 50 KB
| lower_name.c | 60 KB
| macro.c | 42 KB
| mem_manage.c | 13 KB
| nv_transforms.c | 22 KB
| overload.c | 201 KB
| pch.c | 14 KB
| pragma.c | 9 KB
| preproc.c | 12 KB
| scope_stk.c | 76 KB
| src_seq.c | 31 KB
| statements.c | 55 KB
| symbol_ref.c | 24 KB
| symbol_tbl.c | 123 KB
| sys_predef.c | 20 KB
| target.c | 2 KB
| templates.c | 286 KB
| trans_copy.c | <1 KB
| trans_corresp.c | 50 KB
| trans_unit.c | 3 KB
| types.c | 112 KB
| --- alphabetical break ---
| modules.c | 6 KB
| --- gap (59 KB) ---
0x7D0EB0 | floating.c | 19 KB
| float_type.h inlines | 52 KB
0x7DFFF0 +--------------------------+
| C++ runtime / demangler | 304 KB
0x82A000 +--------------------------+
Reproduction
To regenerate the source file list from the strings database:
jq '[.[] | select(.value | test("/dvs/p4/.*\\.[ch]$")) |
{file: (.value | split("/") | last),
xrefs: (.xrefs | length)}
] | group_by(.file) |
map({file: .[0].file,
total_xrefs: (map(.xrefs) | add)}) |
sort_by(.file)' cudafe++_strings.json
To extract address ranges per file:
import json
from collections import defaultdict
with open('cudafe++_strings.json') as f:
data = json.load(f)
files = defaultdict(list)
for entry in data:
val = entry.get('value', '')
if '/dvs/p4/' not in val:
continue
if not (val.endswith('.c') or val.endswith('.h')):
continue
fname = val.split('/')[-1]
for xref in entry.get('xrefs', []):
files[fname].append(int(xref['from'], 16))
for fname in sorted(files):
addrs = sorted(files[fname])
print(f"{fname:25s} {hex(addrs[0]):>12s} - {hex(addrs[-1]):>12s}"
f" ({len(addrs)} xrefs)")
-
nv_transforms.chas only 1 function with an EDG-style__FILE__reference, but sweep analysis confirms ~40 functions in the0x6BAE70--0x6BE4A0region (~22 KB). Most use NVIDIA's own assertion macros instead of EDG'sinternal_errorpath. ↩
Global Variable Index
cudafe++ v13.0 uses approximately 400+ global variables scattered across the .bss and .data segments. These variables fall into clear functional categories: compilation mode selectors, error/diagnostic state, I/O handles, CUDA-specific flags, translation unit management, scope tracking, IL allocation, lexer state, template instantiation, lambda transforms, and memory management. Every address listed below was confirmed through binary analysis of the x86-64 Linux ELF (sha256 6a69...). This page serves as the canonical cross-reference for all other wiki articles.
The variables cluster into three address regions: 0x106xxxx (NVIDIA-added configuration flags, typically set during CLI processing), 0x126xxxx (EDG core compiler state, used throughout parsing, IL generation, and code emission), and 0x12Cxxxx / 0x128xxxx (template instantiation, lambda transform, and arena allocator state). A few tables live in the read-only .rodata segment at 0xE6xxxx--0xE8xxxx.
Compilation Mode and Language Standard
These globals control the fundamental compilation dialect -- C vs C++, which standard version, which vendor extensions are active, and whether the compiler is in CUDA mode.
| Address | Size | Name | Description |
|---|---|---|---|
dword_126EFB4 | 4 | language_mode | Master dialect selector. 1 = C, 2 = C++. Checked in virtually every subsystem. In some contexts (p1.12) interpreted as device_il_mode when value is 2. |
dword_126EF68 | 4 | cpp_standard_version | __cplusplus value. 199711 = C++98, 201103 = C++11, 201402 = C++14, 201703 = C++17, 202002 = C++20, 202302 = C++23. For C mode: 199000 (pre-C99), 199901 (C99), 201112 (C11), 201710 (C17), 202311 (C23). |
dword_126EFAC | 4 | extended_features | EDG extended features / GNU compatibility mode flag. Also used as CUDA mode indicator in several paths. |
dword_126EFA8 | 4 | gcc_extensions | GCC extensions mode (1 = enabled). Also used as GPU compilation mode flag in device/host separation. |
dword_126EFA4 | 4 | clang_extensions | Clang extensions mode. Dual-use: also serves as device-code-mode flag during device/host separation (1 = compiling device side). |
dword_126EFB0 | 4 | gnu_extensions_enabled | GNU extensions active (set alongside dword_126EFA8). Also used as strict_c_mode and relaxed_constexpr in some paths. |
qword_126EF98 | 8 | gcc_version | GCC compatibility version, encoded as major*10000+minor*100+patch. Default 80100 (GCC 8.1.0). Compared as hex thresholds (e.g., 0x9E97 = 40599). |
qword_126EF90 | 8 | clang_version | Clang compatibility version. Default 90100. Used for feature gating (compared against 0x78B3, 0x15F8F, 0x1D4BF). |
qword_126EF78 | 8 | msvc_version | MSVC compatibility version. Default 1926. |
qword_126EF70 | 8 | version_threshold_max | Upper version bound. Default 99999. |
dword_126EF64 | 4 | cpp_extensions_enabled | C extension level (nonstandard extensions). |
dword_126EF80 | 4 | feature_flag_80 | Miscellaneous feature flag, default 1. |
dword_126EF48 | 4 | auto_parameter_mode | Auto parameter support flag (inverse of input). |
dword_126EF4C | 4 | auto_parameter_support | Auto-parameter enabled (C++20 auto function params). |
dword_126EEFC | 4 | digit_separators_enabled | C++14 digit separator (') support. |
dword_126EF0C | 4 | feature_flag_0C | Miscellaneous feature flag, default 1. |
dword_126E4A8 | 4 | sm_architecture | Target SM architecture version (set by --nv_arch / case 245). |
dword_126E498 | 4 | signed_chars | Whether plain char is signed. |
CUDA-Specific Flags
Flags controlling CUDA-specific behavior: device code generation, extended lambdas, relaxed constexpr, OptiX mode.
| Address | Size | Name | Description |
|---|---|---|---|
dword_1065850 | 4 | device_stub_mode | Device stub mode toggle. Toggled by expression dword_1065850 = (dword_1065850 == 0) in gen_routine_decl. 0 = forwarding body pass, 1 = static stub pass. |
dword_106BF38 | 4 | extended_lambda_mode | NVIDIA extended lambdas enabled (--expt-extended-lambda). Gates the lambda wrapper generation pipeline. |
dword_106BF40 | 4 | lambda_host_device_mode | Lambda host-device mode flag. Controls whether __device__ function references are allowed in host code. |
dword_106BF34 | 4 | lambda_validation_skip | Skip lambda validation checks. |
dword_106BFDC | 4 | skip_device_only | Skip device-only code generation. When clear, deferred function list accumulates at qword_1065840. |
dword_106BFF0 | 4 | relaxed_attribute_mode | NVIDIA relaxed override mode. Controls permissive __host__/__device__ attribute mismatch handling. Default 1 in CLI defaults. |
dword_106BFBC | 4 | whole_program_mode | Whole-program mode (affects deferred function list behavior). |
dword_106BFD0 | 4 | device_registration | Enable CUDA device registration / cross-space reference checking. |
dword_106BFCC | 4 | constant_registration | Enable CUDA constant registration / another cross-space check flag. |
dword_106BFB8 | 4 | emit_symbol_table | Emit symbol table in output. |
dword_106BF6C | 4 | alt_host_compiler_mode | Alternative host compiler mode. |
dword_106BF68 | 4 | host_compiler_flag | Host compiler attribute support flag. Also dword_106BF58. |
dword_106BDD8 | 4 | optix_mode | OptiX compilation mode flag. |
dword_106B670 | 4 | optix_kernel_index | OptiX kernel index (combined with dword_106BDD8 for error 3689). |
qword_106B678 | 8 | optix_kernel_table | OptiX kernel info table pointer. |
dword_106C2C0 | 4 | gpu_mode | GPU/device compilation mode. Controls reinterpret_cast semantics, pointer dereference, and keyword detection in device context. |
dword_106C1D8 | 4 | relaxed_constexpr_ptr | Controls pointer dereference in device constexpr (--expt-relaxed-constexpr related). |
dword_106C1E0 | 4 | device_typeid | Controls typeid availability in device constexpr context. |
dword_106C1F4 | 4 | device_class_lookup | CUDA device class member lookup flag. |
dword_E7C760 | 4[6] | exec_space_table | Execution space bitmask table (6 entries). a1 & dword_E7C760[a2] tests space compatibility. |
dword_106B640 | 4 | keep_in_il_active | Assertion guard: set to 1 before keep_in_il walk, cleared to 0 after. |
dword_E85700 | 4 | host_runtime_included | Flag: host_runtime.h already included in .int.c output. |
dword_126E270 | 4 | cpp17_noexcept_type | C++17 noexcept-in-type-system flag. Gates noexcept variant emission for lambda wrappers. |
dword_106BF80 | 4-ptr | module_id_file | Module-ID file path (for CRC32 calculation). |
qword_1065840 | 8 | deferred_function_list | Linked list of deferred functions (used when dword_106BFDC is clear). |
Error and Diagnostic State
The diagnostic subsystem uses a set of globals to track error/warning counts, severity thresholds, output format, and per-error suppression state.
| Address | Size | Name | Description |
|---|---|---|---|
qword_126ED90 | 8 | error_count | Total errors emitted. Also used as error-recovery-mode flag (nonzero = in recovery). |
qword_126ED98 | 8 | warning_count | Total warnings emitted. |
qword_126EDF0 | 8 | error_output_stream | FILE* for diagnostic output. Default stderr. Initialized during ctor_002. |
qword_126EDE8 | 8 | current_source_position | Current source position for error reporting. Mirrored from qword_1065810. |
qword_126ED60 | 8 | error_limit | Maximum error count before abort. |
byte_126ED69 | 1 | min_severity_threshold | Minimum severity for diagnostic output (default threshold). |
byte_126ED68 | 1 | error_promotion_threshold | Severity at or above which warnings become errors. |
dword_126ED40 | 4 | suppress_assertion_output | Suppress assertion output flag. |
dword_126ED48 | 4 | no_catastrophic_on_error | Disable catastrophic error on internal assertion. |
dword_126ED50 | 4 | no_caret_diagnostics | Disable caret (^) diagnostics. |
dword_126ED58 | 4 | max_context_lines | Maximum source context lines in diagnostics. |
dword_126ED78 | 4 | has_error_in_scope | Error occurred in current scope. |
dword_126ED44 | 4 | name_lookup_kind | Name lookup kind for diagnostic formatting. |
byte_126ED55 | 1 | device_severity_override | Default severity for device-mode diagnostics. |
byte_126ED56 | 1 | warning_level_control | Warning level control byte. |
dword_106BBB8 | 4 | output_format | Output format selector. 0 = plaintext, 1 = SARIF JSON. |
dword_106C088 | 4 | warnings_are_errors | Treat warnings as errors (-Werror equivalent). |
dword_126ECA0 | 4 | colorization_requested | Color output requested. |
dword_126ECA4 | 4 | colorization_active | Color output currently active (after TTY detection). |
off_88FAA0 | 8[3795] | error_message_table | Array of 3,795 const char* pointers indexed by error code. |
byte_1067920 | 1[3795] | default_severity_table | Default severity for each error code. |
byte_1067921 | 1[3795] | current_severity_table | Current (possibly pragma-modified) severity. |
byte_1067922 | 4[3795] | per_error_flags | Per-error tracking: bit 0 = first occurrence, other bits = suppression state. |
off_D481E0 | -- | label_fill_in_table | Diagnostic label fill-in table ({name, cond_index, default_index} entries). |
qword_106B488 | 8 | message_text_buffer | Growable message text buffer (initial 0x400 bytes via sub_6B98A0). |
qword_106B480 | 8 | location_prefix_buffer | Location prefix buffer (initial 0x80 bytes). |
qword_106B478 | 8 | sarif_json_buffer | SARIF JSON output buffer (initial 0x400 bytes). |
dword_106B470 | 4 | terminal_width | Terminal width for word wrapping. |
dword_106B4A0 | 4 | fill_in_alloc_count | Fill-in entry allocation counter. |
qword_106B490 | 8 | fill_in_free_list | Free list for 40-byte fill-in entries. |
dword_106B4B0 | 4 | catastrophic_error_guard | Re-entry guard for catastrophic error processing. |
dword_1065928 | 4 | assertion_reentry_guard | Re-entry guard for assertion handler. |
qword_1067860 | 8 | entity_formatter_callback | Entity name formatting callback (sub_5B29C0). |
qword_1067870 | 8 | entity_formatter_buffer | Entity formatter output buffer. |
byte_10678F1 | 1 | diag_is_c_mode | Diagnostic C mode flag (dword_126EFB4 == 1). |
byte_10678F4 | 1 | diag_is_pre_cpp11 | Diagnostic pre-C++11 flag. |
byte_10678FA | 1 | diag_name_lookup_kind | Name lookup kind for entity display. |
qword_106BCD8 | 8 | suppress_all_but_fatal | When set, suppress all errors except 992 (fatal). |
dword_106BCD4 | 4 | predefined_macro_file_mode | Predefined macro file mode (affects error case). |
qword_10658F8 | 8 | pragma_scratch_buffer | Scratch buffer for pragma bsearch operations. |
dword_106B4BC | 4 | werror_emitted_guard | Prevents recursion in warnings-as-errors emission. |
I/O and File Management
Globals controlling input/output filenames, streams, include paths, and preprocessor output.
| Address | Size | Name | Description |
|---|---|---|---|
qword_126EEE0 | 8 | input_filename | Current output/source filename (write-protected name). Compared against "-" for stdout mode. |
qword_106BF20 | 8 | output_filename_override | Output C file path (set by --gen_c_file_name / case 45). |
qword_106C040 | 8 | output_filename_alt | Alternative output filename (used in signoff). |
qword_106C280 | 8 | output_file | FILE* for .int.c output (stdout or file). |
qword_126EE98 | 8 | include_path_list | Include search path linked list head. |
qword_126F100 | 8 | include_path_free_list | Free list for recycled search path nodes. |
qword_126F0E8 | 8 | path_normalize_buffer | Growable buffer for path normalization (0x100 initial). |
dword_126EE58 | 4 | backslash_as_separator | Backslash as path separator (Windows mode). |
dword_126EE54 | 4 | windows_drive_letter | Recognize Windows drive-letter paths. |
dword_126EEE8 | 4 | bom_detection_enabled | Byte-order mark detection enabled. |
dword_126F110 | 4 | once_guard | One-time initialization guard for source file processing. |
qword_126F0C0 | 8 | cached_module_id | Cached module ID string (CRC32-based). |
qword_106BF80 | 8 | module_id_file_path | Module-ID file path for external ID override. |
qword_106C038 | 8 | options_hash_input | Command-line options hash input for module ID. |
qword_106C248 | 8 | macro_alias_map | Hash table: macro define/alias mappings. |
qword_106C240 | 8 | include_path_map | Include path list for CLI processing. |
qword_106C238 | 8 | sys_include_map | System include path map. |
qword_106C228 | 8 | sys_include_map_2 | Additional system include map. |
dword_106C29C | 4 | preprocess_mode | Preprocessing-only mode (1 = active). Set by CLI cases 3,4. |
dword_106C294 | 4 | no_line_commands | Suppress #line directives in output. |
dword_106C288 | 4 | preprocess_output_mode | Preprocess output: 0 = suppress, 1 = emit preprocessed text. |
dword_106C254 | 4 | skip_backend | Skip backend code generation entirely. |
Scope Stack
The scope stack is an array of 784-byte entries at qword_126C5E8, indexed by dword_126C5E4. It tracks the nested scope hierarchy (file, namespace, class, function, block, template).
| Address | Size | Name | Description |
|---|---|---|---|
qword_126C5E8 | 8 | scope_table_base | Base pointer to scope stack array. Each entry is 784 bytes. |
dword_126C5E4 | 4 | current_scope_index | Current top-of-stack index. |
dword_126C5DC | 4 | saved_scope_index | Saved scope index (for enum processing, lambda nesting). |
dword_126C5D8 | 4 | function_scope_index | Enclosing function scope index (-1 if none). |
dword_126C5C8 | 4 | template_scope_index | Template scope index (-1 if not in template). |
dword_126C5C4 | 4 | class_scope_index | Class/nested-class scope index (-1 if none). Also used as friend_scope_index in some paths. |
dword_126C5BC | 4 | lambda_body_flag | Lambda body processing flag / template declaration flag. |
dword_126C5B8 | 4 | class_nesting_depth | Class nesting depth / is_member_of_template flag. |
dword_126C5B4 | 4 | block_scope_counter | Block scope counter / namespace scope parameter. |
dword_126C5AC | 4 | saved_depth_template | Saved scope depth for template instantiation restore. |
dword_126C5E0 | 4 | scope_hash | Scope hash/identifier. |
dword_126C5A4 | 4 | nesting_scope_index | Nesting scope index. |
dword_126C5A0 | 4 | scope_misc_flag | Miscellaneous scope flag. |
dword_126C5C0 | 4 | instantiation_scope_index | Instantiation scope index. |
qword_126C5D0 | 8 | current_routine_ptr | Current enclosing function/routine descriptor pointer. Used for execution space checks (offset +32 -> byte +177 bit 2 for device, byte +182 & 0x30 for space mask). |
qword_126C598 | 8 | pack_expansion_context | Pack expansion context pointer (C++17). |
qword_126C590 | 8 | symbol_hash_table | Robin Hood hash table for symbol lookup within scope. |
Lexer and Token State
The lexer maintains its current token, source position, and preprocessor state in these globals.
| Address | Size | Name | Description |
|---|---|---|---|
word_126DD58 | 2 | current_token | Current token kind (357 possible values). Key values: 7 = identifier, 33 = comma, 55 = semicolon, 56 = =, 67 = equals, 73 = CUDA token, 76 = *, 142 = __attribute__, 161 = this, 187 = requires clause. |
qword_126DD38 | 8 | token_source_position | Source position of current token. |
qword_126DD48 | 8 | token_text_ptr | Pointer to current identifier/literal text. |
dword_126DF90 | 4 | token_flags_1 | Token flags / current declaration counter. |
dword_126DF8C | 4 | token_flags_2 | Secondary token flags. |
qword_126DF80 | 8 | token_extra_data | Token extra data pointer. |
dword_126DB74 | 4 | has_cached_tokens | Cached token state flag. |
dword_126DB58 | 4 | digit_separator_seen | C++14 digit separator seen during number scanning. |
qword_126DDA0 | 8 | input_position | Current position in input buffer. |
qword_126DDD8 | 8 | input_buffer_base | Input buffer base address. |
qword_126DDD0 | 8 | input_buffer_end | Input buffer end address. |
dword_126DDA8 | 4 | line_counter | Current line number in input. |
dword_126DDBC | 4 | source_line_number | Source line number (for #line directive tracking). |
qword_126DD80 | 8 | active_macro_chain | Active macro expansion chain head. |
qword_126DD60 | 8 | macro_expansion_marker | Macro expansion position marker. |
dword_126DD30 | 4 | in_directive_flag | Currently processing preprocessor directive. |
qword_126DD18 | 8 | current_macro_node | Current macro being expanded. |
qword_126DD70 | 8 | macro_tracking_1 | Macro position tracking state. |
qword_126DDE0 | 8 | macro_tracking_2 | Secondary macro tracking state. |
qword_126DDF0 | 8 | file_stack | Include file stack (for #include nesting). |
dword_126DDE8 | 4 | preproc_state_1 | Preprocessor state variable. |
dword_126E49C | 4 | preproc_state_2 | Preprocessor state variable. |
qword_126DB40 | 8 | lexical_state_stack | Lexical state save/restore stack (linked list of 80-byte nodes). |
qword_126DB48 | 8 | stop_token_table | Stop token table: 357 entries at offset +8, indexed by token kind. |
qword_126DD98 | 8 | raw_string_state | Raw string literal tracking state. |
dword_126EF00 | 4 | raw_string_flag | Raw string literal processing flag. |
qword_126DDD8 | 8 | raw_string_base | Raw string buffer base. |
qword_126DDD0 | 8 | raw_string_end | Raw string buffer end. |
Preprocessor and Macro System
| Address | Size | Name | Description |
|---|---|---|---|
qword_1270140 | 8 | macro_definition_chain | Macro definition chain head. |
qword_1270148 | 8 | free_token_list | Free list for recycled token nodes. |
qword_1270150 | 8 | cached_token_list | Cached token list head (for rescan). |
qword_1270128 | 8 | reusable_cache_stack | Reusable macro cache stack. |
qword_106B8A0 | 8 | pending_macro_arg | Pending macro argument pointer. |
dword_106B718 | 4 | suppress_pragma_mode | Suppress pragma processing mode. |
dword_106B720 | 4 | preprocessing_mode | Preprocessor-only mode active. |
dword_106B6EC | 4 | line_numbering_state | Line numbering state for #line output. |
qword_106B740 | 8 | pragma_binding_table | Pragma binding table (0x158 bytes initial). |
qword_106B730 | 8 | pragma_alloc_pool_1 | Pragma allocation pool. |
qword_106B738 | 8 | pragma_alloc_pool_2 | Pragma allocation pool (secondary). |
qword_106B890 | 8 | pragma_name_hash_1 | Pragma name hash table. |
qword_106B8A8 | 8 | pragma_name_hash_2 | Pragma name hash table (secondary). |
off_E6CDE0 | -- | pragma_id_table | Pragma ID-to-name mapping table. |
byte_126E558 | 1 | stdc_cx_limited_range | #pragma STDC CX_LIMITED_RANGE state. Default 3. |
byte_126E559 | 1 | stdc_fenv_access | #pragma STDC FENV_ACCESS state. Default 3. |
byte_126E55A | 1 | stdc_fp_contract | #pragma STDC FP_CONTRACT state. Default 3. |
dword_126EE48 | 4 | macro_expansion_tracking | Macro expansion tracking / secondary IL enabled flag. Set to 1 during init-complete. Also controls shareable-constants feature. |
Translation Unit State
These globals track the current translation unit, TU list, and per-TU save/restore mechanism.
| Address | Size | Name | Description |
|---|---|---|---|
qword_106BA10 | 8 | current_tu | Pointer to current translation unit descriptor (424 bytes). |
qword_106B9F0 | 8 | primary_tu | Pointer to first (primary) translation unit. |
qword_12C7A90 | 8 | tu_chain_tail | Tail of translation unit linked list. |
qword_106BA18 | 8 | tu_stack | Translation unit stack (for nested TU processing). |
dword_106B9E8 | 4 | tu_stack_depth | TU stack depth (excluding primary). |
dword_106BA08 | 4 | is_recompilation | Recompilation / secondary-TU flag. When 0 = primary TU, when 1 = secondary. Affects IL entity flag bits. |
qword_106BA00 | 8 | current_filename | Current filename string pointer. |
dword_106B9F8 | 4 | has_module_info | TU has module information. |
qword_12C7A98 | 8 | per_tu_storage_size | Total per-TU variable buffer size. |
qword_12C7AA8 | 8 | registered_var_list_head | Registered per-TU variable list head. |
qword_12C7AA0 | 8 | registered_var_list_tail | Registered per-TU variable list tail. |
qword_12C7AB8 | 8 | stack_entry_free_list | TU stack entry free list. |
qword_12C7AB0 | 8 | corresp_free_list | TU correspondence structure free list. |
dword_12C7A8C | 4 | registration_complete | Variable registration complete flag. |
dword_12C7A88 | 4 | has_seen_module_tu | Has seen a module TU. |
qword_12C7A70 | 8 | corresp_count | TU correspondence allocation counter. |
qword_12C7A78 | 8 | tu_count | Translation unit allocation counter. |
qword_12C7A80 | 8 | stack_entry_count | Stack entry allocation counter. |
qword_12C7A68 | 8 | registration_count | Variable registration allocation counter. |
IL (Intermediate Language) State
The IL subsystem uses arena-allocated regions for entities. Two primary regions exist: file-scope and function-scope.
| Address | Size | Name | Description |
|---|---|---|---|
dword_126EC90 | 4 | file_scope_region_id | File-scope IL region ID. Persistent for the entire TU. |
dword_126EB40 | 4 | current_region_id | Current allocation region ID (file-scope or function-scope). |
dword_126EC80 | 4 | max_region_id | Maximum allocated region ID. |
qword_126EB60 | 16 | il_header | IL header (SSE-width, used for expression copy). |
qword_126EB70 | 8 | main_routine | Main routine entity (main() function). Sign-bit used as elimination marker. |
qword_126EB78 | 8 | compiler_version_string | Compiler version string pointer. |
qword_126EB80 | 8 | compilation_timestamp | Compilation timestamp string. |
byte_126EB88 | 1 | plain_chars_signed | Plain chars are signed flag (IL header field). |
qword_126EB90 | 8 | routine_scope_array | Array indexed by routine number. Also per-region metadata. |
qword_126EB98 | 8 | function_def_table | Function definition table (16 bytes per entry, indexed 1..dword_126EC78). |
qword_126EBA0 | 8 | orphaned_scope_list | Orphaned scope list head (for dead code elimination). |
dword_126EBA8 | 4 | source_language | Source language (0 = C++, 1 = C). |
dword_126EBAC | 4 | std_version_il | Standard version for IL header. |
byte_126EBB0 | 1 | pcc_compatibility_mode | PCC compatibility mode. |
byte_126EBB1 | 1 | enum_type_is_integral | Enum underlying type is integral. |
dword_126EBB4 | 4 | max_member_alignment | Default maximum member alignment. |
byte_126EBB8 | 1 | il_gcc_mode | IL GCC mode. |
byte_126EBB9 | 1 | il_gpp_mode | IL G++ mode. |
byte_126EBD5 | 1 | any_templates_seen | Any templates encountered. |
byte_126EBD6 | 1 | proto_instantiations_in_il | Prototype instantiations present in IL. |
byte_126EBD7 | 1 | il_all_proto_instantiations | IL has all prototype instantiations. |
byte_126EBD8 | 1 | il_c_semantics | IL has C semantics. |
qword_126EBE0 | 8 | deferred_instantiation_list | Deferred/external declaration list head. |
qword_126EBE8 | 8 | seq_number_entries | Sequence number lookup entries (for IL index build). |
dword_126EBF8 | 4 | target_config_index | Target configuration index. |
dword_126EC78 | 4 | routine_counter | Current routine / entity counter. |
dword_126EC7C | 4 | entity_buffer_capacity | Entity buffer capacity (grows by 2048). |
qword_126EC88 | 8 | region_block_chains | Array of block chains indexed by region ID. |
qword_126EC50 | 8 | region_size_tracking | Array of region size tracking. |
qword_126EC58 | 8 | large_alloc_array | Large-allocation (mmap) array. |
dword_126E5FC | 4 | file_scope_constant_flag | Source-file-info flags (bit 0 = constant region flag). |
byte_126E5F8 | 1 | il_language_byte | Language standard byte for routine-type init. |
qword_126EFB8 | 8 | null_source_position | Default/null source position struct. |
qword_126F700 | 8 | current_source_file_ref | Current source file reference for IL entities. |
IL Entity Kind Lists
The IL maintains per-kind linked lists for file-scope entities (kinds 1 through 72+).
| Address | Size | Name | Description |
|---|---|---|---|
qword_126E610 | 8 | kind_1_list | Source file entries (kind 1). |
qword_126E620 | 8 | kind_2_list | Constant entries (kind 2). |
qword_126E630 | 8 | kind_3_list | Parameter entries (kind 3). |
| ... | ... | Continues through all 72+ entry kinds. | |
qword_126EA80 | 8 | kind_72_list | Last numbered kind list (kind 72). |
IL Allocation Counters
Each IL entity type has a dedicated allocation counter used for memory statistics reporting.
| Address | Size | Name | Description |
|---|---|---|---|
qword_126F680 | 8 | local_constant_count | Local constant allocation count. Asserted zero at region boundaries. |
qword_126F748 | 8 | orphan_ptr_count | Orphan pointer allocation count. |
qword_126F750 | 8 | entity_prefix_count | Entity prefix allocation count. |
qword_126F790 | 8 | source_corresp_count | Source correspondence allocation count. |
qword_126F7C0 | 8 | gen_alloc_header_count | Gen-alloc header count (TU copy addresses). |
qword_126F7D0 | 8 | string_bytes_count | String literal bytes counter. |
qword_126F7D8 | 8 | il_entry_prefix_count | IL entry prefix allocation count. |
qword_126F8A0 | 8 | exception_spec_count | Exception specification entry count (16 bytes). |
qword_126F898 | 8 | exception_spec_type_count | Exception spec type count (24 bytes). |
qword_126F890 | 8 | asm_entry_count | ASM entry count (152 bytes). |
qword_126F8A8 | 8 | routine_count | Routine entry count (288 bytes). |
qword_126F8B0 | 8 | field_count | Field entry count (176 bytes). |
qword_126F8B8 | 8 | var_template_count | Variable template entry count (24 bytes). |
qword_126F8C0 | 8 | variable_count | Variable entry count (232 bytes). |
qword_126F8C8 | 8 | vla_dim_count | VLA dimension entry count (48 bytes). |
qword_126F8D0 | 8 | local_static_init_count | Local static init count (40 bytes). |
qword_126F8D8 | 8 | dynamic_init_count | Dynamic init entry count (104 bytes). |
qword_126F8E0 | 8 | type_count | Type entry count (176 bytes). |
qword_126F8E8 | 8 | enum_supplement_count | Enum type supplement count. |
qword_126F8F0 | 8 | typeref_supplement_count | Typeref type supplement count (56 bytes). |
qword_126F8F8 | 8 | misc_supplement_count | Misc type supplement count. |
qword_126F900 | 8 | template_arg_count | Template argument count (64 bytes). |
qword_126F908 | 8 | base_class_count | Base class count (112 bytes). |
qword_126F910 | 8 | base_class_deriv_count | Base class derivation count (32 bytes). |
qword_126F918 | 8 | derivation_step_count | Derivation step count (24 bytes). |
qword_126F920 | 8 | overriding_count | Overriding entry count (40 bytes). |
qword_126F928 | 8 | constant_list_count | Constant list entry count (16 bytes). |
qword_126F930 | 8 | variable_list_count | Variable list entry count (16 bytes). |
qword_126F938 | 8 | routine_list_count | Routine list entry count (16 bytes). |
qword_126F940 | 8 | class_list_count | Class list entry count (16 bytes). |
qword_126F948 | 8 | class_supplement_count | Class type supplement count. |
qword_126F950 | 8 | based_type_member_count | Based type list member count (24 bytes). |
qword_126F958 | 8 | routine_supplement_count | Routine type supplement count (64 bytes). |
qword_126F960 | 8 | param_type_count | Parameter type entry count (80 bytes). |
qword_126F968 | 8 | constant_alloc_count | Constant allocation count (184 bytes). |
qword_126F970 | 8 | source_file_count | Source file entry count. |
IL Free Lists
Arena allocators recycle nodes through per-type free lists.
| Address | Size | Name | Description |
|---|---|---|---|
qword_126E4B8 | 8 | constant_free_list | Constants (linked via offset +104). |
qword_126E4B0 | 8 | expr_node_free_list | Expression nodes (linked via offset +64). |
qword_126F678 | 8 | param_type_free_list | Parameter type entries (linked via offset +0). |
qword_126F670 | 8 | template_arg_free_list | Template argument entries (linked via offset +0). |
qword_126F668 | 8 | constant_list_free_list | Constant list entries (linked via offset +0). |
IL Pools and Region Allocator
| Address | Size | Name | Description |
|---|---|---|---|
qword_126F600 | 104 | type_node_pool_1 | Type node pool (104-byte entries). |
qword_126F580 | 104 | type_node_pool_2 | Secondary type node pool. |
qword_126F500 | 104 | conditional_pool_1 | Conditional pool (guarded by dword_106BF68 || dword_106BF58). |
qword_126F480 | 104 | conditional_pool_2 | Conditional pool (secondary). |
qword_126F400 | 112 | expr_pool_1 | Expression/statement node pool (112 bytes). |
qword_126F380 | 112 | expr_pool_2 | Expression pool (secondary). |
qword_126F300 | 112 | expr_pool_3 | Expression pool (tertiary). |
unk_126E600 | 1344 | scope_pool | Scope table pool (1344 bytes, 384 initial count). |
qword_126E580 | 96 | common_header_pool | Common IL header pool (96 bytes). |
dword_126F690 | 4 | region_prefix_offset | Region allocation prefix offset (0 or 8). |
dword_126F694 | 4 | region_prefix_size | Region allocation prefix size (16 or 24). |
dword_126F688 | 4 | alt_prefix_offset | Alternate region prefix offset. |
dword_126F68C | 4 | alt_prefix_size | Alternate region prefix size (8). |
Constant Sharing Hash Table
| Address | Size | Name | Description |
|---|---|---|---|
qword_126F128 | 8 | constant_hash_table | Hash table for constant sharing/dedup. |
qword_126F130 | 8 | next_constant_index | Next constant index (monotonically increasing). |
qword_126F228 | 8 | shareable_constant_hash | Shareable constant hash table (2039 buckets). |
qword_126F200 | 8 | hash_comparisons | Hash comparison count (statistics). |
qword_126F208 | 8 | hash_searches | Hash search count. |
qword_126F210 | 8 | hash_new_buckets | New hash bucket count. |
qword_126F218 | 8 | hash_region_hits | Region hit count. |
qword_126F220 | 8 | hash_global_hits | Global hit count. |
qword_126F280 | 8 | member_ptr_type_count | Member-pointer / qualified type allocation counter. |
qword_126F2F8 | 3240 | char_string_type_cache | Character string type cache (405 entries = 3240/8). Indexed by 648*char_kind + 8*length. |
Cached Type Nodes
| Address | Size | Name | Description |
|---|---|---|---|
qword_126F2F0 | 8 | cached_void_type | Lazy-init cached void type node. |
qword_126F2E0 | 8 | cached_size_t_type | Lazy-init cached size_t type (for array memcpy). |
qword_126F2D0 | 8 | cached_wchar_type | Cached wchar_t type. |
qword_126F2C8 | 8 | cached_char16_type | Cached char16_t type. |
qword_126F2C0 | 8 | cached_char32_type | Cached char32_t type. |
qword_126F2B8 | 8 | cached_char8_type | Cached char8_t type (C++20). |
qword_126F610 | 8 | cached_char16_variant | Cached char16_t variant type. |
qword_106B660 | 8 | cached_void_fn_type | Cached void function type (C++ mode). |
qword_126E5E0 | 8 | global_char_type | Global char type. Used with qualifier 1 = const for const char*. |
Template Instantiation
| Address | Size | Name | Description |
|---|---|---|---|
qword_12C7740 | 8 | pending_instantiation_list | Pending function/variable instantiation worklist head. |
qword_12C7758 | 8 | pending_class_list | Pending class instantiation list. |
qword_12C76E0 | 8 | instantiation_depth | Current instantiation depth counter (max 0xFF = 255). |
qword_106BD10 | 8 | max_instantiation_depth | Maximum template instantiation depth limit. Default 200. |
qword_106BD08 | 8 | max_constexpr_cost | Maximum constexpr evaluation cost. Default 256. |
dword_12C7730 | 4 | instantiation_mode_active | Instantiation mode active flag. |
dword_12C771C | 4 | new_instantiations_needed | Fixpoint flag: new instantiations generated in current pass. |
dword_12C7718 | 4 | additional_pass_needed | Additional instantiation pass needed flag. |
dword_106C094 | 4 | compilation_mode | Compilation mode: 0 = none, 1 = normal, 2 = used-only, 3 = precompile. |
dword_106C09C | 4 | extended_language_mode | Extended language mode. |
qword_12C7B48 | 8 | template_arg_cache | Template argument cache. |
qword_12C7B40 | 8 | template_arg_cache_2 | Template argument cache (secondary). |
qword_12C7B50 | 8 | template_arg_cache_3 | Template argument cache (tertiary). |
qword_12C7800 | 112[3] | template_hash_tables | Three template hash tables (0x70 bytes each = 14 slots). |
Lambda Transform State
NVIDIA's extended lambda system uses bitmaps and linked lists to track device and host-device lambda closures.
| Address | Size | Name | Description |
|---|---|---|---|
unk_1286980 | 128 | device_lambda_bitmap | Device lambda capture count bitmap (1024 bits). One bit per closure class index. |
unk_1286900 | 128 | host_device_lambda_bitmap | Host-device lambda capture count bitmap (1024 bits). |
qword_12868F0 | 8 | entity_closure_map | Entity-to-closure mapping hash table (via sub_742670). |
qword_1286A00 | 8 | cached_anon_namespace_name | Cached anonymous namespace name (_GLOBAL__N_<filename>). |
qword_1286760 | 8 | cached_static_prefix | Cached static prefix string for mangled names. |
byte_1286A20 | 256K | name_format_buffer | 256KB buffer for name formatting. |
Lambda Registration Lists
Six linked lists track device/constant/kernel entities with internal/external linkage for .int.c registration emission.
| Address | Size | Name | Description |
|---|---|---|---|
unk_1286780 | -- | device_external_list | Device entities with external linkage. |
unk_12867C0 | -- | device_internal_list | Device entities with internal linkage. |
unk_1286800 | -- | constant_external_list | Constant entities with external linkage. |
unk_1286840 | -- | constant_internal_list | Constant entities with internal linkage. |
unk_1286880 | -- | kernel_external_list | Kernel entities with external linkage. |
unk_12868C0 | -- | kernel_internal_list | Kernel entities with internal linkage. |
IL Tree Walking
The walk_tree subsystem uses global callback pointers for its 5-callback traversal model.
| Address | Size | Name | Description |
|---|---|---|---|
qword_126FB88 | 8 | entry_callback | Called for each IL entry during walk. |
qword_126FB80 | 8 | string_callback | Called for each string encountered. |
qword_126FB78 | 8 | pre_walk_check | Pre-walk filter: if returns nonzero, skip subtree. |
qword_126FB70 | 8 | entry_replace | Entry replacement callback. |
qword_126FB68 | 8 | entry_filter | Linked-list entry filter callback. |
dword_126FB5C | 4 | is_file_scope_walk | 1 = walking file-scope IL. |
dword_126FB58 | 4 | is_secondary_il | 1 = current scope is in secondary IL region. |
dword_126FB60 | 4 | walk_mode_flags | Walk mode flags (template stripping, etc.). |
dword_106B644 | 4 | current_il_region | Current IL region (0 or 1; toggles bit 2 of entry flags). |
IL Walk Visited-Set
| Address | Size | Name | Description |
|---|---|---|---|
dword_126FB30 | 4 | visited_count | Count of visited entries in current walk. |
qword_126FB40 | 8 | visited_set | Visited-entry set pointer. |
dword_126FB48 | 4 | hash_table_count | Hash table entry count for visited set. |
qword_126FB50 | 8 | hash_table_array | Hash table array for visited set. |
IL Display
| Address | Size | Name | Description |
|---|---|---|---|
qword_126F980 | 8 | display_output_context | IL-to-string output callback/context. |
dword_126FA30 | 4 | is_file_scope_display | 1 = displaying file-scope region. |
byte_126FA16 | 1 | display_active | IL display currently active flag. |
byte_126FA11 | 1 | pcc_mode_shadow | PCC compatibility mode shadow for display. |
qword_126FA40 | -- | display_string_buffer | Display string buffer (raw literal prefix, etc.). |
Constexpr Evaluator
| Address | Size | Name | Description |
|---|---|---|---|
qword_126FDE0 | 8 | eval_node_free_list | Evaluation node free list (0x10000-byte arena blocks). |
qword_126FDE8 | 8 | eval_nesting_depth | Evaluation nesting depth counter. |
qword_126FE00 | 8[11] | hash_bucket_free_lists | Hash bucket free lists by popcount size class (11 buckets). |
qword_126FE60 | 8[11] | value_node_free_lists | Value node free lists by popcount size class (11 buckets). |
qword_126FBC0 | 8 | variant_path_free_list | Variant path node free list. |
qword_126FBB8 | 8 | variant_path_count | Variant path allocation count. |
qword_126FBC8 | 8 | variant_path_limit | Variant path limit. |
qword_126FBD0 | 8 | variant_path_table | Variant path table pointer. |
qword_126FEC0 | 8 | constexpr_class_hash_table | Class type hash table base for constexpr. |
qword_126FEC8 | 8 | constexpr_class_hash_info | Low 32 = capacity mask, high 32 = entry count. |
Backend Code Generation (cp_gen_be.c)
| Address | Size | Name | Description |
|---|---|---|---|
dword_1065834 | 4 | indent_level | Current indentation depth in output. |
dword_1065820 | 4 | output_line_number | Output line counter. |
dword_106581C | 4 | output_column | Output column counter (chars since last newline). |
dword_1065830 | 4 | output_column_alt | Alternate column counter. |
dword_1065818 | 4 | needs_line_directive | Needs #line directive flag. |
qword_1065810 | 8 | output_source_position | Current source position for #line directives. |
qword_1065748 | 8 | source_sequence_ptr | Current source sequence entry pointer. |
qword_1065740 | 8 | source_sequence_alt | Secondary source sequence pointer (nested scope iteration). |
byte_10656F0 | 1 | current_linkage_spec | Current linkage spec: 2 = extern "C", 3 = extern "C++". |
qword_1065708 | 8 | output_scope_stack | Output scope stack pointer (linked list). |
qword_1065870 | 8 | debug_trace_list | Debug trace request linked list. |
Expression Parsing State
| Address | Size | Name | Description |
|---|---|---|---|
qword_106B970 | 8 | expr_stack_top | Current expression stack top pointer. Primary context object for expression parsing. Checked at offset +17 (flags), +18, +19 (bit flags), +48, +120. |
qword_106B968 | 8 | expr_stack_prev | Previous expression stack entry (push/pop). |
qword_106B580 | 8 | saved_expr_context | Saved expression context (for nested evaluation). |
qword_106B510 | 8 | rewrite_loop_counter | Rewrite loop counter (limited to 100 to prevent infinite loops). |
dword_126EF08 | 4 | requires_expr_enabled | Requires-expression enabled (C++20). |
Overload Resolution
| Address | Size | Name | Description |
|---|---|---|---|
qword_E7FE98 | 8 | override_pending_list | Virtual function override pending list head (40-byte entries). |
qword_E7FEA0 | 8 | override_free_list | Override entry free list. |
qword_E7FE88 | 8 | covariant_free_list | Covariant override free list. |
qword_E7FEC8 | 8 | lambda_hash_table | Lambda closure class hash table pointer. |
qword_E7FED0 | 8 | template_member_hash | Template member hash table pointer. |
dword_E7FE48 | 4 | rbtree_sentinel | Red-black tree sentinel node (for lambda numbering). |
qword_E7FE58 | 8 | rbtree_left_sentinel | Red-black tree left sentinel (= &dword_E7FE48). |
qword_E7FE60 | 8 | rbtree_right_sentinel | Red-black tree right sentinel (= &dword_E7FE48). |
qword_E7FE68 | 8 | rbtree_size | Red-black tree entry count. |
Attribute System
| Address | Size | Name | Description |
|---|---|---|---|
off_D46820 | 32/entry | attribute_descriptor_table | Attribute descriptor table. ~160 entries, stride 32 bytes. Runs to unk_D47A60. |
qword_E7FB60 | 8 | attribute_hash_table | Attribute name hash table (Robin Hood lookup via sub_742670). |
qword_E7F038 | 8 | attribute_hash_table_2 | Secondary attribute hash table. |
byte_E7FB80 | 204 | scoped_attr_buffer | Buffer for scoped attribute name formatting ("namespace::name"). |
byte_82C0E0 | -- | attribute_kind_table | Attribute kind descriptor table (indexed by attribute kind). |
dword_E7F078 | 4 | attr_init_flag | Attribute subsystem initialization flag. |
dword_E7F080 | 4 | attr_flags | Attribute system flags. |
qword_E7F070 | 8 | visibility_stack | Visibility stack linked list. |
qword_E7F068 | 8 | visibility_state | Current visibility state. |
qword_E7F048 | 8 | alias_ifunc_free_list | Free list for alias/ifunc entries. |
qword_E7F058 | 8 | alias_list_head | Alias entry linked list head. |
qword_E7F050 | 8 | alias_list_next | Alias entry linked list next. |
dword_106BF18 | 4 | extended_attr_config | Extended attribute configuration flag. Gates additional initialization. |
Control Flow Tracking
| Address | Size | Name | Description |
|---|---|---|---|
qword_12C7110 | 8 | cf_descriptor_free_list | Control flow descriptor free list. |
qword_12C7118 | 8 | cf_active_list_tail | Active control flow list tail. |
qword_12C7120 | 8 | cf_active_list_head | Active control flow list head. |
Cross-Reference System
| Address | Size | Name | Description |
|---|---|---|---|
qword_106C258 | 8 | xref_output_file | Cross-reference output file handle. When nonzero, enables xref emission. |
qword_12C7160 | 8 | xref_callback | Cross-reference callback (sub_726F10). |
dword_12C7148 | 4 | xref_enabled | Cross-reference generation enabled. |
byte_12C71FA | 1 | xref_flag_a | Cross-reference flag A. |
byte_12C71FE | 1 | xref_flag_b | Cross-reference flag B. Default 1. |
Object Lifetime Stack
| Address | Size | Name | Description |
|---|---|---|---|
qword_126E4C0 | 8 | curr_object_lifetime | Top of object lifetime stack. Used for destructor ordering and scope cleanup. |
Timing and Debug
| Address | Size | Name | Description |
|---|---|---|---|
dword_106C0A4 | 4 | timing_enabled | Timing/profiling enabled flag. |
dword_126EFC8 | 4 | debug_trace | Debug tracing active. When set, calls sub_48AE00/sub_48AFD0 trace hooks. |
dword_126EFCC | 4 | debug_verbosity | Debug verbosity level. >2 = detailed, >3 = very detailed, >4 = IL walk trace. |
byte_106B5C0 | 128 | compilation_timestamp | Compilation timestamp string (from ctime()). |
Memory Allocator (Arena/Pool System)
| Address | Size | Name | Description |
|---|---|---|---|
qword_1280730 | 8 | block_free_list | Recycled 0x10000-byte block free list. |
qword_1280718 | 8 | total_memory_allocated | Total memory allocated (watermark). |
qword_1280710 | 8 | peak_memory_allocated | Peak memory allocated. |
qword_1280708 | 8 | tracked_alloc_total | Tracked allocation total. |
qword_1280720 | 8 | free_fe_hash_table | Hash table for free_fe tracked allocations. |
qword_1280748 | 8 | alloc_tracking_list | Linked list of allocation tracking records. |
dword_1280728 | 4 | mmap_mode | Allocation mode flag. 0 = malloc-based, 1 = mmap-based. Set from dword_106BF18. |
dword_1280750 | 4 | tracking_record_count | Tracking record count (inline up to 1023, then heap). |
unk_1280760 | -- | tracking_record_array | Inline tracking record array. |
IL Copy Remap
| Address | Size | Name | Description |
|---|---|---|---|
qword_126F1E0 | 8 | copy_remap_free_list | Copy remap entry free list (24 bytes each). |
qword_126F1D8 | 8 | copy_remap_count | Copy remap entry count. |
qword_126F1D0 | 4 | copy_recursion_depth | Copy recursion depth counter. |
qword_126F1F8 | 8 | copy_remap_stat_count | Copy remap statistics count. |
qword_126F140 | 8 | selected_entity | Selected entity for copy/comparison. |
byte_126F138 | 1 | selected_entity_kind | Kind of selected entity (7 or 11). |
IL Deferred Reordering Batch
| Address | Size | Name | Description |
|---|---|---|---|
qword_126F170 | 8 | reorder_batch | Batch reordering array (24-byte records: entity, placeholder, source_sequence). |
qword_126F158 | 8 | reorder_ptr_array | Pointer array for batch reordering. |
qword_126F150 | 8 | reorder_batch_limit | Batch size limit (100 entries). |
CLI Processing State
| Address | Size | Name | Description |
|---|---|---|---|
dword_E80058 | 4 | flag_count | Current registered CLI flag count (panics at 552 via sub_40351D). |
dword_E7FF20 | 4 | argv_index | Current argv parsing index (starts at 1). |
byte_E7FF40 | 272 | flag_was_set_bitmap | 272-byte bitmap: which CLI flags were explicitly set. |
dword_E7FF14 | 4 | language_already_set | Guard against switching language mode after initial set. |
dword_E7FF10 | 4 | cuda_compat_flag | CUDA compatibility flag (set based on dword_126EFAC && qword_126EF98 <= 0x76BF). |
off_D47CE0 | -- | set_flag_lookup_table | Lookup table for --set_flag CLI option (name-to-address mapping). |
EDG Feature Flags (0x106Bxxx-0x106Cxxx Region)
These flags control individual C/C++ language features. Set during CLI processing and standard-version initialization.
| Address | Size | Name | Description |
|---|---|---|---|
dword_106C210 | 4 | exceptions_enabled | Exception handling enabled. Default 1. |
dword_106C180 | 4 | rtti_enabled | RTTI enabled. Default 1. |
dword_106C164 | 4 | templates_enabled | Templates enabled. |
dword_106C1B8 | 4 | template_arg_context | Template argument context flag. |
dword_106C194 | 4 | namespaces_enabled | Namespaces enabled. Default 1. |
dword_106C19C | 4 | arg_dep_lookup | Argument-dependent lookup. Default 1. |
dword_106C178 | 4 | bool_keyword | bool keyword enabled. Default 1. |
dword_106C188 | 4 | wchar_t_keyword | wchar_t keyword enabled. Default 1. |
dword_106C18C | 4 | alternative_tokens | Alternative tokens enabled. Default 1. |
dword_106C1A0 | 4 | class_name_injection | Class name injection. Default 1. |
dword_106C1A4 | 4 | const_string_literals | Const string literals. Default 1. |
dword_106C134 | 4 | parse_templates | Parse templates. Default 1. |
dword_106C138 | 4 | dep_name | Dependent name processing. Default 1. |
dword_106C12C | 4 | friend_injection | Friend injection. Default 1. |
dword_106C128 | 4 | adl_related | ADL related feature. Default 1. |
dword_106C124 | 4 | module_visibility | Module-level visibility. Default 1. |
dword_106C140 | 4 | compound_literals | Compound literals. Default 1. |
dword_106C13C | 4 | base_assign_default | Base assign op is default. Default 1. |
dword_106C10C | 4 | deferred_instantiation | Deferred instantiation flag. |
dword_106C0E4 | 4 | exceptions_feature | Exceptions feature flag (version-dependent). |
dword_106C064 | 4 | modify_stack_limit | Modify stack limit. Default 1. |
dword_106C068 | 4 | fe_inlining | Frontend inlining enabled. |
dword_106C0A0 | 4 | feature_A0 | Miscellaneous feature flag. Default 1. |
dword_106C098 | 4 | feature_98 | Miscellaneous feature flag. Default 1. |
dword_106C0FC | 4 | feature_FC | Miscellaneous feature flag. Default 1. |
dword_106C154 | 4 | feature_154 | Miscellaneous feature flag. Default 1. |
dword_106C208 | 4 | constexpr_if_discard | Constexpr-if discarded-statement handling. |
dword_106C1F0 | 4 | cpp_mode_feature | C++ mode feature flag. |
dword_106C2A4 | 4 | feature_2A4 | Default 1. |
dword_106C214 | 4 | feature_214 | Default 1. |
dword_106C2BC | 4 | modules_enabled | C++20 modules enabled. |
dword_106C2B8 | 4 | module_partitions | Module partitions enabled. |
dword_106BDB8 | 4 | restrict_enabled | restrict keyword enabled. Default 1. |
dword_106BDB0 | 4 | remove_unneeded_entities | Remove unneeded entities. Default 1. |
dword_106BD98 | 4 | trigraphs_enabled | Trigraph support. Default 1. |
dword_106BD68 | 4 | guiding_decls | Guiding declarations. Default 1. |
dword_106BD58 | 4 | old_specializations | Old-style specializations. Default 1. |
dword_106BD54 | 4 | implicit_typename | Implicit typename. Default 1. |
dword_106BEA0 | 4 | rtti_config | RTTI configuration flag. |
dword_106BE84 | 4 | gen_move_operations | Generate move operations. Default 1. |
dword_106BC08 | 4 | nodiscard_enabled | [[nodiscard]] enabled. |
dword_106BC64 | 4 | visibility_support | Visibility support enabled. |
dword_106BDF0 | 4 | gnu_attr_groups | GNU attribute groups enabled. |
dword_106BDF4 | 4 | msvc_declspec | MSVC __declspec enabled. |
dword_106BCBC | 4 | template_features | Template features flag. |
dword_106BFC4 | 4 | debug_mode_1 | Debug mode flag 1 (set by --debug_mode). |
dword_106BFC0 | 4 | debug_mode_2 | Debug mode flag 2. |
dword_106BFBC | 4 | debug_mode_3 | Debug mode flag 3. |
qword_106BCE0 | 8 | include_suffix_default | Include suffix default string ("::stdh:"). |
qword_106BC70 | 8 | version_threshold | Feature version threshold. Default 30200. |
Host Compiler Target Configuration
| Address | Size | Name | Description |
|---|---|---|---|
dword_126E1D4 | 4 | msvc_target_version | MSVC target version (1200 = VC6, 1400 = VS2005, etc.). |
dword_126E1D8 | 4 | is_msvc_host | Is MSVC host compiler. |
dword_126E1DC | 4 | is_edg_native | EDG native mode. |
dword_126E1E8 | 4 | is_clang_host | Is Clang host compiler. |
dword_126E1F8 | 4 | is_gnu_host | Is GNU/GCC host compiler. |
qword_126E1F0 | 8 | gnu_host_version | GCC/Clang host version number. |
qword_126E1E0 | 8 | clang_host_version | Clang host version number. |
dword_126E1EC | 4 | backend_enabled | Backend generation enabled. |
dword_126E1BC | 4 | host_feature_flag | Host feature flag. Default 1. |
dword_126DFF0 | 4 | msvc_declspec_mode | MSVC __declspec mode enabled. |
qword_126E1B0 | 8 | library_prefix | Library search path prefix ("lib"). |
dword_126E200 | 4 | constexpr_init_flag | Constexpr initialization flag. |
dword_126E204 | 4 | instantiation_flag | Instantiation control flag. |
dword_126E224 | 4 | parameter_flag | Parameter handling flag. |
Type System Lookup Tables (Read-Only)
| Address | Size | Name | Description |
|---|---|---|---|
byte_E6D1B0 | 256 | signedness_table | Type-code-to-signedness lookup table. |
byte_E6D1AD | 1 | unsigned_int_kind_sentinel | Must equal 111 ('o') -- sentinel validation. |
byte_A668A0 | 256 | type_kind_properties | Type kind property table. Bit 1 = callable, bit 4 = aggregate. |
off_E6E020 | -- | il_entry_kind_names | IL entry kind name table (last = "last", sentinel = 9999). |
off_E6CD78 | -- | db_storage_class_names | Storage class name table (last = "last"). |
off_E6D228 | -- | db_special_function_kinds | Special function kind name table. |
off_E6CD20 | -- | db_operator_names | Operator name table. |
off_E6E060 | -- | name_linkage_kind_names | Name linkage kind names. |
off_E6CD88 | -- | decl_modifier_names | Declaration modifier names. |
off_E6CF38 | -- | pragma_ids | Pragma ID table. |
qword_E6C580 | 8 | sizeof_il_entry_sentinel | Must equal 9999 -- sizeof IL entry validation. |
off_E6DD80 | -- | il_entry_kind_display_names | IL entry kind display names (indexed by kind byte). |
off_E6E040 | -- | linkage_kind_display_names | Linkage kind display names (none/internal/external/C/C++). |
off_E6E140 | -- | feature_init_table | Feature initialization table (used with dword_106BF18). |
IL Display Tables (Read-Only)
| Address | Size | Name | Description |
|---|---|---|---|
off_A6F840 | 8[120] | builtin_op_names | Builtin operation kind names (120 entries). |
off_A6FE40 | 8[22] | type_kind_names | Type kind names (22 entries: void, bool, int, float, ...). |
off_A6F760 | 8[4] | access_specifier_names | Access specifier names (public/protected/private/none). |
off_A6FE00 | 8[7] | storage_class_display_names | Storage class display names (7: none/auto/register/static/extern/mutable/thread_local). |
off_A6F480 | -- | register_kind_names | Register kind names. |
off_A6FC00 | -- | special_kind_names | Special function kind names (lambda call operator, etc.). |
off_A6FC80 | -- | opname_kind_names | Operator name kind names. |
off_A6F640 | -- | typeref_kind_names | Typeref kind names. |
off_A6F420 | -- | based_type_kind_names | Based type kind names. |
off_A6F3F0 | -- | class_kind_names | Class/struct/union kind names. |
off_E6C5A0 | -- | builtin_op_table | Builtin operation reference table. |
PCH and Serialization
| Address | Size | Name | Description |
|---|---|---|---|
dword_106B690 | 4 | pch_mode | Precompiled header mode. |
dword_106B6B0 | 4 | pch_loaded | PCH loaded flag. |
qword_12C6BA0 | 8 | pch_string_buffer_1 | PCH string buffer. |
qword_12C6BA8 | 8 | pch_string_buffer_2 | PCH string buffer (secondary). |
qword_12C6EA0 | 8 | pch_write_state | PCH binary write state. |
qword_12C6EA8 | 8 | pch_misc_state | PCH miscellaneous state. |
dword_12C6C88 | 4 | pch_config_flag | PCH configuration flag. |
byte_12C6EE0 | 1 | pch_byte_flag | PCH byte flag. |
dword_12C6C8C | 4 | saved_var_list_count | Saved variable list count (PCH). |
qword_12C6CA0 | 8 | saved_var_lists | Saved variable list array (PCH). |
Inline and Linkage Tracking
| Address | Size | Name | Description |
|---|---|---|---|
qword_12C6FC8 | 8 | inline_def_tracking_1 | Inline definition tracking. |
qword_12C6FD0 | 8 | inline_def_tracking_2 | Inline definition tracking (secondary). |
qword_12C6FD8 | 8 | inline_def_tracking_3 | Inline definition tracking (tertiary). |
qword_12C6FB8 | 8 | linkage_stack_1 | Linkage stack. |
qword_12C6FC0 | 8 | linkage_stack_2 | Linkage stack (secondary). |
qword_12C6FE0 | 8 | mangling_discriminator | ABI mangling discriminator tracking. |
qword_12C70E8 | 8 | misc_tracking | Miscellaneous definition tracking. |
Miscellaneous
| Address | Size | Name | Description |
|---|---|---|---|
qword_126E4C0 | 8 | curr_object_lifetime | Top of object lifetime stack. |
qword_106B9B0 | 8 | active_compilation_ctx | Active compilation context pointer. |
dword_126E280 | 4 | max_object_size | Maximum object size (for vector/array validation). |
dword_106B4B8 | 4 | omp_declare_variant | OpenMP declare variant active flag. |
dword_106BC7C | 4 | compressed_mangling | Compressed name mangling mode. |
dword_106BD4C | 4 | profiling_flag | Profiling / performance measurement flag. |
dword_106BCFC | 4 | traditional_enum | Traditional (unscoped) enum mode. |
dword_106BBD4 | 4 | char16_variant_flag | char16_t variant selection flag. |
dword_106BD74 | 4 | sharing_mode_config | IL sharing mode configuration. |
dword_126E1C0 | 4 | string_sharing_enabled | String sharing enabled in IL. |
byte_126E1C4 | 1 | basic_char_type | Basic char type code (for sub_5BBDF0). |
dword_106BD8C | 4 | svr4_mode | SVR4 ABI mode. |
byte_126E349 | 1 | cuda_extensions_byte | CUDA extensions flag (byte-sized). |
byte_126E358 | 1 | arch_extension_byte | Extension flag (possibly __CUDA_ARCH__). |
byte_126E3C0 | 1 | extension_byte_C0 | Extension flag byte. |
byte_126E3C1 | 1 | extension_byte_C1 | Extension flag byte. |
byte_126E481 | 1 | extension_byte_481 | Extension flag byte. |
dword_126F248 | 4 | il_index_valid | IL index valid flag (1 = index built). |
qword_126F240 | 8 | il_index_capacity | IL index array capacity. |
qword_126EBF0 | 8 | il_index_count | IL index entry count. |
qword_126F230 | 8 | il_index_aux | IL index auxiliary pointer. |
dword_12C6A24 | 4 | block_scope_suppress | Block-scope suppress level. |
dword_127FC70 | 4 | mark_direction | Mark/unmark direction for entity traversal. |
dword_127FBA0 | 4 | eof_flag | Input EOF flag. |
qword_127FBA8 | 8 | file_handle | Current input file handle. |
dword_127FB9C | 4 | multibyte_mode | Multibyte character mode (>1 = active). |
qword_126E440 | 8[6] | char_type_widths | Character type width table (indexed by char kind: 1,2,4 bytes). |
qword_126E580 | 8[11] | special_type_entries | Special type entries (11 entries). |
qword_126DE00 | -- | operator_name_table | Operator name string table. |
off_E6E0E0 | -- | predef_macro_mode_names | Predefined macro mode name table (sentinel = "last"). |
qword_126EEA0 | 8 | predef_macro_state | Predefined macro initialization state. |
dword_106BBA8 | 4 | c23_features | C23 features flag (#elifdef/#elifndef). |
dword_106C2B0 | 4 | preproc_feature_flag | Preprocessor feature flag. |
dword_106BEF8 | 4 | pch_config_2 | PCH configuration flag (secondary). |
GCC Pragma State
| Address | Size | Name | Description |
|---|---|---|---|
qword_12C6F60 | 8 | gcc_pragma_stack_1 | GCC pragma push/pop stack. |
qword_12C6F68 | 8 | gcc_pragma_stack_2 | GCC pragma stack (secondary). |
qword_12C6F78 | 8 | gcc_pragma_state | GCC pragma state. |
qword_12C6F98 | 8 | gcc_pragma_misc | GCC pragma miscellaneous state. |
Integer Range Tables (SSE-width)
| Address | Size | Name | Description |
|---|---|---|---|
xmmword_126E0E0 | 16 | integer_upper_bounds | Upper bounds for integer kinds (populated during init). |
xmmword_126E000 | 16 | integer_lower_bounds | Lower bounds for integer kinds. |
IL Common Header Template
The 96-byte (6 x 16 bytes) template copied into every new IL entity:
| Address | Size | Name |
|---|---|---|
xmmword_126F6A0 | 16 | IL header template word 0 |
xmmword_126F6B0 | 16 | IL header template word 1 |
xmmword_126F6C0 | 16 | IL header template word 2 |
xmmword_126F6D0 | 16 | IL header template word 3 |
xmmword_126F6E0 | 16 | IL header template word 4 |
xmmword_126F6F0 | 16 | IL header template word 5 |
Address Region Summary
| Region | Range | Count | Purpose |
|---|---|---|---|
.rodata | 0x82xxxx--0xA7xxxx | ~30 | Constant tables (attribute descriptors, operation names, type kind names) |
.rodata | 0xD46xxx--0xD48xxx | ~10 | Attribute descriptor table, CLI flag lookup |
.rodata | 0xE6xxxx--0xE8xxxx | ~40 | IL metadata tables (entry kind names, type properties, signedness, pragma IDs) |
.data | 0x88xxxx | 1 | Error message template table (3795 entries) |
.bss | 0x106Bxxx--0x106Cxxx | ~120 | NVIDIA-added CLI flags, feature toggles, CUDA configuration |
.bss | 0x1065xxx | ~20 | Backend code generator state (output position, stub mode) |
.bss | 0x1067xxx | ~10 | Diagnostic per-error tracking, entity formatter |
.bss | 0x126xxxx | ~200 | EDG core state (scope stack, lexer, IL, error counters, source position) |
.bss | 0x1270xxx | ~10 | Preprocessor macro chains |
.bss | 0x1280xxx | ~15 | Arena allocator tracking, lambda bitmaps |
.bss | 0x1286xxx | ~10 | Lambda transform state, registration lists |
.bss | 0x12C6xxx--0x12C7xxx | ~40 | PCH, template instantiation, TU management |
.bss | 0xE7xxxx | ~30 | Attribute system, override tracking, red-black tree |
Token Kind Table
Every token produced by cudafe++'s lexer carries a 16-bit token kind stored in the global word_126DD58. There are exactly 357 token kinds, numbered 0 through 356, with names indexed from a read-only string pointer table at off_E6D240 in the .rodata segment. A parallel 357-entry byte array at byte_E6C0E0 maps each token kind to an operator-name index, used by the initialize_opname_kinds routine (sub_588BB0) to populate the operator name display table at qword_126DE00. A boolean stop-token table at qword_126DB48 + 8 (357 entries) marks which token kinds are valid synchronization points for error recovery in skip_to_token (sub_6887C0).
Token kind assignment follows a block scheme established by the EDG 6.6 frontend: operators and punctuation occupy the lowest range, followed by alternative tokens (C++ digraphs and named operators), C89 keywords, C99/C11 extensions, MSVC keywords, core C++ keywords, compiler internals, type-trait intrinsics, and finally the newest C++23/26 and extended-type additions at the top. CUDA-specific additions from NVIDIA occupy three dedicated slots (328--330) within the type-trait block, plus additional entries in the extended range. This ordering reflects the historical accretion of the C and C++ standards: each new standard appended its keywords at the end rather than filling gaps.
Key Facts
| Property | Value |
|---|---|
| Total token kinds | 357 (indices 0--356) |
| Name table | off_E6D240 (357 string pointers in .rodata) |
| Operator-to-name map | byte_E6C0E0 (357-byte index array) |
| Operator name display table | qword_126DE00 (48 string pointers, populated by sub_588BB0) |
| Stop-token table | qword_126DB48 + 8 (357 boolean entries) |
| Current token global | word_126DD58 (WORD) |
| Keyword registration function | sub_5863A0 (keyword_init, 1,113 lines, fe_init.c) |
| Keyword entry function | sub_7463B0 (enter_keyword) |
| GNU variant registration | sub_585B10 (enter_gnu_keyword) |
| Alternative token entry | sub_749600 (registers named operator alternative) |
Token Kind Ranges
| Range | Count | Category | Description |
|---|---|---|---|
| 0 | 1 | Special | End-of-file / no-token sentinel |
| 1--31 | 31 | Operators and punctuation | Core operators (+, -, *, etc.) and delimiters ((, ), {, }, ;) |
| 32--51 | 20 | Operators (continued) | Compound and remaining operators (<<, >>, ->, ::, ..., <=>) |
| 52--76 | 25 | Alternative tokens / digraphs | C++ named operators (and, or, not) and digraphs (<%, %>, <:, :>) |
| 77--108 | 32 | C89 keywords | All keywords from ANSI C89/ISO C90 |
| 109--131 | 23 | C99/C11 keywords | restrict, _Bool, _Complex, _Imaginary, character types |
| 132--136 | 5 | MSVC keywords | __declspec, __int8--__int64 |
| 137--199 | 63 | C++ keywords | Core C++ keywords plus C++11/14/17/20/23 additions |
| 200--206 | 7 | Compiler internal | Preprocessor and internal token kinds |
| 207--327 | 121 | Type trait intrinsics | __is_xxx / __has_xxx compiler intrinsic keywords |
| 328--330 | 3 | NVIDIA CUDA type traits | NVIDIA-specific lambda type-trait intrinsics |
| 331--356 | 26 | Extended types / recent additions | _Float32--_Float128, C++23/26 features, scalable vector types |
Complete Token Table
Operators and Punctuation (0--51)
These tokens are produced directly by the character-level scanner sub_679800 (scan_token). Multi-character operators are resolved by dedicated scanning functions in the 0x67ABB0--0x67BAB0 range.
| Kind | Name | C/C++ Construct | Notes |
|---|---|---|---|
| 0 | <eof> | End of file | Sentinel / no-token marker |
| 1 | <identifier> | Identifier | Any non-keyword identifier |
| 2 | <integer literal> | Integer constant | Decimal, hex, octal, or binary |
| 3 | <floating literal> | Floating-point constant | Float, double, or long double |
| 4 | <character literal> | Character constant | 'x', includes wide/u8/u16/u32 |
| 5 | <string literal> | String literal | "...", includes wide/u8/u16/u32/raw |
| 6 | ; | Semicolon | Statement terminator |
| 7 | ( | Left parenthesis | Grouping, function call |
| 8 | ) | Right parenthesis | |
| 9 | , | Comma | Separator, comma operator |
| 10 | = | Assignment | a = b |
| 11 | { | Left brace | Block/initializer open |
| 12 | } | Right brace | Block/initializer close |
| 13 | + | Plus | Addition, unary plus |
| 14 | - | Minus | Subtraction, unary minus |
| 15 | * | Star | Multiplication, pointer dereference, pointer declarator |
| 16 | / | Slash | Division |
| 17 | < | Less-than | Comparison, template open bracket |
| 18 | > | Greater-than | Comparison, template close bracket |
| 19 | & | Ampersand | Bitwise AND, address-of, reference declarator |
| 20 | ? | Question mark | Ternary conditional |
| 21 | : | Colon | Label, ternary, bit-field width |
| 22 | ~ | Tilde | Bitwise complement, destructor |
| 23 | % | Percent | Modulo |
| 24 | ^ | Caret | Bitwise XOR |
| 25 | [ | Left bracket | Array subscript, attributes [[ |
| 26 | . | Dot | Member access |
| 27 | ] | Right bracket | |
| 28 | ! | Exclamation | Logical NOT |
| 29 | | | Pipe | Bitwise OR |
| 30 | -> | Arrow | Pointer member access |
| 31 | ++ | Increment | Pre/post increment |
| 32 | -- | Decrement | Pre/post decrement |
| 33 | == | Equal | Equality comparison; also bitand alt-token for & |
| 34 | != | Not-equal | Inequality comparison |
| 35 | <= | Less-or-equal | Comparison |
| 36 | >= | Greater-or-equal | Comparison |
| 37 | << | Left shift | Also compl alt-token for ~ |
| 38 | >> | Right shift | Also not alt-token for ! |
| 39 | += | Add-assign | Compound assignment |
| 40 | -= | Subtract-assign | |
| 41 | *= | Multiply-assign | |
| 42 | /= | Divide-assign | |
| 43 | %= | Modulo-assign | |
| 44 | <<= | Left-shift-assign | |
| 45 | >>= | Right-shift-assign | |
| 46 | && | Logical AND | Also address of rvalue reference |
| 47 | || | Logical OR | |
| 48 | ^= | XOR-assign | Also not_eq alt-token for != |
| 49 | &= | AND-assign | |
| 50 | |= | OR-assign | Also xor alt-token for ^ |
| 51 | :: | Scope resolution | Also bitor alt-token for | |
Alternative Tokens and Digraphs (52--76)
C++ alternative tokens (ISO 14882 clause 5.5) and C/C++ digraphs. These are registered during keyword_init (sub_5863A0) via sub_749600 when in C++ mode (dword_126EFB4 == 2).
| Kind | Name | Equivalent | Notes |
|---|---|---|---|
| 52 | and | && | Logical AND |
| 53 | or | || | Logical OR |
| 54 | ->* | ->* | Pointer-to-member via pointer |
| 55 | .* | .* | Pointer-to-member via object |
| 56 | ... | ... | Ellipsis (variadic) |
| 57 | <=> | <=> | Three-way comparison (C++20) |
| 58 | # | # | Preprocessor stringification |
| 59 | ## | ## | Preprocessor token paste |
| 60 | <% | { | Digraph for left brace |
| 61 | %> | } | Digraph for right brace |
| 62 | <: | [ | Digraph for left bracket |
| 63 | :> | ] | Digraph for right bracket |
| 64 | and_eq | &= | Bitwise AND-assign |
| 65 | xor_eq | ^= | Bitwise XOR-assign |
| 66 | or_eq | |= | Bitwise OR-assign |
| 67 | %: | # | Digraph for hash |
| 68 | %:%: | ## | Digraph for token paste |
| 69--76 | (reserved) | -- | Reserved for future alternative tokens |
C89 Keywords (77--108)
Always registered unconditionally. These form the base keyword set present in every compilation mode.
| Kind | Name | C/C++ Construct |
|---|---|---|
| 77 | auto | Storage class (C89); type deduction (C++11) |
| 78 | break | Loop/switch exit |
| 79 | case | Switch case label |
| 80 | char | Character type |
| 81 | const | Const qualifier |
| 82 | continue | Loop continuation |
| 83 | default | Switch default label; defaulted function (C++11) |
| 84 | do | Do-while loop |
| 85 | double | Double-precision float |
| 86 | else | If-else branch |
| 87 | enum | Enumeration |
| 88 | extern | External linkage |
| 89 | float | Single-precision float |
| 90 | for | For loop |
| 91 | goto | Unconditional jump |
| 92 | if | Conditional |
| 93 | int | Integer type |
| 94 | long | Long integer modifier |
| 95 | register | Register storage hint (deprecated in C++17) |
| 96 | return | Function return |
| 97 | short | Short integer modifier |
| 98 | signed | Signed integer modifier |
| 99 | sizeof | Size query operator |
| 100 | static | Static storage / internal linkage |
| 101 | struct | Structure |
| 102 | switch | Multi-way branch |
| 103 | typedef | Type alias (C-style) |
| 104 | union | Union type |
| 105 | unsigned | Unsigned integer modifier |
| 106 | void | Void type |
| 107 | volatile | Volatile qualifier |
| 108 | while | While loop |
C99/C11/C23 Keywords (109--131)
Gated on the C standard version at dword_126EF68 (values: 199901 = C99, 201112 = C11, 202311 = C23).
| Kind | Name | Standard | C/C++ Construct |
|---|---|---|---|
| 109 | inline | C99 | Inline function hint (already C++ keyword at 154) |
| 110--118 | (reserved) | -- | -- |
| 119 | restrict | C99 | Pointer restrict qualifier |
| 120 | _Bool | C99 | Boolean type (C-style) |
| 121 | _Complex | C99 | Complex number type |
| 122 | _Imaginary | C99 | Imaginary number type |
| 123--125 | (reserved) | -- | -- |
| 126 | char16_t | C++11/C23 | 16-bit character type |
| 127 | char32_t | C++11/C23 | 32-bit character type |
| 128 | char8_t | C++17/C23 | UTF-8 character type |
| 129--131 | (reserved) | -- | -- |
MSVC Keywords (132--136)
Gated on dword_126EFB0 (Microsoft extensions enabled, language mode 2/MSVC).
| Kind | Name | MSVC Construct |
|---|---|---|
| 132 | __declspec | MSVC declaration specifier |
| 133 | __int8 | 8-bit integer type |
| 134 | __int16 | 16-bit integer type |
| 135 | __int32 | 32-bit integer type |
| 136 | __int64 | 64-bit integer type |
C++ Core Keywords (137--199)
Gated on C++ mode (dword_126EFB4 == 2). Some keywords within this range were added in C++11 through C++23 and have additional standard-version gates.
| Kind | Name | Standard | C/C++ Construct |
|---|---|---|---|
| 137 | bool | C++98 | Boolean type |
| 138 | true | C++98 | Boolean literal |
| 139 | false | C++98 | Boolean literal |
| 140 | wchar_t | C++98 | Wide character type |
| 141--149 | (reserved) | -- | -- |
| 142 | __attribute | GNU | GCC attribute syntax |
| 143 | __builtin_types_compatible_p | GNU | GCC type compatibility test |
| 144--149 | (reserved) | -- | -- |
| 150 | catch | C++98 | Exception handler |
| 151 | class | C++98 | Class definition |
| 152 | delete | C++98 | Deallocation; deleted function (C++11) |
| 153 | friend | C++98 | Friend declaration |
| 154 | inline | C++98 | Inline function/variable |
| 155 | new | C++98 | Allocation expression |
| 156 | operator | C++98 | Operator overload |
| 157 | private | C++98 | Access specifier |
| 158 | protected | C++98 | Access specifier |
| 159 | public | C++98 | Access specifier |
| 160 | template | C++98 | Template declaration |
| 161 | this | C++98 | Current object pointer |
| 162 | throw | C++98 | Throw expression |
| 163 | try | C++98 | Try block |
| 164 | virtual | C++98 | Virtual function/base |
| 165 | (reserved) | -- | -- |
| 166 | const_cast | C++98 | Const cast expression |
| 167 | dynamic_cast | C++98 | Dynamic cast expression |
| 168 | (reserved) | -- | -- |
| 169 | export | C++98/20 | Export declaration (original C++98, revived for modules in C++20) |
| 170 | export | C++20 | Module export (alternate registration slot) |
| 171--173 | (reserved) | -- | -- |
| 174 | mutable | C++98 | Mutable data member |
| 175 | namespace | C++98 | Namespace declaration |
| 176 | reinterpret_cast | C++98 | Reinterpret cast expression |
| 177 | static_cast | C++98 | Static cast expression |
| 178 | typeid | C++98 | Runtime type identification |
| 179 | using | C++98 | Using declaration/directive |
| 180--182 | (reserved) | -- | -- |
| 183 | typename | C++98 | Dependent type name |
| 184 | static_assert | C++11 | Static assertion; also _Static_assert in C11 |
| 185 | decltype | C++11 | Decltype specifier |
| 186 | __auto_type | GNU | GCC auto type extension |
| 187 | __extension__ | GNU | GCC extension marker (suppress warnings) |
| 188 | (reserved) | -- | -- |
| 189 | typeof | C++23/GNU | Type-of expression |
| 190 | typeof_unqual | C++23 | Unqualified type-of expression |
| 191--193 | (reserved) | -- | -- |
| 194 | thread_local | C++11 | Thread-local storage; also _Thread_local in C11 |
| 195--199 | (reserved) | -- | -- |
Compiler Internal Tokens (200--206)
These tokens are used internally by the preprocessor and the token cache. They never appear in user-visible diagnostics.
| Kind | Name | Purpose |
|---|---|---|
| 200 | <pp-number> | Preprocessing number (not yet classified as integer or float) |
| 201 | <header-name> | Include file name (<file> or "file") |
| 202 | <newline> | Logical newline token (preprocessor directive boundary) |
| 203 | <whitespace> | Whitespace token (preprocessing mode only) |
| 204 | <placemarker> | Token-paste placeholder (empty argument in ##) |
| 205 | <pragma> | Pragma token (deferred for later processing) |
| 206 | <end-of-directive> | End of preprocessor directive |
Type Trait Intrinsics (207--327)
These are compiler intrinsic keywords that implement the C++ type traits (from <type_traits>) without requiring template instantiation. They are registered during keyword_init with C++ standard version gating -- earlier traits (C++11) are always available in C++ mode, while newer traits (C++20, C++23, C++26) require the corresponding standard version at dword_126EF68. Some traits are MSVC-specific (gated on dword_126EFB0) or Clang-specific (gated on qword_126EF90).
The complete list of type-trait intrinsics, organized alphabetically within each sub-category:
Unary Type Predicates
| Kind | Name | Standard | Tests Whether... |
|---|---|---|---|
| 207 | __is_class | C++11 | Type is a class (not union) |
| 208 | __is_enum | C++11 | Type is an enumeration |
| 209 | __is_union | C++11 | Type is a union |
| 210 | __is_pod | C++11 | Type is POD (plain old data) |
| 211 | __is_empty | C++11 | Type has no non-static data members |
| 212 | __is_polymorphic | C++11 | Type has at least one virtual function |
| 213 | __is_abstract | C++11 | Type has at least one pure virtual function |
| 214 | __is_literal_type | C++11 | Type is a literal type (deprecated C++17) |
| 215 | __is_standard_layout | C++11 | Type is standard-layout |
| 216 | __is_trivial | C++11 | Type is trivially copyable and has trivial default constructor |
| 217 | __is_trivially_copyable | C++11 | Type is trivially copyable |
| 218 | __is_final | C++14 | Class is marked final |
| 219 | __is_aggregate | C++17 | Type is an aggregate |
| 220 | __has_virtual_destructor | C++11 | Type has a virtual destructor |
| 221 | __has_trivial_constructor | C++11 | Type has a trivial default constructor |
| 222 | __has_trivial_copy | C++11 | Type has a trivial copy constructor |
| 223 | __has_trivial_assign | C++11 | Type has a trivial copy assignment |
| 224 | __has_trivial_destructor | C++11 | Type has a trivial destructor |
| 225 | __has_nothrow_constructor | C++11 | Default constructor is noexcept |
| 226 | __has_nothrow_copy | C++11 | Copy constructor is noexcept |
| 227 | __has_nothrow_assign | C++11 | Copy assignment is noexcept |
| 228 | __has_trivial_move_constructor | C++11 | Type has a trivial move constructor |
| 229 | __has_trivial_move_assign | C++11 | Type has a trivial move assignment |
| 230 | __has_nothrow_move_assign | C++11 | Move assignment is noexcept |
| 231 | __has_unique_object_representations | C++17 | Type has unique object representations |
| 232 | __is_signed | C++11 | Type is a signed arithmetic type |
| 233 | __is_unsigned | C++11 | Type is an unsigned arithmetic type |
| 234 | __is_integral | C++11 | Type is an integral type |
| 235 | __is_floating_point | C++11 | Type is a floating-point type |
| 236 | __is_arithmetic | C++11 | Type is an arithmetic type |
| 237 | nullptr | C++11 | Null pointer literal (not a trait; shares range) |
| 238 | __is_fundamental | C++11 | Type is a fundamental type |
| 239 | __int128 | GNU | 128-bit integer type (not a trait; shares range) |
| 240 | __is_scalar | C++11 | Type is a scalar type |
| 241 | __is_object | C++11 | Type is an object type |
| 242 | __is_compound | C++11 | Type is a compound type |
| 243 | __is_reference | C++11 | Type is an lvalue or rvalue reference |
| 244 | constexpr | C++11 | Constexpr specifier (not a trait; shares range) |
| 245 | consteval | C++20 | Consteval specifier (not a trait; shares range) |
| 246 | constinit | C++20 | Constinit specifier (not a trait; shares range) |
| 247 | _Alignof | C11 | Alignment query (C11 spelling) |
| 248 | _Alignas | C11 | Alignment specifier (C11 spelling) |
| 249 | __bases | GCC | Direct base classes (GCC extension) |
| 250 | __direct_bases | GCC | Non-virtual direct base classes (GCC extension) |
| 251 | __builtin_arm_ldrex | Clang | ARM load-exclusive intrinsic |
| 252 | __builtin_arm_ldaex | Clang | ARM load-acquire-exclusive intrinsic |
| 253 | __builtin_arm_addg | Clang | ARM MTE add-tag intrinsic |
| 254 | __builtin_arm_irg | Clang | ARM MTE insert-random-tag intrinsic |
| 255 | __builtin_arm_ldg | Clang | ARM MTE load-tag intrinsic |
| 256 | __is_member_pointer | C++11 | Type is a pointer to member |
| 257 | __is_member_function_pointer | C++11 | Type is a pointer to member function |
| 258 | __builtin_shufflevector | Clang | Clang vector shuffle intrinsic |
| 259 | __builtin_convertvector | Clang | Clang vector conversion intrinsic |
| 260 | _Noreturn | C11 | No-return function specifier |
| 261 | __builtin_complex | GNU | GCC complex number construction |
| 262 | _Generic | C11 | Generic selection expression |
| 263 | _Atomic | C11 | Atomic type qualifier/specifier |
| 264 | _Nullable | Clang | Nullable pointer qualifier |
| 265 | _Nonnull | Clang | Non-null pointer qualifier |
| 266 | _Null_unspecified | Clang | Null-unspecified pointer qualifier |
| 267 | co_yield | C++20 | Coroutine yield expression |
| 268 | co_return | C++20 | Coroutine return statement |
| 269 | co_await | C++20 | Coroutine await expression |
| 270 | __is_member_object_pointer | C++11 | Type is a pointer to data member |
| 271 | __builtin_addressof | GNU | Address-of without operator overload |
EDG Internal Keywords (272--283)
These are not user-facing keywords. They are injected by the EDG frontend into synthesized declarations for built-in types, throw specifications, and vector types.
| Kind | Name | Purpose |
|---|---|---|
| 272 | __edg_type__ | EDG internal type placeholder |
| 273 | __edg_vector_type__ | SIMD vector type (GCC __attribute__((vector_size)) lowering) |
| 274 | __edg_neon_vector_type__ | ARM NEON vector type |
| 275 | __edg_scalable_vector_type__ | ARM SVE scalable vector type |
| 276 | __edg_neon_polyvector_type__ | ARM NEON polynomial vector type |
| 277 | __edg_size_type__ | Placeholder for size_t before it is typedef'd |
| 278 | __edg_ptrdiff_type__ | Placeholder for ptrdiff_t before it is typedef'd |
| 279 | __edg_bool_type__ | Placeholder for bool / _Bool |
| 280 | __edg_wchar_type__ | Placeholder for wchar_t |
| 281 | __edg_throw__ | Throw specification in synthesized declarations |
| 282 | __edg_opnd__ | Operand reference in synthesized expressions |
| 283 | (reserved) | -- |
More Type Predicates and Binary Traits (284--327)
| Kind | Name | Standard | Tests Whether... |
|---|---|---|---|
| 284 | __is_const | C++11 | Type is const-qualified |
| 285 | __is_volatile | C++11 | Type is volatile-qualified |
| 286 | __is_void | C++11 | Type is void |
| 287 | __is_array | C++11 | Type is an array |
| 288 | __is_pointer | C++11 | Type is a pointer |
| 289 | __is_lvalue_reference | C++11 | Type is an lvalue reference |
| 290 | __is_rvalue_reference | C++11 | Type is an rvalue reference |
| 291 | __is_function | C++11 | Type is a function type |
| 292 | __is_constructible | C++11 | Type is constructible from given args |
| 293 | __is_nothrow_constructible | C++11 | Construction is noexcept |
| 294 | requires | C++20 | Requires expression/clause |
| 295 | concept | C++20 | Concept definition |
| 296 | __builtin_has_attribute | GNU | Tests if declaration has given attribute |
| 297 | __builtin_bit_cast | C++20 | Bit cast intrinsic (std::bit_cast implementation) |
| 298 | __is_assignable | C++11 | Type is assignable from given type |
| 299 | __is_nothrow_assignable | C++11 | Assignment is noexcept |
| 300 | __is_trivially_constructible | C++11 | Construction is trivial |
| 301 | __is_trivially_assignable | C++11 | Assignment is trivial |
| 302 | __is_destructible | C++11 | Type is destructible |
| 303 | __is_nothrow_destructible | C++11 | Destruction is noexcept |
| 304 | __edg_is_deducible | EDG | EDG internal: template argument is deducible |
| 305 | __is_trivially_destructible | C++11 | Destruction is trivial |
| 306 | __is_base_of | C++11 | First type is base of second (binary trait) |
| 307 | __is_convertible | C++11 | First type is convertible to second (binary trait) |
| 308 | __is_same | C++11 | Two types are the same (binary trait) |
| 309 | __is_trivially_copy_assignable | C++11 | Copy assignment is trivial |
| 310 | __is_assignable_no_precondition_check | EDG | Assignable without precondition validation |
| 311 | __is_same_as | Clang | Alias for __is_same (Clang compatibility) |
| 312 | __is_referenceable | C++11 | Type can be referenced |
| 313 | __is_bounded_array | C++20 | Type is a bounded array |
| 314 | __is_unbounded_array | C++20 | Type is an unbounded array |
| 315 | __is_scoped_enum | C++23 | Type is a scoped enumeration |
| 316 | __is_literal | C++11 | Alias for __is_literal_type |
| 317 | __is_complete_type | EDG | Type is complete (not forward-declared) |
| 318 | __is_nothrow_convertible | C++20 | Conversion is noexcept (binary trait) |
| 319 | __is_convertible_to | MSVC | MSVC alias for __is_convertible |
| 320 | __is_invocable | C++17 | Callable with given arguments |
| 321 | __is_nothrow_invocable | C++17 | Call is noexcept |
| 322 | __is_trivially_equality_comparable | Clang | Bitwise equality is equivalent |
| 323 | __is_layout_compatible | C++20 | Types have compatible layouts |
| 324 | __is_pointer_interconvertible_base_of | C++20 | Pointer-interconvertible base (binary trait) |
| 325 | __is_corresponding_member | C++20 | Corresponding members in layout-compatible types |
| 326 | __is_pointer_interconvertible_with_class | C++20 | Member pointer is interconvertible with class pointer |
| 327 | __is_trivially_relocatable | C++26 | Type can be trivially relocated |
NVIDIA CUDA Type Traits (328--330)
Three NVIDIA-specific type-trait intrinsics occupy dedicated token kinds. These are registered during keyword_init when GPU mode is active (dword_106C2C0 != 0) and participate in the same token classification pipeline as all other type traits. They are used internally by the CUDA frontend to detect extended lambda closure types during device/host separation.
| Kind | Name | Purpose |
|---|---|---|
| 328 | __nv_is_extended_device_lambda_closure_type | Tests whether a type is the closure type of an extended device lambda. Used during device code generation to identify lambda closures that require special treatment (wrapper function generation, address-space conversion). |
| 329 | __nv_is_extended_host_device_lambda_closure_type | Tests whether a type is the closure type of an extended host-device lambda (__host__ __device__). These lambdas require dual code generation paths and wrapper functions for both host and device. |
| 330 | __nv_is_extended_device_lambda_with_preserved_return_type | Tests whether a device lambda has an explicitly specified (preserved) return type rather than a deduced one. Affects how the compiler generates the wrapper function return type. |
When extended lambdas are disabled, these traits are predefined as macros expanding to false:
// Fallback definitions in preprocessor preamble:
#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false
Extended Types and Recent Additions (331--356)
These are the newest token kinds, added for extended floating-point types (ISO/IEC TS 18661-3) and recent C++23/26 features.
| Kind | Name | Standard | C/C++ Construct |
|---|---|---|---|
| 331 | _Float32 | TS 18661-3 | 32-bit IEEE 754 float |
| 332 | _Float32x | TS 18661-3 | Extended 32-bit float |
| 333 | _Float64 | TS 18661-3 | 64-bit IEEE 754 float |
| 334 | _Float64x | TS 18661-3 | Extended 64-bit float |
| 335 | _Float128 | TS 18661-3 | 128-bit IEEE 754 float |
| 336--340 | (reserved) | -- | -- |
| 341--356 | (recent additions) | C++23/26 | Reserved for MSVC C++/CLI traits (__is_ref_class, __is_value_class, __is_interface_class, __is_delegate, __is_sealed, __has_finalizer, __has_copy, __has_assign, __is_simple_value_class, __is_ref_array, __is_valid_winrt_type, __is_win_class, __is_win_interface) and additional future extensions |
Token Cache
The token cache provides lookahead, backtracking, and macro-expansion replay for C++ parsing. Tokens are stored in a linked list of cache entries, each 80--112 bytes depending on payload.
Cache Entry Layout
| Offset | Size | Field | Description |
|---|---|---|---|
+0 | 8 | next | Next entry in linked list |
+8 | 8 | source_position | Encoded file/line/column |
+16 | 2 | token_code | Token kind (0--356) |
+18 | 1 | cache_entry_kind | Payload discriminator (see table below) |
+20 | 4 | flags | Token classification flags |
+24 | 4 | extra_flags | Additional flags |
+32 | 8 | extra_data | Context-dependent data |
+40.. | varies | payload | Kind-specific data (40--72 bytes) |
Cache Entry Kinds
Eight discriminator values select the payload interpretation at offset +40:
| Kind | Value | Payload Content | Size | Description |
|---|---|---|---|---|
| identifier | 1 | Name pointer + 64-byte lookup result | 72 | Identifier with pre-resolved scope/symbol lookup. The 64-byte lookup result mirrors xmmword_106C380--106C3B0. |
| macro_def | 2 | Macro definition pointer | 8 | Reference to a macro definition for re-expansion. Dispatched to sub_5BA500. |
| pragma | 3 | Pragma data | varies | Preprocessor pragma deferred for later processing |
| pp_number | 4 | Number text pointer | 8 | Preprocessing number not yet classified as integer or float |
| (reserved) | 5 | -- | -- | Not observed in use |
| string | 6 | String data + encoding byte | varies | String literal with encoding prefix information |
| (reserved) | 7 | -- | -- | Not observed in use |
| concatenated_string | 8 | Concatenated string data | varies | Wide or multi-piece concatenated string literal |
Cache Management Globals
| Address | Name | Description |
|---|---|---|
qword_1270150 | cached_token_rescan_list | Head of list of tokens to re-scan (pushed back for lookahead) |
qword_1270128 | reusable_cache_stack | Stack of reusable cache entry blocks |
qword_1270148 | free_token_list | Free list for recycling cache entries |
qword_1270140 | macro_definition_chain | Active macro definition chain |
qword_1270118 | cache_entry_free_list | Free list for allocate_token_cache_entry |
dword_126DB74 | has_cached_tokens | Boolean: nonzero when cache is non-empty |
Cache Operations
| Address | Identity | Lines | Description |
|---|---|---|---|
sub_669650 | copy_tokens_from_cache | 385 | Copies cached preprocessor tokens for macro re-expansion (assert at lexical.c:3417) |
sub_669D00 | allocate_token_cache_entry | 119 | Allocates from free list at qword_1270118, initializes fields |
sub_669EB0 | create_cached_token_node | 83 | Creates and initializes cache node with source position |
sub_66A000 | append_to_token_cache | 88 | Appends token to cache list, maintains tail pointer |
sub_66A140 | push_token_to_rescan_list | 46 | Pushes token onto rescan stack at qword_1270150 |
sub_66A2C0 | free_single_cache_entry | 18 | Returns cache entry to free list |
Keyword Registration
All keywords are registered during frontend initialization by sub_5863A0 (keyword_init / fe_translation_unit_init, 1,113 lines, in fe_init.c). The function calls sub_7463B0 (enter_keyword) for each keyword, passing the numeric token kind and the keyword string. GNU double-underscore variants (e.g., __asm and __asm__ for asm) are registered via sub_585B10 (enter_gnu_keyword), which automatically generates both __name and __name__ forms from a single root. Alternative tokens are registered via sub_749600.
Version Gating Architecture
Registration is conditional on a set of global configuration flags established during CLI processing:
| Address | Name | Controls | Values |
|---|---|---|---|
dword_126EFB4 | language_mode | C vs C++ dialect | 1 = C (GNU default), 2 = C++ |
dword_126EF68 | cpp_standard_version | Standard version level | 199711 (C++98), 201103 (C++11), 201402 (C++14), 201703 (C++17), 202002 (C++20), 202302 (C++23) |
dword_126EFAC | c_language_mode | C mode flag | Boolean |
dword_126EFB0 | microsoft_extensions | MSVC keywords | Boolean |
dword_126EFA8 | gnu_extensions | GCC keywords | Boolean |
dword_126EFA4 | clang_extensions | Clang keywords | Boolean |
qword_126EF98 | gnu_version | GCC version threshold | Encoded: e.g., 0x9FC3 = GCC 4.0.3 |
qword_126EF90 | clang_version | Clang version threshold | Encoded: e.g., 0x15F8F, 0x1D4BF |
Registration Pattern
The pseudocode below shows the version-gated registration pattern reconstructed from sub_5863A0:
void keyword_init(void) {
// C89 keywords -- always registered
enter_keyword(77, "auto");
enter_keyword(78, "break");
enter_keyword(79, "case");
// ... all C89 keywords ...
enter_keyword(108, "while");
// C99 keywords -- gated on C99+ standard
if (c_standard_version >= 199901) {
enter_keyword(119, "restrict");
enter_keyword(120, "_Bool");
enter_keyword(121, "_Complex");
enter_keyword(122, "_Imaginary");
}
// C11 keywords
if (c_standard_version >= 201112) {
enter_keyword(184, "_Static_assert");
enter_keyword(247, "_Alignof");
enter_keyword(248, "_Alignas");
enter_keyword(260, "_Noreturn");
enter_keyword(262, "_Generic");
enter_keyword(263, "_Atomic");
enter_keyword(194, "_Thread_local");
}
// C++ mode keywords
if (language_mode == 2) { // C++ mode
enter_keyword(137, "bool");
enter_keyword(138, "true");
enter_keyword(139, "false");
enter_keyword(140, "wchar_t");
enter_keyword(150, "catch");
enter_keyword(151, "class");
// ... all C++ core keywords ...
enter_keyword(183, "typename");
// Alternative tokens (C++ only)
enter_alt_token(52, "and", /*len*/3);
enter_alt_token(53, "or", 2);
enter_alt_token(64, "and_eq", 6);
// ... all alternative tokens ...
// C++11 keywords
if (cpp_standard_version >= 201103) {
enter_keyword(244, "constexpr");
enter_keyword(185, "decltype");
enter_keyword(237, "nullptr");
enter_keyword(126, "char16_t");
enter_keyword(127, "char32_t");
enter_keyword(184, "static_assert");
enter_keyword(194, "thread_local");
}
// C++20 keywords
if (cpp_standard_version >= 202002) {
enter_keyword(245, "consteval");
enter_keyword(246, "constinit");
enter_keyword(267, "co_yield");
enter_keyword(268, "co_return");
enter_keyword(269, "co_await");
enter_keyword(294, "requires");
enter_keyword(295, "concept");
}
}
// GNU extensions -- gated on gnu_extensions flag
if (gnu_extensions) {
enter_gnu_keyword(187, "__extension__");
enter_gnu_keyword(186, "__auto_type");
enter_gnu_keyword(142, "__attribute");
enter_keyword(117, "__builtin_offsetof");
enter_keyword(143, "__builtin_types_compatible_p");
enter_keyword(239, "__int128");
// ... all GNU extensions ...
}
// MSVC extensions
if (microsoft_extensions) {
enter_keyword(132, "__declspec");
enter_keyword(133, "__int8");
enter_keyword(134, "__int16");
enter_keyword(135, "__int32");
enter_keyword(136, "__int64");
}
// Type traits (C++11+, ~60 traits)
if (language_mode == 2) {
enter_keyword(207, "__is_class");
enter_keyword(208, "__is_enum");
// ... all type traits through 327 ...
}
// CUDA type traits (GPU mode)
if (gpu_mode) {
enter_keyword(328, "__nv_is_extended_device_lambda_closure_type");
enter_keyword(329, "__nv_is_extended_host_device_lambda_closure_type");
enter_keyword(330, "__nv_is_extended_device_lambda_with_preserved_return_type");
}
// Extended float types (GNU)
if (gnu_extensions) {
enter_keyword(331, "_Float32");
enter_keyword(332, "_Float32x");
enter_keyword(333, "_Float64");
enter_keyword(334, "_Float64x");
enter_keyword(335, "_Float128");
}
// Post-keyword init: scope setup, builtin registration
// ...
}
GNU Double-Underscore Registration
sub_585B10 (enter_gnu_keyword, assert at fe_init.c:698) implements the pattern where a single keyword name is registered in two or three forms:
- If
namestarts with_: registersnameas-is and__name__(e.g.,_Boolstays, plus___Bool__if applicable) - Otherwise: registers
__nameand__name__(e.g.,asmproduces__asmand__asm__)
The function uses a stack buffer of 49 characters maximum (name + 5 <= 0x31), prepends __ (encoded as 0x5F5F in little-endian), copies the name, and appends __ with a null terminator. Both variants call sub_7463B0 (enter_keyword) with the same token kind.
Operator Name Table
The operator name display table at qword_126DE00 maps operator kinds to printable names for diagnostics and error messages. It is populated by sub_588BB0 (initialize_opname_kinds) during fe_wrapup.c initialization.
The initialization loop iterates all 357 entries of byte_E6C0E0 (operator-to-name index), mapping each non-zero entry to the corresponding string from off_E6D240 (the token name table). Two special cases are hardcoded:
| Operator Kind | Display Name | Special Case |
|---|---|---|
| 42 | () | Function call operator (overridden from default) |
| 43 | [] | Array subscript operator (overridden from default) |
Additionally, the array positions for new[] and delete[] are hardcoded separately, since these operator names do not correspond to single tokens.
The routine validates that all entries in the range qword_126DE08 through qword_126DF80 (the 48 operator name slots) are non-null, and panics with "initialize_opname_kinds: bad init of opname_names" if any gap is found.
Token State Globals
When a token is produced by the lexer, the following globals are populated:
| Address | Name | Type | Description |
|---|---|---|---|
word_126DD58 | current_token_code | WORD | 16-bit token kind (0--356) |
qword_126DD38 | current_source_position | QWORD | Encoded file/line/column |
qword_126DD48 | token_text_ptr | QWORD | Pointer to identifier/literal text |
src | token_start_position | char* | Start of token in input buffer |
n | token_text_length | size_t | Length of token text |
dword_126DF90 | token_flags_1 | DWORD | Classification flags |
dword_126DF8C | token_flags_2 | DWORD | Additional flags |
qword_126DF80 | token_extra_data | QWORD | Context-dependent payload |
xmmword_106C380--106C3B0 | identifier_lookup_result | 4 x 128-bit | SSE-packed lookup result (64 bytes, 4 XMM registers) |
Cross-References
- Lexer & Tokenizer -- full lexer subsystem documentation, architecture, and function map
- Pipeline Overview -- keyword registration during initialization
- Entry Point & Initialization --
keyword_initin the startup sequence - Global Variable Index -- all global addresses referenced here
- Template Engine -- template argument scanning and
>>disambiguation - CUDA Lambda Overview -- NVIDIA type-trait token usage in lambda transforms
- Attribute System Overview -- CUDA attribute handling at token level
- EDG Source File Map --
lexical.candfe_init.cbinary layout
CUDA Error Catalog
cudafe++ reserves internal error indices 3457--3794 (338 slots) for CUDA-specific diagnostics. These are displayed to users as numbers 20000--20337 using the formula display = internal + 16543. Of the 338 slots, approximately 210 carry unique message templates; the remainder are reserved or share templates with parametric fill-ins. Every CUDA error can be controlled by its numeric code or diagnostic tag name via --diag_suppress, --diag_warning, --diag_error, or the #pragma nv_diagnostic system.
This page is a flat lookup table. For the diagnostic pipeline architecture (severity stack, pragma scoping, SARIF output), see Diagnostic Overview. For narrative discussion of each category with implementation details, see CUDA Errors.
Numbering and Display Format
User-visible: file.cu(42): error #20042-D: calling a __device__ function from ...
^^^^^
display code = internal + 16543
| Direction | Formula | Example |
|---|---|---|
| Display to internal | internal = display - 16543 | 20042 maps to internal 3499 |
| Internal to display | display = internal + 16543 | 3457 maps to display 20000 |
The -D suffix appears when severity is 7 or below (note, remark, warning, soft error). Hard errors (severity 8+) omit the suffix.
Severity Codes
| Code | Level | Suppressible |
|---|---|---|
| 2 | note | yes |
| 4 | remark | yes |
| 5 | warning | yes |
| 6 | command-line warning | no |
| 7 | error (soft) | yes |
| 8 | error (hard, from pragma) | no |
| 9 | catastrophic error | no |
| 10 | command-line error | no |
| 11 | internal error | no |
How to Use This Catalog
Suppress by numeric code:
nvcc --diag_suppress=20042
Suppress by tag name:
nvcc --diag_suppress=unsafe_device_call
In source code:
#pragma nv_diag_suppress unsafe_device_call
#pragma nv_diag_suppress 20042
Category 1: Cross-Space Calling
Checks performed by the call-graph walker comparing the execution-space byte at entity offset +182 of caller vs. callee.
Standard Cross-Space Calls (6 messages)
| Tag | Sev | Message |
|---|---|---|
unsafe_device_call | W | calling a __device__ function(%sq1) from a __host__ function(%sq2) is not allowed |
unsafe_device_call | W | calling a __device__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed |
unsafe_device_call | W | calling a __host__ function(%sq1) from a __device__ function(%sq2) is not allowed |
unsafe_device_call | W | calling a __host__ function(%sq1) from a __global__ function(%sq2) is not allowed |
unsafe_device_call | W | calling a __host__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed |
unsafe_device_call | W | calling a __host__ function from a __host__ __device__ function is not allowed |
Constexpr Cross-Space Calls (6 messages)
These fire when --expt-relaxed-constexpr is not enabled.
| Tag | Sev | Message |
|---|---|---|
unsafe_device_call | W | calling a constexpr __device__ function(%sq1) from a __host__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this. |
unsafe_device_call | W | calling a constexpr __device__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this. |
unsafe_device_call | W | calling a constexpr __host__ function(%sq1) from a __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this. |
unsafe_device_call | W | calling a constexpr __host__ function(%sq1) from a __global__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this. |
unsafe_device_call | W | calling a constexpr __host__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this. |
unsafe_device_call | W | calling a constexpr __host__ function from a __host__ __device__ function is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this. |
Category 2: Virtual Override Mismatch
Override checker (sub_432280) extracts the 0x30 mask from the execution-space byte. __global__ is excluded because kernels cannot be virtual.
| Tag | Sev | Message |
|---|---|---|
| -- | E | execution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ function |
| -- | E | execution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ __device__ function |
| -- | E | execution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __device__ function |
| -- | E | execution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __host__ __device__ function |
| -- | E | execution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __device__ function |
| -- | E | execution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __host__ function |
Category 3: Redeclaration Mismatch
Checked in decl_routine (sub_4CE420) and check_cuda_attribute_consistency (sub_4C6D50).
Incompatible Redeclarations (error-level)
| Tag | Sev | Message |
|---|---|---|
device_function_redeclared_with_global | E | a __device__ function(%no1) redeclared with __global__ |
global_function_redeclared_with_device | E | a __global__ function(%no1) redeclared with __device__ |
global_function_redeclared_with_host | E | a __global__ function(%no1) redeclared with __host__ |
global_function_redeclared_with_host_device | E | a __global__ function(%no1) redeclared with __host__ __device__ |
global_function_redeclared_without_global | E | a __global__ function(%no1) redeclared without __global__ |
host_function_redeclared_with_global | E | a __host__ function(%no1) redeclared with __global__ |
host_device_function_redeclared_with_global | E | a __host__ __device__ function(%no1) redeclared with __global__ |
Compatible Promotions (warning-level, promoted to HD)
| Tag | Sev | Message |
|---|---|---|
device_function_redeclared_with_host | W | a __device__ function(%no1) redeclared with __host__, hence treated as a __host__ __device__ function |
device_function_redeclared_with_host_device | W | a __device__ function(%no1) redeclared with __host__ __device__, hence treated as a __host__ __device__ function |
device_function_redeclared_without_device | W | a __device__ function(%no1) redeclared without __device__, hence treated as a __host__ __device__ function |
host_function_redeclared_with_device | W | a __host__ function(%no1) redeclared with __device__, hence treated as a __host__ __device__ function |
host_function_redeclared_with_host_device | W | a __host__ function(%no1) redeclared with __host__ __device__, hence treated as a __host__ __device__ function |
Category 4: __global__ Function Constraints
Return Type and Signature
| Tag | Sev | Message |
|---|---|---|
global_function_return_type | E | a __global__ function must have a void return type |
global_function_deduced_return_type | E | a __global__ function must not have a deduced return type |
global_function_has_ellipsis | E | a __global__ function cannot have ellipsis |
global_rvalue_ref_type | E | a __global__ function cannot have a parameter with rvalue reference type |
global_ref_param_restrict | E | a __global__ function cannot have a parameter with __restrict__ qualified reference type |
global_va_list_type | E | A __global__ function or function template cannot have a parameter with va_list type |
global_function_with_initializer_list | E | a __global__ function or function template cannot have a parameter with type std::initializer_list |
global_param_align_too_big | E | cannot pass a parameter with a too large explicit alignment to a __global__ function on win32 platforms |
Declaration Context
| Tag | Sev | Message |
|---|---|---|
global_class_decl | E | A __global__ function or function template cannot be a member function |
global_friend_definition | E | A __global__ function or function template cannot be defined in a friend declaration |
global_function_in_unnamed_inline_ns | E | A __global__ function or function template cannot be declared within an inline unnamed namespace |
global_operator_function | E | An operator function cannot be a __global__ function |
global_new_or_delete | E | (__global__ on operator new/delete) |
| -- | E | function main cannot be marked __device__ or __global__ |
C++ Feature Restrictions
| Tag | Sev | Message |
|---|---|---|
global_function_constexpr | E | A __global__ function or function template cannot be marked constexpr |
global_function_consteval | E | A __global__ function or function template cannot be marked consteval |
global_function_inline | E | (__global__ with inline) |
global_exception_spec | E | An exception specification is not allowed for a __global__ function or function template |
Template Argument Restrictions
| Tag | Sev | Message |
|---|---|---|
global_private_type_arg | E | A type that is defined inside a class and has private or protected access (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the class is local to a __device__ or __global__ function |
global_private_template_arg | E | A template that is defined inside a class and has private or protected access cannot be used in the template template argument of a __global__ function template instantiation |
global_unnamed_type_arg | E | An unnamed type (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the type is local to a __device__ or __global__ function |
global_func_local_template_arg | E | A type defined inside a __host__ function (%t) cannot be used in the template argument type of a __global__ function template instantiation |
global_lambda_template_arg | E | The closure type for a lambda (%t%s) cannot be used in the template argument type of a __global__ function template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the flag '-extended-lambda' is specified and the lambda is an extended lambda |
local_type_used_in_global_function | W | a local type %t (defined in %sq1) used in global function %sq2 template argument, the global function cannot be launched from host code. |
Variable Template Restrictions (parallel set)
| Tag | Sev | Message |
|---|---|---|
variable_template_private_type_arg | E | (private/protected type in variable template instantiation) |
variable_template_private_template_arg | E | (private template template arg in variable template) |
variable_template_unnamed_type_template_arg | E | An unnamed type (%t) cannot be used in the template argument type of a variable template instantiation, unless the type is local to a __device__ or __global__ function |
variable_template_func_local_template_arg | E | A type defined inside a __host__ function (%t) cannot be used in the template argument type of a variable template instantiation |
variable_template_lambda_template_arg | E | The closure type for a lambda (%t%s) cannot be used in the template argument type of a variable template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the lambda is an 'extended lambda' and the flag --extended-lambda is specified |
Variadic Template Constraints
| Tag | Sev | Message |
|---|---|---|
global_function_multiple_packs | E | Multiple pack parameters are not allowed for a variadic __global__ function template |
global_function_pack_not_last | E | Pack template parameter must be the last template parameter for a variadic __global__ function template |
Launch Configuration Attributes
| Tag | Sev | Message |
|---|---|---|
bounds_attr_only_on_global_func | E | %s is only allowed on a __global__ function |
maxnreg_attr_only_on_global_func | E | (__maxnreg__ only on __global__) |
missing_launch_bounds | W | no __launch_bounds__ specified for __global__ function |
cuda_specifier_twice_in_group | E | (duplicate CUDA specifier on same declaration) |
bounds_maxnreg_incompatible_qualifiers | E | (__launch_bounds__ and __maxnreg__ conflict) |
| -- | E | The %s qualifiers cannot be applied to the same kernel |
| -- | E | Multiple %s specifiers are not allowed |
| -- | E | incorrect value for launch bounds |
Category 5: Extended Lambda Restrictions
Extended lambdas (__device__ or __host__ __device__ lambdas in host code, enabled by --extended-lambda) must have closure types serializable for device transfer.
Capture Restrictions
| Tag | Sev | Message |
|---|---|---|
extended_lambda_reference_capture | E | An extended %s lambda cannot capture variables by reference |
extended_lambda_pack_capture | E | An extended %s lambda cannot capture an element of a parameter pack |
extended_lambda_too_many_captures | E | An extended %s lambda can only capture up to 1023 variables |
extended_lambda_array_capture_rank | E | An extended %s lambda cannot capture an array variable (type: %t) with more than 7 dimensions |
extended_lambda_array_capture_assignable | E | An extended %s lambda cannot capture an array variable whose element type (%t) is not assignable on the host |
extended_lambda_array_capture_default_constructible | E | An extended %s lambda cannot capture an array variable whose element type (%t) is not default constructible on the host |
extended_lambda_init_capture_array | E | An extended %s lambda cannot init-capture variables with array type |
extended_lambda_init_capture_initlist | E | An extended %s lambda cannot have init-captures with type std::initializer_list |
extended_lambda_capture_in_constexpr_if | E | An extended %s lambda cannot first-capture variable in constexpr-if context |
this_addr_capture_ext_lambda | W | Implicit capture of 'this' in extended lambda expression |
extended_lambda_hd_init_capture | E | init-captures are not allowed for extended __host__ __device__ lambdas |
| -- | E | Unless enabled by language dialect, *this capture is only supported when the lambda is either __device__ only, or is defined within a __device__ or __global__ function |
Type Restrictions on Captures and Parameters
| Tag | Sev | Message |
|---|---|---|
extended_lambda_capture_local_type | E | A type local to a function (%t) cannot be used in the type of a variable captured by an extended __device__ or __host__ __device__ lambda |
extended_lambda_capture_private_type | E | A type that is a private or protected class member (%t) cannot be used in the type of a variable captured by an extended __device__ or __host__ __device__ lambda |
extended_lambda_call_operator_local_type | E | A type local to a function (%t) cannot be used in the return or parameter types of the operator() of an extended __device__ or __host__ __device__ lambda |
extended_lambda_call_operator_private_type | E | A type that is a private or protected class member (%t) cannot be used in the return or parameter types of the operator() of an extended __device__ or __host__ __device__ lambda |
extended_lambda_parent_local_type | E | A type local to a function (%t) cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended __device__ or __host__ __device__ lambda |
extended_lambda_parent_private_type | E | A type that is a private or protected class member (%t) cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended __device__ or __host__ __device__ lambda |
extended_lambda_parent_private_template_arg | E | A template that is a private or protected class member cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended %s lambda |
Enclosing Parent Function Restrictions
| Tag | Sev | Message |
|---|---|---|
extended_lambda_enclosing_function_local | E | The enclosing parent function (%sq2) for an extended %s1 lambda must not be defined inside another function |
extended_lambda_enclosing_function_not_found | E | (no enclosing function found for extended lambda) |
extended_lambda_inaccessible_parent | E | The enclosing parent function (%sq2) for an extended %s1 lambda cannot have private or protected access within its class |
extended_lambda_enclosing_function_deducible | E | The enclosing parent function (%sq2) for an extended %s1 lambda must not have deduced return type |
extended_lambda_cant_take_function_address | E | The enclosing parent function (%sq2) for an extended %s1 lambda must allow its address to be taken |
extended_lambda_parent_non_extern | E | On Windows, the enclosing parent function (%sq2) for an extended %s1 lambda cannot have internal or no linkage |
extended_lambda_parent_class_unnamed | E | The enclosing parent function (%sq2) for an extended %s1 lambda cannot be a member function of a class that is unnamed |
extended_lambda_parent_template_param_unnamed | E | The enclosing parent function (%sq2) for an extended %s1 lambda cannot be in a template which has a unnamed parameter: %nd |
extended_lambda_nest_parent_template_param_unnamed | E | The enclosing parent %n for an extended %s lambda cannot be a template which has a unnamed parameter |
extended_lambda_multiple_parameter_packs | E | The enclosing parent template function (%sq2) for an extended %s1 lambda cannot have more than one variadic parameter, or it is not listed last in the template parameter list. |
extended_lambda_no_parent_func | E | (extended lambda has no parent function) |
extended_lambda_illegal_parent | E | (extended lambda in illegal parent context) |
Nesting and Context Restrictions
| Tag | Sev | Message |
|---|---|---|
extended_lambda_enclosing_function_generic_lambda | E | An extended %s1 lambda cannot be defined inside a generic lambda expression(%sq2). |
extended_lambda_enclosing_function_hd_lambda | E | An extended %s1 lambda cannot be defined inside an extended __host__ __device__ lambda expression(%sq2). |
extended_lambda_inaccessible_ancestor | E | An extended %s1 lambda cannot be defined inside a class (%sq2) with private or protected access within another class |
extended_lambda_inside_constexpr_if | E | For this host platform/dialect, an extended lambda cannot be defined inside the 'if' or 'else' block of a constexpr if statement |
extended_lambda_multiple_parent | E | Cannot specify multiple __nv_parent directives in a lambda declaration |
extended_host_device_generic_lambda | E | __host__ __device__ extended lambdas cannot be generic lambdas |
| -- | E | If an extended %s lambda is defined within the body of one or more nested lambda expressions, each of these enclosing lambda expressions must be defined within the immediate or nested block scope of a function. |
Specifier and Annotation
| Tag | Sev | Message |
|---|---|---|
extended_lambda_disallowed | E | __host__ or __device__ annotation on lambda requires --extended-lambda nvcc flag |
extended_lambda_constexpr | E | The %s1 specifier is not allowed for an extended %s2 lambda |
lambda_operator_annotated | E | The operator() function for a lambda cannot be explicitly annotated with execution space annotations (__host__/__device__/__global__), the annotations are derived from its closure class |
extended_lambda_discriminator | E | (extended lambda discriminator collision) |
Category 6: Device Code Restrictions
General restrictions that apply to all GPU-side code (__device__ and __global__ function bodies).
| Tag | Sev | Message |
|---|---|---|
cuda_device_code_unsupported_operator | E | The operator '%s' is not allowed in device code |
unsupported_type_in_device_code | E | %t %s1 a %s2, which is not supported in device code |
| -- | E | device code does not support exception handling |
no_coroutine_on_device | E | device code does not support coroutines |
| -- | E | operations on vector types are not supported in device code |
undefined_device_entity | E | cannot use an entity undefined in device code |
undefined_device_identifier | E | identifier %sq is undefined in device code |
thread_local_in_device_code | E | cannot use thread_local specifier for variable declarations in device code |
unrecognized_pragma_device_code | W | unrecognized #pragma in device code |
| -- | E | zero-sized parameter type %t is not allowed in device code |
| -- | E | zero-sized variable %sq is not allowed in device code |
| -- | E | dynamic initialization is not supported for a function-scope static %s variable within a __device__/__global__ function |
| -- | E | function-scope static variable within a __device__/__global__ function requires a memory space specifier |
use_of_virtual_base_on_compute_1x | E | Use of a virtual base (%t) requires the compute_20 or higher architecture |
| -- | E | alloca() is not supported for architectures lower than compute_52 |
Category 7: Kernel Launch
| Tag | Sev | Message |
|---|---|---|
device_launch_no_sepcomp | E | kernel launch from __device__ or __global__ functions requires separate compilation mode |
missing_api_for_device_side_launch | E | device-side kernel launch could not be processed as the required runtime APIs are not declared |
| -- | W | explicit stream argument not provided in kernel launch |
| -- | E | kernel launches from templates are not allowed in system files |
device_side_launch_arg_with_user_provided_cctor | E | cannot pass an argument with a user-provided copy-constructor to a device-side kernel launch |
device_side_launch_arg_with_user_provided_dtor | E | cannot pass an argument with a user-provided destructor to a device-side kernel launch |
Category 8: Memory Space and Variable Restrictions
Variable Access Across Spaces
| Tag | Sev | Message |
|---|---|---|
device_var_read_in_host | E | a %s1 %n1 cannot be directly read in a host function |
device_var_written_in_host | E | a %s1 %n1 cannot be directly written in a host function |
device_var_address_taken_in_host | E | address of a %s1 %n1 cannot be directly taken in a host function |
host_var_read_in_device | E | a host %n1 cannot be directly read in a device function |
host_var_written_in_device | E | a host %n1 cannot be directly written in a device function |
host_var_address_taken_in_device | E | address of a host %n1 cannot be directly taken in a device function |
Variable Declaration Restrictions
| Tag | Sev | Message |
|---|---|---|
illegal_local_to_device_function | E | %s1 %sq2 variable declaration is not allowed inside a device function body |
illegal_local_to_host_function | E | %s1 %sq2 variable declaration is not allowed inside a host function body |
shared_specifier_in_range_for | E | the __shared__ memory space specifier is not allowed for a variable declared by the for-range-declaration |
bad_shared_storage_class | E | __shared__ variables cannot have external linkage |
device_variable_in_unnamed_inline_ns | E | A %s variable cannot be declared within an inline unnamed namespace |
| -- | E | member variables of an anonymous union at global or namespace scope cannot be directly accessed in __device__ and __global__ functions |
shared_inside_struct | E | shared type inside a struct or union is not allowed |
shared_parameter | E | (__shared__ as function parameter) |
Auto-Deduced Device References
| Tag | Sev | Message |
|---|---|---|
auto_device_fn_ref | E | A non-constexpr __device__ function (%sq1) with "auto" deduced return type cannot be directly referenced %s2, except if the reference is absent when __CUDA_ARCH__ is undefined |
device_var_constexpr | E | (constexpr rules for __device__ variables) |
device_var_structured_binding | E | (structured bindings on __device__ variables) |
Category 9: __grid_constant__
The __grid_constant__ annotation (compute_70+) marks a kernel parameter as read-only grid-wide.
| Tag | Sev | Message |
|---|---|---|
grid_constant_non_kernel | E | __grid_constant__ annotation is only allowed on a parameter of a __global__ function |
grid_constant_not_const | E | a parameter annotated with __grid_constant__ must have const-qualified type |
grid_constant_reference_type | E | a parameter annotated with __grid_constant__ must not have reference type |
grid_constant_unsupported_arch | E | __grid_constant__ annotation is only allowed for architecture compute_70 or later |
grid_constant_incompat_redecl | E | incompatible __grid_constant__ annotation for parameter %s in function redeclaration (see previous declaration %p) |
grid_constant_incompat_templ_redecl | E | incompatible __grid_constant__ annotation for parameter %s in function template redeclaration (see previous declaration %p) |
grid_constant_incompat_specialization | E | incompatible __grid_constant__ annotation for parameter %s in function specialization (see previous declaration %p) |
grid_constant_incompat_instantiation_directive | E | incompatible __grid_constant__ annotation for parameter %s in instantiation directive (see previous declaration %p) |
Category 10: JIT Mode
JIT mode (-dc for device-only compilation) restricts host constructs.
| Tag | Sev | Message |
|---|---|---|
no_host_in_jit | E | A function explicitly marked as a __host__ function is not allowed in JIT mode |
unannotated_function_in_jit | E | A function without execution space annotations (__host__/__device__/__global__) is considered a host function, and host functions are not allowed in JIT mode. Consider using -default-device flag to process unannotated functions as __device__ functions in JIT mode |
unannotated_variable_in_jit | E | A namespace scope variable without memory space annotations (__device__/__constant__/__shared__/__managed__) is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process unannotated namespace scope variables as __device__ variables in JIT mode |
unannotated_static_data_member_in_jit | E | A class static data member with non-const type is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process such data members as __device__ variables in JIT mode |
host_closure_class_in_jit | E | The execution space for the lambda closure class members was inferred to be __host__ (based on context). This is not allowed in JIT mode. Consider using -default-device to infer __device__ execution space for namespace scope lambda closure classes. |
Category 11: RDC / Whole-Program Mode
| Tag | Sev | Message |
|---|---|---|
| -- | E | An inline __device__/__constant__/__managed__ variable must have internal linkage when the program is compiled in whole program mode (-rdc=false) |
template_global_no_def | E | when "-static-global-template-stub=true" in whole program compilation mode ("-rdc=false"), a __global__ function template instantiation or specialization (%sq) must have a definition in the current translation unit |
extern_kernel_template | E | when "-static-global-template-stub=true", extern __global__ function template is not supported in whole program compilation mode ("-rdc=false") |
| -- | W | address of internal linkage device function (%sq) was taken (nv bug 2001144). mitigation: no mitigation required if the address is not used for comparison, or if the target function is not a CUDA C++ builtin |
Category 12: Atomics
CUDA atomics lowered to PTX instructions with size, type, scope, and memory-order constraints.
Architecture and Type Constraints
| Tag | Sev | Message |
|---|---|---|
nv_atomic_functions_not_supported_below_sm60 | E | __nv_atomic_* functions are not supported on arch < sm_60. |
nv_atomic_operation_not_in_device_function | E | atomic operations are not in a device function. |
nv_atomic_function_no_args | E | atomic function requires at least one argument. |
nv_atomic_function_address_taken | E | nv atomic function must be called directly. |
invalid_nv_atomic_operation_size | E | atomic operations and, or, xor, add, sub, min and max are valid only on objects of size 4, or 8. |
invalid_nv_atomic_cas_size | E | atomic CAS is valid only on objects of size 2, 4, 8 or 16 bytes. |
invalid_nv_atomic_exch_size | E | atomic exchange is valid only on objects of size 4, 8 or 16 bytes. |
invalid_data_size_for_nv_atomic_generic_function | E | generic nv atomic functions are valid only on objects of size 1, 2, 4, 8 and 16 bytes. |
non_integral_type_for_non_generic_nv_atomic_function | E | non-generic nv atomic load, store, cas and exchange are valid only on integral types. |
invalid_nv_atomic_operation_add_sub_size | E | atomic operations add and sub are not valid on signed integer of size 8. |
nv_atomic_add_sub_f64_not_supported | W | atomic add and sub for 64-bit float is supported on architecture sm_60 or above. |
invalid_nv_atomic_operation_max_min_float | E | atomic operations min and max are not supported on any floating-point types. |
floating_type_for_logical_atomic_operation | E | For a logical atomic operation, the first argument cannot be any floating-point types. |
nv_atomic_cas_b16_not_supported | E | 16-bit atomic compare-and-exchange is supported on architecture sm_70 or above. |
nv_atomic_exch_cas_b128_not_supported | E | 128-bit atomic exchange or compare-and-exchange is supported on architecture sm_90 or above. |
nv_atomic_load_store_b128_version_too_low | E | 128-bit atomic load and store are supported on architecture sm_70 or above. |
Memory Order and Scope
| Tag | Sev | Message |
|---|---|---|
nv_atomic_load_order_error | E | atomic load's memory order cannot be release or acq_rel. |
nv_atomic_store_order_error | E | atomic store's memory order cannot be consume, acquire or acq_rel. |
nv_atomic_operation_order_not_constant_int | E | atomic operation's memory order argument is not an integer literal. |
nv_atomic_operation_scope_not_constant_int | E | atomic operation's scope argument is not an integer literal. |
invalid_nv_atomic_memory_order_value | E | (invalid memory order enum value) |
invalid_nv_atomic_thread_scope_value | E | (invalid thread scope enum value) |
Scope Fallback Warnings
| Tag | Sev | Message |
|---|---|---|
nv_atomic_operations_scope_fallback_to_membar | W | atomic operations' scope argument is supported on architecture sm_60 or above. Fall back to use membar. |
nv_atomic_operations_memory_order_fallback_to_membar | W | atomic operations' argument of memory order is supported on architecture sm_70 or above. Fall back to use membar. |
nv_atomic_operations_scope_cluster_change_to_device | W | atomic operations' scope of cluster is supported on architecture sm_90 or above. Using device scope instead. |
nv_atomic_load_store_scope_cluster_change_to_device | W | atomic load and store's scope of cluster is supported on architecture sm_90 or above. Using device scope instead. |
Category 13: ASM in Device Code
NVPTX backend supports fewer inline-assembly constraint letters than x86.
| Tag | Sev | Message |
|---|---|---|
asm_constraint_letter_not_allowed_in_device | E | asm constraint letter '%s' is not allowed inside a __device__/__global__ function |
asm_constraint_must_have_single_letter | E | an asm operand may specify only one constraint letter in a __device__/__global__ function |
| -- | E | The 'C' constraint can only be used for asm statements in device code |
cc_clobber_in_device | E | The cc clobber constraint is not supported in device code |
cuda_xasm_strict_placeholder_format | E | (strict placeholder format in CUDA asm) |
addr_of_label_in_device_func | E | address of label extension is not supported in __device__/__global__ functions |
Category 14: #pragma nv_abi
Controls calling convention for device functions, adjusting parameter passing to match PTX ABI.
| Tag | Sev | Message |
|---|---|---|
nv_abi_pragma_bad_format | E | (malformed #pragma nv_abi) |
nv_abi_pragma_invalid_option | E | #pragma nv_abi contains an invalid option |
nv_abi_pragma_missing_arg | E | #pragma nv_abi requires an argument |
nv_abi_pragma_duplicate_arg | E | #pragma nv_abi contains a duplicate argument |
nv_abi_pragma_not_constant | E | #pragma nv_abi argument must evaluate to an integral constant expression |
nv_abi_pragma_not_positive_value | E | #pragma nv_abi argument value must be a positive value |
nv_abi_pragma_overflow_value | E | #pragma nv_abi argument value exceeds the range of an integer |
nv_abi_pragma_device_function | E | #pragma nv_abi must be applied to device functions |
nv_abi_pragma_device_function_context | E | #pragma nv_abi is not supported inside a host function |
nv_abi_pragma_next_construct | E | #pragma nv_abi must appear immediately before a function declaration, function definition, or an expression statement |
Category 15: __nv_register_params__
Forces all parameters to be passed in registers (compute_80+).
| Tag | Sev | Message |
|---|---|---|
register_params_not_enabled | E | __nv_register_params__ support is not enabled |
register_params_unsupported_arch | E | __nv_register_params__ is only supported for compute_80 or later architecture |
register_params_unsupported_function | E | __nv_register_params__ is not allowed on a %s function |
register_params_ellipsis_function | E | __nv_register_params__ is not allowed on a function with ellipsis |
Category 16: Name Expression (NVRTC)
__CUDACC_RTC__name_expr forms the mangled name of a __global__ function or __device__/__constant__ variable at compile time.
| Tag | Sev | Message |
|---|---|---|
name_expr_parsing | E | Error in parsing name expression for lowered name lookup. Input name expression was: %sq |
name_expr_non_global_routine | E | Name expression cannot form address of a non-__global__ function. Input name expression was: %sq |
name_expr_non_device_variable | E | Name expression cannot form address of a variable that is not a __device__/__constant__ variable. Input name expression was: %sq |
name_expr_not_routine_or_variable | E | Name expression must form address of a __global__ function or the address of a __device__/__constant__ variable. Input name expression was: %sq |
name_expr_extra_tokens | E | Extra tokens found after parsing name expression for lowered name lookup. Input name expression was: %sq |
name_expr_internal_error | E | Internal error in parsing name expression for lowered name lookup. Input name expression was: %sq |
Category 17: Texture and Surface Variables
| Tag | Sev | Message |
|---|---|---|
texture_surface_variable_in_unnamed_inline_ns | E | A texture or surface variable cannot be declared within an inline unnamed namespace |
| -- | E | A texture or surface variable cannot be used in the non-type template argument of a __device__, __host__ __device__ or __global__ function template instantiation |
reference_to_text_surf_type_in_device_func | E | a reference to texture/surface type cannot be used in __device__/__global__ functions |
reference_to_text_surf_var_in_device_func | E | taking reference of texture/surface variable not allowed in __device__/__global__ functions |
addr_of_text_surf_var_in_device_func | E | cannot take address of texture/surface variable %sq in __device__/__global__ functions |
addr_of_text_surf_expr_in_device_func | E | cannot take address of texture/surface expression in __device__/__global__ functions |
indir_into_text_surf_var_in_device_func | E | indirection not allowed for accessing texture/surface through variable %sq in __device__/__global__ functions |
indir_into_text_surf_expr_in_device_func | E | indirection not allowed for accessing texture/surface through expression in __device__/__global__ functions |
Category 18: __managed__ Variables
| Tag | Sev | Message |
|---|---|---|
managed_const_type_not_allowed | E | a __managed__ variable cannot have a const qualified type |
managed_reference_type_not_allowed | E | a __managed__ variable cannot have a reference type |
managed_cant_be_shared_constant | E | __managed__ variables cannot be marked __shared__ or __constant__ |
unsupported_arch_for_managed_capability | E | __managed__ variables require architecture compute_30 or higher |
unsupported_configuration_for_managed_capability | E | __managed__ variables are not yet supported for this configuration (compilation mode (32/64 bit) and/or target operating system) |
decltype_of_managed_variable | E | A __managed__ variable cannot be used as an unparenthesized id-expression argument for decltype() |
Category 19: Device Function Signature Constraints
| Tag | Sev | Message |
|---|---|---|
device_function_has_ellipsis | E | __device__ or __host__ __device__ function with ellipsis requires compute_30 or higher architecture |
device_func_tex_arg | E | (device function with texture argument restriction) |
no_host_device_initializer_list | E | (std::initializer_list in __host__ __device__ context) |
no_host_device_move_forward | E | (std::move/forward in __host__ __device__ context) |
no_strict_cuda_error | W | (relaxed error checking mode) |
Category 20: __wgmma_mma_async Builtins
Warp Group Matrix Multiply-Accumulate builtins (sm_90a+).
| Tag | Sev | Message |
|---|---|---|
wgmma_mma_async_not_enabled | E | __wgmma_mma_async builtins are only available for sm_90a |
wgmma_mma_async_nonconstant_arg | E | Non-constant argument to __wgmma_mma_async call |
wgmma_mma_async_missing_args | E | The 'A' or 'B' argument to __wgmma_mma_async call is missing |
wgmma_mma_async_bad_shape | E | The shape %s is not supported for __wgmma_mma_async builtin |
wgmma_mma_async_bad_A_type | E | (invalid type for operand A) |
wgmma_mma_async_bad_B_type | E | (invalid type for operand B) |
Category 21: __block_size__ / __cluster_dims__
Architecture-dependent launch configuration attributes.
| Tag | Sev | Message |
|---|---|---|
block_size_unsupported | E | __block_size__ is not supported for this GPU architecture |
block_size_must_be_positive | E | (block size values must be positive) |
cluster_dims_unsupported | E | __cluster_dims__ is not supported for this GPU architecture |
cluster_dims_must_be_positive | E | (__cluster_dims__ values must be positive) |
cluster_dims_too_large | E | cluster dimension value is too large |
conflict_between_cluster_dim_and_block_size | E | cannot specify the second tuple in __block_size__ while __cluster_dims__ is present |
max_blocks_per_cluster_unsupported | E | cannot specify max blocks per cluster for this GPU architecture |
max_blocks_per_cluster_negative | E | max blocks per cluster must not be negative |
max_blocks_per_cluster_too_large | E | max blocks per cluster is too large |
too_many_blocks_in_cluster | E | total number of blocks in cluster computed from %s exceeds __launch_bounds__ specified limit for max blocks in cluster |
shared_block_size_must_be_positive | E | the block size of a shared array must be greater than zero |
shared_block_size_too_large | E | (shared block size exceeds maximum) |
mismatched_shared_block_size | E | shared block size does not match one previously specified |
ambiguous_block_size_spec | E | (ambiguous block size specification) |
multiple_block_sizes | E | multiple block sizes not allowed |
threads_dimension_requires_definite_block_size | E | a dynamic THREADS dimension requires a definite block size |
shared_nonthreads_dim | E | (shared array dimension is not THREADS-based) |
shared_affinity_type | E | (shared affinity type mismatch) |
Category 22: Inline Hint Conflicts
| Tag | Sev | Message |
|---|---|---|
inline_hint_forceinline_conflict | E | "__inline_hint__" and "__forceinline__" may not be used on the same declaration |
inline_hint_noinline_conflict | E | "__inline_hint__" and "__noinline__" may not be used on the same declaration |
Category 23: __local_maxnreg__
| Tag | Sev | Message |
|---|---|---|
local_maxnreg | E | (__local_maxnreg__ attribute applied) |
local_maxnreg_attr_only_nonmember_func | E | (__local_maxnreg__ only on non-member functions) |
local_maxnreg_attribute_conflict | E | (__local_maxnreg__ conflicts with existing attribute) |
local_maxnreg_negative | E | (__local_maxnreg__ value is negative) |
local_maxnreg_too_large | E | (__local_maxnreg__ value exceeds maximum) |
maxnreg_attr_only_nonmember_func | E | (__maxnreg__ only on non-member functions) |
bounds_attr_only_nonmember_func | E | (launch bounds only on non-member functions) |
Category 24: Miscellaneous CUDA Errors
| Tag | Sev | Message |
|---|---|---|
cuda_displaced_new_or_delete_operator | E | (displaced new/delete in CUDA context) |
cuda_demote_unsupported_floating_point | W | (unsupported floating-point type demoted) |
illegal_ucn_in_device_identifer | E | Universal character is not allowed in device entity name (%sq) |
thread_local_for_device_vars | E | cannot use thread_local specifier for a %s variable |
global_qualifier_not_allowed | E | (execution space qualifier not allowed here) |
unsupported_nv_attribute | W | (unrecognized NVIDIA attribute) |
addr_of_nv_builtin_var | E | (address-of applied to NVIDIA builtin variable) |
shared_address_immutable | E | (__shared__ variable address is immutable) |
nonshared_blocksizeof | E | (BLOCKSIZEOF applied to non-__shared__ variable) |
nonshared_strict_relaxed | E | (strict/relaxed qualifier on non-__shared__ variable) |
extern_shared | W | (extern __shared__ variable) |
invalid_nvvm_builtin_intrinsic | E | (invalid NVVM builtin intrinsic) |
unannotated_static_not_allowed_in_device | E | (unannotated static not allowed in device code) |
missing_pushcallconfig | E | (cudaConfigureCall not found for kernel launch lowering) |
Complete Diagnostic Tag Index
All 286 CUDA-specific diagnostic tag names extracted from the cudafe++ binary, organized alphabetically within functional groups. Every tag can be used with --diag_suppress, --diag_warning, --diag_error, or #pragma nv_diag_suppress / nv_diag_warning / nv_diag_error.
Cross-Space / Execution Space (1 tag)
| # | Tag Name |
|---|---|
| 1 | unsafe_device_call |
Redeclaration (12 tags)
| # | Tag Name |
|---|---|
| 2 | device_function_redeclared_with_global |
| 3 | device_function_redeclared_with_host |
| 4 | device_function_redeclared_with_host_device |
| 5 | device_function_redeclared_without_device |
| 6 | global_function_redeclared_with_device |
| 7 | global_function_redeclared_with_host |
| 8 | global_function_redeclared_with_host_device |
| 9 | global_function_redeclared_without_global |
| 10 | host_device_function_redeclared_with_global |
| 11 | host_function_redeclared_with_device |
| 12 | host_function_redeclared_with_global |
| 13 | host_function_redeclared_with_host_device |
__global__ Constraints (30 tags)
| # | Tag Name |
|---|---|
| 14 | bounds_attr_only_on_global_func |
| 15 | bounds_maxnreg_incompatible_qualifiers |
| 16 | cuda_specifier_twice_in_group |
| 17 | global_class_decl |
| 18 | global_exception_spec |
| 19 | global_friend_definition |
| 20 | global_func_local_template_arg |
| 21 | global_function_consteval |
| 22 | global_function_constexpr |
| 23 | global_function_deduced_return_type |
| 24 | global_function_has_ellipsis |
| 25 | global_function_in_unnamed_inline_ns |
| 26 | global_function_inline |
| 27 | global_function_multiple_packs |
| 28 | global_function_pack_not_last |
| 29 | global_function_return_type |
| 30 | global_function_with_initializer_list |
| 31 | global_lambda_template_arg |
| 32 | global_new_or_delete |
| 33 | global_operator_function |
| 34 | global_param_align_too_big |
| 35 | global_private_template_arg |
| 36 | global_private_type_arg |
| 37 | global_qualifier_not_allowed |
| 38 | global_ref_param_restrict |
| 39 | global_rvalue_ref_type |
| 40 | global_unnamed_type_arg |
| 41 | global_va_list_type |
| 42 | local_type_used_in_global_function |
| 43 | maxnreg_attr_only_on_global_func |
| 44 | missing_launch_bounds |
| 45 | template_global_no_def |
Extended Lambda (38 tags)
| # | Tag Name |
|---|---|
| 46 | extended_host_device_generic_lambda |
| 47 | extended_lambda_array_capture_assignable |
| 48 | extended_lambda_array_capture_default_constructible |
| 49 | extended_lambda_array_capture_rank |
| 50 | extended_lambda_call_operator_local_type |
| 51 | extended_lambda_call_operator_private_type |
| 52 | extended_lambda_cant_take_function_address |
| 53 | extended_lambda_capture_in_constexpr_if |
| 54 | extended_lambda_capture_local_type |
| 55 | extended_lambda_capture_private_type |
| 56 | extended_lambda_constexpr |
| 57 | extended_lambda_disallowed |
| 58 | extended_lambda_discriminator |
| 59 | extended_lambda_enclosing_function_deducible |
| 60 | extended_lambda_enclosing_function_generic_lambda |
| 61 | extended_lambda_enclosing_function_hd_lambda |
| 62 | extended_lambda_enclosing_function_local |
| 63 | extended_lambda_enclosing_function_not_found |
| 64 | extended_lambda_hd_init_capture |
| 65 | extended_lambda_illegal_parent |
| 66 | extended_lambda_inaccessible_ancestor |
| 67 | extended_lambda_inaccessible_parent |
| 68 | extended_lambda_init_capture_array |
| 69 | extended_lambda_init_capture_initlist |
| 70 | extended_lambda_inside_constexpr_if |
| 71 | extended_lambda_multiple_parameter_packs |
| 72 | extended_lambda_multiple_parent |
| 73 | extended_lambda_nest_parent_template_param_unnamed |
| 74 | extended_lambda_no_parent_func |
| 75 | extended_lambda_pack_capture |
| 76 | extended_lambda_parent_class_unnamed |
| 77 | extended_lambda_parent_local_type |
| 78 | extended_lambda_parent_non_extern |
| 79 | extended_lambda_parent_private_template_arg |
| 80 | extended_lambda_parent_private_type |
| 81 | extended_lambda_parent_template_param_unnamed |
| 82 | extended_lambda_reference_capture |
| 83 | extended_lambda_too_many_captures |
| 84 | this_addr_capture_ext_lambda |
Device Code (13 tags)
| # | Tag Name |
|---|---|
| 85 | addr_of_label_in_device_func |
| 86 | asm_constraint_letter_not_allowed_in_device |
| 87 | asm_constraint_must_have_single_letter |
| 88 | auto_device_fn_ref |
| 89 | cc_clobber_in_device |
| 90 | cuda_device_code_unsupported_operator |
| 91 | cuda_xasm_strict_placeholder_format |
| 92 | illegal_ucn_in_device_identifer |
| 93 | no_coroutine_on_device |
| 94 | no_strict_cuda_error |
| 95 | thread_local_in_device_code |
| 96 | undefined_device_entity |
| 97 | undefined_device_identifier |
| 98 | unrecognized_pragma_device_code |
| 99 | unsupported_type_in_device_code |
| 100 | use_of_virtual_base_on_compute_1x |
Device Function (4 tags)
| # | Tag Name |
|---|---|
| 101 | device_func_tex_arg |
| 102 | device_function_has_ellipsis |
| 103 | no_host_device_initializer_list |
| 104 | no_host_device_move_forward |
Kernel Launch (4 tags)
| # | Tag Name |
|---|---|
| 105 | device_launch_no_sepcomp |
| 106 | device_side_launch_arg_with_user_provided_cctor |
| 107 | device_side_launch_arg_with_user_provided_dtor |
| 108 | missing_api_for_device_side_launch |
Variable Access (11 tags)
| # | Tag Name |
|---|---|
| 109 | device_var_address_taken_in_host |
| 110 | device_var_constexpr |
| 111 | device_var_read_in_host |
| 112 | device_var_structured_binding |
| 113 | device_var_written_in_host |
| 114 | device_variable_in_unnamed_inline_ns |
| 115 | host_var_address_taken_in_device |
| 116 | host_var_read_in_device |
| 117 | host_var_written_in_device |
| 118 | illegal_local_to_device_function |
| 119 | illegal_local_to_host_function |
Variable Template (5 tags)
| # | Tag Name |
|---|---|
| 120 | variable_template_func_local_template_arg |
| 121 | variable_template_lambda_template_arg |
| 122 | variable_template_private_template_arg |
| 123 | variable_template_private_type_arg |
| 124 | variable_template_unnamed_type_template_arg |
__managed__ (6 tags)
| # | Tag Name |
|---|---|
| 125 | decltype_of_managed_variable |
| 126 | managed_cant_be_shared_constant |
| 127 | managed_const_type_not_allowed |
| 128 | managed_reference_type_not_allowed |
| 129 | unsupported_arch_for_managed_capability |
| 130 | unsupported_configuration_for_managed_capability |
__grid_constant__ (8 tags)
| # | Tag Name |
|---|---|
| 131 | grid_constant_incompat_instantiation_directive |
| 132 | grid_constant_incompat_redecl |
| 133 | grid_constant_incompat_specialization |
| 134 | grid_constant_incompat_templ_redecl |
| 135 | grid_constant_non_kernel |
| 136 | grid_constant_not_const |
| 137 | grid_constant_reference_type |
| 138 | grid_constant_unsupported_arch |
Atomics (26 tags)
| # | Tag Name |
|---|---|
| 139 | floating_type_for_logical_atomic_operation |
| 140 | invalid_data_size_for_nv_atomic_generic_function |
| 141 | invalid_nv_atomic_cas_size |
| 142 | invalid_nv_atomic_exch_size |
| 143 | invalid_nv_atomic_memory_order_value |
| 144 | invalid_nv_atomic_operation_add_sub_size |
| 145 | invalid_nv_atomic_operation_max_min_float |
| 146 | invalid_nv_atomic_operation_size |
| 147 | invalid_nv_atomic_thread_scope_value |
| 148 | non_integral_type_for_non_generic_nv_atomic_function |
| 149 | nv_atomic_add_sub_f64_not_supported |
| 150 | nv_atomic_cas_b16_not_supported |
| 151 | nv_atomic_exch_cas_b128_not_supported |
| 152 | nv_atomic_function_address_taken |
| 153 | nv_atomic_function_no_args |
| 154 | nv_atomic_functions_not_supported_below_sm60 |
| 155 | nv_atomic_load_order_error |
| 156 | nv_atomic_load_store_b128_version_too_low |
| 157 | nv_atomic_load_store_scope_cluster_change_to_device |
| 158 | nv_atomic_operation_not_in_device_function |
| 159 | nv_atomic_operation_order_not_constant_int |
| 160 | nv_atomic_operation_scope_not_constant_int |
| 161 | nv_atomic_operations_memory_order_fallback_to_membar |
| 162 | nv_atomic_operations_scope_cluster_change_to_device |
| 163 | nv_atomic_operations_scope_fallback_to_membar |
| 164 | nv_atomic_store_order_error |
JIT Mode (5 tags)
| # | Tag Name |
|---|---|
| 165 | host_closure_class_in_jit |
| 166 | no_host_in_jit |
| 167 | unannotated_function_in_jit |
| 168 | unannotated_static_data_member_in_jit |
| 169 | unannotated_variable_in_jit |
RDC / Whole-Program (2 tags)
| # | Tag Name |
|---|---|
| 170 | extern_kernel_template |
| 171 | template_global_no_def |
#pragma nv_abi (10 tags)
| # | Tag Name |
|---|---|
| 172 | nv_abi_pragma_bad_format |
| 173 | nv_abi_pragma_device_function |
| 174 | nv_abi_pragma_device_function_context |
| 175 | nv_abi_pragma_duplicate_arg |
| 176 | nv_abi_pragma_invalid_option |
| 177 | nv_abi_pragma_missing_arg |
| 178 | nv_abi_pragma_next_construct |
| 179 | nv_abi_pragma_not_constant |
| 180 | nv_abi_pragma_not_positive_value |
| 181 | nv_abi_pragma_overflow_value |
__nv_register_params__ (4 tags)
| # | Tag Name |
|---|---|
| 182 | register_params_ellipsis_function |
| 183 | register_params_not_enabled |
| 184 | register_params_unsupported_arch |
| 185 | register_params_unsupported_function |
Name Expression (6 tags)
| # | Tag Name |
|---|---|
| 186 | name_expr_extra_tokens |
| 187 | name_expr_internal_error |
| 188 | name_expr_non_device_variable |
| 189 | name_expr_non_global_routine |
| 190 | name_expr_not_routine_or_variable |
| 191 | name_expr_parsing |
Texture / Surface (7 tags)
| # | Tag Name |
|---|---|
| 192 | addr_of_text_surf_expr_in_device_func |
| 193 | addr_of_text_surf_var_in_device_func |
| 194 | indir_into_text_surf_expr_in_device_func |
| 195 | indir_into_text_surf_var_in_device_func |
| 196 | reference_to_text_surf_type_in_device_func |
| 197 | reference_to_text_surf_var_in_device_func |
| 198 | texture_surface_variable_in_unnamed_inline_ns |
__wgmma_mma_async (6 tags)
| # | Tag Name |
|---|---|
| 199 | wgmma_mma_async_bad_A_type |
| 200 | wgmma_mma_async_bad_B_type |
| 201 | wgmma_mma_async_bad_shape |
| 202 | wgmma_mma_async_missing_args |
| 203 | wgmma_mma_async_nonconstant_arg |
| 204 | wgmma_mma_async_not_enabled |
__block_size__ / __cluster_dims__ (18 tags)
| # | Tag Name |
|---|---|
| 205 | ambiguous_block_size_spec |
| 206 | block_size_must_be_positive |
| 207 | block_size_unsupported |
| 208 | cluster_dims_must_be_positive |
| 209 | cluster_dims_too_large |
| 210 | cluster_dims_unsupported |
| 211 | conflict_between_cluster_dim_and_block_size |
| 212 | max_blocks_per_cluster_negative |
| 213 | max_blocks_per_cluster_too_large |
| 214 | max_blocks_per_cluster_unsupported |
| 215 | mismatched_shared_block_size |
| 216 | multiple_block_sizes |
| 217 | shared_affinity_type |
| 218 | shared_block_size_must_be_positive |
| 219 | shared_block_size_too_large |
| 220 | shared_nonthreads_dim |
| 221 | threads_dimension_requires_definite_block_size |
| 222 | too_many_blocks_in_cluster |
Inline Hint (2 tags)
| # | Tag Name |
|---|---|
| 223 | inline_hint_forceinline_conflict |
| 224 | inline_hint_noinline_conflict |
__local_maxnreg__ (7 tags)
| # | Tag Name |
|---|---|
| 225 | bounds_attr_only_nonmember_func |
| 226 | local_maxnreg |
| 227 | local_maxnreg_attr_only_nonmember_func |
| 228 | local_maxnreg_attribute_conflict |
| 229 | local_maxnreg_negative |
| 230 | local_maxnreg_too_large |
| 231 | maxnreg_attr_only_nonmember_func |
Lambda Annotation (1 tag)
| # | Tag Name |
|---|---|
| 232 | lambda_operator_annotated |
Miscellaneous (16 tags)
| # | Tag Name |
|---|---|
| 233 | addr_of_nv_builtin_var |
| 234 | bad_shared_storage_class |
| 235 | cuda_demote_unsupported_floating_point |
| 236 | cuda_displaced_new_or_delete_operator |
| 237 | extern_shared |
| 238 | invalid_nvvm_builtin_intrinsic |
| 239 | missing_pushcallconfig |
| 240 | nonshared_blocksizeof |
| 241 | nonshared_strict_relaxed |
| 242 | shared_address_immutable |
| 243 | shared_inside_struct |
| 244 | shared_parameter |
| 245 | shared_specifier_in_range_for |
| 246 | thread_local_for_device_vars |
| 247 | unannotated_static_not_allowed_in_device |
| 248 | unsupported_nv_attribute |
Diagnostic Pragma Actions (6 tags -- not suppressible, but listed for completeness)
| # | Tag Name |
|---|---|
| 249 | nv_diag_default |
| 250 | nv_diag_error |
| 251 | nv_diag_once |
| 252 | nv_diag_remark |
| 253 | nv_diag_suppress |
| 254 | nv_diag_warning |
Cross-Reference: EDG Error Codes Used for CUDA
The following standard EDG error codes (0--3456) are repurposed or frequently triggered by CUDA-specific validation. These display with their original number (not the 20000-D series).
| Internal # | Display # | Context |
|---|---|---|
| 21 | 21 | CUDA auto type with template deduction |
| 147 | 147 | redeclaration mismatch |
| 149 | 149 | illegal CUDA storage class at namespace scope |
| 246 | 246 | static member of non-class type |
| 298 | 298 | typedef/using with template name |
| 325 | 325 | thread_local in CUDA |
| 337 | 337 | calling convention mismatch |
| 453 | 453 | in template instantiation context |
| 551 | 551 | not a member function |
| 795 | 795 | definition in class scope with external linkage (CUDA) |
| 799 | 799 | definition in class scope (C++20 CUDA) |
| 891 | 891 | anonymous type in variable declaration |
| 892 | 892 | auto with __constant__ variable |
| 893 | 893 | auto with CUDA variable |
| 948 | 948 | calling convention mismatch on redeclaration |
| 992 | 992 | fatal error (suppress-all sentinel) |
| 1034 | 1034 | explicit instantiation with conflicting attributes |
| 1063 | 1063 | in include file context |
| 1118 | 1118 | CUDA attribute on namespace-scope variable |
| 1150 | 1150 | context lines truncated |
| 1158 | 1158 | auto return type with __global__ |
| 1306 | 1306 | CUDA memory space mismatch on redeclaration |
| 1418 | 1418 | incomplete type in definition |
| 1430 | 1430 | function attribute mismatch in template |
| 1560 | 1560 | CUDA constexpr class with non-trivial destructor |
| 1580 | 1580 | redeclaration with different template parameters |
| 1655 | 1655 | tentative definition of constexpr |
| 2384 | 2384 | constexpr mismatch on redeclaration (CUDA) |
| 2442 | 2442 | extern variable at block scope with CUDA attribute |
| 2443 | 2443 | extern variable at block scope with CUDA attribute (variant) |
| 2502 | 2502 | no_unique_address mismatch |
| 2503 | 2503 | no_unique_address mismatch (variant) |
| 2656 | 2656 | internal error (assertion failure) |
| 2885 | 2885 | CUDA attribute on deduction guide |
| 2937 | 2937 | structured binding with CUDA attribute |
| 3033 | 3033 | incompatible constexpr CUDA target |
| 3116 | 3116 | restrict qualifier on definition |
| 3414 | 3414 | auto with volatile/atomic qualifier |
| 3510 | 3510 | __shared__ variable with VLA |
| 3566 | 3566 | __constant__ with constexpr with auto |
| 3567 | 3567 | CUDA variable with VLA type |
| 3568 | 3568 | __constant__ with constexpr |
| 3578 | 3578 | CUDA attribute in discarded constexpr-if branch |
| 3579 | 3579 | CUDA attribute at namespace scope with structured binding |
| 3580 | 3580 | CUDA attribute on variable-length array |
| 3648 | 3648 | CUDA __constant__ with external linkage |
| 3698 | 3698 | parameter type mismatch |
| 3709 | 3709 | warnings treated as errors |
Format Specifiers in CUDA Messages
CUDA error messages use the same fill-in system as EDG base errors, expanded by process_fill_in (sub_4EDCD0).
| Specifier | Kind | Meaning | Example |
|---|---|---|---|
%sq | 3 | Quoted entity name | Function name in cross-space call |
%sq1, %sq2 | 3 | Indexed quoted names | Caller and callee |
%no1 | 4 | Entity name (omit kind prefix) | Function in redeclaration |
%n1, %n2 | 4 | Entity names | Override base/derived pair |
%nd | 4 | Entity with declaration location | Template parameter |
%s, %s1, %s2 | 3 | String fill-in | Execution space keyword |
%t | 6 | Type fill-in | Type in template arg errors |
%p | 2 | Source position | Previous declaration location |
Architecture Requirements Summary
Quick reference for minimum architecture required by various CUDA features.
| Feature | Minimum Architecture |
|---|---|
| Virtual bases | compute_20 |
__device__/__host__ __device__ with ellipsis | compute_30 |
__managed__ variables | compute_30 |
| alloca() | compute_52 |
__nv_atomic_* functions | sm_60 |
| Atomic scope argument | sm_60 |
| Atomic add/sub for f64 | sm_60 |
__grid_constant__ | compute_70 |
| Atomic memory order argument | sm_70 |
| 16-bit atomic CAS | sm_70 |
| 128-bit atomic load/store | sm_70 |
__nv_register_params__ | compute_80 |
| Cluster scope for atomics | sm_90 |
| 128-bit atomic exchange/CAS | sm_90 |
__wgmma_mma_async builtins | sm_90a |
Virtual Override Execution Space Matrix
When a derived class overrides a base class virtual function in CUDA, the execution spaces of both functions must be compatible. A __device__ virtual cannot be overridden by a __host__ function, a __host__ virtual cannot be overridden by a __device__ function, and so on. cudafe++ enforces these rules inside record_virtual_function_override (sub_432280, 437 lines, class_decl.c), which runs each time the EDG front-end registers a virtual override during class body scanning. The function performs three tasks: (1) propagate the base class's execution space obligations onto the derived function, (2) detect illegal mismatches and emit one of six dedicated error messages (3542--3547), and (3) fall through to standard EDG override recording (covariant returns, [[nodiscard]], override/final, requires-clause checks).
This page documents the override checking logic at reimplementation-grade depth: reconstructed pseudocode from the decompiled binary, a complete compatibility matrix, the six error messages with their diagnostic tags, and the relaxed-mode flag that softens certain checks.
Key Facts
| Property | Value |
|---|---|
| Binary function | sub_432280 (record_virtual_function_override, 437 lines) |
| Source file | class_decl.c |
| Parameters | a1=derivation_info, a2=overriding_sym, a3=overridden_sym, a4=base_class_info, a5=covariant_return_adjustment |
| Entity field read | byte +182 (execution space bitfield) on both overridden and overriding entities |
| Classification mask | byte & 0x30 -- two-bit extraction: 0x00=implicit host, 0x10=explicit host, 0x20=device, 0x30=HD |
| Propagation bits | 0x10 (host_explicit), 0x20 (device_annotation) |
| Attribute lookup | sub_5CEE70 with kind 87 (__device__) and 86 (__host__) |
| Error emission | sub_4F4F10 with severity 8 (hard error) |
| Relaxed mode flag | dword_106BFF0 (relaxed_attribute_mode) |
| Implicitly-HD test | byte +177 & 0x10 on entity -- constexpr / __forceinline__ bypass |
| Override-involved mark | byte +176 |= 0x02 on overriding entity |
| Assertion guard | nv_is_device_only_routine from nv_transforms.h:367 |
Why Virtual Functions Need Execution Space Checks
Standard C++ imposes no concept of execution space on virtual functions. CUDA introduces three execution spaces (__host__, __device__, __host__ __device__) and one launch-only space (__global__). When a virtual function in a base class is declared with one execution space, every override in every derived class must be callable in the same space. If the base declares a __device__ virtual, calling it through a base pointer on the GPU must dispatch to the derived override -- which is only possible if the override is also __device__ (or __host__ __device__).
__global__ functions cannot be virtual at all (error 3505/3506 prevents this at the attribute application stage), so the override matrix only covers three spaces: __host__, __device__, and __host__ __device__. An unannotated function counts as implicit __host__.
Function Entry: Mark and Resolve Entities
The function begins by resolving the actual entity nodes from the symbol table entries:
// sub_432280 entry (lines 60-69 of decompiled output)
//
// a2 = overriding_sym (symbol table entry for the derived-class function)
// a3 = overridden_sym (symbol table entry for the base-class function)
//
// v10 = entity of overridden function: *(overridden_sym + 88)
// v11 = entity of overriding function: *(*(overriding_sym) + 88)
//
// The entity node at offset +88 is the "associated routine entity" --
// the actual function representation containing execution space bits.
int64_t overridden_entity = *(int64_t*)(overridden_sym + 88); // v10
int64_t overriding_entity = *(int64_t*)(*(int64_t*)overriding_sym + 88); // v11
// Mark the overriding entity as "involved in an override"
*(uint8_t*)(overriding_entity + 176) |= 0x02;
The +176 |= 0x02 flag marks the derived function as "override-involved." This flag is consumed downstream by the exception specification resolver and other class completion logic.
Phase 1: Implicitly-HD Fast Path and Execution Space Propagation
The first branch tests byte +177 & 0x10 on the overriding entity. This bit indicates the function is implicitly __host__ __device__ -- set for constexpr functions (implicitly HD since CUDA 7.5) and __forceinline__ functions. When this bit is set, the override is exempt from mismatch checking, but execution space propagation still occurs.
// Phase 1: implicitly-HD check and propagation (lines 70-94)
void check_and_propagate(int64_t overriding_entity, int64_t overridden_entity) {
if (overriding_entity->byte_177 & 0x10) {
// Overriding function is implicitly HD (constexpr / __forceinline__)
//
// Skip mismatch errors entirely -- an implicitly-HD function is
// compatible with any base execution space. But we must still
// propagate the base's space obligations onto the derived entity
// so that downstream passes (IL marking, code generation) know
// what to emit.
if (!(overridden_entity->byte_177 & 0x10)) {
// Overridden function is NOT implicitly HD -- it has an explicit
// execution space. We need to propagate that space.
//
// Guard: skip propagation for constexpr lambdas with internal
// linkage but no override flag (a degenerate case).
if ((overridden_entity->qword_184 & 0x800001000000) == 0x800000000000
&& !(overridden_entity->byte_176 & 0x02)) {
// Degenerate case -- skip propagation
goto done_nvidia_checks;
}
uint8_t base_es = overridden_entity->byte_182;
// Propagate __host__ obligation:
// If the base is NOT device-only (i.e., base is host, HD, or
// unannotated), the derived function inherits the host obligation.
if ((base_es & 0x30) != 0x20) {
overriding_entity->byte_182 |= 0x10; // set host_explicit
}
// Propagate __device__ obligation:
// If the base has the device_annotation bit set, the derived
// function inherits the device obligation.
if (base_es & 0x20) {
overriding_entity->byte_182 |= 0x20; // set device_annotation
}
}
goto done_nvidia_checks;
}
// ... Phase 2 continues below
}
Why Propagation Matters
Propagation ensures that a derived class inherits its base class's execution space obligations even when the derived function is implicitly HD. Consider:
struct Base {
__device__ virtual void f(); // byte_182 & 0x30 == 0x20
};
struct Derived : Base {
constexpr void f() override; // byte_177 & 0x10 set (implicitly HD)
};
Without propagation, Derived::f would have byte_182 == 0x00 (no explicit annotation). The device-side IL pass would skip it, and a virtual call base_ptr->f() on the GPU would dispatch to a function never compiled for the device. Propagation sets byte_182 |= 0x20 (device_annotation), ensuring the function is included in device IL.
The propagation follows strict rules:
Base byte_182 & 0x30 | Propagated to overriding entity |
|---|---|
0x00 (implicit host) | |= 0x10 (host_explicit) |
0x10 (explicit host) | |= 0x10 (host_explicit) |
0x20 (device) | |= 0x20 (device_annotation) |
0x30 (HD) | |= 0x10 then |= 0x20 (both) |
Phase 2: Explicit Annotation Mismatch Detection
When the overriding function is NOT implicitly HD (byte_177 & 0x10 == 0), the checker must verify that the derived function's explicit execution space matches the base. It does this by querying the attribute lists on the overriding symbol for __device__ (kind 87) and __host__ (kind 86) attributes using sub_5CEE70.
The overriding symbol has two attribute list pointers: offset +184 (primary attributes) and offset +200 (secondary/redeclaration attributes). Both are checked for each attribute kind.
Reconstructed Pseudocode
// Phase 2: explicit annotation mismatch detection (lines 96-188)
//
// At this point, overriding_entity->byte_177 & 0x10 == 0 (not implicitly HD).
// We must determine what execution space annotations the overriding function
// has, and compare against the overridden function's execution space.
void check_override_mismatch(
int64_t overriding_sym, // a2
int64_t overriding_entity, // v11
int64_t overridden_entity, // v10
int64_t overridden_sym_list, // v6 = a2+48 (location info for diagnostics)
int64_t overridden_sym_arg, // v8 = a3 (for diagnostics)
int64_t base_sym // v9 = *a2 (for diagnostics)
) {
// -- Assertion: overridden entity must exist --
if (!overridden_entity) {
internal_error("nv_transforms.h", 367, "nv_is_device_only_routine");
}
// -- Extract overridden execution space --
uint8_t base_es = overridden_entity->byte_182;
uint8_t mask_30 = base_es & 0x30; // 0x00/0x10/0x20/0x30
bool base_no_device_annotation = (base_es & 0x20) == 0; // v56
bool base_is_hd = (mask_30 == 0x30); // v58
uint8_t base_device_bit = base_es & 0x20; // v55
// -- Check overriding function for __device__ attribute (kind 87) --
bool has_device_attr = find_attribute(87, overriding_sym->attr_list_184)
|| find_attribute(87, overriding_sym->attr_list_200);
if (has_device_attr) {
// Overriding function has __device__.
// Now check if it also has __host__ (kind 86) -- making it HD.
bool has_host_attr = find_attribute(86, overriding_sym->attr_list_184)
|| find_attribute(86, overriding_sym->attr_list_200);
if (has_host_attr) {
// --- Overriding is __host__ __device__ ---
if (base_device_bit) {
// Base has device_annotation (bit 5 set).
// If base is device-only (mask_30 == 0x20), error 3544.
if (mask_30 == 0x20) {
emit_error(8, 3544, location, overridden, base);
}
// If base is HD (mask_30 == 0x30), it's legal -- no error.
// If base has device_bit but mask_30 != 0x20 and != 0x30,
// that can't happen (bit 5 set implies mask_30 is 0x20 or 0x30).
} else {
// Base has no device_annotation -- base is host or implicit host.
emit_error(8, 3543, location, overridden, base);
}
} else {
// --- Overriding is __device__ only ---
// Fall through to LABEL_83 logic.
goto device_only_check;
}
} else {
// Overriding function has NO __device__ attribute.
// It's either explicit __host__ or implicit host (no annotation).
if (dword_106BFF0) {
// Relaxed mode: check if overriding has explicit __host__.
bool has_host_attr = find_attribute(86, overriding_sym->attr_list_184)
|| find_attribute(86, overriding_sym->attr_list_200);
if (!has_host_attr) {
// No explicit __host__ either -- implicit host.
// In relaxed mode, an implicit-host override is treated like
// a device-only override for certain base configurations.
// Jump into the device-only path with modified conditions.
goto device_only_check_relaxed;
}
// Explicit __host__ in relaxed mode: fall through to normal checks.
}
// --- Overriding is __host__ (explicit or implicit) ---
if (mask_30 == 0x20) {
// Base is __device__ only
emit_error(8, 3545, location, overridden, base);
} else if (mask_30 == 0x30) {
// Base is __host__ __device__
emit_error(8, 3546, location, overridden, base);
}
// else: base is host/implicit-host, same space -- no error.
goto done_nvidia_checks;
}
device_only_check:
// Overriding is __device__ only (has __device__ but no __host__).
// v39 = base_no_device_annotation (v56), v40 = 1 (always set entering here).
{
bool should_error = base_no_device_annotation; // v39
bool relaxed_extra = true; // v40
device_only_check_relaxed:
// (relaxed mode entry: v39 = 0, a1 = v56 = base_no_device_annotation)
if (dword_106BFF0) {
// Relaxed mode: the error fires unconditionally when
// base has no device annotation (base is host/implicit-host).
// In strict mode, same condition applies.
should_error = base_no_device_annotation;
relaxed_extra = true; // always true in relaxed
}
if (should_error) {
// Base is host-only (no device_annotation) and override is device-only.
emit_error(8, 3542, location, overridden, base);
} else if (base_is_hd && relaxed_extra) {
// Base is HD, override is device-only.
// v40 (relaxed_extra) is always 1 from Entry A, so this
// fires in both strict and relaxed modes for D-overrides-HD.
emit_error(8, 3547, location, overridden, base);
}
// else: base is device-only too -- compatible, no error.
}
done_nvidia_checks:
// Continue to standard EDG override recording...
}
Decision Tree (Simplified)
overriding byte_177 & 0x10?
YES (implicitly HD) --> propagate, skip mismatch check
NO --> extract base_es = overridden byte_182
has __device__ attr on overriding?
YES --> also has __host__ attr?
YES (override=HD):
base has device_annotation?
YES and mask_30==0x20 --> ERROR 3544
NO --> ERROR 3543
NO (override=D-only):
base has NO device_annotation? --> ERROR 3542
base is HD? --> ERROR 3547
NO (override=H or implicit-H):
base mask_30==0x20 --> ERROR 3545
base mask_30==0x30 --> ERROR 3546
otherwise --> legal (same space)
The Six Error Messages
Each mismatch produces one of six errors. All are emitted at severity 8 (hard error) and are individually suppressible by their diagnostic tag via --diag_suppress or #pragma nv_diag_suppress.
| Internal | Display | Diagnostic Tag | Message Template |
|---|---|---|---|
| 3542 | 20085 | vfunc_incompat_exec_h_d | execution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __device__ function |
| 3543 | 20086 | vfunc_incompat_exec_h_hd | execution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __host__ __device__ function |
| 3544 | 20087 | vfunc_incompat_exec_d_hd | execution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ __device__ function |
| 3545 | 20088 | vfunc_incompat_exec_d_h | execution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ function |
| 3546 | 20089 | vfunc_incompat_exec_hd_h | execution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __host__ function |
| 3547 | 20090 | vfunc_incompat_exec_hd_d | execution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __device__ function |
The display number is computed as internal + 16543 (the standard CUDA error renumbering from construct_text_message). The tag naming convention is vfunc_incompat_exec_{overridden}_{overriding}.
The %n1 and %n2 fill-ins resolve to the entity display names of the base and derived functions respectively, including their full qualified names and parameter types.
Suppression Example
# Suppress by tag (preferred)
nvcc --diag_suppress=vfunc_incompat_exec_h_d file.cu
# Suppress by display number
nvcc --diag_suppress=20085 file.cu
# Suppress in source
#pragma nv_diag_suppress vfunc_incompat_exec_h_d
Complete Compatibility Matrix
This table shows every combination of base (overridden) and derived (overriding) execution space. "Implicit H" means the function has no execution space annotation (byte_182 & 0x30 == 0x00). Since implicit host and explicit __host__ are treated identically for override purposes (both lack the device_annotation bit and have mask_30 != 0x20), they share the same row/column behavior.
__global__ is excluded because __global__ functions cannot be virtual -- the attribute handler rejects __global__ on virtual functions before override checking ever runs.
The matrix is the same in both strict mode (dword_106BFF0 == 0) and relaxed mode (dword_106BFF0 == 1). The relaxed flag changes the code path used to reach the error decision but produces the same result for all input combinations.
| Derived: H / implicit H | Derived: D | Derived: HD | Derived: implicitly HD | |
|---|---|---|---|---|
| Base: H / implicit H | legal | error 3542 | error 3543 | legal + propagate |= 0x10 |
| Base: D | error 3545 | legal | error 3544 | legal + propagate |= 0x20 |
| Base: HD | error 3546 | error 3547 | legal | legal + propagate |= 0x10, |= 0x20 |
Reading the matrix: each row is the base class virtual function's space; each column is the derived class override's space. "Legal" means no error is emitted and the override is recorded normally. "Legal + propagate" means the override is accepted AND the base's execution space bits are OR'd into the derived entity's byte_182.
The diagonal (same space in base and derived) is always legal. The last column (implicitly HD) is always legal because an implicitly HD function is compatible with every execution space -- the mismatch check is skipped entirely and only propagation runs.
Why Both Modes Produce the Same Matrix
Tracing the LABEL_83 code path with the two entry points reveals that dword_106BFF0 does NOT gate error 3547. In the critical device-only-override path (Entry A), v40 is set to 1 before reaching LABEL_83 regardless of the relaxed flag. The flag only changes the assignment to a1 and v40 via conditional moves (cmovz/cmovnz in the disassembly), but the net effect is identical for all input combinations:
LABEL_83 internals (decompiled, annotated):
a2 = 3542; // tentative error
if (!dword_106BFF0) a1 = v39; // strict: a1 = v39
if (dword_106BFF0) v40 = 1; // relaxed: force v40 = 1
// BUT v40 was already 1 from Entry A (line 134)
if (a1) emit_error(3542); // base has no device_annotation
else if (v58 && v40) emit_error(3547); // base is HD
else skip; // base is D-only (compatible)
Entry A sets v39 = v56, v40 = 1, a1 = v56. In strict mode, a1 is overwritten to v39 (same value). In relaxed mode, a1 stays v56 (same value). Either way, a1 = v56 = (base has no device annotation). The v40 = 1 from Entry A is preserved. The result is identical.
The relaxed flag introduces a second entry point (Entry B) for overriding functions with no explicit annotation. In relaxed mode, such functions are routed through LABEL_83 with v39 = 0 and a1 = v56, producing the same device-only check logic. In strict mode, the same functions take the direct H/implicit-H path and produce errors 3545/3546 for device/HD bases. Both paths reach the same conclusions.
Relaxed Mode: The Unannotated Override Path
When dword_106BFF0 == 1 and the overriding function has no __device__ attribute, the checker takes an additional step before falling through to the H/implicit-H path. It queries the overriding symbol for explicit __host__ (kind 86). If __host__ IS found, the function is confirmed as explicit host and errors 3545/3546 apply normally. If __host__ is NOT found (truly unannotated), the function is reclassified through the device-only check path (LABEL_83). This reclassification does not change the error outcome -- an unannotated function overriding a host base still sees no error (both are host-space), and an unannotated function overriding a device or HD base still produces the appropriate error.
Propagation Details
When the overriding function is implicitly HD (byte_177 & 0x10), execution space is propagated from the base to the derived entity by OR-ing bits into byte_182:
// Propagation (direct from decompiled sub_432280, lines 77-91)
uint8_t base_es = overridden_entity->byte_182;
// If base is NOT device-only, derived inherits host obligation
if ((base_es & 0x30) != 0x20) {
overriding_entity->byte_182 |= 0x10; // host_explicit bit
base_es = overridden_entity->byte_182; // re-read (compiler artifact)
}
// If base has device_annotation, derived inherits device obligation
if (base_es & 0x20) {
overriding_entity->byte_182 |= 0x20; // device_annotation bit
}
The re-read of overridden_entity->byte_182 after setting 0x10 on the overriding entity is a compiler artifact (the decompiler shows it reading back from v10+182 into v22, but v10 is the overridden entity, so the value hasn't changed). The OR operations are on the overriding entity only.
Propagation Matrix
Base space (byte_182 & 0x30) | Bits OR'd into overriding byte_182 | Net effect on overriding entity |
|---|---|---|
0x00 (implicit H) | |= 0x10 | Becomes explicit host (0x10) |
0x10 (explicit H) | |= 0x10 | Becomes explicit host (0x10) |
0x20 (D only) | |= 0x20 | Becomes device-annotated (0x20) |
0x30 (HD) | |= 0x10, then |= 0x20 | Becomes HD (0x30) |
After propagation, the overriding entity's byte_182 accurately reflects the execution space obligations inherited from its base class. Downstream passes (device/host separation, IL marking, code generation) use this byte to determine whether the function needs device-side compilation, host-side compilation, or both.
Relaxed Mode (dword_106BFF0)
The global flag dword_106BFF0 (relaxed_attribute_mode, default 1 per CLI defaults) controls permissive handling of execution space annotations across the compiler. Its primary effects are on attribute application (allowing __device__ + __global__ coexistence) and cross-space call validation. For virtual override checking, its effect is narrower:
-
Unannotated override reclassification. In relaxed mode, when the overriding function has neither
__device__nor__host__attributes explicitly, the checker additionally queries the overriding symbol for__host__(kind 86). If__host__is NOT found, the checker treats the unannotated function as potentially device-compatible and routes through the device-only check path (LABEL_83). This can produce error 3542 (D overrides H) for an implicit-host function, which would otherwise only see errors 3545/3546. -
No error suppression for overrides. Unlike attribute application where relaxed mode suppresses error 3481, relaxed mode does NOT suppress any of the six override errors. All six fire at severity 8 in both modes. The flag
dword_106BFF0modulates the code path taken to reach the error decision, not the severity or suppression of the error itself.
Additional Override Checks (Non-CUDA)
After the CUDA execution space checks, sub_432280 continues with standard EDG override validation:
| Error | Condition | Meaning |
|---|---|---|
| 1788 | Base has [[nodiscard]], derived does not | Missing [[nodiscard]] on override |
| 1789 | Derived has [[nodiscard]], base does not | Extraneous [[nodiscard]] on override |
| 1850 | Overriding a final virtual function | Override of final function |
| 2935 | Derived has requires-clause, base does not | Requires-clause mismatch |
| 2936 | Base has requires-clause, derived does not | Requires-clause mismatch |
These are standard C++ checks unrelated to CUDA execution spaces.
Example: Override Interactions
// Example 1: Legal same-space override
struct Base {
__device__ virtual void f();
};
struct Derived : Base {
__device__ void f() override; // Legal: D overrides D
};
// Example 2: Error 3542 -- D overrides H
struct Base2 {
virtual void f(); // Implicit __host__
};
struct Derived2 : Base2 {
__device__ void f() override; // ERROR 3542 (20085)
};
// error #20085-D: execution space mismatch: overridden entity (Base2::f)
// is a __host__ function, but overriding entity (Derived2::f)
// is a __device__ function
// Example 3: Error 3546 -- H overrides HD
struct Base3 {
__host__ __device__ virtual void f();
};
struct Derived3 : Base3 {
void f() override; // ERROR 3546 (20089)
};
// error #20089-D: execution space mismatch: overridden entity (Base3::f)
// is a __host__ __device__ function, but overriding entity (Derived3::f)
// is a __host__ function
// Example 4: Legal constexpr override with propagation
struct Base4 {
__device__ virtual int g();
};
struct Derived4 : Base4 {
constexpr int g() override; // Legal: implicitly HD, propagates |= 0x20
};
// Derived4::g now has byte_182 |= 0x20 (device_annotation)
// and is included in device IL compilation.
// Example 5: Error 3547 -- D overrides HD
struct Base5 {
__host__ __device__ virtual void h();
};
struct Derived5 : Base5 {
__device__ void h() override; // ERROR 3547 (20090)
};
Function Map
| Address | Identity | Lines | Source |
|---|---|---|---|
sub_432280 | record_virtual_function_override | 437 | class_decl.c |
sub_5CEE70 | find_attribute (attribute list lookup by kind) | ~30 | attribute.c |
sub_4F4F10 | emit_diag_with_entity_pair (severity, error, loc, base, derived) | ~100 | error.c |
sub_4F2930 | internal_error (assertion failure) | ~20 | error.c |
sub_41A6E0 | dump_override_entry (debug trace helper) | ~40 | class_decl.c |
sub_41D010 | add_to_override_list | ~20 | class_decl.c |
sub_5E20D0 | allocate_override_entry (40-byte node) | ~15 | mem.c |
sub_432130 | resolve_indeterminate_exception_specification | ~60 | class_decl.c |
Override Entry Structure
Each recorded override is stored as a 40-byte linked list node:
Override entry (40 bytes):
+0x00 (0): next pointer
+0x08 (8): base_class_symbol (entity in base class vtable)
+0x10 (16): derived_class_entity (overriding function entity)
+0x18 (24): flags (0 initially, set during processing)
+0x20 (32): covariant_return_adjustment (pointer or NULL)
The override list is managed via:
qword_E7FE98: list head (most recent entry)qword_E7FEA0: free list head (recycled 40-byte entries)qword_E7FE90: allocation counter
When debug tracing is enabled (dword_126EFCC > 3), the function prints "newly created: ", "existing entry: ", "after modification: ", and "removing: " to stderr via fwrite, followed by calls to sub_41A6E0 to dump the entry contents.
Cross-References
- Execution Spaces -- bitfield layout at entity
+182, attribute application handlers, conflict matrix - Cross-Space Call Validation -- call-graph enforcement, the implicitly-HD bypass
- CUDA Error Catalog -- error numbering scheme, diagnostic tag suppression system
- Global Variables --
dword_106BFF0and other flags - Entity Node Layout -- full byte map of the entity structure including
+176,+177,+182 - __global__ Function Constraints -- why
__global__functions cannot be virtual