cudafe++ v13.0 -- Reverse Engineering Reference

cudafe++ is NVIDIA's CUDA frontend compiler -- the first stage of the CUDA compilation pipeline. It is built on the Edison Design Group (EDG) C++ Front End v6.6, a commercial compiler frontend licensed by compiler vendors worldwide. NVIDIA ships cudafe++ as a statically-linked, stripped ELF binary inside every CUDA Toolkit installation. This binary accepts .cu source files, parses them as C++ with CUDA extensions, separates device code from host code, and produces two outputs: an EDG Intermediate Language (IL) stream consumed by cicc (the NVIDIA PTX code generator), and a transformed .int.c host file consumed by the system C++ compiler (gcc, clang, or cl.exe).

This wiki documents the complete internals of the cudafe++ binary from CUDA Toolkit 13.0, reverse-engineered through static analysis (IDA Pro + Hex-Rays decompilation) of all 6,483 functions. The goal is reimplementation-grade documentation: every page should give a senior compiler engineer enough information to build equivalent functionality from scratch.

Binary Identity

Property	Value
Binary	`cudafe++` from CUDA Toolkit 13.0
Format	ELF 64-bit LSB executable, x86-64, statically linked, stripped
File size	8,910,936 bytes (8.5 MB)
EDG base	Edison Design Group C++ Front End v6.6
Build path	`/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/`
Total functions	6,483
Functions mapped to source	2,208 (34%)

Segment Layout

Section	Start	End	Size	Description
`.text`	`0x403300`	`0x829722`	4,351,010 bytes (4.15 MB)	Executable code
`.rodata`	`0x829740`	`0xAA3FA3`	2,599,011 bytes (2.48 MB)	Read-only data (string tables, jump tables, constants)
`.data`	`0xD46480`	`0xE7EFF0`	1,280,880 bytes (1.22 MB)	Initialized global variables
`.bss`	`0xE7F000`	`0x12D6F20`	4,554,528 bytes (4.34 MB)	Zero-initialized globals
`.eh_frame`	`0xCB1210`	`0xD3F398`	582,024 bytes	Exception handling unwind tables
`.data.rel.ro`	`0xD428C0`	`0xD45E00`	13,632 bytes	Relocation-read-only (vtables, GOT-relative)

Role in the CUDA Toolchain

  input.cu
     |
     v
 cudafe++  ──────── THIS BINARY ────────
     |                                   |
     v                                   v
  device.gpu  (EDG IL)             input.int.c  (transformed host C++)
     |                                   |
     v                                   v
   cicc                            gcc / clang / cl.exe
     |                                   |
     v                                   v
  device.ptx                       host.o
     |                                   |
     v                                   v
   ptxas                              ld
     |                                   |
     v                                   v
  device.cubin ──────────────────> final executable

cudafe++ is a source-to-source compiler. It never generates machine code directly. Its job is to take a single .cu translation unit, understand which code is device (__device__, __global__) and which is host, then:

For the device track: Emit EDG IL -- a typed, scope-linked intermediate representation containing every declaration, type, expression, and statement. This IL is consumed by cicc, which lowers it through LLVM to PTX assembly.
For the host track: Emit a .int.c file -- valid C++ source where device function bodies are suppressed inside #if 0/#endif, __global__ kernels are replaced by __wrapper__device_stub_<name>() forwarding functions, and CUDA runtime registration boilerplate is appended.

The binary runs as a single-threaded, single-pass-per-stage pipeline with 8 stages: pre-init, CLI parsing (276 flags), one-time init (38 subsystem initializers), TU state reset, frontend parse (EDG parser + CUDA extensions), 5-pass IL finalization, backend .int.c emission, and exit. See Pipeline Overview for the full stage diagram.

Source Attribution

The binary embeds __FILE__ strings from the EDG build system, revealing the original source file structure. From these strings plus address-range analysis of decompiled code, 52 .c source files and 13 .h header files have been identified:

Category	Files	Functions Mapped	Description
EDG core parser	15 `.c`	~800	Lexer, expression/declaration parser, statement handling
EDG type system	6 `.c`	~350	Type representation, checking, conversion
EDG templates	5 `.c`	~300	Template parsing, instantiation, deduction
EDG IL subsystem	8 `.c`	~250	IL node types, allocation, walking, display, comparison
EDG infrastructure	12 `.c`	~400	Memory management, error handling, name mangling, scope management
EDG code generation	3 `.c`	~150	Backend `.int.c` emission, ASM handling
NVIDIA additions	3 `.c`	~110	CUDA transforms, attribute validation, lambda wrappers
Headers	13 `.h`	(inline)	Shared constants, struct layouts, macro definitions

The NVIDIA-specific source files are:

nv_transforms.c (~34 functions, ~14 KB of .text): The heart of CUDA support. Implements device/host-device lambda wrapper template generation (__nv_dl_wrapper_t, __nv_hdl_wrapper_t, __nv_hdl_create_wrapper_t), CUDA attribute validation (__launch_bounds__, __cluster_dims__, __block_size__, __maxnreg__), host reference array emission (.nvHRKI/.nvHRDE/.nvHRCE ELF sections), lambda preamble injection (sub_6BCC20), and array capture helper generation.
nv_transforms.h: Header with NVIDIA-specific declarations, type trait template names, and bitmask table definitions.
3 modified EDG files: cmd_line.c (CUDA CLI flags spliced into EDG's flag table), fe_init.c (CUDA-specific initialization at stage 3), and cp_gen_be.c (device stub generation, lambda wrapper emission, registration table output in the backend).

Key Discoveries

Execution Space Bitfield

Every entity node in the EDG IL carries CUDA execution-space information at byte offset +182 (relative to the entity node base). The bitfield encoding:

Bit	Mask	Meaning
4-5	`0x30`	Execution space: 0=none, 1=`__host__`, 2=`__device__`, 3=`__host__ __device__`
6	`0x40`	Device/global flag (set for `__device__` and `__global__` functions)
7	`0x80`	`__global__` kernel flag

This bitfield is checked throughout the pipeline -- in cross-space call validation, device/host code separation, the keep-in-IL predicate, and backend stub generation.

Lambda Wrapper Template Injection

CUDA extended lambdas (__device__ and __host__ __device__ lambdas) cannot be passed directly across the host/device boundary. cudafe++ solves this by injecting a library of template wrapper structs into the compilation at backend time. The master emitter sub_6BCC20 (nv_emit_lambda_preamble) generates all __nv_* templates in a single function call, driven by two 1024-bit bitmasks that record which capture counts were actually needed during parsing:

unk_1286980: Device lambda capture counts (bit N = need __nv_dl_wrapper_t for N captures)
unk_1286900: Host-device lambda capture counts (need __nv_hdl_wrapper_t for N captures)

Only the required specializations are emitted, keeping the generated code minimal.

CUDA Error Catalog

The binary contains 3,795 diagnostic messages in the EDG error table. Of these, 338 are CUDA-specific (error numbers in the 20000+ range and the 3500-3800 range). These cover:

Execution space violations (calling __device__ from __host__ and vice versa)
__global__ function constraints (no return value, no variadic args, no virtual)
Lambda restrictions (35+ distinct error categories for extended lambda misuse)
Attribute conflicts (__launch_bounds__ + __maxnreg__ mutual exclusion)
RDC mode restrictions (user-defined copy constructors in kernel arguments)
Architecture feature gates (feature X requires SM_YY or higher)

IL Entry Kind System

The EDG IL uses 85 defined entry kinds (0-84), each representing a distinct node type in the typed, scope-linked IL graph. Key node types include: routine (288 bytes, functions/methods), variable (232 bytes), type (176 bytes, 22 sub-kinds), expr_node (72 bytes, 36 sub-kinds), statement (80 bytes, 26 sub-kinds), and scope (288 bytes, 9 sub-kinds). All nodes live in a region-based arena allocator with 64 KB blocks. See IL Overview for the complete entry kind table.

CLI Flag Inventory

cudafe++ accepts 276 command-line flags parsed in sub_459630 (cmd_line.c). These control:

Language mode and C++ standard version (__cplusplus value)
Host compiler identity (MSVC, GCC, Clang) and version
CUDA-specific modes: extended lambdas, RDC, JIT, architecture target
Diagnostic suppression and promotion
Include paths and macro definitions
Output format and timing

Flags are passed from nvcc via the -Xcudafe forwarding mechanism. Many flags are undocumented EDG internals.

Wiki Structure

This wiki is organized into 10 sections covering the binary from top-level pipeline down to individual data structures.

Overview

Function Map -- address-to-identity table for all 2,208 mapped functions
Binary Layout -- segment map, memory regions, address space organization
Methodology -- RE tools, approach, confidence scoring

Compilation Pipeline

The 8-stage pipeline from main() at 0x408950 through exit. Covers initialization, CLI parsing, EDG frontend invocation, 5-pass IL finalization, backend .int.c emission, and exit code mapping.

CUDA Execution Model

How cudafe++ handles __device__, __host__, and __global__ execution spaces. Device/host code separation, cross-space call validation, kernel stub generation, RDC (relocatable device code) mode, JIT mode, and SM architecture feature gating.

CUDA Attributes

The internal attribute system: __global__ function constraints, __launch_bounds__ / __cluster_dims__ / __block_size__ / __maxnreg__ validation, __grid_constant__ parameter handling, __managed__ variable support, and minor attributes (__nv_pure__, __nv_register_params__).

Lambda Transformations

Extended lambda support architecture: device lambda wrapper (__nv_dl_wrapper_t), host-device lambda wrapper (__nv_hdl_wrapper_t / __nv_hdl_create_wrapper_t), capture handling (field types, array wrappers for up to 8D), preamble injection (sub_6BCC20), and the 35+ lambda restriction error categories.

EDG Intermediate Language

The 85-entry-kind IL format: node allocation (region-based arena), tree walking (5 callback traversal), device code selection (keep-in-IL predicate), display (debug dump), and comparison/copy operations.

Host Output Generation

The .int.c file format, CUDA runtime boilerplate (__nv_managed_rt initialization, crt/host_runtime.h inclusion), host reference arrays (.nvHRKI/.nvHRDE/.nvHRCE ELF sections for device symbol registration), and CRC32-based module ID generation.

EDG Frontend Internals

The stock EDG 6.6 subsystems: lexer/tokenizer (357 token kinds), expression parser, declaration parser, overload resolution, template engine (instantiation worklist), CUDA-specific template restrictions, constexpr interpreter, Itanium ABI name mangling with CUDA extensions, and the type system (176-byte type node, 22 type kinds).

Error & Diagnostic System

The 3,795-entry diagnostic table, CUDA-specific error catalog (338 entries), format specifier system (%t/%s/%n/%sq/%p/%d), and SARIF output / pragma control.

Data Structures

Byte-level layouts for the core IL node types: entity node (execution/memory space at +182), scope entry (784 bytes), translation unit descriptor (424 bytes), type node (176 bytes, 22 kinds), and template instance record (128 bytes).

Configuration

CLI flag inventory (276 flags by category), EDG build configuration (compile-time constants baked into the binary), architecture detection (--nv_arch and SM version mapping), and experimental feature flags.

Reference

EDG source file map (52 .c + 13 .h), global variable index, token kind table (357 types), full error message catalog, and virtual override mismatch matrix.

Navigating This Wiki

If you want to understand the compilation pipeline: Start with Pipeline Overview, then follow the stage-by-stage links.

If you want to understand CUDA-specific behavior: Start with the CUDA Execution Model section. The execution spaces page explains the fundamental bitfield encoding that everything else depends on.

If you want to understand lambda transformations: Start with the Lambda Transformations overview. Lambda support is the most complex NVIDIA addition and involves template injection, capture-count bitmasks, and 5 distinct wrapper template families.

If you want to understand the IL format: Start with IL Overview for the 85 entry kinds, then Keep-in-IL for how device code is selected.

If you want to look up a specific function: The Function Map provides address-to-identity mappings for all 2,208 identified functions. The EDG Source File Map shows which source file each address range belongs to.

Data Sources

This wiki is derived from:

6,202 Hex-Rays decompiled C pseudocode files -- one per function with recognizable control flow
6,342 x86-64 disassembly files -- full instruction-level coverage
9.5 MB strings database with cross-references to every function that uses each string
161 MB cross-reference database -- complete caller/callee and data-reference mappings
7.7 MB call graph in JSON and DOT format
6,483 control flow graphs with basic block boundaries
247 MB IDA Pro database (.i64)

All analysis was performed on the binary shipped with CUDA Toolkit 13.0, obtained from NVIDIA's public distribution channels.

Function Map

Every function in the cudafe++ binary that triggers an EDG assertion encodes three pieces of data in the assertion string: the source file path, the line number, and the enclosing function name. These strings survive in .rodata and cross-reference back to the compiled functions, providing a ground-truth mapping from binary address to EDG source file. This page catalogs that mapping for all 52 .c source files and 13 .h header files identified in the CUDA 13.0 build of cudafe++ (EDG 6.6).

The mapping was produced by extracting all string literals matching /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/*.c and *.h from the binary's .rodata section, then tracing their cross-references to determine which functions load each path. A function that references attribute.c in an assertion string was compiled from attribute.c. Functions that reference no source path at all (the "unmapped" pool) are either too small to contain assertions, are inlined from headers, or belong to the statically-linked C++ runtime.

Coverage Summary

Category	Functions	Percentage
Mapped via `.c` file paths	2,129	32.8%
Mapped via `.h` file paths only	80	1.2%
Total mapped	2,209	34.1%
Unmapped in EDG region (`0x403300`--`0x7E0000`)	2,906	44.8%
C++ runtime / demangler (`0x7E0000`--`0x829722`)	1,085	16.7%
PLT stubs + init (`0x402A18`--`0x403300`)	283	4.4%
Total functions in binary	6,483	100%

The 2,906 unmapped functions in the EDG region include inlined header expansions (e.g., util.h vector/hash helpers, types.h type queries), small leaf functions below the assertion threshold, switch-table dispatch fragments, and functions from translation units compiled without assertions enabled (notably il_to_str.c display routines and parts of floating.c).

Binary Layout

The EDG .text region (0x403300--0x7E0000) has a three-part structure:

Assert stub region (0x403300--0x408B40): 235 small __noreturn functions, one per assertion site. Each encodes a source file path, line number, and function name, then calls sub_4F2930 (the internal error handler). These stubs are sorted by source file name -- the linker grouped them from all 52 .c files into one contiguous block. 200 stubs map to .c files; the remaining 35 are from .h files inlined into .c compilation units.
Constructor region (0x408B40--0x409350): 15 C++ static constructor functions (ctor_001 through ctor_015) that initialize global tables at program startup.
Main body region (0x409350--0x7DFFF0): The bulk of the compiler. Source files are laid out roughly in alphabetical order by filename, a consequence of the linker processing object files in directory-listing order. The alphabetical ordering holds across the entire range: attribute.c starts at 0x409350, class_decl.c at 0x419280, progressing through to types.c at 0x7A4940, modules.c at 0x7C0C60, and floating.c at 0x7D0EB0.

Source File Address Table

The table below lists all 52 .c source files sorted by their main body start address. "Total Funcs" counts all functions referencing the file (stubs + main body). "Stubs" counts assert stubs in 0x403300--0x408B40. "Main Funcs" counts functions in the main body region.

#	Source File	Origin	Total Funcs	Stubs	Main Funcs	Main Body Start	Main Body End	Sweep
1	`attribute.c`	EDG	177	7	170	`0x409350`	`0x418F80`	P1.01
2	`class_decl.c`	EDG	273	9	264	`0x419280`	`0x447930`	P1.01--02
3	`cmd_line.c`	EDG	44	1	43	`0x44B250`	`0x459630`	P1.02--03
4	`const_ints.c`	EDG	4	1	3	`0x461C20`	`0x4659A0`	P1.03
5	`cp_gen_be.c`	EDG	226	25	201	`0x466F90`	`0x489000`	P1.03--04
6	`debug.c`	EDG	2	0	2	`0x48A1B0`	`0x48A1B0`	P1.04
7	`decl_inits.c`	EDG	196	4	192	`0x48B3F0`	`0x4A1540`	P1.04--05
8	`decl_spec.c`	EDG	88	3	85	`0x4A1BF0`	`0x4B37F0`	P1.05
9	`declarator.c`	EDG	64	0	64	`0x4B3970`	`0x4C00A0`	P1.05
10	`decls.c`	EDG	207	5	202	`0x4C0910`	`0x4E8C40`	P1.05--06
11	`disambig.c`	EDG	5	1	4	`0x4E9E70`	`0x4EC690`	P1.06
12	`error.c`	EDG	51	1	50	`0x4EDCD0`	`0x4F8F80`	P1.06
13	`expr.c`	EDG	538	10	528	`0x4F9870`	`0x5565E0`	P1.07--08
14	`exprutil.c`	EDG	299	13	286	`0x558720`	`0x583540`	P1.08--09
15	`extasm.c`	EDG	7	0	7	`0x584CA0`	`0x585850`	P1.09
16	`fe_init.c`	EDG	6	1	5	`0x585B10`	`0x5863A0`	P1.09
17	`fe_wrapup.c`	EDG	2	0	2	`0x588D40`	`0x588F90`	P1.09
18	`float_pt.c`	EDG	79	0	79	`0x589550`	`0x594150`	P1.09--10
19	`folding.c`	EDG	139	9	130	`0x594B30`	`0x5A4FD0`	P1.10
20	`func_def.c`	EDG	56	1	55	`0x5A51B0`	`0x5AAB80`	P1.10
21	`host_envir.c`	EDG	19	2	17	`0x5AD540`	`0x5B1E70`	P1.10
22	`il.c`	EDG	358	16	342	`0x5B28F0`	`0x5DFAD0`	P1.10--11d
23	`il_alloc.c`	EDG	38	1	37	`0x5E0600`	`0x5E8300`	P1.11a--11e
24	`il_to_str.c`	EDG	83	1	82	`0x5F7FD0`	`0x6039E0`	P1.11f--12
25	`il_walk.c`	EDG	27	1	26	`0x603FE0`	`0x620190`	P1.12
26	`interpret.c`	EDG	216	5	211	`0x620CE0`	`0x65DE10`	P1.12--13
27	`layout.c`	EDG	21	2	19	`0x65EA50`	`0x665A60`	P1.13
28	`lexical.c`	EDG	140	5	135	`0x666720`	`0x689130`	P1.13--14
29	`literals.c`	EDG	21	0	21	`0x68ACC0`	`0x68F2B0`	P1.14
30	`lookup.c`	EDG	71	2	69	`0x68FAB0`	`0x69BE80`	P1.14
31	`lower_name.c`	EDG	179	11	168	`0x69C980`	`0x6AB280`	P1.14--15
32	`macro.c`	EDG	43	1	42	`0x6AB6E0`	`0x6B5C10`	P1.15
33	`mem_manage.c`	EDG	9	2	7	`0x6B6DD0`	`0x6BA230`	P1.15
34	`nv_transforms.c`	NVIDIA	1	0	1	`0x6BE300`	`0x6BE300`	P1.15
35	`overload.c`	EDG	284	3	281	`0x6BE4A0`	`0x6EF7A0`	P1.15--16
36	`pch.c`	EDG	23	3	20	`0x6F2790`	`0x6F5DA0`	P1.16
37	`pragma.c`	EDG	28	0	28	`0x6F61B0`	`0x6F8320`	P1.16
38	`preproc.c`	EDG	10	0	10	`0x6F9B00`	`0x6FC940`	P1.16
39	`scope_stk.c`	EDG	186	6	180	`0x6FE160`	`0x7106B0`	P1.16--17
40	`src_seq.c`	EDG	57	1	56	`0x710F10`	`0x718720`	P1.17
41	`statements.c`	EDG	83	1	82	`0x719300`	`0x726A50`	P1.17
42	`symbol_ref.c`	EDG	42	2	40	`0x726F20`	`0x72CEA0`	P1.17
43	`symbol_tbl.c`	EDG	175	8	167	`0x72D950`	`0x74B8D0`	P1.17--18
44	`sys_predef.c`	EDG	35	1	34	`0x74C690`	`0x751470`	P1.18
45	`target.c`	EDG	11	0	11	`0x7525F0`	`0x752DF0`	P1.18
46	`templates.c`	EDG	455	12	443	`0x7530C0`	`0x794D30`	P1.18
47	`trans_copy.c`	EDG	2	0	2	`0x796BA0`	`0x796BA0`	P1.18
48	`trans_corresp.c`	EDG	88	6	82	`0x796E60`	`0x7A3420`	P1.18--19
49	`trans_unit.c`	EDG	10	0	10	`0x7A3BB0`	`0x7A4690`	P1.19
50	`types.c`	EDG	88	5	83	`0x7A4940`	`0x7C02A0`	P1.19
51	`modules.c`	EDG	22	3	19	`0x7C0C60`	`0x7C2560`	P1.19
52	`floating.c`	EDG	50	9	41	`0x7D0EB0`	`0x7D59B0`	P1.19

Totals: 5,338 cross-references across 52 .c files, resolving to 2,129 unique functions. With .h file references added, 2,209 unique functions are mapped.

Largest Source Files by Function Count

Source File	Main Body Funcs	Approximate Code Size
`expr.c`	528	~373 KB (`0x4F9870`--`0x5565E0`)
`templates.c`	443	~282 KB (`0x7530C0`--`0x794D30`)
`il.c`	342	~185 KB (`0x5B28F0`--`0x5DFAD0`)
`exprutil.c`	286	~175 KB (`0x558720`--`0x583540`)
`overload.c`	281	~200 KB (`0x6BE4A0`--`0x6EF7A0`)
`class_decl.c`	264	~187 KB (`0x419280`--`0x447930`)
`interpret.c`	211	~241 KB (`0x620CE0`--`0x65DE10`)
`decls.c`	202	~165 KB (`0x4C0910`--`0x4E8C40`)
`cp_gen_be.c`	201	~141 KB (`0x466F90`--`0x489000`)
`decl_inits.c`	192	~91 KB (`0x48B3F0`--`0x4A1540`)

Header File Cross-References

Thirteen .h header files appear in assertion strings. These are headers that contain non-trivial inline functions or macros that expand to assertion-bearing code. When a function compiled from decls.c triggers an assertion whose __FILE__ is types.h, that assertion was inlined from types.h into the decls.c compilation unit.

#	Header File	Xrefs	Stubs	Main Funcs	Address Range	Inlined Into
1	`decls.h`	1	0	1	`0x4E08F0`	`decls.c`
2	`float_type.h`	63	0	63	`0x7D1C90`--`0x7DEB90`	`floating.c`
3	`il.h`	5	2	3	`0x52ABC0`--`0x6011F0`	`expr.c`, `il.c`, `il_to_str.c`
4	`lexical.h`	1	0	1	`0x68F2B0`	`lexical.c` / `literals.c` boundary
5	`mem_manage.h`	4	0	4	`0x4EDCD0`	`error.c`
6	`modules.h`	5	0	5	`0x7C1100`--`0x7C2560`	`modules.c`
7	`nv_transforms.h`	3	0	3	`0x432280`--`0x719D20`	`class_decl.c`, `cp_gen_be.c`, `src_seq.c`
8	`overload.h`	1	0	1	`0x6C9E40`	`overload.c`
9	`scope_stk.h`	4	0	4	`0x503D90`--`0x574DD0`	`expr.c`, `exprutil.c`
10	`symbol_tbl.h`	2	1	1	`0x7377D0`	`symbol_tbl.c`
11	`types.h`	17	4	13	`0x469260`--`0x7B05E0`	Many files (scattered type queries)
12	`util.h`	124	10	114	`0x430E10`--`0x7C2B10`	All major `.c` files
13	`walk_entry.h`	51	0	51	`0x604170`--`0x618660`	`il_walk.c`

Notable Header Patterns

util.h is the most widely-included header, with 124 cross-references (114 in main body) spanning nearly the entire EDG .text region from 0x430E10 to 0x7C2B10. It provides generic container templates (dynamic arrays, hash tables, sorted sets) used by every major subsystem. The EDG linker inlined these templates into each compilation unit, creating many small util.h-attributed functions scattered across the binary.

float_type.h is concentrated in a single 52 KB block at 0x7D1C90--0x7DEB90, immediately after floating.c. It contains 63 template instantiations for IEEE 754 floating-point type operations (comparison, conversion, arithmetic) for each target floating-point width. These templates were instantiated in the floating.c compilation unit.

walk_entry.h contributes 51 functions in the tight range 0x604170--0x618660, all within the il_walk.c region. These are the per-entry-kind callback dispatch functions generated by preprocessor macros in the IL walker header.

nv_transforms.h is NVIDIA-specific. Its 3 cross-references appear in class_decl.c (sub_432280 at 0x432280), cp_gen_be.c (sub_47ECC0 at 0x47ECC0), and src_seq.c (sub_719D20 at 0x719D20). These are the integration points where NVIDIA's CUDA transform hooks are called from standard EDG code paths -- class definition processing, backend code generation, and source sequence ordering.

NVIDIA-Specific Files

`nv_transforms.c`

The only NVIDIA-authored .c file in the EDG source tree. Despite having only 1 mapped function via __FILE__ (sub_6BE300 at 0x6BE300), the sweep analysis of the 0x6BAE70--0x6BE4A0 range identified approximately 40 functions compiled from this file. The discrepancy exists because nv_transforms.c uses NVIDIA's own assertion macros (not EDG's standard internal_error path), so most functions do not reference the EDG-style __FILE__ string.

Functions confirmed in the nv_transforms.c region:

Address	Identity	Purpose
`0x6BAE70`	`nv_init_transforms`	Zero all NVIDIA transform state at startup
`0x6BAF70`	`alloc_mem_block`	64 KB memory block allocator for NV region pools
`0x6BB290`	`reset_mem_state`	Emergency OOM recovery -- clear memory tracking
`0x6BB350`	`init_memory_regions`	Bootstrap region 0 and region 1 with initial blocks
`0x6BB790`	`emit_device_lambda_wrapper`	Generate `__nv_dl_wrapper_t<>` specialization
`0x6BCC20`	`emit_lambda_preamble`	Inject lambda wrapper preamble declarations
`0x6BD490`	`emit_host_device_lambda_wrapper`	Generate `__nv_hdl_wrapper_t<>` specialization
`0x6BE300`	(mapped function)	Single function with EDG-style `__FILE__` reference

Key infrastructure in this file:

__nv_dl_wrapper_t<> / __nv_hdl_wrapper_t<> struct template generation
Host reference array emission (.nvHRKE, .nvHRKI, .nvHRDE, .nvHRDI, .nvHRCE, .nvHRCI)
Capture count bitmask tables: unk_1286980 (device) and unk_1286900 (host-device), 128 bytes each
Lambda-to-closure entity mapping via hash table at qword_12868F0

`nv_transforms.h`

NVIDIA's hook header, #include-d from three EDG source files. It declares the functions that bridge standard EDG processing to NVIDIA's CUDA transform layer. The three inclusion sites represent the three points where EDG's standard C++ frontend cedes control to NVIDIA-specific logic:

class_decl.c (sub_432280 at 0x432280): Called during class definition processing to apply CUDA execution-space attributes to closure types and validate lambda capture constraints.
cp_gen_be.c (sub_47ECC0 at 0x47ECC0): Called during backend code generation to emit CUDA-specific output constructs (device stubs, host reference arrays, registration calls).
src_seq.c (sub_719D20 at 0x719D20): Called during source sequence processing to inject NVIDIA preamble declarations and wrapper type definitions into the correct position in the declaration order.

Unmapped Regions (Gap Analysis)

Several address ranges within the EDG .text region contain functions that could not be mapped to any source file via __FILE__ strings. The major gaps and their probable contents:

Gap Range	Size	Probable Content	Evidence
`0x408B40`--`0x409350`	~2 KB	Static constructors (`ctor_001`--`ctor_015`)	No source path; global table initializers
`0x447930`--`0x44B250`	~13 KB	`class_decl.c` / `cmd_line.c` boundary helpers	Between confirmed ranges
`0x459630`--`0x461C20`	~34 KB	`cmd_line.c` tail + `const_ints.c` preamble	Unmapped option handlers
`0x5E8300`--`0x5F7FD0`	~87 KB	IL display routines (`il_to_str.c` early body)	No assertions (display-only code)
`0x665A60`--`0x666720`	~3 KB	`layout.c` / `lexical.c` boundary	Small gap between confirmed ranges
`0x689130`--`0x68ACC0`	~7 KB	`lexical.c` tail + `literals.c` preamble	Token/literal conversion helpers
`0x6AB280`--`0x6AB6E0`	~1 KB	`lower_name.c` / `macro.c` boundary	Mangling helpers
`0x6BA230`--`0x6BAE70`	~3 KB	`mem_manage.c` / `nv_transforms.c` boundary	Memory infrastructure
`0x6EF7A0`--`0x6F2790`	~12 KB	`overload.c` / `pch.c` boundary	Overload resolution helpers
`0x6FC940`--`0x6FE160`	~6 KB	`preproc.c` / `scope_stk.c` boundary	Preprocessor tail
`0x751470`--`0x7525F0`	~7 KB	`sys_predef.c` / `target.c` boundary	Predefined macro infrastructure
`0x7A4690`--`0x7A4940`	~1 KB	`trans_unit.c` / `types.c` boundary	Translation unit helpers
`0x7C2560`--`0x7D0EB0`	~59 KB	Type-name mangling / encoding for output	Between `modules.c` and `floating.c`
`0x7D1C90`--`0x7DEB90`	~52 KB	`float_type.h` template instantiations	Confirmed via `.h` path strings
`0x7DFFF0`--`0x82A000`	~304 KB	C++ runtime, demangler, soft-float, EH	Statically-linked libstdc++/libgcc

The largest unmapped gap within EDG code is the IL display region at 0x5E8300--0x5F7FD0 (87 KB). These functions were compiled from il_to_str.c but contain no assertions because the display/dump subsystem was built without assertion macros -- it is purely diagnostic code that formats IL trees to stdout.

The float_type.h block at 0x7D1C90--0x7DEB90 (52 KB) is technically mapped via .h cross-references but has no .c file attribution because the template instantiations carry only the header's __FILE__ path.

Alphabetical Ordering Observation

The files are laid out in the binary in rough alphabetical order, consistent with a build system that compiles object files in directory-listing order and a linker that processes them sequentially:

0x409350  attribute.c      (a)
0x419280  class_decl.c     (c)
0x44B250  cmd_line.c       (c)
0x461C20  const_ints.c     (c)
0x466F90  cp_gen_be.c      (c)
0x48A1B0  debug.c          (d)
0x48B3F0  decl_inits.c     (d)
0x4A1BF0  decl_spec.c      (d)
0x4B3970  declarator.c     (d)
0x4C0910  decls.c          (d)
0x4E9E70  disambig.c       (d)
0x4EDCD0  error.c          (e)
0x4F9870  expr.c           (e)
0x558720  exprutil.c       (e)
0x584CA0  extasm.c         (e)
0x585B10  fe_init.c        (f)
0x588D40  fe_wrapup.c      (f)
0x589550  float_pt.c       (f)
0x594B30  folding.c        (f)
0x5A51B0  func_def.c       (f)
0x5AD540  host_envir.c     (h)
0x5B28F0  il.c             (i)
0x5E0600  il_alloc.c       (i)
0x5F7FD0  il_to_str.c      (i)
0x603FE0  il_walk.c        (i)
0x620CE0  interpret.c      (i)
0x65EA50  layout.c         (l)
0x666720  lexical.c        (l)
0x68ACC0  literals.c       (l)
0x68FAB0  lookup.c         (l)
0x69C980  lower_name.c     (l)
0x6AB6E0  macro.c          (m)
0x6B6DD0  mem_manage.c     (m)
0x6BAE70  nv_transforms.c  (n)  [region start; mapped func at 0x6BE300]
0x6BE4A0  overload.c       (o)
0x6F2790  pch.c            (p)
0x6F61B0  pragma.c         (p)
0x6F9B00  preproc.c        (p)
0x6FE160  scope_stk.c      (s)
0x710F10  src_seq.c        (s)
0x719300  statements.c     (s)
0x726F20  symbol_ref.c     (s)
0x72D950  symbol_tbl.c     (s)
0x74C690  sys_predef.c     (s)
0x7525F0  target.c         (t)
0x7530C0  templates.c      (t)
0x796BA0  trans_copy.c     (t)
0x796E60  trans_corresp.c  (t)
0x7A3BB0  trans_unit.c     (t)
0x7A4940  types.c          (t)
0x7C0C60  modules.c        (m)  [breaks alphabetical order]
0x7D0EB0  floating.c       (f)  [breaks alphabetical order]

Two files break the alphabetical pattern: modules.c at 0x7C0C60 and floating.c at 0x7D0EB0. Both appear after types.c instead of in their expected positions (between mem_manage.c and nv_transforms.c for modules.c, between float_pt.c and folding.c for floating.c). This suggests these two files are compiled as separate translation units outside the main EDG source directory, or are added to the link line after the alphabetically-sorted EDG objects.

Data Source

All mappings were extracted from the binary's .rodata string table. The extraction command:

jq '[.[] | select(.value | test("/dvs/p4/.*\\.c$")) |
  {file: (.value | split("/") | last),
   xrefs: [.xrefs[].func] | length}
] | sort_by(.file)' cudafe++_strings.json

The full build path for every source file is:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/<filename>

Address ranges were verified against the 20 sweep reports (P1.01 through P1.20) produced during the binary analysis phase.

Binary Layout

cudafe++ ships as a single statically-linked, stripped ELF 64-bit x86-64 executable. Static linking pulls in the entirety of libstdc++ (locale facets, iostream, exception handling), Berkeley SoftFloat 3e (half/quad-precision arithmetic), and glibc CRT startup code. The resulting 8.5 MB binary has no external shared library dependencies -- it runs identically on any Linux x86-64 host regardless of installed C++ runtime version.

This page documents the complete segment and section layout, the internal organization of each major section, and the key data structures located within each region. All addresses are virtual addresses from the ELF load image.

ELF Header

Property	Value
Format	ELF 64-bit LSB executable
Architecture	x86-64 (AMD64)
Linking	Statically linked
Stripped	Yes (no debug symbols, no .symtab)
File size	8,910,936 bytes (8.5 MB)
Entry point	`0x40918C` (`_start`, glibc CRT)
Main	`0x408950`

Complete Section Table

Section	Start	End	Size (bytes)	Size (human)	Permissions	Purpose
LOAD (ELF hdr)	`0x400000`	`0x402A18`	10,776	10.5 KB	r-x	ELF headers and program header table
`.init`	`0x402A18`	`0x402A30`	24	24 B	r-x	Initialization stub (calls `init_proc`)
`.plt`	`0x402A30`	`0x403300`	2,256	2.2 KB	r-x	Procedure Linkage Table (141 entries)
`.text`	`0x403300`	`0x829722`	4,351,010	4.15 MB	r-x	All executable code
`.fini`	`0x829724`	`0x829732`	14	14 B	r-x	Finalization stub (empty body)
`.rodata`	`0x829740`	`0xAA3FA3`	2,599,011	2.48 MB	r--	Read-only data
`.eh_frame_hdr`	`0xAA3FA4`	`0xAB0350`	50,092	48.9 KB	r--	Exception frame header index
`.eh_frame`	`0xCB1210`	`0xD3F398`	582,024	568.4 KB	rw-	Exception unwind tables (CFI)
`.gcc_except_table`	`0xD3F398`	`0xD42854`	13,500	13.2 KB	rw-	GCC LSDA exception handler tables
`.ctors`	`0xD42858`	`0xD428B0`	88	88 B	rw-	Constructor table (9 function pointers + 2 sentinels)
`.dtors`	`0xD428B0`	`0xD428C0`	16	16 B	rw-	Destructor table (2 sentinels, empty)
`.data.rel.ro`	`0xD428C0`	`0xD45E00`	13,632	13.3 KB	rw-	Vtables and relocation-read-only data
`.got`	`0xD45FC0`	`0xD45FF8`	56	56 B	rw-	Global Offset Table
`.got.plt`	`0xD46000`	`0xD46478`	1,144	1.1 KB	rw-	GOT for PLT entries
`.data`	`0xD46480`	`0xE7EFF0`	1,280,880	1.22 MB	rw-	Initialized globals
`.bss`	`0xE7F000`	`0x12D6F20`	4,554,528	4.34 MB	rw-	Zero-initialized globals
`.tls`	`0x12D6F20`	`0x12D6F38`	24	24 B	---	Thread-local storage (exception state)
`extern`	`0x12D6F38`	`0x12D73A8`	1,136	1.1 KB	---	External symbol stubs

Total virtual address space consumed: 0x12D73A8 - 0x400000 = 18.9 MB.

.text -- Executable Code (4.15 MB)

The .text section contains all 6,483 functions in the binary. It divides into four distinct regions, laid out contiguously by the linker:

0x403300                                                          0x829722
|-- assert stubs --|-- ctors --|---- EDG main body ----|-- C++ runtime ----|
0x403300    0x408B40  0x409350                  0x7DF400          0x829722
   34 KB      8 KB              3.61 MB                  304 KB

Assert Stub Region (0x403300 -- 0x408B40, 34 KB)

Contains 235 small __noreturn functions, each encoding a single assertion site. Every stub loads three string constants -- source file path, line number, and function name -- then calls sub_4F2930 (the internal_error handler in error.c). These stubs are called from the bodies of larger functions when an impossible condition is detected.

The linker groups all stubs from all 52 .c source files into this contiguous block, sorted approximately by source file name. Of the 235 stubs:

200 map to .c source files (e.g., attribute.c:10897 at 0x403300, cp_gen_be.c:22342 at 0x4036F6)
35 map to .h header files inlined into .c compilation units (e.g., types.h at 0x40345C)

Each stub is exactly 29 bytes: a lea for the file path, a mov for the line number, a lea for the function name, then a call to sub_4F2930.

Constructor Region (0x408B40 -- 0x409350, 8 KB)

Contains 9 C++ global constructor functions (ctor_001 through ctor_009) registered in the .ctors table. These run before main() via __libc_start_main's init callback at 0x829640. The constructors, in execution order:

Constructor	Address	Identity	What It Initializes
`ctor_001`	`0x408B40`	EDG diagnostic list	Doubly-linked list at `E7FE40..E7FE68` (self-referencing empty sentinel)
`ctor_002`	`0x408B90`	Stream state table	13 qwords at `126ED80..126EDE0` (output channel array including `126EDF0` = stderr `FILE*`)
`ctor_003`	`0x408C20`	EDG internal caches	`ios_base::Init` + 7 doubly-linked lists at `12C6A40`, `12868C0..1286780` (symbol/type caches)
`ctor_004`	`0x408E50`	Emergency exception pool	72,704-byte malloc pool at `12D4870`, free-list at `12D4868`, with pthread mutex
`ctor_005`	`0x408ED0`	Locale once-flags (set 1)	8 flags at `12D6A68..12D6AA0`
`ctor_006`	`0x408F50`	Locale once-flags (set 2)	8 flags at `12D6AF0..12D6B28`
`ctor_007`	`0x408FD0`	Locale once-flags (set 3)	12 flags at `12D6D28..12D6D80`
`ctor_008`	`0x409090`	Locale once-flags (set 4)	12 flags at `12D6DE8..12D6E40`
`ctor_009`	`0x409150`	Stream buffer destructors	`__cxa_atexit` for `basic_streambuf<char>` and `basic_streambuf<wchar_t>`

Constructors 4--9 belong to statically-linked libstdc++. Only constructors 1--3 initialize EDG/NVIDIA state.

EDG Main Body (0x409350 -- 0x7DF400, 3.61 MB)

The core of the compiler. Contains 5,115 functions compiled from 52 EDG .c source files plus 3 NVIDIA-specific source files. Functions are laid out in approximate alphabetical order by source file name -- the linker processed object files in directory-listing order:

0x409350   attribute.c     (170 functions)
0x419280   class_decl.c    (264 functions)
0x44B250   cmd_line.c      (43 functions)
0x461C20   const_ints.c    (3 functions)
0x466F90   cp_gen_be.c     (201 functions)
  ...
0x6BE300   nv_transforms.c (1 mapped function, NVIDIA)
0x6BE4A0   overload.c      (281 functions)
  ...
0x7A4940   types.c
0x7C0C60   modules.c
0x7D0EB0   floating.c
  ~0x7DF400  end of EDG code

The 52 source files break down by subsystem:

Subsystem	Files	Functions	Description
Parser	15 `.c`	~800	Lexer, expression/declaration parser, statements
Type system	6 `.c`	~350	Type representation, checking, conversion
Templates	5 `.c`	~300	Parsing, instantiation, deduction
IL subsystem	8 `.c`	~250	Node types, allocation, walking, display, comparison
Infrastructure	12 `.c`	~400	Memory, errors, name mangling, scope management
Code generation	3 `.c`	~150	Backend `.int.c` emission
NVIDIA additions	3 `.c`	~110	CUDA transforms, attribute validation, lambda wrappers

See Function Map for the complete address-to-source-file table.

C++ Runtime Region (0x7DF400 -- 0x829722, 304 KB)

Statically-linked library code with no EDG source attribution. Contains approximately 900 functions from three libraries:

Berkeley SoftFloat 3e (0x7E0D30 -- 0x7E4150, ~80 functions). IEEE 754 arithmetic for half-precision (float16), extended precision (float80), and quad-precision (float128). Operations: add, sub, mul, div, sqrt, comparisons, int/float conversions. Global state at 12D4820 (exception flags) and 12D4821 (rounding mode). Used by the EDG floating.c subsystem for constant folding of non-native float types.

libstdc++ / libsupc++ (0x7E42E0 -- 0x829600, ~800 functions). The C++ runtime:

operator new/operator delete with new-handler retry loop (0x7E42E0)
Exception handling: __cxa_throw (0x823050), __cxa_begin_catch (0x822EB0), __cxa_allocate_exception (0x7E4750), std::terminate (0x8231A0)
Emergency exception pool: 72,704-byte fallback allocator for OOM during exception handling (0x7E45C0)
iostream initialization: ios_base::Init constructor/destructor (0x7E5650/0x7E5F20) setting up cout/cin/cerr + wide variants
Full locale system: 600+ functions implementing ctype, num_get, num_put, numpunct, collate, time_get/put, money_get/put, moneypunct, messages, and codecvt facets for both char and wchar_t

CUDA-aware name demangler (at 0x7CABB0, technically in the EDG tail region). NVIDIA's custom Itanium ABI demangler with extensions for CUDA lambda wrapper templates. Recognizes mangled prefixes: "Unvdl" for __nv_dl_wrapper_t<>, "Unvdtl" for __nv_dl_wrapper_t<> with trailing return, and "Unvhdl" for __nv_hdl_wrapper_t<>.

CRT startup (0x40918C and 0x829640 -- 0x829722). _start at 0x40918C calls __libc_start_main(main@0x408950, init@0x829640, fini@0x8296D0). The .fini_array processor at 0x8296E0 iterates backwards through function pointers at off_D428A0.

.rodata -- Read-Only Data (2.48 MB)

The .rodata section at 0x829740 -- 0xAA3FA3 holds all constant data: string literals, jump tables, error message templates, IL metadata tables, and format strings. Major structures:

Error Message Table (off_88FAA0)

The EDG diagnostic system's message template table. An array of 3,795 const char* pointers, indexed by error code 0--3794:

off_88FAA0[0]    = ""                           // error 0: unused
off_88FAA0[1]    = "last line of file ends ..."  // error 1
  ...
off_88FAA0[3794] = "..."                         // error 3794

Each pointer references a NUL-terminated format string elsewhere in .rodata containing % fill-in specifiers (%t = type, %s = string, %n = name, %sq = quoted string, %p = position, %d = decimal). Error codes above 3456 are CUDA-specific (338 entries covering execution space violations, lambda restrictions, architecture feature gates). See Diagnostic Overview.

IL Entry Kind Name Table (off_E6DD80)

Maps the 85 entry_kind enum values (0--84) to human-readable strings. Used by the IL display subsystem (il_to_str.c) for debug output:

off_E6DD80[0]  = "scope"
off_E6DD80[6]  = "type"
off_E6DD80[11] = "routine"
off_E6DD80[23] = "variable"
  ...
off_E6DD80[84] = "last"       // sentinel

The il_one_time_init function (sub_5CF7F0) validates at startup that this table ends with the "last" sentinel, catching version mismatches between the table and the enum.

EDG Source File Path Strings

Approximately 65 string literals of the form /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/<file>.<ext>. These are __FILE__ expansions embedded in assertion macros. Each is referenced by the corresponding assert stub in the 0x403300 region.

Jump Tables

Switch-statement jump tables for the major dispatch functions. The largest are:

Expression parser dispatch (~120 case targets)
Declaration specifier dispatch (~80 case targets)
IL walker entry-kind dispatch (~85 case targets)
Backend code generation dispatch (~90 case targets)

Format Strings

Printf-style format strings for the .int.c backend emitter. These include CUDA runtime boilerplate templates ("#include \"crt/host_runtime.h\"", "static __nv_managed_rt ...", "void __device_stub__...") and IL display format strings.

.data -- Initialized Globals (1.22 MB)

The .data section at 0xD46480 -- 0xE7EFF0 holds all initialized global variables. Major structures, ordered by address:

Attribute Descriptor Table (off_D46820)

The master attribute dispatch table, starting at 0xD46820 and extending to approximately 0xD47A60. Each entry is 32 bytes and describes one EDG/CUDA attribute kind: kind code (1 byte), flags (1 byte), name string pointer, validation function pointer, and application function pointer. See Attribute System Overview.

Diagnostic Fill-in Tables (off_D481E0)

Named-label fill-in descriptors for the diagnostic system. Maps fill-in label strings to format specifier dispatch codes. Located at 0xD481E0.

Keyword Tables

The EDG keyword registration system stores keyword-to-token-ID mappings. Initialized during fe_translation_unit_init (sub_5863A0) with 200+ C/C++ keywords (from auto through co_yield), 60+ type trait intrinsics (__is_class, __has_trivial_copy, etc.), and CUDA extension keywords (__device__, __global__, __shared__, __constant__, __managed__, __launch_bounds__, __grid_constant__).

Error Severity Override Table

Maps error codes to their overridden severity levels. Populated by --diag_suppress, --diag_warning, --diag_error CLI flags.

libstdc++ Vtables (0xD428C0 -- 0xD45E00, in .data.rel.ro)

The .data.rel.ro section holds vtables for all statically-linked C++ classes. Key vtables:

Address	Class
`off_D42C00`	`__gnu_cxx::__concurrence_lock_error`
`off_D42C28`	`__gnu_cxx::__concurrence_unlock_error`
`off_D42CD8`	`std::bad_alloc`
`off_D45740`	`std::basic_istream<char>`
`off_D457C0`	`std::basic_istream<wchar_t>`
`off_D45860`	`std::basic_ostream<char>`
`off_D458E0`	`std::basic_ostream<wchar_t>`
`off_D45A28`	`std::basic_streambuf<char>`
`off_D45A78`	`std::basic_streambuf<wchar_t>`

Exception Handler Pointers (0xE7EExx)

Located at the tail of .data:

Address	Type	Identity
`off_E7EEB0`	`qword`	atexit target: `basic_streambuf<wchar_t>` object
`off_E7EEB8`	`qword`	atexit target: `basic_streambuf<char>` object
`off_E7EEC0`	`qword`	`std::unexpected_handler` pointer
`off_E7EEC8`	`qword`	`std::terminate_handler` pointer

EDG Diagnostic List Head (0xE7FE40)

A 40-byte doubly-linked list structure at 0xE7FE40..0xE7FE68. Initialized by ctor_001 as an empty self-referencing sentinel (both forward and backward pointers point to the list head). Used to chain diagnostic records during compilation.

.bss -- Zero-Initialized Globals (4.34 MB)

The .bss section at 0xE7F000 -- 0x12D6F20 is the largest section by virtual size. It contains all zero-initialized global state for both the EDG compiler and the statically-linked runtime. The .bss occupies no space in the ELF file on disk -- it is allocated and zeroed by the OS loader.

The 4.34 MB .bss divides into three logical regions:

EDG Compiler State (0xE7F000 -- 0x1290000, ~4.1 MB)

The bulk of .bss holds the EDG frontend's global state. Major structures:

Scope stack and symbol tables (~1.5 MB). The EDG scope stack (scope_stk.c) maintains nested scope contexts during parsing. Each scope entry is 784 bytes. The scope stack globals, various hash tables for name lookup, and the associated symbol table arrays consume the largest contiguous blocks.

IL region tracking (~800 KB). Region indices, region-to-scope mappings (qword_126EB90), region memory tables (qword_126EC88), and IL entry list heads. The region counter at dword_126EC80 tracks active regions. Each function definition creates a new region.

Translation unit state (~400 KB). The TU descriptor itself is dynamically allocated (424 bytes), but the per-TU global variables -- source file table, include stack, macro state, conditional compilation depth -- live in .bss. sub_7A4860 (reset_tu_state) zeroes these between compilations.

Parser state (~600 KB). Token lookahead buffers, declaration nesting depth, template argument stacks, expression evaluation context. The lexer maintains character classification tables and identifier hash buckets.

Error and diagnostic state (~200 KB). Error count (qword_126ED90), warning count (qword_126ED98), error limit (qword_126ED60), diagnostic suppression bitmaps, and the stream state table at 126ED80..126EDE0 (13 qwords including the stderr FILE* at qword_126EDF0).

Configuration flags (~100 KB). The 0x106xxxx region contains hundreds of dword flags set by CLI parsing and used throughout compilation. Examples:

Address	Type	Identity
`dword_106B640`	int	Keep-in-IL guard flag
`dword_106B4B0`	int	Catastrophic error re-entry guard
`dword_106B4BC`	int	Warnings-as-errors recursion guard
`dword_106B9E8`	int	TU stack depth
`dword_106BA08`	int	TU-copy mode flag
`dword_106BBB8`	int	Output format (0=text, 1=SARIF)
`dword_106BCD4`	int	Predefined macro file mode
`dword_106C088`	int	Warnings-are-errors mode
`dword_106C188`	int	`wchar_t` keyword enabled
`dword_106C254`	int	Skip backend (errors present)
`dword_106C2C0`	int	GPU mode flag
`dword_1065928`	int	Internal error re-entry guard

Lambda capture bitmasks (~256 bytes). Two 1024-bit bitmasks recording which lambda capture counts were used during parsing:

Address	Size	Identity
`unk_1286900`	128 bytes	Host-device lambda capture counts
`unk_1286980`	128 bytes	Device lambda capture counts

Bit N set means a lambda with N captures was encountered, triggering emission of the corresponding __nv_dl_wrapper_t or __nv_hdl_wrapper_t specialization in the backend.

IL walker callbacks (5 function pointers at qword_126FB68..126FB88). The five IL tree-walk callback slots: entry filter, entry replace, pre-walk check, string callback, and entry callback. Swapped in and out by different IL traversal passes.

libstdc++ Runtime State (0x1290000 -- 0x12D6F20, ~280 KB)

SoftFloat globals (16 bytes). Exception flags at byte_12D4820, rounding mode at byte_12D4821.

Emergency exception pool (24 bytes of metadata). Free-list head (qword_12D4868), base address (qword_12D4870), capacity (qword_12D4878 = 72,704 bytes). The pool itself is heap-allocated at startup by ctor_004.

Locale system (~2 KB). The "C" locale singleton (unk_12D5E60), global locale impl pointer (qword_12D5E70), classic locale impl pointer (qword_12D5E78), character classification tables (12D5BE0..12D5D50), locale ID counter (dword_12D5E58), and pthread_once control variables.

iostream objects (~2 KB). The six standard stream objects and their backing file buffers:

Address	Identity
`0x12D6000`	`std::cerr`
`0x12D6060`	`std::cin`
`0x12D60C0`	`std::cout`
`0x12D5EE0`	`std::wcerr`
`0x12D5F40`	`std::wcin`
`0x12D5FA0`	`std::wcout`

Each stream object is backed by a basic_filebuf at a known offset (e.g., cout's filebuf at 0x12D67E0).

Demangler caches (40 bytes). Template argument cache at qword_12C7B40/12C7B48/12C7B50 (capacity/count/buffer pointer, grows by 500 entries via realloc). Block-scope suppress flag at dword_12C6A24.

EDG internal lists (7 x 48 bytes). Seven doubly-linked list structures at 12868C0..1286780 initialized by ctor_003. Serve as symbol/scope/type caches with destructor sub_6BD820.

Thread-Local Storage (0x12D6F20 -- 0x12D6F38, 24 bytes)

The .tls section holds exactly 24 bytes of thread-local data. This is the __cxa_eh_globals structure (accessed via __readfsqword(0) - 16):

struct __cxa_eh_globals {
    void     *caught_exception_stack;   // +0x00: linked list of caught exceptions
    uint32_t  uncaughtExceptions;       // +0x08: count of in-flight exceptions
};

Despite cudafe++ being single-threaded, the TLS infrastructure exists because libstdc++ exception handling unconditionally uses TLS offsets compiled into the static library.

.ctors / .dtors -- Constructor/Destructor Tables

The .ctors section at 0xD42858 is 88 bytes: a -1 sentinel (8 bytes), 9 constructor function pointers (72 bytes), and a 0 terminator (8 bytes). The 9 constructors are ctor_001 through ctor_009 documented above.

The .dtors section at 0xD428B0 is 16 bytes: a -1 sentinel and a 0 terminator. No destructors are registered -- all cleanup is done via __cxa_atexit handlers registered during construction.

.eh_frame / .gcc_except_table -- Exception Handling

The .eh_frame section (582 KB) contains DWARF Call Frame Information (CFI) records for stack unwinding during C++ exception propagation. The .gcc_except_table section (13.2 KB) contains GCC Language-Specific Data Area (LSDA) records that map program counters to catch handlers and cleanup functions.

The .eh_frame_hdr section (48.9 KB) is a binary search index into .eh_frame, enabling O(log n) lookup of unwind information by instruction pointer during exception throw.

These sections exist because libstdc++ exception handling requires them. cudafe++ itself rarely throws exceptions -- the EDG frontend uses longjmp-based error recovery. However, the statically-linked libstdc++ code (particularly operator new and locale initialization) uses C++ exceptions internally.

.plt / .got.plt -- PLT Stubs

The .plt section (2.2 KB, 141 entries) and .got.plt (1.1 KB) implement lazy binding for the 141 libc functions that cudafe++ imports despite static linking. These are glibc internal symbols resolved at load time. The PLT stubs are the standard x86-64 two-instruction pattern: indirect jump through GOT, then fallback to the dynamic linker (which never executes since the binary is statically linked -- the GOT is pre-resolved by the static linker).

Static Libraries Linked

The binary statically links four library components:

Library	Functions	.text Range	Purpose
libstdc++ (locale)	~600	`0x7EA800` -- `0x829600`	Full locale facet implementations
libstdc++ (iostream/exception)	~60	`0x7E42E0` -- `0x7EA800`	Streams, exceptions, operator new
Berkeley SoftFloat 3e	~80	`0x7E0D30` -- `0x7E4150`	float16/float80/float128 arithmetic
glibc CRT	~10	`0x40918C`, `0x829640` -- `0x829722`	`_start`, init, fini

No shared libraries are loaded at runtime. The binary is fully self-contained.

Virtual Address Space Map

0x400000 +-----------------------+
         | ELF headers           |  10.5 KB
0x402A18 | .init                 |  24 B
0x402A30 | .plt                  |  2.2 KB
0x403300 | .text                 |  4.15 MB
         |   assert stubs        |    34 KB    (0x403300 - 0x408B40)
         |   constructors        |    8 KB     (0x408B40 - 0x409350)
         |   EDG main body       |    3.61 MB  (0x409350 - 0x7DF400)
         |   C++ runtime         |    304 KB   (0x7DF400 - 0x829722)
0x829722 | padding               |
0x829724 | .fini                 |  14 B
0x829740 | .rodata               |  2.48 MB
         |   error table         |    30 KB    (off_88FAA0)
         |   string literals     |    ~2 MB
         |   IL kind names       |    <1 KB    (off_E6DD80)
         |   jump tables         |    ~400 KB
0xAA3FA3 | .eh_frame_hdr         |  48.9 KB
         |       [gap]           |
0xCB1210 | .eh_frame             |  568 KB
0xD3F398 | .gcc_except_table     |  13.2 KB
0xD42858 | .ctors                |  88 B
0xD428B0 | .dtors                |  16 B
0xD428C0 | .data.rel.ro          |  13.3 KB   (vtables)
0xD45E00 |       [padding/GOT]   |
0xD46480 | .data                 |  1.22 MB
         |   attribute table     |    ~5 KB    (off_D46820)
         |   keyword tables      |    variable
         |   handler pointers    |    (at 0xE7EExx)
         |   diagnostic list     |    (at 0xE7FE40)
0xE7EFF0 |       [padding]       |
0xE7F000 | .bss                  |  4.34 MB
         |   EDG compiler state  |    ~4.1 MB  (0xE7F000 - 0x1290000)
         |   libstdc++ state     |    ~280 KB  (0x1290000 - 0x12D6F20)
0x12D6F20| .tls                  |  24 B
0x12D6F38| extern                |  1.1 KB
0x12D73A8+-----------------------+

Key Observations

The .bss dominates. At 4.34 MB, the .bss is the largest section -- larger than .text. This reflects the EDG frontend's design: hundreds of global variables hold parser state, scope stacks, symbol tables, and IL region metadata. A reimplementation should strongly consider replacing these globals with a context struct passed through the call chain.

Static linking adds 304 KB of dead-weight code. The C++ runtime region (0x7DF400 -- 0x829722) contains 900 functions, the majority of which (600+ locale facet methods) are never called by cudafe++. The locale system is pulled in transitively through iostream initialization. A reimplementation that avoids std::cout/std::cerr could eliminate this entirely.

The EDG code is tightly packed. The 3.61 MB EDG main body has almost no inter-function padding. Functions from the same source file are contiguous, and the alphabetical ordering by filename is consistent across the entire range. This makes address-to-source-file attribution reliable.

The binary is position-dependent. No PIE (Position-Independent Executable) flag is set. All code references use absolute addressing. The .got is minimal (56 bytes / 7 entries) -- almost all data references are direct.

Methodology

This page documents the reverse engineering methodology used to produce every page in this wiki. The goal is full transparency: a reader should be able to reproduce any finding by following the same techniques against the same binary. Every claim in the wiki traces back to one of four evidence categories (CONFIRMED, HIGH, MEDIUM, LOW), and this page defines exactly what each level means, what tools produced the raw data, and how that data was refined into the structured documentation that follows.

Toolchain

Component	Version	Role
IDA Pro	9.0 (64-bit)	Interactive disassembler and database host
Hex-Rays	x86-64 decompiler (IDA 9.0 bundled)	Pseudocode generation for all 6,483 functions
IDAPython	3.x (IDA-embedded)	Scripted extraction via `analyze_cudafe++.py` (531 lines)
Target binary	`cudafe++` from CUDA Toolkit 13.0	ELF 64-bit, statically linked, stripped, 8,910,936 bytes
IDA database	`cudafe++.i64`	247 MB analysis state (all function boundaries, xrefs, type info, decompilation caches)

The binary was loaded into IDA Pro 9.0 with default x86-64 analysis settings. IDA's auto-analysis resolved all code/data boundaries, generated function boundaries for 6,483 functions, and identified 52,489 string literals. The Hex-Rays decompiler was invoked on all 6,483 functions; the IDAPython extraction log reports 6,343 successful decompilations (the remaining 140 failures are exception personality routines, SoftFloat leaf functions, and tiny thunks where Hex-Rays cannot reconstruct a valid C AST). However, due to function-name collisions in the output filenames (multiple sub_XXXXXX entries mapping to the same sanitized name after / replacement), the actual decompiled output directory contains 6,202 unique .c files -- the number used throughout this wiki.

Extraction Script

All raw data was exported from the IDA database in a single automated pass using analyze_cudafe++.py, an IDAPython script that runs inside IDA's scripting environment. The script produces 12 output artifacts:

Artifact	File	Records	Size	Description
String table	`cudafe++_strings.json`	52,489 strings	9.2 MB	Every string literal with address, type, and all cross-references
Function table	`cudafe++_functions.json`	6,483 functions	12 MB	Address, size, instruction count, callers, callees per function
Import table	`cudafe++_imports.json`	142 imports	16 KB	Imported PLT symbols (glibc wrappers in static binary)
Segment table	`cudafe++_segments.json`	26 segments	3.3 KB	ELF section addresses, sizes, types, permissions
Cross-reference table	`cudafe++_xrefs.json`	1,243,258 xrefs	154 MB	Every code and data xref with source function attribution
Comment table	`cudafe++_comments.json`	22,911 comments	2.0 MB	All IDA comments (regular + repeatable)
Name table	`cudafe++_names.json`	54,771 names	3.5 MB	All named locations (IDA auto-names + user-defined)
Call graph	`cudafe++_callgraph.json` + `.dot`	67,756 edges	7.4 MB	Complete inter-procedural call graph (5,057 unique callers, 5,382 unique callees)
`.rodata` dump	`cudafe++_rodata.bin`	2,599,011 bytes	2.5 MB	Raw bytes of the read-only data section
Disassembly	`disasm/<func>_<addr>.asm`	6,342 files	86 MB	Per-function annotated disassembly with hex bytes
CFG graphs	`graphs/<func>_<addr>.json` + `.dot`	12,684 files	184 MB	Per-function basic-block graph with instructions and edges (JSON + DOT)
Decompiled code	`decompiled/<func>_<addr>.c`	6,202 files	38 MB	Hex-Rays pseudocode per function

Script Architecture

The script is structured as a main() function that calls idaapi.auto_wait() to block until IDA's auto-analysis completes, then executes 12 extraction passes in a fixed order. Output is written to four directories: the root output directory (JSON databases), graphs/ (per-function CFGs), disasm/ (per-function disassembly), and decompiled/ (per-function pseudocode). Directories are created if they do not exist.

The 12 passes, in execution order:

export_all_strings() -- Enumerates idautils.Strings(), then for each string walks XrefsTo(string_ea) to record every function that references it. Each string entry captures the address, string value, string type code, and a list of xref records ({from_addr, func_name, xref_type}). This is the foundation for source attribution (see below).
export_all_functions() -- For each function in idautils.Functions(), records start/end address, size, instruction count (via idc.is_code() on each head), library flag (FUNC_LIB), thunk flag (FUNC_THUNK), and builds caller/callee lists. Callers are found via XrefsTo(func_start); callees via XrefsFrom(head) filtered to call-type xrefs (fl_CN = type 17, fl_CF = type 19).
export_imports() -- Enumerates all imported modules via idaapi.get_import_module_qty() and idaapi.enum_import_names(). Records module name, symbol name, address, and ordinal for each of the 142 glibc imports.
export_segments() -- Iterates idautils.Segments() to record each ELF section's name, start/end address, size, type code, and permission bits.

export_xrefs() -- Full enumeration of all cross-references from every instruction head in every function. For each xref, records source address, source function, target address, target function (if any), and xref type code. Produces the 1,243,258-record xref table. The six xref type codes in the output:

Type	Code	Count	Meaning
`dr_O`	1	29,631	Data offset reference
`dr_W`	2	11,488	Data write reference
`dr_R`	3	42,364	Data read reference
`fl_CN`	17	67,756	Code near call
`fl_CF`	19	189,364	Code far/ordinary flow
`fl_JN`	21	902,655	Code near jump (including fall-through)

export_comments() -- Walks every instruction head in the database via idautils.Heads(), extracting both regular comments (idc.get_cmt(ea, 0)) and repeatable comments (idc.get_cmt(ea, 1)).
export_names() -- Iterates idautils.Names() to export all named locations (function names, data labels, IDA auto-generated names).
extract_rodata() -- Reads the raw bytes of the .rodata segment via ida_bytes.get_bytes() and writes them to a binary file. Used for offline string scanning and jump table analysis.
export_callgraph() -- Builds the 67,756-edge call graph by iterating every function and scanning its instruction heads for outgoing call xrefs (fl_CN, fl_CF). Output in both JSON (array of {from, from_addr, to, to_addr} edge records) and Graphviz DOT format (67,759 lines).
export_complete_disassembly() -- Per-function disassembly files. For each function, iterates all instruction heads within the function's address range, generating hex byte dumps alongside disassembly text via idc.generate_disasm_line(). Each file includes a header with function name, address range, and byte size.
export_function_graphs() -- Per-function control flow graphs via idaapi.FlowChart(). For each basic block: block ID, start/end address, size, and full instruction listing. Block-to-block edges (fall-through and branch targets) are extracted via block.succs(). Output as both JSON (structured blocks + edges) and DOT (for Graphviz visualization).
export_decompilation() -- Calls idaapi.init_hexrays_plugin() to initialize the Hex-Rays decompiler, then iterates all functions and calls idaapi.decompile(func_ea). On success, the pseudocode string (str(cfunc)) is written to a .c file with a header comment containing the function name and address. Failures are silently caught via a bare except Exception and skipped.

The script is invoked via IDA's headless batch mode or interactive scripting console. It does not call qexit() at the end, allowing the IDA database to remain open for further interactive analysis after extraction. Total extraction time is approximately 30-45 minutes on a workstation-class machine, dominated by the 6,483 decompilation calls in pass 12.

Source Attribution Technique

The single most powerful technique in this analysis is source attribution via __FILE__ strings. The EDG C++ frontend uses C-style assertions throughout its codebase. When an assertion fires, the handler receives the source file path, line number, and function name as compile-time string constants embedded by the __FILE__, __LINE__, and __func__ macros. Because the binary is stripped (no .symtab), these assertion strings are the only surviving link to the original source tree.

The Assert Handler

The central assert handler is sub_4F2930, located in error.c. It is a __noreturn function that formats and emits an internal compiler error message, then terminates the process. A total of 2,139 functions in the binary call sub_4F2930, with 5,178 total call sites (many functions have multiple assertion points throughout their bodies).

The highest-density callers are the 235 assert stubs in the region 0x403300--0x408B40. Each stub is exactly 29 bytes: three register loads (source file path via lea rdi, line number via mov esi, function name via lea rdx) followed by a call to sub_4F2930:

sub_403300:         ; assert stub for is_aliasable (attribute.c:10897)
  lea  rdi, aAttributeC    ; "/dvs/p4/.../EDG_6.6/src/attribute.c"
  mov  esi, 10897           ; line number (integer, not string)
  lea  rdx, aIsAliasable   ; "is_aliasable"
  call sub_4F2930           ; internal_error(__FILE__, __LINE__, __func__)

Of the 235 stubs, 200 reference .c file paths and 35 reference .h file paths (inlined assertions from header files). The stubs are sorted approximately by source file name within the stub region -- the linker grouped them from all 52 .c compilation units into one contiguous block.

Beyond the dedicated stubs, 1,904 additional functions contain inline assertion checks: the lea rdi, <file_path> instruction appears within the function body at the assertion site, not in a separate stub. These inline assertions provide the same source-file attribution as the stubs.

The Attribution Chain

The attribution chain works in three steps:

String discovery. Extract all strings matching the EDG build path prefix /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/. This yields one string per source file, each cross-referenced by the assert stubs that load it.
Xref tracing. For each assert stub, follow XrefsTo() to find which main-body functions call it. A function at 0x40DFD0 that calls the attribute.c:5108 stub was compiled from attribute.c. This attributes the caller to the source file.
Range extension. Assert stubs are sparse -- not every function contains an assertion. Once a set of functions in a contiguous address range are attributed to the same source file, the entire range is assigned to that file. This works because the linker places all object code from a single .c file contiguously, and the files are arranged roughly alphabetically by filename.

This technique attributed 2,209 functions (34.1% of the binary) to specific source files. The remaining 4,274 functions fall into three categories: C++ runtime code (1,085 functions from libstdc++/glibc, identifiable by address range), PLT/init stubs (283 functions), and unmapped EDG functions (2,906 functions that contain no assertions and cannot be confidently attributed).

Build Path

The full build path embedded in the binary is:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/

This reveals the NVIDIA internal Perforce depot structure (/dvs/p4/), the release branch (r13.0), and the EDG version (EDG_6.6). It confirms the binary was built from EDG C++ Front End version 6.6, licensed from Edison Design Group.

Confidence Levels

Every identification in the raw sweep reports and wiki pages carries one of four confidence levels:

Level	Tag	Criteria	Example
CONFIRMED	Direct match	The function's identity is proven by an assertion string that encodes the exact function name, source file, and line number. No ambiguity.	`sub_403300` loads `"is_aliasable"` + `"attribute.c"` + `"10897"` -- it is the assertion stub for `is_aliasable()` in `attribute.c` at line 10897.
HIGH	String + callgraph	The function references a distinctive string (error message, format string, keyword literal) AND its position in the call graph is consistent with a single plausible identity.	`sub_459630` references 276 CLI flag strings and is called from `main()` at the position where command-line processing occurs -- identified as `proc_command_line()`.
MEDIUM	Pattern + context	The function matches a known EDG pattern (struct layout access, IL node walking, type query) and its address falls within the expected source file range, but no string or assertion directly confirms the identity.	A function at `0x5B3000` accesses the IL node kind field at the expected struct offset and falls within the `il.c` address range -- likely an IL accessor, but the specific function name is inferred.
LOW	Address proximity	The function's address falls within a source file's range, but no internal evidence (strings, struct accesses, callees) distinguishes it from neighboring functions. The attribution is based solely on the linker's contiguous placement of object code.	A small leaf function at `0x5B2F80` sits between two `il.c`-attributed functions -- probably from `il.c`, but it could be an inlined header function.

In practice, approximately 34% of functions are CONFIRMED (via assert strings), ~20% are HIGH (via distinctive strings or unique callgraph positions), ~25% are MEDIUM, and ~21% are LOW or unattributed.

Call Graph Analysis

The complete call graph contains 67,756 edges connecting the 6,483 functions. This graph is the primary tool for understanding system architecture -- which subsystems call which, where the hot paths are, and how NVIDIA's additions integrate with the EDG base.

Hub Identification

Hub functions -- those with exceptionally high in-degree (many callers) or out-degree (many callees) -- reveal the architectural spine of the compiler:

Hub Type	Function	Description	Degree
Top callee	`sub_4F2930`	`internal_error` handler	235+ callers (every assert stub)
Top callee	Type query functions (104 total)	`is_class_or_struct_or_union_type`, etc.	407 call sites for top query
Top caller	`sub_7A40A0`	`process_translation_unit`	Calls into parser, IL, type system
Top caller	`sub_459630`	`proc_command_line` (4,105-line monster)	Touches 276 flag variables
Top caller	`sub_585DB0`	`fe_one_time_init`	36 subsystem initializer calls
Cross-module bridge	`sub_6BCC20`	Lambda preamble injection (NVIDIA)	Called from EDG statement handlers

Graph Structure

The call graph exhibits a layered structure typical of compiler frontends:

Entry layer. main() at 0x408950 calls exactly 8 stage functions in sequence.
Stage layer. Each stage function (init, CLI, parse, wrapup, backend) fans out to dozens of subsystem entry points.
Core layer. The parser (expr.c, decls.c, statements.c) calls into the type system (types.c, exprutil.c), IL builder (il.c, il_alloc.c), and name lookup (lookup.c, scope_stk.c).
Leaf layer. Memory management (mem_manage.c), error reporting (error.c), and type queries form the bottom of the call hierarchy, referenced from almost every subsystem.

NVIDIA's nv_transforms.c sits as a lateral extension at the core layer: it is called from class_decl.c, cp_gen_be.c, and statements.c (via nv_transforms.h inlines), but does not itself call back into the EDG parser. This clean separation suggests NVIDIA modifies the EDG source minimally, preferring to hook into existing EDG extension points rather than fork the core.

String-Based Discovery

The binary contains 52,489 strings in .rodata. These strings are the second most important evidence source after the assertion paths. Major categories:

Category	Approximate Count	Usage
EDG assertion paths (`/dvs/p4/...`)	65 (52 `.c` + 13 `.h`)	Source attribution
CUDA keyword strings	~300	Keyword table initialization, CLI flag names
Error message templates	~3,800	Diagnostic emission (`off_88FAA0` error table, 3,795 entries)
C/C++ keyword strings	~200	Lexer token recognition
Format strings (`%s`, `%d`, etc.)	~500	Output formatting in `.int.c` emission and diagnostics
IL kind names	~200	IL node type display (`off_E6DD80` table)
Type name fragments	~400	Mangling output, type display
CUDA architecture names (`sm_XX`)	~50	Architecture feature gating
Internal EDG config strings	~200	Build configuration, feature flags

String Mining Techniques

Three string mining techniques are used throughout the analysis:

Error message tracing. CUDA-specific error messages (e.g., "calling a __host__ function from a __device__ function is not allowed") are grepped from the string table, their xrefs traced to the emitting function, and the emitting function's callers analyzed to understand the validation logic that triggers the error.
Keyword enumeration. The keyword initialization function (sub_5863A0) loads 200+ string constants in sequence. By reading the strings in load order, the complete CUDA keyword vocabulary is recovered -- including internal-only keywords not documented in the CUDA C++ Programming Guide.
Format string analysis. Format strings in the backend (cp_gen_be.c) reveal the exact syntax of .int.c output. A string like "static void __device_stub__%s(" tells us the precise naming convention for device stub wrapper functions.

Decompilation Quality

Hex-Rays produces readable pseudocode for the vast majority of functions, but several systematic limitations affect the analysis:

Control Flow Artifacts

Hex-Rays occasionally introduces control flow constructs that do not exist in the original source. The most prominent example is the while(1) loop in main() (sub_408950): the decompiler wraps the entire function body in an infinite loop because a setjmp-based error recovery mechanism creates a backward edge in the CFG. In reality, main() executes linearly and returns -- the while(1) is a decompiler artifact, not a real loop.

Similar artifacts appear in functions with complex switch statements (EDG uses computed gotos for performance), where Hex-Rays may produce nested if-else chains instead of the flat dispatch table the original code uses.

Lost Preprocessor Logic

The original EDG source makes heavy use of preprocessor conditionals (#if CUDA_SUPPORT, #ifdef FRONT_END_CPFE, etc.). The compiled binary contains only the taken branch -- the preprocessor evaluated all conditions at build time. This means the decompiled code shows the CUDA-enabled configuration only; any host-only or non-CUDA EDG behavior is invisible.

Similarly, C macros that wrap common patterns (assertion macros, IL access macros, type query macros) are fully expanded in the binary. The decompiled output shows the expanded form -- a sequence of struct field accesses and conditional jumps -- rather than the concise macro invocation the original source used.

Unnamed Variables

The binary is stripped. All local variable names are lost. Hex-Rays assigns synthetic names (v1, v2, a1, a2) based on register allocation and stack slot positions. Function parameters are named a1 through aN in declaration order. During analysis, meaningful names are sometimes manually applied in the IDA database, but most decompiled output uses the synthetic names.

Structure field accesses appear as byte-offset expressions (*((_BYTE *)a1 + 182)) rather than named fields (entity->execution_space). Reconstructing the structure layouts from these offset patterns is a core part of the analysis -- see the Entity Node Layout page for the most extensively reconstructed structure.

Decompilation Failures

The IDAPython extraction log reports 6,343 successful decompilations out of 6,487 attempts (140 failures). Due to filename collisions in the output directory (functions with identical sanitized names at different addresses overwrite each other), the actual output directory contains 6,202 unique .c files. The 281 "missing" files break down as:

Category	Count	Reason
Hex-Rays decompilation failure	~140	Exception personality routines, SoftFloat leaf functions, tiny thunks, irreducible CFG
Filename collisions (overwritten)	~141	Multiple functions with the same IDA name (after `/` to `_` sanitization) write to the same output path

The 140 true decompilation failures are concentrated in the C++ runtime region (0x7DF400--0x829722), particularly in the libstdc++ locale facet implementations (complex template instantiations with deeply nested virtual dispatch) and Berkeley SoftFloat 3e functions (pure arithmetic with non-standard calling conventions). For these functions, analysis relies on the raw disassembly output in disasm/ instead.

Phase 1: Address-Range Sweeps

The first phase of analysis consists of 20 address-range sweeps that collectively cover the entire .text section from 0x403000 to 0x82A000. Each sweep examines a contiguous address range of 128--256 KB, documenting every function within that range.

Sweep Index

Sweep	Address Range	Size	Primary Source Files	Key Findings
P1.01	`0x403000`--`0x425000`	136 KB	`attribute.c`, `class_decl.c`	Assert stub region, CUDA attribute handlers
P1.02	`0x425000`--`0x450000`	172 KB	`class_decl.c`, `cmd_line.c`	Virtual override checking, execution space propagation
P1.03	`0x450000`--`0x478000`	160 KB	`cmd_line.c`, `const_ints.c`, `cp_gen_be.c`	4,105-line CLI parser, 276 flags
P1.04	`0x478000`--`0x4A0000`	160 KB	`cp_gen_be.c`, `decl_inits.c`	Backend `.int.c` emission, device stub generation
P1.05	`0x4A0000`--`0x4C8000`	160 KB	`decl_inits.c`, `decl_spec.c`, `declarator.c`, `decls.c`	Declaration parsing pipeline
P1.06	`0x4C8000`--`0x4F8000`	192 KB	`decls.c`, `disambig.c`, `error.c`	Error table (`off_88FAA0`, 3,795 entries)
P1.07	`0x4F8000`--`0x530000`	224 KB	`expr.c`	Expression parser (528 functions)
P1.08	`0x530000`--`0x560000`	192 KB	`expr.c`, `exprutil.c`	Expression utilities, operator overloads
P1.09	`0x560000`--`0x598000`	224 KB	`exprutil.c`, `extasm.c`, `fe_init.c`, `fe_wrapup.c`	Initialization chain, 5-pass wrapup
P1.10	`0x598000`--`0x5C8000`	192 KB	`float_pt.c`, `folding.c`, `func_def.c`, `host_envir.c`	Constant folding, timing infrastructure
P1.11a--f	`0x5C8000`--`0x5F8000`	192 KB	`il.c`, `il_alloc.c`	IL node creation, arena allocator
P1.12	`0x5F8000`--`0x628000`	192 KB	`il_to_str.c`, `il_walk.c`, `interpret.c`	IL display, tree walking, constexpr
P1.13	`0x628000`--`0x668000`	256 KB	`interpret.c`, `layout.c`, `lexical.c`	Constexpr interpreter, struct layout, lexer
P1.14	`0x668000`--`0x6A8000`	256 KB	`lexical.c`, `literals.c`, `lookup.c`, `lower_name.c`	Name lookup, name mangling
P1.15	`0x6A8000`--`0x6D0000`	160 KB	`lower_name.c`, `macro.c`, `mem_manage.c`, `nv_transforms.c`, `overload.c`	NVIDIA transforms, memory management
P1.16	`0x6D0000`--`0x708000`	224 KB	`overload.c`, `pch.c`, `pragma.c`, `preproc.c`, `scope_stk.c`	Overload resolution, scope stack
P1.17	`0x708000`--`0x740000`	224 KB	`scope_stk.c`, `src_seq.c`, `statements.c`, `symbol_ref.c`, `symbol_tbl.c`	Statement parsing, symbol table
P1.18	`0x740000`--`0x7A0000`	384 KB	`symbol_tbl.c`, `sys_predef.c`, `templates.c`	Template engine (443 functions)
P1.19	`0x7A0000`--`0x7E0000`	256 KB	`trans_unit.c`, `types.c`, `modules.c`, `trans_corresp.c`	Type system, TU processing
P1.20	`0x7E0000`--`0x82A000`	304 KB	(C++ runtime)	libstdc++, SoftFloat, CRT, demangler

The P1.11 sweep was subdivided into six sub-sweeps (11a through 11f) because the il.c region is dense and complex, containing the core IL node creation and manipulation functions that are referenced from nearly every other source file.

Sweep Report Format

Each sweep report follows a consistent format:

================================================================================
P1.XX SWEEP: Address range 0xNNNNNN - 0xMMMMMM
================================================================================
Range: 0xNNNNNN - 0xMMMMMM
Functions found: N
EDG source files:
  - file.c (assert stub range, main body range)
  ...

### 0xAAAAAA -- sub_AAAAAA (NN bytes / NN lines)
**Identity**: function_name (source_file.c:NNNN)
**Confidence**: CONFIRMED / HIGH / MEDIUM / LOW
**EDG Source**: source_file.c
**Notes**: Additional observations about behavior, callers, callees

Every function in the sweep range gets an entry. Functions are documented in address order. The identity field records the inferred function name and source location. The confidence field uses the four-level system defined above. Notes capture anything unusual -- unexpected callers, CUDA-specific behavior, undocumented error codes, or connections to other subsystems.

Phase 2: Targeted Deep Dives

After the Phase 1 sweep establishes the complete function map and identifies all source files, Phase 2 produces the detailed wiki pages. Each wiki page corresponds to one W-series work report that focuses on a specific subsystem or topic.

Deep Dive Methodology

Each W-series report follows a consistent process:

Scope definition. Identify the set of functions relevant to the topic. For example, W012 (Execution Spaces) requires the CUDA attribute application handlers in attribute.c, the execution space checking functions in nv_transforms.c, and the virtual override validator in class_decl.c.
Decompilation review. Read the full Hex-Rays pseudocode for every function in scope. For complex functions, also review the raw disassembly to catch decompiler artifacts.
String evidence collection. Grep the string table for all strings referenced by the in-scope functions. Error messages reveal validation rules; format strings reveal output patterns; keyword strings reveal accepted syntax.
Call graph traversal. Starting from the in-scope functions, walk callers and callees to understand the full data flow. Who calls apply_nv_global_attr? What does it call? How does data arrive and where does it go?
Struct layout reconstruction. When decompiled code accesses struct fields via byte offsets, reconstruct the field layout by collecting all access patterns across all functions that touch the same struct. Cross-validate offsets across multiple functions.
Pseudocode reconstruction. Translate the Hex-Rays output into readable C-like pseudocode with meaningful variable names, proper control flow, and comments explaining the logic. This reconstructed pseudocode appears in the wiki pages.
Cross-reference synthesis. Link findings to other wiki pages and W-series reports. Every page should situate itself within the overall architecture.

W-Series Report Index

As of this writing, 28 W-series reports have been produced, each backing one or more wiki pages:

Report	Topic	Wiki Page(s)
W001	Index page	`index.md`
W002	Function map	`function-map.md`
W003	Binary layout	`binary-layout.md`
W004	Methodology	`methodology.md` (this page)
W005	Pipeline overview	`pipeline/overview.md`
W006	Entry point	`pipeline/entry.md`
W010	Backend code gen	`pipeline/backend.md`
W012	Execution spaces	`cuda/execution-spaces.md`
W014	Cross-space validation	`cuda/cross-space-validation.md`
W015	Device/host separation	`cuda/device-host-separation.md`
W016	Kernel stubs	`cuda/kernel-stubs.md`
W020	Attribute system	`attributes/overview.md`
W021	__global__ constraints	`attributes/global-function.md`
W026	Lambda overview	`lambda/overview.md`
W027	Device wrapper	`lambda/device-wrapper.md`
W028	Host-device wrapper	`lambda/host-device-wrapper.md`
W029	Capture handling	`lambda/capture-handling.md`
W032	IL overview	`il/overview.md`
W033	IL allocation	`il/allocation.md`
W035	Keep-in-IL	`il/keep-in-il.md`
W038	.int.c format	`output/int-c-format.md`
W042	EDG overview	`edg/overview.md`
W047	Template engine	`edg/template-engine.md`
W052	Diagnostics overview	`diagnostics/overview.md`
W053	CUDA errors	`diagnostics/cuda-errors.md`
W056	Entity node layout	`structs/entity-node.md`
W061	CLI flags	`config/cli-flags.md`
W065	EDG source map	`reference/edg-source-map.md`
W066	Global variables	`reference/global-variables.md`

Numerical Summary

Metric	Value
Binary file size	8,910,936 bytes (8.5 MB)
Total functions in binary	6,483
Decompiled functions (log-reported)	6,343
Decompiled files (actual on disk)	6,202
Disassembly files	6,342
CFG files (JSON + DOT)	12,684
Functions attributed to source files	2,209 (34.1%)
Functions calling `sub_4F2930` (assert handler)	2,139
Total call sites to `sub_4F2930`	5,178
Assert stubs (`0x403300`--`0x408B40`)	235
Source files identified (`.c`)	52
Header files identified (`.h`)	13
EDG build-path strings in `.rodata`	65
String literals extracted	52,489
Cross-references extracted	1,243,258
Call graph edges	67,756 (5,057 callers, 5,382 callees)
Named locations	54,771
IDA comments	22,911
Imported glibc symbols	142
ELF segments	26
`.rodata` raw dump	2,599,011 bytes
IDA database (`.i64`)	247 MB
Phase 1 sweep reports	28 files (20 ranges + 8 sub-sweeps), 38,221 lines
Phase 2 deep-dive reports (W-series)	28
Wiki pages	55
Error table entries (`off_88FAA0`)	3,795
CLI flags documented	276
Total exported data	~500 MB

Limitations and Caveats

What This Analysis Cannot Determine

Preprocessor-disabled code. Any EDG code behind #if 0, #ifndef CUDA_SUPPORT, or similar guards was compiled out. The binary reflects only the CUDA-enabled, Linux x86-64, EDG 6.6 configuration. Other EDG frontend features (e.g., Fortran support, Windows target, older C++ standards) are not present.
Inlined function boundaries. When the compiler inlines a function, its code merges with the caller. The binary may contain hundreds of inlined instances of small EDG utility functions (type queries, IL accessors) that are invisible as separate entities. The 6,483 function count represents only the non-inlined functions.
Original variable names. All local and most global variable names are lost. The wiki uses reconstructed names based on semantics (e.g., execution_space_byte for *((_BYTE *)entity + 182)), but these are analyst-assigned, not original.
Exact source line mapping. While assertion strings encode line numbers, these are the assertion site's line number, not the calling function's line number. The analyst can determine that is_aliasable in attribute.c has an assertion at line 10897, but cannot determine the start line of is_aliasable itself.
NVIDIA-internal documentation. Any design documents, code comments, commit messages, or internal wikis that informed the original development are unavailable. All behavioral descriptions in this wiki are inferred from the binary alone.

Reproducibility

Every finding in this wiki can be reproduced by:

Obtaining cudafe++ from CUDA Toolkit 13.0 (version string embedded in binary as the build path prefix r13.0).
Loading it into IDA Pro 9.0 (64-bit) with default x86-64 analysis settings. Wait for auto-analysis to complete (5-10 minutes).
Running analyze_cudafe++.py via File > Script File to extract all raw data (30-45 minutes).
Querying the exported JSON files with jq to trace cross-reference chains, string lookups, and callgraph paths.
Reading the decompiled .c files and raw .asm files for behavioral analysis.

No proprietary tools beyond IDA Pro + Hex-Rays are required. The analysis does not depend on NVIDIA source code access, NDA-protected documentation, or insider knowledge. Every claim is derived from the publicly distributed binary.

Pipeline Overview

cudafe++ is a source-to-source compiler. It reads a .cu file, parses it as C++ with CUDA extensions using a modified EDG 6.6 frontend, then emits a transformed .int.c file where device code is suppressed and host-side stubs replace kernel launch sites. The entire binary is a single-threaded, single-pass-per-stage pipeline controlled from main() at 0x408950.

Pipeline Diagram

  input.cu
     |
     v
 [1] fe_pre_init          sub_585D60   fe_init.c
     9 subsystem pre-initializers
     |
     v
     * sub_5AF350(v7) ---- capture "Total compilation time" start
     |
     v
 [2] proc_command_line     sub_459630   cmd_line.c
     276 CLI flags parsed, mode selection
     |
     v
 [3] fe_one_time_init      sub_585DB0   fe_init.c
     38 subsystem initializers + keyword registration
     |--- fe_init_part_1 (sub_585EE0): per-unit inits, output file open
     |--- keyword_init + fe_translation_unit_init (sub_5863A0)
     |
     v
     * sub_5AF350(v8) ---- capture "Front end time" start
     |
     v
 [4] reset_tu_state        sub_7A4860   trans_unit.c
     Zero all TU globals
     |
     v
 [5] process_trans_unit    sub_7A40A0   trans_unit.c
     Allocate 424-byte TU descriptor, parse source,
     build EDG IL tree, CUDA attribute propagation
     |
     v
 [6] fe_wrapup             sub_588F90   fe_wrapup.c
     5-pass IL finalization: needed-flags, keep-in-IL marking,
     dead entity elimination, scope cleanup
     |
     v
     * sub_5AF350(v9) ---- capture "Front end time" end
     * sub_5AF390("Front end time", v8, v9)
     |
     v
     * sub_5AF350(v10) --- capture "Back end time" start
     |
     v
 [7] Backend entry         sub_489000   cp_gen_be.c
     Walk source sequence, emit .int.c, device stubs,
     lambda wrappers, registration tables
     |
     v
     * sub_5AF350(v11) --- capture "Back end time" end
     * sub_5AF390("Back end time", v10, v11)
     |
     v
     * sub_5AF350(v12) --- capture "Total compilation time" end
     * sub_5AF390("Total compilation time", v7, v12)
     |
     v
 [8] exit_with_status      sub_5AF1D0   host_envir.c
     Map internal status to exit code, terminate

     |----- "Front end time" covers stages 4-6 ----------|
     |----- "Back end time" covers stage 7 ---------------|
     |----- "Total compilation time" covers stages 2-8 ---|

Call Hierarchy from main()

The decompiled main() at 0x408950 calls the pipeline stages in this exact order:

void main(int argc, char **argv, char **envp)
{
    sub_585D60(argc, argv, envp);      // [1] fe_pre_init
    sub_5AF350(v7);                     //     capture_time (total start)
    sub_459630(argc, argv);             // [2] proc_command_line
    // [stack limit adjustment via setrlimit]
    sub_585DB0();                       // [3] fe_one_time_init
    if (dword_106C0A4)
        sub_5AF350(v8);                 //     capture_time (frontend start)
    sub_7A4860();                       // [4] reset_tu_state
    sub_7A40A0(qword_126EEE0);         // [5] process_translation_unit
    sub_588F90(v5, 1);                  // [6] fe_wrapup
    if (dword_106C0A4) {
        sub_5AF350(v9);
        sub_5AF390("Front end time", v8, v9);
    }
    // --- error-recovery re-compilation loop ---
    if (qword_126ED90) {               //     errors present?
        dword_106C254 = 1;             //     skip backend
    }
    while (1) {
        sub_6B8B20(0);                  //     reset file state
        sub_589530();                   //     write signoff + cleanup
        // exit code computation
        if (dword_106C0A4)
            sub_5AF390("Total compilation time", ...);
        sub_5AF1D0(exit_code);          // [8] exit
        // --- if dword_106C254 == 0, backend runs ---
        if (!dword_106C254) {
            if (dword_106C0A4)
                sub_5AF350(v10);        //     capture_time (backend start)
            sub_489000();               // [7] process_file_scope_entities
            if (dword_106C0A4) {
                sub_5AF350(v11);
                sub_5AF390("Back end time", v10, v11);
            }
        }
    }
}

The while(1) loop with sub_5AF1D0 (which calls exit() / abort()) never actually iterates -- the call to sub_5AF1D0 is __noreturn. The compiler just arranged the basic blocks this way: the backend stage at label LABEL_16 falls through from a goto at the top of the loop when dword_106C254 == 0 (no errors).

Stage Details

Stage 1: fe_pre_init -- `sub_585D60` (0x585D60)

Source: fe_init.c

Performs absolute minimum initialization before anything else can run. Called with the raw argc, argv, envp from the OS.

Call	Address	Identity	Purpose
1	`sub_48B3C0`	error_handling_init	Zero error counters
2	`sub_6BB290`	source_file_mgr_init	File descriptor table setup
3	`sub_5B1E70`	scope_symbol_pre_init	Scope stack index = -1
4	`sub_752C90`	type_system_pre_init	Type table allocation
5	`sub_45EB40`	cmd_line_pre_init	Register CLI flag table
6	`sub_4ED530`	declaration_pre_init	Declaration state zeroing
7	`sub_6F6020`	il_pre_init	IL node allocator setup
8	`sub_7A48B0`	tu_tracking_pre_init	Zero all TU globals
9	`sub_7C00F0`	template_pre_init	Template engine state

Sets dword_126C5E4 = -1 (current scope index = "none") and dword_126C5C8 = -1 (secondary scope index = "none").

Data flow: No input beyond process args. Output: global state zeroed and ready for CLI parsing.

Stage 2: proc_command_line -- `sub_459630` (0x459630)

Source: cmd_line.c (4105 decompiled lines)

Parses all 276 CLI flags. Populates global configuration variables that control every subsequent stage. Key outputs:

Global	Address	Meaning
`dword_126EFB4`	`0x126EFB4`	Language mode: 1=K&R C, 2=C++
`dword_126EF68`	`0x126EF68`	C++ standard version (`__cplusplus` value)
`dword_106C0A4`	`0x106C0A4`	Timing enabled (print stage durations)
`dword_126E1D8`	`0x126E1D8`	MSVC host compiler
`dword_126E1F8`	`0x126E1F8`	GNU/GCC host compiler
`dword_126E1E8`	`0x126E1E8`	Clang host compiler
`dword_106BF38`	`0x106BF38`	Extended lambda mode
`qword_126EEE0`	`0x126EEE0`	Output filename (or `"-"` for stdout)
`qword_106BA00`	`0x106BA00`	Primary source filename
`dword_106C29C`	`0x106C29C`	Preprocessing-only mode
`dword_106C064`	`0x106C064`	Stack limit adjustment flag

The parser builds four hash tables for macro defines (qword_106C248), include paths (qword_106C240), and system includes (qword_106C238, qword_106C228). It also suppresses a default set of diagnostic numbers (1257, 1373, 1374, 1375, 1633, 2330, 111, 185, 175).

Data flow: Input: argv. Output: ~150+ global configuration variables populated.

Stage 3: fe_one_time_init -- `sub_585DB0` (0x585DB0)

Source: fe_init.c

The heaviest initialization stage. Calls 38 subsystem initializers in dependency order, then validates the function pointer dispatch table (a sentinel check: off_D560C0 must equal the address of nullsub_6). After validation, calls sub_585EE0 (fe_init_part_1) which:

Records compilation timestamp via time()/ctime() into byte_106B5C0
Runs 26 per-compilation-unit initializers
Opens the output file (qword_106C280 = stdout or file)
Writes the output file header via sub_5AEDB0
Calls the keyword registration function sub_5863A0 which registers 200+ C/C++ keywords plus NVIDIA CUDA-specific type traits (__nv_is_extended_device_lambda_closure_type, etc.)

38 subsystem initializers (in call order):

#	Address	Subsystem
1	`sub_752DF0`	types
2	`sub_5B1D40`	scopes
3	`sub_447430`	errors
4	`sub_4B37F0`	preprocessor
5	`sub_4E8ED0`	declarations
6	`sub_4C0840`	attributes
7	`sub_4A1B60`	names
8	`sub_4E9CF0`	declarations (part 2)
9	`sub_4ED710`	declarations (part 3)
10	`sub_510C30`	statements
11	`sub_56DC90`	expression utilities
12	`sub_5A5160`	expressions
13	`sub_603B00`	parser
14	`sub_5CF7F0`	classes
15	`sub_65DC50`	overload resolution
16	`sub_69C8B0`	templates
17	`sub_665A00`	template instantiation
18	`sub_689550`	exception handling
19	`sub_68F640`	implicit conversions
20	`sub_6B6510`	IL
21	`sub_6BAE70`	source file manager
22	`sub_6F5FC0`	IL walking
23	`sub_6F8300`	IL (part 2)
24	`sub_6FDFF0`	lowering
25	`sub_726DC0`	name mangling
26	`sub_72D410`	name mangling (part 2)
27	`sub_74B9A0`	type checking
28	`sub_710B70`	IL (part 3)
29	`sub_76D630`	code generation
30	`nullsub_11`	debug (no-op)
31	`sub_7A4690`	allocation
32	`sub_7A3920`	memory pools
33	`sub_6A0E90`	templates (part 2)
34	`sub_418F80`	diagnostics
35	`sub_5859C0`	extended asm
36	`sub_751540`	types (part 2)
37	`sub_7C25F0`	templates (part 3)
38	`sub_7DF400`	CUDA-specific init

Data flow: Input: populated config globals. Output: all subsystems initialized, keyword table built, output file open.

Stage 4: reset_tu_state -- `sub_7A4860` (0x7A4860)

Source: trans_unit.c

Zeroes all translation unit tracking globals to prepare for processing:

qword_106BA10 = 0;   // current_translation_unit
qword_106B9F0 = 0;   // primary_translation_unit
qword_12C7A90 = 0;   // tu_chain_tail
dword_106B9F8 = 0;   // has_module_info
qword_106BA18 = 0;   // tu_stack_top
dword_106B9E8 = 0;   // tu_stack_depth

Data flow: No input. Output: TU state clean-slated.

Stage 5: process_translation_unit -- `sub_7A40A0` (0x7A40A0)

Source: trans_unit.c

The main frontend workhorse. This single call parses the entire .cu source file into the EDG intermediate language. Workflow:

Debug trace: "Processing translation unit %s"
Clean up any previous TU state (sub_7A3A50)
Reset error state (sub_5EAEC0)
Allocate 424-byte TU descriptor via sub_6BA0D0
Initialize TU scope state (offsets 24..192 via sub_7046E0)
Set as primary TU (qword_106B9F0) if first
Link into TU chain
Call sub_586240 -- parse the source file (this enters the EDG parser, which handles all of C++ plus CUDA extensions: __device__, __host__, __global__, __shared__, __managed__, etc.)
Depending on mode:
- Module compilation: sub_6FDDF0
- Standard compilation: sub_6F4AD0 (header-unit) + sub_4E8A60 (standard)
Post-processing: sub_588E90 (translation_unit_wrapup -- scope closure, template wrapup, IL output)
Debug trace: "Done processing translation unit %s"

At the end of this stage, the EDG IL tree is fully built. Every declaration, type, expression, and statement from the source has been parsed into IL nodes. CUDA execution-space attributes (__device__, __host__, __global__) have been recorded on entity nodes at byte offset +182 (bit 6 = device/global, bits 4-5 = execution space).

Data flow: Input: source filename from qword_126EEE0. Output: complete EDG IL tree anchored at qword_106BA10 (TU descriptor), source sequence list at *(qword_106BA10 + 8).

Stage 6: fe_wrapup -- `sub_588F90` (0x588F90)

Source: fe_wrapup.c

Five-pass finalization over all translation units. Each pass iterates the TU chain (qword_106B9F0). Passes 2-4 are per-TU error-gated (skip TUs with qword_126ED90 != 0); passes 1 and 5 run unconditionally.

Pass	Function	Purpose	Error-gated?
1	`sub_588C60`	Per-file IL wrapup: template/exception cleanup, IL tree walk (`sub_706710`), IL finalize (`sub_706F40`), destroy temporaries	No
2	`sub_707040`	Needed-flags computation: determine which entities must be preserved for backend consumption	Per-TU skip
3	`sub_610420(23)`	Keep-in-IL marking: mark entities for device code preservation with guard flag `dword_106B640`	Per-TU skip
4	`sub_5CCA40` + `sub_5CC410` + `sub_5CCBF0`	Dead entity elimination (C++ gate on `sub_5CCA40`): clear unneeded instantiation flags, remove dead function bodies, remove unneeded IL entries	Per-TU skip
5	`sub_588D40`	Statement finalization, scope assertions, IL output + template output	No

Between Pass 1 and Pass 2, if no errors have occurred, sub_796C00 runs cross-TU entity marking.

Post-pass operations:

Cross-TU consistency (sub_796BA0, error-gated)
Scope renumbering (sub_707480 double-loop)
Template validation (sub_765480)
File index cleanup (sub_6B8B20 for indices 2..dword_126EC80)
Output flush + close three output files (IDs 1513, 1514, 1515)
Memory statistics: sums 10 space_used() callbacks
State teardown

Data flow: Input: fully built IL tree. Output: finalized IL with dead entities eliminated and device-needed entities marked. The source sequence list (qword_1065748) is the ordered list of top-level declarations the backend will walk.

Stage 7: Backend Code Generation -- `sub_489000` (0x489000)

Source: cp_gen_be.c (723 decompiled lines, the largest single function in the backend)

This is the host-side C++ code generator. It walks the EDG source sequence and emits the .int.c file that the host compiler (gcc/cl.exe/clang) will compile. The backend is gated by dword_106C254: if set to 1 (errors occurred), stage 7 is skipped entirely.

Initialization:

Zeros output state: dword_1065834 (indent level), stream handle, counters
Clears four 512KB hash tables (memset 0x7FFE0 bytes each)
Sets up gen_be_info callback table (xmmword_1065760..10657B0)
Creates output file: <input>.int.c (or stdout for "-")

Boilerplate emission:

#pragma GCC diagnostic push/pop blocks for suppressing host compiler warnings
__nv_managed_rt initialization boilerplate (for __managed__ variables)
Lambda type-trait macro definitions

Main processing loop:

Walks qword_1065748 (global source sequence list)
For each entry: dispatches to sub_47ECC0 (gen_template/process_source_sequence)
Kind 57 entries are pragma interleavings (handled inline)

CUDA-specific transformations performed:

Device stub generation: For __global__ kernels, emit __wrapper__device_stub_<name>() forwarding, wrap original body in #if 0/#endif
Device-only suppression: Device-only declarations wrapped in #if 0/#endif
Lambda wrappers: __nv_dl_wrapper_t<> for device lambdas, __nv_hdl_create_wrapper_t<> for host-device lambdas
Runtime header injection: #include "crt/host_runtime.h" at first CUDA entity
Registration tables: sub_6BCF80 called 6 times for device/host/managed/constant combinations
Anonymous namespace: _NV_ANON_NAMESPACE macro for unique global symbols

Trailer:

Empty-file guard: int __dummy_to_avoid_empty_file;
Re-inclusion of original source via #include "<original_file>"
#undef _NV_ANON_NAMESPACE

Data flow: Input: finalized source sequence from stage 6. Output: .int.c file on disk.

Stage 8: exit_with_status -- `sub_5AF1D0` (0x5AF1D0)

Source: host_envir.c

Maps internal compilation status to process exit codes:

Internal Status	Meaning	Exit Code	Action
3, 4, 5	Success	0	`exit(0)`
8	Warnings only	2	`exit(2)`
9, 10	Errors	4	`exit(4)` + `"Compilation terminated."`
11	Internal error	--	`abort()` + `"Compilation aborted."`

In SARIF mode (dword_106BBB8), text messages are suppressed but exit codes remain the same.

Key Global Variables Controlling Flow

Variable	Address	Type	Role
`dword_106C254`	`0x106C254`	int	Skip-backend flag. Set to 1 when `qword_126ED90` (error count) is nonzero after frontend. Prevents stage 7 from running.
`dword_106C0A4`	`0x106C0A4`	int	Timing flag. When set, `sub_5AF350`/`sub_5AF390` bracket each phase with CPU + wall-clock timestamps.
`dword_126EFB4`	`0x126EFB4`	int	Language mode. 1=K&R C, 2=C++. Controls C++ class finalization in pass 4 of `fe_wrapup`, keyword set selection, and backend behavior. In CUDA mode, always 2.
`qword_126ED90`	`0x126ED90`	qword	Error count. Checked after stages 5-6 to decide whether to run backend. Nonzero skips needed-flags, keep-in-IL marking, and dead entity elimination passes in fe_wrapup.
`qword_126EEE0`	`0x126EEE0`	char*	Output filename. Passed to `sub_7A40A0` for TU naming. Used by backend to construct `.int.c` path.
`dword_1065850`	`0x1065850`	int	Device stub mode. Toggled during backend generation: 1 = currently emitting device stub code (changes parameter types, suppresses bodies).
`dword_106C064`	`0x106C064`	int	Stack limit flag. When set, main adjusts RLIMIT_STACK to max before entering frontend (deep recursion in parser/template engine).

Timing Regions

When dword_106C0A4 is set (via --timing or equivalent flag), three timing regions are printed:

Front end time                     12.34 (CPU)     15.67 (elapsed)
Back end time                       3.45 (CPU)      4.56 (elapsed)
Total compilation time             15.79 (CPU)     20.23 (elapsed)

Format string: "%-30s %10.2f (CPU) %10.2f (elapsed)\n"

The timing is implemented via sub_5AF350 (capture_time: records clock() as CPU milliseconds and time() as wall seconds) and sub_5AF390 (report_timing: computes deltas and prints).

Region	Start	End	Covers
Front end	After `sub_585DB0` (fe_one_time_init)	After `sub_588F90` (fe_wrapup)	Stages 4-6: TU reset, parse, IL build, wrapup
Back end	Before `sub_489000`	After `sub_489000`	Stage 7: .int.c generation
Total	After `sub_585D60` (fe_pre_init), before `sub_459630` (CLI)	Before `sub_5AF1D0` (exit)	Stages 2-8: CLI parsing through exit

Error Recovery Loop

The main() function contains a while(1) loop that appears to support re-compilation (the TU processing infrastructure has a dword_106BA08 "is_recompilation" flag and sub_7A40A0 checks an a2 recompilation parameter). In practice, for the standard CUDA compilation flow, this loop executes exactly once: sub_5AF1D0 is __noreturn and terminates the process.

The loop body:

sub_6B8B20(0) -- reset file state for the source file manager
sub_589530() -- write output signoff (sub_5AEE00) + close source manager (sub_6B8DE0)
Compute exit code from qword_126ED90 (errors) and qword_126ED88 (additional status)
Print total timing if enabled
Restore stack limit if it was raised
sub_5AF1D0(exit_code) -- terminate

Cross-References

Entry Point & Initialization -- detailed breakdown of stages 1-3
CLI Processing -- all 276 flags parsed in stage 2
Frontend Invocation -- stage 5 (parse + IL build) in depth
Frontend Wrapup -- 5-pass architecture of stage 6
Backend Code Generation -- stage 7 (.int.c emission) in depth
Timing & Exit -- stage 8 and exit code mapping
Device/Host Separation -- how the backend filters device vs host code
Kernel Stub Generation -- __wrapper__device_stub_ pattern
Extended Lambda Overview -- lambda wrapper generation in backend
.int.c File Format -- structure of the backend output

Entry Point & Initialization

main() at 0x408950 is a 488-byte __noreturn function that orchestrates the entire cudafe++ compilation pipeline. It takes the standard POSIX signature (int argc, char **argv, char **envp), performs two phases of subsystem initialization, optionally raises the process stack limit, then runs the frontend, backend, and exit sequence in a linearized loop that executes exactly once. The function has 22 direct callees (including getrlimit, setrlimit, and library calls) and never returns -- sub_5AF1D0 at the bottom of the loop calls exit() or abort().

Key Facts

Property	Value
Address	`0x408950`
Size	488 bytes
Source file	`fe_init.c` / `host_envir.c` (initialization); `fe_wrapup.c` (finalization)
Signature	`void __noreturn main(int argc, char argv, char envp)`
Direct callees	22 (9 pre-init + CLI + heavy-init + 5 pipeline stages + timing/exit helpers)
Stack frame	`0x88` bytes (136 bytes: 6 timing stamps + rlimit struct + alignment)
Attribute	`__noreturn` -- the `while(1)` loop terminates via `sub_5AF1D0` which calls `exit()`/`abort()`

Annotated Decompilation

void __noreturn main(int argc, char **argv, char **envp)
{
    rlim_t original_stack;
    bool stack_was_raised;
    uint8_t exit_code;
    struct rlimit rlimits;
    timestamp_t t_total_start, t_fe_start, t_fe_end, t_be_start, t_be_end, t_total_end;

    // --- Redirect diagnostic output to stderr ---
    s = stderr;                            // 0x126EDF0 alias
    qword_126EDF0 = stderr;                // diagnostic stream

    // === PHASE 1: Pre-initialization (9 subsystem calls) ===
    sub_585D60(argc, argv, envp);          // fe_pre_init

    // --- Capture total compilation start time ---
    sub_5AF350(&t_total_start);            // capture_time

    // === PHASE 2: Command-line parsing ===
    sub_459630(argc, argv);                // proc_command_line (276 flags)

    // === Stack limit adjustment ===
    if (dword_106C064                      // --modify-stack-limit (default: ON)
        && !getrlimit(RLIMIT_STACK, &rlimits))
    {
        original_stack = rlimits.rlim_cur;
        rlimits.rlim_cur = rlimits.rlim_max;  // raise to hard limit
        stack_was_raised = (setrlimit(RLIMIT_STACK, &rlimits) == 0);
    }

    // === PHASE 3: Heavy initialization (38 subsystem calls + validation) ===
    sub_585DB0();                          // fe_one_time_init
    //   └─ sub_585EE0()  fe_init_part_1  (33 per-unit inits, output file, keywords)

    if (dword_106C0A4)                     // --timing enabled?
        sub_5AF350(&t_fe_start);           // capture frontend start

    // === PHASE 4: Translation unit setup ===
    sub_7A4860();                          // reset_tu_state (zero 6 TU globals)

    // === PHASE 5: Frontend parse + IL build ===
    sub_7A40A0(qword_126EEE0);            // process_translation_unit

    // === PHASE 6: Frontend wrapup (5-pass IL finalization) ===
    sub_588F90(qword_126EEE0, 1);         // fe_wrapup

    if (dword_106C0A4) {
        sub_5AF350(&t_fe_end);
        sub_5AF390("Front end time", &t_fe_start, &t_fe_end);
    }

    // --- Error gate: skip backend if frontend had errors ---
    if (!qword_126ED90) goto backend;     // no errors → run backend
    dword_106C254 = 1;                    // skip-backend flag

    // === Linearized exit loop (executes once) ===
    while (1) {
        exit_code = 8;                    // default: warnings
        sub_6B8B20(0);                    // reset file state
        sub_589530();                     // write signoff + close source mgr

        if (!qword_126ED90)               // re-check after wrapup
            exit_code = qword_126ED88 ? 5 : 3;  // success codes

        if (dword_106C0A4) {
            sub_5AF350(&t_total_end);
            sub_5AF390("Total compilation time", &t_total_start, &t_total_end);
        }

        if (stack_was_raised) {           // restore original stack limit
            rlimits.rlim_cur = original_stack;
            setrlimit(RLIMIT_STACK, &rlimits);
        }

        sub_5AF1D0(exit_code);            // __noreturn: exit() or abort()

    backend:
        if (!dword_106C254) {             // backend not skipped
            if (dword_106C0A4)
                sub_5AF350(&t_be_start);
            sub_489000();                 // process_file_scope_entities (backend)
            if (dword_106C0A4) {
                sub_5AF350(&t_be_end);
                sub_5AF390("Back end time", &t_be_start, &t_be_end);
            }
        }
    }
}

The while(1) never actually loops. The call to sub_5AF1D0 is __noreturn (it calls exit() or abort() internally), so control never reaches the second iteration. The compiler arranged the basic blocks this way because the backend code at backend: is reached via a goto from the error-gate check, placing it logically "after" the exit call in the CFG.

Phase 1: fe_pre_init -- `sub_585D60` (0x585D60)

The first thing main() does after redirecting stderr is call sub_585D60, which performs the absolute minimum initialization needed before command-line parsing can proceed. This function lives in fe_init.c and makes 9 sequential calls to subsystem pre-initializers, plus two inline global assignments.

Pre-Init Call Table

#	Address	Identity	Source	Purpose
1	`sub_48B3C0`	error_pre_init	error.c	Zero 4 error-tracking globals: `qword_1065870`=0, `qword_1065868`=0, `dword_1065860`=-1, `qword_1065858`=0
2	`sub_6BB290`	source_file_mgr_pre_init	srcfile.c	Zero 10 file descriptor table globals: file chain head, file count, file hash, include stack
3	`sub_5B1E70`	host_envir_early_init	host_envir.c	Heaviest pre-init call. Signal handlers, locale, CWD capture, env vars. See below.
4	`sub_752C90`	type_system_pre_init	type.c	Set `dword_126E4A8`=-1 (dialect version unset), call `sub_7515D0` (type table alloc), set host compiler defaults (`qword_126E1F0`=70300 = GCC 7.3.0 default), init 3 type comparison descriptor pools via `sub_465510`
5	`sub_45EB40`	cmd_line_pre_init	cmd_line.c	Zero the 272-flag was-set bitmap (`byte_E7FF40`, 0x110 bytes), set `dword_E7FF20`=1 (skip argv[0]), initialize ~350 global config variables to defaults. Notable: `dword_106C064`=1 (stack limit adjustment ON by default)
6	`sub_4ED530`	declaration_pre_init	decls.c	Set `stderr` into two global stream pointers, zero error/warning counters (`qword_126ED80..qword_126EDE0`), set diagnostic defaults (`byte_126ED69`=5, `byte_126ED68`=8, `qword_126ED60`=100 max errors), clear 15.2KB diagnostic severity table (`byte_1067920`, 0x3B50 bytes)
7	`sub_6F6020`	il_pre_init	il.c	Zero 3 globals: `dword_12C6C8C`=0 (PCH event counter), `qword_12C6EC0`=0, `qword_12C6EB8`=0
--	(inline)	scope_index_init	fe_init.c	`dword_126C5E4 = -1` (current scope stack index = "none"), `dword_126C5C8 = -1` (secondary scope index = "none")
8	`sub_7A48B0`	tu_tracking_pre_init	trans_unit.c	Zero 13 TU tracking globals: source filename, compilation mode flags, TU stack pointers, PCH state
9	`sub_7C00F0`	template_pre_init	template.c	Single assignment: `dword_106BA20 = 0` (template nesting depth = 0)

host_envir_early_init (sub_5B1E70) Detail

This is the most substantial pre-init call. It initializes the host environment interface layer from host_envir.c:

Signal handlers (one-time, guarded by dword_E6E120):

Signal	Handler	Behavior
SIGINT (2)	`handler` at `0x5AF2C0`	Write newline to stderr, call `sub_5AF2B0(9)` which writes signoff then `exit(4)`
SIGTERM (15)	`handler` at `0x5AF2C0`	Same as SIGINT
SIGXCPU (24)	`sub_5AF270`	Print `"Internal error: CPU time limit exceeded.\n"`, call `sub_5AF1D0(11)` which calls `abort()`
SIGXFSZ (25)	`SIG_IGN`	Ignored (prevents crash on large output files)

After signal setup, dword_E6E120 is set to 0 so handlers are registered only once.

Locale: Calls newlocale(LC_NUMERIC, "C", 0) then uselocale() to force the C locale for numeric output. If either call fails, asserts with "could not set LC_NUMERIC locale" at host_envir.c:264.

Working directory: Iteratively calls getcwd() with a growing buffer (starting at 256 bytes, expanding by 256 on ERANGE) until it fits, then copies the result into qword_126EEA0 via permanent allocation.

Environment variables:

EDG_BASE -- read into qword_126EE38 (base path for EDG data files; empty string if unset)
EDG_SUPPRESS_ASSERTION_LINE_NUMBER -- if set and not "0", sets dword_126ED40 = 1 (suppress line numbers in internal assertion messages)

CPU time limit: Calls getrlimit(RLIMIT_CPU) then setrlimit() with rlim_cur = RLIM_INFINITY to disable the CPU time limit.

Global zeroing: Zeros ~50 host-environment globals including file descriptors, path buffers, platform flags, output filename pointers.

Language mode: Sets dword_126EFB4 = 2 (default to C++ mode -- this is later overridden by CLI parsing if -x c is specified).

Sentinel validation: Checks off_E6E0E0 against the string "last" to verify that the predef_macro_mode_names table was properly initialized at link time. On mismatch, asserts with "predef_macro_mode_names not initialized properly" at host_envir.c:6927.

Stack Limit Adjustment

Between CLI parsing and heavy initialization, main() conditionally raises the process stack limit:

if (dword_106C064 && !getrlimit(RLIMIT_STACK, &rlimits)) {
    original_stack = rlimits.rlim_cur;
    rlimits.rlim_cur = rlimits.rlim_max;   // raise soft to hard limit
    stack_was_raised = (setrlimit(RLIMIT_STACK, &rlimits) == 0);
}

The flag dword_106C064 is set to 1 by default in sub_45EB40 (cmd_line_pre_init) and can be disabled via the --modify_stack_limit=false CLI flag. The purpose is to prevent stack overflow during deep recursion in the C++ parser, template instantiation engine, and constexpr interpreter. After compilation completes (just before exit), main() restores the original rlim_cur value.

Phase 3: fe_one_time_init -- `sub_585DB0` (0x585DB0)

This is the heaviest initialization stage. It zeroes the token state (qword_126DD38 -- 6 bytes packed as a dword + word), optionally calls sub_5AF330 for profiling init if dword_106BD4C is set, then makes 38 sequential calls to subsystem one-time initializers.

One-Time Init Call Table

#	Address	Identity	Source file
1	`sub_752DF0`	type_one_time_init	type.c
2	`sub_5B1D40`	scope_one_time_init	scope.c
3	`sub_447430`	error_one_time_init	error.c
4	`sub_4B37F0`	preprocessor_one_time_init	preproc.c
5	`sub_4E8ED0`	declaration_one_time_init	decls.c
6	`sub_4C0840`	attribute_one_time_init	attribute.c
7	`sub_4A1B60`	name_one_time_init	lookup.c
8	`sub_4E9CF0`	declaration_one_time_init_2	decl_spec.c
9	`sub_4ED710`	declaration_one_time_init_3	declarator.c
10	`sub_510C30`	statement_one_time_init	stmt.c
11	`sub_56DC90`	exprutil_one_time_init	exprutil.c
12	`sub_5A5160`	expression_one_time_init	expr.c
13	`sub_603B00`	parser_one_time_init	parse.c
14	`sub_5CF7F0`	class_one_time_init	class_decl.c
15	`sub_65DC50`	overload_one_time_init	overload.c
16	`sub_69C8B0`	template_one_time_init	template.c
17	`sub_665A00`	instantiation_one_time_init	instantiate.c
18	`sub_689550`	exception_one_time_init	except.c
19	`sub_68F640`	conversion_one_time_init	convert.c
20	`sub_6B6510`	il_one_time_init	il.c
21	`sub_6BAE70`	srcfile_one_time_init	srcfile.c
22	`sub_6F5FC0`	il_walk_one_time_init	il_walk.c
23	`sub_6F8300`	il_one_time_init_2	il.c
24	`sub_6FDFF0`	lower_one_time_init	lower_il.c
25	`sub_726DC0`	mangling_one_time_init	lower_name.c
26	`sub_72D410`	mangling_one_time_init_2	lower_name.c
27	`sub_74B9A0`	typecheck_one_time_init	typecheck.c
28	`sub_710B70`	il_one_time_init_3	il.c
29	`sub_76D630`	codegen_one_time_init	cp_gen_be.c
30	`nullsub_11`	debug_one_time_init	debug.c (no-op)
31	`sub_7A4690`	allocation_one_time_init	il_alloc.c
32	`sub_7A3920`	pool_one_time_init	il_alloc.c
33	`sub_6A0E90`	template_one_time_init_2	template.c
34	`sub_418F80`	diagnostics_one_time_init	diag.c
35	`sub_5859C0`	extasm_one_time_init	extasm.c
36	`sub_751540`	type_one_time_init_2	type.c
37	`sub_7C25F0`	template_one_time_init_3	template.c
38	`sub_7DF400`	cuda_one_time_init	nv_transforms.c

The call order reflects dependency constraints: types before scopes, scopes before declarations, declarations before expressions, expressions before the parser, etc. Template initialization is split across three calls (#16, #33, #37) because different phases of template support depend on different subsystems being initialized first.

Function Pointer Table Validation

After all 38 initializers complete, sub_585DB0 performs a critical integrity check:

if (funcs_6F71AE || off_D560C0 != nullsub_6)
    sub_4F21C0("function_pointers is incorrectly initialized");

This validates two conditions:

funcs_6F71AE must be zero. This global acts as a "dirty flag" -- if any initializer wrote a nonzero value here, the table was not properly zeroed during static initialization.
off_D560C0 must point to nullsub_6 (0x585B00). The address off_D560C0 is the last entry in a function pointer dispatch table in .rodata. The empty function nullsub_6 acts as a sentinel -- its known address is compared against the table's last slot to verify that the table was correctly populated at link time. If the linker reordered or dropped entries, the sentinel would not match.

If either check fails, sub_4F21C0 emits a fatal diagnostic ("function_pointers is incorrectly initialized") and then falls through to sub_585EE0 (fe_init_part_1) regardless -- this is a non-recoverable error that will likely cause crashes later, but the code attempts to continue.

On successful validation, sub_585DB0 returns without calling sub_585EE0. However, sub_585EE0 is actually called from a different path: the normal flow is that sub_585DB0 returns, and main() proceeds. The sub_585EE0 call on the error path in sub_585DB0 appears to be a fallthrough from the panic handler.

Correction from the sweep report: Examination of the actual decompiled code shows that sub_585EE0 (fe_init_part_1) is called only on the error path of the sentinel check within sub_585DB0. On the normal (no-error) path, sub_585DB0 returns sub_7DF400()'s return value directly. This means fe_init_part_1 is called from the sentinel-check error handler, not from the main success path of sub_585DB0. The actual invocation of fe_init_part_1 in the normal flow must occur elsewhere in the pipeline (likely called from within one of the subsystem initializers or from sub_7A40A0).

fe_init_part_1 -- `sub_585EE0` (0x585EE0)

This function performs per-compilation-unit initialization. It is identified by the debug trace string "fe_init_part_1" at level 5 and an assertion path fe_init.c:2007. Its responsibilities:

Compilation Timestamp

time(&timer);
char *t = ctime(&timer);
if (!t) t = "Sun Jan 01 00:00:00 1900\n";
if (strlen(t) > 127)
    assert("fe_init.c", 2007, "fe_init_part_1");  // buffer overflow guard
strcpy(byte_106B5C0, t);   // 128-byte timestamp buffer
dword_126EE48 = 1;         // init-complete flag

Per-Unit Initializer Call Table

After the timestamp, sub_585EE0 calls 33 per-compilation-unit initializers:

#	Address	Identity
1	`sub_4ED7C0`	declaration_unit_init
2	`nullsub_7`	(no-op placeholder)
3	`sub_65DC20`	overload_unit_init
4	`sub_6BB350`	srcfile_unit_init
5	`sub_5B22E0`	scope_unit_init
6	`sub_603B30`	parser_unit_init
7	`sub_5D0170`	class_unit_init
8	`sub_61EBD0`	expression_unit_init
9	`sub_68A0D0`	exception_unit_init
10	`sub_74BFF0`	typecheck_unit_init
11	`sub_710DE0`	il_unit_init
12	`sub_4E8F10`	declaration_unit_init_2
13	`sub_4C0860`	attribute_unit_init
14	`nullsub_2`	(no-op placeholder)
15	`sub_4474D0`	error_unit_init
16	`sub_665A60`	instantiation_unit_init
17	`sub_4E9D10`	decl_spec_unit_init
18	`sub_76D780`	codegen_unit_init
19	`sub_7C0300`	template_unit_init
20	`sub_7A3980`	pool_unit_init
21	`sub_56DEE0`	exprutil_unit_init
22	`nullsub_10`	(no-op placeholder)
23	`sub_6B6890`	il_unit_init_2
24	`sub_726EE0`	mangling_unit_init
25	`sub_6F5DA0`	il_walk_unit_init
26	`sub_6F8320`	il_unit_init_3
27	`sub_6FE130`	lower_unit_init
28	`sub_752FC0`	type_unit_init
29	`sub_4660B0`	folding_unit_init
30	`sub_5943E0`	float_unit_init
31	`sub_6A0F40`	template_unit_init_2
32	`sub_4190B0`	diagnostics_unit_init
33	`sub_7C2640`	template_unit_init_3

Compilation Mode Flags

After the per-unit initializers, sub_585EE0 copies global configuration values (set during CLI parsing) into the compilation-mode descriptor at 0x126EB88:

Field	Address	Source	Meaning
`byte_126EB88`	`0x126EB88`	`dword_126E498`	Dialect flags
`byte_126EBB0`	`0x126EBB0`	`dword_126EFB4 == 1`	K&R C mode
`dword_126EBA8`	`0x126EBA8`	`dword_126EFB4 != 2`	Not-C++ flag
`dword_126EBAC`	`0x126EBAC`	`dword_126EF68`	C standard version
`byte_126EBB8`	`0x126EBB8`	`dword_126EFB0`	Strict C mode
`byte_126EBB9`	`0x126EBB9`	`dword_126EFAC`	EDG GNU-compat extensions
`byte_126EBBA`	`0x126EBBA`	`dword_126EFA4`	Clang extensions enabled
`xmmword_126EBC0`	`0x126EBC0`	`qword_126EF90`	Clang + GNU version thresholds (16 bytes packed)

Output File Setup

if (dword_106C298) {                          // output enabled
    if (qword_106C278)                        // output path specified
        qword_106C280 = sub_4F48F0(path, 0, 0, 16, 1513);  // open file (ID 1513)
    else
        qword_106C280 = stdout;               // default to stdout
}
sub_5AEDB0();                                 // write output header

The output file ID 1513 is one of three output file slots used during compilation (1513, 1514, 1515).

Initialization Summary

The total initialization sequence before parsing begins involves 80+ subsystem init calls across three layers:

main()
 ├─ sub_585D60()  fe_pre_init           9 subsystem pre-inits
 │   ├─ sub_48B3C0   error              4 globals zeroed
 │   ├─ sub_6BB290   srcfile            10 globals zeroed
 │   ├─ sub_5B1E70   host_envir         signals, locale, CWD, env vars, ~50 globals
 │   ├─ sub_752C90   types              type table alloc, compiler defaults
 │   ├─ sub_45EB40   cmd_line           272-flag bitmap, ~350 config defaults
 │   ├─ sub_4ED530   declarations       error counters, diagnostic severity table (15KB)
 │   ├─ sub_6F6020   il                 3 globals zeroed
 │   ├─ [inline]     scope indices      dword_126C5E4 = dword_126C5C8 = -1
 │   ├─ sub_7A48B0   tu_tracking        13 globals zeroed
 │   └─ sub_7C00F0   templates          1 global zeroed
 │
 ├─ sub_459630()  proc_command_line     276 flags → ~150 config globals
 │
 ├─ [RLIMIT_STACK adjustment]           raise soft limit to hard limit
 │
 └─ sub_585DB0()  fe_one_time_init      38 subsystem one-time inits
     ├─ token state zeroing             qword_126DD38 = 0 (6 bytes)
     ├─ 38 subsystem calls              types → scopes → errors → ... → CUDA
     ├─ sentinel check                  funcs_6F71AE == 0 && off_D560C0 == nullsub_6
     └─ sub_585EE0()  fe_init_part_1    (on error path, or called from subsystem)
         ├─ compilation timestamp       byte_106B5C0 via ctime()
         ├─ 33 per-unit inits           declarations → overload → ... → templates
         ├─ compilation mode flags      copy CLI config into descriptor struct
         ├─ output file open            stdout or file (ID 1513)
         └─ sub_5AEDB0()               write output header

Global State Set Before Parsing

By the time sub_7A40A0 (process_translation_unit) is called, the following critical globals have been established:

Global	Address	Value	Set by
`dword_126EFB4`	`0x126EFB4`	2 (C++)	`sub_5B1E70` default, may be overridden by CLI
`dword_126EF68`	`0x126EF68`	C/C++ standard version	CLI parsing
`dword_106C064`	`0x106C064`	1 (stack limit ON)	`sub_45EB40` default
`dword_106C0A4`	`0x106C0A4`	0 or 1	CLI `--timing` flag
`qword_126EEE0`	`0x126EEE0`	source filename	CLI parsing
`qword_106C280`	`0x106C280`	output FILE*	`sub_585EE0`
`qword_126EDF0`	`0x126EDF0`	stderr	`main()` + `sub_4ED530`
`dword_126EE48`	`0x126EE48`	1	`sub_585EE0` (init-complete flag)
`byte_106B5C0`	`0x106B5C0`	ctime string	`sub_585EE0` (compilation timestamp)
`dword_126C5E4`	`0x126C5E4`	-1 then updated	`sub_585D60` then scope init
`qword_126F120`	`0x126F120`	C locale handle	`sub_5B1E70`
`qword_126EEA0`	`0x126EEA0`	CWD string copy	`sub_5B1E70`

The Error Gate

The transition from frontend to backend is controlled by a simple error check:

if (!qword_126ED90)          // qword_126ED90 = error count from frontend
    goto backend_label;      // no errors → run backend
dword_106C254 = 1;           // errors → set skip-backend flag

When dword_106C254 == 1, the backend stage (sub_489000) is skipped entirely. The process still writes a signoff trailer and exits with a nonzero status code. This means a cudafe++ compilation with frontend errors produces no .int.c output file -- the backend never runs.

Exit Code Mapping

The exit function sub_5AF1D0 at 0x5AF1D0 maps internal status codes to process exit codes:

Internal Code	Meaning	Process Exit	Message
3, 4, 5	Success (various)	`exit(0)`	(none)
8	Warnings only	`exit(2)`	(none)
9, 10	Compilation errors	`exit(4)`	`"Compilation terminated.\n"`
11	Internal error	`abort()`	`"Compilation aborted.\n"`
(other)	Unknown/fatal	`abort()`	(none)

In SARIF mode (dword_106BBB8 set), the text messages ("Compilation terminated.", "Compilation aborted.") are suppressed, but exit codes remain identical.

Cross-References

Pipeline Overview -- complete 8-stage pipeline diagram
CLI Processing -- detailed breakdown of sub_459630 and all 276 flags
Frontend Invocation -- sub_7A40A0 (process_translation_unit) internals
Frontend Wrapup -- 5-pass architecture of sub_588F90
Backend Code Generation -- sub_489000 (.int.c emission)
Timing & Exit -- sub_5AF350/sub_5AF390/sub_5AF1D0 details
EDG Overview -- EDG 6.6 source tree and NVIDIA modifications
EDG Lexer -- keyword registration performed during sub_5863A0

CLI Processing

proc_command_line (sub_459630) at 0x459630 is a 21,773-byte function (4,105 decompiled lines, 296 callees) in cmd_line.c that parses the entire cudafe++ command line. It registers 276 flags into a flat lookup table, iterates argv with prefix-matching against that table, dispatches each matched flag through a 275-case switch statement, then resolves language dialect settings and opens output files. This function is the second stage of the pipeline, called directly from main() at 0x408950 before any heavy initialization.

Nobody invokes cudafe++ directly. NVIDIA's driver compiler nvcc decomposes its own options and passes the appropriate low-level flags via -Xcudafe <flag>. The full flag inventory is in CLI Flag Inventory; this page documents the implementation mechanics of the parsing system itself.

Key Facts

Property	Value
Address	`0x459630`
Binary size	21,773 bytes
Decompiled lines	4,105
Source file	`cmd_line.c`
Signature	`int64_t proc_command_line(int argc, char** argv)`
Direct callees	296
Flag table base	`dword_E80060`
Flag table entry size	40 bytes
Flag table capacity	552 entries (overflow panics via `sub_40351D`)
Registered flags	276
Switch cases	275 (case IDs 1--275)
Default-suppressed diagnostics	9 (1257, 1373, 1374, 1375, 1633, 2330, 111, 185, 175)

Flag Table Layout

The flag table is a contiguous array starting at dword_E80060. Each of the 552 slots occupies 40 bytes. The current entry count is tracked in dword_E80058.

Offset   Field            Type       Access pattern
------   -----            ----       --------------
+0       case_id          int32      dword_E80060[idx * 10]
+8       name             char*      qword_E80068[idx * 5]
+16      short_char       int16      word_E80070[idx * 20]    (low byte = char, high byte = 1)
+17      is_valid         int8       (high byte of short_char word, always 1)
+18      takes_value      int8       byte_E80072[idx * 40]
+19      visible          int8       (part of dword_E80080[idx * 10] at +32)
+20      is_boolean       int8       byte_E80073[idx * 40]
+24      name_length      int64      qword_E80078[idx * 5]    (precomputed strlen)
+32      mode_flag        int32      dword_E80080[idx * 10]

The flag-was-set bitmap at byte_E7FF40 spans 0x110 bytes (272 slots). When a flag is matched during parsing, the corresponding byte is set to 1 to record that the user explicitly provided it. The bitmap is zeroed by default_init (sub_45EB40) before every compilation.

Registration: sub_452010 (init_command_line_flags)

sub_452010 at 0x452010 is a 30,133-byte function (3,849 decompiled lines) that populates the entire flag table. It is called once, at line 280 of proc_command_line, before the parsing loop begins.

register_command_flag (sub_451F80)

Each flag is registered through sub_451F80 (25 lines), called approximately 275 times from sub_452010:

void register_command_flag(
    int    case_id,       // dispatch ID for the switch (1-275)
    char*  name,          // flag name without dashes ("preprocess", "timing", etc.)
    char   short_opt,     // single-char alias ('E', '#', etc.), 0 for none
    char   takes_value,   // 1 if the flag requires =<value>
    int    mode_flag,     // visibility/classification (mode vs. action)
    char   enabled        // whether the flag is active (1 = registered, 0 = disabled)
);

The function writes into the next free slot at index dword_E80058, precomputes strlen(name) into name_length, always sets the is_valid byte to 1, then increments the counter. If the counter reaches 552, it panics via sub_40351D -- the table is statically sized.

Paired Toggle Registration

Approximately half of all flags are boolean toggles registered as pairs: --flag and --no_flag share the same case_id but differ in which value they write. Pairs are registered in two ways:

Two sequential register_command_flag calls -- both point to the same case_id; the parsing loop determines whether the matched name starts with no_ and sets the target global to 0 or 1 accordingly.
Inline table population -- seven additional paired flags (relaxed_abstract_checking, concepts, colors, keep_restrict_in_signatures, check_unicode_security, old_id_chars, add_match_notes) are written directly into the array without going through register_command_flag.

Parsing Loop

After flag registration, proc_command_line performs five sequential setup steps, then enters the main argv iteration.

Pre-Loop Setup

Step 1:  Initialize qword_126DD38, qword_126EDE8 (token state / source position)
Step 2:  Call sub_452010() -- register all 276 flags
Step 3:  Allocate 4 hash tables (16-byte header + 256-byte data each):
           qword_106C248  macro define/alias map
           qword_106C240  include path list
           qword_106C238  system include map
           qword_106C228  additional system include map
Step 4:  Suppress 9 diagnostic numbers by default:
           1257, 1373, 1374, 1375, 1633, 2330, 111, 185, 175
         Each via sub_4ED400(number, suppress_severity, 1)
Step 5:  Set dword_E7FF20 = 1 (argv index, skipping argv[0])

The default-suppressed diagnostics are EDG warnings that NVIDIA considers noise for CUDA compilation. Diagnostic 111 ("statement is unreachable"), 185 ("pointless comparison of unsigned integer with zero"), and 175 ("subscript out of range") are common false positives in CUDA template-heavy code.

argv Iteration

The loop processes argv[dword_E7FF20] through argv[argc-1]. For each argument:

Dash detection -- if the argument does not start with -, it is treated as the input filename (stored in qword_126EEE0). Only one non-flag argument is expected.
Short flag matching -- for single-dash arguments (-X), the parser scans the flag table for an entry whose short_char matches. If the flag takes_value, the next argv element is consumed as the value.
Long flag matching -- for double-dash arguments (--flag-name), the parser calls parse_flag_name_value (sub_451EC0) to split on =:

// sub_451EC0: split "--name=value" into name and value
// Respects backslash escapes and quoted strings
// If no '=' found: *name_out = src, *value_out = NULL
void parse_flag_name_value(char* src, char** name_out, char** value_out);

The name portion is then matched against the flag table using strncmp with each entry's precomputed name_length. The parser iterates all entries and counts exact and prefix matches:

Exact match (length equals name_length and strncmp returns 0) -- dispatches immediately.
Unique prefix match (only one entry's name starts with the given prefix) -- dispatches to that entry.
Ambiguous prefix (multiple entries match the prefix) -- emits error 923 ("ambiguous command-line option").
No match -- the argument is silently ignored or treated as input.

Conflict Detection

Before the main loop, check_conflicting_flags (sub_451E80, 15 lines) validates that mutually exclusive flags were not specified together. It checks byte_E7FFF2 || byte_E80031 || byte_E80032 || byte_E80033, corresponding to flags 3, 193, 194, and 195. If any conflict is detected, it emits error 1027 via sub_4F8480.

The Dispatch Switch (275 Cases)

After a flag is matched, its case_id indexes into a giant switch statement occupying the bulk of proc_command_line. The following sections document the most important cases grouped by function.

Preprocessor Control (Cases 3--9)

Case	Flag	Global(s)	Behavior
3	`no_line_commands`	`dword_106C29C`=1, `dword_106C294`=1, `dword_106C288`=0	Suppress `#line` in preprocessor output
4	`preprocess`	`dword_106C29C`=1, `dword_106C294`=1, `dword_106C288`=1	Preprocessor-only mode (output to stdout)
5	`comments`	(flag bitmap)	Preserve comments in preprocessor output
6	`old_line_commands`	(flag bitmap)	Use old-style `# N "file"` line directives
8	`dependencies`	(multiple)	Dependencies output mode (preprocessor-only + dependency emission)
9	`trace_includes`	(flag bitmap)	Print each `#include` as it is opened

Compilation Mode (Cases 14, 20--26)

Case	Flag	Global	Behavior
14	`no_code_gen`	`dword_106C254 = 1`	Parse-only mode -- sets the skip-backend flag, preventing `process_file_scope_entities` from running
20	`timing`	`dword_106C0A4 = 1`	Enable compilation phase timing. `main()` checks this flag to decide whether to call `sub_5AF350`/`sub_5AF390` for "Front end time", "Back end time", "Total compilation time"
21	`version`	(stdout)	Print the version banner and continue (does not exit). Banner includes: `"cudafe: NVIDIA (R) Cuda Language Front End"`, `"Portions Copyright (c) 2005, 2024-YYYY NVIDIA Corporation"`, `"Portions Copyright (c) 1988-2018, 2024 Edison Design Group Inc."`, `"Based on Edison Design Group C/C++ Front End, version 6.6"`, `"Cuda compilation tools, release 13.0, V13.0.88"`
22	`no_warnings`	`byte_126ED69 = 7`	Set diagnostic severity threshold to error-only (suppress all warnings and remarks)
23	`promote_warnings`	`byte_126ED68 = 5`	Promote all warnings to errors
24	`remarks`	`byte_126ED69 = 4`	Lower threshold to include remark-level diagnostics
25	`c`	calls `sub_44C4F0(0)`	Force C language mode (overrides default C++ if currently in C++ mode)
26	`c++`	calls `sub_44C4F0(2)`	Force C++ language mode

Diagnostic Control (Cases 39--44)

Cases 39--43 (diag_suppress, diag_remark, diag_warning, diag_error, diag_once) share the same value-parsing logic:

1. Read the value string (after '=')
2. Strip leading/trailing whitespace
3. Split on commas
4. For each token:
   a. Parse as integer (diagnostic number)
   b. Call sub_4ED400(number, severity, 1)

The severity values map to:

Suppress = skip entirely
Remark = informational (level 4)
Warning = default warning (level 5)
Error = hard error (level 7)
Once = emit on first occurrence only

Case 44 (display_error_number / no_display_error_number) toggles whether error codes appear in diagnostic messages.

CUDA-Specific Flags (Cases 45--89)

Output File Paths

Case	Flag	Global	Description
45	`gen_c_file_name`	`qword_106BF20`	Path for the generated `.int.c` file
85	`gen_device_file_name`	(has_arg global)	Device-side output file name
86	`stub_file_name`	(has_arg global)	Stub file output path
87	`module_id_file_name`	(has_arg global)	Module ID file path
88	`tile_bc_file_name`	(has_arg global)	Tile bitcode file path

Data Model (Cases 65--66, 90--91)

Case	Flag	Behavior
65	`force-lp64`	LP64 model: pointer size=8, long size=8, specific type encodings for 64-bit
66	`force-llp64`	LLP64 model (Windows): pointer size=4, long size=4
90	`m32`	ILP32 model: all type sizes set for 32-bit (pointer=4, long=4, etc.)
91	`m64`	64-bit mode (default on Linux x86-64)

Device Compilation Control

Case	Flag	Global	Description
46	`msvc_target_version`	`dword_126E1D4`	MSVC version for compatibility emulation
47	`host-stub-linkage-explicit`	boolean	Use explicit linkage on generated host stubs
48	`static-host-stub`	boolean	Generate static (internal linkage) host stubs
49	`device-hidden-visibility`	boolean	Apply hidden visibility to device symbols
52	`no-device-int128`	boolean	Disable `__int128` type support on device
53	`no-device-float128`	boolean	Disable `__float128` type support on device
54	`fe-inlining`	`dword_106C068 = 1`	Enable frontend inlining pass
55	`modify-stack-limit`	`dword_106C064`	Whether `main()` raises the process stack limit via `setrlimit`. Default is ON. Value parsed as integer: nonzero enables, zero disables.
71	`keep-device-functions`	boolean	Do not strip unused device functions
72	`device-syntax-only`	boolean	Device-side syntax check without code generation
77	`device-c`	boolean	Relocatable device code (RDC) mode
82	`debug_mode`	`dword_106BFC4`=1, `dword_106BFC0`=1, `dword_106BFBC`=1	Full debug mode (sets three debug globals simultaneously)
89	`tile-only`	boolean	Tile-only compilation mode

Template Instantiation (Case 16)

The instantiate flag takes a string value and sets dword_106C094:

Value	`dword_106C094`	Meaning
`"none"`	0	No implicit instantiation
`"all"`	1	Instantiate all referenced templates
`"used"`	2	Instantiate only used templates
`"local"`	3	Local instantiation only

Include and Macro Arguments (Cases 29--31)

Cases 29 (include_directory / -I) and 167 (sys_include) append entries to linked lists via sub_4595D0:

// sub_4595D0: append_to_linked_list
// Allocates a 24-byte node: {next_ptr, string_ptr, int_field}
// Appends to singly-linked list with head/tail pointers
void append_to_linked_list(list_head*, char* string, int type);

A special case: -I- (the literal string "-") sets a flag for stdin include mode rather than appending to the path list. It calls sub_5AD0A0 for the actual path registration.

Case 30 (define_macro / -D) builds a linked list of macro definitions via sub_4595D0. Case 31 (undefine_macro / -U) allocates the same 24-byte node but marks the int_field as 1 to indicate undefine.

Language Standard Selection (Cases 228, 240--252)

These cases set dword_126EF68 -- the internal value of __cplusplus or __STDC_VERSION__:

Case(s)	Flag	`dword_126EF68`	Standard
228	`c++98`	199711	C++98/03
204	`c++11`	201103	C++11
240	`c++14`	201402	C++14
246	`c++17`	201703	C++17
251	`c++20`	202002	C++20
252	`c++23`	202302	C++23
178	`c99`	199901	C99 (calls `set_c_mode`)
179	`pre-c99`	199000	Pre-C99
241	`c11`	201112	C11
242	`c17`	201710	C17
243	`c23`	202311	C23
7	`old_c`	(K&R)	K&R C via `sub_44C4F0(1)`

SM Architecture Target (Case 245)

case 245:  // --target=<sm_arch>
    dword_126E4A8 = sub_7525E0(value_string);

sub_7525E0 parses the SM architecture string (e.g., "sm_90", "sm_100") and returns the internal architecture code stored in dword_126E4A8. This value gates which CUDA features are available during compilation (see Architecture Feature Gating).

Host Compiler Compatibility (Cases 182--188)

Case	Flag	Globals	Behavior
182	`gcc / no_gcc`	`dword_126EFA8`, `dword_126EFB0`	Enable/disable GCC compatibility mode + GNU extensions
184	`gnu_version`	`qword_126EF98`	GCC version number (default: 80100 = GCC 8.1.0). Parsed as integer.
187	`clang / no_clang`	`dword_126EFA4`	Enable/disable Clang compatibility mode
188	`clang_version`	`qword_126EF90`	Clang version number (default: 90100 = Clang 9.1.0)
95	`pgc++`	boolean	PGI C++ compiler mode
96	`icc`	boolean	Intel ICC mode
97	`icc_version`	(has_arg)	Intel ICC version number
98	`icx`	boolean	Intel ICX (oneAPI DPC++) mode

Raw Flag Manipulation (Case 193)

case 193:  // --set_flag=<name>=<value>  or  --clear_flag=<name>
    // Looks up <name> in off_D47CE0 (a name-to-address lookup table)
    // Sets the corresponding global to <value> (integer)

This is a backdoor for nvcc to set arbitrary internal globals by name, used for flags that do not have dedicated case_id entries.

Output Mode (Case 274)

case 274:  // --output_mode=text  or  --output_mode=sarif
    if (strcmp(value, "text") == 0)
        output_mode = 0;     // plain text diagnostics (default)
    else if (strcmp(value, "sarif") == 0)
        output_mode = 1;     // SARIF JSON diagnostics

SARIF (Static Analysis Results Interchange Format) output is used by IDE integrations and CI pipelines. When enabled, diagnostic messages are emitted as structured JSON instead of traditional file:line: error: format.

Dump Options (Case 273)

case 273:  // --dump_command_options
    // Iterates the entire flag table
    // For each entry where is_valid == 1:
    //   printf("--%s ", name);
    // Then exits

This is a diagnostic/debug mode that prints every registered flag name and exits. Used by nvcc to discover the cudafe++ flag namespace.

Post-Parsing: Dialect Resolution

After the argv loop exits, proc_command_line enters a massive dialect resolution block (approximately 800 lines). This phase reconciles the various mode flags into a consistent configuration.

Input Filename Extraction

The last non-flag argv element is the input filename, stored in qword_126EEE0. This pointer is later passed to process_translation_unit (sub_7A40A0) in stage 5 of the pipeline.

Memory Region Initialization

Eleven memory regions (numbered 1--11) are initialized with default configurations. These correspond to CUDA memory spaces (global, shared, constant, local, texture, etc.) and are used by the frontend to track address space qualifiers.

GCC/Clang Feature Resolution

The resolver checks GCC version thresholds to decide which extensions to enable:

GCC version thresholds (stored as integer * 100):
  40299 (0x9D6B)  -- GCC 4.2.99 boundary
  40599 (0x9E97)  -- GCC 4.5.99 boundary
  40699 (0x9EFB)  -- GCC 4.6.99 boundary
  etc.

For each threshold, specific feature flags are conditionally enabled. For example, if GCC version >= 40599, rvalue references and variadic templates are enabled even if the language standard is technically C++03. This emulates how GCC provides extensions ahead of standards.

C++ Standard Feature Cascade

Based on the value of dword_126EF68 (__cplusplus), the resolver enables feature flags in a cascade:

199711 (C++98):  base features only
201103 (C++11):  + lambdas, rvalue_refs, auto_type, nullptr,
                   variadic_templates, unrestricted_unions,
                   delegating_constructors, user_defined_literals, ...
201402 (C++14):  + digit_separators, generic lambdas, relaxed_constexpr
201703 (C++17):  + exc_spec_in_func_type, aligned_new, if_constexpr,
                   structured_bindings, fold_expressions, ...
202002 (C++20):  + concepts, modules, coroutines, consteval, ...
202302 (C++23):  + deducing_this, multidimensional_subscript, ...

Conflict Validation

Post-dialect resolution performs consistency checks:

If both gcc and clang modes are enabled, GCC takes precedence
If cfront_2.1 or cfront_3.0 is set alongside modern C++ features, features are silently disabled
If no_exceptions is set but coroutines is requested, coroutines are disabled (they require exceptions)

Output File Opening

After all flags are resolved:

The output .int.c file is opened (path from case 45/gen_c_file_name, or stdout if path is "-")
The error output file is opened if --error_output was specified (case 35)
The listing file is opened if --list was specified (case 33)

Default Diagnostic Severity Overrides

Nine diagnostic numbers are suppressed by default before any user --diag_suppress flags are processed:

Diagnostic	EDG meaning	Why suppressed
1257	(C++11 narrowing conversion in aggregate init)	Common in CUDA kernel argument forwarding
1373	(nonstandard extension used: zero-sized array in struct)	Used in CUDA runtime headers
1374	(nonstandard extension used: struct with no members)	Empty base optimization patterns
1375	(nonstandard extension used: unnamed struct/union)	Windows SDK compatibility
1633	(inline function linkage conflict)	Host/device function linkage edge cases
2330	(implicit narrowing conversion)	Template-heavy CUDA code triggers false positives
111	statement is unreachable	`__builtin_unreachable()` and device code control flow
185	pointless comparison of unsigned integer with zero	Generic template code comparing unsigned with zero
175	subscript out of range	Static analysis false positives in device intrinsics

Users can override these defaults with explicit --diag_error=111 (or similar) on the command line, since user-specified severity always wins.

Key Helper Functions

Function	Address	Lines	Identity	Role
`sub_451E80`	`0x451E80`	15	`check_conflicting_flags`	Validates mutually exclusive flags (3/193/194/195)
`sub_451EC0`	`0x451EC0`	57	`parse_flag_name_value`	Splits `--name=value` on `=`, respecting quotes and backslash escapes
`sub_451F80`	`0x451F80`	25	`register_command_flag`	Inserts one entry into the flag table
`sub_452010`	`0x452010`	3,849	`init_command_line_flags`	Registers all 276 flags (called once from `proc_command_line`)
`sub_4595D0`	`0x4595D0`	21	`append_to_linked_list`	Allocates 24-byte node, appends to `-D`/`-I` argument lists
`sub_45EB40`	`0x45EB40`	470	`default_init`	Zeros 350 global config variables + flag-was-set bitmap
`sub_44C4F0`	`0x44C4F0`	--	`set_c_mode`	Sets language mode: 0=C, 1=K&R, 2=C++
`sub_44C460`	`0x44C460`	--	`parse_integer_arg`	Parses string argument as integer (used by `error_limit`, etc.)
`sub_4ED400`	`0x4ED400`	--	`set_diagnostic_severity`	Sets severity for a single diagnostic number

Key Global Variables

Variable	Address	Type	Set by	Description
`dword_E80058`	`0xE80058`	int32	`register_command_flag`	Current flag table entry count (max 552)
`dword_E80060`	`0xE80060`	array	`register_command_flag`	Flag table base (40 bytes/entry)
`byte_E7FF40`	`0xE7FF40`	byte[272]	Parsing loop	Flag-was-set bitmap
`dword_E7FF20`	`0xE7FF20`	int32	`default_init`	Current argv index (initialized to 1)
`qword_126EEE0`	`0x126EEE0`	char*	Post-parse	Input source filename
`dword_106C254`	`0x106C254`	int32	Case 14	Skip-backend flag (`--no_code_gen`)
`dword_106C0A4`	`0x106C0A4`	int32	Case 20	Timing enabled (`--timing`)
`dword_126EF68`	`0x126EF68`	int32	Standard flags	`__cplusplus` / `__STDC_VERSION__` value
`dword_126EFB4`	`0x126EFB4`	int32	Mode flags	Language mode (0=unset, 1=C, 2=C++)
`dword_126EFA8`	`0x126EFA8`	int32	Case 182	GCC compatibility mode enabled
`dword_126EFA4`	`0x126EFA4`	int32	Case 187	Clang compatibility mode enabled
`qword_126EF98`	`0x126EF98`	int64	Case 184	GCC version (default 80100 = 8.1.0)
`qword_126EF90`	`0x126EF90`	int64	Case 188	Clang version (default 90100 = 9.1.0)
`dword_126EFB0`	`0x126EFB0`	int32	Case 182	GNU extensions enabled
`dword_106C064`	`0x106C064`	int32	Case 55	Modify stack limit (default 1)
`dword_126E4A8`	`0x126E4A8`	int32	Case 245	Target SM architecture code
`dword_106C094`	`0x106C094`	int32	Case 16	Template instantiation mode (0--3)
`byte_126ED69`	`0x126ED69`	int8	Cases 22/24	Diagnostic severity threshold
`byte_126ED68`	`0x126ED68`	int8	Case 23	Warning promotion threshold
`qword_106BF20`	`0x106BF20`	char*	Case 45	Output `.int.c` file path
`qword_106C248`	`0x106C248`	void*	Pre-loop	Macro define/alias hash table
`qword_106C240`	`0x106C240`	void*	Pre-loop	Include path hash table
`qword_106C238`	`0x106C238`	void*	Pre-loop	System include map hash table

Annotated Parsing Flow

int64_t proc_command_line(int argc, char** argv)
{
    // --- Phase 1: Global state init ---
    qword_126DD38 = 0;                         // zero token state
    qword_126EDE8 = 0;                         // zero source position

    // --- Phase 2: Register all flags ---
    init_command_line_flags();                  // sub_452010: 3849 lines, 276 flags

    // --- Phase 3: Allocate hash tables ---
    qword_106C248 = alloc_hash_table();        // macro defines/aliases
    qword_106C240 = alloc_hash_table();        // include paths
    qword_106C238 = alloc_hash_table();        // system includes
    qword_106C228 = alloc_hash_table();        // additional system includes

    // --- Phase 4: Default diagnostic suppressions ---
    set_diagnostic_severity(1257, SUPPRESS, 1);
    set_diagnostic_severity(1373, SUPPRESS, 1);
    set_diagnostic_severity(1374, SUPPRESS, 1);
    set_diagnostic_severity(1375, SUPPRESS, 1);
    set_diagnostic_severity(1633, SUPPRESS, 1);
    set_diagnostic_severity(2330, SUPPRESS, 1);
    set_diagnostic_severity(111,  SUPPRESS, 1);
    set_diagnostic_severity(185,  SUPPRESS, 1);
    set_diagnostic_severity(175,  SUPPRESS, 1);

    // --- Phase 5: Main parsing loop ---
    for (int i = 1; i < argc; i++) {
        char* arg = argv[i];
        if (arg[0] != '-') {
            qword_126EEE0 = arg;               // input filename
            continue;
        }

        // Split --name=value
        char *name, *value;
        parse_flag_name_value(arg + 2, &name, &value);  // sub_451EC0

        // Match against flag table
        int match_count = 0;
        int matched_id = -1;
        for (int f = 0; f < dword_E80058; f++) {
            if (strncmp(name, flag_table[f].name, strlen(name)) == 0) {
                if (strlen(name) == flag_table[f].name_length) {
                    matched_id = flag_table[f].case_id;  // exact match
                    break;
                }
                match_count++;
                matched_id = flag_table[f].case_id;
            }
        }

        if (match_count > 1) {
            error(923);  // "ambiguous command-line option"
            continue;
        }

        byte_E7FF40[matched_id] = 1;           // mark flag as set

        switch (matched_id) {
            case 3:   /* no_line_commands */ ...
            case 4:   /* preprocess */      ...
            ...
            case 274: /* output_mode */     ...
            case 275: /* incognito */       ...
        }
    }

    // --- Phase 6: Post-parsing dialect resolution ---
    // ~800 lines: resolve gcc/clang versions, cascade C++ features,
    // validate consistency, open output files

    // --- Phase 7: Memory region init (1-11) ---
    // Initialize CUDA memory space descriptors

    return 0;
}

Case 21 (--version / -v) prints the following banner to stdout (does not exit):

cudafe: NVIDIA (R) Cuda Language Front End
Portions Copyright (c) 2005, 2024-YYYY NVIDIA Corporation
Portions Copyright (c) 1988-2018, 2024 Edison Design Group Inc.
Based on Edison Design Group C/C++ Front End, version 6.6 (BUILD_DATE BUILD_TIME)
Cuda compilation tools, release 13.0, V13.0.88

Case 92 (--Version / -V) prints a different copyright format and then calls exit(1). This variant is used for machine-parseable version queries.

Relationship to Pipeline

proc_command_line is called as stage 2 of the pipeline, after fe_pre_init (sub_585D60) has initialized signal handlers, locale, working directory, and default config:

main()
  |-- sub_585D60()           [1] fe_pre_init (10 subsystem pre-initializers)
  |-- sub_5AF350()               capture_time (total start)
  |-- sub_459630(argc, argv) [2] proc_command_line  <-- THIS FUNCTION
  |-- setrlimit()                conditional stack raise (gated by dword_106C064)
  |-- sub_585DB0()           [3] fe_one_time_init (38 subsystem initializers)
  ...

By the time proc_command_line returns, every global configuration variable is set to its final value. The subsequent fe_one_time_init phase reads these globals to configure keyword tables, type system parameters, and per-translation-unit state.

Frontend Invocation

process_translation_unit (sub_7A40A0, 1267 bytes at 0x7A40A0, from EDG source file trans_unit.c) is the main frontend workhorse -- stage 5 of the pipeline. Called once from main(), it orchestrates the entire transformation from .cu source text to a fully-built EDG IL tree. The function allocates a 424-byte translation unit descriptor, opens the source file via the lexer, drives the C++ parser to completion, runs semantic analysis on the parsed declarations, and finally performs per-TU wrapup (stop-token verification, class linkage checking, module finalization). By the time it returns, every declaration, type, expression, and statement from the source has been parsed into IL nodes, CUDA execution-space attributes have been resolved, and the TU is linked into the global TU chain ready for the 5-pass fe_wrapup stage.

Key Facts

Property	Value
Function	`sub_7A40A0` (`process_translation_unit`)
Binary address	`0x7A40A0`
Binary size	1267 bytes
EDG source	`trans_unit.c`
Confidence	DEFINITE (source path and function name embedded at lines 696, 725, 556)
Signature	`int process_translation_unit(char filename, int is_recompilation, void module_info)`
Direct callees	27
Debug trace entry	`"Processing translation unit %s\n"`
Debug trace exit	`"Done processing translation unit %s\n"`
TU descriptor size	424 bytes (allocated via `sub_6BA0D0`)
TU stack entry size	16 bytes (`[0]=next`, `[8]=tu_ptr`)

Annotated Decompilation

int process_translation_unit(char *filename,       // source file path
                              int is_recompilation, // nonzero on error-retry pass
                              void *module_info)    // non-NULL for C++20 module TUs
{
    bool is_primary = (module_info == NULL);

    // --- Debug trace on entry ---
    if (debug_verbosity > 0 || (debug_enabled && trace_category("trans_unit")))
        fprintf(stderr, "Processing translation unit %s\n", filename);

    // --- Module-mode state validation ---
    // If this is a primary TU (no module_info) but we've already seen a module TU,
    // that's an internal consistency error.
    if (is_recompilation)
        goto skip_validation;
    if (!is_primary) {
skip_validation:
        if (module_info)
            has_seen_module_tu = 1;                          // dword_12C7A88
        goto proceed;
    }
    if (has_seen_module_tu)
        assertion_failure("trans_unit.c", 696, "process_translation_unit", 0, 0);

proceed:
    // --- Save previous TU state if any ---
    if (current_translation_unit)                            // qword_106BA10
        save_translation_unit_state(current_translation_unit);  // sub_7A3A50

    // --- Reset per-TU compilation state ---
    current_source_position = 0;                             // qword_126DD38
    is_recompilation_flag = is_recompilation;                 // dword_106BA08
    current_filename = filename;                             // qword_106BA00
    has_module_info = (module_info != NULL);                  // dword_106B9F8

    // --- Initialize error/parser state ---
    reset_error_state();                                     // sub_5EAEC0
    if (is_recompilation)
        fe_init_part_1();                                    // sub_585EE0

    // ==========================================================
    //  PHASE 1: Allocate and initialize TU descriptor (424 bytes)
    // ==========================================================
    registration_complete = 1;                               // dword_12C7A8C
    tu_descriptor *tu = allocate_storage(424);               // sub_6BA0D0
    tu->next_tu = NULL;                                      // [0]
    ++tu_count;                                              // qword_12C7A78
    tu->storage_buffer = allocate_storage(per_tu_storage_size); // [16], sub_6BA0D0
    tu->tu_name = NULL;                                      // [8]
    init_scope_state(tu + 24);                               // sub_7046E0, offsets [24..192]
    tu->field_192 = 0;
    tu->field_352 = 0;
    tu->field_184 = 0;
    memset(&tu->scope_decl_area, 0, ...);                    // [200..360] zeroed
    tu->field_360 = 0;
    tu->field_368 = 0;
    tu->field_376 = 0;
    tu->flags = 0x0100;                                      // [392] = "initialized"
    tu->error_severity_count = 0;                            // [408]
    tu->field_416 = 0;

    // --- Copy registered variable defaults into per-TU storage ---
    for (reg = registered_variable_list; reg; reg = reg->next) {
        if (reg->offset_in_tu)
            *(tu + reg->offset_in_tu) = reg->variable_value;
    }

    // --- Set module info pointer and primary flag ---
    tu->module_info_ptr = module_info;                       // [376]
    tu->is_primary = is_primary;                             // [392] byte 0

    // ==========================================================
    //  PHASE 2: Link TU into global chains
    // ==========================================================

    // --- Set as primary TU if this is the first ---
    if (primary_translation_unit == NULL) {                   // qword_106B9F0
        primary_translation_unit = tu;
        if (!is_recompilation)
            assertion_failure("trans_unit.c", 725, "process_translation_unit", 0, 0);
    }

    // --- Push onto TU stack ---
    current_translation_unit = tu;                           // qword_106BA10
    // (stack entry allocated from free list or via permanent alloc)
    stack_entry = alloc_stack_entry();                        // 16 bytes
    stack_entry->tu_ptr = tu;
    stack_entry->next = tu_stack_top;
    if (tu != primary_translation_unit)
        ++tu_stack_depth;                                    // dword_106B9E8
    tu_stack_top = stack_entry;                               // qword_106BA18

    // --- Append to TU linked list ---
    if (tu_chain_tail)                                       // qword_12C7A90
        tu_chain_tail->next_tu = tu;
    tu_chain_tail = tu;

    // ==========================================================
    //  PHASE 3: Source file setup + parse
    // ==========================================================

    if (module_info) {
        // --- Module compilation path ---
        // Extract header info from module descriptor
        module_id = module_info[7];
        module_info[2] = tu;                                 // back-link TU into module
        current_module_id = module_id;                       // qword_106C0B0
        // ... copy include paths, source paths from module descriptor ...
        source_dir = intern_directory_path(filename, 1);     // sub_5ADC60
        set_include_paths(source_dir, &include_list, &sys_list); // sub_5AD120
        fe_translation_unit_init(source_dir, &include_list); // sub_5863A0
        import_module = module_info[3];
        tu->error_severity_count = current_error_severity;   // [408]
        set_module_id(import_module);                        // sub_5AF7F0
        if (preprocessing_only)                              // dword_106C29C
            goto compile;
        goto compile_module;
    }

    // --- Standard (non-module) path ---
    fe_translation_unit_init(0, 0);                          // sub_5863A0
    tu->error_severity_count = current_error_severity;
    if (preprocessing_only)
        goto compile;

    // --- PCH header processing (optional) ---
    if (pch_enabled && !pch_skip_flag) {                     // dword_106BF18, dword_106B6AC
        setup_pch_source();                                  // sub_5861C0
        precompiled_header_processing();                     // sub_6F4AD0
    }

compile:
    // --- Main compilation: parse + build IL ---
    compile_primary_source();                                // sub_586240
    semantic_analysis();                                     // sub_4E8A60 (standard path)
    goto wrapup;

compile_module:
    compile_primary_source();                                // sub_586240
    module_compilation();                                    // sub_6FDDF0 (module path)

wrapup:
    // ==========================================================
    //  PHASE 4: Per-TU wrapup + stack pop
    // ==========================================================
    translation_unit_wrapup();                               // sub_588E90

    // --- Pop TU stack (inlined pop_translation_unit_stack) ---
    top = tu_stack_top;
    popped_tu = top->tu_ptr;
    if (popped_tu != current_translation_unit)
        assertion_failure("trans_unit.c", 556,
                          "pop_translation_unit_stack", 0, 0);
    if (popped_tu != primary_translation_unit)
        --tu_stack_depth;
    tu_stack_top = top->next;
    // (return stack entry to free list)
    if (tu_stack_top)
        switch_translation_unit(tu_stack_top->tu_ptr);       // sub_7A3D60

    // --- Debug trace on exit ---
    if (debug_verbosity > 0 || (debug_enabled && trace_category("trans_unit")))
        fprintf(stderr, "Done processing translation unit %s\n", filename);
}

Execution Flow

process_translation_unit (sub_7A40A0)
  |
  |-- [1] Debug trace: "Processing translation unit %s"
  |-- [2] Module-state validation (assert at trans_unit.c:696)
  |-- [3] Save previous TU state (sub_7A3A50)
  |-- [4] Reset error state (sub_5EAEC0)
  |-- [5] If recompilation: re-run fe_init_part_1 (sub_585EE0)
  |
  |-- [6] Allocate 424-byte TU descriptor (sub_6BA0D0)
  |       |-- Allocate per-TU storage buffer (sub_6BA0D0(per_tu_storage_size))
  |       |-- Initialize scope state at [24..192] (sub_7046E0)
  |       |-- Zero remaining fields [192..416]
  |       |-- Copy registered variable defaults
  |       |-- Set module_info_ptr [376] and flags [392]
  |
  |-- [7] Set as primary TU if first (assert at trans_unit.c:725)
  |-- [8] Push onto TU stack, link into TU chain
  |
  |-- [9] Module path? (module_info != NULL)
  |       |-- YES: Extract module header info
  |       |        sub_5ADC60 (intern_directory_path)
  |       |        sub_5AD120 (set_include_paths)
  |       |        sub_5863A0 (fe_translation_unit_init)
  |       |        sub_5AF7F0 (set_module_id)
  |       |
  |       |-- NO:  sub_5863A0 (fe_translation_unit_init) with NULL args
  |                sub_5861C0 + sub_6F4AD0 (PCH processing, if enabled)
  |
  |-- [10] sub_586240  -- compile_primary_source (parser entry)
  |
  |-- [11] Post-parse semantic analysis:
  |        |-- Module path: sub_6FDDF0 (module_compilation)
  |        |-- Standard path: sub_4E8A60 (translation_unit / semantic analysis)
  |
  |-- [12] sub_588E90 -- translation_unit_wrapup
  |
  |-- [13] Pop TU stack (assert at trans_unit.c:556)
  |-- [14] Debug trace: "Done processing translation unit %s"

Phase 1: Error State Reset -- `sub_5EAEC0`

Before any parsing begins, sub_5EAEC0 resets the parser's error recovery state. This is a tiny function (22 bytes) that configures the error-recovery token scan depth based on whether this is a recompilation pass:

void reset_error_state(void) {
    if (is_recompilation) {            // dword_106BA08
        error_scan_depth = 8;          // dword_126F68C -- shallower scan on retry
        error_scan_mode = 0;           // dword_126F688
        error_recovery_kind = 16;
    } else {
        error_recovery_kind = 24;      // full recovery on first pass
    }
    error_token_limit = error_recovery_kind;  // dword_126F694
    error_count_local = 0;                     // dword_126F690
}

The different error_recovery_kind values (16 vs 24) control how aggressively the parser attempts to resynchronize after encountering a syntax error. On recompilation (error-retry), the compiler uses a smaller recovery window to avoid cascading errors.

Phase 2: TU Descriptor Allocation

The 424-byte TU descriptor is the central data structure tracking a single translation unit's state during compilation. It is allocated from EDG's permanent storage pool via sub_6BA0D0 and linked into two separate data structures: the TU linked list and the TU stack.

Translation Unit Descriptor Layout (424 bytes)

Offset	Size	Field	Description
0	8	`next_tu`	Singly-linked list pointer: chains all TUs in processing order. `qword_106B9F0` (primary TU) is the head; `qword_12C7A90` is the tail.
8	8	`tu_name`	Initially NULL. Set later by the parser to the TU's internal identifier.
16	8	`storage_buffer`	Pointer to a dynamically-sized buffer holding per-TU copies of all registered global variables. Size = `qword_12C7A98` (accumulated during `f_register_trans_unit_variable` calls).
24-192	168	`scope_state`	Initialized by `sub_7046E0`. Contains the TU's scope stack snapshot: file scope descriptor, scope nesting state, using-directive lists. Saved/restored during TU switching by `sub_7A3A50`/`sub_7A3D60`.
184	8	`source_file_entry`	Set to `*(qword_126DDF0 + 64)` after the source file is opened -- the file descriptor from the source file manager.
192	8	(cleared)	Zero-initialized.
200-352	~160	`scope_decl_area`	Bulk-zeroed via `memset`. Holds scope-level declaration state that accumulates during parsing. The zero-init ensures clean state for a new TU.
352	8	(cleared)	Zero-initialized.
360-376	24	`additional_state`	Three qwords, all zeroed. Purpose unclear; possibly reserved for future EDG versions.
376	8	`module_info_ptr`	Pointer to the C++20 module descriptor (`a3` parameter). NULL for standard compilation. When set, the TU participates in modular compilation.
392	2	`flags`	Byte 0: `is_primary` (1 if this is the first TU, 0 otherwise). Byte 1: initialized marker (always 1 = `0x100` in the word).
408	4	`error_severity_count`	Snapshot of `dword_126EC90` at TU creation time. Compared during wrapup to detect new errors introduced during this TU's compilation.
416	8	(cleared)	Zero-initialized.

Registered Variable Mechanism

EDG's multi-TU infrastructure requires certain global variables to be saved and restored when switching between translation units (e.g., during relocatable device code compilation). The mechanism works as follows:

Registration phase (during initialization, before any TU processing): Subsystem initializers call f_register_trans_unit_variable (sub_7A3C00) to register global variables that need per-TU state. Each registration creates a 40-byte entry:

Offset	Size	Field
0	8	`next` -- linked list pointer
8	8	`variable_address` -- pointer to the global variable
16	8	`variable_name` -- debug name string (e.g., `"is_recompilation"`)
24	8	`prior_accumulated_size` -- offset into per-TU storage buffer
32	8	`field_offset_in_tu` -- if nonzero, the offset within the TU descriptor where the default value lives

Accumulated size tracking: Each registration pads the variable's size to 8-byte alignment and adds it to qword_12C7A98 (per-TU storage size). The linked list head is qword_12C7AA8, tail is qword_12C7AA0.
TU creation: When a TU descriptor is allocated, a storage buffer of per_tu_storage_size bytes is allocated alongside it at offset [16]. Default values from the field_offset_in_tu entries are copied into the TU descriptor's own fields.
TU switching: save_translation_unit_state (sub_7A3A50) iterates the registered variable list, copying each variable's current value from its global address into the outgoing TU's storage buffer. switch_translation_unit (sub_7A3D60) does the reverse: copies from the incoming TU's storage buffer back to the global addresses.

Three core variables are always registered (by sub_7A4690):

Variable	Address	Size	Name
`is_recompilation`	`dword_106BA08`	4	`"is_recompilation"`
`current_filename`	`qword_106BA00`	8	`"current_filename"`
`has_module_info`	`dword_106B9F8`	4	`"has_module_info"`

Additional variables are registered by other subsystem initializers (trans_corresp registers 3 more via sub_7A3920).

Phase 3: TU Linking and Stack Management

TU Linked List

Translation units are linked in processing order through the next_tu field at offset [0]:

qword_106B9F0 (primary_translation_unit)
  |
  v
  TU_0 --[next_tu]--> TU_1 --[next_tu]--> TU_2 --[next_tu]--> NULL
                                                       ^
                                                       |
                                          qword_12C7A90 (tu_chain_tail)

qword_106B9F0 always points to the first (primary) TU. qword_12C7A90 always points to the last. The chain is walked by fe_wrapup (sub_588F90) during its 5-pass finalization.

TU Stack

The TU stack tracks the active compilation context. Each stack entry is a 16-byte structure:

Offset	Size	Field
0	8	`next` -- points to the entry below on the stack
8	8	`tu_ptr` -- pointer to the TU descriptor

Stack entries are allocated from a free list (qword_12C7AB8); when the free list is empty, a new 16-byte block is allocated via sub_6B7340 (permanent allocator).

qword_106BA18 (tu_stack_top)
  |
  v
  entry_N: [next=entry_N-1, tu_ptr=current_tu]
  entry_N-1: [next=entry_N-2, tu_ptr=prev_tu]
  ...
  entry_0: [next=NULL, tu_ptr=primary_tu]

The stack depth counter dword_106B9E8 tracks how many non-primary TUs are stacked. It is incremented on push (if tu != primary_tu) and decremented on pop.

The pop operation at the end of process_translation_unit includes an assertion (at trans_unit.c:556) verifying that the top-of-stack TU matches current_translation_unit. This guards against mismatched push/pop sequences, which would corrupt the multi-TU state:

if (stack_top->tu_ptr != current_translation_unit)
    assertion_failure("trans_unit.c", 556, "pop_translation_unit_stack", 0, 0);

Phase 4: Source File Setup

The source file setup differs between standard compilation and C++20 module compilation.

Standard Path (module_info == NULL)

sub_5863A0 (fe_translation_unit_init / keyword_init, 1113 lines, fe_init.c): The largest initialization function in the binary. Performs two tasks in sequence:
- Token state reset: Zeros qword_126DD38 (6-byte source position) and qword_126EDE8 (mirror).
- Per-TU subsystem reinit: Calls 15+ subsystem re-initializers to prepare for a new compilation unit (source file manager, scope system, preprocessor, diagnostics, etc.).
- Keyword registration: Registers 200+ C/C++ keywords via sub_7463B0 (enter_keyword), including all C89/C99/C11/C23 keywords, C++ keywords through C++26, GNU extensions, MSVC extensions, Clang extensions, 60+ type traits, and three NVIDIA CUDA-specific type trait keywords (__nv_is_extended_device_lambda_closure_type, __nv_is_extended_host_device_lambda_closure_type, __nv_is_extended_device_lambda_with_preserved_return_type). Keyword registration is version-gated by the language mode (dword_126EFB4) and C++ standard version (dword_126EF68).
- File scope creation: Calls sub_7047C0(0) to push the initial file scope onto the scope stack.
- C++ builtins: For C++ mode, registers namespace std, operator new/operator delete allocation functions, std::align_val_t.
PCH processing (optional, if dword_106BF18 is set): Calls sub_5861C0 to open the source file with minimal setup (same as sub_586240 but without the recompilation logic), followed by sub_6F4AD0 (precompiled_header_processing, 721 lines, pch.c) which searches for an applicable .pch file, validates memory allocation history, and restores saved variable state from the precompiled header.

Module Path (module_info != NULL)

When compiling a C++20 module unit, the module descriptor (passed as a3) provides pre-computed configuration:

module_info[2] = tu;              // back-link TU into module descriptor
qword_106C0B0 = module_info[7];  // module identifier
qword_126EE98 = module_info[4];  // include path list
qword_126EE78 = module_info[6];  // system include path list
qword_126EE90 = module_info[5];  // additional path list

The module path then calls:

sub_5ADC60(filename, 1) -- intern the source directory path (cached allocation)
sub_5AD120(source_dir, &include_list, &sys_list) -- configure include search paths from the module descriptor
sub_5863A0(source_dir, &include_list) -- fe_translation_unit_init with module-specific paths
sub_5AF7F0(module_info[3]) -- set the module identifier for this TU (asserts not already set)

Phase 5: Compilation Driver -- `sub_586240`

sub_586240 (fe_init.c, 63 lines) is the compilation driver that opens the source file and launches the parser. It is called for both standard and module compilation paths.

void compile_primary_source(void) {
    // If recompilation: reset file-scope scope pointer
    if (is_recompilation)
        *(uint64_t *)&xmmword_126EB60 = 0;

    // Allocate mutable copy of filename for the lexer
    char *fn_copy = temp_allocate(strlen(current_filename) + 1);  // sub_5E0460
    strcpy(fn_copy, current_filename);

    // --- Open source file and push onto input stack ---
    open_file_and_push_input_stack(fn_copy, 0, 0, 0, 0, 0, 0, 0, 0, 0);  // sub_66E6E0

    // Record source file descriptor in TU
    current_tu->source_file_entry = *(source_file_descriptor + 64);  // [184]

    // --- Scope handling ---
    if (!pch_mode) {                                         // dword_106B690
        init_global_scope_flag = 1;                          // dword_126C708
        global_scope_decl_list = global_decl_chain;          // qword_126C710
        finalize_scope();                                    // sub_66E920
    }
    open_scope(1, 0);                                        // sub_6702F0

    // --- PCH recompilation metadata ---
    if (is_recompilation) {
        // Allocate 4-byte version marker (3550774 = "6.6\0")
        char *ver = temp_allocate(4);
        *(uint32_t *)ver = 3550774;                          // EDG 6.6 version tag
        edg_version_ptr = ver;                               // qword_126EB78
        // Copy compilation timestamp
        char *ts = temp_allocate(strlen(byte_106B5C0));
        compilation_timestamp_copy = strcpy(ts, byte_106B5C0); // qword_126EB80
        dialect_version_snapshot = dialect_version;           // dword_126EBF8
    }

    // --- PCH header loading ---
    if (pch_mode) {
        load_precompiled_header(byte_106B5C0);               // sub_6B5C10
        pch_header_loaded = 1;                               // dword_106B6B0
    }
}

Parser Entry: `sub_66E6E0` (open_file_and_push_input_stack)

sub_66E6E0 (lexical.c, 95 lines) is the gateway from file-level compilation into the EDG lexer/parser. It takes 10 parameters controlling how the source file is opened:

Parameter	Position	Typical Value	Meaning
`filename`	a1	source path	Path to the `.cu` file
`include_mode`	a2	0	0 = primary source, nonzero = `#include`
`search_type`	a3	0	0 = absolute path, nonzero = search include dirs
`is_system`	a4	0	System header flag
`guard_flag`	a5	0	Include guard checking mode
`is_pragma`	a6	0	Pragma-include flag
`embed_mode`	a7	0	`#embed` processing flag
`line_adjust`	a8	0	Line number adjustment
`recovery`	a9	0	Error recovery mode
`result_out`	a10	0	Output: set to 1 if file was skipped (guard)

The function delegates to sub_66CBD0 which resolves the file path, opens the file handle, and creates the file descriptor. Then sub_66DFF0 pushes the opened file onto the lexer's input stack, making it the active source for tokenization. The lexer reads from this stack via get_next_token (sub_676860, 1995 lines).

At debug verbosity > 3, it prints: "open_file_and_push_input_stack: skipping guarded include file %s\n" when an include guard causes the file to be skipped.

Phase 6: Semantic Analysis -- `sub_4E8A60`

After parsing completes, sub_4E8A60 (translation_unit, decls.c, 77 lines) performs semantic analysis on the parsed declarations. This function is called only on the standard (non-module) compilation path.

void translation_unit(void) {
    // PCH mode: additional scope finalization
    if (pch_mode)
        finalize_pch_scope();                                // sub_6FC900
    if (global_decl_chain)
        process_pending_declarations();                      // sub_6FDD60

    // --- Main declaration processing loop ---
    declaration_processing_active = 1;                       // dword_126C704
    parse_declaration_seq();                                  // sub_676860 (get_next_token)
    declaration_processing_active = 0;

    // Header-unit stop detection
    if (header_unit_mode)
        finalize_header_unit();                              // sub_6F4A10

    // --- Top-level declaration loop ---
    // Repeatedly processes declarations until token 9 (EOF) is reached.
    // For C++ (dword_126EFB4 == 2) with C++14+ (dword_126EF68 > 201102):
    //   calls sub_6FBCD0 (deferred template processing)
    //   then sub_4E6F80(1, 0) (process next declaration)
    while (current_token != 9) {  // 9 = EOF token
        if (is_cpp && (cpp_version > 201102 || has_cpp20_features))
            process_deferred_templates();                    // sub_6FBCD0
        if (declaration_enabled)
            process_declaration(1, 0);                       // sub_4E6F80
    }

    // --- Post-parse validation ---
    if (!header_unit_mode) {
        if (is_cpp && (cpp_version > 201102 || has_cpp20_features))
            process_deferred_templates();                    // sub_6FBCD0 final pass
        finalize_module_interface();                         // sub_6F81D0
    } else {
        // Header-unit mode assertion: stop position must be found
        assertion_failure("decls.c", 23975, "translation_unit",
                          "translation_unit:", "header stop position not found");
    }
}

The C++ standard version checks (dword_126EF68 > 201102) gate C++14+ features like deferred template instantiation. The value 201102 corresponds to C++11 (__cplusplus value). For C++14 and later, sub_6FBCD0 handles deferred template processing between declaration groups.

Phase 7: Translation Unit Wrapup -- `sub_588E90`

sub_588E90 (translation_unit_wrapup, fe_wrapup.c, 36 lines) performs per-TU finalization after parsing and semantic analysis are complete. It is the last step before the TU stack is popped.

void translation_unit_wrapup(void) {
    if (debug_enabled)
        trace_enter(1, "translation_unit_wrapup");

    // [1] Stop-token verification
    check_all_stop_token_entries_are_reset(                  // sub_675DA0
        file_scope_stop_tokens + 8);                         // qword_126DB48 + 8

    // [2] Class linkage checking (conditional)
    if (!preprocessing_only) {
        if (rdc_enabled || rdc_alt_enabled)                  // dword_106C2BC, dword_106C2B8
            check_class_linkage();                           // sub_446F80
    }

    // [3] Module import finalization
    finalize_module_imports();                                // sub_7C24D0

    // [4] IL output
    complete_scope();                                        // sub_709250

    // [5] Close file scope
    close_file_scope(1);                                     // sub_7047C0

    // [6] Module correspondence finalization (non-preprocessing)
    if (!preprocessing_only)
        process_verification_list();                         // sub_7A2FE0

    // [7] Write compilation unit boundary
    make_module_id(0);                                       // sub_5AF830

    // [8] Namespace cleanup (C++ only, non-PCH, non-preprocessing)
    if (is_cpp && !is_recompilation && !preprocessing_only)
        namespace_cleanup();                                 // sub_76C910

    if (debug_enabled)
        trace_leave();                                       // sub_48AFD0
}

Sub_675DA0: check_all_stop_token_entries_are_reset

Iterates all 357 entries in the stop-token array. If any nonzero entry is found, logs "stop_tokens[\"%s\"] != 0\n" (using off_E6D240 as the token name table) and asserts with "stop token array not all zero" at lexical.c:17680. This catches lexer state corruption where a stop-token (used during error recovery and tentative parsing) was not properly cleared.

Sub_446F80: check_class_linkage

Called only when relocatable device code (RDC) compilation is enabled (dword_106C2BC or dword_106C2B8). Iterates file-scope type entities looking for class/struct/union types (kind 9-11) and scoped enums (kind 2, bit 3 of +145) that need external linkage for cross-TU visibility. For qualifying types, calls sub_41F800 (make_class_externally_linked) to set the linkage bits at offset +80 to 0x20 (external linkage flag). The function performs a two-pass scan:

Pass 1: Identify types needing external linkage. Checks whether the type is used by externally-visible definitions, has nested types with external linkage requirements, or has member functions with non-inline definitions.
Pass 2: If any types were promoted, propagates linkage to member functions and nested class template instantiations via sub_41FD90.

Sub_7A2FE0: process_verification_list (Module Finalization)

sub_7A2FE0 (trans_corresp.c, 69 lines) processes the deferred correspondence verification list for multi-TU compilation. This is the mechanism EDG uses to verify that declarations shared across translation units are structurally compatible (One Definition Rule checking for RDC).

void process_verification_list(void) {
    if (is_recompilation || error_count != saved_error_count)
        goto skip;  // skip if new errors appeared

    correspondence_active = 1;                               // dword_106B9E4
    source_seq = *(current_tu + 8);                          // TU source sequence

    prepare_correspondence(source_seq);                      // sub_79FE00
    verify_correspondence(source_seq);                       // sub_7A2CC0

    // Process pending verification items
    while (pending_list) {                                   // qword_12C7790
        pending_list_snapshot = pending_list;
        pending_list = NULL;
        for (item = pending_list_snapshot; item; item = next) {
            next = item->next;
            switch (item->kind) {                            // byte at [8]
                case 0:  break;                              // no-op
                case 2:  verify_typedef_correspondence(item->data);          // sub_7986A0
                case 6:  verify_friend_correspondence(item->data);           // sub_7A1830
                case 7:  verify_nested_class_correspondence(item->data);     // sub_798960
                case 8:  verify_enum_member_correspondence(item->data);      // sub_798770
                case 11: verify_member_function_correspondence(item->data);  // sub_7A1DB0
                case 28: verify_using_declaration_correspondence(item->data);// sub_7982C0
                case 58: verify_base_class_correspondence(item->data);       // sub_7A27B0
                default: assertion_failure("trans_corresp.c", 7709, ...);
            }
            // Return item to free list
            item->next = corresp_free_list;
        }
    }

    correspondence_active = 0;
    correspondence_complete = 1;                             // dword_106B9E0

skip:
    correspondence_complete = 1;
}

The kind codes (0, 2, 6, 7, 8, 11, 28, 58) correspond to EDG declaration kinds: typedef (2), friend (6), nested class (7), enum member (8), member function (11), using declaration (28), base class (58).

Module vs Standard Compilation Path

The control flow diverges based on dword_106C29C (preprocessing-only mode) and the presence of module_info:

                          module_info?
                         /            \
                       YES             NO
                        |               |
                  sub_5ADC60         sub_5863A0(0,0)
                  sub_5AD120              |
                  sub_5863A0       PCH enabled?
                  sub_5AF7F0        /        \
                        |         YES         NO
                        |          |           |
                        |     sub_5861C0       |
                        |     sub_6F4AD0       |
                        |          |           |
                        +-----+----+-----+-----+
                              |          |
                         sub_586240  sub_586240
                              |          |
                      preprocessing_only?
                        /            \
                      YES             NO
                       |               |
                  sub_6FDDF0      sub_4E8A60
                  (module comp)   (standard comp)
                       |               |
                       +-------+-------+
                               |
                         sub_588E90
                    (translation_unit_wrapup)

Note: sub_6FDDF0 is the module compilation driver (59 lines, lower_il.c). It enters a loop calling sub_676860 (get_next_token) until EOF (token 9), processing module import/export declarations. Between module units, it calls sub_66EA70 to close the current input source and advance to the next module partition.

Global State Variables

Translation Unit Tracking

Variable	Address	Type	Description
`current_translation_unit`	`qword_106BA10`	`tu_descriptor*`	Points to the TU currently being compiled. Set during TU creation and switching.
`primary_translation_unit`	`qword_106B9F0`	`tu_descriptor*`	Points to the first TU. Set exactly once. Never changes after that.
`tu_chain_tail`	`qword_12C7A90`	`tu_descriptor*`	Tail of the TU linked list. Used for O(1) append of new TUs.
`tu_stack_top`	`qword_106BA18`	`stack_entry*`	Top of the TU stack. Each entry is a 16-byte `{next, tu_ptr}` node.
`tu_stack_depth`	`dword_106B9E8`	`int`	Number of non-primary TUs on the stack. Incremented on push, decremented on pop.
`current_filename`	`qword_106BA00`	`char*`	Path of the `.cu` file being compiled. Per-TU variable (saved/restored on switch).
`is_recompilation`	`dword_106BA08`	`int`	Nonzero during error-retry recompilation pass. Per-TU variable.
`has_module_info`	`dword_106B9F8`	`int`	1 if the current TU is a C++20 module unit. Per-TU variable.

Registration Infrastructure

Variable	Address	Type	Description
`registered_variable_list_head`	`qword_12C7AA8`	`reg_entry*`	Head of the registered variable linked list. Built during initialization.
`registered_variable_list_tail`	`qword_12C7AA0`	`reg_entry*`	Tail of the registered variable list. Used for O(1) append.
`per_tu_storage_size`	`qword_12C7A98`	`size_t`	Accumulated size of all registered variables (8-byte aligned). Determines the storage buffer size at TU descriptor offset [16].
`registration_complete`	`dword_12C7A8C`	`int`	Set to 1 at the start of `process_translation_unit`. After this, no more variables can be registered.
`has_seen_module_tu`	`dword_12C7A88`	`int`	Set to 1 when a module-info TU is processed. Guards against mixing module and non-module TUs.
`stack_entry_free_list`	`qword_12C7AB8`	`stack_entry*`	Free list for recycling 16-byte TU stack entries.

Statistics Counters

Variable	Address	Description
`qword_12C7A78`	`tu_count`	Total TU descriptors allocated (424 bytes each)
`qword_12C7A80`	`stack_entry_count`	Total stack entries allocated (16 bytes each)
`qword_12C7A68`	`registration_count`	Total variable registration entries (40 bytes each)
`qword_12C7A70`	`corresp_count`	Total correspondence entries (24 bytes each)

These counters are reported by sub_7A45A0 (print_trans_unit_statistics), which prints formatted memory usage:

trans. unit corresps          N x 24 bytes
translation units             N x 424 bytes
trans. unit stack entry       N x 16 bytes
variable registration         N x 40 bytes

Assertions

The function contains three assertion checks, each producing a fatal diagnostic via sub_4F2930:

Line	Condition	Message	Meaning
`trans_unit.c:696`	Primary TU (no module_info) but `has_seen_module_tu` is set	(none)	Cannot process a non-module TU after a module TU has been seen
`trans_unit.c:725`	`primary_translation_unit` is set but `is_recompilation` is false	(none)	First TU must be on the initial compilation pass, not a retry
`trans_unit.c:556`	Stack top's TU pointer does not match `current_translation_unit`	(none)	TU stack push/pop mismatch -- corrupted compilation state

Callee Reference Table

Address	Identity	Source	Role in Pipeline
`sub_48A7E0`	`trace_category`	`error.c`	Check if debug category `"trans_unit"` is enabled
`sub_5EAEC0`	`reset_error_state`	`parse.c`	Reset parser error recovery state
`sub_585EE0`	`fe_init_part_1`	`fe_init.c`	Re-run per-unit init on recompilation
`sub_6BA0D0`	`allocate_storage`	`il_alloc.c`	Permanent storage allocator (424-byte TU, per-TU buffer)
`sub_7046E0`	`init_scope_state`	`scope_stk.c`	Initialize scope fields at TU descriptor [24..192]
`sub_6B7340`	`permanent_alloc`	`il_alloc.c`	Allocate 16-byte TU stack entry
`sub_7A3A50`	`save_translation_unit_state`	`trans_unit.c`	Save current TU's registered variables and scope state
`sub_7A3D60`	`switch_translation_unit`	`trans_unit.c`	Restore a TU's state (inverse of save)
`sub_5ADC60`	`intern_directory_path`	`host_envir.c`	Cache directory path string (module path)
`sub_5AD120`	`set_include_paths`	`host_envir.c`	Configure include search paths from module descriptor
`sub_5863A0`	`fe_translation_unit_init`	`fe_init.c`	Per-TU init + keyword registration (1113 lines)
`sub_5AF7F0`	`set_module_id`	`host_envir.c`	Set module identifier for current TU
`sub_5861C0`	`setup_pch_source`	`fe_init.c`	Open source file for PCH mode
`sub_6F4AD0`	`precompiled_header_processing`	`pch.c`	Find/load applicable PCH file (721 lines)
`sub_586240`	`compile_primary_source`	`fe_init.c`	Open source, launch parser, build IL
`sub_66E6E0`	`open_file_and_push_input_stack`	`lexical.c`	Open source file, push onto lexer input stack (10 params)
`sub_676860`	`get_next_token`	`lexical.c`	Main tokenizer (1995 lines)
`sub_6702F0`	`open_scope`	`scope_stk.c`	Push a new scope onto the scope stack
`sub_6FDDF0`	`module_compilation`	`lower_il.c`	Module compilation driver (EOF-driven loop)
`sub_4E8A60`	`translation_unit`	`decls.c`	Standard compilation: semantic analysis + declaration loop
`sub_588E90`	`translation_unit_wrapup`	`fe_wrapup.c`	Per-TU finalization (8 sub-steps)
`sub_675DA0`	`check_all_stop_token_entries_are_reset`	`lexical.c`	Verify all 357 stop-tokens are cleared
`sub_446F80`	`check_class_linkage`	`class_decl.c`	RDC: promote class types to external linkage
`sub_7C24D0`	`finalize_module_imports`	`modules.c`	C++20 module import finalization
`sub_709250`	`complete_scope`	`il.c`	IL scope completion
`sub_7047C0`	`close_file_scope`	`scope_stk.c`	Pop file scope, activate using-directives
`sub_7A2FE0`	`process_verification_list`	`trans_corresp.c`	ODR verification for multi-TU (RDC)
`sub_76C910`	`namespace_cleanup`	`cp_gen_be.c`	C++ namespace state cleanup
`sub_4F2930`	`assertion_failure`	`error.c`	Fatal assertion handler (prints source path + line)

Cross-References

Pipeline Overview -- complete 8-stage pipeline diagram showing where process_translation_unit fits
Entry Point & Initialization -- stages 1-3 that execute before this function
Frontend Wrapup -- the 5-pass fe_wrapup (stage 6) that runs after this function
Backend Code Generation -- stage 7 that consumes the IL tree built here
CLI Processing -- all 276 flags that configure the compilation mode
Timing & Exit -- exit code mapping and timing infrastructure
EDG Overview -- EDG 6.6 source tree and NVIDIA modifications
Execution Spaces -- how __device__/__host__/__global__ attributes are recorded during parsing
Device/Host Separation -- how the backend filters device vs host code from the IL tree

Frontend Wrapup

fe_wrapup (sub_588F90, 1433 bytes at 0x588F90, from fe_wrapup.c:776) is the sixth stage of the cudafe++ pipeline. It runs after the parser has built the complete EDG IL tree and before the backend emits the .int.c file. The function performs five sequential passes over the translation unit chain, each pass iterating the linked list rooted at qword_106B9F0. After the five passes, it runs a series of post-pass operations: cross-TU consistency checks, graph optimization, template validation, memory statistics reporting, and global state teardown. The function has 51 direct callees.

The five passes transform the raw IL tree into a finalized, pruned representation: Pass 1 cleans up parsing artifacts, Pass 2 computes which entities are needed, Pass 3 marks entities that must be preserved in the IL for device compilation, Pass 4 eliminates everything not marked, and Pass 5 serializes the result and validates scope consistency. The entire sequence is the bridge between the parser's "everything parsed" state and the backend's "only what matters" input.

Key Facts

Property	Value
Function	`sub_588F90` (`fe_wrapup`)
Binary address	`0x588F90`
Binary size	1433 bytes
EDG source	`fe_wrapup.c`, line 776
Direct callees	51
Debug trace name	`"fe_wrapup"` (level 1 via `sub_48AE00`)
Assertion	`"bad translation unit in fe_wrapup"` if `dword_106BA08 == 0`
Error check	`qword_126ED90` -- passes 2-4 skip TUs with errors
Language gate	`dword_126EFB4 == 2` gates C++-only operations in pass 4

Architecture Overview

sub_588F90 (fe_wrapup)
  |
  |-- Preamble: debug trace, assertion, C++ wrapup, diagnostic hooks
  |
  |-- Pass 1: per-TU basic declaration processing (sub_588C60)
  |-- Pass 2: template/inline instantiation + needed-flags (sub_707040)
  |       |-- gated by !qword_126ED90 (skip error TUs)
  |       |-- preceded by cross-TU marking (sub_796C00) on first run
  |-- Pass 3: keep-in-IL marking for device code (sub_610420 with arg 23)
  |       |-- sets dword_106B640=1 guard, clears after
  |-- Pass 4: constant folding + CUDA transforms + dead entity elimination
  |       |-- sub_5CCA40 (C++ only), sub_5CC410, sub_5CCBF0
  |-- Pass 5: per-TU final cleanup (sub_588D40)
  |
  |-- Post-pass: cross-TU consistency (sub_796BA0)
  |-- Post-pass: graph optimization (sub_707480 double-loop)
  |-- Post-pass: template validation (sub_765480)
  |-- Post-pass: final main-TU cleanup (sub_588D40)
  |-- Post-pass: file index processing (sub_6B8B20 loop)
  |-- Post-pass: output flush (sub_5F7DF0)
  |-- Post-pass: close output files (sub_4F7B10 x3)
  |-- Post-pass: memory statistics (10 subsystem counters -> sub_6B95C0)
  |-- Post-pass: debug dumps (sub_702DC0, sub_6C6570)
  |-- Post-pass: final teardown (sub_5E1D00, sub_4ED0E0, zero 6 globals)

Translation Unit Chain

All five passes iterate the same linked list structure. Each translation unit descriptor is a 424-byte allocation. The first qword of each descriptor is the next pointer, forming a singly-linked list. The head is qword_106B9F0 (the primary TU). For standard single-file CUDA compilation, there is typically one primary TU and zero secondary TUs, but the multi-TU infrastructure exists for module compilation and precompiled headers.

Before processing each TU, sub_7A3D60 (set_current_translation_unit) is called to switch global state to point at that TU. This updates qword_106BA10 (current TU descriptor), which is then used by all subsystems to find the current scope, IL root, file info, and error state.

The file scope IL node -- the root of the IL tree for a TU -- is at *(qword_106BA10 + 8).

The iteration pattern shared by all passes:

// Walk secondary TUs (linked from primary)
node = *(qword **)qword_106B9F0;     // first secondary TU
while (node) {
    sub_7A3D60(node);                 // set node as current TU
    // ... pass-specific work on *(qword_106BA10 + 8) ...
    node = *(qword **)node;           // follow next pointer at +0
}
// Then process primary TU
sub_7A3D60(qword_106B9F0);
// ... pass-specific work on main TU ...

Preamble

Before the five passes begin, fe_wrapup performs:

Debug trace: If dword_126EFC8 (debug mode), logs "fe_wrapup" at level 1 via sub_48AE00.
Set current TU: Calls sub_7A3D60(qword_106B9F0) to select the primary TU.
Assertion: Checks dword_106BA08 != 0 -- the "full compilation mode" flag. If false, triggers a fatal assertion: "bad translation unit in fe_wrapup". This flag is set during TU initialization; its absence here indicates a corrupted pipeline state.
C++ template wrapup: If dword_126EFB4 == 2 (C++ mode), calls sub_78A9D0 (template_and_inline_entity_wrapup). This performs cross-TU template instantiation setup, walking all TUs and their pending instantiation lists.
No-op hook: Calls nullsub_5 -- a disabled debug hook in the exprutil address range (0x56DC80). Likely a compile-time-disabled expression validation point.
CUDA diagnostics: If dword_106C268 is set, calls sub_6B3260 (CUDA-specific diagnostic processing).
Source sequence debug: If debug mode and the "source_file_for_seq_info" flag is active, calls sub_5B9580 to dump source file sequence information.

Pass 1: Basic Declaration Processing

Function: sub_588C60 (file_scope_il_wrapup) Address: 0x588C60 Per-TU: Yes (iterates all secondary TUs, then processes the primary TU) Error-gated: No -- runs unconditionally

This pass performs initial cleanup on each translation unit's IL tree. It runs unconditionally on every TU, regardless of error status, because the cleanup operations (template state release, exception spec finalization) are safe and necessary even after errors.

Operations per TU:

Step	Function	Purpose
1	`sub_7C2690`	Template cleanup -- release deferred template instantiation state
2	`sub_68A0C0`	Exception handling cleanup -- finalize exception specifications, resolve pending catch-block types
3	`sub_446F80`	Diagnostic finalization (conditional: only if `dword_106C2BC` or `dword_106C2B8` is set, and `dword_106C29C` is clear -- i.e., not preprocessing-only mode)
4	`sub_706710`	IL tree walk with parameters `(root, 0, scope_list, 1, 0, 0)` -- traverses the full IL tree performing bookkeeping: arg 2=0 means initial walk, arg 4=1 enables scope processing, arg 3 passes the TU scope list at `qword_106BA10 + 24`
5	`sub_706F40`	IL finalize -- post-walk finalization of the IL root node, marks it as ready for lowering
6	`sub_5BD350`	Destroy temporaries (C++ only, `dword_126EFB4 == 2`) -- cleans up temporary objects from expression evaluation
7	(inline loop)	Clear deferred declaration flags (C++ only, `dword_126EE50 == 0`): iterates the declaration chain at `*(root + 280)`, and for each declaration where bit 2 of byte `+81` is set and `sub_5CA6F0` returns true, clears the pointer at `+40` and clears bit 2 of byte `+81`. This removes deferred-initialization markers from declarations whose initialization has completed.
8	`sub_65D9A0`	Overload resolution cleanup -- releases candidate sets and viability data

After all secondary TUs are processed, the primary TU itself gets the same treatment:

for (tu = *primary_tu; tu != NULL; tu = *tu)
    set_current_tu(tu);
    file_scope_il_wrapup();           // sub_588C60
set_current_tu(primary_tu);
file_scope_il_wrapup();               // for the primary TU itself

Cross-TU Marking (Between Pass 1 and Pass 2)

Before Pass 2 begins, if no errors have occurred (!qword_126ED90), sub_796C00 (mark_secondary_trans_unit_IL_entities_used_from_primary_as_needed) is called. This function:

Calls sub_60E4F0 with callbacks sub_796A60 (IL walk visitor) and sub_796A20 (scope visitor) to walk the primary TU's IL and mark entities referenced from secondary TUs.
Iterates the file table (dword_126EC80 entries starting at index 2), and for each valid file scope that is not bit-2 flagged in byte -8 and has a non-zero scope kind byte +28, calls sub_610200 with the same visitor callbacks.
Runs the walk twice (controlled by a counter: first pass with callback sub_796A60, second with NULL). The two-pass design ensures transitive closure: the first pass discovers direct references, the second propagates through chains of indirect references.

Pass 2: Template/Inline Instantiation and Needed-Flags

Function: sub_707040 (set_needed_flags_at_end_of_file_scope) Address: 0x707040 Per-TU: Yes, but skips TUs with errors (qword_126ED90 check) Source: scope_stk.c:8090

This pass determines which entities are "needed" -- must be preserved in the IL for backend consumption. It is the EDG "needed flags" computation, which decides based on linkage, usage, and language rules whether each declaration must survive to the output.

The function operates on a file scope IL node and walks four declaration lists at different offsets:

Offset	List	Entity Kind	Processing
`+168`	Nested scopes	Namespace/class scopes	Recursively calls `sub_707040` on each scope's IL root at `*(entry + 120)`, skipping entries with bit 0 of byte `+116` set (extern linkage marker)
`+104`	Type declarations	Classes (kind 9-11 at byte `+132`)	Calls `sub_61D7F0(entry, 6)` to set needed flag; recursively processes the class scope at `((entry+152) + 128)` if non-null and bit 5 of byte `+29` is clear
`+112`	Variable declarations	Variables/objects	Complex multi-condition evaluation (see below)
`+144`	Routine declarations	Functions/methods	Checks template body availability at `(entry+240)` and `(*(entry+240)+8)`, bit 2 of byte `+186` (not-needed marker), and entity class at byte `+164`; marks via `sub_61CE20(entry, 0xB)`; preserves and restores bit 5 of byte `+177` across the call

Variable needed-flag logic

For each variable in the +112 list, the algorithm checks (in order of precedence):

If bit 3 of byte +80 is set (external/imported), skip -- always mark as needed via sub_61CE20(entry, 7).
Check sub_7A7850(*(entry+112)) -- if referenced, mark as needed.
Check sub_7A7890(*(entry+112)) -- if used, mark as needed.
Otherwise evaluate:
- Byte +162 bit 4 set and full compilation mode: check linkage class at byte +128 (1=external) and base type completeness via sub_75C1F0.
- Byte +128 == 0 (no linkage) or byte +169 == 2: check initializer pointer at +224 and constexpr flags at byte +164.
- Internal/external linkage with specific storage class: check definition pointer at +200, storage class byte +128, and flag patterns in bytes +160, +161.

At the start of file scope processing, dword_106B640 is set to 1. At the end, after optionally calling sub_6FE8C0 (C++ scope merging), it is cleared to 0.

Debug trace: prints "Start/End of set_needed_flags_at_end_of_file_scope" when the "needed_flags" debug flag is active.

Pass 3: Keep-in-IL Marking (Device Code Selection)

Function: sub_610420 (mark_to_keep_in_il) Address: 0x610420 Per-TU: Yes, skips error TUs Source: il_walk.c:1959 Argument: 23 (the file-scope walk mode)

This is the critical CUDA-specific pass. It determines which entities must be preserved in the intermediate language for device code compilation by cicc. The guard flag dword_106B640 is set to 1 before the call and cleared to 0 after, preventing accidental re-invocation.

The keep-in-IL bit is bit 7 (0x80) of the byte at (entity_pointer - 8). Testing uses signed comparison: *(entry - 8) < 0 means "marked for keeping."

Operation

Save/restore state: Saves and restores 9 global callback/state variables (qword_126FB88 through dword_126FB60), installing sub_617310 (prune_keep_in_il_walk) as the walk prune callback at qword_126FB78. All other callback slots are zeroed. The callback set at dword_126FB58 is set to (byte_at_a1_minus_8 & 2) != 0 -- derived from a flag in the scope node header.
File scope walk: When a2 == 23 and scope kind byte *(a1+28) is 0 (file scope), clears bit 7 of byte *(a1-8) via AND 0x7F. Then calls sub_6115E0(a1, 23) -- the recursive walk_tree_and_set_keep_in_il traversal on the file scope root.
C++ companion walk: For C++ mode (dword_126EFB4 == 2), calls sub_6175F0(a1) to walk scopes and mark out-of-line definitions and friend declarations.
Guard assertion: Asserts dword_106B640 != 0. If the guard was cleared during the walk, fires a fatal assertion at il_walk.c:1959 with function name mark_to_keep_in_il.
Pending entity lists: Iterates the deferred entity list at qword_126EBA0, calling sub_6115E0(entity, 55) for each entry with bit 2 set in byte *(entity[1] + 187) (the "deferred instantiation needed" flag).

43 category-specific walks: Iterates 43 global lists, each containing entities of a specific IL category. Each list is walked with a category-specific tag argument:

Global range	Tags	Count
`qword_126E610` -- `qword_126E770`	1--23	23 lists
`qword_126E7B0` -- `qword_126E7E0`	27--30	4 lists
`qword_126E810` -- `qword_126E8A0`	33--42	10 lists
`qword_126E8E0` -- `qword_126E900`	46--48	3 lists
`qword_126E9B0`, `qword_126E9D0`, `qword_126E9E0`, `qword_126E9F0`	59, 61, 62, 63	4 lists
`qword_126EA80`	72	1 list

These lists follow a reverse-linked structure where the back-pointer is at *(list_entry - 16), not at offset +0. Each entity's tag tells sub_6115E0 what kind of entity it is processing, which affects how the keep_in_il mark propagates to dependents.

Using-declaration fixed-point: Processes namespace member entries at *(root + 256) via sub_6170C0(member, is_file_scope, &changed) in a loop that repeats until changed == 0. The is_file_scope flag is derived from *(a1+28) being 2 or 17.

Hidden name resolution: If *(a1+264) is non-NULL, walks hidden name entries. Each entry has a linked list at entry[1] with per-entry kind at *(entry + 16) (byte). Five kinds are handled:

Kind	Name	Action
`0x35`	Instantiation	Walk via `sub_6170C0` on `*(entry[3] + 8)`
`0x33`	Function template	Conditional marking based on scope type and entity mark
`0x34`	Variable template	Same as 0x33 with `v111 = entry[3]`
`0x36`	Alias template	Same as 0x33 with `v110 = entry[3]`
`6`	Class/struct	Special handling: checks typedef chain at byte `+132 == 12` with non-null source at `+8`; marks via `sub_6115E0(entity, 6)` for file-scope entries

For each marked hidden name entry, the keep_in_il bit at *(entry - 8) is set via OR with 0x80.

Context restore: Restores all saved function pointers and state variables.

Debug trace: "Beginning/Ending file scope keep_in_il walk" when the "needed_flags" flag is active.

For full details on the keep-in-IL mechanism, see Keep-in-IL.

Pass 4: Constant Folding, CUDA Transforms, and Dead Entity Elimination

Per-TU: Yes, skips error TUs C++ only: The sub_5CCA40 call is gated by dword_126EFB4 == 2

This pass has three sub-stages per TU. The first (sub_5CCA40) clears flags to prevent unnecessary work. The second (sub_5CC410) removes function bodies. The third (sub_5CCBF0) removes entire IL entries.

Stage 4a: Clear Unneeded Instantiation Flags -- `sub_5CCA40`

Address: 0x5CCA40 Source: il.c:29450 (clear_instantiation_required_on_unneeded_entities) C++ only: Asserts dword_126EFB4 == 2

Walks the same four declaration lists as Pass 2 (nested scopes at +168, types at +104, routines at +144, and for non-file scopes variables at +112). For routines that are not marked for keeping but have instantiation-required flags set, calls sub_78A380(entity, 0, 2) to clear the instantiation-required bit. This prevents the template engine from instantiating definitions that will be eliminated in the next sub-stage.

The conditions for clearing a routine's instantiation-required flag are:

Byte +80 bit 3 clear (not an external/imported entity)
Byte +179 bit 4 clear (not a special instantiation)
Byte +179 bits 1-2 == 0b10 (has "instantiation required" set) OR (dword_126E204 is set AND byte +176 bit 7 is set)
Non-null template pointer at *(entity + 0) (has a source template)
Byte +176 bit 1 clear (not already processed)

For non-file scopes (byte +28 of scope is nonzero), additionally processes variables in the +112 list with an analogous pattern: byte +162 bit 6 clear, bits 4-5 in the pattern (v8 & 0xB0) == 0x10, with a non-null pointer at *(entry + 0).

Stage 4b: Eliminate Unneeded Function Bodies -- `sub_5CC410`

Address: 0x5CC410 Source: il.c:29231 (eliminate_bodies_of_unneeded_functions) Gate: dword_126E55C != 0 (deferred class members exist)

Iterates the scope table (qword_126EB98, 16-byte entries: {qword scope_ptr, int file_index, pad}). The iteration runs from index 1 through dword_126EC78. For each entry:

Checks that the file reference at qword_126EC88[file_index] is non-null.
Checks TU ownership:
- Primary TU (qword_106BA10 == qword_106B9F0): checks (*(scope_ptr - 8) >> 1) ^ 1) & 1 -- bit 1 of the pre-header flags byte must be clear.
- Secondary TU: checks qword_126DFE0[*(scope_ptr + 24)] == qword_106BA10 -- the scope's file index maps to the current TU.
Verifies scope kind byte +28 == 17 (class/namespace scope).
Checks the keep-in-il mark: bit 2 of byte *(scope_ptr + 187) must be clear (not needed) AND the scope file entry has bit 0 of byte +29 set (eligible for elimination).
If all checks pass, calls sub_5CAB40 to remove the function body from the scope.

In C++ mode with dword_126EFB4 == 2, also calls sub_6FFBA0 to reorganize namespace-level declarations after body removal.

Debug trace: "eliminate_bodies_of_unneeded_functions" at level 3.

Stage 4c: Eliminate Unneeded IL Entries -- `sub_5CCBF0`

Address: 0x5CCBF0 Source: il.c:29598 (eliminate_unneeded_il_entries) Gate: dword_126E55C != 0

The heaviest sub-stage. First calls sub_703C30(a1) to get a scope summary structure (7-element qword array stored at v2), asserting the result is non-null. Then walks four entity lists, removing entries whose keep-in-IL mark (bit 7 of byte at entity - 8) is clear:

List	Offset	Entity Type	Removal actions
Variables	`+112`	Variable declarations	Unlink from list; for C++, call `sub_7B0B60` on type pointers at `+112` and `+216` with callback `sub_5C71B0` (id 147) to clean up associated type metadata
Routines	`+144`	Function/method declarations	Unlink from list; same `sub_7B0B60` type cleanup on type pointers at `+144` and `+248`; set bit 5 of byte `+87` in the routine supplement at `*(entity+152)`
Types	`+104`	Type declarations	Unlink from list; for class entities (kind 9-11 at byte `+132`), call `sub_5CB920` (C++ member cleanup) then `sub_5E2D70` (scope deallocation); set bit 5 of byte `+87` in the entity supplement
Hidden names	`+272`	Hidden name entries	Unlink unmarked entries from list

After variable/routine/type processing, the tail pointers are stored into v2[5], v2[6], and v2[4] respectively (the scope summary structure).

For file-scope nodes (byte +28 == 0), additionally calls sub_5CC570 (eliminate unneeded scope orphaned list entries) after variable processing, and sub_718720 (scope-level cleanup) after type/hidden-name processing.

After list processing, walks qword_126EBE0 (a global deferred entity chain) and removes entries where *(entry - 8) >= 0 (bit 7 clear = not marked).

String arithmetic in debug output

The diagnostic output uses a pointer arithmetic trick: "TARG_VERT_TAB_CHAR" + 17 evaluates to "R", so the format string "%semoving variable " produces either "Removing variable ..." (when the entity is being removed) or "Not removing variable ..." (when kept).

Deferred-Class-Members Flag

Pass 4 checks dword_126E55C after each TU's stage 4a. This flag indicates whether there are deferred class member definitions that need processing. If no errors occurred and the flag is set, stages 4b and 4c run. If errors are present during the per-TU loop, the flag is simply cleared to 0 and stages 4b/4c are skipped for that TU.

Pass 5: Per-TU Final Cleanup

Function: sub_588D40 (file_scope_il_wrapup_part_3) Address: 0x588D40 Source: fe_wrapup.c:559 Per-TU: Yes (all TUs, no error skip)

This pass performs final statement-level processing and scope validation, then optionally re-runs the Pass 2-4 sequence for the main compilation unit.

Operations

Statement finalization: sub_5BAD30 -- finalizes statement-level IL nodes (label resolution, goto target binding, fall-through analysis).
Scope stack assertion (C++ with dword_106BA08): Verifies that *(qword_126C5E8 + 784 * dword_126C5E4 + 496) == qword_126E4C0. The scope stack is an array of 784-byte entries at qword_126C5E8, indexed by dword_126C5E4 (current depth). The assertion checks that the scope pointer at offset +496 of the current entry matches the expected file scope entity (qword_126E4C0). On mismatch, triggers a fatal assertion at fe_wrapup.c:559 with function name file_scope_il_wrapup_part_3.
Scope cleanup: For C++ mode, calls sub_5C9E10(0) -- finalizes class scope processing, resolves deferred member access checks.
IL output: sub_709250 -- serializes the IL tree to the IL output stream. This produces the internal representation that the backend reads, not the final .int.c file.
Template output: sub_7C2560 -- serializes template instantiation information to the output.
Mirrored 3-pass sequence (only when dword_106BA08 -- full compilation mode): Re-runs passes 2-4 on the main TU's file scope node. This handles entities that were discovered or modified during the per-TU passes. The re-run is necessary because secondary TU processing may have added new cross-references to the primary TU's entities:
- sub_707040(file_scope) (needed flags) -- if errors appear (qword_126ED90), clears dword_126E55C and skips remaining
- sub_610420(file_scope, 23) with dword_106B640 = 1/0 guard -- again abort if errors
- sub_5CCA40(file_scope) (clear instantiation flags, C++ only)
- sub_5CC410() + sub_5CCBF0(file_scope) (eliminate, if dword_126E55C)
Source file state: sub_6B9580 -- updates source file tracking counters.
Diagnostic flush: sub_4F4030 -- flushes pending diagnostic messages for this TU.
File scope cleanup: sub_6B9340(dword_126EC90) -- closes file scope state, passing the current error count for this file.

Post-Pass Operations

After all five passes complete, fe_wrapup performs a series of global operations that are not per-TU.

Cross-TU IL Consistency -- `sub_796BA0`

Address: 0x796BA0 Source: trans_copy.c:3003 (copy_secondary_trans_unit_IL_to_primary)

Called only when there are no errors (!qword_126ED90), the multi-TU flag is clear (!dword_106C2B4), and there are secondary TUs (*(qword_106B9F0) != 0). In the current binary, this function always triggers a fatal assertion at trans_copy.c:3003 -- the multi-TU IL copy infrastructure is compiled but disabled, likely reserved for future C++ module compilation support. The function traces "copy_secondary_trans_unit_IL_to_primary" before aborting.

Scope Renumbering -- `sub_707480`

Address: 0x707480 Source: scope_stk.c

Called when dword_126C5A0 (scope renumbering flag) is set and dword_126EC78 > 0 (scope count is positive). Executes a double-loop:

unsigned pass = 1;
do {
    for (int idx = 1; idx < dword_126EC78; idx++)
        sub_707480(idx, pass);
    if (!pass) break;
    pass = 0;
} while (dword_126EC78 > 0);
dword_126C5A0 = 0;

For each scope entry at qword_126EB98 + 16 * idx:

Extracts the scope pointer at +0 and file index at +8
Checks non-null scope pointer, valid file reference in qword_126EC88[file_index]
Verifies scope kind byte +28 == 17 (class/namespace scope)
In pass=1: skips entries where byte +176 of the entity at *(scope+32) is non-negative
Checks bit 1 of byte *(scope-8) is clear and bit 0 of byte +29 is clear
In C++ mode with bit 5 of byte *(*(scope+32) + 186) set: calls sub_6FFBA0 to reorganize scope members
Calls sub_6FE2A0(scope, 0, 1) to renumber the scope's declaration entries

After the double-loop, clears dword_126C5A0 = 0.

Template Validation -- `sub_765480`

Address: 0x765480 Source: templates.c:19822 (remove_unneeded_instantiations)

Called unless dword_106C094 == 1 (minimal compilation mode). Walks the instantiation pending list at qword_12C7740 (linked via offset +8) and removes template instantiations that are no longer needed:

Referent kind (byte `+80`)	Entity kind	Action
9	Class template instantiation	If function body exists and is unreferenced (or `dword_106C094 == 2`), call `sub_5CE710` to eliminate class definition
7	Function template instantiation	Same check with `dword_106C094 != 2` guard
10-11	Variable/alias template	Call `sub_5BBC70` to find underlying function, then `sub_5CAB40` to remove body

Each entry has: offset +16 = template entity pointer, offset +24 = referent entity, offset +80 = flags byte.

Final Main-TU Cleanup

Calls sub_588D40 one more time on the main translation unit (not iterating the chain). This ensures the primary TU gets the same final cleanup treatment as secondary TUs.

File Index Processing

If the primary TU has secondary TUs (*(qword_106B9F0) != 0), iterates the file table starting at index 2 through dword_126EC80:

for (int idx = 2; idx <= dword_126EC80; idx++) {
    if (!qword_126EC88[idx] || *(byte *)(qword_126EB90[idx] + 28))
        continue;
    sub_6B8B20(idx);
}

sub_6B8B20 resets the file state for each valid, non-header file index, updating the source file manager's tracking structures.

Output Flush and File Close

Conditional flush: If dword_106C250 is set and no errors, calls sub_5F7DF0(0) -- flushes the IL output stream.

Close three output files via sub_4F7B10:

Call	File pointer	ID	Identity
`sub_4F7B10(&qword_106C280, 1513)`	Primary output	1513	Main `.int.c` output (or stdout)
`sub_4F7B10(&qword_106C260, 1514)`	Secondary output	1514	Module interface or IL dump
`sub_4F7B10(&qword_106C258, 1515)`	Tertiary output	1515	Template instantiation log

sub_4F7B10 checks if the file pointer is non-null, zeroes it, calls sub_5AEAD0 (fclose wrapper), and on error triggers diagnostic sub_4F7AA0 with the given ID.

Memory Statistics Reporting

Triggered when any of these conditions hold:

dword_106BC80 is set (always-report-stats flag)
dword_126EFCC > 0 (verbosity level > 0)
Debug mode (dword_126EFC8) with "space_used" flag active

Sums the return values of 10 subsystem space_used functions:

#	Function	Address	Subsystem	Report Header
1	`sub_74A980`	`0x74A980`	Symbol table	`"Symbol table use:"`
2	`sub_6B6280`	`0x6B6280`	Macro table	`"Macro table use:"`
3	`sub_4ED970`	`0x4ED970`	Error/diagnostic table	`"Error table use:"`
4	`sub_6887C0`	`0x6887C0`	Conversion table	(conversion/cast subsystem)
5	`sub_4E8F60`	`0x4E8F60`	Declaration table	(declarations subsystem)
6	`sub_56D8C0`	`0x56D8C0`	Expression table	`"Expression table use:"`
7	`sub_5CEA80`	`0x5CEA80`	IL table	(IL node/class subsystem)
8	`sub_726C80`	`0x726C80`	Mangling table	(name mangling subsystem)
9	`sub_6FDF00`	`0x6FDF00`	Lowering table	(IL lowering subsystem)
10	`sub_419150`	`0x419150`	Diagnostic table	(diagnostic output subsystem)

Each function prints its own detailed allocation table to stderr in a standardized format with columns Table / Number / Each / Total, tracks "lost" entries (allocated count minus free-list traversal count), and returns its total byte count.

The cumulative sum is passed to sub_6B95C0 (print_memory_management_statistics at 0x6B95C0), which prints the grand total accounting report:

Memory management table use:
                    Table   Number     Each    Total
             text buffers      NNN       40     NNNN
                    Total                       NNNN

Allocated space in all categories:
           Total of above                    NNNNNNN
    Skipped for alignment                       NNNN
       File mapped memory                          0
          Mapped from PCH                          0  (included in previous line)
      Mapped IL file size                          0
               Not listed                      NNNNN
               Total used                    NNNNNNN
  Avail in used mem blocks                      NNNN
 Avail in freed mem blocks                         0
             Max mem alloc                   NNNNNNN

The "Not listed" entry is computed as qword_1280700 + qword_1280708 - qword_12806F8 - total_above -- it captures memory allocated by subsystems that do not have their own space_used reporter.

Debug Dumps

If debug mode (dword_126EFC8) is active:

"scope_stack" flag: calls sub_702DC0 -- dumps the entire scope stack to stderr, showing all active scopes with their indices, kinds, and entity counts.
"viability" flag: calls sub_6C6570 -- dumps overload viability information, showing candidate sets and resolution decisions.

Final Teardown

IL allocator check -- sub_5E1D00 (check_local_constant_use at il_alloc.c:1177): Copies qword_126EFB8 to qword_126EDE8 (restores the IL source position to a baseline). Asserts qword_126F680 == 0 -- no pending local constants should remain after wrapup. If nonzero, fires a fatal assertion.
Zero 6 global state variables:
- qword_126DB48 = 0 -- pending entity pointer (scope tracking)
- Call sub_4ED0E0() -- declaration subsystem cleanup (releases declaration pools)
- dword_126EE48 = 0 -- init-complete flag (cleared, marking end of frontend processing)
- qword_106BA10 = 0 -- current TU descriptor (no active TU)
- qword_12C7768 = 0 -- template state pointer 1
- qword_12C7770 = 0 -- template state pointer 2
Timing: If debug mode, calls sub_48AFD0 (print trace timing footer for the fe_wrapup section).

Error Gating Summary

Each pass has a distinct error-gating pattern. The conditions below are verified against the decompiled sub_588F90:

Pass	Error behavior	Decompiled condition
Pass 1 (`sub_588C60`)	No gate -- always runs. Cleanup operations (template release, exception spec finalization) are safe and necessary even after errors.	None. Unconditional iteration of all secondary TUs followed by primary.
Cross-TU (`sub_796C00`)	Skipped entirely if any errors occurred. This prevents cross-TU marking from propagating errors between units.	`if (!qword_126ED90) sub_796C00();` (line 67-68 of decompiled)
Pass 2 (`sub_707040`)	Per-TU skip. Inside the TU iteration loop, each TU is independently gated: if errors exist when that TU is selected, it is skipped but subsequent TUs may still run.	`sub_7A3D60(tu); if (!qword_126ED90) sub_707040(*(qword_106BA10 + 8));` (lines 77-84)
Pass 3 (`sub_610420`)	Per-TU skip. Same per-TU gating as Pass 2. When a TU is skipped, `dword_106B640` is never set to 1, so the guard flag remains 0.	`sub_7A3D60(tu); if (!qword_126ED90) { dword_106B640 = 1; sub_610420(..., 23); dword_106B640 = 0; }` (lines 97-108)
Pass 4 (`sub_5CCA40` etc.)	Per-TU skip. On error for a TU: `dword_126E55C` is cleared to 0, which prevents stages 4b (`sub_5CC410`) and 4c (`sub_5CCBF0`) from running for that TU. Stage 4a (`sub_5CCA40`) is additionally gated by `dword_126EFB4 == 2` (C++ only).	`sub_7A3D60(tu); if (!qword_126ED90) { ... if (dword_126E55C) { sub_5CC410(); sub_5CCBF0(v8); } } else { dword_126E55C = 0; }` (lines 120-137)
Pass 5 (`sub_588D40`)	No gate on the per-TU iteration -- always runs. However, the internal mirrored 2-3-4 re-run within `sub_588D40` is individually error-gated at each stage.	Unconditional iteration. Internal re-run checks `qword_126ED90` before each of `sub_707040`, `sub_610420`, `sub_5CCA40`.
Post-passes	`sub_796BA0` requires `!qword_126ED90 && !dword_106C2B4 && *(qword_106B9F0) != 0`. `sub_5F7DF0` requires `dword_106C250 && !qword_126ED90`. All others run unconditionally.	Line 158: `if (!qword_126ED90 && !dword_106C2B4 && *v4) sub_796BA0();` Line 213: `if (dword_106C250 && !qword_126ED90) sub_5F7DF0(0);`

Data Flow Summary

Input	Description
`qword_106B9F0`	TU chain head -- linked list of all translation units
`*(qword_106BA10 + 8)`	File scope IL root node -- the IL tree for each TU
`qword_126ED90`	Error flag -- nonzero means compilation errors occurred
`dword_126EFB4`	Language mode -- 2 for C++, gates pass 4 and template operations
`dword_106BA08`	Full compilation mode flag -- gates Pass 5's mirrored sequence

Output	Description
Finalized IL tree	Entities marked for keeping preserved; all others eliminated
`dword_106B640`	IL emission guard flag -- 0 at completion
`dword_126E55C`	Deferred class members flag -- 0 after processing
Closed output files	Three output streams (IDs 1513-1515) flushed and closed
Zeroed globals	`qword_106BA10`, `dword_126EE48`, `qword_126DB48`, template state -- all cleared

Function Map

Address	Identity	Source file	Role in fe_wrapup
`sub_588F90`	`fe_wrapup`	`fe_wrapup.c:776`	Top-level entry, called from `main()`
`sub_588C60`	`file_scope_il_wrapup`	`fe_wrapup.c`	Pass 1: template/exception cleanup, IL walk, IL finalize
`sub_588D40`	`file_scope_il_wrapup_part_3`	`fe_wrapup.c:559`	Pass 5: statement finalization, scope assertion, IL/template output
`sub_588E90`	`translation_unit_wrapup`	`fe_wrapup.c`	Called from `process_translation_unit`, not directly from fe_wrapup
`sub_707040`	`set_needed_flags_at_end_of_file_scope`	`scope_stk.c:8090`	Pass 2: compute needed-flags on all entity lists
`sub_610420`	`mark_to_keep_in_il`	`il_walk.c:1959`	Pass 3: mark entities for device code preservation
`sub_5CCA40`	`clear_instantiation_required_on_unneeded_entities`	`il.c:29450`	Pass 4a: prevent unnecessary template instantiation
`sub_5CC410`	`eliminate_bodies_of_unneeded_functions`	`il.c:29231`	Pass 4b: remove dead function bodies
`sub_5CCBF0`	`eliminate_unneeded_il_entries`	`il.c:29598`	Pass 4c: remove dead entities from IL lists
`sub_796C00`	`mark_secondary_trans_unit_IL_entities_used_from_primary_as_needed`	`scope_stk.c`	Between Pass 1 and 2: cross-TU reference marking
`sub_796BA0`	`copy_secondary_trans_unit_IL_to_primary`	`trans_copy.c:3003`	Post-pass: dead in CUDA build (always asserts)
`sub_707480`	scope renumber (inferred)	`scope_stk.c`	Post-pass: renumber scope declarations
`sub_765480`	`remove_unneeded_instantiations`	`templates.c:19822`	Post-pass: prune template instantiation list
`sub_6B95C0`	`print_memory_management_statistics`	memory mgmt	Post-pass: grand total memory report
`sub_5E1D00`	`check_local_constant_use`	`il_alloc.c:1177`	Post-pass: assert no pending local constants
`sub_7A3D60`	`set_current_translation_unit`	`trans_unit.c`	Called before every per-TU operation
`sub_706710`	IL tree walk	IL subsystem	Pass 1 via `sub_588C60`
`sub_706F40`	IL finalize	IL subsystem	Pass 1 via `sub_588C60`
`sub_6115E0`	`walk_tree_and_set_keep_in_il`	`il_walk.c`	Pass 3: recursive keep_in_il walker
`sub_6170C0`	namespace member walk	`il_walk.c`	Pass 3: using-declaration fixed-point
`sub_6175F0`	C++ companion walk	`il_walk.c`	Pass 3: out-of-line definitions
`sub_617310`	`prune_keep_in_il_walk`	`il_walk.c`	Pass 3: installed as walk prune callback
`sub_5BD350`	destroy temporaries	IL subsystem	Pass 1: C++ temporary cleanup
`sub_7C2690`	template cleanup	template engine	Pass 1: release deferred template state
`sub_68A0C0`	exception cleanup	exception handling	Pass 1: finalize exception specs
`sub_78A9D0`	`template_and_inline_entity_wrapup`	C++ support	Preamble: C++ pre-wrapup
`sub_78A380`	clear instantiation-required flag	template engine	Pass 4a via `sub_5CCA40`
`sub_5CAB40`	eliminate function body	IL subsystem	Pass 4b via `sub_5CC410`, post-pass via `sub_765480`
`sub_5CE710`	eliminate class definition	IL subsystem	Post-pass via `sub_765480`
`sub_5CB920`	C++ member cleanup	class subsystem	Pass 4c via `sub_5CCBF0`
`sub_5E2D70`	scope deallocation	scope subsystem	Pass 4c via `sub_5CCBF0`
`sub_5CC570`	eliminate scope orphaned entries	IL subsystem	Pass 4c via `sub_5CCBF0`
`sub_718720`	scope-level cleanup	scope subsystem	Pass 4c via `sub_5CCBF0`
`sub_703C30`	get scope summary	scope subsystem	Pass 4c via `sub_5CCBF0`
`sub_7B0B60`	walk type tree	type subsystem	Pass 4c: type metadata cleanup
`sub_5C71B0`	type cleanup callback	type subsystem	Pass 4c: invoked via `sub_7B0B60` with id 147
`sub_6FE2A0`	renumber scope entries	scope subsystem	Post-pass via `sub_707480`
`sub_6FFBA0`	reorganize scope members	scope subsystem	Pass 4b, scope renumbering
`sub_6FE8C0`	C++ scope merge	scope subsystem	Pass 2: merge declaration/scope lists
`sub_4F7B10`	close output file	file I/O	Post-pass: close 3 files
`sub_5F7DF0`	flush IL output	IL output	Post-pass: conditional flush
`sub_6B8B20`	process file entry	source file mgr	Post-pass: file index loop
`sub_4ED0E0`	declaration cleanup	declarations	State teardown
`sub_709250`	IL output	IL output	Pass 5: serialize IL tree
`sub_7C2560`	template output	template engine	Pass 5: serialize template info
`sub_5BAD30`	statement finalization	statement subsystem	Pass 5: finalize statement-level nodes
`sub_5C9E10`	class scope finalization	class subsystem	Pass 5: C++ scope cleanup
`sub_6B9580`	source file state update	source file mgr	Pass 5: update file tracking
`sub_4F4030`	diagnostic flush	diagnostics	Pass 5: flush pending messages
`sub_6B9340`	file scope close	source file mgr	Pass 5: close file scope with error count
`sub_702DC0`	scope stack dump	scope subsystem	Post-pass: debug dump
`sub_6C6570`	viability dump	overload resolution	Post-pass: debug dump
`sub_48AE00`	debug trace enter	debug subsystem	Preamble, Pass 4b/4c
`sub_48AFD0`	debug trace exit/timing	debug subsystem	Final: print timing
`sub_48A7E0`	debug flag check	debug subsystem	Multiple: check named trace flags

Diagnostic Strings

String	Location	When emitted
`"fe_wrapup"`	`sub_588F90` preamble	Debug trace at function entry
`"bad translation unit in fe_wrapup"`	`sub_588F90` preamble	Fatal assertion when `dword_106BA08 == 0`
`"source_file_for_seq_info"`	`sub_588F90` preamble	Debug flag name for source sequence dump
`"Start of set_needed_flags_at_end_of_file_scope"`	`sub_707040` entry	Pass 2 debug trace
`"End of set_needed_flags_at_end_of_file_scope"`	`sub_707040` exit	Pass 2 debug trace
`"needed_flags"`	`sub_707040`, `sub_610420`	Debug flag name for needed-flags diagnostics
`"bad scope kind"`	`sub_707040`	Fatal assertion when scope kind is not 0, 3, or 6
`"variable_needed_even_if_unreferenced"`	`sub_707040`	Assertion function name at `scope_stk.c:7999/8001`
`"Beginning file scope keep_in_il walk"`	`sub_610420` entry	Pass 3 debug trace
`"Ending file scope keep_in_il walk"`	`sub_610420` exit	Pass 3 debug trace
`"mark_to_keep_in_il"`	`sub_610420`	Fatal assertion function name at `il_walk.c:1959`
`"file_scope_il_wrapup_part_3"`	`sub_588D40`	Assertion function name at `fe_wrapup.c:559`
`"clear_instantiation_required_on_unneeded_entities"`	`sub_5CCA40`	Assertion function name at `il.c:29450`
`"eliminate_bodies_of_unneeded_functions"`	`sub_5CC410`	Debug trace at level 3
`"eliminate_unneeded_il_entries"`	`sub_5CCBF0`	Debug trace at level 3
`"Removing variable ..."`	`sub_5CCBF0`	Verbose output when removing a variable entity
`"Not removing variable ..."`	`sub_5CCBF0`	Verbose output when keeping a variable entity
`"Removing routine ..."`	`sub_5CCBF0`	Verbose output when removing a function entity
`"Not removing routine ..."`	`sub_5CCBF0`	Verbose output when keeping a function entity
`"Removing hidden name entry for ..."`	`sub_5CCBF0`	Verbose output during hidden name cleanup
`"check_local_constant_use"`	`sub_5E1D00`	Assertion function name at `il_alloc.c:1177`
`"copy_secondary_trans_unit_IL_to_primary"`	`sub_796BA0`	Debug trace + fatal assertion at `trans_copy.c:3003/3008`
`"remove_unneeded_instantiations"`	`sub_765480`	Assertion function name at `templates.c:19822/19848`
`"scope_stack"`	`sub_588F90` post-pass	Debug flag name for scope stack dump
`"viability"`	`sub_588F90` post-pass	Debug flag name for viability analysis dump
`"space_used"`	`sub_588F90` post-pass	Debug flag name for memory statistics
`"dump_elim"`	`sub_5CCBF0`, `sub_5CC410`	Debug flag name for entity removal details
`"Memory management table use:"`	`sub_6B95C0`	Memory statistics report header
`"Symbol table use:"`	`sub_74A980`	Symbol table statistics header
`"Macro table use:"`	`sub_6B6280`	Macro table statistics header
`"Error table use:"`	`sub_4ED970`	Error table statistics header
`"Expression table use:"`	`sub_56D8C0`	Expression table statistics header

Key Global Variables

Variable	Address	Role in fe_wrapup
`qword_106B9F0`	`0x106B9F0`	TU chain head. Iterated by all 5 passes.
`qword_106BA10`	`0x106BA10`	Current TU descriptor. Switched by `sub_7A3D60` before each TU.
`qword_126ED90`	`0x126ED90`	Error flag. Passes 2-4 skip TUs when nonzero.
`dword_126EFB4`	`0x126EFB4`	Language mode. 2 = C++. Gates `sub_5CCA40`, `sub_78A9D0`, template operations.
`dword_106BA08`	`0x106BA08`	Full compilation flag. Gates preamble assertion and Pass 5's mirrored sequence.
`dword_106B640`	`0x106B640`	IL emission guard. Set=1 during Pass 2 (file scope entry) and Pass 3 (caller). Asserted by `sub_610420`. Cleared=0 at end.
`dword_126E55C`	`0x126E55C`	Deferred class members flag. When set, enables stages 4b and 4c. Cleared on error exit.
`dword_126C5A0`	`0x126C5A0`	Scope renumbering flag. When set, enables post-pass `sub_707480` double-loop. Cleared after.
`dword_126EC78`	`0x126EC78`	Scope count. Controls iteration bounds for `sub_707480` and `sub_5CC410`.
`qword_126EB98`	`0x126EB98`	Scope table base. 16-byte entries: `{qword scope_ptr, int file_index, pad}`.
`dword_126EC80`	`0x126EC80`	File table entry count. Controls file index processing loop.
`qword_126EC88`	`0x126EC88`	File table (name/scope pointers). Indexed by file ID.
`qword_126EB90`	`0x126EB90`	File table (info entries). Indexed by file ID.
`dword_106C094`	`0x106C094`	Compilation mode. Value 1 skips `sub_765480` (template validation).
`dword_106C250`	`0x106C250`	Output flush flag. When set with no errors, calls `sub_5F7DF0(0)`.
`dword_106C268`	`0x106C268`	CUDA diagnostics flag. Gates `sub_6B3260` in preamble.
`dword_106C2B4`	`0x106C2B4`	Cross-TU copy disabled. When set, skips `sub_796BA0`.
`dword_126EFC8`	`0x126EFC8`	Debug/trace mode. Enables trace output and debug dumps throughout.
`dword_126EFCC`	`0x126EFCC`	Diagnostic verbosity level. Level > 0 enables memory stats, > 2 enables dump_elim.
`dword_106BC80`	`0x106BC80`	Always-report-stats flag. Forces memory statistics regardless of verbosity.
`dword_126EE48`	`0x126EE48`	Init-complete flag. Set to 1 during `fe_init_part_1`, cleared to 0 during teardown.
`qword_126DB48`	`0x126DB48`	Scope tracking pointer. Cleared during teardown.
`qword_12C7768`	`0x12C7768`	Template state pointer 1. Cleared during teardown.
`qword_12C7770`	`0x12C7770`	Template state pointer 2. Cleared during teardown.
`qword_126E4C0`	`0x126E4C0`	Expected file scope entity. Compared in Pass 5 scope assertion.
`qword_126C5E8`	`0x126C5E8`	Scope stack base pointer. Array of 784-byte entries.
`dword_126C5E4`	`0x126C5E4`	Current scope stack depth index.
`dword_126E204`	`0x126E204`	Template mode flag. Affects instantiation-required clearing in Pass 4a.
`qword_126EBA0`	`0x126EBA0`	Deferred entity list head. Walked in Pass 3.
`qword_126EBE0`	`0x126EBE0`	Global deferred entity chain. Cleaned in Pass 4c.
`qword_12C7740`	`0x12C7740`	Template instantiation pending list. Walked by `sub_765480`.
`qword_126DFE0`	`0x126DFE0`	File-index-to-TU mapping table. Used for TU ownership checks.

Cross-References

Pipeline Overview -- fe_wrapup is stage 6 in the 8-stage pipeline
Keep-in-IL -- detailed coverage of the device code selection mechanism (Pass 3)
IL Overview -- the IL data structures walked by all five passes
Backend Code Generation -- stage 7, consumes the finalized IL produced by fe_wrapup
Entry Point & Initialization -- the main() function that calls sub_588F90
Frontend Invocation -- stage 5, builds the IL tree that fe_wrapup finalizes
Timing & Exit -- fe_wrapup completion marks the end of "Front end time"
Device/Host Separation -- the keep_in_il mechanism's relationship to device code isolation

Backend Code Generation

The backend is the final stage of the cudafe++ pipeline (stage 7 in the overview). It lives in a single function, process_file_scope_entities (sub_489000, 723 decompiled lines, 4520 bytes), whose job is to walk the EDG source sequence produced by the frontend and emit a .int.c file that the host C++ compiler (gcc, clang, or cl.exe) can compile. The function resides in cp_gen_be.c at EDG source lines around 19916-26628, and it delegates per-entity code generation to gen_template (sub_47ECC0, 1917 decompiled lines), which dispatches on entity kind to specialized generators for variables, types, routines, namespaces, and templates.

The backend is gated by the skip-backend flag (dword_106C254): if set to 1 (errors occurred during the frontend), main() never calls sub_489000 and proceeds directly to exit.

Key Facts

Property	Value
Function	`sub_489000` (`process_file_scope_entities`)
Binary address	`0x489000`
Binary size	4520 bytes (723 decompiled lines)
EDG source	`cp_gen_be.c`
Callees	~140 distinct call targets
Output	`.int.c` file (or stdout when filename is `"-"`)
Main dispatcher	`sub_47ECC0` (`gen_template`, 1917 lines)
Host reference emitter	`sub_6BCF80` (`nv_emit_host_reference_array`)
Module ID writer	`sub_5B0180` (`write_module_id_to_file`)
Skip-backend flag	`dword_106C254`
Backend timing label	`"Back end time"`

Output Primitives

All output to the .int.c file passes through a small set of character-level emitters. Understanding these is essential for reading the decompiled backend code, since every line of generated C/C++ is assembled from these calls:

Function	Address	Identity	Behavior
`sub_467D60`	`0x467D60`	`emit_newline`	Writes `\n` via `putc(10, stream)`. Increments `dword_1065820` (line counter). Resets `dword_106581C` (column counter) and `dword_1065830` to 0. Calls `sub_403730` (write error abort) on failure.
`sub_467DA0`	`0x467DA0`	`emit_line_directive`	Checks `dword_1065818` (needs-line-directive flag). If the current source position (`qword_1065810`) differs from the output line counter, calls `sub_467EB0` to emit a `#line N "file"` directive. Resets `dword_1065818` to 0. Handles close-range line gaps (within 5 lines) by emitting blank lines instead of a `#line` directive.
`sub_467E50`	`0x467E50`	`emit_string`	If `dword_1065818` is set, calls `emit_line_directive` first. Writes each character of the string via `putc`. Increments `dword_106581C` by the string length.
`sub_467EB0`	`0x467EB0`	`emit_line_number`	Emits `#line N "file"` or `# N "file"` (short form when `dword_106C28C` or MSVC EDG-native mode is set). Constructs the directive in a stack buffer starting with `#line` , appends the decimal line number, then the quoted filename via `sub_5B1940`. Sets `dword_1065820` to the target line number. Resets column counters.
`sub_468150`	`0x468150`	`emit_char`	If `dword_1065818` is set, calls `emit_line_directive` first. Writes a single character via `putc`. Increments `dword_106581C` by 1.
`sub_468190`	`0x468190`	`emit_raw_string`	Like `emit_string` but without `strlen` -- walks the string character by character, incrementing `dword_106581C` per character. Calls `emit_line_directive` first if `dword_1065818` is set.
`sub_468270`	`0x468270`	`emit_decimal`	Writes an unsigned integer as decimal digits. Has fast paths for 1-5 digit numbers (manual digit extraction via division by powers of 10). Falls back to `sub_465480` (sprintf-style) for larger numbers. Calls `emit_line_directive` first if needed.
`sub_46BC80`	`0x46BC80`	`emit_line_start`	If the column counter is nonzero, first emits a newline. Increments `dword_1065834` (indent level). Calls `emit_line_directive` if needed. Then writes the string character by character. Used for the first token on a new line (e.g., `#define`, `#ifdef`).

Output State Variables

Variable	Address	Type	Role
`stream`	`0x106583x`	`FILE*`	Output file handle for `.int.c`
`dword_1065834`	`0x1065834`	int	Indent level counter. Incremented by `emit_line_start`, decremented after each directive block. Not used for actual indentation emission -- tracks logical nesting depth for `#line` management.
`dword_1065820`	`0x1065820`	int	Output line counter. Tracks the current line number in the generated `.int.c` file. Incremented by every `\n` written.
`dword_106581C`	`0x106581C`	int	Output column counter. Tracks the current column position. Reset to 0 after each newline.
`dword_1065830`	`0x1065830`	int	Column counter after last newline (secondary tracking). Reset to 0 with `dword_106581C`.
`dword_1065818`	`0x1065818`	int	Needs-line-directive flag. Set to 1 when the source position changes. Checked by every output primitive; when set, a `#line` directive is emitted before the next output.
`qword_1065810`	`0x1065810`	qword	Current source position (line number from the original `.cu` file). Updated when processing each entity.
`qword_1065828`	`0x1065828`	qword	Current source file index. Compared against new file references to decide whether to emit a `#line` with filename.
`qword_126EDE8`	`0x126EDE8`	qword	Mirror of `qword_1065810`. Updated in parallel; used by other subsystems to query current position.

Execution Flow

The backend proceeds through seven sequential phases within sub_489000:

sub_489000 (process_file_scope_entities)
  |
  |-- Phase 1: State initialization (40+ globals zeroed, 4 buffers cleared)
  |-- Phase 2: Output file opening (.int.c or stdout)
  |-- Phase 3: Boilerplate emission (GCC diagnostics, managed runtime, lambda macros)
  |-- Phase 4: Main entity loop (walk source sequence, dispatch to gen_template)
  |-- Phase 5: Empty file guard + scope unwind (sub_466C10)
  |-- [optional] Breakpoint placeholders (qword_1065840 list)
  |-- Phase 6: File trailer (#line, _NV_ANON_NAMESPACE, #include, #undef)
  |-- Phase 7: Host reference arrays (sub_6BCF80 x 6, conditional on dword_106BFD0/BFCC)
  |
  +-- sub_4F7B10: close output file (ID 1701)

Phase 1: State Initialization

The function begins by zeroing approximately 40 global variables and clearing four large buffers. This ensures no state leaks between compilation units (relevant in the recompilation loop, though in practice sub_489000 runs exactly once).

Scalar Zeroing

The first 20 lines of the decompiled function zero individual globals:

dword_1065834 = 0;   // indent level
dword_1065830 = 0;   // column after newline
stream        = 0;   // FILE* handle
qword_126EDE8 = 0;   // current source position (low 6 bytes)
qword_1065828 = 0;   // current file index
dword_1065820 = 0;   // output line counter
dword_106581C = 0;   // output column counter
dword_1065818 = 0;   // needs-line-directive flag
qword_1065748 = 0;   // source sequence cursor
qword_1065740 = 0;   // alternate source sequence cursor
qword_126C5D0 = 0;   // (template instantiation tracking)
dword_106573C = 0;
dword_1065734 = 0;
dword_1065730 = 0;
dword_106572C = 0;
qword_1065708 = 0;   // scope stack head
qword_1065720 = 0;   // scope free list
qword_1065700 = 0;   // scope pool head
dword_10656FC = 0;   // current access specifier
// ... additional counters, flags, sequence pointers

Additional globals zeroed later (after the callback setup):

dword_1065758 = 0;   dword_1065754 = 0;   dword_1065750 = 0;
dword_10656F8 = 0;   dword_10656F4 = 0;
qword_1065718 = 0;   qword_1065710 = 0;
dword_1065728 = 0;   qword_F05708  = 0;

Buffer Clearing

Four memset calls clear hash tables / lookup buffers:

Buffer Base	Size (hex)	Size (decimal)	Description
`unk_FE5700`	`0x7FFE0`	524,256 bytes (~512 KB)	Entity lookup hash table
`unk_F65720`	`0x7FFE0`	524,256 bytes (~512 KB)	Type lookup hash table
`qword_E85720`	`0x7FFE0`	524,256 bytes (~512 KB)	Declaration tracking table
`xmmword_F05720`	`0x5FFE8`	393,192 bytes (~384 KB)	Scope/name resolution table

Total: approximately 1.93 MB of memory zeroed at backend entry.

Callback Table Setup

After zeroing, the function initializes two tables of function pointers:

gen_be_info callbacks (6 entries at xmmword_1065760..10657B0):

sub_5F9040(&xmmword_1065760);    // clear the table first
xmmword_1065760 = off_83BD60;    // callback 0: expression gen
xmmword_1065778 = off_83BD68;    // callback 1: type gen
xmmword_1065788 = off_83BD70;    // callback 2: declaration gen
xmmword_10657A0 = off_83BD78;    // callback 3: statement gen
xmmword_10657B0 = qword_83BD80;  // callback 4: scope gen

These pointers are loaded from read-only data via SSE (_mm_loadh_ps), packing two 8-byte function pointers per 16-byte XMM value.

Direct callback assignments (4 entries):

Variable	Address	Value	Identity
`qword_10657C0`	`0x10657C0`	`sub_46BEE0`	gen_statement_expression (only set when not in MSVC `__declspec` mode)
`qword_10657C8`	`0x10657C8`	`loc_469200`	gen_type_operator_expression
`qword_10657D0`	`0x10657D0`	`sub_466F40`	gen_be_helper_1
`qword_10657D8`	`0x10657D8`	`sub_4686C0`	gen_be_helper_2

Host Compiler Version Detection

A block of conditionals determines warning suppression behavior based on the host compiler version:

byte_10657F0 = 1;                        // always set
byte_10657F1 = byte_126EBB0;             // copy verbose-line-dir flag
if (dword_126EFB4 == 2                   // CUDA mode
    || dword_126EF68 <= 199900)          // C++ standard <= C++98
{
    byte_10657F4 = (dword_126EFB0 != 0); // copy flag
} else {
    byte_10657F4 = 1;                    // force on for newer standards
}

The byte_1065803 flag is set to 1 when MSVC mode (dword_126E1D8) is active or when the GNU/Clang version falls in a specific range (version check qword_126E1F0 - 40500 with tolerance of 2, i.e., Clang versions 40500-40502).

Scope Stack Allocation

A dynamic scope tracking structure is allocated (or resized if it exists from a prior run):

if (qword_10656E8) {
    // resize existing: realloc to 16 * (count + 1) bytes
    sub_6B74D0(*(qword_10656E8), 16 * (*(qword_10656E8 + 8) + 1));
} else {
    // allocate fresh: 16-byte header
    v0 = sub_6B7340(16);
    qword_10656E8 = v0;
}
// allocate 1024-byte data block, zero it, attach to header
v2 = sub_6B7340(1024);
// zero 1024 bytes in 16-byte steps (zeroing 64 pointer-sized slots)
*v0 = v2;
v0[1] = 63;   // capacity = 63 entries

This creates a 64-slot lookup table (63 usable entries plus sentinel) for tracking entity references during code generation.

Phase 2: Output File Opening

The function opens the output .int.c file. Two paths are possible:

Stdout mode: If the output filename (qword_126EEE0) equals "-", the function sets stream = stdout.

// strcmp(qword_126EEE0, "-")
if (filename_is_dash) {
    stream = stdout;
}

File mode: Otherwise, the function constructs the output path by appending .int.c to the base filename (stripping the original extension):

v55 = qword_106BF20;                       // pre-set output path (CLI override)
if (!v55)
    v55 = sub_5ADD90(qword_126EEE0, ".int.c");  // derive_name: strip ext, add ".int.c"
stream = sub_4F48F0(v55, 0, 0, 0, 1701);   // open_output_file (mode 1701)

The sub_5ADD90 function (derive_name) finds the last . in the filename, strips the extension, and appends .int.c. It handles multi-byte UTF-8 characters correctly when scanning for the dot position. The constant 1701 is the file descriptor identifier used by the file management subsystem.

After opening the file, sub_5B9A20 is called to initialize the output stream state, and sub_467EB0 emits the initial #line 1 directive.

Phase 3: Boilerplate Emission

Before processing any user declarations, the backend emits several blocks of boilerplate that the host compiler needs. The exact output depends on the host compiler identity (Clang, GCC, MSVC) and the CUDA mode.

GCC Diagnostic Suppressions

Multiple #pragma GCC diagnostic directives suppress host compiler warnings that would be spurious for generated code:

// Conditional on Clang version > 30599 (0x7787) or GNU version > 40799 (0x9F5F)
#pragma GCC diagnostic ignored "-Wunused-local-typedefs"

// Conditional on dword_126EFA8 (attribute mode) && dword_106C07C
#pragma GCC diagnostic ignored "-Wattributes"

// Clang or recent GNU/Clang:
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-variable"
#pragma GCC diagnostic ignored "-Wunused-function"

// Clang-specific additional suppressions:
#pragma GCC diagnostic ignored "-Wunused-private-field"
#pragma GCC diagnostic ignored "-Wunused-parameter"

The version thresholds use the encoded host compiler version from qword_126EF90 (Clang version) and qword_126E1F0 (GCC/Clang combined version):

Hex constant	Decimal	Approximate version
`0x7787`	30,599	Clang ~3.x
`0x9D07`	40,199	GCC/Clang ~4.0
`0x9E97`	40,599	GCC/Clang ~4.1
`0x9F5F`	40,799	GCC/Clang ~4.1+

Managed Runtime Boilerplate

A block of C code is emitted unconditionally for __managed__ variable support:

static char __nv_inited_managed_rt = 0;
static void **__nv_fatbinhandle_for_managed_rt;
static void __nv_save_fatbinhandle_for_managed_rt(void **in) {
    __nv_fatbinhandle_for_managed_rt = in;
}
static char __nv_init_managed_rt_with_module(void **);

Followed by the inline initialization helper:

__attribute__((unused))                    // added when dword_106BF6C (alt host mode) is set
static inline void __nv_init_managed_rt(void) {
    __nv_inited_managed_rt = (__nv_inited_managed_rt
        ? __nv_inited_managed_rt
        : __nv_init_managed_rt_with_module(__nv_fatbinhandle_for_managed_rt));
}

This boilerplate is surrounded by a #pragma GCC diagnostic push / pop pair to suppress warnings about unused variables/functions in the boilerplate itself.

After the pop, additional #pragma GCC diagnostic ignored directives may be emitted for the remainder of the file (outside the push/pop scope), depending on compiler version.

Lambda Detection Macros

When extended lambda mode (dword_106BF38) is NOT active, three stub macro definitions are emitted:

#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false

Followed by a self-checking #if defined block:

#if defined(__nv_is_extended_device_lambda_closure_type) \
 && defined(__nv_is_extended_host_device_lambda_closure_type) \
 && defined(__nv_is_extended_device_lambda_with_preserved_return_type)
#endif

When extended lambda mode IS active, these macros are not emitted -- the frontend's keyword registration has already defined them as built-in type traits recognized by the parser. The empty #if defined / #endif block serves as a guard that downstream tools can detect.

Phase 4: Main Entity Loop

This is the core of the backend. The source sequence cursor qword_1065748 is initialized from the file scope IL node's declaration list at offset +256: qword_1065748 = *(*(xmmword_126EB60 + 8) + 256), where the high qword of xmmword_126EB60 points to the file scope root (set during fe_wrapup). The cursor walks this linked list of top-level declarations in the order they appeared in the source file. For each entry, it dispatches based on the entry's kind field at offset +16.

Source Sequence Entry Structure

Each source sequence entry has this layout:

Offset	Size	Field	Description
+0	8	`next`	Pointer to next entry in the linked list
+8	1	`sub_kind`	Sub-classification within the kind
+9	1	`skip_flag`	If nonzero, entry has already been processed
+16	1	`kind`	Entry kind (see dispatch table below)
+24	8	`entity`	Pointer to the EDG entity node for this declaration
+32	8	`source_position`	Source file/line encoding
+48	8	`pragma_text`	For pragma entries: pointer to raw pragma string
+56	8	`stdc_kind / pragma_data`	STDC pragma kind or additional pragma metadata
+57	1	`stdc_value`	STDC pragma value (ON/OFF/DEFAULT)

Dual-Cursor Iteration

The loop uses two cursors -- qword_1065748 (primary) and qword_1065740 (alternate) -- to handle pragma interleavings. When the primary cursor encounters a kind-53 entry (a continuation marker), it switches to the alternate cursor. This mechanism handles the case where pragmas are interleaved between parts of a single declaration:

for (i = qword_1065748; i != NULL; ) {
    if (entry_kind(i) == 53) {          // continuation marker
        // save as alternate, follow continuation chain
        alt_cursor = i;
        i = *(i->entity + 8);          // follow entity's next pointer
        continue;
    }
    if (entry_kind(i) == 57) {          // pragma interleave
        entity = i->entity;
        // advance past pragma entries to find next real entity
        for (i = i->next; i && entry_kind(i) == 53; ) {
            alt_cursor = i;
            i = *(i->entity + 8);
        }
        // handle the pragma inline (see below)
        ...
    } else {
        // non-pragma entity: dispatch to gen_template
        sub_47ECC0(0);
    }
}

When the primary cursor is exhausted and an alternate cursor exists, the primary takes the alternate's next pointer and continues. This ensures correct ordering when pragmas split a declaration sequence.

Full Main Loop Pseudocode

The following pseudocode is derived from the decompiled sub_489000 (lines 288-558) and shows the complete dispatch logic. The variable v12 tracks whether any non-pragma entity was emitted (used by the empty file guard in Phase 5). The variable v14 saves/restores byte_10657FB across pragma handling.

// Initialize source sequence cursor from file scope node
qword_1065748 = *(xmmword_126EB60_high + 256);  // source sequence list head
byte_10656F0 = (dword_126EFB4 != 2) + 2;        // linkage: 3=C++, 2=C
sub_466E60(...);                                  // init output state
v12 = 0;                                          // no entities emitted yet

while (1) {
    v14 = byte_10657FB;                           // save pragma-in-progress flag

    i = qword_1065748;                            // primary cursor
    alt = qword_1065740;                          // alternate cursor
    modified_primary = false;
    modified_alt = false;

    while (i != NULL) {
        kind = *(byte*)(i + 16);

        if (kind == 57) {
            // --- Pragma interleave ---
            entity = *(qword*)(i + 24);
            // Walk past continuation markers (kind 53)
            for (i = *(qword*)i; i != NULL; ) {
                if (*(byte*)(i + 16) != 53) break;
                alt = i;
                modified_alt = true;
                i = *(qword*)(*(qword*)(i + 24) + 8);  // follow entity next
            }
            if (i == NULL && alt != NULL) {
                i = *(qword*)alt;
                alt = NULL;
                modified_alt = true;
            }
            modified_primary = true;

            if (*(byte*)(entity + 9))              // skip_flag set?
                continue;                          // already processed

            // Commit cursor state
            qword_1065748 = i;
            if (modified_alt) qword_1065740 = alt;
            byte_10657FB = 1;                      // mark pragma context

            // Set source position from pragma entity
            dword_1065818 = 1;                     // needs line directive
            qword_1065810 = *(qword*)(entity + 32);
            qword_126EDE8 = *(qword*)(entity + 32);

            sub_kind = *(byte*)(entity + 8);
            switch (sub_kind) {
                case 26:  // STDC pragma
                    emit_line_start("#pragma ");
                    emit_raw("STDC ");
                    switch (*(byte*)(entity + 56)) {
                        case 1: emit_raw("FP_CONTRACT ");    break;
                        case 2: emit_raw("FENV_ACCESS ");    break;
                        case 3: emit_raw("CX_LIMITED_RANGE "); break;
                        default: assertion("gen_stdc_pragma: bad kind");
                    }
                    switch (*(byte*)(entity + 57)) {
                        case 1: emit_raw("OFF");     break;
                        case 2: emit_raw("ON");      break;
                        case 3: emit_raw("DEFAULT"); break;
                        default: assertion("gen_stdc_pragma: bad value");
                    }
                    emit_newline();
                    break;

                case 21:  // Line directive pragma
                    emit_line_start("#line ");
                    byte_10657F9 = 1;
                    sub_5FCAF0(*(qword*)(entity + 56), 0, &xmmword_1065760);
                    byte_10657F9 = 0;
                    emit_newline();
                    break;

                default:  // Generic pragma (including sub_kind 19)
                    if (!*(qword*)(entity + 48))
                        assertion("gen_pragma: NULL pragma_text");
                    emit_line_start("#pragma ");
                    emit_raw(*(char**)(entity + 48));
                    emit_newline();
                    if (sub_kind == 19)
                        dword_10656F8 = *(int*)(entity + 56);  // track #pragma pack
                    break;
            }
            byte_10657FB = v14;                    // restore saved flag
            continue;                              // next iteration
        }

        // --- Non-pragma entity ---
        if (modified_primary) qword_1065748 = i;
        if (modified_alt)     qword_1065740 = alt;

        if (kind == 53) {
            // Continuation marker: switch to alternate cursor
            alt = i;
            modified_alt = true;
            i = *(qword*)(*(qword*)(i + 24) + 8);
            continue;
        }

        if (kind == 52)  // end_of_construct: should never appear at top level
            sub_4F2930("cp_gen_be.c", 26628,
                       "process_file_scope_entities",
                       "Top-level end-of-construct entry", 0);

        v12 = 1;                                   // mark: entity emitted
        sub_47ECC0(0);                             // gen_template(recursion_level=0)
        // Loop continues from updated qword_1065748
    }

    // Exhausted primary cursor; check for pending alternate
    if (i == NULL && alt != NULL) {
        i = *(qword*)alt;
        alt = NULL;
        // ... continue outer loop
    } else {
        break;  // done
    }
}

// Final cursor cleanup
if (modified_primary) qword_1065748 = 0;
if (modified_alt)     qword_1065740 = alt;

Entity Kind Dispatch

For non-pragma entries (kind != 57), the loop calls sub_47ECC0(0) (gen_template with recursion level 0), which reads the current entity from qword_1065748 and dispatches based on the entity's kind:

Kind	Name	Handler
2	`variable_decl`	`sub_484A40` (`gen_variable_decl`) or inline
6	`type_decl`	`sub_4864F0` (`gen_type_decl`)
7	`parameter_decl`	`sub_484A40`
8	`field_decl`	Inline field handler
11	`routine_decl`	`sub_47BFD0` (`gen_routine_decl`, 1831 lines)
28	`namespace`	Inline namespace handler (recursive `sub_47ECC0(0)`)
29	`using_decl`	Inline using-declaration handler
42	`asm_decl`	`__asm(...)` generation
51	`indirect`	Unwrap and re-dispatch
52	`end_of_construct`	Assertion (kind 52 triggers `sub_4F2930` diagnostic)
54	`instantiation`	Template instantiation directive
58	`template`	Template definition
66	`alias_decl`	Alias declaration (`using X = Y`)
67	`concept_decl`	Concept handling
83	`deduction_guide`	Deduction guide

Inline Pragma Handling

Kind 57 entries are pragma interleavings that appear between declarations. The backend handles three sub-kinds inline within sub_489000:

Sub-kind 26: STDC Pragma

Emits #pragma STDC <kind> <value>:

// Read pragma kind from offset +56
switch (stdc_kind) {
    case 1:  emit("FP_CONTRACT ");    break;
    case 2:  emit("FENV_ACCESS ");    break;
    case 3:  emit("CX_LIMITED_RANGE "); break;
    default: assertion_failure("gen_stdc_pragma: bad kind");
}
// Read pragma value from offset +57
switch (stdc_value) {
    case 1:  emit("OFF");     break;
    case 2:  emit("ON");      break;
    case 3:  emit("DEFAULT"); break;
    default: assertion_failure("gen_stdc_pragma: bad value");
}

The #pragma keyword is emitted character-by-character from a hardcoded string at address 0x838441 ("#pragma "), followed by "STDC " from address 0x83847B.

Sub-kind 21: Raw Pragma (Line Directive)

Calls sub_5FCAF0 to emit a preprocessor line directive using the pragma's data. The byte_10657F9 flag is set to 1 during emission and reset to 0 afterward, temporarily changing the line-directive emission format.

Sub-kind 19 (or other): Generic Pragma

For all other pragma sub-kinds, the backend reads the raw pragma text from offset +48 and emits it character by character after a #pragma prefix:

if (!entity->pragma_text)
    assertion_failure("gen_pragma: NULL pragma_text");
emit("#pragma ");
emit_raw_string(entity->pragma_text);
emit_newline();

For sub-kind 19 specifically, the function also records the pragma data in dword_10656F8, tracking #pragma pack state.

Linkage Specification

The variable byte_10656F0 tracks the current linkage specification:

Value	Meaning
2	`extern "C"` linkage
3	`extern "C++"` linkage

Set at initialization: byte_10656F0 = (dword_126EFB4 != 2) + 2 -- this evaluates to 3 (C++) when in CUDA mode (dword_126EFB4 == 2), and 2 (C) otherwise. This controls how the backend wraps declarations that need explicit linkage changes.

Phase 5: Empty File Guard

After the main loop completes, the function checks whether any entities were actually emitted:

if (!v12 && dword_126EFB4 != 2) {
    sub_467E50("int __dummy_to_avoid_empty_file;");
    sub_467D60();  // newline
}

The variable v12 tracks whether sub_47ECC0 was called at least once (set to 1 when any non-pragma entity is processed). If no entities were processed AND the mode is not CUDA (dword_126EFB4 != 2), a dummy variable declaration is emitted to prevent the host compiler from rejecting an empty translation unit. In CUDA mode, the file always has content due to the managed runtime boilerplate.

Phase 6: File Trailer

After all entities and the empty-file guard, the function emits a structured trailer. The call to sub_466C10 performs scope stack unwinding -- it pops any remaining scope entries, restoring entity attributes that were temporarily modified during code generation (specifically, bits in byte +82 and +134 of entity nodes).

#line Reset

Two #line 1 "<original_file>" directives bracket the trailer, resetting the host compiler's notion of the current source location back to the original .cu file:

sub_46BC80("#");
if (!dword_126E1F8)      // not GNU mode: use long form
    sub_467E50("line");
sub_467E50(" 1 \"");
filename = sub_5AF450(qword_106BF88);   // get original filename
sub_467E50(filename);
sub_468150(34);           // closing quote '"'

_NV_ANON_NAMESPACE Macro

The anonymous namespace support macro is emitted:

#define _NV_ANON_NAMESPACE <unique_id>

The unique identifier is generated by sub_6BC7E0 (get_anonymous_namespace_name), which returns "_GLOBAL__N_<filename>" -- a mangled name that ensures anonymous namespace entities from different translation units do not collide in the final linked binary.

This is followed by a guard block:

#ifdef _NV_ANON_NAMESPACE
#endif

The #ifdef/#endif block appears to be a deliberate no-op that downstream tools (nvcc's driver) can detect to confirm the file was processed by cudafe++.

MSVC Pack Reset

In MSVC host compiler mode (dword_126E1D8), a #pragma pack() is emitted to reset the packing alignment to the compiler default:

if (dword_126E1D8) {
    sub_46BC80("#pragma pack()");
    sub_467D60();
}

Source Re-inclusion

The original source file is re-included via #include:

#include "<original_file>"

This is the mechanism by which the host compiler sees the original source code: the .int.c file first declares all the generated stubs and boilerplate, then #includes the original file. The EDG frontend has already parsed the original file and knows which declarations are host-visible; the re-inclusion lets the host compiler process them with the stubs already in scope.

A final #line 1 directive follows, and then:

#undef _NV_ANON_NAMESPACE

This cleans up the macro so it does not leak into subsequent compilation units.

Phase 7: Host Reference Arrays

The final emission step generates CUDA host reference arrays via sub_6BCF80 (nv_emit_host_reference_array). These arrays are placed in special ELF sections that the CUDA runtime linker uses to discover device symbols at launch time.

The function is called 6 times with different flag combinations:

// Signature: nv_emit_host_reference_array(emit_fn, is_kernel, is_device, is_internal)

sub_6BCF80(sub_467E50, 1, 0, 1);  // kernel,   internal  -> .nvHRKI
sub_6BCF80(sub_467E50, 1, 0, 0);  // kernel,   external  -> .nvHRKE
sub_6BCF80(sub_467E50, 0, 1, 1);  // device,   internal  -> .nvHRDI
sub_6BCF80(sub_467E50, 0, 1, 0);  // device,   external  -> .nvHRDE
sub_6BCF80(sub_467E50, 0, 0, 1);  // constant, internal  -> .nvHRCI
sub_6BCF80(sub_467E50, 0, 0, 0);  // constant, external  -> .nvHRCE

Section	Array Name	Symbol Type	Linkage
`.nvHRKI`	`hostRefKernelArrayInternalLinkage`	`__global__` kernel	Internal (anonymous namespace)
`.nvHRKE`	`hostRefKernelArrayExternalLinkage`	`__global__` kernel	External
`.nvHRDI`	`hostRefDeviceArrayInternalLinkage`	`__device__` variable	Internal
`.nvHRDE`	`hostRefDeviceArrayExternalLinkage`	`__device__` variable	External
`.nvHRCI`	`hostRefConstantArrayInternalLinkage`	`__constant__` variable	Internal
`.nvHRCE`	`hostRefConstantArrayExternalLinkage`	`__constant__` variable	External

Each array entry encodes a device symbol's mangled name as a byte array:

extern "C" {
    extern __attribute__((section(".nvHRKE")))
           __attribute__((weak))
    const unsigned char hostRefKernelArrayExternalLinkage[] = {
        0x5f, 0x5a, ... /* mangled name bytes */ 0x00
    };
}

The 6 global lists from which these symbols are collected reside at:

Address	Contents
`unk_1286780`	Device-external symbols
`unk_12867C0`	Device-internal symbols
`unk_1286800`	Constant-external symbols
`unk_1286840`	Constant-internal symbols
`unk_1286880`	Kernel-external symbols
`unk_12868C0`	Kernel-internal symbols

This phase is conditional: it only executes when dword_106BFD0 (CUDA device registration) or dword_106BFCC (CUDA constant registration) is nonzero.

Module ID Output

Before the host reference arrays, if dword_106BFB8 is set, sub_5B0180 (write_module_id_to_file) writes the CRC32-based module identifier to a separate file. This ID is used by the CUDA runtime to match device code fatbinaries with their host-side registration code.

Breakpoint Placeholders (Between Phase 5 and Phase 6)

After the empty file guard and scope unwinding (sub_466C10) but before the file trailer, if the breakpoint placeholder list (qword_1065840) is non-empty, the backend emits debug breakpoint functions:

static __attribute__((used)) void __nv_breakpoint_placeholder<N>_<name>(void) {
    exit(0);
}

The placeholder list is a linked list where each node contains:

Offset	Field
+0	`next` pointer
+8	Source position (start)
+16	Source position (end)
+24	Name string (or NULL)

Each placeholder is numbered sequentially (starting from 0). The __attribute__((used)) prevents the linker from stripping these symbols, and the exit(0) body ensures the function has a concrete implementation that a debugger can set a breakpoint on. The underscore separator before the name distinguishes the placeholder from the numbered prefix.

Complete .int.c File Structure

Putting all phases together, the output .int.c file has this structure:

#line 1 "<input>.cu"                          // initial line directive
#pragma GCC diagnostic ignored "-Wunused-local-typedefs"
#pragma GCC diagnostic ignored "-Wattributes"
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-variable"
#pragma GCC diagnostic ignored "-Wunused-function"
// ... additional suppressions for Clang

// --- managed runtime boilerplate ---
static char __nv_inited_managed_rt = 0;
static void **__nv_fatbinhandle_for_managed_rt;
static void __nv_save_fatbinhandle_for_managed_rt(void **in) { ... }
static char __nv_init_managed_rt_with_module(void **);
static inline void __nv_init_managed_rt(void) { ... }

#pragma GCC diagnostic pop
#pragma GCC diagnostic ignored "-Wunused-variable"

// --- lambda detection macros (when not in extended lambda mode) ---
#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false
#if defined(...) && defined(...) && defined(...)
#endif

// --- main entity output ---
// [user declarations, type definitions, function stubs, etc.]
// [device-only code wrapped in #if 0 / #endif]
// [__global__ kernels -> __wrapper__device_stub_ forwarding]
// [pragmas interleaved at original positions]

// --- empty file guard (non-CUDA mode only) ---
int __dummy_to_avoid_empty_file;

// --- breakpoint placeholders (if any) ---
static __attribute__((used)) void __nv_breakpoint_placeholder0_name(void) { exit(0); }

// --- file trailer ---
#line 1 "<input>.cu"
#define _NV_ANON_NAMESPACE _GLOBAL__N_<input>
#ifdef _NV_ANON_NAMESPACE
#endif
#pragma pack()                                // MSVC only
#line 1 "<input>.cu"
#include "<input>.cu"                         // re-include original source
#line 1 "<input>.cu"
#undef _NV_ANON_NAMESPACE

// --- host reference arrays (if CUDA registration active) ---
extern "C" { extern __attribute__((section(".nvHRKI"))) ... }
extern "C" { extern __attribute__((section(".nvHRKE"))) ... }
extern "C" { extern __attribute__((section(".nvHRDI"))) ... }
extern "C" { extern __attribute__((section(".nvHRDE"))) ... }
extern "C" { extern __attribute__((section(".nvHRCI"))) ... }
extern "C" { extern __attribute__((section(".nvHRCE"))) ... }

Key Global Variables

Variable	Address	Type	Role
`stream`	output state	`FILE*`	Output file handle
`dword_1065834`	`0x1065834`	int	Indent/nesting level
`dword_1065820`	`0x1065820`	int	Output line counter
`dword_106581C`	`0x106581C`	int	Output column counter
`dword_1065818`	`0x1065818`	int	Needs-line-directive flag
`qword_1065810`	`0x1065810`	qword	Current source position
`qword_1065828`	`0x1065828`	qword	Current source file index
`qword_1065748`	`0x1065748`	qword	Source sequence cursor (primary)
`qword_1065740`	`0x1065740`	qword	Source sequence cursor (alternate)
`dword_1065850`	`0x1065850`	int	Device stub mode toggle
`byte_10656F0`	`0x10656F0`	byte	Current linkage spec (2=C, 3=C++)
`dword_10656F8`	`0x10656F8`	int	Current `#pragma pack` state
`qword_1065708`	`0x1065708`	qword	Scope stack head
`qword_1065700`	`0x1065700`	qword	Scope pool head
`qword_1065720`	`0x1065720`	qword	Scope free list
`dword_106BF38`	`0x106BF38`	int	Extended lambda mode
`dword_106BFB8`	`0x106BFB8`	int	Emit module ID flag
`dword_106BFD0`	`0x106BFD0`	int	CUDA device registration flag
`dword_106BFCC`	`0x106BFCC`	int	CUDA constant registration flag
`dword_106BF6C`	`0x106BF6C`	int	Alternative host compiler mode
`dword_126EFB4`	`0x126EFB4`	int	Compiler mode (2 = CUDA)
`dword_126E1D8`	`0x126E1D8`	int	MSVC host compiler flag
`dword_126E1F8`	`0x126E1F8`	int	GNU/GCC host compiler flag
`dword_126E1E8`	`0x126E1E8`	int	Clang host compiler flag
`qword_126E1F0`	`0x126E1F0`	qword	GCC/Clang version number
`dword_126EF68`	`0x126EF68`	int	C++ standard version (`__cplusplus`)

Cross-References

Pipeline Overview -- where stage 7 fits in the full compilation flow
Frontend Wrapup -- stage 6, produces the finalized IL that the backend consumes
.int.c File Format -- detailed structure of the backend output file
Managed Memory Boilerplate -- the __nv_managed_rt initialization pattern
Host Reference Arrays -- .nvHRKI/.nvHRDE section format
Module ID -- CRC32 module identification
Device/Host Separation -- how the backend filters device vs host code
Kernel Stub Generation -- __wrapper__device_stub_ pattern in gen_routine_decl
Extended Lambda Overview -- lambda wrapper generation
Lambda Preamble Injection -- sub_6BCC20 emission in gen_template

Timing & Exit

The timing and exit subsystem lives in host_envir.c and handles three responsibilities: measuring CPU and wall-clock time for compilation phases, formatting the compilation summary (error/warning counts), and mapping internal status codes to process exit codes. All functions write to qword_126EDF0 (the diagnostic output stream, initialized to stderr in main()).

Key Facts

Property	Value
Source file	`host_envir.c` (EDG 6.6)
Timing functions	`sub_5AF350` (capture_time), `sub_5AF390` (report_timing)
Exit function	`sub_5AF1D0` (exit_with_status), 145 bytes, `__noreturn`
Signoff function	`sub_5AEE00` (write_signoff), `sub_589530` (write_signoff + free_mem_blocks)
Timing enable flag	`dword_106C0A4` at `0x106C0A4`, set by CLI flag `--timing` (case 20)
Diagnostic stream	`qword_126EDF0` at `0x126EDF0` (stderr)
SARIF mode flag	`dword_106BBB8` at `0x106BBB8`

Timing Infrastructure

capture_time -- `sub_5AF350` (0x5AF350)

A 48-byte function that samples both CPU time and wall-clock time into a 16-byte timestamp structure.

// Annotated decompilation
void capture_time(timestamp_t *out)    // sub_5AF350
{
    out->cpu_ms  = (int)((double)(int)clock() * 1000.0 / 1e6);  // [0]: CPU milliseconds
    out->wall_s  = time(NULL);                                    // [1]: wall-clock seconds
}

Timestamp structure layout (16 bytes, two 64-bit fields):

Offset	Size	Type	Content
+0	8	`int64_t`	CPU time in milliseconds: `clock() * 1000 / CLOCKS_PER_SEC`
+8	8	`time_t`	Wall-clock time via `time(0)` (epoch seconds)

The CPU time computation clock() * 1000.0 / 1000000.0 normalizes the clock() return value (microseconds on Linux where CLOCKS_PER_SEC = 1000000) to milliseconds, then truncates to integer. This means CPU time resolution is 1 ms.

report_timing -- `sub_5AF390` (0x5AF390)

Computes deltas between two timestamps and prints a formatted timing line.

// Annotated decompilation
void report_timing(const char *label,       // sub_5AF390
                   timestamp_t *start,
                   timestamp_t *end)
{
    double elapsed = difftime(end->wall_s, start->wall_s);    // wall seconds
    double cpu_sec = (double)(end->cpu_ms - start->cpu_ms) / 1000.0;  // CPU seconds

    fprintf(qword_126EDF0,
            "%-30s %10.2f (CPU) %10.2f (elapsed)\n",
            label, cpu_sec, elapsed);
}

The decompiled code contains explicit unsigned-to-double conversion handling for 64-bit values (the v6 & 1 | (v6 >> 1) pattern followed by doubling). This is the compiler's standard idiom for converting unsigned 64-bit integers to double on x86-64 when the value might exceed INT64_MAX. In practice, clock() millisecond values fit comfortably in signed 64-bit range, so this path is never taken.

Output format: "%-30s %10.2f (CPU) %10.2f (elapsed)\n"

Front end time                     12.34 (CPU)     15.67 (elapsed)
Back end time                       3.45 (CPU)      4.56 (elapsed)
Total compilation time             15.79 (CPU)     20.23 (elapsed)

The label is left-justified in a 30-character field. CPU and elapsed times are right-justified in 10-character fields with 2 decimal places.

Timing Flag Activation

The timing flag dword_106C0A4 is registered in the CLI flag table as flag ID 20:

// In sub_452010 (register_internal_flags)
sub_451F80(20, "timing", 35, 0, 0, 1);
//         ^id  ^name    ^case ^  ^ ^undocumented

When --timing is passed on the command line, the CLI parser (sub_459630) hits case 20 in its switch statement, which sets dword_106C0A4 = 1. The flag defaults to 0 (disabled), set explicitly in sub_45EB40 (cmd_line_pre_init).

Timing Brackets in main()

main() at 0x408950 allocates six 16-byte timestamp slots on its stack frame:

Variable	Stack offset	Purpose
`v7`	`[rsp+0x00]`	Total compilation start
`v8`	`[rsp+0x10]`	Frontend start
`v9`	`[rsp+0x20]`	Frontend end
`v10`	`[rsp+0x30]`	Backend start
`v11`	`[rsp+0x40]`	Backend end
`v12`	`[rsp+0x50]`	Total compilation end

Three timing regions are measured:

Region 1: Frontend

Captured after sub_585DB0 (fe_one_time_init) and reported after sub_588F90 (fe_wrapup). Covers stages 3-6 of the pipeline: heavy initialization, TU state reset, source parsing + IL build, and the 5-pass wrapup.

if (dword_106C0A4)
    capture_time(&t_fe_start);      // v8

reset_tu_state();                   // sub_7A4860
process_translation_unit(filename); // sub_7A40A0
fe_wrapup(filename, 1);            // sub_588F90

if (dword_106C0A4) {
    capture_time(&t_fe_end);        // v9
    report_timing("Front end time", &t_fe_start, &t_fe_end);
}

Region 2: Backend

Captured around sub_489000 (process_file_scope_entities). Only executed when dword_106C254 == 0 (no frontend errors).

if (!dword_106C254) {
    if (dword_106C0A4)
        capture_time(&t_be_start);  // v10

    process_file_scope_entities();  // sub_489000

    if (dword_106C0A4) {
        capture_time(&t_be_end);    // v11
        report_timing("Back end time", &t_be_start, &t_be_end);
    }
}

Region 3: Total

Starts before CLI parsing (sub_459630) and ends just before exit. Always uses v7 (captured once at the very beginning) as the start timestamp.

capture_time(&t_total_start);       // v7 — captured before CLI parsing

// ... entire compilation ...

if (dword_106C0A4) {
    capture_time(&t_total_end);     // v12
    report_timing("Total compilation time", &t_total_start, &t_total_end);
}

Note that the "Total compilation time" region begins before command-line parsing and includes the CLI parsing overhead, all initialization, frontend, backend, and signoff. The "Front end time" region does NOT include CLI parsing or pre-init -- it starts after fe_one_time_init.

Compilation Summary -- write_signoff

sub_5AEE00 (0x5AEE00) -- write_signoff

This 490-byte function writes the compilation summary trailer to the diagnostic stream. It has two completely separate code paths: SARIF mode and text mode.

SARIF Mode (`dword_106BBB8 == 1`)

Closes the SARIF JSON document started by sub_5AEDB0 (write_init):

fwrite("]}]}\n", 1, 5, qword_126EDF0);

This closes the results array, the run object, the runs array, and the top-level SARIF document. If dword_106BBB8 is set but not equal to 1, the function hits an assertion: write_signoff at host_envir.c:2203.

Text Mode (`dword_106BBB8 == 0`)

The text-mode path assembles a human-readable summary from four counters:

Global	Address	Meaning
`qword_126ED90`	`0x126ED90`	Error count
`qword_126ED98`	`0x126ED98`	Warning count
`qword_126EDB0`	`0x126EDB0`	Suppressed error count
`qword_126EDB8`	`0x126EDB8`	Suppressed warning count

The function uses EDG's message catalog (sub_4F2D60) for all translatable strings:

Message ID	Purpose	Likely content
1742	Error (singular)	`"error"`
1743	Errors (plural)	`"errors"`
1744	Warning (singular)	`"warning"`
1745	Warning (plural)	`"warnings"`
1746	Conjunction	`"and"`
1747	Source file indicator (format)	`"in compilation of \"%s\""`
1748	Generated indicator	`"generated"`
3234	Suppressed intro	`"of which"`
3235	Suppressed verb	`"were suppressed"` / `"was suppressed"`

Output assembly logic (simplified pseudocode):

void write_text_signoff(void)     // text-mode path of sub_5AEE00
{
    int64_t errors   = qword_126ED90;
    int64_t warnings = qword_126ED98;
    int64_t total    = errors + warnings;

    // Debug: module declaration count (only if dword_126EFC8 + "module_report")
    if (dword_126EFC8 && is_debug_enabled("module_report") && qword_106B9C8)
        fprintf(s, "%lu modules declarations processed (%lu failed).\n",
                qword_106B9C8, qword_106B9C0);

    if (total == 0)
        return;                   // nothing to report

    int64_t suppressed_warn = qword_126EDB8;
    int64_t suppressed_total = suppressed_warn + qword_126EDB0;
    int     displayed = total - suppressed_total;

    // --- Print displayed counts ---
    if (displayed != suppressed_total) {  // there ARE unsuppressed diagnostics
        if (errors)
            fprintf(stream, "%lu %s", errors, msg(errors != 1 ? 1743 : 1742));
        if (errors && warnings)
            fprintf(stream, " %s ", msg(1746));   // " and "
        if (warnings)
            fprintf(stream, "%lu %s", warnings, msg(warnings != 1 ? 1745 : 1744));
    }

    // --- Print suppressed counts ---
    if (suppressed_total > 0) {
        // Assertion: suppressed_warn must be 0 if we reach here
        // (i.e., only suppressed errors, not suppressed warnings, trigger assert)
        if (suppressed_warn)
            assert(0);  // host_envir.c:2141, "write_text_signoff"

        if (displayed) {
            fprintf(stream, " (%s ", msg(3234));        // " (of which "
            fprintf(stream, "%lu %s %s",
                    suppressed_total,
                    msg(3235),                          // "was/were suppressed"
                    msg(suppressed_total == 1 ? 1742 : 1743));
            fputc(')', stream);                         // close paren
        } else {
            // All diagnostics were suppressed -- just print suppressed count
            fprintf(stream, "%lu %s %s",
                    suppressed_total,
                    msg(3235),
                    msg(suppressed_total == 1 ? 1742 : 1743));
        }
    }

    // --- Print source filename ---
    fputc(' ', stream);
    if (qword_126EEE0 && *qword_126EEE0 && strcmp(qword_126EEE0, "-") != 0) {
        char *display_name = qword_106C040 ? qword_106C040 : qword_126EEE0;
        char *basename = normalize_path(display_name) + 32;  // sub_5AC020 returns
                                                              // buffer, basename at +32
        fprintf(stream, msg(1747), basename);  // "in compilation of \"%s\""
    } else {
        fputs(msg(1748), stream);              // "generated" (stdout mode)
    }
    fputc('\n', stream);
}

Example output:

2 errors and 1 warning in compilation of "kernel.cu"

3 errors (of which 1 was suppressed error) in compilation of "main.cu"

sub_589530 (0x589530) -- write_signoff + free_mem_blocks

A thin wrapper (13 bytes) called from main()'s exit path. Performs two operations:

void fe_finish(void)          // sub_589530
{
    write_signoff();          // sub_5AEE00 — print summary
    free_mem_blocks();        // sub_6B8DE0 — release all frontend memory pools
}

sub_6B8DE0 (free_mem_blocks) is the master memory deallocation function from mem_manage.c (assertion at line 1438, function name "free_mem_blocks"). It operates in two modes depending on the global dword_1280728:

Pool allocator mode (dword_1280728 set): Walks three linked lists of allocated memory blocks:

Current block at qword_1280720: freed first, looked up in the free-block hash chain at qword_1280748, then the block descriptor itself is freed.
Hash table at qword_126EC88 with dword_126EC80 buckets: each bucket is a singly-linked list of block descriptors. Blocks with nonzero size (field [4]) are freed; blocks with zero size trigger the mem_manage.c:1438 assertion (invariant: a complete block must have a recorded size).
Overflow list at qword_1280730: same walk-and-free logic.

Each block deallocation decrements qword_1280718 (total allocated bytes) and optionally updates qword_1280710 (low-water mark). At debug level > 4, each free prints: "free_complete_block: freeing block of size %lu\n".

Non-pool mode (dword_1280728 == 0): Iterates source file entries via sub_6B8B20(N) for each entry N from dword_126EC80 down to 0, then walks the permanent allocation array at qword_126EC58, calling sub_5B0500 for each (which wraps munmap or free).

Exit Handling

exit_with_status -- `sub_5AF1D0` (0x5AF1D0)

A 145-byte __noreturn function that maps internal compilation status codes to POSIX exit codes. This is the only exit point for normal compilation flow -- every path through main() ends here.

// Full annotated decompilation
__noreturn void exit_with_status(uint8_t status)  // sub_5AF1D0
{
    // --- Text-mode messages (suppressed in SARIF mode) ---
    if (!dword_106BBB8) {           // not SARIF mode
        if (status == 9 || status == 10) {
            fwrite("Compilation terminated.\n", 1, 0x18, qword_126EDF0);
            exit(4);                // goto LABEL_8
        }
        if (status == 11) {
            fwrite("Compilation aborted.\n", 1, 0x15, qword_126EDF0);
            fflush(qword_126EDF0);
            abort();                // goto LABEL_10
        }
    }

    // --- Exit code mapping (both text and SARIF modes) ---
    switch (status) {
        case 3:
        case 4:
        case 5:  exit(0);          // success
        case 8:  exit(2);          // warnings only
        case 9:
        case 10: exit(4);          // errors (SARIF mode reaches here)
        default: fflush(qword_126EDF0);
                 abort();          // internal error (11, or any unknown)
    }
}

Status-to-exit-code mapping:

Internal Status	Meaning	Text Output	Exit Code	Termination
3	Clean success (no warnings, no additional status)	(none)	0	`exit(0)`
4	Success variant	(none)	0	`exit(0)`
5	Success with additional status (`qword_126ED88 != 0`)	(none)	0	`exit(0)`
8	Warnings present (`qword_126ED90 != 0`)	(none)	2	`exit(2)`
9	Errors	`"Compilation terminated.\n"`	4	`exit(4)`
10	Errors (variant)	`"Compilation terminated.\n"`	4	`exit(4)`
11	Internal error / fatal	`"Compilation aborted.\n"`	(n/a)	`abort()`

In SARIF mode (dword_106BBB8 != 0), the text messages "Compilation terminated." and "Compilation aborted." are suppressed. The exit codes remain the same -- the function falls through to the switch which dispatches identically.

The default case handles status 11 and any unexpected status value by calling abort() after flushing the diagnostic stream. This generates a core dump for debugging.

Control flow note

The code structure looks unusual because the decompiler linearizes a two-phase dispatch. First, text-mode messages are emitted for statuses 9/10 and 11 (with early exit(4) or abort() respectively). If SARIF mode is active OR status is not 9/10/11, execution falls through to the switch statement. This means statuses 9/10 reach exit(4) via two different paths depending on SARIF mode, but the exit code is always 4.

Exit Code Determination in main()

The exit code passed to sub_5AF1D0 is computed in main() based on two global counters:

// From main() at 0x408950
uint8_t exit_code = 8;              // default: warnings (errors present → v6=8)

sub_6B8B20(0);                      // reset file state
sub_589530();                       // write_signoff + free_mem_blocks

if (!qword_126ED90)                 // no errors?
    exit_code = qword_126ED88 ? 5 : 3;   // success codes

// ... timing, stack restore ...
exit_with_status(exit_code);

Decision tree:

qword_126ED90 != 0  (errors present)
  └── exit_code = 8  →  exit(2)   "warnings only" path
      NOTE: This is counterintuitive. When errors exist, the exit
      code defaults to 8 (which maps to exit(2), not exit(4)).
      However, this path is only reachable when qword_126ED90 was
      nonzero at the error gate (dword_106C254 = 1, skip backend),
      but became zero by the time we reach the exit code check.
      In practice, errors set qword_126ED90 and it stays nonzero.

qword_126ED90 == 0  (no errors)
  ├── qword_126ED88 != 0  →  exit_code = 5  →  exit(0)  (success w/ status)
  └── qword_126ED88 == 0  →  exit_code = 3  →  exit(0)  (clean success)

The variable qword_126ED88 at 0x126ED88 is initialized to 0 in sub_4ED530 (declaration_pre_init) and sub_4ED7C0. It appears to track whether any notable conditions occurred during compilation that are not errors or warnings -- possibly informational remarks or specific compiler actions taken. When nonzero, the exit code changes from 3 to 5, but both map to exit(0).

Stack Limit Restoration

Before calling exit_with_status, main() restores the process stack limit if it was raised during initialization:

if (stack_was_raised) {
    rlimits.rlim_cur = original_stack;   // restore saved soft limit
    setrlimit(RLIMIT_STACK, &rlimits);
}

The boolean stack_was_raised (stored in rbp, variable v4) is set during startup when dword_106C064 (the --modify_stack_limit flag, default ON) causes main() to raise RLIMIT_STACK from its soft limit to the hard limit. This restoration is a defensive measure -- it ensures any child processes spawned during cleanup (or signal handlers) inherit a normal stack size.

Signal-Driven Exit Paths

Three additional paths reach exit_with_status:

SIGINT / SIGTERM Handler -- `handler` (0x5AF2C0)

Registered in sub_5B1E70 (host_envir_early_init) for signals 2 (SIGINT) and 15 (SIGTERM). The registration is one-shot, guarded by dword_E6E120 (set to 0 after first call). SIGINT registration is conditional: the code first calls signal(SIGINT, SIG_IGN) and checks the return value. If the previous handler was already SIG_IGN (meaning the parent process -- typically nvcc -- has set the child to ignore interrupts), it stays ignored. Otherwise, the custom handler is installed. SIGTERM always gets the handler unconditionally.

__noreturn void handler(void)           // 0x5AF2C0
{
    fputc('\n', qword_126EDF0);         // newline to stderr
    terminate_compilation(9);           // sub_5AF2B0
}

terminate_compilation -- `sub_5AF2B0` (0x5AF2B0)

Bridge function: writes signoff then exits.

__noreturn void terminate_compilation(uint8_t status)  // sub_5AF2B0
{
    write_signoff();                    // sub_5AEE00
    exit_with_status(status);           // sub_5AF1D0
}

When called from handler, status is 9 (errors), which produces "Compilation terminated.\n" followed by exit(4).

SIGXCPU Handler -- `sub_5AF270` (0x5AF270)

Registered for signal 24 (SIGXCPU):

__noreturn void cpu_time_limit_handler(void)  // sub_5AF270
{
    fputc('\n', qword_126EDF0);
    fwrite("Internal error: CPU time limit exceeded.\n", 1, 0x29, qword_126EDF0);
    exit_with_status(11);               // sub_5AF1D0 → abort()
}

This handler fires if the process receives SIGXCPU despite sub_5B1E70 having set RLIMIT_CPU to RLIM_INFINITY at startup. A SIGXCPU could still arrive if an external resource manager (e.g., batch scheduler) overrides the limit after initialization. Status 11 causes abort() with a core dump.

SIGXFSZ

Set to SIG_IGN in sub_5B1E70 (signal(25, SIG_IGN)). This prevents the process from being killed when writing a .int.c file that exceeds the filesystem's file-size limit. Without this, large compilation outputs could trigger an unhandled SIGXFSZ (25) and terminate with a core dump.

SARIF Output Bookends

The SARIF JSON output is bracketed by two functions:

Function	Address	When Called	Output
`sub_5AEDB0` (write_init)	`0x5AEDB0`	During `fe_init_part_1` (stage 3)	`{"version":"2.1.0","$schema":"...","runs":[{"tool":{"driver":{"name":"EDG CPFE","version":"6.6",...}},"columnKind":"unicodeCodePoints","results":[`
`sub_5AEE00` (write_signoff)	`0x5AEE00`	During `sub_589530` (pre-exit)	`]}]}` + newline

The tool metadata identifies the frontend as "EDG CPFE" version "6.6" from "Edison Design Group", with fullName "Edison Design Group C/C++ Front End - 6.6" and informationUri "https://edg.com/c". The column kind is "unicodeCodePoints" (not byte offsets). Individual diagnostics are appended to the results array by the error subsystem between these two calls.

The write_init function (sub_5AEDB0) has the same assertion guard as write_signoff: if dword_106BBB8 is set but not equal to 1, it triggers an assertion at host_envir.c:2017 ("write_init"). Both assertions enforce the invariant that SARIF mode is exactly 0 or 1, never any other value.

Profiling Init -- `sub_5AF330` (0x5AF330)

A separate but related mechanism. During sub_585DB0 (fe_one_time_init), if dword_106BD4C is set, sub_5AF330 is called:

int profiling_init(void)             // sub_5AF330
{
    int was_initialized = dword_126F110;
    if (!dword_126F110)
        dword_126F110 = 1;           // mark as initialized
    return was_initialized;          // 0 on first call, 1 on subsequent
}

This is a one-shot initializer for a profiling subsystem distinct from the --timing flag. The dword_106BD4C gate is set by a different CLI flag and controls a more granular, per-function profiling infrastructure (used by the EDG debug trace system, not the phase-level timing brackets). The dword_126F110 flag prevents double-initialization if fe_one_time_init is called more than once.

Signal Handler Registration Detail

The full signal setup in sub_5B1E70 (host_envir_early_init):

if (dword_E6E120) {                              // one-shot guard (starts nonzero)
    if (signal(SIGINT, SIG_IGN) != SIG_IGN)       // was SIGINT not already ignored?
        signal(SIGINT, handler);                   //   install interrupt handler
    signal(SIGTERM, handler);                      // always install
    signal(SIGXFSZ, SIG_IGN);                     // ignore file-size limit signals
    signal(SIGXCPU, sub_5AF270);                  // CPU time limit → abort
    dword_E6E120 = 0;                             // prevent re-registration
}

Signal	Number	Handler	Behavior
SIGINT	2	`handler` (0x5AF2C0)	Conditional: only if not inherited as `SIG_IGN`. Writes newline, calls `terminate_compilation(9)`.
SIGTERM	15	`handler` (0x5AF2C0)	Always installed. Same handler as SIGINT.
SIGXFSZ	25	`SIG_IGN`	Ignored. Prevents crash on large `.int.c` output.
SIGXCPU	24	`sub_5AF270` (0x5AF270)	Prints `"Internal error: CPU time limit exceeded.\n"`, then `exit_with_status(11)` (abort).

After signal setup, sub_5B1E70 also disables the CPU time limit by setting RLIMIT_CPU soft limit to RLIM_INFINITY:

getrlimit(RLIMIT_CPU, &rlimits);
rlimits.rlim_cur = RLIM_INFINITY;    // -1 = unlimited
setrlimit(RLIMIT_CPU, &rlimits);

This prevents normal compilations from hitting SIGXCPU. The handler at sub_5AF270 is a safety net for cases where an external resource manager re-imposes the limit after initialization.

Complete Exit Sequence

The full sequence from compilation completion to process termination:

1.  sub_6B8B20(0)           Reset source file manager state
2.  sub_589530()            Write signoff + free memory
    ├── sub_5AEE00()        Print error/warning summary (or close SARIF JSON)
    └── sub_6B8DE0()        Free all frontend memory pools
3.  Compute exit_code       Based on qword_126ED90, qword_126ED88
4.  [If timing enabled]
    ├── sub_5AF350(v12)     Capture total end timestamp
    └── sub_5AF390(...)     Print "Total compilation time"
5.  [If stack was raised]
    └── setrlimit(...)      Restore original stack soft limit
6.  sub_5AF1D0(exit_code)   Map status → exit code, terminate
    ├── 3,4,5 → exit(0)
    ├── 8     → exit(2)
    ├── 9,10  → exit(4) + "Compilation terminated."
    └── 11    → abort()  + "Compilation aborted."

Global Variable Reference

Variable	Address	Size	Role
`dword_106C0A4`	`0x106C0A4`	4	Timing enable flag. CLI flag 20 (`--timing`).
`dword_106BBB8`	`0x106BBB8`	4	SARIF output mode. 0=text, 1=SARIF JSON.
`qword_126EDF0`	`0x126EDF0`	8	Diagnostic output FILE* (stderr).
`qword_126ED90`	`0x126ED90`	8	Total error count.
`qword_126ED98`	`0x126ED98`	8	Total warning count.
`qword_126ED88`	`0x126ED88`	8	Additional status (nonzero changes exit code from 3 to 5).
`qword_126EDB0`	`0x126EDB0`	8	Suppressed error count.
`qword_126EDB8`	`0x126EDB8`	8	Suppressed warning count.
`qword_126EEE0`	`0x126EEE0`	8	Output filename (for source display in signoff).
`qword_106C040`	`0x106C040`	8	Display filename override (used if set, else falls back to `qword_126EEE0`).
`dword_106C254`	`0x106C254`	4	Skip-backend flag. Set to 1 when errors detected after frontend.
`dword_106C064`	`0x106C064`	4	Stack limit adjustment flag (`--modify_stack_limit`, default ON).
`dword_E6E120`	`0xE6E120`	4	One-shot guard for signal handler registration in `sub_5B1E70`.
`dword_126F110`	`0x126F110`	4	Profiling initialized flag. Set to 1 by `sub_5AF330`.
`dword_106BD4C`	`0x106BD4C`	4	Profiling gate flag. When set, `fe_one_time_init` calls `sub_5AF330`.
`qword_106B9C8`	`0x106B9C8`	8	Module declarations processed count (for debug `module_report`).
`qword_106B9C0`	`0x106B9C0`	8	Module declarations failed count.
`dword_1280728`	`0x1280728`	4	Memory manager mode flag. Controls pool vs non-pool deallocation in `sub_6B8DE0`.

Cross-References

Pipeline Overview -- placement of timing/exit in the 8-stage pipeline
Entry Point & Initialization -- main() structure, signal handler registration, stack limit setup
CLI Processing -- flag 20 (--timing) registration and parsing
Backend Code Generation -- the "Back end time" measurement target
SARIF & Pragma Control -- SARIF JSON format details
Diagnostic Overview -- error/warning counting infrastructure

Execution Spaces

Every CUDA function lives in one or more execution spaces that govern where the function can run (host CPU, device GPU, or both) and what it can call. cudafe++ encodes execution space as a single-byte bitfield at offset +182 of the entity (routine) node. This byte is the most frequently tested field in CUDA-specific code paths -- it drives attribute application, redeclaration compatibility, virtual override checking, call-graph validation, IL marking, and code generation selection. Understanding this byte is prerequisite to understanding nearly every CUDA-specific subsystem in cudafe++.

The three CUDA execution-space keywords (__host__, __device__, __global__) are parsed as EDG attributes with internal kind codes 'V' (86), 'W' (87), and 'X' (88) respectively. The attribute dispatch table in apply_one_attribute (sub_413240) routes each kind to a dedicated handler that validates constraints and sets the bitfield. Functions without any explicit annotation default to __host__.

Key Facts

Property	Value
Source file	`attribute.c` (handlers), `class_decl.c` (redecl/override), `nv_transforms.h` (inline predicates)
Bitfield location	Entity node byte at offset `+182`
`__global__` handler	`sub_40E1F0` / `sub_40E7F0` (`apply_nv_global_attr`, two variants)
`__device__` handler	`sub_40EB80` (`apply_nv_device_attr`)
`__host__` handler	`sub_4108E0` (`apply_nv_host_attr`)
Virtual override checker	`sub_432280` (`record_virtual_function_override`)
Execution space mask table	`dword_E7C760[]` (indexed by space enum)
Mask lookup	`sub_6BCF60` (`nv_check_execution_space_mask`)
Annotation helper	`sub_41A1F0` (validates HD annotations on types)
Relaxed mode flag	`dword_106BFF0` (permits otherwise-illegal space combinations)
main() entity pointer	`qword_126EB70` (compared during attribute application)

The Execution Space Bitfield (Entity + 182)

Byte offset +182 within a routine entity node encodes the execution space as a bitfield. Individual bits carry distinct meanings:

Byte at entity+182:

  bit 0  (0x01)   device_capable     Function can execute on device
  bit 1  (0x02)   device_explicit    __device__ was explicitly written
  bit 2  (0x04)   host_capable       Function can execute on host
  bit 3  (0x08)   (reserved)
  bit 4  (0x10)   host_explicit      __host__ was explicitly written
  bit 5  (0x20)   device_annotation  Secondary device flag (used in HD detection)
  bit 6  (0x40)   global_kernel      Function is a __global__ kernel
  bit 7  (0x80)   hd_combined        Combined __host__ __device__ flag

Combined Patterns

The attribute handlers do not set individual bits -- they OR entire patterns into the byte. Each CUDA keyword produces a characteristic bitmask:

Keyword	OR mask	Resulting byte	Bit breakdown
`__global__`	`0x61`	`0xE1`	device_capable + device_annotation + global_kernel + bit 7 (always set)
`__device__`	`0x23`	`0x23`	device_capable + device_explicit + device_annotation
`__host__`	`0x15`	`0x15`	device_capable + host_capable + host_explicit
`__host__ __device__`	`0x23 \| 0x15`	`0x37`	device_capable + device_explicit + host_capable + host_explicit + device_annotation
(no annotation)	none	`0x00`	Implicit `__host__` -- bits remain zero

The 0x80 bit is set unconditionally by the __global__ handler. After the |= 0x61 operation (which sets bit 6), the handler reads the byte back and checks (byte & 0x40) != 0. Since bit 6 was just set, this is always true, so |= 0x80 always executes. Despite the field name hd_combined in some tooling, the bit functions as a "has global annotation" marker in practice.

Why device_capable (bit 0) Appears in host

The __host__ mask 0x15 includes bit 0 (device_capable). This is not an error. Bit 0 acts as a "has execution space annotation" marker rather than a strict "runs on device" flag. The actual device-only vs host-only distinction is determined by the two-bit extraction at bits 4-5 (the 0x30 mask), described below.

Execution Space Classification (0x30 Mask)

The critical two-bit extraction byte & 0x30 classifies a routine into one of four categories:

(byte & 0x30):
  0x00  ->  no explicit annotation (implicit __host__)
  0x10  ->  __host__ only
  0x20  ->  __device__ only
  0x30  ->  __host__ __device__

This extraction is the basis of nv_is_device_only_routine, an inline predicate defined in nv_transforms.h (line 367). The full check from the decompiled binary is:

// nv_is_device_only_routine (inlined from nv_transforms.h:367)
// entity_sym: the symbol table entry for the routine
// entity_sym+88 -> associated routine entity

__int64 entity = *(entity_sym + 88);
if (!entity)
    internal_error("nv_transforms.h", 367, "nv_is_device_only_routine");

char byte = *(char*)(entity + 182);
bool is_device_only = ((byte & 0x30) == 0x20) && ((byte & 0x60) == 0x20);

The double-check (byte & 0x60) == 0x20 ensures the function is device-only and NOT a __global__ kernel (which would have bit 6 set, making byte & 0x60 == 0x60). This predicate is used in:

check_void_return_okay (sub_719D20): suppress missing-return warnings for device-only functions
record_virtual_function_override (sub_432280): drive virtual override execution space propagation
Cross-space call validation: determine whether a call crosses execution space boundaries
IL keep-in-il marking: identify device-reachable code

The 0x60 Mask (Kernel vs Device)

A secondary extraction byte & 0x60 distinguishes kernels from plain device functions:

(byte & 0x60):
  0x00  ->  no device annotation
  0x20  ->  __device__ only (not a kernel)
  0x40  ->  __global__ only (should not occur in isolation)
  0x60  ->  __global__ (which implies __device__)

nv_is_device_only_routine Truth Table

The predicate is inlined from nv_transforms.h:367 and appears in multiple call sites. Its internal_error guard string "nv_is_device_only_routine" appears in sub_432280 at the source path EDG_6.6/src/nv_transforms.h. The complete truth table for all execution space combinations:

Execution space	byte+182	byte & 0x30	byte & 0x60	Result
(none, implicit `__host__`)	`0x00`	`0x00`	`0x00`	false
`__host__`	`0x15`	`0x10`	`0x00`	false
`__device__`	`0x23`	`0x20`	`0x20`	true
`__host__ __device__`	`0x37`	`0x30`	`0x20`	false
`__global__`	`0xE1`	`0x20`	`0x60`	false

The __global__ case is the key distinction: byte & 0x30 yields 0x20 (same as __device__), but byte & 0x60 yields 0x60 (not 0x20), so the predicate correctly rejects kernels.

// Full pseudocode for nv_is_device_only_routine
// Inlined at every call site; not a standalone function in the binary.
//
// Input: sym -- a symbol table entry (not the entity itself)
// Output: true if the routine is __device__ only (not __host__, not __global__)
bool nv_is_device_only_routine(symbol *sym) {
    entity *e = sym->entity;            // sym + 88
    if (!e)
        internal_error("nv_transforms.h", 367, "nv_is_device_only_routine");

    char byte = e->byte_182;
    // First check:  bits 4-5 == 0x20 -> has __device__, no __host__
    // Second check: bits 5-6 == 0x20 -> has __device__, no __global__
    return ((byte & 0x30) == 0x20) && ((byte & 0x60) == 0x20);
}

Complete Redeclaration Matrix

The matrix below documents every possible pair of (existing annotation, newly-applied annotation) and the result. Each cell is derived from the three attribute handler functions. "Relaxed" means the outcome changes when dword_106BFF0 is set.

Existing \ Applying	`__host__`	`__device__`	`__global__`
(none) `0x00`	`0x15` -- OK	`0x23` -- OK	`0xE1` -- OK
`__host__` `0x15`	`0x15` -- idempotent	`0x37` -- OK (HD)	error 3481 (always: handler checks `byte & 0x10` unconditionally)
`__device__` `0x23`	`0x37` -- OK (HD)	`0x23` -- idempotent	error 3481 (relaxed: OK)
`__global__` `0xE1`	error 3481 (always)	error 3481 (relaxed: OK)	`0xE1` -- idempotent
`__host__ __device__` `0x37`	`0x37` -- idempotent	`0x37` -- idempotent	error 3481 (always: `byte & 0x10` fires)

The __global__ column always errors when the existing annotation includes __host__ (bit 4 = 0x10), because the __global__ handler's condition (v5 & 0x10) != 0 is not guarded by the relaxed-mode flag. The __device__ column errors on existing __global__ only when relaxed mode is off, because the __device__ handler guards its check with !dword_106BFF0.

Note that __global__'s byte value is 0xE1 (not 0x61) because the 0x80 bit is always set after __global__ is applied, as documented above.

Attribute Application Functions

apply_nv_global_attr (sub_40E1F0 / sub_40E7F0)

Two nearly identical entry points exist. Both apply __global__ to a function entity. The variant at sub_40E7F0 uses a do-while loop for parameter iteration instead of a for loop, but the validation logic is identical. Both variants may exist because EDG generates different code paths for attribute-on-declaration vs attribute-on-definition.

The function performs extensive validation before setting the bitmask:

// Pseudocode for apply_nv_global_attr (sub_40E1F0)
int64_t apply_nv_global_attr(attr_node *a1, entity *a2, char target_kind) {
    if (target_kind != 11)      // only applies to functions
        return a2;

    // Check constexpr lambda with wrong linkage
    if ((a2->qword_184 & 0x800001000000) == 0x800000000000) {
        char *name = get_entity_name(a2, 0);
        error(3469, a1->source_loc, "__global__", name);
        return a2;
    }

    // Static member check
    if ((signed char)a2->byte_176 < 0 && !(a2->byte_81 & 0x04))
        warning(3507, a1->source_loc, "__global__");

    // operator() check
    if (a2->byte_166 == 5)
        error(3644, a1->source_loc);

    // Return type must be void (skip cv-qualifiers)
    type *ret = a2->return_type;    // +144
    while (ret->kind == 12)         // 12 = cv-qualifier wrapper
        ret = ret->next;            // +144
    if (ret->prototype->exception_spec)     // +152 -> +56
        error(3647, a1->source_loc);        // auto/decltype(auto) return

    // Execution space conflict check (single condition with ||)
    char es = a2->byte_182;
    if ((!dword_106BFF0 && (es & 0x60) == 0x20) || (es & 0x10) != 0)
        error(3481, a1->source_loc);
        // Left branch: already __device__ (not relaxed mode) -> conflict
        // Right branch: already __host__ explicit (unconditional) -> conflict

    // Return type must be void (non-constexpr path)
    if (!(a2->byte_179 & 0x10)) {      // not constexpr
        if (a2->byte_191 & 0x01)       // lambda
            error(3506, a1->source_loc);
        else if (!is_void_return(a2))
            error(3505, a1->source_loc);
    }

    // Variadic check
    // ... skip to prototype, check bit 0 of proto+16
    if (proto_flags & 0x01)
        error(3503, a1->source_loc);

    // >>> SET THE BITMASK <<<
    a2->byte_182 |= 0x61;    // bits 0,5,6: device_capable + device_annotation + global_kernel

    // Local function check
    if (a2->byte_81 & 0x04)
        error(3688, a1->source_loc);

    // main() check
    if (a2 == qword_126EB70 && (a2->byte_182 & 0x20))
        error(3538, a1->source_loc);

    // Always set bit 7 after __global__: the check reads the byte AFTER |= 0x61,
    // so bit 6 is always set, making this unconditional.
    if (a2->byte_182 & 0x40)
        a2->byte_182 |= 0x80;

    // Parameter default-init check (device-side warning)
    // ... iterate parameters, warn 3669 if missing defaults
    return a2;
}

apply_nv_device_attr (sub_40EB80)

Handles both variables (target_kind == 7) and functions (target_kind == 11). For variables, it sets the memory space bitfield at +148 (bit 0 = __device__). For functions, it sets the execution space.

// Variable path (target_kind == 7):
a2->byte_148 |= 0x01;              // __device__ memory space
if (((a2->byte_148 & 0x02) != 0) + ((a2->byte_148 & 0x04) != 0) == 2)
    error(3481, ...);               // both __shared__ (bit 1) AND __constant__ (bit 2) set
if ((signed char)a2->byte_161 < 0)
    error(3482, ...);               // thread_local
if (a2->byte_81 & 0x04)
    error(3485, ...);               // local variable

// Function path (target_kind == 11):
// Same constexpr-lambda check as __global__
if (!dword_106BFF0 && (a2->byte_182 & 0x40))
    error(3481, ...);               // already __global__, now __device__
a2->byte_182 |= 0x23;              // device_capable + device_explicit + device_annotation
if ((a2->byte_81 & 0x04) && (a2->byte_182 & 0x40))
    error(3688, ...);               // local function with __global__
if (a2 == qword_126EB70 && (a2->byte_182 & 0x20))
    error(3538, ...);               // __device__ on main()

apply_nv_host_attr (sub_4108E0)

The simplest of the three. Only applies to functions (target_kind 11). Fewer validation checks than __global__ or __device__.

// Function path (target_kind == 11):
// Same constexpr-lambda check
if (a2->byte_182 & 0x40)
    error(3481, ...);           // already __global__, now __host__
a2->byte_182 |= 0x15;          // device_capable + host_capable + host_explicit
if ((a2->byte_81 & 0x04) && (a2->byte_182 & 0x40))
    error(3688, ...);           // local function
if (a2 == qword_126EB70 && (a2->byte_182 & 0x20))
    error(3538, ...);           // __host__ on main()

Default Execution Space

Functions without any explicit annotation have byte +182 == 0x00. This is treated as implicit __host__:

The 0x30 mask yields 0x00, which the cross-space validator treats identically to 0x10 (explicit __host__)
The function is compiled for the host side only
It is excluded from device IL during the keep-in-il pass

In JIT compilation mode (--default-device), the default flips to __device__. This changes which functions are kept in device IL without requiring explicit annotations.

Execution Space Conflict Detection

The attribute handlers enforce a mutual-exclusion matrix. When a second execution space attribute is applied to a function that already has one, the handler checks for conflicts using error 3481:

Already set	Applying	Result
(none)	`__host__`	`0x15` -- accepted
(none)	`__device__`	`0x23` -- accepted
(none)	`__global__`	`0xE1` -- accepted
`__host__` (0x15)	`__device__`	`0x37` -- accepted (HD)
`__device__` (0x23)	`__host__`	`0x37` -- accepted (HD)
`__host__` (0x15)	`__global__`	error 3481 (always -- `byte & 0x10` is unconditional)
`__device__` (0x23)	`__global__`	error 3481 (unless `dword_106BFF0`)
`__global__` (0xE1)	`__host__`	error 3481 (always)
`__global__` (0xE1)	`__device__`	error 3481 (unless `dword_106BFF0`)
`__host__` (0x15)	`__host__`	idempotent OR, no error
`__device__` (0x23)	`__device__`	idempotent OR, no error
`__global__` (0xE1)	`__global__`	idempotent OR, no error

The relaxed mode flag dword_106BFF0 suppresses certain conflicts. When set, combinations that would normally produce error 3481 are silently accepted. This flag corresponds to --expt-relaxed-constexpr or similar permissive compilation modes. Note that the relaxed flag does NOT affect the __host__ -> __global__ or __global__ -> __host__ paths -- these always error because the __global__ handler checks byte & 0x10 unconditionally, and the __host__ handler checks byte & 0x40 unconditionally.

Virtual Function Override Checking (sub_432280)

When a derived class overrides a virtual function, cudafe++ must verify execution space compatibility. This check is embedded in record_virtual_function_override (sub_432280, 437 lines, from class_decl.c).

nv_is_device_only_routine Inline Check

The function first tests whether the overriding function has the __device__ flag at +177 bit 4 (0x10). If so, and the overridden function does NOT have this flag, execution space propagation occurs:

// Propagation logic (simplified from sub_432280, lines 70-94)
if (overriding->byte_177 & 0x10) {     // overriding is __device__
    if (!(overridden->byte_177 & 0x10)) {   // overridden is NOT __device__
        char es = overridden->byte_182;
        if ((es & 0x30) != 0x20) {          // overridden is not device-only
            overriding->byte_182 |= 0x10;   // propagate __host__ flag
        }
        if (es & 0x20) {                    // overridden has device_annotation
            overriding->byte_182 |= 0x20;   // propagate device_annotation
        }
    }
}

Six Virtual Override Mismatch Errors (3542-3547)

When the overriding function is NOT __device__, the checker looks up execution space attributes using sub_5CEE70 (attribute kind 87 = __device__, kind 86 = __host__). Based on which attributes are found on the overriding function and the execution space of the overridden function, one of six errors is emitted:

Error	Overriding has	Overridden space (`byte & 0x30`)	Meaning
3542	`__device__` only	`0x00` or `0x10` (host/implicit)	Device override of host virtual
3543	`__device__` + `__host__`	`0x00` (no annotation)	HD override of implicit-host virtual
3544	`__device__` + `__host__`	`0x20` (device-only)	HD override of device-only virtual
3545	no `__device__`	`0x20` (device-only)	Host override of device-only virtual
3546	no `__device__`	`0x30` (HD)	Host override of HD virtual
3547	`__device__` only	`0x30` (HD), relaxed mode	Device override of HD virtual (relaxed)

The errors are emitted via sub_4F4F10 with severity 8 (hard error). The dword_106BFF0 relaxed mode flag modulates certain paths: in relaxed mode, some combinations that would otherwise error are accepted or downgraded.

Decision Logic

// Pseudocode for override mismatch detection (sub_432280, lines 95-188)
char es = overridden->byte_182;
char mask_30 = es & 0x30;
bool has_host_bit = (es & 0x20) != 0;    // device_annotation
bool is_hd = (mask_30 == 0x30);

bool has_device_attr = has_attribute(overriding, 87 /*__device__*/);
bool has_host_attr   = has_attribute(overriding, 86 /*__host__*/);

if (has_device_attr) {
    if (has_host_attr) {
        // Overriding is __host__ __device__
        if (has_host_bit)
            error = 3544;   // HD overrides device-only
        else if (mask_30 != 0x20)
            error = 3543;   // HD overrides implicit-host
    } else {
        // Overriding is __device__ only
        if (!has_host_bit)
            error = 3542;   // device overrides host
        if (is_hd && relaxed_mode)
            error = 3547;   // device overrides HD (relaxed)
    }
} else {
    // Overriding has no __device__
    if (mask_30 == 0x20)
        error = 3545;       // host overrides device-only
    else if (mask_30 == 0x30)
        error = 3546;       // host overrides HD
}

global Function Constraints

The __global__ handler enforces the strictest constraints of any execution space. A kernel function must satisfy all of the following:

Constraint	Check	Error
Must be a function (not variable/type)	`target_kind == 11`	silently ignored if not
Not a constexpr lambda with wrong linkage	`(qword_184 & 0x800001000000) != 0x800000000000`	3469
Not a static member function	`(signed char)byte_176 >= 0 \|\| (byte_81 & 0x04)`	3507
Not `operator()`	`byte_166 != 5`	3644
Return type not `auto`/`decltype(auto)`	no exception spec at proto+56	3647
No conflicting execution space	see conflict matrix above	3481
Return type is `void` (non-constexpr)	`is_void_return(a2)`	3505 / 3506
Not variadic	`!(proto_flags & 0x01)`	3503
Not a local function	`!(byte_81 & 0x04)`	3688
Not `main()`	`a2 != qword_126EB70`	3538
Parameters have default init (device-side)	walk parameter list	3669 (warning)

Execution Space Annotation Helper (sub_41A1F0)

This function validates that type arguments used in __host__ __device__ or __device__ template contexts are well-formed. It traverses the type chain (following cv-qualifier wrappers where kind == 12), emitting diagnostics:

Error 3597: Type nesting depth exceeds 7 levels
Error 3598: Type is not device-callable (fails sub_550E50 check)
Error 3599: Type lacks appropriate constructor/destructor for device context

The first argument selects the annotation string: when a3 == 0, the string is "__host__ __device__"; when a3 != 0, it is "__device__".

Attribute Dispatch (apply_one_attribute)

The central dispatcher sub_413240 (apply_one_attribute, 585 lines) routes attribute kinds to their handlers via a switch statement:

Kind byte	Decimal	Attribute	Handler
`'V'`	86	`__host__`	`sub_4108E0`
`'W'`	87	`__device__`	`sub_40EB80`
`'X'`	88	`__global__`	`sub_40E1F0` or `sub_40E7F0`

Attribute display names are resolved by sub_40A310 (attribute_display_name), which maps the kind byte back to the human-readable CUDA keyword string for use in diagnostic messages.

Execution Space Mask Table (dword_E7C760)

A lookup table at dword_E7C760 stores precomputed bitmasks indexed by execution space enum value. The function sub_6BCF60 (nv_check_execution_space_mask) performs return a1 & dword_E7C760[a2], allowing fast bitwise checks of whether a given entity's execution space matches a target space category. This table is used throughout cross-space validation and IL marking.

Diagnostics Reference

Error	Severity	Meaning
3469	error	Execution space attribute on constexpr lambda with wrong linkage
3481	error	Conflicting execution spaces
3482	error	`__device__` variable with `thread_local` storage
3485	error	`__device__` attribute on local variable
3503	error	`__global__` function cannot be variadic
3505	error	`__global__` return type must be `void` (non-constexpr path)
3506	error	`__global__` return type must be `void` (constexpr/lambda path)
3507	warning	`__global__` on static member function
3538	error	Execution space attribute on `main()`
3577	error	`__device__` variable with `constexpr` and conflicting memory space
3542	error	Virtual override: `__device__` overrides host
3543	error	Virtual override: `__host__ __device__` overrides implicit-host
3544	error	Virtual override: `__host__ __device__` overrides device-only
3545	error	Virtual override: host overrides device-only
3546	error	Virtual override: host overrides `__host__ __device__`
3547	error	Virtual override: `__device__` overrides HD (relaxed mode)
3597	error	Type nesting too deep for execution space annotation
3598	error	Type not callable in target execution space
3599	error	Type lacks device-compatible constructor/destructor
3644	error	`__global__` on `operator()`
3647	error	`__global__` return type cannot be `auto`/`decltype(auto)`
3669	warning	`__global__` parameter without default initializer (device-side)
3688	error	Execution space attribute on local function

Function Map

Address	Identity	Lines	Source
`sub_40A310`	`attribute_display_name`	83	attribute.c
`sub_40E1F0`	`apply_nv_global_attr` (variant 1)	89	attribute.c
`sub_40E7F0`	`apply_nv_global_attr` (variant 2)	86	attribute.c
`sub_40EB80`	`apply_nv_device_attr`	100	attribute.c
`sub_4108E0`	`apply_nv_host_attr`	31	attribute.c
`sub_413240`	`apply_one_attribute` (dispatch)	585	attribute.c
`sub_41A1F0`	execution space annotation helper	82	class_decl.c
`sub_432280`	`record_virtual_function_override`	437	class_decl.c
`sub_6BCF60`	`nv_check_execution_space_mask`	7	nv_transforms.c
`sub_719D20`	`check_void_return_okay`	271	statements.c

Cross-References

Memory Spaces -- variable-side __device__/__shared__/__constant__ at entity+148
Cross-Space Validation -- call-graph enforcement of execution space rules
Device/Host Separation -- IL marking driven by execution space
Kernel Stubs -- host-side stub generation for __global__ functions
Entity Node Layout -- full byte map of the entity structure
Virtual Override Matrix -- detailed 6-error mismatch table
JIT Mode -- --default-device flag that changes implicit execution space

Memory Spaces

Every CUDA variable that resides in GPU memory belongs to one of four memory spaces: __device__ (global memory), __shared__ (per-block scratchpad), __constant__ (read-only broadcast memory), or __managed__ (unified memory). cudafe++ encodes memory space as a two-byte bitfield at offsets +148 and +149 of the variable entity node. These two bytes are the variable-side analog of the execution space byte at +182 used for functions -- the two systems are complementary but independent.

The memory space bitfield passes through three processing stages. First, attribute handlers in attribute.c set the appropriate bits and enforce mutual exclusion constraints (no __shared__ + __constant__, no thread_local, no grid_constant conflict). Second, declaration processing in decls.c applies additional validation: VLA restrictions for __shared__, constexpr and external-linkage restrictions for __constant__/__device__, and structured binding constraints for all spaces. Third, symbol reference recording in symbol_ref.c checks whether host code illegally accesses device-side variables at reference time.

Memory spaces apply exclusively to variables (entity kind 7). __shared__ and __constant__ have no function-side meaning -- only __device__ (kind 'W', 87) doubles as a function execution space attribute.

Key Facts

Property	Value
Memory space offset	Entity node byte `+148` (3-bit bitfield)
Extended space offset	Entity node byte `+149` (1 bit for `__managed__`)
`__device__` handler	`sub_40EB80` (`apply_nv_device_attr`, 100 lines, `attribute.c`)
`__managed__` handler	`sub_40E0D0` (`apply_nv_managed_attr`, 47 lines, `attribute.c:10523`)
`__shared__` handler	Kind `'Z'` (90), not individually decompiled; sets `+148 \|= 0x02`
`__constant__` handler	Kind `'['` (91), not individually decompiled; sets `+148 \|= 0x04`
Declaration processor	`sub_4DEC90` (`variable_declaration`, 1098 lines, `decls.c`)
Variable declaration	`sub_4CA6C0` (`decl_variable`, 1090 lines, `decls.c:7730`)
Variable fixup	`sub_4CC150` (`cuda_variable_fixup`, 120 lines, `decls.c`)
Defined-variable check	`sub_4DC200` (`mark_defined_variable`, 26 lines, `decls.c`)
Cross-space reference checker	`sub_72A650` / `sub_72B510` (`record_symbol_reference_full`, `symbol_ref.c`)
Device-var-in-host checker	`sub_6BCF10` (`nv_check_device_variable_in_host`, `nv_transforms.c`)
Post-validation	`sub_6BC890` (`nv_validate_cuda_attributes`, 161 lines, `nv_transforms.c`)
Attribute kind codes	`'W'`=87 (`__device__`), `'Z'`=90 (`__shared__`), `'['`=91 (`__constant__`), `'f'`=102 (`__managed__`)

The Memory Space Bitfield (Entity +148 / +149)

Byte +148: Primary Memory Space

Byte at entity+148:

  bit 0  (0x01)   __device__       Variable in device global memory
  bit 1  (0x02)   __shared__       Variable in per-block shared memory
  bit 2  (0x04)   __constant__     Variable in constant memory
  bit 3  (0x08)   type_member      Set when variable inherits space from type context
  bit 4  (0x10)   device_at_file   __device__ at file scope (no enclosing function)
  bit 7  (0x80)   weak_odr         Set by apply_nv_weak_odr_attr (sub_40AD80)

Bits 3, 4, and 7 are set by decl_variable (sub_4CA6C0) during declaration processing, not by the attribute handlers. Bit 3 is set via *(_BYTE *)(v33 + 148) |= 8u when the variable inherits its memory space from a type context (such as a static member of a class with a device annotation). Bit 4 is set via *(_BYTE *)(v43 + 148) = v73 | 0x10 when a __device__ variable is declared at file scope (dword_126C5D8 == -1, meaning no enclosing function).

Byte +149: Extended Memory Space

Byte at entity+149:

  bit 0  (0x01)   __managed__    Unified memory (host + device accessible)
  bits 1-7        (reserved)

Word-Level Access

Some validation code reads bytes +148 and +149 together as a 16-bit word. The __grid_constant__ conflict check in apply_nv_managed_attr tests:

// sub_40E0D0, line 26 (apply_nv_managed_attr)
if ( (a2[164] & 4) != 0 && (*((_WORD *)a2 + 74) & 0x102) != 0 )

Here (_WORD *)(a2 + 148) (offset 74 in 16-bit units) is tested against 0x0102. In little-endian layout, 0x0102 means byte +148 bit 1 (__shared__) OR byte +149 bit 0 (__managed__). This catches the case where a __grid_constant__ parameter also carries __shared__ or __managed__.

Mutual Exclusion

In valid CUDA programs, at most one of __device__, __shared__, and __constant__ should be set. However, __managed__ always implies __device__ -- the handler sets both +149 bit 0 and +148 bit 0. The validation logic permits __device__ + __managed__ but rejects combinations like __shared__ + __constant__.

The mutual exclusion check appears identically in both apply_nv_managed_attr and apply_nv_device_attr:

// From sub_40EB80 (apply_nv_device_attr), variable path:
v9 = *(_BYTE *)(a2 + 148) | 1;     // set __device__ bit
*(_BYTE *)(a2 + 148) = v9;
if ( ((v9 & 2) != 0) + ((v9 & 4) != 0) == 2 )
    sub_4F81B0(3481, a1 + 56);      // error: conflicting spaces

The expression ((v9 & 2) != 0) + ((v9 & 4) != 0) == 2 is true only when both __shared__ (bit 1) and __constant__ (bit 2) are set simultaneously. This means:

__device__ + __shared__ is allowed (the bits coexist)
__device__ + __constant__ is allowed
__shared__ + __constant__ triggers error 3481

Attribute Handlers

apply_nv_managed_attr -- sub_40E0D0

The __managed__ handler is the simplest and most thoroughly documented. It demonstrates the full validation pattern that all memory space handlers share.

Entry point: Called from apply_one_attribute (sub_413240) when attribute kind is 'f' (102).

Decompiled logic (47 lines, attribute.c:10523):

// sub_40E0D0 -- apply_nv_managed_attr
// a1: attribute node, a2: entity node, a3: entity kind

// Gate: only applies to variables
if ( a3 != 7 )
    internal_error("attribute.c", 10523, "apply_nv_managed_attr");

// Step 1: Set managed flag AND device flag
v3 = a2[148];           // save old memory space byte
a2[149] |= 1;           // set __managed__ bit
a2[148] = v3 | 1;       // set __device__ bit (managed implies device)

// Step 2: Mutual exclusion check
if ( ((v3 & 2) != 0) + ((v3 & 4) != 0) == 2 )
    error(3481, ...);    // __shared__ + __constant__ conflict

// Step 3: Thread-local check
if ( (char)a2[161] < 0 )
    error(3482, ...);    // __managed__ on thread_local

// Step 4: Local variable check
if ( (a2[81] & 4) != 0 )
    error(3485, ...);    // __managed__ on local variable

// Step 5: __grid_constant__ conflict
if ( (a2[164] & 4) != 0 && (*(WORD*)(a2 + 148) & 0x102) != 0 )
{
    // Determine which space string to display
    v4 = a2[148];
    v5 = "__constant__";
    if ( (v4 & 4) == 0 ) {
        v5 = "__managed__";
        if ( (a2[149] & 1) == 0 ) {
            v5 = "__shared__";
            if ( (v4 & 2) == 0 ) {
                v5 = "__device__";
                if ( (v4 & 1) == 0 )
                    v5 = "";
            }
        }
    }
    error(3577, ..., v5);   // incompatible with __grid_constant__
}

The space-name selection cascade (__constant__ > __managed__ > __shared__ > __device__ > empty) is used in error messages to show which memory space conflicts with __grid_constant__. The cascade tests bits in priority order, matching the most "restrictive" space first.

apply_nv_device_attr -- sub_40EB80

The __device__ handler is dual-purpose: it handles both variables (a3 == 7) and functions (a3 == 11).

Entry point: Called from apply_one_attribute when attribute kind is 'W' (87).

Variable path (entity kind 7):

// sub_40EB80, variable branch
*(_BYTE *)(a2 + 148) |= 1;          // set __device__ bit

// Validation (identical to __managed__):
// 1. Error 3481 if __shared__ + __constant__ both set
// 2. Error 3482 if thread_local (byte +161 bit 7)
// 3. Error 3485 if local variable (byte +81 bit 2)
// 4. Error 3577 if __grid_constant__ conflict

Function path (entity kind 11):

// sub_40EB80, function branch
// Check: not an implicitly-deleted function
if ( (*(_QWORD *)(a2 + 184) & 0x800001000000LL) != 0x800000000000LL
     || (*(_BYTE *)(a2 + 176) & 2) != 0 )
{
    // Conflict with __global__
    if ( !dword_106BFF0 && (*(_BYTE *)(a2 + 182) & 0x40) != 0 )
        error(3481, ...);

    *(_BYTE *)(a2 + 182) |= 0x23;    // set device execution space

    // Local function with __global__ conflict
    if ( (*(_BYTE *)(a2 + 81) & 4) != 0 && (*(_BYTE *)(a2 + 182) & 0x40) != 0 )
        error(3688, ...);

    // __device__ on main()
    if ( a2 == qword_126EB70 && (*(_BYTE *)(a2 + 182) & 0x20) != 0 )
        warning(3538, ...);
}
else
{
    // Implicitly-deleted function: just warn
    v14 = get_entity_display_name(a2);
    error(3469, ..., "__device__", v14);
}

// Check function parameters for missing default initializers
// (error 3669 for parameters without defaults in device context)

The function path is documented in Execution Spaces -- here we focus on the variable path.

shared and constant Handlers

The __shared__ and __constant__ attribute handlers are dispatched through apply_one_attribute (sub_413240) when attribute kind codes 'Z' (90) and '[' (91) are encountered. Their variable-path logic mirrors __device__ and __managed__:

Step	`__shared__` ('Z')	`__constant__` ('[')
Set memory space bit	`byte +148 \|= 0x02`	`byte +148 \|= 0x04`
Mutual exclusion (3481)	Check `__constant__` bit (bit 2)	Check `__shared__` bit (bit 1)
Thread-local check (3482)	Yes	Yes
Local variable check (3485)	Yes	Yes
`__grid_constant__` conflict (3577)	Yes	Yes

The __shared__ and __constant__ keywords apply only to variables (kind 7). Unlike __device__, they do not have a function-path branch -- there is no __shared__ or __constant__ function execution space.

Variable Declaration Processing

sub_4DEC90 -- variable_declaration

The top-level declaration processor (decls.c) performs additional CUDA-specific validation after attribute handlers have set the memory space bits. This function is 1098 lines and handles both normal variable declarations and static data member definitions.

CUDA-specific checks in variable_declaration:

Error	Condition	Description
149	Memory space attribute at illegal scope	CUDA storage class at namespace scope (specific scenarios)
892	`auto` with `__constant__`	`auto`-typed `__constant__` variable
893	`auto` with CUDA attribute	`auto`-typed variable with other CUDA memory space
3510	`__shared__` with VLA	`__shared__` variable with variable-length array type
3566	`__constant__` + constexpr + auto	`__constant__` constexpr with auto deduction
3567	CUDA variable with VLA	CUDA memory-space variable with VLA type
3568	`__constant__` + constexpr	`__constant__` combined with constexpr
3578	CUDA attribute in discarded branch	CUDA attribute on variable in constexpr-if discarded branch
3579	CUDA attribute + structured binding	CUDA attribute at namespace scope with structured binding
3580	CUDA attribute on VLA	CUDA attribute on variable-length array

Memory space string selection (used in error messages):

// sub_4DEC90, line ~357: selecting display name for the memory space
v50 = "__constant__";
if ( (v49 & 4) == 0 ) {
    v50 = "__managed__";
    if ( (*(_BYTE *)(v15 + 149) & 1) == 0 ) {
        v50 = "__host__ __device__" + 9;   // pointer arithmetic: = "__device__"
        if ( (v49 & 2) != 0 )
            v50 = "__shared__";
    }
}

The string "__device__" is produced by taking the string "__host__ __device__" and advancing by 9 bytes, skipping past "__host__ ". This is a binary-level optimization -- the compiler shares string storage between the combined "__host__ __device__" literal and the standalone "__device__" reference.

sub_4CA6C0 -- decl_variable

The core variable declaration function (1090 lines, decls.c:7730) handles CUDA memory space propagation during symbol table entry creation. Key behaviors:

Storage class mapping: When declaration state byte at offset +269 equals 5, it indicates a CUDA memory space storage class. The function performs a scope walk to determine the correct namespace scope for the variable. If a prior declaration exists at the same scope (dword_126C5DC == dword_126C5B4), the CUDA storage class is reset to allow redeclaration.

Scope walk: Traverses the scope chain (784-byte scope entries at qword_126C5E8, indexed by dword_126C5E4) upward through class scopes (scope_kind 4) and template scopes (bit 0x20 at scope entry +9), until reaching a non-class, non-template scope. This determines whether the variable is at namespace scope, class scope, or block scope.

Error 3483 -- memory space in non-device function: When a variable with a device memory space bit (+148 bit 0 set) is declared inside a function body, and the enclosing routine is NOT device-only (+182 & 0x30 != 0x20), the function emits error 3483 with the storage kind and space name:

// From sub_4CA6C0, ~line 886-910
if (!at_namespace_scope) {
    char space = entity->byte_148;
    if (storage_class != 1 && (space & 0x01)) {
        routine_descriptor = qword_126C5D0;
        if (routine_descriptor) {
            entity_ptr = *(routine_descriptor + 32);
            if (entity_ptr && (entity_ptr[182] & 0x30) != 0x20) {
                const char *name = get_space_name(entity);  // priority cascade
                const char *kind = (storage_class == 2) ? "a static" : "an automatic";
                error(3483, source_loc, kind, name);
            }
        }
    }
}

File-scope device flag: When a __device__ variable is at file scope (dword_126C5D8 == -1), the function sets bit 4 of +148:

if ((entity->byte_148 & 0x01) && dword_126C5D8 == -1)
    entity->byte_148 |= 0x10;   // bit 4: device_at_file_scope

Redeclaration checking: When a variable is redeclared, the function compares memory space encoding at offset +136 (the attribute byte) between the existing and new entity. Error 1306 is emitted for mismatched CUDA memory spaces.

Memory space propagation: Calls sub_4C4750 (set_variable_attributes) for final attribute propagation, and sub_4CA480 (check_variable_redeclaration) for prior-declaration compatibility.

sub_4DC200 -- mark_defined_variable

Post-declaration validation for device-memory variables with external linkage (26 lines):

// sub_4DC200 -- mark_defined_variable (decompiled)
void mark_defined_variable(entity_t *a1, int a2) {
    if (a1[164] & 0x10) {   // already marked as defined
        if (!dword_106BFD0                    // cross-space checking not overridden
            && (a1[148] & 3) == 1             // __device__ set, __shared__ NOT set
            && !is_compiler_generated(a1)     // not compiler-generated
            && (a1[80] & 0x70) != 0x10)       // not anonymous
        {
            warning(3648, a1 + 64);           // external linkage warning
        }
    } else if (!a2 && (*(byte*)(*(qword*)a1 + 81) & 2)) {
        error(1655, ...);   // tentative definition of constexpr
    } else {
        // Same 3648 check on first definition
        if (!dword_106BFD0 && (a1[148] & 3) == 1 && ...)
            warning(3648, a1 + 64);
        a1[164] |= 0x10;   // mark as defined
    }
}

The condition (a1[148] & 3) == 1 tests that bit 0 (__device__) is set AND bit 1 (__shared__) is NOT set. This catches __device__ variables (including __device__ __constant__ and __device__ __managed__, since those have bit 0 set) but excludes __shared__ variables (which have bit 1 set). The check is NOT about __constant__ alone -- a pure __constant__ variable (only bit 2 set, value 0x04) would yield (0x04 & 3) == 0, failing the test. The p1.06 report's characterization of error 3648 as "constant with external linkage" is misleading; the actual condition is "device-accessible (non-shared) variable with external linkage."

sub_4CC150 -- cuda_variable_fixup

Called from variable_declaration after CUDA constexpr-if detection. This function:

Manipulates variable entity fields at offset +148 (memory space) and +162 (visibility flags)
Adjusts scope chains using the 784-byte scope entry array
Creates new type entries for CUDA-specific variable rewriting

Bit Assignment Resolution

Two sweep reports provided conflicting bit assignments for byte +148:

Source	bit 0	bit 1	bit 2
p1.01 (attribute.c handlers)	`__device__`	`__shared__`	`__constant__`
p1.06 (decls.c)	`__constant__`	`__shared__`	`__managed__`

The decompiled code resolves this definitively in favor of the p1.01 assignment. Two independent functions confirm it:

sub_40E0D0 (apply_nv_managed_attr) sets a2[149] |= 1 (managed at +149) and a2[148] = v3 | 1 (device at +148 bit 0). The subsequent conflict check tests (v3 & 2) for __shared__ and (v3 & 4) for __constant__.
sub_40EB80 (apply_nv_device_attr) sets *(_BYTE *)(a2 + 148) | 1 (device at +148 bit 0), then uses the identical conflict test ((v9 & 2) != 0) + ((v9 & 4) != 0) == 2.

The canonical encoding is:

Byte +148:  bit 0 = __device__,  bit 1 = __shared__,  bit 2 = __constant__
Byte +149:  bit 0 = __managed__

The p1.06 report's alternative encoding is an analysis error, caused by mark_defined_variable (sub_4DC200) testing +148 & 3 == 1 in the context of error 3648. That test checks for __device__ set (bit 0) without __shared__ (bit 1) -- not for __constant__ at bit 0. The error was then characterized as "constant with external linkage" based on the error message text rather than the actual bit test.

Validation Constraints

managed Constraints

__managed__ has the strictest requirements among memory space annotations. All five checks occur in apply_nv_managed_attr (sub_40E0D0):

Constraint	Binary test	Error	Description
Variables only	`a3 != 7`	internal_error	`__managed__` can only apply to variables, not functions or types
No shared+constant	`((old & 2) != 0) + ((old & 4) != 0) == 2`	3481	Both `__shared__` and `__constant__` already set
Not thread-local	`(signed char)byte+161 < 0`	3482	Bit 7 of `+161` = thread_local storage
Not reference/local	`byte+81 & 4`	3485	Bit 2 of `+81` = reference type or local variable
Not grid_constant	`byte+164 & 4` and word `+148 & 0x0102`	3577	`__grid_constant__` parameter with managed or shared space

The __managed__ keyword requires compute capability >= 3.0. This is verified at compilation time via version threshold comparisons (qword_126EF90 > 0x78B3, where 0x78B3 = 30899 in the CUDA version encoding scheme). The specific error code for architecture-too-low is not captured in the decompiled attribute handler.

shared Constraints

__shared__ variables have restrictions enforced across multiple functions:

Constraint	Where	Error	Description
No VLA type	`sub_4DEC90`	3510	`__shared__` variable cannot have variable-length array type
No VLA (general)	`sub_4DEC90`	3580	CUDA memory-space attribute on variable-length array
Not thread-local	Attribute handler	3482	`__shared__` on `thread_local` variable
Not local (non-block)	Attribute handler	3485	Cannot appear on local variables outside device function scope
No grid_constant	Attribute handler	3577	Incompatible with `__grid_constant__` parameter

constant Constraints

__constant__ carries additional restrictions related to constexpr and type:

Constraint	Where	Error	Description
No constexpr	`sub_4DEC90`	3568	`__constant__` combined with constexpr (when managed+device bits also set)
No constexpr+auto	`sub_4DEC90`	3566	Constexpr with const-qualified type
No VLA type	`sub_4DEC90`	3567	CUDA memory-space variable with VLA type
Not thread-local	Attribute handler	3482	`__constant__` on `thread_local` variable
Not local	Attribute handler	3485	Cannot appear on local variables
No grid_constant	Attribute handler	3577	Incompatible with `__grid_constant__` parameter

Note: Error 3648 (external linkage warning) is emitted by sub_4DC200 but the condition tests (byte+148 & 3) == 1, which checks for __device__ set without __shared__ -- not specifically __constant__. The check applies to any device-accessible non-shared variable, including __device__, __device__ __constant__, and __device__ __managed__.

Cross-Space Variable Access Checking

When host code references a device-side variable, the symbol reference recorder emits diagnostics. This checking occurs in record_symbol_reference_full (sub_72A650 / sub_72B510, symbol_ref.c) and is gated by global flags dword_106BFD0 and dword_106BFCC.

Gate Logic

1. Is cross-space checking enabled?
   → dword_106BFD0 != 0 OR dword_106BFCC != 0

2. Is the referenced entity a variable (kind == 7)?
   → Yes: proceed to nv_check_device_var_ref_in_host
   → No (kind 10/11/20 -- function): check nv_check_host_var_ref_in_device

3. Get current routine from scope stack (dword_126C5D8)
4. Check routine execution space at +182 (0x30 mask):
   → 0x00 or 0x10 (host): emit device-var-in-host errors
   → 0x20 (device): emit host-var-in-device errors

Device Variable Referenced from Host Code

The nv_check_device_var_ref_in_host path (assert string at symbol_ref.c:2347) checks memory space bits and produces specific errors based on which space the variable occupies:

Error	Condition	Description
3548	Variable has `__shared__` or `__constant__` (byte+148 bits 1-2)	Reference to `__shared__` / `__constant__` variable from host code
3549	Variable has `__constant__` and reference is in initializer context (ref_kind bit 4)	Initializer referencing device memory variable from host
3550	Variable has `__shared__` and reference is a write (ref_kind bit 1)	Write to `__shared__` variable from host code
3486	Via `sub_6BCF10` -- complex linkage check (`+176 & 0x200000000002000`, `+166 == 5`, `+168 in [1,4]`)	Illegal device variable reference from host (operator function context)

Host Variable Referenced from Device Code

The nv_check_host_var_ref_in_device path (assert string at symbol_ref.c:2390) handles the reverse direction:

Error	Condition	Description
3623	Device-only function referenced outside device context	Use of `__device__`-only function outside the bodies of device functions

The error 3623 has two context strings:

"outside the bodies of device functions" -- general case
"from a constexpr or consteval __device__ function" -- constexpr context

Relaxation: dword_106BF40

When dword_106BF40 is set (corresponding to --expt-relaxed-constexpr), and the current routine at +182 has the device annotation pattern (& 0x30 == 0x20) with +177 bit 1 set (explicit __device__), cross-space variable access checks are suppressed. This allows constexpr device functions to reference host variables during constant evaluation.

Host Reference Arrays

When the backend emits host-side code, variables marked with __device__, __shared__, or __constant__ are registered in ELF section arrays so the CUDA runtime can discover them at load time. The emission function sub_6BCF80 (nv_emit_host_reference_array) writes entries into six separate sections:

Section	Array Name	Memory Space	Linkage
`.nvHRDE`	`hostRefDeviceArrayExternalLinkage`	`__device__`	External
`.nvHRDI`	`hostRefDeviceArrayInternalLinkage`	`__device__`	Internal
`.nvHRCE`	`hostRefConstantArrayExternalLinkage`	`__constant__`	External
`.nvHRCI`	`hostRefConstantArrayInternalLinkage`	`__constant__`	Internal
`.nvHRKE`	`hostRefKernelArrayExternalLinkage`	`__global__` (kernel)	External
`.nvHRKI`	`hostRefKernelArrayInternalLinkage`	`__global__` (kernel)	Internal

Each array entry contains the mangled name of the device symbol as a byte array:

extern "C" {
    extern __attribute__((section(".nvHRDE")))
    __attribute__((weak))
    const unsigned char hostRefDeviceArrayExternalLinkage[] = {
        /* mangled name bytes */ 0x0
    };
}

Six global lists (at addresses unk_1286780 through unk_12868C0) accumulate symbols during compilation, one per section type. Note that __shared__ variables do NOT get host reference arrays -- they have no host-visible address.

Redeclaration Compatibility

When a variable is redeclared, decl_variable (sub_4CA6C0) compares the memory space bits between the prior declaration and the new one. Error 1306 is emitted for mismatched CUDA memory spaces:

Error 1306: CUDA memory space mismatch on redeclaration

The comparison tests byte +148 of both the existing entity and the new declaration's computed attributes. The CUDA memory space acts as an implicit storage class -- storage class value 5 in the declaration state (offset 269) indicates a CUDA-specific storage class that requires special scope-walking behavior.

String Table Usage

The memory space keywords appear in the binary's string table and are referenced by error message formatting code:

String	Usage
`"__constant__"`	Error messages for `__constant__` constraints, space name display
`"__managed__"`	Error messages for `__managed__` constraints
`"__device__"`	Obtained via `"__host__ __device__" + 9` (pointer arithmetic), or direct literal
`"__shared__"`	Error messages for `__shared__` constraints
`"__host__ __device__"`	Combined string; `+9` yields `"__device__"`

The pointer-arithmetic trick for "__device__" appears in both sub_4DEC90 (variable_declaration) and error message formatting throughout the attribute handlers. It saves binary space by reusing the combined "__host__ __device__" string constant.

Error Code Summary

Attribute Application Errors

Error	Severity	Description
3481	Error	Conflicting CUDA memory spaces (`__shared__` + `__constant__` simultaneously)
3482	Error	CUDA memory space attribute on `thread_local` variable
3485	Error	CUDA memory space attribute on local variable
3577	Error	Memory space incompatible with `__grid_constant__` parameter

Declaration Processing Errors

Error	Severity	Description
149	Error	Illegal CUDA storage class at namespace scope
892	Error	`auto` type with `__constant__` variable
893	Error	`auto` type with CUDA memory space variable
1306	Error	CUDA memory space mismatch on redeclaration
3483	Error	Memory space qualifier on automatic/static variable in non-device function
3510	Error	`__shared__` variable with variable-length array
3566	Error	`__constant__` with constexpr and auto deduction
3567	Error	CUDA variable with VLA type
3568	Error	`__constant__` combined with constexpr
3578	Error	CUDA attribute in constexpr-if discarded branch
3579	Error	CUDA attribute at namespace scope with structured binding
3580	Error	CUDA attribute on variable-length array
3648	Warning	Device-accessible (non-shared) variable with external linkage

Cross-Space Reference Errors

Error	Severity	Description
3486	Error	Illegal device variable reference from host (operator function context)
3548	Error	Reference to `__shared__` / `__constant__` variable from host code
3549	Error	Initializer referencing device memory variable from host
3550	Error	Write to `__shared__` variable from host code
3623	Error	Use of `__device__`-only function outside device context

Global State Variables

Variable	Type	Description
`dword_126EFA8`	int	CUDA mode flag (nonzero when compiling CUDA)
`dword_126EFB4`	int	CUDA dialect (2 = CUDA C++)
`dword_126EFAC`	int	Extended CUDA features flag
`dword_126EFA4`	int	CUDA version-check control
`qword_126EF98`	int64	CUDA version threshold (hex: `0x9E97` = 40599, `0x9D6C`, etc.)
`qword_126EF90`	int64	CUDA version threshold (hex: `0x78B3` = 30899 for compute_30)
`dword_106BFD0`	int	Enable cross-space reference checking (primary)
`dword_106BFCC`	int	Enable cross-space reference checking (secondary)
`dword_106BF40`	int	Allow `__device__` function refs in host (`--expt-relaxed-constexpr`)
`dword_106BFF0`	int	Relaxed execution space mode (permits otherwise-illegal combos)
`qword_126EB70`	ptr	Entity pointer for `main()` (prevents `__device__` on main)
`qword_126C5E8`	ptr	Scope stack base pointer (784-byte entries)
`dword_126C5E4`	int	Current scope stack top index
`dword_126C5D8`	int	Current function scope index (-1 if none)

Function Map

Address	Identity	Size	Source
`sub_40AD80`	`apply_nv_weak_odr_attr`	0.2 KB	`attribute.c:10497`
`sub_40E0D0`	`apply_nv_managed_attr`	0.4 KB	`attribute.c:10523`
`sub_40E1F0`	`apply_nv_global_attr` (variant 1)	0.9 KB	`attribute.c`
`sub_40E7F0`	`apply_nv_global_attr` (variant 2)	0.9 KB	`attribute.c`
`sub_40EB80`	`apply_nv_device_attr`	1.0 KB	`attribute.c`
`sub_4108E0`	`apply_nv_host_attr`	0.3 KB	`attribute.c`
`sub_413240`	`apply_one_attribute` (dispatch)	5.9 KB	`attribute.c`
`sub_413ED0`	`apply_attributes_to_entity`	4.9 KB	`attribute.c`
`sub_40A310`	`attribute_display_name`	0.6 KB	`attribute.c:1307`
`sub_4CA6C0`	`decl_variable`	11 KB	`decls.c:7730`
`sub_4CC150`	`cuda_variable_fixup`	1.2 KB	`decls.c:20654`
`sub_4DC200`	`mark_defined_variable`	0.3 KB	`decls.c`
`sub_4DEC90`	`variable_declaration`	11 KB	`decls.c:12956`
`sub_6BC890`	`nv_validate_cuda_attributes`	1.6 KB	`nv_transforms.c`
`sub_6BCF10`	`nv_check_device_variable_in_host`	0.2 KB	`nv_transforms.c`
`sub_6BCF80`	`nv_emit_host_reference_array`	0.8 KB	`nv_transforms.c`
`sub_72A650`	`record_symbol_reference_full` (6-arg)	6.6 KB	`symbol_ref.c`
`sub_72B510`	`record_symbol_reference_full` (4-arg)	7.3 KB	`symbol_ref.c`

Cross-Space Call Validation

CUDA's execution model partitions code into host (CPU) and device (GPU) worlds. A function in one execution space cannot directly call a function in the other -- a __host__ function cannot call a __device__ function, and vice versa. cudafe++ enforces these rules at two points during compilation: at explicit call sites in expressions (expr.c) and at symbol reference recording time (symbol_ref.c). Together these checks cover both direct function calls and indirect references -- variable accesses, implicit constructor/destructor invocations, and template-instantiated calls. The validation produces 12 distinct calling error messages (6 normal + 6 constexpr-with-suggestion variants), plus 4 variable access errors and 1 device-only function reference error.

Key Facts

Property	Value
Source files	`expr.c` (call site checks), `symbol_ref.c` (reference-time checks), `class_decl.c` (type hierarchy walk), `nv_transforms.c` (helpers)
Call-site checker	`sub_505720` (`check_cross_execution_space_call`, 4.0 KB)
Template variant	`sub_505B40` (`check_cross_space_call_in_template`, 2.7 KB)
Reference checker	`sub_72A650` (`record_symbol_reference_full`, 6-arg, 659 lines)
Reference checker (short)	`sub_72B510` (`record_symbol_reference_full`, 4-arg, 732 lines)
Type hierarchy walker	`sub_41A1F0` (annotation helper, walks nested types for HD violations)
Type hierarchy entry	`sub_41A3E0` (validates lambda/class HD annotation, calls `sub_41A1F0`)
Space name helper	`sub_6BC6B0` (`get_entity_display_name`, 49 lines)
Trivial-device-copyable	`sub_6BC680` (`is_device_or_extended_device_lambda`, 16 lines)
Device ref expression walker	`sub_6BE330` (`nv_scan_expression_for_device_refs`, 89 lines)
Diagnostic emission	`sub_4F7450` (multi-arg diagnostic), `sub_4F8090` (type+entity diagnostic)
Calling errors	3462, 3463, 3464, 3465, 3508
Variable access errors	3548, 3549, 3550, 3486
Device-only function ref	3623
Type annotation errors	3593, 3594, 3597, 3598, 3599, 3615, 3635, 3691
Cross-space enable flag	`dword_106BFD0` (primary), `dword_106BFCC` (secondary)
Device ref relaxation	`dword_106BF40` (allow `__device__` function refs in host)
Relaxed constexpr flag	`dword_126EFB0` (also referenced as CLI flag 104)

Execution Space Recall

The execution space is encoded at byte offset +182 of the entity (routine) node. The two-bit extraction byte & 0x30 classifies the routine:

`byte & 0x30`	Space	Meaning
`0x00`	(none)	Implicit `__host__`
`0x10`	`__host__`	Explicit host-only
`0x20`	`__device__`	Device-only
`0x30`	`__host__ __device__`	Both spaces

The 0x60 mask distinguishes __global__ kernels: (byte & 0x60) == 0x20 means plain __device__, while byte & 0x40 set means __global__.

Additional flags at byte +177 encode secondary space information:

Bit	Mask	Meaning
0	`0x01`	`__host__` annotation present
1	`0x02`	`__device__` annotation present
2	`0x04`	constexpr device
4	`0x10`	implicitly HD / `__forceinline__` relaxation

The +177 & 0x10 bit is the critical bypass: when set, the function is treated as implicitly __host__ __device__ and exempt from cross-space checks. This covers constexpr functions (which are implicitly HD since CUDA 7.5) and __forceinline__ functions (which the compiler may allow to be instantiated in either space).

The Implicitly-HD Bypass

Before any cross-space error is emitted, both the caller and callee are tested for the implicitly-HD condition. The exact binary test is:

// Implicitly-HD check (appears in both sub_505720 and sub_505B40)
// entity: pointer to routine entity node

bool is_implicitly_hd(int64_t entity) {
    // Check 1: bit 0x10 at +177 (constexpr/forceinline HD)
    if ((*(uint8_t*)(entity + 177) & 0x10) != 0)
        return true;

    // Check 2: deleted function with specific annotation combo
    // +184 is an 8-byte extended flags field
    // 0x800000000000 = deleted bit, 0x1000000 = explicit annotation
    // If deleted but NOT explicitly annotated, AND byte+176 bit 1 is clear:
    if ((*(uint64_t*)(entity + 184) & 0x800001000000LL) == 0x800000000000LL
        && (*(uint8_t*)(entity + 176) & 2) == 0)
        return true;

    return false;
}

This means:

constexpr functions -- the +177 & 0x10 bit is set during attribute processing, making them callable from both host and device code without explicit annotation.
__forceinline__ functions -- same bit, allowing cross-space inlining.
Implicitly-deleted functions -- defaulted special members (constructors, destructors, assignment operators) that are deleted due to non-copyable members. These get a pass because they will never actually be called.

If either the caller or the callee is implicitly HD, the cross-space check returns immediately without error.

Call-Site Validation: sub_505720

check_cross_execution_space_call is called during expression scanning in scan_expr_full whenever a function call expression is processed. It takes three parameters:

// sub_505720 -- check_cross_execution_space_call
// a1: entity pointer of the callee function (may be NULL)
// a2: bool -- if true, this is a "must be callable" context (__global__ launch)
// a3: source location pointer for diagnostics
// returns: char (nonzero if diagnostic was emitted)
char check_cross_execution_space_call(int64_t callee, bool must_callable, uint64_t *src_loc);

Algorithm

The function follows a multi-stage gate structure. At each gate, an early return can skip the check entirely:

Gate 1 -- Class scope suppression. If we are inside a class definition scope (dword_126C5C8 != -1) and the current scope has device-scope flags set (scope_entry[6] & 0x06), AND we are inside a type node context (dword_106B670 != -1, type_entry[5] & 0x08), the check is suppressed. This allows member function declarations inside device classes to reference host functions without error -- the actual check happens when the member is instantiated/defined.

Gate 2 -- Diagnostic suppression scope. If the current scope entry has diagnostic-suppression bit 1 of byte +14 set (scope_entry[14] & 0x02), checks are suppressed. This covers SFINAE contexts and decltype evaluation.

Gate 3 -- Concept/requires context. If the current context pointer (qword_106B970) is non-null and byte +17 has bit 1 set (strict-mode or concept context), checks are suppressed.

Gate 4 -- No enclosing function. If dword_126C5D8 == -1 (no enclosing function scope), the caller space defaults to host-only (v7=0, v8=1) -- meaning we are at file scope, which is implicitly host.

Gate 5 -- Extract caller space. The enclosing function entity is retrieved from the scope stack at qword_126C5E8 + 784 * dword_126C5D8 + 224. Its execution space is extracted:

v7 = (caller[182] & 0x60) == 0x20 -- caller is host-only
v8 = (caller[182] & 0x30) != 0x20 -- caller is NOT device-only
v5 = (caller[-8] & 0x10) != 0 -- caller has secondary device mark (the -8 offset reads a flags byte 8 bytes before the entity, in the preceding allocation header)

Gate 6 -- Caller implicitly HD. The caller is tested for implicitly-HD status. If true, return immediately.

Gate 7 -- Callee implicitly HD. The callee (parameter a1) is tested for implicitly-HD status. If true, return immediately.

Gate 8 -- No caller entity or secondary device. If no caller entity exists or the secondary device flag is set, skip to the __global__ check.

Error Decision Logic

After passing all gates, the function computes which error to emit based on caller/callee space combination:

// Pseudocode for the error decision tree

bool callee_is_not_device = (callee[182] & 0x30) != 0x20;   // v3
bool callee_is_host_only  = (callee[182] & 0x60) == 0x20;   // v4
bool callee_is_global     = (callee[182] & 0x40) != 0;       // v11 in some paths
bool caller_is_host_only  = (caller[182] & 0x60) == 0x20;    // v7
bool caller_not_device    = (caller[182] & 0x30) != 0x20;    // v8
bool has_forceinline      = (caller[181] & 0x20) != 0;

if (caller_is_host_only && caller_not_device) {
    // Caller is __host__ __device__ (both flags set)
    if (has_forceinline || callee_is_not_device || !callee_is_host_only)
        goto global_check;

    // HD caller calling host-only callee
    if (!is_device_or_extended_lambda(callee)) {
        char *caller_name = get_entity_display_name(caller, 0);
        char *callee_name = get_entity_display_name(callee, 1);
        int errcode = 3462 + ((callee[177] & 0x02) != 0);  // 3462 or 3463
        emit_diagnostic(errcode, src_loc, callee_name, caller_name);
    }
} else if (caller_not_device) {
    // Caller is host-only, callee is device-only
    if (has_forceinline || callee_is_not_device || !callee_is_host_only)
        goto global_check;

    // Check relaxed-constexpr bypass
    if ((callee[177] & 0x02) != 0 && dword_106BF40) {
        // Callee has __device__ annotation AND relaxation flag is set
        if (must_callable && !callee_is_global)
            goto global_check;  // suppress for __global__ must-call context
        // else suppress entirely
    }

    // Check constexpr-device bypass
    if ((callee[177] & 0x04) != 0)
        goto global_check;  // constexpr device functions get a pass

    // Host caller calling device-only callee
    char *caller_name = get_entity_display_name(caller, 0);
    char *callee_name = get_entity_display_name(callee, 1);
    int errcode = 3465 - ((callee[177] & 0x02) == 0);  // 3464 or 3465
    emit_diagnostic(errcode, src_loc, callee_name, caller_name);
}

global_check:
if (must_callable && !callee_is_global) {
    // must_callable is true but callee is not __global__
    // (this path is for __global__ launch checks)
    // no error here -- fall through
} else if (!must_callable && callee_is_global) {
    // __global__ function called from wrong context
    if (callee_is_host_only) {
        // __global__ called from host-only -- "cannot be called from host"
        emit_diagnostic(3508, src_loc, "host", "cannot");
    } else if (!callee_is_host_only) {
        // __global__ called from __device__ context
        emit_diagnostic(3508, src_loc, "__device__", "cannot");
    }
} else if (must_callable || !callee_is_global) {
    return;  // no __global__ issue
} else {
    emit_diagnostic(3508, src_loc, "__global__", "must");
}

Error 3462 vs 3463 (Device-from-Host Direction)

The distinction between errors 3462 and 3463 is the +177 & 0x02 bit on the callee -- whether it has an explicit __device__ annotation:

3462: __device__ function called from __host__ context. The callee has no explicit __device__ annotation (it was implicitly device-only).
3463: Same violation, but the callee has explicit __device__ annotation. The error message includes an additional note about the __host__ __device__ context.

The computation: 3462 + ((callee[177] & 0x02) != 0) yields 3462 when the bit is clear, 3463 when set.

Error 3464 vs 3465 (Host-from-Device Direction)

Similarly for the reverse direction:

3464: __host__ function called from __device__ context, callee has explicit __device__ annotation (bit clear in the subtraction).
3465: Same violation, callee does NOT have explicit __device__ annotation.

The computation: 3465 - ((callee[177] & 0x02) == 0) yields 3464 when the bit is clear, 3465 when set.

Error 3508 (global Misuse)

Error 3508 is a parameterized error with two string arguments: the context string and the verb. The combinations are:

Context	Verb	Meaning
`"host"`	`"cannot"`	`__global__` function cannot be called from `__host__` code directly (must use `<<<>>>`)
`"__device__"`	`"cannot"`	`__global__` function cannot be called from `__device__` code
`"__host__ __device__" + 9` = `"__device__"`	`"cannot"`	Same, from HD context with device focus
`"__global__"`	`"must"`	A `__global__` function must be called with `<<<>>>` syntax

Template Variant: sub_505B40

check_cross_space_call_in_template performs the same validation but is called during template instantiation rather than initial expression scanning. It has two key differences:

Guard on dword_126C5C4 == -1: only runs when no nested class scope is active. If dword_126C5C4 != -1, the entire function is skipped -- template instantiation inside nested class definitions defers cross-space checks.
Additional scope guards: checks scope_entry[4] != 12 (not a namespace scope) and qword_106B970 + 17 & 0x40 == 0 (not in a concept context). These prevent false positives during dependent name resolution.
No return value: returns void instead of char. It only emits diagnostics; it does not report whether a diagnostic was emitted.
Error code selection: uses 3463 - ((callee[177] & 0x02) == 0) for the HD-caller case (yielding 3462 or 3463), and 3465 - ((callee[177] & 0x02) == 0) for the host-caller case (yielding 3464 or 3465). The __global__ error always uses "must" verb.
No must_callable parameter: the template variant does not handle the must/cannot distinction for __global__. It always emits 3508 with "__global__" and "must" if the callee is __global__.

Complete Calling Error Matrix

The following matrix shows which errors fire for each caller/callee space combination:

Caller \ Callee	`__host__`	`__device__`	`__host__ __device__`	`__global__`
`__host__` (explicit)	OK	3464 or 3465	OK	3508 (`"must"`)
`__device__`	3462 or 3463	OK	OK	3508 (`"cannot"`)
`__host__ __device__`	OK	3462 or 3463	OK	3508
(no annotation) = host	OK	3464 or 3465	OK	3508 (`"must"`)
`__global__`	OK	OK	OK	3508 (`"cannot"`)

Entries marked "OK" pass the cross-space check without error. The specific error (3462 vs 3463, 3464 vs 3465) depends on whether the callee has the +177 & 0x02 bit (explicit __device__ annotation).

Bypass Conditions (No Error Despite Mismatch)

Even when the matrix says an error should fire, the following conditions suppress it:

Caller or callee is implicitly HD (+177 & 0x10): constexpr functions, __forceinline__ functions, implicitly-deleted special members.
Caller has __forceinline__ relaxation (+181 & 0x20): the caller has a __forceinline__ attribute that relaxes cross-space restrictions.
Callee is a device lambda that passes trivial-device-copyable check (sub_6BC680 returns true): extended lambda optimization.
Callee has constexpr-device flag (+177 & 0x04): constexpr functions marked for device use.
dword_106BF40 is set and callee has explicit __device__ (+177 & 0x02): the --expt-relaxed-constexpr or similar flag allows device function references from host code.
Current scope has diagnostic suppression (scope_entry[14] & 0x02): SFINAE context.
Concept/requires context (qword_106B970 + 17 & 0x40).

The 12 Calling Error Messages

cudafe++ emits 6 base error messages for cross-space call violations. Each has a variant that adds a --expt-relaxed-constexpr suggestion when the callee is a constexpr function, yielding 12 total messages:

Error	Direction	Context	Suggestion?
3462	device called from host	Callee lacks explicit `__device__`	No
3463	device called from HD	Callee has explicit `__device__` (HD context note)	No
3464	host called from device	Callee has explicit `__device__` (bit clear in subtraction)	No
3465	host called from device	Callee lacks explicit `__device__`	No
3508	`__global__` context error	Parameterized: `"must"` / `"cannot"` + space string	No
3462+constexpr	device called from host	constexpr callee	Yes: `--expt-relaxed-constexpr`
3463+constexpr	device called from HD	constexpr callee	Yes
3464+constexpr	host called from device	constexpr callee	Yes
3465+constexpr	host called from device	constexpr callee	Yes
3508+constexpr	`__global__` context	constexpr callee	Yes

The constexpr suggestion variants are selected by the relaxed-constexpr flag state. When dword_106BF40 (the --expt-relaxed-constexpr relaxation flag) is NOT set and the callee has constexpr annotations, the error message includes a note suggesting the flag to resolve the issue.

Variable Access Validation: symbol_ref.c

The record_symbol_reference_full functions (sub_72A650 / sub_72B510) enforce cross-space rules at the symbol reference level. This is a different check point than the call-site checker -- it catches variable accesses and implicit function references that are not explicit function calls.

Reference Kind Bitmask (Parameter a1)

The first parameter encodes the kind of reference being made:

Bit	Mask	Meaning
0	`0x01`	Address reference (`&var`)
1	`0x02`	Write reference (assignment target)
2	`0x04`	Non-modifying reference (read)
3	`0x08`	Direct use
4	`0x10`	Initializer
5	`0x20`	Potential modification
6	`0x40`	Move reference
10	`0x400`	Template argument
13	`0x2000`	ODR-use
15	`0x8000`	Negative offset

Device Variable in Host Code (Errors 3548-3550)

When the referenced symbol is a variable (kind == 7), record_symbol_reference_full checks whether a __shared__, __constant__, or __managed__ variable is accessed from host code:

// nv_check_device_var_ref_in_host (inlined in sub_72A650, lines ~330-374)
// var_info: *(entity + 88) -- pointer to variable descriptor

uint64_t caller = 0;
if (dword_126C5D8 != -1)  // enclosing function scope exists
    caller = *(qword_126C5E8 + 784 * dword_126C5D8 + 224);

uint8_t mem_flags = *(var_info + 148);     // memory space flags
const char *space_name;
if (mem_flags & 0x02)
    space_name = "__shared__";
else if (mem_flags & 0x04)
    space_name = "__constant__";
else
    space_name = "";  // no specific space (managed or other)

// Exemption: managed variables with bit 0x100 set are OK
if ((*(uint16_t*)(var_info + 148) & 0x0101) == 0x0101)
    return;  // managed + exemption flag

// Only check if: has device memory annotation, there is a caller,
// caller is NOT device-only, caller is not implicitly-HD
if ((ref_kind & 0x12040) == 0       // not a transparent reference
    && (mem_flags & 0x07) != 0       // has device memory annotation
    && caller != 0
    && (*(caller + 182) & 0x30) != 0x20   // caller NOT device-only
    && (*(caller + 177) & 0x10) == 0      // caller NOT implicitly HD
    && !is_implicitly_hd(caller))          // extended implicit-HD check
{
    if (ref_kind & 0x08)  // direct use
        emit_diag(3548, src_loc, space_name, entity);  // "reference to __shared__"

    if (ref_kind & 0x10)  // initializer
        emit_diag(3549, src_loc, space_name, entity);  // "initializer for __constant__"

    if ((mem_flags & 0x02) && (ref_kind & 0x20))  // __shared__ + write
        emit_diag(3550, src_loc, space_name, entity);  // "write to __shared__"
}

Error	Condition	Message
3548	Direct use of `__shared__`/`__constant__` variable from host	Reference to device memory variable from host code
3549	Initializer referencing `__shared__`/`__constant__` from host	Cannot initialize from host
3550	Write to `__shared__` variable from host	Cannot write to shared memory from host

Device-Only Function Reference (Error 3623)

For function-type symbols (kind 10 or 11, or concept kind 20), the check validates that __device__-only functions are not referenced from host code:

// nv_check_device_function_ref_in_host (inlined in sub_72A650, lines ~382-454)
// entity: the function being referenced
// entity + 88 -> routine info (for kind 10/11)
// entity + 88 -> +192 for concepts (kind 20)

int64_t routine_info = ...;  // resolve through type chain
if (routine_info == 0)
    return;

// Only check if: has device annotation, is device-only,
// has no implicit-HD flags
if ((*(routine_info + 191) & 0x01) == 0     // not a coroutine exemption
    || (*(routine_info + 182) & 0x30) != 0x20  // not device-only
    || (*(routine_info + 177) & 0x15) != 0)    // has HD/host/constexpr flags
    return;

// Check if already exempted by extended flags
if (is_implicitly_hd(routine_info))
    return;

// Determine caller context
int64_t caller_routine = 0;
if (dword_126C5D8 != -1) {
    caller_routine = *(qword_126C5E8 + 784 * dword_126C5D8 + 224);
} else if (dword_126C5B8) {
    // Walk scope stack to find enclosing try block
    int scope_idx = dword_126C5E4;
    while (scope_idx != -1) {
        int64_t entry = qword_126C5E8 + 784 * scope_idx;
        if (*(int32_t*)(entry + 408) != -1)  // has try block
            break;
        scope_idx = *(int32_t*)(entry + 560);  // parent scope
    }
    if (scope_idx == -1) return;
    caller_routine = *(entry + 224);
}

if (caller_routine == 0) goto emit_outside;
if (is_implicitly_hd(caller_routine)) return;

if ((*(caller_routine + 182) & 0x30) == 0x20) {
    // Caller is __device__-only
    if ((*(caller_routine + 177) & 0x05) == 0)
        return;  // no constexpr/consteval markers
    context = "from a constexpr or consteval __device__ function";
} else {
    context = "outside the bodies of device functions";
}

emit_outside:
const char *name = *(routine_info + 8);  // function name
if (!name) name = "";
emit_diagnostic(3623, src_loc, name, context);

Error 3623 has two context strings:

"outside the bodies of device functions" -- the reference is from file scope or host code
"from a constexpr or consteval __device__ function" -- the reference is from a constexpr/consteval device function that cannot actually call the target

The dword_106BFD0 / dword_106BFCC Gate

Both record_symbol_reference_full variants gate the cross-space device-reference scan (sub_6BE330) with:

if (dword_106BFD0 || dword_106BFCC) {
    // Cross-space reference checking is enabled
    if (!qword_126C5D0                                    // no current routine descriptor
        || *(qword_126C5D0 + 32) == 0                    // no routine entity
        || (*(*(qword_126C5D0 + 32) + 182) & 0x30) != 0x20  // not device-only
        || (dword_106BF40 && (*(*(qword_126C5D0 + 32) + 177) & 0x02) != 0))
    {
        // Call sub_6BE330 to walk expression tree for device references
        nv_scan_expression_for_device_refs(entity);
    }
}

The scan is skipped when the current routine IS __device__-only -- device code referencing other device symbols is always valid. The dword_106BF40 check further relaxes: if the flag is set AND the routine has explicit __device__ annotation (+177 & 0x02), the scan is also skipped.

Type Hierarchy Walk: sub_41A1F0 / sub_41A3E0

The type hierarchy walkers handle a different class of violation: when a __host__ __device__ or __device__ annotation is applied to a class or lambda whose member types contain HD-incompatible nested types. These functions live in class_decl.c and are called during class completion.

sub_41A3E0 (Entry Point)

This function validates a complete type annotation context. It receives a lambda/class info structure and checks multiple conditions:

// sub_41A3E0 -- validate_type_hd_annotation
// a1: type annotation context structure
//   +8:  entity pointer
//   +32: flags byte (bit 0 = has_host, bit 3 = has_conflict, bit 4 = has_device,
//                     bit 5 = has_virtual)
//   +36: source location
// a2: 0 = __host__ __device__, nonzero = __device__ only
// a3: enable additional nested check (for OptiX path)

char *space_name = (a2 == 0) ? "__host__ __device__" : "__device__";

// Error 3615: duplicate HD annotation conflict
if (a2 == 0 && (flags & 0x01))
    emit_diag(3615, src_loc);

// Error 3593: conflict between __host__ and __device__ on type
if (flags & 0x08) {
    if (entity && entity[163] < 0) {  // entity has device-negative flag
        if ((flags & 0x18) != 0x18)
            goto check_members;
        emit_diag(3635, src_loc);  // both __host__ and __device__ + conflict
    } else {
        emit_diag(3593, src_loc, space_name);
    }
}

// Error 3594: virtual function in __device__ context
if (flags & 0x20 || ...)
    emit_diag(3594, src_loc, space_name);

// Recurse into member types
walk_type_for_hd_violations(type_entry, src_loc, a2);  // sub_41A1F0

// Error 3691: nested OptiX check
if (a3 && (flags & 0x10))
    emit_diag(3691, src_loc, space_name);

sub_41A1F0 (Recursive Type Walker)

This function walks the type hierarchy to find nested violations. It uses sub_7A8370 (is-array-type check) and sub_7A9310 (get-array-element-type) to traverse through arrays, and walks through cv-qualified type wrappers (kind == 12) by following the +144 pointer chain.

// sub_41A1F0 -- walk_type_for_hd_violations (recursive)
// a1: type node pointer
// a2: source location pointer
// a3: 0 = HD mode, nonzero = device-only mode

char *space_name = (a3) ? "__device__" : "__host__ __device__";

if (!is_valid_type(a1) || a1 == 0) {
    // Base case: no type to check, or check passed at top level
    goto label_20;
}

int depth = 0;
int64_t current = a1;
do {
    if (!is_array_type(current)) {  // sub_7A8370
        // Not an array -- check this type for violations
        if (depth > 7)
            emit_diag(3597, src_loc, space_name, a1);  // nesting depth exceeded

        // Walk through cv-qualified wrappers
        while (*(current + 132) == 12)  // cv-qual kind
            current = *(current + 144);  // underlying type

        // Guard: skip if in nested class scope
        if (dword_126C5C4 != -1)
            return;
        if ((scope_entry[6] & 0x06) != 0)
            return;
        if (scope_entry[4] == 12)  // namespace scope
            goto walk_callback;

        // Error 3598: type not valid in device context
        if (!check_type_valid_for_space(30, current, 0))  // sub_550E50
            emit_diag(3598, src_loc, space_name, current);

        // Error 3599: type has problematic member
        int64_t display = get_type_display_name(current);  // sub_5BD540
        if (!check_member_compat(60, display, current))  // sub_510860
            emit_diag(3599, src_loc, space_name, current);

        goto label_20;
    }
    ++depth;
    current = get_array_element_type(current);  // sub_7A9310
} while (current != 0);

label_20:
// Final phase: walk_tree with callback sub_41B420
if (dword_126C5C4 != -1) return;
if ((scope_entry[6] & 0x06) != 0) return;
if (scope_entry[4] == 12) return;

// Save/restore diagnostic state
saved_state = qword_126EDE8;
qword_126EDE8 = *src_loc;
dword_E7FE78 = 0;
walk_tree(a1, sub_41B420, 792);  // sub_7B0B60 with callback
qword_126EDE8 = saved_state;

The callback sub_41B420 is used in the tree walk to check each nested type member. This is the same callback used for OptiX extended lambda body validation, applied to validate that all types referenced within the annotated scope are compatible with the target execution space.

Type Annotation Errors

Error	Condition	Message
3593	Conflict between `__host__` and `__device__` on extended lambda/type	Cannot apply both annotations
3594	Virtual function in `__device__` or HD context	Virtual dispatch not supported on device
3597	Type nesting depth exceeds 7 levels in HD validation	Type hierarchy too deep for device
3598	Nested type not valid in device context	Type `X` cannot be used in `__device__` code
3599	Nested type member incompatible with device execution	Member of type `X` is not device-compatible
3615	Duplicate `__host__ __device__` annotation	Already annotated as HD
3635	Both `__host__` and `__device__` annotations with negative device flag	Conflicting explicit annotations
3691	Nested OptiX annotation conflict	OptiX extended lambda nested check failure

Global State Variables

Global	Type	Purpose
`qword_126C5E8`	`int64_t`	Scope stack base pointer (array of 784-byte entries)
`dword_126C5E4`	`int32_t`	Current scope stack top index
`dword_126C5D8`	`int32_t`	Current function scope index (-1 if none)
`dword_126C5C8`	`int32_t`	Class scope index (-1 if none)
`dword_126C5C4`	`int32_t`	Nested class scope (-1 if none)
`dword_126C5B8`	`int32_t`	Is-member-of-template flag
`qword_126C5D0`	`int64_t`	Current routine descriptor pointer
`qword_106B970`	`int64_t`	Current compilation context
`dword_106BFD0`	`int32_t`	Enable cross-space reference checking (primary)
`dword_106BFCC`	`int32_t`	Enable cross-space reference checking (secondary)
`dword_106BF40`	`int32_t`	Allow `__device__` function references in host
`dword_106B670`	`int32_t`	Current type node context index (-1 if none)
`qword_106B678`	`int64_t`	Type node table base pointer
`dword_E7FE78`	`int32_t`	Diagnostic state flag (cleared during type walks)
`qword_126EDE8`	`int64_t`	Saved diagnostic source position

Function Map

Address	Size	Identity	Source
`sub_41A1F0`	~0.5 KB	`walk_type_for_hd_violations`	`class_decl.c`
`sub_41A3E0`	~0.5 KB	`validate_type_hd_annotation`	`class_decl.c`
`sub_41B420`	(callback)	Type walk callback for device compat	`class_decl.c`
`sub_4F7450`	~0.3 KB	`emit_diag_multi_arg` (cross-space diagnostics)	`expr.c`
`sub_505720`	4.0 KB	`check_cross_execution_space_call`	`expr.c`
`sub_505AA0`	0.8 KB	`get_execution_space_string`	`expr.c`
`sub_505B40`	2.7 KB	`check_cross_space_call_in_template`	`expr.c`
`sub_6BC680`	0.1 KB	`is_device_or_extended_device_lambda`	`nv_transforms.c`
`sub_6BC6B0`	0.5 KB	`get_entity_display_name`	`nv_transforms.c`
`sub_6BE330`	0.9 KB	`nv_scan_expression_for_device_refs`	`nv_transforms.c`
`sub_72A650`	6.6 KB	`record_symbol_reference_full` (6-arg)	`symbol_ref.c`
`sub_72B510`	7.3 KB	`record_symbol_reference_full` (4-arg)	`symbol_ref.c`

Cross-References

Execution Spaces -- the +182 byte encoding and attribute handlers
Device/Host Separation -- how validated code is split into device and host IL
Kernel Stubs -- __global__ function wrapper generation
Entity Node -- byte offsets +176, +177, +182, +184
Diagnostics Overview -- error emission pipeline
Lambda Overview -- extended lambda HD annotation validation

Device/Host Separation

A single .cu file contains both host and device code intermixed. Conventional wisdom assumes cudafe++ splits them with two compilation passes -- one for host, one for device. That assumption is wrong. cudafe++ uses a single-pass, tag-and-filter architecture: the EDG frontend builds one unified IL tree from the entire translation unit, every entity gets execution-space bits written into its node, and then two separate output paths filter the tagged IL -- one path emits the .int.c host file, the other emits the device IL for cicc. There is no re-parse, no second invocation of the frontend.

This page documents the global variables that control the split, the IL-marking walk that selects device-reachable entries, the host-output filtering logic that suppresses device-only entities, and the output files produced.

Key Facts

Property	Value
Architecture	Single-pass: parse once, tag with execution-space bits, filter at output time
Language mode flag	`dword_126EFB4` -- language mode (`1` = C, `2` = C++)
Host compiler identity	`dword_126EFA4` -- clang mode; `dword_126EFA8` -- gcc mode
Device stub mode	`dword_1065850` -- toggled per-entity in `sub_47BFD0` (`gen_routine_decl`)
Device-only filter	`sub_46B3F0` -- returns 0 for device-only entities when generating host output
Keep-in-IL entry point	`sub_610420` (`mark_to_keep_in_il`), 892 lines
Keep-in-IL worker	`sub_6115E0` (`walk_tree_and_set_keep_in_il`), 4649 lines
Prune callback	`sub_617310` (`prune_keep_in_il_walk`), 127 lines
Host output entry point	`sub_489000` (`process_file_scope_entities`)
Host sequence dispatcher	`sub_47ECC0` (`gen_template` / top-level source sequence processor), 1917 lines
Routine declaration	`sub_47BFD0` (`gen_routine_decl`), 1831 lines
Host output file	`<input>.int.c` (transformed C++ for host compiler)
Device output file	Named via `--gen_device_file_name` CLI flag (binary IL for cicc)
Module ID file	Named via `--module_id_file_name` CLI flag
Stub file	Named via `--stub_file_name` CLI flag

Why Single-Pass Matters

Old NVIDIA documentation and third-party descriptions sometimes describe a "two-pass" compilation model where cudafe++ runs once to extract device code and once to extract host code. This is not what the binary does. The evidence:

One frontend invocation. sub_489000 (process_file_scope_entities) is called once. It walks the source sequence list (qword_1065748) a single time, dispatching each entity through sub_47ECC0.
No re-parse. The EDG frontend builds the IL tree in memory once. The keep-in-IL walk (sub_610420) runs during fe_wrapup pass 3, marking device-reachable entries with bit 7 of the prefix byte. The host backend then emits .int.c from the same IL tree, filtering based on execution-space bits.
dword_126EFB4 is a language mode, not a pass counter. Its value 2 means "C++ mode," not "second pass." It never changes between device and host output phases.
The device IL is a byte-level binary dump of marked entries, not the output of a separate code-generation pass. The host output is a text-mode C++ file produced by the gen_* family of functions.

The practical implication: every CUDA entity exists once in memory with its execution-space tag at entity+182. The tag drives all downstream decisions -- what goes into device IL, what appears in host .int.c, what gets wrapped in #if 0, and what gets a kernel stub.

Control Globals

dword_126EFB4 -- Language Mode

Value	Meaning
`0`	Unset / not initialized
`1`	C mode
`2`	C++ mode

Set during CLI processing (sub_45C200, case 228/240/246/251/252 for C++ standard versions). In CUDA compilation this is always 2 because .cu files are compiled as C++. The keep-in-IL logic at sub_610420 checks dword_126EFB4 == 2 to decide whether to run the secondary routine-definition marking pass (sub_6175F0).

dword_126EFA4 -- Clang Mode / Device Code Mode

This global has different semantics depending on context. In CLI processing (case 187), it records whether clang host compiler mode is active. In the template instantiation system (p1.18 sweep), it acts as a device-code mode flag (1 = device code path, 0 = host stubs). The dual use reflects the fact that cudafe++ reuses the same global for different phases.

dword_126EFA8 -- GCC Mode / GPU Compilation Mode

Set when gcc host compiler mode is active. In template-related code paths, a nonzero value indicates GPU compilation mode is enabled.

dword_1065850 -- Device Stub Mode Toggle

This global flag controls how __global__ kernel bodies are emitted. It is toggled inside gen_routine_decl (sub_47BFD0). The toggle mechanism is a self-inverting flip that causes gen_routine_decl to process each __global__ kernel TWICE. Because the toggle fires at the TOP of the function (before body emission), the first call (0->1) emits the static stub definition, and the recursive call (1->0) emits the forwarding body.

Toggle Pseudocode (from sub_47BFD0, decompiled line 551-553)

// v3 = entity pointer, v8 = is_friend flag
uint64_t flags = *(uint64_t*)(entity + 176);    // 8-byte flags field

// Bitmask 0x40000002000000 combines the __global__ attribute bit (0x40000000000000)
// and a definition/linkage flag (0x2000000) from the entity's flags field at +176.
if ((flags & 0x40000002000000) == 0x40000002000000 && !is_friend)
    dword_1065850 = (dword_1065850 == 0);   // flip: 0->1 or 1->0

This toggle fires at the TOP of gen_routine_decl, before either stub variant is emitted. Because the function calls itself recursively at the end (decompiled line 1821: return sub_47BFD0(v152, a2)), the toggle fires again on re-entry, resetting the flag.

Body Emission Decision (decompiled line 1421-1432)

The actual stub body selection happens later in the function, based on the CURRENT value of dword_1065850 (which has already been toggled):

if ((entity->byte_182 & 0x40) != 0) {       // has __global__ annotation
    char has_body = entity->byte_179 & 0x02;  // has a definition

    if (dword_1065850) {
        // First call (toggle 0->1): emit static stub with cudaLaunchKernel placeholder
        if (!is_specialization && has_body) {
            emit("{ ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);}");
        }
    } else if (has_body) {
        // Recursive call (toggle 1->0): emit forwarding stub
        emit("{");
        emit_scope_qualifier(entity);
        emit("__wrapper__device_stub_");
        emit(entity->name);
        emit_template_args_if_needed(entity);
        emit_parameter_forwarding(entity);
        emit(");return;}");
    }
    // Both invocations: wrap original body in #if 0 / #endif
}

Self-Recursion (decompiled line 1817-1821)

After the first call emits the static stub, the function checks whether dword_1065850 is nonzero (the toggle set it to 1). If so, it restores the source sequence pointer and calls itself:

if (dword_1065850) {
    qword_1065748 = saved_source_sequence;
    return sub_47BFD0(context, a2);   // recursive self-call
}

The recursive invocation toggles dword_1065850 back to 0, emits the forwarding body, and returns without further recursion (since dword_1065850 == 0 at the self-recursion check).

The flag is also set in sub_47ECC0 when processing template instantiation directives (source sequence kind 54): if the entity has byte_182 & 0x40 (device/global annotation) and CUDA language mode is active, dword_1065850 is set to 1 before emitting the instantiation directive.

dword_126EBA8 -- Language Standard Mode

Value 1 indicates C language standard mode. The device-only filtering function sub_46B3F0 references this to determine whether EBA (EDG binary archive) mode applies.

Host-Output Filtering: sub_46B3F0

This compact function (39 lines decompiled) is the gatekeeper that determines whether an entity should be emitted in the host .int.c output. It is called from sub_47ECC0 at the point where the host backend decides whether to emit a type/variable declaration or wrap it in #if 0.

Decompiled Logic

// sub_46B3F0 -- returns 0 to suppress (device-only), nonzero to emit
uint64_t sub_46B3F0(entry *a1, entry *a2) {
    char kind = a1->byte_132;

    // Classes, structs, unions (kind 9-11): always check device-only
    if ((unsigned char)(kind - 9) <= 2)
        goto check_device_flag;

    // Enums (kind 2): check if scoped enum is device-only
    if (kind == 2) {
        if ((a1->byte_145 & 0x08) == 0)  // not an enum definition
            return 1;                      // emit it
        goto check_device_flag;
    }

    // Typedefs (kind 12): check underlying type kind
    if (kind == 12) {
        char underlying = a1->byte_160;
        if (underlying > 10)
            return 0;
        // Magic bitmask: 0x71D = 0b11100011101
        // Bits set for kinds 0,2,3,4,8,9,10 -> emit
        return (0x71DULL >> underlying) & 1;
    }

    return 1;  // everything else: emit

check_device_flag:
    int is_device;
    if (a2)
        is_device = a2->byte_49 & 1;
    else
        is_device = a1->byte_135 >> 7;

    if (!is_device)
        return 0;   // not device-related, suppress? (inverted logic)

    // Device entity: check if it should still be emitted
    return dword_126EBA8           // C mode -> emit anyway
        || (kind - 9) > 2         // not a class/struct/union -> emit
        || *(a1->ptr_152 + 89) != 1;  // scope check
}

The function uses a bitmask trick (0x71D >> underlying_kind) to quickly determine which typedef underlying types pass the filter. The bit pattern 0b11100011101 selects kinds 0 (void/basic), 2 (enum), 3 (parameter), 4 (pointer), 8 (field), 9 (class), and 10 (struct).

Where It Is Called

In sub_47ECC0 (the master source-sequence dispatcher), when processing type declarations (kind 6):

case 6:  // type_decl
    sub_4864F0(recursion_level, &continuation, kind_byte);
    if (!recursion_level && !sub_46B3F0(type_entry, scope_entry)) {
        // Entity is device-only in host context
        // Wrap in #if 0 / #endif
    }

This is the mechanism that makes device-only classes, structs, and enums invisible to the host compiler. They still exist in the IL tree (and participate in the keep-in-IL walk for device output), but their text representation is suppressed in .int.c.

Device-Only Suppression in Host Output

When sub_46B3F0 returns 0 for an entity, or when the execution-space check in gen_routine_decl identifies a device-only function, the host backend wraps the declaration in preprocessor guards:

#if 0
__device__ void device_only_function() {
    // ... original body ...
}
#endif

This pattern appears in three locations:

Type declarations -- sub_47ECC0 wraps device-only types via sub_46B3F0 check.
Routine declarations -- sub_47BFD0 checks entity->byte_81 & 0x04 (has device scope) combined with execution-space bits at entity+182. When a function is device-only and the current output track is host, the function body is suppressed.
Lambda bodies -- sub_47B890 (gen_lambda) wraps device lambda bodies in #if 0 / #endif and emits __nv_dl_wrapper_t wrapper types instead.

The nv_is_device_only_routine Check

The inline predicate from nv_transforms.h:367 is the canonical way to test if a routine lives exclusively in device space:

bool nv_is_device_only_routine(entity *e) {
    char byte = e->byte_182;
    return ((byte & 0x30) == 0x20)    // device annotation, no host
        && ((byte & 0x60) == 0x20);   // device, not __global__
}

The double-mask check distinguishes three cases:

(byte & 0x30) == 0x20: has __device__ but not __host__ (bits 4-5)
(byte & 0x60) == 0x20: has __device__ but not __global__ (bits 5-6)

A __global__ function fails the second test because bit 6 is set (byte & 0x60 == 0x60). This matters because __global__ functions ARE emitted in host output -- as stubs that call __wrapper__device_stub_<name>.

The Keep-in-IL Walk (Device Code Selection)

The keep-in-IL mechanism runs during fe_wrapup pass 3 and selects which IL entries belong to the device output. The full details are documented in the Keep-in-IL page; this section covers the aspects relevant to device/host separation.

Call Chain

sub_610420 (mark_to_keep_in_il)
  |
  +-- installs pre_walk_check = sub_617310 (prune_keep_in_il_walk)
  +-- walks file-scope IL via sub_6115E0 (walk_tree_and_set_keep_in_il)
  |     |
  |     +-- for each child entry:
  |           *(child - 8) |= 0x80    // set bit 7 = keep_in_il
  |           recurse into child
  |
  +-- if dword_126EFB4 == 2 (C++ mode):
  |     sub_6175F0 (walk_scope_and_mark_routine_definitions)
  |
  +-- iterates 45+ global entry-kind linked lists
  +-- processes using-declarations (fixed-point loop)

The Keep Bit

Every IL entry has an 8-byte prefix. Bit 7 (0x80) of the byte at entry_ptr - 8 is the keep-in-IL flag:

Byte at (entry_ptr - 8):
  bit 0  (0x01)  is_file_scope
  bit 1  (0x02)  is_in_secondary_il
  bit 2  (0x04)  current_il_region
  bits 3-6       reserved
  bit 7  (0x80)  keep_in_il          <<<< THE DEVICE CODE MARKER

The sign bit doubles as the flag, enabling a fast test: *(signed char*)(entry - 8) < 0 means "keep." The recursive worker sub_6115E0 sets this bit on every reachable sub-entry by ORing 0x80 into the prefix byte and recursing.

Transitive Closure

The walk implements a transitive closure: if a __device__ function references a type, that type gets marked, which transitively marks its member types, base classes, template parameters, and any routines they reference. The prune callback (sub_617310) prevents infinite loops by returning 1 (skip) when an entry already has bit 7 set.

Additional "keep definition" flags exist for deeper marking:

Entity	Field	Bit	Effect
Type (class/struct)	`entry + 162`	bit 7 (0x80)	Retain full class body, not just forward decl
Routine	`entry + 187`	bit 2 (0x04)	Retain function body

Seed Entries

The walk starts from entities already tagged with execution-space bits. These seeds include:

Functions with __device__ or __global__ at entity+182
Variables with __shared__, __constant__, or __managed__ memory space attributes
Extended device/host-device lambdas

Everything reachable from a seed gets the keep bit. Everything without the keep bit is eliminated from the device IL by the elimination pass (sub_5CCBF0).

host device Functions

Functions annotated with both __host__ and __device__ have bits 4 and 5 set in entity+182, producing (byte & 0x30) == 0x30. These functions participate in BOTH output paths:

Host output (.int.c): The function passes the nv_is_device_only_routine check (it returns false because bit 4 is set alongside bit 5). The function body is emitted normally -- no #if 0 wrapping, no stub substitution.
Device IL: The keep-in-IL walk marks the function and all its dependencies because it has device-capable bits set. The full function body is retained in the device IL.

This dual inclusion is why __host__ __device__ functions must be valid C++ in both execution contexts. They are compiled once by EDG, then the same IL is consumed by both the host compiler (via .int.c text) and cicc (via binary IL).

Template Instantiation Interaction

When sub_47ECC0 processes a template instantiation directive (source sequence kind 54) for a __host__ __device__ template, it does NOT set dword_1065850. The stub mode toggle only activates for entities with byte_182 & 0x40 (the __global__ kernel bit). Host-device functions get their bodies emitted directly in both tracks.

Output Files

cudafe++ produces up to four output files from a single compilation:

1. Host C++ File (.int.c)

Generated by sub_489000 (process_file_scope_entities). The filename is derived from the input: <input>.int.c, or stdout if the output name is "-".

Contents:

Pragma boilerplate (#pragma GCC diagnostic ignored ...)
Managed runtime initialization (__nv_init_managed_rt, __nv_fatbinhandle_for_managed_rt)
Lambda macro definitions (__nv_is_extended_device_lambda_closure_type, etc.)
#include "crt/host_runtime.h" (injected when first CUDA-tagged type is encountered)
All host-visible declarations with device-only entities wrapped in #if 0
Kernel functions replaced with forwarding stubs to __wrapper__device_stub_<name>
Registration tables (sub_6BCF80 called 6 times for device/host x managed/constant combinations)
Anonymous namespace macro (_NV_ANON_NAMESPACE)
Original source re-inclusion (#include "<original_file>")

2. Device IL File

Named via --gen_device_file_name CLI flag (flag index 85). Contains the binary IL for all entries that passed the keep-in-IL walk. This file is consumed by cicc (the CUDA IL compiler).

3. Module ID File

Named via --module_id_file_name CLI flag (flag index 87). Contains the CRC32-based unique identifier for this compilation unit, computed by make_module_id (sub_5B5500). Used to prevent ODR violations across separate compilation units in RDC mode.

4. Stub File

Named via --stub_file_name CLI flag (flag index 86). Contains the __wrapper__device_stub_<name> function definitions that bridge host-side kernel launch calls to the CUDA runtime.

Kernel Stub Generation

For __global__ kernel functions, the host output replaces the original body with two stub forms. The toggle dword_1065850 flips 0->1 at the top of gen_routine_decl, so the static definition is emitted first, followed by the forwarding body from the recursive call:

// Output 1 (dword_1065850 == 1 after toggle, emitted first):
static void __wrapper__device_stub_kernel_name(params) {
    ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}
#if 0
<original body>
#endif

// Output 2 (dword_1065850 == 0 after toggle, emitted by recursive call):
void kernel_name(params) {
    <scope>::__wrapper__device_stub_kernel_name(params);
    return;
}
#if 0
<original body>
#endif

The static stub provides the definition of __wrapper__device_stub_ that the forwarding body calls. The cudaLaunchKernel(0, 0, 0, 0, 0, 0) placeholder creates a linker dependency on the CUDA runtime without performing an actual kernel launch.

For template kernels, the forwarding stub includes explicit template arguments: __wrapper__device_stub_kernel_name<T1, T2, ...>(params). For full details see Kernel Stubs.

Architectural Diagram

                        .cu source
                            |
                     EDG Frontend (parse once)
                            |
                     Unified IL Tree
                    (all entities tagged
                     at entity+182)
                            |
              +-------------+-------------+
              |                           |
        fe_wrapup pass 3           Backend (sub_489000)
     mark_to_keep_in_il            walks source sequence
      (sub_610420)                       |
              |                    sub_47ECC0 per entity
        set bit 7 on                     |
        device-reachable          +------+------+
        entries                   |             |
              |              sub_46B3F0    sub_47BFD0
        Device IL output    returns 0?    __global__?
        (binary, for cicc)       |             |
                            #if 0/endif   stub body
                            wrap it       replacement
                                  |             |
                                  +------+------+
                                         |
                                   .int.c output
                                 (text C++ for host
                                  compiler)

Function Map

Address	Name	Lines	Role
`sub_489000`	`process_file_scope_entities`	723	Backend entry point, `.int.c` emission
`sub_47ECC0`	`gen_template` (source sequence dispatcher)	1917	Dispatches each entity; calls `sub_46B3F0` for type filtering
`sub_47BFD0`	`gen_routine_decl`	1831	Routine declaration/definition; toggles `dword_1065850`
`sub_46B3F0`	device-only type filter	39	Returns 0 for device-only entities in host output
`sub_610420`	`mark_to_keep_in_il`	892	Top-level device IL marking entry point
`sub_6115E0`	`walk_tree_and_set_keep_in_il`	4649	Recursive worker that sets bit 7 on reachable entries
`sub_617310`	`prune_keep_in_il_walk`	127	Pre-walk callback; skips already-marked entries
`sub_6175F0`	`walk_scope_and_mark_routine_definitions`	634	Additional pass for C++ routine definitions
`sub_47B890`	`gen_lambda`	336	Lambda wrapper generation; `#if 0` for device lambda bodies
`sub_4864F0`	`gen_type_decl`	751	Type declaration emission; host runtime injection
`sub_5CCBF0`	`eliminate_unneeded_il_entries`	345	Elimination pass (removes entries without keep bit)

Cross-References

Execution Spaces -- byte +182 bitfield encoding for __host__/__device__/__global__; the nv_is_device_only_routine predicate that drives host-output filtering
Kernel Stubs -- detailed stub generation logic: forwarding body (pass 1) and static cudaLaunchKernel body (pass 2)
Keep-in-IL -- full documentation of the device code marking walk, the keep bit at entry_ptr - 8, and the transitive closure algorithm
Memory Spaces -- variable-side __device__/__shared__/__constant__ at entity+148; these are the seed entries for the keep-in-IL walk
.int.c File Format -- structure of the generated host translation file
Entity Node Layout -- full byte map of the entity structure including offset +176 (flags field) and +182 (execution space byte)

Kernel Stub Generation

When cudafe++ generates the .int.c host translation of a CUDA source file, every __global__ kernel function undergoes a critical transformation: the original kernel body is suppressed and replaced with a device stub -- a lightweight host-callable wrapper that delegates to cudaLaunchKernel. This mechanism is how CUDA kernel launch syntax (kernel<<<grid, block>>>(args)) ultimately becomes a regular C++ function call that the host compiler can process. The stub generation logic lives entirely within gen_routine_decl (sub_47BFD0), a 1,831-line function in cp_gen_be.c that is the central code generator for all C++ function declarations and definitions. A secondary function, gen_bare_name (sub_473F10), handles the character-by-character emission of the __wrapper__device_stub_ prefix into function names.

The stub mechanism operates in two passes controlled by a global toggle, dword_1065850 (the device_stub_mode flag). The toggle fires at the top of gen_routine_decl, BEFORE the body-selection logic runs. Because the toggle is dword_1065850 = (dword_1065850 == 0), it flips 0->1 on the first invocation. This means:

First invocation (toggle 0->1): dword_1065850 == 1 at decision points -> emits the static declaration with cudaLaunchKernel placeholder body, then recurses.
Recursive invocation (toggle 1->0): dword_1065850 == 0 at decision points -> emits the forwarding body that calls __wrapper__device_stub_<name>.

Both invocations wrap the original kernel body in #if 0 / #endif so the host compiler never sees device code.

Key Facts

Property	Value
Source file	`cp_gen_be.c` (EDG 6.6 backend code generator)
Main generator	`sub_47BFD0` (`gen_routine_decl`, 1831 lines)
Bare name emitter	`sub_473F10` (`gen_bare_name`, 671 lines)
Stub prefix string	`"__wrapper__device_stub_"` at `0x839420`
Specialization prefix	`"__specialization_"` at `0x839960`
cudaLaunchKernel body	`"{ ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);}"` at `0x839CB8`
Device-only dummy (ctor/dtor)	`"{int *volatile ___ = 0;"` at `0x839A3E` + `"::free(___);"` at `0x839A72`
Device-only dummy (global)	`"{int volatile ___ = 1;"` at `0x839A56` + `"::exit(___);"` at `0x839A80`
Stub mode flag	`dword_1065850` (global toggle)
Static template stub CLI flag	`-static-global-template-stub=true`
Parameter list generator	`sub_478900` (`gen_parameter_list`)
Scope qualifier emitter	`sub_474D60` (recursive namespace path)
Parameter name emitter	`sub_474BB0` (emit entity name for forwarding)

The Device Stub Mode Toggle

The entire stub generation mechanism hinges on a single global variable, dword_1065850. This flag acts as a modal switch: when set, all subsequent code generation for __global__ functions produces the static stub variant rather than the forwarding body.

Toggle Logic

The toggle occurs in gen_routine_decl at the point where the function's CUDA flags are inspected. The critical line from the decompiled binary:

// sub_47BFD0, around decompiled line 553
// v3 = routine entity pointer, v8 = is_friend flag

__int64 flags = *(_QWORD *)(v3 + 176);

if ((flags & 0x40000002000000) == 0x40000002000000 && v8 != 1)
    dword_1065850 = dword_1065850 == 0;   // toggle: 0->1 or 1->0

The bitmask 0x40000002000000 encodes a combination of the __global__ attribute and a linkage/definition flag in the entity's 8-byte flags field at offset +176. The condition requires BOTH bits set and the declaration must NOT be a friend declaration (v8 != 1). The toggle expression dword_1065850 == 0 flips the flag: if it was 0, it becomes 1; if it was 1, it becomes 0.

This means gen_routine_decl is called twice for every __global__ kernel. Crucially, the toggle fires at the TOP of the function, BEFORE the body emission logic:

First call (dword_1065850 == 0 at entry -> toggled to 1): All subsequent decision points see dword_1065850 == 1. Emits the static stub with cudaLaunchKernel placeholder body. Then recurses.
Recursive call (dword_1065850 == 1 at entry -> toggled to 0): All subsequent decision points see dword_1065850 == 0. Emits the forwarding stub body. Does NOT recurse (the flag is 0 at the end).

The self-recursion that drives the second call is explicit at the end of gen_routine_decl:

// sub_47BFD0, decompiled line 1817-1821
if (dword_1065850) {
    qword_1065748 = (int64_t)v163;  // restore source sequence pointer
    return sub_47BFD0(v152, a2);     // recursive self-call
}

After emitting the static stub (first call), the self-recursion check at line 1817 fires because dword_1065850 == 1. The function restores the source sequence state and calls itself. In the recursive call, the toggle fires again (1->0), and the forwarding body is emitted with dword_1065850 == 0. At the end of the recursive call, dword_1065850 == 0, so no further recursion occurs.

Stub Generation: The Forwarding Body

When dword_1065850 == 0 and the entity has __global__ annotation (byte +182 & 0x40) with a body (byte +179 & 0x02), gen_routine_decl emits a forwarding body instead of the original kernel implementation. This is the output produced by the recursive (second) invocation.

Step-by-Step Emission

The forwarding body is assembled from multiple sub_468190 (emit raw string) calls:

// Condition: (byte[182] & 0x40) != 0 && (byte[179] & 2) != 0 && dword_1065850 == 0

// 1. Open brace
sub_468190("{");

// 2. Scope qualification (if kernel is in a namespace)
scope = *(v3 + 40);  // entity's enclosing scope
if (scope && byte_at(scope + 28) == 3) {       // scope kind 3 = namespace
    sub_474D60(*(scope + 32));   // recursively emit namespace::namespace::...
    sub_468190("::");
}

// 3. Emit "__wrapper__device_stub_" prefix
sub_468190("__wrapper__device_stub_");

// 4. Emit the original function name
sub_468190(*(char **)(v3 + 8));  // entity name string at offset +8

Template Argument Emission

After the function name, template arguments must be forwarded. The logic branches on whether the function is an explicit template specialization (v153) or a non-template member of a template class:

Case A: Explicit specialization (v153 != 0) -- uses the template argument list at entity offset +224:

v135 = *(v3 + 224);  // template_args linked list
if (v135) {
    putc('<', stream);  // emit '<'
    do {
        arg_kind = byte_at(v135 + 8);
        if (arg_kind == 0) {
            // Type argument: emit type specifier + declarator
            sub_5FE8B0(v135[4], ...);   // gen_type_specifier
            sub_5FB270(v135[4], ...);   // gen_declarator
        } else if (arg_kind == 1) {
            // Value argument (non-type template param)
            sub_5FCAF0(v135[4], 1, ...); // gen_constant
        } else {
            // Template-template argument
            sub_472730(v135[4], ...);    // gen_template_arg
        }
        v135 = *v135;          // next in linked list
        separator = v135 ? ',' : '>';
        putc(separator, stream);
    } while (v135);
}

Case B: Non-specialization -- template parameters from the enclosing class template are forwarded:

// v162 = template parameter info from enclosing scope
v92 = v162[1];  // template parameter list
if (v92 && (byte_at(v92 + 113) & 2) == 0) {
    sub_467E50("<");
    do {
        param_kind = byte_at(v92 + 112);
        if (param_kind == 1) {
            // type parameter -- emit the type
            sub_5FE8B0(*(v92 + 120), ...);
            sub_5FB270(*(v92 + 120), ...);
        } else if (param_kind == 2) {
            // non-type parameter -- emit constant
            sub_5FCAF0(*(v92 + 120), 1, ...);
        } else {
            // template-template parameter
            sub_472730(*(v92 + 120), ...);
        }
        if (byte_at(v92 + 113) & 1)
            sub_467E50("...");   // parameter pack expansion
        v92 = *(v92 + 104);     // next parameter
        emit(v92 ? "," : ">");
    } while (v92);
}

Parameter Forwarding

After the name and template arguments, the forwarding call's actual arguments are emitted:

// 5. Emit parameter forwarding: "(param1, param2, ...)"
sub_468150(40);  // '('
param = *(v167 + 40);  // first parameter entity from definition scope
if (param) {
    for (separator = ""; ; separator = ",") {
        sub_468190(separator);
        sub_474BB0(param, 7);  // emit parameter name
        if (byte_at(param + 166) & 0x40) {
            sub_468190("...");  // variadic parameter pack expansion
        }
        param = *(param + 104);  // next parameter in list
        if (!param) break;
    }
}
sub_468190(");");

// 6. Emit return statement and closing brace
sub_468190("return;}");

Complete Output Example

For a kernel:

namespace my_ns {
template<typename T>
__global__ void my_kernel(T* data, int n) { /* device code */ }
}

The forwarding body (emitted during the recursive call with dword_1065850 == 0) produces:

template<typename T>
void my_ns::my_kernel(T* data, int n) {
    my_ns::__wrapper__device_stub_my_kernel<T>(data, n);
    return;
}
#if 0
/* original kernel body here */
#endif

Note: __host__ is NOT emitted in the forwarding body. The __global__ attribute is stripped and no explicit execution space appears. The function appears as a plain C++ function in .int.c.

Stub Generation: The Static cudaLaunchKernel Placeholder

When dword_1065850 == 1 (the first invocation, after the toggle), the function declaration is rewritten with a different storage class and body. Despite being called "pass 2" conceptually (it produces the definition that the forwarding body calls), it is emitted FIRST in the output because the toggle sets the flag before any body emission logic runs.

Declaration Modifiers

When dword_1065850 is set, gen_routine_decl forces the storage class to static and optionally prepends the __specialization_ prefix:

// sub_47BFD0, decompiled lines 897-903
if (dword_1065850) {
    v164 = 2;                    // force storage class = static
    v23 = "static";
    if (v153)                    // if template specialization
        sub_467E50("__specialization_");
    goto emit_storage_class;     // -> sub_467E50("static"); sub_468150(' ');
}

The __specialization_ prefix is emitted BEFORE static for template specializations. This creates names like __specialization_static void __wrapper__device_stub_kernel(...) which the CUDA runtime uses to distinguish specialization stubs from primary template stubs.

Name Emission via gen_bare_name

In stub mode, gen_bare_name (sub_473F10) prepends the wrapper prefix character-by-character. The relevant code path:

// sub_473F10, decompiled lines 130-144
if (byte_at(v2 + 182) & 0x40 && dword_1065850) {
    // Emit line directive if pending
    if (dword_1065818)
        sub_467DA0();

    // Character-by-character emission of "__wrapper__device_stub_"
    v25 = "_wrapper__device_stub_";   // note: starts at second char
    v26 = 95;                          // first char: '_' (0x5F = 95)
    do {
        ++v25;
        putc(v26, stream);
        v26 = *(v25 - 1);
        ++dword_106581C;
    } while ((char)v26);
}

The technique is notable: the string "_wrapper__device_stub_" is stored starting at the second character, and the first underscore (_, ASCII 95) is loaded as the initial character separately. The do/while loop then walks the string pointer forward, emitting each character via putc and incrementing the column counter (dword_106581C). This assembles the full __wrapper__device_stub_ prefix before the actual function name is emitted.

cudaLaunchKernel Placeholder Body

For non-specialization __global__ kernels in stub mode, the body is a single-line placeholder:

// sub_47BFD0, decompiled lines 1424-1429
if (dword_1065850) {
    if (!v153 && v90) {    // not a specialization AND has __global__ body
        sub_468190("{ ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);}");
        goto suppress_original;
    }
}

The call ::cudaLaunchKernel(0, 0, 0, 0, 0, 0) is never actually executed at runtime. It exists solely to create a linker dependency on the CUDA runtime library, ensuring that cudaLaunchKernel is linked even though the real launch is performed through the CUDA driver API. The six zero arguments match the signature cudaError_t cudaLaunchKernel(const void*, dim3, dim3, void**, size_t, cudaStream_t).

Complete Output Example (Static Stub)

For the same kernel above, the static stub (emitted first, with dword_1065850 == 1) produces:

static void __wrapper__device_stub_my_kernel(float* data, int n) {
    ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}

Dummy Bodies for Non-Kernel Device Functions

Not all CUDA-annotated functions are __global__ kernels. Device-only functions (constructors, destructors, and plain __device__ functions) that have definitions also need host-side bodies to prevent host compiler errors. These receive dummy bodies designed to suppress optimizer warnings while remaining syntactically valid.

Condition for Dummy Body Emission

The dummy body path activates in the ELSE branch of the __global__ check -- that is, for non-kernel device functions. The condition from the decompiled code (lines 1603-1606):

// This path is reached when (byte[182] & 0x40) == 0 -- entity is NOT __global__
// The flags field at offset +176 is an 8-byte bitfield encoding linkage/definition state.

uint64_t flags = *(uint64_t*)(entity + 176);
if ((flags & 0x30000000000500) != 0x20000000000000)  // NOT a device-only entity with definition
    goto emit_original_body;                          // skip dummy, emit normally

if (!dword_106BFDC || (entity->byte_81 & 4) != 0)   // whole-program flag check
{
    // Emit dummy body for device-only function visible in host output
}

The bitmask 0x30000000000500 extracts the device-annotation and definition bits from the 8-byte flags field. The target value 0x20000000000000 selects entities that have device annotation set but no host-side definition -- exactly the functions that need a dummy body to satisfy the host compiler.

Constructor/Destructor Dummy (definition_kind 1 or 2)

For constructors (definition_kind == 1) and destructors (definition_kind == 2), the dummy body allocates a volatile null pointer and frees it:

// sub_47BFD0, decompiled lines 1611-1651
if ((unsigned char)(byte[166] - 1) <= 1) {
    sub_468190("{int *volatile ___ = 0;");
    // ... emit (void)param; for each parameter ...
    sub_468190("::free(___);}");
}

Output:

{int *volatile ___ = 0;(void)param1;(void)param2;::free(___);}

The volatile qualifier prevents the optimizer from removing the allocation. The ::free(0) call is a no-op at runtime but establishes a dependency on the C library and prevents dead code elimination of the entire body.

global / Regular Device Function Dummy (definition_kind >= 3)

For non-constructor/destructor device functions, a different pattern is used:

else {
    sub_468190("{int volatile ___ = 1;");
    // ... emit (void)param; for each parameter ...
    sub_468190("::exit(___);}");
}

Output:

{int volatile ___ = 1;(void)param1;(void)param2;::exit(___);}

The ::exit(1) call guarantees the function is never considered to "return normally" by the host compiler's control-flow analysis, suppressing missing-return-value warnings for non-void functions.

Parameter Usage Emission

Between the opening and closing statements, each named parameter is referenced with (void)param; to suppress unused-parameter warnings. The loop walks the parameter list:

for (kk = *(v167 + 40); kk; kk = *(kk + 104)) {
    if (*(kk + 8) && !(byte_at(kk + 166) & 0x40)) {  // has name, not a pack
        // For aggregate types with GNU host compiler: complex cast chain
        if (!dword_1065750 && dword_126E1F8
            && is_aggregate_type(*(kk + 112))
            && has_nontrivial_dtor(*(kk + 112))) {
            sub_468190("(void)");
            sub_468190("reinterpret_cast<void *>(&(const_cast<char &>");
            sub_468190("(reinterpret_cast<const volatile char &>(");
            sub_474BB0(kk, 7);  // parameter name
            sub_468190("))))");
        } else {
            sub_468190("(void)");
            sub_474BB0(kk, 7);  // parameter name
        }
        sub_468150(';');
    }
}

The complex reinterpret_cast chain for aggregate types with non-trivial destructors avoids triggering GCC/Clang warnings about taking the address of a parameter that might be passed in registers.

The #if 0 / #endif Suppression

After the stub body is emitted, the original kernel body is wrapped in preprocessor guards to hide it from the host compiler:

// sub_47BFD0, decompiled lines 1598-1601
sub_46BC80("#if 0");       // emit "#if 0\n"
--dword_1065834;           // decrease indent level
sub_467D60();              // emit newline

// ... then emit the original body via:
dword_1065850_saved = dword_1065850;
dword_1065850 = 0;                    // temporarily disable stub mode
sub_47AEF0(*(v167 + 80), 0);         // gen_statement_full: emit original body
dword_1065850 = dword_1065850_saved;  // restore stub mode
sub_466C10();                          // finalize

// ... then emit #endif
putc('#', stream);
// character-by-character emission of "#endif\n"

The function temporarily disables stub mode (dword_1065850 = 0) while emitting the original body so that any nested constructs are generated normally. After the body, #endif is emitted and stub mode is restored.

For definitions (when v112 == 0), a trailing ; is appended after #endif to satisfy host compilers that may expect a statement terminator.

The -static-global-template-stub Flag

The CLI flag -static-global-template-stub=true controls how template __global__ functions are stubbed. When enabled, template kernel stubs receive static linkage, which avoids ODR violations when the same template kernel is instantiated in multiple translation units during whole-program compilation (-rdc=false).

The flag produces two diagnostic messages when it encounters problematic patterns:

Extern template kernel: "when "-static-global-template-stub=true", extern __global__ function template is not supported in whole program compilation mode ("-rdc=false")" -- An extern template kernel cannot receive a static stub because the definitions would conflict across TUs.
Missing definition: "when "-static-global-template-stub=true" in whole program compilation mode ("-rdc=false"), a __global__ function template instantiation or specialization (%sq) must have a definition in the current translation unit" -- The static stub requires a local definition to replace.

Both diagnostics recommend either switching to -rdc=true (separate compilation) or explicitly setting -static-global-template-stub=false.

Diagnostic Push/Pop Around Stubs

Before emitting device stub declarations, gen_routine_decl wraps the output in compiler-specific diagnostic suppression to prevent spurious warnings:

For GCC/Clang hosts (dword_126E1F8 set, version > 0x9E97 = 40599):

sub_467E50("\n#pragma GCC diagnostic push\n");
sub_467E50("#pragma GCC diagnostic ignored \"-Wunused-parameter\"\n");
// ... stub emission ...
sub_467E50("\n#pragma GCC diagnostic pop\n");

For MSVC hosts (dword_126E1D8 set):

sub_467E50("\n__pragma(warning(push))\n");
sub_467E50("__pragma(warning(disable : 4100))\n");  // unreferenced formal parameter
// ... stub emission ...
sub_467E50("\n__pragma(warning(pop))\n");

For static template specialization stubs, an additional warning is suppressed:

GCC/Clang: #pragma GCC diagnostic ignored "-Wunused-function" (warning 4505 on MSVC: "unreferenced local function has been removed")

Deferred Function List for Whole-Program Mode

When dword_106BFBC (a whole-program compilation flag) is set and dword_106BFDC is clear, instead of emitting a dummy body immediately, gen_routine_decl adds the function to a deferred list:

// sub_47BFD0, decompiled lines 1713-1745
v117 = sub_6B7340(32);          // allocate 32-byte node
v117[0] = qword_1065840;        // link to previous head
v117[1] = source_start;         // source position start
v117[2] = source_end;           // source position end
if (has_name)
    v117[3] = strdup(name);     // copy of function name
else
    v117[3] = NULL;
qword_1065840 = v117;           // push onto list head

This deferred list (qword_1065840) is later consumed during the breakpoint placeholder generation phase in process_file_scope_entities (sub_489000), where each deferred entry produces a static __attribute__((used)) void __nv_breakpoint_placeholder<N>_<name>(void) { exit(0); } function.

Function Map

Address	Name	Role
`sub_47BFD0`	`gen_routine_decl`	Main stub generator; 1831 lines; handles all function declarations
`sub_473F10`	`gen_bare_name`	Character-by-character name emission with `__wrapper__device_stub_` prefix
`sub_474BB0`	`gen_entity_name`	Parameter name emission for forwarding calls
`sub_474D60`	`gen_scope_qualifier`	Recursive namespace path emission (`ns1::ns2::`)
`sub_478900`	`gen_parameter_list`	Parameter list with type transformation in stub mode
`sub_478D70`	`gen_function_declarator_with_scope`	Full function declarator with cv-qualifiers and ref-qualifiers
`sub_47AEF0`	`gen_statement_full`	Statement generator used for emitting original body inside `#if 0`
`sub_47ECC0`	`gen_template` / `process_source_sequence`	Top-level dispatch; also sets `dword_1065850` for instantiation directives
`sub_46BC80`	(emit #if directive)	Emits `#if 0` / `#if 1` preprocessor lines
`sub_467E50`	(emit string)	Primary string emission to output stream
`sub_468190`	(emit raw string)	Raw string emission (no line directive)
`sub_489000`	`process_file_scope_entities`	Backend entry point; consumes deferred function list

Concrete Example: Simple Kernel Stub Output

Given this input CUDA source:

__global__ void add_one(int *data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n)
        data[idx] += 1;
}

cudafe++ generates the following in the .int.c host translation file. The toggle fires at the top of gen_routine_decl (0->1), so the static stub definition is emitted FIRST, followed by the forwarding body from the recursive call.

Output 1: Static Stub Definition (first call, dword_1065850 == 1 after toggle)

The static stub provides the linker symbol that the forwarding body calls. Diagnostic pragmas wrap the declaration to suppress unused-parameter warnings:

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-parameter"
static void __wrapper__device_stub_add_one(int *data, int n) {
    ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}
#if 0
/* Original kernel body -- hidden from host compiler */
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n)
        data[idx] += 1;
}
#endif
#pragma GCC diagnostic pop

The static storage class is forced by the check at decompiled line 897-903. The __wrapper__device_stub_ prefix is emitted by gen_bare_name (sub_473F10). The cudaLaunchKernel placeholder body comes from the string literal at 0x839CB8.

Output 2: Forwarding Body (recursive call, dword_1065850 == 0 after toggle)

After the static stub is emitted and gen_routine_decl recurses, the forwarding body replaces the original kernel body. The __global__ attribute is stripped (kernels become regular host functions in .int.c):

void add_one(int *data, int n) {__wrapper__device_stub_add_one(data, n);return;}
#if 0
/* Original kernel body -- hidden from host compiler (emitted again) */
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n)
        data[idx] += 1;
}
#endif

The forwarding body is assembled character-by-character:

{ -- open brace
Scope qualifier (none for file-scope kernels; ns:: for namespaced ones)
__wrapper__device_stub_ -- the stub prefix from string at 0x839420
add_one -- the original function name from entity + 8
(data, n) -- parameter names forwarded (no types, just names via sub_474BB0)
);return;} -- close the forwarding call and return

The original body appears in #if 0 in both outputs because both code paths reach the same LABEL_457 -> sub_46BC80("#if 0") emission point.

Template Kernel Example

For a template kernel:

template<typename T>
__global__ void scale(T *data, T factor, int n) { /* ... */ }

// explicit instantiation
template __global__ void scale<float>(float *, float, int);

Output 1 (first call, dword_1065850 == 1) produces a specialization stub:

__specialization_static void __wrapper__device_stub_scale(float *data, float factor, int n) {
    ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}

Output 2 (recursive call, dword_1065850 == 0) produces a forwarding stub with template arguments:

template<typename T>
void scale(T *data, T factor, int n) {__wrapper__device_stub_scale<T>(data, factor, n);return;}

The __specialization_ prefix is emitted only when the entity is a template specialization (v153 != 0) and dword_1065850 is set (decompiled line 901-902).

Device-Only Function Example

For a non-kernel __device__ function with a body:

__device__ int device_helper(int x, int y) {
    return x + y;
}

The host output uses a dummy body instead of a forwarding stub (since there is no __wrapper__device_stub_ target for non-kernel functions):

__attribute__((unused)) int device_helper(int x, int y) {int volatile ___ = 1;(void)x;(void)y;::exit(___);}
#if 0
{
    return x + y;
}
#endif

The __attribute__((unused)) prefix is emitted when the function's execution space is device-only ((byte_182 & 0x70) == 0x20) and dword_126E1F8 (GCC host compiler mode) is set (decompiled line 905-906).

Cross-References

Execution Spaces -- byte +182 bitfield that drives the __global__ check; complete redeclaration matrix
Device/Host Separation -- IL marking that determines which functions need stubs; the dword_1065850 toggle lifecycle
RDC Mode -- separate compilation mode that affects stub linkage
.int.c File Format -- overall structure of the generated host file
CUDA Runtime Boilerplate -- managed memory initialization emitted alongside stubs

RDC Mode

CUDA supports two compilation models that fundamentally change how cudafe++ processes device code: whole-program mode (-rdc=false, the default) and separate compilation mode (-rdc=true, also called Relocatable Device Code). The mode switch affects error checking, stub linkage, module ID generation, anonymous namespace mangling, and -- when multiple translation units are involved -- triggers EDG's cross-TU correspondence machinery for structural type verification.

From cudafe++'s perspective, the distinction maps to a single CLI flag (--device-c, flag index 77) and a handful of global booleans that gate code paths throughout the binary. This page documents what changes between the two modes, how module IDs are generated, how cross-TU IL correspondence works, and how host stub linkage is controlled.

Key Facts

Property	Value
RDC CLI flag	`--device-c` (flag index 77, no argument)
Whole-program mode flag	`dword_106BFBC` (also set by `--debug_mode`)
Module ID cache	`qword_126F0C0` (cached string, computed once)
Module ID generator	`sub_5AF830` (`make_module_id`, ~450 lines)
Module ID setter	`sub_5AF7F0` (`set_module_id`)
Module ID getter	`sub_5AF820` (`get_module_id`)
Module ID file writer	`sub_5B0180` (`write_module_id_to_file`)
Module ID file flag	`--gen_module_id_file` (flag 83)
Module ID file path	`--module_id_file_name` (flag 87)
Cross-TU IL copier	`sub_796BA0` (`copy_secondary_trans_unit_IL_to_primary`, trans_copy.c)
Cross-TU usage marker	`sub_796C00` (`mark_secondary_IL_entities_used_from_primary`)
Class correspondence	`sub_7A00D0` (`verify_class_type_correspondence`, 703 lines)
TU processing entry	`sub_7A40A0` (`process_translation_unit`)
TU switch	`sub_7A3D60` (`switch_translation_unit`)
Host stub linkage flag	`--host-stub-linkage-explicit` (flag 47)
Static host stub flag	`--static-host-stub` (flag 48)
Static template stub flag	`--static-global-template-stub` (set_flag mechanism)
EDG source files	`host_envir.c` (module ID), `trans_copy.c`, `trans_corresp.c`, `trans_unit.c`

Whole-Program Mode (-rdc=false)

Whole-program mode is the default. All device code for a given translation unit must be defined within that single .cu file. No external device symbols are allowed. The host compiler sees the entire program at once, and nvlink is not required for device code linking.

Constraints Enforced

Five diagnostics are specific to whole-program mode or are closely tied to the internal-linkage consequences of non-RDC compilation:

1. Inline device/constant/managed variables must have internal linkage.

An inline __device__/__constant__/__managed__ variable must have
internal linkage when the program is compiled in whole program
mode (-rdc=false)

In whole-program mode, the device runtime has no linker step to resolve external inline variables across TUs. An inline __device__ variable with external linkage would need cross-TU deduplication that only nvlink can provide. The frontend forces static (or anonymous-namespace) linkage, emitting an error if the variable has external linkage.

2. Extern __global__ function templates are forbidden (with -static-global-template-stub=true).

when "-static-global-template-stub=true", extern __global__ function
template is not supported in whole program compilation mode ("-rdc=false").
To resolve the issue, either use separate compilation mode ("-rdc=true"),
or explicitly set "-static-global-template-stub=false" (but see nvcc
documentation about downsides of turning it off)

The -static-global-template-stub flag causes template kernel stubs to receive static linkage to avoid ODR violations when the same template is instantiated in multiple host-side compilation units. An extern template declaration conflicts with this because the extern stub expects an external definition while the static stub forces a local one. The diagnostic tag for this is extern_kernel_template.

3. __global__ template instantiations must have local definitions (with -static-global-template-stub=true).

when "-static-global-template-stub=true" in whole program compilation
mode ("-rdc=false"), a __global__ function template instantiation or
specialization (%sq) must have a definition in the current translation
unit.

A static stub requires a definition in the same TU. If the instantiation point references a template defined in another header without an explicit instantiation, the stub has no body to emit. The diagnostic tag is template_global_no_def.

Both template-related diagnostics recommend either switching to -rdc=true or setting -static-global-template-stub=false. The 4 usage contexts in the binary for -static-global-template-stub all appear in error message strings (at addresses 0x88E588 and 0x88E6E0).

4. Kernel launch from __device__ or __global__ functions requires separate compilation.

kernel launch from __device__ or __global__ functions requires
separate compilation mode

Dynamic parallelism -- launching a kernel from device code (a __device__ or __global__ function calling <<<...>>>) -- requires the device linker (nvlink) to resolve cross-module kernel references. In whole-program mode, no device linking occurs, so the construct is illegal. The diagnostic tag is device_launch_no_sepcomp.

5. Address of internal linkage device function (bug mitigation).

address of internal linkage device function (%sq) was taken
(nv bug 2001144). mitigation: no mitigation required if the
address is not used for comparison, or if the target function
is not a CUDA C++ builtin. Otherwise, write a wrapper function
to call the builtin, and take the address of the wrapper
function instead

This diagnostic fires in whole-program mode when code takes the address of a static __device__ function. Because device functions with internal linkage get module-ID-based name mangling, their addresses may differ across compilations or across TUs even when they refer to the "same" function. The warning documents a known NVIDIA bug (2001144) and provides a workaround: wrap the builtin in a non-internal function and take the wrapper's address instead. This diagnostic has no associated tag name -- it is emitted unconditionally when the condition is detected.

Deferred Function List

When dword_106BFBC (whole-program mode) is set and dword_106BFDC (skip-device-only) is clear, gen_routine_decl (sub_47BFD0) adds device-only functions to a deferred linked list (qword_1065840) rather than emitting dummy bodies inline. Each list node is 32 bytes:

Offset	Field
+0	`next` pointer
+8	Source position (start)
+16	Source position (end)
+24	Name string (strdup'd, or NULL)

This list is consumed during the breakpoint placeholder phase in process_file_scope_entities (sub_489000), where each entry produces a static __attribute__((used)) void __nv_breakpoint_placeholder<N>_<name>(void) { exit(0); } function for debugger support.

Separate Compilation Mode (-rdc=true)

When nvcc passes --device-c (flag index 77) to cudafe++, separate compilation mode is activated. This:

Allows __device__, __constant__, and __managed__ variables to have external linkage
Permits extern __global__ template functions
Enables dynamic parallelism (kernel launches from device code)
Requires nvlink to resolve device-side cross-TU references
Generates a module ID that uniquely identifies each compilation unit for runtime registration

In this mode, the host stubs are generated with external linkage (by default) so the host linker can resolve cross-TU kernel calls. The module ID is embedded in the registration code to match host stubs with their corresponding device fatbinary segments.

Multi-TU Processing in EDG

When multiple translation units are compiled in a single cudafe++ invocation (as happens during RDC compilation with nvcc), the EDG frontend processes them sequentially using a stack-based TU management system:

Global	Purpose
`qword_106BA10`	Current translation unit pointer
`qword_106B9F0`	Primary (first) translation unit
`qword_106BA18`	TU stack top
`dword_106B9E8`	TU stack depth (excluding primary)

process_translation_unit (sub_7A40A0, trans_unit.c) is the main entry point called from main() for each source file:

Allocates a 424-byte TU descriptor via sub_6BA0D0
Initializes scope state and copies registered variable defaults
Sets the primary TU pointer (qword_106B9F0) for the first file
Links the TU into the processing chain
Opens the source file and sets up include paths
Runs the parser (sub_586240)
Dispatches to standard compilation (sub_4E8A60) or module compilation (sub_6FDDF0)
Calls finalization (sub_588E90)
Pops the TU from the stack

switch_translation_unit (sub_7A3D60, trans_unit.c, line 514) saves/restores per-TU state when the frontend needs to reference entities from a different TU:

Asserts qword_106BA10 != 0 (current TU exists)
If target differs from current: saves current TU via sub_7A3A50
Restores target TU state via memcpy from per-TU buffer
Sets qword_106BA10 = target
Restores scope chain: xmmword_126EB60, qword_126EB70, etc.
Recomputes file scope indices via sub_704490

Per-TU state is registered through f_register_trans_unit_variable (sub_7A3C00, trans_unit.c, line 227), which accumulates variables into a linked list (qword_12C7AA8). Each registration record is 40 bytes with fields for the variable pointer, name, prior size, and buffer offset. The total per-TU buffer size is tracked in qword_12C7A98.

Three core variables are always registered (sub_7A4690):

dword_106BA08 (is_recompilation), 4 bytes
qword_106BA00 (current_filename), 8 bytes
dword_106B9F8 (has_module_info), 4 bytes

Module ID Generation

Every compilation unit in CUDA needs a unique identifier to associate host-side registration code with the correct device fatbinary. This identifier -- the module ID -- is generated by make_module_id (sub_5AF830, host_envir.c, ~450 lines) and cached in qword_126F0C0.

Algorithm

The module ID generator has three source modes, tried in order:

Mode 1: Module ID file. If qword_106BF80 (set by --module_id_file_name) is non-NULL, the entire contents of the specified file are read and used as the module ID. This allows build systems to inject deterministic identifiers.

Mode 2: Explicit numeric token. If the caller provides a non-NULL string argument (nptr), it is parsed via strtoul. If the parse succeeds, the numeric value is used directly. If the parse fails (the string is not a pure integer), the string itself is CRC32-hashed and the hash is used.

Mode 3: Default computation. The default path builds the ID from several components:

Calls stat() on the source file to obtain mtime
Formats ctime() of the modification time
Reads getpid() for the current process ID
Collects qword_106C038 (command-line options hash input)
Computes the CRC32 hash of the options string
Takes the output filename, strips it to basename
If the source filename exceeds 8 characters, replaces it with its CRC32 hex representation

The final string is assembled in the format:

{options_crc}_{output_name_len}_{output_name}_{source_or_crc}[_{extra}][_{pid}]

All non-alphanumeric characters in the result are replaced with underscores. The string is allocated permanently and cached in qword_126F0C0.

Debug tracing (gated by dword_126EFC8) emits:

make_module_id: str1 = %s, str2 = %s, pid = %ld
make_module_id: final string = %s

CRC32 Implementation

The function contains an inline CRC32 implementation that appears three times (for the options hash, the source filename, and the extra string). All three copies use the same algorithm:

Polynomial: 0xEDB88320 (standard reflected CRC-32)
Initial value: 0xFFFFFFFF
Processing: bit-by-bit, 8 iterations per byte
Final XOR: implicit via the reflected algorithm

The triple inlining suggests the CRC32 was originally a macro or small inline function that the compiler expanded at each call site. The polynomial 0xEDB88320 is the bitwise reversal of the standard CRC-32 polynomial 0x04C11DB7, confirming this is the ubiquitous CRC-32/ISO-HDLC algorithm.

PID Incorporation

Module ID File Output

When --gen_module_id_file (flag 83) is set, write_module_id_to_file (sub_5B0180) generates the module ID via sub_5AF830(0) and writes it to the file specified by qword_106BF80 (--module_id_file_name, flag 87). If the filename is not set, it emits "module id filename not specified". If the write fails, it emits "error writing module id to file".

In the backend output phase, if dword_106BFB8 (emit-symbol-table flag) is set, sub_5B0180 is also called to write the module ID before the host reference arrays are emitted.

Entity-Based Module ID Selection

An alternative module ID source is available through use_variable_or_routine_for_module_id_if_needed (sub_5CF030, il.c, line 31969, ~65 lines). Rather than computing a hash from file metadata, this function selects a representative entity (variable or function) from the current TU whose mangled name can serve as a stable identifier. The selection criteria are strict:

Entity kind must be 7 (variable) or 11 (routine), tested via (kind - 7) & 0xFB == 0
Must have a definition (for variables: offset +169 != 0; for routines: has a body)
Must not be a class member
Must not be in an unnamed namespace
Must have storage class == 0 (no explicit static, extern, or register)
Must not be template-related or marked with special compilation flags
For routines: must not have explicit specialization, return type must not be a builtin

The selected entity is stored in qword_126F140 with its kind byte in byte_126F138 (7 for variable, 11 for routine). This entity's name is then fed into sub_5AF830 to produce the final module ID string. The entity-based approach provides a more deterministic ID than the PID-based default, since it is derived from source content rather than runtime state.

Anonymous Namespace Mangling

The module ID directly controls how anonymous namespaces are mangled in the .int.c output. The function sub_6BC7E0 (in nv_transforms.c) constructs the anonymous namespace identifier:

// sub_6BC7E0 implementation:
if (qword_1286A00)                      // cached?
    return qword_1286A00;
module_id = sub_5AF830(0);              // get or compute module ID
buf = malloc(strlen(module_id) + 12);   // "_GLOBAL__N_" = 11 chars + NUL
strcpy(buf, "_GLOBAL__N_");
strcat(buf, module_id);
qword_1286A00 = buf;                    // cache for reuse
return buf;

This _GLOBAL__N_<module_id> string is emitted in the .int.c trailer as:

#define _NV_ANON_NAMESPACE _GLOBAL__N_<module_id>
#ifdef _NV_ANON_NAMESPACE
#endif
#include "<source_file>"
#undef _NV_ANON_NAMESPACE

The #define gives anonymous namespace entities a stable, unique mangled name that is consistent between the device and host compilation paths. The #ifdef/#endif guard is defensive -- it tests that the macro was defined (it always is at this point). The #include re-includes the original source file with the macro defined, allowing the host compiler to see the anonymous namespace entities with their module-ID-qualified names. The #undef cleans up to avoid polluting later inclusions.

The anonymous namespace hash also appears during host reference array name construction. For static or anonymous-namespace device entities, the scoped name prefix builder (sub_6BD2F0) inserts _GLOBAL__N_<module_id> as the namespace component, ensuring the mangled name in the .nvHR* section uniquely identifies the entity even across TUs with the same anonymous namespace structure.

Usage in Output

The module ID appears in three places in the generated .int.c output:

Anonymous namespace mangling: sub_6BC7E0 constructs _GLOBAL__N_<module_id> for anonymous-namespace symbols in device code, producing unique mangled names per TU.
Registration boilerplate: The __cudaRegisterFatBinary call passes the module ID to the CUDA runtime, which uses it to match host registration with the correct device fatbinary.
Module ID file: When requested, the ID is written to a separate file for consumption by the build system or nvlink.

Cross-TU IL Correspondence

When multiple TUs are processed in a single cudafe++ invocation, the same C++ types, templates, and declarations may appear in multiple TUs. EDG's correspondence system verifies structural equivalence and establishes canonical entries to avoid duplicate definitions in the merged output.

trans_copy.c: IL Copying Between TUs

The trans_copy.c file contains a single function at address 0x796BA0:

copy_secondary_trans_unit_IL_to_primary -- Copies IL entries from secondary translation units into the primary TU's IL tree. Called after all TUs have been parsed, during the fe_wrapup finalization phase (specifically, after the 5-pass multi-TU iteration). This function ensures that device-reachable IL entries from secondary TUs are available in the primary TU's output scope.

A closely related function exists at 0x796C00:

mark_secondary_IL_entities_used_from_primary (sub_796C00) -- Called during fe_wrapup pass 2 (IL lowering), before the TU iteration loop that applies sub_707040 to each TU's file-scope IL. This function marks IL entities in secondary TUs that are referenced from the primary TU, ensuring they survive any dead-code elimination in later passes.

trans_corresp.c: Structural Equivalence Checking

The trans_corresp.c file (address range 0x796E60--0x7A3420, 88 functions) implements the full cross-TU correspondence verification system. The core functions:

verify_class_type_correspondence (sub_7A00D0, 703 lines) is the centerpiece. It performs a deep structural comparison of two class types from different TUs:

Base class comparison via sub_7A27B0 (verify_base_class_correspondence) -- iterates base class lists, comparing virtual/non-virtual status, accessibility, and type identity
Friend declaration comparison via sub_7A1830 (verify_friend_declaration_correspondence) -- walks friend lists checking structural equivalence
Member function comparison via sub_7A1DB0 (verify_member_function_correspondence, 411 lines) -- compares function signatures, attributes, constexpr status, and virtual overrides
Nested type comparison via sub_798960 (equiv_member_constants) -- verifies nested class/enum/typedef correspondence
Template parameter comparison via sub_7B2260 -- validates template parameter lists match structurally
Using declaration comparison -- dispatches by kind: 36 = alias, 6/11 = using declaration, 7/58 = namespace using declaration

If any comparison fails, the function delegates to sub_797180 to emit a diagnostic (error codes 1795/1796), then falls through to f_set_no_trans_unit_corresp (5 variants at sub_797B50-sub_7981A0 for different entity kinds).

The type node layout used by the correspondence system:

Offset +132: type kind (9=struct, 10=class, 11=union)
Offset +144: referenced type / next pointer
Offset +152: class info pointer
Offset +161: flags byte (bits for anonymous, elaborated, template, local)
Class info at +128: scope block with members at indexed offsets [12], [13], [14], [18], [22]

Supporting verification functions:

Address	Name	Scope
`sub_7A0E10`	`verify_enum_type_correspondence`	Enum underlying type and enumerator list
`sub_7A1230`	`verify_function_type_correspondence`	Parameter and return type
`sub_7A1390`	`verify_type_correspondence`	Dispatcher to class/enum/function variants
`sub_7A1460`	`set_type_correspondence`	Links two types as corresponding
`sub_7A1CC0`	`verify_nested_class_body_correspondence`	Nested class scope comparison
`sub_7A2C10`	`verify_template_parameter_correspondence`	Template parameter list
`sub_7A3140`	`check_decl_correspondence_with_body`	Declaration with definition
`sub_7A3420`	`check_decl_correspondence_without_body`	Declaration-only case
`sub_7A38A0`	`check_decl_correspondence`	Dispatcher (with/without body)
`sub_7A38D0`	`same_source_position`	Source position comparison
`sub_7999C0`	`find_template_correspondence`	Cross-TU template entity matching (601 lines)
`sub_79A5A0`	`determine_correspondence`	General correspondence determination
`sub_79B8D0`	`mark_canonical_instantiation`	Updates instantiation canonical status
`sub_79C1A0`	`get_canonical_entry_of`	Returns canonical entity for a TU entry
`sub_79D080`	`establish_instantiation_correspondences`	Links instantiations across TUs
`sub_79DFC0`	`set_type_corresp`	Sets type correspondence
`sub_79E760`	`find_routine_correspondence`	Cross-TU function matching
`sub_79F320`	`find_namespace_correspondence`	Cross-TU namespace matching

Correspondence Lifecycle

The correspondence system uses three hash tables (qword_12C7800, qword_12C7880, qword_12C7900, each 0x70 bytes / 14 slots) plus linked lists to track established correspondences. The lifecycle:

Registration (sub_7A3920): Registers three global variables (dword_106B9E4, dword_106B9E0, qword_12C7798) for per-TU save/restore
Initialization (sub_7A3980): Zeroes all correspondence hash tables and list pointers
Discovery during parsing: As the secondary TU is parsed, types/functions that match primary-TU entities are identified through name and scope comparison
Verification: verify_class_type_correspondence and its siblings perform deep structural comparison
Linkage: set_type_correspondence (sub_7A1460) and f_set_trans_unit_corresp (sub_79C400, 511 lines) connect matching entities
Canonicalization: canonical_ranking (sub_796E60) determines which TU's entity is the canonical representative; mark_canonical_instantiation (sub_79B8D0) updates instantiation records

The correspondence allocation uses 24-byte nodes from a free list (qword_12C7AB0) managed by alloc_trans_unit_corresp (sub_7A3B50) and free_trans_unit_corresp (sub_7A3BB0). The free function decrements a refcount at offset +16; when it reaches 1, the node returns to the free list.

Integration with fe_wrapup

The cross-TU correspondence system hooks into the 5-pass multi-TU architecture in fe_wrapup (sub_588E90):

Pass	Action	Cross-TU Role
1	Per-file IL wrapup (`sub_588C60`)	Iterates TU chain, prepares file scope IL
2	IL lowering (`sub_707040`)	Calls `sub_796C00` (mark secondary IL) before loop
3	IL emission (`sub_610420`, arg 23)	Marks device-reachable entries per TU
4	C++ class finalization	Deferred member processing
5	Per-file part 3 (`sub_588D40`)	Final per-TU cleanup
Post	Cleanup	Calls `sub_796BA0` (copy secondary IL to primary)

After all five passes complete, sub_796BA0 copies remaining secondary-TU IL into the primary TU's tree, and scope renumbering fixes up any index conflicts.

Host Reference Arrays and Linkage Splitting

The six .nvHR* ELF sections emitted in the .int.c output trailer encode device symbol names for CUDA runtime discovery. These arrays are split along two axes: symbol type (kernel, device variable, constant variable) and linkage (external, internal). The split is critical for RDC: external-linkage symbols are globally resolvable by nvlink across all TUs, while internal-linkage symbols are TU-local and require module-ID-based prefixing to avoid collisions.

Section	Array Name	Symbol Type	Linkage
`.nvHRKE`	`hostRefKernelArrayExternalLinkage`	`__global__` kernel	external
`.nvHRKI`	`hostRefKernelArrayInternalLinkage`	`__global__` kernel	internal
`.nvHRDE`	`hostRefDeviceArrayExternalLinkage`	`__device__` variable	external
`.nvHRDI`	`hostRefDeviceArrayInternalLinkage`	`__device__` variable	internal
`.nvHRCE`	`hostRefConstantArrayExternalLinkage`	`__constant__` variable	external
`.nvHRCI`	`hostRefConstantArrayInternalLinkage`	`__constant__` variable	internal

The emission is driven by 6 calls to nv_emit_host_reference_array (sub_6BCF80, 79 lines, nv_transforms.c) with parameters (emit_callback, is_kernel, is_device, is_internal_linkage):

// From sub_489000 (process_file_scope_entities), backend output phase:
if (dword_106BFD0 || dword_106BFCC) {
    sub_6BCF80(sub_467E50, 1, 0, 1);  // kernel, internal
    sub_6BCF80(sub_467E50, 1, 0, 0);  // kernel, external
    sub_6BCF80(sub_467E50, 0, 1, 1);  // device, internal
    sub_6BCF80(sub_467E50, 0, 1, 0);  // device, external
    sub_6BCF80(sub_467E50, 0, 0, 1);  // constant, internal
    sub_6BCF80(sub_467E50, 0, 0, 0);  // constant, external
}

Each call iterates a separate global list that was populated during the entity walk:

List Address	Content
`unk_1286880`	kernel external
`unk_12868C0`	kernel internal
`unk_1286780`	device external
`unk_12867C0`	device internal
`unk_1286800`	constant external
`unk_1286840`	constant internal

Entity registration into these lists is performed by nv_get_full_nv_static_prefix (sub_6BE300, 370 lines, nv_transforms.c:2164). This function examines each device-annotated entity and routes it to the appropriate list based on its execution space bits (at entity offset +182) and linkage (internal linkage = static or anonymous namespace, determined by flags at entity offset +80).

For internal linkage entities, the function builds a scoped name prefix:

Recursively constructs the scope path via sub_6BD2F0 (nv_build_scoped_name_prefix)
For anonymous namespaces, inserts the _GLOBAL__N_<module_id> prefix (via qword_1286A00)
Hashes the full path with format_string_to_sso (sub_6BD1C0)
Constructs the prefix: off_E7C768 + len + "_" + filename + "_"
Caches the prefix in qword_1286760 for reuse
Appends "_" and the entity's mangled name

For external linkage entities, the path is simpler: the :: scope-qualified name is used directly without module-ID-based prefixing.

The generated output for each symbol:

extern "C" {
    extern __attribute__((section(".nvHRKE")))
           __attribute__((weak))
    const unsigned char hostRefKernelArrayExternalLinkage[] = {
        0x5f, 0x5a, /* ... mangled name bytes ... */ 0x00
    };
}

The __attribute__((weak)) allows multiple TUs to define the same array without linker errors -- the CUDA runtime reads whichever copy survives.

Host Stub Linkage Flags

Three CLI flags control the linkage of generated host stubs:

--host-stub-linkage-explicit (Flag 47)

When set, host stubs are emitted with explicit linkage specifiers rather than relying on the default linkage of the surrounding context. This ensures that the stub's linkage matches what nvcc/nvlink expects regardless of the source file's linkage context (e.g., inside an anonymous namespace or extern "C" block).

--static-host-stub (Flag 48)

Forces all generated host stubs (__wrapper__device_stub_*) to have static linkage. This is used in single-TU compilation where the stubs do not need to be visible to other object files. It prevents symbol conflicts when the same kernel name appears in multiple compilation units that are linked together.

--static-global-template-stub (set_flag Mechanism)

Unlike the direct CLI flags above, -static-global-template-stub is set through the generic --set_flag mechanism (flag 193), which looks up the name in the off_D47CE0 table and stores the value. It has 4 usage contexts in the binary, all in error message strings.

When enabled (=true), template __global__ function stubs receive static linkage. This prevents ODR violations in whole-program mode when the same template kernel is instantiated in multiple host-side TUs. The tradeoff is that extern template kernels and out-of-TU instantiations become illegal (see the constraints in the whole-program section above).

Output Differences Between Modes

Output Aspect	Whole-Program (-rdc=false)	Separate Compilation (-rdc=true)
Host stub linkage	Can be `static` (with flags 47/48)	External (default)
Template stub linkage	`static` (with -static-global-template-stub)	External
Module ID generation	Generated but less critical	Required for registration matching
Module ID file	Optional	Typically generated
Device code embedding	Inline fatbinary in host object	Relocatable device object (.rdc)
nvlink requirement	No	Yes (resolves device symbols)
Dynamic parallelism	Forbidden	Allowed
Extern device variables	Forbidden	Allowed
Anonymous namespace hash	Used for device symbol uniqueness	Used for device symbol uniqueness
Deferred function list	Active (breakpoint placeholders)	Behavior depends on `dword_106BFDC`
Cross-TU correspondence	N/A (single TU)	Active when multi-TU invocation

Global Variables

Address	Size	Name	Purpose
`dword_106BFBC`	4	`whole_program_mode`	Whole-program mode; also set by `--debug_mode` (flag 82, which sets `dword_106BFC4=1`, `dword_106BFC0=1`, `dword_106BFBC=1`)
`dword_106BFDC`	4	`skip_device_only`	Disables deferred function list accumulation
`dword_106BFB8`	4	`emit_symbol_table`	Emit symbol table + module ID to file
`dword_106BFD0`	4	`device_registration`	Device registration / cross-space reference checking
`dword_106BFCC`	4	`constant_registration`	Constant registration flag
`qword_126F0C0`	8	`cached_module_id`	Cached module ID string
`qword_106BF80`	8	`module_id_file_path`	Module ID file path (from `--module_id_file_name`)
`qword_106BA10`	8	`current_translation_unit`	Pointer to current TU descriptor
`qword_106B9F0`	8	`primary_translation_unit`	Pointer to first TU (primary)
`qword_106BA18`	8	`translation_unit_stack`	Top of TU stack
`dword_106B9E8`	4	`tu_stack_depth`	TU stack depth (excluding primary)
`qword_12C7AA8`	8	`registered_variable_list_head`	Per-TU variable registration list
`qword_12C7A98`	8	`per_tu_storage_size`	Total per-TU buffer size
`qword_12C7AB0`	8	`corresp_free_list`	Correspondence node free list
`qword_12C7AB8`	8	`stack_entry_free_list`	TU stack entry free list
`qword_1065840`	8	`deferred_function_list`	Breakpoint placeholder linked list head

Function Map

Address	Name	Source File	Lines	Role
`sub_5AF830`	`make_module_id`	host_envir.c	~450	CRC32-based unique TU identifier
`sub_5AF7F0`	`set_module_id`	host_envir.c	~10	Setter for cached module ID
`sub_5AF820`	`get_module_id`	host_envir.c	~3	Getter for cached module ID
`sub_5B0180`	`write_module_id_to_file`	host_envir.c	~30	Writes module ID to file
`sub_5CF030`	`use_variable_or_routine_for_module_id_if_needed`	il.c:31969	~65	Selects representative entity for ID
`sub_6BC7E0`	(anon namespace hash)	nv_transforms.c	~20	Generates `_GLOBAL__N_<module_id>`
`sub_6BCF80`	`nv_emit_host_reference_array`	nv_transforms.c	79	Emits `.nvHR*` ELF section with symbol names
`sub_6BD2F0`	`nv_build_scoped_name_prefix`	nv_transforms.c	~95	Recursive scope-qualified name builder
`sub_6BE300`	`nv_get_full_nv_static_prefix`	nv_transforms.c:2164	~370	Scoped name + host ref array registration
`sub_796BA0`	`copy_secondary_trans_unit_IL_to_primary`	trans_copy.c	~50	Copies secondary TU IL to primary
`sub_796C00`	`mark_secondary_IL_entities_used_from_primary`	--	--	Marks secondary IL referenced from primary
`sub_796E60`	`canonical_ranking`	trans_corresp.c	--	Determines canonical TU entry
`sub_7975D0`	`may_have_correspondence`	trans_corresp.c	--	Quick correspondence eligibility check
`sub_797990`	`f_change_canonical_entry`	trans_corresp.c	--	Updates canonical representative
`sub_7983A0`	`f_same_name`	trans_corresp.c	--	Cross-TU symbol name comparison
`sub_79C400`	`f_set_trans_unit_corresp`	trans_corresp.c	511	Establishes entity correspondence
`sub_7A00D0`	`verify_class_type_correspondence`	trans_corresp.c	703	Deep class structural comparison
`sub_7A0E10`	`verify_enum_type_correspondence`	trans_corresp.c	--	Enum comparison
`sub_7A1230`	`verify_function_type_correspondence`	trans_corresp.c	--	Function type comparison
`sub_7A1460`	`set_type_correspondence`	trans_corresp.c	--	Links corresponding types
`sub_7A1DB0`	`verify_member_function_correspondence`	trans_corresp.c	411	Member function comparison
`sub_7A27B0`	`verify_base_class_correspondence`	trans_corresp.c	--	Base class list comparison
`sub_7A3920`	`register_trans_corresp_variables`	trans_corresp.c	--	Registers per-TU state variables
`sub_7A3980`	`init_trans_corresp_state`	trans_corresp.c	--	Zeroes all correspondence state
`sub_7A3A50`	`save_translation_unit_state`	trans_unit.c	--	Saves current TU state to buffer
`sub_7A3C00`	`f_register_trans_unit_variable`	trans_unit.c	--	Registers a per-TU variable
`sub_7A3CF0`	`fix_up_translation_unit`	trans_unit.c	--	Finalizes TU state
`sub_7A3D60`	`switch_translation_unit`	trans_unit.c	--	Saves/restores TU context
`sub_7A3EF0`	`push_translation_unit_stack`	trans_unit.c	--	Pushes TU onto stack
`sub_7A3F70`	`pop_translation_unit_stack`	trans_unit.c	--	Pops TU from stack
`sub_7A40A0`	`process_translation_unit`	trans_unit.c	--	Main TU processing entry point
`sub_7A4690`	`register_builtin_trans_unit_variables`	trans_unit.c	--	Registers 3 core per-TU vars

Cross-References

Kernel Stub Generation -- -static-global-template-stub details and the stub toggle mechanism
Device/Host Separation -- How the single-pass tag-and-filter architecture works
.int.c File Format -- Anonymous namespace mangling and module ID in output
Backend Code Generation -- Module ID output phase
Host Reference Arrays -- .nvHR* section format and runtime discovery
CLI Flag Inventory -- Flag indices 47, 48, 77, 83, 87
CUDA Error Catalog -- Category 11 (RDC / whole-program diagnostics)
EDG 6.6 Overview -- Cross-TU correspondence section
Template Engine -- Template instantiation deduplication across TUs
Global Variable Index -- All globals referenced here

JIT Mode

JIT mode is a compilation mode where cudafe++ produces device code only -- no host .int.c file, no kernel stubs, no CUDA runtime registration tables. The output is a standalone device IL payload suitable for runtime compilation via NVRTC (nvrtcCompileProgram) or direct loading through the CUDA Driver API (cuModuleLoadData, cuModuleLoadDataEx). Because there is no host compiler invocation downstream, anything that belongs exclusively to the host side is illegal: explicit __host__ functions, unannotated functions (which default to __host__), namespace-scope variables without memory-space qualifiers, non-const class static data members, and lambda closures inferred to have __host__ execution space.

The --default-device flag inverts the annotation default -- unannotated entities become __device__ instead of __host__, allowing C++ code written without CUDA annotations to compile directly for the GPU. This is the recommended workaround for all four unannotated-entity diagnostics.

Key Facts

Property	Value
Compilation output	Device IL only (no `.int.c`, no stubs, no registration)
Host output suppression	`--gen_c_file_name` (flag 45) not supplied by driver
Device output path	`--gen_device_file_name` (flag 85)
Default execution space (normal)	`__host__` (entity+182 byte == `0x00`)
Default execution space (JIT + `--default-device`)	`__device__` (entity+182 byte `0x23`)
Annotation override flag	`--default-device` (passed to cudafe++ by NVRTC or nvcc)
RDC mode flag	`--device-c` (flag 77) -- relocatable device code; orthogonal to JIT
JIT diagnostic count	5 error messages (1 explicit-host + 4 unannotated-entity)
Diagnostic tag suffix	All five tags end with `_in_jit`
NVRTC integration	NVRTC calls cudafe++ with JIT-appropriate flags internally
Driver API consumers	`cuModuleLoadData`, `cuModuleLoadDataEx`, `cuLinkAddData`

How JIT Mode Is Activated

cudafe++ is never invoked directly by application code. In the standard offline compilation pipeline, nvcc invokes cudafe++ with both --gen_c_file_name (flag 45, the host .int.c path) and --gen_device_file_name (flag 85, the device IL path). Both outputs are generated from a single frontend invocation -- cudafe++ uses a single-pass architecture internally (see Device/Host Separation).

In JIT mode, the driving tool -- typically NVRTC -- invokes cudafe++ with only the device-side output path. The host-output file name (--gen_c_file_name) is not provided, so no .int.c file is generated. The absence of a host output target is what structurally makes this "JIT mode": without a host file, there is no host compiler to feed, and therefore no host-side constructs can be tolerated.

Activation Conditions

JIT mode is not a single user-facing CLI flag. It is an internal compilation state activated by the combination of flags that the driving tool (nvcc or NVRTC) sets when invoking cudafe++:

NVRTC invocation. NVRTC always invokes cudafe++ in JIT mode. NVRTC compiles CUDA C++ source to PTX at application runtime. There is no host compiler, no host object file, and no linking -- the output is pure device code.
nvcc --ptx or --cubin without host compilation. When nvcc is asked to produce only PTX or cubin output (no host object), it may invoke cudafe++ with the JIT mode configuration to skip host-side generation entirely.
Architecture target combined with device-only flags. The internal JIT state is set when the target configuration (--target, flag 245 -> dword_126E4A8) is combined with device-only compilation flags (e.g., --device-syntax-only, flag 72).

The practical effect: when JIT mode is active, the entire implicit-host-annotation system becomes a source of errors rather than a convenience. Every function without __device__ or __global__ defaults to __host__, and host entities are illegal.

NVRTC Runtime Compilation Path

NVRTC (libnvrtc.so / nvrtc64_*.dll) is NVIDIA's runtime compilation library. Application code calls nvrtcCreateProgram with CUDA C++ source text, then nvrtcCompileProgram to compile it. Internally, NVRTC embeds a complete CUDA compilation pipeline including cudafe++ and cicc, invoking them with JIT-appropriate flags:

Application
    |
    v
nvrtcCompileProgram(prog, numOptions, options)
    |
    v
cudafe++ --target <sm_code> --gen_device_file_name <tmpfile> [--default-device] ...
    |                    (no --gen_c_file_name => JIT mode)
    v
cicc <tmpfile> --> PTX
    |
    v
ptxas / cuModuleLoadData --> device binary (cubin)

The user-facing NVRTC options (--gpu-architecture=compute_90, --device-debug, etc.) are translated by the NVRTC library into internal cudafe++ and cicc flags. The --default-device flag is passed through when the user includes it in the NVRTC options array.

CUDA Driver API Consumption

The PTX or cubin produced by the JIT pipeline is consumed by the CUDA Driver API:

cuModuleLoadData / cuModuleLoadDataEx: Load a compiled module (PTX or cubin) into the current context. The driver JIT-compiles PTX to native binary at load time.
cuLinkAddData / cuLinkComplete: Link multiple compiled objects into a single module (JIT linking for RDC workflows).
cuModuleGetFunction: Retrieve a __global__ kernel handle from the loaded module for launch via cuLaunchKernel.

Because JIT-compiled code has no host-side registration (no __cudaRegisterFunction calls, no fatbin embedding), the Driver API is the only path to launch kernels from JIT-compiled modules. The CUDA Runtime API launch syntax (<<<>>>) is not available for JIT-compiled kernels -- the application must use cuLaunchKernel explicitly.

The --default-device Flag

In normal (offline) compilation, functions and namespace-scope variables without explicit CUDA annotations default to __host__. This default makes sense when both host and device outputs are generated: the unannotated entities go into the host .int.c file and are compiled by the host compiler.

In JIT mode, this default is counterproductive. Most code intended for JIT compilation targets the GPU, and requiring explicit __device__ on every function and variable is verbose and incompatible with header-only libraries written for standard C++.

The --default-device flag changes the default:

Entity type	Default without `--default-device`	Default with `--default-device`
Unannotated function	`__host__` (entity+182 == `0x00`)	`__device__` (entity+182 == `0x23`)
Namespace-scope variable (no memory space)	Host variable	`__device__` variable (entity+148 bit 0 set)
Non-const class static data member	Host variable	`__device__` variable
Lambda closure class (namespace scope)	`__host__` inferred space	`__device__` inferred space
Explicitly `__host__` function	`__host__` (unchanged)	`__host__` (unchanged -- always error in JIT)
Explicitly `__device__` function	`__device__` (unchanged)	`__device__` (unchanged)
`__global__` kernel	`__global__` (unchanged)	`__global__` (unchanged)

Entities with explicit annotations are unaffected. Only entities that would otherwise receive the implicit __host__ default are redirected to __device__.

Interaction with Entity+182

The execution-space bitfield at entity+182 (documented in Execution Spaces) is set during attribute application. Without --default-device, an unannotated function has byte 0x00 at entity+182 -- the 0x30 mask extracts 0x00, which is treated as implicit __host__. With --default-device active, the frontend treats unannotated functions as if __device__ had been applied, setting byte+182 to 0x23 (the standard __device__ OR mask: device_capable | device_explicit | device_annotation).

This means the downstream subsystems -- keep-in-IL marking, cross-space validation, device-only filtering -- all see a properly-annotated __device__ entity and process it identically to an explicitly annotated one. The flag does not add a "JIT mode" code path through every subsystem; it simply changes the default annotation, and the existing execution-space machinery handles the rest.

How to Pass the Flag

In normal nvcc workflows, --default-device is passed through -Xcudafe:

nvcc -Xcudafe --default-device source.cu

In NVRTC workflows, the flag is passed via the nvrtcCompileProgram options array:

const char *opts[] = {"--default-device"};
nvrtcCompileProgram(prog, 1, opts);

JIT Mode Diagnostics

Five error messages enforce JIT mode restrictions. All five are emitted during semantic analysis when the frontend encounters an entity that cannot exist in a device-only compilation. The messages are self-documenting: four of the five include an explicit suggestion to use --default-device.

Diagnostic 1: Explicit host Function

Tag: no_host_in_jit

Message:

A function explicitly marked as a __host__ function is not allowed in JIT mode

Trigger: The function declaration carries an explicit __host__ annotation (entity+182 has bit 4 set via the 0x15 OR mask from apply_nv_host_attr at sub_4108E0). This is unconditionally illegal in JIT mode -- there is no device-side representation of a host-only function, and JIT mode produces no host output.

No --default-device suggestion: This is the only JIT diagnostic that does not suggest --default-device. The flag only affects unannotated entities. An explicit __host__ annotation overrides the default. The fix must be a source code change: remove __host__, change it to __device__, or change it to __host__ __device__.

Example:

// JIT mode: error no_host_in_jit
__host__ void setup() { /* ... */ }

// Fix options:
__device__ void setup() { /* ... */ }
__host__ __device__ void setup() { /* ... */ }  // if needed in both contexts

Diagnostic 2: Unannotated Function

Tag: unannotated_function_in_jit

Message:

A function without execution space annotations (__host__/__device__/__global__)
is considered a host function, and host functions are not allowed in JIT mode.
Consider using -default-device flag to process unannotated functions as __device__
functions in JIT mode

Trigger: A function entity has (entity+182 & 0x30) == 0x00 -- no explicit execution-space annotation. By default this means implicit __host__, which is illegal in JIT mode.

Fix: Either add __device__ to the function declaration, or compile with --default-device.

Example:

// JIT mode without --default-device: error unannotated_function_in_jit
int compute(int x) { return x * x; }

// Fix 1: explicit annotation
__device__ int compute(int x) { return x * x; }

// Fix 2: compile with --default-device (function becomes implicitly __device__)

Diagnostic 3: Unannotated Namespace-Scope Variable

Tag: unannotated_variable_in_jit

Message:

A namespace scope variable without memory space annotations
(__device__/__constant__/__shared__/__managed__) is considered a host variable,
and host variables are not allowed in JIT mode. Consider using -default-device flag
to process unannotated namespace scope variables as __device__ variables in JIT mode

Trigger: A variable declared at namespace scope (including global scope and anonymous namespaces) lacks a CUDA memory-space annotation. In normal compilation, such variables live in host memory. In JIT mode, host memory is inaccessible.

The check applies to the memory-space bitfield at entity+148, not the execution-space bitfield at entity+182. Without any annotation, none of the memory-space bits (__device__ bit 0, __shared__ bit 1, __constant__ bit 2, __managed__ bit 3) are set.

Scope note: This check targets namespace-scope variables only. Local variables inside __device__ or __global__ functions are not subject to this check -- they live on the device stack or in registers.

Fix: Add a memory-space annotation, or compile with --default-device.

Example:

// JIT mode without --default-device: error unannotated_variable_in_jit
int table[256] = { /* ... */ };

// Fix 1: mutable device memory
__device__ int table[256] = { /* ... */ };

// Fix 2: read-only data
__constant__ int table[256] = { /* ... */ };

Diagnostic 4: Non-Const Class Static Data Member

Tag: unannotated_static_data_member_in_jit

Message:

A class static data member with non-const type is considered a host variable,
and host variables are not allowed in JIT mode. Consider using -default-device flag
to process such data members as __device__ variables in JIT mode

Trigger: A class or struct has a static data member whose type is not const-qualified. Static data members are allocated at namespace scope (not per-instance), so they are subject to the same host-variable prohibition as namespace-scope variables.

Why non-const only: const and constexpr static members with compile-time-constant initializers can be folded into device code by cicc without requiring an actual global variable in host memory. Non-const static members require mutable storage that must be explicitly placed in device memory.

Example:

struct Config {
    // JIT mode without --default-device: error unannotated_static_data_member_in_jit
    static int max_iterations;

    // OK: const with constant initializer (compile-time folding)
    static const int default_value = 42;

    // OK: constexpr (compile-time constant)
    static constexpr float pi = 3.14159f;
};

// Fix: explicit annotation
struct Config {
    __device__ static int max_iterations;
};

Diagnostic 5: Lambda Closure Class with Inferred host Space

Tag: host_closure_class_in_jit

Message:

The execution space for the lambda closure class members was inferred to be __host__
(based on context). This is not allowed in JIT mode. Consider using -default-device
to infer __device__ execution space for namespace scope lambda closure classes.

Trigger: A lambda expression at namespace scope (or in a context where the enclosing function has implicit __host__ space) produces a closure class whose execution space is inferred to be __host__. The lambda was not explicitly annotated with __device__, and the enclosing context is host-only, so cudafe++'s execution-space inference assigns __host__ to the closure class members.

This diagnostic interacts with the extended lambda system (documented in Extended Lambda Overview). In normal compilation, a namespace-scope lambda without annotations is host-only and gets a closure type compiled for the CPU. In JIT mode, that closure type has no valid compilation target.

Fix: Either annotate the lambda with __device__ (requires extended lambdas: --expt-extended-lambda), or pass --default-device to change the inference to __device__.

Example:

// JIT mode without --default-device: error host_closure_class_in_jit
auto fn = [](int x) { return x * 2; };

// Fix 1: explicit annotation (requires --expt-extended-lambda)
auto fn = [] __device__ (int x) { return x * 2; };

// Fix 2: compile with --default-device

Diagnostic Summary

Tag	Entity type	`--default-device` suggested	Suppressible
`no_host_in_jit`	Explicit `__host__` function	No	Yes (via `--diag_suppress`)
`unannotated_function_in_jit`	Function with no annotation	Yes	Yes
`unannotated_variable_in_jit`	Namespace-scope variable, no annotation	Yes	Yes
`unannotated_static_data_member_in_jit`	Non-const static data member	Yes	Yes
`host_closure_class_in_jit`	Lambda closure inferred `__host__`	Yes	Yes

All five diagnostics use the standard cudafe++ diagnostic system. They can be controlled via CLI flags or source pragmas:

--diag_suppress=unannotated_function_in_jit
--diag_warning=no_host_in_jit
#pragma nv_diag_suppress unannotated_variable_in_jit

Warning: Suppressing these diagnostics silences the messages but does not change the underlying problem. The entities still have host execution space and will be absent from the device IL output, leading to link errors or runtime failures when the module is loaded.

Architecture: JIT Mode vs Normal Mode

Aspect	Normal (offline) mode	JIT mode
Driver tool	nvcc	NVRTC (or nvcc with `--ptx` / `--cubin`)
Host output (`.int.c`)	Generated via `sub_489000`	Not generated
Device IL output	Generated via keep-in-IL walk	Generated via keep-in-IL walk (identical)
Kernel stubs	`__wrapper__device_stub_` in `.int.c`	Not needed
Registration code	`__cudaRegisterFunction` / `__cudaRegisterVar`	Not emitted
Fatbin embedding	Embedded in host object	Not applicable
Default unannotated space	`__host__`	`__host__` (error) or `__device__` (with `--default-device`)
Kernel launch mechanism	`<<<>>>` -> `cudaLaunchKernel` (Runtime API)	`cuLaunchKernel` (Driver API)
Module loading	Automatic (CUDA runtime startup)	Manual (`cuModuleLoadData`)
Link model	Static linking with host object	JIT linking (`cuLinkAddData`) or direct load

Single-Pass Architecture Impact

cudafe++ uses a single-pass architecture: the EDG frontend parses the source once, builds a unified IL tree, and tags every entity with execution-space bits at entity+182. In normal mode, two output filters run on this tree -- one for the host .int.c file (driven by sub_489000 -> sub_47ECC0), one for the device IL (driven by the keep-in-IL walk at sub_610420). In JIT mode, only the device IL output path runs. The host output path is simply never invoked because no host output was requested.

This means JIT mode does not require a fundamentally different code path through the frontend. Parsing, semantic analysis, template instantiation, and IL construction all proceed identically. The difference manifests at two points:

Diagnostic emission during semantic analysis. The five JIT diagnostics fire when the frontend detects entities that would be host-only. In normal mode, these entities are silently accepted because they will appear in the host output.
Output generation. The backend skips host-file emission entirely. The keep-in-IL walk runs as usual, marking device-reachable entries with bit 7 of the prefix byte (entry_ptr - 8). The device IL writer produces the binary output. No stub generation (gen_routine_decl stub path), no registration table emission, no .int.c formatting.

Interaction with Other Modes

RDC (Relocatable Device Code)

JIT mode is orthogonal to RDC (--device-c, flag 77). RDC controls whether device code is compiled for separate linking (enabling cross-TU __device__ function calls and extern __device__ variables), while JIT mode controls whether host output is produced. Both can be active simultaneously -- for example, NVRTC with --relocatable-device-code=true compiles device code for separate device linking without any host output.

When RDC is combined with JIT mode, NVRTC compiles each source file to relocatable device code, and the driver-API linker (cuLinkAddData, cuLinkComplete) resolves cross-references at load time. Without RDC, all device code must be self-contained within a single translation unit.

Extended Lambdas

Extended lambdas (--expt-extended-lambda, controlled by dword_106BF38) interact with JIT mode through the lambda closure class inference. The host_closure_class_in_jit diagnostic targets the case where a lambda's closure is inferred as host-side. With --default-device, the inference changes to device-side, resolving the conflict. Extended lambda capture rules still apply in JIT mode -- captures must be trivially device-copyable, subject to the 1023-capture limit, and array captures are limited to 7 dimensions.

Relaxed Constexpr

Relaxed constexpr mode (--expt-relaxed-constexpr, flag 104, sets dword_106BFF0) makes constexpr functions implicitly __host__ __device__. In JIT mode, this resolves many unannotated-function errors because constexpr functions gain the __device__ annotation implicitly via the HD bypass (entity+177 bit 4). However, non-constexpr unannotated functions still trigger unannotated_function_in_jit unless --default-device is also active.

Practical Patterns

Pattern 1: Minimal JIT Kernel

// Source passed to nvrtcCreateProgram -- no --default-device needed
extern "C" __global__ void add(float* a, float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

No annotations needed beyond __global__ on the kernel. All code within the kernel body is implicitly device code. The extern "C" prevents name mangling so the kernel can be found by cuModuleGetFunction.

Pattern 2: JIT-Compiling Library Code with --default-device

// Header-only math library, no CUDA annotations
template <typename T>
T clamp(T val, T lo, T hi) {
    return val < lo ? lo : (val > hi ? hi : val);
}

__global__ void kernel(float* data, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) data[i] = clamp(data[i], 0.0f, 1.0f);
}

Without --default-device, clamp triggers unannotated_function_in_jit. With --default-device, clamp is implicitly __device__ and compiles cleanly.

Pattern 3: Guarding Host Code with Preprocessor

// Use __CUDACC_RTC__ to guard host-only code
#ifndef __CUDACC_RTC__
__host__ void cpu_fallback(float* data, int n) {
    for (int i = 0; i < n; i++) data[i] *= 2.0f;
}
#endif

__global__ void gpu_process(float* data, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) data[i] *= 2.0f;
}

__CUDACC_RTC__ is predefined by NVRTC. Code guarded by #ifndef __CUDACC_RTC__ is invisible to the JIT compiler, avoiding no_host_in_jit errors.

Pattern 4: Static Data Members in JIT

struct Constants {
    static constexpr int BLOCK_SIZE = 256;        // OK: constexpr, folded at compile time
    static const float EPSILON;                    // Error without --default-device (non-constexpr const)
};

#ifdef __CUDACC_RTC__
__device__
#endif
const float Constants::EPSILON = 1e-6f;            // Annotated for JIT mode

Function Map

Address	Name	Lines	Role
`sub_459630`	`proc_command_line`	4105	CLI parser; processes `--default-device` and `--device-c` flags
`sub_452010`	`init_command_line_flags`	3849	Registers all flags including `default-device`
`sub_610420`	`mark_to_keep_in_il`	892	Device IL marking (runs identically in JIT and normal mode)
`sub_489000`	`process_file_scope_entities`	723	Host `.int.c` backend (skipped entirely in JIT mode)
`sub_47ECC0`	`gen_template`	1917	Source-sequence dispatcher; host output path (skipped in JIT)
`sub_40EB80`	`apply_nv_device_attr`	100	Sets `__device__` bits; entity+182 OR `0x23` (function), entity+148 OR `0x01` (variable)
`sub_4108E0`	`apply_nv_host_attr`	31	Sets `__host__` bits; entity+182 OR `0x15`

Cross-References

Execution Spaces -- entity+182 bitfield, __host__/__device__/__global__ OR masks, 0x30 mask classification
Device/Host Separation -- single-pass architecture, keep-in-IL walk, host/device output file generation
Cross-Space Validation -- execution-space call checking (still applies in JIT mode for HD entities)
CUDA Error Catalog -- Category 10 (JIT Mode), all five diagnostic messages with tag names
CLI Flag Inventory -- flag table, --gen_device_file_name (85), --gen_c_file_name (45), --device-c (77)
Architecture Feature Gating -- --target SM code (dword_126E4A8) and feature thresholds
Extended Lambda Overview -- lambda closure class execution-space inference, wrapper types
Kernel Stubs -- __wrapper__device_stub_ mechanism (absent in JIT mode)
RDC Mode -- relocatable device code, separate compilation for device-side linking

Architecture Feature Gating

cudafe++ enforces architecture-dependent feature gates that prevent use of CUDA constructs on hardware that cannot support them. These gates operate at three distinct layers: compile-time SM version checks against dword_126E4A8 during semantic analysis, string-embedded diagnostic messages with architecture names baked into .rodata, and host-compiler version gating controlling which GCC/Clang-specific #pragma directives and language constructs appear in the generated .int.c output. A separate mechanism, the --db debug system, provides runtime tracing that can expose architecture checks as they execute. This page documents all three layers, the global variables involved, every discovered threshold constant, and the complete data flow from nvcc invocation to feature gate evaluation.

Key Facts

Property	Value
SM version storage	`dword_126E4A8` (`sm_architecture`, set by `--target` / case 245)
SM version TU-level copy	`dword_126EBF8` (`target_config_index`, copied during TU init in `sub_586240`)
Architecture parser stub	`sub_7525E0` (6-byte stub returning `-1`; actual parsing done by nvcc)
Post-parse initializer	`sub_7525F0` (`set_target_configuration`, `target.c:299`)
Type table initializer	`sub_7515D0` (sets 100+ type-size/alignment globals, called from `sub_7525F0`)
GCC version global	`qword_126EF98` (default `80100` = GCC 8.1.0, set by `--gnu_version` case 184)
Clang version global	`qword_126EF90` (default `90100` = Clang 9.1.0, set by `--clang_version` case 188)
GCC host dialect flag	`dword_126E1F8` (host compiler identified as GCC)
Clang host dialect flag	`dword_126E1E8` (host compiler identified as Clang)
Host GCC version copy	`qword_126E1F0` (copied from `qword_126EF98` during dialect init)
Host Clang version copy	`qword_126E1E0` (copied from `qword_126EF90` during dialect init)
`--nv_arch` error string	`"invalid or no value specified with --nv_arch flag"` at `0x8884F0`
Debug option parser	`sub_48A390` (`proc_debug_option`, 238 lines, `debug.c`)
Debug trace linked list	`qword_1065870` (head pointer)
Invalid arch sentinel	`-1` (`0xFFFFFFFF`)
Feature threshold count	17 CUDA features across 7 SM versions (20, 30, 52, 60, 70, 80, 90/90a)
Host compiler threshold count	19 version constants across GCC 3.0 through GCC 14.0

Layer 1: SM Architecture Input

How the Architecture Reaches cudafe++

cudafe++ never parses architecture strings directly from the user. The driver (nvcc) translates user-facing flags like --gpu-architecture=sm_90 into an internal numeric code and passes it via the --target flag when spawning the cudafe++ process. Inside cudafe++, the --target flag is registered as CLI flag 245 and handled in proc_command_line (sub_459630).

The handler calls sub_7525E0, which in the CUDA Toolkit 13.0 binary is a 6-byte stub:

; sub_7525E0 -- architecture parser stub
; Address: 0x7525E0, Size: 6 bytes
mov     eax, 0FFFFFFFFh    ; return -1 unconditionally
retn

This stub always returns -1 (the invalid-architecture sentinel). The actual architecture code is injected by nvcc into the argument string that sub_7525E0 receives. Because IDA decompiled this as a stub, the parsing logic is either inlined by the compiler or resolved through a different mechanism at link time. The result is stored in dword_126E4A8:

// proc_command_line (sub_459630), case 245
v80 = sub_7525E0(qword_E7FF28, v23, v20, v30);  // parse SM code from arg string
dword_126E4A8 = v80;                              // store in sm_architecture
if (v80 == -1) {
    sub_4F8420(2664);  // emit error 2664
    // error string: "invalid or no value specified with --nv_arch flag"
    sub_4F2930("cmd_line.c", 12219, "proc_command_line", 0, 0);
    // assert_fail -- unreachable if error handler returns
}
sub_7525F0(v80);  // set_target_configuration

Error 2664 fires when the architecture value is -1. The error string at 0x8884F0 references --nv_arch (the nvcc-facing name for this flag). This string has no direct xrefs in the IDA analysis, meaning it is loaded indirectly through the error message table (off_88FAA0). The --nv_arch name in the error message is a user-facing alias; internally cudafe++ processes it as --target (flag 245).

set_target_configuration (sub_7525F0)

After storing the SM version, sub_7525F0 performs post-parse initialization. This function lives in target.c:299:

// sub_7525F0 -- set_target_configuration
__int64 __fastcall sub_7525F0(int a1)
{
    if ((unsigned int)(a1 + 1) > 1)  // rejects only -1
        assert_fail("set_target_configuration", 299);
    sub_7515D0();           // initialize type table for target platform
    qword_126E1B0 = "lib";  // library search path prefix
}

The guard (a1 + 1) > 1u is an unsigned comparison that accepts any value >= 0 and rejects only -1 (which wraps to 0 when incremented). This is a sanity check -- in production, nvcc always provides a valid SM code.

Type Table Initialization (sub_7515D0)

The sub_7515D0 function, called from set_target_configuration, initializes over 100 global variables describing the target platform's type sizes, alignments, and numeric limits. This establishes the data model for CUDA device code:

// sub_7515D0 -- target type initialization (excerpt)
// Sets LP64 data model with CUDA-specific type properties
dword_126E328 = 8;     // sizeof(long)
dword_126E338 = 4;     // sizeof(int)
dword_126E2FC = 16;    // sizeof(long double)
dword_126E308 = 16;    // alignof(long double)
dword_126E2B8 = 8;     // sizeof(pointer)
dword_126E2AC = 8;     // alignof(pointer)
dword_126E420 = 2;     // sizeof(wchar_t)
dword_126E4A0 = 8;     // target vector width
dword_126E258 = 53;    // double mantissa bits
dword_126E250 = 1024;  // double max exponent
dword_126E254 = -1021; // double min exponent
dword_126E234 = 113;   // __float128 mantissa bits
dword_126E22C = 0x4000; // __float128 max exponent
dword_126E230 = -16381; // __float128 min exponent
// ... ~80 more assignments ...

The function unconditionally returns -1, which is not used by the caller.

SM Version Propagation

During translation unit initialization (sub_586240, called from fe_translation_unit_init), the SM version is copied into a TU-level global:

// sub_586240, line 54 in decompiled output
dword_126EBF8 = dword_126E4A8;  // target_config_index = sm_architecture

After this point, architecture checks throughout the compiler read either dword_126E4A8 (the CLI-level global) or dword_126EBF8 (the TU-level copy). Both contain the same integer SM version code. The dual-variable pattern exists because EDG's architecture supports multi-TU compilation where each TU could theoretically target a different architecture (though CUDA compilation always uses a single target per cudafe++ invocation).

Layer 2: CUDA Feature Thresholds

cudafe++ checks the SM architecture version at semantic analysis time to gate CUDA-specific features. When a feature is used on an architecture below its minimum requirement, the compiler emits a diagnostic error or warning. All thresholds below were extracted from error strings embedded in the binary's .rodata section and confirmed through cross-reference with diagnostic tag names.

Complete Feature Threshold Table

Feature	Min Architecture	Diagnostic Tag	Error String
Virtual base classes	compute_20	`use_of_virtual_base_on_compute_1x`	`Use of a virtual base (%t) requires the compute_20 or higher architecture`
Device variadic functions	compute_30	`device_function_has_ellipsis`	`__device__ or __host__ __device__ function with ellipsis requires compute_30 or higher architecture`
`__managed__` variables	compute_30	`unsupported_arch_for_managed_capability`	`__managed__ variables require architecture compute_30 or higher`
`alloca()` in device code	compute_52	`alloca_unsupported_for_lower_than_arch52`	`alloca() is not supported for architectures lower than compute_52`
Atomic scope argument	sm_60	(inline)	`atomic operations' scope argument is supported on architecture sm_60 or above. Fall back to use membar.`
Atomic f64 add/sub	sm_60	(inline)	`atomic add and sub for 64-bit float is supported on architecture sm_60 or above.`
`__nv_atomic_*` functions	sm_60	(inline)	`__nv_atomic_* functions are not supported on arch < sm_60.`
`__grid_constant__`	compute_70	`grid_constant_unsupported_arch`	`__grid_constant__ annotation is only allowed for architecture compute_70 or later`
Atomic memory order	sm_70	(inline)	`atomic operations' argument of memory order is supported on architecture sm_70 or above. Fall back to use membar.`
128-bit atomic load/store	sm_70	(inline)	`128-bit atomic load and store are supported on architecture sm_70 or above.`
16-bit atomic CAS	sm_70	(inline)	`16-bit atomic compare-and-exchange is supported on architecture sm_70 or above.`
`__nv_register_params__`	compute_80	`register_params_unsupported_arch`	`__nv_register_params__ is only supported for compute_80 or later architecture`
`__wgmma_mma_async`	sm_90a	`wgmma_mma_async_not_enabled`	`__wgmma_mma_async builtins are only available for sm_90a`
Atomic cluster scope	sm_90	(inline)	`atomic operations' scope of cluster is supported on architecture sm_90 or above. Using device scope instead.`
Atomic cluster scope (load/store)	sm_90	(inline)	`atomic load and store's scope of cluster is supported on architecture sm_90 or above. Using device scope instead.`
128-bit atomic exch/CAS	sm_90	`nv_atomic_exch_cas_b128_not_supported`	`128-bit atomic exchange or compare-and-exchange is supported on architecture sm_90 or above.`

GPU-Architecture-Gated Attributes (No Specific SM in String)

Several features check the architecture but their error strings do not embed a specific SM version number. Instead, they use the generic phrase "this GPU architecture", meaning the threshold is encoded in the comparison logic rather than the diagnostic text:

Feature	Diagnostic Tag	Error String
`__cluster_dims__`	`cluster_dims_unsupported`	`__cluster_dims__ is not supported for this GPU architecture`
`max_blocks_per_cluster`	`max_blocks_per_cluster_unsupported`	`cannot specify max blocks per cluster for this GPU architecture`
`__block_size__`	`block_size_unsupported`	`__block_size__ is not supported for this GPU architecture`
`__managed__` (config)	`unsupported_configuration_for_managed_capability`	`__managed__ variables are not yet supported for this configuration (compilation mode (32/64 bit) and/or target operating system)`

These features are gated by the same dword_126E4A8 comparison mechanism as the features in the main table, but their exact SM threshold values would require tracing the specific comparison instructions in the semantic analysis functions.

Diagnostic Behavior: Errors vs Warnings vs Demotions

Architecture gate violations produce three distinct behaviors depending on the feature class:

Hard errors -- Compilation halts. Features that fundamentally cannot work on the target architecture:

__managed__ below compute_30 -- No unified memory hardware support
__grid_constant__ below compute_70 -- No hardware constant propagation mechanism
__nv_register_params__ below compute_80 -- Register parameter ABI not available
__wgmma_mma_async below sm_90a -- No warp-group MMA hardware
alloca() below compute_52 -- No dynamic stack allocation support on device
Virtual base classes below compute_20 -- No vtable support on earliest GPU architectures

Fallback warnings -- Compilation continues with degraded behavior. The compiler generates functionally correct but potentially less performant code:

Atomic scope arguments on pre-sm_60 -- Falls back to membar-based synchronization
Atomic memory order on pre-sm_70 -- Falls back to membar-based ordering
64-bit float atomics on pre-sm_60 -- Falls back to CAS loop emulation

Scope demotion warnings -- Informational diagnostics about automatic scope narrowing:

Cluster scope atomics on pre-sm_90 -- Silently demotes to device scope ("Using device scope instead")

compute_XX vs sm_XX Naming

Error strings use two naming conventions that reflect CUDA's split between virtual and physical architectures:

compute_XX -- Virtual architecture. Checked at PTX generation time. Features gated by compute_XX are relevant to the intermediate PTX representation and are independent of the specific GPU die. Examples: __managed__ (requires unified memory ISA support), alloca() (requires dynamic stack frame instructions).
sm_XX -- Physical architecture. Checked at SASS generation time. Features gated by sm_XX are tied to specific hardware capabilities of a GPU die. Examples: 128-bit atomics (require specific load/store unit widths), cluster scope (requires the SM 9.0 thread block cluster hardware).

In practice, cudafe++ stores a single integer in dword_126E4A8 and the distinction is purely semantic -- both forms gate against the same numeric value. The value is a compute capability number (e.g., 70 for Volta, 90 for Hopper).

The sm_90a suffix (with the a accelerator flag) is a special case used exclusively for __wgmma_mma_async builtins. This variant requires the Hopper accelerated architecture, which is distinct from the base sm_90. The a suffix is encoded in the SM integer value passed to cudafe++ by nvcc.

__wgmma_mma_async Detail

The warp-group matrix multiply-accumulate builtin has the most granular validation of any architecture-gated feature. Beyond the sm_90a architecture check, cudafe++ also validates:

Check	Diagnostic Tag	Error String
Architecture gate	`wgmma_mma_async_not_enabled`	`__wgmma_mma_async builtins are only available for sm_90a`
Shape validation	`wgmma_mma_async_bad_shape`	`The shape %s is not supported for __wgmma_mma_async builtin`
A operand type	`wgmma_mma_async_bad_A_type`	(type mismatch diagnostic)
B operand type	`wgmma_mma_async_bad_B_type`	(type mismatch diagnostic)
Missing arguments	`wgmma_mma_async_missing_args`	`The 'A' or 'B' argument to __wgmma_mma_async call is missing`
Non-constant args	`wgmma_mma_async_nonconstant_arg`	`Non-constant argument to __wgmma_mma_async call`

The validation function is identified as check_wgmma_mma_async (string at 0x888CAC). Four type-specific builtin variants are registered: __wgmma_mma_async_f16, __wgmma_mma_async_bf16, __wgmma_mma_async_tf32, and __wgmma_mma_async_f8.

nv_register_params Detail

The register parameter attribute has three distinct checks, only one of which is an architecture gate:

Check	Diagnostic Tag	Error String
Feature enable flag	`register_params_not_enabled`	`__nv_register_params__ support is not enabled`
Architecture gate	`register_params_unsupported_arch`	`__nv_register_params__ is only supported for compute_80 or later architecture`
Function type check	`register_params_unsupported_function`	`__nv_register_params__ is not allowed on a %s function`
Ellipsis check	`register_params_ellipsis_function`	(variadic function diagnostic)

The attribute handler is apply_nv_register_params_attr (string at 0x830C78).

SM Version to Feature Summary

SM Version	Features Introduced	Feature Count
compute_20	Virtual base classes in device code	1
compute_30	`__managed__` variables, device variadic functions	2
compute_52	`alloca()` in device code	1
sm_60	Atomic scope argument, 64-bit float atomics, `__nv_atomic_*` API	3
sm_70	`__grid_constant__`, 128-bit atomic load/store, atomic memory order, 16-bit CAS	4
compute_80	`__nv_register_params__`	1
sm_90 / sm_90a	`__wgmma_mma_async`, thread block clusters, 128-bit atomic exchange/CAS, cluster scope atomics	5

Notably absent from cudafe++ error strings are features like cooperative groups (sm_60+), tensor cores (sm_70+), and dynamic parallelism (sm_35+). These are checked at runtime or by the PTX assembler (ptxas) rather than the language frontend.

Layer 3: Host Compiler Version Gating

cudafe++ generates .int.c output that must compile cleanly under the host C++ compiler (GCC, Clang, or MSVC). Because different host compiler versions support different warning pragmas, attributes, and language features, cudafe++ gates its output based on the host compiler version stored in qword_126EF98 (GCC) and qword_126EF90 (Clang). Additionally, several C++ language feature flags in the EDG frontend are conditionally enabled based on host compiler version to match the behavior the user expects from their host compiler.

Version Encoding

Both GCC and Clang versions are encoded as a single integer: major * 10000 + minor * 100 + patch. For example, GCC 8.1.0 is encoded as 80100. The compiler tests these values against hexadecimal threshold constants using > (strictly-greater-than) comparisons, which effectively means "version at or above threshold + 1." Since all threshold values use a 99 patch level (e.g., 40299 for GCC 4.2.99), the gate > 40299 is equivalent to >= 40300, which effectively means "GCC 4.3 or later."

Complete Threshold Table

Hex Constant	Decimal	Encoded Version	Effective Gate	Occurrence Count
`0x752F`	29,999	2.99.99	GCC/Clang >= 3.0	1 (dialect resolution)
`0x75F7`	30,199	3.01.99	GCC/Clang >= 3.2	low
`0x76BF`	30,399	3.03.99	GCC/Clang >= 3.4	low (cuda_compat_flag gate)
`0x7787`	30,599	3.05.99	Clang >= 3.6	medium (`-Wunused-local-typedefs`)
`0x78B3`	30,899	3.08.99	Clang >= 3.9	low
`0x9C3F`	39,999	3.99.99	GCC >= 4.0	medium (dword_106BDD8 + Clang gate)
`0x9D07`	40,199	4.01.99	GCC >= 4.2	medium (`-Wunused-variable` file-level)
`0x9D6B`	40,299	4.02.99	GCC >= 4.3	medium (variadic templates)
`0x9DCF`	40,399	4.03.99	GCC >= 4.4	low (dialect resolution)
`0x9E33`	40,499	4.04.99	GCC >= 4.5	low (dialect resolution)
`0x9E97`	40,599	4.05.99	GCC >= 4.6	medium (diagnostic push/pop)
`0x9EFB`	40,699	4.06.99	GCC >= 4.7	low (feature flag gating)
`0x9F5F`	40,799	4.07.99	GCC >= 4.8	medium (`-Wunused-local-typedefs`)
`0xEA5F`	59,999	5.99.99	GCC >= 6.0	22 files (C++14/17 features)
`0xEB27`	60,199	6.01.99	GCC >= 6.2	low (`HasFuncPtrConv` gate)
`0x1116F`	69,999	6.99.99	GCC >= 7.0	medium (dword_106BDD8 + feature flags)
`0x15F8F`	89,999	8.99.99	GCC/Clang >= 9.0	medium (C++17/20 features)
`0x1D4BF`	119,999	11.99.99	GCC/Clang >= 12.0	8 files
`0x1FBCF`	129,999	12.99.99	GCC >= 13.0	13 files
`0x222DF`	139,999	13.99.99	GCC >= 14.0	5 files

How Thresholds Are Used

The thresholds serve three purposes:

1. Diagnostic pragma emission. The .int.c output includes #pragma GCC diagnostic directives to suppress host compiler warnings about CUDA-generated code. Different GCC/Clang versions introduced different warning flags, so the pragmas are conditionally emitted:

// From sub_489000 (backend boilerplate emission)
// -Wunused-local-typedefs: GCC 4.8+ (0x9F5F) or Clang 3.6+ (0x7787)
if ((dword_126E1E8 && qword_126EF90 > 0x7787)
    || (!dword_106BF6C && !dword_106BF68
        && dword_126E1F8 && qword_126E1F0 > 0x9F5F))
{
    emit("#pragma GCC diagnostic ignored \"-Wunused-local-typedefs\"");
}

// Push/pop block for managed RT: GCC 4.6+ (0x9E97) or Clang
if (dword_126E1E8 || (!dword_106BF6C && dword_126E1F8 && qword_126E1F0 > 0x9E97))
{
    emit("#pragma GCC diagnostic push");
    emit("#pragma GCC diagnostic ignored \"-Wunused-variable\"");
    emit("#pragma GCC diagnostic ignored \"-Wunused-function\"");
    // ... managed runtime boilerplate ...
    emit("#pragma GCC diagnostic pop");
}

// File-level -Wunused-variable: GCC 4.2+ (0x9D07) or Clang
if (dword_126E1E8 || (dword_126E1F8 && qword_126E1F0 > 0x9D07))
    emit("#pragma GCC diagnostic ignored \"-Wunused-variable\"");

2. C++ feature gating during dialect resolution. The post-parsing dialect resolution in proc_command_line and the sub_44B6B0 dialect setup function use qword_126EF98 thresholds to decide which C++ language features to enable. Examples from the decompiled code:

// sub_44B6B0 -- dialect resolution, ~400 lines
// GCC 4.3+ (0x9D6B): enable variadic templates
if (qword_126EF98 > 0x9D6B)
    dword_106BE1C = 1;  // variadic_templates

// GCC 4.7+ (0x9EFB): enable list initialization under certain conditions
if (qword_126EF98 > 0x9EFB && dword_106BE1C && (!byte_E7FFF1 || dword_106C10C))
    dword_106BE10 = 1;

// GCC 6.0+ (0xEA5F) or Clang: enable C++14/17 features
if (dword_126EFA4 || (dword_126EFA8 && qword_126EF98 > 0xEA5F))
    // Enable feature (Clang always, GCC only 6.0+)

3. CUDA compatibility mode. A special flag dword_E7FF10 (cuda_compat_flag) is set when dword_126EFAC && qword_126EF98 <= 0x76BF -- that is, when extended features are enabled but the GCC version is 3.3.99 or below. This activates a legacy compatibility path for very old host compilers that lack modern C++ support.

The 0xEA5F (59999) Threshold -- The Most Pervasive Gate

The threshold 0xEA5F (GCC 6.0) is the most widely used version constant in the binary, appearing in 22 decompiled functions. It gates the C++14/17 feature set boundary. GCC 6.0 was the first GCC release with full C++14 support and substantial C++17 support.

The typical usage pattern is:

// Pattern: "Clang (any version) OR GCC 6.0+"
if (dword_126EFA4 || (dword_126EFA8 && qword_126EF98 > 0xEA5F))
    // Enable C++14/17 feature

// Pattern: "GNU extensions but not Clang, GCC 6.0+"
if (dword_126EFAC && !dword_126EFA4 && qword_126EF98 > 0xEA5F)
    // Enable GNU-specific extended feature

Functions using this threshold include: declaration processing (sub_40D900), attribute application (sub_413ED0), class declaration (sub_431590), dialect resolution (sub_44B6B0), initializer processing (sub_48C710, sub_4B6760), backend code generation (sub_4688C0), expression canonicalization (sub_4CA6C0, sub_4D2B70), IL walking (sub_54AED0), scope management (sub_59C9B0, sub_59AF40), type processing (sub_5D1350), overload resolution (sub_662670, sub_666720), and template specialization (sub_6A3B00).

Version-Gated Feature Flag: dword_106BDD8

One particular feature flag (dword_106BDD8) is set during dialect resolution based on a compound version check:

// sub_44B6B0, decompiled line ~228-231
// v4 = (dword_126EFA4 != 0), i.e., is_clang_mode
if ((dword_126EFAC && !v4 && qword_126EF98 > 0x1116F)  // GNU ext, not Clang, GCC >= 7.0
    || (v4 && qword_126EF90 > 0x9C3F))                   // or Clang >= 4.0
{
    dword_106BDD8 = 1;
}

This flag is referenced in 7 decompiled functions (sub_430920, sub_42FE50, sub_447930, sub_44AAC0, sub_44B6B0, sub_45EB40, sub_724630). The W066 global variables report identifies it as optix_mode, but the decompiled code shows it is set purely based on compiler version thresholds during dialect resolution, not from any --emit-optix-ir CLI flag. It likely controls a C++ language feature (possibly structured bindings or another C++17 feature) that requires GCC 7.0+ or Clang 4.0+ support, and the "optix_mode" name in the report may be a misidentification based on context where it was encountered. The flag gates behavior in attribute validation (sub_42FE50), where it interacts with dword_106B670 to control feature availability.

Dialect Initialization Flow

The host compiler version globals are initialized in proc_command_line and propagated to the dialect system during TU initialization:

proc_command_line (CLI parsing, sub_459630):
  case 184 (--gnu_version=X):   qword_126EF98 = X   // GCC version
  case 188 (--clang_version=X): qword_126EF90 = X   // Clang version
  case 182 (--gcc):             dword_126EFA8 = 1    // GCC mode flag
  case 187 (--clang):           dword_126EFA4 = 1    // Clang mode flag

dialect_init (sub_44B6B0, called during setup):
  // ~400 lines of version-threshold-based feature flag resolution
  // Sets 30+ EDG feature flags based on gcc_version, clang_version,
  // cpp_standard_version, and extension mode flags

target dialect (sub_752A80, select_cp_gen_be_target_dialect):
  if (dword_126EFA8):                           // GCC mode
    dword_126E1F8 = 1                           // host_dialect_gnu
    qword_126E1F0 = qword_126EF98              // host_gcc_version
  if (dword_126EFA4):                           // Clang mode
    dword_126E1E8 = 1                           // host_dialect_clang
    qword_126E1E0 = qword_126EF90              // host_clang_version

The defaults for unspecified versions are qword_126EF98 = 80100 (GCC 8.1.0) and qword_126EF90 = 90100 (Clang 9.1.0), set during default_init (sub_45EB40).

The --db Debug Mechanism

The --db flag (CLI case 37) activates EDG's internal debug tracing system by calling sub_48A390 (proc_debug_option). While not directly related to architecture gating, the --db mechanism shares the adjacent global namespace (dword_126EFC8, dword_126EFCC) and is relevant because debug tracing can expose architecture checks as they execute in real time.

Connection Between --db and Architecture

The --db flag does not set or modify any architecture-related globals. Its connection to the architecture system is observational: when debug tracing is enabled, the compiler emits trace output at key decision points throughout compilation, including the semantic analysis functions that evaluate architecture thresholds. Enabling --db=5 (verbosity level 5) causes the compiler to log IL entry kinds, template instantiation steps, and scope transitions, which provides visibility into when and why architecture gates fire.

The CLI dispatch for --db:

// proc_command_line (sub_459630), case 37
case 37:  // --db=<string>
    if (sub_48A390(qword_E7FF28))  // proc_debug_option
        goto error;                // returns nonzero on parse failure
    dword_106C2A0 = dword_126EFCC; // save initial error count baseline

After proc_debug_option returns, dword_106C2A0 captures the current value of dword_126EFCC (debug verbosity level). This is used as a baseline error count for subsequent error tracking.

proc_debug_option (sub_48A390)

This 238-line function (debug.c) parses debug control strings. On entry, it unconditionally sets dword_126EFC8 = 1 (debug tracing enabled), then dispatches based on the first character of the input:

// sub_48A390 entry
dword_126EFC8 = 1;  // enable debug tracing
v3 = (unsigned __int8)*nptr;
if ((v3 - 48) <= 9) {               // first char is digit
    dword_126EFCC = strtol(v1, 0, 10); // set verbosity level
    return 0;
}

The full parsing grammar:

Input Format	Parsed As	Action
`"5"` (numeric only)	Verbosity level	Sets `dword_126EFCC = 5`
`"name=3"`	Name with level	Adds trace node: action=1, level=3
`"name+=3"`	Additive trace	Adds trace node: action=2, level=3
`"name-=3"`	Subtractive trace	Adds trace node: action=3, level=3
`"name=3!"`	Permanent trace	Adds trace node: action=1, level=3, permanent=1
`"#name"`	Hash removal	Removes matching node from trace list
`"-name"`	Dash removal	Removes matching node from trace list
`"a,b=2,c=3"`	Comma-separated	Processes each entry independently

Debug Trace Node Structure

Debug trace requests are stored as a singly-linked list rooted at qword_1065870. Each node is 28 bytes, allocated via sub_6B7340 (the IL allocator):

struct debug_trace_node {           // 28 bytes (32 allocated)
    struct debug_trace_node* next;  // +0:  linked list link
    char*  name_string;             // +8:  entity name to trace (heap copy)
    int32  action_type;             // +16: 1=set, 2=add, 3=subtract, 4=remove
    int32  level;                   // +20: trace level (integer)
    int32  permanent;               // +24: 1=survives reset, 0=cleared on reset
};

When proc_debug_option encounters its own name in the trace list (the self-referential check !strcmp(src, "proc_debug_option")), it prints the entire trace state to stderr:

if (qword_1065870 && (v2 & 1) != 0) {
    do {
        fprintf(s, "debug request for: %s\n", node->name_string);
        fprintf(s, "action=%d,  level=%d\n", node->action_type, node->level);
        node = node->next;
    } while (node);
}

Debug Verbosity Levels

The dword_126EFCC verbosity level controls trace output granularity across the entire compiler:

Level	Effect
0	No debug output (default)
1-2	Basic trace: function entry/exit markers
3	Detailed trace: includes entity names, scope indices
4	Very detailed: IL entry kinds, overload candidate lists
5+	Full trace: IL tree walking with `"Walking IL tree, entry kind = ..."`

db_name (CLI case 190)

The --db_name flag (case 190) calls a separate function sub_48AD80 to register a debug name filter. Unlike --db which enables global tracing, --db_name restricts trace output to entities matching the specified name pattern. If sub_48AD80 fails (returns nonzero), error 570 is emitted.

Three-Layer Checking Model

Layer 1: Compile-Time Semantic Checks (cudafe++ Frontend)

These are the primary gates. During semantic analysis, cudafe++ reads dword_126E4A8 and compares it against threshold constants. Violations emit diagnostic errors through the standard error system (diagnostic IDs in the 3000+ range, displayed as 20000-series via the +16543 offset formula). These checks are unconditional -- they fire regardless of whether the code would actually execute at runtime.

Enforcement point: Declaration processing, type checking, attribute application, and CUDA-specific semantic validation passes.

Examples:

__managed__ variable declaration with dword_126E4A8 < 30 triggers unsupported_arch_for_managed_capability
__grid_constant__ parameter with dword_126E4A8 < 70 triggers grid_constant_unsupported_arch
__wgmma_mma_async call on non-sm_90a triggers wgmma_mma_async_not_enabled
Virtual base class with dword_126E4A8 < 20 triggers use_of_virtual_base_on_compute_1x

Layer 2: String-Embedded Diagnostic Formatting

Error strings with architecture names baked into .rodata represent the complete set of architecture-dependent diagnostics. These strings are loaded by the diagnostic system and formatted with the current architecture value. The strings serve as the user-visible feedback for Layer 1 checks.

The architecture name in the string (e.g., "compute_70", "sm_90a") is a literal constant, not a formatted parameter -- the compiler does not interpolate the actual target architecture into these messages. This means the error messages always state the minimum required architecture, not what the user actually specified. The only exception is the virtual base error which uses %t (a type formatter) to include the base class name, not the architecture.

Layer 3: Host Compiler Version Gating

This layer does not check GPU architecture at all -- instead, it gates the output format of the generated .int.c file based on the host C++ compiler's version. The thresholds ensure that GCC/Clang-specific pragmas, attributes, and language constructs in the generated code are compatible with the actual host compiler that will consume the output.

Enforcement point: Backend code generation (sub_489000 and related functions in cp_gen_be.c).

Impact: Incorrect host compiler version gating does not cause compilation failure -- it may produce warnings from the host compiler due to unrecognized pragmas, or miss warning suppression directives that would silence spurious diagnostics.

Interaction Between Layers

nvcc (driver)
  |
  | --target=<sm_code>  --gnu_version=<ver>  --clang_version=<ver>
  v
cudafe++ process
  |
  +-- CLI parsing (proc_command_line)
  |     dword_126E4A8 = sm_code         (SM architecture)
  |     qword_126EF98 = gcc_version     (host GCC version)
  |     qword_126EF90 = clang_version   (host Clang version)
  |
  +-- set_target_configuration (sub_7525F0)
  |     sub_7515D0()  -- type table init (100+ globals)
  |
  +-- dialect_resolution (sub_44B6B0)
  |     30+ feature flags set based on version thresholds
  |     dword_126E1F8 / dword_126E1E8  -- host dialect set
  |     qword_126E1F0 / qword_126E1E0  -- host version copies
  |
  +-- TU init (sub_586240)
  |     dword_126EBF8 = dword_126E4A8   (SM version copy)
  |
  +-- [Layer 1] Semantic analysis
  |     Compare dword_126E4A8 against SM thresholds
  |     Emit CUDA-specific errors for unsupported features
  |
  +-- [Layer 2] Diagnostic formatting
  |     Load error string with baked-in architecture name
  |     Format and display error to user
  |
  +-- [Layer 3] .int.c code generation
  |     Compare qword_126E1F0 / qword_126E1E0 against host thresholds
  |     Emit appropriate #pragma directives
  |     Generate host-compiler-compatible boilerplate
  |
  v
Host Compiler (gcc / clang / cl.exe)

Layers 1 and 2 operate during the frontend phase and can halt compilation. Layer 3 operates during the backend phase and only affects the format of the generated output file.

Global Variable Summary

Address	Size	Name	Role
`dword_126E4A8`	4	`sm_architecture`	Target SM version from `--target` (case 245). Sentinel: `-1`.
`dword_126EBF8`	4	`target_config_index`	TU-level copy of `dword_126E4A8`, set in `sub_586240`.
`qword_126EF98`	8	`gcc_version`	GCC compatibility version. Default `80100`. Set by `--gnu_version` (case 184).
`qword_126EF90`	8	`clang_version`	Clang compatibility version. Default `90100`. Set by `--clang_version` (case 188).
`dword_126EFA8`	4	`gcc_extensions`	GCC mode enabled. Set by `--gcc` (case 182).
`dword_126EFA4`	4	`clang_extensions`	Clang mode enabled. Set by `--clang` (case 187).
`dword_126EFAC`	4	`extended_features`	Extended features / GNU compat mode.
`dword_126EFB0`	4	`gnu_extensions_enabled`	GNU extensions active.
`dword_126E1F8`	4	`host_dialect_gnu`	Host compiler is GCC/GNU. Set during dialect init.
`dword_126E1E8`	4	`host_dialect_clang`	Host compiler is Clang. Set during dialect init.
`qword_126E1F0`	8	`host_gcc_version`	Host GCC version, copied from `qword_126EF98`.
`qword_126E1E0`	8	`host_clang_version`	Host Clang version, copied from `qword_126EF90`.
`dword_126EFC8`	4	`debug_trace_enabled`	Debug tracing active. Set unconditionally by `--db`.
`dword_126EFCC`	4	`debug_verbosity`	Debug output level. >2=detailed, >4=IL walk trace.
`dword_E7FF10`	4	`cuda_compat_flag`	Legacy compat: `dword_126EFAC && qword_126EF98 <= 0x76BF`.
`dword_106BDD8`	4	`version_gated_feature`	Set when GCC >= 7.0 or Clang >= 4.0. Referenced in 7 functions.
`dword_106C2A0`	4	`error_count_baseline`	Saved from `dword_126EFCC` after `--db` processing.
`qword_1065870`	8	`debug_trace_list`	Head of debug trace request linked list.
`dword_126E4A0`	4	`target_vector_width`	Set to 8 by `sub_7515D0`.

Cross-References

CLI Flag Inventory -- --target, --gnu_version, --clang_version, --db flag details
Architecture Detection -- --target flag and SM version parsing details
CUDA Error Catalog -- Complete diagnostic messages for each feature gate
.int.c File Format -- Host compiler pragma emission details
Backend Code Generation -- GCC/Clang version threshold usage in output
Global Variable Index -- Full address-level documentation
Execution Spaces -- Execution space bitfield and attribute handlers
__managed__ Variables -- Managed variable attribute and SM 30 gate
__grid_constant__ -- Grid constant attribute and SM 70 gate

Attribute System Overview

cudafe++ processes CUDA attributes through NVIDIA's customization of the EDG 6.6 attribute subsystem. EDG provides a general-purpose attribute infrastructure in attribute.c (approximately 11,500 lines of source, spanning addresses 0x409350--0x418F80 in the binary) that handles C++11 [[...]] attributes, GNU __attribute__((...)), MSVC __declspec, and alignas. NVIDIA extends this infrastructure by injecting 14 CUDA-specific attribute kinds into EDG's attribute kind enumeration, registering CUDA-specific handler callbacks, and adding a post-declaration validation pass that enforces cross-attribute consistency rules (e.g., __launch_bounds__ requires __global__).

The attribute system operates in four phases: scanning (lexer recognizes attribute syntax and builds attribute node lists), lookup (maps attribute names to descriptors via a hash table), application (dispatches to per-attribute handler functions that modify entity nodes), and validation (post-declaration consistency checks). CUDA attributes participate in all four phases, using the same node structures and dispatch mechanisms as standard C++/GNU attributes.

CUDA Attribute Kind Enum

Every attribute node carries a kind byte at offset +8. For standard C++/GNU attributes, EDG assigns kinds from its built-in descriptor table (byte_82C0E0 in the .rodata segment). For CUDA attributes, NVIDIA reserves a block of kind values in the ASCII printable range. The function attribute_display_name (sub_40A310, from attribute.c:1307) contains the authoritative switch table that maps kind values to human-readable names:

Kind	Hex	ASCII	Display Name	Category	Handler
86	0x56	`'V'`	`__host__`	Execution space	`sub_4108E0`
87	0x57	`'W'`	`__device__`	Execution space	`sub_40EB80`
88	0x58	`'X'`	`__global__`	Execution space	`sub_40E1F0` / `sub_40E7F0`
89	0x59	`'Y'`	`__tile_global__`	Execution space	(internal)
90	0x5A	`'Z'`	`__shared__`	Memory space	`sub_40E0D0` (shared path)
91	0x5B	`'['`	`__constant__`	Memory space	`sub_40E0D0` (constant path)
92	0x5C	`'\'`	`__launch_bounds__`	Launch config	`sub_411C80`
93	0x5D	`']'`	`__maxnreg__`	Launch config	`sub_410F70`
94	0x5E	`'^'`	`__local_maxnreg__`	Launch config	`sub_411090`
95	0x5F	`'_'`	`__tile_builtin__`	Internal	(internal)
102	0x66	`'f'`	`__managed__`	Memory space	`sub_40E0D0` (managed path)
107	0x6B	`'k'`	`__cluster_dims__`	Launch config	`sub_4115F0`
108	0x6C	`'l'`	`__block_size__`	Launch config	`sub_4109E0`
110	0x6E	`'n'`	`__nv_pure__`	Optimization	(internal)

The kind values are not contiguous. Kinds 86--95 form a dense block for the original CUDA attributes. Kinds 102, 107, 108, and 110 were added later (managed memory in CUDA 6.0, cluster dimensions in CUDA 11.8, block size and nv_pure more recently), occupying gaps in the ASCII range.

attribute_display_name (`sub_40A310`)

This function serves dual duty: it formats the display name for diagnostic messages, and its switch table is the canonical enumeration of all CUDA attribute kinds. The logic:

// sub_40A310 -- attribute_display_name (attribute.c:1307)
// a1: pointer to attribute node
const char* attribute_display_name(attr_node_t* a1) {
    const char* name = a1->name;           // +16
    const char* ns   = a1->namespace_str;  // +24

    // If scoped (namespace::name), format "namespace::name"
    if (ns) {
        size_t ns_len = strlen(ns);
        assert(ns_len + strlen(name) + 3 <= 204);  // buffer byte_E7FB80
        sprintf(byte_E7FB80, "%s::%s", ns, name);
        name = intern_string(byte_E7FB80);  // sub_5E0700
    }

    // Override with CUDA display name based on kind byte
    switch (a1->kind) {  // byte at +8
        case 'V': return "__host__";
        case 'W': return "__device__";
        case 'X': return "__global__";
        case 'Y': return "__tile_global__";
        case 'Z': return "__shared__";
        case '[': return "__constant__";
        case '\\': return "__launch_bounds__";
        case ']': return "__maxnreg__";
        case '^': return "__local_maxnreg__";
        case '_': return "__tile_builtin__";
        case 'f': return "__managed__";
        case 'k': return "__cluster_dims__";
        case 'l': return "__block_size__";
        case 'n': return "__nv_pure__";
        default:  return name ? name : "";
    }
}

The 204-byte static buffer byte_E7FB80 is shared across calls (not thread-safe, but cudafe++ is single-threaded per translation unit). The intern_string call (sub_5E0700) ensures the formatted "namespace::name" string is deduplicated into EDG's permanent string pool.

Attribute Node Structure

Every attribute is represented by a 72-byte IL node (entry kind 0x48 = attribute). The node layout:

struct attr_node_t {               // 72 bytes, IL entry kind 0x48
    attr_node_t*  next;            // +0   next attribute in list
    uint8_t       kind;            // +8   attribute kind byte (CUDA: 'V'..'n')
    uint8_t       source_mode;     // +9   1=C++11, 2=GNU, 3=MSVC, 4=alignas, 5=clang
    uint8_t       target_kind;     // +10  what entity type this targets
    uint8_t       flags;           // +11  bit 0=applies_to_params
                                   //      bit 1=skip_arg_check
                                   //      bit 4=scoped attribute
                                   //      bit 7=unknown/unrecognized
    uint32_t      _pad;            // +12  (alignment)
    const char*   name;            // +16  attribute name string
    const char*   namespace_str;   // +24  namespace (NULL for unscoped)
    arg_node_t*   arguments;       // +32  argument list head
    void*         source_pos;      // +40  source position info
    void*         decl_context;    // +48  declaration context / scope
    void*         src_loc_1;       // +56  source location
    void*         src_loc_2;       // +64  secondary source location
};

For CUDA attributes, the kind byte at offset +8 is the discriminator. When get_attr_descr_for_attribute (sub_40FDB0) resolves an attribute name, it writes the corresponding kind value from the descriptor table (byte_82C0E0) into this field. All subsequent dispatch operates on this byte alone.

The source_mode byte at +9 indicates the syntactic form the user wrote. CUDA attributes like __host__ are parsed as GNU-style attributes (source_mode = 2), because cudafe++ defines them via __attribute__((...)) internally.

Attribute Descriptor Table and Name Lookup

Master Descriptor Table (`off_D46820`)

The attribute descriptor table is a static array in .rodata at off_D46820, extending to unk_D47A60. Each entry is 32 bytes and encodes:

Attribute name string
Kind byte (written to attr_node_t.kind on match)
Handler function pointer (the apply_* callback)
Mode/version condition string (e.g., 'g' for GCC-only, 'l' for Clang-only)
Target applicability mask

Initialization: `init_attr_name_map` (`sub_418F80`)

At startup, init_attr_name_map iterates the descriptor table, validates each name is at most 100 characters, and inserts it into the hash table qword_E7FB60 (created via sub_7425C0). This hash table enables O(1) lookup of attribute names during parsing.

// sub_418F80 -- init_attr_name_map (attribute.c:1524)
void init_attr_name_map(void) {
    attr_name_map = create_hash_table();  // qword_E7FB60
    for (attr_descr* d = off_D46820; d < unk_D47A60; d++) {
        assert(strlen(d->name) <= 100);
        insert_into_hash_table(attr_name_map, d->name, d);
    }
    // Also initializes dword_E7F078 and processes config if dword_106BF18 set
}

A companion function init_attr_token_map (sub_419070) creates a second hash table qword_E7F038 that maps attribute tokens to their descriptors, used during lexer-level attribute recognition.

Name Normalization: `sub_40A250`

Before looking up an attribute name, EDG strips __ prefixes and suffixes. The function at sub_40A250 checks whether the name starts with "__" and ends with "__", strips them, and looks up the bare name in qword_E7FB60. This means __host__, __attribute__((host)), and host all resolve to the same descriptor. The stripping respects the current language standard (dword_126EFB4) and C++ version (dword_126EF68).

Central Dispatch: `get_attr_descr_for_attribute` (`sub_40FDB0`)

This 227-line function is the central attribute resolution path. Given an attribute node with a name, it:

Looks up the name in the hash table
Checks mode compatibility (GCC mode via dword_126EFA8, Clang mode via dword_126EFA4, MSVC mode via dword_106BF68/dword_106BF58)
Checks namespace match ("gnu", "__gnu__", "clang") via cond_matches_attr_mode (sub_40C4C0)
Evaluates version-conditional availability via in_attr_cond_range (sub_40D620)
Writes the kind byte from the matched descriptor into attr_node_t.kind
Returns the descriptor entry (which carries the handler function pointer)

The mode condition strings use a compact encoding: 'g'=GCC, 'l'=Clang, 's'=Sun, 'c'=C++, 'm'=MSVC; 'x'=extension, '+'=positive match, '!'=boundary marker.

Attribute Application Pipeline

Phase 1: Scanning

The lexer recognizes attribute syntax and calls into the scanning functions:

Function	Address	Role
`scan_std_attribute_group`	`sub_412650`	Parses `[[...]]` C++11 and `__attribute__((...))` GNU attributes
`scan_gnu_attribute_groups`	`sub_412F20`	Handles `__attribute__((...))` specifically
`scan_attributes_list`	`sub_4124A0`	Iterates token stream building attribute node lists
`parse_attribute_argument_clause`	`sub_40C8B0`	Parses attribute argument expressions
`get_balanced_token`	`sub_40C6C0`	Handles balanced parentheses/brackets in arguments

Scanning produces a linked list of attr_node_t nodes. At this stage, the kind byte is unset; only the name and namespace_str fields are populated.

Phase 2: Lookup and Kind Assignment

When the parser reaches a declaration, get_attr_descr_for_attribute resolves each attribute name to a descriptor and writes the kind byte. For CUDA attributes, this assigns values in the 'V'--'n' range.

Phase 3: Application -- `apply_one_attribute` (`sub_413240`)

The central dispatcher is a 585-line function containing a switch on the kind byte. For each CUDA kind, it calls the corresponding handler:

// sub_413240 -- apply_one_attribute (attribute.c, main dispatch)
// 585 lines, giant switch on attribute kind
void apply_one_attribute(attr_node_t* attr, entity_t* entity, int target_kind) {
    switch (attr->kind) {
        case 'V':  apply_nv_host_attr(attr, entity, target_kind);     break;
        case 'W':  apply_nv_device_attr(attr, entity, target_kind);   break;
        case 'X':  apply_nv_global_attr(attr, entity, target_kind);   break;
        case 'Z':  apply_nv_shared_attr(attr, entity, target_kind);   break;
        case '[':  apply_nv_constant_attr(attr, entity, target_kind); break;
        case '\\': apply_nv_launch_bounds(attr, entity, target_kind); break;
        case ']':  apply_nv_maxnreg_attr(attr, entity, target_kind);  break;
        case '^':  apply_nv_local_maxnreg(attr, entity, target_kind); break;
        case 'f':  apply_nv_managed_attr(attr, entity, target_kind);  break;
        case 'k':  apply_nv_cluster_dims(attr, entity, target_kind);  break;
        case 'l':  apply_nv_block_size(attr, entity, target_kind);    break;
        // ... standard attributes handled similarly ...
    }
}

The outer iteration is apply_attributes_to_entity (sub_413ED0, 492 lines), which walks the attribute list, calls apply_one_attribute for each, and handles deferred attributes, attribute merging, and ordering constraints.

Phase 4: Post-Declaration Validation -- `sub_6BC890`

After all attributes on a declaration are applied, sub_6BC890 (nv_validate_cuda_attributes, from nv_transforms.c) performs cross-attribute consistency checking. This function validates that combinations of CUDA attributes are legal:

// sub_6BC890 -- nv_validate_cuda_attributes (nv_transforms.c)
// a1: entity (function), a2: diagnostic location
void nv_validate_cuda_attributes(entity_t* fn, source_loc_t* loc) {
    if (!fn || (fn->byte_177 & 0x10))  // skip if null or already validated
        return;

    uint8_t exec_space = fn->byte_182;  // CUDA execution space bits
    launch_config_t* lc = fn->launch_config;  // entity+256

    // Check 1: parameters with rvalue-reference in __global__ functions
    // Walks parameter list, emits error 3702 for ref-qualified params

    // Check 2: __nv_register_params__ on __host__-only or __global__
    if (fn->byte_183 & 0x08) {
        if (exec_space & 0x40)       // __global__
            emit_error(3661, "__global__");
        else if ((exec_space & 0x30) == 0x20)  // __host__ only (no __device__)
            emit_error(3661, "__host__");
    }

    // Check 3: __launch_bounds__ without __global__
    if (lc && !(exec_space & 0x40)) {
        if (lc->maxThreadsPerBlock || lc->minBlocksPerMultiprocessor)
            emit_error(3534, "__launch_bounds__");
    }

    // Check 4: __cluster_dims__ / __block_size__ without __global__
    if (lc && (fn->byte_183 & 0x40 || lc->cluster_dim_x > 0)) {
        const char* name = (lc->block_size_x > 0) ? "__block_size__" : "__cluster_dims__";
        emit_error(3534, name);
    }

    // Check 5: maxBlocksPerClusterSize exceeds cluster product
    if (lc && lc->cluster_dim_x > 0 && lc->maxBlocksPerClusterSize > 0) {
        if (lc->maxBlocksPerClusterSize <
            lc->cluster_dim_x * lc->cluster_dim_y * lc->cluster_dim_z) {
            emit_error(3707, ...);
        }
    }

    // Check 6: __maxnreg__ without __global__
    if (lc && lc->maxnreg >= 0 && !(exec_space & 0x40))
        emit_error(3715, "__maxnreg__");

    // Check 7: __launch_bounds__ + __maxnreg__ conflict
    if (lc && lc->maxThreadsPerBlock && lc->maxnreg >= 0)
        emit_error(3719, "__launch_bounds__ and __maxnreg__");

    // Check 8: __global__ without __launch_bounds__
    if ((exec_space & 0x40) && (!lc || (!lc->maxThreadsPerBlock && !lc->minBlocksPerMultiprocessor)))
        emit_warning(3695);  // "no __launch_bounds__ specified for __global__ function"
}

Error Codes in Validation

Error	Severity	Message
3534	7 (error)	`"%s" attribute is not allowed on a non-__global__ function`
3661	7 (error)	`__nv_register_params__ is not allowed on a %s function`
3695	4 (warning)	`no __launch_bounds__ specified for __global__ function`
3702	7 (error)	Parameter with rvalue reference in `__global__` function
3707	7 (error)	`total number of blocks in cluster computed from %s exceeds __launch_bounds__ specified limit`
3715	7 (error)	`__maxnreg__ is not allowed on a non-__global__ function`
3719	7 (error)	`__launch_bounds__ and __maxnreg__ may not be used on the same declaration`

Per-Attribute Handler Function Table

Each CUDA attribute has a dedicated apply_* function registered in the descriptor table. These functions modify entity node fields (execution space bits, memory space bits, launch configuration) and emit diagnostics for invalid usage.

Attribute	Handler	Address	Lines	Entity Fields Modified
`__host__`	`apply_nv_host_attr`	`sub_4108E0`	31	`entity+182 \|= 0x15`
`__device__`	`apply_nv_device_attr`	`sub_40EB80`	100	Functions: `entity+182 \|= 0x23`; Variables: `entity+148 \|= 0x01`
`__global__`	`apply_nv_global_attr`	`sub_40E1F0`	89	`entity+182 \|= 0x61`
`__global__` (variant 2)	`apply_nv_global_attr`	`sub_40E7F0`	86	Same as above (alternate entry point)
`__shared__`	(via device attr path)	--	--	`entity+148 \|= 0x02`
`__constant__`	(via device attr path)	--	--	`entity+148 \|= 0x04`
`__managed__`	`apply_nv_managed_attr`	`sub_40E0D0`	47	`entity+148 \|= 0x01`, `entity+149 \|= 0x01`
`__launch_bounds__`	`apply_nv_launch_bounds_attr`	`sub_411C80`	98	`entity+256` -> launch config `+0`, `+8`, `+16`
`__maxnreg__`	`apply_nv_maxnreg_attr`	`sub_410F70`	67	`entity+256` -> launch config `+32`
`__local_maxnreg__`	`apply_nv_local_maxnreg_attr`	`sub_411090`	67	`entity+256` -> launch config `+36`
`__cluster_dims__`	`apply_nv_cluster_dims_attr`	`sub_4115F0`	145	`entity+256` -> launch config `+20`, `+24`, `+28`
`__block_size__`	`apply_nv_block_size_attr`	`sub_4109E0`	265	`entity+256` -> launch config `+40`..`+52`
`__nv_register_params__`	`apply_nv_register_params_attr`	`sub_40B0A0`	38	`entity+183 \|= 0x08`

Attribute Registration (`sub_6B5E50`)

The function sub_6B5E50 (160 lines, in the nv_transforms.c / mem_manage.c area) registers NVIDIA-specific pseudo-attributes into EDG's keyword and macro systems at startup. It operates after EDG's standard keyword initialization but before parsing begins.

The registration creates macro-like definitions that the lexer expands before attribute processing. The function:

Allocates attribute definition nodes via sub_6BA0D0 (EDG's node allocator)
Looks up existing definitions via sub_734430 (hash table search) -- if a definition already exists, it chains the new handler onto it via sub_6AC190
Creates new keyword entries via sub_749600 if no prior definition exists
Registers __nv_register_params__ as a 40-byte attribute definition node (kind marker 8961) with chain linkage
Registers __noinline__ as a 30-byte attribute definition node (kind marker 6401), including the "oinline))" suffix for __attribute__((__noinline__)) expansion
Conditionally registers ARM SME attributes (__arm_in, __arm_inout, __arm_out, __arm_preserves, __arm_streaming, __arm_streaming_compatible) via sub_6ACCB0 when Clang version >= 180000 and ARM target flags are set
Registers _Pragma as an operator-like keyword for _Pragma("...") processing

If any registration fails (the existing entry cannot be extended), it emits internal error 1338 with the attribute name and calls sub_6B6280 (fatal error handler).

Entity Node: CUDA Attribute Fields

CUDA attributes modify specific byte fields in entity nodes. The key fields for a reimplementation:

Execution Space (`entity+182`)

Bit 0 (0x01): __device__           set by apply_nv_device_attr
Bit 2 (0x04): __host__             set by apply_nv_host_attr
Bit 4 (0x10): (reserved)
Bit 5 (0x20): __host__ explicit    set by apply_nv_host_attr
Bit 6 (0x40): __global__           set by apply_nv_global_attr
Bit 7 (0x80): __host__ __device__  set when both specified

Handlers use OR-masks: __host__ sets 0x15 (bits 0+2+4), __device__ sets 0x23 (bits 0+1+5), __global__ sets 0x61 (bits 0+5+6). The overlap at bit 0 means all execution-space-annotated functions have bit 0 set, which serves as a quick "has CUDA annotation" predicate.

Memory Space (`entity+148`)

Bit 0 (0x01): __device__           device memory
Bit 1 (0x02): __shared__           shared memory
Bit 2 (0x04): __constant__         constant memory

Extended Memory Space (`entity+149`)

Bit 0 (0x01): __managed__          managed (unified) memory

Launch Configuration (`entity+256`)

A pointer to a separately allocated launch_config_t structure (created by sub_5E52F0):

struct launch_config_t {
    uint64_t  maxThreadsPerBlock;          // +0   from __launch_bounds__(N, ...)
    uint64_t  minBlocksPerMultiprocessor;  // +8   from __launch_bounds__(N, M, ...)
    int32_t   maxBlocksPerClusterSize;     // +16  from __launch_bounds__(N, M, K)
    int32_t   cluster_dim_x;              // +20  from __cluster_dims__(X, ...)
    int32_t   cluster_dim_y;              // +24  from __cluster_dims__(X, Y, ...)
    int32_t   cluster_dim_z;              // +28  from __cluster_dims__(X, Y, Z)
    int32_t   maxnreg;                    // +32  from __maxnreg__(N)
    int32_t   local_maxnreg;              // +36  from __local_maxnreg__(N)
    int32_t   block_size_x;              // +40  from __block_size__(X, ...)
    int32_t   block_size_y;              // +44  from __block_size__(X, Y, ...)
    int32_t   block_size_z;              // +48  from __block_size__(X, Y, Z, ...)
    uint8_t   flags;                      // +52  bit 0=cluster_dims_set
                                          //      bit 1=block_size_set
};

This structure is allocated lazily -- only created when a launch configuration attribute is first applied to a function. The allocation function sub_5E52F0 returns a zero-initialized structure with maxnreg = -1 and local_maxnreg = -1 (sentinel for "unset").

Attribute Processing Global State

Global	Address	Purpose
`qword_E7FB60`	`0xE7FB60`	Attribute name hash table (created by `init_attr_name_map`)
`qword_E7F038`	`0xE7F038`	Attribute token hash table (created by `init_attr_token_map`)
`byte_E7FB80`	`0xE7FB80`	204-byte static buffer for formatted attribute display names
`off_D46820`	`0xD46820`	Master attribute descriptor table (32 bytes per entry, extends to `0xD47A60`)
`qword_E7F070`	`0xE7F070`	Visibility stack (for `__attribute__((visibility(...)))` nesting)
`qword_E7F048`	`0xE7F048`	Alias/ifunc free list head
`qword_E7F058`/`E7F050`	`0xE7F058`/`0xE7F050`	Alias chain list head/tail
`dword_E7F080`	`0xE7F080`	Attribute processing flags
`dword_E7F078`	`0xE7F078`	Extended attribute config flag

The function reset_attribute_processing_state (sub_4190B0) zeroes all of these at the start of each translation unit.

Function Map

Address	Identity	Source	Confidence
`sub_40A250`	`strip_double_underscores_and_lookup`	`attribute.c`	HIGH
`sub_40A310`	`attribute_display_name`	`attribute.c:1307`	HIGH
`sub_40C4C0`	`cond_matches_attr_mode`	`attribute.c`	HIGH
`sub_40C6C0`	`get_balanced_token`	`attribute.c`	HIGH
`sub_40C8B0`	`parse_attribute_argument_clause`	`attribute.c`	HIGH
`sub_40D620`	`in_attr_cond_range`	`attribute.c`	HIGH
`sub_40E0D0`	`apply_nv_managed_attr`	`attribute.c:10523`	HIGH
`sub_40E1F0`	`apply_nv_global_attr` (variant 1)	`attribute.c`	HIGH
`sub_40E7F0`	`apply_nv_global_attr` (variant 2)	`attribute.c`	HIGH
`sub_40EB80`	`apply_nv_device_attr`	`attribute.c`	HIGH
`sub_40FDB0`	`get_attr_descr_for_attribute`	`attribute.c:1902`	HIGH
`sub_4108E0`	`apply_nv_host_attr`	`attribute.c`	HIGH
`sub_4109E0`	`apply_nv_block_size_attr`	`attribute.c`	HIGH
`sub_410F70`	`apply_nv_maxnreg_attr`	`attribute.c`	HIGH
`sub_411090`	`apply_nv_local_maxnreg_attr`	`attribute.c`	HIGH
`sub_4115F0`	`apply_nv_cluster_dims_attr`	`attribute.c`	HIGH
`sub_411C80`	`apply_nv_launch_bounds_attr`	`attribute.c`	HIGH
`sub_412650`	`scan_std_attribute_group`	`attribute.c:2914`	HIGH
`sub_413240`	`apply_one_attribute`	`attribute.c`	HIGH
`sub_413ED0`	`apply_attributes_to_entity`	`attribute.c`	HIGH
`sub_418F80`	`init_attr_name_map`	`attribute.c:1524`	HIGH
`sub_419070`	`init_attr_token_map`	`attribute.c`	HIGH
`sub_4190B0`	`reset_attribute_processing_state`	`attribute.c`	HIGH
`sub_6B5E50`	`process_nv_register_params` / attribute registration	`nv_transforms.c`	HIGH
`sub_6BC890`	`nv_validate_cuda_attributes`	`nv_transforms.c`	VERY HIGH

Cross-References

global Function Constraints -- detailed validation rules for __global__
Launch Configuration Attributes -- __launch_bounds__, __cluster_dims__, __block_size__
grid_constant -- grid-constant parameter attribute
managed Variables -- managed memory attribute
Minor CUDA Attributes -- __noinline__, __forceinline__, __nv_register_params__, __nv_pure__
Entity Node Layout -- full entity structure with CUDA field offsets
CUDA Execution Spaces -- how execution space bits drive code generation
CUDA Memory Spaces -- memory space bitfield semantics

global Function Constraints

The __global__ attribute designates a CUDA kernel -- a function that executes on the GPU and is callable from host code via the <<<...>>> launch syntax. Of all CUDA execution space attributes, __global__ imposes the most constraints. cudafe++ enforces these constraints across three separate validation passes: attribute application (when __global__ is first applied to an entity), post-declaration validation (after all attributes on a declaration are resolved), and semantic analysis (during template instantiation, redeclaration merging, and lambda processing). This page documents all constraint checks, their implementation in the binary, the entity node fields they inspect, and the diagnostics they emit.

Key Facts

Property	Value
Source files	`attribute.c` (apply handler), `nv_transforms.c` (post-validation), `class_decl.c` (redeclaration, lambda), `decls.c` (template packs)
Apply handler (variant 1)	`sub_40E1F0` (89 lines)
Apply handler (variant 2)	`sub_40E7F0` (86 lines)
Post-validation	`sub_6BC890` (`nv_validate_cuda_attributes`, 161 lines)
Attribute kind byte	`0x58` = `'X'`
OR mask applied	`entity+182 \|= 0x61` (bits 0 + 5 + 6)
HD combined flag	`entity+182 \|= 0x80` (set when `__global__` applied to function already marked `__host__`)
Total constraint checks	37 distinct error conditions
Entity fields read	`+81`, `+144`, `+148`, `+152`, `+166`, `+176`, `+179`, `+182`, `+183`, `+184`, `+191`
Relaxed mode flag	`dword_106BFF0` (suppresses certain conflict checks)
main() entity pointer	`qword_126EB70` (compared to detect `__global__ main`)

Two Variants of `apply_nv_global_attr`

Two nearly identical functions implement the __global__ application logic. Both perform the same 11 validation checks and apply the same 0x61 bitmask. The difference is purely structural: sub_40E1F0 uses a for loop with a null-terminated break for the parameter default-init iteration, while sub_40E7F0 uses a do-while loop with an explicit null check and early return. Both exist because EDG's attribute subsystem may route through different call paths depending on whether the attribute appears on a declaration or a definition.

// Pseudocode for apply_nv_global_attr (sub_40E1F0 / sub_40E7F0)
// a1: attribute node, a2: entity node, a3: target kind
entity_t* apply_nv_global_attr(attr_node_t* a1, entity_t* a2, uint8_t a3) {

    // Gate: only applies to functions (kind 11)
    if (a3 != 11)
        return a2;

    // ---- Phase 1: Linkage / constexpr lambda check ----
    // Bits 47 and 24 of the 48-bit field at +184
    if ((a2->qword_184 & 0x800001000000) == 0x800000000000) {
        // Constexpr lambda with internal linkage but no local flag
        char* name = get_entity_display_name(a2, 0);  // sub_6BC6B0
        emit_error(3469, a1->src_loc, "__global__", name);
        return a2;   // bail out, do not apply __global__
    }

    // ---- Phase 2: Structural constraints ----

    // 2a. Static member function check
    if ((signed char)a2->byte_176 < 0 && !(a2->byte_81 & 0x04))
        emit_warning(3507, a1->src_loc, "__global__");  // severity 5

    // 2b. operator() check
    if (a2->byte_166 == 5)
        emit_error(3644, a1->src_loc);  // severity 7

    // 2c. Exception specification check (uses return type chain)
    type_t* ret = a2->type_chain;  // entity+144
    while (ret->kind == 12)        // skip cv-qualifier wrappers
        ret = ret->referenced;     // type+144
    if (ret->prototype->exception_spec)  // proto+152 -> +56
        emit_error(3647, a1->src_loc);   // auto/decltype(auto) return

    // 2d. Execution space conflict
    uint8_t es = a2->byte_182;
    if (!relaxed_mode && (es & 0x60) == 0x20)  // already __device__ only
        emit_error(3481, a1->src_loc);
    if (es & 0x10)                              // already __host__ explicit
        emit_error(3481, a1->src_loc);

    // 2e. Return type must be void
    if (!(a2->byte_179 & 0x10)) {  // not constexpr
        if (a2->byte_191 & 0x01)   // is lambda
            emit_error(3506, a1->src_loc);
        else {
            type_t* base = skip_typedefs(a2->type_chain);  // sub_7A68F0
            if (!is_void_type(base->referenced))            // sub_7A6E90
                emit_error(3505, a1->src_loc);
        }
    }

    // 2f. Variadic (ellipsis) check
    type_t* proto_type = a2->type_chain;  // +144
    while (proto_type->kind == 12)
        proto_type = proto_type->referenced;
    if (proto_type->prototype->flags_16 & 0x01)  // bit 0 of proto+16
        emit_error(3503, a1->src_loc);

    // ---- Phase 3: Apply the bitmask ----
    a2->byte_182 |= 0x61;   // device_capable + device_annotation + global_kernel

    // ---- Phase 4: Additional checks (after bitmask set) ----

    // 4a. Local function (constexpr local)
    if (a2->byte_81 & 0x04)
        emit_error(3688, a1->src_loc);

    // 4b. main() function check
    if (a2 == main_entity && (a2->byte_182 & 0x20))
        emit_error(3538, a1->src_loc);

    // ---- Phase 5: Parameter iteration (__grid_constant__ warning) ----
    if (a1->flags & 0x01) {  // attr_node+11 bit 0: applies to parameters
        // Walk parameter list from prototype
        proto_type = a2->type_chain;
        while (proto_type->kind == 12)
            proto_type = proto_type->referenced;
        param_t* param = *proto_type->prototype->param_list;  // deref +152

        source_loc_t loc = a1->src_loc;  // +56
        for (; param != NULL; param = param->next) {
            // Peel cv-qualifier wrappers
            type_t* ptype = param->type;  // param[1]
            while (ptype->kind == 12)
                ptype = ptype->referenced;

            // Check: is type a __grid_constant__ candidate?
            if (!has_grid_constant_flag(ptype) && scope_index == -1) {
                // sub_7A6B60: checks byte+133 bit 5 (0x20)
                int64_t scope = scope_table_base + 784 * scope_table_index;
                if ((scope->flags_6 & 0x06) == 0 && scope->kind_4 != 12) {
                    type_t* ptype2 = param->type;
                    while (ptype2->kind == 12)
                        ptype2 = ptype2->referenced;
                    if (!ptype2->default_init)  // type+120 == NULL
                        emit_error(3669, &loc);
                }
            }
        }
    }

    // ---- Phase 6: HD combined flag ----
    if (a2->byte_182 & 0x40)       // __global__ now set
        a2->byte_182 |= 0x80;      // mark as combined HD

    return a2;
}

Execution Order Detail

The 0x61 bitmask is applied before the local-function (3688) and main() (3538) checks but after all structural checks (3507, 3644, 3647, 3481, 3505/3506, 3503). This means the bitmask is set even when errors are emitted -- cudafe++ continues processing after errors to collect as many diagnostics as possible in a single compilation pass.

The constexpr-lambda check at the top (error 3469) is the only check that causes an early return. If the function is a constexpr lambda with wrong linkage, the bitmask is NOT set and no further validation is performed.

Validation Error Catalog

The 37 validation errors are organized by the phase in which they are checked and by semantic category. Error codes below are cudafe++ internal diagnostic numbers; severity values match the sub_4F41C0 severity parameter (5 = warning, 7 = error, 8 = hard error).

Category 1: Return Type

Error	Severity	Check	Message
3505	7	`!is_void_type(skip_typedefs(entity+144)->referenced)`	`a __global__ function must have a void return type`
3506	7	`entity+191 & 0x01` (lambda) and non-void	`a __global__ function must not have a deduced return type`
3647	7	`entity+152 -> +56 != NULL` (exception spec present on return proto)	auto/decltype(auto) deduced return type

Error 3505 and 3506 are mutually exclusive paths guarded by the byte+179 & 0x10 constexpr flag. When the function is not constexpr, the handler checks whether it is a lambda (3506 path, which checks byte+191 bit 0) or a regular function (3505 path, which resolves through skip_typedefs via sub_7A68F0 and tests is_void_type via sub_7A6E90). The skip_typedefs function follows the type chain while type->kind == 12 (cv-qualifier wrapper) and type->byte_161 & 0x7F == 0 (no qualifier flags). The is_void_type function follows the same chain and returns kind == 1 (void).

Error 3647 is checked independently of 3505/3506. The check examines the exception specification pointer at prototype offset +56. In EDG's type system, auto and decltype(auto) return types are represented with a non-null exception specification node on the return type's prototype -- this is a repurposed field that indicates the return type is deduced.

Category 2: Parameters

Error	Severity	Check	Message
3503	8	`proto+16 & 0x01` (has ellipsis)	`a __global__ function cannot have ellipsis`
3702	7	`param_flags & 0x02` (rvalue ref)	`a __global__ function cannot have a parameter with rvalue reference type`
--	7	Parameter with `__restrict__` on reference type	`a __global__ function cannot have a parameter with __restrict__ qualified reference type`
--	7	Parameter of type `va_list`	`A __global__ function or function template cannot have a parameter with va_list type`
--	7	Parameter of type `std::initializer_list`	`a __global__ function or function template cannot have a parameter with type std::initializer_list`
--	7	Oversized alignment on win32	`cannot pass a parameter with a too large explicit alignment to a __global__ function on win32 platforms`
3669	8	Device-scope parameter without default init	`__grid_constant__` parameter warning (device-side check)

Error 3503 (ellipsis) is checked in the apply handler by testing bit 0 of the function prototype's flags word at offset +16. This bit indicates the parameter list ends with ....

Error 3702 (rvalue reference) is checked in the post-validation pass (sub_6BC890), not in the apply handler. The post-validator walks the parameter list and checks byte offset +32 (bit 1) of each parameter node.

The __restrict__ reference, va_list, initializer_list, and win32 alignment checks are scattered across separate validation functions in nv_transforms.c and are triggered during declaration processing rather than during attribute application.

Error 3669 is checked in the apply handler's parameter iteration loop. It walks each parameter, resolves through cv-qualifier wrappers, and tests whether sub_7A6B60 returns false (meaning the parameter type has bit 5 of byte+133 clear -- not a __grid_constant__ type) AND the scope lookup produces a non-array, non-qualifier type without a default initializer at type+120.

Category 3: Modifiers

Error	Severity	Check	Message
3507	5	`(signed char)byte_176 < 0 && !(byte_81 & 0x04)`	`A __global__ function or function template cannot be marked constexpr` (warning for static member)
3688	8	`byte_81 & 0x04` (local function)	`A __global__ function or function template cannot be marked constexpr` (constexpr local)
3481	8	Execution space conflict (see matrix)	Conflicting CUDA execution spaces
--	7	Function is consteval	`A __global__ function or function template cannot be marked consteval`
3644	7	`byte_166 == 5` (operator function kind)	`An operator function cannot be a __global__ function`
--	7	Defined in friend declaration	`A __global__ function or function template cannot be defined in a friend declaration`
--	7	Exception specification present	`An exception specification is not allowed for a __global__ function or function template`
--	7	Declared in inline unnamed namespace	`A __global__ function or function template cannot be declared within an inline unnamed namespace`
3538	7	`a2 == qword_126EB70` (is `main()`)	`function main cannot be marked __device__ or __global__`

Error 3507 deserves special attention. The decompiled code shows:

if ((signed char)a2->byte_176 < 0 && !(a2->byte_81 & 0x04))
    emit_warning(3507, ...);

The signed char cast means byte_176 >= 0x80 (bit 7 set = static member function). The !(byte_81 & 0x04) condition ensures it is NOT a local function. The emitter uses severity 5 (warning via sub_4F8DB0), meaning this is a warning, not an error -- NVIDIA chose to warn rather than reject __global__ on static members, though the official documentation says it is not allowed. The displayed string is "A __global__ function or function template cannot be marked constexpr" with "__global__" as the attribute name parameter, though the actual semantic is "static member function" per the field being checked.

Error 3644 checks entity+166 == 5. This field stores the "operator function kind" enum value, where 5 corresponds to operator(). This prevents lambda call operators or functors from being directly marked __global__.

Error 3688 is checked after the bitmask is set (byte_182 |= 0x61). It tests byte_81 & 0x04, which indicates a local (block-scope) function. The handler emits with severity 8 (via sub_4F81B0, hard error).

Error 3538 compares the entity pointer against qword_126EB70, which holds the entity pointer for main() (set during initial declaration processing). The condition also requires byte_182 & 0x20 (device annotation bit set), which is always true after |= 0x61.

Category 4: Template Constraints

Error	Severity	Check	Message
--	7	Pack parameter is not last template parameter	`Pack template parameter must be the last template parameter for a variadic __global__ function template`
--	7	Multiple pack parameters	`Multiple pack parameters are not allowed for a variadic __global__ function template`

These checks are performed during template declaration processing in decls.c, not in the apply handler. They constrain variadic __global__ function templates: CUDA requires that pack parameters appear last (so the runtime can enumerate kernel arguments), and only a single pack is permitted (the CUDA launch infrastructure cannot handle multiple parameter packs).

Category 5: Redeclaration

Error	Severity	Check	Message
--	7	Previously `__global__`, now no execution space	`a __global__ function(%no1) redeclared without __global__`
--	7	Previously `__global__`, now `__host__`	`a __global__ function(%no1) redeclared with __host__`
--	7	Previously `__global__`, now `__device__`	`a __global__ function(%no1) redeclared with __device__`
--	7	Previously `__global__`, now `__host__ __device__`	`a __global__ function(%no1) redeclared with __host__ __device__`

These four error variants are symmetrical with the reverse direction:

a __device__ function(%no1) redeclared with __global__
a __host__ function(%no1) redeclared with __global__
a __host__ __device__ function(%no1) redeclared with __global__

Redeclaration checks occur during declaration merging in class_decl.c. When a function is redeclared and the execution space of the new declaration does not match the original, cudafe++ emits one of these errors. The %no1 format specifier inserts the function name. These checks run independently of the apply_nv_global_attr handler -- they operate on the merged entity after both attribute sets have been processed.

Category 6: Constexpr Lambda Linkage

Error	Severity	Check	Message
3469	5	`(qword_184 & 0x800001000000) == 0x800000000000`	`__global__` on constexpr lambda with wrong linkage

This is the first check in the apply handler and the only one that causes early return. The 48-bit field at entity+184 encodes template and linkage properties. Bit 47 (0x800000000000) indicates internal linkage or a similar constraint, while bit 24 (0x000001000000) indicates a local entity. When bit 47 is set but bit 24 is clear, the entity is a constexpr lambda that cannot legally receive __global__. The handler calls sub_6BC6B0 (get_entity_display_name) to format the entity name for the diagnostic message, then returns without setting the bitmask.

Category 7: Post-Validation (sub_6BC890)

These checks run after all attributes on a declaration have been applied, in the nv_validate_cuda_attributes function:

Error	Severity	Check	Message
3702	7	Parameter with rvalue reference flag (bit 1 at param+32)	`a __global__ function cannot have a parameter with rvalue reference type`
3661	7	`__nv_register_params__` on `__global__`	`__nv_register_params__ is not allowed on a __global__ function`
3534	7	`__launch_bounds__` on non-`__global__`	`%s attribute is not allowed on a non-__global__ function`
3707	7	maxBlocksPerCluster < cluster product	`total number of blocks in cluster computed from %s exceeds __launch_bounds__ specified limit`
3715	7	`__maxnreg__` on non-`__global__`	`__maxnreg__ is not allowed on a non-__global__ function`
3719	7	Both `__launch_bounds__` and `__maxnreg__`	`__launch_bounds__ and __maxnreg__ may not be used on the same declaration`
3695	4	`__global__` without `__launch_bounds__`	`no __launch_bounds__ specified for __global__ function` (warning)

Error 3695 is a severity-4 diagnostic (informational warning). It fires when a __global__ function has no associated launch configuration, encouraging developers to specify __launch_bounds__ for optimal register allocation. This is the only constraint that is a soft advisory rather than a hard or standard error.

Entity Node Field Reference

The apply handler reads and writes specific fields within the entity node. Complete field semantics:

Offset	Size	Field Name	Role in `__global__` Validation
`+81`	1 byte	`local_flags`	Bit 2 (0x04): function is local (block-scope). Checked for 3688 and as exemption for 3507.
`+144`	8 bytes	`type_chain`	Pointer to return type. Followed through `kind==12` cv-qualifier wrappers.
`+152`	8 bytes	`prototype`	Function prototype pointer. At prototype+16: flags (bit 0 = ellipsis). At prototype+56: exception spec pointer. At prototype+0: parameter list head (double deref for first param).
`+166`	1 byte	`operator_kind`	Value 5 = `operator()`. Checked for 3644.
`+176`	1 byte	`member_flags`	Bit 7 (0x80, checked as `signed char < 0`): static member function. Checked for 3507.
`+179`	1 byte	`constexpr_flags`	Bit 4 (0x10): function is constexpr. Guards 3505/3506 check (skipped if constexpr).
`+182`	1 byte	`execution_space`	The primary execution space bitfield. `\|= 0x61` sets global kernel. Read for conflict checks (0x60, 0x10 masks).
`+183`	1 byte	`extended_cuda`	Bit 3 (0x08): `__nv_register_params__`. Checked in post-validation. Bit 6 (0x40): `__cluster_dims__` set.
`+184`	8 bytes	`linkage_template`	48-bit field encoding template/linkage flags. Only lower 48 bits used; mask `0x800001000000` checks constexpr lambda linkage.
`+191`	1 byte	`lambda_flags`	Bit 0 (0x01): entity is a lambda. Routes to 3506 instead of 3505 for void-return check.
`+256`	8 bytes	`launch_config`	Pointer to launch configuration struct (56 bytes). NULL if no launch attributes applied. Read in post-validation.

The 0x61 Bitmask

The OR mask 0x61 sets three bits in the execution space byte:

0x61 = 0b01100001

  bit 0 (0x01):  device_capable     -- function can run on device
  bit 5 (0x20):  device_annotation  -- has explicit device-side annotation
  bit 6 (0x40):  global_kernel      -- function is a __global__ kernel

Bit 0 is shared with __device__ (0x23) and __host__ (0x15). It serves as a "has CUDA annotation" predicate -- any entity with bit 0 set has been explicitly annotated with at least one execution space keyword. This enables fast if (byte_182 & 0x01) checks throughout the codebase.

Bit 5 is shared with __device__. A __global__ function is considered device-annotated because kernel code executes on the GPU.

Bit 6 is unique to __global__. The mask byte_182 & 0x40 is the canonical predicate for "is this a kernel function?" used in dozens of locations throughout the binary.

HD Combined Flag (0x80)

After setting 0x61, the handler checks whether bit 6 (0x40, global kernel) is now set. If so, it ORs 0x80 into the byte. This bit means "combined host+device" and is set as a secondary effect. The logic at the end of the function:

if (a2->byte_182 & 0x40)       // just set via |= 0x61
    a2->byte_182 |= 0x80;      // always true after apply

This means every __global__ function ends up with byte_182 & 0x80 set, which marks it as "combined" in the execution space classification. This is semantically correct: a kernel has both a host-side stub (for launching) and device-side code (for execution).

Parameter Iteration for grid_constant

The final section of the apply handler iterates the function's parameter list to check for parameters that should be annotated __grid_constant__. This check only runs when attr_node->flags bit 0 (a1+11 & 0x01) is set, indicating the attribute application context includes parameter-level processing.

The iteration follows this structure:

// Navigate to function prototype
type_t* proto_type = entity->type_chain;     // +144
while (proto_type->kind == 12)               // skip cv-qualifiers
    proto_type = proto_type->referenced;      // +144

// Get parameter list head (double dereference)
param_t** param_list = proto_type->prototype->param_head;  // proto+152 -> deref
param_t* param = *param_list;                               // deref again

for (; param != NULL; param = param->next) {
    // Navigate to unqualified parameter type
    type_t* ptype = param[1];    // param->type (offset 8)
    while (ptype->kind == 12)
        ptype = ptype->referenced;

    // sub_7A6B60: checks byte+133 bit 5 (0x20) -- "has __grid_constant__"
    bool has_gc = (ptype->byte_133 & 0x20) != 0;

    if (!has_gc && dword_126C5C4 == -1) {
        // Scope table lookup
        int64_t scope = qword_126C5E8 + 784 * dword_126C5E4;
        uint8_t scope_flags = scope->byte_6;
        uint8_t scope_kind = scope->byte_4;

        // Skip if scope has qualifier flags or is a cv-qualified scope
        if ((scope_flags & 0x06) == 0 && scope_kind != 12) {
            // Re-navigate to unqualified type
            type_t* ptype2 = param[1];
            while (ptype2->kind == 12)
                ptype2 = ptype2->referenced;

            // Check for default initializer
            if (ptype2->qword_120 == 0)
                emit_error(3669, &saved_source_loc);
        }
    }
}

The scope table lookup uses a 784-byte scope structure (at qword_126C5E8 indexed by dword_126C5E4) to determine whether the current context is device-side. The dword_126C5C4 == -1 check verifies we are in device compilation mode. This entire parameter iteration is a device-side warning mechanism: it alerts developers when a kernel parameter lacks a default initializer in a context where __grid_constant__ would be appropriate.

Post-Declaration Validation (sub_6BC890)

After all attributes on a declaration are applied, nv_validate_cuda_attributes (sub_6BC890, 161 lines) performs cross-attribute consistency checks. For __global__ functions, this function enforces:

Rvalue Reference Parameters (3702)

// Walk parameter list
type_t* ret = entity->type_chain;
while (ret->kind == 12)
    ret = ret->referenced;
param_t* param = **((param_t***)ret + 19);  // proto -> param list

while (param) {
    if (param->byte_32 & 0x02)  // rvalue reference flag
        emit_error(3702, source_loc);
    param = param->next;
}

This check scans all parameters for the rvalue reference flag (bit 1 at parameter node offset +32). Kernel functions cannot accept rvalue references because kernel launch involves copying arguments through the CUDA runtime, which does not support move semantics across the host-device boundary.

nv_register_params Conflict (3661)

if (entity->byte_183 & 0x08) {  // __nv_register_params__ set
    if (entity->byte_182 & 0x40)
        emit_error(3661, ..., "__global__");
    else if ((entity->byte_182 & 0x30) == 0x20)
        emit_error(3661, ..., "__host__");
}

The __nv_register_params__ attribute (bit 3 of byte+183) is incompatible with __global__ because kernel parameter passing uses a fixed ABI that cannot be overridden.

Launch Configuration Without global (3534)

launch_config_t* lc = entity->launch_config;  // +256
if (lc && !(entity->byte_182 & 0x40)) {
    if (lc->maxThreadsPerBlock || lc->minBlocksPerMultiprocessor)
        emit_error(3534, ..., "__launch_bounds__");
}

The __launch_bounds__, __cluster_dims__, and __block_size__ attributes require __global__. If a non-kernel function has any of these, error 3534 fires.

Cluster Dimension Product Check (3707)

if (lc->cluster_dim_x > 0 && lc->maxBlocksPerCluster > 0) {
    uint64_t product = lc->cluster_dim_x * lc->cluster_dim_y * lc->cluster_dim_z;
    if (lc->maxBlocksPerCluster < product)
        emit_error(3707, ...);
}

launch_bounds and maxnreg Conflict (3719)

if (lc->maxThreadsPerBlock && lc->maxnreg >= 0)
    emit_error(3719, ..., "__launch_bounds__ and __maxnreg__");

These two attributes provide contradictory register pressure hints and cannot coexist.

Missing launch_bounds Warning (3695)

if ((entity->byte_182 & 0x40) &&
    (!lc || (!lc->maxThreadsPerBlock && !lc->minBlocksPerMultiprocessor)))
    emit_warning(3695);

Severity 4 (advisory). Encourages developers to annotate kernels with __launch_bounds__ for optimal register allocation.

Execution Space Conflict Matrix

When __global__ is applied to a function that already has an execution space annotation, the handler checks for conflicts using two conditions:

// Condition 1: already __device__ only (without relaxed mode)
if (!dword_106BFF0 && (byte_182 & 0x60) == 0x20)
    error(3481);

// Condition 2: already __host__ explicit
if (byte_182 & 0x10)
    error(3481);

Current `byte_182`	Applying `__global__`	`(byte & 0x60) == 0x20`	`byte & 0x10`	Result
`0x00` (none)	`\|= 0x61` -> `0x61`	false	false	accepted
`0x23` (`__device__`)		true	false	error 3481 (unless relaxed)
`0x15` (`__host__`)		false	true	error 3481
`0x37` (`__host__ __device__`)		false	true	error 3481
`0x61` (`__global__`)		true	false	error 3481 (unless relaxed) -- idempotent bitmask

In relaxed mode (dword_106BFF0 != 0), the first condition is suppressed, allowing __device__ + __global__ combinations. The second condition (explicit __host__) is never relaxed.

Helper Functions

Address	Identity	Lines	Purpose
`sub_6BC6B0`	`get_entity_display_name`	49	Formats entity name for diagnostic messages. Handles demangling, strips leading `::`.
`sub_7A68F0`	`skip_typedefs`	19	Follows type chain through `kind==12` wrappers while `byte_161 & 0x7F == 0`.
`sub_7A6E90`	`is_void_type`	16	Follows type chain through `kind==12`, returns `kind == 1`.
`sub_7A6B60`	`has_grid_constant_flag`	9	Follows type chain through `kind==12`, returns `byte_133 & 0x20`.
`sub_4F7510`	`emit_error_with_names`	66	Emits error with two string arguments (attribute name + entity name).
`sub_4F8DB0`	`emit_warning_with_name`	38	Emits warning (severity 5) with one string argument.
`sub_4F8200`	`emit_error_basic`	10	Emits error with severity + code + source location.
`sub_4F81B0`	`emit_error_minimal`	10	Emits error (severity 8) with code + source location.
`sub_4F8490`	`emit_error_with_extra`	38	Emits error with one supplementary argument.

Additional global Constraints (Outside Apply Handler)

Beyond the apply handler and post-validation, several other subsystems enforce __global__-specific rules. These checks occur during template instantiation, lambda processing, and declaration merging:

Template Argument Type Restrictions

CUDA restricts which types can appear as template arguments in __global__ function template instantiations:

Host-local types (defined inside a __host__ function) cannot be used
Private/protected class members cannot be used (unless the class is local to a __device__/__global__ function)
Unnamed types cannot be used (unless local to a __device__/__global__ function)
Lambda closure types cannot be used (unless the lambda is defined in a __device__/__global__ function, or is an extended lambda with --extended-lambda)
Texture/surface variables cannot be used as non-type template arguments
Private/protected template template arguments from class scope cannot be used

Static Global Template Stub

In whole-program compilation mode (-rdc=false) with -static-global-template-stub=true:

Extern __global__ function templates are not supported
__global__ function template instantiations must have definitions in the current TU

Device-Side Restrictions

Functions marked __global__ (or __device__) are subject to additional restrictions during semantic analysis:

address of label extension is not supported
ASM operands may specify only one constraint letter
Certain ASM constraint letters are forbidden
Texture/surface variables cannot have their address taken or be indirected
Anonymous union member variables at global/namespace scope cannot be directly accessed
Function-scope static variables require a memory space specifier
Dynamic initialization of function-scope static variables is not supported

Function Map

Address	Identity	Lines	Source File
`sub_40E1F0`	`apply_nv_global_attr` (variant 1)	89	`attribute.c`
`sub_40E7F0`	`apply_nv_global_attr` (variant 2)	86	`attribute.c`
`sub_6BC890`	`nv_validate_cuda_attributes`	161	`nv_transforms.c`
`sub_6BC6B0`	`get_entity_display_name`	49	`nv_transforms.c`
`sub_7A68F0`	`skip_typedefs`	19	`types.c`
`sub_7A6E90`	`is_void_type`	16	`types.c`
`sub_7A6B60`	`has_grid_constant_flag`	9	`types.c`
`sub_4F7510`	`emit_error_with_names`	66	`error.c`
`sub_4F8DB0`	`emit_warning_with_name`	38	`error.c`
`sub_4F8200`	`emit_error_basic`	10	`error.c`
`sub_4F81B0`	`emit_error_minimal`	10	`error.c`
`sub_4F8490`	`emit_error_with_extra`	38	`error.c`
`sub_413240`	`apply_one_attribute` (dispatch)	585	`attribute.c`

Global Variables

Global	Address	Purpose
`dword_106BFF0`	`0x106BFF0`	Relaxed mode flag. When set, suppresses `__device__` + `__global__` conflict (3481).
`qword_126EB70`	`0x126EB70`	Pointer to the entity node for `main()`. Compared during 3538 check.
`dword_126C5C4`	`0x126C5C4`	Scope index sentinel (`-1` = device compilation mode). Guards 3669 parameter check.
`dword_126C5E4`	`0x126C5E4`	Current scope table index.
`qword_126C5E8`	`0x126C5E8`	Scope table base pointer. Each entry is 784 bytes.

Cross-References

Execution Spaces -- bitfield layout, conflict matrix, virtual override checking
Attribute System Overview -- dispatch table, attribute node structure, application pipeline
grid_constant -- the parameter attribute that interacts with the 3669 check
Launch Configuration Attributes -- __launch_bounds__, __cluster_dims__, __block_size__ (post-validation errors 3534, 3707, 3715, 3719, 3695)
Entity Node Layout -- full byte map with all CUDA fields
Kernel Stubs -- host-side stub generation triggered by byte_182 & 0x40
CUDA Template Restrictions -- template argument type restrictions for __global__ instantiations
Diagnostics Overview -- error emission functions and severity levels

Launch Configuration Attributes

cudafe++ supports five attributes that control CUDA kernel launch parameters: __launch_bounds__, __cluster_dims__, __block_size__, __maxnreg__, and __local_maxnreg__. All five store their values into a shared 56-byte launch configuration struct pointed to by entity+256. The struct is lazily allocated on first use by sub_5E52F0 and initialized with sentinel values (-1 for all int32 fields, 0 for the two leading int64 fields, flags cleared). Each attribute handler parses its arguments through a shared constant-expression evaluation pipeline (sub_461640 for value extraction, sub_461980 for sign checking), validates positivity and 32-bit range, then writes results into specific offsets of the struct. A post-declaration validation pass (sub_6BC890 in nv_transforms.c) enforces cross-attribute constraints: launch config attributes require __global__, cluster dimensions must not exceed __launch_bounds__, and __maxnreg__ is mutually exclusive with __launch_bounds__.

Key Facts

Property	Value
Source files	`attribute.c` (apply handlers), `nv_transforms.c` (post-validation)
`__launch_bounds__` handler	`sub_411C80` (98 lines)
`__cluster_dims__` handler	`sub_4115F0` (145 lines)
`__block_size__` handler	`sub_4109E0` (265 lines)
`__maxnreg__` handler	`sub_410F70` (67 lines)
`__local_maxnreg__` handler	`sub_411090` (67 lines)
Post-validation	`sub_6BC890` (`nv_validate_cuda_attributes`, 160 lines)
Struct allocator	`sub_5E52F0` (42 lines)
Constant value extractor	`sub_461640` (`const_expr_get_value`, 53 lines)
Constant sign checker	`sub_461980` (`const_expr_sign_compare`, 97 lines)
Dependent-type check	`sub_7BE9E0` (`is_dependent_type`)
Entity field	`entity+256` -- pointer to `launch_config_t` (56 bytes, NULL if no launch attrs)
Entity extended flags	`entity+183` bit 6 (0x40): cluster_dims intent (set by zero-argument `__cluster_dims__`)
Total error codes	17 distinct diagnostics across all five attributes and post-validation

Attribute Kind Codes

Each CUDA attribute carries a kind byte at attr_node+8. The five launch config attributes use these values from the attribute_display_name (sub_40A310) switch table:

Kind	Hex	ASCII	Attribute	Handler
92	0x5C	`'\'`	`__launch_bounds__`	`sub_411C80`
93	0x5D	`']'`	`__maxnreg__`	`sub_410F70`
94	0x5E	`'^'`	`__local_maxnreg__`	`sub_411090`
107	0x6B	`'k'`	`__cluster_dims__`	`sub_4115F0`
108	0x6C	`'l'`	`__block_size__`	`sub_4109E0`

Kinds 92--94 are part of the original dense block (86--95). Kinds 107 and 108 were added later for cluster/Hopper-era features, occupying gaps in the ASCII range.

Launch Configuration Struct Layout

The struct is allocated by sub_5E52F0 and returned with a 16-byte offset from the raw allocation base. All handlers access the struct through the pointer stored at entity+256. The allocator initializes all int32 fields to -1 (sentinel for "not set") and zeroes the two leading int64 fields and the flags byte.

struct launch_config_t {                  // 56 bytes (offsets from entity+256 pointer)
    int64_t  maxThreadsPerBlock;          // +0   from __launch_bounds__ arg 1 (init: 0)
    int64_t  minBlocksPerMultiprocessor;  // +8   from __launch_bounds__ arg 2 (init: 0)
    int32_t  maxBlocksPerCluster;         // +16  from __launch_bounds__ arg 3 (init: -1)
    int32_t  cluster_dim_x;              // +20  from __cluster_dims__ / __block_size__ (init: -1)
    int32_t  cluster_dim_y;              // +24  from __cluster_dims__ / __block_size__ (init: -1)
    int32_t  cluster_dim_z;              // +28  from __cluster_dims__ / __block_size__ (init: -1)
    int32_t  maxnreg;                    // +32  from __maxnreg__ (init: -1)
    int32_t  local_maxnreg;              // +36  from __local_maxnreg__ (init: -1)
    int32_t  block_size_x;              // +40  from __block_size__ (init: -1)
    int32_t  block_size_y;              // +44  from __block_size__ (init: -1)
    int32_t  block_size_z;              // +48  from __block_size__ (init: -1)
    uint8_t  flags;                      // +52  bit 0: cluster_dims_set
                                         //       bit 1: block_size_set
    // +53..+55: padding
};

The struct packs integer fields of mixed widths. The first two fields (maxThreadsPerBlock and minBlocksPerMultiprocessor) are 64-bit to accommodate the full range of CUDA launch bounds values. The cluster dimensions, block sizes, and register counts are 32-bit because individual values cannot exceed hardware limits. The flags byte at offset +52 records which dimension-setting attributes have been applied, enabling mutual exclusion enforcement between __cluster_dims__ and __block_size__.

Allocator: sub_5E52F0

The allocator performs arena allocation via sub_6B7D60, then initializes every field:

// sub_5E52F0 -- allocate_launch_config
launch_config_t* allocate_launch_config() {
    void* raw = arena_alloc(pool_id, launch_config_pool_size + 56);
    char* base = pool_base + raw;

    if (!abi_mode) {              // dword_106BA08 == 0
        ++alloc_counter_prefix;
        base += 8;
        *(int64_t*)(base - 8) = 0;   // 8-byte ABI prefix
    }

    ++alloc_counter_main;

    // Zero the int64 fields
    *(int64_t*)(base + 0)  = 0;       // becomes returned+0:  maxThreadsPerBlock = 0
    *(int64_t*)(base + 8)  = 0;       // padding (base+8..15)
    *(int64_t*)(base + 16) = 0;       // becomes returned+0..7 after offset

    // Initialize all int32 fields to -1 (sentinel = "not set")
    *(int32_t*)(base + 32) = -1;      // returned+16: maxBlocksPerCluster
    *(int32_t*)(base + 36) = -1;      // returned+20: cluster_dim_x
    *(int32_t*)(base + 40) = -1;      // returned+24: cluster_dim_y
    *(int32_t*)(base + 44) = -1;      // returned+28: cluster_dim_z
    *(int32_t*)(base + 48) = -1;      // returned+32: maxnreg
    *(int32_t*)(base + 52) = -1;      // returned+36: local_maxnreg
    *(int32_t*)(base + 56) = -1;      // returned+40: block_size_x
    *(int32_t*)(base + 60) = -1;      // returned+44: block_size_y
    *(int32_t*)(base + 64) = -1;      // returned+48: block_size_z
    base[68] &= 0xFC;                // returned+52: clear flags bits 0 and 1

    // Set internal flags byte combining ABI mode, device mode, marker
    base[8] = (8 * (device_flag & 1)) & 0x7F
            | (2 * (!abi_mode))       & 0x7E
            | 1;

    return (launch_config_t*)(base + 16);   // return with 16-byte offset
}

The sentinel value -1 (0xFFFFFFFF as unsigned, -1 as signed) is semantically meaningful throughout: handlers and the post-validator test field >= 0 or field > 0 to determine whether a field has been set. A value of -1 always fails both tests, so unset fields are correctly treated as absent. The two leading int64 fields use 0 as their sentinel since they store __launch_bounds__ arguments where zero means "not specified."

Constant-Expression Evaluation Pipeline

All five attribute handlers share the same two-function pipeline for parsing attribute argument values from EDG's internal 128-bit constant representation.

sub_461980 -- const_expr_sign_compare

Compares a constant expression's value against a 64-bit threshold. Returns +1 if the expression value is greater, -1 if less, 0 if equal. The comparison operates on the 128-bit extended-precision value stored at offsets +152 through +166 (eight 16-bit words) of the expression node.

// sub_461980 -- const_expr_sign_compare(expr_node, threshold)
// Returns: +1 if expr > threshold, -1 if expr < threshold, 0 if equal
int32_t const_expr_sign_compare(expr_node_t* expr, int64_t threshold) {
    // Decompose threshold into eight 16-bit words with sign extension
    uint16_t thresh_words[8];
    // ... sign-extension propagation through all 8 words ...

    // Navigate to base type, skipping cv-qualifier wrappers (kind == 12)
    type_t* type = expr->type_chain;    // expr+112
    while (type->kind_132 == 12)
        type = type->referenced;        // type+144

    // Determine signedness from base type
    bool is_signed = (type->kind_132 == 2
                      && is_signed_type_table[type->subkind_144]);

    if (is_signed && (expr->word_152 & 0x8000)) {
        // Negative expression value
        if (!(threshold_high & 0x8000))
            return -1;    // negative < non-negative
    } else if (!is_signed) {
        if (threshold_high & 0x8000)
            return 1;     // non-negative > negative threshold
    }

    // Word-by-word comparison from most-significant to least
    // expr+152 (MSW) through expr+166 (LSW) vs threshold words
    for (int i = 0; i < 8; i++) {
        if (expr->words[152 + 2*i] > thresh_words[i]) return 1;
        if (expr->words[152 + 2*i] < thresh_words[i]) return -1;
    }
    return 0;  // equal
}

The handlers call const_expr_sign_compare(expr, 0) to check positivity:

<= 0 means non-positive (used by __cluster_dims__, __block_size__, __maxnreg__, __local_maxnreg__)
< 0 means strictly negative (used by __launch_bounds__ arg 3, where zero is allowed)

sub_461640 -- const_expr_get_value

Extracts a uint64_t value from a constant expression node's 128-bit representation. Sets an overflow flag if the value does not fit in 64 bits (accounting for sign).

// sub_461640 -- const_expr_get_value(expr_node, *overflow_flag)
// Returns: uint64_t value; *overflow_flag = 1 if truncation occurred
uint64_t const_expr_get_value(expr_node_t* expr, int32_t* overflow) {
    // Navigate to base type
    type_t* type = expr->type_chain;    // expr+112
    while (type->kind_132 == 12)
        type = type->referenced;

    uint16_t sign_word = expr->word_152;    // most-significant of 128-bit value
    bool is_signed = (type->kind_132 == 2
                      && is_signed_type_table[type->subkind_144]);

    int16_t expected_high;
    if (is_signed) {
        *overflow = 0;
        expected_high = -(sign_word >> 15);     // -1 if negative, 0 if positive
    } else {
        *overflow = 0;
        expected_high = 0;
    }

    // Verify that the upper 64 bits match the expected sign-extension pattern
    bool has_overflow = (sign_word != (uint16_t)expected_high);
    if (expr->word_154 != (uint16_t)expected_high) has_overflow = true;
    if (expr->word_156 != (uint16_t)expected_high) has_overflow = true;
    if (expr->word_158 != (uint16_t)expected_high) has_overflow = true;

    // Reconstruct 64-bit value from the lower four 16-bit words
    uint64_t result = ((uint64_t)expr->word_160 << 48)
                    | ((uint64_t)expr->word_162 << 32)
                    | ((uint64_t)expr->word_164 << 16)
                    | ((uint64_t)expr->word_166);

    if (!is_signed) {
        if (has_overflow) { *overflow = 1; }
        return result;
    }
    // Signed: verify sign bit consistency
    if (((uint16_t)expected_high) != (uint16_t)(result >> 63)
        || has_overflow
        || (int16_t)sign_word < 0) {
        *overflow = 1;
    }
    return result;
}

The overflow flag is used by all handlers with a consistent check pattern:

int32_t overflow;
uint64_t val = const_expr_get_value(expr, &overflow);
if (overflow || val > 0x7FFFFFFF)
    emit_error(OVERFLOW_ERROR_CODE, src_loc);
else
    launch_config->field = (int32_t)val;

Template-Dependent Argument Bailout

Before evaluating constant expressions, all five handlers walk the attribute argument list checking for template-dependent types via sub_7BE9E0 (is_dependent_type). The walk follows a linked list of argument nodes (head at attr_node+32), where each node has:

Offset	Field	Description
`+0`	`next`	Next argument node in list
`+10`	`kind`	Argument kind: 3 = type-qualified, 4 = expression, 5 = indirect expression
`+32`	`expr`	Expression/type pointer (accessed as `node[4]` in decompiled code)

If any argument has a dependent type, the handler returns immediately without modifying the entity. This defers attribute processing to template instantiation time, when concrete values are available:

// Common bailout pattern (appears in all 5 handlers)
arg_node_t* walk = *(arg_node_t**)(attr_node + 32);
while (walk) {
    switch (walk->kind_10) {
        case 3:   // type-qualified argument
            if (walk->expr[4]->kind_148 == 12)    // cv-qualifier wrapper
                return entity;                     // dependent -- bail
            break;
        case 4:   // expression argument
            if (is_dependent_type(walk->expr[4]))  // sub_7BE9E0
                return entity;
            if (walk->kind_10 != 5)
                break;
            // fallthrough to case 5
        case 5:   // indirect expression
            if (is_dependent_type(*(walk->expr[4])))
                return entity;
            break;
        default:
            break;
    }
    walk = walk->next;
}
// All args are concrete -- proceed with evaluation

launch_bounds (sub_411C80)

Syntax: __launch_bounds__(maxThreadsPerBlock [, minBlocksPerMultiprocessor [, maxBlocksPerCluster]])

Accepts 1 to 3 arguments. Registered at kind byte 0x5C ('\\').

// sub_411C80 -- apply_nv_launch_bounds_attr (attribute.c, 98 lines)
// a1: attribute node, a2: entity node
entity_t* apply_nv_launch_bounds(attr_node_t* attr, entity_t* entity) {

    // ---- Error 3535: launch_bounds on local function ----
    // Note: does NOT return early -- continues to store values
    if (entity->byte_81 & 0x04)
        emit_error_with_name(7, 3535, attr->src_loc, "__launch_bounds__");

    // ---- Parse argument list ----
    arg_list_t* args = attr->arg_list;    // attr+32
    if (!args)
        return entity;

    // ---- Allocate launch config if needed ----
    launch_config_t* lc = entity->launch_config;   // entity+256
    if (!lc) {
        lc = allocate_launch_config();              // sub_5E52F0
        entity->launch_config = lc;
    }

    // ---- Arg 1: maxThreadsPerBlock (required, stored as int64) ----
    // Copied directly from constant expression value -- no sign/overflow check
    lc->maxThreadsPerBlock = args->const_value;     // +0, int64

    // ---- Arg 2: minBlocksPerMultiprocessor (optional, stored as int64) ----
    arg_node_t* arg2_list = *args;                  // first child
    if (!arg2_list)
        return entity;

    expr_node_t* arg2_expr = *arg2_list;            // expression node
    lc->minBlocksPerMultiprocessor = arg2_list[4];  // +8, int64, raw copy

    // ---- Check for arg 3 existence ----
    if (!arg2_expr)
        goto process_arg3;

    // ---- Template-dependent bailout for remaining args ----
    arg_node_t* walk = *(arg_node_t**)(attr + 32);
    if (!walk)
        goto process_arg3;
    // ... dependent type walk (same pattern as documented above) ...
    // If any arg is dependent, return entity unchanged

process_arg3:
    // ---- Arg 3: maxBlocksPerCluster (optional, int32, uses full pipeline) ----
    expr_node_t* expr3 = arg2_expr->const_value;   // 3rd arg expression
    if (!expr3)
        return entity;

    if (const_expr_sign_compare(expr3, 0) < 0) {
        // Error 3705: negative maxBlocksPerCluster
        emit_error(7, 3705, attr->src_loc);
    } else {
        int32_t overflow;
        uint64_t val = const_expr_get_value(expr3, &overflow);
        if (overflow || val > 0x7FFFFFFF) {
            // Error 3706: overflow
            emit_error(7, 3706, attr->src_loc);
        } else if (val != 0) {
            lc->maxBlocksPerCluster = (int32_t)val;   // +16
        }
        // val == 0: not stored, sentinel -1 remains (means "use default")
    }

    return entity;
}

Argument Semantics

Arg	Field	Offset	Type	Validation	Description
1 (required)	`maxThreadsPerBlock`	+0	int64	None -- raw copy	Maximum threads per block. Guides register allocation in `ptxas`.
2 (optional)	`minBlocksPerMultiprocessor`	+8	int64	None -- raw copy	Minimum resident blocks per SM. Guides occupancy optimization.
3 (optional)	`maxBlocksPerCluster`	+16	int32	`sign_compare < 0` (3705), overflow (3706)	Maximum blocks per cluster (CUDA 11.8+).

Critical Implementation Details

First two args bypass the sign/overflow pipeline. Arguments 1 and 2 are copied directly from the constant expression node's value field as 64-bit quantities. They do not pass through const_expr_sign_compare or const_expr_get_value. This means negative or excessively large values for maxThreadsPerBlock and minBlocksPerMultiprocessor are accepted at parse time -- downstream consumers (ptxas) are responsible for rejecting them.

Third argument uses the strict pipeline. Only argument 3 (maxBlocksPerCluster) passes through both const_expr_sign_compare and const_expr_get_value with the overflow check. This argument was added later (CUDA 11.8 cluster launch) and uses the newer, stricter validation pattern.

Zero is acceptable for arg 3. The sign check uses const_expr_sign_compare(expr, 0) < 0 (strictly negative), not <= 0. A zero value passes the sign check but is not written (else if (val != 0) guard), leaving the sentinel -1 in place. This means zero effectively means "use default."

Error 3535 does not abort. The local-function check fires but does NOT return early. Processing continues, arguments are stored, and the launch config struct is populated even after emitting the error. This is consistent with cudafe++'s design of collecting as many diagnostics as possible in a single compilation pass.

cluster_dims (sub_4115F0)

Syntax: __cluster_dims__(x [, y [, z]]) or __cluster_dims__()

Accepts 0 to 3 arguments. Missing dimensions default to 1. Sets flag bit 0 at +52. Registered at kind byte 0x6B ('k').

// sub_4115F0 -- apply_nv_cluster_dims_attr (attribute.c, 145 lines)
entity_t* apply_nv_cluster_dims(attr_node_t* attr, entity_t* entity) {

    arg_list_t* args = attr->arg_list;    // attr+32

    // ---- No-argument form: set intent flag only ----
    if (args->kind_10 == 0) {             // no arguments present
        entity->byte_183 |= 0x40;        // cluster_dims intent flag
        return entity;
    }

    // ---- Extract argument expressions (up to 3) ----
    expr_node_t* expr_x = args->value;
    arg_node_t* child1 = args->first_child;
    expr_node_t* expr_y = child1 ? child1->value : NULL;
    expr_node_t* expr_z = NULL;
    if (child1 && child1->first_child)
        expr_z = child1->first_child->value;

    // ---- Template-dependent bailout ----
    // ... same walk pattern as __launch_bounds__ ...

    // ---- Allocate launch config if needed ----
    launch_config_t* lc = entity->launch_config;
    if (!lc) {
        lc = allocate_launch_config();
        entity->launch_config = lc;
    }

    // ---- Conflict check: __block_size__ already set cluster dims ----
    if (lc->flags & 0x02) {               // bit 1 = block_size_set
        emit_error(7, 3791, attr->src_loc);
        lc = entity->launch_config;       // reload after error emit
    }

    // ---- Set cluster_dims flag ----
    lc->flags |= 0x01;                    // bit 0 = cluster_dims_set

    // ---- Arg 1: cluster_dim_x ----
    if (!expr_x) {
        lc->cluster_dim_x = 1;            // +20, default
    } else if (const_expr_sign_compare(expr_x, 0) <= 0) {
        emit_error_with_name(7, 3685, attr->src_loc, "__cluster_dims__");
        lc = entity->launch_config;       // reload
    } else {
        int32_t overflow;
        uint64_t val = const_expr_get_value(expr_x, &overflow);
        if (overflow || val > 0x7FFFFFFF)
            emit_error(7, 3686, attr->src_loc);
        else
            lc->cluster_dim_x = (int32_t)val;
    }

    // ---- Arg 2: cluster_dim_y (defaults to 1) ----
    if (!expr_y) {
        lc->cluster_dim_y = 1;            // +24
    } else {
        // Same sign_compare/get_value/3685/3686 pattern
        // Stores at lc->cluster_dim_y (+24)
    }

    // ---- Arg 3: cluster_dim_z (defaults to 1) ----
    if (!expr_z) {
        lc->cluster_dim_z = 1;            // +28
    } else {
        // Same pattern, stores at lc->cluster_dim_z (+28)
    }

    return entity;
}

Key Observations

Zero-argument form. When __cluster_dims__() is called with no arguments, the handler does not allocate the launch config struct. It sets entity+183 |= 0x40 (the "cluster_dims intent" flag) and returns. This intent flag is checked during post-validation to detect __cluster_dims__ on non-__global__ functions (error 3534) even when no dimensions were specified.

Conflict check with block_size. Before storing dimensions, the handler checks lc->flags & 0x02 (bit 1 = block_size_set). If __block_size__ was already applied, error 3791 fires. Crucially, the handler does NOT return early after this error -- it continues to set the flag and attempt to store values. The reverse conflict (applying __block_size__ after __cluster_dims__) is checked in sub_4109E0 with the same error code, testing lc->flags & 0x01.

Strict positivity (zero rejected). All three dimensions use const_expr_sign_compare(expr, 0) <= 0, rejecting zero. Error 3685 fires with the attribute name "__cluster_dims__" as a format argument. Error 3686 fires for values exceeding 0x7FFFFFFF.

Defaults to 1. Unspecified dimensions default to 1, not 0. A cluster dimension of 1 means "no clustering in that dimension" -- the neutral value. The default is written explicitly (lc->cluster_dim_x = 1), overwriting the -1 sentinel from allocation.

block_size (sub_4109E0)

Syntax: __block_size__(bx [, by [, bz [, cx [, cy [, cz]]]]])

Accepts up to 6 arguments: three block dimensions followed by three optional cluster dimensions. Registered at kind byte 0x6C ('l'). At 265 lines, this is the largest launch config handler.

// sub_4109E0 -- apply_nv_block_size_attr (attribute.c, 265 lines)
entity_t* apply_nv_block_size(attr_node_t* attr, entity_t* entity) {

    // ---- Parse up to 6 argument expressions ----
    arg_list_t* args = attr->arg_list;
    expr_node_t* block_x  = args->value;          // arg 1
    expr_node_t* block_y  = NULL;                  // arg 2
    expr_node_t* block_z  = NULL;                  // arg 3
    expr_node_t* cluster_x = NULL;                 // arg 4
    expr_node_t* cluster_y = NULL;                 // arg 5
    expr_node_t* cluster_z = NULL;                 // arg 6
    // ... linked-list traversal to extract args 2-6 ...

    // ---- Template-dependent bailout ----
    // ... same walk pattern ...

    // ---- Allocate launch config ----
    launch_config_t* lc = entity->launch_config;
    if (!lc) {
        lc = allocate_launch_config();
        entity->launch_config = lc;
    }

    // ---- Block dimensions: args 1-3 ----
    // Each uses: sign_compare <= 0 -> error 3788
    //            get_value overflow or > 0x7FFFFFFF -> error 3789
    //            else store at +40/+44/+48
    //            missing args default to 1

    // block_size_x (+40):
    if (!block_x)
        lc->block_size_x = 1;
    else
        validate_positive_int32(block_x, &lc->block_size_x, 3788, 3789, attr);

    // block_size_y (+44): same pattern, default 1
    // block_size_z (+48): same pattern, default 1

    // ---- Cluster dimensions: args 4-6 (only if arg 4 present) ----
    if (!cluster_x) {
        // No cluster dims from __block_size__
        lc->flags &= ~0x02;           // clear bit 1 temporarily

        if (!(lc->flags & 0x01)) {    // cluster_dims NOT already set
            // Write default cluster dims
            lc->cluster_dim_x = 1;     // +20
            lc->cluster_dim_y = 1;     // +24
            lc->cluster_dim_z = 1;     // +28
        }
        return entity;
    }

    // ---- Conflict check: cluster_dims already set ----
    if (lc->flags & 0x01) {           // bit 0 = cluster_dims_set
        emit_error(7, 3791, attr->src_loc);
        lc = entity->launch_config;
    }

    // ---- Set block_size flag ----
    lc->flags |= 0x02;                // bit 1 = block_size_set

    if (lc->flags & 0x01)             // cluster_dims_set -> conflict, bail
        return entity;

    // ---- Parse cluster dims from args 4-6 ----
    // Uses error 3788 for non-positive, 3789 for overflow
    // (same codes as block dims, with "__block_size__" as attr name)
    // Stores at +20/+24/+28, defaults to 1 if absent

    return entity;
}

Key Observations

Dual-purpose attribute. __block_size__ combines block dimensions and cluster dimensions in a single attribute. Arguments 1-3 specify the thread block shape (stored at +40/+44/+48); arguments 4-6 specify the cluster shape (stored at +20/+24/+28). This is NVIDIA's older, combined syntax, compared to the newer separate __cluster_dims__ attribute.

Shared cluster fields. Both __block_size__ and __cluster_dims__ write to the same offsets (+20/+24/+28). The flags byte (bit 0 for cluster_dims, bit 1 for block_size) provides mutual exclusion via error 3791.

Block size fields are separate from launch_bounds. The block dimensions from __block_size__ go to +40/+44/+48, distinct from __launch_bounds__'s maxThreadsPerBlock at +0. The __block_size__ attribute specifies exact dimensions; __launch_bounds__ specifies an upper bound. Both can coexist on the same function.

Defaulting behavior when no cluster args. When only 3 arguments are provided (block dims only), the handler checks whether __cluster_dims__ was already applied (flags & 0x01). If not, it writes default cluster dims of (1, 1, 1) to +20/+24/+28. If __cluster_dims__ was already applied, it leaves the existing cluster dim values untouched.

Error 3788/3789. These are the __block_size__-specific equivalents of __cluster_dims__'s 3685/3686. Both use strict positivity (<= 0), rejecting zero.

maxnreg (sub_410F70)

Syntax: __maxnreg__(N)

Accepts exactly 1 argument. Stores at launch_config+32. Registered at kind byte 0x5D (']').

// sub_410F70 -- apply_nv_maxnreg_attr (attribute.c, 67 lines)
entity_t* apply_nv_maxnreg(attr_node_t* attr, entity_t* entity) {
    arg_list_t* args = attr->arg_list;       // attr+32
    if (!args)
        return entity;

    // ---- Template-dependent bailout ----
    // ... same walk pattern ...

    // ---- Allocate launch config ----
    if (!entity->launch_config)
        entity->launch_config = allocate_launch_config();

    // ---- Parse the single argument ----
    expr_node_t* expr = args->const_value;   // argument expression
    if (!expr)
        return entity;

    if (const_expr_sign_compare(expr, 0) <= 0) {
        emit_error(7, 3717, attr->src_loc);       // non-positive register count
    } else {
        int32_t overflow;
        uint64_t val = const_expr_get_value(expr, &overflow);
        if (overflow || val > 0x7FFFFFFF)
            emit_error(7, 3718, attr->src_loc);    // register count too large
        else
            entity->launch_config->maxnreg = (int32_t)val;   // +32
    }

    return entity;
}

The maxnreg field defaults to -1 from the allocator. A value >= 0 in post-validation unambiguously means the attribute was applied with a valid value (since zero would be caught by the <= 0 check here, the minimum valid value is 1).

Post-Validation Conflict

The __maxnreg__ handler does not check for conflicts with __launch_bounds__ at application time. The mutual exclusion is enforced in post-validation (sub_6BC890), which emits error 3719 when both maxThreadsPerBlock != 0 and maxnreg >= 0. This design allows the apply handlers to be called in any order.

local_maxnreg (sub_411090)

Syntax: __local_maxnreg__(N)

Structurally identical to __maxnreg__. Stores at launch_config+36. Registered at kind byte 0x5E ('^').

// sub_411090 -- apply_nv_local_maxnreg_attr (attribute.c, 67 lines)
entity_t* apply_nv_local_maxnreg(attr_node_t* attr, entity_t* entity) {
    // ... identical structure to __maxnreg__ ...

    if (const_expr_sign_compare(expr, 0) <= 0) {
        emit_error(7, 3786, attr->src_loc);        // error 3786: non-positive
    } else {
        int32_t overflow;
        uint64_t val = const_expr_get_value(expr, &overflow);
        if (overflow || val > 0x7FFFFFFF)
            emit_error(7, 3787, attr->src_loc);     // error 3787: too large
        else
            entity->launch_config->local_maxnreg = (int32_t)val;   // +36
    }

    return entity;
}

The __local_maxnreg__ attribute limits register usage within a specific device function scope rather than at the kernel level. It uses a separate struct field (+36 vs +32) so both can coexist. The post-validator does NOT check local_maxnreg for __global__-only enforcement -- __local_maxnreg__ is more permissive than __maxnreg__ and may appear on __device__ functions.

Post-Declaration Validation (sub_6BC890)

After all attributes on a declaration have been applied, nv_validate_cuda_attributes (sub_6BC890, 160 lines, in nv_transforms.c) performs cross-attribute consistency checks. This function is called from the declaration processing pipeline and operates on the completed entity node. Multiple errors can be emitted from a single validation pass -- cudafe++ does not short-circuit after the first error.

// sub_6BC890 -- nv_validate_cuda_attributes (nv_transforms.c, 160 lines)
// a1: entity pointer, a2: source location for diagnostics
void nv_validate_cuda_attributes(entity_t* entity, source_loc_t* loc) {

    if (!entity || (entity->byte_177 & 0x10))
        return;      // null or suppressed entity

    // ---- Phase 1: Parameter validation (rvalue refs, error 3702) ----
    // Walks parameter list checking for rvalue reference flag
    // [documented on __global__ page]

    // ---- Phase 2: __nv_register_params__ check (error 3661) ----
    // [documented on __global__ page]

    // ---- Phase 3: Launch config attribute checks ----
    launch_config_t* lc = entity->launch_config;   // entity+256
    uint8_t es = entity->byte_182;                  // execution space

    if (!lc)
        goto check_global_advisory;

    if (es & 0x40)                                  // is __global__
        goto cross_attribute_checks;

    // ==== Error 3534: launch config on non-__global__ ====

    // 3534 for __launch_bounds__
    if (lc->maxThreadsPerBlock || lc->minBlocksPerMultiprocessor) {
        emit_error_with_name(7, 3534, &global_loc, "__launch_bounds__");
        lc = entity->launch_config;                 // reload after emit
    }

    // 3534 for __cluster_dims__ or __block_size__
    if ((entity->byte_183 & 0x40) || lc->cluster_dim_x >= 0) {
        const char* name = (lc->block_size_x > 0) ? "__block_size__"
                                                    : "__cluster_dims__";
        emit_error_with_name(7, 3534, &global_loc, name);
        lc = entity->launch_config;
        if (!lc)
            goto check_global_advisory;
    }

cross_attribute_checks:
    // ==== Error 3707: cluster size exceeds maxBlocksPerCluster ====
    if (lc->cluster_dim_x > 0) {
        if (lc->maxBlocksPerCluster > 0) {
            uint64_t cluster_product = (int64_t)lc->cluster_dim_x
                                     * (int64_t)lc->cluster_dim_y
                                     * (int64_t)lc->cluster_dim_z;
            if ((uint64_t)lc->maxBlocksPerCluster < cluster_product) {
                const char* name = (lc->block_size_x > 0) ? "__block_size__"
                                                           : "__cluster_dims__";
                emit_error_with_name(7, 3707, &global_loc, name);
                lc = entity->launch_config;
                if (!lc)
                    goto check_maxnreg;
            }
        }
    }

    // ==== Error 3719: __launch_bounds__ + __maxnreg__ conflict ====
    if (lc->maxnreg >= 0) {
        if (!(es & 0x40)) {
            // ==== Error 3715: __maxnreg__ on non-__global__ ====
            emit_error_with_name(7, 3715, &global_loc, "__maxnreg__");
            lc = entity->launch_config;
            if (lc)
                goto check_maxnreg_conflict;
            goto check_global_advisory;
        }

check_maxnreg_conflict:
        if (!lc->maxThreadsPerBlock) {
            // No __launch_bounds__ -- maxnreg is fine on its own
            // (but this path is for non-__global__, so it already errored)
            goto check_global_advisory;
        }
        // Both __launch_bounds__ and __maxnreg__ present
        emit_error_with_name(7, 3719, &global_loc,
                             "__launch_bounds__ and __maxnreg__");
    }

check_maxnreg:

check_global_advisory:
    // ==== Warning 3695: __global__ without __launch_bounds__ ====
    if (!(es & 0x40))
        return;                  // not __global__, no advisory needed

    lc = entity->launch_config;
    if (!lc) {
        emit_warning(4, 3695, &kernel_decl_loc);
        return;
    }

    if (!lc->maxThreadsPerBlock && !lc->minBlocksPerMultiprocessor) {
        // Launch config exists but no __launch_bounds__ values set
        // (struct was allocated by __cluster_dims__ or __block_size__)
        emit_warning(4, 3695, &kernel_decl_loc);
    }
}

Validation Logic Detail

Error 3534 -- Launch config on non-global. Tests entity->byte_182 & 0x40 (the __global__ bit). If clear, any non-default values in the launch config struct trigger error 3534. The error message uses %s with the specific attribute name. Notably, the check for __cluster_dims__ or __block_size__ tests lc->cluster_dim_x >= 0 (which is true when any cluster dim handler has run, since they write non-negative values). It also checks the intent flag (entity->byte_183 & 0x40) for the zero-argument __cluster_dims__() form.

Error 3707 -- Cluster product exceeds maxBlocksPerCluster. Computes cluster_dim_x * cluster_dim_y * cluster_dim_z using signed 64-bit arithmetic and compares against maxBlocksPerCluster. The multiplication uses the actual stored dimension values. The error message names whichever attribute set the cluster dims ("__block_size__" if block_size_x > 0, otherwise "__cluster_dims__"). This is a compile-time consistency check: if the programmer specifies both a cluster shape and a maximum cluster block count, the shape must fit.

Error 3715 -- maxnreg on non-global. Separate from the general 3534 check. While 3534 covers __launch_bounds__/__cluster_dims__/__block_size__, __maxnreg__ uses its own code because it appears in a different branch of the validation logic.

Error 3719 -- launch_bounds + maxnreg conflict. These two attributes provide contradictory register allocation hints: __launch_bounds__ asks the compiler to choose registers based on occupancy targets; __maxnreg__ overrides with a hard limit. Detected by lc->maxThreadsPerBlock != 0 && lc->maxnreg >= 0.

Warning 3695 -- Missing launch_bounds advisory. Severity 4 (informational). Fires when a __global__ function has no __launch_bounds__ annotation. Tests both lc == NULL (no launch config at all) and maxThreadsPerBlock == 0 && minBlocksPerMultiprocessor == 0 (struct exists but was allocated by other attrs). Not an error; can be suppressed.

Error Catalog

Apply-Time Errors

Error	Sev	Attribute	Condition	Sign test	Emit function
3535	7	`__launch_bounds__`	`entity+81 & 0x04` (local function)	--	`sub_4F79D0`
3685	7	`__cluster_dims__`	`sign_compare(expr, 0) <= 0`	`<= 0` (zero rejected)	`sub_4F79D0`
3686	7	`__cluster_dims__`	`overflow \|\| val > 0x7FFFFFFF`	--	`sub_4F8200`
3705	7	`__launch_bounds__` (arg 3)	`sign_compare(expr, 0) < 0`	`< 0` (zero allowed)	`sub_4F8200`
3706	7	`__launch_bounds__` (arg 3)	`overflow \|\| val > 0x7FFFFFFF`	--	`sub_4F8200`
3717	7	`__maxnreg__`	`sign_compare(expr, 0) <= 0`	`<= 0`	`sub_4F8200`
3718	7	`__maxnreg__`	`overflow \|\| val > 0x7FFFFFFF`	--	`sub_4F8200`
3786	7	`__local_maxnreg__`	`sign_compare(expr, 0) <= 0`	`<= 0`	`sub_4F8200`
3787	7	`__local_maxnreg__`	`overflow \|\| val > 0x7FFFFFFF`	--	`sub_4F8200`
3788	7	`__block_size__`	`sign_compare(expr, 0) <= 0`	`<= 0`	`sub_4F79D0`
3789	7	`__block_size__`	`overflow \|\| val > 0x7FFFFFFF`	--	`sub_4F8200`
3791	7	`__cluster_dims__` / `__block_size__`	`flags & opposite_bit`	--	`sub_4F8200`

Post-Validation Errors

Error	Sev	Condition	Emit function
3534	7	Launch config attrs on non-`__global__`	`sub_4F79D0`
3695	4	`__global__` without `__launch_bounds__`	`sub_4F8200`
3707	7	`maxBlocksPerCluster < cluster_x * cluster_y * cluster_z`	`sub_4F79D0`
3715	7	`maxnreg >= 0` on non-`__global__`	`sub_4F79D0`
3719	7	`maxThreadsPerBlock != 0 && maxnreg >= 0`	`sub_4F79D0`

Sign-Test Summary

Attribute	Non-positive error	Overflow error	Sign test	Zero allowed?
`__launch_bounds__` arg 1-2	(none)	(none)	No check	Yes
`__launch_bounds__` arg 3	3705	3706	`< 0`	Yes (not stored)
`__cluster_dims__`	3685	3686	`<= 0`	No
`__block_size__`	3788	3789	`<= 0`	No
`__maxnreg__`	3717	3718	`<= 0`	No
`__local_maxnreg__`	3786	3787	`<= 0`	No

Attribute Interaction Matrix

	`__launch_bounds__`	`__cluster_dims__`	`__block_size__`	`__maxnreg__`	`__local_maxnreg__`
`__launch_bounds__`	--	OK	OK	3719	OK
`__cluster_dims__`	OK	--	3791	OK	OK
`__block_size__`	OK	3791	--	OK	OK
`__maxnreg__`	3719	OK	OK	--	OK
`__local_maxnreg__`	OK	OK	OK	OK	--

Additional constraints:

All attributes except __local_maxnreg__ require __global__ execution space (error 3534 / 3715)
__launch_bounds__ arg 3 must be >= cluster product when cluster dims are set (error 3707)
__launch_bounds__ is also rejected on local functions at application time (error 3535)

Entity Node Field Reference

Offset	Size	Field	Role in Launch Config
`+81`	1 byte	`local_flags`	Bit 2 (0x04): local function. Checked by `sub_411C80` for error 3535.
`+177`	1 byte	`suppress_flags`	Bit 4 (0x10): entity suppressed. Post-validation skips if set.
`+182`	1 byte	`execution_space`	Bit 6 (0x40): `__global__`. Checked by `sub_6BC890` for 3534, 3695, 3715.
`+183`	1 byte	`extended_cuda`	Bit 6 (0x40): cluster_dims intent (set by zero-arg `__cluster_dims__`).
`+256`	8 bytes	`launch_config`	Pointer to `launch_config_t` (56 bytes). NULL if no launch config attrs.

Error Emission Functions

Address	Identity	Signature	Used for
`sub_4F79D0`	`emit_error_with_name`	`(severity, code, loc, name_str)`	3535, 3685, 3534, 3707, 3715, 3719, 3788
`sub_4F8200`	`emit_error_basic`	`(severity, code, loc)`	3686, 3705, 3706, 3717, 3718, 3786, 3787, 3789, 3791, 3695

sub_4F79D0 passes a format string argument (the attribute name) into the diagnostic message via %s. sub_4F8200 emits a fixed-format message with no string interpolation. Warning 3695 uses severity 4 through sub_4F8200; all other diagnostics use severity 7.

Function Map

Address	Identity	Lines	Source File
`sub_411C80`	`apply_nv_launch_bounds_attr`	98	`attribute.c`
`sub_4115F0`	`apply_nv_cluster_dims_attr`	145	`attribute.c`
`sub_4109E0`	`apply_nv_block_size_attr`	265	`attribute.c`
`sub_410F70`	`apply_nv_maxnreg_attr`	67	`attribute.c`
`sub_411090`	`apply_nv_local_maxnreg_attr`	67	`attribute.c`
`sub_6BC890`	`nv_validate_cuda_attributes`	160	`nv_transforms.c`
`sub_5E52F0`	`allocate_launch_config`	42	`il.c` (IL allocation)
`sub_461640`	`const_expr_get_value`	53	`const_expr.c`
`sub_461980`	`const_expr_sign_compare`	97	`const_expr.c`
`sub_7BE9E0`	`is_dependent_type`	15	`template.c`
`sub_4F79D0`	`emit_error_with_name`	--	`error.c`
`sub_4F8200`	`emit_error_basic`	--	`error.c`

Global Variables

Address	Name	Purpose
`qword_126EDE8`	`global_source_loc`	Default source location used in post-validation error emission
`qword_126DD38`	`kernel_decl_loc`	Source location for kernel declaration (used in 3695 advisory)
`dword_126EC90`	`il_pool_id`	Arena allocator pool ID for launch config allocation
`dword_126F694`	`launch_config_size`	Size parameter for arena allocator
`dword_126F690`	`pool_base`	Base pointer of the IL arena pool
`dword_106BA08`	`abi_mode`	ABI compatibility flag; when 0, allocator adds 8-byte prefix
`dword_126E5FC`	`device_flag`	Device compilation mode; bit 0 affects launch config flags byte
`byte_E6D1B0`	`is_signed_type_table`	Lookup table indexed by type subkind; true if type is signed integer

Cross-References

Attribute System Overview -- dispatch table, attribute node structure, kind enum
global Function Constraints -- the __global__ attribute that launch config attributes require
grid_constant -- parameter attribute that interacts with kernel parameter checks
Minor Attributes -- __nv_register_params__, __noinline__, __forceinline__
Entity Node Layout -- full byte map of entity node with +256 pointer
Execution Spaces -- byte_182 bitfield layout and __global__ predicate
Diagnostics Overview -- error emission functions, severity levels

grid_constant

The __grid_constant__ attribute marks a __global__ function parameter as read-only across the entire kernel grid. When applied, the parameter is loaded once from host memory into GPU constant memory at grid launch, and all threads in the grid read from this cached copy instead of loading from the parameter buffer in global memory. The attribute was introduced in CUDA 11.7 and requires compute capability 7.0 or later (Volta+).

cudafe++ enforces 8 validation checks on __grid_constant__ parameters, distributed across three phases: attribute application (checking type constraints -- const qualification, no reference types, SM version), post-declaration validation (checking that the annotation appears only on __global__ function parameters), and redeclaration/template merging (checking consistency of annotations between declarations). A ninth related check (error 3669) in the __global__ apply handler issues an advisory when a kernel parameter lacks a default initializer in device compilation mode, suggesting that __grid_constant__ would be appropriate.

Key Facts

Property	Value
Internal keyword	`grid_constant` (stored at `0x82bf0f`), displayed as `__grid_constant__` (at `0x82bf1d`)
Attribute category	Optimization (parameter-level)
Minimum architecture	compute_70 (Volta), gated by `dword_126E4A8 >= 70`
Entity node flag	`entity+164` bit 2 (`0x04`) -- set on the parameter entity during attribute application
Type node flag	`type+133` bit 5 (`0x20`) -- checked by `sub_7A6B60` (type chain query)
Parameter node flag	`param+32` bit 1 (`0x02`) -- checked during post-declaration validation in `sub_6BC890`
Total diagnostics	8 unique error strings + 1 related advisory (3669) + 1 memory space conflict (3577)
Diagnostic tag prefix	`grid_constant_*` (8 tags in `.rodata` at `0x84810f`--`0x857770`)
Message string block	`0x88d8b0`--`0x88dbe8` (contiguous block in `.rodata`)

Why grid_constant Exists

A parameter annotated __grid_constant__ tells the CUDA runtime and compiler three things:

1. The parameter value is identical for every thread in the grid. This is inherently true for all kernel parameters -- they are passed by value through the kernel launch API -- but the annotation makes this guarantee explicit and mechanically exploitable.

2. The parameter lives in constant memory, not the parameter buffer. Without the annotation, kernel parameters are placed in a parameter buffer that threads read from global memory (or a dedicated parameter memory space with limited caching). With __grid_constant__, the runtime loads the parameter into the GPU's constant memory cache at launch time. This provides:

Broadcast reads: all 32 threads in a warp reading the same constant-memory address execute in a single memory transaction. The uniform cache serves a broadcast at full throughput.
Separate cache hierarchy: constant memory has a dedicated L1 cache (the "uniform cache") separate from the general L1/L2 data caches. Using it for grid-wide parameters reduces pressure on the main cache hierarchy.
Reduced register pressure: the compiler can re-read the parameter from constant memory at any point instead of keeping it pinned in a register. This frees registers for other values, improving occupancy.

3. The parameter must be const-qualified. Since the value is shared across the grid and cached in constant memory, writes would be nonsensical. The hardware constant memory is read-only from the kernel's perspective. cudafe++ enforces this at the type level.

4. The parameter must not be a reference type. References to host memory are meaningless on the device. Kernel parameters are already copied to the device by the CUDA runtime. A reference would dangle because it would point into host address space. Even a reference to device memory is not valid here -- __grid_constant__ parameters must be values, not indirections.

SM_70+ Requirement Rationale

The compute_70 (Volta) minimum exists because Volta significantly rearchitected the constant memory subsystem. Pre-Volta GPUs (Maxwell, Pascal) have a more restricted constant memory subsystem with a fixed 64 KB window per kernel. Volta introduced:

Larger effective constant memory through improved caching
Per-thread-block constant buffer indexing
Hardware support for grid-wide parameter broadcasting with the new parameter cache architecture

The compiler lowers __grid_constant__ parameters to ld.const (constant-space load) PTX instructions, which rely on the Volta constant memory architecture to function correctly. On pre-Volta hardware, the constant memory hardware cannot serve this use case.

Where Validation Happens

The __grid_constant__ validation logic is spread across multiple compilation phases because the checks require different kinds of information. The type-level checks (const, reference) can be performed as soon as the attribute is applied. The context check (must be on a __global__ parameter) requires the function's execution space to be resolved. The redeclaration checks require both the old and new declarations to be available.

Phase 1: Attribute Application

Checks 1 (const), 2 (reference), and 4 (architecture) execute during attribute application, when the __grid_constant__ attribute handler runs. This handler is registered in EDG's attribute descriptor table under the kind byte for __grid_constant__. It receives the attribute node, the entity node, and the target kind. The handler inspects the parameter's type node to verify const-qualification and absence of reference semantics, and checks dword_126E4A8 against the threshold value 70.

Phase 2: Post-Declaration Validation

Check 3 (must be on __global__ parameter) executes in nv_validate_cuda_attributes (sub_6BC890). This function runs after all attributes on a declaration have been applied and resolved. It walks the function's parameter list and checks whether any parameter carries the __grid_constant__ flag (param+32 bit 1) on a non-__global__ function.

Phase 3: Redeclaration/Template Merging

Checks 5--8 (consistency across redeclarations, template redeclarations, specializations, and explicit instantiations) execute during the declaration merging passes in class_decl.c, decls.c, and template.c. These passes compare the entity+164 bit 2 flag on corresponding parameters of the old and new declarations.

Validation Check 1: const-Qualified Type

Property	Value
Tag	`grid_constant_not_const` (at `0x848146`)
Message	`a parameter annotated with __grid_constant__ must have const-qualified type` (at `0x88d8b0`)
Severity	error
Phase	Attribute application

The parameter's type must carry the const qualifier. The check peels through the type chain, following cv-qualifier wrapper nodes (kind == 12) to reach the underlying type, then verifies the const flag is present.

The type-level check works on the same type chain navigation pattern used throughout EDG's type system:

// Conceptual logic (from the __grid_constant__ attribute handler)
type_t* ptype = param->type;
while (ptype->kind == 12)        // skip cv-qualifier wrapper nodes
    ptype = ptype->referenced;   // follow chain at type+144

if (!(ptype->cv_quals & CONST_FLAG))
    emit_error("grid_constant_not_const", param->src_loc);

If the user writes:

__global__ void kernel(__grid_constant__ int x) { ... }

cudafe++ emits grid_constant_not_const because int is not const-qualified. The correct form is:

__global__ void kernel(__grid_constant__ const int x) { ... }

Validation Check 2: No Reference Type

Property	Value
Tag	`grid_constant_reference_type` (at `0x84815e`)
Message	`a parameter annotated with __grid_constant__ must not have reference type` (at `0x88d900`)
Severity	error
Phase	Attribute application

The parameter must not be a reference (& or &&). This check fires independently of the const check -- both can fire on the same parameter.

In EDG's type system, reference types have kind == 7 (lvalue reference) or kind == 19 (rvalue reference). The check walks the type chain through cv-qualifier wrappers and tests the final type kind:

type_t* ptype = param->type;
while (ptype->kind == 12)
    ptype = ptype->referenced;

if (ptype->kind == 7 || ptype->kind == 19)   // lvalue ref or rvalue ref
    emit_error("grid_constant_reference_type", param->src_loc);

Example that triggers this error:

__global__ void kernel(__grid_constant__ const int& x) { ... }

The rationale is that kernel parameters are copied across the host-device boundary by the CUDA runtime. A reference to host memory would be invalid on the device, and a reference to device memory does not participate in the kernel launch parameter copying mechanism. The __grid_constant__ attribute specifically requests constant-memory placement of the parameter value -- a reference has no value to place.

Validation Check 3: Only on global Parameters

Property	Value
Tag	`grid_constant_non_kernel` (at `0x84812d`)
Message	`__grid_constant__ annotation is only allowed on a parameter of a __global__ function` (at `0x88db38`)
Error code	3702
Severity	7 (standard error)
Phase	Post-declaration validation (`sub_6BC890`)

This check enforces that __grid_constant__ only appears on parameters of __global__ (kernel) functions. Parameters of __device__ or __host__ __device__ functions do not participate in the kernel launch mechanism and have no grid-wide constant memory optimization path.

The check executes in nv_validate_cuda_attributes (sub_6BC890, 161 lines, nv_transforms.c). The validator navigates from the function entity to its parameter list, then walks each parameter testing for the __grid_constant__ flag. The reconstructed pseudocode:

// From nv_validate_cuda_attributes (sub_6BC890)
// a1: function entity node
// a2: pointer to source location for diagnostics

void nv_validate_cuda_attributes(entity_t* a1, source_loc_t* a2) {

    if (!a1 || (a1->byte_177 & 0x10))
        return;   // null entity or suppressed

    type_t* type_chain = a1->type_chain;   // entity+144
    uint8_t exec_space = a1->byte_182;     // execution space bitfield

    // Skip parameter walk under certain execution space conditions
    if (!type_chain || ((exec_space & 0x30) == 0x20 &&
                        (exec_space & 0x60) != 0x20))
        goto skip_param_walk;

    // Navigate through cv-qualifier wrappers to reach the function type
    while (type_chain->kind == 12)
        type_chain = type_chain->referenced;   // type+144

    // Get parameter list from prototype (double dereference)
    param_t* param = **(param_t***)(type_chain + 152);

    // Walk each parameter
    while (param) {
        if (param->byte_32 & 0x02) {
            // __grid_constant__ flag is set on a non-__global__ parameter
            emit_error(7, 3702, a2);   // grid_constant_non_kernel
        }
        param = param->next;
    }

    // ... (continues with __launch_bounds__ validation below)
}

The param->byte_32 & 0x02 test checks bit 1 of the parameter node's byte at offset +32. This bit is the __grid_constant__ flag on the parameter entity node -- it is set by the __grid_constant__ attribute application handler when the attribute is first applied, and checked here to verify the containing function is actually a kernel.

The error fires for any execution space that is NOT __global__. The condition skip at the top of the function ((exec_space & 0x30) == 0x20 && (exec_space & 0x60) != 0x20) is a pre-filter that handles certain host-side function configurations -- it does NOT suppress the parameter walk for __global__ functions (which have bit 6 = 0x40 set).

Validation Check 4: compute_70+ Architecture

Property	Value
Tag	`grid_constant_unsupported_arch` (at `0x857770`)
Message	`__grid_constant__ annotation is only allowed for architecture compute_70 or later` (at `0x88db90`)
Severity	error
Phase	Attribute application

The target architecture, stored in dword_126E4A8 (set by the --target CLI flag via case 245 in proc_command_line), must be >= 70. The architecture code is an integer representation: sm_70 maps to 70, sm_80 to 80, sm_90 to 90, etc.

// Architecture gate in the __grid_constant__ attribute handler
if (dword_126E4A8 < 70)
    emit_error("grid_constant_unsupported_arch", param->src_loc);

If the user compiles with -arch=compute_60 or lower and uses __grid_constant__, this error fires. The check is a straightforward integer comparison -- no bitmask, no table lookup.

The architecture value reaches cudafe++ through nvcc, which translates user-facing flags like --gpu-architecture=sm_70 into the internal numeric code and passes it via the --target flag. Inside cudafe++, sub_7525E0 (a 6-byte stub returning -1) nominally parses this value, but the actual number is injected by nvcc into the argument string. See Architecture Feature Gating for the full data flow.

Validation Checks 5--8: Redeclaration Consistency

The four redeclaration consistency checks share the same algorithmic structure but apply to different declaration contexts. They all enforce the invariant that __grid_constant__ annotations must match between declarations: if the first declaration annotates a parameter with __grid_constant__, every subsequent declaration (redeclaration, template redeclaration, specialization, explicit instantiation) must also annotate the corresponding parameter, and vice versa.

Why These Checks Exist

The __grid_constant__ attribute affects the kernel's ABI -- specifically, how the CUDA runtime passes the parameter at launch time. If one translation unit sees a declaration with __grid_constant__ and another sees a declaration without it, they would generate incompatible kernel launch code. In RDC (relocatable device code) mode, where kernels can be declared in one TU and defined in another, this mismatch would cause silent data corruption at runtime. The compiler catches it at declaration merging time to prevent this.

Check 5: Function Redeclaration

Property	Value
Tag	`grid_constant_incompat_redecl` (at `0x84810f`)
Message	`incompatible __grid_constant__ annotation for parameter %s in function redeclaration (see previous declaration %p)` (at `0x88d950`)
Phase	Redeclaration merging (`class_decl.c`)

When a __global__ function is redeclared, cudafe++ compares the entity+164 bit 2 (0x04) flag on each parameter between the existing and new declarations. If the flags differ for any parameter at the same position, the error fires.

// Redeclaration consistency check (conceptual, in class_decl.c)
param_t* old_param = get_params(old_decl);
param_t* new_param = get_params(new_decl);

while (old_param && new_param) {
    bool old_gc = (old_param->entity->byte_164 & 0x04) != 0;
    bool new_gc = (new_param->entity->byte_164 & 0x04) != 0;

    if (old_gc != new_gc)
        emit_error("grid_constant_incompat_redecl",
                   new_param->name, old_decl->src_loc);

    old_param = old_param->next;
    new_param = new_param->next;
}

Example:

__global__ void kernel(__grid_constant__ const int x);
__global__ void kernel(const int x);  // ERROR: grid_constant_incompat_redecl

The %s in the message is expanded to the parameter name, and %p is expanded to a source location reference pointing at the previous declaration.

Check 6: Function Template Redeclaration

Property	Value
Tag	`grid_constant_incompat_templ_redecl` (at `0x857748`)
Message	`incompatible __grid_constant__ annotation for parameter %s in function template redeclaration (see previous declaration %p)` (at `0x88d9c8`)
Phase	Template redeclaration merging (`class_decl.c`)

Same logic as check 5, but for function template redeclarations. Template redeclaration merging occurs in a separate code path from regular function redeclaration because template entities have additional metadata (template parameter lists, partial specialization chains) that must be reconciled.

template<typename T>
__global__ void kernel(__grid_constant__ const T x);

template<typename T>
__global__ void kernel(const T x);  // ERROR: grid_constant_incompat_templ_redecl

Check 7: Template Specialization

Property	Value
Tag	`grid_constant_incompat_specialization` (at `0x857720`)
Message	`incompatible __grid_constant__ annotation for parameter %s in function specialization (see previous declaration %p)` (at `0x88da48`)
Phase	Template specialization processing

When a function template specialization's __grid_constant__ annotations disagree with the primary template, this error fires. The specialization must preserve the __grid_constant__ annotation from the primary template because the compiler may have already committed to constant-memory parameter placement based on the primary template's declaration.

template<typename T>
__global__ void kernel(__grid_constant__ const T x);

template<>
__global__ void kernel<int>(const int x);  // ERROR: grid_constant_incompat_specialization

A specialization that omits the annotation would require a different ABI for that particular instantiation, which the kernel launch infrastructure cannot accommodate on a per-specialization basis.

Check 8: Explicit Instantiation Directive

Property	Value
Tag	`grid_constant_incompat_instantiation_directive` (at `0x8576f0`)
Message	`incompatible __grid_constant__ annotation for parameter %s in instantiation directive (see previous declaration %p)` (at `0x88dac0`)
Phase	Explicit instantiation processing

This mirrors the specialization check but applies to explicit instantiation declarations and definitions (template void ... and extern template void ...).

template<typename T>
__global__ void kernel(__grid_constant__ const T x) { ... }

template __global__ void kernel<int>(const int x);
// ERROR: grid_constant_incompat_instantiation_directive

The instantiation directive must match the primary template's __grid_constant__ annotation for each parameter.

Memory Space Conflict Check (Error 3577)

While not one of the 8 __grid_constant__ validation checks, error 3577 provides a guard in the reverse direction. When apply_nv_managed_attr (sub_40E0D0) or apply_nv_device_attr (sub_40EB80) applies a memory space attribute to a variable, they check whether the entity has the __grid_constant__ flag set at entity+164 bit 2. If so, and the variable also has a memory space qualifier, error 3577 is emitted with the name of the conflicting memory space.

The check is identical in both handlers. Here is the reconstructed pseudocode from apply_nv_managed_attr (sub_40E0D0):

// From apply_nv_managed_attr (sub_40E0D0, attribute.c:10523)
// a1: attribute node, a2: entity node, a3: target kind (must be 7 = variable)

entity_t* apply_nv_managed_attr(attr_node_t* a1, entity_t* a2, uint8_t a3) {

    // Gate: variables only
    if (a3 != 7)
        internal_error("apply_nv_managed_attr", "attribute.c", 10523);

    // Apply memory space flags
    uint8_t old_memspace = a2->byte_148;
    a2->byte_149 |= 0x01;        // set __managed__ flag
    a2->byte_148 = old_memspace | 0x01;  // set __device__ flag (managed implies device)

    // Check for conflicting memory space combinations
    if (((old_memspace & 0x02) != 0) + ((old_memspace & 0x04) != 0) == 2)
        emit_error(3481, a1->src_loc);   // both __shared__ and __constant__ set

    if ((signed char)a2->byte_161 < 0)
        emit_error(3482, a1->src_loc);   // thread_local conflict

    if (a2->byte_81 & 0x04)
        emit_error(3485, a1->src_loc);   // local variable conflict

    // Grid constant conflict check
    if ((a2->byte_164 & 0x04) != 0      // has __grid_constant__ flag
        && (*(uint16_t*)(a2 + 148) & 0x0102) != 0)  // __shared__ OR __managed__
    {
        // Determine which memory space to report in the diagnostic
        uint8_t mem = a2->byte_148;
        const char* space;
        if      (mem & 0x04)         space = "__constant__";
        else if (a2->byte_149 & 0x01) space = "__managed__";
        else if (mem & 0x02)         space = "__shared__";
        else if (mem & 0x01)         space = "__device__";
        else                         space = "";

        emit_error_with_string(3577, a1->src_loc, space);
    }

    return a2;
}

The 0x0102 mask on the 16-bit word at a2 + 148 checks two bits: bit 1 of byte +148 (__shared__, value 0x02) and bit 0 of byte +149 (__managed__, value 0x01 shifted left by 8 bits = 0x0100). This means the conflict check fires specifically when a __grid_constant__ parameter also has __shared__ or __managed__ -- these memory spaces are incompatible with constant memory placement.

The priority order for the diagnostic message (__constant__ > __managed__ > __shared__ > __device__) determines which memory space name appears in the error output when multiple conflicting spaces are present simultaneously.

The apply_nv_device_attr handler (sub_40EB80) performs the identical check in its variable-handling branch (when a3 == 7):

// From apply_nv_device_attr (sub_40EB80), variable branch
if (a3 == 7) {
    a2->byte_148 |= 0x01;         // set __device__ flag

    // ... shared/constant conflict, thread_local, local variable checks ...

    // Identical grid_constant conflict check
    if ((a2->byte_164 & 0x04) != 0 && (*(uint16_t*)(a2 + 148) & 0x0102) != 0) {
        // Same priority cascade for space name
        // ...
        emit_error_with_string(3577, a1->src_loc, space);
    }
    return a2;
}

Entity Node Fields

Three distinct locations in entity/type/parameter nodes carry __grid_constant__ state:

entity+164 bit 2 (0x04): Grid Constant Declaration Flag

Set during attribute application when a parameter is declared __grid_constant__. This is the "declaration-side" flag that records the programmer's intent. Used by:

Memory space conflict check (error 3577) in apply_nv_managed_attr and apply_nv_device_attr
Redeclaration consistency checks (checks 5--8)

type+133 bit 5 (0x20): Type-Level Flag

A flag on the type node (not the entity node) checked by sub_7A6B60. This function follows the type chain through cv-qualifier wrappers (kind == 12) and tests byte+133 & 0x20:

// sub_7A6B60 (types.c)
// In the broader EDG type system, this function checks bit 5 of the
// type's flag byte. For CUDA parameter types, this bit indicates
// __grid_constant__ annotation. The same bit is also used as the
// dependent-type flag in template contexts (hence 299 callers in the binary).
bool type_has_flag_0x20(type_t* type) {
    while (type->kind == 12)       // skip cv-qualifier wrappers
        type = type->referenced;   // follow type chain at +144
    return (type->byte_133 & 0x20) != 0;
}

Used by the __global__ apply handler's parameter iteration to detect parameters that are already annotated with __grid_constant__, suppressing the error 3669 advisory for those parameters.

param+32 bit 1 (0x02): Parameter Node Flag

A flag on the parameter node itself, checked during post-declaration validation (sub_6BC890). The validator walks the parameter list and tests each parameter's byte at offset +32 for bit 1. If set on a parameter of a non-__global__ function, error 3702 (grid_constant_non_kernel) is emitted.

The three flags serve different purposes: the entity flag records the declaration intent and is used for cross-declaration consistency checks, the type flag enables efficient type-level queries during attribute application, and the parameter flag enables the post-validation pass to scan parameter lists without resolving entity or type chains.

Parameter Iteration in the global Apply Handler

The apply_nv_global_attr handlers (sub_40E1F0 and sub_40E7F0) contain a parameter iteration loop that interacts with __grid_constant__. This loop checks each kernel parameter for types that should be __grid_constant__ but are not annotated as such. When found in device compilation mode (dword_126C5C4 == -1), error 3669 is emitted as an advisory.

// From apply_nv_global_attr (sub_40E1F0), Phase 5: parameter iteration
// This section runs only when attr_node+11 bit 0 is set (applies to parameters)

if (a1->byte_11 & 0x01) {

    // Navigate to function prototype through cv-qualifier chain
    type_t* proto_type = entity->type_chain;     // entity+144
    while (proto_type->kind == 12)
        proto_type = proto_type->referenced;

    // Get parameter list head (double dereference from prototype+152)
    param_t* param = **(param_t***)(proto_type + 152);
    source_loc_t saved_loc = a1->src_loc;        // attr_node+56

    for (; param != NULL; param = param->next) {
        // Peel cv-qualifier wrappers from parameter type
        type_t* ptype = param->type;             // param[1] (offset 8)
        while (ptype->kind == 12)
            ptype = ptype->referenced;

        // sub_7A6B60: returns true if type+133 bit 5 is set
        // (parameter is already __grid_constant__)
        if (!sub_7A6B60(ptype) && dword_126C5C4 == -1) {

            // Scope table lookup (784-byte entries)
            int64_t scope = qword_126C5E8 + 784 * dword_126C5E4;

            // Skip if scope has qualifier flags or is a cv-qualified scope
            if ((scope->byte_6 & 0x06) == 0 && scope->byte_4 != 12) {

                // Re-navigate to unqualified type
                type_t* ptype2 = param->type;
                while (ptype2->kind == 12)
                    ptype2 = ptype2->referenced;

                // If no default initializer, suggest __grid_constant__
                if (ptype2->qword_120 == 0)
                    emit_error(3669, &saved_loc);
            }
        }
    }
}

The logic: for each parameter in a __global__ function, if the parameter type does NOT already have the __grid_constant__ flag AND we are in device compilation mode AND the current scope is not a cv-qualified context AND the parameter type lacks a default initializer (the type+120 pointer is null), then emit error 3669 as an advisory. The advisory nudges kernel authors to add __grid_constant__ annotations for better performance.

The scope table lookup (qword_126C5E8 indexed by dword_126C5E4, 784-byte entries) determines whether the current compilation context is device-side. The dword_126C5C4 == -1 sentinel explicitly indicates device compilation mode. Together these two conditions ensure the advisory only fires when processing the device-side compilation of a kernel, not during host-side stub generation.

Keyword Registration

The __grid_constant__ keyword is registered during fe_translation_unit_init (sub_5863A0), alongside other CUDA extension keywords (__device__, __global__, __shared__, __constant__, __managed__, __launch_bounds__). The registration inserts both grid_constant (bare form, for attribute name lookup) and __grid_constant__ (double-underscore form, for lexer recognition) into EDG's keyword-to-token-ID mapping.

The attribute name lookup function (sub_40A250) strips leading and trailing double underscores before searching the attribute name hash table (qword_E7FB60), so __grid_constant__ resolves to the same descriptor entry as the bare grid_constant form.

Diagnostic Tag Summary

Tag	Error Code	Message	Phase
`grid_constant_not_const`	--	`a parameter annotated with __grid_constant__ must have const-qualified type`	Application
`grid_constant_reference_type`	--	`a parameter annotated with __grid_constant__ must not have reference type`	Application
`grid_constant_non_kernel`	3702	`__grid_constant__ annotation is only allowed on a parameter of a __global__ function`	Post-validation
`grid_constant_unsupported_arch`	--	`__grid_constant__ annotation is only allowed for architecture compute_70 or later`	Application
`grid_constant_incompat_redecl`	--	`incompatible __grid_constant__ annotation for parameter %s in function redeclaration (see previous declaration %p)`	Redeclaration
`grid_constant_incompat_templ_redecl`	--	`incompatible __grid_constant__ annotation for parameter %s in function template redeclaration (see previous declaration %p)`	Template redecl
`grid_constant_incompat_specialization`	--	`incompatible __grid_constant__ annotation for parameter %s in function specialization (see previous declaration %p)`	Specialization
`grid_constant_incompat_instantiation_directive`	--	`incompatible __grid_constant__ annotation for parameter %s in instantiation directive (see previous declaration %p)`	Instantiation

Error codes for checks 1, 2, 4--8 are not individually mapped in the decompiled code available for this analysis. Error 3702 (check 3) is confirmed from the post-validation function sub_6BC890. Error 3577 (memory space conflict) is confirmed from sub_40E0D0 and sub_40EB80.

Function Map

Address	Identity	Lines	Source File	Role
`sub_7A6B60`	type flag query (`byte_133 & 0x20`)	9	`types.c`	Follows type chain, returns grid_constant / dependent flag
`sub_40E0D0`	`apply_nv_managed_attr`	47	`attribute.c:10523`	Memory space conflict check (3577) for `__managed__`
`sub_40EB80`	`apply_nv_device_attr`	100	`attribute.c`	Memory space conflict check (3577) for `__device__`
`sub_6BC890`	`nv_validate_cuda_attributes`	161	`nv_transforms.c`	Post-validation: param walk for 3702 (`grid_constant_non_kernel`)
`sub_40E1F0`	`apply_nv_global_attr` (variant 1)	89	`attribute.c`	Parameter iteration with grid_constant flag check (3669 advisory)
`sub_40E7F0`	`apply_nv_global_attr` (variant 2)	86	`attribute.c`	Same parameter iteration (alternate call path, `do-while` loop)
`sub_5863A0`	`fe_translation_unit_init`	--	`fe_init.c`	Registers `__grid_constant__` keyword
`sub_40A250`	attribute name lookup	--	`attribute.c`	Strips `__` prefix/suffix, searches hash table

Global Variables

Global	Address	Purpose
`dword_126E4A8`	`0x126E4A8`	Target SM architecture code (from `--target`). Must be >= 70 for `__grid_constant__`.
`dword_126C5C4`	`0x126C5C4`	Scope index sentinel. `-1` = device compilation mode. Guards 3669 advisory check.
`dword_126C5E4`	`0x126C5E4`	Current scope table index. Used in 3669 scope lookup.
`qword_126C5E8`	`0x126C5E8`	Scope table base pointer (784-byte entries). Used in 3669 scope lookup.

Cross-References

Attribute System Overview -- attribute node structure, dispatch pipeline, kind byte enumeration
__global__ Function Constraints -- parameter iteration for __grid_constant__ advisory (error 3669), full apply handler pseudocode
Entity Node Layout -- entity+164 bit 2 (grid_constant flag), param+32 bit 1
CUDA Error Catalog -- all 8 grid_constant_* diagnostic tags
CLI Flag Inventory -- --target flag setting dword_126E4A8
Architecture Feature Gating -- SM version gating mechanism, dword_126E4A8 data flow
CUDA Memory Spaces -- constant memory semantics, error 3577 conflict
RDC Mode -- why redeclaration consistency matters across translation units

managed Variables

The __managed__ attribute declares a variable in CUDA Unified Memory -- a memory region accessible from both host (CPU) and device (GPU) code, with the CUDA runtime handling page migration transparently. Unlike __device__ variables (accessible only from device code without explicit cudaMemcpy), managed variables can be read and written by both the host and device using the same pointer. The hardware and driver cooperate to migrate pages on demand between CPU and GPU memory, so neither the programmer nor the compiler needs to issue explicit copies.

The constraint set on __managed__ reflects two fundamental realities. First, unified memory is a runtime feature: the compiler cannot resolve managed addresses at compile time, so every host-side access must be gated behind a lazy initialization call that registers the variable with the CUDA runtime's unified memory subsystem. Second, unified memory requires hardware support: the Kepler architecture (compute capability 3.0) introduced the UVA (Unified Virtual Addressing) infrastructure that managed memory depends on. These two realities drive the entire implementation -- the attribute handler sets both a managed flag and a device flag (because managed memory is device-global memory with extra runtime semantics), the validation chain rejects memory spaces and qualifiers that conflict with runtime writability, and the code generator wraps every host-side access in a comma-operator expression that forces lazy initialization.

Key Facts

Property	Value
Attribute kind byte	`0x66` = `'f'` (102)
Handler function	`sub_40E0D0` (`apply_nv_managed_attr`, 47 lines, `attribute.c:10523`)
Entity node flags set	`entity+149` bit 0 (`__managed__`) AND `entity+148` bit 0 (`__device__`)
Detection bitmask	`((_WORD)(entity + 148) & 0x101) == 0x101`
Minimum architecture	compute_30 (Kepler) -- `dword_126E4A8 >= 30`
Applies to	Variables only (entity kind 7)
Diagnostic codes	3481, 3482, 3485, 3577 (attribute application); arch/config errors (declaration processing)
Managed RT boilerplate emitter	`sub_489000` (`process_file_scope_entities`, line 218)
Access wrapper emitters	`sub_4768F0` (`gen_name_ref`), `sub_484940` (`gen_variable_name`)
Managed access prefix string	`0x839570` (65 bytes)
Managed RT static block string	`0x83AAC8` (243 bytes)
Managed RT init function string	`0x83ABC0` (210 bytes)

Semantic Meaning

A __managed__ variable occupies a single virtual address that is valid on both host and device. The CUDA runtime allocates the variable through cudaMallocManaged during module initialization and registers it so the driver can track page ownership. When a kernel accesses the variable, the GPU's page fault handler migrates the page from CPU memory (if needed). When host code accesses it after a kernel launch, the runtime ensures the GPU has finished writing and the page is migrated back to CPU-accessible memory.

This is fundamentally different from the other three memory spaces:

Space	Accessibility	Migration	Lifetime
`__device__`	Device only (host needs `cudaMemcpy`)	Manual	Program lifetime
`__shared__`	Device only, per-thread-block	None (on-chip SRAM)	Block lifetime
`__constant__`	Device read-only (host writes via `cudaMemcpyToSymbol`)	Manual	Program lifetime
`__managed__`	Host and device, same pointer	Automatic (page faults)	Program lifetime

Because managed memory is fundamentally device global memory with runtime-managed migration, the __managed__ handler always sets the __device__ bit alongside the __managed__ bit. This is not redundant -- it ensures that all code paths that check for "device-accessible variable" (error 3483 scope checks, external linkage warning 3648, cross-space reference validation) treat managed variables correctly. A managed variable IS a device variable; it just happens to also be host-accessible through the runtime's page migration.

Why the Constraints Exist

Each validation check enforced by the handler exists for a specific hardware or semantic reason:

Variables only (kind 7): Unified memory is a storage concept. Functions do not reside in managed memory -- they have execution spaces, not memory spaces.
Cannot be __shared__ or __constant__: These are mutually exclusive memory spaces that occupy different physical hardware. __shared__ is per-block on-chip SRAM with no concept of host accessibility. __constant__ is a read-only cached region with no write path from device code. Managed memory is global DRAM with page migration. They cannot coexist.
Cannot be thread_local: Thread-local storage uses thread-specific addressing (TLS segments) which is a host-side concept incompatible with CUDA's execution model. A managed variable must have a single global address visible to all threads on both host and device.
Cannot be a local variable or reference type: Managed variables require runtime registration with the CUDA driver during module loading. Local variables are stack-allocated with lifetimes that cannot be tracked by the runtime. References cannot cross address spaces -- a reference to a managed variable on the host would hold a CPU virtual address that is meaningless on the device.
Requires compute_30+: Unified Virtual Addressing (UVA), the hardware foundation for managed memory, was introduced with the Kepler architecture (compute capability 3.0). On earlier architectures, host and device have separate, non-overlapping virtual address spaces, making transparent page migration impossible.
Incompatible with __grid_constant__: Grid-constant parameters are loaded into constant memory at kernel launch. A managed variable's value is determined by its current page state, which can change between kernel launches. The two semantics are contradictory.

Attribute Application: apply_nv_managed_attr

sub_40E0D0 -- Full Pseudocode

The __managed__ attribute handler is the simplest of the four memory space handlers and demonstrates the complete validation template. Called from apply_one_attribute (sub_413240) when the attribute kind byte is 'f' (102).

// sub_40E0D0 -- apply_nv_managed_attr (attribute.c:10523)
// a1: attribute node pointer (attribute_node_t*)
// a2: entity node pointer (entity_t*)
// a3: entity kind (uint8_t)
// returns: entity node pointer (passthrough)

entity_t* apply_nv_managed_attr(attr_node_t* a1, entity_t* a2, uint8_t a3) {

    // ===== Gate: variables only =====
    // Entity kind 7 = variable. Any other kind (function=11, type=6, etc.)
    // is an internal error -- the dispatch table should never route
    // __managed__ to a non-variable entity.
    if (a3 != 7)
        internal_error("attribute.c", 10523, "apply_nv_managed_attr", 0, 0);

    // ===== Step 1: Set managed + device flags =====
    // Save current memory space byte for later checks.
    // Managed memory IS device global memory, so both flags must be set.
    uint8_t old_space = a2->byte_148;
    a2->byte_149 |= 0x01;         // set __managed__ flag
    a2->byte_148 = old_space | 1;  // set __device__ flag

    // ===== Step 2: Mutual exclusion -- shared + constant =====
    // The expression ((x & 2) != 0) + ((x & 4) != 0) == 2 is true
    // only when BOTH __shared__ (bit 1) and __constant__ (bit 2) are set.
    // This catches an impossible three-way conflict, NOT managed+shared
    // or managed+constant individually. The individual conflicts
    // (__managed__ + __shared__, __managed__ + __constant__) are caught
    // by the __grid_constant__ check or by subsequent declaration processing.
    if (((old_space & 2) != 0) + ((old_space & 4) != 0) == 2)
        emit_error(3481, a1->source_loc);  // "conflicting CUDA memory spaces"

    // ===== Step 3: Thread-local check =====
    // Byte +161 bit 7 (sign bit when read as signed char) indicates
    // thread_local storage duration. Managed variables must have
    // static storage duration with a single global address.
    if ((signed char)a2->byte_161 < 0)
        emit_error(3482, a1->source_loc);  // "CUDA memory space on thread_local"

    // ===== Step 4: Local variable / reference type check =====
    // Byte +81 bit 2 indicates the entity is declared in a local scope
    // (block scope, function parameter, or reference type).
    // Managed variables require file-scope lifetime for runtime registration.
    if (a2->byte_81 & 0x04)
        emit_error(3485, a1->source_loc);  // "CUDA memory space on local/ref"

    // ===== Step 5: __grid_constant__ conflict =====
    // Byte +164 bit 2 is the __grid_constant__ flag on the parameter entity.
    // If set, check whether this entity also has a conflicting memory space.
    // The 16-bit word read at +148 with mask 0x0102 catches:
    //   byte +148 bit 1 (0x02) = __shared__
    //   byte +149 bit 0 (0x01, as 0x100 in word) = __managed__
    // (Little-endian: word = byte_149 << 8 | byte_148)
    if ((a2->byte_164 & 0x04) && (*(uint16_t*)(a2 + 148) & 0x0102)) {

        // Build error message: select most restrictive space name
        uint8_t space = a2->byte_148;
        const char* name = "__constant__";
        if (!(space & 0x04)) {
            name = "__managed__";
            if (!(a2->byte_149 & 0x01)) {
                name = "__shared__";
                if (!(space & 0x02)) {
                    name = "__device__";
                    if (!(space & 0x01))
                        name = "";
                }
            }
        }
        emit_error_with_name(3577, a1->source_loc, name);
        // "memory space %s incompatible with __grid_constant__"
    }

    return a2;
}

Entity Node Fields Modified

Offset	Field	Bits Set	Meaning
`+148`	`memory_space`	bit 0 (`0x01`)	`__device__` -- variable lives in device global memory
`+149`	`extended_space`	bit 0 (`0x01`)	`__managed__` -- variable is in unified memory

Entity Node Fields Read (Validation)

Offset	Field	Mask	Meaning
`+148`	`memory_space`	`0x02`	`__shared__` flag (mutual exclusion check)
`+148`	`memory_space`	`0x04`	`__constant__` flag (mutual exclusion check)
`+161`	`storage_flags`	bit 7 (sign)	`thread_local` storage duration
`+81`	`scope_flags`	`0x04`	Local scope / reference type indicator
`+164`	`cuda_flags`	`0x04`	`__grid_constant__` parameter flag
`+148:149`	`space_word`	`0x0102`	Combined `__shared__` OR `__managed__` (grid_constant conflict)

Comparison with apply_nv_device_attr (sub_40EB80)

The __device__ handler's variable path (entity kind 7) is structurally identical to apply_nv_managed_attr, minus the byte_149 |= 1 step. Both handlers:

Set byte_148 |= 0x01 (device memory space)
Check error 3481 (shared + constant mutual exclusion)
Check error 3482 (thread_local)
Check error 3485 (local variable)
Check error 3577 (grid_constant conflict)

The only difference: __managed__ additionally sets byte_149 |= 0x01. The __device__ handler also has a function path (kind 11) for setting execution space bits -- __managed__ has no function path because managed memory is a storage concept, not an execution concept.

Architecture Gating

The compute_30 requirement for __managed__ is enforced during declaration processing, not in the attribute handler itself. The attribute handler (sub_40E0D0) sets the bitfield flags unconditionally; the architecture check happens later when the declaration is fully processed.

Two diagnostic tags cover managed architecture gating:

Tag	Message	Condition
`unsupported_arch_for_managed_capability`	`__managed__ variables require architecture compute_30 or higher`	`dword_126E4A8 < 30`
`unsupported_configuration_for_managed_capability`	`__managed__ variables are not yet supported for this configuration (compilation mode (32/64 bit) and/or target operating system)`	Configuration-specific flag check

The architecture check uses the global dword_126E4A8 which stores the SM version number from the --gpu-architecture flag. The value 30 corresponds to sm_30 (Kepler), the first architecture with Unified Virtual Addressing (UVA) support. The configuration check covers edge cases like 32-bit compilation mode or unsupported operating systems where the CUDA runtime's managed memory subsystem is unavailable.

Managed Runtime Boilerplate

Every .int.c file emitted by cudafe++ contains a block of managed runtime initialization code, emitted unconditionally by sub_489000 (process_file_scope_entities) at line 218. This block is emitted regardless of whether the translation unit contains any __managed__ variables -- the static guard flag ensures zero overhead when no managed variables exist.

Static Declarations

Four declarations are emitted as a single string literal from 0x83AAC8 (243 bytes):

// Emitted verbatim by sub_489000, line 218
static char __nv_inited_managed_rt = 0;
static void **__nv_fatbinhandle_for_managed_rt;
static void __nv_save_fatbinhandle_for_managed_rt(void **in) {
    __nv_fatbinhandle_for_managed_rt = in;
}
static char __nv_init_managed_rt_with_module(void **);

Each symbol serves a specific role in the initialization chain:

Symbol	Type	Role
`__nv_inited_managed_rt`	`static char`	Guard flag: 0 = uninitialized, nonzero = initialized
`__nv_fatbinhandle_for_managed_rt`	`static void**`	Cached fatbinary handle, populated during `__cudaRegisterFatBinary`
`__nv_save_fatbinhandle_for_managed_rt`	`static void(void**)`	Callback that stores the fatbin handle -- called at program startup
`__nv_init_managed_rt_with_module`	`static char(void**)`	Forward declaration -- defined later by `crt/host_runtime.h`

The forward declaration of __nv_init_managed_rt_with_module is critical: this function is provided by the CUDA runtime headers and performs the actual cudaRegisterManagedVariable calls. By forward-declaring it here, the managed runtime boilerplate can reference it before the runtime header is #included later in the .int.c file.

Lazy Initialization Function

Emitted immediately after the static block (string at 0x83ABC0, 210 bytes):

// sub_489000, lines 221-224
// Conditional prefix:
if (dword_106BF6C)  // alternative host compiler mode
    emit("__attribute__((unused)) ");

// Function body:
static inline void __nv_init_managed_rt(void) {
    __nv_inited_managed_rt = (
        __nv_inited_managed_rt
            ? __nv_inited_managed_rt
            : __nv_init_managed_rt_with_module(
                  __nv_fatbinhandle_for_managed_rt)
    );
}

The ternary is a lazy-init idiom. On first call, __nv_inited_managed_rt is 0 (falsy), so the false branch executes __nv_init_managed_rt_with_module, which registers all managed variables in the translation unit and returns nonzero. The result is stored back into __nv_inited_managed_rt, so subsequent calls short-circuit through the true branch and return the existing nonzero value without re-initializing.

The __attribute__((unused)) prefix is conditionally added when dword_106BF6C (alternative host compiler mode) is set. This suppresses -Wunused-function warnings on host compilers that may not see any call sites for this function if no managed variables exist in the translation unit.

Runtime Registration Sequence

The full initialization flow spans the compilation and runtime startup pipeline:

Compile time (cudafe++ emits into .int.c):
  1. __nv_save_fatbinhandle_for_managed_rt() -- defined, stores fatbin handle
  2. __nv_init_managed_rt_with_module()      -- forward-declared only
  3. __nv_init_managed_rt()                  -- defined, lazy init wrapper
  4. #include "crt/host_runtime.h"           -- provides _with_module() definition

Program startup:
  5. __cudaRegisterFatBinary() calls __nv_save_fatbinhandle_for_managed_rt()
     to cache the fatbin handle for this translation unit

First managed variable access:
  6. Comma-operator wrapper calls __nv_init_managed_rt()
  7. Guard flag is 0, so __nv_init_managed_rt_with_module() executes
  8. __nv_init_managed_rt_with_module() calls cudaRegisterManagedVariable()
     for every __managed__ variable in the translation unit
  9. Guard flag set to nonzero, preventing re-initialization

Subsequent accesses:
  10. Comma-operator wrapper calls __nv_init_managed_rt()
  11. Guard flag is nonzero, ternary short-circuits, no runtime call

Host Access Transformation: The Comma-Operator Pattern

When cudafe++ generates the .int.c host-side code and encounters a reference to a __managed__ variable, it wraps the access in a comma-operator expression. This is the core mechanism that ensures the CUDA managed memory runtime is initialized before any managed variable is touched on the host.

Detection

Two backend emitter functions detect managed variables using the same 16-bit bitmask test:

// Used by both sub_4768F0 (gen_name_ref) and sub_484940 (gen_variable_name)
if ((*(_WORD*)(entity + 148) & 0x101) == 0x101)

In little-endian layout, the 16-bit word at offset 148 spans bytes +148 (low) and +149 (high). The mask 0x101 tests:

Bit 0 of byte +148 (0x01): __device__ flag
Bit 0 of byte +149 (0x100 in the word): __managed__ flag

Both bits are always set together by apply_nv_managed_attr, so this test is equivalent to "is this a managed variable?"

Transformed Output

For a managed variable named managed_var, the emitter produces:

(*( (__nv_inited_managed_rt ? (void)0 : __nv_init_managed_rt()), (managed_var)))

The prefix string lives at 0x839570 (65 bytes):

"(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), ("

After emitting the variable name, the suffix ))) closes the expression.

Why This Works: Anatomy of the Expression

Reading from inside out:

(*( (__nv_inited_managed_rt ? (void)0 : __nv_init_managed_rt()), (managed_var)))
     ^--- ternary: lazy init guard ----^                          ^--- value ---^
     ^--- comma operator: init side-effect, then yield value --------------------------^
^--- dereference: access the managed variable's storage ---------------------------------^

Ternary __nv_inited_managed_rt ? (void)0 : __nv_init_managed_rt() -- The guard flag is checked. If nonzero (already initialized), the expression evaluates to (void)0, which generates no code. If zero (first access), __nv_init_managed_rt() is called, which performs CUDA runtime registration and sets the guard flag to nonzero.
Comma operator (init_expr, (managed_var)) -- The C comma operator evaluates its left operand for side effects only, discards the result, then evaluates and returns its right operand. This guarantees the initialization side-effect is sequenced before the variable access, per C/C++ sequencing rules (C11 6.5.17, C++17 [expr.comma]).
Outer dereference *(...) -- The outer * dereferences the result. After runtime registration, the managed variable's symbol resolves to the unified memory pointer that the CUDA runtime allocated via cudaMallocManaged. The dereference yields the actual variable value.

The entire expression is parenthesized to be safely usable in any expression context -- assignments, function arguments, member access, etc.

Two Emitter Paths

The access transformation is applied by two separate functions, covering different name resolution contexts:

sub_484940 (gen_variable_name, 52 lines) -- handles direct variable name emission. Simpler structure: check the 0x101 bitmask, emit prefix, emit the name (handling three sub-cases: thread-local via this, anonymous via sub_483A80, or regular via sub_472730), emit suffix.

// sub_484940 -- gen_variable_name (pseudocode)
void gen_variable_name(entity_t* a1) {
    bool needs_suffix = false;

    // Check: is this a __managed__ variable?
    if ((*(uint16_t*)(a1 + 148) & 0x101) == 0x101) {
        needs_suffix = true;
        emit("(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), (");
    }

    // Emit variable name (three cases)
    if (a1->byte_163 & 0x80)
        emit("this");                       // thread-local proxy
    else if (a1->byte_165 & 0x04)
        emit_anonymous_name(a1);            // compiler-generated name
    else
        gen_expression_or_name(a1, 7);      // regular name emission

    if (needs_suffix)
        emit(")))");
}

sub_4768F0 (gen_name_ref, 237 lines) -- handles qualified name references with :: scope resolution, template arguments, __super:: qualifier, and member access. The managed wrapping applies an additional gate: a3 == 7 (entity is a variable) AND !v7 (the fourth parameter is zero, meaning no nested context that already handles initialization).

// sub_4768F0 -- gen_name_ref, managed wrapping (lines 160-163, 231-236)
int gen_name_ref(context_t* ctx, entity_t* entity, uint8_t kind, int nested) {

    bool needs_suffix = false;

    if (!nested && kind == 7
        && (*(uint16_t*)(entity + 148) & 0x101) == 0x101) {
        needs_suffix = true;
        emit("(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), (");
    }

    // ... 200+ lines of qualified name emission ...
    // handles ::, template<>, __super::, member access paths

    if (needs_suffix) {
        emit(")))");
        return 1;
    }
    // ...
}

Host-Side Exemption in Cross-Space Checking

Managed variables receive a special exemption in the cross-space reference validation performed by record_symbol_reference_full (sub_72A650). When host code references a __device__ variable, the checker would normally emit error 3548. But managed variables are specifically exempted:

// Inlined in sub_72A650, cross-space variable reference check
if ((*(uint16_t*)(var_info + 148) & 0x0101) == 0x0101)
    return;  // managed variable -- host access is legal

This uses the same 0x0101 bitmask to detect managed variables. The exemption exists because managed variables are explicitly designed for host access -- that is their entire purpose. Without this exemption, every host-side __managed__ variable access would trigger a spurious "reference to device variable from host code" error.

Managed Variables and constexpr

The declaration processor sub_4DEC90 (variable_declaration) imposes additional constraints when __managed__ is combined with constexpr:

Error	Condition	Description
3568	`__constant__` + constexpr	`__constant__` combined with constexpr (prevents runtime initialization)
3566	`__constant__` + constexpr + auto	`__constant__` constexpr with auto deduction

These errors target __constant__ specifically, but the validation cascade also generates the space name for managed variables when constructing error messages. The space name selection uses the same priority cascade as the attribute handler:

// sub_4DEC90, line ~357 -- selecting display name for error messages
const char* space_name = "__constant__";
if (!(space & 0x04)) {
    space_name = "__managed__";
    if (!(*(uint8_t*)(entity + 149) & 0x01)) {
        space_name = "__host__ __device__" + 9;  // pointer trick: "__device__"
        if (space & 0x02)
            space_name = "__shared__";
    }
}

The string "__device__" is obtained by taking "__host__ __device__" and advancing by 9 bytes, skipping the "__host__ " prefix. This is a binary-level optimization where the compiler shares string storage between the combined form and the standalone "__device__" substring.

Error 3648: External Linkage Warning

The post-definition check in sub_4DC200 (mark_defined_variable) warns when a device-accessible variable has external linkage. This affects managed variables because they always have the __device__ bit set:

// sub_4DC200 -- mark_defined_variable
// Condition for warning 3648:
if ((entity->byte_148 & 3) == 1    // __device__ set AND __shared__ NOT set
    && !is_compiler_generated(entity)
    && (entity->byte_80 & 0x70) != 0x10)  // not anonymous
{
    warning(3648, entity->source_loc);
}

The bit test (byte_148 & 3) == 1 checks that bit 0 (__device__) is set and bit 1 (__shared__) is NOT set. This catches:

__device__ variables (0x01): yes, (0x01 & 3) == 1
__managed__ variables (0x01 at +148, 0x01 at +149): yes, (0x01 & 3) == 1
__device__ __constant__ (0x05): yes, (0x05 & 3) == 1
__shared__ (0x02): no, (0x02 & 3) == 2
__constant__ alone (0x04): no, (0x04 & 3) == 0

Managed variables therefore trigger this warning if they have external linkage and are not compiler-generated.

Diagnostic Summary

Error	Phase	Condition	Message
3481	Attribute application	`__shared__` AND `__constant__` both set	Conflicting CUDA memory spaces
3482	Attribute application	`thread_local` storage duration	CUDA memory space on thread_local variable
3485	Attribute application	Local scope or reference type	CUDA memory space on local variable
3577	Attribute application	`__grid_constant__` + managed/shared	Memory space incompatible with `__grid_constant__`
3648	Post-definition	External linkage on device-accessible (non-shared) var	External linkage warning
(arch)	Declaration processing	`dword_126E4A8 < 30`	`__managed__` requires compute_30 or higher
(config)	Declaration processing	Unsupported OS/bitness	`__managed__` not supported for this configuration

Function Map

Address	Name	Lines	Role
`sub_40E0D0`	`apply_nv_managed_attr`	47	Attribute handler -- sets flags, validates
`sub_40EB80`	`apply_nv_device_attr`	100	Device handler (variable path is structurally identical)
`sub_413240`	`apply_one_attribute`	585	Dispatch -- routes kind `'f'` to `sub_40E0D0`
`sub_489000`	`process_file_scope_entities`	723	Emits managed RT boilerplate into `.int.c`
`sub_4768F0`	`gen_name_ref`	237	Access wrapper -- qualified name path
`sub_484940`	`gen_variable_name`	52	Access wrapper -- direct name path
`sub_4DEC90`	`variable_declaration`	1098	Declaration processing, constexpr/VLA checks
`sub_4DC200`	`mark_defined_variable`	26	External linkage warning (error 3648)
`sub_72A650`	`record_symbol_reference_full`	~400	Cross-space check with managed exemption
`sub_6BC890`	`nv_validate_cuda_attributes`	161	Post-declaration cross-attribute validation

Cross-References

Memory Spaces -- bitfield encoding at entity +148/+149, all four memory space handlers
Attribute System Overview -- dispatch table, attribute kind enum, application pipeline
grid_constant -- error 3577 conflict with managed
Architecture Feature Gating -- compute_30 gate for __managed__
CUDA Runtime Boilerplate -- managed RT emission, lambda stubs, __cudaPushCallConfiguration
Cross-Space Validation -- managed exemption in host access checks
Entity Node Layout -- byte +148/+149 field definitions

Minor CUDA Attributes

cudafe++ defines several CUDA-specific attributes beyond the core execution-space, memory-space, and launch-configuration families. These attributes serve diverse purposes: optimization hints for the downstream compiler, parameter passing strategy selection, inline control that bridges the EDG front-end with cicc's code generation, and internal annotations for tile/cooperative infrastructure. Most are undocumented by NVIDIA. This page covers each in detail: what the attribute does, why it exists, how cudafe++ validates and stores it, and where the flags end up in the entity node.

Attribute Summary

Kind	Hex	ASCII	Display Name	Category	Handler / Flag
110	0x6E	`'n'`	`__nv_pure__`	Optimization	entity+183 (via IL propagation)
--	--	--	`__nv_register_params__`	ABI	`sub_40B0A0` (38 lines), entity+183 bit 3
--	--	--	`__forceinline__`	Inline control	entity+177 bit 4
--	--	--	`__noinline__`	Inline control	`sub_40F5F0` / `sub_40F6F0`, entity+179 bit 5, entity+180 bit 7
--	--	--	`__inline_hint__`	Inline control	entity+179 bit 4
89	0x59	`'Y'`	`__tile_global__`	Internal	(no handler observed)
95	0x5F	`'_'`	`__tile_builtin__`	Internal	(no handler observed)
94	0x5E	`'^'`	`__local_maxnreg__`	Launch config	`sub_411090` (67 lines)
108	0x6C	`'l'`	`__block_size__`	Launch config	`sub_4109E0` (265 lines)

Note: __nv_register_params__, __forceinline__, __noinline__, and __inline_hint__ do not have CUDA attribute kind codes. They are processed through different paths (EDG's standard attribute system, pragma-like registration at startup, or direct flag manipulation). Only __nv_pure__, __tile_global__, __tile_builtin__, __local_maxnreg__, and __block_size__ have dedicated CUDA kind bytes in the attribute_display_name switch table.

`__nv_pure__` (Kind 0x6E = `'n'`)

Purpose

__nv_pure__ marks a function as having no observable side effects: given the same inputs, it always returns the same result and does not modify any state visible to the caller. This is an optimization hint for cicc (the CUDA compiler backend). A pure function can be:

Common-subexpression eliminated (CSE): if f(x) appears twice in the same basic block, the second call can be replaced by the first call's result.
Hoisted out of loops: if f(x) is invariant across loop iterations, it can be computed once before the loop (LICM -- loop-invariant code motion).
Dead-code eliminated: if the result of f(x) is never used and the function has no side effects, the call can be removed entirely.

This is semantically equivalent to GCC's __attribute__((pure)) and LLVM's readonly function attribute, but expressed through NVIDIA's internal attribute system rather than the standard GNU attribute path. The choice of a separate internal attribute rather than reusing the GNU pure attribute reflects cudafe++'s design of routing all CUDA-specific semantics through its own kind-byte dispatch, keeping the NVIDIA optimization pipeline cleanly separated from EDG's standard attribute handling.

Binary Encoding

In the attribute kind enum, __nv_pure__ has kind value 110 (0x6E, ASCII 'n'). This is the highest kind value in the CUDA attribute range, added later than the original dense block (86--95).

The attribute_display_name switch (sub_40A310) maps it:

case 'n': return "__nv_pure__";

Application Behavior

In the apply_one_attribute constraint checker (sub_413240), kind 'n' has the following entry:

case 'n':
    if (target_kind == 28)   // target is a namespace-level entity
        goto LABEL_21;       // -> pass through (no per-entity modification)
    goto LABEL_8;            // -> attribute doesn't apply to this target

The handler does not modify any entity node fields directly. Unlike __host__ or __device__ which set bitmask flags at entity+182, __nv_pure__ propagates through the attribute node list itself. The attribute node with kind 0x6E remains attached to the entity's attribute chain and is consumed later by:

The .int.c output generator (sub_5565E0 and related functions), which emits the __nv_pure__ attribute into the intermediate C output. In the IL code generator, kind 0x6E shares handling with __launch_bounds__ (0x5C):

case 0x5C:
case 0x6E:
    a2->kind_field = 25;    // IL node type for "function attribute"
    sub_540560(0, 0, a2, a4, ...);  // emit attribute to .int.c
    break;

cicc then reads the __nv_pure__ annotation from the .int.c output and applies the corresponding LLVM-level optimization attributes (readonly, willreturn, etc.) to the function in the NVVM IR.

Why It Exists

CUDA device code has optimization opportunities that GCC's pure does not capture. Device functions execute in a constrained environment (no system calls, no I/O, deterministic memory model), which makes purity easier to verify and more valuable to exploit. By providing __nv_pure__ as a separate internal attribute, NVIDIA can:

Gate it behind CUDA mode (it only appears in device compilation flows).
Attach it to internal runtime functions (__shfl_sync, math intrinsics, etc.) that NVIDIA knows are pure but that cannot carry GCC pure through the host compilation path.
Avoid interactions with EDG's GNU attribute conflict checking, which has its own rules for pure vs const vs noreturn.

String Evidence

The string table contains exactly one reference to __nv_pure__ at address 0x829848, and a diagnostic tag nv_pure at 0x88cc08. The low reference count confirms this is an internal optimization attribute not exposed to user code through documented CUDA APIs.

`__nv_register_params__` (Entity+183 bit 3)

Purpose

__nv_register_params__ tells cicc to pass kernel parameters in registers instead of through constant memory. By default, CUDA kernel parameters are loaded via ld.param instructions, which access a dedicated constant memory bank visible to the kernel launch mechanism. This works well when parameter counts are large (the constant memory bank is 4 KB per kernel), but for small parameter counts, passing values directly in registers avoids the latency of the constant memory load path.

Register parameter passing eliminates the constant-bank load latency (typically 4--8 cycles on modern architectures) and removes potential bank conflicts when multiple warps read the same parameters. The trade-off is that it consumes registers from the limited register file, which can reduce occupancy if the kernel already uses many registers.

Requirements

The attribute has four validation checks, enforced across two separate locations:

Enablement flag (dword_106C028): a compiler internal flag that must be set. If not set, the handler emits error 3659 with the message "__nv_register_params__ support is not enabled". This flag is controlled by an internal nvcc option, not exposed to users.
Architecture check (implied by error string): the string "__nv_register_params__ is only supported for compute_80 or later architecture" exists in the binary at 0x88cb80. This check is performed outside the apply handler, in the post-validation or downstream pipeline.
Function type restriction (implied by error string): the string "__nv_register_params__ is not allowed on a %s function" at 0x88cbd0 shows that certain function types (likely __host__ or non-kernel functions) are rejected. The post-validation in sub_6BC890 checks: if entity+183 & 0x08 is set (register_params flag) but the execution space at entity+182 is __global__ (bit 6) or the function is not a pure __device__ function, it emits error 3661 with the relevant space name.
Ellipsis (variadic) check: the apply handler (sub_40B0A0) traverses the function's return type chain to reach the prototype, then checks prototype+16 & 0x01 (the variadic flag). If set, it emits error 3662 with the message "__nv_register_params__ is not allowed on a function with ellipsis". Variadic functions cannot use register parameter passing because the parameter count is not known at compile time.

Apply Handler: `sub_40B0A0` (38 lines)

// sub_40B0A0 -- apply_nv_register_params_attr (attribute.c:10537)
entity_t* apply_nv_register_params_attr(attr_node_t* a1, entity_t* a2, uint8_t a3) {
    assert(a3 == 11);  // functions only

    bool enabled = true;
    if (!dword_106C028) {       // enablement flag not set
        emit_error(7, 3659, a1->src_loc);  // "support is not enabled"
        enabled = false;
    }

    if (!a2) return a2;

    // Walk return type chain to get function prototype
    type_t* ret_type = a2->type_at_144;
    if (!ret_type) goto set_flag;

    while (ret_type->kind == 12)     // skip cv-qualifier wrappers
        ret_type = ret_type->next;   // +144

    // Check variadic flag
    if (ret_type->prototype->flags_16 & 0x01) {
        emit_error(7, 3662, a1->src_loc);  // "not allowed on variadic"
        return a2;
    }

set_flag:
    if (enabled)
        a2->byte_183 |= 0x08;  // set register_params bit
    return a2;
}

The flag is stored at entity+183 bit 3 (0x08), the same byte that holds the cluster_dims intent flag (bit 6, 0x40). These two flags coexist without conflict because they serve orthogonal purposes.

Post-Declaration Validation

In sub_6BC890 (nv_validate_cuda_attributes), if entity+183 & 0x08 is set:

if (entity->byte_183 & 0x08) {
    uint8_t es = entity->byte_182;
    if (es & 0x40) {                    // __global__ function
        emit_error(7, 3661, src, "__global__");
    } else if ((es & 0x30) != 0x20) {   // not pure __device__
        emit_error(7, 3661, src, "__host__");
    }
    // else: pure __device__ function -- register_params is valid
}

This means __nv_register_params__ is only valid on __device__ functions (not __global__, not __host__, not __host__ __device__). Kernel functions (__global__) have their own parameter passing ABI dictated by the CUDA runtime, and host functions use the host ABI.

Registration at Startup

The function sub_6B5E50 (called during compiler initialization) registers __nv_register_params__ as a preprocessor macro expansion. It looks up the name via sub_734430, and if not found, creates a new macro definition node and registers it in the symbol table via sub_749600. The macro body is a 40-byte token sequence that, when expanded, produces the __attribute__((__nv_register_params__)) syntax that EDG's attribute parser can consume. This macro-based registration is why __nv_register_params__ does not have a CUDA kind byte -- it enters the attribute system through the standard GNU __attribute__ path, not through the CUDA attribute descriptor table.

The same startup function also registers __noinline__ with a similar mechanism, and _Pragma (if Clang compatibility mode requires it).

Inline Control Attributes

cudafe++ provides three inline control attributes that interact with EDG's inline heuristic system. These attributes do not have CUDA kind bytes; they are processed through EDG's standard attribute infrastructure and NVIDIA's own flag-setting paths.

Entity Node Fields

entity+177  cuda_flags (byte):
    bit 4 (0x10) = __forceinline__

entity+179  more_cuda_flags (byte):
    bit 4 (0x10) = __inline_hint__
    bit 5 (0x20) = __noinline__ (EDG internal noinline)

entity+180  function_attrs (byte):
    bit 7 (0x80) = __noinline__ (GNU attribute form)

`forceinline`

__forceinline__ requests that the compiler always inline the function, overriding cost-based heuristics. It is stored at entity+177 bit 4 (0x10). This bit is checked during cross-execution-space call validation (sub_505720): a __forceinline__ function is treated as implicitly host-device, meaning it suppresses cross-space call errors. The logic in the cross-space checker:

if (entity->byte_177 & 0x10)    // __forceinline__
    // treat as implicitly __host__ __device__

This relaxation exists because __forceinline__ functions are expected to be inlined at the call site, so their execution space becomes the caller's execution space. There is no separate call to resolve, hence no cross-space violation.

In the .int.c output, __forceinline__ is emitted so that cicc can apply it during NVVM IR generation. cicc translates it to LLVM's alwaysinline attribute.

`noinline`

__noinline__ prevents the compiler from inlining a function, regardless of heuristics. It has two separate handlers because it can arrive through two syntactic paths:

Path 1: EDG internal form (sub_40F5F0, 51 lines)

This handler is invoked when __noinline__ is recognized as a CUDA-specific attribute (source_mode 3 or with the scoped-attribute bit set). It sets entity+179 |= 0x20. In C mode (dword_126EFB4 == 2), it additionally creates an ABI annotation node by calling sub_5E5130 and linking it to the function's prototype exception-spec chain at prototype+56. This ABI node carries flags 0x19 and signals to the code generator that the noinline directive should be preserved across compilation boundaries.

// sub_40F5F0 -- apply_noinline_attr (EDG internal path)
if (target_kind == 11) {  // function
    if (attr->kind) {
        entity->byte_179 |= 0x20;         // noinline flag
        if (attr->source_mode == 3 && dword_126EFB4 == 2) {
            // Create ABI annotation for C mode
            extract_func_type(entity+144, &ft_out);
            if (!ft_out->prototype->abi_info) {
                abi_node_t* n = alloc_abi_node();
                *n |= 0x19;
                ft_out->prototype->abi_info = n;
            }
        }
    }
    return entity;
}
// else: emit error 1835 (wrong target) or 2470 (alignas context)

Path 2: GNU attribute form (sub_40F6F0, 37 lines)

This handler is invoked when __noinline__ arrives through the __attribute__((__noinline__)) GNU attribute path. It sets a different bit: entity+180 |= 0x80. This separation allows the compiler to distinguish between the CUDA-specific noinline directive and the GNU portable one, although in practice both prevent inlining.

Additionally, when the function is a device function (byte+176 bit 7 set = static member, source_mode indicates GNU/Clang, byte+81 bit 2 set = local, byte+187 bit 0 clear), it calls sub_5CEE70(28, entity->attr_chain) to record the noinline directive for device-side compilation.

// sub_40F6F0 -- apply_noinline_attr (GNU form)
if (target_kind == 11) {
    entity->byte_180 |= 0x80;
    if ((signed char)entity->byte_176 < 0
        && (attr->source_mode == 2 || (attr->flags & 0x10))
        && (entity->byte_81 & 0x04)
        && !(entity->byte_187 & 0x01)) {
        sub_5CEE70(28, entity->attr_chain);
    }
} else {
    // emit error 1835/2470 with appropriate severity
}

`__inline_hint__`

__inline_hint__ is an internal NVIDIA attribute that provides a non-binding suggestion to the compiler's inlining heuristics. Unlike __forceinline__, which mandates inlining, __inline_hint__ merely biases the cost model in favor of inlining. It is stored at entity+179 bit 4 (0x10).

The attribute is registered through the same startup mechanism as __nv_register_params__ in sub_6B5E50, and its handler apply_nv_inline_hint_attr (referenced at address 0x40A999 within sub_40A8A0) sets the flag. The diagnostic tag nv_inline_hint exists at 0x82bf2f in the string table, suggesting diagnostic messages exist for conflicts.

Mutual Exclusion

__forceinline__ and __noinline__ are mutually exclusive. The diagnostic system includes 2 messages for inline hint conflicts (identified in the W053 error report). When both are applied to the same function, the compiler emits a diagnostic. However, __inline_hint__ can coexist with either, as it is merely a suggestion that the other directives override.

The mutual exclusion is enforced through the constraint checker in apply_one_attribute (sub_413240) and through post-validation checks. The constraint string for the 'r' (routine/function) constraint class includes property codes m (for member/constexpr) and v (for virtual), with + and - qualifiers controlling whether the attribute is allowed or forbidden. Error codes 1835--1843 and 1858--1871 cover the various conflict scenarios.

IL Output

In the .int.c output, the inline control attributes are emitted as standard GNU __attribute__ annotations:

// emitted for __noinline__:
__attribute__((noinline))

// emitted for __forceinline__:
__attribute__((always_inline))

cicc reads these and translates them to LLVM's noinline and alwaysinline function attributes respectively.

`__tile_global__` (Kind 0x59 = `'Y'`)

Purpose

__tile_global__ is an internal execution-space attribute that appears in the attribute_display_name switch table but has no user-facing documentation. Its kind value (89, 'Y') places it in the original dense block of CUDA attributes between __global__ (88, 'X') and __shared__ (90, 'Z').

The name strongly suggests this attribute is related to NVIDIA's tile-based cooperative group infrastructure or the Tensor Memory Accelerator (TMA) programming model, where "tile global" would denote a function that operates on a tile of global memory. In the cooperative groups model, tiled partitions allow threads to cooperatively access contiguous memory regions, and a __tile_global__ function might be the kernel entry point for such a tiled execution pattern.

Binary Evidence

The attribute is defined in the kind enum (the attribute_display_name switch case), but no handler function has been identified in the binary. In the apply_one_attribute dispatcher (sub_413240), there is no case for kind 'Y'. This means:

The attribute can be parsed and stored in an attribute node.
It has a display name for diagnostics.
It does not modify entity node fields through the standard apply pipeline.

This is consistent with the attribute being consumed downstream by cicc or another tool in the compilation pipeline, rather than requiring cudafe++ to perform validation beyond basic parsing. Alternatively, it may be a reserved placeholder for future functionality.

`__tile_builtin__` (Kind 0x5F = `'_'`)

Purpose

__tile_builtin__ is another internal attribute in the CUDA kind enum, with kind value 95 (0x5F, ASCII '_'). Its kind value is the last in the original dense block (86--95).

The name suggests this attribute marks functions that are tile-level builtins -- compiler intrinsics that implement tile-based operations. These would be functions like cooperative_groups::tiled_partition::shfl(), cooperative_groups::tiled_partition::ballot(), or TMA copy intrinsics, which are compiled by cudafe++ as ordinary function calls but need special handling by cicc for efficient code generation.

Binary Evidence

Like __tile_global__, __tile_builtin__ has no handler in the apply_one_attribute dispatcher. It appears only in the attribute_display_name switch table. The attribute node with kind 0x5F passes through cudafe++ without entity node modification and is consumed by the downstream compiler.

The pairing of __tile_global__ (Y) and __tile_builtin__ (_) suggests a two-part infrastructure:

__tile_global__ marks kernel-level entry points for tiled execution.
__tile_builtin__ marks the intrinsic operations available within that tiled execution context.

`__local_maxnreg__` (Kind 0x5E = `'^'`)

Purpose

__local_maxnreg__ sets a per-function register limit, as opposed to __maxnreg__ which is per-kernel. The distinction matters for __device__ helper functions called from kernels: __maxnreg__ can only be applied to __global__ functions, but __local_maxnreg__ can be applied to any device function. This allows fine-grained register pressure tuning at the function level without requiring the entire kernel to be constrained.

When cicc compiles a __device__ function with __local_maxnreg__, it sets the target register limit for that specific function during register allocation, potentially spilling more aggressively to local memory. The surrounding kernel can use a different register budget.

Apply Handler: `sub_411090` (67 lines)

The handler is structurally identical to sub_410F70 (__maxnreg__), differing only in the offset within the launch config struct where it stores the value:

// sub_411090 -- apply_nv_local_maxnreg_attr
entity_t* apply_nv_local_maxnreg_attr(attr_node_t* a1, entity_t* a2, ...) {
    // Allocate launch config struct if needed
    if (!entity->launch_config)
        entity->launch_config = allocate_launch_config();  // sub_5E52F0

    // Skip if template-dependent argument
    if (is_dependent_type(arg))
        return entity;

    // Validate: must be positive
    if (const_expr_sign_compare(arg, 0) <= 0) {   // sub_461980
        emit_error(7, 3786, a1->src_loc);          // non-positive value
        return entity;
    }

    // Validate: must fit in int32
    int64_t val = const_expr_get_value(arg);        // sub_461640
    if (val > INT32_MAX) {
        emit_error(7, 3787, a1->src_loc);           // value too large
        return entity;
    }

    entity->launch_config->local_maxnreg = (int32_t)val;  // offset +36
    return entity;
}

Post-Validation Difference from `maxnreg`

In sub_6BC890, __maxnreg__ (stored at launch_config+32) is validated to require __global__ (error 3715: "__maxnreg__ is only valid on __global__ functions"). __local_maxnreg__ has no such check in post-validation. This is intentional: it is designed to work on __device__ functions as well. The post-validation function only checks the maxnreg field (offset +32) for the __global__ requirement; the local_maxnreg field (offset +36) is left unchecked.

Diagnostics

Error	Message	Condition
3786	Non-positive `__local_maxnreg__` value	`const_expr_sign_compare(arg, 0) <= 0`
3787	`__local_maxnreg__` value too large	Value exceeds int32 range

`__block_size__` (Kind 0x6C = `'l'`)

Purpose

__block_size__ specifies the thread block dimensions (and optionally cluster dimensions) for a kernel at compile time. Unlike __launch_bounds__, which provides hints for the compiler's register allocator, __block_size__ declares the actual block geometry. This enables the compiler to optimize based on known block dimensions: unrolling loops by the block dimension, computing shared memory bank conflict patterns at compile time, and statically determining the number of warps.

Apply Handler: `sub_4109E0` (265 lines)

This is the largest of the launch config attribute handlers. It accepts up to 6 arguments: three block dimensions (x, y, z) and three cluster dimensions (x, y, z).

// sub_4109E0 -- apply_nv_block_size_attr (simplified)
entity_t* apply_nv_block_size_attr(attr_node_t* a1, entity_t* a2, ...) {
    // Allocate launch config struct if needed
    if (!entity->launch_config)
        entity->launch_config = allocate_launch_config();

    launch_config_t* lc = entity->launch_config;

    // Parse block dimensions (arguments 1-3)
    // Each: validate positive, validate fits in int32
    for (int i = 0; i < 3 && arg_exists; i++) {
        if (const_expr_sign_compare(arg, 0) <= 0)
            emit_error(7, 3788, src);   // non-positive
        else {
            int64_t val = const_expr_get_value(arg);
            if (val > INT32_MAX)
                emit_error(7, 3789, src);  // too large
            else
                lc->block_size[i] = (int32_t)val;  // +40, +44, +48
        }
    }

    // Parse optional cluster dimensions (arguments 4-6)
    if (cluster_args_present) {
        // Check for conflict with prior __cluster_dims__
        if (lc->flags & 0x01)
            emit_error(7, 3791, src);  // conflict

        for (int i = 0; i < 3 && arg_exists; i++) {
            // same positive/range validation
            lc->cluster_dim[i] = (int32_t)val;  // +20, +24, +28
        }
    } else if (!(lc->flags & 0x01)) {
        // Default cluster dims to (1,1,1) when no cluster args
        // and no prior __cluster_dims__
        lc->cluster_dim_x = 1;
        lc->cluster_dim_y = 1;
        lc->cluster_dim_z = 1;
    }

    lc->flags |= 0x02;   // mark block_size_set
    return entity;
}

Conflict with `__cluster_dims__`

__block_size__ and __cluster_dims__ have a bidirectional conflict. Each handler checks the other's flag:

__block_size__ checks flags & 0x01 (cluster_dims_set) before writing cluster dims: error 3791.
__cluster_dims__ checks flags & 0x02 (block_size_set) before writing cluster dims: error 3791.

However, neither handler returns early on this conflict. Both continue to set their respective flag bits, so after conflict the flags byte can be 0x03 (both bits set). The error diagnostic is emitted but the compilation continues.

Diagnostics

Error	Message	Condition
3788	Non-positive `__block_size__` dimension	`const_expr_sign_compare(arg, 0) <= 0`
3789	`__block_size__` dimension too large	Value exceeds int32 range
3791	Conflicting `__cluster_dims__` and `__block_size__`	Both attributes applied to same entity

Global State and Registration

Startup Registration (`sub_6B5E50`)

The function sub_6B5E50 runs during compiler initialization and registers three names as preprocessor macro definitions:

__nv_register_params__: looked up via sub_734430; if not found, creates a new macro via sub_749600 and associates it with a 40-byte token sequence. The token body encodes the magic values 8961 (0x2301) as a prefix, followed by attribute argument tokens. If the symbol already exists (the macro was predefined), it appends the token body to the existing definition's expansion via sub_6AC190.
__noinline__: registered with the same mechanism. The token body contains the string "oinline))" as a suffix (the decompiled code shows strcpy((char*)(v11+20), "oinline))");), which reconstructs the full __attribute__((__noinline__)) expansion.
_Pragma: conditionally registered if dword_106C0E0 is set. The _Pragma macro registration enables MSVC-compatible pragma handling in certain compilation modes.

Additionally, if Clang compatibility mode is active (dword_126EFA4 set, qword_126EF90 > 0x2BF1F = Clang >= 3.0, and specific extension flags are enabled), the function registers ARM SVE attribute macros (__arm_in, __arm_inout, __arm_out, __arm_preserves, __arm_streaming, __arm_streaming_compatible).

Entity Node Field Summary

entity+177  bit 4 (0x10): __forceinline__
entity+179  bit 4 (0x10): __inline_hint__
entity+179  bit 5 (0x20): __noinline__ (EDG path)
entity+180  bit 7 (0x80): __noinline__ (GNU path)
entity+181  bit 5 (0x20): __forceinline__ relaxation flag
entity+182  [byte]:       execution space (see overview)
entity+183  bit 3 (0x08): __nv_register_params__
entity+183  bit 6 (0x40): __cluster_dims__ intent
entity+256  [pointer]:    launch_config_t* (for __local_maxnreg__, __block_size__)

Function Map

Address	Size	Identity	Source
`sub_40A310`	83 lines	`attribute_display_name`	`attribute.c:1307`
`sub_40A8A0`	23 lines	`apply_nv_inline_hint_attr` (contains)	`attribute.c`
`sub_40B0A0`	38 lines	`apply_nv_register_params_attr`	`attribute.c:10537`
`sub_40F5F0`	51 lines	`apply_noinline_attr` (EDG path)	`attribute.c`
`sub_40F6F0`	37 lines	`apply_noinline_attr` (GNU path)	`attribute.c`
`sub_40F7B0`	61 lines	`apply_noinline_scoped_attr`	`attribute.c`
`sub_4109E0`	265 lines	`apply_nv_block_size_attr`	`attribute.c`
`sub_411090`	67 lines	`apply_nv_local_maxnreg_attr`	`attribute.c`
`sub_413240`	585 lines	`apply_one_attribute` (dispatch)	`attribute.c`
`sub_6B5E50`	160 lines	Startup registration	`nv_transforms.c` adjacent
`sub_6BC890`	160 lines	`nv_validate_cuda_attributes`	`nv_transforms.c`

Diagnostic Tag Index

Error	Diagnostic Tag	Attribute
3659	`register_params_not_enabled`	`__nv_register_params__`
3661	`register_params_unsupported_function`	`__nv_register_params__`
3662	`register_params_ellipsis_function`	`__nv_register_params__`
--	`register_params_unsupported_arch`	`__nv_register_params__`
3786	`local_maxnreg_negative`	`__local_maxnreg__`
3787	`local_maxnreg_too_large`	`__local_maxnreg__`
3788	`block_size_must_be_positive`	`__block_size__`
3789	(block_size dimension overflow)	`__block_size__`
3791	`conflict_between_cluster_dim_and_block_size`	`__block_size__` / `__cluster_dims__`
1835	(attribute on wrong target)	`__noinline__`
2470	(attribute in alignas context)	`__noinline__`

Cross-References

Attribute System Overview -- kind enum, descriptor table, application pipeline
Launch Configuration Attributes -- shared launch_config_t struct, __launch_bounds__, __maxnreg__, __cluster_dims__
__global__ Function Constraints -- post-validation checks in sub_6BC890
Entity Node Layout -- entity+177, +179, +180, +182, +183 field definitions
Cross-Space Validation -- __forceinline__ relaxation in cross-space calling
Architecture Feature Gating -- __nv_register_params__ compute_80 requirement

Extended Lambda Overview

Extended lambdas are the most complex NVIDIA addition to the EDG frontend. Standard C++ lambdas produce closure classes with host linkage only -- they cannot appear in __global__ kernel launches or __device__ function calls because the closure type has no device-side instantiation. The --extended-lambda flag (dword_106BF38) enables a transformation pipeline that wraps each annotated lambda in a device-visible template struct, making the closure class callable across the host/device boundary.

Two wrapper types exist. __nv_dl_wrapper_t handles device-only lambdas (annotated __device__). __nv_hdl_wrapper_t handles host-device lambdas (annotated __host__ __device__). The wrappers are parameterized template structs that store captured variables as typed fields, providing the device compiler with a concrete, instantiatable type for each lambda's captures. The wrapper templates do not exist in any header file -- they are synthesized as raw C++ text and injected into the compilation stream by the backend code generator.

Key Facts

Property	Value
Enable flag	`dword_106BF38` (`--extended-lambda` / `--expt-extended-lambda`)
Source files	`class_decl.c` (scan), `nv_transforms.c` (emit), `cp_gen_be.c` (gen)
Device wrapper type	`__nv_dl_wrapper_t<Tag, CapturedVarTypePack...>`
Host-device wrapper type	`__nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, NeverThrows, Tag, OpFunc, CapturedVarTypePack...>`
Device bitmap	`unk_1286980` (128 bytes, 1024 bits)
Host-device bitmap	`unk_1286900` (128 bytes, 1024 bits)
Max captures supported	1024 per wrapper type
`lambda_info` allocator	`sub_5E92A0`
Preamble injection marker	Type named `__nv_lambda_preheader_injection`

End-to-End Flow

The extended lambda system spans the entire cudafe++ pipeline -- from parsing through backend emission. Five major functions form the chain:

  FRONTEND (class_decl.c)              BACKEND (cp_gen_be.c + nv_transforms.c)
  ========================             ========================================

  sub_447930 scan_lambda               sub_47ECC0 gen_template (dispatcher)
       |                                    |
       +-- detect annotations               +-- sees __nv_lambda_preheader_injection
       |   (bits at lambda+25)              |
       +-- validate constraints             +-- sub_4864F0 gen_type_decl
       |   (35+ error codes)                |       triggers preamble emission
       |                                    |
       +-- record capture count             +-- sub_6BCC20 nv_emit_lambda_preamble
           in bitmap                        |       emits ALL __nv_* templates
                                            |
                                            +-- sub_47B890 gen_lambda
                                                    emits per-lambda wrapper call

Stage 1: scan_lambda (`sub_447930`, 2113 lines)

The frontend entry point for all lambda expressions. Called from the expression parser when it encounters [. For extended lambdas, this function performs three critical operations:

Execution space detection -- Walks up the scope stack looking for scope_kind == 17 (function body). Reads execution space byte at offset +182: bit 4 = __device__, bit 5 = __host__. Sets can_be_host and can_be_device flags.
Annotation processing -- Parses the __nv_parent specifier (NVIDIA extension for closure-to-parent linkage) and __host__/__device__ attribute annotations on the lambda expression itself. Sets decision bits at lambda_info + 25.
Validation -- When dword_106BF38 is set, validates that the lambda's execution space is compatible with its enclosing context. Emits errors 3592-3634 and 3689-3690 for violations. Records the capture count in the appropriate bitmap via sub_6BCBF0.

Stage 2: Annotation Detection (Decision Bits)

The scan_lambda function sets bits at lambda_info + 25 that control all downstream behavior:

Bit	Mask	Meaning	Set when
bit 3	`0x08`	Device lambda wrapper needed	Lambda has `__device__` annotation
bit 4	`0x10`	Host-device lambda wrapper needed	Lambda has `__host__ __device__`
bit 5	`0x20`	Has `__nv_parent`	`__nv_parent` pragma parsed in capture list

Additional flags at lambda_info + 24:

Bit	Mask	Meaning
bit 4	`0x10`	Capture-default is `=`
bit 5	`0x20`	Capture-default is `&`

And at lambda_info + 25 lower bits:

Bit	Mask	Meaning
bit 0	`0x01`	Is generic lambda
bit 1	`0x02`	Has `__host__` execution space
bit 2	`0x04`	Has `__device__` execution space

Stage 3: Preamble Trigger (`sub_4864F0`, gen_type_decl)

During backend code generation, sub_47ECC0 (the master source sequence dispatcher) encounters a type declaration whose name matches __nv_lambda_preheader_injection. This sentinel type is never used by user code -- it exists solely as a trigger. When matched:

The backend emits #line 1 "nvcc_internal_extended_lambda_implementation".
It calls sub_6BCC20 (nv_emit_lambda_preamble) to inject the entire _nv* template library.
It wraps the trigger type in #if 0 / #endif so it never reaches the host compiler.

Stage 4: Preamble Emission (`sub_6BCC20`, 244 lines)

This is the single point where all CUDA lambda support templates enter the compilation. It takes a void(*emit)(const char*) callback and emits raw C++ source text. The exact emission order, verified against the decompiled binary, is:

__NV_LAMBDA_WRAPPER_HELPER macro, __nvdl_remove_ref (with T&, T&&, T(&)(Args...) specializations), and __nvdl_remove_const trait helpers
__nv_dl_tag template (device lambda tag type)
Array capture helpers via sub_6BC290 (__nv_lambda_array_wrapper primary + dimension 2-8 specializations, __nv_lambda_field_type primary + array/const-array specializations)
Primary __nv_dl_wrapper_t with static_assert + zero-capture __nv_dl_wrapper_t<Tag> specialization (emitted as a single string literal)
__nv_dl_trailing_return_tag definition + its zero-capture wrapper specialization with __builtin_unreachable() body (emitted as two consecutive string literals)
Device bitmap scan -- iterates unk_1286980 (1024 bits). For each set bit N > 0, calls sub_6BB790(N, emit) to generate two __nv_dl_wrapper_t specializations (standard tag + trailing-return tag) for N captures
__nv_hdl_helper class (anonymous namespace, with fp_copier, fp_deleter, fp_caller, fp_noobject_caller static members + out-of-line definitions)
Primary __nv_hdl_wrapper_t with static_assert
Host-device bitmap scan -- iterates unk_1286900 (1024 bits). For each set bit N (including 0), emits four wrapper specializations per N: sub_6BBB10(0, N) (non-mutable, HasFuncPtrConv=false), sub_6BBEE0(0, N) (mutable, HasFuncPtrConv=false), sub_6BBB10(1, N) (non-mutable, HasFuncPtrConv=true), sub_6BBEE0(1, N) (mutable, HasFuncPtrConv=true)
__nv_hdl_helper_trait_outer with const and non-const operator() specializations, plus conditionally (when dword_126E270 is set for C++17 noexcept-in-type-system) const noexcept and non-const noexcept specializations -- all inside the same struct, closed by \n};
__nv_hdl_create_wrapper_t factory
Type trait helpers: __nv_lambda_trait_remove_const, __nv_lambda_trait_remove_volatile, __nv_lambda_trait_remove_cv (composed from the first two)
__nv_extended_device_lambda_trait_helper + #define __nv_is_extended_device_lambda_closure_type(X) (emitted together in one string)
__nv_lambda_trait_remove_dl_wrapper (unwraps device lambda wrapper to get inner tag)
__nv_extended_device_lambda_with_trailing_return_trait_helper + #define __nv_is_extended_device_lambda_with_preserved_return_type(X) (emitted together)
__nv_extended_host_device_lambda_trait_helper + #define __nv_is_extended_host_device_lambda_closure_type(X) (emitted together)

Note: each SFINAE trait and its corresponding detection macro are emitted as a single a1() call in the decompiled code, not as separate steps. The device bitmap scan skips bit 0 (zero-capture handled by step 4's specialization), but the host-device bitmap scan processes bit 0 (zero-capture host-device wrappers require distinct HasFuncPtrConv specializations).

Stage 5: Per-Lambda Wrapper Emission (`sub_47B890`, gen_lambda, 336 lines)

For each lambda expression in the translation unit, the backend emits the wrapper call. The decision depends on the bits at lambda_info + 25:

Device lambda (bit 3 set, byte[25] & 0x08):

__nv_dl_wrapper_t< /* closure type tag */ >(/* captured values */)

The original lambda body is wrapped in #if 0 / #endif so it is invisible to the host compiler. The device compiler sees the wrapper struct which provides the captured values as typed fields.

Host-device lambda (bit 4 set, byte[25] & 0x10):

__nv_hdl_create_wrapper_t<IsMutable, HasFuncPtrConv, Tag, CaptureTypes...>
    ::__nv_hdl_create_wrapper( /* lambda expression */, capture_args... )

The lambda expression is emitted inline as the first argument (binds to Lambda &&lam in the factory). The factory internally calls std::move(lam) when heap-allocating. Unlike the device lambda path, the original lambda body is NOT wrapped in #if 0 -- it must be visible to both host and device compilers.

Neither bit set (plain lambda or byte[25] & 0x06 == 0x02):

Standard lambda emission with no wrapping. If byte[25] & 0x06 == 0x02, emits an empty body placeholder { } with the real body in #if 0 / #endif.

Bitmap System

Rather than generating all 1024 possible capture-count specializations for each wrapper type, cudafe++ tracks which capture counts were actually used during frontend parsing. This is a critical compile-time optimization.

Bitmap Layout

unk_1286980 (device lambda bitmap):
  128 bytes = 16 x uint64 = 1024 bits
  Bit N set  =>  __nv_dl_wrapper_t specialization for N captures is needed

unk_1286900 (host-device lambda bitmap):
  128 bytes = 16 x uint64 = 1024 bits
  Bit N set  =>  __nv_hdl_wrapper_t specializations for N captures are needed

Bitmap Operations

Function	Address	Operation
`nv_reset_capture_bitmasks`	`sub_6BCBC0`	Zeroes both 128-byte bitmaps. Called before each translation unit.
`nv_record_capture_count`	`sub_6BCBF0`	Sets bit `capture_count` in the appropriate bitmap. `a1 == 0` targets device, `a1 != 0` targets host-device. Implementation: `result[a2 >> 6] \|= 1LL << a2`.
Scan in `sub_6BCC20`	inline	Iterates each uint64 word, shifts right to test each bit, calls the wrapper emitter for each set bit.

The scan loop in sub_6BCC20 processes 64 bits at a time:

uint64_t *ptr = (uint64_t *)&unk_1286980;
unsigned int idx = 0;
do {
    uint64_t word = *ptr;
    unsigned int limit = idx + 64;
    do {
        if (idx != 0 && (word & 1))
            emit_device_lambda_wrapper(idx, callback);  // sub_6BB790
        ++idx;
        word >>= 1;
    } while (limit != idx);
    ++ptr;
} while (limit != 1024);

Note that bit 0 is never emitted as a specialization -- the zero-capture case is handled by the primary template itself.

The __nv_parent Pragma

__nv_parent is a NVIDIA-specific capture-list extension that provides closure-to-parent class linkage. It appears in the lambda capture list as a special identifier:

auto lam = [__nv_parent = ParentClass, x, y]() __device__ { /* ... */ };

Processing in scan_lambda

During capture list parsing (Phase 3 of sub_447930, around line 584):

The parser checks for a token matching the string "__nv_parent" at address 0x82e284.
If found, calls sub_52FB70 to resolve the parent class by name lookup.
Sets lambda_info + 25 |= 0x20 (bit 5 = has __nv_parent).
Stores the resolved parent class pointer at lambda_info + 32.
If __nv_parent is specified more than once, emits error 3590.
If __nv_parent is specified without __device__, emits error 3634.

The __nv_parent class reference is used during device code generation to establish the relationship between the lambda's closure type and its enclosing class, which is necessary for the device compiler to properly resolve member accesses through the closure.

lambda_info Structure

Allocated by sub_5E92A0. This is the per-lambda metadata node created during scan_lambda and consumed during backend generation.

Offset	Size	Field	Description
+0	8	`captured_variable_list`	Head of linked list of capture entries
+8	8	`closure_class_type_node`	Pointer to the closure class type in the IL
+16	8	`call_operator_symbol`	Pointer to the `operator()` routine entity
+24	1	`flags_byte_1`	bit 0 = has captures, bit 3 = `__host__`, bit 4 = `__device__`, bit 5 = has `__nv_parent`, bit 6 = is opaque, bit 7 = constexpr const
+25	1	`flags_byte_2`	bit 0 = is generic, bit 1 = `__host__` exec space, bit 2 = `__device__` exec space, bit 3 = device wrapper needed, bit 4 = host-device wrapper needed, bit 5 = has `__nv_parent`
+32	8	`__nv_parent_class`	Parent class pointer (NVIDIA extension)
+40	4	`lambda_number`	Unique lambda index within scope
+44	4	`source_location`	Source position of lambda expression

Key Functions

Address	Name (recovered)	Source	Lines	Role
`sub_447930`	`scan_lambda`	`class_decl.c`	2113	Frontend: parse lambda, validate constraints, record capture count
`sub_42FE50`	`scan_lambda_capture_list`	`class_decl.c`	524	Frontend: parse `[...]` capture list, handle `__nv_parent`
`sub_42EE00`	`make_field_for_lambda_capture`	`class_decl.c`	551	Frontend: create closure class fields for captures
`sub_42D710`	`scan_lambda_capture_list` (inner)	`class_decl.c`	1025	Frontend: process individual capture entries
`sub_42F910`	`field_for_lambda_capture`	`class_decl.c`	~200	Frontend: resolve capture field via hash lookup
`sub_436DF0`	Lambda template decl helper	`class_decl.c`	65	Frontend: propagate execution space to call operator template
`sub_6BCC20`	`nv_emit_lambda_preamble`	`nv_transforms.c`	244	Backend: emit ALL `__nv_*` template infrastructure
`sub_6BB790`	`emit_device_lambda_wrapper_specialization`	`nv_transforms.c`	191	Backend: emit `__nv_dl_wrapper_t<Tag, F1..FN>` for N captures
`sub_6BBB10`	`emit_host_device_lambda_wrapper` (const)	`nv_transforms.c`	238	Backend: emit `__nv_hdl_wrapper_t` non-mutable variant
`sub_6BBEE0`	`emit_host_device_lambda_wrapper` (mutable)	`nv_transforms.c`	236	Backend: emit `__nv_hdl_wrapper_t` mutable variant
`sub_6BC290`	`emit_array_capture_helpers`	`nv_transforms.c`	183	Backend: emit `__nv_lambda_array_wrapper` for dim 2-8
`sub_6BCBC0`	`nv_reset_capture_bitmasks`	`nv_transforms.c`	9	Init: zero both 128-byte bitmaps
`sub_6BCBF0`	`nv_record_capture_count`	`nv_transforms.c`	13	Record: set bit in device or host-device bitmap
`sub_6BCDD0`	`nv_find_parent_lambda_function`	`nv_transforms.c`	33	Query: find enclosing host/device function for nested lambda
`sub_6BC680`	`is_device_or_extended_device_lambda`	`nv_transforms.c`	16	Query: test if entity qualifies as device lambda
`sub_47B890`	`gen_lambda`	`cp_gen_be.c`	336	Backend: emit per-lambda wrapper construction call
`sub_4864F0`	`gen_type_decl`	`cp_gen_be.c`	751	Backend: detect preamble trigger, invoke emission
`sub_47ECC0`	`gen_template` (dispatcher)	`cp_gen_be.c`	1917	Backend: master source sequence dispatcher
`sub_489000`	`process_file_scope_entities`	`cp_gen_be.c`	723	Backend: entry point, emits lambda macro defines in boilerplate

Global State

Variable	Address	Purpose
`dword_106BF38`	`0x106BF38`	Extended lambda mode flag (`--extended-lambda`)
`dword_106BF40`	`0x106BF40`	Lambda host-device mode flag
`unk_1286980`	`0x1286980`	Device lambda capture-count bitmap (128 bytes)
`unk_1286900`	`0x1286900`	Host-device lambda capture-count bitmap (128 bytes)
`qword_12868F0`	`0x12868F0`	Entity-to-closure mapping hash table
`dword_126E270`	`0x126E270`	C++17 noexcept-in-type-system flag (controls noexcept wrapper variants)
`qword_E7FEC8`	`0xE7FEC8`	Lambda hash table (Robin Hood, 16 bytes/slot, 1024 entries)
`ptr` (E7FE40 area)	`0xE7FE40`	Red-black tree root for lambda numbering per source position
`dword_E7FE48`	`0xE7FE48`	Red-black tree sentinel node
`dword_E85700`	`0xE85700`	`host_runtime.h` already included flag
`dword_106BDD8`	`0x106BDD8`	OptiX mode flag (triggers error 3689 on incompatible lambdas)

Concrete End-to-End Example

Consider a user writing this CUDA code with --extended-lambda:

// user.cu
#include <cstdio>
__global__ void kernel(int *out) {
    int scale = 2;
    auto f = [=] __device__ (int x) { return x * scale; };
    out[threadIdx.x] = f(threadIdx.x);
}

Here is the transformation at each stage.

Stage 1: scan_lambda detects the lambda

The frontend parser encounters [=] __device__ (int x) { ... }. sub_447930 runs:

Finds __device__ annotation on the lambda expression.
Sets lambda_info + 25 |= 0x08 (bit 3: device wrapper needed) and lambda_info + 25 |= 0x04 (bit 2: has __device__ exec space).
Sets lambda_info + 24 |= 0x10 (bit 4: capture-default is =).
Counts one capture (scale). Calls sub_6BCBF0(0, 1) to set bit 1 in the device bitmap unk_1286980.
Creates a closure class (compiler-generated name like __lambda_17_16) with one field of type int for the captured scale.

Stage 2: Preamble injection

When the backend encounters the sentinel type __nv_lambda_preheader_injection, sub_6BCC20 emits the template library. Because bit 1 is set in the device bitmap, it calls sub_6BB790(1, emit) which generates a one-capture specialization:

template <typename Tag, typename F1>
struct __nv_dl_wrapper_t<Tag, F1> {
    typename __nv_lambda_field_type<F1>::type f1;
    __nv_dl_wrapper_t(Tag, F1 in1) : f1(in1) { }
    template <typename...U1>
    int operator()(U1...) { return 0; }
};

template <typename U, U func, typename Return, unsigned Id, typename F1>
struct __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>, F1> {
    typename __nv_lambda_field_type<F1>::type f1;
    __nv_dl_wrapper_t(__nv_dl_trailing_return_tag<U, func, Return, Id>, F1 in1)
        : f1(in1) { }
    template <typename...U1>
    Return operator()(U1...) { __builtin_unreachable(); }
};

Stage 3: Per-lambda wrapper emission

sub_47B890 (gen_lambda) reads byte[25] & 0x08 (device lambda flag is set) and emits the wrapper construction call. The lambda body is hidden from the host compiler:

// Output in .int.c (what the host compiler sees):
__nv_dl_wrapper_t< __nv_dl_tag<
    __NV_LAMBDA_WRAPPER_HELPER(&__lambda_17_16::operator(), 0u)>,
    int>(
    __nv_dl_tag<
        __NV_LAMBDA_WRAPPER_HELPER(&__lambda_17_16::operator(), 0u)>{},
    scale)
#if 0
[=] __device__ (int x) { return x * scale; }
#endif

The __NV_LAMBDA_WRAPPER_HELPER(X, Y) macro expands to decltype(X), Y, giving the tag its two non-type parameters: the function pointer type and the pointer itself.

What each compiler sees

Host compiler sees a __nv_dl_wrapper_t<Tag, int> struct with field f1 holding the captured scale. The operator() returns int(0) (never actually called on host). The original lambda body is inside #if 0.

Device compiler sees the same wrapper struct but resolves the tag's encoded function pointer &__lambda_17_16::operator() to call the actual lambda body. The wrapper's f1 field provides the captured scale value.

Architecture: Text Template Approach

NVIDIA's lambda support uses a raw text emission pattern rather than constructing AST nodes. The template infrastructure is generated as C++ source text strings, passed through a callback function:

emit("template <typename Tag, typename...CapturedVarTypePack>\n"
     "struct __nv_dl_wrapper_t {\n"
     "static_assert(sizeof...(CapturedVarTypePack) == 0,"
     "\"nvcc internal error: unexpected number of captures!\");\n"
     "};\n");

This text is emitted to the .int.c output file and subsequently parsed by the host compiler. The device compiler receives the same text through a parallel path. This design is architecturally simpler than building proper AST nodes for the wrapper templates, at the cost of the templates existing only as generated text rather than first-class IL entities.

The preamble injection point is controlled by a sentinel type declaration: when the backend encounters a type named __nv_lambda_preheader_injection, it emits the entire template library and wraps the sentinel in #if 0. This guarantees the templates appear exactly once, before any lambda expression that references them, regardless of declaration ordering in the user's source.

Device Lambda Wrapper -- __nv_dl_wrapper_t template structure in detail
Host-Device Lambda Wrapper -- __nv_hdl_wrapper_t type-erased design
Capture Handling -- __nv_lambda_field_type, __nv_lambda_array_wrapper
Preamble Injection -- sub_6BCC20 emission pipeline step by step
Lambda Restrictions -- 35+ error categories and validation rules

Device Lambda Wrapper (`__nv_dl_wrapper_t`)

When a C++ lambda is annotated __device__ inside CUDA code compiled with --extended-lambda, the closure class that the frontend creates has host linkage only -- it cannot be instantiated on the device. The device lambda wrapper system solves this by replacing the lambda expression at the call site with a construction of __nv_dl_wrapper_t<Tag, F1, ..., FN>, a template struct whose type parameters encode the lambda's identity (via Tag) and whose fields store the captured variables in device-accessible storage. The wrapper struct has a dummy operator() that never executes real code on the device side -- its purpose is purely to carry captured state across the host/device boundary. The actual device-side call is dispatched through the tag type, which encodes a function pointer to the lambda's operator() as a non-type template parameter.

Two tag types exist. __nv_dl_tag is the standard tag for lambdas with auto-deduced return types. __nv_dl_trailing_return_tag handles lambdas with explicit trailing return types, preserving the user-specified return type through the wrapper. Both tag types carry the lambda's operator() function pointer and a unique ID as template parameters.

The wrapper template does not exist in any header file. It is synthesized as raw C++ text by sub_6BB790 (emit_device_lambda_wrapper_specialization) in nv_transforms.c and injected into the compilation stream during preamble emission. Only the capture counts actually used in the translation unit are emitted, controlled by a 1024-bit bitmap at unk_1286980.

Key Facts

Property	Value
Wrapper type	`__nv_dl_wrapper_t<Tag, CapturedVarTypePack...>`
Standard tag	`__nv_dl_tag<U, func, unsigned>`
Trailing-return tag	`__nv_dl_trailing_return_tag<U, func, Return, unsigned>`
Specialization emitter	`sub_6BB790` (`emit_device_lambda_wrapper_specialization`, 191 lines)
Per-lambda emission	`sub_47B890` (`gen_lambda`, 336 lines, `cp_gen_be.c`)
Preamble master emitter	`sub_6BCC20` (`nv_emit_lambda_preamble`, 244 lines)
Capture bitmap	`unk_1286980` (128 bytes = 1024 bits, device lambda)
Bitmap setter	`sub_6BCBF0` (`nv_record_capture_count`, 13 lines)
Max supported captures	1024
Source file	`nv_transforms.c` (specialization emitter), `cp_gen_be.c` (per-lambda call)
Field type trait	`__nv_lambda_field_type<T>`

Primary Template and Zero-Capture Specialization

The primary template is a static_assert trap -- any instantiation with a non-zero variadic pack that was not explicitly specialized triggers a compilation error. The zero-capture specialization (Tag only, no F parameters) provides a trivial constructor and a dummy operator() returning 0.

This code is emitted verbatim as a single string literal from sub_6BCC20:

// Exact binary string (emitted as a single a1() call in sub_6BCC20):
template <typename Tag,typename...CapturedVarTypePack>
struct __nv_dl_wrapper_t {
static_assert(sizeof...(CapturedVarTypePack) == 0,"nvcc internal error: unexpected number of captures!");
};
template <typename Tag>
struct __nv_dl_wrapper_t<Tag> {
__nv_dl_wrapper_t(Tag) { }
template <typename...U1>
int operator()(U1...) { return 0; }
};

Note: no space after the comma in Tag,typename... and no indentation -- this is the literal text injected into the .int.c output. The primary template and the zero-capture specialization are emitted as a single string literal.

The primary template's static_assert acts as a safety net: if the frontend records a capture count of N but fails to emit the corresponding N-capture specialization, the host compiler will produce a diagnostic rather than silently generating broken code. The zero-capture specialization's operator() returns int(0) -- this value is never used at runtime because the device compiler dispatches through the tag's encoded function pointer, not through the wrapper's operator().

Tag Types

`__nv_dl_tag`

The standard device lambda tag. Three template parameters encode the lambda identity. Exact binary string:

template <typename U, U func, unsigned>
struct __nv_dl_tag { };

The string is "\ntemplate <typename U, U func, unsigned>\nstruct __nv_dl_tag { };\n" -- note the leading newline.

Parameter	Role
`U`	Type of the lambda's `operator()` (deduced via `decltype`)
`func`	Non-type template parameter: pointer to the lambda's `operator()`
`unsigned`	Unnamed parameter: unique ID disambiguating lambdas with identical operator types

The __NV_LAMBDA_WRAPPER_HELPER(X, Y) macro (emitted at preamble start) expands to decltype(X), Y, providing the U, func pair from a single expression. The full macro and helper text emitted as the first a1() call:

#define __NV_LAMBDA_WRAPPER_HELPER(X, Y) decltype(X), Y
template <typename T>
struct __nvdl_remove_ref { typedef T type; };

template<typename T>
struct __nvdl_remove_ref<T&> { typedef T type; };

template<typename T>
struct __nvdl_remove_ref<T&&> { typedef T type; };

template <typename T, typename... Args>
struct __nvdl_remove_ref<T(&)(Args...)> {
  typedef T(*type)(Args...);
};

template <typename T>
struct __nvdl_remove_const { typedef T type; };

template <typename T>
struct __nvdl_remove_const<T const> { typedef T type; };

The __nvdl_remove_ref specialization for function references (T(&)(Args...)) is notable: it converts a function reference type to a function pointer type (T(*)(Args...)). This handles the case where a lambda captures a function by reference -- the wrapper field needs a copyable function pointer, not a reference.

`__nv_dl_trailing_return_tag`

For lambdas with explicit trailing return types (-> ReturnType), a separate tag preserves the return type:

template <typename U, U func, typename Return, unsigned>
struct __nv_dl_trailing_return_tag { };

The additional Return parameter carries the user-specified return type. This is necessary because the wrapper's operator() must return this type rather than int, and the body uses __builtin_unreachable() to satisfy the compiler without generating actual return-value code.

Trailing-Return Zero-Capture Specialization

The zero-capture variant for trailing-return lambdas uses __builtin_unreachable() instead of return 0. The exact binary text (emitted as two consecutive a1() calls):

template <typename U, U func, typename Return, unsigned>
struct __nv_dl_trailing_return_tag { };

template <typename U, U func, typename Return, unsigned Id>
struct __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id> > {
  __nv_dl_wrapper_t(__nv_dl_trailing_return_tag<U, func, Return, Id>) { }

  template <typename...U1> Return operator()(U1...) { __builtin_unreachable(); }
};

Note: the __nv_dl_trailing_return_tag definition and its zero-capture wrapper specialization are emitted together (two strings in immediate succession: the first ends at { before __builtin_unreachable, the second contains __builtin_unreachable(); }\n}; \n\n -- note the trailing space before the newlines).

The __builtin_unreachable() tells the compiler this code path is never taken, so no return value needs to be materialized. This is safe because the wrapper's operator() is never called on the device side -- the device compiler resolves the call through the tag's encoded function pointer directly.

Per-Capture-Count Specialization Generator (`sub_6BB790`)

The function sub_6BB790 generates partial specializations of __nv_dl_wrapper_t for a specific capture count N. It takes two arguments: the capture count (unsigned int a1) and an emit callback (void(*a2)(const char*)). For each N, it emits two struct specializations: one for __nv_dl_tag and one for __nv_dl_trailing_return_tag.

Generated Template Structure (N captures)

For a lambda capturing N variables, sub_6BB790(N, emit) produces:

// Standard tag specialization
template <typename Tag, typename F1, typename F2, ..., typename FN>
struct __nv_dl_wrapper_t<Tag, F1, F2, ..., FN> {
    typename __nv_lambda_field_type<F1>::type f1;
    typename __nv_lambda_field_type<F2>::type f2;
    ...
    typename __nv_lambda_field_type<FN>::type fN;

    __nv_dl_wrapper_t(Tag, F1 in1, F2 in2, ..., FN inN)
        : f1(in1), f2(in2), ..., fN(inN) { }

    template <typename...U1>
    int operator()(U1...) { return 0; }
};

// Trailing-return tag specialization
template <typename U, U func, typename Return, unsigned Id,
          typename F1, typename F2, ..., typename FN>
struct __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>,
                          F1, F2, ..., FN> {
    typename __nv_lambda_field_type<F1>::type f1;
    typename __nv_lambda_field_type<F2>::type f2;
    ...
    typename __nv_lambda_field_type<FN>::type fN;

    __nv_dl_wrapper_t(__nv_dl_trailing_return_tag<U, func, Return, Id>,
                      F1 in1, F2 in2, ..., FN inN)
        : f1(in1), f2(in2), ..., fN(inN) { }

    template <typename...U1>
    Return operator()(U1...) { __builtin_unreachable(); }
};

`__nv_lambda_field_type` Indirection

Each field is declared as typename __nv_lambda_field_type<Fi>::type fi rather than Fi fi. This indirection allows the lambda infrastructure to intercept array types (which cannot be captured by value in C++) and replace them with __nv_lambda_array_wrapper instances that perform element-by-element copying. The primary template is an identity transform:

template <typename T>
struct __nv_lambda_field_type {
    typedef T type;
};

Specializations for array types (emitted by sub_6BC290) map T[D1]...[DN] to __nv_lambda_array_wrapper<T[D1]...[DN]>, and const T[D1]...[DN] to const __nv_lambda_array_wrapper<T[D1]...[DN]>.

Emission Mechanics

The decompiled sub_6BB790 reveals the emission is entirely printf-based, building C++ source text in a 1064-byte stack buffer (v29[1064]) and passing each fragment through the emit callback. The function has two major branches:

Branch 1: a1 == 0 (zero captures) -- Dead code. Falls through to emit __nv_dl_wrapper_t(Tag,) : with a trailing comma and empty initializer list, which would produce syntactically invalid C++. This path is never reached because the bitmap scan loop in sub_6BCC20 skips bit 0 (if (v2 && (v3 & 1) != 0)). The zero-capture case is handled by the primary template's __nv_dl_wrapper_t<Tag> specialization emitted unconditionally as a string literal in sub_6BCC20.

Branch 2: a1 > 0 (N captures) -- Generates the N-ary specializations through seven sequential loops:

Loop 1:  Emit template parameter list    ", typename F1, ..., typename FN"
Loop 2:  Emit partial specialization      ", F1, ..., FN"
Loop 3:  Emit field declarations          "typename __nv_lambda_field_type<Fi>::type fi;\n"
Loop 4:  Emit constructor parameters      "F1 in1, F2 in2, ..., FN inN"
Loop 5:  Emit initializer list            "f1(in1), f2(in2), ..., fN(inN)"
         Emit operator() with "return 0"
         Then repeat Loops 1-5 for __nv_dl_trailing_return_tag variant
Loop 6:  Same parameter/field emission for trailing-return variant
Loop 7:  Same initializer list for trailing-return variant
         Emit operator() with __builtin_unreachable()

Each loop uses sprintf(v29, "...", index) for numbered parameters and a2(v29) to emit the fragment. The first element in each comma-separated list is handled specially (no leading comma), with subsequent elements prefixed by ", ".

Key string literals used by sub_6BB790 (extracted from binary):

String	Purpose
`"\ntemplate <typename Tag"`	Opens template parameter list
`", typename F%u"`	Each additional type parameter
`">\nstruct __nv_dl_wrapper_t<Tag"`	Opens partial specialization
`", F%u"`	Each type argument in specialization
`"typename __nv_lambda_field_type<F%u>::type f%u;\n"`	Field declaration
`"__nv_dl_wrapper_t(Tag,"`	Constructor declaration (standard tag)
`"F%u in%u"`	Constructor parameter
`"f%u(in%u)"`	Initializer list entry
`" { }\ntemplate <typename...U1>\nint operator()(U1...) { return 0; }\n};\n"`	Standard operator()
`"__nv_dl_trailing_return_tag<U, func, Return, Id>"`	Trailing-return tag name
`" { }\ntemplate <typename...U1>\nReturn operator()(U1...) "`	Trailing-return operator()
`"{ __builtin_unreachable(); }\n};\n\n"`	Unreachable body

Concrete Example: 2 Captures

For a lambda capturing two variables, sub_6BB790(2, emit) produces:

template <typename Tag, typename F1, typename F2>
struct __nv_dl_wrapper_t<Tag, F1, F2> {
    typename __nv_lambda_field_type<F1>::type f1;
    typename __nv_lambda_field_type<F2>::type f2;
    __nv_dl_wrapper_t(Tag, F1 in1, F2 in2) : f1(in1), f2(in2) { }
    template <typename...U1>
    int operator()(U1...) { return 0; }
};

template <typename U, U func, typename Return, unsigned Id,
          typename F1, typename F2>
struct __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>,
                          F1, F2> {
    typename __nv_lambda_field_type<F1>::type f1;
    typename __nv_lambda_field_type<F2>::type f2;
    __nv_dl_wrapper_t(__nv_dl_trailing_return_tag<U, func, Return, Id>,
                      F1 in1, F2 in2) : f1(in1), f2(in2) { }
    template <typename...U1>
    Return operator()(U1...) { __builtin_unreachable(); }
};

Per-Lambda Wrapper Emission (`sub_47B890`)

The backend code generator sub_47B890 (gen_lambda in cp_gen_be.c) handles the per-lambda transformation at each lambda expression's usage site. It reads the decision bits at lambda_info + 25 and emits a wrapper construction call that replaces the lambda expression in the output .int.c file.

Device Lambda Path (bit 3 set: `byte[25] & 0x08`)

When the device lambda flag is set, the emitter produces a wrapper construction expression followed by a #if 0 block that hides the original lambda body from the host compiler:

// sub_47B890, decompiled lines 46-58
if ((v2 & 8) != 0) {
    sub_467E50("__nv_dl_wrapper_t< ");   // open wrapper type
    sub_475820(a1);                       // emit tag type (closure class)
    sub_46E640(a1);                       // emit capture type list
    sub_467E50(">( ");                    // close template args, open ctor
    sub_475820(a1);                       // emit tag constructor arg
    sub_467E50("{} ");                    // empty-brace tag construction
    sub_46E550(*a1);                      // emit captured value expressions
    sub_467E50(") ");                     // close ctor call
    sub_46BC80("#if 0");                  // suppress original lambda
    --dword_1065834;                      // adjust nesting depth
    sub_467D60();                         // newline
}

The generated output for a device lambda with two captures looks like:

__nv_dl_wrapper_t< __nv_dl_tag<decltype(&ClosureType::operator()),
    &ClosureType::operator(), 0u>, int, float>(
    __nv_dl_tag<decltype(&ClosureType::operator()),
    &ClosureType::operator(), 0u>{}, x, y)
#if 0
// original lambda body hidden from host compiler
[x, y]() __device__ { /* ... */ }
#endif

The #if 0 suppression ensures the host compiler never attempts to parse the device lambda body, which may contain device-only intrinsics and constructs. The device compiler sees the wrapper struct and resolves the call through the tag type's encoded function pointer.

Body Suppression for Host-Only Pass (bit pattern `byte[25] & 0x06 == 0x02`)

A separate suppression path handles lambdas where the body should not be compiled on the current pass. In this case, the emitter outputs an empty body { } and wraps the real body in #if 0 / #endif:

// sub_47B890, decompiled lines 290-306
if ((*(_BYTE *)(a1 + 25) & 6) == 2) {
    sub_467D60();             // newline
    sub_468190("{ }");        // empty body placeholder
    sub_46BC80("#if 0");      // start suppression
    --dword_1065834;
    sub_467D60();
}
// ... emit original body under #if 0 ...
sub_47AEF0(body, 0);         // emit body (invisible due to #if 0)
if ((*(_BYTE *)(a1 + 25) & 6) == 2) {
    sub_46BC80("#endif");     // end suppression
    --dword_1065834;
    sub_467D60();
    dword_1065820 = 0;
    qword_1065828 = 0;
}

After the body emission completes, the device lambda path also emits a matching #endif to close the #if 0 block opened at the wrapper call:

// sub_47B890, decompiled lines 312-320
if ((v29 & 8) != 0) {          // device lambda
    sub_46BC80("#endif");       // close #if 0 from wrapper call
    --dword_1065834;
    sub_467D60();
    dword_1065820 = 0;
    qword_1065828 = 0;
}

Host-Device Lambda Path (bit 4 set: `byte[25] & 0x10`)

Host-device lambdas take a different path through __nv_hdl_create_wrapper_t rather than __nv_dl_wrapper_t. This is covered in the Host-Device Lambda Wrapper page.

Bitmap-Driven Emission

Only capture counts that were actually used during frontend parsing get specializations emitted. The scan loop in sub_6BCC20 processes the 128-byte bitmap at unk_1286980 as an array of 16 uint64_t values:

uint64_t *ptr = (uint64_t *)&unk_1286980;
unsigned int idx = 0;
do {
    uint64_t word = *ptr;
    unsigned int limit = idx + 64;
    do {
        if (idx != 0 && (word & 1))       // skip bit 0 (handled by primary)
            sub_6BB790(idx, callback);     // emit N-capture specialization
        ++idx;
        word >>= 1;
    } while (limit != idx);
    ++ptr;
} while (limit != 1024);

Bit 0 is skipped because the zero-capture case is already handled by the primary template's __nv_dl_wrapper_t<Tag> specialization (emitted unconditionally as a string literal). For each remaining set bit N, sub_6BB790(N, emit) produces two structs (standard tag and trailing-return tag), meaning a translation unit using lambdas with 1, 3, and 5 captures emits exactly 6 wrapper struct specializations rather than the full 2048 that exhaustive generation would produce.

Detection Traits

After all wrapper specializations are emitted, sub_6BCC20 emits SFINAE trait templates that allow compile-time detection of device-lambda wrapper types. These are emitted AFTER the host-device wrapper infrastructure (steps 7-12 in the emission sequence), not immediately after the device bitmap scan. Each trait + its #define macro is emitted as a single a1() call:

// Emitted as one string (step 13 in sub_6BCC20):
template <typename T>
struct __nv_extended_device_lambda_trait_helper {
  static const bool value = false;
};
template <typename T1, typename...Pack>
struct __nv_extended_device_lambda_trait_helper<__nv_dl_wrapper_t<T1, Pack...> > {
  static const bool value = true;
};
#define __nv_is_extended_device_lambda_closure_type(X) __nv_extended_device_lambda_trait_helper< typename __nv_lambda_trait_remove_cv<X>::type>::value

Note: in the binary, the #define is a single line (no backslash continuation). The 2-space indentation on static const bool matches the binary exactly.

An unwrapper trait strips the wrapper to recover the inner tag type (step 14 in emission):

template<typename T> struct __nv_lambda_trait_remove_dl_wrapper { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_dl_wrapper< __nv_dl_wrapper_t<T> > { typedef T type; };

A separate trait detects whether a wrapper uses a trailing-return tag (step 15 in emission):

template <typename T>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper {
  static const bool value = false;
};
template <typename U, U func, typename Return, unsigned Id, typename...Pack>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper<__nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>, Pack...> > {
  static const bool value = true;
};
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) __nv_extended_device_lambda_with_trailing_return_trait_helper< typename __nv_lambda_trait_remove_cv<X>::type >::value

Note: the emission order in sub_6BCC20 is: device trait (step 13), then __nv_lambda_trait_remove_dl_wrapper (step 14), then trailing-return trait (step 15), then host-device trait (step 16). The unwrapper appears between the two detection traits, not after both of them.

These traits and macros enable the CUDA runtime headers and device compiler to distinguish wrapped device lambdas from ordinary closure types at compile time, which is necessary for proper template argument deduction in kernel launch expressions.

Function Map

Address	Name (recovered)	Source	Lines	Role
`sub_6BB790`	`emit_device_lambda_wrapper_specialization`	`nv_transforms.c`	191	Emit `__nv_dl_wrapper_t<Tag, F1..FN>` for N captures (both tag variants)
`sub_6BCC20`	`nv_emit_lambda_preamble`	`nv_transforms.c`	244	Master emitter: primary template, zero-capture, bitmap scan, traits
`sub_6BCBF0`	`nv_record_capture_count`	`nv_transforms.c`	13	Set bit N in device or host-device bitmap
`sub_6BCBC0`	`nv_reset_capture_bitmasks`	`nv_transforms.c`	9	Zero both 128-byte bitmaps before each TU
`sub_47B890`	`gen_lambda`	`cp_gen_be.c`	336	Per-lambda wrapper call emission in `.int.c` output
`sub_467E50`	`emit_string`	`cp_gen_be.c`	--	Low-level string emitter to output buffer
`sub_46BC80`	`emit_preprocessor_directive`	`cp_gen_be.c`	--	Emit `#if 0` / `#endif` suppression blocks
`sub_475820`	`emit_closure_tag_type`	`cp_gen_be.c`	--	Emit tag type for wrapper construction
`sub_46E640`	`emit_capture_type_list`	`cp_gen_be.c`	--	Emit template argument list of capture types
`sub_46E550`	`emit_capture_value_list`	`cp_gen_be.c`	--	Emit constructor arguments (captured values)
`sub_6BC290`	`emit_array_capture_helpers`	`nv_transforms.c`	183	Emit `__nv_lambda_array_wrapper` for dim 2-8

Global State

Variable	Address	Purpose
`unk_1286980`	`0x1286980`	Device lambda capture-count bitmap (128 bytes, 1024 bits)
`dword_106BF38`	`0x106BF38`	`--extended-lambda` mode flag (enables entire system)
`dword_1065834`	`0x1065834`	Preprocessor nesting depth (decremented on `#if 0` emission)
`dword_1065820`	`0x1065820`	Output state flag (reset after `#endif` emission)
`qword_1065828`	`0x1065828`	Output state pointer (reset after `#endif` emission)

Extended Lambda Overview -- End-to-end flow through the five pipeline stages
Host-Device Lambda Wrapper -- __nv_hdl_wrapper_t type-erased design
Capture Handling -- __nv_lambda_field_type, __nv_lambda_array_wrapper for array captures
Preamble Injection -- sub_6BCC20 full emission sequence
Lambda Restrictions -- Validation rules and error codes
Kernel Stub Generation -- Parallel #if 0 suppression pattern for __global__ functions

Host-Device Lambda Wrapper

The __nv_hdl_wrapper_t template is cudafe++'s type-erased wrapper for __host__ __device__ extended lambdas. Unlike the device-only __nv_dl_wrapper_t which is a simple aggregate of captured fields, the host-device wrapper must operate on both the host (through the host compiler) and the device (through ptxas). This dual requirement forces a fundamentally different design: the wrapper uses void*-based type erasure with a manager<Lambda> inner struct that provides do_copy, do_call, and do_delete operations as static function pointers. The Lambda type is known only inside the constructor -- after construction, all operations go through the type-erased function pointer table stored in __nv_hdl_helper.

A second, lightweight path exists for lambdas that have no captures and can convert to a raw function pointer. When HasFuncPtrConv=true, the wrapper skips heap allocation entirely and stores the lambda directly as a function pointer via fp_noobject_caller, providing a operator __opfunc_t*() conversion operator.

Both paths are generated as raw C++ source text by two nearly-identical emitter functions in nv_transforms.c: sub_6BBB10 (non-mutable, IsMutable=false, const operator()) and sub_6BBEE0 (mutable, IsMutable=true, non-const operator()). For each capture count N observed during frontend parsing, the preamble emitter (sub_6BCC20) calls each function twice -- once with HasFuncPtrConv=0 and once with HasFuncPtrConv=1 -- producing four partial specializations per capture count: (non-mutable, no-fptr), (mutable, no-fptr), (non-mutable, fptr), (mutable, fptr).

Key Facts

Property	Value
Full template signature	`__nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, NeverThrows, Tag, OpFunc, Captures...>`
Source file	`nv_transforms.c` (EDG 6.6)
Non-mutable emitter	`sub_6BBB10` (238 lines, `IsMutable=false`)
Mutable emitter	`sub_6BBEE0` (236 lines, `IsMutable=true`)
Helper class	`__nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>` (anonymous namespace)
Factory	`__nv_hdl_create_wrapper_t<IsMutable, HasFuncPtrConv, Tag, CaptureArgs...>`
Trait deduction	`__nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>`
Bitmap	`unk_1286900` (128 bytes, 1024 bits)
Primary template static_assert	`"nvcc internal error: unexpected number of captures in __host__ __device__ lambda!"`
Specializations per capture count	4 (2 mutability x 2 HasFuncPtrConv); each of the 4 `sub_6BCC20` calls emits one specialization
Noexcept variants	Additional 2 trait specializations when `dword_126E270` is set (C++17)

Template Parameters

template <bool IsMutable,       // false = const operator(), true = non-const
          bool HasFuncPtrConv,  // true = captureless, function pointer path
          bool NeverThrows,     // maps to noexcept(NeverThrows)
          typename Tag,         // unique tag type per lambda site
          typename OpFunc,      // operator() signature as R(Args...)
          typename... CapturedVarTypePack>  // captured variable types F1..FN
struct __nv_hdl_wrapper_t;

Parameter	Role
`IsMutable`	Controls whether `operator()` is `const`. `false` for lambdas without `mutable` keyword (the common case), `true` for `mutable` lambdas. Emitted as `"false,"` by `sub_6BBB10` and `"true,"` by `sub_6BBEE0`.
`HasFuncPtrConv`	`true` when the lambda has no captures and can be implicitly converted to a function pointer. Enables the lightweight `fp_noobject_caller` path instead of heap allocation. Passed as `a1` to the emitter functions.
`NeverThrows`	Propagated to `noexcept(NeverThrows)` on `operator()`. Set to `true` only when `dword_126E270` is active (C++17 noexcept-in-type-system) and the lambda's `operator()` is declared `noexcept`.
`Tag`	A unique type tag generated per lambda call site, used to give each `__nv_hdl_helper` instantiation its own static function pointer storage. Same tag system as device lambdas.
`OpFunc`	The lambda's call signature decomposed as `OpFuncR(OpFuncArgs...)`. Used to type the function pointers in `__nv_hdl_helper` and the wrapper's `operator()`.
`CapturedVarTypePack`	`F1, F2, ..., FN` -- one type per captured variable. Each becomes a field `typename __nv_lambda_field_type<Fi>::type fi` in the wrapper struct.

The __nv_hdl_helper Class

Before any __nv_hdl_wrapper_t specialization is emitted, sub_6BCC20 emits the __nv_hdl_helper class inside an anonymous namespace. This class holds the static function pointers that enable type erasure -- the Lambda type is known when the constructor assigns the pointers, but the operator(), copy constructor, and destructor access them without knowing the concrete Lambda type.

// Exact binary string (emitted as a single a1() call):
namespace {template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
struct __nv_hdl_helper {
  typedef void * (*fp_copier_t)(void *);
  typedef OpFuncR (*fp_caller_t)(void *, OpFuncArgs...);
  typedef void (*fp_deleter_t) (void *);
  typedef OpFuncR (*fp_noobject_caller_t)(OpFuncArgs...);
  static fp_copier_t fp_copier;
  static fp_deleter_t fp_deleter;
  static fp_caller_t fp_caller;
  static fp_noobject_caller_t fp_noobject_caller;
};

template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
typename __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier_t __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier;

template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
typename __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_deleter_t __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_deleter;

template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
typename __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_caller_t __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_caller;
template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
typename __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_noobject_caller_t __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_noobject_caller;
}

Note three details in the binary that differ from a hand-written version: (1) namespace {template has no newline between the opening brace and template, (2) fp_deleter_t has a space before (void *) that the other typedefs lack: typedef void (*fp_deleter_t) (void *), (3) the blank line between fp_caller and fp_noobject_caller out-of-line definitions is missing -- they are separated by only one newline.

The anonymous namespace is critical: it gives each translation unit its own copy of the static function pointers, preventing ODR violations when multiple TUs use the same lambda tag type. The Tag parameter ensures that different lambda call sites within the same TU get independent function pointer storage even if they share the same OpFuncR(OpFuncArgs...) signature.

Function Pointer Roles

Pointer	Type	Set by	Used by	Purpose
`fp_copier`	`void()(void*)`	Constructor (capturing path)	Copy constructor	Heap-allocates a new `Lambda` copy from `void*` buffer
`fp_caller`	`OpFuncR()(void, OpFuncArgs...)`	Constructor (capturing path)	`operator()`	Casts `void` back to `Lambda` and invokes it
`fp_deleter`	`void()(void)`	Constructor (capturing path)	Destructor	Casts `void` to `Lambda` and `delete`s it
`fp_noobject_caller`	`OpFuncR(*)(OpFuncArgs...)`	Constructor (non-capturing path)	`operator()` + conversion operator	Stores the lambda directly as a function pointer

Type-Erasure Mechanism

The following diagram shows how a void* data pointer and the manager<Lambda> static functions work together to erase the concrete lambda type:

Construction (concrete Lambda type known):
============================================

  __nv_hdl_wrapper_t ctor(Tag{}, Lambda &&lam, F1 in1, ...)
       |
       |-- data = new Lambda(std::move(lam))          // heap-allocate
       |
       |-- __nv_hdl_helper<Tag,...>::fp_copier         // ASSIGN function pointers
       |       = &manager<Lambda>::do_copy             //   (Lambda type captured here)
       |-- __nv_hdl_helper<Tag,...>::fp_deleter
       |       = &manager<Lambda>::do_delete
       |-- __nv_hdl_helper<Tag,...>::fp_caller
       |       = &manager<Lambda>::do_call

After construction (Lambda type erased):
============================================

  __nv_hdl_wrapper_t
  +----------------------------+
  | f1, f2, ..., fN            |   captured variable fields (typed)
  | void *data ----------------+---> heap: Lambda object
  +----------------------------+
                                     (concrete type unknown here)
  operator()(args...):
       fp_caller(data, args...)
           |
           v
       manager<Lambda>::do_call(void *buf, args...)
           auto ptr = static_cast<Lambda*>(buf);
           return (*ptr)(args...);

  Copy ctor:
       data = fp_copier(in.data)
           |
           v
       manager<Lambda>::do_copy(void *buf)
           return new Lambda(*static_cast<Lambda*>(buf));

  Move ctor:
       data = in.data;  in.data = 0;     // pointer steal

  Destructor:
       fp_deleter(data)
           |
           v
       manager<Lambda>::do_delete(void *buf)
           delete static_cast<Lambda*>(buf);

The Tag template parameter is critical: it ensures each lambda call site gets its own set of __nv_hdl_helper static function pointers. Without Tag, two different lambdas with the same OpFuncR(OpFuncArgs...) signature would share the same function pointers, and the second constructor call would overwrite the first's fp_caller/fp_copier/fp_deleter.

The Capturing Path (HasFuncPtrConv=false)

When HasFuncPtrConv=false (the a1=0 path in the emitter), the wrapper uses heap allocation for type erasure. This is the full-weight path for lambdas that capture state.

Reconstructed Template (N captures, non-mutable)

The following is the complete C++ output reconstructed from sub_6BBB10 with a1=0 (HasFuncPtrConv=false) and a2=N captures:

template <bool NeverThrows, typename Tag, typename OpFuncR,
          typename... OpFuncArgs, typename F1, typename F2, /* ...FN */>
struct __nv_hdl_wrapper_t<false, false, NeverThrows, Tag,
                           OpFuncR(OpFuncArgs...), F1, F2, /* ...FN */> {
    // --- Captured fields ---
    typename __nv_lambda_field_type<F1>::type f1;
    typename __nv_lambda_field_type<F2>::type f2;
    // ...
    typename __nv_lambda_field_type<FN>::type fN;

    typedef OpFuncR(__opfunc_t)(OpFuncArgs...);

    // --- Data member for type-erased lambda ---
    void *data;

    // --- Type erasure manager ---
    template <typename Lambda>
    struct manager {
        static void *do_copy(void *buf) {
            auto ptr = static_cast<Lambda *>(buf);
            return static_cast<void *>(new Lambda(*ptr));
        };
        static OpFuncR do_call(void *buf, OpFuncArgs... args) {
            auto ptr = static_cast<Lambda *>(buf);
            return (*ptr)(std::forward<OpFuncArgs>(args)...);
        };
        static void do_delete(void *buf) {
            auto ptr = static_cast<Lambda *>(buf);
            delete ptr;
        }
    };

    // --- Constructor: heap-allocate Lambda, register function pointers ---
    template <typename Lambda>
    __nv_hdl_wrapper_t(Tag, Lambda &&lam, F1 in1, F2 in2, /* ...FN inN */)
        : f1(in1), f2(in2), /* ...fN(inN), */
          data(static_cast<void *>(new Lambda(std::move(lam)))) {
        __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier
            = &manager<Lambda>::do_copy;
        __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_deleter
            = &manager<Lambda>::do_delete;
        __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_caller
            = &manager<Lambda>::do_call;
    }

    // --- Call operator: delegate through type-erased fp_caller ---
    // Binary emits: "OpFuncR operator() (OpFuncArgs... args) " + "const " + "noexcept(NeverThrows) "
    OpFuncR operator() (OpFuncArgs... args) const noexcept(NeverThrows) {
        return __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>
            ::fp_caller(data, std::forward<OpFuncArgs>(args)...);
    }

    // --- Copy constructor: delegate through fp_copier ---
    __nv_hdl_wrapper_t(const __nv_hdl_wrapper_t &in)
        : f1(in.f1), f2(in.f2), /* ...fN(in.fN), */
          data(__nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>
               ::fp_copier(in.data)) { }

    // --- Move constructor: steal void* pointer ---
    __nv_hdl_wrapper_t(__nv_hdl_wrapper_t &&in)
        : f1(std::move(in.f1)), f2(std::move(in.f2)), /* ...fN(std::move(in.fN)), */
          data(in.data) { in.data = 0; }

    // --- Destructor: delegate through fp_deleter ---
    ~__nv_hdl_wrapper_t(void) {
        __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_deleter(data);
    }

    // --- Copy assignment: deleted ---
    __nv_hdl_wrapper_t & operator=(const __nv_hdl_wrapper_t &in) = delete;
};

Key Design Decisions

Heap allocation in constructor. The lambda is std::moved into a heap-allocated copy via new Lambda(std::move(lam)). This erases the concrete type -- the wrapper only holds a void* afterward. The manager<Lambda> static methods are assigned to the __nv_hdl_helper static function pointers during construction, preserving the type information as function pointer values rather than as template parameters.

Static function pointers instead of vtable. Rather than using virtual functions, the wrapper stores the type-erasure operations in static function pointers on __nv_hdl_helper. This is an unconventional choice -- it means all wrappers with the same Tag share the same function pointer storage. This works because within a single translation unit, each tag corresponds to exactly one lambda closure type. The approach avoids vtable overhead (no virtual destructor, no vptr in the wrapper) at the cost of not being safe across multiple lambda types sharing a tag.

Move constructor steals pointer. The move constructor copies the void* data pointer and sets the source to 0 (null). The destructor unconditionally calls fp_deleter(data), so a null data pointer after move must be handled by the deleter. Since delete on a null pointer is a no-op in C++, the moved-from wrapper's destructor call is safe.

Copy assignment is deleted. Only copy construction and move construction are supported. This avoids the complexity of managing the void* lifetime during assignment (which would require deleting the old data and copying the new).

Zero-Capture Specialization

When a2=0 (no captures), the emitter skips the field declarations and the field portions of the member initializer lists. The wrapper degenerates to holding only void* data with no fN fields. The constructor takes only (Tag, Lambda&&) with no capture arguments. The copy and move constructors handle only the data member.

The Lightweight Path (HasFuncPtrConv=true)

When HasFuncPtrConv=true (the a1=1 path), the lambda has no captures and can be implicitly converted to a raw function pointer. The emitter produces a drastically simpler wrapper:

template <bool NeverThrows, typename Tag, typename OpFuncR,
          typename... OpFuncArgs>
struct __nv_hdl_wrapper_t<false, true, NeverThrows, Tag,
                           OpFuncR(OpFuncArgs...)> {
    typedef OpFuncR(__opfunc_t)(OpFuncArgs...);

    // --- Constructor: store lambda as function pointer ---
    template <typename Lambda>
    __nv_hdl_wrapper_t(Tag, Lambda &&lam)
     { __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_noobject_caller = lam; }

    // --- Call operator: invoke through stored function pointer ---
    // Binary: "OpFuncR operator() (OpFuncArgs... args) " + "const " + "noexcept(NeverThrows) "
    OpFuncR operator() (OpFuncArgs... args) const noexcept(NeverThrows) {
        return __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>
            ::fp_noobject_caller(std::forward<OpFuncArgs>(args)...);
    }

    // --- Function pointer conversion operator ---
    // Binary: "operator __opfunc_t * () const { ... }"
    operator __opfunc_t * () const {
        return __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_noobject_caller;
    }

    // --- Copy assignment: deleted ---
    __nv_hdl_wrapper_t & operator=(const __nv_hdl_wrapper_t &in) = delete;
};

No void* data member. No manager struct. No heap allocation. No copy constructor, move constructor, or destructor (the compiler-generated defaults suffice). The lambda is stored directly as a function pointer in fp_noobject_caller, and the wrapper provides an implicit conversion to __opfunc_t* -- the raw function pointer type matching the lambda's signature.

This path is selected when gen_lambda (sub_47B890) detects that the lambda has no capture list (*(_QWORD *)a1 == 0, the capture head pointer is null) and the lambda does not use capture-default = (bit 4 at byte[24] is clear). Additional conditions involving dword_126EFAC, dword_126EFA4, and qword_126EF98 (a version threshold at 0xEB27 = 60199, likely a CUDA toolkit version) gate this detection, suggesting the function-pointer conversion path was added in a specific toolkit release.

Mutable vs Non-Mutable (sub_6BBB10 vs sub_6BBEE0)

The two emitter functions are structurally identical. The sole differences:

Aspect	`sub_6BBB10` (non-mutable)	`sub_6BBEE0` (mutable)
First template bool emitted	`"false,"`	`"true,"`
`operator()` qualifier	`a3("const ")` before `noexcept`	No `"const "` emission
Binary difference	Line 190: emits `"const "`	Line 188: skips to `noexcept`

In the decompiled binary, the two functions are 238 and 236 lines respectively. The 2-line difference is exactly the a3("const ") call present in sub_6BBB10 but absent from sub_6BBEE0.

For a mutable lambda, the C++ standard says operator() is non-const, allowing the lambda body to modify captured-by-value variables. The wrapper faithfully propagates this: sub_6BBEE0 generates operator() without the const qualifier. In the capturing path, this means the do_call function pointer invokes a non-const Lambda, which is sound because the lambda is heap-allocated and accessed through a mutable void*.

Emitter Call Matrix

sub_6BCC20 emits all four combinations for each set bit N in the host-device bitmap:

sub_6BBB10(0, N, emit);  // IsMutable=false, HasFuncPtrConv=false
sub_6BBEE0(0, N, emit);  // IsMutable=true,  HasFuncPtrConv=false
sub_6BBB10(1, N, emit);  // IsMutable=false, HasFuncPtrConv=true
sub_6BBEE0(1, N, emit);  // IsMutable=true,  HasFuncPtrConv=true

This produces four partial specializations per set bitmap bit N. The NeverThrows parameter remains a template parameter (not a partial-specialization value), handled at instantiation time. Note in the decompiled binary that the fourth call uses v9 (which holds v6 before the post-increment): v9 = v6++; ... sub_6BBEE0(1, v9, a1); -- all four calls use the same capture count N.

The __nv_hdl_helper_trait_outer Deduction Helper

After the per-capture-count specializations, sub_6BCC20 emits a trait class that deduces the wrapper return type from the lambda's operator() signature:

template <bool IsMutable, bool HasFuncPtrConv, typename ...CaptureArgs>
struct __nv_hdl_helper_trait_outer {
    // Primary: extract operator() signature via decltype(&Lambda::operator())
    template <typename Tag, typename Lambda>
    struct __nv_hdl_helper_trait
        : public __nv_hdl_helper_trait<Tag, decltype(&Lambda::operator())> { };

    // Specialization for const operator() (non-mutable lambda):
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) const> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, false,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

    // Specialization for non-const operator() (mutable lambda):
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...)> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, false,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

    // C++17 noexcept variants (only when dword_126E270 is set):
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) const noexcept> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, true,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) noexcept> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, true,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };
};

The trick here is the primary __nv_hdl_helper_trait inheriting from a specialization on decltype(&Lambda::operator()). The compiler deduces the member function pointer type of operator(), which pattern-matches against one of the four specializations. The non-noexcept specializations pass NeverThrows=false; the noexcept specializations pass NeverThrows=true. This is how the NeverThrows template parameter gets its value -- through trait deduction, not through an explicit argument.

The C++17 noexcept variants are gated on dword_126E270. In C++17, noexcept became part of the type system, so R(C::*)(Args...) noexcept is a distinct type from R(C::*)(Args...). Without the additional specializations, the compiler would fail to match noexcept member function pointers.

In the decompiled sub_6BCC20, the emission is split into three a1() calls: (1) the base struct with const and non-const specializations (ending with }; for the non-const spec), (2) conditionally (if (dword_126E270)) the const noexcept and noexcept specializations, and (3) a1("\n};") to close the outer struct. This means the closing brace of __nv_hdl_helper_trait_outer is always emitted, but the noexcept specializations inside it are conditional. A subtle consequence: in non-C++17 mode, the binary between the non-const }; and the outer }; contains only \n}; -- the inner struct specializations end before the outer struct closes.

The __nv_hdl_create_wrapper_t Factory

The factory struct ties everything together. It provides a single static method that the backend emits at each host-device lambda usage site:

template <bool IsMutable, bool HasFuncPtrConv,
          typename Tag, typename... CaptureArgs>
struct __nv_hdl_create_wrapper_t {
    template <typename Lambda>
    static auto __nv_hdl_create_wrapper(Lambda &&lam, CaptureArgs... args)
        -> decltype(
            __nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>
                ::template __nv_hdl_helper_trait<Tag, Lambda>
                ::get(lam, args...))
    {
        typedef decltype(
            __nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>
                ::template __nv_hdl_helper_trait<Tag, Lambda>
                ::get(lam, args...)) container_type;
        return container_type(Tag{}, std::move(lam), args...);
    }
};

The trailing return type uses decltype to invoke the trait chain and deduce the exact __nv_hdl_wrapper_t specialization. The body constructs that deduced type with Tag{} (a value-initialized tag), the moved lambda, and the capture arguments.

Backend Emission at Lambda Call Site

When gen_lambda (sub_47B890) encounters a host-device lambda (bit 4 set at byte[25]), it emits the factory call in two phases:

Phase 1 (before lambda body): Opens the factory call with template arguments and the method name:

__nv_hdl_create_wrapper_t< IsMutable, HasFuncPtrConv, Tag, CaptureTypes... >
    ::__nv_hdl_create_wrapper(

Phase 2 (after lambda body): The lambda expression is emitted as the first argument to __nv_hdl_create_wrapper, then the captured value expressions are appended as trailing arguments, followed by the closing ):

    /* lambda expression emitted inline */,
    capture_arg1, capture_arg2, ... )

This differs from the device lambda path where the original lambda body is wrapped in #if 0 / #endif. In the host-device path, the lambda is passed by rvalue reference to the factory method, which moves it into a heap-allocated copy for type erasure. The captured values are passed separately (via sub_46E550 at line 323 of the decompiled binary) so the wrapper can store them as typed fields alongside the void* data.

The IsMutable decision comes from byte[24] & 0x02 (mutable keyword present). The HasFuncPtrConv decision involves nested conditions, all gated on the capture list head being null (*(_QWORD *)a1 == 0):

HasFuncPtrConv = false;  // default
if (capture_list_head == NULL) {
    if (dword_126EFAC && !dword_126EFA4 && qword_126EF98 <= 0xEB27) {
        HasFuncPtrConv = true;   // forced true for old toolkit versions
    } else {
        // General path: true iff no capture-default '='
        HasFuncPtrConv = !(byte[24] & 0x10);
    }
}

When dword_126EFAC is set and dword_126EFA4 is clear, the toolkit version qword_126EF98 is compared against 0xEB27 (60199). At or below this threshold, HasFuncPtrConv is unconditionally true. Above the threshold, it falls through to the general path which checks whether the lambda has a capture-default = (bit 4 at byte[24]): if no = default, then the lambda is captureless and can convert to a function pointer.

This logic is at sub_47B890 lines 62-77 of the decompiled binary.

SFINAE Detection Traits

At the end of the preamble, sub_6BCC20 emits a detection trait and macro for identifying host-device lambda wrappers:

// Exact binary string (step 16 in sub_6BCC20, emitted as a single a1() call):
template <typename>
struct __nv_extended_host_device_lambda_trait_helper {
  static const bool value = false;
};
template <bool B1, bool B2, bool B3, typename T1, typename T2, typename...Pack>
struct __nv_extended_host_device_lambda_trait_helper<__nv_hdl_wrapper_t<B1, B2, B3, T1, T2, Pack...> > {
  static const bool value = true;
};
#define __nv_is_extended_host_device_lambda_closure_type(X)  __nv_extended_host_device_lambda_trait_helper< typename __nv_lambda_trait_remove_cv<X>::type>::value

Note: binary has typename...Pack (no space), Pack...> > (space between angle brackets -- pre-C++11 syntax), two spaces before __nv_extended_host_device_lambda_trait_helper in the macro, and 2-space indentation on static const bool.

This allows compile-time detection of whether a type is a host-device lambda wrapper, used internally by the CUDA runtime headers and by nvcc to apply special handling to extended host-device lambda closure types.

Emission Sequence in sub_6BCC20

The host-device wrapper infrastructure is emitted in steps 7-12 of the 20-step preamble emission sequence:

Step	Content	Function
7	`__nv_hdl_helper` class (anonymous namespace, 4 static function pointer members + out-of-line definitions)	`sub_6BCC20` inline
8	Primary `__nv_hdl_wrapper_t` with `static_assert` (catches unexpected capture counts)	`sub_6BCC20` inline
9	Per-capture-count specializations: for each bit N set in `unk_1286900`, emit 4 calls: `sub_6BBB10(0,N)`, `sub_6BBEE0(0,N)`, `sub_6BBB10(1,N)`, `sub_6BBEE0(1,N)`	`sub_6BBB10`, `sub_6BBEE0`
10	`__nv_hdl_helper_trait_outer` deduction helper (2 or 4 trait specializations depending on C++17)	`sub_6BCC20` inline
11	C++17 noexcept trait variants (conditional on `dword_126E270`)	`sub_6BCC20` inline
12	`__nv_hdl_create_wrapper_t` factory	`sub_6BCC20` inline

The bitmap scan loop for host-device wrappers differs from the device-lambda loop in one important way: bit 0 IS emitted. The device-lambda loop skips bit 0 (the zero-capture case is handled by the primary template), but the host-device loop processes every set bit including 0. This is because the zero-capture host-device wrapper still requires distinct specializations for the HasFuncPtrConv=true and HasFuncPtrConv=false paths.

// sub_6BCC20, host-device bitmap scan (decompiled)
v5 = (unsigned __int64 *)&unk_1286900;
v6 = 0;
do {
    v7 = *v5;
    v8 = v6 + 64;
    do {
        while ((v7 & 1) == 0) {    // skip unset bits
            ++v6;
            v7 >>= 1;
            if (v6 == v8) goto LABEL_13;
        }
        sub_6BBB10(0, v6, a1);     // non-mutable, HasFuncPtrConv=false
        sub_6BBEE0(0, v6, a1);     // mutable,     HasFuncPtrConv=false
        sub_6BBB10(1, v6, a1);     // non-mutable, HasFuncPtrConv=true
        v9 = v6++;
        v7 >>= 1;
        sub_6BBEE0(1, v9, a1);     // mutable,     HasFuncPtrConv=true
    } while (v6 != v8);
LABEL_13:
    ++v5;
} while (v6 != 1024);

Comparison with Device Lambda Wrapper

Aspect	`__nv_dl_wrapper_t`	`__nv_hdl_wrapper_t`
Type erasure	None -- concrete fields only	`void*` + `manager<Lambda>` function pointers
Heap allocation	Never	Yes (capturing path) or never (HasFuncPtrConv path)
Copy semantics	Trivially copyable aggregate	Custom copy ctor via `fp_copier`; copy assign deleted
Move semantics	Default	Custom move ctor stealing `void*`; moved-from nulled
Destructor	Trivial	Calls `fp_deleter(data)`
`operator()` body	`return 0;` / `__builtin_unreachable()` (placeholder)	Delegates through `fp_caller` or `fp_noobject_caller`
Function pointer conversion	Not supported	`operator __opfunc_t * ()` when HasFuncPtrConv=true
Specializations per N	2 (standard tag + trailing-return tag)	4 (2 mutability x 2 HasFuncPtrConv)
Template params (partial spec)	`Tag, F1..FN`	`IsMutable, HasFuncPtrConv, NeverThrows, Tag, OpFuncR(OpFuncArgs...), F1..FN`

The host-device wrapper is fundamentally more complex because it must produce a callable object that works on both host and device. The device-only wrapper can use placeholder operator bodies (return 0) because the device compiler sees the original lambda body through a different mechanism. The host-device wrapper must actually call the lambda through the type-erased function pointer table.

Concrete Example: Host-Device Lambda with One Capture

User code:

auto add_n = [n] __host__ __device__ (int x) { return x + n; };
int result = add_n(42);

This lambda has one capture (n, by value), is not mutable (default), and cannot convert to a function pointer (it captures). The frontend sets bit 4 at byte[25] (host-device wrapper needed) and calls sub_6BCBF0(1, 1) to set bit 1 in the host-device bitmap unk_1286900.

During preamble emission, sub_6BCC20 sees bit 1 set and emits four specializations via sub_6BBB10(0,1), sub_6BBEE0(0,1), sub_6BBB10(1,1), sub_6BBEE0(1,1). The relevant one for this lambda (non-mutable, capturing) is from sub_6BBB10(0,1).

At the lambda call site, gen_lambda emits:

__nv_hdl_create_wrapper_t< false, false, __nv_dl_tag<...>, int >
    ::__nv_hdl_create_wrapper(
        [n] __host__ __device__ (int x) { return x + n; },
        n )

The factory method deduces the wrapper type via __nv_hdl_helper_trait_outer and constructs:

__nv_hdl_wrapper_t<false, false, false, Tag, int(int), int>

At runtime on the host: the constructor heap-allocates the lambda, stores n as field f1, and sets the fp_caller/fp_copier/fp_deleter static function pointers. Calling add_n(42) invokes fp_caller(data, 42) which casts void* back to the lambda type and calls operator()(42).

At runtime on the device: the same wrapper struct is memcpy'd to device memory. The device compiler sees the wrapper's fields and operator() which delegates through the function pointer table, resolving to the lambda body.

Emitter Function Signature

Both sub_6BBB10 and sub_6BBEE0 share the same prototype:

__int64 __fastcall sub_6BBB10(int a1, unsigned int a2,
                               void (__fastcall *a3)(const char *));

Parameter	Role
`a1`	`HasFuncPtrConv` flag. `0` = full type-erased path. `1` = lightweight function pointer path.
`a2`	Number of captured variables (0 to 1023).
`a3`	Emit callback. Called with C++ source text fragments that are concatenated to form the output.

The functions use a 1080-byte stack buffer (v28[1080]) for sprintf formatting of per-capture template parameters and field declarations. The buffer is large enough for field names up to F1023 / f1023 / in1023 with surrounding template syntax.

Key Functions

Address	Name	Lines	Role
`sub_6BBB10`	`emit_hdl_wrapper_nonmutable`	238	Emit `__nv_hdl_wrapper_t<false, ...>` specialization
`sub_6BBEE0`	`emit_hdl_wrapper_mutable`	236	Emit `__nv_hdl_wrapper_t<true, ...>` specialization
`sub_6BCC20`	`nv_emit_lambda_preamble`	244	Master emitter; calls both for each bitmap bit
`sub_47B890`	`gen_lambda`	336	Per-lambda site emission of `__nv_hdl_create_wrapper_t::__nv_hdl_create_wrapper(...)` call
`sub_6BCBF0`	`nv_record_capture_count`	13	Sets bit in `unk_1286900` bitmap during frontend scan
`sub_6BCBC0`	`nv_reset_capture_bitmasks`	9	Zeroes both bitmaps before each TU

Global State

Variable	Address	Purpose
`unk_1286900`	`0x1286900`	Host-device lambda capture-count bitmap (128 bytes, 1024 bits)
`dword_126E270`	`0x126E270`	C++17 noexcept-in-type-system flag; gates noexcept trait variants
`dword_126EFAC`	`0x126EFAC`	Influences HasFuncPtrConv deduction in `gen_lambda`
`dword_126EFA4`	`0x126EFA4`	Secondary gate for HasFuncPtrConv path
`qword_126EF98`	`0x126EF98`	Toolkit version threshold for HasFuncPtrConv (compared against `0xEB27`)
`dword_106BF38`	`0x106BF38`	Extended lambda mode flag (`--extended-lambda`)

Extended Lambda Overview -- end-to-end pipeline and bitmap system
Device Lambda Wrapper -- __nv_dl_wrapper_t simpler aggregate approach
Capture Handling -- __nv_lambda_field_type and array capture helpers
Preamble Injection -- sub_6BCC20 full emission sequence
Lambda Restrictions -- validation rules and error codes

Capture Handling

C++ lambdas capture variables by creating closure-class fields -- one field per captured entity. For scalars this is straightforward: the closure stores a copy (or reference) of the variable. Arrays present a problem because C++ forbids direct value-capture of C-style arrays. CUDA extended lambdas compound the problem: the wrapper template that carries captures across the host/device boundary needs a uniform way to express every field's type, including multi-dimensional arrays and const-qualified variants. cudafe++ solves this with two injected template families: __nv_lambda_field_type<T> (a type trait that maps each captured variable's declared type to a storable type) and __nv_lambda_array_wrapper<T[D1]...[DN]> (a wrapper struct that holds a deep copy of an N-dimensional array with element-by-element copy in its constructor).

A separate subsystem handles the backend code generator's emission of capture type declarations and capture value expressions for each lambda. nv_gen_extended_lambda_capture_types (sub_46E640) walks the capture list and emits decltype-based template arguments wrapped in __nvdl_remove_ref / __nvdl_remove_const / __nv_lambda_trait_remove_cv. sub_46E550 emits the corresponding capture values (variable names, this, *this, or init-capture expressions).

All of this is driven by a bitmap system that tracks which capture counts were actually used, so cudafe++ only emits the wrapper specializations that a given translation unit requires.

Key Facts

Property	Value
Field type trait	`__nv_lambda_field_type<T>`
Array wrapper	`__nv_lambda_array_wrapper<T[D1]...[DN]>`
Supported array dims	1D (identity) through 7D (generated for ranks 2-8)
Array helper emitter	`sub_6BC290` (`emit_array_capture_helpers`) in `nv_transforms.c`
Capture type emitter	`sub_46E640` (`nv_gen_extended_lambda_capture_types`) in `cp_gen_be.c`
Capture value emitter	`sub_46E550` in `cp_gen_be.c`
Device bitmap	`unk_1286980` (128 bytes = 1024 bits)
Host-device bitmap	`unk_1286900` (128 bytes = 1024 bits)
Bitmap initializer	`sub_6BCBC0` (`nv_reset_capture_bitmasks`)
Bitmap setter	`sub_6BCBF0` (`nv_record_capture_count`)

__nv_lambda_field_type

This is the type trait that maps every captured variable's declared type to a type suitable for storage in a wrapper struct field. For scalar types (and anything that is not an array), it is the identity:

template <typename T>
struct __nv_lambda_field_type {
    typedef T type;
};

For array types, it maps to the corresponding __nv_lambda_array_wrapper specialization. cudafe++ generates partial specializations for dimensions 2 through 8, each in both non-const and const variants.

Generated Specializations (Example: 3D)

// Non-const array
template<typename T, size_t D1, size_t D2, size_t D3>
struct __nv_lambda_field_type<T [D1][D2][D3]> {
    typedef __nv_lambda_array_wrapper<T [D1][D2][D3]> type;
};

// Const array
template<typename T, size_t D1, size_t D2, size_t D3>
struct __nv_lambda_field_type<const T [D1][D2][D3]> {
    typedef const __nv_lambda_array_wrapper<T [D1][D2][D3]> type;
};

For 1D arrays (T[D1]), no specialization is generated. The primary template handles them -- 1D arrays decay to pointers in standard capture, so this is the identity case. The explicit specializations cover dimensions 2 through 8 (template parameter lists with D1 through D2...D7 respectively).

Why Ranks 2-8

The loop in sub_6BC290 runs with counter v1 from 2 to 8 inclusive (while (v1 != 9)). Rank 1 is handled by the primary template. Rank 9+ triggers the static_assert in the unspecialized __nv_lambda_array_wrapper primary template. This bounds the maximum supported array dimensionality for lambda capture at 7D -- an extremely generous limit (standard CUDA kernels rarely exceed 3D arrays).

__nv_lambda_array_wrapper<T[D1]...[DN]>

The array wrapper is a struct that owns a copy of an N-dimensional C-style array. Since arrays cannot be value-captured in C++ (they decay to pointers), this wrapper provides the deep-copy semantics that CUDA extended lambdas need.

Primary Template (Trap)

The unspecialized primary template contains only a static_assert that always fires:

template <typename T>
struct __nv_lambda_array_wrapper {
    static_assert(sizeof(T) == 0,
        "nvcc internal error: unexpected failure in capturing array variable");
};

This catches any array dimensionality that falls outside the range [2, 8]. Since sizeof(T) is never zero for a real type, the assertion always fails if the primary template is instantiated.

Generated Specializations

For each rank N from 2 through 8, sub_6BC290 generates a partial specialization:

// Example: rank 3
template<typename T, size_t D1, size_t D2, size_t D3>
struct __nv_lambda_array_wrapper<T [D1][D2][D3]> {
    T arr[D1][D2][D3];
    __nv_lambda_array_wrapper(const T in[D1][D2][D3]) {
        for(size_t i1 = 0; i1 < D1; ++i1)
        for(size_t i2 = 0; i2 < D2; ++i2)
        for(size_t i3 = 0; i3 < D3; ++i3)
            arr[i1][i2][i3] = in[i1][i2][i3];
    }
};

The constructor takes a const T in[D1]...[DN] parameter and performs element-by-element copy via nested for-loops. Each loop variable is named i1 through iN and iterates from 0 to D1 through DN respectively. The assignment arr[i1]...[iN] = in[i1]...[iN] copies each element.

Reconstructed Output for Rank 4

What sub_6BC290 actually emits for a 4-dimensional array (directly from the decompiled string fragments):

template<typename T, size_t D1, size_t D2, size_t D3, size_t D4>
struct __nv_lambda_array_wrapper<T [D1][D2][D3][D4]> {
    T arr[D1][D2][D3][D4];
    __nv_lambda_array_wrapper(const T in[D1][D2][D3][D4]) {
        for(size_t i1 = 0; i1  < D1; ++i1)
        for(size_t i2 = 0; i2  < D2; ++i2)
        for(size_t i3 = 0; i3  < D3; ++i3)
        for(size_t i4 = 0; i4  < D4; ++i4)
            arr[i1][i2][i3][i4] = in[i1][i2][i3][i4];
    }
};

Note the double-space before < in the for condition -- this is present in the actual emitted code (visible in the decompiled sprintf format string "for(size_t i%u = 0; i%u < D%u; ++i%u)").

sub_6BC290: emit_array_capture_helpers

Address 0x6BC290, 183 decompiled lines, in nv_transforms.c. Takes a single argument: void (*a1)(const char *), the text emission callback.

Algorithm

The function has two major loops, each iterating rank from 2 to 8.

Loop 1 -- Array wrapper specializations:

for rank = 2 to 8:
    emit "template<typename T"
    for d = 1 to rank-1:
        emit ", size_t D{d}"
    emit ">\nstruct __nv_lambda_array_wrapper<T "
    for d = 1 to rank-1:
        emit "[D{d}]"
    emit "> {T arr"
    for d = 1 to rank-1:
        emit "[D{d}]"
    emit ";\n__nv_lambda_array_wrapper(const T in"
    for d = 1 to rank-1:
        emit "[D{d}]"
    emit ") {"
    for d = 1 to rank-1:
        emit "\nfor(size_t i{d} = 0; i{d}  < D{d}; ++i{d})"
    emit " arr"
    for d = 1 to rank-1:
        emit "[i{d}]"
    emit " = in"
    for d = 1 to rank-1:
        emit "[i{d}]"
    emit ";\n}\n};\n"

Loop 2 -- Field type specializations:

First emits the primary __nv_lambda_field_type:

emit "template <typename T>\nstruct __nv_lambda_field_type {\ntypedef T type;};"

Then for each rank from 2 to 8, emits two specializations (non-const and const):

for rank = 2 to 8:
    // Non-const specialization
    emit "template<typename T"
    for d = 1 to rank-1:
        emit ", size_t D{d}"
    emit ">\nstruct __nv_lambda_field_type<T "
    for d = 1 to rank-1:
        emit "[D{d}]"
    emit "> {\ntypedef __nv_lambda_array_wrapper<T "
    for d = 1 to rank-1:
        emit "[D{d}]"
    emit "> type;\n};\n"

    // Const specialization
    emit "template<typename T"
    for d = 1 to rank-1:
        emit ", size_t D{d}"
    emit ">\nstruct __nv_lambda_field_type<const T "
    for d = 1 to rank-1:
        emit "[D{d}]"
    emit "> {\ntypedef const __nv_lambda_array_wrapper<T "
    for d = 1 to rank-1:
        emit "[D{d}]"
    emit "> type;\n};\n"

Stack Usage

Two stack buffers: v33[1024] for the for-loop lines (the sprintf format includes four %u substitutions) and s[1064] for the dimension fragments (smaller format: "%s%u%s" with prefix/suffix).

Emission Order in Preamble

sub_6BC290 is called from sub_6BCC20 (nv_emit_lambda_preamble) at step 3, after __nvdl_remove_ref/__nvdl_remove_const trait helpers and __nv_dl_tag, but before the primary __nv_dl_wrapper_t definition. This ordering is critical: __nv_dl_wrapper_t field declarations reference __nv_lambda_field_type, which in turn references __nv_lambda_array_wrapper, so both must be defined first.

Capture Type Emission (sub_46E640)

Address 0x46E640, approximately 400 decompiled lines, in cp_gen_be.c. Confirmed identity: nv_gen_extended_lambda_capture_types (assert string at line 17368 of cp_gen_be.c).

This function emits the template type arguments that appear in a wrapper struct instantiation. For a device lambda wrapper __nv_dl_wrapper_t<Tag, F1, F2, ..., FN>, this function generates the F1 through FN types. Each type must precisely match the declared type of the captured variable, with references and top-level const stripped.

Input

Takes __int64 **a1 -- a pointer to the lambda info structure. The capture list is a linked list starting at *a1 (offset +0 of the lambda info). Each capture entry is a node with:

Offset	Size	Field
+0	8	`next` pointer (linked list)
+8	8	`variable_entity` -- pointer to the captured variable's entity node
+24	8	`init_capture_scope` -- scope for init-capture expressions
+32	1	`flags_byte_1` -- bit 0 = init-capture, bit 7 = has braces/parens
+33	1	`flags_byte_2` -- bit 0 = paren-init (vs brace-init)

The variable entity at offset +8 has:

Offset +8: name string (null if *this capture)
Offset +163: sign bit (bit 7) -- if set, this is a *this or this capture

Algorithm: Three Capture Kinds

The function walks the capture list and for each entry, dispatches on two conditions: the init-capture flag (i[4] & 1) and the *this flag (byte at entity+163 sign bit).

Case 1: Regular variable capture (i[4] & 1 == 0 and entity+163 >= 0)

Emits:

, typename __nvdl_remove_ref<decltype(varname)>::type

Where varname is the string at entity+8. This strips reference qualification from the variable's type. The decltype(varname) ensures the type is deduced from the actual declaration, not from any decay.

Case 2: *this capture (i[4] & 1 == 0 and entity+163 < 0)

Two sub-cases depending on whether this is an explicit this capture (C++23 deducing this) versus traditional *this:

If i[4] & 8 (explicit this):

, decltype(this) const

Otherwise (traditional *this):

, typename __nvdl_remove_const<typename __nvdl_remove_ref<decltype(*this) > ::type> :: type

If the lambda is non-const (mutable), const is not appended. The mutable check reads (byte)a1[3] & 2 -- if clear, appends const.

Case 3: Init-capture (i[4] & 1 != 0)

Emits:

, typename __nv_lambda_trait_remove_cv<typename __nvdl_remove_ref<decltype({expr})>::type>::type

Where {expr} is the init-capture expression, emitted by calling sub_46D910 (the expression code generator). The expression is wrapped in {...} (brace-init) or (...) (paren-init) depending on byte+33 bit 0. The additional __nv_lambda_trait_remove_cv wrapper strips top-level const and volatile from the deduced type.

GCC Diagnostic Guards

When dword_126E1E8 is set (indicating the host compiler is GCC-based), the init-capture path wraps the decltype expression in pragma guards:

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunevaluated-expression"
decltype({expr})
#pragma GCC diagnostic pop

This suppresses GCC warnings about using decltype on expressions that are not evaluated. The flag dword_126E1E8 is likely set when the target host compiler is GCC rather than MSVC or Clang.

Character-by-Character Emission

The decompiled code reveals that sub_46E640 does not use sub_467E50 (emit string) for all output. For short constant strings like ", ", "typename __nvdl_remove_ref<decltype(", etc., it emits character-by-character via putc(ch, stream) with a manual loop. This is a common pattern in EDG's code generator where inline string emission avoids function-call overhead for fixed text.

The character counter dword_106581C tracks the column position for line-wrapping decisions. Each emission path increments it by the string length.

Capture Value Emission (sub_46E550)

Address 0x46E550, 60 decompiled lines, in cp_gen_be.c. This function emits the actual values passed to the wrapper constructor -- the runtime expressions that initialize each captured field.

Algorithm

Walks the same capture linked list. For each entry, emits , followed by:

Condition	Output
Regular variable (`byte+32 & 1 == 0`, entity+163 >= 0)	Variable name string from entity+8
Explicit this (`byte+32 & 8`, entity+163 < 0)	`this`
Traditional *this (`byte+32 & 8 == 0`, entity+163 < 0)	`*this`
Init-capture (`byte+32 & 1`)	The init-capture expression via `sub_46D910`

For init-captures, the expression is wrapped in (...) or {...} based on bit 0 of byte+33:

Bit 0 set: paren-init (expr)
Bit 0 clear: brace-init {expr}

Relationship to Type Emission

sub_46E550 and sub_46E640 are called in sequence by the per-lambda wrapper emitter (sub_47B890, gen_lambda). The type emission produces the template type parameters; the value emission produces the constructor arguments. Together they construct an expression like:

__nv_dl_wrapper_t<
    __nv_dl_tag<decltype(&Closure::operator()), &Closure::operator(), 42>,
    typename __nvdl_remove_ref<decltype(x)>::type,
    typename __nvdl_remove_ref<decltype(y)>::type
>(tag, x, y)

Bitmap System

Rather than generating wrapper specializations for all possible capture counts (0 through 1023), cudafe++ maintains two 1024-bit bitmaps that record which counts were actually observed during frontend parsing. During preamble emission, only the specializations for set bits are generated.

Memory Layout

unk_1286980 (device lambda bitmap):
    Address: 0x1286980
    Size:    128 bytes = 16 x uint64_t = 1024 bits
    Bit N:   __nv_dl_wrapper_t specialization for N captures needed

unk_1286900 (host-device lambda bitmap):
    Address: 0x1286900
    Size:    128 bytes = 16 x uint64_t = 1024 bits
    Bit N:   __nv_hdl_wrapper_t specializations for N captures needed

sub_6BCBC0: nv_reset_capture_bitmasks

Address 0x6BCBC0, 9 decompiled lines. Called before each translation unit.

memset(&unk_1286980, 0, 0x80);   // Clear device bitmap (128 bytes)
memset(&unk_1286900, 0, 0x80);   // Clear host-device bitmap (128 bytes)

sub_6BCBF0: nv_record_capture_count

Address 0x6BCBF0, 13 decompiled lines. Called from scan_lambda (sub_447930) after counting captures.

_QWORD *result = &unk_1286900;          // Default: host-device bitmap
if (!a1)
    result = &unk_1286980;              // a1 == 0: device bitmap
result[a2 >> 6] |= 1LL << a2;          // Set bit a2

Parameters:

a1 (int): Bitmap selector. 0 = device, non-zero = host-device.
a2 (unsigned): Capture count (0-1023).

The bit-set logic: a2 >> 6 selects the uint64_t word (divides by 64), and 1LL << a2 sets the appropriate bit within that word. Since a2 is an unsigned int, the shift 1LL << a2 uses only the low 6 bits of a2 on x86-64, so the word index and bit index are consistent.

Note the mapping inversion: a1 == 0 maps to unk_1286980 (device), while a1 != 0 maps to unk_1286900 (host-device). This is counterintuitive but confirmed by the decompiled code.

Bitmap Scan in nv_emit_lambda_preamble

The scan loop in sub_6BCC20 processes each bitmap as 16 uint64_t words:

// Device lambda bitmap scan
uint64_t *ptr = (uint64_t *)&unk_1286980;
unsigned int idx = 0;
do {
    uint64_t word = *ptr;
    unsigned int limit = idx + 64;
    do {
        if (idx != 0 && (word & 1))
            sub_6BB790(idx, callback);   // emit_device_lambda_wrapper_specialization
        ++idx;
        word >>= 1;
    } while (limit != idx);
    ++ptr;
} while (limit != 1024);

// Host-device lambda bitmap scan
ptr = (uint64_t *)&unk_1286900;
idx = 0;
do {
    uint64_t word = *ptr;
    unsigned int limit = idx + 64;
    do {
        while ((word & 1) == 0) {    // Skip unset bits
            ++idx;
            word >>= 1;
            if (idx == limit) goto next_word;
        }
        sub_6BBB10(0, idx, callback);    // Non-mutable, HasFuncPtrConv=false
        sub_6BBEE0(0, idx, callback);    // Non-mutable, HasFuncPtrConv=true
        sub_6BBB10(1, idx, callback);    // Mutable, HasFuncPtrConv=false
        sub_6BBEE0(1, idx++, callback);  // Mutable, HasFuncPtrConv=true
        word >>= 1;
    } while (idx != limit);
next_word:
    ++ptr;
} while (idx != 1024);

Key differences between the two scans:

The device scan skips bit 0 (if (idx != 0 && ...)). The zero-capture case is handled by the primary template and its explicit <Tag> specialization already emitted as static text.
The host-device scan does not skip bit 0 -- zero-capture host-device lambdas (stateless lambdas with __host__ __device__) still need wrapper specializations because the host-device wrapper has function-pointer-conversion variants.
Each set bit in the host-device bitmap triggers four emitter calls (non-mutable/mutable x HasFuncPtrConv false/true), compared to one call per bit for device lambdas.

How Fields Use __nv_lambda_field_type

When sub_6BB790 (emit_device_lambda_wrapper_specialization) generates a wrapper struct for N captures, each field is declared as:

typename __nv_lambda_field_type<F1>::type f1;
typename __nv_lambda_field_type<F2>::type f2;
// ... through fN

This indirection through __nv_lambda_field_type means:

If F1 is int, the field type is int (identity via primary template).
If F1 is float[3][4], the field type is __nv_lambda_array_wrapper<float[3][4]>, which stores a deep copy.
If F1 is const double[2][2], the field type is const __nv_lambda_array_wrapper<double[2][2]>.

The constructor mirrors this pattern:

__nv_dl_wrapper_t(Tag, F1 in1, F2 in2, ..., FN inN)
    : f1(in1), f2(in2), ..., fN(inN) { }

For array captures, the f1(in1) initialization invokes __nv_lambda_array_wrapper's constructor, which performs the element-by-element copy. For scalar captures, it is a trivial copy/move.

End-to-End Example

Given user code:

int x = 42;
float matrix[3][4];
auto lam = [x, matrix]() __device__ { /* use x and matrix */ };

cudafe++ produces:

Frontend (scan_lambda): Counts 2 captures. Calls sub_6BCBF0(0, 2) to set bit 2 in the device bitmap.
Preamble emission (sub_6BCC20): Scans the device bitmap, finds bit 2 set. Calls sub_6BB790(2, emit) which generates:

template <typename Tag, typename F1, typename F2>
struct __nv_dl_wrapper_t<Tag, F1, F2> {
    typename __nv_lambda_field_type<F1>::type f1;
    typename __nv_lambda_field_type<F2>::type f2;
    __nv_dl_wrapper_t(Tag, F1 in1, F2 in2) : f1(in1), f2(in2) { }
    template <typename...U1>
    int operator()(U1...) { return 0; }
};

Per-lambda emission (sub_47B890 calling sub_46E640 and sub_46E550):

__nv_dl_wrapper_t<
    __nv_dl_tag<decltype(&ClosureType::operator()), &ClosureType::operator(), 0>,
    typename __nvdl_remove_ref<decltype(x)>::type,        // int
    typename __nvdl_remove_ref<decltype(matrix)>::type     // float[3][4]
>(tag, x, matrix)

Template instantiation: The host compiler instantiates the wrapper. F1 = int so __nv_lambda_field_type<int>::type = int (identity). F2 = float[3][4] so __nv_lambda_field_type<float[3][4]>::type = __nv_lambda_array_wrapper<float[3][4]>, which triggers the rank-2 specialization with its nested double for-loop constructor.

Function Map

Address	Name (recovered)	Source	Lines	Role
`sub_6BC290`	`emit_array_capture_helpers`	`nv_transforms.c`	183	Emit `__nv_lambda_array_wrapper` (ranks 2-8) and `__nv_lambda_field_type` specializations
`sub_6BCBC0`	`nv_reset_capture_bitmasks`	`nv_transforms.c`	9	Zero both 128-byte bitmaps at translation unit start
`sub_6BCBF0`	`nv_record_capture_count`	`nv_transforms.c`	13	Set bit N in device or host-device bitmap
`sub_6BCC20`	`nv_emit_lambda_preamble`	`nv_transforms.c`	244	Master emitter -- scans bitmaps, calls all sub-emitters
`sub_6BB790`	`emit_device_lambda_wrapper_specialization`	`nv_transforms.c`	191	Emit `__nv_dl_wrapper_t<Tag, F1..FN>` for N captures
`sub_46E640`	`nv_gen_extended_lambda_capture_types`	`cp_gen_be.c`	~400	Emit `decltype`-based template type args for each capture
`sub_46E550`	(capture value emitter)	`cp_gen_be.c`	~60	Emit variable names / `this` / `*this` / init-capture exprs
`sub_46D910`	(expression code generator)	`cp_gen_be.c`	--	Called by both `sub_46E640` and `sub_46E550` for init-captures
`sub_467E50`	(emit string to output)	`cp_gen_be.c`	--	String emission helper used by code generator
`sub_467DA0`	(column tracking helper)	`cp_gen_be.c`	--	Called when `dword_1065818` is set for line-length management

Global State

Variable	Address	Size	Purpose
`unk_1286980`	`0x1286980`	128 bytes	Device lambda capture-count bitmap
`unk_1286900`	`0x1286900`	128 bytes	Host-device lambda capture-count bitmap
`dword_106581C`	`0x106581C`	4 bytes	Column counter for output line tracking
`dword_1065818`	`0x1065818`	4 bytes	Line-length management enabled flag
`dword_126E1E8`	`0x126E1E8`	4 bytes	GCC-compatible host compiler flag (enables diagnostic pragmas)
`stream`	(global)	8 bytes	Output FILE* for code generation

Extended Lambda Overview -- end-to-end lambda pipeline and lambda_info structure
Device Lambda Wrapper -- __nv_dl_wrapper_t template anatomy
Host-Device Lambda Wrapper -- __nv_hdl_wrapper_t type-erased design
Preamble Injection -- sub_6BCC20 emission sequence in full detail
Lambda Restrictions -- validation errors for malformed captures

Preamble Injection

The entire CUDA extended lambda template library -- every __nv_dl_wrapper_t, every __nv_hdl_wrapper_t, every trait helper and detection macro -- enters the compilation through a single function: sub_6BCC20 (nv_emit_lambda_preamble). This 244-line function in nv_transforms.c accepts a void(*emit)(const char*) callback and produces raw C++ source text that is injected into the .int.c output stream. The preamble is emitted exactly once per translation unit, triggered by a sentinel type declaration named __nv_lambda_preheader_injection. The trigger mechanism lives in sub_4864F0 (gen_type_decl in cp_gen_be.c), which string-compares each type declaration's name against the sentinel marker, emits a synthetic #line directive, and then calls the master emitter.

The preamble contains 20 logical emission steps, ranging from simple type traits (4 lines each) to bitmap-driven loops that generate hundreds of template specializations. The design is driven by a critical optimization: rather than emitting all 1024 possible capture-count specializations for each wrapper type, cudafe++ maintains two 1024-bit bitmaps (unk_1286980 for device lambdas, unk_1286900 for host-device lambdas) that track which capture counts were actually used during frontend parsing. The preamble emitter scans these bitmaps and generates only the specializations that the translation unit requires.

Key Facts

Property	Value
Master emitter	`sub_6BCC20` (`nv_emit_lambda_preamble`, 244 lines, `nv_transforms.c`)
Trigger function	`sub_4864F0` (`gen_type_decl`, 751 lines, `cp_gen_be.c`)
Emit callback (typical)	`sub_467E50` (raw text output to `.int.c` stream)
Sentinel type name	`__nv_lambda_preheader_injection`
Synthetic source file	`"nvcc_internal_extended_lambda_implementation"`
Enable flag	`dword_106BF38` (`--extended-lambda` / `--expt-extended-lambda`)
Device bitmap	`unk_1286980` (128 bytes = 16 x `uint64` = 1024 bits)
Host-device bitmap	`unk_1286900` (128 bytes = 16 x `uint64` = 1024 bits)
C++17 noexcept gate	`dword_126E270` (controls noexcept trait variants)
One-shot guarantee	Once emitted, the sentinel type is wrapped in `#if 0` / `#endif`
Max capture count	1024 (bit index range 0..1023)
Array dimension range	2D through 8D (7 specializations per wrapper)

Trigger Mechanism: sub_4864F0 (gen_type_decl)

The preamble is not emitted eagerly at the start of compilation. Instead, the EDG frontend inserts a synthetic type declaration named __nv_lambda_preheader_injection into the IL at the point where the lambda template library is needed. During backend code generation, sub_4864F0 (the type declaration emitter in cp_gen_be.c) encounters this declaration and performs the following sequence:

// sub_4864F0, decompiled lines 200-242
// Check: is this a type tagged with the preheader marker? (bit at v4-8 & 0x10)
if ((*(_BYTE *)(v4 - 8) & 0x10) != 0)
{
    if (dword_106BF38)                           // --extended-lambda enabled?
    {
        v18 = *(_QWORD *)(v4 + 8);              // get type name pointer
        if (v18)
        {
            // Compare name against "__nv_lambda_preheader_injection" (30 chars + NUL)
            v30 = "__nv_lambda_preheader_injection";
            v31 = 32;                             // comparison length
            do {
                if (!v31) break;
                v29 = *(_BYTE *)v18++ == *v30++;
                --v31;
            } while (v29);

            if (v29)                              // name matched
            {
                if (dword_106581C)                // pending newline needed
                    sub_467D60();                 // emit newline

                // Emit #line directive pointing to synthetic source file
                v32 = "#line";
                if (dword_126E1DC)                // shorthand mode
                    v32 = "#";
                sub_467E50(v32);
                sub_467E50(" 1 \"nvcc_internal_extended_lambda_implementation\"");

                if (dword_106581C)
                    sub_467D60();

                // THE CRITICAL CALL: emit entire lambda template library
                sub_6BCC20(sub_467E50);

                dword_1065820 = 0;                // reset line tracking state
                qword_1065828 = 0;
            }
        }
    }
    // Suppress the sentinel type from host compiler output
    sub_46BC80("#if 0");
    --dword_1065834;
    sub_467D60();
}

Trigger Conditions

Three conditions must all be true for preamble emission:

Marker bit set -- The type declaration node has bit 0x10 set at offset -8 (the IL node header flags). This bit marks NVIDIA-injected synthetic declarations.
Extended lambda mode active -- dword_106BF38 is nonzero, meaning --extended-lambda (or --expt-extended-lambda) was passed to nvcc.
Name matches sentinel -- The type's name at offset +8 is byte-equal to "__nv_lambda_preheader_injection" (a 31-character string including NUL; the comparison loop runs up to 32 iterations).

Synthetic Source File Context

Before calling sub_6BCC20, the trigger emits:

#line 1 "nvcc_internal_extended_lambda_implementation"

This #line directive serves two purposes: it changes the apparent source file for any diagnostics emitted during template parsing, and it provides a recognizable marker in the generated .int.c file for debugging. All lambda template infrastructure appears to originate from "nvcc_internal_extended_lambda_implementation" rather than from the user's source file. The dword_126E1DC flag selects between #line and the shorthand # form for the line directive.

One-Shot Guarantee and Sentinel Suppression

After the preamble is emitted, the sentinel type declaration is wrapped in #if 0 / #endif. The #if 0 is emitted immediately after the preamble call (line 239: sub_46BC80("#if 0")). The matching #endif is emitted later when sub_4864F0 reaches the closing path for this declaration type (lines 736-745):

else if ((*(_BYTE *)(v4 - 8) & 0x10) != 0)
{
    if (dword_106581C)
        sub_467D60();
    ++dword_1065834;
    sub_468190("#endif");
    --dword_1065834;
    sub_467D60();
    dword_1065820 = 0;
    qword_1065828 = 0;
}

The sentinel type __nv_lambda_preheader_injection never reaches the host compiler's type system -- it exists solely as a positional marker in the IL. Because the EDG frontend inserts exactly one such declaration per translation unit, and the backend processes declarations sequentially, the preamble is guaranteed to be emitted exactly once.

After emission, dword_1065820 (output line counter) and qword_1065828 (output state pointer) are reset to zero, ensuring subsequent #line directives correctly track the user's source file.

Master Emitter: sub_6BCC20

The function signature:

__int64 __fastcall sub_6BCC20(void (__fastcall *a1)(const char *));

The single parameter a1 is an output callback. In production, this is always sub_467E50 -- the function that writes raw text to the .int.c output stream. Every a1("...") call appends the given string literal to the output. The function has no other state parameters; all needed state (bitmaps, C++17 flag) is read from globals.

The 20 emission steps are executed unconditionally in a fixed order. Steps 6 and 9 contain bitmap-scanning loops that conditionally call sub-emitters based on which capture counts were registered during frontend parsing. Step 11 is gated on the C++17 noexcept flag.

Step 1: Type Removal Traits and Wrapper Helper Macro

The first a1(...) call emits the largest single string literal in the function -- three foundational metaprogramming utilities:

#define __NV_LAMBDA_WRAPPER_HELPER(X, Y) decltype(X), Y

template <typename T>
struct __nvdl_remove_ref { typedef T type; };

template<typename T>
struct __nvdl_remove_ref<T&> { typedef T type; };

template<typename T>
struct __nvdl_remove_ref<T&&> { typedef T type; };

template <typename T, typename... Args>
struct __nvdl_remove_ref<T(&)(Args...)> {
  typedef T(*type)(Args...);
};

template <typename T>
struct __nvdl_remove_const { typedef T type; };

template <typename T>
struct __nvdl_remove_const<T const> { typedef T type; };

__NV_LAMBDA_WRAPPER_HELPER(X, Y) expands to decltype(X), Y. It provides the <U, func> pair for tag type construction from a single expression. At each lambda wrapper call site, the per-lambda emitter (sub_47B890) generates __NV_LAMBDA_WRAPPER_HELPER(&Closure::operator(), &Closure::operator()), which expands to decltype(&Closure::operator()), &Closure::operator().

__nvdl_remove_ref strips lvalue and rvalue references, with a special case for function references (T(&)(Args...) -> T(*)(Args...)). __nvdl_remove_const strips top-level const. Both are used during capture type emission to normalize captured variable types before passing them as template arguments to wrapper structs.

Step 2: Device Lambda Tag

template <typename U, U func, unsigned>
struct __nv_dl_tag { };

The device lambda tag type. U is the type of the lambda's operator(), func is a non-type template parameter holding the pointer to that operator, and the unsigned disambiguates lambdas with identical operator types at different call sites within the same TU.

Step 3: Array Capture Helpers (sub_6BC290)

sub_6BCC20 calls sub_6BC290(a1), which emits the __nv_lambda_array_wrapper and __nv_lambda_field_type infrastructure for C-style array captures. This is a separate 183-line function that generates templates for array dimensions 2 through 8.

Three template families are emitted:

Primary template (static_assert trap):

template <typename T>
struct __nv_lambda_array_wrapper {
    static_assert(sizeof(T) == 0,
        "nvcc internal error: unexpected failure in capturing array variable");
};

Per-dimension partial specializations (dimensions 2-8). For each dimension D from 2 to 8, sub_6BC290 generates a partial specialization with D size_t template parameters and a nested-for-loop constructor:

// Example: 3D (v1 = 3)
template<typename T, size_t D1, size_t D2, size_t D3>
struct __nv_lambda_array_wrapper<T [D1][D2][D3]> {
    T arr[D1][D2][D3];
    __nv_lambda_array_wrapper(const T in[D1][D2][D3]) {
        for(size_t i1 = 0; i1 < D1; ++i1)
        for(size_t i2 = 0; i2 < D2; ++i2)
        for(size_t i3 = 0; i3 < D3; ++i3)
            arr[i1][i2][i3] = in[i1][i2][i3];
    }
};

Field type trait specializations:

template <typename T>
struct __nv_lambda_field_type { typedef T type; };

// For each dimension D from 2 to 8:
template<typename T, size_t D1, ..., size_t DN>
struct __nv_lambda_field_type<T [D1]...[DN]> {
    typedef __nv_lambda_array_wrapper<T [D1]...[DN]> type;
};

template<typename T, size_t D1, ..., size_t DN>
struct __nv_lambda_field_type<const T [D1]...[DN]> {
    typedef const __nv_lambda_array_wrapper<T [D1]...[DN]> type;
};

The loop structure in sub_6BC290 uses two stack buffers: v33[1024] for the nested-for-loop lines (each sprintf call formats four copies of the loop index variable) and s[1064] for dimension parameters and array subscript expressions. The outer loop runs from v1 = 2 to v1 = 8 (inclusive, 7 iterations). 1D arrays do not need a wrapper -- they can be captured directly. Arrays of 9+ dimensions are unsupported (the primary template's static_assert fires).

See Capture Handling for detailed documentation.

Step 4: Primary `__nv_dl_wrapper_t` and Zero-Capture Specialization

template <typename Tag, typename...CapturedVarTypePack>
struct __nv_dl_wrapper_t {
    static_assert(sizeof...(CapturedVarTypePack) == 0,
                  "nvcc internal error: unexpected number of captures!");
};

template <typename Tag>
struct __nv_dl_wrapper_t<Tag> {
    __nv_dl_wrapper_t(Tag) { }
    template <typename...U1>
    int operator()(U1...) { return 0; }
};

The primary template traps any instantiation with a non-zero capture count that lacks a matching specialization. The zero-capture specialization provides a trivial constructor and a dummy operator() returning int(0). This return value is never used at runtime -- the device compiler dispatches through the tag's encoded function pointer.

Step 5: Trailing-Return Tag and Base Specialization

template <typename U, U func, typename Return, unsigned>
struct __nv_dl_trailing_return_tag { };

template <typename U, U func, typename Return, unsigned Id>
struct __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id> > {
    __nv_dl_wrapper_t(__nv_dl_trailing_return_tag<U, func, Return, Id>) { }

    template <typename...U1> Return operator()(U1...) {
        __builtin_unreachable();
    }
};

For lambdas with explicit trailing return types (-> ReturnType), the tag carries the Return type as a template parameter. The operator() returns Return instead of int, with __builtin_unreachable() satisfying the compiler without generating actual return-value code.

The trailing-return tag and its zero-capture specialization are emitted as two separate a1(...) calls. The __builtin_unreachable() body is split: a1("__builtin_unreachable(); }\n}; \n\n").

Step 6: Device Lambda Bitmap Scan

Scans unk_1286980 (the device lambda bitmap, 1024 bits) and calls sub_6BB790 for each set bit with index greater than zero:

// Decompiled from sub_6BCC20
v1 = (unsigned __int64 *)&unk_1286980;
v2 = 0;
do {
    v3 = *v1;                          // load 64-bit word
    v4 = v2 + 64;                      // word boundary
    do {
        if (v2 && (v3 & 1) != 0)       // skip bit 0, emit for set bits
            sub_6BB790(v2, a1);         // emit_device_lambda_wrapper_specialization
        ++v2;
        v3 >>= 1;
    } while (v4 != v2);
    ++v1;
} while (v4 != 1024);

Bit 0 is explicitly skipped (if (v2 && ...)). The zero-capture case is handled by the specializations in steps 4 and 5.

For each set bit N > 0, sub_6BB790(N, a1) emits two __nv_dl_wrapper_t partial specializations: one for __nv_dl_tag and one for __nv_dl_trailing_return_tag, each with N typed fields, a constructor taking N parameters, and an initializer list binding inK to fK. See Device Lambda Wrapper for full emitter logic.

This bitmap-driven approach is the critical compile-time optimization. A translation unit using lambdas with capture counts 1, 3, and 5 emits exactly 6 struct specializations rather than 2046 (1023 counts x 2 tag variants).

Step 7: Host-Device Helper Class (`__nv_hdl_helper`)

Emitted inside an anonymous namespace:

namespace {
template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
struct __nv_hdl_helper {
    typedef void * (*fp_copier_t)(void *);
    typedef OpFuncR (*fp_caller_t)(void *, OpFuncArgs...);
    typedef void (*fp_deleter_t)(void *);
    typedef OpFuncR (*fp_noobject_caller_t)(OpFuncArgs...);

    static fp_copier_t fp_copier;
    static fp_deleter_t fp_deleter;
    static fp_caller_t fp_caller;
    static fp_noobject_caller_t fp_noobject_caller;
};

// Out-of-line static member definitions (4 members):
template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
typename __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier_t
    __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier;

// ... (fp_deleter, fp_caller, fp_noobject_caller follow the same pattern)
}

The anonymous namespace prevents ODR violations across TUs. The Tag parameter isolates function pointer storage per lambda site even when call signatures are identical. The entire struct definition plus all four out-of-line member definitions are emitted as a single a1(...) call.

Pointer	Purpose
`fp_copier`	Heap-copies a `Lambda` from `void*` (used by copy constructor)
`fp_caller`	Casts `void` to `Lambda` and invokes `operator()`
`fp_deleter`	Casts `void` to `Lambda` and `delete`s it
`fp_noobject_caller`	Stores captureless lambda as raw function pointer

Step 8: Primary `__nv_hdl_wrapper_t`

template <bool IsMutable, bool HasFuncPtrConv, bool NeverThrows,
          typename Tag, typename OpFunc, typename...CapturedVarTypePack>
struct __nv_hdl_wrapper_t {
    static_assert(sizeof...(CapturedVarTypePack) == 0,
        "nvcc internal error: unexpected number of captures "
        "in __host__ __device__ lambda!");
};

Same safety-net pattern as the device wrapper.

Step 9: Host-Device Lambda Bitmap Scan

Scans unk_1286900 (the host-device bitmap, 1024 bits). Unlike the device scan, this loop does not skip bit 0 -- the zero-capture host-device case still requires distinct specializations for HasFuncPtrConv=true vs HasFuncPtrConv=false.

For each set bit N, four specialization calls are made:

v5 = (unsigned __int64 *)&unk_1286900;
v6 = 0;
do {
    v7 = *v5;
    v8 = v6 + 64;
    do {
        while ((v7 & 1) == 0) {        // fast-skip unset bits
            ++v6;
            v7 >>= 1;
            if (v6 == v8) goto LABEL_13;
        }
        sub_6BBB10(0, v6, a1);         // IsMutable=false, HasFuncPtrConv=false
        sub_6BBEE0(0, v6, a1);         // IsMutable=true,  HasFuncPtrConv=false
        sub_6BBB10(1, v6, a1);         // IsMutable=false, HasFuncPtrConv=true
        v9 = v6++;
        v7 >>= 1;
        sub_6BBEE0(1, v9, a1);        // IsMutable=true,  HasFuncPtrConv=true
    } while (v6 != v8);
LABEL_13:
    ++v5;
} while (v6 != 1024);

Note the ordering asymmetry in the fourth call: sub_6BBEE0(1, v9, a1) uses the pre-increment value v9 because v6 has already been incremented by the v9 = v6++ expression.

The inner while ((v7 & 1) == 0) loop provides fast skipping over consecutive unset bits without executing four function calls per zero bit. This is an optimization compared to the device scan loop.

Call	`a1`	`a2`	IsMutable	HasFuncPtrConv	`operator()` qualifier
`sub_6BBB10(0, N, emit)`	0	N	false	false	`const noexcept(NeverThrows)`
`sub_6BBEE0(0, N, emit)`	0	N	true	false	`noexcept(NeverThrows)` (no const)
`sub_6BBB10(1, N, emit)`	1	N	false	true	`const noexcept(NeverThrows)`
`sub_6BBEE0(1, N, emit)`	1	N	true	true	`noexcept(NeverThrows)` (no const)

The sole difference between sub_6BBB10 and sub_6BBEE0 is that sub_6BBB10 emits "false," for IsMutable and adds a3("const ") before the noexcept qualifier on operator(), while sub_6BBEE0 emits "true," and omits the const. They are otherwise structurally identical -- 238 vs 236 lines, the 2-line difference being exactly the a3("const ") call.

See Host-Device Lambda Wrapper for the complete internal structure of each specialization.

Step 10: `__nv_hdl_helper_trait_outer` (Base Specializations)

The deduction helper trait that extracts the wrapper type from a lambda's operator() signature:

template <bool IsMutable, bool HasFuncPtrConv, typename ...CaptureArgs>
struct __nv_hdl_helper_trait_outer {
    template <typename Tag, typename Lambda>
    struct __nv_hdl_helper_trait
        : public __nv_hdl_helper_trait<Tag, decltype(&Lambda::operator())> { };

    // Match const operator() (non-mutable lambda):
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) const> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, false,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

    // Match non-const operator() (mutable lambda):
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...)> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, false,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

The primary __nv_hdl_helper_trait inherits from a specialization on decltype(&Lambda::operator()). The compiler deduces the member function pointer type and pattern-matches against the const or non-const specialization. Both produce NeverThrows=false.

This block is emitted without a closing }; -- the noexcept variants (step 11) are conditionally appended before the closing brace.

Step 11: C++17 Noexcept Trait Variants (Conditional)

Gated on dword_126E270:

if (dword_126E270)
    a1(/* noexcept trait specializations */);
a1("\n};");  // close __nv_hdl_helper_trait_outer

When C++17 noexcept-in-type-system is active, two additional __nv_hdl_helper_trait specializations are emitted:

    // Match const noexcept operator():
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) const noexcept> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, true,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

    // Match non-const noexcept operator():
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) noexcept> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, true,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

The noexcept specializations produce NeverThrows=true. In C++17, R(C::*)(Args...) const noexcept is a distinct type from R(C::*)(Args...) const, so without these specializations, noexcept lambdas would fail to match and the trait chain would break.

Step 12: `__nv_hdl_create_wrapper_t` Factory

template<bool IsMutable, bool HasFuncPtrConv, typename Tag,
         typename...CaptureArgs>
struct __nv_hdl_create_wrapper_t {
    template <typename Lambda>
    static auto __nv_hdl_create_wrapper(Lambda &&lam, CaptureArgs... args)
        -> decltype(
            __nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>
                ::template __nv_hdl_helper_trait<Tag, Lambda>
                ::get(lam, args...))
    {
        typedef decltype(
            __nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>
                ::template __nv_hdl_helper_trait<Tag, Lambda>
                ::get(lam, args...)) container_type;
        return container_type(Tag{}, std::move(lam), args...);
    }
};

This factory is the entry point called at each host-device lambda usage site. The trailing return type chains through the trait hierarchy to deduce the exact __nv_hdl_wrapper_t specialization. The body constructs the deduced wrapper with Tag{}, the moved lambda, and the capture arguments.

Step 13: CV-Removal Traits

template<typename T> struct __nv_lambda_trait_remove_const { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_const<T const> { typedef T type; };

template<typename T> struct __nv_lambda_trait_remove_volatile { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_volatile<T volatile> { typedef T type; };

template<typename T> struct __nv_lambda_trait_remove_cv {
    typedef typename __nv_lambda_trait_remove_const<
        typename __nv_lambda_trait_remove_volatile<T>::type>::type type;
};

These are distinct from the __nvdl_remove_ref/__nvdl_remove_const emitted in step 1. The step-1 traits are used during capture type normalization at wrapper call sites. The step-13 traits are used by the detection macros (steps 14-17) to strip CV qualifiers before testing whether a type is an extended lambda wrapper.

Step 14: Device Lambda Detection Trait

template <typename T>
struct __nv_extended_device_lambda_trait_helper {
    static const bool value = false;
};

template <typename T1, typename...Pack>
struct __nv_extended_device_lambda_trait_helper<__nv_dl_wrapper_t<T1, Pack...> > {
    static const bool value = true;
};

#define __nv_is_extended_device_lambda_closure_type(X) \
    __nv_extended_device_lambda_trait_helper< \
        typename __nv_lambda_trait_remove_cv<X>::type>::value

SFINAE detection for device lambda wrappers. The macro strips CV qualifiers first, ensuring const __nv_dl_wrapper_t<...> is also detected. Used by CUDA runtime headers for conditional behavior on extended lambda types.

Step 15: Device Lambda Wrapper Unwrapper

template<typename T> struct __nv_lambda_trait_remove_dl_wrapper { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_dl_wrapper<__nv_dl_wrapper_t<T> > {
    typedef T type;
};

Extracts the inner tag type from a zero-capture device lambda wrapper. Only matches __nv_dl_wrapper_t<T> with a single template parameter (the tag). Used to access __nv_dl_tag or __nv_dl_trailing_return_tag for device function dispatch resolution.

Step 16: Trailing-Return Device Lambda Detection

template <typename T>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper {
    static const bool value = false;
};

template <typename U, U func, typename Return, unsigned Id, typename...Pack>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper<
    __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>, Pack...> > {
    static const bool value = true;
};

#define __nv_is_extended_device_lambda_with_preserved_return_type(X) \
    __nv_extended_device_lambda_with_trailing_return_trait_helper< \
        typename __nv_lambda_trait_remove_cv<X>::type>::value

Detects whether a device lambda wrapper uses the trailing-return tag variant. Needed because trailing-return lambdas require different handling during device compilation -- the return type is explicit and must be preserved, rather than deduced.

Step 17: Host-Device Lambda Detection Trait

The final emission:

template <typename>
struct __nv_extended_host_device_lambda_trait_helper {
    static const bool value = false;
};

template <bool B1, bool B2, bool B3, typename T1, typename T2, typename...Pack>
struct __nv_extended_host_device_lambda_trait_helper<
    __nv_hdl_wrapper_t<B1, B2, B3, T1, T2, Pack...> > {
    static const bool value = true;
};

#define __nv_is_extended_host_device_lambda_closure_type(X) \
    __nv_extended_host_device_lambda_trait_helper< \
        typename __nv_lambda_trait_remove_cv<X>::type>::value

Detects any __nv_hdl_wrapper_t instantiation. The partial specialization matches all six template parameters (B1=IsMutable, B2=HasFuncPtrConv, B3=NeverThrows, T1=Tag, T2=OpFunc, Pack=captures).

sub_6BCC20 returns the result of this final a1(...) call.

Bitmap Infrastructure

Registration: sub_6BCBF0 (nv_record_capture_count)

During frontend parsing, scan_lambda (sub_447930) records each lambda's capture count:

__int64 __fastcall sub_6BCBF0(int a1, unsigned int a2)
{
    unsigned __int64 *result;
    if (a1)
        result = (unsigned __int64 *)&unk_1286900;  // host-device bitmap
    else
        result = (unsigned __int64 *)&unk_1286980;  // device bitmap
    result[a2 >> 6] |= 1ULL << a2;
    return (__int64)result;
}

The function selects the bitmap based on a1 (0 = device, nonzero = host-device), computes the word index as a2 >> 6 (divide by 64), and sets the bit via bitwise OR. No synchronization is needed because the frontend is single-threaded.

Reset: sub_6BCBC0 (nv_reset_capture_bitmasks)

Before each translation unit, both bitmaps are zeroed:

void sub_6BCBC0(void)
{
    memset(&unk_1286980, 0, 128);  // device bitmap
    memset(&unk_1286900, 0, 128);  // host-device bitmap
}

Scan Algorithm Differences

Aspect	Device scan (step 6)	Host-device scan (step 9)
Bitmap	`unk_1286980`	`unk_1286900`
Bit 0	Skipped (`if (v2 && ...)`)	Processed
Skip strategy	Tests every bit individually	Inner `while` fast-skips consecutive zeros
Calls per set bit	1 (`sub_6BB790`)	4 (`sub_6BBB10` x2 + `sub_6BBEE0` x2)
Specializations per set bit	2 (standard + trailing-return)	4 (IsMutable x HasFuncPtrConv)

The device scan skips bit 0 because the zero-capture case is handled by the always-emitted primary template. The host-device scan processes bit 0 because the zero-capture case requires explicit specializations for the HasFuncPtrConv and IsMutable dimensions -- the always-emitted primary template contains only a static_assert trap.

Complete Emission Order Summary

Step	Content	Emitter	Templates Produced
1	Ref/const removal traits	inline string	`__NV_LAMBDA_WRAPPER_HELPER`, `__nvdl_remove_ref`, `__nvdl_remove_const`
2	Device tag	inline string	`__nv_dl_tag`
3	Array helpers	`sub_6BC290`	`__nv_lambda_array_wrapper` (dim 2-8), `__nv_lambda_field_type` specializations
4	Device wrapper primary	inline string	`__nv_dl_wrapper_t` primary + zero-capture
5	Trailing-return tag	inline string	`__nv_dl_trailing_return_tag` + zero-capture specialization
6	Device bitmap scan	loop + `sub_6BB790`	N-capture `__nv_dl_wrapper_t` (2 per set bit N > 0)
7	HD helper	inline string	`__nv_hdl_helper` (anonymous namespace, 4 static FPs)
8	HD wrapper primary	inline string	`__nv_hdl_wrapper_t` primary with static_assert
9	HD bitmap scan	loop + `sub_6BBB10` x2 + `sub_6BBEE0` x2	N-capture `__nv_hdl_wrapper_t` (4 per set bit)
10	Trait outer	inline string	`__nv_hdl_helper_trait_outer` (const + non-const specializations)
11	C++17 noexcept	conditional inline	Noexcept `__nv_hdl_helper_trait` specializations
12	Factory	inline string	`__nv_hdl_create_wrapper_t`
13	CV traits	inline string	`__nv_lambda_trait_remove_const/volatile/cv`
14	Device detection	inline string	`__nv_extended_device_lambda_trait_helper` + macro
15	Wrapper unwrap	inline string	`__nv_lambda_trait_remove_dl_wrapper`
16	Trailing-return detection	inline string	`__nv_extended_device_lambda_with_trailing_return_trait_helper` + macro
17	HD detection	inline string	`__nv_extended_host_device_lambda_trait_helper` + macro

Output Size Characteristics

The preamble size depends on the number of distinct capture counts used:

Component	Fixed/Variable	Approximate Size
Steps 1-5 (fixed templates)	Fixed	~1.5 KB
Step 3 (array helpers, dim 2-8)	Fixed	~4 KB
Step 6 (device, per capture count)	Variable	~0.8 KB per count
Steps 7-8 (HD helper + primary)	Fixed	~1.5 KB
Step 9 (HD, per capture count)	Variable	~6 KB per count (4 specializations)
Steps 10-17 (traits, macros)	Fixed	~3 KB

A typical translation unit with 3-5 distinct capture counts produces approximately 30-50 KB of injected C++ text.

Design Rationale

Text Emission vs AST Construction

The preamble is emitted as raw C++ source text rather than constructed as AST nodes in the EDG IL. This trades correctness-by-construction for implementation simplicity:

Avoids IL complexity. Constructing proper AST nodes for template partial specializations, static member definitions, anonymous namespaces, and macros would require deep integration with the EDG IL construction API.
Matches output format. The .int.c file is plain C++ text consumed by the host compiler. Since the templates must eventually become text, generating them as text from the start eliminates a serialize-deserialize round trip.
Self-documenting. The emitted text is directly readable in the .int.c file. grep for __nv_dl_wrapper_t to see exactly what was produced.

The cost is that the templates exist only as generated text, not as first-class IL entities. They cannot be analyzed or transformed by other EDG passes. This is acceptable because the preamble templates are infrastructure -- they are never the target of user-facing diagnostics or transformations.

Why Bitmaps Instead of Lists

The 1024-bit bitmap offers constant-time set (O(1) via shift-and-OR) and linear-time scan (O(1024) = effectively constant for a fixed-size structure). The bitmap has zero dynamic allocation, fits in two cache lines (128 bytes), and the scan loop compiles to simple shift-and-test instructions. Alternative representations (sorted lists, hash sets) would add allocation overhead and complexity for negligible benefit given the fixed 128-byte size.

Why Bit 0 Is Skipped for Device but Not Host-Device

The device lambda zero-capture case is fully handled by the primary template's zero-capture specialization (step 4), which is always emitted. No per-capture-count specialization is needed because the zero-capture wrapper has no fields, no constructor parameters, and no specialization-specific behavior.

The host-device zero-capture case requires distinct specializations for HasFuncPtrConv=true (lightweight function pointer path) and HasFuncPtrConv=false (heap-allocated type erasure path). These paths have fundamentally different internal structure. The always-emitted primary template contains only a static_assert trap, not a working implementation, so bit 0 must be processed to generate the actual zero-capture specializations.

Function Map

Address	Name (recovered)	Source	Lines	Role
`sub_6BCC20`	`nv_emit_lambda_preamble`	`nv_transforms.c`	244	Master emitter: 17-step template injection pipeline
`sub_4864F0`	`gen_type_decl`	`cp_gen_be.c`	751	Trigger: detects sentinel, emits `#line`, calls master emitter
`sub_467E50`	`emit_string`	`cp_gen_be.c`	~29	Output callback: writes string char-by-char via `putc()`
`sub_467D60`	`emit_newline`	`cp_gen_be.c`	~15	Emits `\n`, increments line counter
`sub_6BC290`	`emit_array_capture_helpers`	`nv_transforms.c`	183	Step 3: `__nv_lambda_array_wrapper` for dim 2-8
`sub_6BB790`	`emit_device_lambda_wrapper_specialization`	`nv_transforms.c`	191	Step 6: N-capture `__nv_dl_wrapper_t` (both tag variants)
`sub_6BBB10`	`emit_hdl_wrapper_nonmutable`	`nv_transforms.c`	238	Step 9: `__nv_hdl_wrapper_t<false,...>` specialization
`sub_6BBEE0`	`emit_hdl_wrapper_mutable`	`nv_transforms.c`	236	Step 9: `__nv_hdl_wrapper_t<true,...>` specialization
`sub_6BCBF0`	`nv_record_capture_count`	`nv_transforms.c`	13	Sets bit N in device or HD bitmap
`sub_6BCBC0`	`nv_reset_capture_bitmasks`	`nv_transforms.c`	9	Zeroes both 128-byte bitmaps before each TU
`sub_46BC80`	`emit_preprocessor_directive`	`cp_gen_be.c`	--	Emits `#if 0` / `#endif` suppression blocks

Global State

Variable	Address	Type	Purpose
`unk_1286980`	`0x1286980`	`uint64_t[16]`	Device lambda capture-count bitmap (1024 bits)
`unk_1286900`	`0x1286900`	`uint64_t[16]`	Host-device lambda capture-count bitmap (1024 bits)
`dword_106BF38`	`0x106BF38`	`int32`	`--extended-lambda` mode flag
`dword_126E270`	`0x126E270`	`int32`	C++17 noexcept-in-type-system flag
`dword_126E1DC`	`0x126E1DC`	`int32`	EDG native mode flag (`#` vs `#line` format)
`dword_106581C`	`0x106581C`	`int32`	Output column counter
`dword_1065820`	`0x1065820`	`int32`	Output line counter (reset after preamble)
`qword_1065828`	`0x1065828`	`int64`	Output state pointer (reset after preamble)
`dword_1065818`	`0x1065818`	`int32`	Pending indentation flag
`dword_1065834`	`0x1065834`	`int32`	Preprocessor nesting depth counter

Extended Lambda Overview -- end-to-end pipeline architecture and bitmap system
Device Lambda Wrapper -- __nv_dl_wrapper_t template structure, sub_6BB790 emitter
Host-Device Lambda Wrapper -- __nv_hdl_wrapper_t type erasure design, sub_6BBB10/sub_6BBEE0
Capture Handling -- __nv_lambda_field_type, __nv_lambda_array_wrapper, sub_6BC290
Lambda Restrictions -- validation rules and error codes

Lambda Restrictions

Extended lambdas are CUDA's most constraint-heavy feature. Before a lambda can be wrapped in __nv_dl_wrapper_t or __nv_hdl_wrapper_t for device transfer, cudafe++ must verify that the closure type is serializable: no reference captures (device memory cannot hold host-side pointers), no function-local types in the public interface (device compiler has no access to them), no unnamed parent classes (the wrapper tag requires a mangleable name), and dozens of other structural invariants. The restriction checker runs as Phase 4 of scan_lambda (sub_447930, lines 626--866 of the 2113-line function) and continues through per-capture validation in make_field_for_lambda_capture (sub_42EE00) and recursive type walks in sub_41A3E0 / sub_41A1F0. Together, these functions enforce 39 distinct diagnostic tags covering 35+ error categories and approximately 45 unique error code call sites.

All restrictions apply only when dword_106BF38 (--extended-lambda / --expt-extended-lambda) is set and the lambda has an explicit __device__ or __host__ __device__ annotation. Standard C++ lambdas and lambdas defined inside __device__ / __global__ function bodies are exempt.

Key Facts

Property	Value
Primary validator	`sub_447930` (`scan_lambda`, Phase 4, ~240 lines within 2113-line function)
Per-capture validator	`sub_42EE00` (`make_field_for_lambda_capture`, 551 lines)
Type hierarchy walker	`sub_41A3E0` (`validate_type_hd_annotation`, 75 lines)
Array/element checker	`sub_41A1F0` (`walk_type_for_hd_violations`, 81 lines)
Type walk callback	`sub_41B420` (33 lines, issues errors 3603/3604/3606/3607/3610/3611)
Diagnostic tag count	39 unique tags for extended lambda errors
Error code range	3592--3635, 3689--3691
Error severity	All severity 7 (error), except 3612 (warning) and 3590 (error)
Enable flag	`dword_106BF38` (`--extended-lambda`)
OptiX gate	`dword_106BDD8` && `dword_106B670` (triggers 3689)

Restriction Categories

The tables below list every restriction enforced by the extended lambda validator, organized by the phase of validation in which each check occurs. The Error column gives the internal error index (displayed to users as 20000-series with the renumbering formula code + 16543). The Tag column gives the diagnostic tag name usable with --diag_suppress / #pragma nv_diag_suppress.

Category 1: Capture Restrictions

These checks run in two phases. The per-lambda checks (3593, 3595) occur in scan_lambda Phase 4 and in sub_4F9F20 (capture count finalization). The per-capture checks (3596--3599, 3616) run inside make_field_for_lambda_capture (sub_42EE00), which calls sub_41A1F0 for array dimension and constructibility analysis.

Error	Tag	Restriction	Enforcement Location
3593	`extended_lambda_reference_capture`	Reference capture (`[&]` or `[&x]`) is prohibited. Device memory cannot hold host-side references. Fires when `capture_default == &` and `capture_mode == &` on the same lambda (byte+24 bits 4 and 5 both set).	`sub_447930` Phase 4, line ~825
3595	`extended_lambda_too_many_captures`	Maximum 1023 captures. The bitmap system uses 1024 bits (128 bytes) per wrapper type; bit 0 is reserved for the zero-capture primary template, so the usable range is 1--1023. Capture count `> 0x3FE` triggers this error.	`sub_4F9F20` line ~616
3596	`extended_lambda_init_capture_array`	Init-captures with array type are not supported. The init-capture's type node is checked for kind 3 (array type) with element kind 1 and sub-kind 21.	`sub_42EE00` line ~508
3597	`extended_lambda_array_capture_rank`	Arrays with more than 7 dimensions cannot be captured. The walker `sub_41A1F0` counts array nesting depth via `sub_7A8370` (is_array_type) and `sub_7A9310` (get_element_type). If depth > 7, error fires. The limit matches the generated `__nv_lambda_array_wrapper` specializations (dims 2--8, plus dim 1 as identity).	`sub_41A1F0` lines ~29, ~54
3598	`extended_lambda_array_capture_default_constructible`	Array element type must be default-constructible on the host. After unwinding CV-qualifiers (kind 12 loop), calls `sub_550E50(30, element_type, 0)` to check default-constructibility. Failure emits this error.	`sub_41A1F0` line ~40
3599	`extended_lambda_array_capture_assignable`	Array element type must be copy-assignable on the host. Calls `sub_5BD540` to get the assignment operator, then `sub_510860(60, ...)` to verify it is callable. Failure emits this error.	`sub_41A1F0` lines ~42--44
3616	`extended_lambda_pack_capture`	Cannot capture an element of a parameter pack. After calling `sub_41A1F0` for type validation, `sub_7A8C00` checks whether the capture type involves a pack expansion; if so, this error fires.	`sub_42EE00` line ~517
3610	`extended_lambda_init_capture_initlist`	Init-captures with `std::initializer_list` type are prohibited. The type walk callback `sub_41B420` checks kind and class identity.	`sub_41B420` / `sub_4907A0`
3602	`extended_lambda_capture_in_constexpr_if`	An extended lambda cannot first-capture a variable inside a `constexpr if` branch. The capture must be visible outside the discarded branch.	`sub_447930` Phase 6
3614	`extended_lambda_hd_init_capture`	Init-captures are completely prohibited for `__host__ __device__` lambdas. When byte+25 bit 4 is set (HD wrapper) and the lambda has any captures, this error fires and the HD bits are cleared.	`sub_447930` line ~1710
--	`this_addr_capture_ext_lambda`	Implicit capture of `this` in an extended lambda triggers a warning. Separate from the errors above; fires during capture list processing.	`sub_42FE50` / `sub_42D710`
--	(no tag)	`*this` capture requires either `__device__`-only or definition inside `__device__`/`__global__` function, unless enabled by language dialect.	`sub_42FE50`

Category 2: Type Restrictions

Type restrictions enforce that every type visible in the lambda's public interface (captures, parameters, return type, and parent function template arguments) is accessible to the device compiler. Three contexts are checked, each with two sub-checks (function-local types and private/protected class member types). Additionally, the parent function's template arguments are checked for private/protected template members.

Error	Tag	Context	Restriction
3603	`extended_lambda_capture_local_type`	Capture variable type	A type local to a function cannot appear in the type of a captured variable.
3604	`extended_lambda_capture_private_type`	Capture variable type	A private or protected class member type cannot appear in the type of a captured variable.
3606	`extended_lambda_call_operator_local_type`	`operator()` signature	A function-local type cannot appear in the return or parameter types of the lambda's `operator()`.
3607	`extended_lambda_call_operator_private_type`	`operator()` signature	A private/protected class member type cannot appear in the `operator()` return or parameter types.
3610	`extended_lambda_parent_local_type`	Parent template args	A function-local type cannot appear in the template arguments of the enclosing parent function or any parent classes.
3611	`extended_lambda_parent_private_type`	Parent template args	A private/protected class member type cannot appear in the template arguments of the enclosing parent function or parent classes.
3635	`extended_lambda_parent_private_template_arg`	Parent template args	A template that is itself a private/protected class member cannot be used as a template argument of the enclosing parent.

Type Walk Dispatch via `dword_E7FE78`

The callback sub_41B420 uses a global discriminator dword_E7FE78 to select between the three contexts. Each context is called with a different value:

`dword_E7FE78`	Context	Local-type error	Private-type error
0	Capture variable type	3603	3604
1	`operator()` signature	3606	3607
2	Parent template args	3610	3611

The dispatch formula in sub_41B420 is 4 * (dword_E7FE78 != 1) + base_error. For local types, base is 3603; for private types, base is 3604. When dword_E7FE78 == 0, the multiplier is 41 = 4, yielding 3603+0 / 3604+0. When dword_E7FE78 == 1, the multiplier is 40 = 0, yielding 3603+3 = 3606 / 3604+3 = 3607. When dword_E7FE78 == 2 (and != 1), the multiplier is 4*1 = 4, yielding 3603+4 = (incorrect -- the actual formula uses a conditional). In practice the decompiled code shows:

// For function-local type check:
v2 = 3603;
if (dword_E7FE78)
    v2 = 4 * (unsigned int)(dword_E7FE78 != 1) + 3606;
// dword_E7FE78=0 -> 3603
// dword_E7FE78=1 -> 4*0 + 3606 = 3606
// dword_E7FE78=2 -> 4*1 + 3606 = 3610

// For private/protected type check:
v4 = 3604;
if (dword_E7FE78)
    v4 = 4 * (unsigned int)(dword_E7FE78 != 1) + 3607;
// dword_E7FE78=0 -> 3604
// dword_E7FE78=1 -> 4*0 + 3607 = 3607
// dword_E7FE78=2 -> 4*1 + 3607 = 3611

The tree walk itself is invoked via sub_7B0B60(type_node, sub_41B420, error_base). The error_base parameter (792 or 795) is stored in a global and used by the walker to control recursion behavior, not error selection.

Category 3: Enclosing Parent Function Restrictions

The parent function (the function in whose body the extended lambda is defined) must satisfy several naming and linkage constraints. These exist because the device compiler must be able to instantiate the wrapper template at a globally-unique mangled name derived from the parent function's signature.

Error	Tag	Restriction	Rationale
3605	`extended_lambda_enclosing_function_local`	Parent function must not be defined inside another function (local function).	Nested function bodies have no externally-visible mangling; the wrapper tag would be unresolvable.
3608	`extended_lambda_cant_take_function_address`	Parent function must allow its address to be taken. Checks `entity+80 bits 0-1` for address-taken capability.	The wrapper tag encodes a function pointer to the parent's `operator()`. If address-of is forbidden (e.g., deleted functions), the tag is ill-formed.
3609	`extended_lambda_parent_class_unnamed`	Parent function cannot be a member of an unnamed class. Walks the scope chain checking `entity+8` (name pointer) for null.	Unnamed classes have no mangled name, making the wrapper tag unresolvable.
3601	`extended_lambda_parent_non_extern`	On Windows only: parent function must have external linkage. Internal or no linkage is prohibited.	Windows COFF requires external linkage for cross-TU symbol resolution. On Linux ELF this restriction does not apply. Checks `entity+81 bit 2` (has_qualified_scope) and `entity+8` (name).
3608	`extended_lambda_inaccessible_parent`	Parent function cannot have private or protected access within its class. Checks `entity+80 bits 0-1` (access specifier).	Private/protected member functions are not visible to the device compiler's separate compilation pass.
3592	`extended_lambda_enclosing_function_deducible`	Parent function must not have a deduced return type (`auto` return). Checks `entity+81 bit 0` (is_deprecated flag used as deducible marker).	Deduced return types are resolved lazily; the wrapper template needs a concrete type.
3600	(no dedicated tag)	Parent function cannot be `= delete`d or `= default`ed. Checks `entity+166` for values 1 or 2 (deleted, defaulted).	A deleted/defaulted function has no body, so the lambda cannot exist.
3613	(no dedicated tag)	Parent function cannot have a `noexcept` specification. Checks `entity+191 bit 0`.	Exception specifications interact with the wrapper's `NeverThrows` template parameter in ways that cannot be validated at frontend time.
3615	`extended_lambda_enclosing_function_not_found`	The validator (`sub_41A3E0`) could not locate the enclosing function. Fires when the type annotation context byte has bit 0 set but the host-device validation context `a2 == 0`.	Internal consistency check; should not occur in well-formed code.

Category 4: Template Parameter Restrictions

The parent function's template parameter list must satisfy naming and variadic constraints to ensure the wrapper tag type can be uniquely instantiated.

Error	Tag	Restriction
--	`extended_lambda_parent_template_param_unnamed`	Every template parameter of the enclosing parent function must be named. Anonymous template parameters (`template <typename>`) prevent the wrapper from referencing the parameter in its tag type. Checked per-parameter during scope walk.
--	`extended_lambda_nest_parent_template_param_unnamed`	Same restriction applied to nested parent scopes (enclosing class templates, enclosing function templates above the immediate parent).
--	`extended_lambda_multiple_parameter_packs`	The parent template function can have at most one variadic parameter pack, and it must be the last parameter. Multiple packs or non-trailing packs prevent the device compiler from deducing the wrapper specialization.

Category 5: Nesting and Context Restrictions

Error	Tag	Restriction	Rationale
--	`extended_lambda_enclosing_function_generic_lambda`	An extended lambda cannot be defined inside a generic lambda expression. Generic lambdas have template `operator()` which makes the closure type non-deducible for wrapper tag generation.	Generic lambdas produce dependent types that the wrapper system cannot resolve.
--	`extended_lambda_enclosing_function_hd_lambda`	An extended lambda cannot be defined inside another extended `__host__ __device__` lambda.	The wrapper for the outer HD lambda would need to capture the inner wrapper, creating a recursive type dependency.
--	`extended_host_device_generic_lambda`	A `__host__ __device__` extended lambda cannot be a generic lambda (i.e., with `auto` parameters).	The HD wrapper uses type erasure with concrete function pointer types. Generic lambdas would require polymorphic function pointers, which the type erasure scheme cannot express.
--	`extended_lambda_inaccessible_ancestor`	An extended lambda cannot be defined inside a class that has private or protected access within another class.	The wrapper tag must be visible to both host and device compilation passes. A privately-nested class is not accessible from the translation-unit scope where the wrapper template is instantiated.
--	`extended_lambda_inside_constexpr_if`	An extended lambda cannot be defined inside the `if` or `else` block of a `constexpr if` statement (platform/dialect dependent).	Discarded `constexpr if` branches may eliminate the lambda entirely, but the preamble has already been committed. Restriction prevents dangling wrapper specializations.
3590	`extended_lambda_multiple_parent`	Cannot specify `__nv_parent` more than once in a single lambda's capture list.	`__nv_parent` stores a single parent class pointer at `lambda_info + 32`; only one slot exists.
3634	(no dedicated tag)	`__nv_parent` requires the lambda to be `__device__` annotated. If `__nv_parent` is specified without `__device__` execution space, this error fires. Additionally validates that the enclosing scope has `__host__` but not `__device__` execution space (bits at entity+182).	`__nv_parent` is used to link the device closure to its enclosing class for member access. This is only meaningful in device execution context.

Category 6: Specifier and Annotation Restrictions

Error	Tag	Restriction
3612	`extended_lambda_disallowed`	`__host__` or `__device__` annotation on a lambda when `--extended-lambda` is not enabled. This is a warning, not an error. The flag must be explicitly passed on the command line.
3620	`extended_lambda_constexpr`	The `constexpr` specifier is not allowed on an extended lambda's `operator()`. Also applies to `consteval`. Two separate emit calls: one for `"constexpr"` and one for `"consteval"`.
3621	(no dedicated tag)	The `operator()` function for a lambda cannot be explicitly annotated with execution space annotations (`__host__`/`__device__`/`__global__`). The annotations are derived from the closure class, not the operator. Fires when `entity+182 bits 1-2` are set on the call operator.
3689	(no dedicated tag)	OptiX mode incompatibility. When both `dword_106BDD8` (OptiX) and `dword_106B670` (a secondary OptiX flag) are set, and the lambda body at `qword_106B678 + 176*dword_106B670 + 5` has bit 3 set, this error fires. OptiX has stricter lambda body requirements than standard CUDA.
3690	`extended_lambda_discriminator`	Lambda numbering lookup failure in the red-black tree (`ptr` / `dword_E7FE48`). The tree maps source positions to lambda indices for unique wrapper tag generation. If the tree search fails, the wrapper cannot be uniquely identified.
3691	(no dedicated tag)	Extended lambda with `__host__ __device__` annotation where the type annotation byte has bit 4 set (HD init-capture validation context). Issued by `sub_41A3E0` as a final post-check.

Category 7: Enclosing Scope Miscellaneous

Error	Tag	Restriction
3617	`extended_lambda_no_parent_func`	No enclosing function could be found for the extended lambda. `sub_6BCDD0` (`nv_find_parent_lambda_function`) walked the scope chain and returned null. The lambda may be at file scope, which is not a valid context for an extended lambda.
3618	`extended_lambda_illegal_parent`	Ambiguous overload when resolving the enclosing function. `sub_6BCDD0` found multiple candidate functions. Emitted via `sub_4F6E50` with three operands (location, space string, function name).
3619	(no dedicated tag)	Secondary ambiguity variant. Same as 3618 but fires on a different branch (the `v291[0]` check rather than `v287[0]`), indicating the ambiguity was detected through a different resolution path.
3601	(duplicate)	Lambda defined in unnamed namespace (`entity+81 bit 2` set and `entity+8` name pointer is null). The wrapper tag requires a named scope.
3605	(duplicate)	Non-trivially-copyable type in capture scope. When `entity+80 bits 0-1` indicate non-trivial copy semantics, the capture cannot be transferred to device memory.

Validation Architecture

Phase 4 of scan_lambda: Per-Lambda Validation

After parsing the capture list and annotations (Phases 1--3), scan_lambda enters the extended lambda validation block. This block is guarded by dword_106BF38 (extended lambda mode) and the annotation bits at lambda_info + 25. The validation proceeds as:

sub_447930 (scan_lambda), Phase 4 entry:
  |
  +-- Call sub_6BCDD0 (nv_find_parent_lambda_function)
  |     Returns: parent function node, sets is_device/is_template flags
  |
  +-- If parent == NULL:  emit error 3617
  +-- If ambiguous:       emit error 3618 or 3619
  |
  +-- Validate parent function properties:
  |     entity+81 bit 0  -> error 3592 (deprecated/deducible)
  |     entity+191 bit 0 -> error 3613 (noexcept spec)
  |     entity+166 == 1|2 -> error 3600 (deleted/defaulted)
  |     entity+81 bit 2  -> unnamed scope check -> error 3601
  |     entity+80 bits 0-1 -> address-taken / access check -> error 3608
  |
  +-- Walk parent scope chain for unnamed classes:
  |     entity+8 == NULL -> error 3609
  |     Non-trivial copy   -> error 3605
  |
  +-- Check capture-default conflicts:
  |     byte+24 bits 4+5 both set -> error 3593 (& and = conflict)
  |
  +-- OptiX gate: dword_106BDD8 -> error 3689
  |
  +-- Lambda numbering via red-black tree:
        Lookup failure -> error 3690

Per-Capture Validation: sub_42EE00

For each captured variable, make_field_for_lambda_capture runs targeted checks:

sub_42EE00 (make_field_for_lambda_capture):
  |
  +-- If byte+25 bit 3 set (device wrapper):
  |     |
  |     +-- Check init-capture for array type
  |     |     type_node+48 == 3 && sub_kind == 21 -> error 3596
  |     |
  |     +-- Call sub_41A1F0 (walk_type_for_hd_violations)
  |     |     Counts array dimensions, checks element type
  |     |     dim > 7 -> error 3597
  |     |     Not default-constructible -> error 3598
  |     |     Not assignable -> error 3599
  |     |
  |     +-- Check for pack expansion
  |           sub_7A8C00 returns true -> error 3616
  |
  +-- (Later) If byte+25 bit 4 set (HD wrapper):
        |
        +-- Call sub_7B0B60 with sub_41B420 callback
              Walks entire type tree, fires 3603/3604 for
              function-local and private/protected types

Type Hierarchy Walker: sub_41A3E0 / sub_41A1F0

sub_41A3E0 is the outer wrapper that validates the per-capture annotation context. sub_41A1F0 performs the recursive array dimension walk and element-type validation.

sub_41A3E0 (validate_type_hd_annotation):
  |
  +-- Determine context string: "__device__" or "__host__ __device__"
  |     Based on a2 parameter (0 = HD, nonzero = device-only)
  |
  +-- Check annotation byte (a1+32):
  |     bit 0 set && a2==0 -> error 3615
  |     bit 3 set -> check parent visibility:
  |       entity+163 < 0 (private) -> check bit pattern
  |       Both bits 3+4 set with private parent -> error 3635
  |       Otherwise -> error 3593
  |     bit 5 set -> error 3594 (private/protected access)
  |
  +-- Unwrap CV-qualifiers on element type (kind==12 loop)
  |
  +-- Call sub_41A1F0 (walk_type_for_hd_violations):
  |     Recursive array walker:
  |       v6 = dimension counter
  |       Loop: while sub_7A8370(type) returns true
  |         increment v6, follow sub_7A9310 to element type
  |       If v6 > 7: error 3597
  |       Unwrap CV (kind==12 loop)
  |       If not in dependent context (dword_126C5C4 == -1):
  |         Check scope flags (byte+6 bits 1-2)
  |         sub_550E50(30, type, 0) -> error 3598 (not default-constructible)
  |         sub_5BD540 + sub_510860(60, ...) -> error 3599 (not assignable)
  |       Call sub_7B0B60(type, sub_41B420, 792) for deep type walk
  |
  +-- If a3 (third parameter) set:
        Check bit 4 of annotation byte -> error 3691

sub_41B420: Type Walk Callback

This compact callback (33 lines decompiled) is invoked by sub_7B0B60 for every type node in the capture's type tree. It checks two properties:

Function-local type -- entity+81 bit 0 set: the type is defined inside a function body. Error selection uses dword_E7FE78 to pick between capture context (3603), operator() context (3606), and parent template-arg context (3610).
Private/protected member type -- entity+81 bit 2 set AND entity+80 bits 0-1 in range [1,2] (private or protected access specifier). Error selection parallels the local-type case: 3604, 3607, or 3611 depending on dword_E7FE78.

Special case: when entity+132 == 9 (template parameter dependent type) AND entity+152 points to a class with byte+86 bit 0 set AND entity+72 is non-null, the function-local check is suppressed. This handles template parameters that are not themselves local but instantiate with local types -- the error is deferred to instantiation time.

Diagnostic Tag Reference

Complete list of all 39 extended lambda diagnostic tags, sorted alphabetically. All tags can be used with --diag_suppress, --diag_warning, --diag_error on the command line, and with #pragma nv_diag_suppress, #pragma nv_diag_warning, #pragma nv_diag_error in source.

Tag	Category
`extended_host_device_generic_lambda`	Nesting
`extended_lambda_array_capture_assignable`	Capture
`extended_lambda_array_capture_default_constructible`	Capture
`extended_lambda_array_capture_rank`	Capture
`extended_lambda_call_operator_local_type`	Type
`extended_lambda_call_operator_private_type`	Type
`extended_lambda_cant_take_function_address`	Parent
`extended_lambda_capture_in_constexpr_if`	Capture
`extended_lambda_capture_local_type`	Type
`extended_lambda_capture_private_type`	Type
`extended_lambda_constexpr`	Specifier
`extended_lambda_disallowed`	Specifier
`extended_lambda_discriminator`	Internal
`extended_lambda_enclosing_function_deducible`	Parent
`extended_lambda_enclosing_function_generic_lambda`	Nesting
`extended_lambda_enclosing_function_hd_lambda`	Nesting
`extended_lambda_enclosing_function_local`	Parent
`extended_lambda_enclosing_function_not_found`	Parent
`extended_lambda_hd_init_capture`	Capture
`extended_lambda_illegal_parent`	Parent
`extended_lambda_inaccessible_ancestor`	Nesting
`extended_lambda_inaccessible_parent`	Parent
`extended_lambda_init_capture_array`	Capture
`extended_lambda_init_capture_initlist`	Capture
`extended_lambda_inside_constexpr_if`	Nesting
`extended_lambda_multiple_parameter_packs`	Template
`extended_lambda_multiple_parent`	Nesting
`extended_lambda_nest_parent_template_param_unnamed`	Template
`extended_lambda_no_parent_func`	Parent
`extended_lambda_pack_capture`	Capture
`extended_lambda_parent_class_unnamed`	Parent
`extended_lambda_parent_local_type`	Type
`extended_lambda_parent_non_extern`	Parent
`extended_lambda_parent_private_template_arg`	Type
`extended_lambda_parent_private_type`	Type
`extended_lambda_parent_template_param_unnamed`	Template
`extended_lambda_reference_capture`	Capture
`extended_lambda_too_many_captures`	Capture
`this_addr_capture_ext_lambda`	Capture

Bitmap Interaction

The capture count limit of 1023 derives from the bitmap architecture. Each wrapper type (device and host-device) uses a 128-byte bitmap (unk_1286980 / unk_1286900) storing 1024 bits. The bitmap setter sub_6BCBF0 performs:

result[capture_count >> 6] |= 1LL << capture_count;

Bit 0 is never emitted as a wrapper specialization (the zero-capture case uses the primary template). Bits 1--1023 map to generated partial specializations. The error check at capture count > 0x3FE (1022) fires before the bitmap set operation, so the effective maximum is 1023 captures. Attempting 1024 or more would overflow the 64-bit word boundary calculation, though in practice the error prevents this.

Operator() Annotation Derivation

Error 3621 enforces a fundamental design rule: the operator() function of an extended lambda must not carry explicit execution space annotations. Instead, the execution space is derived from the closure class. During scan_lambda Phase 5 (decl_call_operator_for_lambda), the code sets the call operator's execution space from lambda_info + 25:

// Propagate device/host from lambda_info to call operator
byte[operator+182] = (4 * byte[lambda+25]) & 0x10 | byte[operator+182] & 0xEF;
byte[operator+182] = (16 * byte[lambda+25]) & 0x20 | byte[operator+182] & 0xDF;

If the call operator already has execution space bits set (from explicit annotation by the user), error 3621 fires. The rationale is that the wrapper template's tag type already encodes the execution space; having the operator carry its own annotations would create an inconsistency that the device compiler cannot resolve.

Key Functions

Address	Name (recovered)	Lines	Role
`sub_447930`	`scan_lambda`	2113	Master lambda parser; Phase 4 = restriction validator
`sub_42EE00`	`make_field_for_lambda_capture`	551	Per-capture field creator with device-lambda validation
`sub_41A3E0`	`validate_type_hd_annotation`	75	Outer type annotation checker (errors 3593/3594/3615/3635/3691)
`sub_41A1F0`	`walk_type_for_hd_violations`	81	Recursive array dim / element-type validator (3597/3598/3599)
`sub_41B420`	(type walk callback)	33	Issues 3603/3604/3606/3607/3610/3611 via `dword_E7FE78` dispatch
`sub_6BCDD0`	`nv_find_parent_lambda_function`	33	Scope chain walk to find enclosing host/device function
`sub_6BCBF0`	`nv_record_capture_count`	13	Set bit in device or host-device bitmap
`sub_4F9F20`	(capture count finalizer)	~620	Checks capture count > 0x3FE, calls bitmap setter
`sub_7B0B60`	(tree walker)	--	Recursive type tree traversal, calls callback for each node
`sub_7A8370`	(is_array_type)	--	Returns nonzero if type node is an array type
`sub_7A9310`	(get_element_type)	--	Returns the element type of an array type node
`sub_550E50`	(check_default_constructible)	--	`sub_550E50(30, type, 0)` tests default-constructibility
`sub_510860`	(check_callable)	--	`sub_510860(60, op, type)` tests if operator is callable

Global State

Variable	Address	Purpose
`dword_106BF38`	`0x106BF38`	Extended lambda mode flag (`--extended-lambda`)
`dword_106BDD8`	`0x106BDD8`	OptiX mode flag
`dword_106B670`	`0x106B670`	Secondary OptiX lambda flag
`qword_106B678`	`0x106B678`	OptiX lambda body array base pointer
`dword_E7FE78`	`0xE7FE78`	Type walk context discriminator (0=capture, 1=operator, 2=parent)
`ptr`	(stack)	Red-black tree root for lambda numbering per source position
`dword_E7FE48`	`0xE7FE48`	Red-black tree sentinel node
`dword_126C5C4`	`0x126C5C4`	Dependent scope index (-1 = not in dependent context)
`dword_126EFAC`	`0x126EFAC`	CUDA mode flag
`dword_126EFA4`	`0x126EFA4`	GCC extensions flag
`qword_126EF98`	`0x126EF98`	GCC compatibility version

Extended Lambda Overview -- end-to-end pipeline, annotation bits, lambda_info layout
Device Lambda Wrapper -- __nv_dl_wrapper_t template structure
Host-Device Lambda Wrapper -- __nv_hdl_wrapper_t type-erased design
Capture Handling -- __nv_lambda_field_type, __nv_lambda_array_wrapper
Preamble Injection -- sub_6BCC20 emission pipeline
CUDA Error Catalog -- complete error index with message templates
Cross-Space Call Validation -- execution space checking infrastructure

IL Overview

The Intermediate Language (IL) is EDG's central data structure -- a typed, scope-linked graph of every declaration, type, expression, statement, and template in the translation unit. cudafe++ (EDG 6.6) builds the IL during parsing, walks it for CUDA device/host separation, and emits it as the .int.c output. The IL never touches disk: IL_SHOULD_BE_WRITTEN_TO_FILE=0 forces in-memory-only operation. All IL nodes live in a region-based arena allocator, organized into file-scope (region 1) and per-function (region N) memory pools.

The IL is versioned as IL_VERSION_NUMBER="6.6" and carries the compile-time flag ALL_TEMPLATE_INFO_IN_IL=1, meaning template definitions, specializations, and instantiation directives are fully represented in the IL graph rather than deferred to a separate template database.

Key Configuration Constants

Constant	Value	Meaning
`IL_VERSION_NUMBER`	`"6.6"`	IL format version, matches EDG version
`IL_SHOULD_BE_WRITTEN_TO_FILE`	`0`	IL is never serialized to disk
`ALL_TEMPLATE_INFO_IN_IL`	`1`	Full template data in IL graph
`IL_FILE_SUFFIX`	(string)	Suffix for IL file names if serialization were enabled
`sizeof_il_entry` sentinel	`9999`	Validated at init time (guard value in `qword_E6C580`)

IL Entry Kind System

Every IL node carries an entry_kind byte that identifies its type. The name table off_E6DD80 (aliased as il_entry_kind_names at off_E6E020) maps these bytes to human-readable strings. The il_one_time_init function (sub_5CF7F0) validates that this table ends with a "last" sentinel.

There are 85 defined entry kind values (0-84). Some are primary node types with their own linked lists; others are auxiliary records displayed inline by their parent.

Complete il_entry_kind Table

Kind	Hex	Name	Bytes	Display	Notes
0	0x00	`none`	--	--	Null/invalid sentinel
1	0x01	`source_file_entry`	80	Case 1	File name, line ranges, include flags
2	0x02	`constant`	184	Case 2	16 sub-kinds (ck_*)
3	0x03	`param_type`	80	Case 3	Parameter type in function signature
4	0x04	`routine_type_supplement`	64	Inline	Embedded in routine type node
5	0x05	`routine_type_extra`	--	Inline	Additional routine type data
6	0x06	`type`	176	Case 6	22 sub-kinds (tk_*)
7	0x07	`variable`	232	Case 7	Variables, parameters, structured bindings
8	0x08	`field`	176	Case 8	Class/struct/union members
9	0x09	`exception_specification`	16	Case 9	noexcept, throw() specs
10	0x0A	`exception_spec_type`	24	Case 0xA	Type in exception specification
11	0x0B	`routine`	288	Case 0xB	Functions, methods, constructors, destructors
12	0x0C	`label`	128	Case 0xC	Goto labels, break/continue targets
13	0x0D	`expr_node`	72	Case 0xD	36 sub-kinds (enk_*)
14	0x0E	(reserved)	--	Inline	Skipped in display
15	0x0F	(reserved)	--	Inline	Skipped in display
16	0x10	`switch_case_entry`	56	Case 0x10	Case value + range for switch
17	0x11	`switch_info`	24	Case 0x11	Switch statement descriptor
18	0x12	`handler`	40	Case 0x12	try/catch handler entry
19	0x13	`try_supplement`	32	Inline	Try block extra info
20	0x14	`asm_supplement`	--	Inline	Inline asm statement data
21	0x15	`statement`	80	Case 0x15	26 sub-kinds (stmk_*)
22	0x16	`object_lifetime`	64	Case 0x16	Destruction ordering
23	0x17	`scope`	288	Case 0x17	9 sub-kinds (sck_*)
24	0x18	`base_class`	112	Case 0x18	Inheritance record
25	0x19	`string_text`	1*	--	Raw string literal bytes
26	0x1A	`other_text`	1*	--	Compiler version, misc text
27	0x1B	`template_parameter`	136	Case 0x1B	Template param with supplement
28	0x1C	`namespace`	128	Case 0x1C	Namespace declarations
29	0x1D	`using_declaration`	80	Case 0x1D	Using declarations/directives
30	0x1E	`dynamic_init`	104	Case 0x1E	9 sub-kinds (dik_*)
31	0x1F	`local_static_variable_init`	40	Case 0x1F	Static local init records
32	0x20	`vla_dimension`	48	Case 0x20	Variable-length array bound
33	0x21	`overriding_virtual_func`	40	Case 0x21	Virtual override info
34	0x22	(reserved)	--	Inline	Skipped in display
35	0x23	`derivation_path`	24	Case 0x23	Base-class derivation step
36	0x24	`base_class_derivation`	32	--	Derivation detail record
37	0x25	(reserved)	--	Inline	Skipped in display
38	0x26	(reserved)	--	Inline	Skipped in display
39	0x27	`class_info`	208	Case 0x27	Class type supplement
40	0x28	(reserved)	--	--	Skipped in display
41	0x29	`constructor_init`	48	Case 0x29	Ctor member/base initializer
42	0x2A	`asm_entry`	152	Case 0x2A	Inline assembly block
43	0x2B	`asm_operand`	--	Case 0x2B	Asm constraint + expression
44	0x2C	`asm_clobber`	--	Case 0x2C	Asm clobber register
45	0x2D	(reserved)	--	Inline	Skipped in display
46	0x2E	(reserved)	--	Inline	Skipped in display
47	0x2F	(reserved)	--	Inline	Skipped in display
48	0x30	(reserved)	--	Inline	Skipped in display
49	0x31	`element_position`	24	--	Designator element position
50	0x32	`source_sequence_entry`	32	Case 0x32	Declaration ordering
51	0x33	`full_entity_decl_info`	56	Case 0x33	Full declaration info
52	0x34	`instantiation_directive`	40	Case 0x34	Explicit instantiation
53	0x35	`src_seq_sublist`	24	Case 0x35	Source sequence sub-list
54	0x36	`explicit_instantiation_decl`	--	Case 0x36	extern template
55	0x37	`orphaned_entities`	56	Case 0x37	Entities without parent scope
56	0x38	`hidden_name`	32	Case 0x38	Hidden name entry
57	0x39	`pragma`	64	Case 0x39	Pragma records (43 kinds)
58	0x3A	`template`	208	Case 0x3A	Template declaration
59	0x3B	`template_decl`	40	Case 0x3B	Template declaration head
60	0x3C	`requires_clause`	16	Case 0x3C	C++20 requires clause
61	0x3D	`template_param`	136	Case 0x3D	Template parameter entry
62	0x3E	`name_reference`	40	Case 0x3E	Name lookup reference
63	0x3F	`name_qualifier`	40	Case 0x3F	Qualified name qualifier
64	0x40	`seq_number_lookup`	32	Case 0x40	Sequence number index
65	0x41	`local_expr_node_ref`	--	Case 0x41	Local expression reference
66	0x42	`static_assert`	24	Case 0x42	Static assertion
67	0x43	`linkage_spec`	32	Case 0x43	extern "C"/"C++" block
68	0x44	`scope_ref`	32	Case 0x44	Scope back-reference
69	0x45	(reserved)	--	Inline	Skipped in display
70	0x46	`lambda`	--	Case 0x46	Lambda expression
71	0x47	`lambda_capture`	--	Case 0x47	Lambda capture entry
72	0x48	`attribute`	72	Case 0x48	C++11/GNU attribute
73	0x49	`attribute_argument`	40	Case 0x49	Attribute argument
74	0x4A	`attribute_group`	8	Case 0x4A	Attribute group
75	0x4B	(reserved)	--	Inline	Skipped in display
76	0x4C	(reserved)	--	Inline	Skipped in display
77	0x4D	(reserved)	--	Inline	Skipped in display
78	0x4E	(reserved)	--	Inline	Skipped in display
79	0x4F	`template_info`	--	Case 0x4F	Template instantiation info
80	0x50	`subobject_path`	24	Case 0x50	Address constant sub-path
81	0x51	(reserved)	--	Inline	Skipped in display
82	0x52	`module_info`	--	Case 0x52	C++20 module metadata
83	0x53	`module_decl`	--	Case 0x53	Module declaration
84	0x54	`last`	--	--	Sentinel for table validation

Inline entries (kinds 4, 5, 14, 15, 19, 20, 27, 34, 37, 38, 40, 45-48, 69, 75-78, 81) are displayed as part of their parent node rather than as standalone IL entries. The display dispatcher (sub_5F4930) returns immediately for these kinds.

IL Header Structure

The IL header lives in the BSS segment at 0x126EB60 and is printed by display_il_header_and_file_scope (sub_5F76B0). It records translation-unit-level metadata:

struct il_header {                        // at xmmword_126EB60
    il_entry*   primary_source_file;      // +0x00  head of source file list
    scope*      primary_scope;            // +0x08  file-scope root
    routine*    main_routine;             // +0x10  main() if present
    char*       compiler_version;         // +0x18  "6.6" version string
    char*       time_of_compilation;      // +0x20  build timestamp
    uint8_t     plain_chars_are_signed;   // +0x28  signedness of plain char
    uint32_t    source_language;          // +0x2C  0=C++, 1=C (dword_126EBA8)
    uint32_t    std_version;             // +0x30  e.g. 201703 (dword_126EBAC)
    uint8_t     pcc_compatibility_mode;   // +0x34  PCC compat flag
    uint8_t     enum_type_is_integral;    // +0x35
    uint32_t    default_max_member_align; // +0x38
    uint8_t     gcc_mode;                // +0x3C  GCC compatibility
    uint8_t     gpp_mode;                // +0x3D  G++ compatibility
    uint32_t    gnu_version;             // +0x40  e.g. 40201
    uint8_t     short_enums;             // +0x44
    uint8_t     default_nocommon;         // +0x45
    uint8_t     UCN_identifiers_used;     // +0x46
    uint8_t     vla_used;                // +0x47
    uint8_t     any_templates_seen;       // +0x48
    uint8_t     prototype_instantiations_in_il;     // +0x49
    uint8_t     il_has_all_prototype_instantiations; // +0x4A
    uint8_t     il_has_C_semantics;       // +0x4B
    uint8_t     nontag_types_used_in_exception_or_rtti; // +0x4C
    il_entry*   seq_number_lookup_entries; // +0x50
    uint32_t    target_configuration_index; // +0x58
};

The source_language field selects the display string "sl_Cplusplus" or "sl_C". When source_language == 1 (C mode) and std_version > 199900, the routine display additionally prints C99 pragma state fields (fp_contract, fenv_access, cx_limited_range).

Memory Region System

IL entries are allocated in numbered memory regions managed by a bump allocator (sub_6B7D60):

Region	Purpose	Lifetime	Globals
1	File scope	Entire translation unit	`dword_126EC90` (region ID), `dword_126F690`/`dword_126F694` (base offset / prefix size)
2..N	Per-function scope	Duration of function body processing	`dword_126EB40` (current region), `dword_126F688`/`dword_126F68C` (base offset / prefix size)

Region 1 contains all file-scope declarations: types, global variables, function declarations, namespaces, templates. Regions 2+ are allocated one per function definition and hold that function's local variables, statements, expressions, labels, and temporaries. The region table at qword_126EC88 maps region indices to their memory, while qword_126EB90 maps region indices to their associated scope entries. dword_126EC80 tracks the total number of regions.

The allocator selects file-scope vs function-scope by comparing dword_126EB40 == dword_126EC90. When equal, the node goes into region 1; otherwise it goes into the current function region. Some node types force a specific region:

Labels (alloc_label at sub_5E5CA0): Assert that the current region is NOT file scope
Templates (alloc_template at sub_5E8D20): Always file-scope only
Sequence number lookups (sub_5E9170): Force region 1 by temporarily setting TU-copy mode

The display system (sub_5F7DF0) iterates all regions:

// File scope
printf("Intermediate language for memory region 1 (file scope):");
walk_file_scope_il(display_il_entry, ...);   // sub_60E4F0

// Per-function regions
for (int r = 2; r <= region_count; r++) {
    scope* s = scope_table[r];
    routine* fn = s->assoc_routine;
    printf("Intermediate language for memory region %ld (function \"%s\"):",
           r, fn->name);
    walk_routine_scope_il(r, display_il_entry, ...);  // sub_610200
}

IL Entry Prefix

Every IL node has a multi-qword prefix preceding the node body. The prefix size depends on allocation mode: 24 bytes (3 qwords) in normal file-scope mode, 16 bytes (2 qwords) in TU-copy mode, and 8 bytes (1 qword) for function-scope allocations. The allocator (sub_6B7D60) allocates a contiguous block and the caller returns a pointer past the prefix, so the prefix occupies negative offsets from the returned node pointer.

Normal file-scope mode (dword_106BA08 == 0, dword_126F694 = 24):

Raw allocation layout (normal file-scope, 24-byte prefix):
Offset  Size  Field
------  ----  -----
+0      8     translation_unit_copy_address   (qword, zeroed in normal mode)
+8      8     next_in_list                    (qword, linked list pointer)
+16     8     prefix flags qword              (flags byte at +16, 7 bytes padding)
+24     ...   node body starts here           (returned pointer)

Node pointer perspective (ptr = raw + 24):
ptr - 24  = TU copy address   (8 bytes at raw+0)
ptr - 16  = next pointer       (8 bytes at raw+8)
ptr - 8   = prefix flags byte  (8 bytes at raw+16, flags in low byte)
ptr + 0   = first byte of node body

TU-copy mode (dword_106BA08 != 0, dword_126F694 = 16):

Raw allocation layout (TU-copy mode, 16-byte prefix):
+0      8     next_in_list                    (no TU copy slot)
+8      8     prefix flags qword
+16     ...   node body starts here           (returned pointer)

Function-scope allocations (dword_126F68C = 8):

Raw allocation layout (function-scope, 8-byte prefix):
+0      8     prefix flags qword              (no TU copy, no orphan slot)
+8      ...   node body starts here           (returned pointer)

The prefix flags byte is at ptr - 8 from the returned node pointer (in all modes). The next_in_list pointer at ptr - 16 is the linked list link used by the IL walker to traverse all entries of a given kind (file-scope only). The translation_unit_copy_address at ptr - 24 stores the original address when a node is copied between translation units; it is zeroed in normal mode and absent in TU-copy and function-scope modes.

The keep_in_il test throughout cudafe++ uses *(signed char*)(entry - 8) < 0 to check bit 7 of the prefix flags byte -- this works because the flags byte is always at offset -8 from the node pointer regardless of allocation mode.

Prefix Flags Byte

The prefix flags byte (at offset -8 from the returned node pointer) encodes scope and language information:

Bit	Mask	Name	Meaning
0	0x01	`allocated`	Always set on allocation
1	0x02	`file_scope`	Set when `!dword_106BA08` (not in TU-copy mode)
2	0x04	`is_in_secondary_il`	Entry came from secondary translation unit
3	0x08	`language_flag`	Copies `dword_126E5FC & 1` (C++ vs C mode indicator)
7	0x80	`keep_in_il`	CUDA-critical: marks entry for device IL output

Bit 7 (keep_in_il) is the mechanism by which cudafe++ selects device-relevant declarations. The mark_to_keep_in_il pass in il_walk.c sets this bit on all entries that are needed for device compilation. See Device/Host Separation and keep-in-il for details.

Sub-Kind Systems

Most primary IL entry kinds use a secondary kind byte to discriminate between variants. These sub-kind enums are the core classification taxonomy of the IL.

Type Kinds (tk_*)

The type kind byte lives at offset +132 in the type node body. 22 values, dispatched by set_type_kind (sub_5E2E80) and displayed by display_type (sub_5F06B0):

Value	Name	Supplement	Size	Notes
0	`tk_error`	--	--	Error/placeholder type
1	`tk_void`	--	--	void
2	`tk_integer`	integer_type_supplement	32	int, char, bool, enum, wchar_t, char8/16/32_t
3	`tk_float`	--	--	float, double, long double
4	`tk_complex`	--	--	_Complex float/double/ldouble
5	`tk_imaginary`	--	--	_Imaginary (C99)
6	`tk_pointer`	--	--	Pointer, reference, rvalue reference
7	`tk_routine`	routine_type_supplement	64	Function type (return + params)
8	`tk_array`	--	--	Fixed and variable-length arrays
9	`tk_class`	class_type_supplement	208	class types
10	`tk_struct`	class_type_supplement	208	struct types
11	`tk_union`	class_type_supplement	208	union types
12	`tk_typeref`	typeref_type_supplement	56	typedef, using, decltype, typeof
13	`tk_ptr_to_member`	--	--	Pointer-to-member
14	`tk_template_param`	templ_param_supplement	40	Template type parameter
15	`tk_vector`	--	--	SIMD vector type
16	`tk_scalable_vector`	--	--	Scalable vector (SVE)
17	`tk_nullptr`	--	--	std::nullptr_t
18	`tk_mfp8`	--	--	8-bit floating point
19	`tk_scalable_vector_count`	--	--	Scalable vector predicate
20	(auto/decltype_auto)	--	--	Placeholder types
21	(typeof_unqual/typeof_type)	--	--	C23 typeof

The display function references off_A6FE40 (22 string entries) for type kind names. The typeref sub-kind table at off_A6F640 has 28 entries covering typedef aliases, decltype expressions, auto, and concept-constrained placeholders.

Constant Kinds (ck_*)

The constant kind byte lives at offset +148 in the constant node. 16 values, dispatched by display_constant (sub_5F2720):

Value	Name	Notes
0	`ck_error`	Error placeholder
1	`ck_integer`	Integer value (arbitrary precision via `sub_602F20`)
2	`ck_string`	String/character literal (char kind + length + raw bytes)
3	`ck_float`	Floating-point constant
4	`ck_complex`	Complex constant (real + imaginary)
5	`ck_imaginary`	Imaginary constant
6	`ck_address`	Address constant with 7 address sub-kinds (abk_*)
7	`ck_ptr_to_member`	Pointer-to-member constant
8	`ck_label_difference`	GNU label address difference
9	`ck_dynamic_init`	Dynamically initialized constant
10	`ck_aggregate`	Aggregate initializer (linked list of sub-constants)
11	`ck_init_repeat`	Repeated initializer (constant + count)
12	`ck_template_param`	Template parameter constant with 15 sub-kinds (tpck_*)
13	`ck_designator`	Designated initializer
14	`ck_void`	Void constant
15	`ck_reflection`	Reflection entity reference

Address constant sub-kinds (abk_*): abk_routine, abk_variable, abk_constant, abk_temporary, abk_uuidof, abk_typeid, abk_label.

Template parameter constant sub-kinds (tpck_*): tpck_param, tpck_expression, tpck_member, tpck_unknown_function, tpck_address, tpck_sizeof, tpck_datasizeof, tpck_alignof, tpck_uuidof, tpck_typeid, tpck_noexcept, tpck_template_ref, tpck_integer_pack, tpck_destructor.

Expression Node Kinds (enk_*)

The expression kind byte lives at offset +24 in the expression node. 36 values, dispatched by display_expr_node (sub_5ECFE0):

Value	Name	Notes
0	`enk_error`	Error expression
1	`enk_operation`	Binary/unary/ternary operation (120 operator sub-kinds via eok_*)
2	`enk_constant`	Constant reference
3	`enk_variable`	Variable reference
4	`enk_field`	Field access
5	`enk_temp_init`	Temporary initialization
6	`enk_lambda`	Lambda expression
7	`enk_new_delete`	new/delete expression (56-byte supplement)
8	`enk_throw`	throw expression (24-byte supplement)
9	`enk_condition`	Conditional expression (32-byte supplement)
10	`enk_object_lifetime`	Object lifetime management
11	`enk_typeid`	typeid expression
12	`enk_sizeof`	sizeof expression
13	`enk_sizeof_pack`	sizeof...(pack)
14	`enk_alignof`	alignof expression
15	`enk_datasizeof`	NVIDIA __datasizeof extension
16	`enk_address_of_ellipsis`	Address of variadic parameter
17	`enk_statement`	Statement expression (GCC extension)
18	`enk_reuse_value`	Reused value reference
19	`enk_routine`	Function reference
20	`enk_type_operand`	Type as operand (e.g., in sizeof)
21	`enk_builtin_operation`	Compiler builtin (indexed via `off_E6C5A0`)
22	`enk_param_ref`	Parameter reference
23	`enk_braced_init_list`	C++11 braced init list
24	`enk_c11_generic`	C11 _Generic selection
25	`enk_builtin_choose_expr`	GCC __builtin_choose_expr
26	`enk_yield`	C++20 co_yield
27	`enk_await`	C++20 co_await
28	`enk_fold_expression`	C++17 fold expression
29	`enk_initializer`	Initializer expression
30	`enk_concept_id`	C++20 concept-id
31	`enk_requires`	C++20 requires expression
32	`enk_compound_req`	Compound requirement
33	`enk_nested_req`	Nested requirement
34	`enk_const_eval_deferred`	Deferred constexpr evaluation
35	`enk_template_name`	Template name expression

The enk_operation kind (value 1) carries an additional operation.kind byte dispatched through off_A6F840 (120 entries, the eok_* enum) and an operation.type_kind byte from off_A6FE40 (22 entries).

Expression Operation Kinds (eok_*)

The 120+ operation kinds cover all C++ operators. Key groups:

Category	Operations
Arithmetic	`eok_add`, `eok_subtract`, `eok_multiply`, `eok_divide`, `eok_remainder`, `eok_negate`, `eok_unary_plus`
Bitwise	`eok_and`, `eok_or`, `eok_xor`, `eok_complement`, `eok_shiftl`, `eok_shiftr`
Comparison	`eok_eq`, `eok_ne`, `eok_lt`, `eok_gt`, `eok_le`, `eok_ge`, `eok_spaceship`
Logical	`eok_land`, `eok_lor`, `eok_not`
Assignment	`eok_assign`, `eok_add_assign`, `eok_subtract_assign`, `eok_multiply_assign`, etc.
Pointer	`eok_indirect`, `eok_address_of`, `eok_padd`, `eok_psubtract`, `eok_pdiff`, `eok_subscript`
Member access	`eok_dot_field`, `eok_points_to_field`, `eok_dot_static`, `eok_points_to_static`, `eok_pm_field`, `eok_points_to_pm_call`
Casts	`eok_cast`, `eok_lvalue_cast`, `eok_ref_cast`, `eok_dynamic_cast`, `eok_bool_cast`, `eok_base_class_cast`, `eok_derived_class_cast`
Calls	`eok_call`, `eok_dot_member_call`, `eok_points_to_member_call`, `eok_dot_pm_call`, `eok_points_to_pm_func_ptr`
Increment	`eok_pre_incr`, `eok_pre_decr`, `eok_post_incr`, `eok_post_decr`
Complex	`eok_real_part`, `eok_imag_part`, `eok_xconj`
Vector	`eok_vector_fill`, `eok_vector_eq`, `eok_vector_ne`, `eok_vector_lt`, `eok_vector_gt`, `eok_vector_le`, `eok_vector_ge`, `eok_vector_subscript`, `eok_vector_question`, `eok_vector_land`, `eok_vector_lor`, `eok_vector_not`
Control	`eok_comma`, `eok_question`, `eok_parens`, `eok_lvalue`, `eok_lvalue_adjust`, `eok_noexcept`
Variadic	`eok_va_start`, `eok_va_end`, `eok_va_arg`, `eok_va_copy`, `eok_va_start_single_operand`
Virtual	`eok_virtual_function_ptr`, `eok_dot_vacuous_destructor_call`, `eok_points_to_vacuous_destructor_call`
Misc	`eok_array_to_pointer`, `eok_reference_to`, `eok_ref_indirect`, `eok_ref_dynamic_cast`, `eok_pm_base_class_cast`, `eok_pm_derived_class_cast`, `eok_class_rvalue_adjust`

Statement Kinds (stmk_*)

The statement kind byte lives at offset +32 in the statement node. 26 values:

Value	Name	Supplement	Notes
0	`stmk_expr`	--	Expression statement
1	`stmk_if`	--	if statement
2	`stmk_constexpr_if`	24 bytes	if constexpr (C++17)
3	`stmk_if_consteval`	--	if consteval (C++23)
4	`stmk_if_not_consteval`	--	if !consteval (C++23)
5	`stmk_while`	--	while loop
6	`stmk_goto`	--	goto statement
7	`stmk_label`	--	Label statement
8	`stmk_return`	--	return statement
9	`stmk_coroutine`	128 bytes	C++20 coroutine body (full coroutine descriptor)
10	`stmk_coroutine_return`	--	co_return statement
11	`stmk_block`	32 bytes	Compound statement / block
12	`stmk_end_test_while`	--	do-while loop
13	`stmk_for`	24 bytes	for loop
14	`stmk_range_based_for`	--	C++11 range-for (iterator, begin, end, incr)
15	`stmk_switch_case`	--	case label
16	`stmk_switch`	24 bytes	switch statement
17	`stmk_init`	--	Declaration with initializer
18	`stmk_asm`	--	Inline assembly
19	`stmk_try_block`	32 bytes	try block
20	`stmk_decl`	--	Declaration statement
21	`stmk_set_vla_size`	--	VLA size computation
22	`stmk_vla_decl`	--	VLA declaration
23	`stmk_assigned_goto`	--	GCC computed goto
24	`stmk_empty`	--	Empty statement
25	`stmk_stmt_expr_result`	--	GCC statement expression result

The coroutine statement (kind 9) carries the largest supplement at 128 bytes, containing traits, handle, promise, initial/final suspend calls, unhandled_exception call, get_return_object call, new/delete routines, and parameter copies. A preserved typo in the EDG source reads "paramter_copies" (missing 'e'), confirming genuine EDG lineage.

Scope Kinds (sck_*)

The scope kind byte lives at offset +28 in the scope node. 9 observed values:

Value	Name	Notes
0	`sck_file`	File scope (translation unit root)
1	`sck_func_prototype`	Function prototype scope
2	`sck_block`	Block scope (compound statement)
3	`sck_namespace`	Namespace scope
6	`sck_class_struct_union`	Class/struct/union scope
8	`sck_template_declaration`	Template declaration scope
15	`sck_condition`	Condition scope (if/while/for condition variable)
16	`sck_enum`	Enum scope (C++11 scoped enums)
17	`sck_function`	Function body scope (has routine ptr, parameters, ctor inits)

Scope kinds determine which child lists are displayed. The bitmask (1 << kind) & 0x20044 (bits 2, 6, 17 = block, class/struct/union, function) and (1 << kind) & 0x9 (bits 0, 3 = file, namespace) control whether namespaces, using_declarations, and using_directives lists appear.

Dynamic Init Kinds (dik_*)

The dynamic init kind byte lives at offset +48. 9 values:

Value	Name	Notes
0	`dik_none`	No initialization
1	`dik_zero`	Zero initialization
2	`dik_constant`	Constant initializer
3	`dik_expression`	Expression initializer
4	`dik_class_result_via_ctor`	Class value via constructor call
5	`dik_constructor`	Constructor call (routine + args)
6	`dik_nonconstant_aggregate`	Non-constant aggregate init
7	`dik_bitwise_copy`	Bitwise copy from source
8	`dik_lambda`	Lambda initialization

Common IL Node Header

All primary IL node types (type, variable, field, routine, scope, namespace, template, etc.) share a 96-byte common header copied from a template at xmmword_126F6A0..126F6F0. This header is initialized by init_il_alloc (sub_5EAD80) and contains:

Source correspondence (source_corresp) block: name, position, parent scope, access specifier, linkage, flags
The display function display_source_corresp (sub_5EDF40) prints these fields for every entity type

Key source correspondence fields (printed for all entities):

name and unmangled_name_or_mangled_encoding
decl_position (line + column)
name_references list
is_class_member + access (from off_A6F760: public/protected/private/none)
parent_scope and enclosing_routine
name_linkage (from off_E6E040: none/internal/external/C/C++)
Flags: referenced, needed, is_local_to_function, marked_as_gnu_extension, externalized, maybe_unused, is_deprecated_or_unavailable

Initialization and Reset

The IL subsystem initializes in two phases:

One-Time Init (sub_5CF7F0)

Called once at program startup. Validates 7 name-table arrays end with "last" sentinels:

Table	Address	Content
`il_entry_kind_names`	`off_E6E020`	85 IL entry kind names
`db_storage_class_names`	`off_E6CD78`	Storage class enum names
`db_special_function_kinds`	`off_E6D228`	Special function kind names
`db_operator_names`	`off_E6CD20`	Operator kind names
`name_linkage_kind_names`	`off_E6E060`	Linkage kind names
`decl_modifier_names`	`off_E6CD88`	Declaration modifier names
`pragma_ids`	`off_E6CF38`	Pragma identifier names

Also validates unsigned_int_kind_of table (byte_E6D1AD == 111 == 'o') and initializes 60+ allocation pools via sub_7A3C00 (pool_init) with element sizes ranging from 1 to 1344 bytes.

Per-TU Init (sub_5CFE20)

Called at the start of each translation unit compilation. Zeroes all pool heads, allocates the constant-sharing hash table (16,312 bytes = 2,039 buckets at qword_126F228), and the character-type hash table (3,240 bytes at qword_126F2F8). Sets sharing mode flags (byte_126E558..126E55A = 3). Tail-calls sub_5EAF00 to reset float constant caches.

Secondary Pool Reset (sub_5D0170)

Resets ~80 transient globals in the 126F680..126F978 range between template instantiation passes. Pure state zeroing, no allocation.

IL constants are deduplicated via a 2,039-bucket hash table at qword_126F228. The alloc_shareable_constant function (sub_5D2390) checks constant_is_shareable (sub_5D2210) -- which excludes aggregate constants (kind 10), template parameter constants (kind 12), and string literals when string sharing is disabled (dword_126E1C0).

On a cache hit, the existing constant is relinked to the front of its bucket chain. On a miss, a new 184-byte constant is allocated and inserted. Statistics are tracked: total allocations (qword_126F208), comparisons (qword_126F200), region hits (qword_126F218), global hits (qword_126F220), and new buckets (qword_126F210).

CUDA Extensions to IL

NVIDIA adds several CUDA-specific fields to standard EDG IL nodes:

Routine flags (bytes 182-183): nvvm_intrinsic, global (global), device (device), host (host)
Variable flags: shared (shared), constant (constant), device (device), managed (managed)
keep_in_il bit (prefix byte bit 7): The mechanism for device/host code separation
Lambda entries (kinds 0x46, 0x47): Extended lambda wrapper support

These extensions are what make cudafe++ the CUDA-aware C++ frontend rather than a stock EDG compiler.

Function Map

Address	Function	Source	Notes
`sub_5CF7F0`	`il_one_time_init`	il.c	Validates tables, inits 60+ pools
`sub_5CFE20`	`il_init` / `il_reset`	il.c	Per-TU initialization
`sub_5D0170`	`il_reset_secondary_pools`	il.c	Template instantiation reset
`sub_5D01F0`	`il_rebuild_entry_index`	il.c	Build entry pointer index
`sub_5D02F0`	`il_invalidate_entry_index`	il.c	Clear entry index
`sub_5D0750`	`compare_expressions`	il.c	Deep structural equality
`sub_5D1350`	`compare_constants`	il.c	Constant comparison (525 lines)
`sub_5D1FE0`	`compare_dynamic_inits`	il.c	Dynamic init comparison
`sub_5D2210`	`constant_is_shareable`	il.c	Shareability predicate
`sub_5D2390`	`alloc_shareable_constant`	il.c	Hash-table dedup allocation
`sub_5D2F90`	`i_copy_expr_tree`	il.c	Deep expression tree copy
`sub_5D3B90`	`i_copy_constant_full`	il.c	Deep constant copy
`sub_5D47A0`	`i_copy_dynamic_init`	il.c	Deep dynamic init copy
`sub_5E2E80`	`set_type_kind`	il_alloc.c	Type kind dispatch (22 kinds)
`sub_5E3D40`	`alloc_type`	il_alloc.c	176-byte type node
`sub_5E4D20`	`alloc_variable`	il_alloc.c	232-byte variable node
`sub_5E4F70`	`alloc_field`	il_alloc.c	176-byte field node
`sub_5E53D0`	`alloc_routine`	il_alloc.c	288-byte routine node
`sub_5E5CA0`	`alloc_label`	il_alloc.c	128-byte label node
`sub_5E5F00`	`set_expr_node_kind`	il_alloc.c	Expression kind dispatch
`sub_5E62E0`	`alloc_expr_node`	il_alloc.c	72-byte expression node
`sub_5E6E20`	`set_statement_kind`	il_alloc.c	Statement kind dispatch
`sub_5E7060`	`alloc_statement`	il_alloc.c	80-byte statement node
`sub_5E7D80`	`alloc_scope`	il_alloc.c	288-byte scope node
`sub_5E7A70`	`alloc_namespace`	il_alloc.c	128-byte namespace node
`sub_5E8D20`	`alloc_template`	il_alloc.c	208-byte template node
`sub_5E99D0`	`dump_il_table_statistics`	il_alloc.c	Print allocation stats
`sub_5EAD80`	`init_il_alloc`	il_alloc.c	Initialize common header template
`sub_5F4930`	`display_il_entry`	il_to_str.c	Main display dispatcher (~1,686 lines)
`sub_5F76B0`	`display_il_header_and_file_scope`	il_to_str.c	IL header + region 1
`sub_5F7DF0`	`display_il_file`	il_to_str.c	Top-level display entry point
`sub_60E4F0`	`walk_file_scope_il`	il_walk.c	File-scope tree walker
`sub_610200`	`walk_routine_scope_il`	il_walk.c	Per-function tree walker

Cross-References

IL Allocation -- Arena allocator details, node sizes, free lists
IL Walking -- Tree traversal framework with 5 callback slots
keep-in-il -- Device code selection via bit 7
IL Display -- Debug dump format and output
IL Comparison & Copy -- Expression/constant comparison and deep copy
Device/Host Separation -- CUDA IL marking
Type System -- 22 type kinds in detail

IL Allocation

Every IL node in cudafe++ is allocated through a region-based bump allocator implemented in il_alloc.c (EDG 6.6 source at /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/il_alloc.c). The allocator manages 70+ distinct IL entry types across two memory region categories -- file-scope (persistent for the entire translation unit) and per-function-scope (transient, freed after each function body is processed). Free-lists recycle high-churn node types to reduce region pressure. The allocation subsystem occupies address range 0x5E0600-0x5EAF00 in the binary, roughly 43KB of compiled code covering 100+ functions.

Key Facts

Property	Value
Source file	`il_alloc.c` (EDG 6.6)
Address range	`0x5E0600`-`0x5EAF00`
Core allocator	`sub_6B7D60` (`region_alloc(region_id, size)`)
File-scope allocator	`sub_5E03D0` (`alloc_in_file_scope_region`)
Dual-region allocator	`sub_5E02E0` (`alloc_in_region`)
Scratch-region allocator	`sub_5E0460` (`alloc_in_scratch_region`)
Stats dump	`sub_5E99D0` (`dump_il_table_stats`), 340 lines
Init function	`sub_5EAD80` (`init_il_alloc`)
Reset watermarks	`sub_5EAEC0` (`reset_region_offsets`)
Clear free-lists	`sub_5EAF00` (`clear_free_lists`)
Node types tracked	70+ (each with per-type counter)
Free-list types	6 (template_arg, constant_list, expr_node, constant, param_type, source_seq_entry)

Region-Based Bump Allocator

The core allocation primitive is sub_6B7D60 (region_alloc), a bump allocator that takes a region ID and requested size, and returns a pointer to the allocated block within the region's memory. The caller then writes prefix fields and returns a pointer past the prefix to the node body.

region_alloc Pseudocode

// sub_6B7D60 -- region_alloc(region_id, total_size)
// Returns pointer to start of allocated block within the region.
void* region_alloc(int region_id, int64_t requested_size) {
    // Step 1: Align requested size to 8-byte boundary, add 8 for capacity margin
    int64_t aligned_size = requested_size;
    if (requested_size == 0) {
        aligned_size = 8;              // minimum allocation
    } else if (requested_size & 7) {
        aligned_size = (requested_size + 7) & ~7;   // round up to 8
    }
    int64_t check_size = aligned_size + 8;           // capacity check includes margin

    // Step 2: Get current region block
    mem_block_t* block = region_table[region_id];    // qword_126EC88[region_id]
    void* alloc_ptr = block->next_free;              // block[2] = bump pointer

    // Step 3: Check if current block has enough space
    if (block->end - alloc_ptr < check_size) {
        // Not enough space -- try free-list or allocate new block
        bool is_reuse = block->is_reusable;          // block byte +40
        int64_t block_size;
        if (is_reuse) {
            block_size = 2048;                       // small reuse block
        } else {
            flush_region(region_table[region_id]);   // sub_6B68D0
            block_size = 0x10000;                    // 64KB default
        }

        // Search free-list (qword_1280730) for a suitable block
        block = find_free_block(aligned_size + 56, block_size);
        if (!block) {
            // Allocate fresh block from heap
            if (block_size < aligned_size + 56)
                block_size = aligned_size + 56;
            block_size = (block_size + 7) & ~7;      // align to 8
            block = malloc(block_size);
            if (!block) fatal_error(4);              // out of memory
            block->capacity = block_size;
            block->end = (char*)block + block_size;
            block->next_free = block + 6;            // skip 48-byte header
        }

        // Link new block into region's block chain
        block->is_reusable = 0;
        alloc_ptr = block->next_free;
        block->next = region_table[region_id];
        region_table[region_id] = block;
    }

    // Step 4: Bump the pointer
    total_allocated += aligned_size;                 // qword_1280700
    block->next_free = (char*)alloc_ptr + aligned_size;
    alignment_waste += aligned_size - requested_size; // qword_12806F8
    per_region_total[region_id] += aligned_size;     // qword_126EC50[region_id]

    return alloc_ptr;
}

Region Architecture

dword_126EC90  = file_scope_region_id   (region 1, persistent)
dword_126EB40  = current_region_id      (file-scope or per-function)
dword_126F690  = file-scope base offset (typically 0)
dword_126F694  = file-scope prefix size (24 normal, 16 TU-copy)
dword_126F688  = function-scope base offset
dword_126F68C  = function-scope prefix size (8)
qword_126EC88  = region_table           (region index -> memory block)
qword_126EB90  = scope_table            (region index -> scope entry)
dword_126EC80  = total_region_count

Region selection uses a simple identity test: when dword_126EB40 == dword_126EC90, the current scope is file-scope and nodes go into region 1. When the values differ, the current scope is a function body, and nodes go into the current function's region. Some allocators force a specific behavior:

File-scope only: alloc_in_file_scope_region (sub_5E03D0) always uses dword_126EC90
Dual-region: alloc_in_region (sub_5E02E0) branches on the identity test
Scratch region: alloc_in_scratch_region (sub_5E0460) temporarily sets TU-copy mode, allocates from region 1, and restores state
Same-region-as: Used by alloc_class_list_entry (sub_5E2410) and alloc_based_type_list_member (sub_5E29C0) -- inspects the prefix byte of an existing node to determine which region it lives in, then allocates the new node in that same region

Allocation Protocol

Every IL node allocator follows a consistent protocol. The prefix size varies by mode: 24 bytes for file-scope (normal), 16 bytes for file-scope (TU-copy mode), and 8 bytes for function-scope.

File-scope allocation (normal mode, dword_126F694 = 24):

 1. if (dword_126EFC8) trace_enter(5, "alloc_<name>")
 2. raw = region_alloc(file_scope_region, entry_size + 24)
 3. ptr = raw + dword_126F690                       // base offset (typically 0)
 4. *(ptr+0)  = 0                                   // zero TU copy address slot (8 bytes)
    ++qword_126F7C0                                 // TU copy addr counter
 5. *(ptr+8)  = 0                                   // zero the next-in-list pointer (8 bytes)
    ++qword_126F750                                 // orphan pointer counter
 6. ++qword_126F7D8                                 // IL entry prefix counter
 7. *(ptr+16) = flags_byte:                         // prefix flags (8-byte qword, flags in low byte)
        bit 0 = 1                                   // allocated
        bit 1 = 1                                   // file_scope (not TU-copy)
        bit 3 = dword_126E5FC & 1                   // language flag (C++ vs C)
 8. node = ptr + 24                                 // skip 24-byte prefix
 9. ++qword_126F8xx                                 // per-type counter
10. initialize type-specific fields
11. copy 96-byte common header from template globals
12. if (dword_126EFC8) trace_leave()
13. return node

Function-scope allocation (dword_126F68C = 8):

 1. raw = region_alloc(current_region, entry_size + 8)
 2. ptr = raw + dword_126F688                       // function-scope base offset
 3. *(ptr+0) = flags_byte:                          // prefix flags (8-byte qword)
        bit 1 = !dword_106BA08                      // file_scope flag
        bit 3 = dword_126E5FC & 1                   // language flag
    (no TU copy slot, no next-in-list slot)
 4. node = ptr + 8                                  // skip 8-byte prefix
 5. return node

The returned pointer skips the prefix, so all field offsets documented in the IL are relative to this returned pointer. The prefix flags byte is always at node - 8 regardless of allocation mode. The next-in-list link (file-scope only) is at node - 16, and the TU-copy address (normal file-scope only) is at node - 24.

Common IL Header Template

Every IL node contains a 96-byte common header, copied from six __m128i template globals initialized by init_il_alloc (sub_5EAD80):

xmmword_126F6A0  [+0..+15]    16 bytes, zeroed
xmmword_126F6B0  [+16..+31]   16 bytes (high qword zeroed)
xmmword_126F6C0  [+32..+47]   16 bytes, zeroed
xmmword_126F6D0  [+48..+63]   16 bytes, zeroed
xmmword_126F6E0  [+64..+79]   16 bytes (from qword_126EFB8 = source position)
xmmword_126F6F0  [+80..+95]   16 bytes (low word = 4, high qword = 0)
qword_126F700    [+96..+103]  8 bytes (current source file reference)

This template captures the current source position and language state at the moment of allocation. The template is refreshed when the parser advances through source positions, so each newly-allocated node carries the file/line/column of the construct it represents.

IL Entry Prefix

Every IL entry has a variable-size raw prefix preceding the node body. The prefix is 24 bytes in normal file-scope mode, 16 bytes in TU-copy file-scope mode, and 8 bytes in function-scope mode.

Normal file-scope (24-byte prefix, ptr = raw + 24):
+0   [8 bytes]  TU copy   ptr - 24   translation_unit_copy_address
+8   [8 bytes]  next      ptr - 16   next_in_list link
+16  [8 bytes]  flags     ptr - 8    prefix flags byte (+ 7 padding)
+24  [...]      body      ptr + 0    node-specific fields

TU-copy file-scope (16-byte prefix, ptr = raw + 16):
+0   [8 bytes]  next      ptr - 16   next_in_list link
+8   [8 bytes]  flags     ptr - 8    prefix flags byte (+ 7 padding)
+16  [...]      body      ptr + 0    node-specific fields

Function-scope (8-byte prefix, ptr = raw + 8):
+0   [8 bytes]  flags     ptr - 8    prefix flags byte (+ 7 padding)
+8   [...]      body      ptr + 0    node-specific fields

Prefix Flags Byte

Bit	Mask	Name	Set When
0	`0x01`	`allocated`	Always set on fresh allocation
1	`0x02`	`file_scope`	`!dword_106BA08` (not in TU-copy mode)
2	`0x04`	`is_in_secondary_il`	Entry from secondary translation unit
3	`0x08`	`language_flag`	`dword_126E5FC & 1` (C++ mode indicator)
7	`0x80`	`keep_in_il`	Set by device code marking pass

Bit 7 is the CUDA-critical keep_in_il flag used to select device-relevant declarations. See Keep-in-IL for the marking algorithm. The flags byte is always at entry - 8 regardless of allocation mode, and the sign-bit position allows a fast test: *(signed char*)(entry - 8) < 0 means "keep this entry."

Some allocators preserve bit 7 across free-list recycling (notably alloc_local_constant at sub_5E1A80 and alloc_derivation_step at sub_5E1EE0), ensuring that the keep-in-il status is not lost when a node is reclaimed and reissued.

Complete Node Size Table

The stats dump function sub_5E99D0 prints the allocation table with exact names and per-unit sizes for all 70+ IL entry types. Sizes listed are the allocation unit in bytes -- the values passed to region_alloc.

Primary IL Nodes

IL Entry Type	Size (bytes)	Counter Global	Allocator	Region
type	176	`qword_126F8E0`	`sub_5E3D40`	file-scope
variable	232	`qword_126F8C0`	`sub_5E4D20`	dual (kind-dependent)
routine	288	`qword_126F8A8`	`sub_5E53D0`	file-scope
expr_node	72	`qword_126F880`	`sub_5E62E0`	dual + free-list
statement	80	`qword_126F818`	`sub_5E7060`	dual
scope	288	`qword_126F7E8`	`sub_5E7D80`	dual
constant	184	`qword_126F968`	`sub_5E11C0`	dual
field	176	`qword_126F8B0`	`sub_5E4F70`	file-scope
label	128	`qword_126F888`	`sub_5E5CA0`	function-scope only
asm_entry	152	`qword_126F890`	`sub_5E57B0`	dual
namespace	128	`qword_126F7F8`	`sub_5E7A70`	file-scope
template	208	`qword_126F720`	`sub_5E8D20`	file-scope
template_parameters	136	`qword_126F728`	`sub_5E8A90`	file-scope
template_arg	64	`qword_126F900`	`sub_5E2190`	file-scope + free-list

Type Supplements

Auxiliary structures allocated alongside type nodes by set_type_kind (sub_5E2E80):

Supplement	Size	Counter	For Type Kinds
integer_type_supplement	32	`qword_126F8E8`	`tk_integer` (2)
routine_type_supplement	64	`qword_126F958`	`tk_routine` (7)
class_type_supplement	208	`qword_126F948`	`tk_class` (9), `tk_struct` (10), `tk_union` (11)
typeref_type_supplement	56	`qword_126F8F0`	`tk_typeref` (12)
templ_param_supplement	40	`qword_126F8F8`	`tk_template_param` (14)

Expression Supplements

Allocated inline by set_expr_node_kind (sub_5E5F00) for expression kinds that need extra storage:

Supplement	Size	Counter	For Expression Kind
new/delete supplement	56	`qword_126F868`	`enk_new_delete` (7)
throw supplement	24	`qword_126F860`	`enk_throw` (8)
condition supplement	32	`qword_126F858`	`enk_condition` (9)

Statement Supplements

Allocated inline by set_statement_kind (sub_5E6E20):

Supplement	Size	Counter	For Statement Kind
constexpr_if	24	`qword_126F798`	`stmk_constexpr_if` (2)
block	32	`qword_126F830`	`stmk_block` (11)
for_loop	24	`qword_126F820`	`stmk_for` (13)
try supplement	32	`qword_126F838`	`stmk_try_block` (19)
switch_stmt_descr	24	`qword_126F848`	`stmk_switch` (16)
coroutine_descr	128	`qword_126F828`	`stmk_coroutine` (9)

Linked-List Entry Types

Entry Type	Size	Counter	Notes
class_list_entry	16	`qword_126F940`	Region-aware (`sub_5E2410`) or simple (`sub_5E26A0`)
routine_list_entry	16	`qword_126F938`	`sub_5E2750`
variable_list_entry	16	`qword_126F930`	`sub_5E2800`
constant_list_entry	16	`qword_126F928`	Free-list recycled (`sub_5E28B0`)
IL_entity_list_entry	24	`qword_126F7B8`	`sub_5E94F0`
based_type_list_member	24	`qword_126F950`	Region-aware (`sub_5E29C0`)

Inheritance and Virtual Dispatch

Entry Type	Size	Counter	Allocator
base_class	112	`qword_126F908`	`sub_5E2300`
base_class_derivation	32	`qword_126F910`	`sub_5E1FD0`
derivation_step	24	`qword_126F918`	`sub_5E1EE0`
overriding_virtual_func	40	`qword_126F920`	`sub_5E20D0`

Variable and Routine Auxiliaries

Entry Type	Size	Counter	Allocator
dynamic_init	104	`qword_126F8D8`	`sub_5E4650`
local_static_var_init	40	`qword_126F8D0`	`sub_5E4870`
vla_dimension	48	`qword_126F8C8`	`sub_5E49C0`
variable_template_info	24	`qword_126F8B8`	`sub_5E4C70`
exception_specification	16	`qword_126F8A0`	`sub_5E5130`
exception_spec_type	24	`qword_126F898`	`sub_5E51D0`
param_type	80	`qword_126F960`	`sub_5E1D40` (free-list recycled)
constructor_init	48	`qword_126F810`	`sub_5E7410`
handler	40	`qword_126F840`	`sub_5E6B90`
switch_case_entry	56	`qword_126F850`	`sub_5E6A60`

Scope and Source Tracking

Entry Type	Size	Counter	Allocator
source_sequence_entry	32	`qword_126F780`	`sub_5E8300` (free-list recycled)
src-seq_secondary_decl	56	`qword_126F778`	`sub_5E8480`
src-seq_end_of_construct	24	`qword_126F770`	`sub_5E85B0`
src-seq_sublist	24	`qword_126F768`	`sub_5E86C0`
local-scope-ref	32	`qword_126F7E0`	`sub_5E80A0`
object_lifetime	64	`qword_126F800`	`sub_5E7800` (free-list recycled)
static_assertion	24	`qword_126F788`	`sub_5E81B0`

Templates, Names, and Pragmas

Entry Type	Size	Counter	Allocator
template_decl	40	`qword_126F738`	`sub_5E8C60`
requires_clause	16	`qword_126F730`	`sub_5E8BB0`
name_reference	40	`qword_126F718`	`sub_5E90B0`
name_qualifier	40	`qword_126F710`	`sub_5E8FC0`
element_position	24	`qword_126F708`	`sub_5E8EB0`
pragma	64	`qword_126F808`	`sub_5E7570`
using-decl	80	`qword_126F7F0`	`sub_5E7BF0`
instantiation_directive	40	`qword_126F758`	`sub_5E8770`
linkage_spec_block	32	`qword_126F760`	`sub_5E8830`
hidden_name	32	`qword_126F740`	`sub_5E8980`

Attributes and Miscellaneous

Entry Type	Size	Counter	Allocator
attribute	72	`qword_126F7B0`	`sub_5E9600`
attribute_arg	40	`qword_126F7A8`	`sub_5E96F0`
attribute_group	8	`qword_126F7A0`	`sub_5E97C0`
source_file	80	`qword_126F970`	`sub_5E08D0`
seq_number_lookup_entry	32	`qword_126F7C8`	`sub_5E9170`
subobject_path	24	`qword_126F790`	`sub_5E0A30`
orphaned_list_header	56	`qword_126F748`	`sub_5E0800`

Bookkeeping Counters (No Separate Allocator)

Counter	Size	Global	Meaning
string_literal_text	1	`qword_126F7D0`	Raw string literal bytes (accumulated)
fs_orphan_pointers	8	`qword_126F750`	File-scope orphan pointer slots
trans_unit_copy_addr	8	`qword_126F7C0`	TU-copy address slots written
IL_entry_prefix	4	`qword_126F7D8`	Total prefix flags bytes written

Free-List Recycling

Six node types use free-list recycling to avoid allocating fresh memory for high-churn entries. Each free-list is a singly-linked list with the link pointer embedded in the node itself.

Active Free-Lists

Node Type	Free-List Head	Link Offset	Alloc Function	Free Function
template_arg (64B)	`qword_126F670`	+0	`sub_5E2190`	`sub_5E22D0` (`free_template_arg_list`)
constant_list_entry (16B)	`qword_126F668`	+0	`sub_5E28B0`	`sub_5E2990` (`return_constant_list_entries_to_free_list`)
expr_node (72B)	`qword_126E4B0`	+64	`sub_5E62E0`	(kind set to 36 = `ek_reclaimed`)
constant (184B)	`qword_126E4B8`	+104	`sub_5E1A80` (`alloc_local_constant`)	`sub_5E1B70` (`free_local_constant`)
param_type (80B)	`qword_126F678`	+0	`sub_5E1D40` (`alloc_param_type`)	`sub_5E1EB0` (`free_param_type_list`)
source_seq_entry (32B)	scope+328	--	`sub_5E8300`	(per-scope recycling)
object_lifetime (64B)	scope+512	+56	`sub_5E7800`	(per-scope recycling)

Expression Node Recycling

Expression nodes use the most sophisticated free-list protocol. The allocator (sub_5E62E0) checks qword_126E4B0 before allocating fresh memory:

// Pseudocode for alloc_expr_node
if (expr_free_list != NULL) {
    node = expr_free_list;
    assert(node->kind == 36);        // ek_reclaimed sentinel
    expr_free_list = *(node + 64);   // link at offset +64
    // reuse node (preserves bit 7 of prefix)
} else {
    node = region_alloc(region_id, 72);
    // full prefix initialization
}
set_expr_node_kind(node, requested_kind);
++total_expr_count;
++fs_expr_count;
update_rescan_counter(&rescan_expr_count);

When expression nodes are freed, their kind byte at offset +24 is set to 36 (ek_reclaimed), and their link pointer at offset +64 chains them into the free list. The stats dump walks this free list to count available recycled nodes, printing them as "(avail. fs expr node)".

A source-tracking variant alloc_expr_node_with_source_tracking (sub_5E66B0) wraps the allocation in save_source_correspondence/restore_source_correspondence calls (sub_5B8910/sub_5B89C0). For non-same-region allocations, this variant uses alloc_permanent(72) instead of the dual-region allocator because the free list cannot safely cross region boundaries.

Constant Recycling

Local constants use a separate free-list (qword_126E4B8) with the link at offset +104. The free_local_constant function (sub_5E1B70) validates the node is in-use (bit 0 of prefix) before unlinking. The check_local_constant_use assertion function (sub_5E1D00) verifies qword_126F680 == 0 at function boundaries, ensuring all borrowed constants have been returned.

The duplicate_constant_to_other_region function (sub_5E1BB0) handles the case where a constant must be copied from one region to another. When source and destination are the same region, it works in-place. When they differ, it allocates 184 bytes in the target region, copies contents via sub_5BA500, frees the original to the free list, and applies post-copy fixups (sub_5B9DE0, sub_5D39A0).

set_type_kind -- Type Kind Dispatch

set_type_kind (sub_5E2E80, confirmed at il_alloc.c:2334) writes the type kind byte at offset +132 of the type node and allocates any required type supplement. It handles 22 type kinds (0x00-0x15):

Kind	Name	Action
0	`tk_error`	No-op
1	`tk_void`	No-op
2	`tk_integer`	Allocates 32-byte `integer_type_supplement`, sets default access=5
3	`tk_float`	Sets format byte = 2
4	`tk_complex`	Sets format byte = 2
5	`tk_imaginary`	Sets format byte = 2
6	`tk_pointer`	Zeroes 2 payload fields
7	`tk_routine`	Allocates 64-byte `routine_type_supplement`, initializes calling convention and parameter bitfields
8	`tk_array`	Zeroes size and flags fields
9	`tk_class`	Allocates 208-byte `class_type_supplement`, stores kind at +100
10	`tk_struct`	Same as class
11	`tk_union`	Same as class
12	`tk_typeref`	Allocates 56-byte `typeref_type_supplement`
13	`tk_ptr_to_member`	Zeroes fields
14	`tk_template_param`	Allocates 40-byte `templ_param_supplement`
15	`tk_vector`	Zeroes fields
16	`tk_scalable_vector`	Zeroes fields
17-21	Pack/special types	No-op or zeroes
default	--	`internal_error("set_type_kind: bad type kind")`

The class type supplement (208 bytes) is the largest supplement. init_class_type_supplement_fields (sub_5E2D70) initializes it with defaults: access=1, virtual_function_table_index=-1, and zeroed member lists. The companion function init_class_type_supplement (sub_5E2C70) accesses the supplement through the type node's pointer at offset +152.

A combined function init_type_fields_and_set_kind (sub_5E3590, 317 lines) copies the 96-byte template header and then runs the same switch as set_type_kind inline. This is used by alloc_type (sub_5E3D40) to avoid a separate function call.

set_expr_node_kind -- Expression Kind Dispatch

set_expr_node_kind (sub_5E5F00, confirmed at il_alloc.c:3932) writes the expression kind byte at offset +24 and zeroes offset +8. It handles 36 expression kinds (0-35):

Kind	Name	Action
0	`enk_error`	No-op
1	`enk_operation`	Sets operation bytes (0x78=120, 0x15=21, 0, 0), zeroes 2 qwords
2-6	`enk_constant`..`enk_lambda`	Zeroes 2 qword operand fields
7	`enk_new_delete`	Allocates 56-byte supplement via permanent alloc
8	`enk_throw`	Allocates 24-byte supplement
9	`enk_condition`	Allocates 32-byte supplement
10	`enk_object_lifetime`	Zeroes 2 qwords
11,25,32	Address-of variants	1 qword + flag
12-15	Cast variants	Sets word=1, 1 qword
16	`enk_address_of_ellipsis`	No-op
17,18,22,23,29,33,35	Simple operand	1 qword
19	`enk_routine`	3 qwords
20	`enk_type_operand`	2 qwords
21	`enk_builtin_operation`	Sets byte=117 (0x75), 1 qword
24,26,27,30,31	Complex operand	2 qwords
28	`enk_fold_expression`	1 qword + 1 dword
34	`enk_const_eval_deferred`	1 qword + 1 dword
default	--	`internal_error("set_expr_node_kind: bad kind")`

The reinit_expr_node_kind function (sub_5E60E0) performs the same dispatch but additionally resets header fields (flag bits and source position from qword_126EFB8) before the kind switch. This is used when an existing expression node is repurposed without reallocation.

set_statement_kind -- Statement Kind Dispatch

set_statement_kind (sub_5E6E20, confirmed at il_alloc.c:4513) writes the statement kind byte at offset +32 and zeroes offset +40. It handles 26 statement kinds (0x00-0x19):

Kind	Name	Supplement
0	`stmk_expr`	1 qword (expression pointer)
1	`stmk_if`	2 qwords (condition + body)
2	`stmk_constexpr_if`	Allocates 24 bytes
3,4	`stmk_if_consteval`	2 qwords
5	`stmk_while`	1 qword
6,7	`stmk_goto`/`stmk_label`	2 qwords
8	`stmk_return`	1 qword
9	`stmk_coroutine`	1 qword (links to 128-byte coroutine_descr)
10,23,24,25	Various	No-op
11	`stmk_block`	Allocates 32 bytes, stores source pos, sets priority
12	`stmk_end_test_while`	1 qword
13	`stmk_for`	Allocates 24 bytes
14	`stmk_range_based_for`	2 qwords
15	`stmk_switch_case`	2 qwords
16	`stmk_switch`	Allocates 24 bytes
17	`stmk_init`	1 qword
18	`stmk_asm`	1 qword + flag
19	`stmk_try_block`	Allocates 32 bytes
20	`stmk_decl`	1 qword
21,22	VLA statements	1 qword
default	--	`internal_error("set_statement_kind: bad kind")`

set_constant_kind -- Constant Kind Dispatch

set_constant_kind (sub_5E0C60, confirmed at il_alloc.c:952) writes the constant kind byte at offset +148 and initializes the variant-specific union fields. 16 constant kinds (0-15):

Kind	Name	Action
0	`ck_error`	Zeroes variant fields
1	`ck_integer`	Calls `init_target_int` (`sub_461260`)
2	`ck_string`	Zeroes string fields
3	`ck_float`	Zeroes float fields
4	`ck_address`	Allocates 32-byte sub-node in file-scope region
5	`ck_complex`	Zeroes complex fields
6	`ck_imaginary`	Zeroes imaginary fields
7	`ck_ptr_to_member`	Zeroes 2 fields
8	`ck_label_difference`	Zeroes 2 fields
9	`ck_dynamic_init`	Zeroes
10	`ck_aggregate`	Zeroes aggregate list head
11	`ck_init_repeat`	Zeroes repeat fields
12	`ck_template_param`	Zeroes, dispatches to `set_template_param_constant_kind`
13	`ck_designator`	Zeroes
14	`ck_void`	Zeroes
15	`ck_reflection`	Zeroes
default	--	`internal_error("set_constant_kind: bad kind")`

The template parameter constant kind has its own sub-dispatch (sub_5E0B40, il_alloc.c:768) handling 14 sub-kinds (tpck_*), each zeroing variant fields at offsets +160, +168, +176. It validates the parent constant kind is 12 (ck_template_param) before proceeding.

Additional Kind Dispatchers

set_routine_special_kind

sub_5E5280 (confirmed at il_alloc.c:3065) sets the routine special kind byte at offset +166. 8 values (0-7):

Kind	Action
0	Sets word at +168 to 0
1-4	No-op
5	Zeroes byte at +168
6-7	Zeroes qword at +168
default	`internal_error("set_routine_special_kind: bad kind")`

set_dynamic_init_kind

sub_5E45C0 (confirmed at il_alloc.c:2506) sets the dynamic init kind at offset +48. 10 values (0-9) controlling what fields are initialized in the dynamic initialization variant union.

Statistics Dump

dump_il_table_stats (sub_5E99D0) prints a formatted table of all IL allocation counters. It is invoked when tracing is enabled or on explicit request. The output format:

IL table use:
   Table                    Number    Each     Total
   -----                    ------    ----     -----
   source file                  42      80      3360
   constant                   1847     184    339848
   type                        923     176    162448
   variable                    412     232     95584
   routine                     287     288     82656
   expr node                 12847      72    924984
   statement                  5923      80    473840
   scope                       312     288     89856
   ...
   Total                                     2172576

The function iterates all 70+ counters, multiplies count by per-unit size, accumulates a running total, and adds the passed argument a1 (typically the raw region overhead) for the final sum. It also walks the expr_node free list (qword_126E4B0) to count available recycled nodes, printing them separately as "(avail. fs expr node)".

The counter globals are contiguous in BSS from qword_126F680 through qword_126F970, with 8-byte spacing (qword counters). The full ordered list of counters is documented in the Complete Node Size Table above.

Initialization and Reset

init_il_alloc (`sub_5EAD80`)

Called once at compiler startup. Responsibilities:

Zeroes the 96-byte common header template (xmmword_126F6A0-xmmword_126F6F0)
Sets the source position portion of the template from qword_126EFB8
Computes the language mode byte: byte_126E5F8 = (dword_126EFB4 != 2) + 2 (C++ mode detection)
Registers 6 allocator state variables with sub_7A3C00 (saveable state for region offset save/restore across compilation phases)
Optionally calls sub_6F5D00 if dword_106BF18 is set (debug initialization)

reset_region_offsets (`sub_5EAEC0`)

Resets the bump allocator watermarks. Called at region boundaries:

dword_126F690 = 0;               // base offset reset
if (dword_106BA08) {              // TU-copy mode
    dword_126F68C = 8;            // function-scope watermark
    dword_126F688 = 0;            // function-scope base
    dword_126F694 = 16;           // file-scope watermark
} else {
    dword_126F694 = 24;           // file-scope watermark (extra 8 for TU copy addr)
}

The different initial watermark values (16 vs 24) reflect the prefix size in each mode: normal mode uses a 24-byte prefix (8 TU-copy + 8 next-link + 8 flags), while TU-copy mode uses a 16-byte prefix (8 next-link + 8 flags, no TU-copy slot). Function-scope allocations use dword_126F68C = 8 (8-byte prefix: flags only).

clear_free_lists (`sub_5EAF00`)

Zeroes all 5 global free-list heads:

qword_126F678 = 0;    // param_type free list
qword_126F670 = 0;    // template_arg free list
qword_126E4B8 = 0;    // local_constant free list
qword_126E4B0 = 0;    // expr_node free list
qword_126F668 = 0;    // constant_list_entry free list

Called at function-scope exit to prevent dangling pointers into freed regions.

String Allocation

Two specialized allocators handle string storage in regions:

copy_string_to_region (`sub_5E0600`, `il_alloc.c:548`)

char* copy_string_to_region(int region_id, const char* str) {
    size_t len = strlen(str);
    char* buf;
    if (region_id == 0)
        buf = heap_alloc(len + 1);                // general heap
    else if (region_id == file_scope_region)
        buf = region_alloc(file_scope, len + 1);   // file-scope region
    else if (region_id == -1)
        buf = persistent_alloc(len + 1);           // persistent heap
    else
        internal_error("copy_string_to_region");
    return strcpy(buf, str);
}

copy_string_of_length_to_region (`sub_5E0700`, `il_alloc.c:572`)

Same three-way dispatch but takes an explicit length parameter and uses strncpy with explicit null termination: result[len] = 0.

Special Allocation Patterns

Labels -- Function-Scope Assertion

alloc_label (sub_5E5CA0) asserts that dword_126EB40 != dword_126EC90 (must be in function scope). Labels cannot exist at file scope -- they are always allocated in a function's region:

assert(current_region != file_scope_region);   // il_alloc.c:3588

Variables -- Kind-Dependent Region

alloc_variable (sub_5E4D20) uses the variable's linkage kind to select the allocation strategy: when kind > 2 (non-local variables like global, extern, static), it uses the dual-region allocator (sub_5E02E0). Otherwise it allocates directly in the file-scope region. This ensures that local variables live in function regions while globals persist in the file-scope region.

GNU Supplement for Routines

alloc_gnu_supplement_for_routine (sub_5E56D0, il_alloc.c:3412) asserts that no supplement already exists (*(routine+240) == 0), then allocates a 40-byte supplement and stores the pointer at routine+240. This is for GCC-extension attributes on functions (visibility, alias, constructor/destructor priority).

Pragma -- 43 Kinds

alloc_pragma (sub_5E7570, il_alloc.c:4781) uses the same-region-as pattern (handling null, non-file-scope, scratch, and same-region-as cases) and dispatches a switch covering 43 pragma kinds (0-42). Most kinds are no-op; kinds 19, 21, 26, 28, 29 have small payload fields.

Scope -- Routine Association

alloc_scope (sub_5E7D80) validates that if assoc_routine (argument a3) is non-null, the scope kind must be 17 (sck_function). Violation triggers internal_error("assoc_routine is non-NULL") at il_alloc.c:4946. After kind dispatch, it zeroes 26 qword fields (offsets 80-280) and sets *(result+240) = -1 as a sentinel.

Global Variable Map

Address	Name	Purpose
`dword_126EC90`	`file_scope_region_id`	Region 1 identifier
`dword_126EB40`	`current_region_id`	Active allocation region
`dword_106BA08`	`tu_copy_mode`	TU-copy mode flag (affects prefix layout)
`dword_126EFC8`	`tracing_enabled`	When set, brackets alloc calls with trace_enter/leave
`qword_126EFB8`	`null_source_position`	Default source position for new nodes
`qword_126F700`	`current_source_file`	Current source file reference
`qword_106B9B0`	`compilation_context`	Active compilation context pointer
`dword_126E5FC`	`source_file_flags`	Bit 0 = C++ mode indicator
`byte_126E5F8`	`language_std_byte`	Language standard (controls routine type init)
`dword_106BFF0`	`uses_exceptions`	Exception model flag (set in routine alloc)

IL Tree Walking

The IL tree walking framework is the backbone of every operation that must visit the complete IL graph: debug display, device code marking, IL serialization, and IL copying for template instantiation. The framework lives in il_walk.c (with entry-kind dispatch logic auto-generated from walk_entry.h). It provides a generic, callback-driven traversal engine consisting of two core functions: walk_file_scope_il (sub_60E4F0), which orchestrates the top-level iteration over all global entry-kind lists, and walk_entry_and_subtree (sub_604170), which recursively descends into a single entry's children according to the IL schema. Five global function-pointer slots allow each client to customize the walk's behavior without modifying the walker itself.

The framework follows a strict separation of traversal and action. The walker knows how to navigate the IL graph; the callbacks decide what to do at each node. This design enables the same walker to serve four fundamentally different purposes: pretty-printing, transitive-closure marking, pointer remapping during copy, and entry filtering during serialization.

Key Facts

Property	Value
Source file	`il_walk.c` (EDG 6.6)
Header (auto-generated dispatch)	`walk_entry.h`
Assert path	`/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/il_walk.c`
Top-level file-scope walker	`sub_60E4F0` (`walk_file_scope_il`), 2043 lines
Recursive entry walker	`sub_604170` (`walk_entry_and_subtree`), 7763 lines / 42KB
Routine-scope walker	`sub_610200` (`walk_routine_scope_il`), 108 lines
Hash table reset	`sub_603B30` (`clear_walk_hash_table`), 23 lines
Anonymous union lookup	`sub_603FE0` (`find_parent_var_of_anon_union_type`), 127 lines
Entry kinds covered	85 (switch cases 0--84)
Recursive self-calls	~330 (in `walk_entry_and_subtree`)
Callback slots	5 global function pointers

Callback Slot Architecture

The five callback slots are stored as global function pointers. Before any walk, the caller saves all five values, installs its own set, and restores the originals on exit. This save/restore discipline makes walks re-entrant -- a callback can itself trigger a nested walk with different callbacks.

Address	Slot Name	Signature	Purpose
`qword_126FB88`	`entry_callback`	`entry_ptr(entry_ptr, entry_kind)`	Called for each entry visited; may return a replacement pointer
`qword_126FB80`	`string_callback`	`void(char_ptr, string_kind, byte_length)`	Called for each string field; `string_kind` is 24 (id_name), 25 (string_text), or 26 (other_text); `byte_length` is `strlen+1` for kinds 24/26, field-based for kind 25
`qword_126FB78`	`pre_walk_check`	`int(entry_ptr, entry_kind)`	Called before descending into an entry; returns nonzero to skip the subtree
`qword_126FB70`	`entry_replace`	`entry_ptr(entry_ptr, entry_kind)`	Called to remap an entry pointer (used during IL copy to translate old pointers to new ones)
`qword_126FB68`	`entry_filter`	`entry_ptr(entry_ptr, entry_kind)`	Called on linked-list heads to filter entries; returning NULL removes the entry from the list

The pre_walk_check slot is the only one whose return value controls flow: nonzero means "already handled, skip this subtree." The keep-in-il pass uses this to avoid revisiting already-marked entries (preventing infinite recursion on cyclic references). The entry_replace slot is used during IL copy operations to translate pointers from the source IL region to the destination region.

Walk State Globals

In addition to the five callback slots, four state variables track the walker's context:

Address	Name	Description
`dword_126FB5C`	`is_file_scope_walk`	1 during `walk_file_scope_il`, 0 during `walk_routine_scope_il`
`dword_126FB58`	`is_secondary_il`	1 if the current scope belongs to the secondary IL region
`dword_106B644`	`current_il_region`	Toggles per IL region; used to stamp bit 2 of entry flags
`dword_126FB60`	`walk_mode_flags`	Bitmask controlling walk behavior (e.g., strip template info)

All four are saved and restored alongside the callback slots, making the entire walk context atomically swappable.

walk_file_scope_il (sub_60E4F0)

This is the central traversal entry point. Every operation that needs to visit the entire file-scope IL calls this function with its desired callbacks. It takes six arguments:

void walk_file_scope_il(
    entry_callback_t   a1,   // entry visitor (qword_126FB88)
    string_callback_t  a2,   // string visitor (qword_126FB80)
    entry_replace_t    a3,   // pointer remapper (qword_126FB70)
    entry_filter_t     a4,   // list filter (qword_126FB68)
    pre_walk_check_t   a5,   // pre-visit gate (qword_126FB78)
    int                a6    // walk_mode_flags (dword_126FB60)
);

Initialization

The function begins by saving all five callback slots and all four walk state variables, then installs the caller's values:

// Save current state
saved_entry_cb      = qword_126FB88;
saved_string_cb     = qword_126FB80;
saved_pre_walk      = qword_126FB78;
saved_entry_replace = qword_126FB70;
saved_entry_filter  = qword_126FB68;
saved_is_file_scope = dword_126FB5C;
saved_is_secondary  = dword_126FB58;
saved_il_region     = dword_106B644;
saved_mode_flags    = dword_126FB60;

// Install new callbacks
qword_126FB88 = a1;
qword_126FB80 = a2;
qword_126FB70 = a3;
qword_126FB68 = a4;
qword_126FB78 = a5;
dword_126FB60 = a6;
dword_126FB5C = 1;  // mark as file-scope walk

An assertion fires if pre_walk_check is NULL and the primary scope is in secondary IL (bit 1 of flags byte set):

if (!a5 && is_secondary)
    assert_fail("il_walk.c", 270, "walk_file_scope_il");

This prevents unguarded walks into secondary IL regions, which would produce incorrect results because secondary entries need canonical-entry delegation.

Walk Order

The function visits IL entries in a fixed, deterministic order. This order is significant for serialization (the IL binary format expects entries in this exact sequence) and for display (the --dump_il output follows this structure).

Phase 1: Primary scope (kind 23)

primary_scope = xmmword_126EB60[1];  // second qword of IL header

// If entry_replace callback exists, remap the scope pointer first
if (entry_replace)
    primary_scope = entry_replace(primary_scope, 23);

// Determine IL region flags from the scope's prefix byte
is_secondary_il = (*(primary_scope - 8) & 0x02) != 0;
current_il_region = ((*(primary_scope - 8) >> 2) ^ 1) & 1;

walk_entry_and_subtree(primary_scope, 23);

The scope entry (kind 23) is walked first because it is the root of the scope tree. Walking the scope recursively visits all nested scopes and their member lists.

Phase 2: Source file entries (kind 1)

for (entry = xmmword_126EB60[0]; entry; entry = entry->child_file) {
    if (entry_filter && !entry_filter(entry, 1))
        continue;  // filtered out
    walk_entry_and_subtree(entry, 1);
}

Source file entries form a linked list via offset +56 (child_file). Each entry holds the file name, full path, and name_as_written strings.

Phase 3: main_routine pointer and string entries

Before walking strings, the function remaps the main_routine pointer from the IL header:

// main_routine (qword_126EB70, IL header + 0x10)
if (entry_replace) {
    il_header.main_routine = entry_replace(il_header.main_routine, 11);
    // Also remap compiler_version through entry_replace
    compiler_version = entry_replace(compiler_version, 26);
}

Then two string entries from the IL header are walked as "other text" (kind 26):

// compiler_version (qword_126EB78, IL header + 0x18)
if (compiler_version) {
    if (trace_verbosity > 4)
        fprintf(s, "Walking IL tree, string entry kind = %s\n", "other text");
    if (string_callback)
        string_callback(compiler_version, 26, strlen(compiler_version) + 1);
}

// time_of_compilation (qword_126EB80, IL header + 0x20) -- same pattern
if (entry_replace)
    time_of_compilation = entry_replace(time_of_compilation, 26);
if (time_of_compilation) {
    if (string_callback)
        string_callback(time_of_compilation, 26, strlen(time_of_compilation) + 1);
}

Strings are walked with kind 26 (other_text) and the string callback receives the raw character pointer, the kind, and the length including the null terminator.

Phase 4: Orphaned entities list (kind 55)

for (entry = qword_126EBA0; entry; entry = entry->next) {
    if (entry_filter && !entry_filter(entry, 55))
        continue;
    walk_entry_and_subtree(entry, 55);
}

Kind 55 entries are orphaned entities -- declarations that lost their parent scope (e.g., after template instantiation cleanup). They are stored in a separate linked list headed at qword_126EBA0.

Phase 5: Global entry-kind lists (kinds 1--72)

The bulk of the walk iterates 45 global linked lists, one per entry kind. Each list head is stored at a fixed address in the 0x126E610--0x126EA80 range, with 16-byte spacing. The complete walk order, verified from the decompiled sub_60E4F0:

#	Global Address	Kind	Entry Kind Name
1	`qword_126E610`	1	`source_file_entry`
2	`qword_126E620`	2	`constant`
3	`qword_126E630`	3	`param_type`
4	`qword_126E640`	4	`routine_type_supplement`
5	`qword_126E650`	5	`routine_type_extra`
6	`qword_126E660`	6	`type`
7	`qword_126E670`	7	`variable`
8	`qword_126E680`	8	`field`
9	`qword_126E690`	9	`exception_specification`
10	`qword_126E6A0`	10	`exception_spec_type`
11	`qword_126E6B0`	11	`routine`
12	`qword_126E6C0`	12	`label`
13	`qword_126E6D0`	13	`expr_node`
14	`qword_126E6E0`	14	(reserved)
15	`qword_126E6F0`	15	(reserved)
16	`qword_126E700`	16	`switch_case_entry`
17	`qword_126E710`	17	`switch_info`
18	`qword_126E720`	18	`handler`
19	`qword_126E730`	19	`try_supplement`
20	`qword_126E740`	20	`asm_supplement`
21	`qword_126E750`	21	`statement`
22	`qword_126E760`	22	`object_lifetime`
23	`qword_126E770`	23	`scope`
24	`qword_126E7B0`	27	`template_parameter`
25	`qword_126E7C0`	28	`namespace`
26	`qword_126E7D0`	29	`using_declaration`
27	`qword_126E7E0`	30	`dynamic_init`
28	`qword_126E810`	33	`overriding_virtual_func`
29	`qword_126E820`	34	(reserved)
30	`qword_126E830`	35	`derivation_path`
31	`qword_126E840`	36	`base_class_derivation`
32	`qword_126E850`	37	(reserved)
33	`qword_126E860`	38	(reserved)
34	`qword_126E870`	39	`class_info`
35	`qword_126E880`	40	(reserved)
36	`qword_126E890`	41	`constructor_init`
37	`qword_126E8A0`	42	`asm_entry`
38	`qword_126E8E0`	46	`lambda`
39	`qword_126E8F0`	47	`lambda_capture`
40	`qword_126E900`	48	`attribute`
41	`qword_126E9D0`	61	`template_param`
42	`qword_126E9B0`	59	`template_decl`
43	`qword_126E9E0`	62	`name_reference`
44	`qword_126E9F0`	63	`name_qualifier`
45	`qword_126EA80`	72	`attribute` (C++11)

Note the gaps in the walk order: kinds 24-26 (base_class, string_text, other_text), 31-32 (local_static_variable_init, vla_dimension), 43-45 (asm_operand, asm_clobber, reserved), and 49-58 (element_position through hidden_name) are skipped. These entry kinds are either embedded inline within parent entries, accessed only through the recursive descent of walk_entry_and_subtree, or have no file-scope lists. Also note that kinds 59 and 61 appear out-of-order (61 before 59) -- this is verified in the binary.

For each non-empty list, the walk applies the entry_replace callback (if present) to each entry before descending, and follows the next pointer (at offset -16 in the raw allocation, which is the next_in_list link in the entry prefix).

Phase 6: Special trailing lists

Three additional lists are walked after the main kind-indexed sequence:

// seq_number_lookup entries (kind 64) at qword_126EBE8
for (entry = qword_126EBE8; entry; entry = entry->next) {
    if (entry_filter) ...
    walk_entry_and_subtree(entry, 64);
}

// External declarations (kind 6) at qword_126EBE0
// -- uses entry_filter with kind 6 and follows offset +104 links

// Kind 83 entries at qword_126EC00
for (entry = qword_126EC00; entry; entry = entry->next) {
    if (entry_filter) ...
    walk_entry_and_subtree(entry, 83);
}

Cleanup

After all phases complete, the function restores all saved state:

dword_126FB5C = saved_is_file_scope;
dword_126FB58 = saved_is_secondary;
dword_106B644 = saved_il_region;
dword_126FB60 = saved_mode_flags;
qword_126FB88 = saved_entry_cb;
qword_126FB80 = saved_string_cb;
qword_126FB78 = saved_pre_walk;
qword_126FB70 = saved_entry_replace;
qword_126FB68 = saved_entry_filter;

If tracing is active (dword_126EFC8), the function emits trace-leave via sub_48AFD0.

walk_entry_and_subtree (sub_604170)

This is the recursive engine -- the second-largest function in the entire cudafe++ binary at 7763 lines / 42KB of decompiled code. It takes an entry pointer and its kind, then recursively walks every child entry according to the IL schema.

Entry Protocol

Before descending into any entry, the function executes a two-path check:

while (true) {
    if (pre_walk_check != NULL) {
        // Callback path: delegate decision to the callback
        if (pre_walk_check(entry, entry_kind))
            return;  // callback says skip
    } else {
        // Default path: check flags
        flags = *(entry - 8);

        // If not file-scope walk and entry has file-scope bit: skip
        if (!is_file_scope_walk && (flags & 0x01))
            return;

        // If entry's il_region bit matches current_il_region: skip
        if (((flags & 0x04) != 0) == current_il_region)
            return;

        // Stamp the entry's il_region bit to match current region
        *(entry - 8) = (4 * (current_il_region & 1)) | (flags & 0xFB);
    }

    // Trace output at verbosity > 4
    if (trace_verbosity > 4)
        fprintf(s, "Walking IL tree, entry kind = %s\n",
                il_entry_kind_names[entry_kind]);

    // Dispatch on entry kind
    switch ((char)entry_kind) { ... }
}

The while(true) loop structure exists because certain cases (particularly linked-list tails) use continue to re-enter the check with a new entry, avoiding redundant function-call overhead for tail chains.

The default-path flags check serves two purposes:

Scope isolation: File-scope entries encountered during a routine-scope walk are skipped (they belong to the outer walk).
Region tracking: The current_il_region toggle prevents visiting the same entry twice within a single walk -- once stamped, an entry's bit 2 matches current_il_region, and the equality check causes the walker to skip it.

Entry Kind Dispatch

The giant switch covers all 85 entry kinds. Each case knows the exact layout of that entry type and recursively calls walk_entry_and_subtree on every child pointer. The three callbacks are invoked at appropriate points:

entry_replace: Called on each child pointer before recursion, potentially replacing it with a remapped pointer.
string_callback: Called on string fields (file names, identifier text), receiving the string pointer, kind 26, and byte length including null terminator.
entry_filter: Called on linked-list head pointers, returning NULL to remove the entry from the list.

Coverage by Entry Kind

The following table shows the major entry kinds and what the walker visits for each:

Kind	Name	Children Walked
1	`source_file_entry`	`file_name` (string, kind 26 at `[0]`), `full_name` (string, kind 26 at `[1]`), `name_as_written` (string, kind 26 at `[2]`), child file list (kind 1, linked via offset +56 at `[5]`), associated entry at `[6]` (kind 1), module info at `[8]` (kind 82)
2	`constant`	Type refs at `[14]`/`[15]` (kind 6), expression at `[16]` (kind 13); sub-switch on `constant_kind` byte at +148 (see below)
3	`parameter`	Type (kind 6), declared_type (kind 6), default_arg_expr (kind 13), attributes (kind 72)
6	`type`	Base type (kind 6), member field list (kind 8), template info (kind 58), scope (kind 23), base class list (kind 24), class_info supplement (kind 39)
7	`variable`	Type (kind 6), initializer expression (kind 13), attributes (kind 72), declared_type (kind 6)
8	`field`	Next field (kind 8), type (kind 6), bit_size_constant (kind 2)
9	`exception_spec`	Type list (kind 10), noexcept expression (kind 13)
11	`routine`	Return type (kind 6), parameter list (kind 3), body scope (kind 23), template info (kind 58), exception spec (kind 9), attributes (kind 72)
12	`label`	Break label (kind 12), continue label (kind 12)
13	`expression`	Sub-expressions (kind 13), operand entries, type references (kind 6); sub-switch on expression operator covers ~120 operator kinds
16	`switch_case`	Statement (kind 21), case_value constant (kind 2)
17	`switch_info`	Case list (kind 16), default case (kind 16), sorted case array
18	`catch_entry`	Parameter (kind 7), statement (kind 21), dynamic_init expression (kind 13)
21	`statement`	Sub-statements (kind 21), expressions (kind 13), labels (kind 12); sub-switch on statement kind
22	`object_lifetime`	Variable (kind 7), lifetime scope boundary
23	`scope`	Variables list (kind 7), routines list (kind 11), types list (kind 6), nested scopes (kind 23), namespaces (kind 28), using-declarations (kind 29), hidden names (kind 56), labels (kind 12)
24	`base_class`	Next (kind 24), type (kind 6), derived_class (kind 6), offset expression
27	`template_parameter`	Default value, constraint expression (kind 60), template param supplement
28	`namespace`	Associated scope (kind 23), flags
29	`using_declaration`	Target entity, position, access specifier
30	`dynamic_init`	Expression (kind 13), associated variable (kind 7)
39	`class_info`	Constructor initializer list (kind 41), friend list, base class list (kind 24)
41	`constructor_init`	Next (kind 41), member/base expression, initializer expression
55	`orphaned_entities`	Entity list, scope reference
58	`template`	Template parameter list (kind 61), body, specializations list
72	`attribute`	Attribute arguments (kind 73), next attribute (kind 72)
80	`subobject_path`	Linked list (kind 80), each entry walked recursively

Constant Entry Sub-Switch (Case 2)

The constant entry handler is one of the most complex cases. After walking two type references ([14], [15] as kind 6) and one expression ([16] as kind 13), it dispatches on the constant_kind byte at entry + 148:

// Walk shared fields first
walk(a1[14], 6);   // type
walk(a1[15], 6);   // declared_type
walk(a1[16], 13);  // associated expression

// Strip template info if walk_mode_flags set
if (walk_mode_flags)
    a1[17] = 0;

switch (constant_kind) {
    case 0:  /* ck_error */
    case 1:  /* ck_integer */
    case 3:  /* ck_float */
    case 5:  /* ck_imaginary */
    case 14: /* ck_void */
        break;  // leaf constants, no children

    case 2:  /* ck_string */
        // Walk string data at [20] as string_text (kind 25)
        // Length comes from [19] (not strlen -- may have embedded NULs)
        if (string_callback)
            string_callback(a1[20], 25, a1[19]);
        break;

    case 4:  /* ck_complex */
        walk(a1[19], 27);   // template_parameter (real/imaginary parts)
        break;

    case 6:  /* ck_address -- 7 sub-kinds at entry+152 */
        switch (address_sub_kind) {
            case 0: entry_replace(a1[20], 11);  break;  // routine
            case 1: entry_replace(a1[20], 7);   break;  // variable
            case 2: case 3:
                walk(a1[20], 2);                break;  // constant (recurse)
            case 4: entry_replace(a1[20], 6);   break;  // type (typeid)
            case 5: walk(a1[20], 6);            break;  // type (uuidof, recurse)
            case 6: entry_replace(a1[20], 12);  break;  // label
            default: error("bad address const kind");
        }
        // Then walk subobject_path list at [22] (kind 80)
        break;

    case 7:  /* ck_ptr_to_member */
        entry_replace(a1[19], 36);   // derivation_path
        walk(a1[20], 62);            // name_reference
        // Conditional: if a1[21] & 2, replace [22] as routine(11)
        //             else replace [22] as field(8)
        break;

    case 8:  /* ck_label_difference */
        walk(a1[20], 2);             // constant (recurse)
        break;

    case 9:  /* ck_dynamic_init */
        walk(a1[19], 30);            // dynamic_init entry
        break;

    case 10: /* ck_aggregate */
        // Linked list of constants at [19], each via offset +104
        for each constant in list: walk(entry, 2);
        entry_replace(a1[20], 2);    // tail constant
        break;

    case 11: /* ck_init_repeat */
        walk(a1[19], 2);            // repeated constant
        break;

    case 12: /* ck_template_param -- 15 sub-kinds at entry+152 */
        // Another sub-switch with cases 0-13 + default error
        break;

    case 13: /* ck_designator */
        walk(a1[20], 2);            // constant value
        break;

    case 15: /* ck_reflection */
        // Walk [20] with kind from entry+152 byte
        break;
}

The walk_mode_flags field zeroing (a1[17] = 0) strips template parameter constant info during IL binary output. This is the template-stripping behavior controlled by argument a6 of walk_file_scope_il.

String Entry Handling

String fields within entries are walked with three distinct string kind values:

String Kind	Value	Display Name	Used For	Length Source
`id_name`	24	`"id name"`	Identifier names (variable, function, field names)	`strlen(str) + 1`
`string_text`	25	`"string text"`	String literal content (for `ck_string` constants)	Constant's length field `[19]`
`other_text`	26	`"other text"`	File names, compiler version, compilation time, asm text	`strlen(str) + 1`

The string_text kind (25) is special: its length comes from the enclosing constant entry's [19] field rather than strlen, because C/C++ string literals may contain embedded null bytes. All other string kinds use strlen(str) + 1.

Error Strings

The function contains diagnostic strings from walk_entry.h that fire on unexpected sub-kind values:

String	Line	Triggers When
`"walk_entry_and_subtree: bad address const kind"`	883	Unknown `address_constant_kind` in constant entry (kind 2, sub-kind 6)
`"walk_entry_and_subtree: bad template param constant kind"`	1035	Unknown `template_param_constant_kind` in constant entry (kind 2, sub-kind 12)
`"walk_entry_and_subtree: bad constant kind"`	1051	Unknown `constant_kind` in constant entry (kind 2)

All three errors reference walk_entry.h as the source file and walk_entry_and_subtree as the function name, confirming the dispatch code is generated from the header file.

walk_routine_scope_il (sub_610200)

The routine-scope counterpart of walk_file_scope_il. It takes a routine index and walks that routine's scope chain:

void walk_routine_scope_il(int routine_index, ...) {
    // Same 5-callback + 4-state save/restore pattern
    // Trace: "walk_routine_scope_il"
    // Assert: il_walk.c, line 376

    dword_126FB5C = 0;  // NOT file-scope walk

    scope = qword_126EB90[routine_index];  // routine_scope_array
    while (scope) {
        walk_entry_and_subtree(scope, 23);
        if (entry_replace)
            scope = entry_replace(scope, 23);
        scope = scope->next;
    }
}

The key difference from walk_file_scope_il is that is_file_scope_walk is set to 0, which changes the entry protocol in walk_entry_and_subtree: entries with the file-scope bit set in their flags byte are skipped, because they belong to the file-scope IL and should not be processed during a routine-scope walk.

Callers and Use Cases

The walk framework serves four distinct purposes. Each caller installs a different callback configuration.

IL Display

The --dump_il debug output uses the walk framework with display_il_entry (sub_5F4930) as the entry callback:

// sub_5F76B0 (display_il_header)
walk_file_scope_il(
    display_il_entry,    // a1: entry callback = sub_5F4930
    NULL,                // a2: no string callback
    NULL,                // a3: no replace
    NULL,                // a4: no filter
    NULL,                // a5: no pre-walk check
    0                    // a6: no special flags
);

With all callbacks NULL except entry_callback, the walker visits every entry in walk order and calls display_il_entry on each, which dispatches on entry kind to print formatted field dumps. The pre_walk_check is NULL, so the default flags-based skip logic applies -- the current_il_region toggle prevents double-visiting.

Keep-in-IL Marking

The device code selection pass (mark_to_keep_in_il, sub_610420) installs the prune callback as pre_walk_check and NULL for everything else:

// sub_610420 (mark_to_keep_in_il)
qword_126FB88 = NULL;                    // no entry callback
qword_126FB80 = NULL;                    // no string callback
qword_126FB78 = prune_keep_in_il_walk;   // sub_617310
qword_126FB70 = NULL;                    // no replace
qword_126FB68 = NULL;                    // no filter

The prune_keep_in_il_walk callback (sub_617310) sets bit 7 (0x80) of each entry's flags byte and returns 1 for already-marked entries (preventing infinite recursion). The actual subtree walk is handled by a specialized copy of walk_entry_and_subtree (sub_6115E0, walk_tree_and_set_keep_in_il, 4649 lines) that directly sets the keep bit on every reachable child rather than using callbacks. See Keep-in-IL for the full mechanism.

IL Serialization

IL binary output (when IL_SHOULD_BE_WRITTEN_TO_FILE would be enabled, or for device IL output) uses all five callback slots:

entry_callback: Records each entry's position in the output stream
string_callback: Serializes string data with length prefix
entry_replace: Translates IL pointers to output-stream offsets
entry_filter: Skips entries that should not appear in the output (e.g., entries without keep_in_il for device IL)
pre_walk_check: Prevents re-serializing entries already written

IL Copy (Template Instantiation)

When EDG instantiates a template, it copies the template's IL subtree into a new region. The copy operation uses entry_replace to remap all pointers from the source region to the destination:

entry_replace: For each child pointer, allocates a new entry in the destination region, copies the source entry's contents, and returns the new pointer
string_callback: Copies string data into the destination region
pre_walk_check: Tracks which entries have already been copied (using the visited-set hash table at qword_126FB50)

Hash Table for Visited Set

The walk framework includes a visited-set hash table for cycles and deduplication:

Address	Name	Description
`qword_126FB50`	`hash_table_array`	Pointer to hash table bucket array
`dword_126FB48`	`hash_table_count`	Number of entries in hash table
`qword_126FB40`	`visited_set`	Pointer to visited-set data
`dword_126FB30`	`visited_count`	Number of visited entries

The hash table is reset by sub_603B30 (clear_walk_hash_table) before each walk operation. It uses open addressing and is primarily employed during IL copy operations to map source entry pointers to their destination counterparts.

Helper Functions

Several helper functions support the walk framework:

Address	Identity	Lines	Purpose
`sub_603B30`	`clear_walk_hash_table`	23	Zeros the visited-set hash table (`qword_126FB50`, `dword_126FB48`)
`sub_603FE0`	`find_parent_var_of_anon_union_type`	127	Searches scope member lists for the variable that owns an anonymous union type
`sub_603BB0`	`find_var_in_nested_scopes`	333	Recursively searches nested scopes for a variable (deeply unrolled, 8+ levels)
`sub_603B00`	(trivial getter)	9	Walk-state accessor
`sub_610200`	`walk_routine_scope_il`	108	Routine-scope walker (counterpart to `walk_file_scope_il`)

Keep-in-IL Specialized Walkers

The keep-in-il pass uses parallel implementations of the walk framework that bypass the callback mechanism for performance:

Address	Identity	Lines	Purpose
`sub_6115E0`	`walk_tree_and_set_keep_in_il`	4649	File-scope variant -- sets bit 7 directly on every reachable entry
`sub_618660`	`walk_entry_and_set_keep_in_il_routine_scope`	3728	Routine-scope variant
`sub_61CE20`--`sub_620190`	(keep-in-il helpers)	various	Per-kind helpers for template args, exception specs, array bounds, expressions, statements

These specialized walkers are structurally identical to walk_entry_and_subtree but replace callback invocations with direct *(entry - 8) |= 0x80 operations. They exist as separate functions rather than callback-based walks because the keep-in-il marking is performance-critical -- it runs on every CUDA compilation, and eliminating the function-pointer indirection across ~330 recursive calls provides measurable speedup.

Global Entry-Kind List Layout

The per-kind linked lists are stored in a contiguous global array starting at 0x126E600, with 16-byte stride. The formula 0x126E600 + kind * 0x10 gives the list head for most entry kinds up to kind 72. The complete walk order with all 51 lists (45 from Phase 5, 3 from Phase 6, plus orphaned entities, source files, and seq_number_lookup) is documented in the Phase 5 table above.

The three trailing lists (Phase 6) are stored outside the contiguous array at separate addresses in the IL header/footer region:

Address	Kind	Purpose	Next-Pointer Strategy
`qword_126EBE8`	64	Sequence number lookup entries	Standard `next_in_list` at node prefix
`qword_126EBE0`	6	External declarations (type list)	Type-specific next at offset +104
`qword_126EC00`	83	Module declarations (C++20)	Standard `next_in_list` at node prefix

The external declarations list (qword_126EBE0) is notable: it walks entries as kind 6 (type) but uses a different linked-list strategy (offset +104 rather than the standard prefix next pointer). This is because the external declarations list is a secondary index over type entries that are also present in the main type list at qword_126E660.

Walk Order Diagram

walk_file_scope_il(callbacks...)
  |
  +-- [save 5 callbacks + 4 state vars]
  +-- [install caller's callbacks]
  |
  +-- Phase 1: walk_entry_and_subtree(primary_scope, 23)
  |     |
  |     +-- Recursively visits all nested scopes,
  |         their member lists (vars, routines, types),
  |         and all subtrees
  |
  +-- Phase 2: source_file list (kind 1)
  |     +-- for each file: walk(file, 1)
  |         +-- walks file_name, full_name, child files
  |
  +-- Phase 3: main_routine + string entries
  |     +-- entry_replace(main_routine, 11)
  |     +-- string_callback(compiler_version, 26, len)
  |     +-- string_callback(time_of_compilation, 26, len)
  |
  +-- Phase 4: orphaned_entities list (kind 55)
  |     +-- for each orphan: walk(orphan, 55)
  |
  +-- Phase 5: global lists (kinds 1, 2, 3, ..., 72)
  |     +-- for each kind:
  |           for each entry in list:
  |             entry_replace(entry, kind)
  |             walk(entry, kind)
  |
  +-- Phase 6: trailing lists (kinds 64, 6-ext, 83)
  |
  +-- [restore saved state]

Diagnostic Strings

String	Source	Condition
`"walk_file_scope_il"`	`sub_60E4F0`	Trace enter (`dword_126EFC8` nonzero)
`"walk_routine_scope_il"`	`sub_610200`	Trace enter
`"Walking IL tree, entry kind = %s\n"`	`sub_604170`	`dword_126EFCC > 4`
`"Walking IL tree, string entry kind = %s\n"`	`sub_604170` / `sub_60E4F0`	`dword_126EFCC > 4`
`"walk_entry_and_subtree: bad address const kind"`	`sub_604170`	Unknown address constant sub-kind (`walk_entry.h:883`)
`"walk_entry_and_subtree: bad template param constant kind"`	`sub_604170`	Unknown template param constant sub-kind (`walk_entry.h:1035`)
`"walk_entry_and_subtree: bad constant kind"`	`sub_604170`	Unknown constant kind (`walk_entry.h:1051`)
`"find_parent_var_of_anon_union_type"`	`sub_603FE0`	Assert at lines 511, 523
`"find_parent_var_of_anon_union_type: var not found"`	`sub_603FE0`	Variable lookup failed

Function Map

Address	Identity	Confidence	Lines	EDG Source
`sub_60E4F0`	`walk_file_scope_il`	99%	2043	`il_walk.c:270`
`sub_604170`	`walk_entry_and_subtree`	99%	7763	`il_walk.c` / `walk_entry.h`
`sub_610200`	`walk_routine_scope_il`	98%	108	`il_walk.c:376`
`sub_603B30`	`clear_walk_hash_table`	85%	23	`il_walk.c`
`sub_603FE0`	`find_parent_var_of_anon_union_type`	99%	127	`il_walk.c:511`
`sub_603BB0`	`find_var_in_nested_scopes`	85%	333	`il_walk.c`
`sub_603B00`	(trivial walk-state accessor)	80%	9	`il_walk.c`
`sub_6115E0`	`walk_tree_and_set_keep_in_il`	98%	4649	`il_walk.c`
`sub_618660`	`walk_entry_and_set_keep_in_il_routine_scope`	88%	3728	`il_walk.c`
`sub_61CE20`	(keep-in-il helper: template args)	80%	100	`il_walk.c`
`sub_61D0C0`	(keep-in-il helper: exception spec)	80%	108	`il_walk.c`
`sub_61D330`	(keep-in-il helper: array bound)	80%	97	`il_walk.c`
`sub_61D570`	(keep-in-il helper: overriding virtual)	80%	120	`il_walk.c`
`sub_61D7F0`	(keep-in-il helper: base class)	80%	69	`il_walk.c`
`sub_61D9B0`	(keep-in-il helper: attributes)	80%	202	`il_walk.c`
`sub_61DEC0`	(keep-in-il helper: using-decl)	80%	101	`il_walk.c`
`sub_61E160`	(keep-in-il helper: object lifetime)	80%	76	`il_walk.c`
`sub_61E370`	(keep-in-il helper: expressions)	80%	369	`il_walk.c`
`sub_61ECF0`	(keep-in-il helper: statements)	80%	466	`il_walk.c`
`sub_61F420`	(keep-in-il helper: additional exprs)	80%	631	`il_walk.c`
`sub_61FEA0`	(keep-in-il helper: decl sequence)	80%	173	`il_walk.c`

Cross-References

IL Overview -- entry kind table, IL header structure
IL Allocation -- entry prefix layout, flags byte definition
Keep-in-IL -- device code marking pass using this framework
IL Display -- display_il_entry callback
Pipeline Overview -- when walks are triggered
Device/Host Separation -- higher-level context

Keep-in-IL (Device Code Selection)

cudafe++ compiles a single .cu translation unit that contains both host and device code. After the EDG frontend builds the complete IL tree, cudafe++ must split the two worlds: host-side declarations feed into the .int.c output, while device-side declarations feed into the binary IL emitted for cicc. The keep-in-il mechanism performs this split. It is a transitive-closure walk that starts from known device entities (functions with __device__/__global__ attributes, __shared__/__constant__/__managed__ variables) and recursively marks every IL entry they reference. Entries that survive the mark phase are written to the device IL; entries without the mark are stripped by the elimination pass.

The entire mechanism lives in il_walk.c (the mark/walk side) and il.c (the elimination side). It runs as pass 3 of fe_wrapup, after IL lowering (pass 2) and before C++ class finalization (pass 4).

Key Facts

Property	Value
Source file	`il_walk.c` (mark), `il.c` (eliminate)
Mark entry point	`sub_610420` (`mark_to_keep_in_il`), 892 lines
Recursive worker	`sub_6115E0` (`walk_tree_and_set_keep_in_il`), 4649 lines / 23KB
Prune callback	`sub_617310` (`prune_keep_in_il_walk`), 127 lines
Elimination entry point	`sub_5CCBF0` (`eliminate_unneeded_il_entries`), 345 lines
Template cleanup	`sub_5CCA40` (`clear_instantiation_required_on_unneeded_entities`), 86 lines
Body removal	`sub_5CC410` (`eliminate_bodies_of_unneeded_functions`), ~200 lines
Trigger	`fe_wrapup` pass 3, argument 23 (scope entry kind)
Guard flag	`dword_106B640` (set=1 before walk, cleared=0 after)
Key bit	Bit 7 (0x80) of byte at `entry_ptr - 8`

The Keep-in-IL Bit

Every IL entry is preceded by an 8-byte prefix. The byte at offset -8 from the entry pointer contains per-entry flags:

Byte at (entry_ptr - 8):
  bit 0  (0x01)  is_file_scope          Entry belongs to file-scope IL region
  bit 1  (0x02)  is_in_secondary_il     Entry is in the secondary IL (second TU)
  bit 2  (0x04)  current_il_region      Toggles per IL region (0 or 1)
  bits 3-6       (reserved)
  bit 7  (0x80)  keep_in_il             DEVICE CODE MARKER

The sign bit of this byte doubles as the keep-in-il flag. This allows a fast check: *(signed char*)(entry - 8) < 0 means "keep this entry." The elimination pass exploits this: it tests *(char*)(entry - 8) >= 0 to identify entries to remove.

Two additional "keep definition" flags exist on specific entity types:

Entity kind	Field	Bit	Meaning
Type (kind 6, class/struct)	`entry + 162`	bit 7 (0x80)	`keep_definition_in_il` -- retain full class body
Routine (kind 11)	`entry + 187`	bit 2 (0x04)	`keep_definition_in_il` -- retain function body

The keep_definition_in_il flag is stronger than the base keep_in_il flag. A type marked with only keep_in_il may be emitted as a forward declaration; one marked with keep_definition_in_il retains its full member list, base classes, and nested types.

Pipeline Context

fe_wrapup (sub_588F90)
  |
  +-- Pass 1:  sub_588C60   per-file IL wrapup
  +-- Pass 2:  sub_707040   IL lowering
  +-- Pass 3:  dword_106B640 = 1                          <<<< guard ON
  |            sub_610420(file_scope, 23)                  <<<< MARK PHASE
  |            dword_106B640 = 0                           <<<< guard OFF
  +-- Pass 4:  sub_5CCA40   clear_instantiation_required   <<<< CLEANUP
  |            sub_5CC410   eliminate_bodies_of_unneeded    (if deferred)
  |            sub_5CCBF0   eliminate_unneeded_il_entries   (if deferred)
  +-- Pass 5:  sub_588D40   per-file part_3

The guard flag dword_106B640 is an assertion variable: mark_to_keep_in_il asserts it is nonzero during the walk. If the assertion fires, it triggers internal_error at il_walk.c:1959. This prevents accidental invocation of the marking logic outside the designated pass window.

Mark Phase: mark_to_keep_in_il (sub_610420)

This is the top-level entry point for device code marking. It orchestrates the entire transitive-closure walk.

Callback Installation

The function saves all five IL tree-walk callback slots and installs a specialized configuration:

// Save existing callbacks
saved_entry_cb      = qword_126FB88;    // entry_callback
saved_string_cb     = qword_126FB80;    // string_callback
saved_pre_walk      = qword_126FB78;    // pre_walk_check
saved_entry_replace = qword_126FB70;    // entry_replace
saved_entry_filter  = qword_126FB68;    // entry_filter

// Install keep-in-il configuration
qword_126FB88 = NULL;                   // no entry callback
qword_126FB80 = NULL;                   // no string callback
qword_126FB78 = prune_keep_in_il_walk;  // sub_617310
qword_126FB70 = NULL;                   // no replacement
qword_126FB68 = NULL;                   // no filter
dword_126FB60 = 0;                      // walk_mode_flags = 0
dword_126FB58 = (flags_byte & 0x02) != 0;  // is_secondary_il

All callbacks are restored on exit, making the function safely re-entrant from nested invocations (the prune callback can recursively call mark_to_keep_in_il for canonical entries in secondary IL).

Scope Initialization

For scope entries (kind 23), the function handles two cases:

Scope already has keep_in_il set (byte at entry + 28 is nonzero): Call walk_tree_and_set_keep_in_il directly. The scope was previously identified as device-relevant.
Fresh scope (byte at entry + 28 is zero): Clear bit 7 of the entry's flags byte, then walk. This is the file-scope entry point where the walk begins with the keep bit initially cleared, allowing the recursive walk to set it transitively.

if (entry_kind == 23) {             // scope
    if (*(entry + 28) != 0) {
        walk_tree_and_set_keep_in_il(entry, 23);
    } else {
        *(entry - 8) &= 0x7F;      // clear keep_in_il
        // Debug: "Beginning file scope keep_in_il walk"
        walk_tree_and_set_keep_in_il(entry, 23);
        if (dword_126EFB4 == 2)     // C++ mode
            walk_scope_and_mark_routine_definitions(entry);  // sub_6175F0
    }
}

Global Entry-Kind List Walk

After processing the scope, mark_to_keep_in_il iterates all 45+ global entry-kind linked lists. These lists at 0x126E610--0x126EA80 hold every file-scope entity indexed by entry kind. The function visits each list and calls walk_tree_and_set_keep_in_il on every entry:

// Orphaned scope list (kind 55) -- only entries with keep_definition flag
for (entry = qword_126EBA0; entry; entry = entry->next) {
    if (entry->routine_byte_187 & 0x04)   // keep_definition_in_il set
        walk_tree_and_set_keep_in_il(entry, 55);
}

// Source files (kind 1), constants (kind 2), parameters (kind 3), ...
// through to concepts (kind 72)
for (int kind = 1; kind <= 72; kind++) {
    for (entry = global_list[kind]; entry; entry = entry->next)
        walk_tree_and_set_keep_in_il(entry, kind);
}

The iteration order mirrors walk_file_scope_il (sub_60E4F0), processing kinds 1 through 72 with some gaps (kinds 24--26, 31--32, 43--45, 49--58, 60, 64--71 are skipped because those lists are empty or handled differently).

Using-Declaration Fixup

After the main walk, the function processes using-declarations attached to scopes. This is a fixed-point loop that repeats until no new entities are marked:

do {
    changed = 0;
    process_using_decl_list(scope->using_decls, is_class_scope, &changed);
} while (changed);

For each scope region (iterated via entry + 264), it walks the using-declaration chain and handles six declaration kinds:

Using-decl kind byte	Name	Action
`0x33`	Simple using	If target entity is marked, mark the using-decl
`0x34`	Using with namespace	If target entity is marked, mark using-decl + namespace
`0x35`	Nested scope	Recurse via `sub_6170C0`
`0x36`	Using with template	If target entity is marked, mark using-decl + template
`6`	Type alias (typedef)	Special: if typedef of a class/struct with `has_definition` flag, and the underlying class is marked, mark the typedef too
`66`	Using-everything	Force-mark unconditionally, set `changed = 1`

The typedef case (kind 6) deserves attention. When a typedef aliases a marked class, the typedef entry gets marked so that device code can reference the class through its alias name. The check verifies entry + 132 == 12 (typedef type kind), the underlying type is a class/struct/union (kinds 9--11), and the has_definition flag (entry + 161, bit 2) is set.

Recursive Worker: walk_tree_and_set_keep_in_il (sub_6115E0)

This 23KB function is structurally identical to the generic walk_entry_and_subtree (sub_604170) but specialized: instead of invoking callbacks, it directly sets the keep_in_il bit on every reachable sub-entry and recurses.

The function dispatches on entry kind (approximately 80 cases) and for each child pointer it encounters, performs:

if (child != NULL) {
    *(child - 8) |= 0x80;                    // set keep_in_il
    walk_tree_and_set_keep_in_il(child, child_kind);  // recurse
}

Key entry kinds and what they transitively mark:

Entry kind	ID	Children marked
`source_file`	1	file_name, full_name, child files
`constant`	2	type, string data, address target
`parameter`	3	type, declared_type, default_arg_expr, attributes
`type`	6	base_type, member fields, template info, scope, base classes
`variable`	7	type, initializer expression, attributes
`field`	8	next field, type, bit_size_constant
`routine`	11	return_type, parameters, body, template info, exception specs
`expression`	13	sub-expressions, operands, type references
`statement`	21	sub-statements, expressions, labels
`scope`	23	all member lists (variables, routines, types, nested scopes)
`template_parameter`	39	default values, constraints
`namespace`	28	associated scope

The function also handles cross-references in template instantiations: when it encounters a template specialization, it follows the primary template pointer and marks the template definition too. This ensures that if device code uses vector<int>, the vector template itself is retained.

Pre-Walk Check Integration

Before recursing into any entry, the walk checks the pre_walk_check callback (qword_126FB78), which is set to prune_keep_in_il_walk. This callback returns 1 (skip) if the entry is already marked, preventing infinite recursion on cyclic references (classes referencing their own members) and avoiding redundant work.

Prune Callback: prune_keep_in_il_walk (sub_617310)

This callback is installed as the pre_walk_check during the keep-in-il walk. It runs before the walker descends into each entry.

Decision Logic

int prune_keep_in_il_walk(entry_ptr, entry_kind) {
    char flags = *(entry_ptr - 8);

    // Case 1: Secondary IL mismatch -- delegate to canonical
    if (is_secondary_il && !(flags & 0x02)) {
        canonical = lookup_canonical(entry_ptr, entry_kind);  // sub_5B9EE0
        if (dword_126EE48) {   // CUDA mode
            if (canonical && canonical->assoc_entry) {
                target = *canonical->assoc_entry;
                if (target != entry_ptr && (*(target - 8) & 0x02))
                    mark_to_keep_in_il(target, entry_kind);  // recurse
            }
        }
        return 1;  // skip this entry (handled via canonical)
    }

    // Case 2: Already marked -- skip
    if (flags < 0)   // bit 7 set = signed negative
        return 1;

    // Case 3: Type with class/struct/union definition -- mark definition too
    if (entry_kind == 6 && (*(entry + 132) - 9) <= 2) {
        if (is_local || is_imported || !has_name || has_definition)
            set_keep_definition_on_type(entry);  // sub_6111C0
    }

    // Set the keep_in_il bit
    *(entry_ptr - 8) |= 0x80;

    // Debug output
    if (trace_active && trace_filter("needed_flags", entry, kind)) {
        switch (entry_kind) {
            case 6:  fprintf(s, "Setting keep_in_il on type ");  break;
            case 7:  fprintf(s, "Setting keep_in_il on var  ");  break;
            case 11: fprintf(s, "Setting keep_in_il on rout ");  break;
            case 28: fprintf(s, "Setting keep_in_il on namespace "); break;
        }
    }

    // Case 4: Variable/routine in non-guard mode -- check class membership
    if (!dword_106B640) {
        if (!(*(entry + 82) & 0x10)) {
            canonical = lookup_canonical(entry, entry_kind);
            // Assert canonical exists (il_walk.c:1885)
            if (*(canonical + 81) & 0x04) {   // is class member
                class_type = **(canonical + 40 + 32);
                walk_tree_and_set_keep_in_il(class_type, 6);
                set_keep_definition_on_type(class_type);
            }
        }
        return 1;
    }

    // Handle canonical entry in secondary IL (CUDA mode)
    canonical = lookup_canonical(entry, entry_kind);
    if (dword_126EE48 && canonical) {
        assoc = *(canonical + 32);
        if (assoc) {
            target = *assoc;
            if (target != entry && (*(target - 8) & 0x02))
                mark_to_keep_in_il(target, entry_kind);
        }
    }

    return 0;  // continue walking into this entry's children
}

The callback's return value controls the walk: returning 1 tells the walker to skip the subtree (entry already processed or delegated to canonical), returning 0 tells it to descend into children.

Secondary IL Handling

When cudafe++ processes multiple translation units (e.g., through #include chains that bring in separate compilation units), it maintains primary and secondary IL regions. The secondary IL flag (bit 1 of the flags byte) distinguishes them. The prune callback handles cross-region references by looking up the canonical (primary) version of each entry via sub_5B9EE0 and recursively marking that version instead. This ensures the device IL output contains the primary definitions, not secondary duplicates.

Keep-Definition Logic

For Types (sub_6111C0 / sub_611300)

When a class/struct/union type needs its definition kept (not just a forward declaration), set_keep_definition_on_type performs:

void set_keep_definition_on_type(entry) {
    // Debug: "Setting keep_definition_in_il on <type>"
    *(entry + 162) |= 0x80;           // set keep_definition bit

    // If already marked keep_in_il, clear and re-walk
    // (definition requires deeper traversal than reference)
    if (*(entry - 8) & 0x80) {
        *(entry - 8) &= ~0x80;        // clear keep_in_il
        mark_to_keep_in_il(entry, 6);  // re-walk with full traversal
    }

    // For class/struct: also clear/re-walk the associated scope
    if (entry_kind is class/struct/union) {
        scope = entry->associated_scope;
        *(scope - 8) &= ~0x80;
        // Follow canonical type chain
    }
}

The clear-and-re-walk pattern is important: when an entity was initially marked via a shallow reference (e.g., a pointer to the class), only the type entry itself was marked. When the definition is later needed (e.g., the device code accesses a member), the keep bit is cleared and the walk restarts, this time descending into all members, base classes, and nested types.

For Routines (sub_6113F0 / sub_6181E0)

void set_keep_definition_on_routine(entry) {
    // Debug: "Setting keep_definition_in_il on rout <name>"
    *(entry + 187) |= 0x04;          // set keep_definition bit

    // If template specialization: also mark the primary template
    if (*(entry + 177) & 0x20) {
        primary = lookup_primary_template(entry);  // sub_5BBCC0
        mark_to_keep_in_il(primary, 11);
    }

    // Special member handling (copy/move constructors)
    if (special_member_kind == 1 || special_member_kind == 2) {
        // Recurse on associated class type's ctor/dtor
    }
}

Scope-Level Routine Walk: sub_6175F0

In C++ mode (dword_126EFB4 == 2), after the main mark pass, mark_to_keep_in_il calls sub_6175F0 on the file scope. This function performs an additional sweep through all scope hierarchies to ensure routine definitions are correctly retained:

For each class/struct scope with keep_in_il set: recurse into the class scope
For each namespace (non-alias): recurse into the namespace scope
For routines in class scopes with external linkage: if marked but not keep_definition, call set_keep_definition_on_routine

This handles the case where a class method is referenced by device code through a virtual call or template instantiation, requiring the full function body to be available in the device IL.

Elimination Phase

After the mark phase completes, three functions strip unmarked entities from the IL.

clear_instantiation_required_on_unneeded_entities (sub_5CCA40)

Runs in pass 4 of fe_wrapup, C++ mode only. Prevents unnecessary template instantiations from being triggered during IL output.

The function recursively walks the scope tree and for each routine with template instantiation flags, checks whether the instantiation is still needed:

void clear_instantiation_required_on_unneeded_entities(scope) {
    assert(dword_126EFB4 == 2);  // C++ only

    // Recurse into child scopes (skip namespace aliases)
    for (child = scope->nested_namespaces; child; child = child->next) {
        if (!(child->flags & 0x01))    // not an alias
            recurse(child->associated_scope);
    }

    // Recurse into class scopes
    for (type = scope->types_list; type; type = type->next) {
        if (is_class_struct_union(type) && !is_anonymous(type))
            recurse(type->type_extra->scope_entry);
    }

    // Clear instantiation_required on unneeded routines
    for (rout = scope->routines_list; rout; rout = rout->next) {
        if (!(rout->flags_80 & 0x08)           // not suppressed
            && !(rout->flags_179 & 0x10)        // not already cleared
            && ((rout->flags_179 & 6) == 2 || (dword_126E204 && rout->flags_176 < 0))
            && rout->source_corresp != NULL
            && !(rout->flags_176 & 0x02))       // not locally defined
        {
            clear_instantiation_required(rout->name, 0, 2);  // sub_78A380
        }
    }

    // For non-file scopes: also clear on variable templates
    if (scope->scope_kind != 0) {
        for (var = scope->variables_list; var; var = var->next) {
            if (!(var->flags_80 & 0x08)
                && !(var->flags_162 & 0x40)
                && (var->flags_162 & 0xB0) == 0x10
                && var->name != NULL)
            {
                clear_instantiation_required(var->name, 0, 2);
            }
        }
    }
}

eliminate_bodies_of_unneeded_functions (sub_5CC410)

Walks the IL table (qword_126EB98) and removes function bodies for routines that were not marked with keep_definition_in_il:

void eliminate_bodies_of_unneeded_functions() {
    for (idx = 1; idx <= dword_126EC78; idx++) {
        scope = qword_126EC88[idx];
        if (!scope) continue;
        if (scope not in current TU) continue;
        if (scope_kind != 17) continue;     // 17 = function body

        routine = scope->owning_entity;
        if (routine->keep_definition_in_il)  // byte+187 & 0x04
            continue;
        if (!(routine->flags_29 & 0x01))
            continue;

        remove_function_body(routine);       // sub_5CAB40
    }
}

eliminate_unneeded_il_entries (sub_5CCBF0)

The main elimination pass. Walks the scope tree and removes all unmarked entities from the IL linked lists.

void eliminate_unneeded_il_entries(scope) {
    emit_info = get_emit_info(scope);       // sub_703C30
    assert(emit_info != NULL);              // il.c:29598

    // Recurse into child scopes (skip namespace aliases)
    for (child = scope->nested_namespaces; child; child = child->next) {
        if (!(child->flags & 0x01))
            eliminate_unneeded_il_entries(child->associated_scope);
    }

    // --- Eliminate variables ---
    prev = NULL;
    for (var = scope->variables_list; var; var = next) {
        next = var->next_in_list;           // offset +104
        if (*(signed char*)(var - 8) < 0) { // keep_in_il set
            prev = var;                     // keep in list
        } else {
            // Unlink from list
            if (prev) prev->next = next;
            else scope->variables_list = next;
            var->next = NULL;
            // C++ mode: walk expression trees to clear hidden names
            if (cpp_mode) {
                walk_tree(var->expr_tree, clear_hidden_name_cb, 147);
                walk_tree(var->alt_tree, clear_hidden_name_cb, 147);
            }
        }
    }
    emit_info[5] = prev;   // last kept variable

    // File scope: also clean orphaned scope list
    if (scope->scope_kind == 0)
        eliminate_unneeded_scope_orphaned_list_entries();  // sub_5CC570

    // --- Eliminate routines (same pattern as variables) ---
    prev = NULL;
    for (rout = scope->routines_list; rout; rout = next) {
        next = rout->next_in_list;
        if (*(signed char*)(rout - 8) < 0) {
            prev = rout;
        } else {
            // Unlink + clear hidden names
        }
    }
    emit_info[6] = prev;

    // Clear global variable reference if unmarked
    if (qword_126EB70 && *(signed char*)(qword_126EB70 - 8) >= 0)
        qword_126EB70 = NULL;

    // --- Eliminate types ---
    prev = NULL;
    for (type = scope->types_list; type; type = next) {
        next = type->next_in_list;

        // Follow typedef chains to find the real type for sign-bit check
        real = type;
        if (real->type_kind == 12 && !real->name) {  // anonymous typedef
            do { real = real->base_type; }
            while (real->type_kind == 12 && !real->name);
        }

        if (*(signed char*)(real - 8) < 0) {   // marked
            prev = type;
            if (is_class_struct_union(type))
                eliminate_unneeded_class_definitions(type);  // sub_5CC1B0
        } else {
            // Unlink + process eliminated class members
            if (is_class_struct_union(type) && cpp_mode)
                process_members_of_eliminated_class(type);
            type->base_type = NULL;
            clear_type_extra_member_lists(type->type_extra);
            type->type_extra->flags |= 0x20;  // mark as eliminated
        }
    }
    emit_info[4] = prev;

    // --- Eliminate hidden names ---
    // (same sign-bit check, unlink unmarked entries from scope->hidden_names)

    // File scope: emit orphaned scopes
    if (scope->scope_kind == 0)
        emit_orphaned_scopes(scope);        // sub_718720

    // Clean external declarations list
    for (ext = qword_126EBE0; ext; ext = next) {
        next = ext->next;
        if (*(signed char*)(ext - 8) >= 0) {
            // Unlink from external declarations list
        }
    }
}

The debug output for eliminated vs. retained entities uses a string trick: "TARG_VERT_TAB_CHAR" + 17 evaluates to "R", so the output reads either "Removing variable <name>" (eliminated) or "Not removing variable <name>" (retained).

Global State

Address	Name	Description
`dword_106B640`	`keep_in_il_walk_active`	Assertion guard; 1 during pass 3, 0 otherwise
`dword_126EFB4`	`cpp_mode`	2 = C++ mode (enables class/template processing)
`dword_126EFC8`	`trace_active`	Nonzero enables diagnostic output
`dword_126EFCC`	`trace_verbosity`	Higher = more output (>2 prints elimination details)
`dword_126EE48`	`cuda_mode`	Nonzero enables CUDA-specific canonical entry handling
`dword_126E204`	`template_compat_flag`	Affects template instantiation clearing criteria
`qword_126EBA0`	`orphaned_scope_list`	File-scope orphaned scopes (kind 55 list)
`qword_126EB70`	`global_variable_ref`	Cleared if its entry is unmarked
`qword_126EBE0`	`external_decl_list`	External declarations; unmarked ones removed
`qword_126EB98`	`il_table`	Array of IL scope pointers, indexed by il_table_index
`qword_126FB78`	`pre_walk_check`	Callback slot; set to `prune_keep_in_il_walk` during mark
`qword_126FB88`	`entry_callback`	Callback slot; NULL during mark phase
`dword_126FB58`	`is_secondary_il`	Walk state: 1 if currently in secondary IL region
`dword_126FB5C`	`is_file_scope_walk`	Walk state: 1 during file-scope walk
`dword_106B644`	`current_il_region`	Walk state: toggles per IL region
`dword_126FB60`	`walk_mode_flags`	Walk state: 0 during keep-in-il walk

Diagnostic Strings

String	Source	Condition
`"Beginning file scope keep_in_il walk"`	`sub_610420`	`trace_active && trace_category("needed_flags")`
`"Ending file scope keep_in_il walk"`	`sub_610420`	Same
`"Setting keep_in_il on type "`	`sub_617310`	`trace_active && trace_filter("needed_flags", entry, 6)`
`"Setting keep_in_il on var "`	`sub_617310`	Same, kind 7
`"Setting keep_in_il on rout "`	`sub_617310`	Same, kind 11
`"Setting keep_in_il on namespace "`	`sub_617310`	Same, kind 28
`"Setting keep_definition_in_il on "`	`sub_6111C0`	Trace active
`"Setting keep_definition_in_il on rout "`	`sub_6113F0`	Trace active
`"Removing variable <name>"`	`sub_5CCBF0`	`trace_verbosity > 2` or `trace_filter("dump_elim")`
`"Not removing variable <name>"`	`sub_5CCBF0`	Same (for kept entries)
`"Removing routine <name>"`	`sub_5CCBF0`	Same
`"Removing <type>"`	`sub_5CCBF0`	Same
`"eliminate_unneeded_il_entries"`	`sub_5CCBF0`	`trace_active` (level 3 trace enter/exit)

Function Map

Address	Identity	Confidence	Lines	EDG Source
`sub_610420`	`mark_to_keep_in_il`	99%	892	`il_walk.c:1959`
`sub_6115E0`	`walk_tree_and_set_keep_in_il`	98%	4649	`il_walk.c`
`sub_617310`	`prune_keep_in_il_walk`	99%	127	`il_walk.c:1885`
`sub_6111C0`	`set_keep_definition_on_type`	95%	63	`il_walk.c`
`sub_611300`	`set_keep_definition_on_type_simple`	92%	48	`il_walk.c`
`sub_6113F0`	`set_keep_definition_on_routine`	95%	81	`il_walk.c`
`sub_6181E0`	`set_keep_definition_on_routine_unconditional`	90%	69	`il_walk.c`
`sub_6170C0`	`process_using_decl_list`	92%	154	`il_walk.c`
`sub_6175F0`	`walk_scope_and_mark_routine_definitions`	90%	634	`il_walk.c`
`sub_616EE0`	`mark_virtual_function_types_to_keep`	85%	88	`il_walk.c`
`sub_618370`	`walk_and_set_keep_in_il_helper`	80%	119	`il_walk.c`
`sub_618660`	`walk_entry_and_set_keep_in_il_routine_scope`	88%	3728	`il_walk.c`
`sub_5CCBF0`	`eliminate_unneeded_il_entries`	100%	345	`il.c:29598`
`sub_5CCA40`	`clear_instantiation_required_on_unneeded_entities`	100%	86	`il.c:29450`
`sub_5CC410`	`eliminate_bodies_of_unneeded_functions`	100%	~200	`il.c:29231`
`sub_5CC1B0`	`eliminate_unneeded_class_definitions`	100%	~200	`il.c`
`sub_5CC570`	`eliminate_unneeded_scope_orphaned_list_entries`	100%	~200	`il.c:29398`
`sub_5CB920`	`process_members_of_eliminated_class_definition`	100%	~300	`il.c:29097`
`sub_5B9EE0`	`lookup_canonical_entry`	--	--	`il_walk.c`
`sub_78A380`	`clear_instantiation_required`	--	--	`template.c`

Cross-References

Pipeline Overview -- overall compilation flow
IL Overview -- entry kinds, header, regions
IL Tree Walking -- generic walker with 5 callbacks
Device/Host Separation -- higher-level splitting strategy
Execution Spaces -- how entities get device/host attributes

IL Display

The IL display subsystem produces a human-readable text dump of the entire Intermediate Language graph. It is compiled from EDG's il_to_str.c (source path /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/il_to_str.c, confirmed by an assertion at line 6175 in form_float_constant). The display code occupies address range 0x5EB290--0x603A00 in the binary (roughly 90KB), with the main dispatch functions at 0x5EC600--0x5F7FD0 and formatting helpers continuing through 0x6039E0.

Activation is via the il_display CLI flag (flag index 10 in the boolean flag table), which triggers display_il_file after the frontend completes parsing. The output goes to stdout through an indirectable callback mechanism (qword_126F980). When active, every IL entry in every memory region is printed with labeled fields, 25-column-aligned formatting, and scope/address annotations.

Key Facts

Property	Value
Source file	`il_to_str.c` (EDG 6.6)
Address range	`0x5EB290`--`0x6039E0`
Top-level entry point	`sub_5F7DF0` (`display_il_file`), 56 lines
Header + file-scope	`sub_5F76B0` (`display_il_header`), 174 lines
Main dispatcher	`sub_5F4930` (`display_il_entry`), 1,686 lines
Single-entity display	`sub_5F7D50` (`display_single_entity`), 38 lines
CLI flag	`il_display` (index 10, boolean)
Output callback	`qword_126F980` (function pointer, default `sub_5EB290` = `fputs(s, stdout)`)
Display-active flag	`byte_126FA16` (set to 1 during display)
Scope context flag	`dword_126FA30` (1 = file-scope region, 0 = function-scope)
Entry kind name table	`off_E6DD80` (~84 entries, indexed by entry kind byte)

Top-Level Control Flow

display_il_file (sub_5F7DF0)
│
│  printf("Display of IL file \"%s\", produced by the compilation of \"%s\"\n",
│         il_file_name, source_file_name)
│
├── display_il_header (sub_5F76B0)
│   │  dword_126FA30 = 1                              // file-scope mode
│   │  puts("\n\nIntermediate language for memory region 1 (file scope):")
│   │  puts("\nil_header:")
│   │  ... 30+ header fields ...
│   │
│   └── walk_file_scope_il(display_il_entry, ...)      // sub_60E4F0
│       └── display_il_entry (sub_5F4930)              // callback per entity
│
└── for region = 2 .. dword_126EC80:
        dword_126FA30 = 0                              // function-scope mode
        // lookup function name from scope table
        printf("\n\nIntermediate language for memory region %ld (function \"%s\"):\n",
               region, function_name)
        walk_routine_scope_il(region, display_il_entry, ...)   // sub_610200
        └── display_il_entry (sub_5F4930)              // callback per entity

Memory region 1 is always file-scope (global declarations, types, templates). Regions 2+ correspond to individual function bodies. The scope table at qword_126EB90 maps each region index to its owning scope entry; the display code checks scope.kind == 17 (sck_function) and extracts the routine name for the banner.

IL Header Fields

display_il_header (sub_5F76B0) prints the translation-unit-level metadata stored in BSS at 0x126EB60--0x126EBF8:

Field	Type	Notes
`primary_source_file`	IL pointer	Source file entry for the main `.cu` file
`primary_scope`	IL pointer	File-scope scope entry
`main_routine`	IL pointer	`main()` routine entry, if present
`compiler_version`	string	EDG compiler version string
`time_of_compilation`	string	Build timestamp
`plain_chars_are_signed`	bool	Default `char` signedness
`source_language`	enum	`sl_Cplusplus` (0) or `sl_C` (1), from `dword_126EBA8`
`std_version`	integer	C/C++ standard version (e.g., 201703 for C++17), from `dword_126EBAC`
`pcc_compatibility_mode`	bool	PCC compatibility
`enum_type_is_integral`	bool	Whether enum underlying type is integral
`default_max_member_alignment`	integer	Default structure packing alignment
`gcc_mode`	bool	GCC compatibility mode
`gpp_mode`	bool	G++ compatibility mode
`gnu_version`	integer	GNU compatibility version number
`short_enums`	bool	`-fshort-enums` behavior
`default_nocommon`	bool	Default `-fno-common`
`UCN_identifiers_used`	bool	Universal character names in identifiers
`vla_used`	bool	Variable-length arrays present
`any_templates_seen`	bool	Whether any templates were parsed
`prototype_instantiations_in_il`	bool	Template prototype instantiations included
`il_has_all_prototype_instantiations`	bool	All prototypes included (ALL_TEMPLATE_INFO_IN_IL=1)
`il_has_C_semantics`	bool	C-language semantics active
`nontag_types_used_in_exception_or_rtti`	bool	Non-tag types in EH/RTTI
`seq_number_lookup_entries`	integer	Count of source sequence entries
`target_configuration_index`	integer	Target configuration selector

After printing the header, display_il_header calls walk_file_scope_il (sub_60E4F0) with display_il_entry as the per-entity callback. This iterates every IL entry in file-scope region 1.

The Main Dispatcher: display_il_entry

display_il_entry (sub_5F4930, 1,686 lines) is the central per-entity display function. It receives an entry pointer and an entry kind byte, and dispatches to the appropriate per-kind display function.

Transparent (Inline) Kinds

The first switch handles kinds that are displayed inline by their parent and should not appear as standalone entries. These return immediately without output:

Transparent kinds (early return):
  4   routine_type_supplement      15  (reserved)
  5   routine_type_extra           19  try_supplement
  14  (reserved)                   20  asm_supplement
  27  template_parameter_suppl     34  (reserved)
  37  (reserved)                   38  (reserved)
  46  (reserved)                   47  (reserved)
  48  (reserved)                   75  (reserved)
  76  (reserved)                   77  (reserved)
  78  (reserved)                   81  (reserved)

Entry Header Line

For non-transparent kinds, the dispatcher prints a scope-annotated header:

file-scope type@7f3a4b200100
func-scope variable@7f3a4b300200

The scope prefix comes from dword_126FA30 (1 = "file-scope", 0 = "func-scope"). The kind name is looked up from off_E6DD80[kind_byte]. The address is the raw entry pointer value. For entries in function-scope regions while dword_126FA30 == 1, a warning "**NON FILE SCOPE PTR**" is emitted.

Dispatch Table

The second switch dispatches to specialized display functions:

Kind	Hex	Name	Display function	Lines
1	0x01	`source_file_entry`	inline in dispatcher	~40
2	0x02	`constant`	`sub_5F2720` (`display_constant`)	605
3	0x03	`param_type`	inline in dispatcher	~30
6	0x06	`type`	`sub_5F06B0` (`display_type`)	1,033
7	0x07	`variable`	`sub_5EE500` (`display_variable`)	614
8	0x08	`field`	inline in dispatcher	~80
9	0x09	`exception_specification`	inline	~20
10	0x0A	`exception_spec_type`	inline	~10
11	0x0B	`routine`	`sub_5EF1A0` (`display_routine`)	1,160
12	0x0C	`label`	inline	~30
13	0x0D	`expr_node`	`sub_5ECFE0` (`display_expr_node`)	534
16	0x10	`switch_case_entry`	inline	~15
17	0x11	`switch_info`	inline	~10
18	0x12	`handler`	inline	~15
21	0x15	`statement`	`sub_5EC600` (`display_statement`)	328
22	0x16	`object_lifetime`	inline	~20
23	0x17	`scope`	`sub_5F2140` (`display_scope`)	177
28	0x1C	`namespace`	inline	~20
29	0x1D	`using_declaration`	inline	~20
30	0x1E	`dynamic_init`	`sub_5F37F0` (`display_dynamic_init`)	248
31	0x1F	`local_static_variable_init`	inline	~15
32	0x20	`vla_dimension`	inline	~10
33	0x21	`overriding_virtual_func`	inline	~15
35	0x23	`derivation_path`	inline	~10
36	0x24	`base_class`	inline	~25
39	0x27	`class_info`	`sub_5F4030` (`display_class_supplement`)	366
41	0x29	`constructor_init`	inline	~15
42	0x2A	`asm_entry`	inline	~25
43	0x2B	`asm_operand`	inline	~15
44	0x2C	`asm_clobber`	inline	~10
50	0x32	`source_sequence_entry`	inline	~15
51	0x33	`full_entity_decl_info`	inline	~15
52	0x34	`instantiation_directive`	inline	~10
53	0x35	`src_seq_sublist`	inline	~10
54	0x36	`explicit_instantiation_decl`	inline	~10
55	0x37	`orphaned_entities`	inline	~10
56	0x38	`hidden_name`	inline	~10
57	0x39	`pragma`	inline	~20
58	0x3A	`template`	inline	~20
59	0x3B	`template_decl`	inline	~15
60	0x3C	`requires_clause`	inline	~10
61	0x3D	`template_param`	inline	~15
62	0x3E	`name_reference`	`sub_5EBC60` (`display_name_reference`)	84
63	0x3F	`name_qualifier`	inline	~15
64	0x40	`seq_number_lookup`	inline	~10
65	0x41	`local_expr_node_ref`	inline	~10
66	0x42	`static_assert`	inline	~10
67	0x43	`linkage_spec`	inline	~10
68	0x44	`scope_ref`	inline	~10
70	0x46	`lambda`	inline	~15
71	0x47	`lambda_capture`	inline	~15
72	0x48	`attribute`	inline	~20
73	0x49	`attribute_argument`	inline	~10
74	0x4A	`attribute_group`	inline	~10
79	0x4F	`template_info`	inline	~15
80	0x50	`subobject_path`	inline	~10
82	0x52	`module_info`	inline	~10
83	0x53	`module_decl`	inline	~10

Per-Kind Display Functions

source_file_entry (Kind 1)

Displayed inline in the dispatcher. Fields:

Field	Type	Notes
`file_name`	string	Short file name
`full_name`	string	Full path
`name_as_written`	string	As-written in `#include`
`first_seq_number`	integer	First source sequence number in this file
`last_seq_number`	integer	Last source sequence number
`first_line_number`	integer	First line number
`child_files`	IL pointer list	Included files
`is_implicit_include`	bool	Implicitly included
`is_include_file`	bool	Is an `#include`d file (not the primary TU)
`top_level_file`	bool	Top-level compilation unit

source_corresp (Shared Prefix)

All named entities (variable, routine, type, field, label, namespace, template_param) share a source_corresp sub-record, printed by display_source_corresp (sub_5EDF40, 170 lines). This is the first thing displayed for each such entity:

source_corresp:
  name:                    foo
  unmangled_name_or_mangled_encoding: _Z3foov
  decl_position.seq:       42
  decl_position.column:    5
  name_references:         name_reference@7f3a...
  is_class_member:         TRUE
  access:                  public
  parent_scope:            file-scope scope@7f3a...
  enclosing_routine:       NULL
  referenced:              TRUE
  needed:                  TRUE
  name_linkage:            external

Fields displayed by display_source_corresp:

Field	Type	Lookup table
`name`	string	Direct string
`unmangled_name_or_mangled_encoding`	string	Direct string
`decl_position`	position	seq + column sub-fields
`name_references`	IL pointer	name_reference entry
`is_class_member`	bool	--
`access`	enum	`off_A6F760` (4 entries: public/protected/private/none)
`parent_scope`	IL pointer	Scope entry
`enclosing_routine`	IL pointer	Routine entry
`referenced`	bool	--
`needed`	bool	--
`is_local_to_function`	bool	--
`parent_via_local_scope_ref`	IL pointer	--
`name_linkage`	enum	`off_E6E040` (none/internal/external/C/C++)
`has_associated_pragma`	bool	--
`is_decl_after_first_in_comma_list`	bool	--
`copied_from_secondary_trans_unit`	bool	--
`same_name_as_external_entity_in_secondary_trans_unit`	bool	--
`member_of_unknown_base`	bool	--
`qualified_unknown_base_member`	bool	--
`marked_as_gnu_extension`	bool	--
`is_deprecated_or_unavailable`	bool	--
`externalized`	bool	--
`maybe_unused`	bool	`[[maybe_unused]]` attribute
`source_sequence_entry`	IL pointer	--
`attributes`	IL pointer	Attribute list

type (Kind 6)

display_type (sub_5F06B0, 1,033 lines) handles all 22 type kinds. After calling display_source_corresp, it prints common type fields then switches on the type kind byte at offset +132:

Common type fields:

Field	Lookup table
`next`	IL pointer
`based_types`	Linked list, kind from `off_A6F420` (6 entries)
`size`	Integer
`alignment`	Integer
`incomplete`	bool
`used_in_exception_or_rtti`	bool
`declared_in_function_prototype`	bool
`alignment_set_explicitly`	bool
`variables_are_implicitly_referenced`	bool
`may_alias`	bool
`autonomous_primary_tag_decl`	bool
`is_builtin_va_list`	bool
`is_builtin_va_list_from_cstdarg`	bool
`has_gnu_abi_tag_attribute`	bool
`in_gnu_abi_tag_namespace`	bool
`type_kind`	Enum from `off_A6FE40` (22 entries)

Type kind switch (offset +132):

Kind	Name	Key sub-fields
2	integer	`int_kind` (via `sub_5F9110`), `explicitly_signed`, `wchar_t_type`, `char8_t_type`, `char16_t_type`, `char32_t_type`, `bool_type`; for enums: `is_scoped_enum`, `packed`, `originally_unnamed`, `is_template_enum`, `ELF_visibility`, `base_type`, `assoc_template`
3/4/5	float/double/ldouble	`float_kind` (via `sub_5F93D0`)
6	pointer	`type_pointed_to`, `is_reference`, `is_rvalue_reference`
7	function	`return_type`, `param_type_list`, `assoc_routine`, `has_ellipsis`, `prototyped`, `trailing_return_type`, `value_returned_by_cctor`, `does_not_return`, `result_should_be_used`, `is_const`, `explicit_calling_convention`, `calling_convention` (from `off_E6CDA0`), `this_class`, `qualifiers`, `ref_qualifiers`, `prototype_scope`, `exception_specification`
8	array	`element_type`, `qualifiers`, `is_static`, `is_variable_size_array`, `is_vla`, `element_count`, `bound_constant`
9/10/11	class/struct/union	`field_list`, `extra_info` (class supplement via `sub_5F4030`), `final`, `abstract`, `any_virtual_base_classes`, `any_virtual_functions`, `originally_unnamed`, `is_template_class`, `is_specialized`, `is_empty_class`, `is_packed`, `max_member_alignment`
12	typeref	`typeref_type`, `template_arg_list`, `assoc_template`, `typeref_kind` (from `off_A6F640`, 28 entries), `qualifiers`, `predeclared`, `has_variably_modified_type`, `is_nonreal`
13	member pointer	`class_of_which_a_member`, `type`
14	template param	`kind` (tptk_param/tptk_member/tptk_unknown), `is_pack`, `is_generic_param`, `is_auto_param`, `class_type`, `coordinates`
15	vector	`element_type`, `size_constant`, `is_boolean_vector`, `vector_kind`
16	tuple	`element_type`, `tuple_elements`

variable (Kind 7)

display_variable (sub_5EE500, 614 lines) is one of the most field-heavy display functions. After display_source_corresp, it prints:

Field	Lookup table / Notes
`next`	IL pointer
`type`	IL pointer
`storage_class`	`off_A6FE00` (7 entries: none/auto/register/static/extern/mutable/thread_local)
`declared_storage_class`	Same table
`asm_name` or `reg`	`off_A6F480` (53 register kind entries)
`alignment`	Integer
`ELF_visibility`	`off_A6F720` (5 entries)
`init_priority`	Integer
`cleanup_routine`	IL pointer
`container` / `bindings`	Selected by bits at offset +162
`section`	String (ELF section name)
`aliased_variable`	IL pointer
`declared_type`	IL pointer
`template_info`	IL pointer

CUDA-specific variable fields:

Field	Notes
`shared`	`__shared__` memory space
`constant`	`__constant__` memory space
`device`	`__device__` memory space

Boolean flags (approximately 50 flags spanning bytes 144--208):

is_weak, is_weakref, is_gnu_alias, has_gnu_used_attribute, has_gnu_abi_tag_attribute, is_not_common, is_common, has_internal_linkage_attribute, asm_name_is_valid, used, address_taken, is_parameter, is_parameter_pack, is_pack_element, is_enhanced_for_iterator, initializer_in_class, constant_valued, is_thread_local, extends_lifetime, is_template_param_object, compiler_generated, is_in_class_specialization, is_handler_param, is_this_parameter, referenced_non_locally, modified_within_try_block, is_template_variable, is_prototype_instantiation, is_nonreal, is_specialized, specialized_with_old_syntax, explicit_instantiation, class_explicitly_instantiated, explicit_do_not_instantiate, param_value_has_been_changed, param_used_more_than_once, is_anonymous_parent_object, is_member_constant, is_constexpr, declared_constinit, is_inline, suppress_inline_definition, superseded_external, has_variably_modified_type, is_vla, is_compound_literal, has_explicit_initializer, has_parenthesized_initializer, has_direct_braced_initializer, has_flexible_array_initializer, declared_with_auto_type_specifier, declared_with_decltype_auto, declared_with_class_template_placeholder

routine (Kind 11)

display_routine (sub_5EF1A0, ~1,160 lines) is the single largest per-kind display function. After display_source_corresp:

Field	Lookup table / Notes
`next`	IL pointer
`type`	IL pointer (function type)
`function_def_number`	Integer
`memory_region`	Integer (region index for function body)
`storage_class`	`off_A6FE00` (7 entries)
`declared_storage_class`	Same table
`special_kind`	`off_A6FC00` (13 entries: none/constructor/destructor/conversion/operator/lambda_call_operator/...)
`opname_kind`	`off_A6FC80` (47 entries)
`builtin_function_kind`	Integer
`ELF_visibility`	`off_A6F720`
`virtual_function_number`	Integer
`constexpr_intrinsic_number`	Integer
`section`	String
`aliased_routine`	IL pointer
`inline_partner`	IL pointer
`ctor_priority` / `dtor_priority`	Integer
`asm_name`	String
`declared_type`	IL pointer
`generating_using_decl`	IL pointer
`befriending_classes`	IL pointer list
`assoc_template`	IL pointer
`template_arg_list`	Via `display_template_arg_list`

CUDA-specific routine flags (byte 182):

Flag	Bit	Meaning
`nvvm_intrinsic`	bit 4	NVVM intrinsic function
`device`	bit 5	`__device__` execution space
`global`	bit 6	`__global__` execution space
`host`	bit 4 (byte 183)	`__host__` execution space

C99-specific fields (displayed when dword_126EBA8 == 1 and std_version > 199900):

fp_contract, fenv_access, cx_limited_range -- pragma state values from off_A6F460 (4 entries).

Boolean flags (approximately 60 flags spanning bytes 176--191):

address_taken, is_virtual, overrides_base_member, pure_virtual, final, override, covariant_return_virtual_override, is_inline, is_declared_constexpr, is_constexpr, is_constexpr_intrinsic, compiler_generated, defined, called, is_explicit_constructor, is_explicit_conversion_function, is_trivial_default_constructor, is_trivial_copy_function, is_trivial_destructor, is_initializer_list_ctor, is_delegating_ctor, is_inheriting_ctor, assignment_to_this_done, is_prototype_instantiation, is_template_function, is_specialized, specialized_with_old_syntax, explicit_instantiation, class_explicitly_instantiated, explicit_do_not_instantiate, has_nodiscard_attribute, never_throws, is_in_class_specialization, never_inline, is_pure, is_initialization_routine, is_finalization_routine, is_weak, is_weakref, is_gnu_alias, is_ifunc, has_gnu_used_attribute, has_gnu_abi_tag_attribute, in_gnu_abi_tag_namespace, allocates_memory, no_instrument_function, no_check_memory_usage, always_inline, gnu_c89_inline, implicit_alias, has_internal_linkage_attribute, contains_try_block, contains_local_class_type, superseded_external, defined_in_friend_decl, contains_statement_expression, inline_in_class_definition, is_lambda_body, is_defaulted, is_deleted, contains_local_static_variable, is_raw_literal_operator, is_tls_init_routine, has_deducible_return_type, has_deduced_return_type, contains_generic_lambda, is_coroutine, is_top_level_in_mem_region, friend_defined_in_instantiation, is_ineligible, definition_needed, defined_outside_of_parent, trailing_requires_clause

expr_node (Kind 13)

display_expr_node (sub_5ECFE0, 534 lines) handles 36 expression node kinds. Common expression fields are printed first:

Field	Notes
`type`	IL pointer (expression type)
`orig_lvalue_type`	IL pointer
`next`	IL pointer
`is_lvalue`	bool
`is_xvalue`	bool
`result_is_not_used`	bool
`is_pack_expansion`	bool
`is_parenthesized`	bool
`compiler_generated`	bool
`volatile_fetch`	bool
`do_not_interpret`	bool
`type_definition_needed`	bool

Expression kind switch (offset +24):

Kind	Name	Key sub-fields
0	`enk_error`	(none)
1	`enk_operation`	`operation.kind` from `off_A6F840` (120 operator kinds), `operation.type_kind` from `off_A6FE40` (22 type kinds), 20+ boolean flags for cast semantics, ADL suppression, virtual call properties, evaluation order
2	`enk_constant`	Constant value reference
3	`enk_variable`	Variable reference
4	`enk_field`	Field access
5	`enk_temp_init`	Temporary initialization
6	`enk_lambda`	Lambda expression
7	`enk_new_delete`	`is_new`, `placement_new`, `aligned_version`, `array_delete`, `global_new_or_delete`, `deducible_type`, `type`, `routine`, `arg`, `dynamic_init`, `number_of_elements`
8	`enk_throw`	Throw expression
9	`enk_condition`	Conditional expression
10	`enk_object_lifetime`	Object lifetime tracking
11	`enk_typeid`	`typeid` expression
12	`enk_sizeof`	`sizeof` expression
13	`enk_sizeof_pack`	`sizeof...` (pack)
14	`enk_alignof`	`alignof` expression
15	`enk_datasizeof`	`__datasizeof`
16	`enk_address_of_ellipsis`	Address of `...`
17	`enk_statement`	Statement expression
18	`enk_reuse_value`	Value reuse
19	`enk_routine`	Function reference
20	`enk_type_operand`	Type operand
21	`enk_builtin_operation`	Built-in op from `off_E6C5A0`
22	`enk_param_ref`	Parameter reference
23	`enk_braced_init_list`	Braced initializer
24	`enk_c11_generic`	`_Generic` selection
25	`enk_builtin_choose_expr`	`__builtin_choose_expr`
26	`enk_yield`	`co_yield`
27	`enk_await`	`co_await`
28	`enk_fold`	Fold expression
29	`enk_initializer`	Initializer
30	`enk_concept_id`	Concept ID
31	`enk_requires`	`requires` expression
32	`enk_compound_req`	Compound requirement
33	`enk_nested_req`	Nested requirement
34	`enk_const_eval_deferred`	Consteval deferred
35	`enk_template_name`	Template name

Every expression case ends with dump_source_position("position", ...) to record the source location.

statement (Kind 21)

display_statement (sub_5EC600, 328 lines) handles 26 statement kinds. Common fields first:

Field	Notes
`position`	Source position
`next`	IL pointer
`parent`	IL pointer (enclosing scope/block)
`attributes`	IL pointer
`has_associated_pragma`	bool
`is_initialization_guard`	bool
`is_lowering_boilerplate`	bool
`is_fallthrough_statement`	bool
`is_likely`	bool
`is_unlikely`	bool

Statement kind switch (offset +32):

Kind	Name	Key sub-fields
0	`stmk_expr`	Expression statement
1	`stmk_if`	if
2	`stmk_constexpr_if`	`if constexpr`
3	`stmk_if_consteval`	`if consteval` (C++23)
4	`stmk_if_not_consteval`	`if !consteval`
5	`stmk_while`	while loop
6	`stmk_goto`	goto
7	`stmk_label`	label
8	`stmk_return`	return
9	`stmk_coroutine`	Coroutine body (see below)
10	`stmk_coroutine_return`	Coroutine return
11	`stmk_block`	Block/compound: `statements`, `final_position`, `assoc_scope`, `lifetime`, `end_of_block_reachable`, `is_statement_expression`
12	`stmk_end_test_while`	do-while
13	`stmk_for`	for loop
14	`stmk_range_based_for`	Range-for: `iterator`, `range`, `begin`, `end`, `ne_call_expr`, `incr_call_expr`
15	`stmk_switch_case`	switch case
16	`stmk_switch`	switch
17	`stmk_init`	Initialization
18	`stmk_asm`	Inline assembly
19	`stmk_try_block`	try block
20	`stmk_decl`	Declaration
21	`stmk_set_vla_size`	VLA size
22	`stmk_vla_decl`	VLA declaration
23	`stmk_assigned_goto`	Computed goto
24	`stmk_empty`	Empty statement
25	`stmk_stmt_expr_result`	Statement expression result

Coroutine statement (case 9) displays the full C++20 coroutine lowering structure:

traits, handle, promise, init_await_resume, this_param_copy,
paramter_copies, final_suspend_label, initial_suspend_call,
final_suspend_call, unhandled_exception_call, get_return_object_call,
new_routine, delete_routine, ...

The field name "paramter_copies" (missing the 'e' in "parameter") is a typo preserved verbatim from the EDG source. This confirms the display strings originate from Edison Design Group's own il_to_str.c -- a reimplementation would spell it correctly.

scope (Kind 23)

display_scope (sub_5F2140, 177 lines) handles 9 scope kinds:

Kind	Name	Extra fields
0	`sck_file`	Top-level file scope
1	`sck_func_prototype`	Function prototype scope
2	`sck_block`	`assoc_handler`
3	`sck_namespace`	`assoc_namespace`
6	`sck_class_struct_union`	`assoc_type`
8	`sck_template_declaration`	Template declaration scope
15	`sck_condition`	`assoc_statement`
16	`sck_enum`	`assoc_type`
17	`sck_function`	`routine.ptr`, `parameters`, `constructor_inits`, `lifetime_of_local_static_vars`, `this_param_variable`, `return_value_variable`

Common scope fields: next, parent, kind

Boolean flags: do_not_free_memory_region, is_constexpr_routine, is_stmt_expr_block, is_placeholder_scope, needed_walk_done

Child entity lists: assoc_block, lifetime, constants, types, variables, nonstatic_variables, labels, routines, asm_entries, scopes

Conditional lists (controlled by bitmask tests on scope kind):

// Bitmask 0x20044 = bits 2+6+17 = sck_block + sck_class_struct_union + sck_function
// Bitmask 0x9     = bits 0+3    = sck_file + sck_namespace

if ((1LL << kind) & 0x20044) {
    // display: namespaces, using_declarations, using_directives
}
if ((1LL << kind) & 0x9) {
    // display: namespaces, using_declarations, using_directives
}
// Also: dynamic_inits, local_static_variable_inits (function/block scopes)
//       expr_node_refs, scope_refs, vla_dimensions (function scope + C mode)
//       pragmas, hidden_names, templates, source_sequence_list, src_seq_sublist_list

constant (Kind 2)

display_constant (sub_5F2720, 605 lines) handles 16 constant kinds. After display_source_corresp, common fields include next, type, orig_type, expr, and approximately 25 boolean flags.

Constant kind switch (offset +148):

Kind	Name	Key sub-fields
0	`ck_error`	(none)
1	`ck_integer`	Integer value via `sub_602F20`
2	`ck_string`	`character_kind` (char/wchar_t/char8_t/char16_t/char32_t), `length`, `literal_kind` (see below)
3	`ck_float`	Float value via `sub_5FCAF0`
4	`ck_complex`	Complex value
5	`ck_imaginary`	Imaginary value
6	`ck_address`	Sub-kind: abk_routine/variable/constant/temporary/uuidof/typeid/label; `subobject_path`, `offset`
7	`ck_ptr_to_member`	`casting_base_class`, `name_reference`, `cast_to_base`, `is_function_ptr`
8	`ck_label_difference`	`from_address`, `to_address`
9	`ck_dynamic_init`	`dynamic_init` pointer
10	`ck_aggregate`	`first_constant`, `last_constant`, `has_dynamic_init_component`
11	`ck_init_repeat`	`constant`, `count`, `multidimensional_aggr_tail_not_repeated`
12	`ck_template_param`	Sub-kinds: tpck_param/expression/member/unknown_function/address/sizeof/datasizeof/alignof/uuidof/typeid/noexcept/template_ref/integer_pack/destructor
13	`ck_designator`	`is_field_designator`, `is_generic`, `uses_direct_init_syntax`
14	`ck_void`	(none)
15	`ck_reflection`	`entity`, `local_scope_number`

dynamic_init (Kind 30)

display_dynamic_init (sub_5F37F0, 248 lines) handles 9 dynamic initialization kinds:

Kind	Name	Key sub-fields
0	`dik_none`	(none)
1	`dik_zero`	Zero-initialization
2	`dik_constant`	Constant initialization
3	`dik_expression`	Expression initialization
4	`dik_class_result_via_ctor`	Class result through constructor
5	`dik_constructor`	`routine`, `args`, `is_copy_constructor_with_implied_source`, `is_implicit_copy_for_copy_initialization`, `value_initialization`
6	`dik_nonconstant_aggregate`	Non-constant aggregate
7	`dik_bitwise_copy`	`source`
8	`dik_lambda`	`lambda`, `constant`, `non_constant`

Common fields: next, variable, destructor, lifetime, next_in_destruction_list, unordered, init_expr_lifetime, and approximately 20 boolean flags including static_temp, follows_an_exec_statement, inside_conditional_expression, has_temporary_lifetime, is_constructor_init, is_freeing_of_storage_on_exception, overlaps_temps_in_inner_lifetime, is_reused_value, is_creation_of_initializer_list_object, master_entry.

class_info (Kind 39)

display_class_type_supplement (sub_5F4030, 366 lines) is not dispatched directly from the kind table but called by display_type when the type kind is class/struct/union (kinds 9/10/11). It prints the class supplement record:

Field	Notes
`base_classes`	IL pointer list
`direct_base_classes`	IL pointer list
`preorder_base_classes`	IL pointer list
`primary_base_class`	IL pointer
`size_without_virtual_base_classes`	Integer
`alignment_without_virtual_base_classes`	Integer
`highest_virtual_function_number`	Integer
`virtual_function_info_offset`	Integer
`virtual_function_info_base_class`	IL pointer
`ELF_visibility`	`off_A6F720`
`is_lambda_closure_class`	bool
`is_generic_lambda_closure_class`	bool
`has_lambda_conversion_function`	bool
`is_initializer_list`	bool
`has_initializer_list_ctor`	bool
`has_anonymous_union_member`	bool
`anonymous_union_kind`	enum (auk_none/auk_variable/auk_field)
`is_va_list_tag`	bool
`has_nodiscard_attribute`	bool
`has_field_initializer`	bool
`removed_from_il`	bool
`contains_error`	bool
`befriending_classes`	Linked list (checks kind bytes 9/10/11 for class/struct/union)
`friend_routines`	IL pointer list
`friend_classes`	IL pointer list
`assoc_scope`	IL pointer
`assoc_template`	IL pointer
`template_arg_list`	Via `display_template_arg_list`
`lambda_parent.variable` / `.field` / `.routine`	Selected by bits in byte 86
`proxy_of_type`	IL pointer

Formatting Infrastructure

25-Column Field Labels

dump_field_label (sub_5EB2A0, 22 lines) is the universal field label formatter. It prints "field_name:" then pads with spaces to column 25. If the label plus colon exceeds 24 characters, it prints a newline first to avoid misalignment:

storage_class:           static
alignment:               16
is_constexpr:            TRUE

This produces the consistent columnar output visible in all IL dumps.

Boolean Fields

dump_field_bool (sub_5EB450, 25 lines) prints a label and "TRUE" or "FALSE":

is_virtual:              TRUE
pure_virtual:            FALSE

Source Position Fields

dump_source_position (sub_5EB4E0, 82 lines) prints position as two sub-fields when the position is non-zero (seq != 0 or column != 0):

position.seq:            42
position.column:         5

Reads a 32-bit sequence number at *position and a 16-bit column at *(position + 4).

IL Pointer Annotations

dump_il_entity_pointer (sub_5EB8B0, 99 lines) is the most comprehensive pointer formatter. For each IL entity pointer, it prints:

Scope prefix: "file-scope" or "func-scope" (from bit 0 of the entry prefix byte at entry_ptr - 8)
Kind name: from off_E6DD80[kind_byte]
Hex address: @%lx
Entity name (kind-dependent):
- Kinds with name at offset +8 (bitmask 0x2000000010001984): prints the name string
- Kind 12 (label): prints "label " prefix + name
- Kind 6 (type): calls qualified name formatter
- Kind 2 (constant): calls type display
- Kind 0x40 (seq_number_lookup): prints qualified name from offset +0
- Kind with bit 36 set: prints qualified name from offset +40, plus "in" context from +56

primary_source_file:     file-scope source_file_entry@7f3a4b100020 "test.cu"
main_routine:            file-scope routine@7f3a4b200100 "main"

The variant dump_il_string_pointer (sub_5EB670) prints the same format but includes the string value from the pointed-to entry. A scope mismatch (e.g., function-scope pointer found during file-scope display) triggers a "**NON FILE SCOPE PTR**" warning.

Entity List Display

display_entity_list (sub_5EC450, 87 lines) walks a linked list of entity pointers and prints each with scope/kind/address annotations:

entities:                file-scope variable@7f3a... "x"
                         func-scope variable@7f3a... "y"

It follows the next link at offset 0 of each list node until NULL.

String Literal Display

dump_string_value (sub_5EB300, 41 lines) prints string values with proper escape handling:

NULL pointers print "NULL"
Non-printable characters are printed as octal escapes (\OOO)
Backslash and double-quote are backslash-escaped (\\, \")
The octal mask width is controlled by dword_126E49C (CHAR_BIT equivalent, typically 8)

file_name:               "test.cu"
full_name:               "/home/user/project/test.cu"

Float Constant Formatting

form_float_constant (sub_5F7FD0, 302 lines) handles float-to-string conversion with EDG-specific formatting. An assertion at line 6175 guards against buffer overflow (63-byte limit).

Float kind suffixes:

Kind	Suffix	Type
0	(none)	double
2	`f`/`F`	float
3	`f32x`	extended float32
5	`f64x`	extended float64
6	`l`/`L`	long double
7	`w`	float128/wide
8	`q`	quad
9	`bf16`	bfloat16
10	`f16`	float16
11	`f32`	float32
12	`f64`	float64
13	`f128`	float128

Special value handling:

NaN: __builtin_nanf(""), __builtin_nan(""), etc. (when compiler version > 30299)
Infinity: __builtin_huge_valf() or (__extension__ 0x1.0p<exp>f)
Division form: (f/0.0f) or (f/(0,0.0f)) (C++ vs C modes, selected by dword_126E1D8/dword_126E1E8)
User-defined literals: (funcname("string_value")) form

Data Tables Referenced

The display subsystem relies on approximately 20 string-to-enum lookup tables in the .rodata segment:

Address	Name	Entries	Used by
`off_A6F000`	`attr_arg_kind_names`	6	Attribute argument display
`off_A6F040`	`attr_location_names`	24	Attribute display
`off_A6F100`	`attr_family_names`	5	Attribute display
`off_A6F140`	`attr_kind_names`	86	Attribute display
`off_A6F3F0`	`class_kind_labels`	3	`befriending_classes` display
`off_A6F420`	`based_type_kind_names`	6	`display_type` based_types
`off_A6F460`	`pragma_state_names`	4	fp_contract/fenv_access/cx_limited_range
`off_A6F480`	`register_kind_names`	53	`display_variable` reg field
`off_A6F640`	`typeref_kind_names`	28	`display_type` typeref
`off_A6F720`	`elf_visibility_kind_names`	5	ELF visibility (all entity types)
`off_A6F760`	`access_specifier_names`	4	public/protected/private/none
`off_A6F840`	`expr_operator_kind_names`	120	`display_expr_node` operations
`off_A6FC00`	`special_function_kind_names`	13	`display_routine` special_kind
`off_A6FC80`	`operator_name_kind_names`	47	`display_routine` opname_kind
`off_A6FE00`	`storage_class_names`	7	Storage class (variable + routine)
`off_A6FE40`	`type_kind_names`	22	Type kind (all type displays)
`off_E6C5A0`	`builtin_operation_names`	varies	`display_expr_node` builtins
`off_E6CDA0`	`calling_convention_names`	varies	`display_type` calling conventions
`off_E6CDE0`	`pragma_kind_names`	varies	Pragma display
`off_E6CF40`	`asm_clobber_reg_names`	varies	Asm clobber display
`off_E6D240`	`token_kind_names`	varies	Fold expression / attribute_arg tokens
`off_E6DD80`	`il_entry_kind_names`	~84	All display functions (entry kind)
`off_E6E040`	`linkage_kind_names`	varies	Name linkage (source_corresp)

All tables use the same bounds-checking pattern:

const char *name = "**BAD STORAGE CLASS**";
if ((unsigned char)value <= 6u)
    name = storage_class_names[value];
puts(name);

Out-of-range values produce "**BAD <KIND>**" sentinel strings, which serve as diagnostic markers for corrupted IL.

Global State

Address	Name	Type	Purpose
`dword_126FA30`	`is_file_scope_region`	int	1 during file-scope display, 0 during function-scope
`qword_126F980`	`output_callbacks`	function ptr	Output function (default: `sub_5EB290` = `fputs(s, stdout)`)
`byte_126FA16`	`display_active`	byte	Set to 1 during display, prevents re-entrant calls
`byte_126FA11`	`pcc_compat_shadow`	byte	Shadow of PCC compatibility mode during display
`dword_126EBA8`	`source_language`	int	0 = C++, 1 = C
`dword_126EBAC`	`std_version`	int	C/C++ standard version number
`dword_126EC80`	`total_region_count`	int	Number of memory regions (1 = file scope only)
`qword_126EC88`	`region_table`	pointer array	Region index to memory block mapping
`qword_126EB90`	`scope_table`	pointer array	Region index to scope entry mapping
`qword_126EEE0`	`source_file_name`	string ptr	Name of the source file being compiled

Helper Functions (0x5F8000--0x6039E0)

The display subsystem includes approximately 50 additional helper functions in the address range beyond the main dispatchers:

Address	Lines	Identity	Purpose
`sub_5F85E0`	78	`display_bool_field`	Boolean TRUE/FALSE output
`sub_5F8760`	97	`display_flags_word`	Flags word display
`sub_5F8910`	88	`display_type_qualifiers`	const/volatile/restrict qualifier flags
`sub_5F8A80`	49	`display_storage_class`	Storage class enum
`sub_5F8BD0`	139	`display_access_specifier`	Access with indentation
`sub_5F8DF0`	103	`display_linkage_kind`	Linkage kind enum
`sub_5F9040`	28	`init_output_context`	Initialize display callback state
`sub_5F9110`	149	`display_int_type_kind`	Integer type kind name
`sub_5F93D0`	70	`display_float_type_kind`	Float type kind name
`sub_5F9500`	70	`display_int_type_size`	Integer type size name
`sub_5F9650`	99	`display_qualifier_flags`	Full qualifier flags
`sub_5F9820`	18	`display_ref_qualifier`	`&` or `&&`
`sub_5F9860`	91	`display_calling_convention`	Calling convention from `off_E6CDA0`
`sub_5F99A0`	115	`display_attribute_target`	Attribute target kind
`sub_5F9BC0`	20	`display_asm_keyword`	`"asm"` or `"volatile"`
`sub_5F9C10`	26	`display_elaborated_type`	Elaborated type specifier
`sub_5F9CA0`	50	`display_struct_layout`	Structure layout padding mode
`sub_5F9D80`	89	`display_member_alignment`	Member alignment field
`sub_5F9F70`	57	`display_template_kind`	Template kind name
`sub_5FA0D0`	283	`display_template_arg_list`	Full template argument list
`sub_5FA660`	127	`display_constraint_expr`	Constraint expression (C++20)
`sub_5FA8F0`	118	`display_deduction_guide`	Deduction guide info
`sub_5FAB70`	333	`display_capture_list`	Lambda capture list
`sub_5FB270`	556	`display_expr_operator_name`	Expression operator name (120 kinds)
`sub_5FBCD0`	571	`display_expr_details`	Operator-specific expression details
`sub_5FCAF0`	1,319	`display_float_constant`	Float/complex/imaginary formatting
`sub_5FE7C0`	55	`display_expr_flag`	Expression flag display
`sub_5FE8B0`	1,659	`display_expr_operator`	Expression operator details (2nd largest)
`sub_600740`	72	`display_for_range`	Range-based for details
`sub_600870`	171	`display_coroutine_info`	Coroutine info (C++20)
`sub_600BF0`	19	`display_designated_init`	Designated initializer
`sub_600C50`	107	`display_attribute_entry`	Attribute entry
`sub_600E00`	55	`display_asm_operand`	Asm operand display
`sub_600EF0`	76	`display_asm_statement`	Asm statement details
`sub_600FF0`	29	`display_gcc_builtin_kind`	GCC built-in kind
`sub_601070`	87	`display_pragma_info`	Pragma info
`sub_6011F0`	155	`display_declspec_attribute`	`__declspec` attribute
`sub_601460`	92	`display_thread_local`	Thread-local info
`sub_6015A0`	73	`display_module_info`	Module info (C++20)
`sub_6016F0`	197	`display_concept_requires`	Concept/requires expression
`sub_601B10`	48	`display_pack_expansion`	Pack expansion info
`sub_601BE0`	50	`display_structured_binding`	Structured binding (C++17)
`sub_601CB0`	562	`display_additional_expr`	Additional expression info
`sub_6027D0`	144	`display_deduced_class`	Deduced class info
`sub_6029B0`	190	`display_decl_sequence`	Declaration sequence entry
`sub_602DC0`	74	`display_enum_underlying`	Enum underlying type
`sub_602F20`	306	`display_integer_constant`	Integer constant formatting
`sub_603670`	134	`display_vendor_attribute`	Vendor attribute details
`sub_6038F0`	26	`display_cleanup_handler`	Cleanup handler
`sub_6039E0`	78	`display_sequence_entry`	Last function in il_to_str region

The "paramter_copies" Typo

The coroutine statement display (case 9 in display_statement) prints the field label "paramter_copies" -- missing the 'e' in "parameter." This typo is present in the compiled binary's string table and originates from Edison Design Group's source code. It serves as strong provenance evidence: a clean-room reimplementation would not reproduce this exact spelling error, confirming that cudafe++ links genuine EDG il_to_str.c object code.

Complete Call Graph

display_il_file (sub_5F7DF0) ─── TOP LEVEL
├── display_il_header (sub_5F76B0)
│   ├── init_output_context (sub_5F9040)
│   ├── dump_il_entity_pointer (sub_5EB8B0) ×30+ for header fields
│   ├── dump_field_bool (sub_5EB450) ×15+ for header booleans
│   ├── dump_string (sub_5EB790)
│   └── walk_file_scope_il (sub_60E4F0)
│       └── display_il_entry (sub_5F4930) ─── callback per entity
│
└── [loop over regions 2..N]
    └── walk_routine_scope_il (sub_610200)
        └── display_il_entry (sub_5F4930) ─── callback per entity

display_il_entry (sub_5F4930) ─── MAIN DISPATCHER
├── display_source_corresp (sub_5EDF40) ─── shared by named entities
├── display_statement (sub_5EC600) ─── case 0x15
│   ├── display_coroutine_info (sub_600870)
│   └── display_for_range (sub_600740)
├── display_expr_node (sub_5ECFE0) ─── case 0x0D
│   ├── display_expr_operator (sub_5FE8B0)
│   ├── display_expr_operator_name (sub_5FB270)
│   └── display_expr_details (sub_5FBCD0)
├── display_variable (sub_5EE500) ─── case 0x07
│   └── display_init_kind (sub_5EBB50)
├── display_routine (sub_5EF1A0) ─── case 0x0B
│   └── display_template_arg_list (sub_5EBF60 / sub_5FA0D0)
├── display_type (sub_5F06B0) ─── case 0x06
│   ├── display_class_supplement (sub_5F4030)
│   ├── display_int_type_kind (sub_5F9110)
│   └── display_float_type_kind (sub_5F93D0)
├── display_scope (sub_5F2140) ─── case 0x17
├── display_constant (sub_5F2720) ─── case 0x02
│   ├── display_integer_constant (sub_602F20)
│   └── display_float_constant (sub_5FCAF0)
├── display_dynamic_init (sub_5F37F0) ─── case 0x1E
├── display_name_reference (sub_5EBC60) ─── case 0x3E
└── display_entity_list (sub_5EC450) ─── multiple cases

display_single_entity (sub_5F7D50) ─── TARGETED DISPLAY
├── entity_lookup (sub_73D400)
├── resolve_entity (sub_7377D0)
├── get_entity_kind (sub_5C64C0)
├── init_output_context (sub_5F9040)
└── display_il_entry (sub_5F4930)

Relationship to Other Subsystems

The IL display subsystem is read-only: it never modifies the IL graph. It shares the same entry walker functions used by the IL Tree Walking framework (walk_file_scope_il = sub_60E4F0, walk_routine_scope_il = sub_610200) and the Keep-in-IL mark phase, but passes display_il_entry as the callback instead of a transformation function.

The IL Allocation subsystem provides dump_il_table_stats (sub_5E99D0), which dumps allocation counters rather than IL content -- a complementary diagnostic activated separately.

The field offsets printed by the display functions serve as ground truth for the IL Overview entry kind table and the Entity Node Layout documentation.

IL Comparison & Deep Copy

The IL comparison and deep copy engines are two tightly coupled subsystems in EDG's il.c that serve template instantiation, constant sharing, and overload resolution. The comparison engine determines structural equivalence between two IL expression trees or constant nodes -- needed when the compiler must decide whether two template arguments are "the same" or whether a constant has already been allocated. The deep copy engine clones expression trees while optionally substituting template parameters for their actual arguments -- the core mechanism behind template instantiation. Both subsystems are recursive tree walkers dispatched by node-kind switches, and both operate on the same IL node layout described in IL Overview.

These two engines share the address range 0x5D0750--0x5DFAD0 in the binary (roughly 37KB of compiled code). The comparison engine occupies 0x5D0750--0x5D2160, constant sharing infrastructure sits at 0x5D2170--0x5D2D80, the expression copy engine fills 0x5D2DE0--0x5D5550, and the template parameter substitution dispatcher extends from 0x5DC000--0x5DFAD0.

Key Facts

Property	Value
Source file	`il.c` (EDG 6.6)
Assert path	`/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/il.c`
Comparison engine	`sub_5D0750` (`compare_expressions`), 588 lines
Constant comparison	`sub_5D1350` (`compare_constants`), 525 lines
Dynamic init comparison	`sub_5D1FE0` (`compare_dynamic_inits`), ~80 lines
Constant sharing allocator	`sub_5D2390` (`alloc_shareable_constant`), ~200 lines
Expression tree copier	`sub_5D2F90` (`i_copy_expr_tree`), 424 lines
Constant deep copier	`sub_5D3B90` (`i_copy_constant_full`), 305 lines
Template substitution dispatcher	`sub_5DC000` (`copy_template_param_expr`), 1416 lines
Template constant dispatcher	`sub_5DE290` (`copy_template_param_con`), 819 lines
Constant sharing hash buckets	2039
Recursion depth guard	`dword_126F1D0` (incremented/decremented around `compare_expressions`)

Part 1: The Comparison Engine

Why It Exists

Three front-end subsystems need structural equality testing on IL trees:

Template argument deduction. When the compiler deduces template arguments from a function call, it must compare the deduced value against a previously deduced value for the same parameter. Two independently constructed expression trees representing sizeof(int) must compare as equal even though they are distinct heap allocations.
Constant sharing. Identical constants across the translation unit are deduplicated into a single canonical node in file-scope memory. The comparison engine is the hash table's equality predicate -- after two constants hash to the same bucket, compare_constants determines whether they are structurally identical.
Overload resolution. When the compiler checks whether two function template specializations have equivalent signatures, it compares their template argument expressions for equivalence.

compare_expressions (sub_5D0750)

This is the main entry point. It takes two expression-node pointers and a flags word, and returns 1 (match) or 0 (mismatch). It uses a 36-case switch on the expression-node kind byte (offset +24 in the node layout).

compare_expressions(expr_a, expr_b, flags) -> bool:

    if expr_a == expr_b:
        return TRUE                          // pointer identity short-circuit

    if expr_a->kind != expr_b->kind:
        return FALSE                         // different node types never match

    recursion_depth++                        // dword_126F1D0

    switch expr_a->kind:

        case 0 (null):
            result = FALSE                   // two null nodes are never "equal"

        case 1 (operation):
            if expr_a->op_code != expr_b->op_code:
                result = FALSE
            else:
                // compare each operand in the linked list pairwise
                result = compare_operand_lists(expr_a->operands, expr_b->operands, flags)
                if result:
                    result = equiv_types(expr_a->result_type, expr_b->result_type)

        case 2 (constant reference):
            result = compare_constants(expr_a->constant, expr_b->constant, flags)

        case 3 (entity reference):
            // first try pointer equality on the referenced entity
            if expr_a->entity == expr_b->entity:
                result = TRUE
            elif sharing_enabled and same_sharing_symbol(expr_a, expr_b):
                result = TRUE
            else:
                // deep entity comparison via equiv_types + compare_template_variables
                result = equiv_types(expr_a->entity->type, expr_b->entity->type)

        case 4, 19 (type reference):
            result = (expr_a->type_ptr == expr_b->type_ptr)
                  or equiv_types(expr_a->type_ptr, expr_b->type_ptr)

        case 5, 18 (dynamic init):
            result = compare_dynamic_inits(expr_a->init, expr_b->init, flags)

        case 6 (source position):
            result = (expr_a->offset == expr_b->offset)

        case 7 (full expression info):
            result = compare_flags(expr_a, expr_b)
                 and equiv_types(...)
                 and compare_expressions(expr_a->sub_expr, expr_b->sub_expr, flags)

        case 8 (template arguments):
            // element-by-element comparison of arg lists
            result = compare_template_arg_lists(expr_a->args, expr_b->args, flags)

        case 10, 33 (sub-expression wrapper):
            result = compare_expressions(expr_a->inner, expr_b->inner, flags)

        case 11, 32 (unary with boolean):
            result = (expr_a->bool_field == expr_b->bool_field)
                 and compare_expressions(expr_a->inner, expr_b->inner, flags)

        case 12, 14, 15 (typed value):
            result = (expr_a->value_byte == expr_b->value_byte)
                 and compare_type_or_value(...)

        case 13 (two-byte key):
            result = (expr_a->key_word == expr_b->key_word)

        case 16 (always-equal sentinel):
            result = TRUE

        case 17, 22, 35 (opaque pointer):
            result = (expr_a->ptr == expr_b->ptr)

        case 20 (pointer with fallback):
            result = (expr_a->ptr == expr_b->ptr)
                  or deep_compare_via_sub_7B2260(...)

        case 21 (keyed sub-expression):
            result = (expr_a->key == expr_b->key)
                 and compare_expressions(expr_a->inner, expr_b->inner, flags)

        case 23 (simple sub-expression):
            result = compare_expressions(expr_a->inner, expr_b->inner, flags)

        case 24 (nested expression pair):
            result = compare_pair(expr_a, expr_b, flags)

        case 25 (lambda/closure):
            result = chase_closure_ptrs_and_compare(...)

        case 28 (attributed expression):
            result = (expr_a->attr_flags == expr_b->attr_flags)
                 and compare_expressions(expr_a->inner, expr_b->inner, flags)

        case 30 (template specialization):
            if expr_a->hash != expr_b->hash:
                result = FALSE
            else:
                result = compare_template_specializations(expr_a, expr_b)

        case 31 (function template args):
            result = compare_each_arg_type(expr_a->args, expr_b->args)

        default:
            internal_error("compare_expressions: bad expr kind")

    recursion_depth--
    return result

Flags interpretation. The third argument flags is a bitmask:

Bit	Mask	Meaning
0	`0x01`	Strict mode -- entity references must be pointer-identical, not just structurally equivalent
1	`0x02`	Check constraints -- compare template constraints alongside types
2	`0x04`	Allow specialization -- used by the equivalence wrapper (`sub_5D1320`) when comparing for specialization matching

Recursion depth guard. The global dword_126F1D0 is incremented on entry and decremented on exit. This counter is not used for depth limiting -- it exists so that diagnostic routines (guarded by dword_126EFC8) can print indented traces via sub_5C4B70 (dump_expr_tree).

compare_constants (sub_5D1350)

Constants are the most structurally complex IL nodes. A single constant node is 184 bytes and carries a constant_kind byte at offset +148 that selects among 16 primary kinds, some of which contain nested sub-kinds. The comparison function uses an outer switch on constant_kind and inner switches for aggregate and template-parameter sub-kinds.

compare_constants(const_a, const_b, flags) -> bool:

    if const_a == const_b:
        return TRUE

    if const_a->constant_kind != const_b->constant_kind:
        return FALSE

    switch const_a->constant_kind:

        case 0, 14 (trivial kinds):
            return TRUE

        case 1 (integer):
            return compare_integer_values(const_a->value, const_b->value)
               and (const_a->flags == const_b->flags)

        case 2 (string literal):
            return memcmp(const_a->bytes, const_b->bytes, const_a->length) == 0

        case 3, 5 (float):
            return compare_float_value(const_a->value, const_b->value)

        case 4 (complex):
            return compare_float_value(const_a->real, const_b->real)
               and compare_float_value(const_a->imag, const_b->imag)

        case 6 (address constant):
            // nested switch on address_kind (offset +152), 6 sub-kinds:
            switch const_a->address_kind:
                case 0, 1: pointer equality or compare_entities(...)
                case 2:    recursive type comparison
                case 3, 6: pointer equality at offset +160
                case 5:    type comparison via deep_compare
            // uses while(2) loop for manual tail-call optimization
            // on case 2 and case 13

        case 7 (template argument):
            return compare_template_arg(...)

        case 8 (pair of constants):
            return compare_constants(const_a->first, const_b->first, flags)
               and compare_constants(const_a->second, const_b->second, flags)

        case 9 (dynamic init):
            return compare_dynamic_inits(const_a->init, const_b->init, flags)

        case 10 (aggregate):
            // walk linked lists of sub-constants in lockstep
            a_elem = const_a->first_element
            b_elem = const_b->first_element
            while a_elem and b_elem:
                if not compare_constants(a_elem, b_elem, flags): return FALSE
                a_elem = a_elem->next; b_elem = b_elem->next
            return (a_elem == NULL and b_elem == NULL)

        case 11 (constant + scope):
            return compare_constants(const_a->sub, const_b->sub, flags)
               and (const_a->scope_byte == const_b->scope_byte)

        case 12 (template parameter constant):
            // deeply nested -- 14 sub-kinds at offset +152:
            switch const_a->template_param_kind:
                case 0:  pack parameter comparison
                case 1:  compare_expressions on embedded expr
                case 2:  compare_types via sub_5B3080
                case 3:  compare_types + type + flags
                case 4, 12: recursive compare_constants
                case 5-10: type equality + sub_5BFB80 comparison
                case 11: type + template argument list
                case 13: type equality only

        case 13 (entity ref constant):
            return (const_a->flags == const_b->flags)
               and (pointer_equal_or_sharing_match(const_a->entity, const_b->entity))

        case 15 (literal value):
            return (const_a->value_ptr == const_b->value_ptr)

        default:
            internal_error("compare_constants: bad constant kind")

Manual tail-call optimization. For cases 6 (address constant with type sub-kind) and 13 (entity ref), the function uses while(2) (an infinite loop that reassigns the operands and continues from the top of the comparison) instead of making a recursive call. This avoids stack growth when comparing chains of address constants, which can be deeply nested in pointer-to-member types.

compare_dynamic_inits (sub_5D1FE0)

Dynamic initializers represent runtime initialization expressions (constructors, aggregate init, etc.). The comparison function dispatches on the init kind byte at offset +48:

compare_dynamic_inits(init_a, init_b, flags) -> bool:

    if init_a->kind != init_b->kind:
        return FALSE
    if init_a->flags != init_b->flags:
        return FALSE
    // entity fields at +8, +16 compared with sharing-aware equality

    switch init_a->kind:
        case 0, 1: return TRUE (after header match)
        case 2, 6: return compare_constants(init_a->constant, init_b->constant)
        case 3, 4: return compare_expressions(init_a->expr, init_b->expr)
        case 5:    return compare_entity_ref(...)
                       and compare_sub_exprs(...)

Why It Exists

Without deduplication, every occurrence of the integer constant 42 in a translation unit would allocate a separate 184-byte constant node in the IL. For large programs (especially heavy template users), this wastes significant memory. The constant sharing system maintains a hash table of canonical constants: when a new constant is about to be allocated, alloc_shareable_constant first checks whether an identical constant already exists in the hash table. If so, the existing node is returned; if not, a new canonical copy is created in the file-scope region and inserted into the table.

Shareability Predicate

Not all constants can be shared. The predicate constant_is_shareable (sub_5D2210) checks several blocking conditions:

constant_is_shareable(constant) -> bool:

    if not sharing_enabled (dword_126EE48):
        return FALSE

    if constant has parent:
        // parent must be type 2 (constant); checks sharing flag 0x40 at byte+81
        // and calls compare_constants on the parent's value
        return parent_is_shareable(...)

    // blocking conditions for parentless constants:
    if constant->associated_entry != NULL:  return FALSE   // already bound to an entry
    if constant->extra_data != 0:           return FALSE   // has auxiliary data
    if constant->flags & 0x02:              return FALSE   // flag bit 1 blocks sharing

    switch constant->constant_kind:
        case 2 (string):    return string_sharing_enabled (dword_126E1C0)
        case 6 (address):   return TRUE unless has extra payload at +176
                             or address_subkind==4 with data
        case 7 (template):  return (constant->extra_ptr == NULL)
        case 10 (aggregate): return FALSE   // aggregates never shared
        case 12 (template param): return FALSE   // template params never shared
        default:            return TRUE

The rationale for excluding aggregates and template parameters: aggregate constants contain linked lists of sub-constants that would require recursive sharing checks, and template parameter constants are inherently unique to their instantiation context.

Hash Table Structure

The hash table is allocated during il_init (sub_5CFE20) as a 16,312-byte block (stored at qword_126F228), yielding 2039 bucket slots (16312 / 8 = 2039). Each bucket is a pointer to the head of a singly-linked chain of constant nodes.

Hash Table Layout (qword_126F228):

    +--------+--------+--------+     +--------+
    | slot 0 | slot 1 | slot 2 | ... |slot 2038|
    +--------+--------+--------+     +--------+
        |        |        |
        v        v        v
      const -> const -> NULL    (singly-linked chains)
       |
       v
      const -> NULL

Why 2039? The number 2039 is prime. Using a prime number as the hash table size ensures that the modular-reduction step (hash % 2039) distributes keys uniformly even when the hash function produces patterns with common factors. The compiled code computes the modulus through an optimized multiply-and-shift sequence (multiply by 0x121456F, then shift) rather than a hardware division instruction.

alloc_shareable_constant (sub_5D2390)

This is the entry point for all constant allocation when sharing is enabled. It implements a hash-table lookup with MRU (most recently used) reordering of the chain:

alloc_shareable_constant(local_constant) -> constant*:

    total_alloc_count++                              // qword_126F208

    if not sharing_enabled or not constant_is_shareable(local_constant):
        return alloc_constant(local_constant)        // fallback to non-shared alloc

    if local_constant has parent:
        // parent's shared pointer is already the canonical copy
        assert parent->type == 2
        return parent->shared_ptr

    // ---- hash table lookup ----
    hash = compute_constant_hash(local_constant)     // sub_5BE150
    bucket_index = hash % 2039
    bucket_ptr = &hash_table[bucket_index]

    prev = NULL
    curr = *bucket_ptr
    while curr != NULL:
        comparison_count++                           // qword_126F200
        if compare_constants(curr, local_constant, 0):
            // ---- HIT: MRU reorder ----
            if prev != NULL:
                // unlink curr from current position
                prev->next = curr->next
                // move curr to front of chain
                curr->next = *bucket_ptr
                *bucket_ptr = curr
            if curr is in same region:
                region_hit_count++                   // qword_126F218
            else:
                global_hit_count++                   // qword_126F220
            return curr
        prev = curr
        curr = curr->next

    // ---- MISS: allocate new canonical constant ----
    new_bucket_count++                               // qword_126F210
    new_constant = alloc_in_file_scope(184)          // sub_5E11C0 or sub_5E1620
    memcpy(new_constant, local_constant, 184)        // 11.5 x SSE + 8-byte tail
    clear_sharing_flags(new_constant)
    fixup_constant_references(new_constant)          // sub_5D39A0
    // link at head of chain
    new_constant->next = *bucket_ptr
    *bucket_ptr = new_constant
    return new_constant

MRU optimization rationale. When a hash bucket chain contains many constants (collision), recently matched constants are likely to be matched again soon (temporal locality from template instantiation expanding the same types repeatedly). Moving the matched node to the front of the chain converts an O(n) average-case lookup into O(1) for repeated accesses to the same constant.

Statistics counters. The sharing system maintains four counters for profiling:

Counter	Address	Meaning
`qword_126F200`	comparisons	Total `compare_constants` calls during sharing
`qword_126F208`	total_allocs	Total calls to `alloc_shareable_constant`
`qword_126F210`	new_buckets	Number of cache misses (new canonical entries)
`qword_126F218`	region_hits	Sharing hits where the existing constant is in the same region
`qword_126F220`	global_hits	Sharing hits where the existing constant is in a different region

String Constant Interning (sub_5DBAB0)

String literals receive a separate interning pass through intern_string_constant at 0x5DBAB0. This function reuses the same 2039-bucket hash table (qword_126F228) but with string-specific comparison logic:

intern_string_constant(string, context_a, context_b) -> constant*:

    hash = compute_constant_hash(string)
    bucket_index = hash % 2039

    // linear chain search with exact match (flag=1)
    for each entry in chain:
        if compare_constants(entry, local_constant, 1):   // strict mode
            move_to_front(entry)                           // MRU
            return entry

    // miss: allocate new string constant in file-scope region
    new = alloc_constant_with_source_sequence(ck_string)
    memcpy(new, local_constant, 184)
    new->string_data = alloc_string_storage(strlen(string) + 1)
    strcpy(new->string_data, string)
    clear_sharing_flags(new)
    fixup_constant_references(new)
    link_at_chain_head(bucket_index, new)
    free_local_constant(local_constant)
    return new

fixup_constant_references (sub_5D39A0)

After a constant is copied into the shared region, some of its internal pointers may still reference nodes in the source (non-shared) region. fixup_constant_references walks the constant's internal structure and redirects these dangling references:

If the constant's associated IL entry is not in the shared region, the back-pointer at offset +128 is cleared.
For template parameter constants (kind 12), sub-kinds 1 and 5-10 may embed expression trees at offsets +160/+168. If these expressions are not in the shared region, they are deep-copied via copy_expr_tree or reattached via attach_to_region.
For literal value constants (kind 15) with expression sub-kind 13, the constant kind is rewritten based on the expression's kind (expr kind 2 becomes const kind 2, etc.), effectively inlining the expression into the constant.

Part 3: The Deep Copy Engine

Why It Exists

Template instantiation requires cloning expression trees from template definitions while replacing template parameter references with the actual arguments provided at the instantiation site. This is not a simple memcpy -- every node in the tree must be visited, its pointers updated to reference the new region's copies, and template parameter nodes must be intercepted and replaced with substituted values. The deep copy engine provides this transformation.

Default argument expansion also uses the copy engine: when a function call omits an argument that has a default, the default's expression tree is cloned from the function declaration into the call site.

i_copy_expr_tree (sub_5D2F90)

The central expression copier. It takes an expression node, a flags word, and a substitution-list context, then returns a freshly allocated deep copy.

i_copy_expr_tree(src_expr, flags, subst_list) -> expr_node*:

    // shallow clone: allocate new node, copy fixed fields
    dest = allocate_expr_node_clone(src_expr)      // sub_5C28B0

    switch src_expr->kind:

        case 0  (null):           // no children to copy
        case 3  (entity ref):     // entity pointer is shared, not copied
        case 4  (type ref):       // type pointer is shared
        case 16 (sentinel):       // no data
        case 19 (template ref):   // entity pointer is shared
        case 20 (type constant):  // type pointer is shared
        case 22 (opaque ptr):     // shallow only
        case 30 (template spec):  // shallow only
            break                 // nothing beyond the shallow clone

        case 1 (operation):
            // recursively copy the operand linked list
            dest->operands = i_copy_list_of_expr_trees(src_expr->operands, flags, subst)

        case 2 (constant reference):
            // deep-copy the constant node
            dest->constant = i_copy_constant_full(src_expr->constant, NULL, flags, subst)

        case 5 (dynamic init):
            dest->init = i_copy_dynamic_init(src_expr->init, flags, subst)

        case 6 (call expression):
            // walk argument list, copy each argument expression
            dest->args = copy_arg_list(src_expr->args, flags, subst)

        case 7 (full expression info):
            // copy 6 sub-fields (type, scope, sub-expression, etc.)
            copy_full_expr_children(dest, src_expr, flags, subst)

        case 8 (template arguments):
            dest->type_list = copy_type_list(src_expr->type_list, flags)

        case 9 (pack expansion):
            dest->type = copy_type(src_expr->type)
            dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)

        case 10 (object lifetime):
            push_lifetime_scope()
            dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
            attach_lifetime(dest)

        case 11, 23, 32, 33 (sub-expression list):
            dest->list = i_copy_list_of_expr_trees(src_expr->list, flags, subst)

        case 12, 14, 15 (typed value):
            // conditional copy based on value byte
            if src_expr->value_byte matches copy-condition:
                dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)

        case 13 (two-byte key):
            // no children beyond the key value

        case 17 (entity reference, copyable):
            if flags & 0x80:   // copy_entities mode
                dest->entity = alloc_constant_from_entity(src_expr->entity)

        case 18 (substitution slot):
            // look up in subst_list for replacement
            dest = resolve_from_substitution_list(subst_list, src_expr->index)

        case 21, 26, 27 (expression + list):
            dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
            dest->list = i_copy_list_of_expr_trees(src_expr->list, flags, subst)

        case 24 (list + pointer):
            dest->list = i_copy_list_of_expr_trees(src_expr->list, flags, subst)
            dest->ptr = copy_pointer_target(src_expr->ptr)

        case 25 (expression + flags):
            dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
            dest->flags |= propagated_flags

        case 28 (attributed expression):
            dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
            // attribute flags are already copied in the shallow clone

        case 31 (expression + extracted pointer):
            dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
            dest->extra = extract_pointer(src_expr)

        case 34 (constexpr fold):
            dest = copy_constexpr_fold(src_expr)    // sub_65AE50

        default:
            internal_error("i_copy_expr_tree: bad expr kind")

    // ---- post-copy entity resolution (LABEL_11) ----
    if flags & 0x10:   // resolve_refs mode
        for kinds 2, 3, 7, 19:
            resolve_entity_ref(dest)               // sub_5B3030

    return dest

Flags interpretation for the copy engine:

Bit	Mask	Meaning
4	`0x10`	`resolve_refs` -- after copying, resolve entity references through the symbol table
7	`0x80`	`copy_entities` -- copy entity nodes themselves (not just references to them)
12	`0x1000`	`mark_instantiated` -- stamp copied nodes with the instantiation flag
14	`0x5000`	`preserve_source_pos` -- carry source-position annotations from source to copy

i_copy_list_of_expr_trees (sub_5D38C0)

A helper that walks a linked list of expression nodes (connected via the next pointer at offset +16), copies each via i_copy_expr_tree, and links the copies into a new list:

i_copy_list_of_expr_trees(head, flags, subst) -> expr_node*:

    result_head = NULL
    result_tail = NULL

    curr = head
    while curr != NULL:
        copy = i_copy_expr_tree(curr, flags, subst)
        if result_head == NULL:
            result_head = copy
        else:
            result_tail->next = copy
        result_tail = copy
        curr = curr->next

    return result_head

i_copy_constant_full (sub_5D3B90)

The constant copier handles the 184-byte constant node and its recursive sub-structure. It maintains a substitution list to avoid duplicating shared type definitions across the copy tree.

i_copy_constant_full(src, dest_or_null, flags, subst_list) -> constant*:

    if dest_or_null:
        dest = dest_or_null                 // copy in place
    else:
        dest = alloc_constant_node()        // sub_5E11C0

    memcpy(dest, src, 184)                  // 11 x SSE + 8-byte tail
    clear_sharing_flag(dest, bit 2 at [5].byte[3])
    clear_sharing_flag(dest, bit 5 at [9].byte[3])

    switch dest->constant_kind:

        case 10 (aggregate):
            // walk linked list of sub-constants, deep-copy each
            for each element in dest->element_list:
                element = i_copy_constant_full(element, NULL, flags, subst_list)
                relink(element)

        case 11 (constant + scope):
            dest->sub_constant = i_copy_constant_full(
                dest->sub_constant, NULL, flags, subst_list)

        case 9 (dynamic init):
            dest->init = i_copy_dynamic_init(dest->init, flags, subst_list)

        case 6 (address constant with type definition):
            // substitution-list management: look up whether this type
            // has already been copied in this tree; if so, reuse the copy
            existing = lookup_in_subst_list(subst_list, src->type_def)
            if existing:
                dest->type_def = existing
            else:
                dest->type_def = copy_type(src->type_def)
                add_to_subst_list(subst_list, src->type_def, dest->type_def)

        case 12 (template parameter constant):
            switch dest->template_param_kind:
                case 0, 2, 3, 13: // no extra copy needed
                case 1:           dest->value_expr = copy_expr_tree(dest->value_expr)
                case 4, 12:       dest->inner = i_copy_constant_full(dest->inner, ...)
                case 5-10:        dest->extra_expr = copy_expr_tree(dest->extra_expr)
                case 11:          dest->type = copy_type(dest->type)
                                  dest->arg_list = copy_template_arg_list(dest->arg_list)

    fixup_constant_references(dest)         // sub_5D39A0
    return alloc_shareable_constant(dest)   // sub_5D2390 -- may deduplicate

Substitution list purpose. When copying an expression tree that references the same type definition in multiple sub-expressions (e.g., two occurrences of decltype(x) in a single template), the substitution list ensures both references point to the same copied type node, preserving the sharing relationship from the original tree.

Public Wrappers

The i_copy_* functions are internal -- they take a substitution-list parameter that must be cleaned up after use. Public wrappers handle this lifecycle:

Wrapper	Address	Internal function	Purpose
`copy_expr_tree`	`sub_5D3940`	`i_copy_expr_tree`	Expression deep copy with auto-cleanup
`copy_constant_full`	`sub_5D4300`	`i_copy_constant_full`	Constant deep copy with auto-cleanup
`copy_dynamic_init`	`sub_5D4CF0`	`i_copy_dynamic_init`	Dynamic init deep copy with auto-cleanup
`copy_constant`	`sub_5D4D50`	`i_copy_constant_full`	Simple constant copy (flags=0)

Each wrapper allocates a local substitution list, calls the internal function, then appends the list entries to the global free list at qword_126F1E0.

Part 4: Template Parameter Substitution

Why It Exists

The deep copy engine (Part 3) performs a mechanical tree clone -- it duplicates structure but does not transform content. Template instantiation requires more: when the copier encounters a node referencing template parameter T, it must replace that node with the actual type argument (e.g., int). When it encounters sizeof(T), it must evaluate the expression with T=int and produce the constant 4. The template parameter substitution engine is the transformation layer that sits on top of the copy engine, intercepting template-parameter nodes and performing the substitution.

copy_template_param_expr (sub_5DC000)

This is the central dispatcher for expression-level template substitution. At 1416 lines and 7872 bytes of compiled code, it is the largest single function in the comparison/copy subsystem. It takes up to 10 arguments:

copy_template_param_expr(
    expr_node*   a1,     // expression to substitute
    template_ctx a2,     // template argument context
    template_ctx a3,     // secondary context (for nested templates)
    type*        a4,     // expected result type
    scope*       a5,     // current scope
    int          a6,     // flags
    int*         a7,     // error_flag (output: set to 1 on failure)
    diag_info*   a8,     // diagnostics context
    constant*    a9,     // scratch constant (pre-allocated workspace)
    constant**   a10     // output constant pointer
) -> expr_node*          // substituted expression, or NULL (use a9/a10)

The function dispatches on expr->kind and, for operation nodes, further dispatches on the operation code:

copy_template_param_expr(expr, tctx, ...):

    switch expr->kind:

        case 0 (empty):
            return expr                       // pass through unchanged

        case 1 (operation):
            switch expr->op_code:

                case 116 (type expression):
                    return copy_template_param_type_expr(expr, tctx, ...)

                case 5 (cast):
                    // substitute the cast's target type
                    new_type = copy_template_param_type(expr->target_type, tctx)
                    // recursively substitute the operand
                    new_operand = copy_template_param_expr(expr->operands, tctx, ...)
                    return build_cast_node(new_type, new_operand)

                case 0, 25, 28, 29, 53-57, 71, 72, 87, 88, 103 (unary/simple binary):
                    // substitute operand(s) recursively
                    new_ops = substitute_operand_list(expr->operands, tctx)
                    return build_operation_node(expr->op_code, new_ops, new_type)

                case 26, 27, 39-43, 58-63 (binary with type check):
                    // substitute both operands
                    lhs = copy_template_param_expr(expr->operands[0], tctx, ...)
                    rhs = copy_template_param_expr(expr->operands[1], tctx, ...)
                    // post-substitution type promotion:
                    do_conversions_on_operands_of_copied_template_expr(op, &lhs, &rhs)
                    return build_operation_node(op, lhs, rhs, result_type)

                case 39 (ternary / conditional):
                    cond  = copy_template_param_expr(operands[0], ...)
                    true_ = copy_template_param_expr(operands[1], ...)
                    false_= copy_template_param_expr(operands[2], ...)
                    return build_conditional(cond, true_, false_)

                case 44, 45 (imaginary):
                    internal_error("imaginary operators not implemented")

        case 2 (constant reference):
            return copy_template_param_con(expr->constant, tctx, ...)

        case 3 (variable / entity reference):
            // look up the variable in the substitution context
            subst = find_substitution(tctx, expr->entity)
            if subst found:
                // check type compatibility between expected and actual
                if is_pointer_compatible(expected_type, subst->type):
                    return build_value_from_constant(subst)
                else:
                    apply_type_conversion(subst, expected_type)
            return expr   // no substitution needed

        case 5 (function call in constant context):
            // dispatch on call sub-kind
            switch expr->call_subkind:
                case 1: substitute type + validate value
                case 2: delegate to copy_template_param_con

        case 19 (template parameter reference):
            // same logic as case 3 for template parameters
            subst = find_substitution(tctx, expr->template_param)
            return build_substituted_expr(subst)

        case 20 (type constant):
            new_type = copy_template_param_type(expr->type, tctx)
            return build_type_constant_expr(new_type)

        case 21 (builtin operation):
            return copy_template_param_builtin_operation(expr, tctx, ...)
            // asserts no error in process

        case 22 (type reference):
            new_type = copy_template_param_type(expr->type, tctx)
            return build_type_ref_expr(new_type)

        case 23 (expression wrapper):
            inner = copy_template_param_expr(expr->inner, tctx, ...)
            return wrap(inner)

        case 30 (pack expansion):
            return expand_pack(expr, tctx, ...)

        case 31 (dependent entity):
            // complex hash-map based instantiation tracking
            // with get_with_hash / vector_insert / entity list processing
            return instantiate_dependent_entity(expr, tctx, ...)

Post-substitution type conversions. The internal helper do_conversions_on_operands_of_copied_template_expr (at il.c line 18885) handles arithmetic promotions that must occur after template parameter substitution. For example, T + U where T=int and U=double requires promoting the int operand to double -- this promotion is not present in the template definition's expression tree because the types are unknown there. The function handles:

Shift operators (ops 53-54): promote the result type to the promoted LHS type.
Comparison operators (ops 58-63): compute the common type and apply usual arithmetic conversions.
Arithmetic operators (default): compute common type via sub_5657C0 and insert implicit conversion nodes.
Imaginary operators (ops 44-45): explicitly not implemented (triggers internal_error).

copy_template_param_con (sub_5DE290)

The constant-level substitution dispatcher. At 819 lines, it handles the case where a constant node in a template definition contains a reference to a template parameter:

copy_template_param_con(constant, tctx, expected_type, scope, flags,
                        error_flag, diag, scratch) -> constant*:

    switch constant->constant_kind:

        case 12 (template parameter constant):
            // this is the core case -- the constant IS a template parameter
            switch constant->template_param_kind:

                case 0 (value parameter):
                    // look up the bound value in the template argument list
                    binding = lookup_template_arg(tctx, constant->param_index)
                    if binding is a pack:
                        return expand_pack_element(binding, ...)
                    return binding->value_constant

                case 1 (expression parameter):
                    // try overload resolution first
                    result = copy_template_param_con_overload_resolution(...)
                    if result: return result
                    // fall back to full expression-level substitution
                    return copy_template_param_expr(constant->expr, tctx, ...)

                case 2 (non-member entity parameter):
                    return copy_template_param_unknown_entity_con(constant, FALSE, ...)

                case 3 (member entity parameter):
                    return copy_template_param_unknown_entity_con(constant, TRUE, ...)

                case 4 (nested constant parameter):
                    return copy_template_param_con(constant->inner, tctx, ...)

                case 5-10 (scalar value parameters: sizeof, alignof, etc.):
                    // look up the substitution via sub_5BFB80
                    // perform type equality check
                    // apply type conversions if needed
                    return substituted_scalar_constant(...)

                case 11 (entity + argument pack):
                    // entity substitution with argument list processing
                    return substitute_entity_with_args(...)

                case 12 (nested recursive):
                    return copy_template_param_con(constant->inner, tctx, ...)

        case 6 (address/aggregate constant):
            switch constant->address_kind:
                case 3 (function call):
                    // substitute callee type + each argument recursively
                    callee_type = copy_template_param_type(constant->callee_type, tctx)
                    for each arg in constant->args:
                        arg = copy_template_param_con(arg, tctx, ...)
                    return build_call_constant(callee_type, args)
                default:
                    if is_dependent_type(constant->type):
                        return deep_copy_constant(constant)
                    // handle address-space attribute patterns

        case 15 (expression constant):
            switch constant->expr_constant_kind:
                case 46 (strip_template_arg):
                    // dispatch on template argument type:
                    //   0 = type argument -> type substitution
                    //   1 = value argument -> value substitution
                    //   2 = template argument -> template substitution
                case 6:  return type_substitution(...)
                case 13: return non_type_param_substitution(...)
                case 2:  return recursive copy_template_param_con(inner, ...)

        default:
            internal_error("copy_template_param_con: unexpected kind")

copy_template_param_con_with_substitution (sub_5DFAD0)

The top-level entry point for template constant substitution, called from the template instantiation driver. It manages the IL region switch (moving allocation to file-scope for the duration of instantiation), handles the initial overload-resolution check, and performs post-substitution type normalization:

copy_template_param_con_with_substitution(constant, template_args, scope,
                                          expected_type, access, flags,
                                          error_flag, scratch):

    saved_region = current_region
    switch_to_file_scope_region()            // with debug trace

    local_scratch = alloc_local_constant()

    // ---- special case: overload resolution for expression parameters ----
    if constant->kind == 12 and constant->param_kind == 1:
        overload_info = lookup_overload_candidate(constant)
        if overload_info:
            result = copy_template_param_con_overload_resolution(
                         constant, overload_info, tctx, ...)
            if result: goto post_process

    // ---- validate expected type ----
    if expected_type is pointer_type:
        validate_pointer_binding(expected_type)

    // ---- main substitution ----
    result = copy_template_param_expr(constant->expr, tctx, ...)
    // or: result = copy_template_param_con(constant, tctx, ...)
    // depending on whether the constant embeds an expression

    post_process:
    // ---- post-substitution type normalization ----
    if result->type is pointer_type:
        validate_binding(result)
        result = try_implicit_conversion(result)
    elif result->type is array_type:
        result = try_implicit_conversion(result)
        result = array_to_pointer_decay(result)
    elif result->type is function_type:
        result = try_implicit_conversion(result)
        result = function_to_pointer_decay(result)
    else:
        result = general_conversion(result)

    // ---- handle deferred instantiation ----
    if is_deferred_instantiation(result):
        copy_deferred_data_into_scratch(result, scratch)

    restore_region(saved_region)
    free_local_constant(local_scratch)
    return result

Supporting Functions

Function	Address	Lines	Purpose
`copy_template_param_type_expr`	`sub_5DDEB0`	82	Handles op=116 type expressions within template substitution; extracts and substitutes the type, checks dependent-type status
`copy_template_param_expr_list`	`sub_5DE010`	77	Iterates an expression linked list, calling `copy_template_param_expr` on each element; shares a single scratch constant across all iterations
`copy_template_param_value_expr`	`sub_5DE1A0`	55	Single-expression variant; passes the expression's own type as the expected type
`copy_template_param_con_overload_resolution`	`sub_5DF6A0`	180	Attempts overload resolution during template substitution when the template parameter refers to a set of overloaded functions; validates result type compatibility
`copy_template_param_unknown_entity_con`	`sub_5DB420`	213	Handles entity constants where the entity kind is not known until substitution time (using declarations, namespace aliases, variables, templates, types)

Part 5: Data Flow Between the Subsystems

The four subsystems interact in a specific calling pattern during template instantiation:

Template Instantiation Driver
  |
  +-> copy_template_param_con_with_substitution (entry point)
        |
        +-> copy_template_param_expr (expression-level dispatch)
        |     |
        |     +-> copy_template_param_con (constant-level dispatch)
        |     |     |
        |     |     +-> copy_template_param_unknown_entity_con
        |     |     +-> copy_template_param_con_overload_resolution
        |     |     +-> [recursive: copy_template_param_expr]
        |     |     +-> [recursive: copy_template_param_con]
        |     |
        |     +-> copy_template_param_type (type-level, in type.c)
        |     +-> copy_template_param_type_expr
        |     +-> copy_template_param_expr_list
        |     +-> copy_template_param_builtin_operation
        |
        +-> alloc_shareable_constant (deduplication on output)
              |
              +-> compare_constants (hash table equality check)
              +-> fixup_constant_references

The comparison engine is not called during the copy itself -- it is called only at the end, when the newly constructed constants are passed through alloc_shareable_constant for deduplication. This means the copy engine may temporarily create duplicate constants that are later merged by the sharing infrastructure. The design separates concerns: the copy engine focuses on correctness (producing a valid substituted tree), while the sharing engine focuses on efficiency (deduplicating identical results).

Part 6: Initialization and Reset

il_one_time_init (sub_5CF7F0)

Called once at program startup. Validates seven name-table arrays end with the "last" sentinel string, checks the sizeof_il_entry guard value (9999), and initializes 60+ allocation pools via pool_init (sub_7A3C00) with element sizes ranging from 1 byte to 1344 bytes. Conditionally initializes C++-mode pools (guarded by dword_106BF68 || dword_106BF58).

il_init (sub_5CFE20)

Called at the start of each translation unit. Zeroes all global pool heads, allocates and zeroes the two hash tables:

Character type table: 3240 bytes at qword_126F2F8 (5 character types x 81 slots = 405 entries, 8 bytes each).
Constant sharing table: 16312 bytes at qword_126F228 (2039 buckets, 8 bytes each).

Sets the three sharing mode bytes (byte_126E558, byte_126E559, byte_126E55A) to 3 (all sharing enabled), and tail-calls il_init_float_constants (sub_5EAF00).

il_reset_secondary_pools (sub_5D0170)

Zeroes ~80 qword globals in the 0x126F680--0x126F978 range. These are transient counters, list heads, and cached type pointers used during template instantiation. Called separately from il_init, suggesting it resets state between instantiation passes within the same translation unit.

Address Map

Address	Function	Lines	Role
`0x5CF7F0`	`il_one_time_init`	~200	One-time startup validation + pool init
`0x5CFE20`	`il_init`	~100	Per-TU hash table allocation + state reset
`0x5D0170`	`il_reset_secondary_pools`	~40	Reset instantiation-transient state
`0x5D0750`	`compare_expressions`	588	Expression tree structural equality
`0x5D1320`	`compare_expressions_for_equivalence`	~10	Thin wrapper (flags=4)
`0x5D1350`	`compare_constants`	525	Constant structural equality, 16 kinds
`0x5D1FE0`	`compare_dynamic_inits`	~80	Dynamic init comparison
`0x5D2160`	`compare_constants_default`	~5	Thin wrapper (flags=0)
`0x5D2170`	`expr_tree_contains_template_param_constant`	~50	Template param presence check
`0x5D2210`	`constant_is_shareable`	~100	Shareability predicate
`0x5D2390`	`alloc_shareable_constant`	~200	Hash table deduplication allocator
`0x5D2890`	`alloc_il_entry_from_constant`	~20	Wraps constant in IL entry
`0x5D2F90`	`i_copy_expr_tree`	424	Expression tree deep copy (35-case switch)
`0x5D38C0`	`i_copy_list_of_expr_trees`	~40	Linked-list copy helper
`0x5D3940`	`copy_expr_tree`	~30	Public wrapper with cleanup
`0x5D39A0`	`fixup_constant_references`	~80	Post-copy pointer fixup
`0x5D3B90`	`i_copy_constant_full`	305	Constant deep copy (16-kind switch)
`0x5D4300`	`copy_constant_full`	~20	Public wrapper with cleanup
`0x5D47A0`	`i_copy_dynamic_init`	~150	Dynamic init deep copy
`0x5D4C00`	`copy_lambda_capture`	~60	Lambda capture list copy
`0x5D4DB0`	`alloc_constant`	~150	Non-shared constant allocation with kind-specific cleanup
`0x5DBAB0`	`intern_string_constant`	~92	String literal interning via hash table
`0x5DC000`	`copy_template_param_expr`	1416	Template substitution -- expression dispatcher
`0x5DDEB0`	`copy_template_param_type_expr`	82	Template substitution -- type expressions
`0x5DE010`	`copy_template_param_expr_list`	77	Template substitution -- expression list
`0x5DE1A0`	`copy_template_param_value_expr`	55	Template substitution -- single value expr
`0x5DE290`	`copy_template_param_con`	819	Template substitution -- constant dispatcher
`0x5DF6A0`	`copy_template_param_con_overload_resolution`	180	Template substitution -- overload resolution
`0x5DFAD0`	`copy_template_param_con_with_substitution`	288	Template substitution -- top-level entry

.int.c File Format

When cudafe++ processes a CUDA source file, the backend code generator emits a transformed C++ translation called the .int.c file (short for "intermediate C"). This is the host-side output that the downstream host compiler (GCC, Clang, or MSVC) will compile. The file preserves all host-visible declarations from the original source but replaces device code with stubs, injects CUDA runtime boilerplate, and appends registration tables and anonymous namespace support. The entire emission is driven by process_file_scope_entities (sub_489000), a 723-line function in cp_gen_be.c that serves as the backend entry point. It initializes output state, opens the output stream, emits a fixed sequence of preamble sections, walks the EDG intermediate language source sequence to generate the transformed C++ body, then appends a fixed trailer with _NV_ANON_NAMESPACE handling, #pragma pack() for MSVC, and CUDA host reference arrays.

Key Facts

Property	Value
Backend entry point	`sub_489000` (`process_file_scope_entities`, 723 lines)
EDG source file	`cp_gen_be.c` (lines 19916-26628)
Default output name	`<input>.int.c` (via `sub_5ADD90` string concatenation)
Output override global	`qword_106BF20` (set by CLI flag `gen_c_file_name`, case 45)
Stdout sentinel	`"-"` (output filename compared character-by-character)
Output stream global	`stream` (FILE pointer at fixed address)
Line counter	`dword_1065820` (incremented on every `\n`)
Column counter	`dword_106581C` (character position within current line)
Indent level	`dword_1065834` (decremented with `--` around directive blocks)
Needs-line-directive flag	`dword_1065818` (triggers `#line` emission before next output)
Source sequence cursor	`qword_1065748` (current IL entry being processed)
Device stub mode toggle	`dword_1065850` (0=normal, 1=generating `__wrapper__device_stub_`)
Empty file guard string	`"int __dummy_to_avoid_empty_file;"` at `0x83AED8`
Anon namespace macro string	`"_NV_ANON_NAMESPACE"` at `0x83AF45`
Managed RT boilerplate	inline `static` functions for `__managed__` variable support

Output File Naming

The output filename is determined by three inputs, checked in order:

// sub_489000, decompiled lines 153-177
char *input_name = qword_126EEE0;   // source filename from CLI

// 1. Check for stdout mode
if (strcmp(input_name, "-") == 0) {
    stream = stdout;
}
else {
    // 2. Check for explicit output name override
    char *output_name = qword_106BF20;
    if (!output_name)
        // 3. Default: append ".int.c" to input filename
        output_name = sub_5ADD90(input_name, ".int.c");

    stream = sub_4F48F0(output_name, 0, 0, 0, 1701);  // open for writing
}

The - sentinel enables piping cudafe++ output to stdout for debugging or toolchain integration. The qword_106BF20 override is set by the gen_c_file_name CLI option (case 45 in the CLI parser at sub_459630), allowing nvcc to specify an explicit output path. The default .int.c suffix means a file kernel.cu produces kernel.cu.int.c.

Complete .int.c File Structure

A fully-generated .int.c file follows this fixed section ordering, top to bottom:

+------------------------------------------------------------------+
| 1. #line directive (initial source position)                     |
+------------------------------------------------------------------+
| 2. #pragma GCC diagnostic ignored "-Wunused-local-typedefs"      |
|    #pragma GCC diagnostic ignored "-Wattributes"                 |
+------------------------------------------------------------------+
| 3. #pragma GCC diagnostic push                                   |
|    #pragma GCC diagnostic ignored "-Wunused-variable"            |
|    #pragma GCC diagnostic ignored "-Wunused-function"            |
+------------------------------------------------------------------+
| 4. Managed runtime boilerplate                                   |
|    (static __nv_inited_managed_rt, __nv_init_managed_rt, etc.)   |
+------------------------------------------------------------------+
| 5. #pragma GCC diagnostic pop                                    |
+------------------------------------------------------------------+
| 6. #pragma GCC diagnostic ignored "-Wunused-variable"            |
|    #pragma GCC diagnostic ignored "-Wunused-private-field"       |
|    #pragma GCC diagnostic ignored "-Wunused-parameter"           |
+------------------------------------------------------------------+
| 7. Extended lambda macro definitions (or #define false stubs)    |
+------------------------------------------------------------------+
| 8. MAIN BODY: transformed C++ from source sequence walk          |
|    - #include "crt/host_runtime.h" (injected at first CUDA type) |
|    - Device stubs for __global__ kernels                         |
|    - #if 0 / #endif around device-only code                     |
|    - All host-visible declarations, types, functions             |
+------------------------------------------------------------------+
| 9. Empty file guard (if no entities generated)                   |
+------------------------------------------------------------------+
| 10. Breakpoint placeholders (debug builds only)                  |
+------------------------------------------------------------------+
| 11. _NV_ANON_NAMESPACE define / include / undef trick            |
+------------------------------------------------------------------+
| 12. #pragma pack() (MSVC only)                                   |
+------------------------------------------------------------------+
| 13. Module ID file output (if dword_106BFB8 set)                 |
+------------------------------------------------------------------+
| 14. Host reference arrays (.nvHRKI, .nvHRDE, etc.)               |
+------------------------------------------------------------------+

Section 1: Initial #line Directive

After opening the output stream, sub_489000 emits a #line directive via sub_46D1A0 to establish the initial source mapping. This directive points the host compiler's diagnostic messages back to the original .cu file:

// sub_489000, decompiled lines 283-287
sub_46D1A0(v10, v11);  // emit #line <number> "<filename>"

The #line directive format depends on the host compiler. For GCC/Clang hosts (dword_126E1F8 set), the line keyword is omitted (producing # 1 "file.cu"). For MSVC hosts (dword_126E1D8 set), the full #line 1 "file.cu" form is used. This pattern recurs throughout the file wherever source position changes.

Section 2-6: Diagnostic Suppressions

The preamble contains a layered set of #pragma GCC diagnostic directives that suppress warnings the host compiler would otherwise emit on the generated code. The exact set depends on which host compiler is active and its version.

Suppression Decisions

The conditions controlling each suppression are checked against host compiler identification globals:

Global	Meaning
`dword_126E1E8`	Host is Clang
`dword_126E1F8`	Host is GCC (including Clang in GCC-compat mode)
`dword_126E1D8`	Host is MSVC
`qword_126EF90`	Clang version number
`qword_126E1F0`	GCC/Clang version number
`dword_106BF6C`	Alternative host compiler mode
`dword_106BF68`	Secondary host compiler flag

-Wunused-local-typedefs

Emitted early, outside any push/pop scope:

// sub_489000, decompiled lines 182-187
if ((dword_126E1E8 && qword_126EF90 > 0x7787)   // Clang > 30599
    || (!dword_106BF6C && !dword_106BF68
        && dword_126E1F8 && qword_126E1F0 > 0x9F5F))  // GCC > 40799
{
    emit("#pragma GCC diagnostic ignored \"-Wunused-local-typedefs\"");
}

This targets GCC 4.8+ and Clang 3.1+, which introduced the -Wunused-local-typedefs warning. CUDA template machinery frequently creates local typedefs that are used only by device code (suppressed in #if 0 blocks), triggering spurious warnings.

-Wattributes

// sub_489000, decompiled lines 188-189
if (dword_126EFA8 && dword_106C07C)
    emit("\n#pragma GCC diagnostic ignored \"-Wattributes\"\n");

Suppresses warnings about unknown or ignored __attribute__ annotations. Emitted when CUDA-specific attribute processing is active (dword_126EFA8) and a secondary flag (dword_106C07C) indicates the host compiler would reject CUDA-specific attributes.

Push/Pop Block with -Wunused-variable and -Wunused-function

The managed runtime boilerplate (section 4) is wrapped in a diagnostic push/pop block:

// sub_489000, decompiled lines 190-234
emit("#pragma GCC diagnostic push");
emit("#pragma GCC diagnostic ignored \"-Wunused-variable\"");
emit("#pragma GCC diagnostic ignored \"-Wunused-function\"");

// ... managed runtime boilerplate here ...

emit("#pragma GCC diagnostic pop");

The push/pop scope isolates these suppressions to the managed runtime code. The conditions for emitting this block check Clang presence (dword_126E1E8), or GCC version > 40599 (qword_126E1F0 > 0x9E97). The managed runtime functions are static and may be unused in translation units without __managed__ variables.

Post-Pop File-Level Suppressions

After the pop, additional file-scoped suppressions are emitted that remain active for the rest of the file:

// sub_489000, decompiled lines 243-250
emit("#pragma GCC diagnostic ignored \"-Wunused-variable\"\n");

if (dword_126E1E8) {  // Clang only
    emit("#pragma GCC diagnostic ignored \"-Wunused-private-field\"\n");
    emit("#pragma GCC diagnostic ignored \"-Wunused-parameter\"\n");
}

The -Wunused-private-field and -Wunused-parameter suppressions are Clang-specific. GCC does not have -Wunused-private-field, and GCC's -Wunused-parameter behavior differs.

Summary of All Suppressions

Warning	Scope	Host Compiler	Version Threshold
`-Wunused-local-typedefs`	File-level	Clang, GCC	Clang > 30599, GCC > 40799
`-Wattributes`	File-level	GCC/Clang	When CUDA attrs active
`-Wunused-variable`	Push/pop block	Clang, GCC >= 40599	Around managed RT only
`-Wunused-function`	Push/pop block	Clang, GCC >= 40599	Around managed RT only
`-Wunused-variable`	File-level	Clang, GCC >= 40199	Rest of file
`-Wunused-private-field`	File-level	Clang only	Always
`-Wunused-parameter`	File-level	Clang only	Always

Section 7: Extended Lambda Macros

When extended lambda mode is NOT active (dword_106BF38 == 0), three stub macros are defined:

// sub_489000, decompiled lines 259-264
emit("#define __nv_is_extended_device_lambda_closure_type(X) false\n");
emit("#define __nv_is_extended_host_device_lambda_closure_type(X) false\n");
emit("#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false\n");
emit("#if defined(__nv_is_extended_device_lambda_closure_type)"
     " && defined(__nv_is_extended_host_device_lambda_closure_type)"
     "&& defined(__nv_is_extended_device_lambda_with_preserved_return_type)\n"
     "#endif\n");

These macros are consumed by crt/host_runtime.h to conditionally compile lambda wrapper infrastructure. When extended lambdas are disabled, all three evaluate to false, causing the runtime header to skip lambda wrapper code. The #if defined(...) && defined(...) block that immediately follows is an existence check -- it verifies the macros are defined, producing a compilation error if some other header has #undef'd them.

When extended lambda mode IS active (dword_106BF38 != 0), these defines are skipped entirely. The lambda preamble injection system (via sub_6BCC20) provides the real implementations later in the main body.

Section 8: Main Body -- Source Sequence Walk

The main body is generated by iterating the global source sequence list (qword_1065748), which is a linked list of EDG IL entries representing every top-level declaration in the translation unit. For each entry, the backend dispatches to sub_47ECC0 (gen_template / process_source_sequence), which handles all declaration kinds:

// sub_489000, decompiled lines 288-316 (simplified)
while (qword_1065748) {
    entry = qword_1065748;
    kind = entry->kind;  // byte at offset +16

    if (kind == 57) {
        // Pragma interleaving -- handled inline
        handle_pragma(entry);
    } else if (kind == 52) {
        // End-of-construct -- should not appear at top level
        fatal_error("Top-level end-of-construct entry");
    } else {
        entities_generated = 1;
        sub_47ECC0(0);  // gen_template at recursion level 0
    }
}

During this walk, several CUDA-specific injections occur:

#include "crt/host_runtime.h" -- injected by sub_4864F0 (gen_type_decl) or sub_47ECC0 when the first CUDA-tagged entity at global scope is encountered. The flag dword_E85700 prevents duplicate inclusion.
Device stub pairs -- __global__ kernel functions trigger two calls to gen_routine_decl (sub_47BFD0): first the forwarding body, then the static cudaLaunchKernel placeholder, controlled by the dword_1065850 toggle.
#if 0 / #endif guards -- device-only declarations are wrapped in preprocessor guards to hide them from the host compiler.
Interleaved pragmas -- source sequence entries of kind 57 represent #pragma directives from the original source (including #pragma pack, #pragma STDC, and user pragmas), which are re-emitted at their original positions.

Section 9: Empty File Guard

If the source sequence walk produced no entities (v12 == 0) and the compilation is not in pure CUDA mode (dword_126EFB4 != 2), a dummy declaration is emitted to prevent the host compiler from rejecting an empty translation unit:

// sub_489000, decompiled lines 565-569
if (!entities_generated && dword_126EFB4 != 2) {
    emit("int __dummy_to_avoid_empty_file;");
    newline();
}

Some host compilers (notably older GCC versions) produce warnings or errors on completely empty .c files. The int __dummy_to_avoid_empty_file; declaration is a minimal valid C/C++ statement that suppresses this.

Section 10: Breakpoint Placeholders

When the deferred function list (qword_1065840) is non-empty, the backend emits one breakpoint placeholder function per entry. These are used for debugger support in whole-program compilation mode:

// sub_489000, decompiled lines 573-651 (simplified)
node = qword_1065840;  // linked list of deferred functions
index = 0;
while (node) {
    emit("static __attribute__((used)) void __nv_breakpoint_placeholder");
    emit_decimal(index);
    putc('_', stream);
    if (node->name)
        emit(node->name);
    emit("(void) ");

    // Set source position from node
    set_source_position(node->source_start);
    emit("{ ");
    set_source_position(node->source_end);
    emit("exit(0); }");

    node = node->next;
    index++;
}

Each placeholder has the form static __attribute__((used)) void __nv_breakpoint_placeholderN_funcname(void) { exit(0); }. The __attribute__((used)) prevents the linker from stripping these functions. The debugger uses their addresses to set breakpoints on device functions that have been stripped from the host binary.

The deferred list is populated by gen_routine_decl when dword_106BFBC (whole-program mode) is set and dword_106BFDC is clear -- device-only functions that need host-side breakpoint anchors are pushed onto this list rather than receiving dummy bodies inline.

Section 11: _NV_ANON_NAMESPACE Trick

The trailer contains a four-step sequence that handles C++ anonymous namespace mangling for CUDA. Anonymous namespaces in C++ create translation-unit-local symbols, but CUDA device code requires globally unique symbol names (because device code from multiple TUs is linked together by the device linker). The _NV_ANON_NAMESPACE mechanism assigns a deterministic, globally unique identifier to each TU's anonymous namespace.

Step-by-Step Emission

// sub_489000, decompiled lines 654-710

// Step 1: #line back to original source
emit("#");
if (!dword_126E1F8)  // MSVC: include "line" keyword
    emit("line");
emit(" 1 \"");
emit(path_transform(qword_106BF88));  // original source file path
emit("\"");

// Step 2: #define _NV_ANON_NAMESPACE <hash>
emit("#define ");
emit("_NV_ANON_NAMESPACE");
emit(" ");
emit(sub_6BC7E0());   // generate unique hash string
newline();

// Step 3: #ifdef / #endif (force inclusion check)
emit("#ifdef ");
emit("_NV_ANON_NAMESPACE");
newline();
emit("#endif");
newline();

// Step 3b: #pragma pack() for MSVC
if (dword_126E1D8) {   // MSVC host
    emit("#pragma pack()");
    newline();
}

// Step 4: #include "<original_file>"
emit("#");
if (!dword_126E1F8)
    emit("line");
emit(" 1 \"");
emit(path_transform(qword_106BF88));
emit("\"");
newline();
emit("#include ");
emit("\"");
emit(path_transform(qword_106BF88));
emit("\"");
newline();

// Step 5: Reset #line and #undef
emit("#");
if (!dword_126E1F8)
    emit("line");
emit(" 1 \"");
emit(path_transform(qword_106BF88));
emit("\"");
newline();
emit("#undef ");
emit("_NV_ANON_NAMESPACE");
newline();

The Hash Generator (sub_6BC7E0)

The _NV_ANON_NAMESPACE value is produced by sub_6BC7E0, which constructs the string _GLOBAL__N_ followed by the module ID hash:

// sub_6BC7E0 (20 lines)
if (cached_result)
    return cached_result;

char *module_id = sub_5AF830(0);   // compute CRC32-based module ID
size_t len = strlen(module_id);
char *result = allocate(len + 12);
strcpy(result, "_GLOBAL__N_");
strcpy(result + 11, module_id);
cached_result = result;
return result;

The module ID (sub_5AF830) is a CRC32-based hash incorporating the source filename, compiler options, file modification time, and process ID. This produces values like _GLOBAL__N_1a2b3c4d5e6f7890 -- deterministic enough for reproducible builds, but unique enough to avoid collisions between TUs.

Why the Define/Include/Undef Sequence

The three-step define/include/undef pattern serves a specific purpose:

#define _NV_ANON_NAMESPACE <hash> -- establishes the macro before the source file is re-included.
#include "<original_file>" -- re-includes the original .cu source. During this second inclusion, any code inside anonymous namespaces that uses _NV_ANON_NAMESPACE gets the unique hash substituted, producing globally unique symbol names for device code.
#undef _NV_ANON_NAMESPACE -- cleans up the macro after inclusion.

The #ifdef _NV_ANON_NAMESPACE / #endif block between define and include is a safety check -- it verifies the macro was actually defined before proceeding.

This mechanism works in conjunction with the EDG frontend's anonymous namespace handling. When the frontend encounters namespace { ... } containing device code, it generates references to _NV_ANON_NAMESPACE that become concrete identifiers during the re-inclusion pass. The name mangling in the demangler (sub_7CA140, sub_7C5650, sub_7C4E80) also uses _NV_ANON_NAMESPACE to produce consistent mangled names.

Section 12: #pragma pack() for MSVC

When the host compiler is MSVC (dword_126E1D8 set), a bare #pragma pack() is emitted to reset the packing alignment to the compiler default:

// sub_489000, decompiled lines 676-681
if (dword_126E1D8) {
    emit("#pragma pack()");
    newline();
}

This reset ensures that any #pragma pack(N) directives from the original source or from included CUDA headers do not leak into subsequent translation units. On GCC/Clang, the #pragma pack() push/pop mechanism is typically handled differently, so this emission is MSVC-specific.

Section 13-14: Module ID and Host Reference Arrays

The final two sections are conditional:

Module ID output (sub_5B0180): When dword_106BFB8 is set, the module ID string (the same CRC32-based hash from sub_5AF830) is written to a separate file. This ID is used by the CUDA runtime to match host-side registration code with the device fatbinary.

Host reference arrays (sub_6BCF80): When dword_106BFD0 (device registration) or dword_106BFCC (constant registration) is set, six calls to sub_6BCF80 emit ELF section declarations for host reference arrays:

// sub_489000, decompiled lines 713-721
// nv_emit_host_reference_array(emit_fn, is_kernel, is_device, is_internal)
sub_6BCF80(emit_callback, 1, 0, 1);  // kernel,   internal  -> .nvHRKI
sub_6BCF80(emit_callback, 1, 0, 0);  // kernel,   external  -> .nvHRKE
sub_6BCF80(emit_callback, 0, 1, 1);  // device,   internal  -> .nvHRDI
sub_6BCF80(emit_callback, 0, 1, 0);  // device,   external  -> .nvHRDE
sub_6BCF80(emit_callback, 0, 0, 1);  // constant, internal  -> .nvHRCI
sub_6BCF80(emit_callback, 0, 0, 0);  // constant, external  -> .nvHRCE

These produce extern "C" declarations with __attribute__((section(".nvHRXX"))) annotations, where XX is one of KE, KI, DE, DI, CE, CI (Kernel/Device/Constant + External/Internal). The arrays contain mangled names of device symbols, enabling the CUDA runtime to locate and register them at program startup.

Complete Example

For a source file kernel.cu containing a single __global__ kernel function and a host function, the generated kernel.cu.int.c looks approximately like this:

# 1 "kernel.cu"
#pragma GCC diagnostic ignored "-Wunused-local-typedefs"
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-variable"
#pragma GCC diagnostic ignored "-Wunused-function"
static char __nv_inited_managed_rt = 0;
static void **__nv_fatbinhandle_for_managed_rt;
static void __nv_save_fatbinhandle_for_managed_rt(void **in) {
    __nv_fatbinhandle_for_managed_rt = in;
}
static char __nv_init_managed_rt_with_module(void **);
static inline void __nv_init_managed_rt(void) {
    __nv_inited_managed_rt = (__nv_inited_managed_rt
        ? __nv_inited_managed_rt
        : __nv_init_managed_rt_with_module(
            __nv_fatbinhandle_for_managed_rt));
}
#pragma GCC diagnostic pop
#pragma GCC diagnostic ignored "-Wunused-variable"
#pragma GCC diagnostic ignored "-Wunused-private-field"
#pragma GCC diagnostic ignored "-Wunused-parameter"
#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false
#if defined(__nv_is_extended_device_lambda_closure_type) \
 && defined(__nv_is_extended_host_device_lambda_closure_type) \
 && defined(__nv_is_extended_device_lambda_with_preserved_return_type)
#endif

/* === main body begins here === */
#include "crt/host_runtime.h"

# 5 "kernel.cu"
void host_function(int *data, int n) {
    for (int i = 0; i < n; i++) data[i] *= 2;
}
# 10 "kernel.cu"
void my_kernel(float *data, int n) {
    ::my_kernel::__wrapper__device_stub_my_kernel(data, n);
    return;
}
#if 0
/* original __global__ kernel body suppressed */
#endif
static void __wrapper__device_stub_my_kernel(float *data, int n) {
    ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}
/* === main body ends === */

# 1 "kernel.cu"
#define _NV_ANON_NAMESPACE _GLOBAL__N_a1b2c3d4e5f67890
#ifdef _NV_ANON_NAMESPACE
#endif
# 1 "kernel.cu"
#include "kernel.cu"
# 1 "kernel.cu"
#undef _NV_ANON_NAMESPACE

Initialization State

Before emitting any output, sub_489000 zeroes all output-related global state and initializes four large hash tables (each 512KB, cleared with memset). It also sets up a function pointer table (xmmword_1065760 through xmmword_10657B0) containing code generation callbacks:

// sub_489000, decompiled lines 62-97 (summarized)
dword_1065834 = 0;         // indent level
stream = NULL;             // output file handle
dword_1065820 = 0;         // line counter
dword_106581C = 0;         // column counter
dword_1065818 = 0;         // needs-line-directive
qword_1065748 = 0;         // source sequence cursor
qword_1065740 = 0;         // alternate cursor
dword_1065850 = 0;         // device stub mode

// Clear four 512KB hash tables
memset(&unk_FE5700, 0, 0x7FFE0);   // 524,256 bytes
memset(&unk_F65720, 0, 0x7FFE0);
memset(qword_E85720, 0, 0x7FFE0);
memset(&xmmword_F05720, 0, 0x5FFE8);  // 393,192 bytes (smaller)

// Callback setup
if (!dword_126DFF0)                     // not MSVC mode
    qword_10657C0 = sub_46BEE0;        // gen_be callback
qword_10657C8 = loc_469200;            // line directive callback
qword_10657D0 = sub_466F40;            // output callback
qword_10657D8 = sub_4686C0;            // error callback

#line Directive Protocol

Throughout the file, #line directives maintain the mapping between generated output and original source positions. The emission protocol differs by host compiler:

Host Compiler	#line Format	Example
GCC / Clang	`# <line> "<file>"`	`# 42 "kernel.cu"`
MSVC	`#line <line> "<file>"`	`#line 42 "kernel.cu"`

The dword_1065818 flag (needs_line_directive) is set whenever the current source position changes. Before emitting the next declaration or statement, sub_467DA0 checks this flag and emits a #line directive if needed, then clears the flag. The source position is tracked in two globals: qword_1065810 (pending position) and qword_126EDE8 (current position).

Function Map

Address	Name	Role
`sub_489000`	`process_file_scope_entities`	Backend entry point; orchestrates entire .int.c emission
`sub_47ECC0`	`gen_template` / `process_source_sequence`	Walks source sequence, dispatches all declaration kinds
`sub_47BFD0`	`gen_routine_decl`	Function declaration/definition generator; kernel stub logic
`sub_4864F0`	`gen_type_decl`	Type declaration generator; injects `#include "crt/host_runtime.h"`
`sub_484A40`	`gen_variable_decl`	Variable declaration generator; managed memory registration
`sub_467E50`	(emit string)	Primary string emission to output stream
`sub_468190`	(emit raw string)	Raw string emission without line directive check
`sub_46BC80`	(emit directive)	Emits `#if` / `#endif` preprocessor lines
`sub_467DA0`	(emit line directive)	Conditionally emits `#line` when `dword_1065818` is set
`sub_467D60`	(emit newline)	Emits newline and flushes pending line directive
`sub_46CF20`	(emit source position)	Sets source position for next `#line` directive
`sub_5ADD90`	(string concat)	Concatenates input filename with `.int.c` extension
`sub_4F48F0`	(file open)	Opens output file for writing (mode 1701)
`sub_6BC7E0`	(anon namespace hash)	Generates `_GLOBAL__N_<module_id>` string
`sub_5AF830`	`make_module_id`	CRC32-based unique TU identifier
`sub_5B0180`	`write_module_id_to_file`	Writes module ID to separate file
`sub_6BCF80`	`nv_emit_host_reference_array`	Emits `.nvHRKE`/`.nvHRDI`/etc. ELF sections
`sub_4F7B10`	(file close)	Closes output stream (mode 1701)

Cross-References

Kernel Stub Generation -- detailed stub mechanism using dword_1065850 toggle
Device/Host Separation -- how device-only code gets #if 0 guards
CUDA Runtime Boilerplate -- managed memory initialization functions
Host Reference Arrays -- .nvHRKI/.nvHRDE section format
Module ID & Registration -- CRC32 hash computation details
Pipeline Overview -- where backend generation fits in the 7-stage pipeline
Extended Lambda Overview -- lambda macro definitions and preamble injection

CUDA Runtime Boilerplate

Every .int.c file emitted by cudafe++ contains a fixed block of CUDA runtime initialization code, injected unconditionally before the main body. This boilerplate implements lazy initialization of the CUDA managed memory runtime and defines macro stubs for the extended lambda detection system. The managed runtime block is always emitted regardless of whether the translation unit uses __managed__ variables -- the static flag __nv_inited_managed_rt ensures the runtime is initialized at most once, and the static linkage prevents symbol conflicts across translation units. The lambda detection macros provide a compile-time protocol between cudafe++ and crt/host_runtime.h: the runtime header inspects these macros to decide whether to compile lambda wrapper infrastructure.

Key Facts

Property	Value
Emitter function	`sub_489000` (`process_file_scope_entities`, line 218)
Managed RT string address	`0x83AAC8` (243 bytes)
Init function string address	`0x83ABC0` (210 bytes)
Managed access wrapper string	`0x839570` (65 bytes)
Access wrapper emitters	`sub_4768F0` (`gen_name_ref`, xref at `0x476DCF`), `sub_484940` (`gen_variable_name`, xref at `0x484A08`)
Lambda stub macros string	`0x83AD10`, `0x83AD50`, `0x83AD98`
Lambda existence check string	`0x83ADE8` (194 bytes)
Extended lambda mode flag	`dword_106BF38` (`extended_lambda_mode`)
Alternative host flag	`dword_106BF6C` (`alternative_host_compiler_mode`)
`__cudaPushCallConfiguration` lookup	`sub_511D40` (`scan_expr_full`), string at `0x899213`
Push config error message	`0x88CA48`, error code 3654
Managed variable detection	`((_WORD )(entity + 148) & 0x101) == 0x101`
EDG source file	`cp_gen_be.c`

Managed Memory Runtime Initialization

Static Variables Block

The first emission at line 218 of sub_489000 outputs four declarations as a single string literal:

static char __nv_inited_managed_rt = 0;
static void **__nv_fatbinhandle_for_managed_rt;
static void __nv_save_fatbinhandle_for_managed_rt(void **in) {
    __nv_fatbinhandle_for_managed_rt = in;
}
static char __nv_init_managed_rt_with_module(void **);

These are emitted verbatim from a single string at 0x83AAC8:

"static char __nv_inited_managed_rt = 0; static void **__nv_fatbinhandle_for_managed_rt;
 static void __nv_save_fatbinhandle_for_managed_rt(void **in)
 {__nv_fatbinhandle_for_managed_rt = in;} static char __nv_init_managed_rt_with_module(void **);"

Each component serves a specific role:

Symbol	Type	Purpose
`__nv_inited_managed_rt`	`static char`	Guard flag: 0 = not initialized, nonzero = initialized
`__nv_fatbinhandle_for_managed_rt`	`static void**`	Cached fatbinary handle, set during `__cudaRegisterFatBinary`
`__nv_save_fatbinhandle_for_managed_rt`	`static void (void**)`	Stores the fatbin handle for later use by the init function
`__nv_init_managed_rt_with_module`	`static char (void**)`	Forward declaration -- defined by `crt/host_runtime.h`

The forward declaration of __nv_init_managed_rt_with_module is critical: this function is provided by the CUDA runtime headers (crt/host_runtime.h) and performs the actual CUDA runtime API calls to register managed variables with the unified memory system. By forward-declaring it here, the managed runtime boilerplate can reference it before the header is #included later in the file.

Lazy Initialization Function

Immediately after the static block, sub_489000 emits the __nv_init_managed_rt inline function. The emission has a conditional prefix:

// sub_489000, decompiled lines 221-224
if (dword_106BF6C)   // alternative host compiler mode
    emit("__attribute__((unused)) ");

emit(" static inline void __nv_init_managed_rt(void) {"
     " __nv_inited_managed_rt = (__nv_inited_managed_rt"
     " ? __nv_inited_managed_rt"
     "                 : __nv_init_managed_rt_with_module("
     "__nv_fatbinhandle_for_managed_rt));}");

When dword_106BF6C (alternative host compiler mode) is set, the function is prefixed with __attribute__((unused)) to suppress "defined but not used" warnings on host compilers that do not understand CUDA semantics.

The emitted function, reformatted for readability:

static inline void __nv_init_managed_rt(void) {
    __nv_inited_managed_rt = (
        __nv_inited_managed_rt
            ? __nv_inited_managed_rt
            : __nv_init_managed_rt_with_module(
                  __nv_fatbinhandle_for_managed_rt)
    );
}

This is a lazy initialization pattern. On first call, __nv_inited_managed_rt is 0 (falsy), so the ternary takes the false branch and calls __nv_init_managed_rt_with_module. That function performs CUDA runtime registration and returns a nonzero value which is stored back into __nv_inited_managed_rt. On subsequent calls, the ternary short-circuits and returns the existing value without re-initializing. The function is static inline to allow the host compiler to inline it at every managed variable access site, and static to avoid symbol collisions across translation units.

Runtime Registration Flow

The complete managed memory initialization sequence spans the compilation pipeline:

1. cudafe++ emits __nv_save_fatbinhandle_for_managed_rt() definition
2. cudafe++ emits forward decl of __nv_init_managed_rt_with_module()
3. cudafe++ emits __nv_init_managed_rt() with lazy init pattern
4. #include "crt/host_runtime.h" provides __nv_init_managed_rt_with_module()
5. __cudaRegisterFatBinary() calls __nv_save_fatbinhandle_for_managed_rt()
   to cache the fatbin handle
6. First access to any __managed__ variable triggers __nv_init_managed_rt()
7. __nv_init_managed_rt_with_module() calls __cudaRegisterManagedVariable()
   for every __managed__ variable in the TU

Managed Variable Access Transformation

When the backend encounters a reference to a __managed__ variable during code generation, it wraps the access in a comma-operator expression that forces lazy initialization. This transformation is performed by two functions:

sub_4768F0 (gen_name_ref, xref at 0x476DCF) -- handles qualified name references
sub_484940 (gen_variable_name, xref at 0x484A08) -- handles direct variable name emission

Detection Condition

Both functions detect __managed__ variables using the same bitfield test:

// sub_484940, decompiled line 11
if ((*(_WORD *)(entity + 148) & 0x101) == 0x101)

This tests two bits simultaneously as a 16-bit word read at offset 148:

Byte	Bit	Mask	Meaning
`+148`	bit 0	`0x01`	`__device__` memory space
`+149`	bit 0	`0x01` (reads as `0x100` in word)	`__managed__` flag

The combined mask 0x101 matches when both __device__ and __managed__ are set. The __managed__ attribute handler (sub_40E0D0, apply_nv_managed_attr) always sets both bits: __managed__ implies the variable resides in device global memory (__device__), with the additional unified-memory semantics.

Emitted Wrapper

When the condition matches, the emitter outputs a prefix string from 0x839570:

(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), (

After the variable name is emitted normally, the suffix ))) closes the expression. The complete transformed access for a managed variable managed_var becomes:

(*( (__nv_inited_managed_rt ? (void)0 : __nv_init_managed_rt()), (managed_var)))

Breaking down the expression:

Outer *(...) -- dereferences the result (the managed variable is accessed through a pointer after initialization)
Comma operator (init_expr, (managed_var)) -- evaluates the init expression for its side effect, then yields the variable
Ternary __nv_inited_managed_rt ? (void)0 : __nv_init_managed_rt() -- lazy init guard: if already initialized, the ternary evaluates to (void)0 (no-op). Otherwise, calls __nv_init_managed_rt() which performs runtime registration

This pattern guarantees that any access to any __managed__ variable triggers runtime initialization exactly once, regardless of access order. The comma operator ensures the initialization is a sequenced side effect evaluated before the variable access.

sub_4768F0 (gen_name_ref) -- Qualified Access Path

The name reference generator at sub_4768F0 handles the more complex case where the variable access includes scope qualification (::, template arguments, member access):

// sub_4768F0, decompiled lines 160-163
if (!v7 && a3 == 7 && (*(_WORD *)(v9 + 148) & 0x101) == 0x101) {
    v13 = 1;  // flag: need closing )))
    emit("(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), (");
    // ... emit qualified name with scope resolution ...
}

The condition a3 == 7 indicates the entity is a variable (IL entry kind 7). The !v7 check (v7 = a4, the fourth parameter) gates on whether the access is from a context that already handles initialization. The v13 flag tracks whether the closing ))) needs to be emitted after the complete name expression:

// sub_4768F0, decompiled lines 231-236
if (v13) {
    emit(")))");
    return 1;
}

sub_484940 (gen_variable_name) -- Direct Access Path

The direct variable name emitter at sub_484940 follows the same pattern but with a simpler structure:

// sub_484940, decompiled lines 10-15
v1 = 0;
if ((*(_WORD *)(a1 + 148) & 0x101) == 0x101) {
    v1 = 1;  // flag: need closing )))
    emit("(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), (");
}

// ... emit variable name (possibly anonymous, templated, etc.) ...

if (v1) {
    emit(")))");
    return;
}

This function handles three variable name forms:

Thread-local variables (byte +163 bit 7 set) -- emits "this" string (4 characters via inline loop)
Anonymous variables (byte +165 bit 2 set) -- dispatches to sub_483A80 for generated name emission
Regular variables -- dispatches to sub_472730 (gen_expression_or_name, mode 7)

The managed wrapper is applied around all three forms.

__cudaPushCallConfiguration Lookup

When cudafe++ processes a CUDA kernel launch expression (kernel<<<grid, block, shmem, stream>>>(args...)), the frontend must locate the __cudaPushCallConfiguration runtime function to lower the <<<>>> syntax into standard C++ function calls. This lookup occurs in sub_511D40 (scan_expr_full), the 80KB expression scanner.

Lookup Mechanism

At case 0x48 (decimal 72, the token for kernel launch <<<), the scanner performs a name lookup:

// sub_511D40, decompiled lines 1999-2006
sub_72EEF0("__cudaPushCallConfiguration", 0x1B);   // inject name into scope
v206 = sub_698940(v255, 0);                         // lookup the declaration

if (!v206 || *(_BYTE *)(v206 + 80) != 11) {        // not found or not a function
    sub_4F8200(0x0B, 3654, &qword_126DD38);         // emit error 3654
}

The lookup calls sub_72EEF0 to insert the identifier __cudaPushCallConfiguration (27 bytes, 0x1B) into the current scope context, then sub_698940 performs the actual name resolution. If the declaration is not found (!v206) or the entity at offset +80 is not a function (kind != 11), error 3654 is emitted.

Error 3654

The error string at 0x88CA48:

unable to find __cudaPushCallConfiguration declaration.
CUDA toolkit installation may be corrupt.

This error indicates that the CUDA runtime headers have not been properly included or that the toolkit installation is broken. The __cudaPushCallConfiguration function is declared in crt/device_runtime.h (included transitively through crt/host_runtime.h), so this error should only appear if the include paths are misconfigured.

The error is emitted with severity 0x0B (11), which maps to a fatal error -- compilation cannot continue without this function because every kernel launch depends on it.

Kernel Launch Lowering

After successful lookup, the scanner builds an AST node representing the lowered kernel launch. The <<<grid, block, shmem, stream>>> syntax is transformed into:

// Conceptual lowering:
if (__cudaPushCallConfiguration(grid, block, shmem, stream) != 0) {
    // launch configuration failed
}
kernel(args...);

Error 3655 (emitted at line 2019) handles the case where the call configuration push succeeds syntactically but the stream argument is missing in contexts that require it. The string for this is "explicit stream argument not provided in kernel launch".

Lambda Detection Macros

Default Stub Macros (No Extended Lambdas)

When dword_106BF38 (extended_lambda_mode) is 0, sub_489000 emits three macro definitions that evaluate to false, followed by an existence check:

// sub_489000, decompiled lines 259-264
emit("#define __nv_is_extended_device_lambda_closure_type(X) false\n");
emit("#define __nv_is_extended_host_device_lambda_closure_type(X) false\n");
emit("#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false\n");
emit("#if defined(__nv_is_extended_device_lambda_closure_type)"
     " && defined(__nv_is_extended_host_device_lambda_closure_type)"
     "&& defined(__nv_is_extended_device_lambda_with_preserved_return_type)\n"
     "#endif\n");

Verbatim emitted code:

#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false
#if defined(__nv_is_extended_device_lambda_closure_type) && defined(__nv_is_extended_host_device_lambda_closure_type)&& defined(__nv_is_extended_device_lambda_with_preserved_return_type)
#endif

Note the missing space before && in the second conjunction -- this is exactly how the string appears in the binary at 0x83ADE8. The #if defined(...) block is a compile-time assertion: if any of the three macros were #undef'd by a misbehaving header between this point and their use in crt/host_runtime.h, the preprocessor would silently skip lambda-related code rather than producing cryptic template errors. The #endif immediately follows -- the block has no body because its purpose is solely the existence check.

These macros are consumed by crt/host_runtime.h to conditionally compile lambda wrapper infrastructure. When all three evaluate to false, the runtime header skips device lambda wrapper template instantiation, host-device lambda wrapper instantiation, and trailing-return-type lambda handling.

Trait-Based Macros (Extended Lambdas Active)

When dword_106BF38 is nonzero (--extended-lambda or --expt-extended-lambda CLI flag), the stub macros are NOT emitted. Instead, the lambda preamble emitter sub_6BCC20 (nv_emit_lambda_preamble) provides trait-based implementations later in the file body. The decision is made at line 256 of sub_489000:

// sub_489000, decompiled lines 251-264
if (dword_106BF38)        // extended lambdas enabled?
    goto LABEL_38;        // skip stub macros, jump to next section
// else: emit stubs
emit("#define __nv_is_extended_device_lambda_closure_type(X) false\n");
// ...

The trait-based implementations emitted by sub_6BCC20 use template specialization rather than preprocessor macros. Each macro is #define'd to invoke a type trait helper:

Device lambda detection (string at 0xA82CF8):

template <typename T>
struct __nv_extended_device_lambda_trait_helper {
  static const bool value = false;
};
template <typename T1, typename...Pack>
struct __nv_extended_device_lambda_trait_helper<__nv_dl_wrapper_t<T1, Pack...> > {
  static const bool value = true;
};
#define __nv_is_extended_device_lambda_closure_type(X) \
    __nv_extended_device_lambda_trait_helper< \
        typename __nv_lambda_trait_remove_cv<X>::type>::value

Preserved return type detection (string at 0xA82F68):

template <typename T>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper {
  static const bool value = false;
};
template <typename U, U func, typename Return, unsigned Id, typename...Pack>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper<
    __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>, Pack...> > {
  static const bool value = true;
};
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) \
    __nv_extended_device_lambda_with_trailing_return_trait_helper< \
        typename __nv_lambda_trait_remove_cv<X>::type >::value

Host-device lambda detection (string at 0xA831B0):

template <typename>
struct __nv_extended_host_device_lambda_trait_helper {
  static const bool value = false;
};
template <bool B1, bool B2, bool B3, typename T1, typename T2, typename...Pack>
struct __nv_extended_host_device_lambda_trait_helper<
    __nv_hdl_wrapper_t<B1, B2, B3, T1, T2, Pack...> > {
  static const bool value = true;
};
#define __nv_is_extended_host_device_lambda_closure_type(X) \
    __nv_extended_host_device_lambda_trait_helper< \
        typename __nv_lambda_trait_remove_cv<X>::type>::value

All three trait helpers follow the same pattern: a primary template with value = false, a partial specialization matching the corresponding wrapper type with value = true, and a macro that instantiates the trait after stripping cv-qualifiers via __nv_lambda_trait_remove_cv. The cv-stripping is necessary because lambda closure types may be captured as const references.

Macro Registration in the Frontend

The three macro names are registered as built-in identifiers by sub_5863A0 (a frontend initialization function), which calls sub_7463B0 to register each name with a unique identifier code:

// sub_5863A0, decompiled lines 976-978
sub_7463B0(328, "__nv_is_extended_device_lambda_closure_type");
sub_7463B0(329, "__nv_is_extended_host_device_lambda_closure_type");
sub_7463B0(330, "__nv_is_extended_device_lambda_with_preserved_return_type");

These registrations (IDs 328, 329, 330) make the names known to the EDG lexer before any source code is parsed, ensuring they can be resolved during preprocessing even if no header has defined them yet.

Diagnostic Suppression Scope

The managed runtime boilerplate is wrapped in a #pragma GCC diagnostic push / pop block to isolate its warning suppressions:

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-variable"
#pragma GCC diagnostic ignored "-Wunused-function"

/* managed runtime declarations */

#pragma GCC diagnostic pop

The push/pop is emitted only when the host compiler supports it: Clang (dword_126E1E8 set), or GCC version > 40599 (qword_126E1F0 > 0x9E97 and dword_106BF6C not set). The suppressions are necessary because __nv_inited_managed_rt and __nv_init_managed_rt are static symbols that may never be referenced in translation units without __managed__ variables, causing -Wunused-variable and -Wunused-function warnings.

Global State Dependencies

Global	Type	Meaning	Effect on Emission
`dword_106BF38`	int	`extended_lambda_mode`	0: emit false stubs. Nonzero: skip stubs, `sub_6BCC20` provides traits
`dword_106BF6C`	int	`alternative_host_compiler_mode`	Adds `__attribute__((unused))` to `__nv_init_managed_rt`
`dword_126E1E8`	int	Host is Clang	Controls push/pop and extra suppressions
`dword_126E1F8`	int	Host is GCC	Controls push/pop version threshold
`qword_126E1F0`	int64	GCC/Clang version number	> 0x9E97 (40599) for push/pop support

Function Map

Address	Name	Role
`sub_489000`	`process_file_scope_entities`	Emits managed RT block and lambda macros
`sub_4768F0`	`gen_name_ref`	Wraps qualified managed variable accesses
`sub_484940`	`gen_variable_name`	Wraps direct managed variable accesses
`sub_511D40`	`scan_expr_full`	Looks up `__cudaPushCallConfiguration` for `<<<>>>` lowering
`sub_6BCC20`	`nv_emit_lambda_preamble`	Emits trait-based lambda detection macros
`sub_5863A0`	(frontend init)	Registers lambda macro names as built-in identifiers
`sub_467E50`	(emit string)	Primary string emission to output stream
`sub_72EEF0`	(inject identifier)	Inserts `__cudaPushCallConfiguration` into scope for lookup
`sub_698940`	(name lookup)	Resolves identifier to entity declaration
`sub_4F8200`	(emit error)	Error emission with severity and error code

Cross-References

.int.c File Format -- complete file structure showing where runtime boilerplate sits
Device Lambda Wrapper -- __nv_dl_wrapper_t matched by trait macros
Host-Device Lambda Wrapper -- __nv_hdl_wrapper_t matched by trait macros
Preamble Injection -- sub_6BCC20 emission of trait templates
Entity Node Layout -- byte +148/+149 memory space bitfield
__managed__ Variables -- attribute handler setting the 0x101 bits
Kernel Stub Generation -- device stub side of kernel launch lowering
Host Reference Arrays -- registration tables that reference managed variables

Host Reference Arrays

When cudafe++ splits a CUDA source file into device and host halves, the host-side .int.c output is compiled by a standard C++ compiler (GCC, Clang, or MSVC) that has no concept of device symbols. The CUDA runtime, however, needs to know which __global__ kernels, __device__ variables, and __constant__ variables exist so it can register them at program startup. cudafe++ solves this by emitting host reference arrays -- static byte arrays containing the mangled names of device symbols, placed into specially-named ELF sections that downstream tools (the fatbinary linker and crt/host_runtime.h registration code) read to enumerate device entities. The mechanism exists because the host compiler's symbol table contains only host-side symbols; the .nvHR* sections provide the complementary device-side symbol directory that the CUDA runtime needs to build the host-device binding table.

The arrays are emitted at the very end of the .int.c file, after the #undef _NV_ANON_NAMESPACE cleanup, by six calls to nv_emit_host_reference_array (sub_6BCF80, 79 lines, nv_transforms.c). Each call handles one combination of symbol type (kernel, device variable, constant variable) and linkage class (external, internal). The split by linkage is critical for RDC (relocatable device code) compilation: external-linkage symbols are globally visible across translation units and resolved by nvlink, while internal-linkage symbols (from static declarations or anonymous namespaces) are TU-local and must carry module-ID-based name prefixes to avoid collisions.

Key Facts

Property	Value
Emission function	`sub_6BCF80` (`nv_emit_host_reference_array`, 79 lines)
EDG source file	`nv_transforms.c`
Caller	`sub_489000` (`process_file_scope_entities`, lines 713--721)
Guard condition	`dword_106BFD0 \|\| dword_106BFCC` (device or constant registration enabled)
Emit callback	`sub_467E50` (primary string emitter to output stream)
Registration function	`sub_6BE300` (`nv_get_full_nv_static_prefix`, 370 lines, `nv_transforms.c:2164`)
Scope prefix builder	`sub_6BD2F0` (`nv_build_scoped_name_prefix`, 95 lines)
Expression walker	`sub_6BE330` (`nv_scan_expression_for_device_refs`, 89 lines)
List data structure	`std::list<std::string>`-like containers at 6 global addresses
Static prefix cache	`qword_1286760`
Anonymous namespace name	`qword_1286A00` (format: `_GLOBAL__N_<module_id>`)
Prefix format string	at `off_E7C768`, expanded as `"%s%lu_%s_"`
Assert guard	`nv_transforms.c:2164`, `"nv_get_full_nv_static_prefix"`

The Six Sections

The arrays are organized into 6 ELF sections along two axes: symbol type (3 values) and linkage (2 values):

Section	Array Name	Symbol Type	Linkage	Global List Address
`.nvHRKE`	`hostRefKernelArrayExternalLinkage`	`__global__` kernel	External	`unk_1286880`
`.nvHRKI`	`hostRefKernelArrayInternalLinkage`	`__global__` kernel	Internal	`unk_12868C0`
`.nvHRDE`	`hostRefDeviceArrayExternalLinkage`	`__device__` variable	External	`unk_1286780`
`.nvHRDI`	`hostRefDeviceArrayInternalLinkage`	`__device__` variable	Internal	`unk_12867C0`
`.nvHRCE`	`hostRefConstantArrayExternalLinkage`	`__constant__` variable	External	`unk_1286800`
`.nvHRCI`	`hostRefConstantArrayInternalLinkage`	`__constant__` variable	Internal	`unk_1286840`

The section name encoding is: .nvHR (host reference) + one letter for symbol type (K=kernel, D=device, C=constant) + one letter for linkage (E=external, I=internal).

Note that __shared__ variables are not included -- they have no host-visible address and exist only within a kernel's execution lifetime.

Emission Architecture

Invocation from the Backend

The backend entry point sub_489000 (process_file_scope_entities) calls sub_6BCF80 six times at the very end of .int.c generation (decompiled lines 713--721). The calls are guarded by two flags: dword_106BFD0 (device registration mode) and dword_106BFCC (constant registration mode). If neither is set, no arrays are emitted.

// sub_489000 trailer, decompiled lines 713-721
if (dword_106BFD0 || dword_106BFCC) {
    // nv_emit_host_reference_array(emit_fn, is_kernel, is_device, is_internal)
    sub_6BCF80(sub_467E50, 1, 0, 1);  // kernel,   internal  -> .nvHRKI
    sub_6BCF80(sub_467E50, 1, 0, 0);  // kernel,   external  -> .nvHRKE
    sub_6BCF80(sub_467E50, 0, 1, 1);  // device,   internal  -> .nvHRDI
    sub_6BCF80(sub_467E50, 0, 1, 0);  // device,   external  -> .nvHRDE
    sub_6BCF80(sub_467E50, 0, 0, 1);  // constant, internal  -> .nvHRCI
    sub_6BCF80(sub_467E50, 0, 0, 0);  // constant, external  -> .nvHRCE
}

The function signature is:

void nv_emit_host_reference_array(
    void (*emit)(const char *),  // a1: string emission callback
    int is_kernel,               // a2: 1 = kernel, 0 = variable
    int is_device,               // a3: 1 = __device__, 0 = __constant__ (only when is_kernel=0)
    int is_internal              // a4: 1 = internal linkage, 0 = external linkage
);

The flag decoding for selecting which global list, section name, and array name to use works as follows:

if is_kernel (a2 != 0):
    if is_internal (a4 != 0):  list = unk_12868C0, section = ".nvHRKI", name = "hostRefKernelArrayInternalLinkage"
    else:                       list = unk_1286880, section = ".nvHRKE", name = "hostRefKernelArrayExternalLinkage"
else if is_internal (a4 != 0):
    if is_device (a3 != 0):    list = unk_12867C0, section = ".nvHRDI", name = "hostRefDeviceArrayInternalLinkage"
    else:                       list = unk_1286840, section = ".nvHRCI", name = "hostRefConstantArrayInternalLinkage"
else:
    if is_device (a3 != 0):    list = unk_1286780, section = ".nvHRDE", name = "hostRefDeviceArrayExternalLinkage"
    else:                       list = unk_1286800, section = ".nvHRCE", name = "hostRefConstantArrayExternalLinkage"

Note the precedence: the kernel flag is checked first. When is_kernel=1, the is_device flag is ignored entirely -- kernels are always kernels regardless of is_device.

Emission Output Format

For each section, sub_6BCF80 emits a single array declaration:

extern "C" {
extern __attribute__((section(".nvHRKE")))
       __attribute__((weak))
const unsigned char hostRefKernelArrayExternalLinkage[] = {
/* _Z8myKernelPfi */
0x5f,0x5a,0x38,0x6d,0x79,0x4b,0x65,0x72,0x6e,0x65,0x6c,0x50,0x66,0x69,0x0,
/* _Z12otherKernelPd */
0x5f,0x5a,0x31,0x32,0x6f,0x74,0x68,0x65,0x72,0x4b,0x65,0x72,0x6e,0x65,0x6c,0x50,0x64,0x0,
0x0};
}

Key details about the emitted C:

extern "C" wrapping ensures no C++ name mangling is applied to the array itself. The section name in the ELF binary is the sole identifier.
__attribute__((section(".nvHRXX"))) places the array in a named ELF section that downstream tools scan by name.
__attribute__((weak)) allows multiple translation units to define the same array name without causing linker errors. When multiple TUs each emit their own hostRefKernelArrayExternalLinkage, the linker keeps one copy. This is safe because the CUDA runtime reads the section contents, not the symbol -- it concatenates all .nvHRKE section contributions from all object files.
const unsigned char[] encodes each mangled name as individual hex bytes, not as a string literal. This avoids any issues with embedded NUL bytes or special characters in mangled names.
Each symbol name is preceded by a /* mangled_name */ comment for human readability.
Each name is terminated by 0x0 (NUL byte).
If the list is empty (no symbols of that type/linkage), the array contains a single 0x0 sentinel.

The iteration traverses a doubly-linked list rooted at the global list variable. From the decompiled code:

// Decompiled iteration in sub_6BCF80, lines 56-73
for (node = list[3]; list + 1 != node; node = next_node(node)) {
    emit("/* ");
    emit(*(char **)(node + 32));   // mangled name string
    emit(" */\n");
    size_t len = *(size_t *)(node + 40);  // string length
    for (size_t j = 0; j < len; j++) {
        char byte = *(char *)(*(char **)(node + 32) + j);
        snprintf(buf, 128, "0x%x,", byte);
        emit(buf);
    }
    emit("0x0,");  // NUL terminator for this name
}

Each node in the linked list stores:

+32: pointer to the mangled name string
+40: length of the mangled name

The list structure itself is a std::list<std::string>-compatible container where list[3] (offset +24) points to the first data node and list + 1 (offset +8) is the sentinel/end node.

Symbol Registration Pipeline

The host reference arrays are the output of a two-phase pipeline: (1) symbol collection during compilation, and (2) array emission at the end of the backend pass.

Phase 1: Collection During Compilation

As cudafe++ processes the AST, it encounters declarations marked with __global__, __device__, or __constant__. Each such entity must be registered in the appropriate global list so it appears in the host reference array. This registration is performed by two cooperating functions:

nv_scan_expression_for_device_refs (sub_6BE330, 89 lines) recursively walks expression trees looking for references to device-annotated entities. It dispatches on expression kind:

Expression Kind	Handling
7 (variable reference)	Checks `__global__` bit, registers if device-annotated
11 (function reference)	Checks function attributes, registers if `__global__`
15 (member access)	Recurses on the member
16 (pointer dereference)	Recurses on the operand
17 (expression list)	Recurses on each element
20 (call expression)	Checks the callee
24 (cast expression)	Recurses on the operand

When the walker finds a device entity, it tail-calls into nv_get_full_nv_static_prefix.

nv_get_full_nv_static_prefix (sub_6BE300, 370 lines) is the master registration function. It determines the symbol's linkage class and constructs the name that goes into the host reference array. The function begins with two early-exit checks:

if (!entity) return;
if ((entity[182] & 0x40) == 0) return;  // not __global__

Byte +182 of the entity node carries execution space bits. Bit 6 (0x40) indicates __global__. Byte +179 carries additional flags where bits 0x12 indicate device/constant annotation. Byte +80 bits 0x70 encode the linkage class: 0x10 = internal (static/anonymous), 0x30 = external.

The function then splits into two paths based on linkage:

Internal Linkage Path

For static functions, anonymous-namespace entities, or entities with forced internal linkage, the name must include a TU-unique prefix to prevent collisions across translation units:

Scope prefix construction (sub_6BD2F0): Recursively walks the entity's enclosing scopes (byte +28 == 3 indicates "has parent scope"). For each scope level, the scope name is extracted from +32 -> +8 (the scope's identifier string). For anonymous namespaces (where the scope name pointer is NULL), the function substitutes _GLOBAL__N_<module_id>, constructing and caching this string in qword_1286A00.
Hash computation (sub_6BD1C0): The scope-qualified name is hashed using vsnprintf with format string at address 8573734 (likely "%s%lu" or similar) and a 32-byte buffer. This produces a deterministic hash of the scope path.
Static prefix construction: The full prefix is assembled as:
```
snprintf(buf, size, "%s%lu_%s_", off_E7C768, strlen(module_id), module_id)
```
where off_E7C768 is a fixed prefix string (likely "__nv_static_" or similar) and module_id comes from sub_5AF830 (the CRC32-based module identifier). The result is cached in qword_1286760 so it is computed only once per TU.
Name assembly: The prefix, a "_" separator, and the entity's mangled name (from entity +8) are concatenated.
List insertion: The assembled name is pushed into the internal-linkage list (unk_12868C0 for kernels, unk_12867C0 for device variables, unk_1286840 for constants) via a std::list::push_back-equivalent call.

External Linkage Path

For entities with default (external) linkage, the path is simpler:

A " ::" scope prefix is prepended (string at address 10998575, corresponding to " ::" -- two bytes).
If the entity has a parent scope (byte +28 == 3 at the scope entry), the scope-qualified name is built by recursing through parent scopes, concatenating "::" separators and hashing each level with sub_6BD1C0.
The entity's mangled name (from entity +8) is appended directly.
The result is pushed into the external-linkage list (unk_1286880 for kernels, unk_1286780 for device variables, unk_1286800 for constants).

Phase 2: Emission (Backend Trailer)

After the entire source file has been processed and all entity walks have populated the 6 global lists, the backend trailer calls sub_6BCF80 six times. Each call drains one list and emits the corresponding ELF section declaration. The emission is always performed for all 6 sections, even if some lists are empty (producing arrays with only a 0x0 sentinel).

Internal vs. External Linkage Split

The split into internal and external linkage sections serves two distinct purposes:

Whole-Program Mode (`-rdc=false`)

In whole-program (non-RDC) mode, all device code from a single TU is embedded directly in the host object file as a fatbinary. The host reference arrays tell crt/host_runtime.h's __cudaRegisterLinkedBinary machinery which symbols exist in the fatbinary so it can register them with the CUDA driver at program startup.

Internal-linkage symbols require the TU-unique prefix to avoid name collisions if two TUs define identically-named static __global__ kernels. The prefix incorporates the module ID (a CRC32 of the TU's representative entity) to ensure uniqueness.

Separate Compilation Mode (`-rdc=true`)

In RDC mode, device code is compiled to relocatable device objects (.rdc files) that nvlink links together. External-linkage device symbols must be globally resolvable across TUs. The .nvHRKE/.nvHRDE/.nvHRCE sections provide the symbol directory that nvlink uses to match device symbols with their host-side registration entries.

Internal-linkage symbols in RDC mode remain TU-local. They carry module-ID prefixes and are placed in the *I sections, which nvlink processes separately. The split ensures that nvlink does not attempt to deduplicate or cross-reference symbols that were intentionally given internal linkage.

Downstream Consumption

Host Compiler

GCC/Clang/MSVC compiles the .int.c file and sees the extern "C" array declarations with __attribute__((section(...))). The host compiler places each array into the named ELF section (or PE section on Windows). Because the arrays are const unsigned char[] with weak linkage, they impose no runtime overhead and can be safely deduplicated by the linker.

Fatbinary Linker (fatbinary / nvlink)

The fatbinary linker reads the .nvHR* sections from each object file to discover which device symbols need registration. For each entry in the byte arrays, it extracts the mangled name (scanning for 0x0 terminators) and matches it against the device code in the fatbinary or relocatable device object.

CUDA Runtime (`crt/host_runtime.h`)

At program startup, the CUDA runtime's __cudaRegisterLinkedBinary function (or __cudaRegisterFatBinary in whole-program mode) walks the .nvHR* sections to:

This registration enables the host-side API (cudaLaunchKernel, cudaMemcpyToSymbol, etc.) to resolve device symbols by name at runtime.

Supporting Data Structures

Global List Nodes

Each of the 6 global lists (unk_1286780 through unk_12868C0) is a std::list<std::string>-compatible doubly-linked list. The list head structure occupies 48 bytes (3 pointers + metadata):

Offset	Field	Description
+0	`allocator`	Allocator state
+8	`sentinel`	Sentinel/end node address (comparison target for iteration end)
+16	`size`	Number of entries
+24	`first`	Pointer to first data node

Each data node stores:

Offset	Field	Description
+0	`prev`	Previous node pointer
+8	`next`	Next node pointer
+16	`data_start`	Start of string data area
+32	`str_ptr`	Pointer to mangled name character data
+40	`str_len`	Length of the mangled name

The strings use SSO (Small String Optimization): if the mangled name is 15 bytes or shorter, the character data is stored inline starting at offset +16; otherwise str_ptr at +32 points to a heap allocation and offset +16 stores the heap capacity.

Static Prefix Cache

qword_1286760 caches the internal-linkage prefix string computed by nv_get_full_nv_static_prefix. The format is:

<off_E7C768><module_id_length>_<module_id>_

Where off_E7C768 is a fixed string (the NVIDIA static prefix marker), the module ID comes from sub_5AF830 (CRC32-based), and the underscores separate the components. This prefix is allocated once via sub_5E03D0 and reused for all internal-linkage entities in the TU.

Anonymous Namespace Name Cache

qword_1286A00 caches the anonymous namespace identifier, constructed as _GLOBAL__N_<module_id>. This follows the Itanium ABI convention for anonymous namespace mangling but uses the CUDA module ID instead of a random hash. It is allocated once by sub_6BD2F0 and reused for all entities in anonymous namespaces.

Scope-Qualified Name Builder

sub_6BD2F0 (nv_build_scoped_name_prefix) recursively constructs scope-qualified names for internal-linkage entities:

void nv_build_scoped_name_prefix(char **scope_name, scope_entry *parent, string *result) {
    // Recurse to parent scope first
    if (parent && parent->kind == 3)  // byte +28 == 3
        nv_build_scoped_name_prefix(parent->parent->name, parent->parent->scope, result);

    char *name = *scope_name;
    if (!name)
        name = get_or_create_anon_namespace_name();  // _GLOBAL__N_<module_id>

    // Build: hash(name) via vsnprintf with format at 8573734, 32-byte buffer
    // Append to result string
    format_string_to_sso(&tmp, vsnprintf, 32, 8573734, name_len);
    string_append(result, tmp);
}

The recursion visits ancestor scopes from outermost to innermost, concatenating hashed scope names. This produces a deterministic, collision-resistant path that uniquely identifies the entity's position in the namespace hierarchy.

Host Reference Trie

During compilation, cudafe++ maintains a trie (prefix tree) structure for deduplicating host reference entries. This trie is stored alongside the linear lists and prevents the same symbol from being registered twice if it is referenced from multiple points in the source.

The trie is cleaned up at the end of compilation by:

sub_6BD530 (nv_free_host_ref_tree, 257 lines) -- deeply recursive tree destructor with 9 levels of inlined recursion
sub_6BD820 (nv_free_host_ref_list, 34 lines) -- iterates the linked list, calling nv_free_host_ref_tree for each node's tree, then frees the node

Each trie node structure:

Offset	Field	Description
+0	`next`	Next sibling pointer
+8	(reserved)	Alignment/flags
+16	`child_chain`	First child in chain
+24	`child_tree`	Child subtree pointer
+32	`data_ptr`	Pointer to name data (or `+48` if inline)
+40	`data_len`	Length of name data
+48	`inline_data`	Inline storage for short names

If data_ptr == &node[48] (the inline data area), no separate allocation was made; otherwise data_ptr points to a heap-allocated string that nv_free_host_ref_tree frees separately.

Complete Emission Example

For a source file containing:

__global__ void myKernel(float *data, int n) { /* ... */ }
__device__ int d_counter;
static __constant__ float c_table[256];

The .int.c trailer emits:

extern "C" {
extern __attribute__ ((section (".nvHRKI"))) __attribute__((weak)) const unsigned char hostRefKernelArrayInternalLinkage[] = {
0x0};

extern "C" {
extern __attribute__ ((section (".nvHRKE"))) __attribute__((weak)) const unsigned char hostRefKernelArrayExternalLinkage[] = {
/* _Z8myKernelPfi */
0x5f,0x5a,0x38,0x6d,0x79,0x4b,0x65,0x72,0x6e,0x65,0x6c,0x50,0x66,0x69,0x0,
0x0};
}

extern "C" {
extern __attribute__ ((section (".nvHRDI"))) __attribute__((weak)) const unsigned char hostRefDeviceArrayInternalLinkage[] = {
0x0};
}

extern "C" {
extern __attribute__ ((section (".nvHRDE"))) __attribute__((weak)) const unsigned char hostRefDeviceArrayExternalLinkage[] = {
/* _Z9d_counter */
0x5f,0x5a,0x39,0x64,0x5f,0x63,0x6f,0x75,0x6e,0x74,0x65,0x72,0x0,
0x0};
}

extern "C" {
extern __attribute__ ((section (".nvHRCI"))) __attribute__((weak)) const unsigned char hostRefConstantArrayInternalLinkage[] = {
/* __nv_static_42_kernel_cu_c_table */
0x5f,0x5f,0x6e,0x76,0x5f,...,0x0,
0x0};
}

extern "C" {
extern __attribute__ ((section (".nvHRCE"))) __attribute__((weak)) const unsigned char hostRefConstantArrayExternalLinkage[] = {
0x0};
}

Note how c_table (declared static __constant__) appears in the internal-linkage .nvHRCI section with its module-ID-prefixed name, while myKernel (external linkage by default) appears in .nvHRKE with its standard Itanium-ABI mangled name.

Function Map

Address	Name	Source	Lines	Role
`sub_6BCF80`	`nv_emit_host_reference_array`	nv_transforms.c	79	Selects section/list by flags, emits array declaration
`sub_6BE300`	`nv_get_full_nv_static_prefix`	nv_transforms.c:2164	370	Master registration: determines linkage, builds name, inserts into list
`sub_6BE330`	`nv_scan_expression_for_device_refs`	nv_transforms.c	89	Recursive expression walker that finds device entity references
`sub_6BD2F0`	`nv_build_scoped_name_prefix`	nv_transforms.c	95	Recursive scope-qualified name builder for internal-linkage entities
`sub_6BD1C0`	`format_string_to_sso`	nv_transforms.c	48	Formats via `vsnprintf` into std::string SSO buffer
`sub_6BD530`	`nv_free_host_ref_tree`	nv_transforms.c	257	Recursive deep-free of deduplication trie
`sub_6BD820`	`nv_free_host_ref_list`	nv_transforms.c	34	Frees linked list of host reference entries
`sub_6BCF10`	`nv_check_device_variable_in_host`	nv_transforms.c	16	Validates device variable not improperly referenced from host
`sub_5AF830`	`make_module_id`	host_envir.c	~450	CRC32-based TU identifier used in internal-linkage prefixes
`sub_489000`	`process_file_scope_entities`	cp_gen_be.c	723	Backend entry point; calls `sub_6BCF80` x6 in trailer
`sub_467E50`	(emit string)	cp_gen_be.c	--	Primary string emission callback passed to `sub_6BCF80`

Global Variables

Address	Type	Name	Purpose
`unk_1286780`	list	device external list	Accumulates `__device__` external-linkage symbol names
`unk_12867C0`	list	device internal list	Accumulates `__device__` internal-linkage symbol names
`unk_1286800`	list	constant external list	Accumulates `__constant__` external-linkage symbol names
`unk_1286840`	list	constant internal list	Accumulates `__constant__` internal-linkage symbol names
`unk_1286880`	list	kernel external list	Accumulates `__global__` external-linkage symbol names
`unk_12868C0`	list	kernel internal list	Accumulates `__global__` internal-linkage symbol names
`qword_1286760`	char*	static prefix cache	Cached internal-linkage prefix string (computed once per TU)
`qword_1286A00`	char*	anon namespace name	Cached `_GLOBAL__N_<module_id>` string
`dword_106BFD0`	int	device registration flag	Enables device symbol registration (guard for emission)
`dword_106BFCC`	int	constant registration flag	Enables constant symbol registration (guard for emission)

Cross-References

.int.c File Format -- complete file structure showing where host reference arrays sit (sections 13--14)
CUDA Runtime Boilerplate -- managed memory initialization that references registered symbols
Module ID & Registration -- CRC32 hash computation used in internal-linkage prefixes
RDC Mode -- how the internal/external split interacts with separate compilation
Memory Spaces -- __device__ / __constant__ / __shared__ attribute encoding
Name Mangling -- nv_get_full_nv_static_prefix and Itanium ABI encoding
Backend Code Generation -- Phase 7 host reference array emission
CLI Flag Inventory -- flags controlling device/constant registration

Module ID & Registration

When CUDA programs are compiled with separate compilation (-rdc=true), each .cu translation unit is compiled independently and later linked by nvlink. The host-side registration code emitted by cudafe++ must associate its __cudaRegisterFatBinary call with the correct device fatbinary, and anonymous namespace device symbols must receive globally unique mangled names. The module ID is a string identifier computed by make_module_id (sub_5AF830, host_envir.c, ~450 lines) that provides this uniqueness. It is derived from a CRC32 hash of the compiler options and source filename, combined with the output filename and process ID. Once computed, the module ID is cached in qword_126F0C0 and referenced throughout the backend code generator -- in _NV_ANON_NAMESPACE construction, _GLOBAL__N_ mangling, _INTERNAL prefixing, host reference array scoped names, and the module ID file written for nvlink consumption.

Key Facts

Property	Value
Generator function	`sub_5AF830` (`make_module_id`, ~450 lines, host_envir.c)
Setter	`sub_5AF7F0` (`set_module_id`, host_envir.c, line 3387 assertion)
Getter	`sub_5AF820` (`get_module_id`, host_envir.c)
File writer	`sub_5B0180` (`write_module_id_to_file`, host_envir.c)
Entity-based selector	`sub_5CF030` (`use_variable_or_routine_for_module_id_if_needed`, il.c, line 31969)
Anon namespace constructor	`sub_6BC7E0` (nv_transforms.c, ~20 lines)
Cached module ID global	`qword_126F0C0` (8 bytes, initially NULL)
Selected entity global	`qword_126F140` (8 bytes, IL entity pointer)
Selected entity kind	`byte_126F138` (1 byte, 7=variable or 11=routine)
Module ID file path global	`qword_106BF80` (set by `--module_id_file_name`, flag 87)
Generate-module-ID-file flag	`--gen_module_id_file` (flag 83, no argument)
Module ID file path flag	`--module_id_file_name` (flag 87, has argument)
Options hash input global	`qword_106C038` (string, command-line options to hash)
Output filename global	`qword_106C040` (display filename override)
Emit-symbol-table flag	`dword_106BFB8` (triggers `write_module_id_to_file` in backend)
CRC32 polynomial	`0xEDB88320` (CRC-32/ISO-HDLC, reflected)
CRC32 initial value	`0xFFFFFFFF`
Debug trace topic	`"module_id"` (gated by `dword_126EFC8`)
Debug format strings	`"make_module_id: str1 = %s, str2 = %s, pid = %ld\n"` at `0xA5DA48`
	`"make_module_id: final string = %s\n"` at `0xA5DA80`

Algorithm Overview

The module ID generator has three source modes, tried in priority order. The result is always cached in qword_126F0C0 -- the function returns immediately if the cache is populated.

Mode 1: Module ID File

If qword_106BF80 (set by the --module_id_file_name CLI flag) is non-NULL and dword_106BFB8 is clear, the function opens the specified file, reads its entire contents into a heap-allocated buffer, null-terminates it, and uses that as the module ID verbatim. This allows build systems to inject deterministic, reproducible identifiers from external sources (e.g., a content hash of the source file computed by the build system).

// sub_5AF830, mode 1: read module ID from file
if (!dword_106BFB8 && qword_106BF80) {
    FILE *f = open_file(qword_106BF80, "r");  // sub_4F4870
    if (!f) fatal("unable to open module id file for reading");

    fseek(f, 0, SEEK_END);
    size_t len = ftell(f);
    rewind(f);

    char *buf = allocate(len + 1);             // sub_6B7340
    if (fread(buf, 1, len, f) != len)
        fatal("unable to read module id from file");

    buf[len] = '\0';
    fclose(f);
    qword_126F0C0 = buf;
    return buf;
}

Mode 2: Explicit Token (Caller-Provided String)

If the caller passes a non-NULL first argument (src), the function enters the default computation path using that string as the source filename component. When a secondary string argument (nptr) is provided instead (used by use_variable_or_routine_for_module_id_if_needed), it is first parsed with strtoul. If the parse succeeds (the entire string was consumed as a number), the numeric value is formatted as an 8-digit hex string. If the parse fails (the string is not purely numeric), the string is CRC32-hashed and the hash is used as the hex token. The working directory (qword_126EEA0) is used as an extra component, and the PID is always appended.

Mode 3: Default Computation (stat + ctime + getpid)

When no caller-provided string is available, the function stat()s the output file. If the stat succeeds and the file is a regular file (S_IFREG), the modification time (st_mtime) is converted to a string via ctime(), and the PID is obtained via getpid(). If the stat fails or the result is not a regular file, only the PID is used, with the compilation timestamp string (qword_126EB80) as the source component.

Complete Generation Pseudocode

function make_module_id(src_arg):
    // Check cache
    if qword_126F0C0 != NULL:
        return qword_126F0C0

    // Mode 1: read from file
    if !dword_106BFB8 AND qword_106BF80 != NULL:
        return read_file_contents(qword_106BF80)

    // Determine the output filename base
    if dword_126EE48:                    // multi-TU mode
        output_name = **(qword_106BA10 + 184)   // from TU descriptor
    else:
        output_name = xmmword_126EB60[0]         // primary source file
    if qword_106C040 != NULL:
        output_name = qword_106C040              // display name override

    // Determine source string and extra string
    pid = 0
    extra = NULL

    if src_arg != NULL:
        src = src_arg
        // skip nptr processing, fall through to assembly

    else if nptr != NULL:                // caller-provided numeric token
        (value, endptr) = strtoul(nptr, 0)
        if endptr <= nptr OR *endptr != '\0':
            value = crc32(nptr)          // not a pure number, hash it
        src = sprintf("%08lx", value)
        pid = getpid()
        extra = qword_126EEA0           // working directory

    else:                                // default: stat the output file
        if stat(output_name) succeeds AND is regular file:
            mtime = stat.st_mtime
            src = ctime(mtime)
            pid = getpid()
            extra = qword_126EEA0
        else:
            pid = getpid()
            src = qword_126EB80         // compilation timestamp
            extra = qword_126EEA0

    // --- Assemble the module ID string ---

    // Step 1: CRC32 of command-line options
    if qword_106C038 != NULL:
        options_crc = crc32(qword_106C038)
        options_hex = sprintf("_%08lx", options_crc)
    else:
        options_hex = sprintf("_%08lx", 0)

    // Step 2: source name compression
    name_len = strlen(src) + (extra ? strlen(extra) + 1 : 0)
    if name_len > 8:
        // Source name too long -- replace with CRC32
        combined_crc = crc32(src)
        if extra:
            combined_crc = crc32_continue(combined_crc, extra)
        src = sprintf("%08lx", combined_crc)
        // extra is consumed into the hash, set to NULL
        extra = NULL

    // Step 3: PID suffix
    if pid != 0:
        pid_suffix = sprintf("_%ld", pid)
    else:
        pid_suffix = ""

    // Step 4: extract basename of output file
    basename = strip_directory_prefix(output_name)   // sub_5AC1F0
    basename_len = strlen(basename)

    // Step 5: concatenate all components
    result = options_hex + "_" + basename_len + "_" + basename + "_" + src
    if extra:
        result += "_" + extra
    if pid != 0 AND nptr == NULL:
        result += pid_suffix

    // Step 6: sanitize -- replace all non-alphanumeric with '_'
    for each character c in result:
        if !isalnum(c):
            c = '_'

    // Cache and return
    qword_126F0C0 = result
    return result

Module ID Format

The final module ID string follows this structure:

_{options_crc}_{basename_len}_{basename}_{source_or_crc}[_{extra}][_{pid}]

All non-alphanumeric characters are replaced with underscores after assembly. A concrete example for a file kernel.cu compiled with nvcc -arch=sm_89 -rdc=true:

_a1b2c3d4_9_kernel_cu_5e6f7890_1234
  |          |  |        |         |
  |          |  |        |         +-- PID (getpid())
  |          |  |        +------------ CRC32 of source name (> 8 chars compressed)
  |          |  +--------------------- output basename ("kernel.cu", dot -> "_")
  |          +------------------------ basename length (9, "kernel.cu")
  +----------------------------------- CRC32 of options string

The leading underscore comes from the options_hex format ("_%08lx"). All dots, slashes, dashes, and other non-alphanumeric characters are uniformly replaced with underscores, making the result safe for use as a C identifier suffix.

CRC32 Implementation

The function contains an inline CRC32 implementation that appears three times in the decompiled output -- once for the options string hash, once for the source filename hash, and once for the extra string hash. All three are byte-identical in the binary, indicating the compiler inlined a shared helper (likely a static inline function or macro) at each call site.

The algorithm is the standard bit-by-bit reflected CRC-32 used by ISO 3309, ITU-T V.42, Ethernet, PNG, and zlib. The polynomial 0xEDB88320 is the bit-reversed form of the generator polynomial 0x04C11DB7.

CRC32 Pseudocode

function crc32(data: byte_string) -> uint32:
    crc = 0xFFFFFFFF                    // initialization vector

    for each byte in data:
        for bit_index in 0..7:
            // XOR the lowest bit of crc with the current data bit
            if ((crc ^ (byte >> bit_index)) & 1) != 0:
                crc = (crc >> 1) ^ 0xEDB88320
            else:
                crc = crc >> 1

    return crc ^ 0xFFFFFFFF             // final inversion

CRC32 Decompiled (Single Instance)

This is one of the three identical inline copies from sub_5AF830, processing the options string at qword_106C038:

// sub_5AF830, lines 121-165 (options CRC32)
uint64_t crc = 0xFFFFFFFF;
uint8_t *ptr = (uint8_t *)qword_106C038;

if (ptr) {
    while (*ptr) {
        uint8_t byte = *ptr;
        while (1) {
            ++ptr;
            // Bit 0
            uint64_t tmp = crc >> 1;
            if (((uint8_t)crc ^ byte) & 1) tmp ^= 0xEDB88320;
            // Bit 1
            uint64_t tmp2 = tmp >> 1;
            if (((uint8_t)tmp ^ (byte >> 1)) & 1) tmp2 ^= 0xEDB88320;
            // Bit 2
            uint64_t tmp3 = tmp2 >> 1;
            if (((uint8_t)tmp2 ^ (byte >> 2)) & 1) tmp3 ^= 0xEDB88320;
            // Bit 3
            uint64_t tmp4 = tmp3 >> 1;
            if (((uint8_t)tmp3 ^ (byte >> 3)) & 1) tmp4 ^= 0xEDB88320;
            // Bit 4
            uint64_t tmp5 = tmp4 >> 1;
            if (((uint8_t)tmp4 ^ (byte >> 4)) & 1) tmp5 ^= 0xEDB88320;
            // Bit 5
            uint64_t tmp6 = tmp5 >> 1;
            if (((uint8_t)tmp5 ^ (byte >> 5)) & 1) tmp6 ^= 0xEDB88320;
            // Bit 6
            uint64_t tmp7 = tmp6 >> 1;
            if (((uint8_t)tmp6 ^ (byte >> 6)) & 1) tmp7 ^= 0xEDB88320;
            // Bit 7
            crc = tmp7 >> 1;
            if (((uint8_t)tmp7 ^ (byte >> 7)) & 1) == 0)
                break;
            byte = *ptr;
            crc ^= 0xEDB88320;
            if (!*ptr) goto done;
        }
    }
done:
    sprintf(options_hex, "_%08lx", crc ^ 0xFFFFFFFF);
}

The unrolled 8-iteration loop processes one byte at a time without a lookup table. Each iteration shifts the CRC right by one bit and conditionally XORs the polynomial. The final XOR with 0xFFFFFFFF is the standard CRC-32 finalization step. The compiler fully unrolled the inner 8-bit loop, turning what was originally a counted for (int i = 0; i < 8; i++) loop into 8 sequential if-shift-xor blocks. The three copies in the function differ only in which input string they process and which output variable receives the result.

Why Three Inline Copies

The CRC32 code appears at three locations within sub_5AF830:

Copy	Input	Output	Purpose
1 (lines 121-164)	`qword_106C038` (options string)	`options_hex`	Hash compiler flags into the module ID prefix
2 (lines 186-273)	`src` + `extra` (source + extra strings)	`src` (overwritten with hex)	Compress long source filenames (> 8 chars) into a fixed-width hash
3 (lines 361-407)	`nptr` (explicit token string)	`v67`	Hash non-numeric caller-provided tokens

Copy 2 is a two-pass CRC: it first hashes the source filename string, then continues the CRC state into the extra string (working directory), producing a single combined hash. This is why the code between copies 2a and 2b checks if (extra_len != 0) before starting the second pass.

The original C source almost certainly had a single crc32_string() helper function (or macro) that the compiler inlined at each call site during optimization. The EDG front-end codebase uses similar inline expansion patterns elsewhere (e.g., the 9 copies of UTF-8 decoding logic in the same file).

Module ID Source Modes -- Decision Tree

make_module_id(src)
    |
    +-- qword_126F0C0 set? --> return cached
    |
    +-- File mode available?
    |   (qword_106BF80 != NULL && !dword_106BFB8)
    |   YES --> read file, cache, return
    |
    +-- Caller provided src argument?
    |   YES --> use src as source component, no PID
    |
    +-- nptr set (explicit token)?
    |   YES --> strtoul(nptr)
    |           |
    |           +-- parse OK? --> use numeric value
    |           +-- parse fail? --> CRC32 hash nptr
    |           extra = working_directory
    |           pid = getpid()
    |
    +-- Default (no src, no nptr)
        stat(output_file)
        |
        +-- stat OK && regular file?
        |   src = ctime(st_mtime)
        |   pid = getpid()
        |   extra = working_directory
        |
        +-- stat fail
            src = qword_126EB80 (compilation timestamp)
            pid = getpid()
            extra = working_directory

Entity-Based Module ID Selection

An alternative entry path into the module ID system is use_variable_or_routine_for_module_id_if_needed (sub_5CF030, il.c, line 31969, ~65 lines). Instead of computing a hash from file metadata, this function selects a representative entity (variable or function) from the current translation unit whose mangled name serves as a stable identifier. The mangled name is then passed to sub_5AF830 as the src argument.

Selection Criteria

The function is invoked during IL processing. It first checks sub_5AF820 (get_module_id) -- if a module ID is already cached, it returns immediately. Otherwise, it evaluates the candidate entity:

// sub_5CF030, simplified
char *use_variable_or_routine_for_module_id_if_needed(entity, kind) {
    if (get_module_id())
        return get_module_id();      // already computed

    if (qword_126F140) {
        // Already selected an entity, extract its name
        assert(dword_106BF10 || dword_106BEF8);  // il.c:32064
        goto extract_name;
    }

    // Validate entity kind: must be 7 (variable) or 11 (routine)
    assert(entity && ((kind - 7) & 0xFB) == 0);   // il.c:31969

    // Check if entity is unsuitable (member of TU scope, etc.)
    if (entity->scope == primary_scope
        || (entity->flags_81 & 0x04)       // unnamed namespace
        || (entity->scope && entity->scope->kind == 3))
    {
        // Skip: entity in primary scope, unnamed namespace, or class scope
        ...
        return NULL;
    }

    if (kind == 7) {   // Variable
        // Must have: no storage class, has definition, not template-related,
        // not inline, not constexpr, not thread-local
        if (entity->storage_class == 0
            && entity->has_definition          // offset +169
            && !(entity->flags_162 & 0x10)     // not explicit specialization
            && !(entity->flags_164 & 0x10)     // not partial specialization
            && entity->flags_148 >= 0          // not extern template
            && !(entity->flags_160 & 0x08)     // not inline variable
            && entity->flags_165 >= 0)         // not constexpr
        {
            qword_126F140 = entity;
            byte_126F138 = 7;
        }
    }
    else {   // Routine (kind == 11)
        // Must have: no specialization, no builtin return type,
        // no template parameters, not defaulted/deleted
        if (!entity->flags_164
            && entity->flags_176 >= 0          // not defaulted
            && !(entity->flags_179 & 0x02)     // not deleted
            && !(entity->flags_180 & 0x38)     // not template-related
            && !(entity->flags_184 & 0x20))    // not consteval
        {
            // Additional checks: return type not builtin, not coroutine
            if (!is_builtin_type(entity->return_type)
                && !is_generic_function(entity)
                && !is_concept_function(entity->return_type_entry))
            {
                qword_126F140 = entity;
                byte_126F138 = 11;
            }
        }
    }

extract_name:
    // Get the entity's mangled name
    char *name;
    if (byte_126F138 == 7) {
        // Variable: check unnamed namespace, use mangled or lowered name
        if ((entity->flags_81 & 0x04) || (entity->scope && entity->scope->kind == 3))
            name = get_lowered_name();      // sub_6A70C0
        else
            name = entity->name;            // offset +8
    } else {
        // Routine: similar checks, use name or lowered name
        assert(byte_126F138 == 11);         // il.c:32079
        if (dword_126EFB4 == 2)             // C++20 mode
            name = get_mangled_name();      // sub_6A76C0
        else
            name = entity->name;
    }

    assert(name != NULL);                   // il.c:32086
    return make_module_id(name);            // sub_5AF830(name)
}

The strict filtering ensures the selected entity is one whose mangled name is deterministic across compilations of the same source. Template instantiations, inline variables, and unnamed namespace entities are excluded because their names may vary or conflict.

set_module_id and get_module_id

The module ID cache has a setter/getter pair for use by external callers that compute the ID through other means:

// sub_5AF7F0 -- set_module_id (host_envir.c, line 3387)
void set_module_id(char *id) {
    assert(qword_126F0C0 == NULL);   // "set_module_id" -- must not be set already
    qword_126F0C0 = id;
}

// sub_5AF820 -- get_module_id (host_envir.c)
char *get_module_id(void) {
    return qword_126F0C0;
}

The setter asserts that the module ID has not been previously set. This is a safety guard: the module ID must be computed exactly once per compilation. Any attempt to set it twice indicates a logic error in the pipeline.

write_module_id_to_file

The write_module_id_to_file function (sub_5B0180, host_envir.c, ~30 lines) is called during the backend output phase when dword_106BFB8 (emit-symbol-table flag) is set. It generates the module ID (via sub_5AF830(0)) and writes the raw string to a file:

// sub_5B0180 -- write_module_id_to_file
void write_module_id_to_file(void) {
    char *id = make_module_id(NULL);       // sub_5AF830(0)
    char *path = qword_106BF80;            // module ID file path

    if (!path)
        fatal("module id filename not specified");

    FILE *f = open_file_for_writing(path); // sub_4F48F0
    size_t len = strlen(id);

    if (fwrite(id, 1, len, f) != len)
        fatal("error writing module id to file");

    fclose(f);
}

The module ID file is a plain text file containing nothing but the module ID string (no newline, no header). This file is consumed by the fatbinary linker (fatbinary) and nvlink during the device linking phase.

Downstream Consumers

The module ID is referenced in seven distinct locations across the cudafe++ binary:

1. Anonymous Namespace Mangling (sub_6BC7E0)

Constructs the _GLOBAL__N_<module_id> string used as the _NV_ANON_NAMESPACE macro value in the .int.c trailer:

// sub_6BC7E0 (nv_transforms.c, ~20 lines)
if (qword_1286A00)                      // cached?
    return qword_1286A00;

char *id = make_module_id(NULL);        // sub_5AF830(0)
char *buf = allocate(strlen(id) + 12);  // "_GLOBAL__N_" = 11 chars + NUL
strcpy(buf, "_GLOBAL__N_");
strcpy(buf + 11, id);
qword_1286A00 = buf;                   // cache for reuse
return buf;

This string appears in the .int.c output as:

#define _NV_ANON_NAMESPACE _GLOBAL__N_a1b2c3d4e5f67890
#ifdef _NV_ANON_NAMESPACE
#endif
#include "kernel.cu"
#undef _NV_ANON_NAMESPACE

2. Scoped Name Prefix Builder (sub_6BD2F0)

The recursive nv_build_scoped_name_prefix function uses the same _GLOBAL__N_<module_id> string when building scope-qualified names for internal-linkage device entities in host reference arrays. If the entity is in an anonymous namespace and qword_1286A00 is not yet computed, it calls sub_5AF830(0) directly to generate the module ID.

3. Internal Linkage Prefix (sub_69DAA0)

Constructs _INTERNAL<module_id> for internal-linkage entities during name lowering:

// sub_69DAA0 (lower_name.c context)
char *id = make_module_id(NULL);
char *buf = allocate(strlen(id) + 10);
strcpy(buf, "_INTERNAL");              // 0x414E5245544E495F in little-endian
strcpy(buf + 9, id);

4. Unnamed Namespace Naming (sub_69ED40, give_unnamed_namespace_a_name)

When the name lowering pass encounters an unnamed (anonymous) namespace entity, it calls sub_5AF830(0) to obtain the module ID and constructs a _GLOBAL__N_<module_id> name for the namespace. The function is confirmed as give_unnamed_namespace_a_name from assert strings at lower_name.c lines 7880 and 7889.

5. Frontend Wrapup (sub_588E90)

The translation_unit_wrapup function (sub_588E90, fe_wrapup.c) calls sub_5AF830(0) unconditionally during frontend finalization. This ensures the module ID is computed and cached before the backend code generator needs it, even if no earlier consumer triggered computation.

6. Entity-Based Selection (sub_5CF030)

As described above, use_variable_or_routine_for_module_id_if_needed selects a representative entity and passes its mangled name to sub_5AF830, which then uses the name as the src component instead of file metadata.

7. Module ID File Output (sub_5B0180)

Writes the raw module ID string to a file for consumption by fatbinary and nvlink.

Integration with the Compilation Pipeline

The module ID is computed at multiple points during compilation, but only the first computation persists (all subsequent calls return the cached value):

Pipeline stage                    Module ID action
--------------------------------------------------------------
CLI parsing                       Flags 83/87 set qword_106BF80
                                  Options string stored in qword_106C038
Frontend processing               sub_5CF030 may select entity-based ID
Frontend wrapup (sub_588E90)      sub_5AF830(0) ensures ID is computed
Backend output (sub_489000)       sub_6BC7E0 uses ID for _NV_ANON_NAMESPACE
                                  sub_6BCF80 uses ID in host reference arrays
                                  sub_5B0180 writes ID to file (if dword_106BFB8)

The --gen_module_id_file flag (83) controls whether a module ID file is generated at all. The --module_id_file_name flag (87) specifies its path. Both are set by nvcc when invoking cudafe++ with -rdc=true.

PID Incorporation

The getpid() call ensures that concurrent compilations of the same source file produce different module IDs. Without the PID, two parallel nvcc invocations compiling the same .cu file with the same flags would generate identical module IDs, causing runtime registration collisions when the resulting objects are linked together. The PID is appended as the final underscore-separated component and is only included in modes 2 and 3 (not when the caller provides a src argument directly, and not when the module ID is read from a file). This means reproducible builds require mode 1 (file-based) or entity-based selection.

Global Variables

Address	Size	Name	Description
`qword_126F0C0`	8	`cached_module_id`	Cached module ID string (computed once, never freed)
`qword_106BF80`	8	`module_id_file_path`	Path from `--module_id_file_name` (flag 87)
`qword_106C038`	8	`options_hash_input`	Command-line options string for CRC32 hashing
`qword_106C040`	8	`display_filename`	Output filename override (used as basename source)
`qword_126F140`	8	`selected_entity`	Entity chosen by `use_variable_or_routine_for_module_id_if_needed`
`byte_126F138`	1	`selected_entity_kind`	Kind of selected entity (7=variable, 11=routine)
`dword_106BFB8`	4	`emit_symbol_table`	Flag: write module ID file + symbol table in backend
`qword_1286A00`	8	`cached_anon_namespace_hash`	Cached `_GLOBAL__N_<module_id>` string
`qword_126EEA0`	8	`working_directory`	Current working directory (set during `host_envir_early_init`)
`qword_126EB80`	8	`compilation_timestamp`	`ctime()` of compilation start (IL header)
`dword_126EFC8`	4	`debug_trace_flag`	Enables debug trace output to FILE `s`

Function Map

Address	Name	Source File	Lines	Role
`sub_5AF830`	`make_module_id`	host_envir.c	~450	CRC32-based unique TU identifier generator
`sub_5AF7F0`	`set_module_id`	host_envir.c	~10	Setter with assert guard (must be called once)
`sub_5AF820`	`get_module_id`	host_envir.c	~3	Returns `qword_126F0C0`
`sub_5B0180`	`write_module_id_to_file`	host_envir.c	~30	Writes module ID to file for nvlink
`sub_5CF030`	`use_variable_or_routine_for_module_id_if_needed`	il.c:31969	~65	Selects representative entity for stable ID
`sub_6BC7E0`	(anon namespace hash)	nv_transforms.c	~20	Constructs `_GLOBAL__N_<module_id>`
`sub_6BD2F0`	`nv_build_scoped_name_prefix`	nv_transforms.c	~95	Recursive scope-qualified name builder
`sub_69DAA0`	(internal linkage prefix)	lower_name.c	~60	Constructs `_INTERNAL<module_id>` prefix
`sub_69ED40`	`give_unnamed_namespace_a_name`	lower_name.c:7880	~80	Names anonymous namespaces with module ID
`sub_588E90`	`translation_unit_wrapup`	fe_wrapup.c	~30	Ensures module ID is computed during wrapup

Cross-References

.int.c File Format -- _NV_ANON_NAMESPACE trailer section that consumes the module ID
CUDA Runtime Boilerplate -- managed memory registration that uses the fatbinary handle
Host Reference Arrays -- .nvHR* sections where scoped names include the module ID
RDC Mode -- separate compilation mode that requires module IDs for cross-TU linking
CLI Flag Inventory -- flags 83 (gen_module_id_file) and 87 (module_id_file_name)
Backend Code Generation -- output phase where write_module_id_to_file is called
Frontend Wrapup -- translation_unit_wrapup triggers early module ID computation

EDG 6.6 Overview

cudafe++ is built on top of Edison Design Group's (EDG) commercial C++ frontend, version 6.6. EDG provides the complete C++ language implementation -- lexer, preprocessor, parser, semantic analysis, type system, template instantiation, overload resolution, constant evaluation, and Itanium ABI name mangling. NVIDIA licenses this frontend and compiles it from source with CUDA-specific modifications injected at three distinct integration levels: a dedicated NVIDIA source file (nv_transforms.c), surgical modifications to EDG source files that call into NVIDIA headers, and a large layer of CUDA property-query leaf functions that permeate every compilation phase.

The build path embedded in the binary is:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/

Source Tree

The binary contains debug path references to 52 .c files and 13 .h files. Together these constitute the entire EDG frontend plus NVIDIA's single dedicated source file.

Source Files (.c)

#	File	Pipeline role
1	`attribute.c`	C++11/GNU/CUDA attribute parsing and validation
2	`class_decl.c`	Class/struct/union declaration processing, lambda scanning
3	`cmd_line.c`	Command-line argument parsing (276 flags)
4	`const_ints.c`	Compile-time integer constant evaluation
5	`cp_gen_be.c`	Backend -- `.int.c` code generation, source sequence walking
6	`debug.c`	Debug output and IL dump infrastructure
7	`decl_inits.c`	Declaration initializer processing
8	`decl_spec.c`	Declaration specifier parsing (storage class, type qualifiers)
9	`declarator.c`	Declarator parsing (pointers, arrays, function signatures)
10	`decls.c`	General declaration processing
11	`disambig.c`	Syntactic disambiguation (expression vs. declaration)
12	`error.c`	Diagnostic message formatting and emission (3,795 messages)
13	`expr.c`	Expression parsing and semantic analysis
14	`exprutil.c`	Expression utility functions (coercion, evaluation)
15	`extasm.c`	Extended inline assembly parsing
16	`fe_init.c`	Frontend initialization (36 subsystem init routines)
17	`fe_wrapup.c`	Frontend finalization (5-pass wrapup sequence)
18	`float_pt.c`	Floating-point literal parsing
19	`floating.c`	IEEE 754 constant folding (arbitrary precision)
20	`folding.c`	General constant folding
21	`func_def.c`	Function definition processing
22	`host_envir.c`	Host environment interface (file I/O, exit, signals)
23	`il.c`	IL node creation, linking, and management
24	`il_alloc.c`	IL arena allocator (region-based, 64KB blocks)
25	`il_to_str.c`	IL-to-string conversion for debug display
26	`il_walk.c`	IL tree walking with 5 callback functions
27	`interpret.c`	Constexpr interpreter (compile-time evaluation engine)
28	`layout.c`	Struct/class memory layout computation
29	`lexical.c`	Lexer / tokenizer (357 token kinds)
30	`literals.c`	String and numeric literal processing
31	`lookup.c`	Name lookup (unqualified, qualified, ADL)
32	`lower_name.c`	Itanium ABI name mangling
33	`macro.c`	Preprocessor macro expansion
34	`mem_manage.c`	Internal memory management (arena allocator, tracking)
35	`modules.c`	C++20 module support (mostly stubs in CUDA build)
36	`nv_transforms.c`	NVIDIA-authored -- CUDA AST transforms, lambda wrappers
37	`overload.c`	C++ overload resolution
38	`pch.c`	Precompiled header support
39	`pragma.c`	Pragma processing (43 pragma kinds)
40	`preproc.c`	Preprocessor directives (#include, #ifdef, etc.)
41	`scope_stk.c`	Scope stack management
42	`src_seq.c`	Source sequence (declaration ordering for emission)
43	`statements.c`	Statement parsing and semantic analysis
44	`symbol_ref.c`	Symbol reference tracking
45	`symbol_tbl.c`	Symbol table operations (hash-based lookup)
46	`sys_predef.c`	System predefinitions (built-in types, macros)
47	`target.c`	Target configuration (data model, ABI)
48	`templates.c`	Template instantiation, specialization, deduction
49	`trans_copy.c`	Translation unit IL deep copy
50	`trans_corresp.c`	Cross-TU type correspondence verification (RDC)
51	`trans_unit.c`	Translation unit lifecycle (the main entry point)
52	`types.c`	C++ type system (22 type kinds, queries, construction)

Header Files (.h)

#	File	Contents
1	`decls.h`	Declaration node structure definitions
2	`float_type.h`	Floating-point type descriptors
3	`il.h`	IL entry kind enums, node structure definitions
4	`lexical.h`	Token kind enums, lexer state
5	`mem_manage.h`	Memory allocator interface
6	`modules.h`	Module system declarations
7	`nv_transforms.h`	NVIDIA-authored -- CUDA transform API, called from EDG files
8	`overload.h`	Overload resolution structures
9	`scope_stk.h`	Scope stack interface
10	`symbol_tbl.h`	Symbol table interface
11	`types.h`	Type node structure, type kind enum
12	`util.h`	General utility macros and inline functions
13	`walk_entry.h`	IL walking callback signatures

Code Breakdown

The binary contains approximately 6,300 identifiable functions in the EDG portion of the code:

Category	Functions	% of binary	Description
Attributed to source files	~2,200	~35%	Matched to one of the 52 `.c` files via assert strings, source path references, or address-range mapping
Unmapped EDG functions	~2,900	~46%	EDG code without source file attribution (inlined, optimized, or from headers)
C++ runtime / ABI	~1,200	~19%	Itanium ABI runtime, exception handling, `std::` library, operator new/delete

Top 10 Source Files by Function Count

Rank	File	Functions	Primary responsibility
1	`expr.c`	~195	Expression parsing, operator semantics, implicit conversions
2	`il.c`	~185	IL node creation, entry kind dispatch, node linking
3	`templates.c`	~172	Template instantiation worklist, SFINAE, deduction
4	`exprutil.c`	~154	Expression coercion, arithmetic conversions, lvalue analysis
5	`symbol_tbl.c`	~102	Symbol table hash operations, scope chain walking
6	`overload.c`	~100	Candidate set construction, ICS ranking, best viable function
7	`class_decl.c`	~90	Class body parsing, member declarations, lambda scanning
8	`attribute.c`	~83	Attribute parsing, CUDA attribute validation dispatch
9	`cp_gen_be.c`	~81	Backend emission, `.int.c` generation, device stub writing
10	`scope_stk.c`	~72	Scope push/pop, scope kind management, lookup context

Architecture: Classic Frontend Pipeline

EDG implements a textbook multi-pass compiler frontend. cudafe++ drives it in a single-threaded, sequential pipeline from main() at 0x408950:

  source.cu
     |
     v
  +-----------+     lexical.c, macro.c, preproc.c, literals.c
  |  Lexer /  |     357 token kinds, trigraph handling, raw string
  |  Preproc  |     adjustment, __CUDA_ARCH__ macro injection
  +-----------+
     |  token stream
     v
  +-----------+     expr.c, declarator.c, decl_spec.c, statements.c,
  |  Parser   |     class_decl.c, disambig.c, func_def.c, extasm.c
  |           |     Recursive-descent with disambiguation
  +-----------+
     |  parse tree
     v
  +-----------+     overload.c, exprutil.c, lookup.c, templates.c,
  | Semantic  |     types.c, attribute.c, const_ints.c, folding.c
  | Analysis  |     Type checking, overload resolution, template
  |           |     instantiation, constexpr evaluation
  +-----------+
     |  annotated AST
     v
  +-----------+     il.c, il_alloc.c, il_walk.c, scope_stk.c,
  |  IL Build |     symbol_tbl.c, src_seq.c, trans_unit.c
  |           |     Scope-linked graph of all declarations, types,
  |           |     expressions, statements, templates
  +-----------+
     |  IL graph
     v
  +-----------+     fe_wrapup.c, lower_name.c, trans_corresp.c
  |  Wrapup   |     5-pass finalization: dead code marking,
  |           |     name lowering, cross-TU correspondence (RDC)
  +-----------+
     |  finalized IL
     v
  +-----------+     cp_gen_be.c, nv_transforms.c, host_envir.c
  |  Backend  |     Walk source sequence, emit .int.c file,
  | Emission  |     inject CUDA stubs, lambda wrappers, host
  |           |     reference arrays, managed variable boilerplate
  +-----------+
     |
     v
  output.int.c

The process_translation_unit function (sub_7A40A0 in trans_unit.c) is the main entry point for compilation. It allocates a 424-byte TU descriptor, opens the source file, and orchestrates the parse-to-IL sequence. For the main compilation path, it calls:

sub_586240 -- parse the translation unit (drives lexer + parser)
sub_4E8A60 -- standard compilation finalization (IL completion)
sub_588F90 -- fe_wrapup (5-pass IL finalization)
sub_489000 -- backend entry (.int.c emission, "Back end time")

NVIDIA Modifications

NVIDIA's CUDA integration is organized in three layers, from most isolated to most pervasive.

Level 1: NVIDIA-Authored Source (`nv_transforms.c` + `nv_transforms.h`)

A single dedicated NVIDIA source file at address range 0x6BAE70--0x6BE4A0, containing approximately 34 functions in ~14KB of code. This file implements all CUDA-specific AST transformations:

Function	Address	Purpose
`nv_init_transforms`	`0x6BAE70`	Zero all NVIDIA transform state at startup
`emit_device_lambda_wrapper`	`0x6BB790`	Generate `__nv_dl_wrapper_t<Tag, F1..FN>` partial specialization
`emit_hdl_wrapper` (non-mutable)	`0x6BBB10`	Generate `__nv_hdl_wrapper_t<false, ...>` type-erased wrapper
`emit_hdl_wrapper` (mutable)	`0x6BBEE0`	Same as above but `operator()` is non-const
`emit_array_capture_helpers`	`0x6BC290`	Generate `__nv_lambda_array_wrapper` for 2D-8D arrays
`nv_validate_cuda_attributes`	`0x6BC890`	Validate `__launch_bounds__`, `__cluster_dims__`, `__maxnreg__`
`nv_reset_capture_bitmasks`	`0x6BCBC0`	Zero device/host-device capture bitmasks per TU
`nv_record_capture_count`	`0x6BCBF0`	Set bit N in capture bitmap for wrapper generation
`nv_emit_lambda_preamble`	`0x6BCC20`	Master emitter: inject all `__nv_*` templates into compilation
`nv_find_parent_lambda_function`	`0x6BCDD0`	Walk scope chain for enclosing device/global function
`nv_emit_host_reference_array`	`0x6BCF80`	Generate `.nvHRKE`/`.nvHRDI`/etc. ELF section arrays
`nv_get_full_nv_static_prefix`	`0x6BE300`	Build scoped name + register entity in host ref arrays

The companion header nv_transforms.h declares the API surface that EDG source files call into. This is the primary NVIDIA integration point -- EDG code never calls nv_transforms.c functions directly; it calls through the header's declarations.

Key data structures managed by nv_transforms.c:

Global	Size	Purpose
`unk_1286980`	128 bytes (1024 bits)	Device lambda capture-count bitmap
`unk_1286900`	128 bytes (1024 bits)	Host-device lambda capture-count bitmap
`qword_12868F0`	pointer	Entity-to-closure ID hash table
`qword_1286A00`	pointer	Cached anonymous namespace name (`_GLOBAL__N_<file>`)
`qword_1286760`	pointer	Cached static name prefix string
`unk_1286780`--`unk_12868C0`	6 lists	Host reference array symbol lists (one per section type)
`dword_126E270`	4 bytes	C++17 noexcept-in-type-system flag

Level 2: NVIDIA-Modified EDG Files

Three EDG source files contain direct calls into nv_transforms.h functions, making them the "NVIDIA-aware" EDG files:

cp_gen_be.c -- The backend code generator. When it encounters a type named __nv_lambda_preheader_injection during source sequence walking, it calls nv_emit_lambda_preamble (sub_6BCC20) to inject the entire __nv_* template library. It also calls NVIDIA functions for host reference array emission, managed variable boilerplate, and device stub generation.

class_decl.c -- The class/struct declaration processor. The scan_lambda function (sub_447930, 2113 lines) detects __host__/__device__ annotations on lambda expressions, validates CUDA-specific constraints (35+ error codes in range 3592--3690), and records capture counts in the bitmaps via nv_record_capture_count.

statements.c -- The statement parser. Calls NVIDIA transform functions for statement-level CUDA validation, such as checking that __syncthreads() is not called in divergent control flow within __global__ functions.

Level 3: CUDA Property Query Layer

The most pervasive integration layer consists of 104 small leaf functions clustered at addresses 0x7A6000--0x7AA000 (within types.c). These are type-system query functions that answer questions like "is this type a __device__ pointer?", "does this class have __shared__ storage?", "is this a kernel function type?".

Each follows a canonical pattern:

bool is_<property>_type(type_node *t) {
    while (t->kind == 12)       // 12 = tk_typedef
        t = t->referenced_type; // strip typedef layers
    return <check on underlying type>;
}

These 104 accessors account for 3,648 total call sites across the binary. The top callers by call-site count:

Address	Callers	Identity	Returns
`0x7A8A30`	407	`is_class_or_struct_or_union_type`	kind in {9, 10, 11}
`0x7A9910`	389	`type_pointed_to`	`ptr->referenced_type` (kind == 6)
`0x7A9E70`	319	`get_cv_qualifiers`	accumulated cv-qual bits (& 0x7F)
`0x7A6B60`	299	`is_dependent_type`	bit 5 of byte +133
`0x7A7630`	243	`is_object_pointer_type`	kind == 6 && !(bit 0 of +152)
`0x7A8370`	221	`is_array_type`	kind == 8
`0x7A7B30`	199	`is_member_pointer_or_ref`	kind == 6 && (bit 0 of +152)
`0x7A6AC0`	185	`is_reference_type`	kind == 7
`0x7A8DC0`	169	`is_function_type`	kind == 14
`0x7A6E90`	140	`is_void_type`	kind == 1

CUDA integration is pervasive because these tiny accessors are called from every phase of compilation -- the parser checks execution space during declaration, semantic analysis validates cross-space calls, the type system queries CUDA qualifiers during overload resolution, and the backend reads them during IL emission. There is no isolated "CUDA layer"; the CUDA awareness is distributed across the entire frontend through these leaf functions.

Type Kind Constants

The type query functions operate on a type_node structure (176 bytes, IL entry kind 6). The kind field at offset +132 encodes:

kind	Name	Description
0	`tk_none`	Null/invalid
1	`tk_void`	`void`
2	`tk_integer`	All integer types including `bool`, `char`, enums
3	`tk_float`	`float`
4	`tk_double`	`double`
5	`tk_long_double`	`long double`
6	`tk_pointer`	Pointer types (object and member)
7	`tk_reference`	Lvalue reference (`T&`)
8	`tk_array`	Array types (`T[]`, `T[N]`)
9	`tk_struct`	`struct`
10	`tk_class`	`class`
11	`tk_union`	`union`
12	`tk_typedef`	Typedef alias (stripped by all query functions)
13	`tk_pointer_to_member`	Pointer-to-member (`T C::*`)
14	`tk_function`	Function type
15	`tk_bitfield`	Bit-field
16	`tk_pack_expansion`	Parameter pack expansion
17	`tk_pack_expansion`	Alternate pack expansion form
18	`tk_auto`	`auto` / `decltype(auto)` placeholder
19	`tk_rvalue_reference`	Rvalue reference (`T&&`)
20	`tk_nullptr_t`	`std::nullptr_t`

Memory Management

EDG uses a custom region-based arena allocator implemented in mem_manage.c (address range 0x6B5E40--0x6BA230). Key characteristics:

Block size: 64KB (0x10000) per block
Region model: Multiple numbered regions (file-scope = region 1, per-function = region N)
Free list recycling: Freed blocks go to qword_1280730 for reuse before new allocation
Trim threshold: Blocks with more than 1,887 unused bytes are split; remainder goes to free list
Tracking: All allocations recorded for watermark monitoring (qword_1280718 = total, qword_1280710 = peak)
Dual mode: Malloc-based (mode 0) or mmap-based (mode 1), selected by dword_1280728 from CLI flag

Block structure (48+ bytes header per 64KB block):

Offset	Type	Field
+0	`void*`	Next pointer (block chain)
+8	`void*`	Current allocation pointer
+16	`void*`	High-water mark within block
+24	`void*`	End-of-block pointer
+32	`int64`	Block total size (0 if sub-block)
+40	`byte`	Trimmed flag
+48	--	Start of usable data

The free_fe function (sub_6BA230, 533 lines) implements a hash-table-based deduplicating allocator for front-end object deallocation, using open addressing with linear probing.

C++20 Modules (Stubs)

The modules.c file (address range 0x7C0C60--0x7C2560) contains approximately 20 functions implementing the C++20 module import/export interface. CUDA does not support C++20 modules, so most functions are stubs that return 0:

has_pending_template_definition_from_module -- returns 0
has_pending_template_specializations_from_module -- returns 0
Seven additional stub functions at 0x7C2350--0x7C2410 -- all return 0

The non-stub functions handle the binary module interface file format (magic header {0x9A, 0x13, 0x37, 0x7D}) and basic module name matching, likely preserved from the EDG baseline for future CUDA module support.

Cross-TU Correspondence (RDC Mode)

When compiling with Relocatable Device Code (--rdc), multiple translation units are processed sequentially. The trans_corresp.c file (address range 0x7A00D0--0x7A38A0) implements structural equivalence checking between types from different TUs:

verify_class_type_correspondence (sub_7A00D0, 703 lines) -- Deep comparison of class types: base classes, friend declarations, member functions, nested types, template parameters
verify_enum_type_correspondence (sub_7A0E10) -- Enum underlying type and enumerator list comparison
verify_function_type_correspondence (sub_7A1230) -- Parameter list and return type comparison
set_type_correspondence (sub_7A1460) -- Links two corresponding types across TUs

The trans_unit.c file manages TU lifecycle with a stack-based model:

Global	Purpose
`qword_106BA10`	Current translation unit pointer
`qword_106B9F0`	Primary (first) translation unit
`qword_106BA18`	TU stack top
`dword_106B9E8`	TU stack depth (excluding primary)

process_translation_unit (sub_7A40A0) allocates a 424-byte TU descriptor and drives the parse-to-completion sequence. switch_translation_unit (sub_7A3D60) saves/restores per-TU state (registered variables, scope stack, file scope) when switching between TUs during RDC compilation.

Cross-References

Pipeline Overview -- How EDG stages map to the 7-stage pipeline
IL Overview -- The 85 entry kinds that EDG produces
Extended Lambda Overview -- The nv_transforms.c lambda pipeline in detail
Type System -- Deep dive on 22 type kinds and class layout
Template Engine -- Template instantiation worklist
Name Mangling -- Itanium ABI encoding with CUDA extensions
Lexer -- Tokenizer and keyword registration
Overload Resolution -- Candidate evaluation and ICS ranking
Diagnostics Overview -- The 3,795 error message system

Lexer & Tokenizer

The lexer in cudafe++ is EDG 6.6's lexical.c implementation -- a hand-coded, state-machine-driven tokenizer that converts raw source bytes into a stream of 357 distinct token kinds. It spans approximately 185 functions across the address range 0x668330--0x689130 and constitutes one of the densest subsystems in the binary. The design is a classic multi-layered scanner: a byte-level character scanner (sub_679800, 907 lines) feeds into a token acquisition engine (sub_6810F0, 3,811 lines), which in turn is wrapped by a cache-aware token delivery function (sub_676860, 1,995 lines). CUDA keyword recognition is injected at the get_token_main level, gated on dword_106C2C0 (GPU compilation mode flag).

The lexer does not use generated tables from tools like flex. Instead, every character-class test, keyword match, and operator scan is written as explicit C switch/if chains, compiled into dense jump tables by the optimizer. This produces extremely large functions -- get_token_main alone has approximately 300 local variables in its decompiled form -- but eliminates the overhead of table-driven DFA transitions for a language as context-sensitive as C++.

Key Facts

Property	Value
Source file	`lexical.c` (~185 functions)
Address range	`0x668330`--`0x689130`
Token kinds	357 (indexed from `off_E6D240` name table)
Primary scanner	`sub_679800` (`scan_token`, 907 lines)
Token acquisition	`sub_6810F0` (`get_token_main`, 3,811 lines, ~300 locals)
Cache + delivery	`sub_676860` (`get_next_token`, 1,995 lines)
Numeric literal scanner	`sub_672390` (`scan_numeric_literal`, 1,571 lines)
Keyword registration	`sub_5863A0` (`keyword_init`, in `fe_init.c`, 200+ keywords)
Universal char scanner	`sub_6711E0` (`scan_universal_character`, 278 lines)
Template arg scanner	`sub_67DC90` (`scan_template_argument_list`, 1,078 lines)
Token cache entry size	80--112 bytes (8 cache entry kinds)
Scope entry size	784 bytes (at `qword_126C5E8`)
GPU mode gate	`dword_106C2C0`
Current token global	`word_126DD58`

Architecture

The lexer is organized as four concentric layers, each calling into the one below it:

Parser (expr.c, decls.c, statements.c)
  │
  ▼
get_next_token (sub_676860)         ← Cache management, macro rescan
  │
  ▼
get_token_main (sub_6810F0)         ← Keyword classification, CUDA gates
  │
  ▼
scan_token (sub_679800)             ← Character-level scanning
  │
  ▼
Input buffer (qword_126DDA0)        ← Raw bytes from source file

The parser never calls the character-level scanner directly. All token consumption flows through get_next_token, which checks the token cache and rescan lists before falling through to get_token_main. This layering allows the lexer to support lookahead, backtracking, macro expansion replay, and template argument rescanning without modifying the core scanner.

Token System

The 357 Token Kinds

Every token produced by the lexer carries a 16-bit token code stored in word_126DD58. The complete set of 357 token kinds is indexed through the name table at off_E6D240, which maps each token code to its string representation. The stop-token table at qword_126DB48 + 8 contains 357 boolean entries used by the error recovery scanner to identify synchronization points.

Token codes are assigned in blocks:

Range	Category	Examples
1--51	Operators and punctuation	`+`, `-`, `*`, `/`, `(`, `)`, `{`, `}`, `::`, `->`
52--76	Alternative tokens / digraphs	`and`, `or`, `not`, `<%`, `%>`, `<:`, `:>`
77--108	C89 keywords	`auto`(77), `break`(78), `case`(79), `char`(80), `while`(108)
109--131	C99/C11 keywords	`restrict`(119), `_Bool`(120), `_Complex`(121), `_Imaginary`(122)
132--136	MSVC keywords	`__declspec`(132), `__int8`(133), `__int16`(134), `__int32`(135), `__int64`(136)
137--199	C++ keywords	`catch`(150), `class`(151), `template`(160), `decltype`(185), `typeof`(189)
200--206	Compiler internal	Internal token kinds for the preprocessor
207--330	Type traits	`__is_class`(207), `__has_trivial_copy`, ..., NVIDIA-specific traits at 328--330
331--356	Extended types / recent additions	`_Float32`(331)--`_Float128`(335), C++23/26 features

CUDA-Specific Token Kinds

Three NVIDIA type-trait keywords occupy dedicated token codes registered during keyword_init:

Token Code	Keyword	Purpose
328	`__nv_is_extended_device_lambda_closure_type`	Tests if type is a device lambda
329	`__nv_is_extended_host_device_lambda_closure_type`	Tests if type is a host-device lambda
330	`__nv_is_extended_device_lambda_with_preserved_return_type`	Tests if device lambda preserves return type

These are registered as standard type-trait keywords and participate in the same token classification path as the 60+ standard __is_xxx/__has_xxx traits.

Token State Globals

When a token is produced, the following globals are populated:

Address	Name	Type	Description
`word_126DD58`	`current_token_code`	WORD	16-bit token kind (0--356)
`qword_126DD38`	`current_source_position`	QWORD	Encoded file/line/column
`qword_126DD48`	`token_text_ptr`	QWORD	Pointer to identifier/literal text
`src`	`token_start_position`	char*	Start of token in input buffer
`n`	`token_text_length`	size_t	Length of token text
`dword_126DF90`	`token_flags_1`	DWORD	Classification flags
`dword_126DF8C`	`token_flags_2`	DWORD	Additional flags
`qword_126DF80`	`token_extra_data`	QWORD	Context-dependent payload
`xmmword_106C380`--`106C3B0`	`identifier_lookup_result`	4 x 128-bit	SSE-packed lookup result for identifiers (64 bytes)

The 64-byte identifier lookup result is written into four SSE registers (xmmword_106C380 through xmmword_106C3B0) by the identifier classification path. When a scanned identifier is also a keyword, the lookup result contains the keyword's token code, scope information, and classification flags. The compiler uses movaps/movups instructions to read/write this packed state in bulk.

Token Cache

The token cache provides the lookahead, backtracking, and macro-expansion replay capabilities required by C++ parsing. Tokens are stored in a linked list of cache entries that can be consumed, rewound, and re-scanned.

Cache Entry Layout (80--112 bytes)

Offset	Size	Field	Description
`+0`	8	`next`	Next entry in cache linked list
`+8`	8	`source_position`	Encoded source location
`+16`	2	`token_code`	Token kind (0--356)
`+18`	1	`cache_entry_kind`	Discriminator for payload type (see below)
`+20`	4	`flags`	Token flags
`+24`	4	`extra_flags`	Additional flags
`+32`	8	`extra_data`	Context-dependent data
`+40`..	varies	`payload`	Kind-specific payload data

Cache Entry Kinds

Kind	Value	Payload	Description
identifier	1	Name pointer + lookup result	Identifier token with pre-resolved scope lookup
macro_def	2	Macro definition pointer	Macro definition for re-expansion (calls `sub_5BA500`)
pragma	3	Pragma data	Preprocessor pragma for deferred processing
pp_number	4	Number text	Preprocessing number (not yet classified as int/float)
(reserved)	5	--	Not observed in use
string	6	String data + encoding	String literal token
(reserved)	7	--	Not observed in use
concatenated_string	8	Concatenated string data	Wide or multi-piece concatenated string literal

Cache Management Globals

Address	Name	Description
`qword_1270150`	`cached_token_rescan_list`	Head of list of tokens to re-scan (pushed back for lookahead)
`qword_1270128`	`reusable_cache_stack`	Stack of reusable cache entry blocks
`qword_1270148`	`free_token_list`	Free list for recycling cache entries
`qword_1270140`	`macro_definition_chain`	Active macro definition chain
`dword_126DB74`	`has_cached_tokens`	Boolean flag: nonzero when cache is non-empty

Cache Operations

Address	Identity	Description
`sub_669650`	`copy_tokens_from_cache`	Copies cached preprocessor tokens for macro re-expansion (assert at `lexical.c:3417`)
`sub_669D00`	`allocate_token_cache_entry`	Allocates from free list at `qword_1270118`
`sub_669EB0`	`create_cached_token_node`	Creates and initializes token cache node
`sub_66A000`	`append_to_token_cache`	Appends token to cache list, maintains tail pointer
`sub_66A140`	`push_token_to_rescan_list`	Pushes token onto rescan stack at `qword_1270150`
`sub_66A2C0`	`free_single_cache_entry`	Returns cache entry to free list

Layer 1: scan_token (sub_679800)

scan_token is the character-level scanner. It reads raw bytes from the input buffer at qword_126DDA0, classifies them, and produces a single token. The function is 907 lines and dispatches on the first byte of each token.

Character Dispatch

The scanner reads the byte at the current input position and enters one of the following paths:

First Byte	Action
`0x00` (NUL)	Control byte processing (8 embedded control types, see below)
`0x09` (TAB), `0x0B` (VT), `0x0C` (FF), `0x20` (space)	Whitespace -- advance and retry
`a`--`z`, `A`--`Z`, `_`	Identifier or keyword scanning
`0`--`9`	Numeric literal scanning (decimal, hex, octal, binary)
`'`	Character literal scanning
`"`	String literal scanning
`/`	Comment (`//` or `/* */`) or division operator
`.`	Dot operator, or float literal if followed by digit
`<`	Less-than, `<=`, `<<`, `<<=`, `<=>`, or template bracket
`>`	Greater-than, `>=`, `>>`, `>>=`, or template bracket
`+`, `-`, `*`, `%`, `^`, `~`, `!`, `=`, `&`, `\|`	Operator scanning (single or compound)
`(`, `)`, `[`, `]`, `{`, `}`, `;`, `,`, `?`, `@`	Single-character tokens
`#`	Preprocessor directive or stringification operator
`\`	Universal character name (`\uXXXX`, `\UXXXXXXXX`) or line continuation

Embedded Control Bytes (NUL Dispatch)

The input buffer uses embedded NUL bytes (0x00) as in-band control markers. When the scanner encounters a NUL, it reads the next byte as a control type code:

Control Type	Value	Action
Newline marker	1	End of line -- calls `sub_6702F0` (`refill_buffer`) to read next source line
(reserved)	2	--
Macro position	3	Macro expansion position marker -- calls `sub_66A770` to update position tracking
End of directive	4	Marks end of a preprocessor directive
EOF (primary)	5	End of current source file -- pops file stack
Stale position	6	Invalid position marker -- emits diagnostic 1192 or 861
Continuation	7	Backslash-newline continuation was here
EOF (secondary)	8	Secondary EOF marker for nested includes

This in-band signaling approach avoids the cost of checking buffer boundaries on every character read. The refill_buffer function (sub_6702F0, 792 lines) places these marker bytes at the end of each source line, so the scanner can detect line endings and EOF without comparing the input pointer against a limit.

Input Buffer System

Address	Name	Description
`qword_126DDA0`	`current_input_position`	Read pointer into the input buffer
`qword_126DDD8`	`input_buffer_base`	Start of the allocated input buffer
`qword_126DDD0`	`input_buffer_end`	End of the allocated input buffer
`qword_126DDF0`	`file_stack`	Stack of open source files (for `#include`)
`qword_127FBA8`	`current_file_handle`	`FILE*` for the current source file
`dword_127FBA0`	`eof_flag`	Set when current file reaches EOF
`dword_127FB9C`	`multibyte_encoding_mode`	Values >1 enable multibyte character decoding via `sub_5B09B0`
`dword_126DDA8`	`source_line_counter`	Lines read from current source file
`dword_126DDBC`	`output_line_counter`	Lines emitted to preprocessed output

Buffer Refill: read_next_source_line (sub_66F4E0)

sub_66F4E0 (735 lines) reads the next line from the source file into the input buffer. It calls getc() for single-byte mode or sub_5B09B0 for multibyte mode (controlled by dword_127FB9C > 1). The function:

Reads characters one at a time until newline or EOF
Handles backslash-newline line splicing (joining continuation lines)
Places control byte markers at newline positions (type 1) and EOF (type 5/8)
Updates the line counter at dword_126DDA8
Manages trigraph warnings (diagnostic 1750) through the companion function sub_6702F0

Layer 2: get_token_main (sub_6810F0)

get_token_main is the largest function in the lexer at 3,811 decompiled lines with approximately 300 local variables. It wraps scan_token and performs the complete token classification pipeline: keyword recognition, CUDA keyword gating, template parameter detection, operator overload name lookup, access specifier tracking, and namespace scope management.

Token Classification Pipeline

After scan_token produces a raw token, get_token_main performs these classification steps:

scan_token produces raw token
  │
  ├── Identifier?
  │     ├── Look up in keyword table
  │     │     ├── Standard C/C++ keyword → set token_code to keyword kind
  │     │     ├── CUDA keyword (dword_106C2C0 != 0) → set token_code
  │     │     ├── Type trait keyword → set token_code (207-356)
  │     │     └── Not a keyword → classify as identifier token
  │     │
  │     ├── Check template parameter context
  │     │     └── If inside template<>, classify as type-name or non-type
  │     │
  │     └── Entity lookup for context-sensitive classification
  │           ├── typedef name → classify as TYPE_NAME token
  │           ├── class/struct name → classify as CLASS_NAME
  │           ├── enum name → classify as ENUM_NAME
  │           ├── namespace name → classify as NAMESPACE_NAME
  │           └── template name → classify as TEMPLATE_NAME
  │
  ├── Numeric literal?
  │     └── Route to scan_numeric_literal (sub_672390)
  │
  ├── String/character literal?
  │     └── Handle encoding prefix (L, u8, u, U, R)
  │
  └── Operator/punctuation?
        ├── Check for template angle bracket context
        ├── Handle digraphs/alternative tokens
        └── Produce operator token code

CUDA Keyword Detection

CUDA keyword handling is gated on dword_106C2C0 (GPU mode). When this flag is nonzero, get_token_main recognizes CUDA-specific identifiers and routes them to the CUDA attribute processing path:

// Pseudocode from get_token_main
if (token_is_identifier) {
    // ... standard keyword lookup ...

    if (dword_106C2C0 != 0) {  // GPU mode active
        // Check for __device__, __host__, __global__,
        // __shared__, __constant__, __managed__,
        // __launch_bounds__, __grid_constant__
        // Route to CUDA attribute handlers
        if (dword_106BA08) {   // CUDA attribute processing enabled
            sub_74DC30(...);   // CUDA attribute resolution
            sub_74E240(...);   // CUDA attribute application
        }
    }
}

The GPU mode flag dword_106C2C0 is also checked during:

Attribute token processing in sub_686350 (handle_attribute_token, 584 lines)
Deferred diagnostic emission in sub_668660 (severity override via byte_126ED55)
Entity visibility computation in sub_669130

C++ Standard Version Gating

Throughout get_token_main, keyword classification is gated on the C++ standard version stored in dword_126EF68:

Version Value	Standard	Keywords Enabled
201102	C++11	`constexpr`, `decltype`, `nullptr`, `char16_t`, `char32_t`, `static_assert`
201402	C++14	`binary literals`, `digit separators`
201703	C++17	`if constexpr`, `char8_t`, `structured bindings`
202002	C++20	`concept`, `requires`, `co_yield`, `co_return`, `co_await`, `consteval`, `constinit`
202302	C++23	`typeof`, `typeof_unqual`, extended digit separators

The language mode at dword_126EFB4 controls broader dialect selection:

Value	Mode	Effect
1	GNU/default	GNU extensions enabled, alternative tokens recognized
2	MSVC	MSVC keywords enabled (`__declspec`, `__int8`--`__int64`), some GNU extensions disabled

Context-Sensitive Token Classification

C++ requires the lexer to classify identifiers based on declaration context. The functions supporting this classification:

Address	Identity	Description
`sub_668C90`	`classify_identifier_entity`	Dispatches on entity kind: typedef(3), class(4,5), function(7,9), namespace(19-22)
`sub_668E00`	`resolve_entity_through_alias`	Walks typedef/using chains (kind=3 with `+104` flag, kind=16 → `**[+88]`)
`sub_668F80`	`get_resolved_entity_type`	Resolves entity to underlying type through alias chains
`sub_668900`	`handle_token_identifier_type_check`	Determines if token is identifier vs typename vs template
`sub_666720`	`select_dual_lookup_symbol`	Selects between two candidate symbols in dual-scope lookup (372 lines)

Entity classification reads the entity_kind byte at offset +80 of entity nodes:

switch (entity->kind) {    // offset +80
    case 3:                // typedef
        return TYPE_NAME;
    case 4: case 5:        // class / struct
        return CLASS_NAME;
    case 6:                // enum
        return ENUM_NAME;
    case 7:                // function
        return IDENTIFIER;
    case 9: case 10:       // namespace / namespace alias
        return NAMESPACE_NAME;
    case 19: case 20: case 21: case 22:  // template kinds
        return TEMPLATE_NAME;
    case 16:               // using declaration
        return resolve_through_using(entity);
    case 24:               // namespace alias (resolved)
        return NAMESPACE_NAME;
}

Layer 3: get_next_token (sub_676860)

get_next_token (1,995 lines) is the token delivery function called by the parser. It manages the token cache, handles macro expansion replay, and calls get_token_main only when no cached tokens are available.

Token Delivery Flow

get_next_token (sub_676860)
  │
  ├── Check cached_token_rescan_list (qword_1270150)
  │     └── If non-empty: pop token, dispatch on cache_entry_kind
  │           ├── kind 1 (identifier): load xmmword_106C380..106C3B0
  │           ├── kind 2 (macro_def): call sub_5BA500 (macro expansion)
  │           ├── kind 3 (pragma): process deferred pragma
  │           ├── kind 4 (pp_number): return as-is
  │           ├── kind 6 (string): return string token
  │           └── kind 8 (concatenated_string): return concatenated string
  │
  ├── Check reusable_cache_stack (qword_1270128)
  │     └── If non-empty: pop and return cached token
  │           (assert: "get_token_from_reusable_cache_stack" at 4450, 4469)
  │
  ├── Check pending_macro_arg (qword_106B8A0)
  │     └── If set: process macro argument token
  │
  └── Fall through to get_token_main (sub_6810F0)
        └── Full token acquisition from source

The function sets the following globals on every token delivery:

word_126DD58 = token code
qword_126DD38 = source position
dword_126DF90 = token flags 1
dword_126DF8C = token flags 2
qword_126DF80 = extra data

CUDA Attribute Token Interception

When CUDA attribute processing is enabled (dword_106BA08 != 0), get_next_token intercepts identifier tokens and routes them through CUDA attribute resolution via sub_74DC30 and sub_74E240. This allows CUDA execution-space attributes (__device__, __host__, __global__) to be recognized at the token level rather than requiring full declaration parsing.

Numeric Literal Scanner: scan_numeric_literal (sub_672390)

The numeric literal scanner is 1,571 lines and handles every numeric literal format defined by C89 through C++23.

Literal Prefix Dispatch

scan_numeric_literal
  │
  ├── First char '0':
  │     ├── 0x/0X → hex literal (isxdigit validation)
  │     ├── 0b/0B → binary literal (C++14)
  │     ├── 0[0-7] → octal literal
  │     └── 0 alone → decimal zero
  │
  ├── First char '1'-'9':
  │     └── decimal literal
  │
  └── After integer part:
        ├── '.' → floating-point literal
        ├── 'e'/'E' → decimal float exponent
        ├── 'p'/'P' → hex float exponent
        └── suffix → type suffix parsing

C++14 Digit Separators

Digit separators (' characters within numeric literals) are handled through a two-flag system:

Address	Name	Purpose
`dword_126EEFC`	`cpp14_digit_separators_enabled`	Master enable for digit separator support
`dword_126DB58`	`digit_separator_seen`	Set when a separator is encountered in the current literal

When dword_126EEFC is enabled, the scanner accepts ' between digits:

// Digit separator handling in scan_numeric_literal
while (isdigit(*pos) || (*pos == '\'' && dword_126EEFC)) {
    if (*pos == '\'') {
        dword_126DB58 = 1;  // mark separator seen
        pos++;
        if (!isdigit(*pos))
            emit_diagnostic(2629);  // separator not followed by digit
        continue;
    }
    // process digit...
}

C++23 extended digit separators (for binary, octal, hex) are gated on dword_126EF68 > 202302:

if (dword_126EF68 > 202302) {
    // C++23: allow digit separators in binary/octal/hex
} else {
    emit_diagnostic(2628);  // C++23 feature used in earlier mode
}

Integer Suffix Parsing

sub_6748A0 (convert_integer_suffix, 137 lines) parses the following suffixes:

Suffix	Type
(none)	`int` (or promoted per value)
`u` / `U`	`unsigned int`
`l` / `L`	`long`
`ll` / `LL`	`long long`
`ul` / `UL`	`unsigned long`
`ull` / `ULL`	`unsigned long long`
`z` / `Z`	`size_t` (C++23)
`uz` / `UZ`	`size_t unsigned` (C++23)

sub_674BB0 (determine_numeric_literal_type, 400 lines) applies the C++ promotion rules based on the literal value and suffix to determine the final type.

Floating-Point Literal Handling

Address	Identity	Description
`sub_675390`	`scan_float_exponent`	Scans `e`/`E`/`p`/`P` exponent suffix (57 lines)
`sub_6754B0`	`convert_float_literal`	Converts float literal string to value (338 lines)

Float suffixes: f/F (float), l/L (long double), none (double).

Universal Character Names: scan_universal_character (sub_6711E0)

sub_6711E0 (278 lines, assert at lexical.c:12384) scans \uXXXX and \UXXXXXXXX universal character names in identifiers and string/character literals.

void scan_universal_character(char *input, uint32_t *result) {
    int width;
    if (input[1] == 'u')
        width = 4;    // \uXXXX
    else
        width = 8;    // \UXXXXXXXX

    uint32_t value = 0;
    for (int i = 0; i < width; i++) {
        char c = *input++;
        if (!isxdigit(c)) {
            // emit error diagnostic
            return;
        }
        int digit;
        if (c >= '0' && c <= '9')
            digit = c - 48;      // '0' = 48
        else if (islower(c))
            digit = c - 87;      // 'a' = 97, 97-87 = 10
        else
            digit = c - 55;      // 'A' = 65, 65-55 = 10
        value = (value << 4) | digit;
    }
    *result = value;
}

sub_671870 (validate_universal_character_value, 62 lines) performs range checking after scanning: surrogate pair values (0xD800--0xDFFF) are rejected, and values outside the valid Unicode range (> 0x10FFFF) produce an error.

The feature is controlled by dword_106BCC4 (universal characters enabled) and dword_106BD4C (extended character mode).

Keyword Registration: keyword_init (sub_5863A0)

sub_5863A0 (1,113 lines, in fe_init.c) registers all C/C++ keywords with the symbol table during frontend initialization. It calls sub_7463B0 (enter_keyword) once per keyword, passing the token ID and string representation. GNU double-underscore variants are registered via sub_585B10, and alternative tokens via sub_749600.

Keyword Categories and Version Gating

Keywords are registered conditionally based on language mode and standard version:

keyword_init (sub_5863A0)
  │
  ├── C89 core (always registered)
  │     auto(77), break(78), case(79), char(80), continue(82),
  │     default(83), do(84), double(85), else(86), enum(87),
  │     extern(88), float(89), for(90), goto(91), if(92),
  │     int(93), long(94), register(95), return(96), short(97),
  │     sizeof(99), static(100), struct(101), switch(102),
  │     typedef(103), union(104), unsigned(105), void(106), while(108)
  │
  ├── C99 (gated on C99+ mode)
  │     _Bool(120), _Complex(121), _Imaginary(122), restrict(119)
  │
  ├── C11 (gated on C11+ mode)
  │     _Generic(262), _Atomic(263), _Alignof(247), _Alignas(248),
  │     _Thread_local(194), _Static_assert(184), _Noreturn(260)
  │
  ├── C23 (gated on C23 mode)
  │     bool, true, false, alignof, alignas, static_assert,
  │     thread_local, typeof(189), typeof_unqual(190)
  │
  ├── C++ core (gated on C++ mode: dword_126EFB4 == 2)
  │     catch(150), class(151), friend(153), inline(154),
  │     mutable(174), operator(156), new(155), delete(152),
  │     private(157), protected(158), public(159), template(160),
  │     this(161), throw(162), try(163), virtual(164),
  │     namespace(175), using(179), typename(183), typeid(178),
  │     const_cast(166), dynamic_cast(167), static_cast(177),
  │     reinterpret_cast(176)
  │
  ├── C++ alternative tokens (gated on C++ mode)
  │     and(52), and_eq(64), bitand(33), bitor(51), compl(37),
  │     not(38), not_eq(48), or(53), or_eq(66), xor(50), xor_eq(65)
  │
  ├── C++ modern keywords (gated on standard version)
  │     C++11: constexpr(244), decltype(185), nullptr(237),
  │            char16_t(126), char32_t(127)
  │     C++17: char8_t(128)
  │     C++20: consteval(245), constinit(246), co_yield(267),
  │            co_return(268), co_await(269), concept(295), requires(294)
  │     C++23: typeof(189), typeof_unqual(190)
  │
  ├── GNU extensions (gated on dword_126EFA8)
  │     __extension__(187), __auto_type(186), __attribute(142),
  │     __builtin_offsetof(117), __builtin_types_compatible_p(143),
  │     __builtin_shufflevector(258), __builtin_convertvector(259),
  │     __builtin_complex(261), __builtin_has_attribute(296),
  │     __builtin_addressof(271), __builtin_bit_cast(297),
  │     __int128(239), __bases(249), __direct_bases(250),
  │     _Float32(331), _Float32x(332), _Float64(333),
  │     _Float64x(334), _Float128(335)
  │
  ├── MSVC extensions (gated on dword_126EFB0)
  │     __declspec(132), __int8(133), __int16(134),
  │     __int32(135), __int64(136)
  │
  ├── Clang extensions (gated on Clang version at qword_126EF90)
  │     _Nullable(264), _Nonnull(265), _Null_unspecified(266)
  │
  ├── Type traits (60+, gated by standard version)
  │     __is_class(207), __is_enum, __is_union, __has_trivial_copy,
  │     __has_virtual_destructor, ... through token code 327
  │
  ├── NVIDIA CUDA type traits (gated on GPU mode)
  │     __nv_is_extended_device_lambda_closure_type(328),
  │     __nv_is_extended_host_device_lambda_closure_type(329),
  │     __nv_is_extended_device_lambda_with_preserved_return_type(330)
  │
  └── EDG internal keywords (always registered)
        __edg_type__(272), __edg_size_type__(277),
        __edg_ptrdiff_type__(278), __edg_bool_type__(279),
        __edg_wchar_type__(280), __edg_opnd__(282),
        __edg_throw__(281), __edg_is_deducible(304),
        __edg_vector_type__(273), __edg_neon_vector_type__(274)

Version gating globals used during keyword registration:

Address	Name	Values
`dword_126EFB4`	`language_mode`	1 = K&R C / GNU default, 2 = C++
`dword_126EF68`	`cpp_standard_version`	199900, 201102, 201402, 201703, 202002, 202302
`qword_126EF98`	`gnu_version`	e.g., `0x9FC3` = GCC 4.0.3
`qword_126EF90`	`clang_version`	e.g., `0x15F8F`, `0x1D4BF`
`dword_126EFA8`	`gnu_extensions_enabled`	Boolean
`dword_126EFA4`	`extensions_enabled`	Boolean (Clang compat)
`dword_126EFAC`	`c_language_mode`	Boolean: C vs C++
`dword_126EFB0`	`microsoft_extensions_enabled`	Boolean

String and Character Literal Scanning

Character Literal Scanning

Address	Identity	Lines	Description
`sub_66CB30`	`scan_character_literal_prefix`	34	Detects encoding prefix (`L`, `u`, `U`, `u8`)
`sub_66CBD0`	`scan_character_literal`	111	Scans `'x'` / `L'x'` / `u'x'` / `U'x'` / `u8'x'` literals

String Literal Scanning

Address	Identity	Lines	Description
`sub_66C550`	`scan_string_literal`	356	Scans quoted string literals with escape sequences
`sub_676080`	`scan_raw_string_literal`	391	Scans `R"delimiter(content)delimiter"` raw strings
`sub_66E6E0`	`scan_identifier_suffix`	94	Checks for user-defined literal suffixes (C++11)
`sub_66E920`	`is_valid_ud_suffix`	51	Validates user-defined literal suffix names
`sub_6892F0`	`string_literal_concatenation_check`	107	Checks adjacent string literal tokens for concatenation
`sub_689550`	`process_user_defined_literal`	332	Handles C++11 UDL operator lookup

Encoding Prefixes

The lexer recognizes 5 string encoding prefixes, each producing a different string literal type:

Prefix	Token	Character Type	Width
(none)	`"..."`	`char`	1 byte
`L`	`L"..."`	`wchar_t`	4 bytes (Linux)
`u8`	`u8"..."`	`char8_t` (C++20) / `char`	1 byte
`u`	`u"..."`	`char16_t`	2 bytes
`U`	`U"..."`	`char32_t`	4 bytes

Scope Entry Layout

The lexer interacts heavily with the scope system. Scope entries are 784-byte records stored in an array at qword_126C5E8, indexed by dword_126C5E4 (current scope index).

Offset	Size	Field	Description
`+0`	4	`name_hash`	Hash of scope name for lookup
`+4`	1	`scope_kind`	Kind code (12 = file scope, see below)
`+6`	1	`scope_flags`	Bit flags: bit 5 = inline namespace
`+7`	1	`access_flags`	Bit 0 = in class context
`+10`	1	`extra_flags`	Bit 0 = module scope
`+12`	1	`template_flags`	Bit 0 = in template argument scan, bit 4 = has concepts
`+24`	8	`symbol_chain_or_hash_ptr`	Head of symbol chain or hash table
`+32`	8	`hash_table_ptr`	Hash table for O(1) lookup in large scopes
`+192`	8	`lazy_load_scope_ptr`	Pointer for lazy symbol loading (calls `sub_7C1900`)
`+208`	4	`scope_depth`	Nesting depth counter
`+376`	8	`parent_template_info`	Template context for template scope entries
`+416`	8	`module_info`	C++20 module partition data
`+632`	8	`class_info_ptr`	Pointer to class descriptor for class scopes

Scope-related globals:

Address	Name	Description
`dword_126C5E4`	`current_scope_index`	Index into scope table
`dword_126C5C4`	`class_scope_index`	Innermost class scope (-1 if none)
`dword_126C5C8`	`namespace_scope_index`	Innermost namespace scope (-1 if none)
`dword_126C5DC`	`file_scope_index`	File (global) scope index
`xmmword_126C520`	`entity_kind_to_language_mode_map`	32-entry table mapping entity kinds to required language modes

Lexer State Stack

The lexer supports push/pop of its entire state for speculative parsing and template argument scanning.

Address	Identity	Lines	Description
`sub_688320`	`push_lexical_state`	137	Pushes current lexer state onto `qword_126DB40` stack
`sub_668330`	`pop_lexical_state_stack_full`	166	Pops state, restores stop-token table, macro chains (assert at `lexical.c:17808`)

State stack nodes are 80-byte linked-list entries:

Offset	Size	Field
`+0`	8	`next` (previous state)
`+8`	8	`cached_tokens`
`+16`	8	`source_position`
`+24`--`+72`	48	`token_cache_state` (saved cache pointers and flags)

The push/pop mechanism is used for:

Template argument list scanning (sub_67DC90, 1,078 lines)
Speculative parsing in disambiguation contexts
Macro expansion state save/restore

Template Argument Scanning: scan_template_argument_list (sub_67DC90)

sub_67DC90 (1,078 lines, assert at lexical.c:19918) scans template argument lists (<...>). This is one of the most complex lexer functions because of the >> ambiguity: in vector<vector<int>>, the closing >> must be split into two > tokens to close two template argument lists.

The scanner:

Pushes lexer state and sets template argument scanning mode (scope entry offset +12, bit 0)
Scans tokens while tracking nesting depth of <> pairs
Handles nested template-ids recursively
Creates token cache entries for deferred parsing
Uses the scope system to classify identifiers within template arguments
Disambiguates >> as either right-shift or double template close

The entity kind checks at offsets +80 (values 19--22) identify template entities for recursive template-id scanning.

Preprocessor Integration

The lexer handles several preprocessor-related responsibilities:

Source Position Tracking

Address	Identity	Lines	Description
`sub_66D100`	`set_source_position`	282	Converts raw input position to file/line/column (called from dozens of locations)
`sub_66D5E0`	`emit_output_line`	491	Emits source text and `#line` directives to preprocessed output
`sub_66B1F0`	`emit_preprocessed_output`	231	Outputs `#line` directives via `qword_106C280` (output `FILE*`)

Macro Expansion Support

Address	Identity	Lines	Description
`sub_66A770`	`lookup_macro_at_position`	41	Scans macro chain (`qword_126DD80`) for macro enclosing given position
`sub_66A7F0`	`create_macro_expansion_record`	44	Allocates macro expansion tracking node
`sub_66A890`	`push_macro_expansion`	41	Pushes new expansion onto active stack
`sub_66A940`	`pop_macro_expansion`	28	Pops expansion from stack
`sub_66A9D0`	`is_in_macro_expansion`	12	Returns whether currently inside macro expansion
`sub_66A9F0`	`get_macro_expansion_depth`	17	Returns nesting depth of macro expansions
`sub_66A310`	`invalidate_macro_node`	56	Clears macro definition when it goes out of scope
`sub_66A5E0`	`free_macro_definition_chain`	91	Walks and frees macro chain via `qword_126DD70` / `qword_126DDE0`

Include File Handling

Address	Identity	Lines	Description
`sub_66BB50`	`open_source_file`	332	Opens include files via `sub_4F4970` (fopen wrapper), creates file tracking nodes
`sub_66EA70`	`open_next_input_file`	364	Opens next input source after current file ends, manages include-stack unwinding
`sub_67BAB0`	`scan_header_name`	110	Scans `<filename>` or `"filename"` for `#include` directives

Token Pasting and Stringification

Address	Identity	Lines	Description
`sub_67D1E0`	`handle_token_pasting`	117	Implements `##` preprocessor operator
`sub_67D440`	`stringify_token`	251	Implements `#` preprocessor operator
`sub_67D050`	`check_token_paste_validity`	57	Validates token paste produces a valid token
`sub_67D900`	`expand_macro_argument`	204	Expands a single macro argument during substitution

Operator Scanning

Multi-character operators are scanned by a set of dedicated functions in the 0x67ABB0--0x67BAB0 range. The scanner reads the first operator character and dispatches to the appropriate function to check for compound operators:

First Char	Possible Tokens
`<`	`<`, `<=`, `<<`, `<<=`, `<=>`, `<%` (digraph `{`), `<:` (digraph `[`)
`>`	`>`, `>=`, `>>`, `>>=`
`+`	`+`, `++`, `+=`
`-`	`-`, `--`, `-=`, `->`, `->*`
`*`	``, `=`
`&`	`&`, `&&`, `&=`
`\|`	`\|`, `\|\|`, `\|=`
`=`	`=`, `==`
`!`	`!`, `!=`
`:`	`:`, `::`
`.`	`.`, `...`, `.*`

Template Angle Bracket Disambiguation

sub_67CB70 (handle_template_angle_brackets, 263 lines) handles the critical disambiguation of < and > in template contexts. In template argument lists, < opens and > closes, but in expressions, they are comparison operators. The function uses scope context information and the current parsing state (from the 784-byte scope entries) to make the determination.

Error Recovery

Address	Identity	Lines	Description
`sub_6887C0`	`skip_to_token`	317	Error recovery: skips tokens until finding a synchronization point (`;`, `}`, etc.)
`sub_6886F0`	`expect_token`	31	Checks current token matches expected kind, emits diagnostic on mismatch
`sub_688560`	`peek_next_token`	44	Looks ahead at next token without consuming it

The stop-token table at qword_126DB48 + 8 (357 entries) controls which token kinds are valid synchronization points for error recovery.

Built-in Type and Attribute Handling

Address	Identity	Lines	Description
`sub_685AB0`	`handle_builtin_type_token`	289	Processes built-in type keywords (`int`, `float`, etc.) into type tokens
`sub_685F10`	`process_decltype_token`	212	Handles `decltype()` expression in token stream
`sub_686350`	`handle_attribute_token`	584	Processes `[[attribute]]` and `__attribute__((x))` syntax, including CUDA attributes
`sub_686F40`	`process_asm_or_extension_keyword`	244	Handles `asm`, `__asm__`, and extension keywords

Diagnostic Strings

String	Source	Condition
`"pop_lexical_state_stack_full"`	`sub_668330`	Assert at `lexical.c:17808`
`"copy_tokens_from_cache"`	`sub_669650`	Assert at `lexical.c:3417`
`"scan_universal_character"`	`sub_6711E0`	Assert at `lexical.c:12384`
`"get_token_from_cached_token_rescan_list"`	`sub_676860`	Assert at `lexical.c:4302`
`"get_token_from_reusable_cache_stack"`	`sub_676860`	Assert at `lexical.c:4450`, `4469`
`"scan_template_argument_list"`	`sub_67DC90`	Assert at `lexical.c:19918`
`"select_dual_lookup_symbol"`	`sub_666720`	Assert at `lexical.c:22477`
`"keyword_init"`	`sub_5863A0`	Assert at `fe_init.c:1597`
`"fe_translation_unit_init"`	`sub_5863A0`	Assert at `fe_init.c:2373`

Diagnostic Code	Context	Meaning
870	Character literal scanning	Invalid character in literal
912	`select_dual_lookup_symbol`	Ambiguous lookup result
1192	Control byte type 6	Stale source position marker
861	Control byte type 6	Invalid position reference
1665	`check_deferred_diagnostics`	Deferred macro-related warning
1750	`refill_buffer`	Trigraph sequence warning
2628	Numeric literal scanner	C++23 digit separator used in earlier mode
2629	Numeric literal scanner	Digit separator not followed by digit

Function Map

Address	Identity	Confidence	Lines	EDG Source
`sub_5863A0`	`keyword_init` / `fe_translation_unit_init`	98%	1,113	`fe_init.c:1597`
`sub_666720`	`select_dual_lookup_symbol`	HIGH	372	`lexical.c:22477`
`sub_668330`	`pop_lexical_state_stack_full`	HIGH	166	`lexical.c:17808`
`sub_668660`	`check_deferred_diagnostics`	MEDIUM	104	`lexical.c`
`sub_6688A0`	`get_scope_from_entity`	HIGH	32	`lexical.c`
`sub_668C90`	`classify_identifier_entity`	MEDIUM	89	`lexical.c`
`sub_668E00`	`resolve_entity_through_alias`	MEDIUM	88	`lexical.c`
`sub_669650`	`copy_tokens_from_cache`	HIGH	385	`lexical.c:3417`
`sub_669D00`	`allocate_token_cache_entry`	MEDIUM	119	`lexical.c`
`sub_66A000`	`append_to_token_cache`	MEDIUM	88	`lexical.c`
`sub_66A140`	`push_token_to_rescan_list`	MEDIUM	46	`lexical.c`
`sub_66A3F0`	`create_source_region_node`	MEDIUM	84	`lexical.c`
`sub_66A5E0`	`free_macro_definition_chain`	MEDIUM	91	`lexical.c`
`sub_66A770`	`lookup_macro_at_position`	MEDIUM	41	`lexical.c`
`sub_66A890`	`push_macro_expansion`	MEDIUM	41	`lexical.c`
`sub_66AA50`	`process_preprocessor_directive`	MEDIUM	380	`lexical.c`
`sub_66B1F0`	`emit_preprocessed_output`	MEDIUM	231	`lexical.c`
`sub_66B910`	`skip_whitespace_and_comments`	MEDIUM	105	`lexical.c`
`sub_66BB50`	`open_source_file`	HIGH	332	`lexical.c`
`sub_66C550`	`scan_string_literal`	MEDIUM	356	`lexical.c`
`sub_66CBD0`	`scan_character_literal`	MEDIUM	111	`lexical.c`
`sub_66D100`	`set_source_position`	HIGH	282	`lexical.c`
`sub_66D5E0`	`emit_output_line`	HIGH	491	`lexical.c`
`sub_66DFF0`	`scan_pp_number`	MEDIUM	268	`lexical.c`
`sub_66EA70`	`open_next_input_file`	MEDIUM	364	`lexical.c`
`sub_66F4E0`	`read_next_source_line`	HIGH	735	`lexical.c`
`sub_6702F0`	`refill_buffer`	HIGH	792	`lexical.c`
`sub_6711E0`	`scan_universal_character`	HIGH	278	`lexical.c:12384`
`sub_671870`	`validate_universal_character_value`	MEDIUM	62	`lexical.c`
`sub_6719B0`	`scan_identifier_or_keyword`	HIGH	400	`lexical.c`
`sub_672390`	`scan_numeric_literal`	HIGH	1,571	`lexical.c`
`sub_6748A0`	`convert_integer_suffix`	MEDIUM	137	`lexical.c`
`sub_674BB0`	`determine_numeric_literal_type`	MEDIUM	400	`lexical.c`
`sub_675390`	`scan_float_exponent`	MEDIUM	57	`lexical.c`
`sub_6754B0`	`convert_float_literal`	MEDIUM	338	`lexical.c`
`sub_676080`	`scan_raw_string_literal`	MEDIUM-HIGH	391	`lexical.c`
`sub_676860`	`get_next_token`	HIGHEST	1,995	`lexical.c:4302`
`sub_679800`	`scan_token`	HIGH	907	`lexical.c`
`sub_67BAB0`	`scan_header_name`	MEDIUM	110	`lexical.c`
`sub_67CB70`	`handle_template_angle_brackets`	MEDIUM	263	`lexical.c`
`sub_67D050`	`check_token_paste_validity`	LOW	57	`lexical.c`
`sub_67D1E0`	`handle_token_pasting`	MEDIUM	117	`lexical.c`
`sub_67D440`	`stringify_token`	MEDIUM	251	`lexical.c`
`sub_67D900`	`expand_macro_argument`	MEDIUM	204	`lexical.c`
`sub_67DC90`	`scan_template_argument_list`	HIGH	1,078	`lexical.c:19918`
`sub_67F2E0`	`create_template_argument_cache`	MEDIUM	184	`lexical.c`
`sub_67F740`	`rescan_template_arguments`	MEDIUM-HIGH	583	`lexical.c`
`sub_680670`	`resolve_dependent_template_id`	MEDIUM	240	`lexical.c`
`sub_680AE0`	`handle_dependent_name_context`	MEDIUM	235	`lexical.c`
`sub_6810F0`	`get_token_main`	HIGHEST	3,811	`lexical.c`
`sub_685AB0`	`handle_builtin_type_token`	MEDIUM	289	`lexical.c`
`sub_685F10`	`process_decltype_token`	MEDIUM	212	`lexical.c`
`sub_686350`	`handle_attribute_token`	MEDIUM-HIGH	584	`lexical.c`
`sub_686F40`	`process_asm_or_extension_keyword`	MEDIUM	244	`lexical.c`
`sub_687F30`	`setup_lexer_for_parsing_mode`	MEDIUM	216	`lexical.c`
`sub_688320`	`push_lexical_state`	MEDIUM	137	`lexical.c`
`sub_688560`	`peek_next_token`	MEDIUM	44	`lexical.c`
`sub_6886F0`	`expect_token`	MEDIUM	31	`lexical.c`
`sub_6887C0`	`skip_to_token`	MEDIUM	317	`lexical.c`

Cross-References

Pipeline Overview -- keyword registration during sub_5863A0
Entry Point & Initialization -- frontend init calls keyword_init
Template Engine -- template argument scanning at lexer level
Type System -- entity kind classification used by lexer
Token Kind Table -- full 357-entry token table
Scope Entry -- 784-byte scope entry structure
Entity Node Layout -- entity node offsets used by identifier classification
Global Variable Index -- all global addresses referenced here
Attribute System Overview -- CUDA attribute handling at token level

Expression Parser

The expression parser is the largest subsystem in cudafe++. It lives in EDG 6.6's expr.c, which compiles to approximately 335KB of code (address range 0x4F8000--0x556600) containing roughly 320 functions. The central function scan_expr_full (sub_511D40) alone occupies 80KB -- approximately 2,000 decompiled lines with over 300 local variables. EDG uses a hand-written recursive descent parser, not a generated one (no yacc/bison). Each C++ operator precedence level has its own scanning function, and the call chain follows the precedence hierarchy: assignment, conditional, logical-or, logical-and, bitwise-or, bitwise-xor, bitwise-and, equality, relational, shift, additive, multiplicative, pointer-to-member, unary, postfix, primary.

CUDA-specific extensions are woven directly into this subsystem: cross-execution-space call validation at every function call site, remapping of GCC __sync_fetch_and_* builtins to NVIDIA __nv_atomic_fetch_* intrinsics, and constexpr-if gating of literal evaluation based on compilation mode.

Key Facts

Property	Value
Source file	`expr.c` (~320 functions) + `exprutil.c` (~90 functions)
Address range	`0x4F8000`--`0x556600` (expr.c), `0x558720`--`0x55FE10` (exprutil.c)
Total code size	~385KB
Central dispatcher	`sub_511D40` (`scan_expr_full`, 80KB, ~2,000 lines, 300+ locals)
Ternary handler	`sub_526E30` (`scan_conditional_operator`, 48KB)
Function call handler	`sub_545F00` (`scan_function_call`, 2,490 lines)
New-expression handler	`sub_54AED0` (`scan_new_operator`, 2,333 lines)
Identifier handler	`sub_5512B0` (`scan_identifier`, 1,406 lines)
Template rescan	`sub_5565E0` (`rescan_expr_with_substitution_internal`, 1,558 lines)
Atomic builtin remapper	`sub_537BF0` (`adjust_sync_atomic_builtin`, 1,108 lines, NVIDIA-specific)
Cross-space validation	`sub_505720` (`check_cross_execution_space_call`, 4KB)
Current token global	`word_126DD58` (16-bit token kind)
Expression context	`qword_106B970` (current scope/context pointer)
Trace flag	`dword_126EFC8` (debug trace), `dword_126EFCC` (verbosity level)

Architecture

Recursive Descent, No Generator

EDG's expression parser is entirely hand-written C. There are no parser tables, no DFA state machines, and no grammar transformation output. Each operator precedence level maps to one or more scan_* functions that call down the precedence chain via direct function calls. The parser is effectively a family of mutually recursive functions whose call graph encodes the C++ grammar.

The top-level entry point is scan_expr_full, which serves a dual role: (1) it contains the primary-expression scanner as a massive switch on token kind, and (2) after scanning a primary expression, it enters a post-scan binary-operator dispatch loop that routes to the correct precedence-level handler based on the next operator token.

scan_expr_full (sub_511D40)
  │
  ├─ [token switch] ─────────► Primary expressions
  │     case 1   → scan_identifier (sub_5512B0)
  │     case 2,3 → scan_numeric_literal (sub_5632C0)
  │     case 27  → scan_cast_or_expr (sub_544290)
  │     case 161 → scan_new_operator (sub_54AED0)
  │     case 162 → scan_throw_operator (sub_5211B0)
  │     ... (100+ token cases)
  │
  ├─ [postfix loop] ──────────► Postfix operators
  │     ()   → scan_function_call (sub_545F00)
  │     []   → scan_subscript_operator (sub_540560)
  │     .->  → scan_field_selection_operator (sub_5303E0)
  │     ++-- → scan_postfix_incr_decr (sub_510D70)
  │
  └─ [binary dispatch] ───────► Binary operators by precedence
        prec 64 → scan_simple_assignment_operator (sub_53FD70)
                  scan_compound_assignment_operator (sub_536E80)
        prec 60 → scan_conditional_operator (sub_526E30)
        prec 59 → scan_logical_operator (sub_526040)     [||]
        prec 58 → scan_logical_operator (sub_526040)     [&&]
        prec 57 → scan_comma_operator (sub_529720)
        ...     → scan_bit_operator (sub_525BC0)         [| ^ &]
        ...     → scan_eq_operator (sub_524ED0)          [== !=]
        ...     → scan_add_operator (sub_523EB0)         [+ -]
        ...     → scan_mult_operator (sub_5238C0)        [* / %]
        ...     → scan_shift_operator (sub_524960)       [<< >>]
        ...     → scan_ptr_to_member_operator (sub_522650) [.* ->*]

Precedence Levels

The parser assigns numeric precedence levels internally, passed as the a3 (third) parameter to scan_expr_full. The precedence integer increases with binding strength (higher values = tighter binding):

Level	Operators	Handler
57	`,` (comma)	`scan_comma_operator`
58	`\|\|`	`scan_logical_operator`
59	`&&`	`scan_logical_operator`
60	`? :` (conditional)	`scan_conditional_operator`
61	`\|`	`scan_bit_operator`
62	`^`	`scan_bit_operator`
63	`&`	`scan_bit_operator`
64	`=` `+=` `-=` ...	`scan_simple_assignment_operator` / `scan_compound_assignment_operator`

When scan_expr_full encounters a binary operator token whose precedence is lower than the current precedence parameter, it returns immediately, allowing the caller at that precedence level to consume the operator. This is the standard recursive descent technique: each level calls the next-higher-precedence scanner for its operands.

scan_expr_full -- The Central Dispatcher

scan_expr_full (sub_511D40, 80KB) is the largest function in the entire cudafe++ binary. Its structure follows this pattern:

function scan_expr_full(result, scan_info, precedence, flags, ...) {
    // 1. Trace entry
    if (debug_trace_flag)
        trace_enter(4, "scan_expr_full")
    if (debug_verbosity > 3)
        fprintf(trace_stream, "precedence level = %d\n", precedence)

    // 2. Extract context flags from current scope
    context = current_scope          // qword_106B970
    in_cuda_extension = (context[20] & 0x08) != 0
    in_pack_expansion = context[21] & 0x01
    saved_pending_expr = pending_expression   // qword_106B968
    pending_expression = 0

    // 3. Handle template rescan context
    if (in_template_context) {
        if (context.flags == TEMPLATE_ONLY_DEPENDENT)
            init_expr_stack_entry(...)
            // Mark as template-argument context
    }

    // 4. Handle forced-parenthesized-expression flag
    if (flags & 0x08)
        goto scan_cast_or_expr       // sub_544290

    // 5. Check for decltype token (185)
    if (current_token == 185 && dialect == C++)
        call sub_6810F0(...)         // re-classify through lexer

    // 6. MASTER TOKEN SWITCH -- dispatch on word_126DD58
    switch (current_token) {
        case 1:   // identifier
            // Special-case: check if identifier is a hidden type trait
            if (identifier_is("__is_pointer"))  { set_token(320); scan_unary_type_trait(); break; }
            if (identifier_is("__is_invocable")) { set_token(225); scan_call_like_builtin(); break; }
            if (identifier_is("__is_signed"))   { set_token(324); scan_unary_type_trait(); break; }
            // Default: full identifier scan
            scan_identifier(result, flags, precedence, ...)
            break;

        case 2, 3, 123, 124, 125:  // numeric, char, utf literals
            // Context-sensitive literal handling:
            //   - Check constexpr-if context (execution-space dependent)
            //   - Route to appropriate literal scanner
            if (is_constexpr_if_context)
                value = compute_constexpr_literal()
                scan_constexpr_literal_result(value, result)
            else
                scan_numeric_literal(literal_data, result)  // sub_5632C0
            break;

        case 4, 5, 6, 181, 182:  // string literals
            scan_string_literal(literal_data, result)       // sub_5632C0
            // Vector deprecation check for CUDA
            if ((cuda_mode || cuda_device_mode) && has_vector_literal_flag)
                result.flags |= VECTOR_DEPRECATED
            break;

        case 7:   // postfix-string-context (interpolated strings)
            check_postfix_string_context(...)
            scan_string_expression(literal_data, result)    // sub_563580
            break;

        case 27:  // left-paren '('
            scan_cast_or_expr(result, scratch, flags)       // sub_544290
            // Disambiguates: C-cast, grouped expr, GNU statement expr, fold expr
            break;

        case 31, 32:  // prefix ++ / --
            scan_prefix_incr_decr(result, ...)              // sub_516080
            break;

        case 33:  // & (address-of)
            scan_ampersand_operator(result, ...)            // sub_516720
            break;

        case 34:  // * (indirection)
            scan_indirection_operator(result, ...)          // sub_517270
            break;

        case 35, 36, 37, 38:  // unary + - ~ !
            scan_arith_prefix_operator(result, ...)         // sub_517680
            break;

        case 77:  // lambda expression '['
            scan_lambda_expression(result, ...)             // sub_5BBA60
            break;

        case 99, 284:  // sizeof
            scan_sizeof_operator(result, ...)               // sub_517BD0
            break;

        case 109: // _Generic
            scan_type_generic_operator(result, ...)         // inlined
            break;

        case 152: // requires
            scan_requires_expression(result, ...)           // sub_52CFF0
            break;

        case 155: // new (in C++ concept context path)
            scan_new_operator(result, ...)                  // sub_54AED0
            break;

        case 161: // new-expression
            scan_class_new_expression(result, ...)          // sub_6C9940/sub_6C9C50
            break;

        case 162: // throw
            scan_throw_operator(result, ...)                // sub_5211B0
            break;

        case 166: // const_cast
            scan_const_cast_operator(result, ...)           // sub_520280
            break;

        case 167: // static_cast
            scan_static_cast_operator(result, ...)          // sub_51F670
            break;

        case 176: // reinterpret_cast
            scan_reinterpret_cast_operator(result, ...)     // sub_5209A0
            break;

        case 177: // dynamic_cast
            scan_named_cast_operator(result, ...)           // sub_53D590
            break;

        case 178: // typeid
            scan_typeid_operator(result, ...)               // sub_535370
            break;

        case 185: // decltype
            scan_decltype_operator(result, ...)             // sub_52A3B0
            break;

        case 195 ... 356:  // type traits (__is_class, __is_enum, etc.)
            scan_unary_type_trait_helper(result, ...)       // sub_51A690
            // or
            scan_binary_type_trait_helper(result, ...)      // sub_51B650
            break;

        case 243: // noexcept
            scan_noexcept_operator(result, ...)             // sub_51D910
            break;

        case 267: // co_yield
            // Coroutine yield expression handling
            scan_braced_init_list_full(result, ...)         // sub_5360D0
            add_await_to_operand(result, ...)               // sub_50B630
            break;

        case 269: // co_await
            // Recursive scan of operand, then wrap with await semantics
            scan_expr_full(result, info, precedence, flags | AWAIT)
            add_await_to_operand(result, ...)               // sub_50B630
            break;

        case 297: // __builtin_bit_cast
            scan_builtin_bit_cast(result, ...)              // sub_51CC60
            break;

        // ... approximately 100 additional cases
    }

    // 7. POST-SCAN BINARY OPERATOR DISPATCH LOOP
    //    After scanning a primary/prefix expression, check for binary operators
    while (true) {
        op = current_token
        op_prec = get_binary_op_precedence(op)
        if (op_prec < precedence)
            break    // operator binds less tightly than our level

        switch (op) {
            case '?':  scan_conditional_operator(result, info, flags)   // sub_526E30
            case '=':  scan_simple_assignment_operator(result, ...)     // sub_53FD70
            case '+=': scan_compound_assignment_operator(result, ...)   // sub_536E80
            case '||': scan_logical_operator(result, info, ...)        // sub_526040
            case '&&': scan_logical_operator(result, info, ...)        // sub_526040
            case '|':  scan_bit_operator(result, ...)                  // sub_525BC0
            case '^':  scan_bit_operator(result, ...)
            case '&':  scan_bit_operator(result, ...)
            case '==': scan_eq_operator(result, ...)                   // sub_524ED0
            case '!=': scan_eq_operator(result, ...)
            case '<':  scan_rel_operator(result, ...)                  // sub_543A90
            case '+':  scan_add_operator(result, ...)                  // sub_523EB0
            case '-':  scan_add_operator(result, ...)
            case '*':  scan_mult_operator(result, ...)                 // sub_5238C0
            case '/':  scan_mult_operator(result, ...)
            case '%':  scan_mult_operator(result, ...)
            case '<<': scan_shift_operator(result, ...)                // sub_524960
            case '>>': scan_shift_operator(result, ...)
            case '.*': scan_ptr_to_member_operator(result, ...)        // sub_522650
            case '->*': scan_ptr_to_member_operator(result, ...)
            case ',':  scan_comma_operator(result, ...)                // sub_529720
            // Postfix operators (not precedence-gated):
            case '(':  scan_function_call(result, ...)                 // sub_545F00
            case '[':  scan_subscript_operator(result, ...)            // sub_540560
            case '.':  scan_field_selection_operator(result, ...)      // sub_5303E0
            case '->': scan_field_selection_operator(result, ...)
            case '++': scan_postfix_incr_decr(result, ...)             // sub_510D70
            case '--': scan_postfix_incr_decr(result, ...)
        }
    }

    // 8. Restore saved state and return
    pending_expression = saved_pending_expr
    if (debug_trace_flag)
        trace_exit(...)
    return result
}

Token Dispatch Map (Complete)

The master switch in scan_expr_full covers approximately 120 distinct token cases. The full dispatch table:

Token Code(s)	Expression Form	Handler
1	Identifier (with `__is_pointer`/`__is_signed` detection)	`scan_identifier` (`sub_5512B0`)
2, 3	Integer / floating-point literal	`scan_numeric_literal` (`sub_5632C0`)
4, 5, 6, 181, 182	String literal (narrow, wide, UTF-8/16/32)	`scan_string_literal` (`sub_5632C0`)
7	Postfix string context	`sub_563580`
8	Literal operator call	`make_func_operand_for_literal_operator_call` (`sub_4FFFB0`)
18, 80--136, 165, 180, 183	Type keywords in expression context	`scan_type_returning_type_trait_operator` / `scan_identifier`
25	`__extension__`	`scan_expr_splicer` (`sub_52FD70`) or `scan_statement_expression` (`sub_4F9F20`)
27	`(`	`scan_cast_or_expr` (`sub_544290`) -- disambiguates cast/group/fold/stmt-expr
31, 32	`++` / `--` (prefix)	`scan_prefix_incr_decr` (`sub_516080`)
33	`&` (address-of)	`scan_ampersand_operator` (`sub_516720`)
34	`*` (indirection)	`scan_indirection_operator` (`sub_517270`)
35--38	`+` `-` `~` `!` (unary)	`scan_arith_prefix_operator` (`sub_517680`)
50	`__builtin_expect`	`bound_function_in_cast` (`sub_503F70`)
77	`[` (lambda)	`scan_lambda_expression` (`sub_5BBA60`)
99, 284	`sizeof`	`scan_sizeof_operator` (`sub_517BD0`)
109	`_Generic`	`scan_type_generic_operator` (inlined)
111, 247	`alignof` / `_Alignof`	`scan_alignof_operator` (`sub_519300`)
112	`__intaddr`	`scan_intaddr_operator` (`sub_520EE0`)
113	`va_start`	`scan_va_start_operator` (`sub_51E8A0`)
114	`va_arg`	`scan_va_arg_operator` (`sub_51DFA0`)
115	`va_end`	`scan_va_end_operator` (`sub_51E4A0`)
116	`va_copy`	`scan_va_copy_operator` (`sub_51E670`)
117	`offsetof`	`scan_offsetof` (`sub_555530`)
123	`char` literal	`scan_utf_char_literal` (`sub_5659D0`)
124	`wchar_t` literal	`scan_wchar_literal` (`sub_5658D0`)
125	UTF character literal	`scan_wide_char_literal` (`sub_565950`)
138--141	`__FUNCTION__`/`__PRETTY_FUNCTION__`/`__func__`	`setup_function_name_literal` (`sub_50AC80`)
143	`__builtin_types_compatible_p`	`scan_builtin_operation_args_list` (`sub_534920`)
144, 145	`__real__` / `__imag__`	`scan_complex_projection` (`sub_51D210`)
146	`typeid` (execution-space variant)	`scan_typeid_operator` (`sub_535370`)
152	`requires` (C++20)	`scan_requires_expression` (`sub_52CFF0`)
155	Concept expression	`scan_new_operator` path (`sub_54AED0`)
161	`new`	`scan_class_new_expression` (`sub_6C9940`)
162	`throw`	`scan_throw_operator` (`sub_5211B0`)
166	`const_cast`	`scan_const_cast_operator` (`sub_520280`)
167	`static_cast`	`scan_static_cast_operator` (`sub_51F670`)
176	`reinterpret_cast`	`scan_reinterpret_cast_operator` (`sub_5209A0`)
177	`dynamic_cast`	`scan_named_cast_operator` (`sub_53D590`)
178	`typeid`	`scan_typeid_operator` (`sub_535370`)
185	`decltype`	`scan_decltype_operator` (`sub_52A3B0`)
188	`wchar_t` literal (alt)	`sub_5BCDE0`
189	`typeof`	`scan_typeof_operator` (`sub_52B540`)
195--206	Unary type traits	`scan_unary_type_trait_helper` (`sub_51A690`)
207--292	Binary type traits	`scan_binary_type_trait_helper` (`sub_51B650`)
225, 226	`__is_invocable` / `__is_nothrow_invocable`	`dispatch_call_like_builtin` (`sub_535080`)
227--235	Builtin operations	`sub_535080` / `sub_51BC10` / `sub_51B0C0`
237	`__builtin_constant_p`	`sub_5BC7E0`
243	`noexcept` (operator)	`scan_noexcept_operator` (`sub_51D910`)
251--256	Builtin atomic operations	`check_operand_is_pointer` (`sub_5338B0`/`sub_533B80`)
257, 258	Fold expression tokens	`scan_builtin_shuffle` (`sub_53E480`)
259	`__builtin_convertvector`	`scan_builtin_convertvector` (`sub_521950`)
261	`__builtin_complex`	`scan_builtin_complex` (`sub_521DB0`)
262	`__builtin_choose_expr`	`scan_c11_generic_selection` (`sub_554400`)
267	`co_yield`	Braced-init-list + coroutine `add_await_to_operand` (`sub_50B630`)
269	`co_await`	Recursive `scan_expr_full` + `add_await_to_operand`
270	`__builtin_launder`	`sub_51B0C0(60, ...)`
271	`__builtin_addressof`	`scan_builtin_addressof` (`sub_519CF0`)
294	Pack expansion	`scan_requires_expr` (`sub_542D90`)
296	`__has_attribute`	`scan_builtin_has_attribute` (`sub_51C780`)
297	`__builtin_bit_cast`	`scan_builtin_bit_cast` (`sub_51CC60`)
300, 301	`__is_pointer_interconvertible_with_class`	`sub_51BE60`
302, 303	`__is_corresponding_member`	`sub_51C270`
304	`__edg_is_deducible`	`sub_51B360`
306, 307	`__builtin_source_location`	`sub_5BC720` / `sub_534920`

scan_conditional_operator -- Ternary `? :`

scan_conditional_operator (sub_526E30, 48KB) is the second-largest expression-scanning function. The ternary operator is notoriously complex because it must unify the types of two branches that may have completely different types. The function handles:

Type unification between branches: determines the common type of the true and false expressions. This involves the usual arithmetic conversions for numeric types, pointer-to-derived to pointer-to-base conversions, null pointer conversions, and user-defined conversion sequences.
Lvalue conditional expressions (GCC extension): when both branches are lvalues of the same type, the result is itself an lvalue.
Void branches: if one or both branches are void expressions, the result type is void.
Throw in branches: a throw expression in one branch causes the result to take the type of the other branch.
Constexpr evaluation: when the condition is a constant expression, only one branch is semantically evaluated (the other is discarded).
Reference binding: determines whether the result is an lvalue reference, rvalue reference, or prvalue.
Overloaded operator?: resolution of user-defined conditional operators.

function scan_conditional_operator(context, result, flags) {
    // 1. The condition has already been scanned -- it is in 'result'
    //    We are positioned at the '?' token

    // 2. Save expression stack state
    saved_stack = save_expr_stack()

    // 3. Scan true branch (between ? and :)
    //    Note: precedence resets -- assignment expressions allowed here
    init_expr_stack_entry(...)
    scan_expr_full(true_result, info, ASSIGNMENT_PREC, flags)

    // 4. Expect and consume ':'
    expect_token(':')

    // 5. Scan false branch
    scan_expr_full(false_result, info, ASSIGNMENT_PREC, flags)

    // 6. Type unification of true_result and false_result
    true_type  = get_type(true_result)
    false_type = get_type(false_result)

    if (both_void(true_type, false_type))
        result_type = void
    else if (is_throw(true_result))
        result_type = false_type
    else if (is_throw(false_result))
        result_type = true_type
    else if (arithmetic_types(true_type, false_type))
        result_type = usual_arithmetic_conversions(true_type, false_type)
    else if (same_class_lvalues(true_result, false_result))
        result_type = common_lvalue_type(true_type, false_type)  // GCC ext
    else if (pointer_types(true_type, false_type))
        result_type = composite_pointer_type(true_type, false_type)
    else
        // Try user-defined conversions (overload resolution)
        result_type = resolve_via_conversion_sequences(true_type, false_type)

    // 7. Apply cv-qualification merging
    result_type = merge_cv_qualifications(true_type, false_type, result_type)

    // 8. Build result expression node
    build_conditional_expr_node(result, condition, true_result, false_result, result_type)

    // 9. Restore stack
    restore_expr_stack(saved_stack)
}

The complexity arises from the 15+ different type-pair combinations (arithmetic-arithmetic, pointer-pointer, pointer-null, class-class with conversions, void-void, throw-anything, lvalue-lvalue GCC extension) that each require different conversion logic.

scan_function_call -- All Call Forms

scan_function_call (sub_545F00, 2,490 lines) handles every form of function call expression. It is invoked from the postfix operator dispatch in scan_expr_full when a ( follows a primary expression, and also from various specialized paths.

The function handles:

Regular function calls with overload resolution
Builtin function calls -- GCC/Clang __builtin_* with special semantics
Pseudo-calls to builtins -- va_start, __builtin_va_start, etc.
GNU __builtin_classify_type -- compile-time type classification
SFINAE context -- failed overload resolution suppresses errors instead of aborting
Template argument deduction for function templates at call sites
CUDA atomic builtin remapping -- delegates to adjust_sync_atomic_builtin (see below)

function scan_function_call(callee_operand, flags, context, ...) {
    // 1. Classify the callee
    operand_kind = get_operand_kind(callee_operand)
    assert(operand_kind is valid)  // "scan_function_call: bad operand kind"

    // 2. Scan argument list
    scan_call_arguments(arg_list, ...)   // sub_545760

    // 3. Branch on callee kind
    if (is_builtin_function(callee_operand)) {
        // Check if this is a special builtin
        if (is_sync_atomic_builtin(callee_operand)) {
            // CUDA-specific: remap __sync_fetch_and_* → __nv_atomic_fetch_*
            result = adjust_sync_atomic_builtin(callee, args, ...)  // sub_537BF0
            return result
        }

        // check_builtin_function_for_call: validate args for builtins
        check_builtin_function_for_call(callee, arg_list, ...)

        // scan_builtin_pseudo_call: for builtins with special evaluation
        if (is_pseudo_call_builtin(callee))
            return scan_builtin_pseudo_call(callee, arg_list, ...)
    }

    // 4. Overload resolution
    if (has_overload_candidates(callee_operand)) {
        best = perform_overload_resolution(callee, arg_list, ...)
        if (best == AMBIGUOUS)
            emit_error(...)
        if (best == NO_MATCH && in_sfinae_context)
            return SFINAE_FAILURE
        callee = best.function
    }

    // 5. Template argument deduction (if callee is a function template)
    if (is_function_template(callee)) {
        deduced = deduce_template_args(callee, arg_list, ...)
        if (deduction_failed && in_sfinae_context)
            return SFINAE_FAILURE
        callee = instantiate_template(callee, deduced)
    }

    // 6. CUDA cross-execution-space check
    if (cuda_mode)
        check_cross_execution_space_call(callee, ...)  // sub_505720

    // 7. Apply implicit conversions to arguments
    for each (arg, param) in zip(arg_list, callee.params):
        convert_arg_to_param_type(arg, param)

    // 8. Build call expression node
    build_call_expression(result, callee, arg_list, return_type)
}

scan_call_arguments (`sub_545760`, 332 lines)

The argument scanner called from scan_function_call:

function scan_call_arguments(arg_list_out, ...) {
    // assert "scan_call_arguments"
    // Loop: scan comma-separated expressions until ')'
    while (current_token != ')') {
        scan_expr_full(arg, info, ASSIGNMENT_PREC, flags)
        append(arg_list_out, arg)
        if (current_token == ',')
            consume(',')
        else
            break
    }
    // Handle default arguments for missing trailing params
    // Handle parameter pack expansion
}

scan_new_operator -- All `new` Forms

scan_new_operator (sub_54AED0, 2,333 lines) implements the complete C++ new expression. The function name strings embedded in the binary confirm the following sub-operations:

Sub-operation	Embedded Assert String
Entry point	`"scan_new_operator"`
Rescan in template	`"rescan_new_operator_expr"`
Token validation	`"scan_new_operator: expected new or gcnew"`
Token extraction	`"get_new_operator_token"`
Type parsing	`"scan_new_type"`
Paren-as-braced fallback	`"scan_paren_expr_list_as_braced_list"`
Array size deduction	`"deduce_new_array_size"`
Deallocation lookup	`"determine_deletion_for_new"`
Paren initializer	`"prep_new_object_init_paren_initializer"`
Brace initializer	`"prep_new_object_init_braced_initializer"`
No initializer	`"prep_new_object_init_no_initializer"`
Non-POD error	`"scan_new_operator: non-POD class has neither actual nor assumed ctor"`

The function processes all forms:

function scan_new_operator(result, flags, context, ...) {
    // Determine scope: ::new (global) vs. new (class-scope)
    is_global = check_and_consume("::")

    // Parse optional placement arguments: new(placement_args)
    if (current_token == '(')
        placement_args = scan_expression_list(...)

    // Parse the allocated type: new Type
    type = scan_new_type(...)

    // Parse optional array dimension: new Type[size]
    if (current_token == '[') {
        array_size = scan_expression(...)
        if (can_deduce_size)
            deduce_new_array_size(type, initializer)
    }

    // Parse optional initializer
    if (current_token == '(')
        init = prep_new_object_init_paren_initializer(type, ...)
    else if (current_token == '{')
        init = prep_new_object_init_braced_initializer(type, ...)
    else
        init = prep_new_object_init_no_initializer(type, ...)

    // Look up matching operator new
    new_fn = lookup_operator_new(type, placement_args, is_global, ...)

    // Look up matching operator delete (for exception cleanup)
    determine_deletion_for_new(new_fn, type, placement_args, ...)

    // For template-dependent types, defer to rescan at instantiation
    if (is_dependent_type(type))
        record_for_rescan(...)

    // Build new-expression node
    build_new_expr(result, new_fn, type, init, placement_args, array_size)
}

scan_identifier -- Name Resolution in Expression Context

scan_identifier (sub_5512B0, 1,406 lines) handles the case where the current token is an identifier in expression context. This is far more complex than a simple name lookup because identifiers in C++ can resolve to variables, functions, enumerators, type names (triggering functional-notation casts), anonymous union members, or preprocessing constants.

The function contains assert strings revealing its sub-operations:

Assert String	Purpose
`"scan_identifier"`	Entry point
`"scan_identifier: in preprocessing expr"`	Identifier in `#if` context evaluates to 0 or 1
`"anonymous_parent_variable_of"`	Navigate to parent variable of anonymous union member
`"anonymous_parent_variable_of: bad symbol kind on list"`	Error path for malformed anonymous union chain
`"make_anonymous_union_field_operand"`	Construct operand for anonymous union member access
`"get_with_hash"`	Hash-based lookup for cached resolution results

function scan_identifier(result, flags, precedence, ...) {
    // 1. Preprocessing-expression context
    //    In #if, undefined identifiers evaluate to 0
    if (in_preprocessing_expression) {
        // "scan_identifier: in preprocessing expr"
        result = make_integer_constant(0)
        return
    }

    // 2. Look up identifier in current scope
    lookup_result = scope_lookup(current_identifier, current_scope)

    // 3. If identifier resolves to a type name → functional-notation cast
    if (is_type_entity(lookup_result)) {
        scan_functional_notation_type_conversion(type, result, ...)  // sub_54E7C0
        return
    }

    // 4. If identifier is an anonymous union member
    if (is_anonymous_union_member(lookup_result)) {
        // Walk up to find the named parent variable
        parent = anonymous_parent_variable_of(lookup_result)
        result = make_anonymous_union_field_operand(parent, lookup_result)
        return
    }

    // 5. If identifier is a function (possibly overloaded)
    if (is_function_entity(lookup_result)) {
        result = make_func_operand(lookup_result)
        // Lambda capture check
        if (in_lambda_scope)
            check_var_for_lambda_capture(lookup_result, ...)
        return
    }

    // 6. Variable reference
    result = make_var_operand(lookup_result)

    // 7. Lambda capture analysis
    if (in_lambda_scope)
        check_var_for_lambda_capture(lookup_result, ...)

    // 8. Cross-execution-space reference check (CUDA)
    if (cuda_mode)
        check_cross_execution_space_reference(lookup_result, ...)
}

CUDA-Specific: Cross-Execution-Space Call Validation

Two functions implement the CUDA execution space enforcement that prevents illegal calls between __host__ and __device__ code:

check_cross_execution_space_call (`sub_505720`)

Called from scan_function_call and other call sites. The function extracts execution space information from bit-packed flags at entity offset +182:

function check_cross_execution_space_call(callee, is_must_check, diag_ctx) {
    // Extract callee's execution space from entity flags
    if (callee != NULL) {
        is_not_device_only = (callee[182] & 0x30) != 0x20  // bits 4-5
        is_host_only       = (callee[182] & 0x60) == 0x20  // bits 5-6
        is_global          = (callee[182] & 0x40) != 0      // bit 6
    }

    // Early exits for special contexts
    if (compilation_chain == -1)    return   // not in compilation
    if (CU has CUDA flags cleared)  return   // not a CUDA compilation unit
    if (in_SFINAE_context)          return   // errors suppressed

    // Get caller's execution space from enclosing function
    enclosing_fn = CU_table[enclosing_CU_index].function  // at +224
    if (enclosing_fn != NULL) {
        caller_host_only = (enclosing_fn[182] & 0x60) == 0x20
        caller_not_device_only = (enclosing_fn[182] & 0x30) != 0x20
    } else {
        // Top-level code: treated as __host__
        caller_host_only = 0
        caller_not_device_only = 1
    }

    // Check for implicitly HD (constexpr or __host__ __device__ by inference)
    if (callee[177] & 0x10)  return   // callee is implicitly HD
    if (callee has deleted+explicit HD flags)  return

    // The actual cross-space check matrix:
    // caller=host,  callee=device  → error 3462 or 3463
    // caller=device, callee=host   → error 3464 or 3465
    // callee=__global__            → error 3508

    if (caller_not_device_only && caller_host_only) {
        // Caller is __host__ only
        if (callee is __device__ only) {
            if (is_trivial_device_copyable(callee))  // sub_6BC680
                return  // allow
            space1 = get_execution_space_name(enclosing_fn, 0)  // sub_6BC6B0
            space2 = get_execution_space_name(callee, 1)
            emit_error(3462 + has_explicit_host, ...)
        }
    } else if (caller_not_device_only) {
        // Caller is __device__ only
        if (callee is __host__ only)
            emit_error(3464 + has_explicit_device, ...)
    }

    if (callee is __global__) {
        emit_error(3508, is_must_check ? "must" : "cannot", ...)
    }
}

The bit encoding at entity offset +182:

Bits	Mask	Meaning
4--5	`& 0x30`	`__device__` flag: `0x20` = device-only
5--6	`& 0x60`	`__host__` flag: `0x20` = host-only
6	`& 0x40`	`__global__` flag

Error codes issued:

Code	Meaning
3462	`__device__` function called from `__host__` context
3463	Variant of 3462 with `__host__` annotation note
3464	`__host__` function called from `__device__` context
3465	Variant of 3464 with `__device__` annotation note
3508	`__global__` function called from wrong context

check_cross_space_call_in_template (`sub_505B40`)

A simplified variant (2.7KB) used during template instantiation. The logic mirrors check_cross_execution_space_call but operates when dword_126C5C4 == -1 (template instantiation depth guard). It does not take the is_must_check parameter and always checks both directions.

See the Execution Spaces page for full details on the CUDA execution model.

CUDA-Specific: adjust_sync_atomic_builtin

adjust_sync_atomic_builtin (sub_537BF0, 1,108 lines) is the largest NVIDIA-specific function in the expression parser. It transforms GCC-style __sync_fetch_and_* atomic builtins into NVIDIA's own __nv_atomic_fetch_* intrinsics.

Why This Remapping Exists

CUDA inherits GCC's __sync_fetch_and_* builtin family from the host-side C/C++ dialect, but NVIDIA's GPU ISA (PTX) uses a different instruction encoding for atomic operations. The GPU atomics have type-specific variants that the PTX backend needs to select the correct instruction. Rather than teaching the backend to decompose generic __sync_* builtins, NVIDIA front-loads the transformation in the parser, mapping each builtin to a type-suffixed __nv_atomic_fetch_* intrinsic that directly corresponds to a PTX atomic instruction.

The type suffix ensures correct instruction selection:

Suffix	Type Category	PTX Atomic Type
`_s`	Signed integer	`.s32`, `.s64`
`_u`	Unsigned integer	`.u32`, `.u64`
`_f`	Floating-point	`.f32`, `.f64`

Remapping Table

GCC Builtin	NVIDIA Intrinsic (base)
`__sync_fetch_and_add`	`__nv_atomic_fetch_add`
`__sync_fetch_and_sub`	`__nv_atomic_fetch_sub`
`__sync_fetch_and_and`	`__nv_atomic_fetch_and`
`__sync_fetch_and_xor`	`__nv_atomic_fetch_xor`
`__sync_fetch_and_or`	`__nv_atomic_fetch_or`
`__sync_fetch_and_max`	`__nv_atomic_fetch_max`
`__sync_fetch_and_min`	`__nv_atomic_fetch_min`

Pseudocode

function adjust_sync_atomic_builtin(callee, args, arg_list, builtin_info, result_ptr) {
    // assert "adjust_sync_atomic_builtin" at line 6073

    original_entity = get_builtin_entity(callee)   // sub_568F30
    assert(original_entity != NULL)

    // Check arg count -- if extra args and first arg is not pointer type
    if (builtin_info.extra_arg_count && callee[8] != 1) {
        // Reset and emit diagnostic 3768 (wrong arg type for atomic)
        original_entity = NULL
        if (validate_arg_types(...))
            emit_error(3768, diag_ctx)
        return original_entity
    }

    // Walk argument list to find the pointee type (type of *ptr)
    if (args == NULL) {
        // Use declared arg count from builtin info
        arg_index = builtin_info.declared_arg_count
        // ... validate, may emit error 3769 or 1645
    } else {
        // Navigate to the relevant argument node
        // Extract the pointee type by unwinding cv-qualifiers
        arg_type = get_init_component_type(args)
        pointee = unwrap_cv_qualifiers(arg_type)  // while type_kind == 12
    }

    // Determine the type suffix based on pointee type
    if (is_integer_type(pointee)) {
        if (is_signed(pointee))
            suffix = "_s"    // signed
        else
            suffix = "_u"    // unsigned
    } else if (is_float_type(pointee)) {
        suffix = "_f"        // floating-point
    } else {
        // Not a supported atomic type
        if (validate_arg_types(...))
            emit_error(1645 or 852, diag_ctx)
        return original_entity
    }

    // Construct the NVIDIA intrinsic name
    // Map __sync_fetch_and_OP → __nv_atomic_fetch_OP + suffix
    base_name = map_sync_to_nv(original_entity.name)
    // e.g., "__sync_fetch_and_add" → "__nv_atomic_fetch_add"
    full_name = base_name + suffix
    // e.g., "__nv_atomic_fetch_add_s" for signed int

    // Look up or create the NVIDIA intrinsic entity
    nv_entity = lookup_nv_intrinsic(full_name)

    // Replace the callee with the NVIDIA intrinsic
    *result_ptr = nv_entity

    return original_entity
}

The function validates that the pointee type is one of the supported atomic types. If the user passes a pointer to an unsupported type (e.g., a struct), it falls through to emit diagnostic 1645 ("argument type not supported for atomic operation") or 852 (a more specific variant when the __sync function has explicit type constraints).

Template Expression Rescanning

When a template is instantiated, expression trees from the template definition are re-evaluated with concrete template argument substitutions. This is handled by rescan_expr_with_substitution_internal (sub_5565E0, 1,558 lines), the third-largest function in the expression parser.

The function dispatches on expression kind (not token kind -- these are IL expression nodes, not source tokens) and recursively rescans each sub-expression with substitutions applied:

Assert String	Purpose
`"rescan_expr_with_substitution_internal"`	Entry point
`"operator_token_for_builtin_operator"`	Maps operator codes to tokens for rescan
`"operator_token_for_expr_rescan"`	Alternate operator-to-token mapping
`"invalid expr kind in expr rescan"`	Unreachable default case
`"rescan_braced_init_list"`	Rescans `{init-list}` nodes
`"make_operand_for_rescanned_identifier"`	Rebuilds identifier operands after substitution
`"symbol_for_template_param_unknown_entity_rescan"`	Handles dependent names during rescan
`"scan_rel_operator"`	Rescans relational operators (for comparison rewriting)

The key insight is that during template definition parsing, the parser builds a partially-evaluated expression tree where template-dependent parts are stored as opaque nodes. During instantiation, this function walks that tree, substitutes concrete types/values, and re-runs the semantic analysis that was deferred.

Supporting Infrastructure

Diagnostic Emission (30+ wrapper functions, `0x4F8000`--`0x4F8F80`)

The expression parser uses a family of thin diagnostic wrapper functions at the beginning of the address range. Each wraps the core pattern: create_diag(code) -> add_arg(type/entity/string) -> emit(diag). The variants differ only in argument count and types:

Function	Identity	Arguments
`sub_4F8090`	`emit_diag_with_type_and_entity`	Type arg + entity arg
`sub_4F8160`	`emit_diag_1arg`	Single argument
`sub_4F8220`	`emit_diag_with_2_type_args`	Two type arguments
`sub_4F8320`	`emit_diag_with_entity_and_type`	Entity first, type second
`sub_4F8B20`	`issue_incomplete_type_diag`	Incomplete type diagnostic (assert confirmed)

Expression Stack (`exprutil.c`, `0x558720`+)

The expression parser maintains a stack of expression contexts via qword_106B970. Each stack entry (the "current context") holds compilation mode flags, scope depth, CUDA execution space state, and template context bits. Key operations:

Function	Identity	Purpose
`sub_55D0D0`	`save_expr_stack`	Saves current expression stack state
`sub_55D100`	`init_expr_stack_entry`	Creates new stack frame
`sub_55DB50`	`pop_expr_stack`	Restores previous frame
`sub_55E490`	`set_operand_kind`	Sets the operand classification
`sub_55C180`	`alloc_ref_entry`	Allocates reference-entry for tracking
`sub_55C830`	`free_init_component`	Frees initializer component node

Comparison Rewriting (C++20, `0x501020`--`0x508DC0`)

The C++20 three-way comparison operator (<=>) triggers rewriting of traditional comparison expressions. complete_comparison_rewrite (sub_505E80, 6.9KB) rewrites a < b into (a <=> b) < 0 when a spaceship operator exists. It uses a recursion counter at qword_106B510 limited to 100 to prevent infinite rewrite loops. Related functions:

Function	Identity
`sub_501020`	`determine_defaulted_spaceship_return_type`
`sub_5015D0`	`synthesize_defaulted_comparison_body`
`sub_501B00`	`check_comparison_category_type`
`sub_505E10`	`token_for_rel_op` -- maps operator kinds to tokens (16->43, 17->44, 32->45, 33->46)
`sub_505E80`	`complete_comparison_rewrite` -- core rewrite engine
`sub_506430`	`check_defaulted_eq_properties`
`sub_5068F0`	`check_defaulted_secondary_comp`

Range-Based For Loop Desugaring (`0x50C510`, 16.8KB)

fill_in_range_based_for_loop_constructs (sub_50C510) generates the desugared components of for (auto x : range):

// Source:     for (auto x : range_expr) body
// Desugared:  {
//               auto && __range = range_expr;
//               auto __begin = begin(__range);
//               auto __end = end(__range);
//               for (; __begin != __end; ++__begin) {
//                 auto x = *__begin;
//                 body
//               }
//             }

The function calls sub_6EF7A0 (overload resolution) to look up begin() and end() via ADL, and emits error 2285 when no suitable begin/end is found.

Key Global Variables

Address	Name	Type	Description
`word_126DD58`	`current_token_code`	WORD	Current token kind (0--356)
`qword_126DD38`	`current_source_position`	QWORD	Encoded file/line/column
`qword_106B970`	`current_scope`	QWORD	Expression context stack pointer
`qword_106B968`	`pending_expression`	QWORD	Pending expression accumulator
`dword_126EFC8`	`debug_trace_flag`	DWORD	Nonzero enables trace output
`dword_126EFCC`	`debug_verbosity`	DWORD	Trace verbosity level (>3 prints precedence)
`dword_126EFB4`	`language_dialect`	DWORD	1=C, 2=C++
`qword_126EF98`	`standard_version`	QWORD	Language standard version level
`dword_126EFA8`	`in_template_context`	DWORD	Nonzero during template parsing
`dword_126EFA4`	`strict_mode`	DWORD	Strict conformance mode flag
`dword_126EFAC`	`extended_features`	DWORD	Extended features enabled
`xmmword_106C380`	`identifier_lookup_result`	128-bit	SSE-packed identifier lookup (64 bytes total, 4 xmmwords)
`qword_106B510`	`comparison_rewrite_depth`	QWORD	Recursion counter for C++20 comparison rewriting (max 100)
`dword_106C2C0`	`gpu_compilation_mode`	DWORD	Nonzero during GPU compilation
`qword_126C5E8`	`compilation_unit_table`	QWORD	Base of CU array (784-byte stride)
`dword_126C5E4`	`current_CU_index`	DWORD	Index into compilation unit table
`dword_126C5D8`	`enclosing_function_CU_index`	DWORD	CU index of enclosing function
`dword_126C5C4`	`template_instantiation_depth`	DWORD	-1 = not in template instantiation

Diagnostic Codes

The expression parser emits approximately 50 distinct diagnostic codes:

Code	Meaning
57	Pointer-to-member on non-class type
58	Pointer-to-member on incomplete type
60	Pointer-to-member on wrong class type
165	Wrong argument count for builtin
244	Type access violation in member selection
529	Pointer-to-member in concept context
852	Unsupported type for atomic operation (typed variant)
1022	Inaccessible member in selection
1032	Invalid `_Generic` controlling expression
1036	Unsupported predefined function name
1436	`__builtin_types_compatible_p` not available
1543	`__builtin_source_location` not available
1596	Invalid literal operator call
1645	Argument type not supported for atomic operation
1733	`new`-expression in module context
1763	GNU statement expression not available
1777	Statement expression in constexpr context
2285	No `begin`/`end` for range-based for
2669	`co_yield` outside coroutine
2747	`co_yield` not in function scope
2866	Statement expression in constexpr context
2896	Statement expression in template instantiation
2982	Comparison rewrite recursion limit exceeded
3462	`__device__` function called from `__host__` context
3463	Variant of 3462 with `__host__` annotation note
3464	`__host__` function called from `__device__` context
3465	Variant of 3464 with `__device__` annotation note
3508	`__global__` function called from wrong context
3768	Wrong argument type for atomic builtin (extra arg)
3769	Wrong argument type for atomic builtin (declared arg)

Function Index

Complete listing of confirmed functions in the expression parser, grouped by subsystem:

Core Expression Scanning

Address	Size	Identity	Confidence
`sub_511D40`	80KB	`scan_expr_full`	DEFINITE
`sub_526E30`	48KB	`scan_conditional_operator`	DEFINITE
`sub_545F00`	16KB	`scan_function_call`	DEFINITE
`sub_54AED0`	15KB	`scan_new_operator`	DEFINITE
`sub_5512B0`	9KB	`scan_identifier`	DEFINITE
`sub_544290`	6KB	`scan_cast_or_expr`	DEFINITE
`sub_5565E0`	10KB	`rescan_expr_with_substitution_internal`	DEFINITE
`sub_529720`	12KB	`scan_comma_operator`	DEFINITE
`sub_526040`	15KB	`scan_logical_operator`	DEFINITE
`sub_543A90`	1.4KB	`scan_rel_operator`	DEFINITE
`sub_540160`	1.2KB	`apply_one_fold_operator`	DEFINITE
`sub_543FA0`	1KB	`assemble_fold_expression_operand`	DEFINITE

Unary Operators

Address	Size	Identity	Confidence
`sub_516080`	7.6KB	`scan_prefix_incr_decr`	DEFINITE
`sub_516720`	13KB	`scan_ampersand_operator`	DEFINITE
`sub_517270`	4.4KB	`scan_indirection_operator`	DEFINITE
`sub_517680`	5.1KB	`scan_arith_prefix_operator`	DEFINITE
`sub_517BD0`	26KB	`scan_sizeof_operator`	DEFINITE
`sub_519300`	9.4KB	`scan_alignof_operator`	DEFINITE
`sub_519CF0`	6.1KB	`scan_builtin_addressof`	DEFINITE
`sub_510D70`	8.2KB	`scan_postfix_incr_decr`	DEFINITE

Binary Operators

Address	Size	Identity	Confidence
`sub_5238C0`	5.4KB	`scan_mult_operator`	DEFINITE
`sub_523EB0`	10.6KB	`scan_add_operator`	DEFINITE
`sub_524960`	5.8KB	`scan_shift_operator`	DEFINITE
`sub_524ED0`	5.6KB	`scan_eq_operator`	DEFINITE
`sub_525BC0`	4.7KB	`scan_bit_operator`	DEFINITE
`sub_525450`	8.6KB	`scan_gnu_min_max_operator`	DEFINITE
`sub_522650`	19.8KB	`scan_ptr_to_member_operator`	DEFINITE

Assignment

Address	Size	Identity	Confidence
`sub_53FD70`	1.1KB	`scan_simple_assignment_operator`	DEFINITE
`sub_536E80`	3.1KB	`scan_compound_assignment_operator`	DEFINITE
`sub_508770`	4.7KB	`process_simple_assignment`	DEFINITE

Member Access

Address	Size	Identity	Confidence
`sub_5303E0`	15KB	`scan_field_selection_operator`	DEFINITE
`sub_4FEB60`	4.5KB	`make_field_selection_operand`	DEFINITE
`sub_4FEF00`	4.6KB	`do_field_selection_operation`	DEFINITE
`sub_540560`	3.1KB	`scan_subscript_operator`	DEFINITE

Cast Operators

Address	Size	Identity	Confidence
`sub_51EE00`	8.3KB	`scan_new_style_cast`	DEFINITE
`sub_51F670`	13.5KB	`scan_static_cast_operator`	DEFINITE
`sub_520280`	8.8KB	`scan_const_cast_operator`	DEFINITE
`sub_5209A0`	4.9KB	`scan_reinterpret_cast_operator`	DEFINITE
`sub_53C690`	3.6KB	`scan_named_cast_operator`	HIGH

Type Traits

Address	Size	Identity	Confidence
`sub_51A690`	12KB	`scan_unary_type_trait_helper`	DEFINITE
`sub_51B650`	7.2KB	`scan_binary_type_trait_helper`	DEFINITE
`sub_535080`	0.2KB	`dispatch_call_like_builtin`	MEDIUM
`sub_534B60`	1.8KB	`scan_call_like_builtin_operation`	DEFINITE
`sub_549700`	2.2KB	`compute_is_invocable`	DEFINITE
`sub_550E50`	1.3KB	`compute_is_constructible`	DEFINITE
`sub_510410`	2.1KB	`compute_is_convertible`	DEFINITE
`sub_510860`	2.3KB	`compute_is_assignable`	DEFINITE

CUDA-Specific

Address	Size	Identity	Confidence
`sub_505720`	4KB	`check_cross_execution_space_call`	DEFINITE
`sub_505B40`	2.7KB	`check_cross_space_call_in_template`	DEFINITE
`sub_537BF0`	7KB	`adjust_sync_atomic_builtin`	DEFINITE
`sub_520EE0`	2.7KB	`scan_intaddr_operator`	DEFINITE

Initializers and Braced-Init-Lists

Address	Size	Identity	Confidence
`sub_5360D0`	4.7KB	`parse_braced_init_list_full`	DEFINITE
`sub_5392B0`	0.2KB	`complete_braced_init_list_parsing`	DEFINITE
`sub_539340`	1KB	`scan_braced_init_list_cast`	DEFINITE
`sub_539670`	0.4KB	`get_braced_init_list`	DEFINITE
`sub_541000`	2KB	`scan_member_constant_initializer_expression`	DEFINITE
`sub_541DC0`	5.5KB	`prescan_initializer_for_auto_type_deduction`	DEFINITE

Coroutines

Address	Size	Identity	Confidence
`sub_50B630`	10KB	`add_await_to_operand`	DEFINITE
`sub_50C070`	1.8KB	`check_coroutine_context`	HIGH
`sub_50E080`	4.5KB	`make_coroutine_result_expression`	DEFINITE

C++20 Concepts and Requires

Address	Size	Identity	Confidence
`sub_52CFF0`	13.5KB	`scan_requires_expression`	DEFINITE
`sub_542D90`	3.8KB	`scan_requires_expr`	DEFINITE
`sub_52EB60`	8.6KB	`scan_requires_clause`	DEFINITE

Declaration Parser

C++ declaration parsing is the most ambiguity-ridden phase of front-end compilation. A statement like T(x); is simultaneously a valid function-style cast (expression) and a variable declaration with redundant parentheses. EDG 6.6 in cudafe++ resolves this by splitting the work into two stages: a prescanning/disambiguation phase (disambig.c) that probes ahead in the token stream to classify ambiguous constructs, followed by committed parsing across four tightly-coupled source files -- decl_spec.c (declaration specifiers), declarator.c (declarator syntax), decls.c (symbol table insertion and semantic validation), and decl_inits.c (initializer processing). CUDA adds a fifth axis of complexity: every declaration may carry execution space attributes (__device__, __host__, __global__) and memory space qualifiers (__shared__, __constant__, __managed__), which are parsed as attribute category 4 and must be separated from standard C++ attributes before semantic analysis.

The core pipeline processes approximately 22,000 lines of decompiled logic across six major functions, each exceeding 1,000 lines. The design is a classic recursive-descent parser with significant state carried in stack-allocated structures (128-byte decl_spec accumulators packed as __m128i arrays) and global scope chain state (784-byte entries in the scope table at qword_126C5E8).

Key Facts

Property	Value
Source files	`decl_spec.c`, `declarator.c`, `decls.c`, `decl_inits.c`, `disambig.c`
Address range	`0x4A0000`--`0x4F8000` (~360 KB of code, ~530 functions)
Central dispatcher	`sub_4ACF80` (`decl_specifiers`, 4,761 lines)
Declarator entry	`sub_4B7BC0` (`declarator`, 284 lines)
Function declarator	`sub_4B8190` (`function_declarator`, 3,144 lines)
Recursive declarator	`sub_4BC950` (`r_declarator`, 2,578 lines)
Function declaration	`sub_4CE420` (`decl_routine`, 2,858 lines)
Variable declaration	`sub_4CA6C0` (`decl_variable`, 1,090 lines)
Top-level variable entry	`sub_4DEC90` (`variable_declaration`, 1,098 lines)
Disambiguation	`sub_4EA560` (`prescan_declaration`, ~400 lines)
Scope entry size	784 bytes (at `qword_126C5E8`)
Decl specifier accumulator	128 bytes (4 x `__m128i`, stack-allocated)
CUDA mode flag	`dword_126EFA8` (bool), dialect in `dword_126EFB4` (2 = C++)
Current token global	`word_126DD58`
Token advance	`sub_676860` (`get_next_token`)

Architecture

The declaration parsing pipeline operates as a five-stage waterfall. Each stage narrows the interpretation of the token stream until a fully-resolved declaration is inserted into the symbol table:

Token Stream (from lexer)
  │
  ▼
STAGE 1: Disambiguation (disambig.c)
  │  prescan_declaration ─── lookahead to classify ambiguous constructs
  │  prescan_gnu_attribute ── skip __attribute__((...)) blocks
  │  find_for_loop_separator ── distinguish for-init from expression
  │
  ▼
STAGE 2: Declaration Specifiers (decl_spec.c)
  │  decl_specifiers ─── 4,761-line switch dispatching on token kind
  │  ├── storage class: auto, register, static, extern, typedef
  │  ├── type specifiers: int, char, void, class/struct/enum, typename
  │  ├── cv-qualifiers: const, volatile, restrict
  │  ├── function specifiers: inline, virtual, explicit, constexpr, consteval
  │  ├── CUDA attributes: __device__, __host__, __global__ (category 4)
  │  └── class_specifier / enum_specifier (recursive for definitions)
  │
  ▼
STAGE 3: Declarator (declarator.c)
  │  declarator ─── coordinates pointer/array/function declarators
  │  ├── pointer_declarator ── *, &, &&, ::*
  │  ├── r_declarator ── recursive descent on declarator-id
  │  ├── array_declarator ── [expression], []
  │  ├── function_declarator ── (params) cv-qualifiers -> trailing-return noexcept
  │  └── scan_declarator_attributes ── separates CUDA attrs from standard
  │
  ▼
STAGE 4: Declaration Processing (decls.c)
  │  decl_routine ─── function/method declarations (2,858 lines)
  │  decl_variable ── variable declarations with CUDA memory space
  │  variable_declaration ── top-level entry with CUDA error emission
  │  find_linked_symbol ── redeclaration detection
  │  id_linkage ── linkage determination (internal/external/none)
  │
  ▼
STAGE 5: Initializer Processing (decl_inits.c)
     ctor_inits_for_inheriting_ctor ── inheriting constructors
     dtor_initializer ── destructor init lists
     check_for_missing_initializer_full ── missing initializer diagnostics

Stage 1: Disambiguation (disambig.c)

The Problem

C++ has a famous syntactic ambiguity: many token sequences can be parsed as either declarations or expressions. The canonical example:

T(x);          // declaration of variable x of type T?  or  function-style cast of x to T?
T(x)(y);       // declaration of function x returning T?  or  call to T(x) with argument y?
T * x;         // declaration of pointer-to-T named x?  or  multiplication of T and x?

The C++ standard resolves these with the "if it can be a declaration, it is a declaration" rule. EDG implements this by prescanning: before committing to a parse, the parser saves the lexer state, probes ahead through the token stream to determine whether the construct is a declaration, then restores the lexer state and dispatches to the appropriate parser.

prescan_declaration (sub_4EA560)

This is the top-level disambiguation entry point, called when the parser encounters an ambiguous construct at statement or declaration level. It operates in a non-destructive lookahead mode: it consumes tokens tentatively, classifies the construct, then rewinds.

prescan_declaration(flags):
    save_lexer_state()
    
    # Compute CUDA-aware skip mode
    if flags & 0x800 == 0:       # not in template context
        skip_mode = 16385         # 0x4001: standard prescan
    else:
        skip_mode = 67125249      # 0x3FFC001: template-aware prescan
    
    # In CUDA C++ mode, use cuda_skip_token for identifier classification
    if dword_126EFB4 == 2:       # CUDA C++ dialect
        while not at_end_of_tentative_scan():
            token = current_token()
            if is_cuda_keyword(token):
                cuda_skip_token(skip_mode)   # sub_6810F0
            else:
                advance_token()              # sub_676860
            classify_declaration_vs_expression()
    
    restore_lexer_state()
    return classification  # DECLARATION or EXPRESSION

The skip_mode is a bitmask encoding which token classes to recognize during prescanning. In CUDA mode, the wider mask (0x3FFC001) includes CUDA execution-space keywords so that __device__ int x; is correctly classified as a declaration even though __device__ is not a standard C++ keyword.

prescan_gnu_attribute (sub_4E9E70)

Attributes complicate disambiguation because __attribute__((foo)) can appear almost anywhere in a declaration. This function skips over balanced GNU attribute sequences during prescanning:

prescan_gnu_attribute():
    assert current_token == 142     # GNU __attribute__ token
    while current_token == 142:
        advance_token()             # consume __attribute__
        match_balanced_parens()     # skip ((...))
        
        # CUDA extension: check if identifier is CUDA keyword
        if dword_126EFB4 == 2:      # CUDA C++ mode
            if BYTE1(xmmword_106C390) & 2:  # CUDA extension flag
                cuda_skip_token(...)

find_for_loop_separator (sub_4EC690)

A special-purpose disambiguator for for loops. In for(init; cond; incr), the parser must find the semicolons that separate the three clauses. This is non-trivial because the init clause can contain declarations with complex types, nested parentheses, and template angle brackets.

find_for_loop_separator():
    create_disambiguation_checkpoint()  # sub_67B4F0
    paren_depth = 0
    while true:
        token = current_token()
        if token == '(':
            paren_depth++
        elif token == ')':
            if paren_depth == 0:
                break
            paren_depth--
        elif token == ';' and paren_depth == 0:
            restore_checkpoint()
            return SEMICOLON_FOUND   # 0x4B = 75
        elif token == EOF:
            restore_checkpoint()
            return EOF               # 9
    restore_checkpoint()
    return NOT_FOUND                 # 0

Stage 2: Declaration Specifiers (decl_spec.c)

decl_specifiers (sub_4ACF80) -- The Central Dispatcher

This is the most complex function in the declaration parser: 4,761 decompiled lines, a while(2) loop containing a giant switch on token kinds, processing every specifier in a C++ declaration. It handles storage classes, type specifiers, cv-qualifiers, function specifiers, and CUDA attributes, accumulating results into a 128-byte stack structure.

Input Parameter: Context Flags

The a1 parameter encodes the parsing context as a bitmask:

Bit	Mask	Context
2	`0x4`	Inside class member declaration
3	`0x8`	Inside function parameter list
4	`0x10`	At block scope
6	`0x40`	Inside template parameter list
14	`0x4000`	Friend declaration
15	`0x8000`	At class scope
18	`0x40000`	In-declaration (re-entrant)
20	`0x100000`	Constexpr lambda context

The Accumulator Structure

Results are accumulated into a stack-allocated structure (parameter a2) laid out as:

Offset	Size	Field	Description
`+8`	4	`specifier_flags`	Bitmask of specifiers seen
`+32`	8	`source_position`	Position of first specifier
`+120`	4	`flags`	Parsing state flags
`+132`	4	`context`	Context discriminator
`+200`	8	`attribute_list`	Linked list of parsed attributes
`+208`	8	`attribute_list_alt`	Secondary attribute list (CUDA exec space)
`+228`	4	`modifiers`	Accumulated modifier bits
`+272`	8	`type_ptr`	Resolved type pointer

Pseudocode

decl_specifiers(context_flags, accumulator, type_chain, ...):
    debug_trace(3, "decl_specifiers")
    
    spec_bits = 0        # accumulated specifier combination flags
    error_flag = 0
    
    while true:  # while(2) in decompilation
        token = word_126DD58    # current token
        
        switch token:
        
        # ── Storage class specifiers ──
        case TOKEN_AUTO:         # 77
        case TOKEN_REGISTER:     # 119
        case TOKEN_STATIC:       # 99
        case TOKEN_EXTERN:       # 88
        case TOKEN_TYPEDEF:      # 103
            process_storage_class_specifier(
                auto_flag, ..., context_flags, accumulator,
                prev_scope, &spec_bits, &result, &type_out, &error_flag
            )
            continue
        
        # ── Type specifiers (keywords) ──
        case TOKEN_VOID .. TOKEN_DOUBLE:       # 81-119 range
        case TOKEN_SIGNED:
        case TOKEN_UNSIGNED:
        case TOKEN_CHAR:
        case TOKEN_INT:
        case TOKEN_FLOAT:
        case TOKEN_DOUBLE:
            # Validate combination with existing specifiers
            if spec_bits & CONFLICTING_TYPE_MASK:
                emit_error(84)     # conflicting type specifiers
            spec_bits |= type_specifier_bit(token)
            advance_token()
            continue
        
        # ── cv-qualifiers ──
        case TOKEN_CONST:        # 263
        case TOKEN_VOLATILE:     # 264
        case TOKEN_RESTRICT:     # 265, 266
            accumulator.modifiers |= cv_bit(token)
            advance_token()
            continue
        
        # ── Function specifiers ──
        case TOKEN_INLINE:
            spec_bits |= INLINE_BIT
            advance_token()
            continue
        
        case TOKEN_VIRTUAL:
            spec_bits |= VIRTUAL_BIT
            advance_token()
            continue
        
        case TOKEN_EXPLICIT:
            spec_bits |= EXPLICIT_BIT
            advance_token()
            continue
        
        # ── C++11/17/20 specifiers ──
        case TOKEN_CONSTEXPR:
            spec_bits |= CONSTEXPR_BIT
            if context_flags & 0x100000:    # constexpr lambda
                emit_error(1570)
            advance_token()
            continue
        
        case TOKEN_CONSTEVAL:
            spec_bits |= CONSTEVAL_BIT
            advance_token()
            continue
        
        case TOKEN_CONSTINIT:
            spec_bits |= CONSTINIT_BIT
            advance_token()
            continue
        
        case TOKEN_THREAD_LOCAL:
            spec_bits |= THREAD_LOCAL_BIT
            advance_token()
            continue
        
        # ── Class/struct/union/enum definitions ──
        case TOKEN_CLASS:        # 151
        case TOKEN_STRUCT:
        case TOKEN_UNION:
            class_specifier(scope, context_flags, ..., &result, &error_flag)
            continue
        
        case TOKEN_ENUM:
            enum_specifier(scope, context_flags, ..., &result, &error_flag)
            continue
        
        # ── typename specifier ──
        case TOKEN_TYPENAME:     # 183
            typename_specifier(&type_out, accumulator, context_flag, ...)
            continue
        
        # ── Identifier (type name or constructor) ──
        case TOKEN_IDENTIFIER:   # 1
            # This is the declaration/expression ambiguity hotspot
            if try_interpret_as_type_name(accumulator):    # sub_4C4F80
                continue
            if is_constructor_decl(enclosing_class):       # sub_4AC970
                continue
            # Not a type name — fall through to end of specifiers
            break
        
        # ── GNU __attribute__ / __declspec ──
        case TOKEN_ATTRIBUTE:    # 142
            parse_attribute_list(accumulator)
            # CUDA: execution space attributes separated here
            continue
        
        # ── typeof / decltype ──
        case TOKEN_TYPEOF:       # 189
        case TOKEN_DECLTYPE:     # 185
            parse_typeof_or_decltype(accumulator)
            continue
        
        # ── End of specifiers ──
        case TOKEN_SEMICOLON:    # 55
        default:
            break  # exit while loop
    
    # Post-processing: validate specifier combinations
    if spec_bits == 0 and no_type_found:
        emit_error(79)   # missing type specifier
    
    # CUDA: check execution space context
    if dword_126EFB4 == 2:   # CUDA C++ mode
        validate_cuda_execution_space(accumulator, context_flags)
        if invalid_cuda_context:
            emit_error(3537)  # execution space attribute in wrong context

Token Classification Map

The switch in decl_specifiers handles the following token kinds:

Token Code	Keyword	Category
1	identifier	Type name or constructor check
77	`auto`	Storage class (C++03) / placeholder type (C++11)
88	`extern`	Storage class
99	`static`	Storage class
103	`typedef`	Storage class
119	`register`	Storage class
80--108	C type keywords	Type specifiers
142	`__attribute__`	GNU attribute
151	`class`	Class specifier
183	`typename`	Typename specifier
185	`decltype`	Decltype specifier
189	`typeof`	GNU typeof
263--266	cv-qualifiers	`const`, `volatile`, `restrict`, `__restrict`

process_storage_class_specifier (sub_4A31A0)

Validates and records a storage class specifier. C++ allows at most one storage class per declaration (with some exceptions for thread_local).

process_storage_class_specifier(auto_flag, ..., context_flags, decl_info,
                                 prev_scope, spec_bits, result, type_out, error_flag):
    # Flag bits in context_flags:
    #   1=function, 4=class, 8=extern, 0x10=static, 0x200=register,
    #   0x4000=friend, 0x8000=at class scope, 0x100000=constexpr lambda

    if *spec_bits & STORAGE_CLASS_MASK:
        emit_error(80)     # duplicate storage class
        return
    
    if conflicting_with_previous_specifier:
        emit_error(81)     # conflicting storage class
        return
    
    switch current_storage_class:
        case EXTERN:
            if at_block_scope and not_cpp_mode:
                emit_error(85)
            if at_file_scope and not_standard_mode:
                emit_error(149)
            decl_info.linkage_byte = 3    # external linkage
        
        case STATIC:
            if in_class_definition and cpp_mode:
                emit_error(328)
        
        case REGISTER:
            emit_error(481)   # deprecated
        
        case AUTO:
            if dword_126EF4C:     # auto parameter support enabled
                # C++20: auto in parameter list = abbreviated template
                create_placeholder_type()    # sub_5BBA60
            else:
                emit_error(1598)  # auto type in invalid context
    
    *spec_bits |= storage_class_bit

class_specifier (sub_4A57C0, 2,179 lines)

Parses class/struct/union specifiers including the full class body. This function manages scope entry/exit, base class lists, member declarations, access specifiers, and CUDA execution space propagation.

Key operations:

Calls scan_tag_name (sub_4A38A0, 1,216 lines) to parse the class name, handling qualified names and template parameters
Calls check_for_class_modifiers (sub_4A3610) to detect final/__final
Manages the scope stack: pushes a class scope (kind 6 or 7) at qword_126C5E8 + 784 * scope_index
Sets CUDA execution space flags at scope entry offset +182 (bit 0x20) for device-side class definitions
Issues error 2407 for enum definitions in prohibited CUDA execution contexts

enum_specifier (sub_4AA2F0, 1,437 lines)

Parses enum, enum class, and enum struct specifiers, including:

Underlying type (enum E : int)
Opaque enum declarations (enum class E : int;)
Scoped vs. unscoped enum semantics
Calls scan_enumerator_list (sub_4A89F0, 950 lines) for the enumerator body

Specifier Validation Functions

After decl_specifiers accumulates all specifiers, several validation functions check that the combination is legal:

Function	Address	Lines	Purpose
`check_use_of_constexpr`	`sub_4A22B0`	153	Validates `constexpr` on functions and variables
`check_use_of_consteval`	`sub_4A1BF0`	104	Validates `consteval` on functions only
`check_use_of_constinit`	`sub_4A1EC0`	77	Validates `constinit` on variables with static storage
`check_use_of_thread_local`	`sub_4A2000`	111	Validates `thread_local` placement
`check_explicit_specifier`	`sub_4A1DF0`	45	Validates `explicit` on constructors/conversions
`check_gnu_c_auto_type`	`sub_4A2580`	52	Validates GNU `__auto_type`

Each follows the same pattern: examine the accumulated specifier bits and the entity kind at offset +80 of the declaration node, and emit a targeted error if the combination is illegal. For example, check_use_of_consteval:

check_use_of_consteval(decl_info):
    entity = decl_info[0]
    kind = entity[+80]       # symbol kind
    
    if kind != FUNCTION (10) and kind != MEMBER_FUNCTION (11):
        emit_error(2926)      # consteval on non-function
        entity[+177] &= 0xF9 # clear consteval bit
        return
    
    func_kind = entity[+166]
    if func_kind == DESTRUCTOR (2):
        emit_error(2927)      # consteval on destructor
        entity[+177] &= 0xF9
        return
    
    if func_kind == CONSTRUCTOR (1):
        if type_has_virtual_base(entity[+88]):
            emit_error(2928)  # consteval on ctor with virtual base
            entity[+177] &= 0xF9
            return
    
    if func_kind == CONVERSION (5):
        if certain_conversion_conditions:
            emit_error(2959)  # consteval on certain conversions
            entity[+177] &= 0xF9

Stage 3: Declarator Parsing (declarator.c)

Architecture

Declarator parsing uses inside-out construction: the C++ declarator syntax places the declared name in the center, with type constructors radiating outward (pointers to the left, arrays and function parameters to the right). The parser builds a derived-type chain that is later unwound against the base type from decl_specifiers to produce the final type.

Declarator syntax (C++ grammar):
    declarator := pointer-declarator
    pointer-declarator := {*, &, &&, C::*} cv-qualifiers* direct-declarator
    direct-declarator := declarator-id | ( declarator ) | direct-declarator ( params ) | direct-declarator [ expr ]
    declarator-id := qualified-name | unqualified-name

The parser coordinates five specialized sub-parsers:

Function	Address	Lines	Role
`declarator`	`sub_4B7BC0`	284	Top-level entry: dispatches to pointer/r_declarator
`r_declarator`	`sub_4BC950`	2,578	Recursive descent on direct-declarator
`pointer_declarator`	`sub_4B72A0`	440	``, `&`, `&&`, `::` with cv-qualifiers
`array_declarator`	`sub_4B6760`	518	`[expr]` and `[]`
`function_declarator`	`sub_4B8190`	3,144	`(params) cv-quals -> ret noexcept`

scan_declarator_attributes (sub_4B3970) -- CUDA Attribute Separation

This is the critical function that separates CUDA execution space attributes from standard C++ attributes on declarators. In standard C++, attributes apply to the entity being declared. CUDA adds a parallel attribute dimension -- execution space -- that must be routed to a separate storage location.

The function iterates through the attribute list and sorts each attribute by its category byte at offset +9:

scan_declarator_attributes(decl_info, attr_accumulator):
    attr_list = decl_info[+200]    # primary attribute list
    
    for each attr in attr_list:
        category = attr[+9]         # attribute category byte
        kind = attr[+8]             # attribute kind
        placement = attr[+10]       # where in declaration it appeared
        
        switch category:
            case 1:  # TYPE attribute (alignas, etc.)
                # Keep on primary list, set placement
                attr[+10] = 10      # after type specifier
                
            case 2:  # DECLARATION attribute ([[nodiscard]], etc.)
                if attr[+11] & 0x10:
                    # CUDA/vendor declaration attribute
                    route_to_vendor_list(attr)
                else:
                    # Standard declaration attribute
                    attr[+10] = 12  # before declarator
                
            case 3:  # STATEMENT attribute ([[fallthrough]], etc.)
                if decl_info[+131] & 8:  # class-key context
                    handle_class_key_stmt_attr(attr)
                
            case 4:  # CUDA EXECUTION SPACE attribute
                # __device__, __host__, __global__
                # Move to SECONDARY attribute list
                move_to_list(attr, decl_info[+184])
                
                # Error if misplaced
                if wrong_position:
                    emit_error(1847)  # attribute in wrong position
    
    # Mark all processed attributes
    for each attr in processed:
        attr[+11] |= 1    # set "consumed" flag

The separation into primary (offset +200) and secondary (offset +184) attribute lists is essential: downstream code (decl_routine, decl_variable) reads execution space from the secondary list and standard attributes from the primary list. This prevents CUDA execution space from interfering with standard attribute processing like [[nodiscard]] or [[deprecated]].

function_declarator (sub_4B8190, 3,144 lines)

The second-largest function in the declarator parser. It handles the complete C++ function declarator grammar including C++11 trailing return types, C++11/17 noexcept specifications, C++23 deducing this, and the C++ function qualifier trailer (const, volatile, &, &&).

function_declarator(decl_info, context_flags):
    debug_trace(3, "function_declarator")
    
    # Parse parameter list
    expect_token('(')
    param_list = parse_parameter_list()
    expect_token(')')
    
    # C++ member function qualifiers
    cv_quals = 0
    while is_cv_qualifier(current_token):
        cv_quals |= cv_bit(current_token)
        advance_token()
    
    # Ref-qualifier (& or &&)
    ref_qual = NONE
    if current_token == '&':
        ref_qual = LVALUE_REF
        advance_token()
    elif current_token == '&&':
        ref_qual = RVALUE_REF
        advance_token()
    
    # Exception specification
    except_spec = NONE
    if current_token == TOKEN_THROW:
        except_spec = parse_throw_spec()
    elif current_token == TOKEN_NOEXCEPT:
        except_spec = parse_noexcept_spec()
    
    # C++11 trailing return type
    trailing_return = NULL
    if current_token == TOKEN_ARROW:   # ->
        advance_token()
        trailing_return = parse_type()
    
    # C++20 trailing requires clause
    requires_clause = NULL
    if current_token == TOKEN_REQUIRES:
        requires_clause = scan_trailing_requires_clause()
    
    # C++23 deducing this
    if has_explicit_this_parameter(param_list):
        mark_deducing_this()
    
    # Build function type node
    func_type = add_to_derived_type_list(
        FUNCTION_TYPE,
        param_list, cv_quals, ref_qual,
        except_spec, trailing_return, requires_clause
    )
    
    return func_type

Derived Type Construction

add_to_derived_type_list (sub_4B4CF0, 600 lines) is the type-chain builder. Each declarator modifier (pointer, reference, array, function) appends a new node to a linked list. After parsing completes, form_declared_type (sub_4B4870) walks this chain bottom-up, applying each modifier to the base type to produce the final declared type.

For a declaration like const int *(*fp)(double):

Base type: const int
Derived chain: [function(double)] → [pointer] → [pointer]
Unwound: pointer to (pointer to function(double) returning const int)

Stage 4: Declaration Processing (decls.c)

decl_variable (sub_4CA6C0, 1,090 lines)

Processes variable declarations after specifiers and declarator have been parsed. This is where CUDA memory space qualifiers are applied and the variable entity is inserted into the symbol table.

CUDA Memory Space Bits

Variable entries carry a CUDA memory space bitmask at offset +148:

Bit	Mask	Memory Space	Meaning
0	`0x01`	`__constant__`	Device-side constant memory
1	`0x02`	`__shared__`	Block-shared memory (per-SM)
2	`0x04`	`__managed__`	Unified memory (host + device accessible)
4	`0x10`	`__device__`	Device global memory

These bits are set from the declaration state object (parameter a2), which carries the parsed CUDA attribute at offset +240:

decl_variable(decl_specs, decl_state, storage_class, out_entity, out_flags):
    debug_trace(3, "decl_variable")
    assert(decl_state != NULL)              # decls.c:7730
    
    # Look up existing variable in scope
    existing = lookup_variable_in_scope(    # sub_4C84B0
        scope, name, type_info
    )
    
    # Create new variable entity
    var_entity = create_variable_entry(     # sub_5C9840
        name, type, storage_class
    )
    
    # Apply CUDA memory space from declaration state
    if dword_126EFA8:                       # CUDA mode enabled
        cuda_attr_ptr = decl_state[+240]
        if cuda_attr_ptr != NULL:
            # Extract memory space from attribute
            space = extract_memory_space(cuda_attr_ptr)
            var_entity[+148] = space        # set memory space bits
            
            # Scope walk: determine if variable is at namespace scope
            # or inside a function (affects valid memory space combinations)
            scope_idx = dword_126C5E4       # current scope index
            scope_base = qword_126C5E8      # scope table base
            while scope_idx > 0:
                scope_entry = scope_base + 784 * scope_idx
                scope_kind = scope_entry[+4]
                if scope_kind == 4:          # class scope — walk up
                    scope_idx = scope_entry[+256]  # parent scope
                    continue
                break
            
            # Template scope check
            if scope_entry[+9] & 0x20:       # is_template_scope
                handle_template_variable()
    
    # Check redeclaration compatibility
    if existing != NULL:
        old_space = existing[+148]
        new_space = var_entity[+148]
        if old_space != new_space:
            # Determine which string to use for error message
            if new_space & 0x04:
                space_name = "__managed__"
            elif new_space & 0x01:
                space_name = "__constant__"
            elif new_space & 0x02:
                space_name = "__shared__"
            elif new_space & 0x10:
                space_name = "__device__"
            emit_error(1306)      # CUDA memory space mismatch on redeclaration
    
    # Anonymous type check
    if type_is_anonymous(var_entity):
        emit_error(891)           # anonymous type in variable declaration
    
    # Apply remaining attributes
    set_variable_attributes(var_entity)     # sub_4C4750

variable_declaration (sub_4DEC90, 1,098 lines) -- Top-Level Entry

This is the outermost entry point for processing a variable declaration. It wraps decl_variable with CUDA-specific validation, constexpr/constinit checks, and static data member definition handling.

CUDA-Specific Error Emission

The function contains a dense block of CUDA error checks for variable declarations:

variable_declaration(decl_info, ...):
    # Early CUDA checks
    check_constexpr_variable_init(decl_info)    # sub_4DAC80
    
    # CUDA memory space string selection for error messages
    mem_space_bits = entity[+148]
    byte_149 = entity[+149]
    
    if mem_space_bits & 0x04:     # __managed__
        # No __managed__-specific string needed here
        pass
    
    # Build human-readable attribute name for diagnostics
    if byte_149 & 1:
        space_str = "__constant__"
    elif mem_space_bits & 4 == 0:
        space_str = "__managed__"
        if byte_149 & 1 == 0:
            space_str = "__device__"
            if mem_space_bits & 2:
                space_str = "__shared__"
    
    # CUDA variable constraint errors
    if is_shared_variable:
        if is_variable_length_array:
            emit_error(3510)      # __shared__ variable with VLA
    
    if is_constant_variable:
        if is_constexpr:
            emit_error(3568)      # __constant__ combined with constexpr
        if is_volatile:
            emit_error(3566)      # __constant__ combined with volatile
        if is_vla:
            emit_error(3567)      # __constant__ with VLA
    
    if has_cuda_attribute:
        if in_constexpr_if_discarded_branch:
            emit_error(3578)      # CUDA attribute in discarded branch
        if at_namespace_scope and is_structured_binding:
            emit_error(3579)      # CUDA attribute on structured binding
        if is_variable_length_array:
            emit_error(3580)      # CUDA attribute on VLA
    
    # Dispatch to decl_variable or define_static_data_member
    if is_static_member_definition:
        define_static_data_member(...)
    else:
        decl_variable(decl_specs, decl_state, storage_class, ...)
    
    # Post-declaration CUDA fixup
    cuda_variable_fixup(entity)     # sub_4CC150
    mark_defined_variable(entity)   # sub_4DC200

Complete CUDA Variable Error Table

Error	Condition	Message Summary
149	Illegal CUDA storage class at namespace scope	Storage class not allowed here
891	Anonymous type in variable declaration	Anonymous type cannot be used
892	`auto`-typed CUDA variable (variant)	auto not allowed with CUDA qualifier
893	`auto`-typed CUDA variable	auto not allowed with CUDA qualifier
1306	Memory space mismatch on redeclaration	Conflicting CUDA memory space
3483	(CUDA variable context error)	CUDA attribute context mismatch
3510	`__shared__` variable with VLA	Variable-length arrays not allowed in `__shared__`
3566	`__constant__` with `volatile`	volatile incompatible with `__constant__`
3567	`__constant__` with VLA	Variable-length arrays not allowed in `__constant__`
3568	`__constant__` with `constexpr`	constexpr incompatible with `__constant__`
3578	CUDA attribute in `constexpr if` discarded branch	CUDA attribute in dead code
3579	CUDA attribute on structured binding at namespace scope	Structured binding cannot have CUDA attribute
3580	CUDA attribute on VLA	Variable-length arrays not allowed with CUDA attribute
3648	`__constant__` with external linkage	External `__constant__` not allowed
1655	Tentative definition of constexpr variable	Missing initializer

decl_routine (sub_4CE420, 2,858 lines)

The largest function in the declaration processing stage. It handles function and method declarations, integrating CUDA calling convention validation, attribute consistency checking, and template interaction.

Parameters

Parameter	Offset	Description
`a1`	--	`decl_specifiers` accumulator (`__m128i*`)
`a2`	--	Declaration state object
`a3`	--	Function info (offset `+64` = flags, `+80` = prior type)
`a4`	--	SRK flags bitmask
`a5`--`a8`	--	Output pointers and context

SRK Flag Bits

The a4 parameter carries "scan result kind" flags that describe what was parsed:

Bit	Mask	Meaning
0	`0x01`	`SRK_DECLARATION` -- forward declaration
1	`0x02`	`SRK_DEFINITION` -- has function body
7	`0x80`	`SRK_IMPLICIT` -- compiler-generated
8	`0x100`	`SRK_CONSTEXPR` -- constexpr function

Function Entity Layout

After processing, a function entity contains:

Offset	Size	Field	Description
`+80`	1	`entity_kind`	10 = function, 11 = member function
`+88`	8	`descriptor`	Pointer to function descriptor
`+144`	8	`type`	Function type pointer
`+164`	1	`defined_flag`	Set when definition is seen
`+166`	1	`function_kind`	1=ctor, 2=dtor, 5=conversion, 7=deduction guide
`+168`	8	`template_info`	Template instantiation info
`+177`	1	`attribute_flags`	bit 1=constexpr, bit 2=consteval
`+188`	1	`cuda_flags_1`	CUDA calling convention
`+189`	1	`cuda_flags_2`	CUDA execution space
`+192`	8	`parameter_list`	Head of parameter linked list

Pseudocode

decl_routine(decl_specs, decl_state, func_info, srk_flags, ...):
    debug_trace(3, "decl_routine")
    
    # Assertions
    assert func_info != NULL                    # decls.c:10057
    assert storage_class is valid               # decls.c:10059
    assert srk_flags & SRK_DECLARATION          # decls.c:10061
    assert func_type is routine type            # decls.c:10063
    if srk_flags & SRK_DEFINITION:
        assert body follows                     # decls.c:10068
    if srk_flags & SRK_IMPLICIT:
        assert compiler-generated context       # decls.c:10149
    
    # CUDA calling convention check
    if dword_126EFB4 == 2:                      # CUDA C++ mode
        check_cuda_calling_convention(          # sub_4C6AB0
            func_type, decl_specs
        )
        check_cuda_attribute_consistency(       # sub_4C6D50
            decl_state
        )
    
    # Look up existing declaration
    existing = find_linked_symbol(name, scope)
    
    if existing != NULL:
        # Redeclaration checks
        if existing.calling_convention != new_calling_convention:
            emit_error(948)         # calling convention mismatch
        
        if has_cuda_attribute(existing) and has_cuda_attribute(new):
            if not compatible_cuda_attributes(existing, new):
                emit_error(1430)    # function attribute mismatch
    
    # CUDA-specific restrictions
    if has_global_attribute:
        if return_type is auto:
            emit_error(1158)        # auto return type with __global__
    
    if is_deduction_guide:
        if has_any_cuda_attribute:
            emit_error(2885)        # CUDA attribute on deduction guide
    
    if is_explicit_instantiation:
        if conflicting_template_attributes:
            emit_error(1034)        # explicit instantiation conflict
    
    # Process CUDA attributes on the function
    process_cuda_attributes(decl_state)         # sub_42A250
    remove_cuda_trailing_return(decl_state)     # sub_42A210
    
    # Canonicalize trailing return type in CUDA mode
    if dword_126EFB4 == 2:
        canonicalize_return_type(func_type)      # sub_5DBCB0
    
    # Symbol table insertion
    entity = create_function_entity(name, func_type, storage_class)
    
    # Set defined flag
    assert entity.defined_flag is correct       # decls.c:10417
    
    # OpenMP variant handling (if active)
    if dword_106B4B8:                           # omp_declare_variant_active
        create_omp_variant_name("$$OMP_VARIANT%06d", variant_id)

CUDA Attribute Integration

Attribute Category System

EDG classifies attributes using a category byte at offset +9 in the attribute node:

Category	Value	Meaning	Examples
Type	1	Applies to the type	`alignas`, `__aligned__`
Declaration	2	Applies to the declaration	`[[nodiscard]]`, `[[deprecated]]`
Statement	3	Applies to a statement	`[[fallthrough]]`, `[[likely]]`
Execution space	4	CUDA execution space	`__device__`, `__host__`, `__global__`

Category 4 is NVIDIA's addition to EDG's attribute system. Standard EDG uses categories 1-3. CUDA execution space attributes are recognized by the lexer as identifiers, classified as CUDA keywords by get_token_main (sub_6810F0) when dword_106C2C0 (GPU mode) is active, and converted to attribute nodes with category 4 during attribute parsing.

Attribute Node Layout

Offset	Size	Field	Description
`+0`	8	`next`	Next attribute in linked list
`+8`	1	`kind`	Attribute kind (0 when cleared/consumed)
`+9`	1	`category`	1=type, 2=decl, 3=stmt, 4=exec-space
`+10`	1	`placement`	Where in declaration it appeared (10=after type, 12=before declarator)
`+11`	1	`flags`	bit 0 = consumed, bit 4 = CUDA/vendor
`+16`	8	`payload`	Attribute-specific data

Execution Space Propagation

When a CUDA execution space attribute is parsed, it flows through three processing points:

decl_specifiers (sub_4ACF80): CUDA attributes are recognized as token 142 (attribute) and parsed into the attribute list. The attribute parser sets category 4 for execution space attributes.
scan_declarator_attributes (sub_4B3970): Separates category-4 attributes from the primary attribute list and moves them to the secondary list at offset +184 of the declaration info structure.
decl_routine / decl_variable: Reads execution space from the secondary attribute list and applies it to the function/variable entity. For functions, the execution space goes to offsets +188/+189 of the entity. For variables, the memory space goes to offset +148.

warn_on_cuda_execution_space_attributes (sub_4A8990)

A safety valve that catches execution space attributes in places where they should not appear (e.g., on type definitions that are not function or variable declarations):

warn_on_cuda_execution_space_attributes(attr_list):
    warned = false
    for each attr in attr_list:
        category = attr[+9]
        if category == 1 or category == 4:     # type or exec-space
            if not warned:
                emit_error(1882)               # invalid exec space attr
                warned = true
            attr[+8] = 0                       # clear kind (suppress further processing)

Scope Chain and Context Tracking

The declaration parser relies heavily on the scope chain stored in the global scope table. Every declaration must be inserted at the correct scope, and many validation checks depend on whether the current scope is namespace-scope, class-scope, block-scope, or template-scope.

Scope Entry Layout (784 bytes)

Offset	Size	Field	Description
`+4`	1	`scope_kind`	2=namespace, 4=class, 6=function, 8=nested block, 10=block, 12=template, 15/17=special
`+6`	1	`flags_1`	bit 1=extern, bit 2=inline namespace, bit 7=pending class flag
`+7`	1	`flags_2`	bit 1=has using directives
`+9`	1	`template_flags`	bit 5=is template scope, bit 1-3=template kind
`+12`	4	`scope_flags`	bit 2-3=scope modifier
`+182`	1	`cuda_flags`	bit 5 (`0x20`)=CUDA device-side scope
`+192`	8	`first_entity`	Head of entity linked list
`+216`	8	`type_pointer`	Associated type (for class scopes)
`+224`	8	`namespace_ptr`	Associated namespace
`+256`	4	`parent_scope`	Index of parent scope in table
`+368`	8	`source_begin`	Source position where scope begins
`+376`	8	`associated_entity`	Entity that opened this scope
`+408`	4	`parent_scope_idx`	Alternate parent scope index

Scope Table Globals

Address	Name	Description
`qword_126C5E8`	`scope_table_base`	Array of 784-byte scope entries
`dword_126C5E4`	`current_scope_index`	Index into scope table
`dword_126C5DC`	`current_scope_id`	Current scope identifier
`dword_126C5B4`	`namespace_scope_id`	Nearest enclosing namespace scope
`dword_126C5BC`	`class_scope_depth`	Nesting depth of class scopes
`dword_126C5C4`	`lambda_scope_id`	Current lambda scope (-1 if none)
`dword_126C5C8`	`template_scope_id`	Current template scope (-1 if none)

Scope Walk for CUDA Memory Space

When processing a CUDA variable declaration, the parser walks up the scope chain to determine if the variable is at namespace scope (where __device__/__constant__/__managed__ are valid) or inside a function body (where __shared__ is additionally valid):

determine_cuda_variable_scope(var_entity):
    scope_idx = dword_126C5E4
    scope_base = qword_126C5E8
    
    while scope_idx > 0:
        entry = scope_base + 784 * scope_idx
        kind = entry[+4]
        
        if kind == 4:                  # class scope
            # Walk through class scopes to find enclosing namespace/function
            scope_idx = entry[+256]    # parent scope
            continue
        
        if kind == 2:                  # namespace scope
            # Variable is at namespace scope
            # Valid spaces: __device__, __constant__, __managed__
            return NAMESPACE_SCOPE
        
        if kind == 6 or kind == 10:    # function or block scope
            # Variable is inside a function body
            # Valid spaces: __shared__, __device__, __constant__, __managed__
            return FUNCTION_SCOPE
        
        scope_idx = entry[+256]
    
    return FILE_SCOPE

Linkage Determination

id_linkage (sub_4C3380, 310 lines)

Determines whether an identifier has internal, external, or no linkage. This is called during decl_variable and decl_routine to set the linkage byte on the entity.

id_linkage(entity, storage_class, scope):
    debug_trace(3, "id_linkage")
    
    kind = entity[+80]       # entity kind
    
    # C++ linkage rules
    if dword_126EFB4 == 2:    # C++ mode
        if storage_class == STATIC:
            return INTERNAL    # 0x10
        if storage_class == EXTERN:
            return EXTERNAL    # 0x20
        if scope_kind == NAMESPACE:
            if kind == FUNCTION:
                return EXTERNAL
            if kind == VARIABLE:
                if is_const_qualified and not explicitly_extern:
                    return INTERNAL
                return EXTERNAL
        if scope_kind == BLOCK:
            return NONE        # 0x00
    
    # C linkage rules (simpler)
    if storage_class == STATIC:
        return INTERNAL
    if scope_kind == FILE:
        return EXTERNAL
    
    return NONE
    
    # Debug output
    debug_print(linkage_string)   # "internal" / "external" / "none"

find_linked_symbol (sub_4C1CC0, 608 lines)

The redeclaration detection engine. When a new declaration is processed, this function searches the current and enclosing scopes for a previously-declared symbol with the same name and compatible linkage:

find_linked_symbol(name, scope, entity_kind):
    debug_trace(3, "find_linked_symbol")
    
    # Look up in symbol table
    existing = symbol_lookup(name, scope)    # sub_698940
    
    if existing == NULL:
        return NULL
    
    # For functions: handle overload sets
    if entity_kind == FUNCTION:
        # Walk overload set checking for compatible signature
        for each overload in existing.overload_set:
            if types_match(overload.type, new_type):
                return overload
        return NULL    # new overload, not redeclaration
    
    # For variables: check linkage compatibility
    if entity_kind == VARIABLE:
        if existing.linkage == new_linkage:
            return existing
        # Special case: extern at block scope refers to
        # namespace-scope variable with same name
        if new_storage_class == EXTERN and scope_kind == BLOCK:
            return walk_to_namespace_scope_and_search(name)
    
    return NULL

Constructor and Destructor Initialization (decl_inits.c)

ctor_inits_for_inheriting_ctor (sub_4A0310, 746 lines)

Builds the initialization sequence for inheriting constructors (C++11 using Base::Base;). The function iterates virtual base member lists to find matching base constructors and constructs the initialization order:

ctor_inits_for_inheriting_ctor(decl_info):
    class_type = decl_info[+40][+32]    # enclosing class type
    member_list = class_type[+152]       # member list
    
    # Iterate virtual bases
    for each member in member_list:
        if member[+80] == 8:             # base class member kind
            base_type = resolve_base_type(member)
            base_ctor = find_base_constructor(base_type)
            
            if decl_info[+178] & 0x40:   # inheriting-ctor redirection
                # Walk class hierarchy via offset+216 link
                while has_redirect(current):
                    current = current[+216]
                base_ctor = find_redirect_target(current)
            
            # Check accessibility
            check_base_ctor_accessibility(base_ctor)   # sub_48B3F0
            
            # Build init entry
            init_entry = allocate_init_entry()          # sub_6BA0D0
            init_entry.target = base_ctor
            append_to_init_list(init_entry)

dtor_initializer (sub_4A0EC0, 339 lines)

Builds the destructor initialization (destruction) list for a class. The destruction order is the reverse of construction order -- members are destroyed in reverse declaration order, then base classes in reverse order:

dtor_initializer(decl_info):
    debug_trace(3, "dtor_initializer")       # decl_inits.c:10153
    
    class_type = decl_info[5][+32]
    member_list = class_type[+152]
    
    # Check for delegating constructor
    if decl_info[22] & 2:
        return    # delegating ctor, no separate dtor init needed
    
    # Pass 1: members with flag (offset[10] & 2)
    for each member in member_list:
        if member[10] & 2:
            if class_type[+132] != 11:       # not union
                dtor = resolve_member_destructor(member)
                entry = allocate_init_entry()
                entry.destructor = dtor
    
    # Pass 2: members with (offset[10] & 3) == 1
    for each member in member_list:
        if (member[10] & 3) == 1:
            dtor = resolve_member_destructor(member)
            entry = allocate_init_entry()
            entry.destructor = dtor
    
    # Base class destructors (reverse order)
    base_list = class_type[+96]
    for each base in reverse(base_list):
        dtor = resolve_base_destructor(base)   # sub_737270
        entry = allocate_init_entry()
        entry.destructor = dtor

check_for_missing_initializer_full (sub_4A1540, 248 lines)

Checks whether a variable declaration is missing a required initializer:

check_for_missing_initializer_full(entity, type, unused, deferred_error):
    kind = entity[+80]       # 7=variable, 9=static member
    
    # VLA check
    if is_variable_length_array(type):
        emit_error(252)       # VLA cannot have initializer
    
    # const check (C++ mode)
    if dword_126EFB4 == 2:    # C++ mode
        if is_const_qualified(type) and not has_initializer(entity):
            if not is_extern(entity):
                emit_error(257)   # const object requires initializer
    
    # Abstract class check
    if type[+160] & 2:        # abstract class flag
        if type[+132] & 0xFB == 8:    # array of abstract
            emit_error(812)   # array of abstract class
        else:
            emit_error(516)   # abstract class cannot be instantiated
    
    # constexpr check
    if entity has constexpr flag:
        if not has_initializer(entity):
            emit_error(517)   # constexpr variable requires initializer

CUDA Mode Control Globals

The declaration parser is gated on several CUDA mode flags that control which code paths are active:

Address	Name	Type	Description
`dword_126EFA8`	`is_cuda_compilation`	bool	Master CUDA mode flag
`dword_126EFB4`	`cuda_dialect`	int	0=none, 1=C, 2=C++
`dword_126EFAC`	`extended_cuda_features`	bool	Additional CUDA extensions enabled
`dword_126EFA4`	`cuda_host_compilation`	bool	Compiling host-side code
`dword_126EFB0`	`cuda_relaxed_constexpr`	bool	Allow `constexpr` on device functions
`dword_106C17C`	`constexpr_cuda_enabled`	bool	CUDA constexpr compatibility mode
`qword_126EF98`	`cuda_version_threshold_1`	int64	Version gate (0x9E97 = 40599 = CUDA 12.x)
`qword_126EF90`	`cuda_version_threshold_2`	int64	Version gate (0x78B3 = 30899 = CUDA 11.x)
`dword_126EF68`	`cpp_standard_version`	int	C++ standard year (201102, 201402, ...)
`dword_126EF64`	`cpp_extensions_enabled`	bool	Language extensions active

CUDA Version Gating

Several CUDA-specific code paths are guarded by version thresholds. The version values are encoded as major * 1000 + minor * 10 + patch:

// CUDA 11.x and later: enable extended constexpr
if qword_126EF90 > 0x78B3:     // 30899 → CUDA version >= 11.x
    enable_extended_constexpr()

// CUDA 12.x and later: enable managed memory attributes
if qword_126EF98 > 0x9E97:     // 40599 → CUDA version >= 12.x
    enable_managed_attributes()

// Recent CUDA: enable namespace-scope CUDA variable checks
if qword_126EF98 > 0x1116F:    // 70000+ → very recent CUDA
    enable_strict_namespace_checks()

Function Map

decl_spec.c (0x4A1BF0--0x4B37F0)

Address	Identity	Lines	Description
`sub_4A1BF0`	`check_use_of_consteval`	104	Validate `consteval` specifier
`sub_4A1DF0`	`check_explicit_specifier`	45	Validate `explicit` specifier
`sub_4A1EC0`	`check_use_of_constinit`	77	Validate `constinit` specifier
`sub_4A2000`	`check_use_of_thread_local`	111	Validate `thread_local` specifier
`sub_4A22B0`	`check_use_of_constexpr`	153	Validate `constexpr` specifier
`sub_4A2580`	`check_gnu_c_auto_type`	52	Validate GNU `__auto_type`
`sub_4A2630`	`scan_edg_vector_type`	203	Parse vector type syntax
`sub_4A2B80`	`is_function_declaration_ahead`	162	Lookahead: function declaration?
`sub_4A2E40`	`process_auto_parameter`	153	C++20 auto parameters
`sub_4A31A0`	`process_storage_class_specifier`	223	Storage class validation
`sub_4A3610`	`check_for_class_modifiers`	139	Detect `final`/`__final`
`sub_4A38A0`	`scan_tag_name`	1,216	Parse class/enum name
`sub_4A4FD0`	`set_name_linkage_for_type`	41	Set type linkage
`sub_4A5140`	`update_membership_of_class`	173	Update class scope info
`sub_4A5510`	`attach_tag_attributes`	143	Attach attributes to types
`sub_4A57C0`	`class_specifier`	2,179	Parse class/struct/union definition
`sub_4A8990`	`warn_on_cuda_execution_space_attributes`	33	CUDA exec space warning
`sub_4A89F0`	`scan_enumerator_list`	950	Parse enum body
`sub_4AA2F0`	`enum_specifier`	1,437	Parse enum specifier
`sub_4AC550`	`typename_specifier`	197	Parse `typename T::type`
`sub_4AC970`	`is_constructor_decl`	225	Detect constructor declaration
`sub_4ACE00`	`enclosing_class_type`	43	Get enclosing class from scope
`sub_4ACF80`	`decl_specifiers`	4,761	Central specifier dispatcher
`sub_4B37F0`	`decl_spec_one_time_init`	40	Module initialization

declarator.c (0x4B3920--0x4C00A0)

Address	Identity	Lines	Description
`sub_4B3970`	`scan_declarator_attributes`	297	Separate CUDA exec-space attrs
`sub_4B3E80`	`scan_trailing_requires_clause`	136	C++20 requires clause
`sub_4B4230`	`check_for_restrict_qualifier_on_derived_type`	124	Restrict validation
`sub_4B4870`	`form_declared_type`	53	Combine base type + derived chain
`sub_4B4990`	`report_bad_return_type_qualifier`	89	cv-qual on return type
`sub_4B4CF0`	`add_to_derived_type_list`	600	Build derived type chain
`sub_4B5A70`	`delayed_scan_of_exception_spec`	211	Deferred exception spec
`sub_4B6760`	`array_declarator`	518	Parse `[expr]`
`sub_4B72A0`	`pointer_declarator`	440	Parse ``, `&`, `&&`, `::`
`sub_4B7BC0`	`declarator`	284	Top-level declarator entry
`sub_4B8190`	`function_declarator`	3,144	Parse function signature
`sub_4BC7F0`	`scan_requires_expr_parameters`	61	C++20 requires-expr params
`sub_4BC950`	`r_declarator`	2,578	Recursive descent declarator
`sub_4C00A0`	`scan_lambda_declarator`	414	Lambda declarator

decls.c (0x4C0840--0x4F0000)

Address	Identity	Lines	Description
`sub_4C0910`	`incompatible_types_are_SVR4_compatible`	77	SVR4 ABI compat check
`sub_4C0B10`	`set_default_calling_convention`	112	Calling convention setup
`sub_4C0CB0`	`record_overload`	91	Record function overload
`sub_4C0E90`	`set_linkage_for_class_members`	107	Propagate class linkage
`sub_4C10E0`	`set_linkage_environment`	138	Linkage environment setup
`sub_4C15D0`	`check_use_of_placeholder_type`	175	Validate `auto`/`decltype(auto)`
`sub_4C1CC0`	`find_linked_symbol`	608	Redeclaration detection
`sub_4C3380`	`id_linkage`	310	Linkage determination
`sub_4C3A80`	`qualified_name_redecl_sym`	320	Qualified redeclaration
`sub_4CA6C0`	`decl_variable`	1,090	Variable declaration processing
`sub_4CC150`	`cuda_variable_fixup`	120	CUDA post-decl variable fixup
`sub_4CE420`	`decl_routine`	2,858	Function declaration processing
`sub_4DAC80`	`check_constexpr_variable_init`	60	CUDA constexpr check
`sub_4DB440`	`process_asm_block`	200	Inline assembly declaration
`sub_4DC200`	`mark_defined_variable`	26	CUDA constexpr linkage
`sub_4DD710`	`check_trailing_return_type`	80	Auto type deduction check
`sub_4DEC90`	`variable_declaration`	1,098	Top-level variable entry

disambig.c (0x4E9E70--0x4EC690)

Address	Identity	Lines	Description
`sub_4E9E70`	`prescan_gnu_attribute`	98	Skip `__attribute__` in prescan
`sub_4EA560`	`prescan_declaration`	400	Top-level disambiguation
`sub_4EB270`	`prescan_declarator`	200	Prescan declarator tokens
`sub_4EC690`	`find_for_loop_separator`	100	Find `;` in for-init

decl_inits.c (0x4A0310--0x4A1BE0)

Address	Identity	Lines	Description
`sub_4A0310`	`ctor_inits_for_inheriting_ctor`	746	Inheriting ctor init list
`sub_4A0EC0`	`dtor_initializer`	339	Destructor init list
`sub_4A1540`	`check_for_missing_initializer_full`	248	Missing init diagnostic
`sub_4A1B60`	`decl_inits_init`	11	Module initialization
`sub_4A1BB0`	`decl_inits_reset`	9	Module reset

Cross-References

Lexer -- token production, word_126DD58, sub_676860 (get_next_token)
Template Engine -- template scope interaction during declarator parsing
CUDA Template Restrictions -- __global__ template argument validation, executed after decl_routine
Name Mangling -- mangled name generation for declared entities
Overload Resolution -- overload set construction during find_linked_symbol
Constexpr Interpreter -- invoked during check_use_of_constexpr for validation

Overload Resolution

The overload resolution engine in cudafe++ is EDG 6.6's implementation of the C++ overload resolution algorithm (ISO C++ [over.match]). It lives in overload.c -- approximately 100 functions spanning address range 0x6BE4A0--0x6EF7A0 (roughly 200KB of compiled code). Overload resolution is one of the most complex subsystems in any C++ compiler because it sits at the intersection of nearly every other language feature: implicit conversions, user-defined conversions, template argument deduction, SFINAE, partial ordering, reference binding, list initialization, copy elision, and operator overloading each contribute decision branches to the algorithm. EDG implements the standard three-phase architecture -- candidate collection, viability checking, best-viable selection -- with NVIDIA-specific extensions for CUDA execution-space filtering.

Key Facts

Property	Value
Source file	`overload.c` (~100 functions)
Address range	`0x6BE4A0`--`0x6EF7A0`
Total code size	~200KB
Main selection entry	`sub_6E6400` (`select_overloaded_function`, 1,483 lines, 20 parameters)
Operator dispatch	`sub_6EF7A0` (`select_overloaded_operator`, 2,174 lines)
Viability checker	`sub_6E2040` (`determine_function_viability`, 2,120 lines)
Candidate evaluator	`sub_6C4C00` (candidate evaluation, 1,044 lines)
Main driver	`sub_6CE6E0` (overload resolution driver, 1,246 lines)
Built-in candidates	`sub_6CD010` (built-in operator candidates, 752 lines)
Candidate iterator	`sub_6E4FA0` (`try_overloaded_function_match`, 633 lines)
Conversion scoring	`sub_6BEE10` (`standard_conversion_sequence`, 375 lines)
ICS comparison	`sub_6CBC40` (implicit conversion sequence comparison, 345 lines)
Qualification compare	`sub_6BE6C0` (`compare_qualification_conversions`, 127 lines)
Copy constructor select	`sub_6DBEA0` (`select_overloaded_copy_constructor`, 625 lines)
Default constructor select	`sub_6E9080` (`select_overloaded_default_constructor`, 358 lines)
Assignment operator select	`sub_6DD600` (`select_overloaded_assignment_operator`, 492 lines)
CTAD entry	`sub_6E8300` (`deduce_class_template_args`, 285 lines)
List initializer	`sub_6D7C80` (`prep_list_initializer`, 2,119 lines)
Overload set traversal	`sub_6BA230` (iterate overload set)
Overload debug trace	`dword_126EFC8` (enable), `qword_106B988` (output stream)
CUDA extensions flag	`byte_126E349`
Language mode	`dword_126EFB4` (2 = C++)
Standard version	`dword_126EF68` (201103 = C++11, 201703 = C++17, 202301 = C++23)

Why Overload Resolution Is Hard

Overload resolution is not a simple "find the best match" operation. The C++ standard defines it as a partial ordering problem over implicit conversion sequences, where each sequence is itself a multi-step chain of type transformations. The key sources of complexity:

Implicit conversion sequences (ICS). Each argument-to-parameter match produces an ICS consisting of up to three steps: a standard conversion (lvalue-to-rvalue, array-to-pointer, etc.), optionally a user-defined conversion (constructor or conversion function), then another standard conversion. Ranking two ICSs against each other requires comparing each step independently.
User-defined conversions. When no standard conversion exists, the compiler must search for converting constructors on the target type AND conversion operators on the source type, then perform a nested overload resolution among those candidates. This creates recursive invocations of the overload engine.
Template argument deduction. Function templates produce candidates only after deduction succeeds. Deduction may fail (SFINAE), producing no candidate. Successfully deduced candidates participate in a separate tie-breaking rule: non-template functions are preferred over template specializations, and "more specialized" templates are preferred over "less specialized" ones ([over.match.best] p2.5).
Partial ordering. When comparing two function templates that are both viable, the compiler must determine which is "more specialized" by attempting deduction in both directions (templates.c handles this). The result feeds back into overload ranking.
Operator overloading. Built-in operators (like + on int) compete with user-defined operator+. The compiler synthesizes "built-in candidate functions" representing every valid built-in operator signature, adds them to the candidate set alongside user-defined operators, and runs the same best-viable algorithm on the combined set.
Special contexts. Copy-initialization vs. direct-initialization, list-initialization, reference binding, and conditional-operator type determination each have their own overload resolution sub-procedures with modified candidate sets and ranking rules.

Architecture: Three-Phase Pipeline

                          PHASE 1                    PHASE 2                   PHASE 3
                     Candidate Collection        Viability Check         Best-Viable Selection
                    ┌───────────────────┐     ┌──────────────────┐     ┌──────────────────────┐
                    │                   │     │                  │     │                      │
  f(args...) ──────►│ Name lookup       │────►│ For each cand:   │────►│ Pairwise comparison  │
                    │ ADL (arg-dep.)    │     │  - param count   │     │ of viable candidates │
                    │ Using-declarations│     │  - conversions   │     │ via ICS ranking      │──► winner
                    │ Template deduction│     │  - constraints   │     │                      │
                    │ Built-in synth    │     │  - access check  │     │ Tie-breakers:        │──► or ambiguity
                    │                   │     │                  │     │  - non-template pref │
                    └───────────────────┘     └──────────────────┘     │  - partial ordering  │──► or no match
                                                                      │  - cv-qual ranking   │
                                                                      └──────────────────────┘

Phase 1: Candidate Collection

Candidates are collected into an overload set -- a linked list of entries allocated via sub_6BA0D0 and iterated via sub_6BA230. The overload set is built by the caller before invoking select_overloaded_function. Sources of candidates include:

Name lookup results. All declarations visible by name at the call site, including base class members and using-declarations.
Argument-dependent lookup (ADL). Additional functions found by searching the associated namespaces of the argument types (Koenig lookup). These are added to the set by the name lookup machinery before overload resolution begins.
Template specializations. For each function template in the name lookup result, template argument deduction is attempted. If deduction succeeds, the resulting specialization is added as a candidate. If deduction fails, the template is silently dropped (SFINAE).
Built-in operator candidates. For operator expressions, sub_6CD010 synthesizes candidate functions representing every valid built-in operator signature for the given operand types. These synthetic candidates use single-character type classification codes to match operand patterns.

Phase 2: Viability Checking

determine_function_viability (sub_6E2040, 2,120 lines) is the core viability checker. For each candidate function, it determines whether all arguments can be implicitly converted to the corresponding parameter types.

determine_function_viability (sub_6E2040, 2120 lines)
    Input:  candidate function F, argument list A[0..n-1]
    Output: viability flag, per-argument conversion summaries

    // Guard: SFINAE context handling
    if (in_sfinae_context)
        push_diagnostic_suppression()

    // PASS 1: Basic eligibility
    if (F is deleted)
        return NOT_VIABLE
    if (F is template && deduction_failed)
        return NOT_VIABLE
    if (F has fewer params than args && !F.is_variadic)
        return NOT_VIABLE
    if (F has more params than args && excess params lack defaults)
        return NOT_VIABLE

    // Handle implicit 'this' parameter for member functions
    if (F is non-static member function) {
        this_match = selector_match_with_this_param(
            object_operand, F.this_param_type)       // sub_6D0A80
        if (this_match == FAILED)
            return NOT_VIABLE
    }

    // PASS 2: Per-argument conversion check
    for i in 0..n-1:
        log("determine_function_viability: arg %d", i)

        param_type = F.params[i].type
        arg_type   = A[i].type

        // Compute implicit conversion sequence
        ics = compute_standard_conversion_sequence(   // sub_6BEE10
                  arg_type, param_type, context_flags)

        if (ics == NO_CONVERSION) {
            // Try user-defined conversion
            ics = try_user_defined_conversion(
                      arg_type, param_type)           // sub_6BF610
            if (ics == NO_CONVERSION)
                return NOT_VIABLE
        }

        // Check narrowing for list-initialization
        if (context == LIST_INIT && ics.is_narrowing)
            return NOT_VIABLE

        // Record per-argument match summary
        summaries[i] = ics

    log("(pass 2)")  // second pass through for detailed scoring

    // All arguments convertible -- candidate is viable
    return VIABLE, summaries[]

The function implements a two-pass approach visible in the debug trace output: pass 1 performs a quick rejection check (parameter count, deleted status, deduction success), and pass 2 computes the full conversion sequence for each argument. The per-argument summaries are stored in a 48-byte structure (set_arg_summary_for_user_conversion at sub_6BE990 initializes these).

Phase 3: Best-Viable Selection

select_overloaded_function (sub_6E6400, 1,483 lines, 20 parameters) performs the final selection. It is the master entry point for overload resolution -- called from the expression parser, from CTAD, and from special member function selection.

select_overloaded_function (sub_6E6400, 1483 lines, 20 params)
    Input:  overload_set, arg_list, context_flags, ...
    Output: best_function or AMBIGUOUS or NO_MATCH

    log("Entering select_overloaded_function with ...")

    // Early exit: dependent type arguments => defer to instantiation time
    if (selector_type_is_dependent)
        return DEPENDENT

    // Step 1: Iterate candidates and check viability
    viable_set = []
    try_overloaded_function_match(                    // sub_6E4FA0
        overload_set, arg_list, &viable_set, ...)

    if (viable_set is empty)
        return NO_MATCH

    if (viable_set has exactly 1 candidate)
        return viable_set[0]

    // Step 2: Pairwise comparison of viable candidates
    //   For each pair (F1, F2), compare their ICS for each argument
    best = viable_set[0]
    ambiguous = false

    for each candidate C in viable_set[1..]:
        cmp = compare_candidates(best, C)
        //   compare_candidates calls compare_conversion_sequences (sub_6BFF70)
        //   for each argument position, and applies tie-breakers

        if (cmp == C_IS_BETTER)
            best = C
            ambiguous = false
        else if (cmp == NEITHER_BETTER)
            ambiguous = true

    // Step 3: Verify best is strictly better than ALL others
    if (ambiguous) {
        // Final check: is there a single candidate that beats all?
        for each candidate C in viable_set:
            if (C != best) {
                cmp = compare_candidates(best, C)
                if (cmp != BEST_IS_BETTER)
                    return AMBIGUOUS
            }
    }

    return best

Candidate Comparison Rules

The pairwise comparison between two viable candidates F1 and F2 follows [over.match.best]. The result is one of: F1-better, F2-better, or indistinguishable.

compare_candidates(F1, F2):
    // Rule 1: Compare implicit conversion sequences argument-by-argument
    f1_better_count = 0
    f2_better_count = 0

    for i in 0..n-1:
        cmp = compare_conversion_sequences(           // sub_6BFF70
                  F1.ics[i], F2.ics[i])
        if (cmp == F1_BETTER) f1_better_count++
        if (cmp == F2_BETTER) f2_better_count++

    if (f1_better_count > 0 && f2_better_count == 0)
        return F1_IS_BETTER
    if (f2_better_count > 0 && f1_better_count == 0)
        return F2_IS_BETTER

    // Rule 2: Non-template preferred over template
    if (F1 is non-template && F2 is template)
        return F1_IS_BETTER
    if (F2 is non-template && F1 is template)
        return F2_IS_BETTER

    // Rule 3: More-specialized template preferred
    if (both are templates) {
        partial = partial_ordering(F1.template, F2.template)
        if (partial == F1_MORE_SPECIALIZED)
            return F1_IS_BETTER
        if (partial == F2_MORE_SPECIALIZED)
            return F2_IS_BETTER
    }

    // Rule 4: Compare qualification conversions
    cmp_qual = compare_qualification_conversions(     // sub_6BE6C0
                   F1.qual_info, F2.qual_info)
    if (cmp_qual != 0)
        return cmp_qual

    return NEITHER_BETTER

Implicit Conversion Sequence (ICS) Model

An ICS is the sequence of transformations needed to convert an argument type to a parameter type. EDG computes and stores ICS information in a compact structure.

Standard Conversion Sequence

standard_conversion_sequence (sub_6BEE10, 375 lines) computes the standard-conversion component of an ICS. It produces a conversion rank used in comparison.

Rank	Name	Examples	Priority
Exact Match	No conversion needed	`int` to `int`, lvalue-to-rvalue	1 (best)
Promotion	Integer/float promotion	`short` to `int`, `float` to `double`	2
Conversion	Standard conversion	`int` to `double`, derived-to-base	3
User-Defined	User conversion + std conversion	`Foo` to `Bar` via constructor	4
Ellipsis	Match via `...` parameter	Any type to variadic	5 (worst)

Within the same rank, additional criteria refine the comparison:

Qualification adjustment. const T to T is worse than T to T. compare_qualification_conversions (sub_6BE6C0) encodes cv-qualification as a bitmask (const = 0x20, volatile = 0x40, restrict = 0x80) and compares subset relationships.
Derived-to-base distance. Conversion through a shorter inheritance chain is better. Checked via sub_7AB300.
Reference binding. Binding to T&& is preferred over binding to const T& when the argument is an rvalue.

User-Defined Conversion Sequence

When no standard conversion exists, try_conversion_function_match_full (sub_6D0F50, 1,085 lines) searches for a user-defined conversion path. It considers:

Converting constructors on the target type (non-explicit constructors that accept the source type).
Conversion functions on the source type (operator T() members).

For each candidate conversion, it checks:

try_conversion_function_match_full (sub_6D0F50, 1085 lines)
    Input:  source_class_type, dest_type, context_flags
    Output: selected conversion function/constructor, or AMBIGUOUS, or NONE

    log("considering conversion functions for [%lu.%d]")

    if (source is not class type)
        error("try_conversion_function_match_full: source not class")

    // Iterate conversion function candidates of source class
    for each conv_func in source_class.conversion_functions:  // via sub_6BA230
        return_type = conv_func.return_type
        if (return_type is compatible with dest_type) {
            // Check standard conversion from return_type to dest_type
            post_ics = compute_standard_conversion_sequence(
                           return_type, dest_type)
            if (post_ics != NO_CONVERSION)
                add_to_viable(conv_func, post_ics)
        }

    // Also check for converting constructors on dest type
    conversion_from_class_possible(                  // sub_6D28C0/6D2ED0
        source_class_type, dest_type, &viable_set)

    // Select best among viable user-defined conversions
    if (viable_set has 1 candidate)
        return viable_set[0]
    if (viable_set has multiple candidates)
        return best-of or AMBIGUOUS

    return NONE

The conversion_from_class_possible functions (sub_6D28C0 252 lines, sub_6D2ED0 293 lines) emit full debug traces with entry/exit messages:

Entering conversion_from_class_possible, dest_type = <type>
Candidate functions list: ...
Leaving conversion_from_class_possible: <result>

The Main Overload Resolution Driver

sub_6CE6E0 (1,246 lines) is the central driver function -- "THE MONSTER" -- that coordinates the overload resolution pipeline. It is called from determine_selector_match_level and from the candidate evaluation logic, acting as the type-comparison and scoring backbone that feeds the higher-level selection functions.

overload_resolution_driver (sub_6CE6E0, 1246 lines)
    // This function performs the detailed type comparison and conversion
    // sequence computation that determines how well a candidate matches.
    //
    // It is called per-candidate, per-argument-position from the viability
    // checker and the candidate evaluator.

    // 1. Quick identity check
    if (arg_type == param_type)
        return EXACT_MATCH

    // 2. Chase typedef chains to canonical types
    arg_canon  = canonical_type(arg_type)
    param_canon = canonical_type(param_type)

    // 3. Apply lvalue-to-rvalue conversion
    if (param expects rvalue && arg is lvalue)
        apply lvalue_to_rvalue conversion, record in ICS

    // 4. Apply array-to-pointer / function-to-pointer decay
    if (arg is array)      convert to pointer-to-element
    if (arg is function)   convert to pointer-to-function

    // 5. Check for standard conversions (integral promotion, float promotion,
    //    integral conversion, floating conversion, pointer conversion,
    //    pointer-to-member conversion, boolean conversion)
    std_conv = find_applicable_standard_conversion(arg_canon, param_canon)
    if (std_conv != NONE)
        return std_conv with rank

    // 6. Check for qualification conversion (add const/volatile)
    qual_conv = check_qualification_conversion(arg_canon, param_canon)
    if (qual_conv)
        return EXACT_MATCH with qual adjustment

    // 7. Check derived-to-base conversion
    if (is_class(arg_canon) && is_class(param_canon)) {
        if (is_derived_from(arg_canon, param_canon))
            return CONVERSION_RANK with derived-to-base marker
    }

    // 8. No standard conversion found
    return NO_CONVERSION

Candidate Evaluation Function

sub_6C4C00 (1,044 lines) is the candidate evaluation function -- it scores each candidate by computing the full set of implicit conversion sequences across all arguments and produces the data that compare_candidates uses.

evaluate_candidate (sub_6C4C00, 1044 lines)
    Input:  candidate F, argument list args[], match_context
    Output: per-argument ICS array, overall viability

    for each argument position i:
        // Compute the implicit conversion sequence
        ics = overload_resolution_driver(             // sub_6CE6E0
                  args[i].type, F.params[i].type, flags)

        if (ics == NO_CONVERSION) {
            // Try user-defined conversion
            udc = try_user_defined_conversion(args[i].type, F.params[i].type)
            if (udc == NONE)
                mark F as non-viable for position i
                return NON_VIABLE
            ics = user_defined_ics(udc)
        }

        // Record the ICS for this position
        F.arg_summaries[i] = ics

    // Compute overall match quality
    F.match_level = worst(F.arg_summaries[0..n-1])
    return VIABLE

Candidate Iteration

try_overloaded_function_match (sub_6E4FA0, 633 lines, and variant sub_6E5B20, 367 lines) iterates the overload set and calls determine_function_viability for each candidate.

try_overloaded_function_match (sub_6E4FA0, 633 lines)
    Input:  overload_set, arg_list, context
    Output: viable_candidates[]

    log("try_overloaded_function_match")

    // Traverse the overload set
    cursor = overload_set.head
    while (cursor != NULL):                           // via sub_6BA230

        candidate = cursor.function
        log("try_overloaded_function_match: considering %s",
            candidate.name)                           // via sub_5B72C0

        // Set up traversal symbol for template deduction
        set_overload_set_traversal_symbol(cursor)

        // Check viability
        viable = determine_function_viability(        // sub_6E2040
                     candidate, arg_list, context)

        if (viable) {
            add candidate to viable_candidates[]
            record conversion summaries
        }

        cursor = cursor.next

Operator Overloading

Operator overloading resolution follows a specialized path because it must consider both user-defined operators AND synthesized built-in operator candidates.

Entry Point: select_overloaded_operator

sub_6EF7A0 (2,174 lines) is the master entry point for operator overloading. It is called from the expression parser whenever an operator expression involves a class-type operand.

select_overloaded_operator / check_for_operator_overloading
    (sub_6EF7A0, 2174 lines)
    Input:  operator_kind, lhs_operand, rhs_operand (if binary), context
    Output: selected function (user-defined or built-in), or use-builtin flag

    log("Entering check_for_operator_overloading")

    // Guard: dependent operands => defer
    if (lhs is dependent || rhs is dependent)
        log("check_for_operator_overloading: dep operand")
        return DEPENDENT

    // Step 1: Collect user-defined operator candidates
    //   Search member operators of lhs class
    //   Search non-member operators via name lookup + ADL
    user_candidates = collect_user_operator_candidates(
                          operator_kind, lhs, rhs)

    // Step 2: Generate built-in operator candidates
    builtin_candidates = generate_builtin_candidates(   // sub_6CD010
                             operator_kind, lhs.type, rhs.type)

    // Step 3: Combine candidate sets
    combined = user_candidates + builtin_candidates

    // Step 4: Run standard overload resolution on combined set
    result = select_overloaded_function(                 // sub_6E6400
                 combined, [lhs, rhs], OPERATOR_CONTEXT)

    if (result is a built-in candidate) {
        // Adjust operands for built-in semantics
        adjust_operand_for_builtin_operator(             // sub_6E0E50
            lhs, rhs, operator_kind)
        return USE_BUILTIN
    }

    log("Leaving f_check_for_operator_overloading")
    return result.function

Built-in Operator Candidate Generation

sub_6CD010 (752 lines) generates synthetic candidate functions representing built-in operators. It uses a type classification code scheme where each type category is encoded as a single character.

Type Classification Codes

Code	Meaning	Query Function
`A` / `a`	Arithmetic type	`sub_7A7590` (`is_arithmetic`)
`B`	Boolean type	is_bool
`b`	Boolean-equivalent	is_pointer/bool
`C`	Class type	`sub_7A8A30` (`is_class`)
`D` / `I` / `i`	Integer/integral type	`sub_7A71E0` (`is_integral`)
`E`	Enum type	`sub_7A70F0` (`is_enum`)
`F`	Pointer-to-function	is_function_pointer
`H`	Handle type (CLI)	is_handle
`M`	Pointer-to-member	`sub_7A8D90` (`is_member_pointer`)
`N`	nullptr_t	is_nullptr
`O`	Pointer-to-object	is_object_pointer
`P`	Pointer (any)	is_pointer
`S`	Scoped enum	is_scoped_enum
`h`	Handle-to-CLI-array	is_handle_array
`n`	Non-bool arithmetic	is_non_bool_arithmetic

The function matches_type_code (sub_6BECA0) dispatches on these codes to check whether an operand matches a candidate pattern. The function name_for_type_code (sub_6BE4A0, 67 lines) converts codes to human-readable strings for diagnostics (e.g., A becomes "arithmetic").

Candidate Pattern Matching

try_builtin_operands_match (sub_6ED2A0, 812 lines) matches operands against built-in operator patterns. The patterns are encoded as strings like "A;P" where each character is a type code and ; separates operand positions.

try_builtin_operands_match (sub_6ED2A0, 812 lines)
    Input:  operator_kind, pattern_string, operand types
    Output: match result

    log("try_builtin_operands_match: considering %s", pattern_string)

    for i in 0..num_operands-1:
        code = pattern_string[i]    // after skipping separators
        log("try_builtin_operands_match: operand %d", i)

        if (!matches_type_code(operand[i].type, code))
            log("try_builtin_operands_match: ran off pattern")
            return NO_MATCH

    return MATCH with conversion cost

try_conversions_for_builtin_operator (sub_6EE340, 1,058 lines) contains a large switch over operator kinds that selects the appropriate type pattern tables. It checks dword_126EF68 for C++17 features (>= 201703) and dword_126EFB4 for language mode.

Special Member Function Selection

Overload resolution for special member functions uses dedicated entry points that share the same underlying machinery but provide specialized candidate sets and matching rules.

Copy/Move Constructor Selection

select_overloaded_copy_constructor (sub_6DBEA0, 625 lines)
    Input:  class_type, source_operand, context_flags
    Output: selected constructor symbol, or NULL

    log("Entering select_overloaded_copy_constructor, class_type = %s",
        class_type.name)

    // Iterate all constructors of the class
    for each ctor in class_type.constructors:         // via sub_6BA230
        log("select_overloaded_copy_constructor: considering %s",
            ctor.name)                                // via sub_5B72C0

        // Check copy parameter match
        match = determine_copy_param_match(           // sub_6DBAC0
                    ctor, source_operand)

        // determine_copy_param_match calls:
        //   sub_6CE6E0 (type comparison)
        //   sub_6BE5D0 (value category check)
        //   sub_6DB6E0 (deduce_one_parameter for template ctors)

        if (match.viable) {
            if (match better than current_best)
                current_best = ctor

            // Check for ambiguity
            if (match == current_best && ctor != current_best)
                ambiguous = true
        }

    log("Leaving select_overloaded_copy_constructor, cctor_sym = %s",
        current_best.name)
    return current_best

The value category check (sub_6BE5D0, copy_function_not_callable_because_of_arg_value_category, 39 lines) is critical for C++11 move semantics: it rejects copy constructors when the source is an rvalue and a move constructor is available, and vice versa.

Default Constructor Selection

select_overloaded_default_constructor (sub_6E9080, 358 lines)
    Input:  class_type
    Output: selected constructor symbol

    log("Entering select_overloaded_default_constructor, class_type = %s",
        class_type.name)

    // Collect zero-argument constructors
    // Check for default arguments (a 1-param ctor with default is a default ctor)
    // Run standard overload resolution with empty argument list

    log("Leaving select_overloaded_default_constructor, ctor_sym = %s",
        result.name)
    return result

Assignment Operator Selection

select_overloaded_assignment_operator (sub_6DD600, 492 lines)
    Input:  class_type, rhs_operand
    Output: selected assignment operator symbol

    log("Entering select_overloaded_assignment_operator, class_type = %s",
        class_type.name)

    // Iterate assignment operator candidates
    for each assign_op in class_type.assignment_operators:  // via sub_6BA230
        log("select_overloaded_assignment_operator: considering %s",
            assign_op.name)

        // Check parameter match (similar to copy constructor)
        // ...

    log("Leaving select_overloaded_assignment_operator, assign_sym = %s",
        result.name)
    return result

Copy Elision

C++17 guaranteed copy elision is handled by handle_elided_copy_constructor_no_guard (two variants: sub_6DCD60 166 lines and sub_6DD180 169 lines). Even with elision, the compiler must verify that the copy/move constructor would be callable -- the constructor is selected via select_overloaded_copy_constructor but never actually invoked. The wrapper arg_copy_can_be_done_via_constructor (sub_6DCC00, 55 lines) performs this check.

List Initialization

prep_list_initializer (sub_6D7C80, 2,119 lines) implements C++11 brace-enclosed initializer list resolution. It is one of the largest functions in overload.c, reflecting the combinatorial complexity of list initialization.

prep_list_initializer (sub_6D7C80, 2119 lines)
    Input:  init_list (braced expression list), target_type, context
    Output: converted initializer expression

    // The algorithm (per [dcl.init.list]):
    //
    // 1. If T has an initializer_list<X> constructor and the braced-init-list
    //    can be converted to initializer_list<X>, use that constructor.
    //
    // 2. If T is an aggregate, perform aggregate initialization.
    //
    // 3. If T has constructors, overload resolution selects a constructor
    //    with the elements of the braced-init-list as arguments.
    //
    // 4. If T is a reference, bind to a temporary or element.
    //
    // At each step, check for narrowing conversions (C++11 requirement).

    // Gate: C++11 required
    if (dword_126EF68 < 201103)   // std_version < C++11
        return LEGACY_PATH

    // Step 1: Check for initializer_list constructor
    init_list_ctor = find_initializer_list_constructor(    // sub_6DFEC0
                         target_type, element_type)
    if (init_list_ctor) {
        init_list_obj = make_initializer_list_object(      // sub_6DFEC0
                            init_list, element_type)
        return set_up_for_constructor_call(init_list_ctor,
                                           init_list_obj)
    }

    // Step 2: Aggregate initialization (recursive for nested braces)
    if (is_aggregate(target_type)) {
        for each element in init_list:
            // Recursively call prep_list_initializer for nested braces
            prep_list_initializer(element, member_type, ...)  // recursive
        return aggregate_init_expr
    }

    // Step 3: Constructor overload resolution
    result = select_overloaded_function(                   // sub_6E6400
                 target_type.constructors, init_list.elements, LIST_INIT)

    // Step 4: Check for narrowing
    check_narrowing_conversions(init_list, result)

    return result

The find_initializer_list_constructor / make_initializer_list_object function (sub_6DFEC0, 692 lines) handles std::initializer_list<T> construction. It iterates constructors to find one taking initializer_list<T> and sets up the backing array via set_overload_set_traversal_symbol.

Class Template Argument Deduction (CTAD)

C++17 CTAD is implemented by deduce_class_template_args (sub_6E8300, 285 lines). CTAD works by synthesizing a set of "deduction guides" -- function-like entities derived from the class template's constructors -- and running overload resolution on them.

deduce_class_template_args (sub_6E8300, 285 lines)
    Input:  class_template, constructor_arguments, context
    Output: deduced template arguments

    // Step 1: Generate implicit deduction guides from constructors
    //   For each constructor C(P1, P2, ...) of class template T<A, B, ...>:
    //     Create guide: T(P1, P2, ...) -> T<deduced-A, deduced-B, ...>

    // Step 2: Add explicit deduction guides (user-provided)

    // Step 3: Run overload resolution among all guides
    selected_guide = select_overloaded_function(          // sub_6E6400
                         deduction_guides, constructor_args, CTAD_CONTEXT)

    // Step 4: Extract deduced template arguments from selected guide
    return selected_guide.deduced_args

CTAD delegates entirely to select_overloaded_function for the actual resolution -- the deduction guides are treated as ordinary function candidates with synthesized parameter types.

Auto Type Deduction

deduce_auto_type (sub_6DB010, 314 lines) implements C++11 auto type deduction, which is structurally similar to template argument deduction. It handles the special case of auto x = {1, 2, 3} where the deduced type is std::initializer_list<int>.

Conversion Infrastructure

Reference Binding

prep_reference_initializer_operand (sub_6D47B0, 1,121 lines) handles reference initialization, which has its own overload-resolution sub-algorithm for selecting the correct binding path:

Direct binding. If the initializer is an lvalue of the right type (or derived), bind directly.
Conversion-through-temporary. If a user-defined conversion exists, create a temporary and bind the reference to it.
Direct reference binding check. conversion_for_direct_reference_binding_possible (sub_6D4610, 49 lines) checks whether direct binding is possible.

Operand Conversion

After overload resolution selects a function, the arguments must be physically converted to match the parameter types:

Function	Lines	Role
`sub_6D6650` (`user_convert_operand`)	427	Applies user-defined conversion (constructor call or conversion function call)
`sub_6E1430` (`convert_operand_into_temp`)	418	Creates a temporary and converts operand into it
`sub_6E1C40` (`prep_argument` variant 1)	69	Prepares argument for function call
`sub_6E1E40` (`prep_argument` variant 2)	69	Simplified argument preparation
`sub_6EB1C0` (`adjust_overloaded_function_call_arguments`)	249	Post-resolution argument adjustment
`sub_6E0E50` (`adjust_operand_for_builtin_operator`)	199	Adjusts operands for built-in operator semantics

The high-level call setup function select_and_prepare_to_call_overloaded_function (sub_6EB550, 392 lines) combines overload resolution with argument preparation in a single entry point.

Dynamic Initialization

determine_dynamic_init_for_class_init (sub_6DEBC0, 679 lines) determines whether a class object initialization requires a runtime (dynamic) initialization routine rather than static initialization. It checks whether the constructor is trivial, whether the initializer is a constant expression, and whether the target requires dynamic dispatch.

Conditional Operator

conditional_operator_conversion_possible (sub_6EBFC0, 326 lines) handles the special overload resolution for the ternary conditional operator (? :), which has unique type-determination rules involving common type computation between the second and third operands.

Ambiguity Diagnostics

When overload resolution fails due to ambiguity, dedicated diagnostic functions produce the error messages:

Function	Lines	Role
`sub_6D7040` (`diagnose_overload_ambiguity` standalone)	191	Formats and emits ambiguity diagnostic with candidate list
`sub_6D35E0` (`user_defined_conversion_possible` with diagnosis)	399	Handles ambiguity in user-defined conversion resolution

The diagnostic output uses sub_4F59D0, sub_4F5C10, sub_4F5CF0, and sub_4F5D50 for type-to-string formatting, producing messages in the format:

ambiguous overload for 'operator+(A, B)':
  candidate: operator+(int, int)
  candidate: operator+(A::operator int(), int)

Missing Sentinel Warning

warn_if_missing_sentinel (sub_6E9C60, 1,170 lines) is a large function that checks for missing sentinel arguments (NULL terminators) in variadic function calls. It references multiple CUDA extension flags (byte_126E349, byte_126E358, byte_126E3C0, byte_126E3C1, byte_126E481) because CUDA functions have different variadic conventions.

CUDA Execution Space Interaction

CUDA introduces an additional dimension to overload resolution: execution space compatibility. In standard C++, any visible function is a candidate. In CUDA, a candidate from the wrong execution space may be excluded or penalized.

How Execution Spaces Affect Candidates

The CUDA execution space interaction with overload resolution happens at two levels:

Level 1: Post-resolution validation (expr.c). After overload resolution selects the best viable function, check_cross_execution_space_call (sub_505720, 4KB) validates that the selected function is callable from the current execution context. If the call is illegal (e.g., calling a __device__-only function from __host__ code), error 3462--3465 is emitted. This check runs AFTER overload resolution, not during candidate filtering.

Level 2: Overload-internal CUDA awareness (overload.c). Within overload.c itself, the CUDA extensions flag byte_126E349 gates CUDA-specific behavior in several functions:

try_conversion_function_match_full (sub_6D0F50): Checks byte_126E349 when evaluating whether a conversion function is viable. In CUDA mode, conversion functions from the wrong execution space may be excluded from consideration during the user-defined conversion search.
warn_if_missing_sentinel (sub_6E9C60): Uses byte_126E349 and byte_126E358 to adjust sentinel checking behavior for CUDA-annotated variadic functions.

The key architectural decision is that CUDA does NOT filter candidates during Phase 1 (candidate collection) or Phase 2 (viability checking) of overload resolution proper. Instead, execution-space validation is a separate pass that runs after the standard C++ overload algorithm completes. This preserves EDG's clean separation between the standard-conforming overload engine and NVIDIA's CUDA extensions.

Cross-Space Validation

The execution space is encoded in the entity node at offset +182 as a bitfield:

Bit Pattern	Meaning
`(byte & 0x30) == 0x20`	`__device__` only
`(byte & 0x60) == 0x20`	`__host__` only
`(byte & 0x60) == 0x40`	`__global__`
`(byte & 0x30) == 0x30`	`__host__ __device__`

The cross-space checker (sub_505720) compares the caller's execution space with the callee's and emits:

Error	Condition
3462	`__device__` called from `__host__`
3463	Variant of 3462 for HD context
3464	`__host__` called from `__device__`
3465	Variant of 3464 with `__device__` note
3508	`__global__` called from wrong context

A template-instantiation variant (sub_505B40, check_cross_space_call_in_template) performs the same checks during template instantiation.

Debug Tracing

Overload resolution includes extensive debug tracing controlled by dword_126EFC8. When enabled, functions emit trace output via sub_48AFD0 / sub_48AE00 to the stream at qword_106B988:

Entering select_overloaded_function with ...
  try_overloaded_function_match: considering foo(int)
    determine_function_viability: arg 0
    (pass 2)
  try_overloaded_function_match: considering foo(double)
    determine_function_viability: arg 0
    (pass 2)
  comparing candidates: foo(int) vs foo(double)
Leaving select_overloaded_function: foo(int)

The trace format [%lu.%d] is used in conversion function matching to identify candidates by internal ID.

Overload Set Management

Overload sets are managed via two key functions in the memory management subsystem:

Function	Role
`sub_6BA0D0`	Allocate a new overload set entry
`sub_6BA230`	Iterate/traverse an overload set (linked list walk)
`sub_6EC650`	Overload set traversal utility (212 lines)
`sub_6ECA20`	Overload set construction from multiple sources (137 lines)
`sub_6ECCE0`	Overload set initialization wrapper (23 lines)

The linked-list representation means candidate iteration is O(n) per traversal, but overload sets are typically small (< 100 candidates), so this is not a performance concern.

Complete Function Map

Address	Size (lines)	Identity	Confidence
`0x6BE4A0`	67	`name_for_type_code`	VERY HIGH
`0x6BE5D0`	39	`copy_function_not_callable_because_of_arg_value_category`	VERY HIGH
`0x6BE6C0`	127	`compare_qualification_conversions`	HIGH
`0x6BE990`	68	`set_arg_summary_for_user_conversion`	VERY HIGH
`0x6BEAF0`	30	`set_explicit_flag_on_param_list`	HIGH
`0x6BEB60`	69	`find_conversion_function`	VERY HIGH
`0x6BECA0`	70	`matches_type_code`	VERY HIGH
`0x6BEE10`	375	`standard_conversion_sequence`	HIGH
`0x6BF610`	80	`check_user_defined_conversion`	HIGH
`0x6BF710`	163	`evaluate_conversion_for_argument`	HIGH
`0x6BFA50`	129	`process_builtin_operator_candidate`	HIGH
`0x6BFD00`	67	`name_for_overloaded_operator`	HIGH
`0x6BFE40`	48	`check_ambiguous_conversion`	HIGH
`0x6BFF70`	100	`compare_conversion_sequences`	HIGH
`0x6C4C00`	1,044	candidate evaluation	HIGH
`0x6C5C90`	386	candidate scoring/ranking	MEDIUM
`0x6C8B70`	418	argument conversion computation	MEDIUM
`0x6C92B0`	383	template argument deduction for overloads	MEDIUM
`0x6CBC40`	345	implicit conversion sequence comparison	MEDIUM
`0x6CD010`	752	built-in operator candidate generation	HIGH
`0x6CE010`	226	operator overload candidate setup	MEDIUM
`0x6CE6E0`	1,246	overload resolution driver ("THE MONSTER")	HIGH
`0x6D03D0`	170	`determine_selector_match_level` (6-param)	HIGH
`0x6D0790`	132	`determine_selector_match_level` (4-param)	HIGH
`0x6D0A80`	225	`selector_match_with_this_param`	HIGH
`0x6D0F50`	1,085	`try_conversion_function_match_full`	HIGH
`0x6D28C0`	252	`conversion_from_class_possible` (9-param)	HIGH
`0x6D2ED0`	293	`conversion_from_class_possible` (10-param)	HIGH
`0x6D35E0`	399	`user_defined_conversion_possible` / `diagnose_overload_ambiguity`	HIGH
`0x6D3DC0`	360	`conversion_possible`	HIGH
`0x6D4610`	49	`conversion_for_direct_reference_binding_possible`	HIGH
`0x6D47B0`	1,121	`prep_reference_initializer_operand`	HIGH
`0x6D61F0`	176	reference init helper	MEDIUM
`0x6D6650`	427	`user_convert_operand` / `set_up_for_conversion_function_call`	HIGH
`0x6D7040`	191	`diagnose_overload_ambiguity` (standalone)	HIGH
`0x6D7410`	239	`prep_conversion_operand`	HIGH
`0x6D79E0`	93	conversion operand wrapper	MEDIUM
`0x6D7C80`	2,119	`prep_list_initializer`	HIGH
`0x6DACA0`	154	list init parameter deduction helper	MEDIUM
`0x6DB010`	314	`deduce_auto_type`	HIGH
`0x6DB6E0`	236	`deduce_one_parameter`	HIGH
`0x6DBAC0`	175	`determine_copy_param_match`	HIGH
`0x6DBEA0`	625	`select_overloaded_copy_constructor`	HIGH
`0x6DCC00`	55	`arg_copy_can_be_done_via_constructor`	HIGH
`0x6DCD60`	166	`handle_elided_copy_constructor_no_guard` (variant 1)	HIGH
`0x6DD180`	169	`handle_elided_copy_constructor_no_guard` (variant 2)	HIGH
`0x6DD600`	492	`select_overloaded_assignment_operator`	HIGH
`0x6DE110`	31	`actualize_class_object_from_braced_init_list_for_bitwise_copy`	HIGH
`0x6DE1D0`	75	`full_adjust_class_object_type`	HIGH
`0x6DE320`	111	`set_up_for_constructor_call`	HIGH
`0x6DE5A0`	174	`temp_init_from_operand_full`	HIGH
`0x6DE9E0`	7	`temp_init_from_operand` (wrapper)	HIGH
`0x6DE9F0`	114	`find_top_temporary`	HIGH
`0x6DEBC0`	679	`determine_dynamic_init_for_class_init`	HIGH
`0x6DF8C0`	107	conversion with dynamic init wrapper	MEDIUM
`0x6DFBF0`	92	convert and determine dynamic init helper	MEDIUM
`0x6DFEC0`	692	`make_initializer_list_object` / `find_initializer_list_constructor`	HIGH
`0x6E0E50`	199	`adjust_operand_for_builtin_operator`	HIGH
`0x6E1250`	79	argument preparation helper	MEDIUM
`0x6E1430`	418	`convert_operand_into_temp`	HIGH
`0x6E1C40`	69	`prep_argument` (5-param)	HIGH
`0x6E1E40`	69	`prep_argument` (4-param)	HIGH
`0x6E2040`	2,120	`determine_function_viability`	HIGH
`0x6E4FA0`	633	`try_overloaded_function_match` (variant 1)	HIGH
`0x6E5B20`	367	`try_overloaded_function_match` (variant 2)	HIGH
`0x6E61D0`	121	overload match wrapper	MEDIUM
`0x6E6400`	1,483	`select_overloaded_function` (20 params)	HIGH
`0x6E8300`	285	`deduce_class_template_args` (CTAD)	HIGH
`0x6E8890`	199	type comparison for overload	MEDIUM
`0x6E8E20`	93	overload candidate evaluation helper	MEDIUM
`0x6E9080`	358	`select_overloaded_default_constructor`	HIGH
`0x6E9750`	281	argument list builder	MEDIUM
`0x6E9C60`	1,170	`warn_if_missing_sentinel`	HIGH
`0x6EAF90`	105	`node_for_arg_of_overloaded_function_call`	HIGH
`0x6EB1C0`	249	`adjust_overloaded_function_call_arguments`	HIGH
`0x6EB550`	392	`select_and_prepare_to_call_overloaded_function`	HIGH
`0x6EBFC0`	326	`conditional_operator_conversion_possible`	HIGH
`0x6EC650`	212	overload set iterator	MEDIUM
`0x6ECA20`	137	overload set builder	MEDIUM
`0x6ECCE0`	23	overload set init wrapper	LOW
`0x6ECD70`	160	util.h insert operation	MEDIUM
`0x6ECFB0`	193	util.h insert variant	MEDIUM
`0x6ED2A0`	812	`try_builtin_operands_match`	HIGH
`0x6EE340`	1,058	`try_conversions_for_builtin_operator`	HIGH
`0x6EF7A0`	2,174	`select_overloaded_operator` / `check_for_operator_overloading`	HIGH

Key Globals

Global	Usage
`dword_126EFB4`	Language mode (2 = C++)
`dword_126EF68`	Language standard version (201103/201703/202301)
`dword_126EFA4`	GNU extensions enabled
`dword_126EFAC`	Extended mode flag
`dword_126EFC8`	Debug trace enabled (controls overload trace output)
`dword_126EFCC`	Debug output level
`qword_106B988`	Overload debug output stream
`qword_106B990`	Overload debug output stream (alternate)
`qword_12C6B30`	Overload candidate list
`byte_126E349`	CUDA extensions flag
`byte_126E358`	Extension flag (likely `__CUDA_ARCH__`-related)
`dword_106BEA8`	Overload configuration flag
`dword_106BEC0`	Overload configuration flag
`dword_106C2A8`	Used by selector match level
`dword_106C2B8`	Operator-related flag
`dword_106C2BC`	Operator mode flag
`dword_106C104`	Operator configuration
`dword_106C124`	Operator configuration
`dword_106C140`	Operator configuration
`dword_106C16C`	Operator configuration
`dword_126C5C4`	Template nesting depth
`dword_126C5E4`	Scope stack depth
`qword_126C5E8`	Scope stack base

Template Engine

The template engine in cudafe++ is EDG 6.6's implementation of C++ template instantiation, argument deduction, partial specialization ordering, and the worklist-driven fixpoint loop that produces all needed template instantiations at translation-unit end. It lives primarily in templates.c (160+ functions at 0x7530C0--0x794D30) with supporting cross-TU correspondence logic in trans_corresp.c (0x796E60--0x79F9E0).

Template instantiation in a C++ compiler is fundamentally a deferred operation: the compiler parses template definitions, records their bodies in a declaration cache, and only instantiates when a concrete use forces it. EDG implements this with two pending worklists -- one for class templates, one for function/variable templates -- that accumulate entries during parsing and are drained by a fixpoint loop at the end of each translation unit. This page documents the complete instantiation pipeline from "entity added to worklist" through "instantiated body emitted into IL."

Key Facts

Property	Value
Source file	`templates.c` (172 functions), `trans_corresp.c` (36 functions)
Address range	`0x7530C0`--`0x794D30` (templates), `0x796E60`--`0x79F9E0` (correspondence)
Fixpoint entry point	`sub_78A9D0` (`template_and_inline_entity_wrapup`), 136 lines
Worklist walker	`sub_78A7F0` (`do_any_needed_instantiations`), 72 lines
Should-instantiate gate	`sub_774620` (`should_be_instantiated`), 326 lines
Function instantiation	`sub_775E00` (`instantiate_template_function_full`), 839 lines
Class instantiation	`sub_777CE0` (`f_instantiate_template_class`), 516 lines
Variable instantiation	`sub_774C30` (`instantiate_template_variable`), 751 lines
Pending function/variable list	`qword_12C7740` (linked list head)
Pending class list	`qword_12C7758` (linked list head)
Function depth limit	`qword_12C76E0` (max 255 = `0xFF`)
Class depth limit	Per-type counter at type entry `+56`, via `qword_106BD10`
Pending counter	`sub_75D740` (increment) / `sub_75D7C0` (decrement)
SSE state save	4 xmmword registers for functions, 12 for classes
Instantiation modes	`"none"` / `"all"` / `"used"` / `"local"`
Fixpoint flag	`dword_12C771C` (set=1 when new work discovered, loop restarts)

Instantiation Entry Structure

Each pending instantiation is represented as a linked-list node. The function/variable worklist uses entries with the following layout:

Offset	Size	Field	Description
`+0`	8	`entity`	Primary symbol pointer
`+8`	8	`next`	Next entry in pending list
`+16`	8	`inst_info`	Instantiation info record (must be non-null)
`+24`	8	`master_instance`	Canonical template symbol
`+32`	8	`actual_decl`	Declaration in the instantiation context
`+40`	8	`cached_decl`	Cached declaration (for kind 7 / function-local)
`+64`	8	`body_flags`	Deferred/deleted function flags
`+72`	8	`pre_computed_result`	Result from prior instantiation attempt
`+80`	1	`flags`	Status bitfield (see below)

Flags Byte at `+80`

Bit	Mask	Name	Meaning
0	`0x01`	`instantiated`	Entity has been instantiated
1	`0x02`	`not_needed`	Entity was determined to not need instantiation
3	`0x08`	`explicit_instantiation`	From explicit `template` declaration
4	`0x10`	`suppress_auto`	Auto-instantiation suppressed (extern template)
5	`0x20`	`excluded`	Entity excluded from instantiation set
7	`0x80`	`can_be_instantiated_checked`	Pre-check already performed

Flags Byte at `+28` (on `inst_info` at `+16`)

Bit	Mask	Name	Meaning
0	`0x01`	`blocked`	Instantiation blocked (dependency cycle)
3	`0x08`	`debug_checked`	Already checked by debug tracing path

The Fixpoint Loop: template_and_inline_entity_wrapup

sub_78A9D0 is the top-level entry point, called at the end of each translation unit from fe_wrapup. It implements a fixpoint loop that keeps running until no new instantiations are discovered.

template_and_inline_entity_wrapup (sub_78A9D0)
  |
  +-- Assert: qword_106BA18 == 0  (not nested in another TU)
  +-- Check: dword_126EFB4 == 2   (full compilation mode)
  |
  +-- FOR EACH translation_unit IN qword_106B9F0 linked list:
  |     |
  |     +-- sub_7A3EF0: set up TU context (switch active TU)
  |     |
  |     +-- PHASE 1: Process pending class instantiations
  |     |   Walk qword_12C7758 list:
  |     |     For each class entry:
  |     |       if sub_7A6B60 (is_dependent_type) == false
  |     |          AND sub_7A8A30 (is_class_or_struct_type) == true:
  |     |            f_instantiate_template_class(entry)
  |     |
  |     +-- PHASE 2: Enable instantiation mode
  |     |   dword_12C7730 = 1
  |     |
  |     +-- PHASE 3: Process pending function/variable instantiations
  |     |   do_any_needed_instantiations()
  |     |
  |     +-- sub_7A3F70: tear down TU context
  |
  +-- PHASE 4: Check for newly-needed instantiations
  |   if dword_12C771C != 0:
  |     dword_12C771C = 0
  |     LOOP BACK to top          <<<< FIXPOINT
  |
  +-- Check dword_12C7718 for additional pass

The fixpoint is necessary because instantiating one template may trigger references to other uninstantiated templates. For example, instantiating std::vector<Foo> may require instantiating std::allocator<Foo>, Foo's copy constructor, comparison operators, and so on. The loop re-runs until dword_12C771C (the "new instantiations needed" flag) remains zero through an entire pass.

Class-Before-Function Ordering

Classes are instantiated first (Phase 1) because function template instantiations may depend on complete class types. A function template body that accesses T::value_type requires T to be fully instantiated before the function body can be parsed. The two-phase design avoids forward-reference failures during function body replay.

Worklist Walker: do_any_needed_instantiations

sub_78A7F0 walks the pending function/variable instantiation list and processes each entry that passes the should_be_instantiated gate.

void do_any_needed_instantiations(void) {
    entry_t *v0 = qword_12C7740;          // pending list head
    while (v0) {
        if (v0->flags & 0x02) {            // already done
            v0 = v0->next;
            continue;
        }
        inst_info_t *v2 = v0->inst_info;   // offset +16, must be non-null
        if (!(v2->flags & 0x08)) {         // not debug-checked
            if (dword_126EFC8)             // debug tracing enabled
                sub_756B40(v0);            // f_is_static_or_inline check
        }
        if (v2->flags & 0x01) {            // blocked
            v0 = v0->next;
            continue;
        }
        if (v0->flags >= 0) {             // bit 7 not set (not pre-checked)
            sub_7574B0(v0);               // f_entity_can_be_instantiated
        }
        if (should_be_instantiated(v0, 1)) {
            instantiate_template_function_full(v0, 1);
        }
        v0 = v0->next;                    // offset +8
    }
}

The walk is a simple linear traversal. New entries appended during instantiation will be visited on the current pass if they appear after the current position, or on the next fixpoint iteration otherwise.

Debug tracing output: when dword_126EFC8 is nonzero, the walker emits "do_any_needed_instantiations, checking: " followed by the entity name for each entry it considers.

Decision Gate: should_be_instantiated

sub_774620 is the critical decision function that determines whether a pending template entity actually requires instantiation. It implements a chain of rejection checks -- an entity must pass all of them to be instantiated.

int should_be_instantiated(entry_t *a1, int a2) {
    // 1. Already done?
    if (a1->flags_28 & 0x01)    return 0;

    // 2. Suppressed by extern template?
    if (a1->flags_80 & 0x20)    return 0;

    // 3. Already instantiated and not explicit?
    if ((a1->flags_80 & 0x08) && !(a1->flags_80 & 0x01))
        return 0;

    // 4. Has valid master instance?
    if (!a1->master_instance)   return 0;    // offset +24

    // 5. Entity kind filter (function-specific)
    int kind = get_entity_kind(a1->master_instance);
    switch (kind) {
        case 10: case 11:   // class member function
        case 17:            // lambda
        case 9:             // namespace-scope function
        case 7:             // variable template
            break;          // eligible
        default:
            return 0;       // not a function/variable entity
    }

    // 6. Implicit include needed?
    if (needs_implicit_include(a1))
        do_implicit_include_if_needed(a1);    // sub_754A70

    // 7. Depth limit check
    if (get_depth(a1) > *qword_106BD10)
        return 0;

    // 8. Depth warning (diagnostic 489/490)
    if (approaching_depth_limit(a1))
        emit_warning(489);  // or 490

    return 1;
}

The depth limit at qword_106BD10 is the configurable maximum instantiation nesting depth. When exceeded, the entity is silently skipped. When approaching the limit, warnings 489 and 490 are emitted to alert the developer.

Function Instantiation: instantiate_template_function_full

sub_775E00 (839 lines) is the workhorse for instantiating function templates. It saves global parser state, replays the cached function body through the parser with substituted template arguments, and restores state afterward.

SSE State Save/Restore

The function saves and restores 4 SSE registers (xmmword_106C380--xmmword_106C3B0) that hold critical parser/scope state. These 128-bit registers store packed parser context (scope indices, token positions, flags) that must be preserved across instantiation because the parser is stateful and re-entrant:

Save on entry:
    saved_state[0] = xmmword_106C380    // parser scope context
    saved_state[1] = xmmword_106C390    // token stream state
    saved_state[2] = xmmword_106C3A0    // scope nesting info
    saved_state[3] = xmmword_106C3B0    // auxiliary flags

Restore on exit (always, even on error):
    xmmword_106C380 = saved_state[0]
    xmmword_106C390 = saved_state[1]
    xmmword_106C3A0 = saved_state[2]
    xmmword_106C3B0 = saved_state[3]

The use of SSE registers for state save/restore is a compiler optimization -- the generated code uses movaps/movups instructions to save 64 bytes of state in 4 instructions rather than 8 individual mov instructions. The data itself is ordinary integer/pointer fields packed into 128-bit quantities by the compiler's register allocator.

Instantiation Flow

instantiate_template_function_full (sub_775E00)
  |
  +-- Save 4 SSE registers (parser state)
  |
  +-- Check pre-existing result: a1[9] (offset +72)
  |   If result exists:
  |     Load associated translation unit
  |     GOTO restore
  |
  +-- Fresh instantiation:
  |   |
  |   +-- Check implicit include needed
  |   +-- Resolve actual declaration via find_corresponding_instance
  |   +-- For class members (kind 20): handle member function templates
  |   |
  |   +-- Depth limit check:
  |   |   if qword_12C76E0 >= 0xFF (255):
  |   |     emit error, GOTO restore
  |   |   qword_12C76E0++
  |   |
  |   +-- Constraint satisfaction check:
  |   |   sub_7C2370 / sub_7C23B0 (C++20 requires-clause)
  |   |
  |   +-- Handle deferred/deleted functions (offset +64 flags)
  |   |
  |   +-- Set up substitution context: sub_709DE0
  |   |   Binds template parameters to concrete arguments
  |   |
  |   +-- Replay cached function body: sub_5A88B0
  |   |   Re-parses the saved token stream with substituted types
  |   |
  |   +-- Emit into IL: sub_676860
  |   |   Processes tokens until end marker (token kind 9)
  |   |
  |   +-- Update canonical entry: sub_79F1D0
  |   |   Links instantiation to cross-TU correspondence table
  |   |
  |   +-- qword_12C76E0--  (decrement depth)
  |
  +-- Restore 4 SSE registers

Depth Counter: qword_12C76E0

This global counter tracks the current nesting depth of function template instantiations. The hard limit is 255 (0xFF). Each call to instantiate_template_function_full increments it on entry and decrements on exit. When the counter reaches 255, the function emits a fatal error and aborts instantiation.

The 255 limit is a safety valve against infinite recursive template instantiation (e.g., template<int N> struct S { S<N+1> member; }). The C++ standard mandates that implementations support at least 1,024 recursively nested template instantiations ([Annex B]), but EDG defaults to 255. This may be configurable via a CLI flag that sets qword_106BD10.

Class Instantiation: f_instantiate_template_class

sub_777CE0 (516 lines) instantiates class templates. It is structurally similar to the function instantiation path but saves significantly more state (12 SSE registers vs. 4) because class instantiation involves deeper parser state perturbation -- class bodies contain member declarations, nested types, and member function definitions.

SSE State Save/Restore (12 Registers)

Save on entry:
    saved[0]  = xmmword_106C380
    saved[1]  = xmmword_106C390
    saved[2]  = xmmword_106C3A0
    saved[3]  = xmmword_106C3B0
    saved[4]  = xmmword_106C3C0
    saved[5]  = xmmword_106C3D0
    saved[6]  = xmmword_106C3E0
    saved[7]  = xmmword_106C3F0
    saved[8]  = xmmword_106C400
    saved[9]  = xmmword_106C410
    saved[10] = xmmword_106C420
    saved[11] = xmmword_106C430

Restore on exit:
    (reverse order, same 12 registers)

The additional 8 registers (beyond the 4 used by function instantiation) capture the extended scope stack state, class body parsing context, base class list, member template processing state, and access specifier tracking that class body parsing requires.

Class Type Entry Layout

Class instantiation operates on a type entry with the following relevant fields:

Offset	Size	Field	Description
`+56`	8	`instantiation_depth_counter`	Per-type depth limit via `qword_106BD10`
`+72`	8	`containing_template_decl`	The template declaration this specialization came from
`+88`	8	`scope_name_info`	Scope and name resolution data
`+96`	8	`class_body_info`	Pointer to cached class body tokens
`+104`	8	`base_class_list`	Linked list of base class entries
`+120`	8	`namespace_lookup_info`	Namespace and extern template info
`+132`	1	`kind`	Type kind: 9=struct, 10=class, 11=union, 12=alias
`+144`	8	`canonical_type`	Pointer to canonical type entry (follow kind==12 chain)
`+152`	8	`parent_scope`	Enclosing scope entry
`+160`	4	`attribute_flags`	Attribute bits
`+176`	1	`template_flags`	bit 0 = primary template, bit 7 = inline
`+192`	8	`template_argument_list`	Substituted template argument list
`+200`	8	`member_template_list`	Linked list of member templates
`+296`	8	`associated_constraint`	C++20 constraint expression
`+298`	1	`extra_flags`	Additional status bits

Instantiation Flow

f_instantiate_template_class (sub_777CE0)
  |
  +-- Walk to canonical type entry: follow kind==12 chain at +144
  +-- Get class symbol: sub_72F640
  |
  +-- Check extern template constraints: sub_7C2370/sub_7C23B0
  |
  +-- Save 12 SSE registers
  |
  +-- Depth limit check:
  |   if type_entry[+56] >= *qword_106BD10:
  |     emit error, GOTO restore
  |   type_entry[+56]++
  |
  +-- Set up substitution context: sub_709DE0
  |
  +-- Handle base class list:
  |   sub_415BE0 (parse base-specifier-list)
  |   sub_4A5510 (validate base classes)
  |
  +-- Parse class body from declaration cache
  |   Replay saved tokens with substituted types
  |
  +-- Process member templates:
  |   Loop on member_template_list (offset +200)
  |   sub_7856E0 for each member template
  |
  +-- Perform deferred access checks:
  |   sub_744F60 (perform_deferred_access_checks_at_depth)
  |
  +-- type_entry[+56]--  (decrement depth)
  |
  +-- Restore 12 SSE registers

Per-Type Depth Limit

Unlike function instantiation (which uses a single global counter qword_12C76E0 with a hard limit of 255), class instantiation uses a per-type counter stored at offset +56 of the type entry. The limit is still read from qword_106BD10. This per-type design prevents one deeply-nested class hierarchy from consuming the entire depth budget -- each class type tracks its own instantiation nesting independently.

Variable Instantiation: instantiate_template_variable

sub_774C30 (751 lines) handles variable template instantiation. Variable templates (C++14) are less common than function or class templates but follow the same pattern: extract master instance, set up substitution, replay cached declaration.

Instantiation Flow

instantiate_template_variable (sub_774C30)
  |
  +-- Extract master instance: a1[3]=symbol, a1[4]=decl
  |
  +-- Look up declaration type:
  |   Switch on kind: 4/5, 6, 9/10, 19-22
  |
  +-- Find declaration cache: offset +216 or +264
  |
  +-- Depth limit check: qword_106BD10
  |
  +-- Set up substitution context: sub_709DE0
  |
  +-- Create declaration state:
  |   memset(v77, 0, 0x1D8)    // 472 bytes = declaration state
  |   v77[0]  = symbol
  |   v77[3]  = source position
  |   v77[6]  = type
  |   v77[15] = flags
  |   v77[19] = self-pointer
  |   v77[33] = additional flags
  |   v77[35] = initializer
  |   v77[36] = IL tree
  |
  +-- Perform type substitution: sub_764AE0 (scan_template_declaration)
  |
  +-- Handle constexpr/constinit evaluation
  |
  +-- Handle deferred access checks
  |
  +-- Update canonical entry
  |
  +-- For kind==7 (function-local variable templates):
      Special handling via sub_5C9600, copy attributes from prototype

The declaration state structure is 472 bytes (0x1D8), stack-allocated and zero-initialized. This is the same structure used by the main declaration parser -- variable template instantiation reuses the declaration parsing infrastructure with pre-populated fields.

Pending Counter Management

Two small functions manage a pending-instantiation counter that tracks how many instantiations are in flight. This counter is used for progress reporting and infinite-loop detection.

increment_pending_instantiations (sub_75D740)

Called when a new template entity is added to the pending worklist. Increments the counter and checks against a maximum threshold via too_many_pending_instantiations (sub_75D6A0).

decrement_pending_instantiations (sub_75D7C0)

Called when an instantiation completes (successfully or by rejection). Decrements the counter.

The counter itself is not directly visible in the sweep report but is inferred from the call pattern: the increment function is called from code paths that add entries to qword_12C7740 or qword_12C7758, and the decrement is called at the end of each instantiate_template_function_full / f_instantiate_template_class / instantiate_template_variable invocation.

Instantiation Modes

The template engine supports four instantiation modes, controlled by CLI flags that set dword_12C7730 and related configuration globals:

Mode	`dword_12C7730`	Behavior
`"none"`	0	No automatic instantiation. Only explicit `template` declarations trigger instantiation. Used for precompiled headers.
`"used"`	1	Instantiate templates that are actually used (ODR-referenced). This is the default mode. The `should_be_instantiated` function checks usage flags.
`"all"`	2	Instantiate all templates that have been declared, whether or not they are used. Used for template library precompilation.
`"local"`	3	Instantiate only templates with internal linkage. Extern templates are skipped. Used for split compilation models.

The mode transitions during compilation:

During parsing: dword_12C7730 = 0 (collection only, no instantiation)
At wrapup entry: dword_12C7730 = 1 (enable "used" mode)
During fixpoint: mode may escalate to "all" if dword_12C7718 is set

The precompile mode (dword_106C094 == 3) skips the fixpoint loop entirely and records template entities for later instantiation in the consuming translation unit.

Substitution Engine: copy_type_with_substitution

sub_76D860 (1,229 lines) is the core type substitution function. It takes a type node and a set of template-parameter-to-argument bindings, and produces a new type with all template parameters replaced by their concrete values.

copy_type_with_substitution(type, bindings) -> type
  |
  +-- Dispatch on type->kind:
  |
  +-- Simple types (int, float, void): return type unchanged
  |
  +-- Pointer type (kind 6):
  |   new_pointee = copy_type_with_substitution(type->pointee, bindings)
  |   return make_pointer_type(new_pointee)
  |
  +-- Reference types (kind 7, 19):
  |   new_referent = copy_type_with_substitution(type->referent, bindings)
  |   return make_reference_type(new_referent, type->is_rvalue)
  |
  +-- Array type (kind 8):
  |   new_element = copy_type_with_substitution(type->element, bindings)
  |   new_size = substitute_expression(type->size_expr, bindings)
  |   return make_array_type(new_element, new_size)
  |
  +-- Function type (kind 14):
  |   new_return = copy_type_with_substitution(type->return_type, bindings)
  |   new_params = [substitute each parameter type]
  |   return make_function_type(new_return, new_params, type->cv_quals)
  |
  +-- Template parameter type:
  |   Look up parameter in bindings
  |   return concrete argument type
  |
  +-- Template-id type:
  |   new_args = copy_template_arg_list_with_substitution(type->args, bindings)
  |   return find_or_instantiate_template_class(type->template, new_args)
  |
  +-- Pack expansion (kind 16, 17):
  |   Expand pack with all elements from the binding
  |   return list of substituted types

Supporting substitution functions:

Address	Identity	Description
`sub_77BA10`	`copy_parent_type_with_substitution`	Substitutes in enclosing class context
`sub_77BFE0`	`copy_template_with_substitution`	Substitutes within template declarations
`sub_77FDE0`	`copy_template_arg_list_with_substitution`	Substitutes within argument lists (612 lines)
`sub_780B80`	`copy_template_class_reference_with_substitution`	Handles class template references
`sub_78B600`	`copy_template_variable_with_substitution`	Handles variable template references
`sub_793DF0`	`substitute_template_param_list`	Walks parameter list with substitution (741 lines)

Template Argument Deduction

The deduction subsystem determines template argument values from function call arguments. Key functions:

Address	Identity	Lines	Description
`sub_77CEE0`	`matches_template_type`	788	Core deduction: matches actual type against template parameter pattern. Implements [temp.deduct].
`sub_77CA90`	`matches_template_type_for_class_type`	--	Class-specific variant with additional base class traversal
`sub_77C720`	`matches_template_arg_list`	--	Matches a sequence of template arguments
`sub_77C510`	`matches_template_template_param`	--	Matches template template parameters
`sub_77C240`	`template_template_arg_matches_param`	--	Template template argument compatibility check
`sub_77E9F0`	`matches_template_constant`	--	Matches non-type template arguments (constant expressions)
`sub_77E310`	`parameter_is_more_specialized`	330	Partial ordering rule: determines which parameter is more specialized
`sub_780FC0`	`all_templ_params_have_values`	332	Post-deduction check: verifies all parameters received values
`sub_781660`	`wrapup_template_argument_deduction`	--	Finalizes deduction, applies default arguments
`sub_781C40`	`matches_partial_specialization`	316	Tests actual arguments against a partial specialization

Partial Specialization Ordering

When multiple partial specializations match, the engine must select the "most specialized" one. This implements C++ [temp.class.order] and [temp.func.order]:

check_partial_specializations (sub_774470)
  |
  +-- For each partial specialization of the template:
  |   matches_partial_specialization(actual_args, partial_spec)
  |   If matches: add to candidates list
  |     add_to_partial_order_candidates_list (sub_773E40)
  |
  +-- If multiple candidates:
  |   partial_ord (sub_75D2A0)
  |     Pairwise comparison using parameter_is_more_specialized
  |     Select most specialized, or emit ambiguity error
  |
  +-- Return winning specialization (or primary template if no match)

For function templates, ordering uses compare_function_templates (sub_7730D0, 665 lines) which implements the more complex function template partial ordering rules.

Template Declaration Infrastructure

The declaration side handles parsing template<...> prefixes and setting up template entities:

Address	Identity	Lines	Description
`sub_786260`	`template_declaration`	2,487	Main entry point for all template declarations. Handles primary, explicit specialization, partial specialization, and friend templates.
`sub_782690`	`class_template_declaration`	2,280	Class-specific template declaration processing
`sub_78D600`	`template_or_specialization_declaration_full`	2,034	Unified handler routing to class, function, or variable paths
`sub_764AE0`	`scan_template_declaration`	412	Parses the `template<...>` prefix
`sub_779D80`	`scan_template_param_list`	626	Parses template parameter lists
`sub_77AAB0`	`scan_lambda_template_param_list`	--	C++20 lambda template parameter parsing
`sub_770790`	`make_template_function`	914	Creates function template entity
`sub_753870`	`make_template_variable`	--	Creates variable template entity
`sub_756310`	`set_up_template_decl`	--	Template declaration state initialization

Explicit Instantiation

Explicit instantiation (template class Foo<int>; or template void f<int>();) is handled by a dedicated path:

explicit_instantiation (sub_791C70, 105 lines)
  |
  +-- Parse 'extern' flag: a2 & 1 = is_extern_instantiation
  +-- Save compilation mode (dword_106C094)
  |
  +-- Determine instantiation kind:
  |   extern:              kind = 16
  |   non-extern, no inline: kind = 15
  |   non-extern, inline:   kind = 18
  |
  +-- For precompiled header mode: mark scope entry
  |
  +-- instantiation_directive (sub_7908E0, 626 lines):
  |   |
  |   +-- Initialize target scope entry (memset 472 bytes)
  |   +-- Check CUDA device-code instantiation pragmas
  |   +-- Parse declaration:
  |   |   For classes:    sub_789EF0 (update_instantiation_flags)
  |   |   For functions:  sub_78D0E0 (find_matching_template_instance)
  |   |                   then sub_7897C0 (update_instantiation_flags)
  |   |   For variables:  similar path
  |   +-- Handle instantiation attributes (dllexport/visibility)
  |   +-- Clean up parser state
  |
  +-- Handle deferred access checks: sub_744F60
  +-- Restore compilation mode

update_instantiation_flags (sub_7897C0, 351 lines) sets the appropriate instantiation-required bits on the template entity after matching an explicit instantiation directive. It checks compilation mode, CUDA device/host targeting, and adjusts flags accordingly.

CUDA Integration Points

The template engine interacts with CUDA through several mechanisms:

Device/host filtering in should_be_instantiated: The function checks CUDA execution space attributes via sub_756840 (sym_can_be_instantiated) to determine if a template entity should be instantiated for the current compilation target (device or host).
Instantiation directives: CUDA-specific #pragma directives can trigger or suppress template instantiation for device code. The instantiation_directive function checks for these at dword_126EFA8 (GPU mode) and dword_126EFA4 (device-code flag).
Namespace injection: CUDA-specific symbols are entered into cuda::std via enter_symbol_for_namespace_cuda_std (sub_749330) and std::meta via enter_symbol_for_namespace_std_meta (sub_7493C0, C++26 reflection support).
Target dialect selection: select_cp_gen_be_target_dialect (sub_752A80) determines whether template instantiations emit device PTX code or host code, based on dword_126EFA8 (GPU mode) and dword_126EFA4 (device vs. host).

Cross-TU Correspondence

When compiling with RDC mode or multiple translation units, the same template may be instantiated in different TUs. The trans_corresp.c file (0x796E60--0x79F9E0) handles deduplication and canonical entry selection:

Address	Identity	Description
`sub_796E60`	`canonical_ranking`	Determines which of two TU entries is canonical
`sub_7975D0`	`may_have_correspondence`	Checks if cross-TU correspondence is possible
`sub_7999C0`	`find_template_correspondence`	Finds corresponding template across TUs (601 lines)
`sub_79A5A0`	`determine_correspondence`	Establishes correspondence relationship
`sub_79B8D0`	`mark_canonical_instantiation`	Marks the canonical version of an instantiation
`sub_79C400`	`f_set_trans_unit_corresp`	Sets up cross-TU correspondence (511 lines)
`sub_79D080`	`establish_instantiation_correspondences`	Links instantiation results across TUs
`sub_79EE80`--`sub_79F1D0`	`update_canonical_entry` (3 variants)	Updates canonical representative after instantiation
`sub_79F9E0`	`record_instantiation`	Records an instantiation for cross-TU tracking

The correspondence system ensures that when std::vector<int> is instantiated in TU1 and TU2, both produce structurally equivalent IL, and only one canonical version is emitted to the output.

Global State

Address	Name	Description
`qword_12C7740`	`pending_instantiation_list`	Head of pending function/variable instantiation linked list
`qword_12C7758`	`pending_class_instantiation_list`	Head of pending class instantiation linked list
`dword_12C7730`	`instantiation_mode_active`	Current instantiation mode (0=none, 1=used, 2=all, 3=local)
`dword_12C771C`	`new_instantiations_needed`	Fixpoint flag: set to 1 when new work discovered
`dword_12C7718`	`additional_pass_needed`	Secondary fixpoint flag for extra passes
`qword_12C76E0`	`instantiation_depth_counter`	Current function template nesting depth (max 0xFF)
`qword_106BD10`	`max_instantiation_depth_limit`	Configurable depth limit (read by class and function paths)
`xmmword_106C380`--`106C3B0`	`parser_state_save_area`	4 SSE registers saved by function instantiation
`xmmword_106C380`--`106C430`	`parser_state_save_area_full`	12 SSE registers saved by class instantiation
`dword_106C094`	`compilation_mode`	0=none, 1=normal, 3=precompile
`dword_126EFB4`	`compilation_phase`	2=full compilation (required for fixpoint loop)
`qword_106B9F0`	`translation_unit_list_head`	Linked list of TUs for per-TU fixpoint iteration
`qword_106BA18`	`tu_stack_top`	Must be 0 (not nested) when fixpoint starts
`dword_126EFC8`	`debug_tracing_enabled`	Nonzero enables trace output for instantiation
`dword_126EFA8`	`gpu_mode`	Nonzero when compiling CUDA code
`dword_126EFA4`	`device_code`	1=device-side compilation, 0=host stubs
`word_126DD58`	`current_token_kind`	Parser state: current token (9=END)
`qword_126DD38`	`source_position`	Parser state: current source location
`qword_126C5E8`	`scope_table_base`	Array of 784-byte scope entries
`dword_126C5E4`	`current_scope_index`	Index into scope table

Diagnostic Strings

String	Source	Condition
`"do_any_needed_instantiations, checking: "`	`sub_78A7F0`	`dword_126EFC8 != 0` (debug tracing)
`"template_and_inline_entity_wrapup"`	`sub_78A9D0`	Assert string
`"should_be_instantiated"`	`sub_774620`	Assert string at `templates.c:36894`
`"instantiate_template_function_full"`	`sub_775E00`	Assert string at `templates.c:7359`
`"f_instantiate_template_class"`	`sub_777CE0`	Assert string at `templates.c:5277`
`"instantiate_template_variable"`	`sub_774C30`	Assert string at `templates.c:7814`
`"check_template_nesting_depth"`	`sub_7533E0`	Assert string
`"instantiation_directive"`	`sub_7908E0`	Assert string at `templates.c:41682`
`"explicit_instantiation"`	`sub_791C70`	Assert string at `templates.c:42231`
`"template_arg_is_dependent"`	`sub_7530C0`	Assert string at `templates.c:8897`

Function Map

Address	Identity	Confidence	Lines	EDG Source
`sub_78A9D0`	`template_and_inline_entity_wrapup`	100%	136	`templates.c:40084`
`sub_78A7F0`	`do_any_needed_instantiations`	100%	72	`templates.c:39760`
`sub_774620`	`should_be_instantiated`	95%	326	`templates.c:36894`
`sub_775E00`	`instantiate_template_function_full`	95%	839	`templates.c:7359`
`sub_777CE0`	`f_instantiate_template_class`	95%	516	`templates.c:5277`
`sub_774C30`	`instantiate_template_variable`	95%	751	`templates.c:7814`
`sub_75D740`	`increment_pending_instantiations`	95%	--	`templates.c`
`sub_75D7C0`	`decrement_pending_instantiations`	95%	--	`templates.c`
`sub_75D6A0`	`too_many_pending_instantiations`	95%	--	`templates.c`
`sub_7574B0`	`f_entity_can_be_instantiated`	95%	--	`templates.c:37066`
`sub_756B40`	`f_is_static_or_inline_template_entity`	95%	--	`templates.c`
`sub_756840`	`sym_can_be_instantiated`	95%	--	`templates.c`
`sub_754A70`	`do_implicit_include_if_needed`	95%	--	`templates.c`
`sub_76D860`	`copy_type_with_substitution`	95%	1229	`templates.c`
`sub_77FDE0`	`copy_template_arg_list_with_substitution`	95%	612	`templates.c`
`sub_793DF0`	`substitute_template_param_list`	95%	741	`templates.c`
`sub_77CEE0`	`matches_template_type`	95%	788	`templates.c`
`sub_780FC0`	`all_templ_params_have_values`	95%	332	`templates.c`
`sub_781C40`	`matches_partial_specialization`	95%	316	`templates.c`
`sub_774470`	`check_partial_specializations`	95%	58	`templates.c`
`sub_773E40`	`add_to_partial_order_candidates_list`	95%	306	`templates.c`
`sub_75D2A0`	`partial_ord`	95%	--	`templates.c`
`sub_7730D0`	`compare_function_templates`	95%	665	`templates.c`
`sub_786260`	`template_declaration`	95%	2487	`templates.c`
`sub_782690`	`class_template_declaration`	95%	2280	`templates.c`
`sub_78D600`	`template_or_specialization_declaration_full`	95%	2034	`templates.c`
`sub_764AE0`	`scan_template_declaration`	95%	412	`templates.c`
`sub_779D80`	`scan_template_param_list`	95%	626	`templates.c`
`sub_770790`	`make_template_function`	95%	914	`templates.c`
`sub_771D50`	`find_template_function`	95%	470	`templates.c`
`sub_7621A0`	`find_template_class`	95%	519	`templates.c`
`sub_78AC50`	`find_template_variable`	95%	528	`templates.c`
`sub_7908E0`	`instantiation_directive`	95%	626	`templates.c:41682`
`sub_791C70`	`explicit_instantiation`	95%	105	`templates.c:42231`
`sub_7897C0`	`update_instantiation_flags`	90%	351	`templates.c`
`sub_7770E0`	`update_instantiation_required_flag`	95%	434	`templates.c`
`sub_78D0E0`	`find_matching_template_instance`	95%	--	`templates.c`
`sub_709DE0`	`set_up_substitution_context`	--	--	(likely `templates.c`)
`sub_744F60`	`perform_deferred_access_checks_at_depth`	95%	--	`symbol_tbl.c`
`sub_7530C0`	`template_arg_is_dependent`	95%	--	`templates.c:8897`
`sub_762C80`	`template_arg_list_is_dependent_full`	95%	839	`templates.c`
`sub_75EF10`	`equiv_template_arg_lists`	95%	493	`templates.c`
`sub_7931B0`	`make_template_implicit_deduction_guide`	95%	433	`templates.c`
`sub_794D30`	`ctad`	95%	990	`templates.c`
`sub_796E60`	`canonical_ranking`	95%	--	`trans_corresp.c`
`sub_7999C0`	`find_template_correspondence`	95%	601	`trans_corresp.c`
`sub_79C400`	`f_set_trans_unit_corresp`	95%	511	`trans_corresp.c`
`sub_79F1D0`	`update_canonical_entry`	95%	--	`trans_corresp.c`
`sub_79F9E0`	`record_instantiation`	95%	--	`trans_corresp.c`

Cross-References

EDG 6.6 Overview -- Architecture and NVIDIA modification layers
CUDA Template Restrictions -- CUDA-specific template constraints
Type System -- Type kinds and class layout referenced during substitution
Keep-in-IL -- Device code selection interacts with instantiation results
Pipeline Overview -- Where template wrapup fits in the compilation pipeline
Template Instance Record -- Data structure for instantiation entries
Scope Entry -- 784-byte scope structure used during instantiation
Diagnostics Overview -- Warning 489/490 for depth limits

CUDA Template Restrictions

CUDA's split-compilation model imposes restrictions on C++ templates that have no counterpart in standard C++. When a __global__ function template is instantiated, cudafe++ generates a host-side stub whose mangled name must exactly match what the device compiler (cicc) independently produces. This agreement is only possible if both compilers can derive the complete mangled name from the template's signature and arguments. Types that are invisible to one side -- host-local types, unnamed types, private class members, certain lambda closures -- break this invariant and are therefore rejected. The same constraints apply to variable templates used in device contexts, and additional structural restrictions prevent variadic __global__ templates from producing ambiguous mangled names. This page documents all 24 CUDA-specific template restriction errors across 8 categories, the implementation functions that enforce them, and the __NV_name_expr mechanism that relies on these guarantees.

Key Facts

Property	Value
Source file	`cp_gen_be.c` (EDG 6.6 backend code generator)
Access checker	`sub_469F80` (`template_arg_is_accessible`, 144 lines)
Cache engine	`sub_469480` (`cache_access_result_for`, 670 lines)
Arg list walker	`sub_46A230` (walks template arg lists, 182 lines)
Pre-unnamed check	`sub_46A5B0` (`arg_before_unnamed_template_param_arg`, 396 lines)
Scope resolver	`sub_469F30` (resolves scope via hash lookup, 23 lines)
Callback for scope walk	`sub_46ACC0` (passed as callback into `sub_61FE60`)
Cache hash table	`xmmword_F05720` (384 KB, 16,382-entry table, 24 bytes per slot)
Entity lookup table	`unk_FE5700` (512 KB, used by `sub_469F30`)
Free list head	`qword_F05708` (recycled cache entries)
Total restriction errors	24 across 8 categories

Why These Restrictions Exist

The CUDA compilation model splits a single .cu source file into two compilation paths:

Host path: cudafe++ generates a .int.c file containing host stubs. The host compiler (gcc, clang, MSVC) compiles these stubs and produces a host object file. Each __global__ function template instantiation becomes a __wrapper__device_stub_ function.
Device path: The same source is compiled by cicc into PTX. The device compiler independently instantiates the same templates and produces the device-side function bodies.

At link time, the CUDA runtime matches host stubs to device functions by mangled name. Both compilers must produce identical mangled names for every __global__ template instantiation. This is only possible when all template arguments are types that both compilers can see, name, and mangle identically. A host-only local type, for example, exists only in the host compiler's scope -- cicc cannot see it and cannot produce a matching mangled name. The restrictions documented below enforce this invariant.

The same logic applies to __device__/__constant__ variable templates, which must also match across the host/device boundary for registration and symbol lookup.

Category A: global Declaration Restrictions (8 errors)

These errors prevent __global__ function templates from using C++ features that would prevent host stub generation or violate kernel ABI constraints.

Tag	Message	Reason
`global_function_constexpr`	`A __global__ function or function template cannot be marked constexpr`	Kernels are not evaluated at compile time; `constexpr` is meaningless for device launch.
`global_function_consteval`	`A __global__ function or function template cannot be marked consteval`	`consteval` requires compile-time evaluation, incompatible with runtime kernel launch.
`global_class_decl`	`A __global__ function or function template cannot be a member function`	Kernels have no `this` pointer; the launch ABI has no slot for an object reference.
`global_friend_definition`	`A __global__ function or function template cannot be defined in a friend declaration`	Friend definitions have limited visibility, conflicting with the requirement for a globally-linkable stub.
`global_exception_spec`	`An exception specification is not allowed for a __global__ function or function template`	GPU hardware has no exception unwinding mechanism.
`global_function_in_unnamed_inline_ns`	`A __global__ function or function template cannot be declared within an inline unnamed namespace`	Unnamed namespaces produce TU-local linkage, but kernel stubs must have external linkage for runtime registration.
`global_function_with_initializer_list`	`a __global__ function or function template cannot have a parameter with type std::initializer_list`	`std::initializer_list` holds a pointer to backing storage that cannot be transparently transferred to device memory.
`global_va_list_type`	`A __global__ function or function template cannot have a parameter with va_list type`	Variadic argument lists require stack-based access that does not exist on GPU hardware.

These checks occur during attribute application in apply_nv_global_attr (sub_40E1F0 / sub_40E7F0) and in the post-validation pass nv_validate_cuda_attributes (sub_6BC890). The checks apply equally to non-template __global__ functions and __global__ function templates.

Category B: Variadic global Template Constraints (2 errors)

Standard C++ allows multiple parameter packs in a template and does not require packs to be the last parameter. CUDA restricts this for __global__ templates because the host stub ABI requires unambiguous argument layout.

Tag	Message
`global_function_pack_not_last`	`Pack template parameter must be the last template parameter for a variadic __global__ function template`
`global_function_multiple_packs`	`Multiple pack parameters are not allowed for a variadic __global__ function template`

Rationale

The kernel launch wrapper (<<<grid, block>>>) must marshal each argument into a contiguous parameter buffer. For a variadic template like template<typename... Ts> __global__ void kernel(Ts... args), the compiler generates the buffer layout at instantiation time. If the pack is not last, or if multiple packs are present, the positional mapping between template parameters and launch arguments becomes ambiguous -- the compiler cannot determine which arguments belong to which pack without full deduction context that may not be available at stub generation time.

Example

// OK: single pack, last position
template<typename T, typename... Ts>
__global__ void kernel(T first, Ts... rest);

// Error: pack not last
template<typename... Ts, typename T>
__global__ void kernel(Ts... args, T last);  // global_function_pack_not_last

// Error: multiple packs
template<typename... Ts, typename... Us>
__global__ void kernel(Ts... a, Us... b);    // global_function_multiple_packs

Category C: Template Argument Visibility for global (6 errors)

These are the core name-mangling restrictions. Every type used as a template argument to a __global__ function template instantiation must be visible and nameable by both the host and device compilers.

C.1: Host-local types

Tag	Message
`global_func_local_template_arg`	`A type defined inside a __host__ function (%t) cannot be used in the template argument type of a __global__ function template instantiation`

A type defined inside a __host__ function exists only within that function's scope. The device compiler never sees it and cannot produce a matching mangled name.

void host_function() {
    struct LocalType { int x; };
    kernel<LocalType><<<1,1>>>();  // error: host-local type
}

C.2: Private/protected class members

Tag	Message
`global_private_type_arg`	`A type that is defined inside a class and has private or protected access (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the class is local to a __device__ or __global__ function`

Private/protected nested types are accessible only through the enclosing class's access control. While C++ allows friend access and member function access to these types, the device compiler processes templates independently and may not have the same access context. The exception for types local to __device__/__global__ functions reflects that both compilers see device function bodies.

class Outer {
    struct Inner { int x; };     // private
    friend void launch();
};

void launch() {
    kernel<Outer::Inner><<<1,1>>>();  // error: private type
}

C.3: Unnamed types

Tag	Message
`global_unnamed_type_arg`	`An unnamed type (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the type is local to a __device__ or __global__ function`

Unnamed types (anonymous structs, unnamed enums) have no canonical name. Itanium ABI mangling for unnamed types relies on positional encoding within the enclosing scope, which may differ between host and device compilers if they process the enclosing scope differently. Types local to __device__/__global__ functions are exempt because the device compiler processes those scopes identically.

enum { A, B, C };                     // unnamed enum
kernel<decltype(A)><<<1,1>>>();       // error: unnamed type

C.4: Lambda closures

Tag	Message
`global_lambda_template_arg`	`The closure type for a lambda (%t%s) cannot be used in the template argument type of a __global__ function template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the flag '-extended-lambda' is specified and the lambda is an extended lambda (a __device__ or __host__ __device__ lambda defined within a __host__ or __host__ __device__ function)`

Lambda closures are compiler-generated anonymous types. Without --extended-lambda, there is no protocol for both compilers to agree on the closure type's mangled representation. The extended lambda mechanism (--extended-lambda / --extended-lambda) establishes a naming convention for lambdas annotated with __device__ or __host__ __device__, enabling cross-compiler name agreement.

auto f = [](int x){ return x*2; };
kernel<decltype(f)><<<1,1>>>();       // error unless extended lambda

C.5: Private/protected template template arguments

Tag	Message
`global_private_template_arg`	`A template that is defined inside a class and has private or protected access cannot be used in the template template argument of a __global__ function template instantiation`

The same access-control problem as C.2, but for template template parameters. A private class template used as a template template argument cannot be guaranteed visible in the device compiler's independent instantiation context.

C.6: Texture/surface non-type arguments

Message
`A texture or surface variable cannot be used in the non-type template argument of a __device__, __host__ __device__ or __global__ function template instantiation`

Texture and surface objects have special hardware semantics. Their runtime addresses are not fixed at compile time (they are bound through the texture subsystem), so they cannot serve as non-type template arguments whose values must be known to produce a deterministic mangled name.

Implementation: The Access Checking Pipeline

The template argument restriction checks are implemented in a three-function pipeline within cp_gen_be.c:

sub_469F80 — template_arg_is_accessible

This is the primary entry point. It dispatches on the template argument kind (byte at arg+8):

int template_arg_is_accessible(arg_t *a1, int scope_depth, char check_scope, int *cache_miss) {
    arg->flags_25 |= 0x10;               // mark: currently checking
    int kind = arg->kind;                 // offset +8
    
    switch (kind) {
    case 0:  // type argument
        type = arg->value;               // offset +32
        result = cache_access_result_for(type, 6, scope_depth, cache_miss);
        if (!result && (check_scope & 1)) {
            // walk through typedef chains (type_kind == 12)
            while (type->kind == 12)
                type = type->canonical;   // offset +144
            result = cache_access_result_for(type, 6, scope_depth, cache_miss);
            if (!result) {
                sub_469F30(&type_holder, 0);   // resolve via entity lookup
                result = (type_holder != original_type);
            }
        }
        break;
        
    case 1:  // template argument (template template parameter)
        entity = arg->value;             // offset +32
        // Check class accessibility via derivation chain
        if (entity->base_class) {        // offset +128
            // Use IL walker sub_61FE60 with callback sub_46ACC0
            sub_61EC40(visitor_state);
            visitor_state[0] = sub_46ACC0;   // the callback
            sub_61FE60(entity->base_class, visitor_state);
            result = (visitor_state->found == 0);
        }
        break;
        
    case 2:  // non-type argument
        result = cache_access_result_for(arg->value, 58, scope_depth, cache_miss);
        break;
        
    default:
        __assert_fail("template_arg_is_accessible", "cp_gen_be.c", 2448);
    }
    
    arg->flags_25 &= ~0x10;              // clear: done checking
    return result;
}

The flags_25 |= 0x10 / &= ~0x10 pattern is a recursion guard: it marks the argument as "currently being checked" to prevent infinite loops through mutually-referential template arguments.

sub_469480 — cache_access_result_for

This function caches the result of access checking for a given entity to avoid redundant computation. The cache is a hash table at xmmword_F05720 with 16,382 buckets (0x3FFF), each 24 bytes wide.

Cache entry layout (24 bytes):

Offset	Size	Field	Description
+0	8	`next`	Pointer to next entry in chain (collision list)
+8	8	`entity`	Entity pointer being cached
+16	4	`scope_id`	Scope identifier from `qword_1065708` chain
+20	1	`result`	Cached access result (1 = accessible, 0 = not)
+21	1	`arg_kind`	Template argument kind that was checked

Hash function: The entity pointer is right-shifted by 6 bits, then taken modulo 0x3FFF:

unsigned hash = ((unsigned)(entity >> 6) * 262161ULL) >> 32;
unsigned bucket = (entity >> 6)
    - 0x3FFF * (((hash + ((entity >> 6) - hash) >> 1)) >> 13);
char *slot = &xmmword_F05720[24 * bucket];

Cache hit path: If slot->entity == entity and the scope matches, return the cached result immediately. The function walks the qword_1065708 chain (the scope stack) to verify that the cached result was computed in a compatible scope context.

Cache eviction: When a cached entry's scope no longer matches (the scope stack has changed since caching), the entry is moved to the free list (qword_F05708). New entries are allocated from the free list or via sub_6B7340 (24-byte allocation).

Fallback (cache miss): On cache miss, the function performs the actual accessibility analysis:

For type arguments (kind 6): resolves typedefs, checks if the type is a class/struct/enum with access restrictions. Uses sub_5F9C10 to resolve through elaborated type specifiers. Checks entity->access_bits at +80 (bits 0-1: 0=public, 1=protected, 2=private).
For non-type arguments (kind 58): checks the entity's accessibility directly.
For class/struct types (kinds 9-11): walks the class's template argument list recursively via sub_469F80.
For dependent types (kind 14): recursively checks the base type.
For function types (kind 7) and pointer-to-member types (kind 13): recursively checks the return type, parameter types, and pointed-to class.

After computing the result, it is stored in the cache for future lookups.

sub_46A230 — Template Arg List Walker

This function walks a template instantiation's argument list and checks each argument for accessibility. It uses the entity lookup hash table at unk_FE5700 to find cached resolution results.

__int64 walk_template_args(__int64 hash_table, unsigned __int64 type) {
    // Resolve through typedef chains
    while (type->kind == 12)
        type = type->canonical;           // offset +144
    
    // Hash the type pointer into a bucket
    _QWORD *bucket = hash_table + 32 * ((type >> 6) % 0x3FFF);
    
    // Walk the bucket chain
    while (bucket && bucket[1]) {
        entry = bucket[1];                // the entity entry
        
        // Check if this entry matches our type
        if (entry->canonical != type && !sub_7B2260(entry->canonical, type, 0))
            continue;
        
        // Scope compatibility check
        if (bucket[2] && bucket[2] != qword_126C5D0)
            continue;
        
        // For template entities (kind 10), walk their argument lists
        if (entry->kind == 10) {
            arg_list = *entry->template_args;
            while (arg) {
                if (arg->flags_25 & 0x10)     // already being checked
                    goto next;
                if (!template_arg_is_accessible(arg, 0, 0, &miss))
                    goto not_found;
                arg = arg->next;
            }
        }
        
        // Access check on the entity itself
        if (entry->access_bits != 0)      // private/protected
            if (!sub_467780(entity, 1, 0)) // check access
                goto not_found;
        
        // Cache the resolved entity in bucket[3]
        bucket[3] = qword_10657E8;
        return entry;
    }
    return 0;
}

The walker handles three argument kinds:

Kind 0 (type): Checks the type entity's accessibility and, for class templates (kind 12 with subkind 10), recursively walks nested template arguments.
Kind 1 (template): Checks the template entity's class ancestry.
Kind 2 (non-type): Resolves the non-type argument's scope via sub_5F9BC0.

sub_46A5B0 — arg_before_unnamed_template_param_arg

This function handles the generation of template arguments that appear before unnamed template parameter arguments. It determines the positional index of each argument relative to the template parameter list and calls the appropriate code-generation routine. The assert at line 4795 guards against an unexpected argument kind (must be 0, 1, or 2; kind 3 is a pack expansion sentinel).

Category D: Variable Template Parallel Restrictions (5 errors)

Variable templates (template<typename T> __device__ T var = ...) used in device contexts carry the same restrictions as __global__ function templates. The diagnostics mirror Category C exactly:

Tag	Message
`variable_template_private_type_arg`	`A type that is defined inside a class and has private or protected access (%t) cannot be used in the template argument type of a variable template instantiation, unless the class is local to a __device__ or __global__ function`
`variable_template_private_template_arg`	(private template template arg in variable template)
`variable_template_unnamed_type_template_arg`	`An unnamed type (%t) cannot be used in the template argument type of a variable template template instantiation, unless the type is local to a __device__ or __global__ function`
`variable_template_func_local_template_arg`	`A type defined inside a __host__ function (%t) cannot be used in the template argument type of a variable template template instantiation`
`variable_template_lambda_template_arg`	`The closure type for a lambda (%t%s) cannot be used in the template argument type of a variable template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the lambda is an 'extended lambda' and the flag --extended-lambda is specified`

The implementation shares the same cache_access_result_for / template_arg_is_accessible pipeline described in the Category C implementation section. The only difference is the error tag and message string emitted on failure.

Why Variable Templates Need the Same Restrictions

Variable templates instantiated with __device__, __constant__, or __managed__ memory space are registered by the CUDA runtime using their mangled names. The host-side .int.c file contains registration arrays (emitted in .nvHRDE, .nvHRDI, .nvHRCE, .nvHRCI sections) whose entries are byte arrays encoding mangled variable names. The device compiler independently mangles the same variable template instantiation. Both must produce identical names, so the same visibility constraints apply.

Category E: Static Global Template Stub (2 errors)

In whole-program compilation mode (-rdc=false) with -static-global-template-stub=true, template __global__ functions receive static linkage on their host stubs. This prevents ODR violations when the same template kernel is instantiated in multiple translation units. Two scenarios are incompatible with this mode:

Tag Message

extern_kernel_template when "-static-global-template-stub=true", extern __global__ function template is not supported in whole program compilation mode ("-rdc=false"). To resolve the issue, either use separate compilation mode ("-rdc=true"), or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off)

template_global_no_def when "-static-global-template-stub=true" in whole program compilation mode ("-rdc=false"), a __global__ function template instantiation or specialization (%sq) must have a definition in the current translation unit. To resolve this issue, either use separate compilation mode ("-rdc=true"), or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off)

The Problem

An extern template kernel declaration says "this template instantiation exists elsewhere." But if the stub is static, there is no way for the linker to resolve the extern reference to a stub in another TU, because static symbols are TU-local. Similarly, a template instantiation without a definition in the current TU cannot have a static stub generated for it, because there is no body to inline.

Resolution Paths

Both diagnostics suggest the same two alternatives:

Switch to -rdc=true (separate compilation): each TU gets its own device object, and cross-TU kernel references are resolved by the device linker (nvlink).
Set -static-global-template-stub=false: stubs get external linkage, allowing cross-TU references at the cost of potential ODR violations if the same template is instantiated in multiple TUs.

Category F: Local Type Prevents Host Launch (1 error)

Tag	Message
`local_type_used_in_global_function`	`a local type %t (defined in %sq1) used in global function %sq2 template argument, the global function cannot be launched from host code.`

This is a warning-level diagnostic, not a hard error. It fires when a type local to a function (but not a __host__-function-local type, which would be Category C.1) is used as a template argument. The kernel can still be instantiated and called from device code, but the host-side launch path is blocked because the local type is not visible to the host stub generator.

This diagnostic differs from global_func_local_template_arg in severity and scope: it is a soft warning that the kernel "cannot be launched from host code," rather than a hard error that rejects the instantiation entirely.

Category G: __grid_constant__ in Instantiation Directives (1 error)

Tag	Message
`grid_constant_incompat_templ_redecl`	`incompatible __grid_constant__ annotation for parameter %s in function template redeclaration (see previous declaration %p)`

When a function template is redeclared, the __grid_constant__ annotations on its parameters must match the original declaration. This is enforced because __grid_constant__ affects the ABI: a parameter marked __grid_constant__ is placed in constant memory and accessed through a different addressing mode. If a redeclaration omits the annotation, the host stub and device function would disagree on parameter layout.

The related diagnostic grid_constant_incompat_instantiation_directive applies to explicit instantiation directives (template __global__ void kernel<int>(...)) and is documented in the grid_constant page.

Category H: Kernel Launches from System File Templates (1 error)

Message
`kernel launches from templates are not allowed in system files`

This error fires when a <<<...>>> kernel launch expression appears inside a template function defined in a system header file. System headers are files marked with #pragma system_header or located in system include paths (e.g., the CUDA toolkit's include/ directory).

The restriction exists because system headers are processed with relaxed diagnostics. Kernel launch expressions inside template functions in system headers would be instantiated in user code contexts, but the launch transformation (replacing <<<...>>> with cudaConfigureCall + stub call) operates during the system header's processing pass where diagnostic state may be suppressed. Rather than risk silent miscompilation, the compiler rejects this pattern outright.

The __NV_name_expr Mechanism (6 errors)

NVRTC (NVIDIA's runtime compilation library) provides a mechanism to obtain the mangled name of a __global__ function or __device__/__constant__ variable at compile time. This mechanism is exposed through the __CUDACC_RTC__name_expr intrinsic, which the frontend processes during lowered name lookup.

Purpose

NVRTC compiles CUDA code at runtime, producing PTX that is loaded into the driver. The host application needs to look up compiled kernels and device variables by name via cuModuleGetFunction / cuModuleGetGlobal. The __NV_name_expr mechanism bridges this gap: the user provides a C++ name expression (e.g., my_kernel<int> or my_device_var<float>), and the compiler returns the corresponding mangled name (e.g., _Z9my_kernelIiEvv).

The 6 Errors

Tag	Message
`name_expr_parsing`	`Error in parsing name expression for lowered name lookup. Input name expression was: %sq`
`name_expr_extra_tokens`	`Extra tokens found after parsing name expression for lowered name lookup. Input name expression was: %sq`
`name_expr_internal_error`	`Internal error in parsing name expression for lowered name lookup. Input name expression was: %sq`
`name_expr_non_global_routine`	`Name expression cannot form address of a non-__global__ function. Input name expression was: %sq`
`name_expr_non_device_variable`	`Name expression cannot form address of a variable that is not a __device__/__constant__ variable. Input name expression was: %sq`
`name_expr_not_routine_or_variable`	`Name expression must form address of a __global__ function or the address of a __device__/__constant__ variable. Input name expression was: %sq`

Processing Pipeline

Parsing: The name expression is parsed as a C++ id-expression. If parsing fails, name_expr_parsing is emitted. If tokens remain after a successful parse, name_expr_extra_tokens fires.
Lookup: The parsed expression is resolved via standard C++ name lookup (qualified or unqualified, with template argument deduction if needed).
Validation: The resolved entity is checked:
- If it is a function, it must be __global__ (has the __global__ execution space byte set). Otherwise: name_expr_non_global_routine.
- If it is a variable, it must be __device__ or __constant__ (memory space bits at entity+148). Otherwise: name_expr_non_device_variable.
- If it is neither a function nor a variable: name_expr_not_routine_or_variable.
Mangling: If validation passes, the entity is mangled using the Itanium ABI mangler (in lower_name.c) and the resulting string is recorded for NVRTC output.

Connection to Template Restrictions

The __NV_name_expr mechanism relies on every template argument being mangeable. All of the Category C restrictions directly support this: if a template argument type cannot be mangled (because it is unnamed, local, private, etc.), the name expression lookup would produce a mangled name that does not match the device-side mangling. The restrictions are enforced at template instantiation time, before any name expression lookup occurs, so that invalid instantiations never reach the mangling stage.

Data Structures

Template Argument Node (arg_t)

The template argument node is a linked-list entry used by sub_469F80 and sub_46A230:

Offset	Size	Field	Description
+0	8	`next`	Next argument in the list
+8	1	`kind`	Argument kind: 0=type, 1=template, 2=non-type, 3=pack expansion
+24	1	`flags_24`	Bit 0: is pack expansion
+25	1	`flags_25`	Bit 4 (0x10): currently being checked (recursion guard)
+32	8	`value`	Pointer to the type/entity/expression

Entity Node (type/symbol)

Relevant fields for accessibility checking:

Offset	Size	Field	Description
+8	8	`name_entry`	Name string pointer (or next scope for unnamed)
+24	8	`alt_name`	Alternative name (for flag bit 3 at +81)
+40	8	`scope_info`	Scope information; `+32` from this is the enclosing class/namespace
+80	1	`access_bits`	Bits 0-1: access specifier (0=public, 1=protected, 2=private)
+81	1	`entity_flags`	Bit 2 (0x04): is template specialization; bit 6 (0x40): is anonymous
+128	8	`base_class`	Base class pointer (for class entities)
+132	1	`type_kind`	Type kind: 6/8=pointer/ref, 7=function, 9-11=class/struct/enum, 12=typedef, 13=pointer-to-member, 14=dependent
+144	8	`canonical`	Canonical type (for typedefs: the underlying type)
+148	1	`subtype_kind`	Subkind (for type_kind 12: 10=template-id, 12=elaborated)
+152	8	`type_info`	Type-specific data (template args, function params, etc.)
+160	1	`template_kind`	For template entities: template kind
+161	1	`visibility`	Bit 7 (0x80): private visibility (negative char value)
+162	2	`extra_flags`	Bit 7 (0x80) + bit 9 (0x200): cached accessibility state

Diagnostic Summary

All 24 errors sorted by category:

#	Category	Tag	Severity
1	A	`global_function_constexpr`	error
2	A	`global_function_consteval`	error
3	A	`global_class_decl`	error
4	A	`global_friend_definition`	error
5	A	`global_exception_spec`	error
6	A	`global_function_in_unnamed_inline_ns`	error
7	A	`global_function_with_initializer_list`	error
8	A	`global_va_list_type`	error
9	B	`global_function_pack_not_last`	error
10	B	`global_function_multiple_packs`	error
11	C	`global_func_local_template_arg`	error
12	C	`global_private_type_arg`	error
13	C	`global_unnamed_type_arg`	error
14	C	`global_lambda_template_arg`	error
15	C	`global_private_template_arg`	error
16	C	(texture/surface non-type arg)	error
17	D	`variable_template_private_type_arg`	error
18	D	`variable_template_private_template_arg`	error
19	D	`variable_template_unnamed_type_template_arg`	error
20	D	`variable_template_func_local_template_arg`	error
21	D	`variable_template_lambda_template_arg`	error
22	E	`extern_kernel_template`	error
23	E	`template_global_no_def`	error
24	F	`local_type_used_in_global_function`	warning

Category G (grid_constant_incompat_templ_redecl) and Category H (kernel launches from templates...) are counted separately as they span the template/non-template boundary.

Function Map

Address	Identity	Lines	Role
`sub_469F80`	`template_arg_is_accessible`	144	Primary access checker -- dispatches on arg kind
`sub_469480`	`cache_access_result_for`	670	Hash-cached accessibility analysis
`sub_46A230`	(walks template arg lists)	182	Iterates entity lookup table for arg lists
`sub_46A5B0`	`arg_before_unnamed_template_param_arg`	396	Handles args before unnamed template params
`sub_469F30`	(scope resolve helper)	23	Resolves scope via `cache_access_result_for` + entity lookup
`sub_46ACC0`	(scope walk callback)	--	Callback passed to IL walker `sub_61FE60`
`sub_467780`	(access check)	--	Checks C++ access control (public/protected/private)
`sub_466F40`	(output callback)	--	Code generation output callback
`sub_5BFC70`	(pack expansion resolver)	--	Resolves pack expansion nodes (kind 3)
`sub_5F9BC0`	(scope resolver)	--	Resolves entity scope chain
`sub_5F9C10`	(elaborated type resolver)	--	Resolves elaborated type specifiers
`sub_7B2260`	(type equivalence)	--	Checks structural type equivalence
`sub_61EC40`	(init visitor)	27	Initializes IL tree visitor state
`sub_61FE60`	(walk expression tree)	17	Walks expression tree with callback

Global Variables

Global	Address	Description
`xmmword_F05720`	`0xF05720`	Access check cache hash table (384 KB, 16,382 entries x 24 bytes)
`qword_F05708`	`0xF05708`	Free list head for recycled cache entries
`qword_F05730`	`0xF05730`	Scope ID array parallel to cache (4 bytes per entry)
`unk_FE5700`	`0xFE5700`	Entity lookup hash table (512 KB)
`qword_1065708`	`0x1065708`	Scope stack head (linked list of scope entries)
`qword_126C5D0`	`0x126C5D0`	Global scope sentinel
`qword_10657E8`	`0x10657E8`	Current scope context for entity resolution
`dword_1065848`	`0x1065848`	Extended lambda mode flag
`dword_1065850`	`0x1065850`	Device stub mode flag

Cross-References

Template Engine -- instantiation worklist, fixpoint loop, and the should_be_instantiated gate
__global__ Function Attributes -- attribute application and post-validation checks
Kernel Stub Generation -- host stub emission, -static-global-template-stub flag
__grid_constant__ -- parameter annotation compatibility in template redeclarations
CUDA Diagnostics -- complete error catalog with all 24+ messages
Lambda Device Wrapper -- extended lambda mechanism for closure type template args
Execution Spaces -- host/device/global space model
Backend Pipeline -- initialization of hash tables used by the access checker
Int-C Format -- how the .int.c output encodes device symbol registration arrays

Constexpr Interpreter

The constexpr interpreter is the compile-time expression evaluation engine inside cudafe++. It lives in EDG 6.6's interpret.c (69 functions at 0x620CE0--0x65DE10, approximately 33,000 decompiled lines) and implements a virtual machine that executes arbitrary C++ expressions during compilation. Its central function, do_constexpr_expression (sub_634740), is the single largest function in the entire cudafe++ binary: 11,205 decompiled lines, 63KB of machine code, 128 unique callees, and 28 self-recursive call sites.

The interpreter exists because C++ constexpr evaluation requires the compiler to act as an execution engine. Since C++11, constexpr has grown from simple return-expression functions to a Turing-complete subset of C++ that includes loops, branches, dynamic memory allocation (C++20), virtual dispatch, exception-like control flow, and -- as of C++26 -- compile-time reflection. The interpreter must evaluate all of these constructs faithfully, track object lifetimes, detect undefined behavior, and convert results back into IL constants.

Key Facts

Property	Value
Source file	`interpret.c` (69 functions, ~33,000 decompiled lines)
Address range	`0x620CE0`--`0x65DE10`
Main evaluator	`sub_634740` (`do_constexpr_expression`), 11,205 lines, 63KB
Builtin evaluator	`sub_651150` (`do_constexpr_builtin_function`), 5,032 lines
Loop evaluator	`sub_644580` (`do_constexpr_range_based_for_statement`), 2,836 lines
Constructor evaluator	`sub_6480F0` (`do_constexpr_ctor`), 1,659 lines
Call dispatcher	`sub_657560` (`do_constexpr_call`), 1,445 lines
Top-level entry	`sub_65AE50` (`interpret_expr`)
Materialization	`sub_631110` (`copy_interpreter_object_to_constant`), 1,444 lines
Value extraction	`sub_64B580` (`extract_value_from_constant`), 2,299 lines
Arena block size	64KB (`0x10000`)
Large alloc threshold	1,024 bytes (`0x400`)
Max type size	64MB (`0x4000000`)
Uninitialized marker	`0xDB` fill pattern
Self-recursive calls	28 (in `do_constexpr_expression`)
Confirmed assert IDs	38 functions with assert strings
C++26 reflection	8 `std::meta::*` functions

Architecture Overview

The interpreter is structured as a tree-walking evaluator with arena-based memory, memoization caching, and a call stack that mirrors C++ function invocation. The rest of the compiler invokes it through interpret_expr, which sets up interpreter state, calls the recursive evaluator, and converts the result back to an IL constant.

  AST expression node
        |
        v
  +-----------------+
  | interpret_expr  |  sub_65AE50 — allocates state, arena, hash table
  +-----------------+
        |
        v
  +---------------------------+
  | do_constexpr_expression   |  sub_634740 — the 11,205-line evaluator
  |                           |  dispatches on expression-kind code
  |  +-- arithmetic ops       |  cases 40-45: +, -, *, /, %
  |  +-- comparisons          |  cases 49-51: <, >, ==, !=, <=, >=
  |  +-- member access        |  cases 3-4: . and ->
  |  +-- type conversions     |  case 5: cast sub-switch (20+ type pairs)
  |  +-- pointer arithmetic   |  cases 46-48, 50: ptr+int, ptr-ptr
  |  +-- function calls ------+---> do_constexpr_call (sub_657560)
  |  +-- constructors   ------+---> do_constexpr_ctor (sub_6480F0)
  |  +-- builtins       ------+---> do_constexpr_builtin_function (sub_651150)
  |  +-- loops          ------+---> do_constexpr_range_based_for (sub_644580)
  |  +-- statements     ------+---> do_constexpr_statement (sub_647850)
  |  +-- dynamic_cast         |  inline within main evaluator
  |  +-- typeid               |  inline within main evaluator
  |  +-- offsetof             |  inline within main evaluator
  |  +-- bit_cast             |  calls translate_*_bytes functions
  +---------------------------+
        |
        v
  +-------------------------------------+
  | copy_interpreter_object_to_constant |  sub_631110 — materializes result
  +-------------------------------------+  back into IL constant nodes
        |
        v
  IL constant (returned to compiler)

Entry Points

The interpreter has multiple entry points, each called from a different compilation phase:

Entry	Address	Lines	Called from
`interpret_expr`	`sub_65AE50`	572	General constexpr evaluation (primary)
Entry for expression lowering	`sub_65A290`	311	Expression lowering phase (`sub_6E2040`)
Entry for expression trees	`sub_65A8C0`	274	Expression handling (`sub_5BB4C0`, `sub_5C3760`)
`interpret_dynamic_sub_initializers`	`sub_65CFA0`	67	Aggregate initialization
Misc entries	`sub_65BAB0`--`sub_65D150`	150-470	Template instantiation, `static_assert`, enum values

All entry points follow the same pattern: allocate the interpreter state object, initialize the arena and hash table, call do_constexpr_expression, then extract and convert the result.

Interpreter State Object

The interpreter state is a structure passed as the first argument (a1) to every evaluator function. It contains the evaluation stack, heap tracking, memoization cache, and diagnostic context.

Offset	Size	Field	Description
`+0`	8	`hash_table`	Pointer to variable-to-value hash table
`+8`	8	`hash_capacity`	Hash table capacity mask (low 32) / entry count (high 32)
`+16`	8	`stack_top`	Current stack allocation pointer
`+24`	8	`stack_base`	Base of current arena block
`+32`	8	`heap_list`	Head of heap allocation chain (large objects)
`+40`	4	`scope_depth`	Current scope nesting counter
`+56`	8	`hash_aux_1`	Auxiliary hash table pointer
`+64`	8	`hash_aux_2`	Auxiliary hash table capacity
`+72`	8	`call_chain`	Current call stack chain (for recursion tracking)
`+88`	8	`diag_context_1`	Diagnostic context pointer
`+96`	8	`diag_context_2`	Source location for error reporting
`+112`	8	`diag_context_3`	Additional diagnostic metadata
`+132`	1	`flags_1`	Mode flags (bit 0 = strict mode)
`+133`	1	`flags_2`	Additional mode flags

Memory Model

The interpreter uses a dual-tier memory system: an arena allocator for small objects and direct heap allocation for large ones.

Arena Allocator

Arena blocks are 64KB (0x10000 bytes) each, linked together at offset +24:

Block layout:
  +------------------+
  | next_block (+0)  |---> previous block (or null)
  | alloc_ptr  (+8)  |---> current bump position
  | capacity   (+16) |---> end of usable space
  | base       (+24) |---> start of block data
  +------------------+
  | usable space     |  64KB of object storage
  | ...              |
  +------------------+

Allocation follows a bump-pointer pattern:

void *arena_alloc(interp_state *state, size_t size) {
    size = ALIGN_UP(size, 8);
    ptrdiff_t remaining = 0x10000 - (state->stack_top - state->stack_base);
    if (remaining < size) {
        // Allocate new 64KB block, link to chain
        new_block = sub_622D20();
        new_block->next = state->stack_base;
        state->stack_base = new_block;
        state->stack_top = new_block + HEADER_SIZE;
    }
    void *result = state->stack_top;
    state->stack_top += size;
    return result;
}

Large Object Heap

Objects larger than 1,024 bytes (0x400) bypass the arena and are allocated individually via sub_6B7340 (the compiler's general-purpose allocator). These allocations are tracked through an allocation chain so they can be freed when the interpreter scope exits.

Every interpreter object has a header preceding the value bytes:

  offset -10  [-10]  bitmap byte 2 (validity tracking)
  offset  -9  [ -9]  bitmap byte 1 (initialization tracking)
  offset  -8  [ -8]  type pointer (8 bytes, points to type_node)
  offset   0  [  0]  value bytes start here
              ...     value data (size depends on type)

New objects are initialized with value bytes filled to 0xDB (decimal 219), which serves as an uninitialized-memory sentinel. Any read from an object whose bytes still contain 0xDB triggers error 2700 (access to uninitialized object).

Constexpr Value Representation

Values in the interpreter use a type-dependent representation:

Type category	`kind` byte	Value size	Representation
`void`	0	0	Flag `0x40` set, no value bytes
`pointer`	1	0	Stored as reference metadata, not inline bytes
`integral`	2	16 bytes	Two 64-bit words (supports `__int128`)
`float`	3	16 bytes	IEEE 754 value in first 4/8 bytes, padded
`double`	4	16 bytes	IEEE 754 value in first 8 bytes, padded
`complex`	5	32 bytes	Real + imaginary parts
`class/struct`	6	32 bytes	Reference to interpreter object
`union`	7	32 bytes	Reference to interpreter object
`array`	8	N * elem_size	Recursive: element count times element size
`class` (variants)	9, 10, 11	Cached	Looked up in type-to-size hash table
`typedef`	12	(follow)	Chase to underlying type
`enum`	13	16 bytes	Same as integral
`nullptr_t`	19	32 bytes	Null pointer representation

The reference representation for pointers and class objects uses 32 bytes (two __m128i values). The flag byte at offset +8 within a reference encodes:

Bit	Meaning
0	Has concrete object backing
1	Past-the-end pointer (one past array)
2	Has allocation chain (from `constexpr new`)
3	Has subobject path (member/base offset chain)
4	Has bitfield information
5	Is dangling (object lifetime ended)
6	Is const-qualified

Memoization Hash Table

The interpreter maintains a hash table that maps type pointers to precomputed value sizes, avoiding redundant recursive size computations for class types:

Global	Purpose
`qword_126FEC0`	Hash table base pointer
`qword_126FEC8`	Capacity mask (low 32 bits) / entry count (high 32 bits)

Each entry is 16 bytes: 8-byte key (type pointer), 4-byte size value, 4-byte padding. Collision resolution uses linear probing with a bitmask. The table grows (via sub_620760) when load factor exceeds 50%.

Constexpr Allocation Tracking (C++20)

C++20 introduced constexpr dynamic memory allocation (new/delete in constexpr contexts). The interpreter tracks these through a global allocation chain:

Global	Purpose
`qword_126FBC0`	Free list head
`qword_126FBB8`	Outstanding allocation count

When std::allocator<T>::allocate() is called during constexpr (sub_62B100), the interpreter allocates from its arena, sets bit 2 in the object's flag byte, and links the allocation into the chain. std::allocator<T>::deallocate() (sub_62B470) validates that the freed pointer was actually allocated by std::allocator::allocate() and unlinks it. At the end of constexpr evaluation, any remaining allocations indicate a bug in the evaluated code (memory leaked during constant evaluation).

The Main Evaluator: do_constexpr_expression

sub_634740 is the heart of the interpreter. It takes four parameters:

// Returns 1 on success, 0 on failure
int do_constexpr_expression(
    interp_state *a1,       // interpreter state
    expr_node    *a2,       // AST expression node to evaluate
    value_buf    *a3,       // output value buffer (32 bytes)
    address_t    *a4        // "home" address for reference tracking
);

The function body is organized as a nested switch statement. The outer switch dispatches on the expression category at *(a2+24), and several cases contain inner switches for further dispatch.

Outer Switch: Expression Categories

int do_constexpr_expression(interp_state *a1, expr_node *a2,
                            value_buf *a3, address_t *a4) {
    int category = *(a2 + 24);    // expression category code
    switch (category) {
    case 0:   // ---- void expression ----
        a3->flags = 0x40;         // mark as void
        return 1;

    case 1:   // ---- operator expression ----
        return eval_operator(a1, a2, a3, a4);    // inner switch on *(a2+40)

    case 10:  // ---- sub-expression wrapper ----
        return do_constexpr_expression(a1, *(a2+40), a3, a4);  // recurse

    case 11:  // ---- typeid expression ----
        return do_constexpr_typeid(a1, a2, a3);  // inline

    case 17:  // ---- statement expression (GNU extension) ----
        return do_constexpr_statement(a1, *(a2+40), a3);  // sub_647850

    case 18:  // ---- variable lookup ----
        return lookup_variable(a1, a2, a3, a4);  // hash table at a1+0

    case 19:  // ---- function / static variable reference ----
        return resolve_static_ref(a1, a2, a3);

    case 21:  // ---- special expressions ----
        return eval_special(a1, a2, a3, a4);     // inner switch on *(a2+40)

    default:
        emit_error(2721);  // "expression is not a constant expression"
        return 0;
    }
}

Inner Switch: Operator Codes (case 1)

The operator expression case dispatches on the operator code at *(a2+40). This is the largest sub-switch, covering 100+ cases:

int eval_operator(interp_state *a1, expr_node *a2,
                  value_buf *a3, address_t *a4) {
    int opcode = *(a2 + 40);
    switch (opcode) {

    // ---- Assignment ----
    case 0: case 1:
        // Evaluate RHS, store to LHS address
        if (!do_constexpr_expression(a1, rhs, &rval, NULL)) return 0;
        if (!do_constexpr_expression(a1, lhs, &lval, NULL)) return 0;
        assign_value(lval.address, &rval, type);
        *a3 = lval;
        return 1;

    // ---- Member access (. and ->) ----
    case 3: case 4:
        // Evaluate base object, compute member offset
        if (!do_constexpr_expression(a1, base_expr, &base, NULL)) return 0;
        member_offset = compute_member_offset(member_decl, base.type);
        a3->address = base.address + member_offset;
        return 1;

    // ---- Type conversion (cast) ----
    case 5:
        return eval_conversion(a1, a2, a3, a4);  // 20+ type-pair sub-switch

    // ---- Parenthesized expression ----
    case 9:
        return do_constexpr_expression(a1, *(a2+48), a3, a4);  // recurse

    // ---- Pointer increment/decrement ----
    case 22: case 23:
        if (!do_constexpr_expression(a1, operand, &val, NULL)) return 0;
        // Validate pointer is within array bounds
        pos = get_runtime_array_pos(val.address);
        if (pos < 0 || pos >= array_size) {
            emit_error(2692);  // array bounds violation
            return 0;
        }
        val.address += (opcode == 22) ? elem_size : -elem_size;
        *a3 = val;
        return 1;

    // ---- Unary negation / bitwise complement ----
    case 26:
        if (!do_constexpr_expression(a1, operand, &val, NULL)) return 0;
        if (is_integer_type(val.type))
            a3->int_val = ~val.int_val;  // or -val.int_val
        else if (is_float_type(val.type))
            a3->float_val = -val.float_val;
        return 1;

    // ---- Arithmetic binary operators ----
    case 40:  // addition
    case 41:  // subtraction
    case 42:  // multiplication
    case 43:  // division
    case 44:  // modulo
    case 45:  // (additional arithmetic)
        if (!do_constexpr_expression(a1, lhs, &left, NULL)) return 0;
        if (!do_constexpr_expression(a1, rhs, &right, NULL)) return 0;
        if (opcode == 43 && right.int_val == 0) {
            emit_error(61);    // division by zero = UB
            return 0;
        }
        result = apply_arithmetic(opcode, left, right, type);
        if (check_overflow(result, type)) {
            emit_error(2708);  // arithmetic overflow
            return 0;
        }
        *a3 = result;
        return 1;

    // ---- Pointer arithmetic ----
    case 46:  // pointer + integer
    case 47:  // integer + pointer
    case 48:  // pointer - integer
        if (!do_constexpr_expression(a1, ptr_expr, &ptr, NULL)) return 0;
        if (!do_constexpr_expression(a1, int_expr, &idx, NULL)) return 0;
        // Validate result stays within allocation bounds
        new_pos = get_runtime_array_pos(ptr.address) + idx.int_val;
        if (new_pos < 0 || new_pos > array_size) {  // past-the-end is valid
            emit_error(2735);  // pointer arithmetic underflow/overflow
            return 0;
        }
        a3->address = ptr.address + idx.int_val * elem_size;
        return 1;

    // ---- Comparison operators ----
    case 49: case 50: case 51:
        if (!do_constexpr_expression(a1, lhs, &left, NULL)) return 0;
        if (!do_constexpr_expression(a1, rhs, &right, NULL)) return 0;
        // Pointer comparison: validate same complete object
        if (is_pointer(left.type) && !same_complete_object(left, right)) {
            emit_error(2734);  // invalid pointer comparison
            return 0;
        }
        a3->int_val = apply_comparison(opcode, left, right);
        return 1;

    // ---- Compound assignment (+=, -=, etc.) ----
    case 74: case 75:
        // Evaluate LHS address, compute new value, store back
        ...

    // ---- Shift operators ----
    case 80: case 81: case 82: case 83:
        // Left shift, right shift (arithmetic and logical)
        ...

    // ---- Array subscript ----
    case 87: case 88: case 89: case 90: case 91:
        if (!do_constexpr_expression(a1, base, &arr, NULL)) return 0;
        if (!do_constexpr_expression(a1, index, &idx, NULL)) return 0;
        if (idx.int_val < 0 || idx.int_val >= array_dimension) {
            emit_error(2692);  // array bounds violation
            return 0;
        }
        a3->address = arr.address + idx.int_val * elem_size;
        return 1;

    // ---- Pointer-to-member dereference (.* and ->*) ----
    case 92: case 93:
        ...

    // ---- sizeof ----
    case 94:
        a3->int_val = compute_sizeof(operand_type);
        return 1;

    // ---- Comma operator ----
    case 103:
        do_constexpr_expression(a1, lhs, &discard, NULL);  // evaluate + discard
        return do_constexpr_expression(a1, rhs, a3, a4);   // return RHS

    default:
        emit_error(2721);  // not a constant expression
        return 0;
    }
}

Type Conversion Sub-Switch (operator case 5)

Type conversions are one of the most complex parts of the evaluator. The sub-switch dispatches on source/target type pairs and handles overflow detection:

int eval_conversion(interp_state *a1, expr_node *a2,
                    value_buf *a3, address_t *a4) {
    type_node *src_type = source_type(a2);
    type_node *dst_type = target_type(a2);
    int src_kind = src_type->kind;  // offset +132
    int dst_kind = dst_type->kind;

    // Evaluate the operand first
    value_buf operand;
    if (!do_constexpr_expression(a1, *(a2+48), &operand, NULL))
        return 0;

    // Dispatch on type pair
    if (src_kind == 2 && dst_kind == 2) {
        // int -> int: check truncation
        if (!fits_in_target(operand.int_val, dst_type)) {
            emit_error(2707);  // integer overflow in conversion
            return 0;
        }
        a3->int_val = truncate_to(operand.int_val, dst_type);
    }
    else if (src_kind == 2 && (dst_kind == 3 || dst_kind == 4)) {
        // int -> float/double
        a3->float_val = (double)operand.int_val;
    }
    else if ((src_kind == 3 || src_kind == 4) && dst_kind == 2) {
        // float/double -> int: check overflow
        if (operand.float_val > INT_MAX || operand.float_val < INT_MIN) {
            emit_error(2728);  // floating-point conversion overflow
            return 0;
        }
        a3->int_val = (int64_t)operand.float_val;
    }
    else if (src_kind == 1 && dst_kind == 2) {
        // pointer -> int (reinterpret_cast)
        if (!cuda_allows_reinterpret_cast()) {  // dword_106C2C0
            emit_error(2727);  // invalid conversion
            return 0;
        }
    }
    else if (src_kind == 6 && dst_kind == 6) {
        // class -> class (derived-to-base or base-to-derived)
        a3->address = adjust_pointer_for_base(operand.address, src_type, dst_type);
    }
    else if (src_kind == 19 && dst_kind == 1) {
        // nullptr_t -> pointer
        a3->address = 0;
        a3->flags |= 0;  // null pointer
    }
    // ... 15+ additional type pairs ...
    return 1;
}

Variable Lookup (case 18)

When the evaluator encounters a variable reference, it looks up the variable's current value in the interpreter's hash table:

int lookup_variable(interp_state *a1, expr_node *a2,
                    value_buf *a3, address_t *a4) {
    void *var_key = get_variable_entity(a2);
    uint64_t *table = a1->hash_table;       // offset +0
    uint64_t  mask  = a1->hash_capacity;    // offset +8, low 32 bits

    // Linear-probing hash lookup
    uint32_t idx = hash(var_key) & mask;
    while (table[idx * 2] != 0) {
        if (table[idx * 2] == var_key) {
            // Found: load value from stored address
            void *value_addr = table[idx * 2 + 1];
            load_value(a3, value_addr, get_type(a2));
            return 1;
        }
        idx = (idx + 1) & mask;
    }
    // Variable not in scope -> likely a static/global constexpr
    return resolve_static_ref(a1, a2, a3);
}

Function Call Dispatch: do_constexpr_call

sub_657560 (1,445 lines) handles all function call evaluation during constexpr. It is the central dispatcher that routes calls to the appropriate evaluator based on the callee kind.

int do_constexpr_call(interp_state *a1, expr_node *call_expr,
                      value_buf *result, address_t *home) {
    // 1. Resolve the callee
    func_info callee;
    if (!eval_constexpr_callee(a1, call_expr, &callee))  // sub_643FE0
        return 0;

    // 2. Check recursion depth
    int depth = count_call_chain(a1->call_chain);
    if (depth > MAX_CONSTEXPR_DEPTH) {
        emit_error(2701);  // constexpr evaluation exceeded depth limit
        return 0;
    }

    // 3. Dispatch by callee kind
    if (callee.is_builtin) {
        // Route to builtin evaluator
        return do_constexpr_builtin_function(       // sub_651150
            a1, callee.descriptor, args, result, &success);
    }

    if (callee.is_constructor) {
        // Route to constructor evaluator
        return do_constexpr_ctor(a1, callee, args,  // sub_6480F0
                                 result, home);
    }

    if (callee.is_destructor) {
        // Route to destructor evaluator (two variants)
        return do_constexpr_dtor(a1, callee, args,  // sub_64EFE0 or sub_64FB10
                                 result);
    }

    if (callee.is_virtual) {
        // Virtual dispatch: resolve through vtable
        func_info resolved = resolve_virtual_call(callee, this_obj);
        if (!resolved.is_constexpr) {
            emit_error(269);  // virtual function is not constexpr
            return 0;
        }
        callee = resolved;
    }

    // 4. Check that function body is available
    if (!callee.has_body) {
        emit_error(2823);  // constexpr function not defined
        return 0;
    }

    // 5. Push call frame
    call_frame frame;
    frame.prev = a1->call_chain;
    a1->call_chain = &frame;
    frame.func = callee.entity;

    // 6. Bind arguments to parameters
    for (int i = 0; i < callee.param_count; i++) {
        value_buf arg_val;
        if (!do_constexpr_expression(a1, args[i], &arg_val, NULL))
            goto cleanup;
        bind_parameter(a1, callee.params[i], &arg_val);
    }

    // 7. Evaluate function body
    int ok = do_constexpr_statement(a1, callee.body, result);  // sub_647850

    // 8. Pop call frame, clean up allocations
cleanup:
    a1->call_chain = frame.prev;
    release_allocation_chain(a1, &frame);  // sub_633EC0
    return ok;
}

Callee Resolution: eval_constexpr_callee

sub_643FE0 (305 lines) resolves the callee expression of a function call. It handles direct calls, virtual dispatch (vtable lookup), and pointer-to-member-function calls. For virtual calls, it resolves overrides by walking the vtable of the most-derived type of the object being called on.

Recursion Depth Tracking

The interpreter tracks call depth through the call_chain linked list at offset +72 in the interpreter state. Each do_constexpr_call invocation pushes a frame; each return pops it. The chain is also used for diagnostic output -- when a constexpr evaluation fails, the error message includes the call stack showing how the offending expression was reached.

Constructor Evaluation: do_constexpr_ctor

sub_6480F0 (1,659 lines) evaluates constructor calls during constexpr. It implements the full C++ construction sequence:

int do_constexpr_ctor(interp_state *a1, func_info *ctor,
                      expr_node **args, value_buf *result,
                      address_t *target_addr) {
    class_type *cls = ctor->parent_class;

    // 1. Initialize virtual base classes (if most-derived)
    for (vbase in cls->virtual_bases) {
        address_t vbase_addr = target_addr + vbase.offset;
        if (vbase.has_initializer) {
            if (!do_constexpr_expression(a1, vbase.init, &val, &vbase_addr))
                return 0;
        } else {
            init_subobject_to_zero(vbase_addr, vbase.type);  // sub_62C030
        }
    }

    // 2. Initialize non-virtual base classes
    for (base in cls->bases) {
        address_t base_addr = target_addr + base.offset;
        if (base.has_ctor_call) {
            if (!do_constexpr_ctor(a1, base.ctor, base.args,
                                   &val, &base_addr))
                return 0;
        }
    }

    // 3. Initialize data members (in declaration order)
    for (member in cls->members) {
        address_t mem_addr = target_addr + member.offset;
        if (member.has_mem_initializer) {
            // From constructor's member-initializer-list
            if (!do_constexpr_expression(a1, member.init, &val, &mem_addr))
                return 0;
        } else if (member.has_default_initializer) {
            // From in-class default member initializer
            if (!do_constexpr_expression(a1, member.default_init,
                                         &val, &mem_addr))
                return 0;
        } else {
            // Default-initialize (zero for trivial types)
            init_subobject_to_zero(mem_addr, member.type);
        }
    }

    // 4. Execute constructor body (if non-trivial)
    if (ctor->has_body) {
        if (!do_constexpr_statement(a1, ctor->body, result))
            return 0;
    }

    // 5. Handle delegating constructors
    if (ctor->is_delegating) {
        return do_constexpr_ctor(a1, ctor->delegate_target,
                                 args, result, target_addr);
    }

    // 6. For trivial copy/move, use memcpy optimization
    if (ctor->is_trivial_copy) {
        copy_interpreter_subobject(target_addr, source_addr, cls);
        return 1;                                // sub_6337D0
    }

    return 1;
}

Loop Evaluation: do_constexpr_range_based_for_statement

sub_644580 (2,836 lines) evaluates all loop constructs during constexpr: for, while, do-while, and range-based for. It is self-recursive for nested loops.

int do_constexpr_range_based_for_statement(
        interp_state *a1, stmt_node *loop, value_buf *result) {

    // --- Range-based for ---
    if (loop->kind == RANGE_FOR) {
        // 1. Evaluate range expression: auto&& __range = <expr>
        value_buf range_val;
        if (!do_constexpr_expression(a1, loop->range_expr, &range_val, NULL))
            return 0;

        // 2. Evaluate begin() and end()
        value_buf begin_val, end_val;
        if (!do_constexpr_call(a1, loop->begin_call, &begin_val, NULL))
            return 0;
        if (!do_constexpr_call(a1, loop->end_call, &end_val, NULL))
            return 0;

        // 3. Loop: while (begin != end)
        while (true) {
            // Evaluate condition: begin != end
            value_buf cond;
            if (!do_constexpr_expression(a1, loop->condition, &cond, NULL))
                return 0;
            if (!cond.int_val)
                break;  // loop finished

            // Bind loop variable: auto x = *begin
            value_buf elem;
            if (!do_constexpr_expression(a1, loop->deref_expr, &elem, NULL))
                return 0;
            bind_parameter(a1, loop->loop_var, &elem);

            // Execute loop body
            int body_result = do_constexpr_statement(  // sub_6593C0
                a1, loop->body, result);

            if (body_result == BREAK)   break;
            if (body_result == RETURN)  return body_result;
            // CONTINUE falls through to increment

            // Increment iterator: ++begin
            if (!do_constexpr_expression(a1, loop->increment, &begin_val, NULL))
                return 0;

            // Destroy loop variable for this iteration
            cleanup_iteration(a1, loop->loop_var);  // sub_658CE0
        }
        return 1;
    }

    // --- Traditional for/while/do-while ---
    if (loop->kind == FOR_LOOP) {
        // Initialize
        if (loop->init_stmt)
            do_constexpr_statement(a1, loop->init_stmt, NULL);

        while (true) {
            // Condition
            if (loop->condition) {
                value_buf cond;
                do_constexpr_expression(a1, loop->condition, &cond, NULL);
                if (!cond.int_val) break;
            }
            // Body
            int r = do_constexpr_statement(a1, loop->body, result);
            if (r == BREAK)  break;
            if (r == RETURN) return r;
            // Increment
            if (loop->increment)
                do_constexpr_expression(a1, loop->increment, NULL, NULL);
        }
    }
    return 1;
}

The loop body evaluation is delegated to sub_6593C0 (816 lines), which handles per-iteration variable binding, break/continue/return propagation, and destruction of loop-scoped temporaries.

Statement Evaluation: do_constexpr_statement

sub_647850 (509 lines) evaluates compound statements, declarations, branches, and switch statements during constexpr:

int do_constexpr_statement(interp_state *a1, stmt_node *stmt,
                           value_buf *result) {
    switch (stmt->kind) {
    case COMPOUND:
        // Push scope, evaluate each sub-statement, pop scope
        a1->scope_depth++;
        for (s in stmt->children) {
            int r = do_constexpr_statement(a1, s, result);
            if (r == RETURN || r == BREAK || r == CONTINUE)
                { a1->scope_depth--; return r; }
        }
        a1->scope_depth--;
        return OK;

    case DECLARATION:
        // Allocate interpreter storage, evaluate initializer
        return do_constexpr_init_variable(a1, stmt->decl, result);

    case IF_STMT:
        value_buf cond;
        do_constexpr_expression(a1, stmt->condition, &cond, NULL);
        if (cond.int_val)
            return do_constexpr_statement(a1, stmt->then_branch, result);
        else if (stmt->else_branch)
            return do_constexpr_statement(a1, stmt->else_branch, result);
        return OK;

    case SWITCH_STMT:
        value_buf switch_val;
        do_constexpr_expression(a1, stmt->condition, &switch_val, NULL);
        // Find matching case label
        case_label = find_case(stmt->cases, switch_val.int_val);
        return do_constexpr_statement(a1, case_label->body, result);

    case RETURN_STMT:
        if (stmt->return_expr)
            do_constexpr_expression(a1, stmt->return_expr, result, NULL);
        return RETURN;

    case FOR_STMT: case WHILE_STMT: case DO_STMT: case RANGE_FOR:
        return do_constexpr_range_based_for_statement(  // sub_644580
            a1, stmt, result);

    case BREAK_STMT:    return BREAK;
    case CONTINUE_STMT: return CONTINUE;

    case TRY_STMT:
        // try/catch in constexpr (C++26 direction, partially supported)
        ...
    }
}

Builtin Function Evaluation: do_constexpr_builtin_function

sub_651150 (5,032 lines) evaluates compiler intrinsics and __builtin_* functions at compile time. It dispatches on the builtin function ID (a 16-bit value at *(a2+168)), using a sparse comparison tree rather than a dense switch table.

int do_constexpr_builtin_function(
        interp_state *a1,
        func_desc    *a2,       // function descriptor
        value_buf   **a3,       // argument array
        value_buf    *a4,       // result buffer
        int          *a5) {     // success/failure output

    uint16_t builtin_id = *(a2 + 168);

    // --- Arithmetic overflow detection ---
    // __builtin_add_overflow, __builtin_sub_overflow, __builtin_mul_overflow
    if (builtin_id == BUILTIN_ADD_OVERFLOW) {
        int64_t a = a3[0]->int_val, b = a3[1]->int_val;
        bool overflow;
        int64_t result = checked_add(a, b, &overflow);
        a3[2]->int_val = result;     // write to output parameter
        a4->int_val = overflow ? 1 : 0;
        return 1;
    }

    // --- Bit manipulation ---
    // __builtin_clz, __builtin_ctz, __builtin_popcount, __builtin_parity
    if (builtin_id == BUILTIN_CLZ) {
        uint64_t val = a3[0]->int_val;
        if (val == 0) { emit_error(61); return 0; }  // UB: clz(0)
        a4->int_val = __builtin_clzll(val);
        return 1;
    }
    if (builtin_id == BUILTIN_POPCOUNT) {
        a4->int_val = __builtin_popcountll(a3[0]->int_val);
        return 1;
    }
    if (builtin_id == BUILTIN_BSWAP32) {
        a4->int_val = __builtin_bswap32((uint32_t)a3[0]->int_val);
        return 1;
    }

    // --- String operations ---
    // __builtin_strlen, __builtin_strcmp, __builtin_memcmp,
    // __builtin_strchr, __builtin_memchr
    if (builtin_id == BUILTIN_STRLEN) {
        char *str = get_interpreter_string(a1, a3[0]);
        a4->int_val = strlen(str);
        return 1;
    }
    if (builtin_id == BUILTIN_STRCMP) {
        char *s1 = get_interpreter_string(a1, a3[0]);
        char *s2 = get_interpreter_string(a1, a3[1]);
        a4->int_val = strcmp(s1, s2);
        return 1;
    }

    // --- Floating-point classification ---
    // __builtin_isnan, __builtin_isinf, __builtin_isfinite,
    // __builtin_fpclassify, __builtin_huge_val, __builtin_nan
    if (builtin_id == BUILTIN_ISNAN) {
        a4->int_val = isnan(a3[0]->float_val) ? 1 : 0;
        return 1;
    }
    if (builtin_id == BUILTIN_NAN) {
        char *tag = get_interpreter_string(a1, a3[0]);
        a4->float_val = nan(tag);
        return 1;
    }

    // --- C++20/23 bit operations ---
    // std::bit_cast (via __builtin_bit_cast)
    if (builtin_id == BUILTIN_BIT_CAST) {
        // Serialize source object to target-format bytes
        translate_interpreter_object_to_target_bytes(  // sub_62A490
            a1, a3[0], byte_buffer);
        // Deserialize into destination type
        translate_target_bytes_to_interpreter_object(  // sub_62C670
            a1, byte_buffer, a4, dst_type);
        return 1;
    }

    // --- Type traits ---
    // __is_constant_evaluated()
    if (builtin_id == BUILTIN_IS_CONSTANT_EVALUATED) {
        a4->int_val = 1;  // always true inside constexpr evaluator
        return 1;
    }

    // --- Memory operations ---
    // __builtin_memcpy, __builtin_memmove
    if (builtin_id == BUILTIN_MEMCPY) {
        // Copy N bytes between interpreter objects
        copy_interpreter_bytes(a3[0]->address, a3[1]->address,
                               a3[2]->int_val);
        *a4 = *a3[0];  // return dest pointer
        return 1;
    }

    // ... 50+ additional builtin categories ...

    emit_error(2721);  // builtin not evaluable at compile time
    return 0;
}

Builtin Categories Summary

Category	Examples	Count
Arithmetic overflow	`__builtin_add_overflow`, `__builtin_mul_overflow`	3
Bit manipulation	`__builtin_clz`, `__builtin_ctz`, `__builtin_popcount`, `__builtin_bswap`	8+
String operations	`__builtin_strlen`, `__builtin_strcmp`, `__builtin_memcmp`, `__builtin_strchr`	6+
Math/FP classify	`__builtin_isnan`, `__builtin_isinf`, `__builtin_huge_val`, `__builtin_nan`	8+
Type queries	`__is_constant_evaluated`, `__has_unique_object_representations`	4+
Memory operations	`__builtin_memcpy`, `__builtin_memmove`	3+
C++20/23 `<bit>`	`std::bit_cast`, `std::bit_ceil`, `std::bit_floor`, `std::countl_zero`	8+
Atomic (limited)	Constexpr-evaluable atomic subset	2+

Destructor Evaluation

Two functions handle constexpr destructor calls, splitting responsibilities:

do_constexpr_dtor variant 1 (sub_64EFE0, 503 lines) -- Evaluates the destructor body itself. Runs the user-written destructor code, then destroys members in reverse declaration order.

do_constexpr_dtor variant 2 / perform_destructions (sub_64FB10, 877 lines) -- Handles the full destruction sequence including base class destructors and array element destruction. Also implements perform_destructions, the post-evaluation cleanup that destroys all constexpr-created objects when their scope ends.

Materialization: Interpreter Objects to IL Constants

After constexpr evaluation completes, the interpreter's internal objects must be converted back into IL constant nodes that the rest of the compiler can consume.

copy_interpreter_object_to_constant

sub_631110 (1,444 lines) traverses the interpreter's memory representation of an object and builds the corresponding IL constant tree:

il_node *copy_interpreter_object_to_constant(
        interp_state *a1, address_t obj_addr, type_node *type) {
    int kind = type->kind;

    switch (kind) {
    case 2: case 13:  // integer, enum
        return make_integer_constant(load_int(obj_addr), type);

    case 3: case 4:   // float, double
        return make_float_constant(load_float(obj_addr), type);

    case 1:           // pointer
        if (is_null_pointer(obj_addr))
            return make_null_pointer_constant(type);
        // Non-null: build address expression with relocation
        return make_address_constant(
            translate_interpreter_offset(obj_addr),  // inline helper
            type);

    case 6: case 9: case 10: case 11:  // class/struct/union
        il_node *result = make_aggregate_constant(type);
        // Recursively convert each member
        for (member in get_members(type)) {
            address_t mem_addr = obj_addr + member.offset;
            il_node *mem_val = copy_interpreter_object_to_constant(
                a1, mem_addr, member.type);
            add_member_to_aggregate(result, mem_val);
        }
        return result;

    case 8:           // array
        il_node *result = make_array_constant(type);
        for (int i = 0; i < array_dimension(type); i++) {
            address_t elem_addr = obj_addr + i * elem_size;
            il_node *elem = copy_interpreter_object_to_constant(
                a1, elem_addr, elem_type);
            add_element_to_array(result, elem);
        }
        return result;
    }
}

This function also contains get_reflection_string_entry and translate_interpreter_offset as inlined helpers -- the former handles C++26 reflection string extraction, and the latter converts interpreter memory addresses into IL address expressions with proper relocations.

extract_value_from_constant (reverse direction)

sub_64B580 (2,299 lines) performs the inverse: given an IL constant node (from a previously evaluated constexpr), it extracts the value into the interpreter's internal representation. This is used when a constexpr function references another constexpr variable whose value was already computed.

`__builtin_bit_cast` Support

Two functions implement the byte-level serialization needed for std::bit_cast:

translate_interpreter_object_to_target_bytes (sub_62A490, 461 lines) -- Serializes an interpreter object to a target-format byte sequence. Must handle endianness conversion, padding bytes, and bitfield layout according to the target ABI.

translate_target_bytes_to_interpreter_object (sub_62C670, 529 lines) -- Deserializes target-format bytes back into an interpreter object. Validates that the source bytes represent a valid value for the destination type (e.g., no trap representations for bool).

C++20 Constexpr Memory Support

std::allocator::allocate

sub_62B100 (do_constexpr_std_allocator_allocate, 177 lines) -- Handles new expressions in constexpr context. Allocates from the interpreter arena, sets the allocation-chain flag (bit 2), and links the allocation into the tracking chain.

std::allocator::deallocate

sub_62B470 (do_constexpr_std_allocator_deallocate, 195 lines) -- Handles delete in constexpr context. Validates the pointer was allocated by std::allocator::allocate() by searching the allocation chain (qword_126FBC0 / qword_126FBB8).

std::construct_at

sub_64F920 (do_constexpr_std_construct_at, 108 lines) -- Handles std::construct_at() (C++20). Validates the target pointer, then delegates to do_constexpr_ctor for actual construction.

C++26 Reflection Support

EDG 6.6 includes experimental support for the P2996 compile-time reflection proposal. Eight dedicated functions implement std::meta::* operations:

Function	Address	Lines	Reflection operation
`do_constexpr_std_meta_substitute`	`sub_628510`	526	`std::meta::substitute()` -- template argument substitution
`do_constexpr_std_meta_enumerators_of`	`sub_62EB00`	342	`std::meta::enumerators_of()` -- enum value list
`do_constexpr_std_meta_subobjects_of`	`sub_62F0B0`	434	`std::meta::subobjects_of()` -- all subobjects
`do_constexpr_std_meta_bases_of`	`sub_62F7B0`	339	`std::meta::bases_of()` -- base class list
`do_constexpr_std_meta_nonstatic_data_members_of`	`sub_62FD30`	308	`std::meta::nonstatic_data_members_of()`
`do_constexpr_std_meta_static_data_members_of`	`sub_630280`	308	`std::meta::static_data_members_of()`
`do_constexpr_std_meta_members_of`	`sub_6307E0`	590	`std::meta::members_of()` -- all members
`do_constexpr_std_meta_define_class`	`sub_65DE10`	553	`std::meta::define_class()` -- class synthesis

These functions operate on "infovecs" -- information vectors created by make_infovec (sub_62E1B0, 241 lines) that encode reflection metadata as interpreter-internal objects. The get_interpreter_string and get_interpreter_string_length helpers (also within sub_65DE10) extract string values from these infovecs for operations that take string parameters (member names, type names).

The define_class operation is particularly notable: it allows constexpr code to synthesize entirely new class types at compile time, a capability that goes beyond simple introspection.

CUDA-Specific Constexpr Behavior

The interpreter checks several global flags to relax standard constexpr restrictions for CUDA device code:

Global	Purpose
`dword_106C2C0`	Controls `reinterpret_cast` semantics in device constexpr
`dword_106C1D8`	Controls pointer dereference behavior (likely `--expt-relaxed-constexpr`)
`dword_106C1E0`	Controls `typeid` availability in device constexpr
`dword_126EFAC`	CUDA mode flag (enables/disables constexpr relaxations)
`dword_126EFA4`	Secondary CUDA mode flag (combined with EFAC for fine control)

Standard C++ forbids reinterpret_cast, typeid, and certain pointer operations in constexpr contexts. CUDA relaxes these restrictions because GPU programming patterns frequently require type punning and address manipulation that the standard deems non-constant. When these flags are set, the interpreter suppresses the corresponding error codes and evaluates the expression as if it were permitted.

Language Version Gates

Global	Check	Meaning
`qword_126EF98`	`> 0x222DF` (140,255)	C++20 features enabled (standard 202002)
`qword_126EF98`	`> 0x15F8F` (89,999)	C++14 features enabled (standard 201402)
`dword_126EFB4`	`== 2`	Full C++20+ compilation mode
`dword_126EF68`	`>= 202001`	C++20 constexpr dynamic allocation enabled

These version checks gate features like constexpr new/delete (C++20), constexpr dynamic_cast and typeid (C++20), and constexpr virtual dispatch (C++20).

Error Codes

The interpreter emits detailed diagnostics when constexpr evaluation fails. Each error code identifies a specific category of failure:

Error	Meaning
61	Undefined behavior detected (division by zero, `clz(0)`, etc.)
269	Virtual function called is not constexpr
286	Pure virtual function called
2691	Invalid pointer comparison direction
2692	Array bounds violation
2700	Access to uninitialized object
2701	Constexpr evaluation exceeded depth limit
2707	Integer overflow in type conversion
2708	Arithmetic overflow in computation
2721	Expression is not a constant expression
2725	Type too large for constexpr evaluation (> 64MB)
2727	Invalid type conversion in constexpr
2728	Floating-point conversion overflow
2734	Invalid pointer comparison (different complete objects)
2735	Pointer arithmetic out of bounds
2751	Null pointer dereference
2760	Pointer-to-member dereference failure
2766	Null pointer arithmetic
2808	Class too large for constexpr representation
2823	Constexpr function body not available
2879	`offsetof` on invalid member
2921	Direct value return failure
2938	Virtual base class offset not found
2955	Statement expression evaluation failure
2993	Object lifetime violation
2999	Variable-length array in constexpr
3007	Pointer-to-member comparison failure
3024	Dynamic initialization order issue
3248	Member access on uninitialized object
3312	Object representation mismatch (`bit_cast`)

Supporting Functions

Value Management

Function	Address	Lines	Purpose
`f_value_bytes_for_type`	`sub_628DE0`	843	Compute interpreter storage size for a type
`init_subobject_to_zero`	`sub_62C030`	284	Zero-initialize a constexpr subobject
`mark_mutable_members_not_initialized`	`sub_62D0F0`	203	Mark mutable members after copy
Copy scalar value	`sub_62B8A0`	61	Assign scalar value to interpreter object
Load value	`sub_64EA30`	293	Load value from interpreter object into buffer
Check initialized	`sub_62BF60`	55	Validate interpreter object is initialized

Object Addressing

Function	Address	Lines	Purpose
`find_subobject_for_interpreter_address`	`sub_629D30`	334	Map address to subobject identity
`obj_type_at_address`	`sub_62A210`	133	Most-derived type at an address
`get_runtime_array_pos`	`sub_6341C0`	224	Array element index for a pointer
`last_subobject_path_link`	`sub_6345D0`	21	Tail of subobject path chain
`get_trailing_subobject_path_entry`	`sub_634630`	82	Trailing subobject for virtual bases
Copy subobject	`sub_6337D0`	379	Copy subobject between interpreter addresses
Validate subobject path	`sub_62B980`	314	Recursive validation of class hierarchy traversal

Condition and Allocation

Function	Address	Lines	Purpose
`do_constexpr_condition`	`sub_658EE0`	302	Evaluate if/while/for condition
`do_constexpr_condition_alloc`	`sub_62D810`	187	Allocate storage for condition result
`do_constexpr_init_variable`	`sub_6509E0`	427	Initialize local variable in constexpr
Allocate value slot	`sub_62D4F0`	183	Allocate and init a value slot in arena
Release allocation chain	`sub_633EC0`	157	Free tracked constexpr allocations

Dynamic Initialization and Lambdas

Function	Address	Lines	Purpose
`do_constexpr_dynamic_init`	`sub_64A040`	1,111	Dynamic initialization of constexpr variables
`do_constexpr_lambda`	(within `sub_64A040`)	--	Lambda capture evaluation
`do_array_constructor_copy`	(within `sub_64A040`)	--	Array construction via copy ctor

Debug and Diagnostics

Function	Address	Lines	Purpose
Format constexpr value	`sub_632E80`	268	Format value for error messages
Dump constexpr value	`sub_6333E0`	166	`fprintf`-based debug dump

Complete Function Map

Address	Lines	Identity	Confidence
`sub_628180`	237	Init/entry wrapper	MEDIUM
`sub_628510`	526	`do_constexpr_std_meta_substitute`	HIGH (95%)
`sub_628DE0`	843	`f_value_bytes_for_type`	VERY HIGH (99%)
`sub_629D30`	334	`find_subobject_for_interpreter_address`	VERY HIGH (99%)
`sub_62A210`	133	`obj_type_at_address`	VERY HIGH (99%)
`sub_62A490`	461	`translate_interpreter_object_to_target_bytes`	VERY HIGH (99%)
`sub_62AD90`	194	Allocate interpreter value storage	HIGH (85%)
`sub_62B100`	177	`do_constexpr_std_allocator_allocate`	VERY HIGH (99%)
`sub_62B470`	195	`do_constexpr_std_allocator_deallocate`	VERY HIGH (99%)
`sub_62B8A0`	61	Copy scalar value	HIGH (85%)
`sub_62B980`	314	Validate/traverse subobject path	HIGH (80%)
`sub_62BF60`	55	Validate initialization state	HIGH (85%)
`sub_62C030`	284	`init_subobject_to_zero`	VERY HIGH (99%)
`sub_62C670`	529	`translate_target_bytes_to_interpreter_object`	VERY HIGH (99%)
`sub_62D0F0`	203	`mark_mutable_members_not_initialized`	VERY HIGH (99%)
`sub_62D4F0`	183	Allocate constexpr value slot	HIGH (80%)
`sub_62D810`	187	`do_constexpr_condition_alloc`	VERY HIGH (99%)
`sub_62DB00`	132	Get value type size (wrapper)	HIGH (80%)
`sub_62DD10`	242	Builtin dispatch helper	MEDIUM (70%)
`sub_62E1B0`	241	`make_infovec`	VERY HIGH (99%)
`sub_62E670`	276	Init/entry wrapper	MEDIUM (60%)
`sub_62EB00`	342	`do_constexpr_std_meta_enumerators_of`	VERY HIGH (99%)
`sub_62F0B0`	434	`do_constexpr_std_meta_subobjects_of`	VERY HIGH (99%)
`sub_62F7B0`	339	`do_constexpr_std_meta_bases_of`	VERY HIGH (99%)
`sub_62FD30`	308	`do_constexpr_std_meta_nonstatic_data_members_of`	VERY HIGH (99%)
`sub_630280`	308	`do_constexpr_std_meta_static_data_members_of`	VERY HIGH (99%)
`sub_6307E0`	590	`do_constexpr_std_meta_members_of`	VERY HIGH (99%)
`sub_631110`	1,444	`copy_interpreter_object_to_constant`	VERY HIGH (99%)
`sub_632CB0`	36	Create reflection string object	MEDIUM (70%)
`sub_632D80`	64	`get_reflection_string_entry` helper	HIGH (85%)
`sub_632E80`	268	Format constexpr value for diagnostics	MEDIUM (65%)
`sub_6333E0`	166	Dump constexpr value (debug)	MEDIUM (65%)
`sub_6337D0`	379	Copy interpreter subobject	HIGH (85%)
`sub_633EC0`	157	Release allocation chain	HIGH (80%)
`sub_6341C0`	224	`get_runtime_array_pos`	VERY HIGH (99%)
`sub_6345D0`	21	`last_subobject_path_link`	VERY HIGH (99%)
`sub_634630`	82	`get_trailing_subobject_path_entry`	VERY HIGH (99%)
`sub_634740`	11,205	`do_constexpr_expression`	ABSOLUTE (100%)
`sub_643C50`	202	Prepare constexpr callee	HIGH (85%)
`sub_643FE0`	305	`eval_constexpr_callee`	VERY HIGH (99%)
`sub_644580`	2,836	`do_constexpr_range_based_for_statement`	VERY HIGH (99%)
`sub_647850`	509	`do_constexpr_statement`	HIGH (90%)
`sub_6480F0`	1,659	`do_constexpr_ctor`	VERY HIGH (99%)
`sub_64A040`	1,111	`do_constexpr_dynamic_init` / `do_constexpr_lambda`	VERY HIGH (99%)
`sub_64B580`	2,299	`extract_value_from_constant`	VERY HIGH (99%)
`sub_64DFA0`	86	Destructor chain walker	HIGH (80%)
`sub_64E170`	404	Perform destruction sequence	HIGH (85%)
`sub_64E9E0`	26	Predicate / flag check	MEDIUM (65%)
`sub_64EA30`	293	Load value from interpreter object	HIGH (85%)
`sub_64EFE0`	503	`do_constexpr_dtor` (variant 1)	VERY HIGH (99%)
`sub_64F8F0`	14	Trivial forwarding wrapper	MEDIUM (60%)
`sub_64F920`	108	`do_constexpr_std_construct_at`	VERY HIGH (99%)
`sub_64FB10`	877	`do_constexpr_dtor` (v2) / `perform_destructions`	VERY HIGH (99%)
`sub_6509E0`	427	`do_constexpr_init_variable`	VERY HIGH (99%)
`sub_651150`	5,032	`do_constexpr_builtin_function`	ABSOLUTE (100%)
`sub_657560`	1,445	`do_constexpr_call`	VERY HIGH (99%)
`sub_658CE0`	134	Loop iteration cleanup	HIGH (80%)
`sub_658EE0`	302	`do_constexpr_condition`	VERY HIGH (99%)
`sub_6593C0`	816	Loop body evaluator	HIGH (85%)
`sub_65A290`	311	Entry from expression lowering	MEDIUM (70%)
`sub_65A8C0`	274	Entry from expression trees	MEDIUM (70%)
`sub_65AE50`	572	`interpret_expr` (primary entry)	VERY HIGH (99%)
`sub_65BAB0`--`sub_65D150`	150-470	Misc entry points	MEDIUM (70%)
`sub_65CFA0`	67	`interpret_dynamic_sub_initializers`	VERY HIGH (99%)
`sub_65D9A0`--`sub_65DD20`	7-68	Small utility/accessor functions	MEDIUM (65%)
`sub_65DE10`	553	`do_constexpr_std_meta_define_class`	VERY HIGH (99%)

Cross-References

EDG 6.6 Overview -- Position of interpret.c in the source tree
Type System -- The 22 type kinds that the interpreter evaluates
Template Engine -- Constexpr evaluation during template instantiation
IL Overview -- IL constant nodes that materialization produces
Diagnostics Overview -- Error message system for constexpr failures
Pipeline Overview -- Where constexpr evaluation sits in the compilation pipeline

Name Mangling

The name mangling subsystem in cudafe++ implements the Itanium C++ ABI name mangling specification, with NVIDIA-specific extensions for CUDA device lambda wrappers and host reference array registration. The mangling pipeline lives in lower_name.c (60+ functions spanning 0x69C980--0x6AB280) and produces the _Z prefixed symbols that appear in .int.c output and PTX. A separate CUDA-aware demangler at sub_7CABB0 (930 lines, statically linked, not EDG code) reverses the process with extensions for three NVIDIA vendor-specific mangled prefixes: Unvdl, Unvdtl, and Unvhdl. The glue between mangling and CUDA execution spaces is nv_get_full_nv_static_prefix in nv_transforms.c, which constructs scoped static prefixes for __global__ template stubs destined for host reference arrays.

Key Facts

Property	Value
Source file	`lower_name.c` (60+ functions), `nv_transforms.c` (prefix builder)
Address range	`0x69C980`--`0x6AB280` (mangling), `0x6BE300` (static prefix)
Demangler	`sub_7CABB0` (930 lines, NVIDIA custom, not EDG)
ABI standard	Itanium C++ ABI (IA-64), extended with NVIDIA vendor types
Operator name table	`sub_69C980` (`mangled_operator_name`), 47 entries
Entity mangler	`sub_6A1F00` (`mangle_entity_name`), ~1000 lines
Expression mangler	`sub_6A8B10` (`mangled_expression`), ~700 lines
Scalable vector mangler	`sub_69CF10` (`mangled_scalable_vector_name`), 170 lines
Static prefix builder	`sub_6BE300` (`nv_get_full_nv_static_prefix`), 370 lines
Output buffer	`qword_127FCC0` (dynamic buffer with capacity tracking)
Demangling mode flag	`qword_126ED90` (non-zero = demangling/diagnostic mode)
Compressed mangling flag	`dword_106BC7C` (ABI version control)
ABI version selector	`qword_126EF98` (selects vendor-specific vs standard codes)

Architecture Overview

Name mangling occurs at two distinct points in the cudafe++ pipeline:

Forward mangling (IL lowering): EDG's lower_name.c converts entity nodes into Itanium ABI mangled names during the IL-to-text code generation phase. The entry point is mangle_entity_name (sub_6A1F00), which dispatches through 60+ helper functions to handle every C++ construct -- namespaces, classes, templates, operators, expressions, lambdas, and vendor-extended types.
Reverse demangling (diagnostics): A statically linked demangler at sub_7CABB0 converts mangled names back to human-readable form for error messages and debug output. This demangler is not EDG code -- it is NVIDIA's custom implementation that wraps the standard Itanium ABI demangling algorithm with CUDA-specific extensions for device lambda wrapper types.

Entity Node (IL)
  |
  +-- sub_69FF70 (check_mangling_special_cases)
  |     Checks: extern "C", linkage name override, builtin
  |     If special case handled, done.
  |
  +-- sub_6A1F00 (mangle_entity_name)          ~1000 lines
  |     |
  |     +-- sub_69C980 (mangled_operator_name)  47 operators
  |     +-- sub_69E740 (mangle_type_encoding)   type dispatch
  |     +-- sub_6A3B00 (mangle_function_encoding)
  |     +-- sub_6A41A0 (mangle_declaration)
  |     +-- sub_6A4920 (mangle_template_parameter)
  |     +-- sub_6A5DC0 (mangle_abi_tags)        B<tag> encoding
  |     +-- sub_6A6AF0 (mangle_template_args)
  |     +-- sub_6A78B0 (mangle_complete_type)
  |     +-- sub_6A8390 (mangled_nested_name_component)
  |     +-- sub_6A85E0 (mangled_entity_reference)
  |     +-- sub_6A8B10 (mangled_expression)     ~700 lines
  |     +-- sub_6AB280 (mangled_encoding_for_sizeof)
  |
  +-- Output buffer: qword_127FCC0
        [buffer_ptr, write_pos, capacity, overflow_flag, ...]

Operator Name Table (sub_69C980)

mangled_operator_name at 0x69C980 is a pure lookup function: it takes an operator kind byte and an arity flag, and returns a pointer to the two-character Itanium ABI mangled operator code. The function covers all 47 overloadable C++ operators, including C++20 co_await.

Assert: "mangled_operator_name: bad kind" at lower_name.c:11557.

Four operators are context-sensitive -- their mangled code depends on whether the usage is unary (arity a2==1) or binary:

Kind	Unary	Binary	C++ Operator
5	`ps`	`pl`	`+`
6	`ng`	`mi`	`-`
7	`de`	`ml`	`*`
11	`ad`	`an`	`&`

Complete Operator Kind Table

Kind	Code	Operator	Kind	Code	Operator
1	`nw`	`new`	26	`ls`	`<<`
2	`dl`	`delete`	27	`rs`	`>>`
5	`ps`/`pl`	`+` (unary/binary)	28	`rS`	`>>=`
6	`ng`/`mi`	`-` (unary/binary)	29	`lS`	`<<=`
7	`de`/`ml`	`*` (unary/binary)	30	`eq`	`==`
9	`rm`	`%`	31	`ne`	`!=`
11	`ad`/`an`	`&` (unary/binary)	32	`le`	`<=`
12	`or`	`\|`	33	`ge`	`>=`
13	`co`	`~`	34	`ss`	`<=>`
14	`nt`	`!`	37	`pp`	`++`
16	`lt`	`<`	40	`pm`	`->*`
17	`gt`	`>`	41	`pt`	`->`
24	`aN`	`%=`	42	`cl`	`()`
43	`ix`	`[]`	44	`qu`	`?:`
45	`v23min`	vendor `min`	46	`v23max`	vendor `max`
47	`aw`	`co_await` (C++20)

Kinds 3, 4, 8, 10, 15, 18--23, 25, 28--29, 35--36, 38--39 return pointers to .rodata string constants (unk_A7C560 etc.) that encode the remaining standard operators (dv, eo, aS, pL, mI, mL, dV, eO, aa, oo, mm, cm).

Note kinds 45 and 46: these are vendor-extended operators using the v<length><name> Itanium ABI encoding. v23min and v23max are NVIDIA/CUDA-specific min/max operators with a length prefix of 23 -- this encodes the string "min" (3 chars) and "max" (3 chars) as vendor-qualified identifiers.

Entity Name Mangling (sub_6A1F00)

mangle_entity_name at 0x6A1F00 is the master mangling function. It produces the complete Itanium ABI mangled name for any entity node. At roughly 1000 decompiled lines, it handles every C++ entity kind through a multi-level dispatch.

Demangling Mode Early Exit

The function begins with a demangling-mode check:

if (qword_126ED90) {          // demangling / diagnostic mode
    emit_char(1, output);     // '?'
    emit_string("?", output);
    return;
}

When qword_126ED90 is non-zero, the function emits "?" and returns immediately. This mode is used during diagnostic output when the compiler needs a placeholder rather than a real mangled name.

Pre-dispatch: Special Cases (sub_69FF70)

Before the main dispatch, sub_69FF70 (check_mangling_special_cases, 447 lines at 0x69FF70) screens for entities that bypass normal mangling:

Linkage name override: If the entity has an explicit asm("name") or [[gnu::alias("name")]], the override name is used directly.
extern "C" linkage: Returns the unmangled source name.
Builtin entities: Special-cased to avoid generating bogus mangled names.

Main Dispatch Structure

After special-case screening, mangle_entity_name dispatches on the entity kind byte at entity node offset +132:

Entity Kind	Handler	Encoding
Regular function	`sub_6A3B00` (`mangle_function_encoding`)	`_Z<encoding>`
Regular variable	Direct type mangling	`_Z<name><type>`
Namespace member	`sub_6A0740` (`mangle_namespace_prefix`)	`N<qual>..E`
Class member	`sub_6A0A80` (`mangle_class_prefix`)	`N<class><name>E`
Template specialization	`sub_6A6AF0` (`mangle_template_args`)	`I<args>E`
Operator function	`sub_69C980` (`mangled_operator_name`)	operator codes
Constructor/destructor	`sub_69FE30`	`C1`/`C2`/`C3`/`D0`/`D1`/`D2`
Lambda closure	Lambda-specific path	`Ul<sig>E<disc>_`
Local entity	`sub_69F830` (`mangle_local_name`)	`Z<func>E<entity>`
Special (vtable etc.)	`sub_69FBC0` (`mangle_special_name`)	`TV`/`TI`/`GV` etc.

Type Encoding Subpipeline

Type mangling is handled by sub_69E740 (mangle_type_encoding, 177 lines at 0x69E740), which dispatches on type kind to produce Itanium ABI type codes:

Type	Code	Type	Code
`void`	`v`	`bool`	`b`
`char`	`c`	`signed char`	`a`
`unsigned char`	`h`	`short`	`s`
`unsigned short`	`t`	`int`	`i`
`unsigned int`	`j`	`long`	`l`
`unsigned long`	`m`	`long long`	`x`
`unsigned long long`	`y`	`float`	`f`
`double`	`d`	`long double`	`e`
`__int128`	`n`	`unsigned __int128`	`o`
`wchar_t`	`w`	`char8_t`	`Du`
`char16_t`	`Ds`	`char32_t`	`Di`
`_Float16`	`DF16_`	`__float128`	`g`
`std::nullptr_t`	`Dn`	`auto`	`Da`
`decltype(auto)`	`Dc`

Pointer and reference types are encoded with prefix qualifiers: P (pointer), R (lvalue reference), O (rvalue reference). CV-qualifiers use K (const), V (volatile), r (restrict).

The builtin type mangler at sub_6A13A0 (396 lines) includes CUDA-specific type detection through dword_106C2C0 (GPU mode flag) to handle CUDA-extended types.

Substitution Mechanism

The Itanium ABI uses substitution sequences (S_, S0_, S1_, ...) to compress repeated type references. The substitution infrastructure in lower_name.c centers on:

sub_69F0D0 (mangle_substitution_check): Checks whether a type/name component has already been emitted and should use a substitution reference.
sub_69F150 (mangle_with_substitution, 87 lines): Handles S_ encoding, including the well-known substitutions Sa (std::allocator), Sb (std::basic_string), Ss (std::string), Si (std::istream), So (std::ostream), Sd (std::iostream).

Template Argument Mangling

Template arguments are enclosed in I...E and handled by:

sub_69ED40 (mangle_template_args, 86 lines): Iterates the template argument list, emitting I prefix and E suffix.
sub_69EEE0 (mangle_template_arg, 109 lines): Mangles individual template arguments, dispatching between type arguments (direct type encoding), non-type arguments (expression or literal encoding), and template template arguments.
sub_6A4920 (mangle_template_parameter, 277 lines): Encodes template parameter references (T_, T0_, T1_, ...).

ABI Tag Mangling (sub_6A5DC0)

sub_6A5DC0 (643 lines at 0x6A5DC0) handles [[gnu::abi_tag("...")]] attribute propagation per the Itanium ABI extensions. ABI tags are encoded as B<length><tag> suffixes and must be propagated through template instantiations and inline namespaces (e.g., std::__cxx11::basic_string with tag cxx11). This is one of the more complex mangling functions due to the transitive nature of tag propagation.

Constructor/Destructor Encoding (sub_69FE30)

Constructors and destructors use the Itanium ABI's multi-variant encoding:

Code	Meaning
`C1`	Complete object constructor
`C2`	Base object constructor
`C3`	Complete object allocating constructor
`D0`	Deleting destructor
`D1`	Complete object destructor
`D2`	Base object destructor

Special Name Mangling (sub_69FBC0)

sub_69FBC0 (125 lines) produces mangled names for compiler-generated symbols:

Prefix	Symbol
`_ZTV`	Virtual table
`_ZTT`	VTT (construction vtable)
`_ZTI`	`typeinfo` structure
`_ZTS`	`typeinfo` name string
`_ZGV`	Guard variable for static initialization
`_ZTH`	Thread-local initialization function
`_ZTW`	Thread-local wrapper function

Expression Mangling (sub_6A8B10)

mangled_expression at 0x6A8B10 is the second-largest function in lower_name.c at roughly 700 decompiled lines. It produces the Itanium ABI encoding for arbitrary C++ expressions appearing in template arguments, noexcept specifications, and decltype contexts.

Assert: "mangled_encoding_for_expression_full" at lower_name.c:6870, "mangled_expr_operator_name: bad operator" at lower_name.c:11873, "mangled_call_operation" at lower_name.c:6132.

Expression Kind Dispatch

The function first calls sub_69E740 to classify the expression node, then dispatches on the expression kind byte at node offset +24:

Kind	Description	ABI Encoding
0	Error/unknown expression	`?` (demangling mode only)
1	Operator expression	Dispatches on operator byte at `+40`
2	Literal value	`L<type><value>E`
3	Entity reference	`L_Z<encoding>E` or substitution
4	Template parameter	`T_`/`T0_` etc.
5	sizeof/alignof/typeid/noexcept	Delegated to `sub_6AB280`
6	Cast expression	`sc`/`dc`/`rc`/`cv` prefix
7	Call expression	`cl<callee><args>E` or `cp<args>E`
8	Member access	`dt`/`pt` prefix
9	Conditional expression	`qu<cond><then><else>`
10	Pack expansion	`sp<pattern>`

Operator Sub-dispatch (Kind 1)

When the expression is an operator expression, the function reads the operator byte at node offset +40 and performs a large switch covering 100+ cases. For standard binary and unary operators, it calls sub_69C980 (mangled_operator_name) to get the two-character ABI code, then recursively processes operands. Notable special cases:

Cast operators (kinds 0x05--0x13): Dispatches between sc (static_cast), dc (dynamic_cast), rc (reinterpret_cast), and cv (C-style cast) based on cast flags at node offset +25 and +42. The compressed mangling flag dword_106BC7C forces cv for all casts when set.
Vendor extensions (0x21, 0x22): __real__ and __imag__ complex number operations, encoded as v18__real__ and v18__imag__ using the vendor-extended operator format.
Increment/decrement (kinds 0x23--0x26): Pre/post increment (pp) and decrement (mm). Post-increment/decrement append _ suffix per Itanium ABI.
Call expressions (kinds 0x69--0x6D, 0x16--0x17, 0x69): Dispatches to mangled_call_operation which determines the callee encoding and emits cl (call) or cp (non-dependent call) prefix.

sizeof/alignof/typeid/noexcept (sub_6AB280)

mangled_encoding_for_sizeof at 0x6AB280 (130 lines) handles the sizeof-family of operators:

ABI Code	Operator	Variant
`sz`	`sizeof(expr)`	Expression operand
`st`	`sizeof(type)`	Type operand
`az`	`alignof(expr)`	Expression operand
`at`	`alignof(type)`	Type operand
`te`	`typeid(expr)`	Expression operand
`ti`	`typeid(type)`	Type operand
`nx`	`noexcept(expr)`	Expression operand

For older ABI versions (controlled by dword_106BC7C and qword_126EF98), the function emits vendor-specific codes v17alignof and v18alignofe instead of the standard at/az codes.

Scalable Vector Name Mangling (sub_69CF10)

mangled_scalable_vector_name at 0x69CF10 (170 lines) returns mangled names for ARM SVE and RISC-V V extension scalable vector types. EDG supports these types natively, and they must be mangled using the vendor-specific Itanium ABI encoding.

Assert: "mangled_scalable_vector_name" at lower_name.c:10473 and lower_name.c:10440.

The function dispatches on the type node's kind byte at offset +132:

Dispatch Logic

Kind 12 (elaborated type): Unwraps through the elaboration chain (offset +144 points to the underlying type).
Kind 3 (typedef/alias): Dispatches on subkind at offset +144:
- Subkind 1: svint variants (signed integer vectors)
- Subkind 2: svfloat variants (floating-point vectors)
- Subkind 4: svbool variants (predicate vectors)
- Subkind 9: svcount variants
Kind 18 (mfloat8): mfloat8x types for ML inference.
Kind 2 (plain vector): Dispatches on element type byte at offset +144, handling 8 element widths (cases 1--8).

Each type category has 4 mangling variants selected by the a2 parameter (values 1--4), corresponding to different vector widths or tuple sizes (e.g., svint8_t, svint8x2_t, svint8x3_t, svint8x4_t). The actual mangled strings are stored in .rodata pointer tables (off_A7E950 through off_A7EA18).

There is also special handling for svboolx4_t via sub_7A7220, which detects the specific boolean-tuple-of-4 predicate type and returns a dedicated mangling string.

Mangling Output Buffer

All mangling functions write into a shared output buffer managed through qword_127FCC0. The buffer structure:

Offset	Size	Field	Description
`+0`	8	`reserved`	Not used during mangling
`+8`	8	`capacity`	Allocated buffer size
`+16`	8	`write_pos`	Current write position (length of mangled name so far)
`+24`	8	`unused`	Reserved
`+32`	8	`buffer_ptr`	Pointer to character buffer

Key buffer operations:

sub_69D850 (append_char_to_buffer): Appends a single character, calls sub_6B9B20 to grow the buffer if write_pos + 1 > capacity.
sub_69D530 (append_string): Appends a string to the buffer.
sub_69D580 (append_number): Appends a base-36 encoded number.
sub_6B9B20 (ensure_output_buffer_space): Grows the buffer (doubles capacity).

The sub_69DAA0 function (mangle_number, 63 lines) writes numbers in base-36 encoding as required by the Itanium ABI for substitution indices and discriminators.

Mangling Type Marks

The mangling pipeline uses a mark-and-sweep mechanism to track which types have been referenced during signature mangling (needed for substitution sequence generation):

sub_69CCB0 (set_signature_mark, 76 lines): Marks types in a function signature for mangling. Handles function types (a2=7) and template functions (a2=11) by calling sub_5CF440 for type traversal.
sub_69CE10 (ttt_mark_entry, 36 lines): Sets or clears the mangling mark on individual type entities. Uses bit 7 of byte at entity offset +81. The direction (mark vs unmark) is controlled by dword_127FC70.

CUDA Demangler Extensions (sub_7CABB0)

The CUDA-aware demangler at sub_7CABB0 (930 decompiled lines at 0x7CABB0) is a statically linked NVIDIA implementation, not part of EDG. It implements a full Itanium ABI C++ name demangler with three NVIDIA vendor-type extensions for CUDA lambda wrappers.

Function Signature

unsigned char* sub_7CABB0(
    unsigned char *mangled_name,   // a1: input cursor into mangled name
    int64_t       qualifier_out,   // a2: output qualifier struct (24 bytes)
    char          flags,           // a3: behavior flags
    int64_t       output_ctx       // a4: output buffer context
);

Output Buffer Context (a4)

Offset	Size	Field	Description
`+0`	8	`buffer_ptr`	Output character buffer
`+8`	8	`write_pos`	Current output position
`+16`	8	`capacity`	Buffer capacity
`+24`	4	`error_flag`	Set to 1 on buffer overflow
`+28`	4	`overflow`	Redundant overflow indicator
`+32`	8	`suppress_level`	When >0, output is suppressed (for dry-run parsing)
`+48`	8	`error_count`	Cumulative parse error counter
`+64`	8	`skip_template`	When set, suppresses template argument output

Qualifier Output (a2)

Offset	Size	Field	Description
`+0`	4	`has_template_args`	Set to 1 when template arguments were parsed
`+4`	4	`cv_qualifiers`	bit 0=const, bit 1=volatile, bit 2=restrict
`+8`	4	`ref_qualifier`	0=none, 1=lvalue `&`, 2=rvalue `&&`
`+16`	8	`template_depth`	Template nesting depth

Flags (a3)

Bit	Meaning
0	Static-from mode: wraps output in `[static from ...]...[C++]`
1	Suppress-scope mode: increments suppress level

Parsing Dispatch

The demangler handles these Itanium ABI top-level prefixes:

Prefix Byte	ASCII	ABI Meaning	Handler
`0x42`	`B`	EDG block-scope static entity	Block-scope handler (offset + length)
`0x4E`	`N`	Nested name (qualified)	`sub_7CA440` (nested-name parser)
`0x5A`	`Z`	Local entity	`sub_7CEAE0` (encoding parser) + local suffix
`0x53`	`S`	Substitution	`sub_7CD7B0` (substitution resolver)
`0x53 0x74`	`St`	`std::` prefix	Emits `std::` + `sub_7CD0B0` (unqualified-name)
other		Unqualified name	`sub_7CD0B0` (unqualified-name parser)

After parsing the name, the function checks for I (template argument list, 0x49) and dispatches to sub_7C9D30 (template-args parser). A template argument cache at qword_12C7B48/12C7B40/12C7B50 stores parsed entries using a dynamic array that grows by 500 entries via malloc/realloc.

CUDA Vendor Type Extensions

The key NVIDIA extensions are triggered when the demangler encounters the vendor-extended type prefix U followed by nv (bytes 0x55 0x6E 0x76). Three patterns are recognized:

Unvdl -- Device Lambda Wrapper

Pattern: Unvdl<arity><encoding><type>...

Input:  "Unvdl" + <numeric_arity> + <function_encoding> + <captured_types>...
Output: "__nv_dl_wrapper_t<__nv_dl_tag<(& :: <scope>), <arity>, <type1>, ...> >"

Decoded step by step:

Emit __nv_dl_wrapper_t<
Emit __nv_dl_tag<
Parse numeric arity via sub_7C3180, subtract 2 to get actual capture count
Parse one type (sub_7CE590) for the wrapped function type
Emit ,( + & :: + recursively demangle scope (calling sub_7CABB0 with flags=2)
Emit ),
Parse remaining captured types (count from step 3)
Emit > >

Unvdtl -- Trailing Return Device Lambda

Pattern: Unvdtl<arity><return_type><encoding><captured_types>...

Input:  "Unvdtl" + <arity> + <type> + <func_encoding> + <captured_types>...
Output: "__nv_dl_wrapper_t<__nv_dl_trailing_return_tag<...>, <return_type>, ...>"

Same as Unvdl except:

Emit __nv_dl_wrapper_t<
Emit __nv_dl_trailing_return_tag< (instead of __nv_dl_tag<)
After the scope demangling, parse an additional return type via sub_7CE590
Parse a function type via sub_7CE5D0 (adds 1 to result pointer for the E terminator)
Then parse remaining captured types

Unvhdl -- Host-Device Lambda Wrapper

Pattern: Unvhdl<bool1><bool2><bool3><arity><encoding><captured_types>...

Input:  "Unvhdl" + <IsMutable> + <HasFuncPtrConv> + <NeverThrows> + <arity> + ...
Output: "__nv_hdl_wrapper_t<true/false, true/false, true/false,
          __nv_dl_tag<(& :: <scope>), <arity>, <type1>, ...> >"

The three boolean template parameters are decoded first:

Parse numeric value via sub_7C3180 -- if value != 2 (i.e., false in the encoding), emit true,; otherwise emit false,
Repeat for HasFuncPtrConv (second boolean)
Repeat for NeverThrows (third boolean)
Then proceed identically to Unvdl (emit __nv_dl_tag<, parse captures, etc.), but with v68=1 flag marking the host-device variant

The boolean encoding convention: 2 encodes false, any other value (typically 0 or 1) encodes true. This is the reverse of the usual convention and matches the internal encoding used by nv_transforms.c when generating the mangled names.

Block-Scope Static Handling

When the input starts with B (ASCII 0x42), the demangler handles EDG's block-scope static entity encoding:

If flags bit 0 is set and suppress_level is 0: emit [static from
Parse an optional negative sign (n) followed by a decimal length
Skip ahead by that length (the block-scope name)
If suppress_level is 0: emit ] followed by [C++] (the closing bracket and C++ marker)
If flags bit 0 is not set: decrement suppress_level

Instance Suffix

After parsing the main name, if the next character is _ followed by digits (or __ followed by digits), the demangler parses an instance discriminator and emits (instance N) suffix in the output, where N = parsed_value + 2.

Default Argument Suffix

For local entities (after Z...E), the discriminator prefix d triggers special handling:

d_ or d<number>_: emits [default argument N (from end)]:: where N = parsed_value + 2
dn<number>_: negative-index variant

Call Graph

The demangler calls into specialized sub-parsers:

Address	Function	Purpose
`sub_7CA440`	Nested-name parser	Handles `N...E` qualified names
`sub_7CEAE0`	Encoding parser	Top-level `<encoding>` production
`sub_7CD0B0`	Unqualified-name parser	`<source-name>` and operator names
`sub_7CD7B0`	Substitution resolver	`S_`/`S0_` back-references
`sub_7C9D30`	Template-args parser	`I<args>E`
`sub_7CE590`	Type parser	Full type demangling
`sub_7CE5D0`	Function-type parser	Function signature types
`sub_7C3180`	Numeric literal parser	Decimal number extraction
`sub_7C30C0`	Arity emitter	Outputs numeric arity values
`sub_7C2FB0`	String emitter	Emits literal strings to output buffer
`sub_7C3030`	Signed number parser	Handles negative numbers

Static Prefix for global Templates (sub_6BE300)

nv_get_full_nv_static_prefix at 0x6BE300 (370 lines) in nv_transforms.c constructs unique prefix strings for __global__ function templates with static/internal linkage. These prefixes are used to register device symbols in host reference arrays (the .nvHR* ELF sections that the CUDA runtime uses for symbol discovery).

Assert: "nv_get_full_nv_static_prefix" at nv_transforms.c:2164.

Entry Conditions

The function checks two conditions on the entity node:

Bit 0x40 at entity offset +182 must be set (marks __global__ functions)
A name string at entity offset +8 must be non-null

Internal vs External Linkage Paths

The function takes different paths based on entity linkage:

Internal linkage (bits 0x12 at offset +179 set, or storage class 0x10 at offset +80):

Build scoped name prefix via sub_6BD2F0 (nv_build_scoped_name_prefix), which recursively walks the scope chain (offset +40 -> parent scope at offset +28) to build Namespace1::Namespace2:: style prefixes. Anonymous namespaces insert _GLOBAL__N_<filename>.
Hash the entity name via sub_6BD1C0 (format_string_to_sso) using vsnprintf with a format string at address 8573734.
Build the full prefix string using snprintf:

snprintf(qword_1286760, n, "%s%lu_%s_", off_E7C768, strlen(filename), filename);

Where off_E7C768 is a global prefix string (likely "_nv_static_"), the %lu is the filename length, and %s is the filename from sub_5AF830(0). The result is cached in qword_1286760 for reuse across entities in the same translation unit.

Concatenate prefix + "_" separator + entity scoped name
Register the full string in qword_12868C0 (kernel internal-linkage host reference list)

External linkage:

Build name with " ::" scope prefix (the leading space is intentional -- it matches the demangler output format)
Walk scope chain via sub_6BD2F0 if the entity has a parent scope with kind 3 (namespace)
Hash the entity name via sub_6BD1C0
Append "_" separator
Register in qword_1286880 (kernel external-linkage host reference list)

Host Reference Arrays

The prefixes generated by this function end up in six global lists, one per combination of {kernel, device, constant} x {external, internal} linkage:

Global	Section	Array Name
`unk_1286780`	`.nvHRDE`	`hostRefDeviceArrayExternalLinkage`
`unk_12867C0`	`.nvHRDI`	`hostRefDeviceArrayInternalLinkage`
`unk_1286800`	`.nvHRCE`	`hostRefConstantArrayExternalLinkage`
`unk_1286840`	`.nvHRCI`	`hostRefConstantArrayInternalLinkage`
`unk_1286880`	`.nvHRKE`	`hostRefKernelArrayExternalLinkage`
`unk_12868C0`	`.nvHRKI`	`hostRefKernelArrayInternalLinkage`

These are emitted by sub_6BCF80 (nv_emit_host_reference_array) as weak extern "C" byte arrays in the specified ELF sections.

Type Mangling Subsystem (0x7C3000--0x7D0E00)

A separate type mangling subsystem exists in the 0x7C3000--0x7D0E00 range, used for diagnostic output and type encoding (distinct from the lower_name.c mangling used for symbol generation). Key functions:

Address	Function	Lines	Description
`sub_7C3480`	`encode_operator_name`	716	Operator name encoding for diagnostics
`sub_7C5650`	`encode_type_for_mangling`	794	Full type encoding dispatcher
`sub_7C6290`	`encode_expression`	2519	Largest function -- expression encoding
`sub_7C8BE0`	`encode_special_expression`	674	Special expression forms
`sub_7CBB90`	`encode_builtin_type`	1314	All builtin type mappings
`sub_7CEAE0`	`encode_template_args`	1417	Template argument encoding
`sub_7CFFC0`	`encode_nullptr`	484	nullptr-related type encoding

The encode_expression function at sub_7C6290 (2519 lines) is the largest single function in the entire type mangling subsystem and handles the full range of C++ expressions including dynamic_cast, const_cast, reinterpret_cast, safe_cast, static_cast, subscript, and throw.

Nested Name Components (sub_6A8390)

mangled_nested_name_component at 0x6A8390 (101 lines) handles the intermediate components within N...E nested name encodings. It emits ABI substitution codes:

dn: Destructor name
co: Coercion operator
sr: Unresolved scope resolution
L_ZN: Local scope nested name
D1Ev: Destructor suffix (complete object destructor, void return)

When in compressed mode (dword_106BC7C set), the function checks for std:: namespace via sub_7BE9E0 (is_std_namespace) and uses shortened forms.

Entity Reference Mangling (sub_6A85E0)

mangled_entity_reference at 0x6A85E0 (197 lines) is the central dispatch for mangling entity references within expressions. It handles:

Qualified scope resolution (bit 2 at entity offset +81)
Address-of expressions (ad prefix)
Compressed vs full mangling paths
Class member vs free-function encoding

Assert: "mangled_entity_reference" at lower_name.c:4183.

Mangling Discriminators (sub_69DBE0)

mangle_discriminator at 0x69DBE0 (72 lines) writes discriminators for local entities. Itanium ABI uses _ for discriminator 0, _<number>_ for higher discriminators, where the number is encoded in base-36.

Global State Summary

Global	Type	Purpose
`qword_127FCC0`	Buffer*	Primary mangling output buffer
`qword_126ED90`	qword	Demangling/diagnostic mode flag
`dword_106BC7C`	dword	Compressed/vendor-ABI mode flag
`qword_126EF98`	qword	ABI version selector
`dword_127FC70`	dword	Mark/unmark direction for type marks
`qword_1286760`	char*	Cached static prefix string
`qword_1286A00`	char*	Cached anonymous namespace name
`dword_12C6A24`	dword	Block-scope suppress level (demangler)
`qword_12C7B48`	qword	Template argument cache index
`qword_12C7B40`	qword	Template argument cache capacity
`qword_12C7B50`	qword	Template argument cache pointer
`off_E7C768`	char*	Static prefix base string

Function Address Map

Address	Size	Identity	Confidence
`0x69C830`	24	`init_lower_name`	LOW
`0x69C980`	168	`mangled_operator_name`	HIGH
`0x69CCB0`	76	`set_signature_mark`	HIGH
`0x69CE10`	36	`ttt_mark_entry`	HIGH
`0x69CF10`	170	`mangled_scalable_vector_name`	HIGH
`0x69D530`	--	`append_string`	MEDIUM
`0x69D580`	--	`append_number`	MEDIUM
`0x69D850`	--	`append_char_to_buffer`	MEDIUM
`0x69DAA0`	63	`mangle_number`	MEDIUM
`0x69DBE0`	72	`mangle_discriminator`	MEDIUM
`0x69E380`	116	`mangle_cv_qualifiers`	MEDIUM
`0x69E5F0`	79	`mangle_ref_qualifier`	MEDIUM
`0x69E740`	177	`mangle_type_encoding`	MEDIUM-HIGH
`0x69EA40`	150	`mangle_function_type`	MEDIUM
`0x69ED40`	86	`mangle_template_args`	MEDIUM
`0x69EEE0`	109	`mangle_template_arg`	MEDIUM
`0x69F0D0`	28	`mangle_substitution_check`	LOW
`0x69F150`	87	`mangle_with_substitution`	MEDIUM
`0x69F320`	78	`mangle_nested_name`	MEDIUM
`0x69F830`	54	`mangle_local_name`	MEDIUM
`0x69F930`	60	`mangle_unscoped_name`	MEDIUM
`0x69FA90`	58	`mangle_source_name`	MEDIUM
`0x69FBC0`	125	`mangle_special_name`	MEDIUM
`0x69FE30`	78	`mangle_constructor_destructor`	MEDIUM
`0x69FF70`	447	`check_mangling_special_cases`	MEDIUM-HIGH
`0x6A0740`	189	`mangle_namespace_prefix`	MEDIUM
`0x6A0A80`	88	`mangle_class_prefix`	MEDIUM
`0x6A0FB0`	245	`mangle_pointer_type`	MEDIUM
`0x6A13A0`	396	`mangle_builtin_type`	MEDIUM-HIGH
`0x6A1C80`	97	`mangle_expression`	MEDIUM
`0x6A1F00`	~1000	`mangle_entity_name`	HIGH
`0x6A4920`	277	`mangle_template_parameter`	MEDIUM
`0x6A5DC0`	643	`mangle_abi_tags`	MEDIUM-HIGH
`0x6A78B0`	297	`mangle_complete_type`	MEDIUM
`0x6A7F20`	232	`mangle_initializer`	MEDIUM
`0x6A8390`	101	`mangled_nested_name_component`	HIGH
`0x6A85E0`	197	`mangled_entity_reference`	HIGH
`0x6A8B10`	~700	`mangled_expression`	HIGH
`0x6AA030`	30	`mangled_expression_list`	HIGH
`0x6AB280`	130	`mangled_encoding_for_sizeof`	HIGH
`0x6BE300`	370	`nv_get_full_nv_static_prefix`	VERY HIGH
`0x7CABB0`	930	CUDA demangler (top-level)	HIGH

Type System

The type system in cudafe++ is EDG 6.6's implementation of the C++ type representation, query, construction, comparison, and layout infrastructure. It lives primarily in types.c (250+ functions at 0x7A4940--0x7C02A0) with type allocation in il_alloc.c (0x5E2E80--0x5E45C0), type construction helpers in il.c (0x5D64F0--0x5D6DB0), and class layout computation in layout.c (0x65EA50--0x665B50).

Every C++ entity -- variable, function parameter, expression result, template argument -- carries a type pointer. EDG represents types as 176-byte heap-allocated nodes organized by a type_kind discriminant, with supplementary structures for complex kinds (classes, functions, integers, typedefs, template parameters). Type identity in the IL is pointer-based: two types are the "same type" if and only if they resolve to the same canonical node after chasing typedef chains. This page documents the complete type node architecture, the 22 type kinds, the 130 leaf query functions, the MRU-cached type construction pipeline, and the Itanium ABI class layout engine.

Key Facts

Property	Value
Source file	`types.c` (250+ functions), `il_alloc.c` (allocators), `il.c` (construction), `layout.c` (class layout)
Address range	`0x7A4940`--`0x7C02A0` (types.c), `0x5E2E80`--`0x5E45C0` (alloc), `0x5D64F0`--`0x5D6DB0` (il.c), `0x65EA50`--`0x665B50` (layout)
Type node size	176 bytes (raw allocation includes 16-byte IL prefix)
Type kind count	22 values (`0x00`--`0x15`)
Leaf query functions	130 at `0x7A6260`--`0x7A9F90` (3,648 total call sites across binary)
Class layout entry	`sub_662670` (`do_class_layout`), 2,548 lines
Type allocator	`sub_5E3D40` (`alloc_type`), 176-byte bump allocation
Kind dispatch	`sub_5E2E80` (`set_type_kind`), 22-way switch
Qualified type cache	`sub_5D64F0` (`f_make_qualified_type`), MRU linked list at type `+112`
Type comparison	`sub_7AA150` (`types_are_identical`), 636 lines
Top query by callers	`is_class_or_struct_or_union_type` at `0x7A8A30` (407 call sites)
Type counter global	`qword_126F8E0` (incremented on every `alloc_type`)
Void type singleton	`qword_126E5E0`

Type Node Layout (176 Bytes)

Every type in the IL is a 176-byte node allocated by alloc_type (sub_5E3D40). The allocator prepends a 16-byte IL prefix (8-byte TU-copy address + 8-byte next pointer), so the pointer returned to callers points at offset +16 of the raw allocation. All offsets below are relative to the returned pointer.

Offset	Size	Field	Description
`+0`	96	common header	Copied from `xmmword_126F6A0..126F6F0` at allocation time
`+0`	8	`source_corresp`	Source position info
`+8`	1	`prefix_flags`	IL entry prefix: bit 0 = allocated, bit 1 = file-scope, bit 3 = language
`+112`	8	`qualified_chain`	Head of MRU linked list of cv-qualified variants
`+120`	4	`size_info`	Type size in target units (for constexpr value computation)
`+128`	4	`alignment`	Type alignment
`+132`	1	`type_kind`	Discriminant byte: 0--21 (22 values)
`+133`	1	`type_flags_1`	Bit 5 = is_dependent
`+134`	1	`elaboration_flags`	Low 2 bits = elaboration specifier kind
`+136`	1	`type_flags_3`	Bit 2 = bitfield flag, bit 5 = unqualified strip flag
`+144`	8	`referenced_type`	Points to base/element/return type (kind-dependent). For pointers: pointed-to type. For arrays: element type. For typedefs: underlying type
`+145`	1	`integer_subkind`	(overlaps `+144` byte 1; valid when kind==2) Bit 3 = scoped enum, bit 4 = bit-int capable
`+146`	1	`integer_flags`	(overlaps `+144` byte 2; valid when kind==2) Bit 2 = `_BitInt`
`+152`	8	`supplement_ptr`	Pointer to kind-specific supplement, or member-pointer class type (kind==6 with member bit, kind==13)
`+153`	1	`array_flags`	(overlaps `+152` byte 1; valid when kind==8) Bit 0 = dependent, bit 1 = VLA, bit 5 = star-modified
`+160`	8	`secondary_data`	Array bound (kind==8) / attribute info (kind==12) / enum underlying type (kind==2)
`+161`	1	`qualifier_or_class_flags`	Typeref: cv-qualifier bits (kind==12). Class: bit 0 = local, bit 4 = template, bit 5 = anonymous (kind==9/10/11)
`+163`	1	`class_flags_2`	(valid when kind==9/10/11) Bit 0 = empty class
`+164`	1	`feature_usage`	Copied to `byte_12C7AFC` by `record_type_features_used`

Note: Fields at offsets +144--+164 form a union-like region. Different type kinds interpret these bytes differently. The overlap is intentional -- a pointer type uses +152 for the class pointer while an array type uses +153 for VLA flags, and so on.

The type_kind byte at +132 is the single most frequently read field in the entire binary. Every type query function begins by checking it, and the canonical typedef-chase pattern reads it in a tight loop.

Type Kind Enumeration (22 Values)

EDG uses 22 type kind values (tk_*), each with optional supplementary allocations for kind-specific metadata.

Value	Name	Supplement	Supplement Size	Description
0	`tk_none`	--	--	Sentinel / uninitialized
1	`tk_void`	--	--	`void` type
2	`tk_integer`	`integer_type_supplement`	32 B	All integer types: `bool`, `char`, `short`, `int`, `long`, `long long`, `__int128`, `_BitInt(N)`, and enumerations. Subkind at `+145` discriminates
3	`tk_float`	--	--	`float` (format byte at `+144` = 2)
4	`tk_double`	--	--	`double`
5	`tk_long_double`	--	--	`long double`, `__float128`, `_Float16`, `__bf16`
6	`tk_pointer`	--	--	Pointer to T. Bit 0 of `+152` distinguishes member pointers from object pointers
7	`tk_routine`	`routine_type_supplement`	64 B	Function type. Supplement holds parameter list, calling convention, `this`-class pointer, exception specification
8	`tk_array`	--	--	Array of T. Bound at `+160`, element type at `+144`
9	`tk_struct`	`class_type_supplement`	208 B	`struct` type
10	`tk_class`	`class_type_supplement`	208 B	`class` type
11	`tk_union`	`class_type_supplement`	208 B	`union` type
12	`tk_typeref`	`typeref_type_supplement`	56 B	Typedef / elaborated type. References the underlying type at `+144`. This is the "chase me" kind
13	`tk_pointer_to_member`	--	--	Pointer-to-member. Member type at `+144`, class type at `+152`
14	`tk_template_param`	`templ_param_supplement`	40 B	Unresolved template type parameter
15	`tk_typeof`	--	--	`typeof` / `__typeof__` expression type
16	`tk_decltype`	--	--	`decltype(expr)` type
17	`tk_pack_expansion`	--	--	Parameter pack expansion
18	`tk_auto`	--	--	`auto` / `decltype(auto)` placeholder
19	`tk_rvalue_reference`	--	--	Rvalue reference `T&&`
20	`tk_nullptr_t`	--	--	`std::nullptr_t`
21	`tk_reserved`	--	--	Reserved / unused (handled as no-op in `set_type_kind`)

The Integer Type Supplement (32 Bytes)

Integer types (kind 2) carry a 32-byte supplement allocated by set_type_kind and tracked by qword_126F8E8. This supplement discriminates the enormous variety of C++ integer types -- bool, char, signed char, unsigned char, wchar_t, char8_t, char16_t, char32_t, short, int, long, long long, __int128, _BitInt(N), and all scoped/unscoped enumerations.

The integer subkind value (at byte +145 of the parent type node) encodes:

Value	Type Category
1--10	Standard integer types (`bool` through `unsigned long long`)
11	`_BitInt` / extended integer
12	`__int128` / extended

Signedness is determined by a lookup table at byte_E6D1B0, indexed by the integer subkind value.

The Routine Type Supplement (64 Bytes)

Function types (kind 7) carry a 64-byte supplement tracked by qword_126F958. Key fields:

Offset (in supplement)	Size	Field
`+0`	8	Parameter type list head
`+8`	8	Exception specification
`+16`	4	Calling convention / noexcept flags
`+32`	16	Bitfield struct (ABI attributes, variadic flag)
`+40`	8	`this`-class pointer (for member functions)

The Class Type Supplement (208 Bytes)

Class/struct/union types (kinds 9/10/11) carry a 208-byte supplement tracked by qword_126F948. This is the largest supplement and contains the full class metadata:

Offset (in supplement)	Size	Field
`+0`	8	Scope pointer (member declarations)
`+8`	8	Base class list head
`+16`	8	Virtual function table pointer
`+40`	4	Initialized to 1 by `init_class_type_supplement_fields`
`+86`	1	Bit 0 = has virtual bases, bit 3 = has user conversion
`+88`	1	Bit 5 = has flexible array / VLA member
`+100`	4	Class kind (9=struct, 10=class, 11=union)
`+128`	8	Scope block pointer
`+176`	4	Initialized to -1 (sentinel)

The Typeref Supplement (56 Bytes)

Typedef types (kind 12) carry a 56-byte supplement tracked by qword_126F8F0. A typeref wraps another type, creating the alias chain that all query functions must chase. The supplement holds the typedef declaration entity, elaborated type specifier information, and attribute data.

The Typedef Chase Pattern

The most pervasive code pattern in the entire binary is the typedef chase loop. Because C++ types may be wrapped in arbitrarily many typedef layers (typedef int myint; typedef myint myint2;), every function that inspects a type property must first resolve through all typedef indirections to reach the underlying canonical type.

The canonical pattern appears in every one of the 130 leaf query functions:

// Canonical typedef chase — appears 130+ times in types.c
type_t *skip_typedefs(type_t *type) {
    while (type->type_kind == 12)   // 12 = tk_typeref
        type = type->referenced_type;  // offset +144
    return type;
}

bool is_class_or_struct_or_union_type(type_t *type) {
    type = skip_typedefs(type);
    int kind = type->type_kind;     // offset +132
    return kind == 9 || kind == 10 || kind == 11;
}

In x86-64 machine code, this compiles to a 3-instruction loop:

.loop:
    cmp  byte [rdi+132], 12       ; type->type_kind == tk_typeref?
    jne  .done
    mov  rdi, [rdi+144]           ; type = type->referenced_type
    jmp  .loop
.done:

Why 130 Separate Functions?

A natural question: why does EDG have 130 individual query functions instead of a single get_type_kind() accessor? The answer is the EDG compilation model. Each function in types.c is a public API entry point that other source files (parse.c, lower.c, templates.c, etc.) can call without including the full type-system header. This provides:

Encapsulation. Callers never see the type_kind enum values or internal layout. They call is_class_or_struct_or_union_type() instead of checking kind == 9 || kind == 10 || kind == 11.
Binary stability. If EDG adds a new type kind or renumbers existing ones, only types.c needs recompilation. All callers are insulated.
Fast-path optimization. Each leaf function is tiny (10--30 bytes of machine code), fits in a single cache line, and branches on at most 2--3 constants. The branch predictor handles these trivially.
Semantic naming. is_arithmetic_type() is self-documenting where kind >= 2 && kind <= 5 is not. This matters in a 2.5M-line codebase.

Query Function Catalog (Top 30 by Caller Count)

Address	Callers	Function	Returns
`0x7A8A30`	407	`is_class_or_struct_or_union_type`	kind in {9,10,11}
`0x7A9910`	389	`type_pointed_to`	`ptr->referenced_type` (kind==6)
`0x7A9E70`	319	`get_cv_qualifiers`	Accumulated cv-qualifier bits (`& 0x7F`)
`0x7A6B60`	299	`is_dependent_type`	Bit 5 of byte `+133`
`0x7A7630`	243	`is_object_pointer_type`	kind==6 and not member pointer
`0x7A8370`	221	`is_array_type`	kind==8
`0x7A7B30`	199	`is_member_pointer_or_ref`	kind==6 with member bit
`0x7A6AC0`	185	`is_reference_type`	kind==7
`0x7A8DC0`	169	`is_function_type`	kind==14
`0x7A6E90`	140	`is_void_type`	kind==1
`0x7A7C40`	132	`is_trivially_copy_constructible`	Recursive triviality check
`0x7A9350`	126	`array_element_type` (deep)	Strips arrays+typedefs to element
`0x7A7010`	85	`is_enum_type`	kind==2 with scoped check
`0x7A71B0`	82	`is_integer_type`	kind==2
`0x7A8020`	77	`type_size_and_alignment`	Computes sizeof/alignof
`0x7A7810`	77	`is_member_pointer_flag`	kind==6, bit 0 of `+152`
`0x7A8270`	77	`get_mangled_type_encoding`	Type encoding for name mangling
`0x7A8D90`	76	`is_pointer_to_member_type`	kind==13
`0x7A73F0`	70	`is_long_double_type`	kind==5
`0x7A7950`	68	`is_member_ptr_with_both_bits`	kind==6, bits 0 and 1 of `+152`
`0x7A70F0`	62	`is_scoped_enum_type`	kind==2, bit 3 of `+145`
`0x7A6EF0`	56	`is_rvalue_reference_type`	kind==19 (rvalue reference `T&&`)
`0x7A9310`	51	`array_element_type` (shallow)	One-level array to element
`0x7A6B90`	46	`is_simple_function_type`	kind==8, specific flag pattern
`0x7A7220`	43	`is_bit_int_type`	kind==2, bit 2 of `+146`
`0x7A7300`	42	`is_floating_point_type`	kind in {3,4,5}
`0x7A7750`	40	`is_non_member_ptr_type`	kind==6, no member bit
`0x7A6EC0`	39	`is_nullptr_t_type`	kind==20
`0x7A99D0`	37	`pm_member_type`	kind==13, extracts member type at `+152`
`0x7A8F10`	34	`is_unresolved_function_type`	kind==14, constraint check

Total: 128 unique query functions, 4,448 call sites, average 34.75 callers per function.

Typedef Stripping Variants

Six specialized typedef-stripping functions exist, each stopping at a different boundary:

Address	Function	Behavior
`0x7A68F0`	`skip_typedefs`	Strips all typedef layers, preserves cv-qualifiers
`0x7A6930`	`skip_named_typedefs`	Strips typedefs that have no name
`0x7A6970`	`skip_to_attributed_typedef`	Stops at typedef with attribute flag set
`0x7A69C0`	`skip_typedefs_and_attributes`	Strips both typedef and attribute layers
`0x7A6A10`	`skip_to_elaborated_typedef`	Stops at typedef with elaborated-type-specifier flag
`0x7A6A70`	`skip_non_attributed_typedefs`	Stops at typedef with any attribute bits

These variants exist because C++ semantics sometimes care about intermediate typedef layers. For example, [[deprecated]] typedef int bad_int; attaches the attribute to the typedef itself, not to int. A function checking for deprecation must stop at the attributed typedef layer rather than chasing through to int.

Duplicate Query Functions

Several functions are exact binary duplicates with distinct addresses:

0x7A7630 = 0x7A7670 = 0x7A7750 (is_non_member_pointer / is_object_pointer_type)
0x7A7B00 = 0x7A7B70 (is_pointer_type)
0x7A78D0 = 0x7A7910 (is_non_const_ref)

These duplicates exist because EDG uses distinct function names for semantic clarity even when the implementation is identical. The function-level linker does not merge them because they have distinct symbols with different ABI meanings: callers of is_object_pointer_type() and is_non_member_pointer_type() conceptually ask different questions even though the current answer is the same. If a future C++ revision changed pointer semantics, only one function would need updating.

Type Allocation

Type nodes are allocated by alloc_type (sub_5E3D40), which follows the standard IL allocation pattern used by all node allocators in il_alloc.c:

type_t *alloc_type(int type_kind) {
    // 1. Optional debug trace
    if (dword_126EFC8)
        trace_enter("alloc_type");

    // 2. Bump-allocate 176 bytes from the current region
    void *raw = region_alloc(dword_126EC90, 176);

    // 3. Set up IL prefix (16 bytes before the returned pointer)
    // raw[0..7] = TU-copy address (0 if not in copy mode)
    // raw[8..15] = next pointer (0)
    if (!dword_106BA08) {
        ++qword_126F7C0;         // orphan prefix count
        *(raw + 0) = 0;          // TU-copy addr
    }
    ++qword_126F750;             // IL entry count
    *(raw + 8) = 0;              // next pointer

    // 4. Set prefix flags byte
    byte flags = 1;              // bit 0 = allocated
    if (!dword_106BA08)
        flags |= 2;             // bit 1 = file-scope
    if (dword_126E5FC & 1)
        flags |= 8;             // bit 3 = C++ mode
    *(raw + 8) = flags;

    // 5. Increment type counter
    ++qword_126F8E0;

    // 6. Copy 96-byte common IL header
    type_t *result = raw + 16;
    memcpy(result, &xmmword_126F6A0, 96);

    // 7. Dispatch to set_type_kind
    set_type_kind(result, type_kind);

    // 8. Optional debug trace
    if (dword_126EFC8)
        trace_leave();

    return result;
}

set_type_kind: The 22-Way Dispatch

set_type_kind (sub_5E2E80) writes the kind byte and allocates any required supplement structure:

void set_type_kind(type_t *type, int kind) {
    type->type_kind = kind;      // byte at +132

    switch (kind) {
    case 0:  case 1:             // tk_none, tk_void
    case 17: case 18:            // pack expansions
    case 19: case 20: case 21:   // auto, rvalue_ref, nullptr_t
        break;                   // no supplement needed

    case 2:                      // tk_integer
        type->referenced_type = 5;  // default integer subkind
        type->supplement_ptr = alloc_permanent(32);
        ++qword_126F8E8;        // integer supplement counter
        // Store source position at supplement+16
        break;

    case 3: case 4: case 5:     // tk_float, tk_double, tk_long_double
        type->referenced_type = 2;  // format byte
        break;

    case 6:                      // tk_pointer
        type->supplement_ptr = 0;   // zero class-pointer field
        type->secondary_data = 0;
        break;

    case 7:                      // tk_routine (function type)
        type->supplement_ptr = alloc_permanent(64);
        ++qword_126F958;        // routine supplement counter
        // Initialize calling convention bitfield at supplement+32
        break;

    case 8:                      // tk_array
        // Zero size and flags fields
        break;

    case 9: case 10: case 11:   // tk_struct, tk_class, tk_union
        type->supplement_ptr = alloc_permanent(208);
        ++qword_126F948;        // class supplement counter
        init_class_type_supplement_fields(type->supplement_ptr);
        type->supplement_ptr->class_kind = kind;  // at supplement+100
        break;

    case 12:                     // tk_typeref
        type->supplement_ptr = alloc_permanent(56);
        ++qword_126F8F0;        // typeref supplement counter
        break;

    case 13:                     // tk_pointer_to_member
        // Zero member/class fields
        break;

    case 14:                     // tk_template_param
        type->supplement_ptr = alloc_permanent(40);
        ++qword_126F8F8;        // template param supplement counter
        break;

    case 15: case 16:           // tk_typeof, tk_decltype
        // Zero expression pointer fields
        break;

    default:
        internal_error("set_type_kind: bad type kind");
    }
}

Qualified Type Construction: The MRU Cache

When the compiler needs a const int given an int, it calls f_make_qualified_type (sub_5D64F0). This function is called extremely frequently -- every variable declaration, function parameter, and expression type computation may need cv-qualified variants. EDG optimizes this with a move-to-front (MRU) linked list cache on each type node.

type_t *f_make_qualified_type(type_t *base_type, int qualifiers) {
    // qualifiers bitmask: bit 0 = const, bit 1 = volatile,
    //                     bit 2 = restrict, bits 3-6 = address space

    // 1. Array special case: cv-qualify the element type, not the array
    if (base_type->type_kind == 8) {   // array
        type_t *elem = base_type->referenced_type;
        type_t *qual_elem = f_make_qualified_type(elem, qualifiers);
        return rebuild_array_type(base_type, qual_elem);
    }

    // 2. Strip existing qualifiers that already match
    int existing = get_cv_qualifiers(base_type) & 0x7F;
    int needed = qualifiers & ~existing;
    if (needed == 0)
        return base_type;           // already qualified as requested

    // 3. Search the MRU cache at base_type->qualified_chain (+112)
    type_t *prev = NULL;
    type_t *cur = base_type->qualified_chain;
    while (cur) {
        if (cur->type_kind == 12 &&             // must be typeref
            (cur->class_flags_1 & 0x7F) == qualifiers) {
            // Cache hit -- move to front if not already there
            if (prev) {
                prev->next = cur->next;
                cur->next = base_type->qualified_chain;
                base_type->qualified_chain = cur;
            }
            return cur;
        }
        prev = cur;
        cur = cur->next;
    }

    // 4. Cache miss -- allocate new qualified type
    type_t *qual = alloc_type(12);              // tk_typeref
    qual->referenced_type = base_type;          // +144 = underlying type
    qual->class_flags_1 = qualifiers & 0x7F;    // +161 = qualifier bits
    setup_type_node(qual);                       // sub_5B3DE0

    // 5. Insert at head of cache list
    qual->next = base_type->qualified_chain;
    base_type->qualified_chain = qual;

    return qual;
}

The MRU optimization is critical because type construction is highly skewed: const T is needed far more often than volatile const restrict T. By moving the most recently matched qualified variant to the front of the chain, subsequent lookups for the same qualification find it immediately.

The same MRU pattern appears in ptr_to_member_type_full (sub_5DB220), which caches pointer-to-member types on the member type's qualification chain at +112.

CV-Qualifier Bitmask

Bit	Mask	Qualifier
0	`0x01`	`const`
1	`0x02`	`volatile`
2	`0x04`	`__restrict`
3--6	`0x78`	Address space qualifier (CUDA/OpenCL)

The 7-bit mask (& 0x7F) at offset +161 of a typeref node encodes the full cv-qualification. get_cv_qualifiers (sub_7A9E70, 319 callers) accumulates these bits by chasing the typedef chain:

int get_cv_qualifiers(type_t *type) {
    int quals = 0;
    while (type->type_kind == 12) {     // chase typedefs
        quals |= type->class_flags_1 & 0x7F;
        type = type->referenced_type;
    }
    return quals;
}

Type Comparison

sub_7AA150 (types_are_identical, 636 lines) is the main structural type comparison function. It handles all 22 type kinds with recursive descent into component types. The algorithm:

Chase typedefs on both operands to reach canonical types.
If pointer-equal after chasing, return true (the common fast path).
If kinds differ, return false.
Dispatch on kind:
- Integer (kind 2): Compare integer subkind values.
- Pointer (kind 6): Recursively compare pointed-to types.
- Array (kind 8): Compare bounds and recursively compare element types.
- Function (kind 7): Compare return type, then parameter-by-parameter.
- Class (kind 9/10/11): Pointer equality only (nominal typing).
- Template param (kind 14): Compare parameter index and depth.
- Pointer-to-member (kind 13): Compare both class and member types.

The comparison is structural for most types but nominal for classes. Two distinct struct Foo definitions in different scopes are different types even if they have identical members.

Cross-TU Type Correspondence

For relocatable device code (RDC) compilation, cudafe++ must match types across translation units. sub_7B2260 (types_are_equivalent_for_correspondence, 688 lines) performs a deep structural comparison that tolerates certain cross-TU differences (different typedef layers, different source positions) while requiring identical essential structure.

Type Construction Functions

Beyond f_make_qualified_type, several other type construction functions use the same cache pattern:

Address	Function	Creates	Cache Location
`0x5D64F0`	`f_make_qualified_type`	`const T`, `volatile T`, etc.	Type `+112` chain
`0x5D6770`	`make_vector_type`	`__attribute__((vector_size(N)))` T	Allocated fresh
`0x5D68E0`	`character_type`	`char[N]` string literal types	Hash table at `qword_126F2F8` (81-slot per kind)
`0x5DB220`	`ptr_to_member_type_full`	`T Class::*`	Member type `+112` chain (MRU)
`0x7AB9B0`	`construct_function_type`	`R(Args...)`	Allocated fresh (423 lines)
`0x7A6320`	`make_cv_combined_type`	Combines cv-quals from two types	Allocated fresh

Character Type Cache

String literal types (char[5], wchar_t[12], etc.) are extremely common in C++ programs. character_type (sub_5D68E0) uses a hash-table cache at qword_126F2F8 with 81 slots per character kind (5 kinds: char, wchar_t, char8_t, char16_t, char32_t), covering array sizes 0 through 80. For sizes exceeding 80, no caching is performed and a fresh array type is allocated every time.

Class Layout: do_class_layout

sub_662670 (do_class_layout, 2,548 lines) is the most complex function in the type system. It implements the Itanium C++ ABI class layout algorithm with GNU extensions, MSVC compatibility mode, and CUDA-specific adjustments. It is called exactly once per class definition from sub_442680 (class definition processing).

What do_class_layout Computes

For each class/struct/union, the function determines:

sizeof: Total class size including padding.
alignof: Required alignment, incorporating alignas, __attribute__((aligned)), and #pragma pack.
Member offsets: Byte offset of each non-static data member.
Base class offsets: Byte offset of each non-virtual base class subobject.
Virtual base offsets: Byte offset of each virtual base class subobject (stored in the vtable).
Vtable pointer placement: Where _vptr is placed (offset 0 for primary base, elsewhere for secondary).
Empty base optimization (EBO): Whether empty base classes can share address with data members.
Bit-field packing: How bit-fields are packed into allocation units.
Tail padding reuse: Whether derived classes can place members in base class tail padding (non-POD only).

Pseudocode: Itanium ABI Layout

void do_class_layout(type_t *class_type) {
    class_info_t *info = class_type->supplement_ptr;
    int sizeof_val = 0;
    int alignof_val = 1;
    int dsize = 0;          // data size (excludes tail padding)

    // PHASE 1: Lay out non-virtual base classes
    for (base_t *base = info->base_list; base; base = base->next) {
        if (base->is_virtual)
            continue;       // defer virtual bases

        int base_size = base->type->size_info;
        int base_align = base->type->alignment;

        // Empty base optimization
        if (is_empty_class(base->type)) {
            int offset = 0;
            while (empty_base_conflict(class_type, base->type, offset))
                offset += base_align;
            set_base_class_offset(base, offset);
            // sizeof may not increase for empty bases
        } else {
            // Align dsize up to base alignment
            dsize = ALIGN_UP(dsize, base_align);
            set_base_class_offset(base, dsize);
            dsize += base_size;
        }

        alignof_val = MAX(alignof_val, base_align);
    }

    // PHASE 2: Place vptr if needed
    if (class_has_virtual_functions(class_type) &&
        !has_primary_base_with_vptr(class_type)) {
        // vptr at current offset (usually 0)
        dsize = ALIGN_UP(dsize, POINTER_ALIGN);
        dsize += POINTER_SIZE;
        alignof_val = MAX(alignof_val, POINTER_ALIGN);
    }

    // PHASE 3: Lay out non-static data members
    for (field_t *field = info->first_field; field; field = field->next) {
        int field_align = alignment_of_field_full(field);
        int field_size = field->type->size_info;

        if (field->is_bitfield) {
            align_offsets_for_bit_field(field, &dsize, &alignof_val);
            continue;
        }

        dsize = ALIGN_UP(dsize, field_align);

        // Warn if field lands in tail padding of a base class
        warn_if_offset_in_tail_padding(class_type, dsize, field);

        field->offset = dsize;
        dsize += field_size;
        alignof_val = MAX(alignof_val, field_align);
    }

    // PHASE 4: Lay out virtual base classes
    for (base_t *base = info->base_list; base; base = base->next) {
        if (!base->is_virtual)
            continue;

        int base_align = base->type->alignment;
        if (is_empty_class(base->type)) {
            int offset = sizeof_val;
            while (subobject_conflict(class_type, base->type, offset))
                offset += base_align;
            set_virtual_base_class_offset(base, offset);
        } else {
            sizeof_val = ALIGN_UP(sizeof_val > dsize ? sizeof_val : dsize,
                                  base_align);
            set_virtual_base_class_offset(base, sizeof_val);
            sizeof_val += base->type->size_info;
        }
    }

    // PHASE 5: Finalize
    sizeof_val = MAX(sizeof_val, dsize);
    sizeof_val = ALIGN_UP(sizeof_val, alignof_val);
    if (sizeof_val == 0)
        sizeof_val = 1;     // C++ requires sizeof >= 1

    compute_empty_class_bit(class_type);
    trailing_base_does_not_affect_gnu_size(class_type);
    check_explicit_alignment(class_type);

    class_type->size_info = sizeof_val;
    class_type->alignment = alignof_val;

    // Debug: dump_layout() if debug flag set
    if (dword_126EFC8)
        dump_layout(class_type);
}

Key Sub-Functions

Address	Function	Purpose
`0x65EA50`	`trailing_base_does_not_affect_gnu_size`	Checks if trailing empty base affects GNU-compatible size vs dsize
`0x65EE70`	`empty_base_conflict`	Self-recursive: detects two empty bases of same type at same address
`0x65F410`	`increment_field_offsets`	Advances offset counters; warns about tail-padding overlap
`0x65F9F0`	`last_user_field_of`	Finds last user-declared (non-compiler-generated) field
`0x65FC20`	`subobject_conflict`	Generalizes empty_base_conflict to all subobjects
`0x6610B0`	`set_base_class_offsets`	Assigns offsets to non-virtual base class subobjects
`0x6614A0`	`set_virtual_base_class_offset`	Assigns offsets to virtual base class subobjects
`0x6621E0`	`alignment_of_field_full`	Computes field alignment considering packed, aligned, pragma pack

Empty Base Optimization

The EBO is one of the most subtle parts of C++ layout. The C++ standard requires that two distinct subobjects of the same type have different addresses. But empty base classes (no data members, no virtual functions, all bases empty) can be placed at offset 0 without consuming space -- unless another subobject of the same type already occupies that address.

empty_base_conflict (sub_65EE70, 240 lines) is self-recursive: it walks the entire base class hierarchy checking for address collisions. When a conflict is detected, the layout engine advances the offset by the base's alignment until no conflict exists.

Alignment Computation

alignment_of_field_full (sub_6621E0, 193 lines) computes the effective alignment of a data member considering all alignment modifiers in priority order:

Natural alignment of the field's type.
__attribute__((aligned(N))) -- increases alignment.
__attribute__((packed)) -- reduces alignment to 1.
#pragma pack(N) -- caps alignment at N.
__declspec(align(N)) -- MSVC mode alignment.

The interaction between these modifiers follows complex ABI rules. For example, #pragma pack(4) on a struct with a double member reduces the double's alignment from 8 to 4, but __attribute__((aligned(16))) on the same member overrides the pack to 16.

Type Trait Evaluation

sub_7BDCB0 (evaluate_type_trait, 510 lines) implements the compiler built-in type traits: __is_trivially_copyable, __is_constructible, __has_unique_object_representations, __is_aggregate, __is_empty, etc. These are dispatched via a switch on trait ID and return boolean results by inspecting the class type supplement flags and calling recursive property checks.

Type Deduction

sub_7B9670 (deduce_template_argument_type, 459 lines) handles template argument deduction from function arguments to template parameters. This is separate from the template engine's substitute_in_type (sub_7BCDE0, 800 lines), which performs the reverse operation: given concrete template arguments, produce the substituted type.

Global Type Singletons

Several frequently-used types are cached as global pointers to avoid repeated allocation:

Global	Type
`qword_126E5E0`	`void` type
`qword_126F2F0`	`void` type (duplicate reference)
`qword_126F1A0`	`std::source_location::__impl` (cached on first use)

Statistics Tracking

Every type-related allocation increments a per-kind counter. print_trans_unit_statistics (sub_7A45A0) dumps these counters via fprintf:

Counter	What it counts	Per-entry size
`qword_126F8E0`	Type nodes allocated	176 B
`qword_126F8E8`	Integer type supplements	32 B
`qword_126F958`	Routine type supplements	64 B
`qword_126F948`	Class type supplements	208 B
`qword_126F8F0`	Typeref supplements	56 B
`qword_126F8F8`	Template param supplements	40 B
`qword_126F280`	Pointer-to-member types constructed	--

CUDA-Specific Type Extensions

Address Space Qualifiers

CUDA's __shared__, __constant__, and __device__ memory spaces are represented as address-space qualifiers in the cv-qualifier bitmask (bits 3--6 at +161). The attribute kind values {1, 6, 11, 12} (bitmask 0x1842) are checked in compare_attribute_specifiers (sub_7A5E10) to detect incompatible address-space qualified typedefs.

Feature Usage Tracking

record_type_features_used (sub_7A4F10) records GPU feature requirements based on types encountered:

_BitInt types (integer subkind 11/12): sets bit 0 of byte_12C7AFC
__float128 / __bf16 types: sets bit 2
Bit-fields: sets bit 1
Class types: copies feature bits from +164

This information feeds into architecture gating, ensuring that code using _BitInt(128) targets a GPU architecture that supports it.

Constexpr Type Size Limits

The constexpr interpreter (sub_628DE0, f_value_bytes_for_type) enforces a 64 MB limit (0x4000000 bytes) on types used in constexpr evaluation. This prevents compile-time memory exhaustion from expressions like constexpr std::array<char, 1'000'000'000> x{};.

Function Map

Address	Lines	Function	Source
`0x5D64F0`	340	`f_make_qualified_type`	il.c
`0x5DB220`	63	`ptr_to_member_type_full`	il.c
`0x5E2E80`	--	`set_type_kind`	il_alloc.c
`0x5E3D40`	--	`alloc_type`	il_alloc.c
`0x65EA50`	105	`trailing_base_does_not_affect_gnu_size`	layout.c
`0x65EE70`	240	`empty_base_conflict`	layout.c
`0x65FC20`	271	`subobject_conflict`	layout.c
`0x6610B0`	196	`set_base_class_offsets`	layout.c
`0x6614A0`	204	`set_virtual_base_class_offset`	layout.c
`0x6621E0`	193	`alignment_of_field_full`	layout.c
`0x662670`	2548	`do_class_layout`	layout.c
`0x7A4B40`	--	`ttt_is_type_with_no_name_linkage`	types.c
`0x7A4F10`	--	`record_type_features_used`	types.c
`0x7A5E10`	--	`compare_attribute_specifiers`	types.c
`0x7A6260`	--	`type_has_flexible_array_or_vla`	types.c
`0x7A6320`	--	`make_cv_combined_type`	types.c
`0x7A68F0`--`0x7A9F90`	--	130 leaf query functions	types.c
`0x7AA150`	636	`types_are_identical`	types.c
`0x7AB9B0`	423	`construct_function_type`	types.c
`0x7AE680`	541	`adjust_type_for_templates`	types.c
`0x7B2260`	688	`types_are_equivalent_for_correspondence`	types.c
`0x7B3400`	905	`standard_conversion_sequence`	types.c
`0x7B5210`	441	`require_complete_type`	types.c
`0x7B6350`	1107	`compute_type_layout`	types.c
`0x7B7750`	784	`compute_class_properties`	types.c
`0x7B9670`	459	`deduce_template_argument_type`	types.c
`0x7BDCB0`	510	`evaluate_type_trait`	types.c
`0x7BF630`	348	`format_type_for_diagnostic`	types.c
`0x7C02A0`	--	`compatible_ms_bit_field_container_types`	types.c

Diagnostic System Overview

The cudafe++ diagnostic system is a 7-stage pipeline rooted in EDG 6.6's error.c. It manages 3,795 error message templates, 9 severity levels, per-error suppression tracking, #pragma diagnostic overrides, and two output formats (text and SARIF JSON). The most-connected function in the entire binary -- sub_4F2930 (assertion handler) with 5,185 call sites -- feeds into this system, making error handling the single largest cross-cutting concern in cudafe++.

Error Table

The error message template table lives at off_88FAA0: an array of 3,795 const char* pointers indexed by error code (0--3794).

Range	Count	Origin	Display Format
0--3456	3,457	Standard EDG 6.6	`#N-D`
3457--3794	338	NVIDIA CUDA extensions	`#(N+16543)-D` (20000--20337-D series)

The renumbering logic in construct_text_message (sub_4EF9D0):

int display_code = error_code;
if (display_code > 3456)
    display_code = error_code + 16543;   // 3457 → 20000, 3794 → 20337
sprintf(buf, "%d", display_code);

The -D suffix is appended only when severity <= 7 (warnings and below). Errors with severity > 7 (catastrophic, command-line error, internal) omit the suffix:

const char *suffix = "-D";
if (severity > 7)
    suffix = "";

Any access with error code > 3794 triggers sub_4F2D30 (error_text), which fires an assertion: "error_text: invalid error code" (error.c:911).

Severity Levels

Nine severity values are stored as a single byte at offset 180 of the diagnostic record:

Value	Name	Display String (lowercase)	Display String (uppercase)	Colorization	Exit Behavior
2	note	`"note"`	`"Note"`	cyan (code 4)	continues
4	remark	`"remark"`	`"Remark"`	cyan (code 4)	continues
5	warning	`"warning"`	`"Warning"`	magenta (code 3)	continues
6	command-line warning	`"command-line warning"`	`"Command-line warning"`	magenta (code 3)	continues
7	error (soft)	`"error"`	`"Error"`	red (code 2)	continues, counted
8	error (hard)	`"error"`	`"Error"`	red (code 2)	continues, counted, not suppressible by pragma
9	catastrophic	`"catastrophic error"`	`"Catastrophic error"`	red (code 2)	immediate exit(4)
10	command-line error	`"command-line error"`	`"Command-line error"`	red (code 2)	immediate exit(4)
11	internal error	`"internal error"`	`"Internal error"`	red (code 2)	immediate exit(11) via abort path

Uppercase display strings are used when dword_106BCD4 is set, indicating the diagnostic originates from a predefined macro file context (e.g., "In predefined macro file: Error #...").

The special string "nv_diag_remark" at offset +8 yields "remark" -- an NVIDIA-specific annotation kind for CUDA diagnostic remarks.

Severity Byte Arrays

Three parallel byte arrays, indexed as [4 * error_code], track per-error severity state:

Array	Address	Purpose
`byte_1067920`	`0x1067920`	Default severity -- the compile-time severity assigned to each error code
`byte_1067921`	`0x1067921`	Current severity -- the effective severity after `#pragma` overrides
`byte_1067922`	`0x1067922`	Tracking flags -- bit 0: first-time guard, bit 1: already-emitted, bit 2: has pragma override

The 4-byte stride means each error code occupies a 4-byte slot across all three arrays, with only the first byte of each slot used. This layout allows the pragma override system (sub_4F30A0) to efficiently look up and modify per-error severity.

7-Stage Diagnostic Pipeline

  caller emits error
       |
       v
  [1] create_diagnostic_entry     sub_4F40C0
       Allocate ~200-byte record, set error_code + severity
       |
       v
  [2] check_for_overridden_severity   sub_4F30A0
       Walk #pragma diagnostic stack, apply push/pop overrides
       |
       v
  [3] check_severity              sub_4F1330      ← 62 callers, 77 callees
       Central dispatch: suppress/promote, error limit, output routing
       |
       ├─── text path ──────────────────────────────────────┐
       |                                                     |
       v                                                     v
  [4] write_message_to_buffer    sub_4EF620       [6] write_sarif_message_json  sub_4EF8A0
       Expand %XY format specifiers from template             JSON-escape + wrap
       |                                                     |
       v                                                     v
  [5] construct_text_message     sub_4EF9D0 (6.5 KB)       SARIF JSON buffer → stderr
       file:line prefix, severity label, word wrap,
       caret lines, template context, include stack
       |
       v
  [4a] process_fill_in           sub_4EDCD0 (1,202 lines)
       Expand %T/%n/%s/%p/%d/%u/%t/%r specifiers
       |
       v
       output → stderr or redirect file

Stage 1: create_diagnostic_entry (`sub_4F40C0`)

Allocates a diagnostic record via sub_4EC940 and initializes it:

record = allocate_diagnostic_record();
record->kind = 0;              // primary diagnostic
record->error_code = a1;       // offset 176
if (severity <= 7)
    check_for_overridden_severity(a1, &severity, position);
record->severity = severity;   // offset 180
// resolve source position → file, line, column
// link into global diagnostic chain (qword_106BA10)

The wrapper sub_4F41C0 sets dword_106B4A8 (file index mode) to -1 for command-line and fatal severities (6, 9, 10, 11), disabling file-index tracking for diagnostics that have no meaningful source location.

Sub-diagnostics are created by sub_4F5A70 with kind=2, linked to their parent's sub-diagnostic chain at offsets 40/48 of the parent record.

Stage 2: check_for_overridden_severity (`sub_4F30A0`)

Walks the #pragma diagnostic stack stored in qword_1067820. Each stack entry is a 24-byte record containing a source position, a pragma action code, and an optional error code target.

Pragma action codes and their effect on severity:

Code	Pragma	Effect
30	`ignored`	Set severity to 3 (suppress)
31	`remark`	Set severity to 4
32	`warning`	Set severity to 5
33	`error`	Set severity to 7
35	`default`	Restore from `byte_1067920[4 * error_code]`
36	push/pop marker	Scope boundary for push/pop tracking

The function uses binary search (bsearch with comparator sub_4ECD20) to find the nearest pragma entry that applies at the current source position, then walks backward through the stack to resolve nested push/pop scopes.

Stage 3: check_severity (`sub_4F1330`)

The central dispatch function (601 decompiled lines, 62 callers, 77 callees). This is the most complex function in the error subsystem.

Complete decision tree pseudocode (derived from the decompiled sub_4F1330):

void check_severity(diagnostic_record *record) {
    dword_1065938 = 0;                    // reset caret-position cache
    uint8_t min_sev = byte_126ED69;       // minimum severity threshold

    // ── Gate 1: Minimum severity filter ──
    if (min_sev > record->severity) {
        if (min_sev == 3)                 // severity 3 = suppress sentinel
            ASSERT_FAIL("check_severity", error.c:3859);
        goto count_and_exit;              // silently discard
    }

    // ── Gate 2: System-header / suppress-all promotion ──
    if (is_system_header(record->source_sequence_number)) {
        min_sev = 8;                      // promote to hard error
    } else if (qword_106BCD8) {           // suppress-all-but-fatal mode
        min_sev = 7;                      // treat as error-level floor
    } else if (min_sev == 3) {
        ASSERT_FAIL("check_severity", error.c:3859);
    }

    if (record->severity < min_sev)
        goto count_and_exit;

    // ── Gate 3: Per-error tracking flags ──
    uint8_t *flags = &byte_1067922[4 * record->error_code];
    if (record->severity <= 7) {          // suppressible severities only
        uint8_t old = *flags;
        *flags |= 2;                      // mark as emitted
        if ((old & 1) && (old & 2))       // first-time guard + already-emitted
            goto suppressed;              // skip: already seen in this scope
    } else {
        *flags |= 2;                      // hard errors: always mark, never skip
    }

    // ── Gate 4: Pragma diagnostic check ──
    if (dword_126C5E4 != -1) {            // scope tracking active
        if (check_pragma_diagnostic(record->error_code,
                                     record->severity,
                                     &record->source_seq)) {
suppressed:
            // Update error/warning counters even when suppressed
            uint8_t sev = record->severity;
            if (sev <= 7 && sev < byte_126ED68)  // promote to error threshold
                sev = sev;                        // keep as-is
            else
                sev = 7;                          // count as error
            update_suppressed_counts(sev, &qword_126EDC8);
            goto count_and_exit;
        }
        // Record pragma scope if applicable
        if (in_template_scope() || has_special_scope_flags())
            record_pragma_diagnostic(record->error_code, record->severity);
    }

    // ── Gate 5: Suppress-all-but-fatal redirect ──
    if (qword_106BCD8 && !dword_106BCD4 && record->error_code != 992) {
        emit_error_992();                 // replace with fatal error 992
        return;                           // guard against catastrophic loop
    }

    // ── Severity promotion: warning → error threshold ──
    uint8_t effective_sev = record->severity;
    if (effective_sev <= 7 && effective_sev >= byte_126ED68) {
        effective_sev = 7;                // promote to error
        if (dword_126C5C8 == -1) {
            update_counts(7, &qword_126ED80);
            if (!dword_126ED78)           // no further counting needed
                goto skip_extra_counts;
            goto update_additional_counts;
        }
    } else if (effective_sev > 7 || effective_sev < byte_126ED68) {
        // Already at error+ or below promotion threshold
    }
    update_counts(effective_sev, &qword_126ED80);
    if (dword_126ED78 && (effective_sev - 9) > 2)  // not catastrophic/internal
        goto update_additional_counts;

skip_extra_counts:
    if (qword_126EDC0)
        update_counts(effective_sev, qword_126EDC0);

    // ── Allocate output buffers (first use) ──
    if (!qword_106B488) {
        qword_106B488 = alloc_buffer(0x400);
        qword_106B480 = alloc_buffer(0x80);
    }
    reset_buffer(qword_106B488);
    reset_buffer(qword_106B480);

    // ── Catastrophic loop detection ──
    if (record->severity == 9) {
        if (dword_106B4B0) {              // already processing catastrophic
            fprintf(stderr, "%s\n", "Loop in catastrophic error processing.");
            emergency_exit(9);            // never returns
        }
        dword_106B4B0 = 1;               // set catastrophic guard
        if (record->error_code == 3709 || !dword_126ED48)
            goto emit_message;
    } else if (record->severity == 11 || record->error_code == 3709) {
        goto emit_message;                // internal error or warnings-as-errors
    }

    // ── Template context expansion ──
    int context_count = 0;
    for (int scope = dword_126C5E4; scope > 0; scope--) {
        context_count += format_scope_context(scope);
    }
    // Include-file context
    if (dword_126EE48 && qword_106B9F0 && has_include_context()) {
        file_info = lookup_source_file(record->source_seq);
        if (file_info != current_file) {
            context_count++;
            // Emit error 1063/1064 (include-stack context)
            create_sub_diagnostic(record, (context_count != 1) ? 1064 : 1063);
        }
    }
    // Context elision (when context_limit is set)
    if (dword_126ED58 > 0 && dword_126ED58 + 1 < context_count) {
        // Emit error 1150: "%d context lines elided"
    }

emit_message:
    // ── Output routing ──
    reset_buffer(qword_106B488);
    if (dword_106BBB8 == 1) {
        // SARIF JSON path
        write_sarif_json(record);         // → qword_106B478
        fputs(sarif_buffer, stderr);
        fflush(stderr);
    } else {
        construct_text_message(record);   // → sub_4EF9D0
    }

    // ── Termination for fatal severities ──
    if (record->severity >= 9 && record->severity <= 11) {
        cleanup();                        // flush output, close files
        emergency_exit(record->severity); // exit with severity as code
        // unreachable
    }

    // ── Error limit enforcement ──
    if (qword_126ED90 + qword_126ED98 >= qword_126ED60) {
        fprintf(stderr, "%s\n", "Error limit reached.");
        if (qword_106C260)                // raw listing file
            fwrite("C \"\" 0 0 error limit reached\n", 1, 29, listing);
        cleanup();
        emergency_exit(9);               // exit(catastrophic)
    }

    // ── Warnings-as-errors promotion ──
    if (record->severity == 5 && dword_106C088 && !dword_106B4BC) {
        uint8_t saved_min = byte_126ED69;
        byte_126ED69 = 4;                // temporarily lower threshold
        dword_106C088 = 0;               // prevent recursion in self
        dword_106B4BC = 1;               // prevent recursion guard
        emit_diagnostic(4, 3709, ...);   // "warnings treated as errors"
        byte_126ED69 = saved_min;        // restore threshold
        dword_106C088 = 1;               // restore mode
    }

    // ── File index update ──
    if (dword_106B4A8 != -1)
        update_file_index(record);

count_and_exit:
    return;
}

Key decision points explained:

Minimum severity gate:

The global byte_126ED69 is the minimum severity threshold -- diagnostics below this level are silently discarded. When the threshold is 3 (the "suppress" sentinel), an assertion fires, which prevents the threshold from ever being set to the suppress level directly.

System-header promotion:

When a diagnostic originates from a system header (detected by sub_5B9B60), its severity is promoted to 8 (hard error, not suppressible by pragma). This applies equally to CUDA system headers.

Per-error tracking:

Bit 0 of the tracking flags (byte_1067922[4 * error_code]) acts as a first-time guard: if both bit 0 (first-time) and bit 1 (already-emitted) are set, the error has been suppressed-then-seen, and further emissions are skipped depending on the pragma scope.

Suppress-all-but-fatal mode:

When qword_106BCD8 is set and the error is not error 992 (the fatal sentinel), check_severity replaces the current diagnostic with error 992 and re-enters.

Catastrophic loop detection:

The re-entry guard dword_106B4B0 prevents infinite recursion when a catastrophic error triggers another catastrophic error during its own processing. The message "Loop in catastrophic error processing." is printed directly to stderr followed by emergency_exit(9).

Error limit enforcement:

qword_126ED90 (total errors) + qword_126ED98 (total warnings) are checked against qword_126ED60 (error limit). When exceeded, the compiler writes the limit message and exits with catastrophic status. The raw listing file also receives a machine-readable C "" 0 0 error limit reached line.

Warnings-as-errors promotion:

When dword_106C088 (warnings-are-errors mode) is set, every warning (severity 5) triggers error 3709 ("warnings treated as errors") as a follow-up diagnostic. The implementation temporarily lowers the minimum severity threshold to 4 (remark), disables warnings-as-errors mode, sets the recursion guard, emits the diagnostic, then restores all three values. This prevents the error-3709 diagnostic from itself triggering another error-3709.

Output routing:

if (dword_106BBB8 == 1)
    // SARIF JSON path: sub_4EF8A0 → qword_106B478
else
    sub_4EF9D0(record);   // text path → construct_text_message

Termination for fatal severities:

Severities 9 (catastrophic), 10 (command-line error), and 11 (internal error) all trigger cleanup via sub_66B5E0 followed by sub_5AF2B0(severity), which maps severity to the process exit code.

Stage 4: write_message_to_buffer (`sub_4EF620`)

Looks up the error template string from the table and expands format specifiers:

const char *template = off_88FAA0[error_code];   // error_code must be <= 3794

Format specifier syntax: %XY...Zn where:

X = specifier letter (T, d, n, p, r, s, t, u)
Y...Z = option characters (a-z, A-Z), max 29
n = trailing digit = fill-in index

Special forms:

%% = literal %
%[label] = named label fill-in, looked up in off_D481E0 table

Each specifier dispatches to process_fill_in (sub_4EDCD0) with the appropriate fill-in kind.

Stage 5: construct_text_message (`sub_4EF9D0`)

The largest function in the error subsystem at 1,464 decompiled lines (6.5 KB). Formats the complete diagnostic output.

Output format:

file(line): severity #code-D: message text

Variant formats:

"At end of source: ..." -- when line number is 0
"In predefined macro file: ..." -- when dword_106BCD4 is set
"Line N" -- when the file name is "-" (stdin)

Sub-diagnostic indentation:

Kind	Indent (chars)	Continuation indent
0 (primary)	0	10
2 (sub-diagnostic, same parent)	10	20
2 (sub-diagnostic, different parent)	12	22
3 (related)	1	11

Word wrapping:

The function wraps output text at dword_106B470 (terminal width) column boundaries. When colorization is disabled, it uses a simple space-scanning algorithm. When colorization is enabled (ESC byte 0x1B in the formatted string), it tracks visible character width separately from escape sequence bytes and wraps only on visual boundaries.

Fill-in verification:

After output, the function iterates the fill-in linked list and asserts that every entry has used_flag == 1. An unused fill-in triggers: "construct_text_message: not all fill-ins used for error string: \"...\"" (error.c:4781).

Raw listing output:

When qword_106C260 (raw listing file) is open and the diagnostic is not a continuation (kind != 3), a machine-readable line is emitted:

S "filename" line column message\n

Where S is a single-character severity code: R (remark), W (warning), E (error), C (catastrophic/internal). Internal errors additionally prefix "(internal error) " before the message text.

Stage 6: process_fill_in (`sub_4EDCD0`)

Expands a single format specifier by searching the diagnostic record's fill-in linked list (head at offset 184) for an entry matching the requested kind and index. 1,202 decompiled lines.

Fill-in kind dispatch (from ASCII code of specifier letter):

Letter	ASCII	Kind	Payload
`%T`	84	6 (type)	Type node pointer
`%d`	100	0 (decimal)	Integer value
`%n`	110	4 (entity name)	Entity node pointer + options
`%p`	112	2 (parameter)	Source position
`%r`	114	7	Byte + pointer
`%s`	115	3 (string)	String pointer
`%t`	116	5	(type variant)
`%u`	117	1 (unsigned)	Unsigned integer value

Entity name options (for %n specifier):

Option	Meaning
`f`	Full qualification
`o`	Omit kind prefix
`p`	Omit parameters
`t`	Full with template arguments
`a`	Omit + show accessibility
`d`	Show declaration location
`T`	Show template specialization

Assertion Handler (`sub_4F2930`)

The most-connected function in the entire cudafe++ binary: 5,185 call sites. Declared __noreturn.

Signature:

void __noreturn assertion_handler(
    char *source_file,     // EDG source file path
    int   line_number,     // source line number
    const char *func_name, // enclosing function name
    const char *prefix,    // message prefix (or NULL)
    const char *message    // detail message (or NULL)
);

Message format (with prefix):

assertion failed: <prefix> <message> (<file>, line <line> in <func>)

Message format (without prefix):

assertion failed at: "<file>", line <line> in <func>

The function allocates a 0x400-byte buffer via sub_6B98A0, concatenates the message components using sub_6B9CD0 (buffer append), then calls sub_4F21C0 (internal_error). Because sub_4F21C0 is also __noreturn, the code after the call is dead -- the decompiler shows a loop structure with sprintf(v20, "%d", v8) that is never actually reached.

When dword_126ED40 (suppress assertion output) is set, the message text is replaced with "<suppressed>".

Internal Error Handler (`sub_4F21C0`)

Creates error 2656 with severity 11 (internal error), outputs it through the standard pipeline, then exits.

void __noreturn internal_error(const char *message) {
    if (dword_1065928) {                     // re-entry guard
        fprintf(stderr, "%s: %s\n", "Internal error loop", message);
        sub_5AF2B0(11);                      // emergency exit
    }
    dword_1065928 = 1;                       // set guard
    diag = sub_4F41C0(2656, &current_pos, 11);  // create diag record
    if (message)
        sub_4F2E90(diag, message);           // attach message as fill-in
    sub_4F1330(diag);                        // route through check_severity
    sub_5AF1D0(11);                          // cleanup + exit(11)
    sub_4F2240();                            // update file index (unreachable)
}

The re-entry guard dword_1065928 prevents infinite recursion: if internal_error is called while already processing an internal error (e.g., an assertion fires inside the error formatting code), it prints "Internal error loop: <message>" directly to stderr and exits immediately with code 11.

Exit Codes

Code	Condition	Trigger
0	Compilation succeeded	Normal exit via `sub_5AF1D0(0)`
2	Errors encountered	`total_errors > 0` at exit
4	Catastrophic error	Severity 9 or 10 reached
11	Internal error	Severity 11 (assertion failure)
abort	Double internal error	Re-entry in `sub_4F21C0` or catastrophic loop

The exit path flows through sub_5AF2B0, which maps the severity to the appropriate process exit code. Catastrophic loop detection ("Loop in catastrophic error processing.") calls sub_5AF2B0(9), which maps to exit code 4.

Diagnostic Record Layout

Each diagnostic record is approximately 200 bytes, allocated by sub_4EC940:

Offset	Size	Field	Description
0	4	`kind`	0=primary, 2=sub-diagnostic, 3=continuation
8	8	`next`	Linked list pointer (global chain)
16	8	`parent`	Parent diagnostic (for sub-diagnostics)
24	8	`related_list`	Related diagnostic chain
40	8	`sub_diagnostic_head`	First sub-diagnostic
48	8	`sub_diagnostic_tail`	Last sub-diagnostic
72	8	`context_head`	Template/include context chain
88	8	`related_info`	Related location info pointer
96	8	`source_sequence_number`	Position in source sequence
136	4	`file_index`	Index into source file table
140	2	`column_end`	End column for caret range
144	4	`line_delta`	Line offset for continuation
152	8	`file_name_string`	Canonical file path
160	8	`display_file_name`	Display-formatted file path
168	4	`column_number`	Column number
172	4	`caret_info`	Caret position data
176	4	`error_code`	Error code (0--3794)
180	1	`severity`	Severity level (2--11)
184	8	`fill_in_list_head`	First fill-in entry
192	8	`fill_in_list_tail`	Last fill-in entry

Fill-In Entry Layout

Each fill-in entry is 40 bytes, allocated from a free-list pool (qword_106B490) or heap (sub_6B8070):

Offset	Size	Field	Description
0	4	`kind`	Fill-in kind (0--7, mapped from format specifier letter)
4	1	`used_flag`	Set to 1 when consumed during formatting
8	8	`next`	Next fill-in in linked list
16	8+	`payload`	Union: qword for most kinds; int+int for kind 4 (entity name)

Kind-specific initialization in alloc_fill_in_entry (sub_4F2DE0):

Kind 2 (parameter): payload = qword_126EFB8 (current source position)
Kind 4 (entity name): payload = 0, extra = 0xFFFFFFFF, flags = 0
Kind 7: byte + qword payload
Default: payload = 0

Colorization

Initialized by sub_4F2C10 (init_colorization, error.c:825):

Check NOCOLOR environment variable -- if set, disable colorization
Check sub_5AF770 (isatty) -- if stderr is not a terminal, disable
Read EDG_COLORS or GCC_COLORS environment variable
Default: "error=01;31:warning=01;35:note=01;36:locus=01:quote=01:range1=32"

Category codes used in escape sequences:

Code	Category	Default ANSI
1	reset	`\033[0m`
2	error	`\033[01;31m` (bold red)
3	warning	`\033[01;35m` (bold magenta)
4	note/remark	`\033[01;36m` (bold cyan)
5	locus	`\033[01m` (bold)
6	quote	`\033[01m` (bold)
7	range1	`\033[32m` (green)

Controlled by dword_126ECA0 (colorization requested) and dword_126ECA4 (colorization active). The sub_4ECDD0 function emits escape sequences to the output buffer, and sub_4F3E50 handles escape insertion during word-wrapped output.

Key Global Variables

Variable	Address	Type	Purpose
`off_88FAA0`	`0x88FAA0`	`const char*[3795]`	Error message template table
`off_D481E0`	`0xD481E0`	struct[]	Named label fill-in table
`byte_1067920`	`0x1067920`	byte[4*3795]	Default severity per error
`byte_1067921`	`0x1067921`	byte[4*3795]	Current severity per error
`byte_1067922`	`0x1067922`	byte[4*3795]	Per-error tracking flags
`byte_126ED68`	`0x126ED68`	byte	Error promotion threshold
`byte_126ED69`	`0x126ED69`	byte	Minimum severity threshold
`qword_126ED60`	`0x126ED60`	qword	Error limit
`qword_126ED90`	`0x126ED90`	qword	Total error count
`qword_126ED98`	`0x126ED98`	qword	Total warning count
`dword_106B4B0`	`0x106B4B0`	int	Catastrophic error re-entry guard
`dword_106B4BC`	`0x106B4BC`	int	Warnings-as-errors recursion guard
`dword_106BBB8`	`0x106BBB8`	int	Output format (0=text, 1=SARIF)
`dword_106C088`	`0x106C088`	int	Warnings-are-errors mode
`dword_1065928`	`0x1065928`	int	Internal error re-entry guard
`qword_106BCD8`	`0x106BCD8`	qword	Suppress-all-but-fatal mode
`dword_106BCD4`	`0x106BCD4`	int	Predefined macro file mode
`qword_106B488`	`0x106B488`	qword	Message text buffer (0x400 initial)
`qword_106B480`	`0x106B480`	qword	Location prefix buffer (0x80 initial)
`qword_106B478`	`0x106B478`	qword	SARIF JSON buffer (0x400 initial)
`dword_106B470`	`0x106B470`	int	Terminal width for word wrapping
`qword_126EDF0`	`0x126EDF0`	FILE*	Error output stream (default stderr)
`qword_106C260`	`0x106C260`	FILE*	Raw listing output file

Function Map

Address	Name (Recovered)	EDG Source	Size	Role
`0x4EC940`	`allocate_diagnostic_record`	error.c	--	Pool allocator for diagnostic records
`0x4ECB10`	`write_sarif_physical_location`	error.c	--	SARIF location JSON fragment
`0x4ECDD0`	`emit_colorization_escape`	error.c	--	Emit ANSI escape to buffer
`0x4ED190`	`record_pragma_diagnostic`	error.c	--	Record pragma override in scope
`0x4ED240`	`check_pragma_diagnostic`	error.c	--	Check if error suppressed by pragma
`0x4EDCD0`	`process_fill_in`	error.c:4297	1,202 lines	Format specifier expansion
`0x4EF620`	`write_message_to_buffer`	error.c:4703	159 lines	Template string expansion
`0x4EF8A0`	`write_sarif_message_json`	error.c	79 lines	SARIF message JSON wrapper
`0x4EF9D0`	`construct_text_message`	error.c:3153	1,464 lines	Full text diagnostic formatter
`0x4F1330`	`check_severity`	error.c:3859	601 lines	Central severity dispatch
`0x4F2190`	`check_severity_thunk`	error.c	8 lines	Tail-call wrapper
`0x4F21A0`	`internal_error_variant`	error.c	9 lines	check_severity + exit(11)
`0x4F21C0`	`internal_error`	error.c	22 lines	Error 2656, severity 11, re-entry guard
`0x4F2240`	`update_file_index`	error.c	114 lines	LRU source-file index cache
`0x4F24B0`	`build_source_caret_line`	error.c	~100 lines	Source caret underline
`0x4F2930`	`assertion_handler`	error.c	101 lines	5,185 callers, `__noreturn`
`0x4F2C10`	`init_colorization`	error.c:825	43 lines	Parse EDG_COLORS/GCC_COLORS
`0x4F2D30`	`error_text_invalid_code`	error.c:911	12 lines	Assert on code > 3794
`0x4F2DE0`	`alloc_fill_in_entry`	error.c	41 lines	Pool allocator for fill-ins
`0x4F2E90`	`append_fill_in_string`	error.c	--	Attach string fill-in to diagnostic
`0x4F30A0`	`check_for_overridden_severity`	error.c:3803	~130 lines	Pragma diagnostic stack walk
`0x4F3480`	`format_assertion_message`	error.c	~100 lines	Multi-arg string builder
`0x4F3E50`	`emit_colorization_in_wrap`	error.c	--	Escape handling during word wrap
`0x4F40C0`	`create_diagnostic_entry`	error.c:5202	~50 lines	Base record creator
`0x4F41C0`	`create_diagnostic_entry_with_file_index`	error.c	13 lines	Wrapper with file-index mode
`0x4F5A70`	`create_sub_diagnostic`	error.c:5242	32 lines	kind=2 sub-diagnostic creator
`0x4F6C40`	`format_scope_context`	error.c	--	Extract instantiation context from scope

Call Graph

sub_4F2930 (assertion_handler)  [5,185 callers, __noreturn]
  └── sub_4F21C0 (internal_error)
        ├── sub_4F41C0 (create_diagnostic_entry, error=2656, sev=11)
        │     └── sub_4F40C0 (create_diagnostic_entry)
        │           └── sub_4F30A0 (check_for_overridden_severity)
        ├── sub_4F2E90 (append_fill_in_string)
        ├── sub_4F1330 (check_severity)  [62 callers, 77 callees]
        │     ├── sub_4ED240 (check_pragma_diagnostic)
        │     ├── sub_4EF9D0 (construct_text_message)
        │     │     ├── sub_4EF620 (write_message_to_buffer)
        │     │     │     └── sub_4EDCD0 (process_fill_in)
        │     │     ├── sub_4F24B0 (build_source_caret_line)
        │     │     └── sub_4F3E50 (emit_colorization_in_wrap)
        │     ├── sub_4EF8A0 (write_sarif_message_json)
        │     │     └── sub_4EF620 (write_message_to_buffer)
        │     ├── sub_4F5A70 (create_sub_diagnostic)
        │     ├── sub_4F2DE0 (alloc_fill_in_entry)
        │     ├── sub_4F6C40 (format_scope_context)
        │     ├── sub_66B5E0 (cleanup)
        │     └── sub_5AF2B0 (exit)
        ├── sub_5AF1D0 (cleanup + exit)
        └── sub_4F2240 (update_file_index)

CUDA Error Catalog

cudafe++ reserves error indices 3457--3794 for CUDA-specific diagnostics. These 338 slots are displayed to the user as error numbers 20000--20337 with a -D suffix (for suppressible severities), produced by the renumbering logic in construct_text_message (sub_4EF9D0): when the internal error code exceeds 3456, the display code is error_code + 16543. Of the 338 slots, approximately 210 carry unique error message templates; the remainder are reserved or share templates with parametric fill-ins (%s, %sq, %t, %n, %no). Every CUDA error can be suppressed, promoted, or demoted by its diagnostic tag name via --diag_suppress, --diag_warning, --diag_error, or the #pragma nv_diagnostic system.

This page is a searchable reference catalog organized by error category. For the diagnostic pipeline mechanics (severity levels, pragma stack, output formatting), see Diagnostic Overview.

Error Numbering Scheme

// construct_text_message (sub_4EF9D0), error.c:3153
int display_code = error_code;
if (display_code > 3456)
    display_code = error_code + 16543;   // 3457 -> 20000, 3794 -> 20337
sprintf(buf, "%d", display_code);

// Suffix: "-D" appended when severity <= 7 (note, remark, warning, soft error)
const char *suffix = (severity > 7) ? "" : "-D";

User-visible format: file(line): error #20042-D: calling a __device__ function from a __host__ function is not allowed

Mapping formula:

Direction	Formula
Display to internal	`internal = display - 16543` (for display >= 20000)
Internal to display	`display = internal + 16543` (for internal > 3456)

Diagnostic Tag Names and Suppression

Each CUDA error has an associated diagnostic tag name -- a snake_case identifier that can be passed to --diag_suppress, --diag_warning, --diag_error, or --diag_default instead of the numeric code. The tag names are also accepted by #pragma nv_diag_suppress, #pragma nv_diag_warning, etc.

# Suppress a specific CUDA error by tag name
nvcc --diag_suppress=calling_a_constexpr__host__function_from_a__device__function

# Suppress by numeric code (equivalent)
nvcc --diag_suppress=20042

# In source code
#pragma nv_diag_suppress device_function_redeclared_with_host

The pragma actions understood by cudafe++:

Pragma	Internal Code	Effect
`nv_diag_suppress`	30	Set severity to 3 (suppressed)
`nv_diag_remark`	31	Set severity to 4 (remark)
`nv_diag_warning`	32	Set severity to 5 (warning)
`nv_diag_error`	33	Set severity to 7 (error)
`nv_diag_default`	35	Restore original severity
`nv_diag_once`	--	Emit only on first occurrence

Category 1: Cross-Space Calling (12 messages)

Cross-space call validation is the highest-frequency CUDA diagnostic category. The checker walks the call graph and emits an error whenever a function in one execution space calls a function in an incompatible space. Six variants cover non-constexpr calls; six more cover constexpr calls (which can be relaxed with --expt-relaxed-constexpr).

Standard Cross-Space Calls

Tag	Message Template
`unsafe_device_call`	`calling a __device__ function(%sq1) from a __host__ function(%sq2) is not allowed`
`unsafe_device_call`	`calling a __device__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed`
`unsafe_device_call`	`calling a __host__ function(%sq1) from a __device__ function(%sq2) is not allowed`
`unsafe_device_call`	`calling a __host__ function(%sq1) from a __global__ function(%sq2) is not allowed`
`unsafe_device_call`	`calling a __host__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed`
`unsafe_device_call`	`calling a __host__ function from a __host__ __device__ function is not allowed`

Constexpr Cross-Space Calls

These fire when --expt-relaxed-constexpr is not enabled. The message explicitly suggests the flag.

Tag	Message Template
`unsafe_device_call`	`calling a constexpr __device__ function(%sq1) from a __host__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.`
`unsafe_device_call`	`calling a constexpr __device__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.`
`unsafe_device_call`	`calling a constexpr __host__ function(%sq1) from a __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.`
`unsafe_device_call`	`calling a constexpr __host__ function(%sq1) from a __global__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.`
`unsafe_device_call`	`calling a constexpr __host__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.`
`unsafe_device_call`	`calling a constexpr __host__ function from a __host__ __device__ function is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.`

Implementation: Cross-space checks are performed by the call-graph walker in the CUDA validation pass. The checker compares the execution space byte at entity offset +182 of the callee against the caller. When the mask test fails, the appropriate variant is selected based on whether either function is constexpr and whether the callee has named fill-ins or uses the anonymous (no %sq) form.

Category 2: Virtual Override Mismatch (6 messages)

When a derived class overrides a virtual function, the execution space of the override must match the base. Six combinations cover all mismatched pairs among __host__, __device__, and __host__ __device__.

Tag	Message Template
--	`execution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ function`
--	`execution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ __device__ function`
--	`execution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __device__ function`
--	`execution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __host__ __device__ function`
--	`execution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __device__ function`
--	`execution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __host__ function`

Implementation: The override checker (sub_432280, record_virtual_function_override) extracts the 0x30 mask from the execution space byte of both the base and derived function entities. If they differ, the appropriate pair is selected and emitted. The __global__ space is not included because __global__ functions cannot be virtual (see Category 4).

Category 3: Redeclaration Mismatch (12 messages)

When a function is redeclared with a different execution space annotation, cudafe++ either emits an error (incompatible combination) or a warning (compatible promotion to __host__ __device__).

Error-Level Redeclarations (4 messages)

Tag	Message Template
`device_function_redeclared_with_global`	`a __device__ function(%no1) redeclared with __global__`
`global_function_redeclared_with_device`	`a __global__ function(%no1) redeclared with __device__`
`global_function_redeclared_with_host`	`a __global__ function(%no1) redeclared with __host__`
`global_function_redeclared_with_host_device`	`a __global__ function(%no1) redeclared with __host__ __device__`

Warning-Level Redeclarations (Promoted to HD, 5 messages)

Tag	Message Template
`device_function_redeclared_with_host`	`a __device__ function(%no1) redeclared with __host__, hence treated as a __host__ __device__ function`
`device_function_redeclared_with_host_device`	`a __device__ function(%no1) redeclared with __host__ __device__, hence treated as a __host__ __device__ function`
`device_function_redeclared_without_device`	`a __device__ function(%no1) redeclared without __device__, hence treated as a __host__ __device__ function`
`host_function_redeclared_with_device`	`a __host__ function(%no1) redeclared with __device__, hence treated as a __host__ __device__ function`
`host_function_redeclared_with_host_device`	`a __host__ function(%no1) redeclared with __host__ __device__, hence treated as a __host__ __device__ function`

Global Redeclarations (3 messages)

Tag	Message Template
`global_function_redeclared_without_global`	`a __global__ function(%no1) redeclared without __global__`
`host_function_redeclared_with_global`	`a __host__ function(%no1) redeclared with __global__`
`host_device_function_redeclared_with_global`	`a __host__ __device__ function(%no1) redeclared with __global__`

Implementation: Redeclaration checking occurs in decl_routine (sub_4CE420) and check_cuda_attribute_consistency (sub_4C6D50). The checker compares the execution space byte from the prior declaration against the new declaration's attribute set. When bits differ, it selects the message based on which bits changed and whether the result is a compatible promotion.

Category 4: global Function Constraints (37 messages)

__global__ (kernel) functions have the most extensive constraint set of any execution space. These errors enforce the CUDA programming model requirement that kernels have specific signatures, cannot be members, and cannot use certain C++ features.

Return Type and Signature

Tag	Message Template
`global_function_return_type`	`a __global__ function must have a void return type`
`global_function_deduced_return_type`	`a __global__ function must not have a deduced return type`
`global_function_has_ellipsis`	`a __global__ function cannot have ellipsis`
`global_rvalue_ref_type`	`a __global__ function cannot have a parameter with rvalue reference type`
`global_ref_param_restrict`	`a __global__ function cannot have a parameter with __restrict__ qualified reference type`
`global_va_list_type`	`A __global__ function or function template cannot have a parameter with va_list type`
`global_function_with_initializer_list`	`a __global__ function or function template cannot have a parameter with type std::initializer_list`
`global_param_align_too_big`	`cannot pass a parameter with a too large explicit alignment to a __global__ function on win32 platforms`

Declaration Context

Tag	Message Template
`global_class_decl`	`A __global__ function or function template cannot be a member function`
`global_friend_definition`	`A __global__ function or function template cannot be defined in a friend declaration`
`global_function_in_unnamed_inline_ns`	`A __global__ function or function template cannot be declared within an inline unnamed namespace`
`global_operator_function`	`An operator function cannot be a __global__ function`
`global_new_or_delete`	(internal -- global* on operator new/delete)*
--	`function main cannot be marked __device__ or __global__`

C++ Feature Restrictions

Tag	Message Template
`global_function_constexpr`	`A __global__ function or function template cannot be marked constexpr`
`global_function_consteval`	`A __global__ function or function template cannot be marked consteval`
`global_function_inline`	(internal -- global* with inline)*
`global_exception_spec`	`An exception specification is not allowed for a __global__ function or function template`

Template Argument Restrictions

Tag	Message Template
`global_private_type_arg`	`A type that is defined inside a class and has private or protected access (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the class is local to a __device__ or __global__ function`
`global_private_template_arg`	`A template that is defined inside a class and has private or protected access cannot be used in the template template argument of a __global__ function template instantiation`
`global_unnamed_type_arg`	`An unnamed type (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the type is local to a __device__ or __global__ function`
`global_func_local_template_arg`	`A type defined inside a __host__ function (%t) cannot be used in the template argument type of a __global__ function template instantiation`
`global_lambda_template_arg`	`The closure type for a lambda (%t%s) cannot be used in the template argument type of a __global__ function template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the flag '-extended-lambda' is specified and the lambda is an extended lambda (a __device__ or __host__ __device__ lambda defined within a __host__ or __host__ __device__ function)`
`local_type_used_in_global_function`	`a local type %t (defined in %sq1) used in global function %sq2 template argument, the global function cannot be launched from host code.`

Variadic Template Constraints

Tag	Message Template
`global_function_multiple_packs`	`Multiple pack parameters are not allowed for a variadic __global__ function template`
`global_function_pack_not_last`	`Pack template parameter must be the last template parameter for a variadic __global__ function template`

Variable Template Restrictions (parallel to kernel template)

Tag	Message Template
`variable_template_private_type_arg`	`A type that is defined inside a class and has private or protected access (%t) cannot be used in the template argument type of a variable template instantiation, unless the class is local to a __device__ or __global__ function`
`variable_template_private_template_arg`	(private template template arg in variable template)
`variable_template_unnamed_type_template_arg`	`An unnamed type (%t) cannot be used in the template argument type of a variable template template instantiation, unless the type is local to a __device__ or __global__ function`
`variable_template_func_local_template_arg`	`A type defined inside a __host__ function (%t) cannot be used in the template argument type of a variable template template instantiation`
`variable_template_lambda_template_arg`	`The closure type for a lambda (%t%s) cannot be used in the template argument type of a variable template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the lambda is an 'extended lambda' and the flag --extended-lambda is specified`

Launch Configuration Attributes

Tag	Message Template
`bounds_attr_only_on_global_func`	`%s is only allowed on a __global__ function`
`maxnreg_attr_only_on_global_func`	(maxnreg only on global)
--	`The %s qualifiers cannot be applied to the same kernel`
--	`Multiple %s specifiers are not allowed`
--	`no __launch_bounds__ specified for __global__ function`
`cuda_specifier_twice_in_group`	(duplicate CUDA specifier on same declaration)

Category 5: Extended Lambda Restrictions (35 messages)

Extended lambdas (__device__ or __host__ __device__ lambdas defined within host code, enabled by --extended-lambda) are one of the most constraint-heavy features in CUDA. The restriction set enforces that the lambda's closure type can be serialized for device transfer.

Capture Restrictions

Tag	Message Template
`extended_lambda_reference_capture`	`An extended %s lambda cannot capture variables by reference`
`extended_lambda_pack_capture`	`An extended %s lambda cannot capture an element of a parameter pack`
`extended_lambda_too_many_captures`	`An extended %s lambda can only capture up to 1023 variables`
`extended_lambda_array_capture_rank`	`An extended %s lambda cannot capture an array variable (type: %t) with more than 7 dimensions`
`extended_lambda_array_capture_assignable`	`An extended %s lambda cannot capture an array variable whose element type (%t) is not assignable on the host`
`extended_lambda_array_capture_default_constructible`	`An extended %s lambda cannot capture an array variable whose element type (%t) is not default constructible on the host`
`extended_lambda_init_capture_array`	`An extended %s lambda cannot init-capture variables with array type`
`extended_lambda_init_capture_initlist`	`An extended %s lambda cannot have init-captures with type std::initializer_list`
`extended_lambda_capture_in_constexpr_if`	`An extended %s lambda cannot first-capture variable in constexpr-if context`
`this_addr_capture_ext_lambda`	`Implicit capture of 'this' in extended lambda expression`
`extended_lambda_hd_init_capture`	`init-captures are not allowed for extended __host__ __device__ lambdas`
--	`Unless enabled by language dialect, *this capture is only supported when the lambda is either __device__ only, or is defined within a __device__ or __global__ function`

Type Restrictions on Captures and Parameters

Tag	Message Template
`extended_lambda_capture_local_type`	`A type local to a function (%t) cannot be used in the type of a variable captured by an extended __device__ or __host__ __device__ lambda`
`extended_lambda_capture_private_type`	`A type that is a private or protected class member (%t) cannot be used in the type of a variable captured by an extended __device__ or __host__ __device__ lambda`
`extended_lambda_call_operator_local_type`	`A type local to a function (%t) cannot be used in the return or parameter types of the operator() of an extended __device__ or __host__ __device__ lambda`
`extended_lambda_call_operator_private_type`	`A type that is a private or protected class member (%t) cannot be used in the return or parameter types of the operator() of an extended __device__ or __host__ __device__ lambda`
`extended_lambda_parent_local_type`	`A type local to a function (%t) cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended __device__ or __host__ __device__ lambda`
`extended_lambda_parent_private_type`	`A type that is a private or protected class member (%t) cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended __device__ or __host__ __device__ lambda`
`extended_lambda_parent_private_template_arg`	`A template that is a private or protected class member cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended %s lambda`

Enclosing Parent Function Restrictions

Tag	Message Template
`extended_lambda_enclosing_function_local`	`The enclosing parent function (%sq2) for an extended %s1 lambda must not be defined inside another function`
`extended_lambda_inaccessible_parent`	`The enclosing parent function (%sq2) for an extended %s1 lambda cannot have private or protected access within its class`
`extended_lambda_enclosing_function_deducible`	`The enclosing parent function (%sq2) for an extended %s1 lambda must not have deduced return type`
`extended_lambda_cant_take_function_address`	`The enclosing parent function (%sq2) for an extended %s1 lambda must allow its address to be taken`
`extended_lambda_parent_non_extern`	`On Windows, the enclosing parent function (%sq2) for an extended %s1 lambda cannot have internal or no linkage`
`extended_lambda_parent_class_unnamed`	`The enclosing parent function (%sq2) for an extended %s1 lambda cannot be a member function of a class that is unnamed`
`extended_lambda_parent_template_param_unnamed`	`The enclosing parent function (%sq2) for an extended %s1 lambda cannot be in a template which has a unnamed parameter: %nd`
`extended_lambda_nest_parent_template_param_unnamed`	`The enclosing parent %n for an extended %s lambda cannot be a template which has a unnamed parameter`
`extended_lambda_multiple_parameter_packs`	`The enclosing parent template function (%sq2) for an extended %s1 lambda cannot have more than one variadic parameter, or it is not listed last in the template parameter list.`

Nesting and Context Restrictions

Tag	Message Template
`extended_lambda_enclosing_function_generic_lambda`	`An extended %s1 lambda cannot be defined inside a generic lambda expression(%sq2).`
`extended_lambda_enclosing_function_hd_lambda`	`An extended %s1 lambda cannot be defined inside an extended __host__ __device__ lambda expression(%sq2).` (note: double space before "lambda" is present in the binary)
`extended_lambda_inaccessible_ancestor`	`An extended %s1 lambda cannot be defined inside a class (%sq2) with private or protected access within another class`
`extended_lambda_inside_constexpr_if`	`For this host platform/dialect, an extended lambda cannot be defined inside the 'if' or 'else' block of a constexpr if statement`
`extended_lambda_multiple_parent`	`Cannot specify multiple __nv_parent directives in a lambda declaration`
`extended_host_device_generic_lambda`	`__host__ __device__ extended lambdas cannot be generic lambdas`
--	`If an extended %s lambda is defined within the body of one or more nested lambda expressions, each of these enclosing lambda expressions must be defined within the immediate or nested block scope of a function.`

Specifier and Annotation

Tag	Message Template
`extended_lambda_disallowed`	`__host__ or __device__ annotation on lambda requires --extended-lambda nvcc flag`
`extended_lambda_constexpr`	`The %s1 specifier is not allowed for an extended %s2 lambda`
--	`The operator() function for a lambda cannot be explicitly annotated with execution space annotations (__host__/__device__/__global__), the annotations are derived from its closure class`

Category 6: Device Code Restrictions (13 messages)

General restrictions that apply to any code executing on the GPU. These errors are emitted when C++ features unsupported by the NVPTX backend appear in __device__ or __global__ function bodies.

Tag	Message Template
`cuda_device_code_unsupported_operator`	`The operator '%s' is not allowed in device code`
`unsupported_type_in_device_code`	`%t %s1 a %s2, which is not supported in device code`
--	`device code does not support exception handling`
--	`device code does not support coroutines`
--	`operations on vector types are not supported in device code`
`undefined_device_entity`	`cannot use an entity undefined in device code`
`undefined_device_identifier`	`identifier %sq is undefined in device code`
`thread_local_in_device_code`	`cannot use thread_local specifier for variable declarations in device code`
`unrecognized_pragma_device_code`	`unrecognized #pragma in device code`
--	`zero-sized parameter type %t is not allowed in device code`
--	`zero-sized variable %sq is not allowed in device code`
--	`dynamic initialization is not supported for a function-scope static %s variable within a __device__/__global__ function`
--	`function-scope static variable within a __device__/__global__ function requires a memory space specifier`

Category 7: Kernel Launch (6 messages)

Errors related to <<<...>>> kernel launch syntax.

Tag	Message Template
`device_launch_no_sepcomp`	`kernel launch from __device__ or __global__ functions requires separate compilation mode`
`missing_api_for_device_side_launch`	`device-side kernel launch could not be processed as the required runtime APIs are not declared`
--	`explicit stream argument not provided in kernel launch`
--	`kernel launches from templates are not allowed in system files`
`device_side_launch_arg_with_user_provided_cctor`	`cannot pass an argument with a user-provided copy-constructor to a device-side kernel launch`
`device_side_launch_arg_with_user_provided_dtor`	`cannot pass an argument with a user-provided destructor to a device-side kernel launch`

Category 8: Memory Space and Variable Restrictions (15 messages)

Variable Access Across Spaces

Tag	Message Template
`device_var_read_in_host`	`a %s1 %n1 cannot be directly read in a host function`
`device_var_written_in_host`	`a %s1 %n1 cannot be directly written in a host function`
`device_var_address_taken_in_host`	`address of a %s1 %n1 cannot be directly taken in a host function`
`host_var_read_in_device`	`a host %n1 cannot be directly read in a device function`
`host_var_written_in_device`	`a host %n1 cannot be directly written in a device function`
`host_var_address_taken_in_device`	`address of a host %n1 cannot be directly taken in a device function`

Variable Declaration Restrictions

Tag	Message Template
`illegal_local_to_device_function`	`%s1 %sq2 variable declaration is not allowed inside a device function body`
`illegal_local_to_host_function`	`%s1 %sq2 variable declaration is not allowed inside a host function body`
--	`the __shared__ memory space specifier is not allowed for a variable declared by the for-range-declaration`
--	`__shared__ variables cannot have external linkage`
`device_variable_in_unnamed_inline_ns`	`A %s variable cannot be declared within an inline unnamed namespace`
--	`member variables of an anonymous union at global or namespace scope cannot be directly accessed in __device__ and __global__ functions`

Auto-Deduced Device References

Tag	Message Template
`auto_device_fn_ref`	`A non-constexpr __device__ function (%sq1) with "auto" deduced return type cannot be directly referenced %s2, except if the reference is absent when __CUDA_ARCH__ is undefined`
`device_var_constexpr`	(constexpr rules for device* variables)*
`device_var_structured_binding`	(structured bindings on device* variables)*

Category 9: __grid_constant__ (8 messages)

The __grid_constant__ annotation (compute_70+) marks a kernel parameter as read-only grid-wide. Errors enforce that the parameter is on a __global__ function, is const-qualified, and is not a reference type.

Tag	Message Template
`grid_constant_non_kernel`	`__grid_constant__ annotation is only allowed on a parameter of a __global__ function`
`grid_constant_not_const`	`a parameter annotated with __grid_constant__ must have const-qualified type`
`grid_constant_reference_type`	`a parameter annotated with __grid_constant__ must not have reference type`
`grid_constant_unsupported_arch`	`__grid_constant__ annotation is only allowed for architecture compute_70 or later`
`grid_constant_incompat_redecl`	`incompatible __grid_constant__ annotation for parameter %s in function redeclaration (see previous declaration %p)`
`grid_constant_incompat_templ_redecl`	`incompatible __grid_constant__ annotation for parameter %s in function template redeclaration (see previous declaration %p)`
`grid_constant_incompat_specialization`	`incompatible __grid_constant__ annotation for parameter %s in function specialization (see previous declaration %p)`
`grid_constant_incompat_instantiation_directive`	`incompatible __grid_constant__ annotation for parameter %s in instantiation directive (see previous declaration %p)`

Category 10: JIT Mode (5 messages)

JIT mode (-dc for device-only compilation) restricts host constructs. These errors guide users toward the -default-device flag for unannotated declarations.

Tag	Message Template
`no_host_in_jit`	`A function explicitly marked as a __host__ function is not allowed in JIT mode`
`unannotated_function_in_jit`	`A function without execution space annotations (__host__/__device__/__global__) is considered a host function, and host functions are not allowed in JIT mode. Consider using -default-device flag to process unannotated functions as __device__ functions in JIT mode`
`unannotated_variable_in_jit`	`A namespace scope variable without memory space annotations (__device__/__constant__/__shared__/__managed__) is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process unannotated namespace scope variables as __device__ variables in JIT mode`
`unannotated_static_data_member_in_jit`	`A class static data member with non-const type is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process such data members as __device__ variables in JIT mode`
`host_closure_class_in_jit`	`The execution space for the lambda closure class members was inferred to be __host__ (based on context). This is not allowed in JIT mode. Consider using -default-device to infer __device__ execution space for namespace scope lambda closure classes.`

Category 11: RDC / Whole-Program Mode (4 messages)

Diagnostics related to relocatable device code (-rdc=true) and whole-program compilation (-rdc=false).

Tag	Message Template
--	`An inline __device__/__constant__/__managed__ variable must have internal linkage when the program is compiled in whole program mode (-rdc=false)`
`template_global_no_def`	`when "-static-global-template-stub=true" in whole program compilation mode ("-rdc=false"), a __global__ function template instantiation or specialization (%sq) must have a definition in the current translation unit. To resolve this issue, either use separate compilation mode ("-rdc=true"), or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off)`
`extern_kernel_template`	`when "-static-global-template-stub=true", extern __global__ function template is not supported in whole program compilation mode ("-rdc=false"). To resolve the issue, either use separate compilation mode ("-rdc=true"), or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off)`
--	`address of internal linkage device function (%sq) was taken (nv bug 2001144). mitigation: no mitigation required if the address is not used for comparison, or if the target function is not a CUDA C++ builtin. Otherwise, write a wrapper function to call the builtin, and take the address of the wrapper function instead`

Category 12: Atomics (26 messages)

CUDA atomics are lowered to PTX instructions with specific size, type, scope, and memory order constraints. These diagnostics enforce hardware limits.

Architecture and Type Constraints

Tag	Message Template
`nv_atomic_functions_not_supported_below_sm60`	`__nv_atomic_* functions are not supported on arch < sm_60.`
`nv_atomic_operation_not_in_device_function`	`atomic operations are not in a device function.`
`nv_atomic_function_no_args`	`atomic function requires at least one argument.`
`nv_atomic_function_address_taken`	`nv atomic function must be called directly.`
`invalid_nv_atomic_operation_size`	`atomic operations and, or, xor, add, sub, min and max are valid only on objects of size 4, or 8.`
`invalid_nv_atomic_cas_size`	`atomic CAS is valid only on objects of size 2, 4, 8 or 16 bytes.`
`invalid_nv_atomic_exch_size`	`atomic exchange is valid only on objects of size 4, 8 or 16 bytes.`
`invalid_data_size_for_nv_atomic_generic_function`	`generic nv atomic functions are valid only on objects of size 1, 2, 4, 8 and 16 bytes.`
`non_integral_type_for_non_generic_nv_atomic_function`	`non-generic nv atomic load, store, cas and exchange are valid only on integral types.`
`invalid_nv_atomic_operation_add_sub_size`	`atomic operations add and sub are not valid on signed integer of size 8.`
`nv_atomic_add_sub_f64_not_supported`	`atomic add and sub for 64-bit float is supported on architecture sm_60 or above.`
`invalid_nv_atomic_operation_max_min_float`	`atomic operations min and max are not supported on any floating-point types.`
`floating_type_for_logical_atomic_operation`	`For a logical atomic operation, the first argument cannot be any floating-point types.`
`nv_atomic_cas_b16_not_supported`	(16-bit CAS not supported)
`nv_atomic_exch_cas_b128_not_supported`	(128-bit exchange/CAS not supported)
`nv_atomic_load_store_b128_version_too_low`	(128-bit load/store requires newer arch)

Memory Order and Scope

Tag	Message Template
`nv_atomic_load_order_error`	`atomic load's memory order cannot be release or acq_rel.`
`nv_atomic_store_order_error`	`atomic store's memory order cannot be consume, acquire or acq_rel.`
`nv_atomic_operation_order_not_constant_int`	`atomic operation's memory order argument is not an integer literal.`
`nv_atomic_operation_scope_not_constant_int`	`atomic operation's scope argument is not an integer literal.`
`invalid_nv_atomic_memory_order_value`	(invalid memory order enum value)
`invalid_nv_atomic_thread_scope_value`	(invalid thread scope enum value)

Scope Fallback Warnings

Tag	Message Template
`nv_atomic_operations_scope_fallback_to_membar`	`atomic operations' scope argument is supported on architecture sm_60 or above. Fall back to use membar.`
`nv_atomic_operations_memory_order_fallback_to_membar`	`atomic operations' argument of memory order is supported on architecture sm_70 or above. Fall back to use membar.`
`nv_atomic_operations_scope_cluster_change_to_device`	`atomic operations' scope of cluster is supported on architecture sm_90 or above. Using device scope instead.`
`nv_atomic_load_store_scope_cluster_change_to_device`	`atomic load and store's scope of cluster is supported on architecture sm_90 or above. Using device scope instead.`

Category 13: ASM in Device Code (6 messages)

Inline assembly constraints are more restrictive in device code (NVPTX backend supports fewer constraint letters than x86).

Tag	Message Template
`asm_constraint_letter_not_allowed_in_device`	`asm constraint letter '%s' is not allowed inside a __device__/__global__ function`
--	`an asm operand may specify only one constraint letter in a __device__/__global__ function`
--	`The 'C' constraint can only be used for asm statements in device code`
--	`The cc clobber constraint is not supported in device code`
`cuda_xasm_strict_placeholder_format`	(strict placeholder format in CUDA asm)
`addr_of_label_in_device_func`	`address of label extension is not supported in __device__/__global__ functions`

Category 14: #pragma nv_abi (10 messages)

The #pragma nv_abi directive controls the calling convention for device functions, adjusting parameter passing to match PTX ABI requirements.

Tag	Message Template
`nv_abi_pragma_bad_format`	(malformed #pragma nv_abi)
`nv_abi_pragma_invalid_option`	`#pragma nv_abi contains an invalid option`
`nv_abi_pragma_missing_arg`	`#pragma nv_abi requires an argument`
`nv_abi_pragma_duplicate_arg`	`#pragma nv_abi contains a duplicate argument`
`nv_abi_pragma_not_constant`	`#pragma nv_abi argument must evaluate to an integral constant expression`
`nv_abi_pragma_not_positive_value`	`#pragma nv_abi argument value must be a positive value`
`nv_abi_pragma_overflow_value`	`#pragma nv_abi argument value exceeds the range of an integer`
`nv_abi_pragma_device_function`	`#pragma nv_abi must be applied to device functions`
`nv_abi_pragma_device_function_context`	`#pragma nv_abi is not supported inside a host function`
`nv_abi_pragma_next_construct`	`#pragma nv_abi must appear immediately before a function declaration, function definition, or an expression statement`

Category 15: __nv_register_params__ (4 messages)

The __nv_register_params__ attribute forces all parameters to be passed in registers (compute_80+).

Tag	Message Template
`register_params_not_enabled`	`__nv_register_params__ support is not enabled`
`register_params_unsupported_arch`	`__nv_register_params__ is only supported for compute_80 or later architecture`
`register_params_unsupported_function`	`__nv_register_params__ is not allowed on a %s function`
`register_params_ellipsis_function`	`__nv_register_params__ is not allowed on a function with ellipsis`

Category 16: __CUDACC_RTC__name_expr (6 messages)

The __CUDACC_RTC__name_expr intrinsic is used by NVRTC to form the mangled name of a __global__ function or __device__/__constant__ variable at compile time.

Tag	Message Template
`name_expr_parsing`	(error during name expression parsing)
`name_expr_non_global_routine`	`Name expression cannot form address of a non-__global__ function. Input name expression was: %sq`
`name_expr_non_device_variable`	`Name expression cannot form address of a variable that is not a __device__/__constant__ variable. Input name expression was: %sq`
`name_expr_not_routine_or_variable`	`Name expression must form address of a __global__ function or the address of a __device__/__constant__ variable. Input name expression was: %sq`
`name_expr_extra_tokens`	(extra tokens after name expression)
`name_expr_internal_error`	(internal error in name expression processing)

Category 17: Texture and Surface Variables (8 messages)

Texture and surface objects have special memory semantics. These errors enforce that they are not used in ways incompatible with the GPU texture subsystem.

Tag	Message Template
`texture_surface_variable_in_unnamed_inline_ns`	`A texture or surface variable cannot be declared within an inline unnamed namespace`
--	`A texture or surface variable cannot be used in the non-type template argument of a __device__, __host__ __device__ or __global__ function template instantiation`
`reference_to_text_surf_type_in_device_func`	`a reference to texture/surface type cannot be used in __device__/__global__ functions`
`reference_to_text_surf_var_in_device_func`	`taking reference of texture/surface variable not allowed in __device__/__global__ functions`
`addr_of_text_surf_var_in_device_func`	`cannot take address of texture/surface variable %sq in __device__/__global__ functions`
`addr_of_text_surf_expr_in_device_func`	`cannot take address of texture/surface expression in __device__/__global__ functions`
`indir_into_text_surf_var_in_device_func`	`indirection not allowed for accessing texture/surface through variable %sq in __device__/__global__ functions`
`indir_into_text_surf_expr_in_device_func`	`indirection not allowed for accessing texture/surface through expression in __device__/__global__ functions`

Category 18: managed Variables (7 messages)

__managed__ unified-memory variables have significant restrictions because they must be accessible from both host and device.

Tag	Message Template
`managed_const_type_not_allowed`	`a __managed__ variable cannot have a const qualified type`
`managed_reference_type_not_allowed`	`a __managed__ variable cannot have a reference type`
`managed_cant_be_shared_constant`	`__managed__ variables cannot be marked __shared__ or __constant__`
`unsupported_arch_for_managed_capability`	`__managed__ variables require architecture compute_30 or higher`
`unsupported_configuration_for_managed_capability`	`__managed__ variables are not yet supported for this configuration (compilation mode (32/64 bit) and/or target operating system)`
`decltype_of_managed_variable`	`A __managed__ variable cannot be used as an unparenthesized id-expression argument for decltype()`
--	(dynamic initialization restrictions for managed* variables)*

Category 19: Device Function Signature Constraints (5 messages)

Restrictions on __device__ and __host__ __device__ functions that are distinct from __global__ constraints.

Tag	Message Template
`device_function_has_ellipsis`	`__device__ or __host__ __device__ function with ellipsis requires compute_30 or higher architecture`
`device_func_tex_arg`	(device function with texture argument restriction)
`no_host_device_initializer_list`	(std::initializer_list in host* device context)*
`no_host_device_move_forward`	(std::move/forward in host* device context)*
`no_strict_cuda_error`	(relaxed error checking mode)

Category 20: __wgmma_mma_async Builtins (4 messages)

Warp Group Matrix Multiply-Accumulate builtins (sm_90a+).

Tag	Message Template
`wgmma_mma_async_not_enabled`	`__wgmma_mma_async builtins are only available for sm_90a`
`wgmma_mma_async_nonconstant_arg`	`Non-constant argument to __wgmma_mma_async call`
`wgmma_mma_async_missing_args`	`The 'A' or 'B' argument to __wgmma_mma_async call is missing`
`wgmma_mma_async_bad_shape`	`The shape %s is not supported for __wgmma_mma_async builtin`

Category 21: __block_size / cluster_dims__ (8 messages)

Architecture-dependent launch configuration attributes.

Tag	Message Template
`block_size_unsupported`	`__block_size__ is not supported for this GPU architecture`
`block_size_must_be_positive`	(block size values must be positive)
`cluster_dims_unsupported`	`__cluster_dims__ is not supported for this GPU architecture`
`cluster_dims_must_be_positive`	(cluster_dims* values must be positive)*
`cluster_dims_too_large`	(cluster_dims* exceeds maximum)*
`conflict_between_cluster_dim_and_block_size`	`cannot specify the second tuple in __block_size__ while __cluster_dims__ is present`
--	`cannot specify max blocks per cluster for this GPU architecture`
`shared_block_size_must_be_positive`	(shared block size must be positive)

Category 22: Inline Hint Conflicts (2 messages)

Tag	Message Template
--	`"__inline_hint__" and "__forceinline__" may not be used on the same declaration`
--	`"__inline_hint__" and "__noinline__" may not be used on the same declaration`

Category 23: Miscellaneous CUDA Errors

Remaining CUDA-specific diagnostics that do not fall into the above categories.

Tag	Message Template
`cuda_displaced_new_or_delete_operator`	(displaced new/delete in CUDA context)
`cuda_demote_unsupported_floating_point`	(unsupported floating-point type demoted)
`illegal_ucn_in_device_identifer`	`Universal character is not allowed in device entity name (%sq)`
`thread_local_for_device_vars`	(thread_local on device variables)
--	`__global__ function or function template cannot have a parameter with va_list type`
`global_qualifier_not_allowed`	(execution space qualifier not allowed here)

Complete Diagnostic Tag Index (286 tags)

The following table lists all 286 CUDA-specific diagnostic tag names extracted from the cudafe++ binary. Each tag can be used with --diag_suppress, --diag_warning, --diag_error, or #pragma nv_diag_suppress / nv_diag_warning / nv_diag_error.

Tags are organized alphabetically within functional groups.

Cross-Space / Execution Space

Tag Name
`unsafe_device_call`

Redeclaration

Tag Name
`device_function_redeclared_with_global`
`device_function_redeclared_with_host`
`device_function_redeclared_with_host_device`
`device_function_redeclared_without_device`
`global_function_redeclared_with_device`
`global_function_redeclared_with_host`
`global_function_redeclared_with_host_device`
`global_function_redeclared_without_global`
`host_device_function_redeclared_with_global`
`host_function_redeclared_with_device`
`host_function_redeclared_with_global`
`host_function_redeclared_with_host_device`

global Constraints

Tag Name
`bounds_attr_only_on_global_func`
`cuda_specifier_twice_in_group`
`global_class_decl`
`global_exception_spec`
`global_friend_definition`
`global_func_local_template_arg`
`global_function_consteval`
`global_function_constexpr`
`global_function_deduced_return_type`
`global_function_has_ellipsis`
`global_function_in_unnamed_inline_ns`
`global_function_inline`
`global_function_multiple_packs`
`global_function_pack_not_last`
`global_function_return_type`
`global_function_with_initializer_list`
`global_lambda_template_arg`
`global_new_or_delete`
`global_operator_function`
`global_param_align_too_big`
`global_private_template_arg`
`global_private_type_arg`
`global_qualifier_not_allowed`
`global_ref_param_restrict`
`global_rvalue_ref_type`
`global_unnamed_type_arg`
`global_va_list_type`
`local_type_used_in_global_function`
`maxnreg_attr_only_on_global_func`
`missing_launch_bounds`
`template_global_no_def`

Extended Lambda

Tag Name
`extended_host_device_generic_lambda`
`extended_lambda_array_capture_assignable`
`extended_lambda_array_capture_default_constructible`
`extended_lambda_array_capture_rank`
`extended_lambda_call_operator_local_type`
`extended_lambda_call_operator_private_type`
`extended_lambda_cant_take_function_address`
`extended_lambda_capture_in_constexpr_if`
`extended_lambda_capture_local_type`
`extended_lambda_capture_private_type`
`extended_lambda_constexpr`
`extended_lambda_disallowed`
`extended_lambda_discriminator`
`extended_lambda_enclosing_function_deducible`
`extended_lambda_enclosing_function_generic_lambda`
`extended_lambda_enclosing_function_hd_lambda`
`extended_lambda_enclosing_function_local`
`extended_lambda_enclosing_function_not_found`
`extended_lambda_hd_init_capture`
`extended_lambda_illegal_parent`
`extended_lambda_inaccessible_ancestor`
`extended_lambda_inaccessible_parent`
`extended_lambda_init_capture_array`
`extended_lambda_init_capture_initlist`
`extended_lambda_inside_constexpr_if`
`extended_lambda_multiple_parameter_packs`
`extended_lambda_multiple_parent`
`extended_lambda_nest_parent_template_param_unnamed`
`extended_lambda_no_parent_func`
`extended_lambda_pack_capture`
`extended_lambda_parent_class_unnamed`
`extended_lambda_parent_local_type`
`extended_lambda_parent_non_extern`
`extended_lambda_parent_private_template_arg`
`extended_lambda_parent_private_type`
`extended_lambda_parent_template_param_unnamed`
`extended_lambda_reference_capture`
`extended_lambda_too_many_captures`
`this_addr_capture_ext_lambda`

Device Code

Tag Name
`addr_of_label_in_device_func`
`asm_constraint_letter_not_allowed_in_device`
`auto_device_fn_ref`
`cuda_device_code_unsupported_operator`
`cuda_xasm_strict_placeholder_format`
`illegal_ucn_in_device_identifer`
`no_strict_cuda_error`
`thread_local_in_device_code`
`undefined_device_entity`
`undefined_device_identifier`
`unrecognized_pragma_device_code`
`unsupported_type_in_device_code`

Device Function

Tag Name
`device_func_tex_arg`
`device_function_has_ellipsis`
`no_host_device_initializer_list`
`no_host_device_move_forward`

Kernel Launch

Tag Name
`device_launch_no_sepcomp`
`device_side_launch_arg_with_user_provided_cctor`
`device_side_launch_arg_with_user_provided_dtor`
`missing_api_for_device_side_launch`

Variable Access

Tag Name
`device_var_address_taken_in_host`
`device_var_constexpr`
`device_var_read_in_host`
`device_var_structured_binding`
`device_var_written_in_host`
`device_variable_in_unnamed_inline_ns`
`host_var_address_taken_in_device`
`host_var_read_in_device`
`host_var_written_in_device`
`illegal_local_to_device_function`
`illegal_local_to_host_function`

Variable Template

Tag Name
`variable_template_func_local_template_arg`
`variable_template_lambda_template_arg`
`variable_template_private_template_arg`
`variable_template_private_type_arg`
`variable_template_unnamed_type_template_arg`

managed

Tag Name
`decltype_of_managed_variable`
`managed_cant_be_shared_constant`
`managed_const_type_not_allowed`
`managed_reference_type_not_allowed`
`unsupported_arch_for_managed_capability`
`unsupported_configuration_for_managed_capability`

__grid_constant__

Tag Name
`grid_constant_incompat_instantiation_directive`
`grid_constant_incompat_redecl`
`grid_constant_incompat_specialization`
`grid_constant_incompat_templ_redecl`
`grid_constant_non_kernel`
`grid_constant_not_const`
`grid_constant_reference_type`
`grid_constant_unsupported_arch`

Atomics

Tag Name
`floating_type_for_logical_atomic_operation`
`invalid_data_size_for_nv_atomic_generic_function`
`invalid_nv_atomic_cas_size`
`invalid_nv_atomic_exch_size`
`invalid_nv_atomic_memory_order_value`
`invalid_nv_atomic_operation_add_sub_size`
`invalid_nv_atomic_operation_max_min_float`
`invalid_nv_atomic_operation_size`
`invalid_nv_atomic_thread_scope_value`
`non_integral_type_for_non_generic_nv_atomic_function`
`nv_atomic_add_sub_f64_not_supported`
`nv_atomic_cas_b16_not_supported`
`nv_atomic_exch_cas_b128_not_supported`
`nv_atomic_function_address_taken`
`nv_atomic_function_no_args`
`nv_atomic_functions_not_supported_below_sm60`
`nv_atomic_load_order_error`
`nv_atomic_load_store_b128_version_too_low`
`nv_atomic_load_store_scope_cluster_change_to_device`
`nv_atomic_operation_not_in_device_function`
`nv_atomic_operation_order_not_constant_int`
`nv_atomic_operation_scope_not_constant_int`
`nv_atomic_operations_memory_order_fallback_to_membar`
`nv_atomic_operations_scope_cluster_change_to_device`
`nv_atomic_operations_scope_fallback_to_membar`
`nv_atomic_store_order_error`

JIT Mode

Tag Name
`host_closure_class_in_jit`
`no_host_in_jit`
`unannotated_function_in_jit`
`unannotated_static_data_member_in_jit`
`unannotated_variable_in_jit`

RDC / Whole-Program

Tag Name
`extern_kernel_template`
`template_global_no_def`

#pragma nv_abi

Tag Name
`nv_abi_pragma_bad_format`
`nv_abi_pragma_device_function`
`nv_abi_pragma_device_function_context`
`nv_abi_pragma_duplicate_arg`
`nv_abi_pragma_invalid_option`
`nv_abi_pragma_missing_arg`
`nv_abi_pragma_next_construct`
`nv_abi_pragma_not_constant`
`nv_abi_pragma_not_positive_value`
`nv_abi_pragma_overflow_value`

__nv_register_params__

Tag Name
`register_params_ellipsis_function`
`register_params_not_enabled`
`register_params_unsupported_arch`
`register_params_unsupported_function`

name_expr

Tag Name
`name_expr_extra_tokens`
`name_expr_internal_error`
`name_expr_non_device_variable`
`name_expr_non_global_routine`
`name_expr_not_routine_or_variable`
`name_expr_parsing`

Texture / Surface

Tag Name
`addr_of_text_surf_expr_in_device_func`
`addr_of_text_surf_var_in_device_func`
`indir_into_text_surf_expr_in_device_func`
`indir_into_text_surf_var_in_device_func`
`reference_to_text_surf_type_in_device_func`
`reference_to_text_surf_var_in_device_func`
`texture_surface_variable_in_unnamed_inline_ns`

__wgmma_mma_async

Tag Name
`wgmma_mma_async_bad_shape`
`wgmma_mma_async_missing_args`
`wgmma_mma_async_nonconstant_arg`
`wgmma_mma_async_not_enabled`

__block_size / cluster_dims__

Tag Name
`block_size_must_be_positive`
`block_size_unsupported`
`cluster_dims_must_be_positive`
`cluster_dims_too_large`
`cluster_dims_unsupported`
`conflict_between_cluster_dim_and_block_size`
`shared_block_size_must_be_positive`
`shared_block_size_too_large`

Miscellaneous

Tag Name
`cuda_demote_unsupported_floating_point`
`cuda_displaced_new_or_delete_operator`
`thread_local_for_device_vars`

Internal Representation

Each CUDA error message is stored as a const char* entry in the error template table at off_88FAA0. The diagnostic tag names are stored in a separate string-to-integer lookup table; the tag name resolver (sub_4ED240 and related functions) performs a binary search on this table to match tag strings against internal error codes.

The format specifiers embedded in CUDA error messages use the same system as EDG base errors:

Specifier	Meaning	Example in CUDA messages
`%sq`	Quoted entity name	Function name in cross-space call
`%sq1`, `%sq2`	Indexed quoted names	Caller and callee in call errors
`%no1`	Entity name (omit kind)	Function name in redeclaration
`%n1`, `%n2`	Entity names	Override base/derived pair
`%nd`	Entity name with decl location	Template parameter
`%s`, `%s1`, `%s2`	String fill-in	Execution space keyword
`%t`	Type fill-in	Type name in template arg errors
`%p`	Source position	Previous declaration location

For full format specifier documentation, see Format Specifiers.

Format Specifiers

The cudafe++ diagnostic system uses a custom format specifier language -- not printf -- to expand parameterized error messages. The expansion engine is process_fill_in (sub_4EDCD0, 1,202 decompiled lines in error.c), called by write_message_to_buffer (sub_4EF620, 159 lines) during template string expansion. Each diagnostic record carries a linked list of typed fill-in entries that supply the actual values -- type nodes, entity pointers, strings, integers, source positions -- which the format engine renders into the final message text.

This page documents the specifier syntax, the fill-in kind system, entity-kind dispatch, suffix options, numeric indexing, and the labeled fill-in mechanism.

Specifier Syntax

When write_message_to_buffer walks an error template string (looked up from off_88FAA0[error_code]), it recognizes three format constructs:

Syntax	Meaning	Example
`%%`	Literal `%` character	`"100%% complete"`
`%XY...Zn`	Fill-in specifier: letter `X`, options `Y...Z`, index `n`	`%nfd2`, `%sq1`, `%t`
`%[label]`	Named label fill-in reference	`%[class_or_struct]`

Positional Specifier Parsing

The parser (sub_4EF620, error.c:4703) processes %XY...Zn specifiers as follows:

// After seeing '%', read next char as specifier letter
char spec_letter = template[pos + 1];      // 'T', 'd', 'n', 'p', 'r', 's', 't', 'u'
pos += 2;

// Collect option characters (a-z, A-Z) into buffer, max 29
int opt_count = 0;
char options[30];
while (true) {
    char c = template[pos];
    if (c >= '0' && c <= '9') {
        // Trailing digit = fill-in index (1-based)
        fill_in_index = c - '0';
        break;
    }
    if ((c & 0xDF) < 'A' || (c & 0xDF) > 'Y') {
        // Not a letter -- end of specifier, index defaults to 1
        fill_in_index = 1;
        break;
    }
    options[opt_count++] = c;
    if (opt_count > 29)
        assertion_handler("error.c", 4739,
            "write_message_to_buffer",
            "construct_text_message:",
            "too many option characters");
    pos++;
}
options[opt_count] = '\0';

process_fill_in(diagnostic_record, spec_letter, options, fill_in_index);

The maximum of 29 option characters is enforced by an assertion. In practice, specifiers use 0--3 option characters.

Fill-In Kinds

The specifier letter maps to a fill-in kind value through a switch on (letter - 84) in process_fill_in (sub_4EDCD0, error.c:4297):

Letter	ASCII	`letter - 84`	Kind	Payload Type	Description
`%T`	84	0	6	Type node pointer	Type name, uppercase rendering (`"<int, float>"`)
`%d`	100	16	0	`int64`	Signed decimal integer
`%n`	110	26	4	Entity node pointer	Entity/symbol name with rich formatting
`%p`	112	28	2	Source position cookie	Source file + line reference
`%r`	114	30	7	byte + pointer	Template parameter reference
`%s`	115	31	3	`const char*`	Plain string
`%t`	116	32	5	Type node pointer	Type name, lowercase rendering (`"int"`)
`%u`	117	33	1	`uint64`	Unsigned decimal integer

Any other letter triggers the assertion: "process_fill_in: bad fill-in kind" (error.c:4297).

Usage Frequency Across 3,795 Templates

Measured across all error message templates in off_88FAA0:

Specifier	Occurrences	Typical Context
`%s`	~470	String fragments: attribute names, keyword text, flag names
`%t`	~241	Type names in mismatch diagnostics
`%sq`	~233	Quoted string fragments in CUDA cross-space messages
`%n`	~179	Entity names: function, variable, class, template
`%p`	~76	Source positions: "declared at line N of file.cu"
`%d`	~60	Numeric values: counts, limits, sizes
`%T`	~40	Type template parameter lists
`%u`	~20	Unsigned counts
`%r`	~10	Template parameter back-references

Fill-In Entry Layout

Each fill-in entry is a 40-byte node allocated from a pool (qword_106B490) or heap by alloc_fill_in_entry (sub_4F2DE0):

Offset	Size	Field	Description
0	4	`kind`	Fill-in kind (0--7, from specifier letter mapping)
4	1	`used_flag`	Set to 1 when consumed during expansion
5	3	(padding)	--
8	8	`next`	Next fill-in in linked list
16	8+	`payload`	Union, varies by kind (see below)

Payload Layout by Kind

Kind 0 (decimal, %d) / Kind 1 (unsigned, %u) / Kind 3 (string, %s) / Kind 5 (type, %t) / Kind 6 (type, %T):

Offset	Size	Field
16	8	`value` -- int64 for kind 0/1, `const char*` for kind 3, type node pointer for kind 5/6

Kind 2 (position, %p):

Offset	Size	Field
16	8	`position_cookie` -- initialized to `qword_126EFB8` (current source position) at allocation time

Kind 4 (entity name, %n):

Offset	Size	Field
16	8	`entity_ptr` -- pointer to entity node
24	4	`scope_index` -- initialized to `0xFFFFFFFF` (invalid)
28	1	`full_qualification_flag`
29	1	`original_name_flag`
30	1	`parameter_list_flag`
31	1	`template_function_flag`
32	1	`definition_flag`
33	1	`alternate_original_flag`
34	1	`template_only_flag`

Kind 7 (%r):

Offset	Size	Field
16	1	`param_byte`
17	7	(padding)
24	8	`template_scope_ptr`

Fill-In Linked List

Fill-in entries attach to the diagnostic record as a singly-linked list:

Head pointer: diagnostic record offset 184 (fill_in_list_head)
Tail pointer: diagnostic record offset 192 (fill_in_list_tail)

When process_fill_in searches for a matching entry, it walks the list from head, looking for the first entry where node->kind == requested_kind. If the specifier includes an index (e.g., %t2), it skips index - 1 matching entries before consuming the target:

const __m128i *node = *(diagnostic + 184);   // fill_in_list_head
if (!node)
    goto fill_in_not_found;

while (node->kind != requested_kind || --index > 0) {
    node = node->next;                        // offset 8
    if (!node)
        goto fill_in_not_found;
}

node->used_flag = 1;                          // mark consumed (offset 4)
// proceed with kind-specific rendering

If no matching entry is found, process_fill_in triggers an assertion with a diagnostic message identifying the missing fill-in: "specified fill-in (%X, N) not found for error string: \"...\"" (error.c:4317).

After all format specifiers have been expanded, construct_text_message (sub_4EF9D0) iterates the entire fill-in list and asserts that every entry has used_flag == 1. An unconsumed fill-in triggers: "construct_text_message: not all fill-ins used for error string: \"...\"" (error.c:4781).

Numeric Indexing

When a template string must reference multiple fill-ins of the same kind, a trailing digit selects which one:

Specifier	Meaning
`%t`	First type fill-in (index 1, default)
`%t1`	First type fill-in (index 1, explicit)
`%t2`	Second type fill-in (index 2)
`%n1`	First entity name fill-in
`%n2`	Second entity name fill-in
`%sq1`	First string fill-in, quoted
`%sq2`	Second string fill-in, quoted

The index is a single digit 0--9. Index 0 behaves identically to index 1 (the counter is pre-decremented before comparison). In practice, most templates use indices 1 and 2; a few use up to 3.

Real template example (CUDA cross-space call, error 3499):

calling a __device__ function(%sq1) from a __host__ function(%sq2) is not allowed

Here %sq1 and %sq2 are both kind 3 (string) with option q (quoted), selecting the first and second string fill-ins respectively. The caller attaches two string fill-ins -- the called function's name and the calling function's name.

Suffix Options

String Options (`%s`)

The %s specifier accepts only one option character: q for quoted output.

Form	Rendering
`%s`	Raw string: `foo`
`%sq`	Quoted string: `"foo"`

The q option wraps the string in double-quote characters (") and applies colorization if enabled (quote category, code 6 = bold). Any other option character on %s triggers: "process_fill_in: bad option" (error.c:4364).

Multiple q characters are permitted syntactically (the parser loops over all option chars validating each is q) but have no additional effect -- only one layer of quoting is applied.

Entity Name Options (`%n`)

The %n specifier accepts a rich set of option suffixes that control how an entity is rendered. Options are processed left-to-right, setting flags on the fill-in entry's flag bytes (offsets 28--34):

Option	Flag Byte	Effect
`f`	offset 28 (`full_qualification`)	Show fully-qualified name with namespace/class scope chain
`o`	offset 29 (`original_name`)	Omit the entity kind prefix (suppress "function ", "variable ", etc.)
`p`	offset 30 (`parameter_list`)	Show function parameter types in signature
`t`	offset 31 + offset 28	Show template arguments AND full qualification (sets both flags)
`a`	offset 29 + offset 33	Show original name AND alternate/accessibility info
`d`	offset 32 (`definition`)	Append declaration location: `" (declared at line N of file.cu)"`
`T`	offset 34 (`template_only`)	Show template specialization context: `" (from translation unit ...)"`

Options can be combined. Common combinations from the error template table:

Specifier	Rendering Example
`%n`	`function "foo"`
`%no`	`"foo"` (no kind prefix)
`%nf`	`function "ns::cls::foo"` (fully qualified)
`%nfd`	`function "ns::cls::foo" (declared at line 42 of bar.cu)`
`%nt`	`function "ns::cls::foo<int>"` (full + template args)
`%np`	`function "foo" [with parameters shown]`
`%nT`	`function "foo" (from translation unit bar.cu)`
`%na`	`"foo" based on template argument(s) ...`

No Options for Other Kinds

The %d, %u, %p, %t, %T, and %r specifiers reject all option characters:

if (*options != '\0')
    assertion_handler("error.c", 4372,
        "process_fill_in",
        "process_fill_in: bad option", NULL);

Kind-Specific Rendering

Kind 0 -- Signed Decimal (`%d`)

Renders the 64-bit signed integer payload using snprintf(buf, 20, "%lli", value), then writes the result to the output buffer. The 20-character buffer accommodates the full range of int64_t values including the sign.

Kind 1 -- Unsigned Decimal (`%u`)

Formats the payload through sub_4F63D0, which renders the unsigned 64-bit value into a dynamically-sized string buffer.

Kind 2 -- Source Position (`%p`)

Calls sub_4F6820 (form_source_position) with the position cookie from the fill-in payload. The rendering includes:

File name (via sub_5B15D0 for display formatting)
Line number
Contextual text supplied by the caller through three string arguments (prefix, suffix, end-of-source fallback)

The caller passes context strings like " (declared ", ")", "(at end of source)" to frame the position reference. When the position resolves to line 0 or the file is "-" (stdin), alternate formats are used.

Kind 3 -- String (`%s` / `%sq`)

Without the q option, writes the string pointer payload directly to the output buffer via strlen + sub_6B9CD0 (buffer append).

With the q option, wraps the string in double quotes with colorization:

if (colorization_active)
    emit_escape(buffer, 6);       // quote color (bold)
write_char(buffer, '"');
write_string(buffer, payload);
if (colorization_active)
    emit_escape(buffer, 1);       // reset
write_char(buffer, '"');

Kind 5 -- Type, Lowercase (`%t`)

Renders the type node through the type formatting subsystem. The rendering pipeline:

Set byte_10678FA = 1 (name lookup kind = type display mode)
Write opening "
Call sub_600740 (format type for display) with the type node and the entity formatter callback table (qword_1067860)
Write closing "
Check via sub_7BE9C0 if the type has an "aka" (also-known-as) desugared form
If yes, append ' (aka "desugared_type")' -- comparing the rendered forms to avoid redundant output when they are identical

The aka check compares the rendered text of the original type against the desugared type. If they produce identical strings (same length, same content via strncmp), the aka suffix is suppressed by truncating the buffer back to the pre-aka position.

Kind 6 -- Type, Uppercase (`%T`)

Renders a type template argument list in angle brackets:

write_string(buffer, "\"<");
// Walk the template argument linked list
for (arg = payload; arg != NULL; arg = arg->next) {
    if (arg->kind != 3)   // skip pack expansion markers
        format_template_argument(arg, &entity_formatter);
    if (arg->next && arg->next->kind != 3)
        write_string(buffer, ", ");
}
write_string(buffer, ">\"");

Template argument entries with kind == 3 (at byte offset +8) are pack-expansion markers and are skipped during rendering.

Kind 7 -- Template Parameter Reference (`%r`)

Renders a template parameter by looking up the parameter entity through sub_5B9EE0 (entity lookup by scope + index). If found and non-null, renders via sub_4F3970 (unqualified entity name). Otherwise, falls back to sub_6011F0 (generic template parameter formatting).

Entity Kind Dispatch (`%n`)

When processing %n specifiers, process_fill_in reads the entity kind byte at offset 80 of the entity node and dispatches to kind-specific rendering logic. The function first resolves through projection indirection: if entity_kind == 16 (typedef), it follows the pointer at entity->info_ptr->pointed_to; if entity_kind == 24 (resolved namespace alias), it follows entity->info_ptr.

The dispatch handles 25 entity kind values (0--24, with gaps at 14/15/16/24 handled as special cases):

Entity Kind	Value	Kind Label String	Index in `off_88FAA0`	Rendering Logic
keyword	0	(none -- literal `"keyword"`)	--	Write `keyword "`, then the keyword's name string from `entity->name_sym->name`
concept	1	(from table)	1462	Simple: write kind label + quoted name
constant template parameter	2	`"constant"` or `"nontype"`	--	Check template parameter subkind: type_kind 14 with subkind 2 = `"nontype"`, else `"constant"`
template parameter	3	(from table)	1464 or 1465	Check whether the template parameter is a type parameter (type_kind != 14) → index 1465, else 1464
class	4	(from table, CUDA-aware)	1466--1468	CUDA mode: `1467` or `1468` (class vs struct); non-CUDA: `1466`
struct	5	(same as class)	1466--1468	Same dispatch as class, differentiated by `v46 != 5`
enum	6	(from table)	1472	Simple: write kind label + quoted name
variable	7	`"variable"` or `"handler parameter"`	1474 or 1475	Check handler-parameter flag (offset 163, bit 0x40). If set: `"handler parameter"` (index 1474). If variable is a structured binding (offset 162, bit 1): use index 2937. Otherwise: `"variable"` (index 1475) with optional template context
field	8	`"field"` or `"member"`	1480 or 1481	CUDA C++ mode: `"member"` (index 1480); C mode: `"field"` (index 1481)
member	9	`"member"`	1480	Always `"member"` with optional template context from scope chain
function	10	`"function"` or `"deduction guide"`	1478 or 2892	Check linkage kind (offset 166 == 7): deduction guide → index 2892. Otherwise `"function"` (1478). Walk qualified type chain to strip cv-qualifiers
function overload	11	(same as function)	1478 or 2892	Same dispatch as function (case 10), merged in the switch
namespace	12	(from table)	1463	Simple: write kind label + quoted name
label	13	(none)	--	Write quoted name only, no kind prefix, no type info
typedef (indirect variable)	14	`"variable"`	1475	Dereferences through `entity->info_ptr->pointed_to` and renders as variable
typedef (indirect function)	15	`"function"`	1478	Dereferences through `entity->info_ptr`, extracts function entity + routine info
typedef	16	--	--	Assertion: `"form_symbol_summary: projection of projection kind"` (error.c:2020). Should have been resolved before dispatch
using declaration	17	(from table)	1479	Simple: write kind label + quoted name
parameter	18	`"parameter"`	1473	Simple: write `"parameter"` + quoted name with type info
class (anonymous/unnamed)	19	(from table)	1469--1471 or 1889	Multiple sub-cases: anonymous class bit 0x40 → index 1469; class-template with bit 0x02 → index 1470; deduction_guide bit → index 1889; else index 1471
function template	20	`"function template"`	1485 (lambda) or kind label	Lambda function (offset 189, bit 0x20): index 1485 with scope entity. Otherwise: `"function template"` with type and parameter info
variable template	21	(from table)	2750	Simple: write kind label + quoted name
alias template	22	(from table)	3050	Simple: write kind label + quoted name
concept template	23	(from table)	1482	Simple: write kind label + quoted name
resolved namespace alias	24	--	--	Assertion: `"form_symbol_summary: projection of projection kind"` (same as kind 16). Should have been resolved

Any entity kind value outside 0--24 (excluding the gaps that trigger assertions) hits the default case: "form_symbol_summary: unsupported symbol kind" (error.c:2023).

Entity Rendering Pipeline

For entity kinds that produce a fully-formatted name (most non-trivial cases), the rendering proceeds through these stages:

1. Write entity kind label string (e.g., "function ")
   └── sub_6B9EA0(buffer, kind_label_string)
   └── sub_6B9CD0(buffer, " ", 1)

2. Open quote
   └── Optional colorization: sub_4ECDD0(buffer, 6)   // quote color
   └── sub_6B9CD0(buffer, "\"", 1)

3. Render type prefix (if has_type_info and full_qualification)
   └── sub_5FE8B0(type_node, 0, 1, 0, 0, &entity_formatter)

4. Render qualified or unqualified name
   ├── With template context:  sub_737A00(entity, &entity_formatter)
   └── Without template context: sub_4F3970(entity)

5. Render function parameters (if applicable)
   ├── Full parameter types: sub_5FB270(type, 0, 0, &entity_formatter)
   └── Simple type suffix:   sub_6016F0(type, &entity_formatter)

6. Close quote
   └── sub_6B9CD0(buffer, "\"", 1)
   └── Optional colorization: sub_4ECDD0(buffer, 1)   // reset

7. Append accessibility info (if 'a' option)
   └── " based on template argument(s) "
   └── sub_5FA660(template_arg_list, 0, &entity_formatter)

8. Append declaration location (if 'd' option)
   └── sub_4F6820(position, diag, " (declared ", ")", "(at end of source)")

9. Append translation unit info (if 'T' option)
   └── " (from translation unit <filename>)"

The original_name flag (o option) suppresses steps 1 and 3, rendering only the bare quoted name without a kind prefix or type qualification. The full_qualification flag (f option) enables step 3 and uses sub_737A00 for fully-qualified name rendering in step 4. The parameter_list flag (p option) forces step 5 to include full parameter-type rendering.

Template Context in Entity Names

When dword_126E274 (show template arguments) is non-zero and the entity has template context, the renderer can walk up the template scope chain:

Access the entity's routine info (for functions: offset 88 → offset 192 → offset 16)
Check for the instantiated-from entity (offset 104 of scope info, guarded by !(offset_176 & 1))
If found, use the instantiated-from entity as the display target
For class templates (entity_kind == 20): walk the template parameter chain, rendering <param1, param2, ...> with pack-expansion markers (...) for variadic parameters

CUDA-Specific Entity Rendering

Several entity kinds have CUDA-aware rendering paths:

Class/struct (kinds 4/5): When dword_126EFB4 == 2 (CUDA C++ mode) and the entity has an anonymous flag (offset 161, bit 0x80), rendering jumps to the anonymous-class handler (kind 19) instead
Field (kind 8): In CUDA C++ mode, the kind label is "member" (index 1480); in C mode, it is "field" (index 1481)
Class/struct label selection: In CUDA C++ mode, the kind label index is always 1467; in non-CUDA mode, it depends on whether the entity is class vs struct

Labeled Fill-Ins (`%[label]`)

The %[label] syntax references a named fill-in from the label table at off_D481E0. This mechanism allows error templates to include conditional text fragments that vary based on language mode or compilation context.

Label Table Structure

off_D481E0 is an array of 24-byte entries (3 pointers per entry):

Offset	Size	Field	Description
0	8	`name`	Label name string (e.g., `"class_or_struct"`)
8	8	`condition_ptr`	Pointer to condition flag (dword)
16	4	`true_index`	String table index when `*condition_ptr != 0`
20	4	`false_index`	String table index when `*condition_ptr == 0`

Label Lookup Algorithm

// write_message_to_buffer, error.c:4714
char *label_start = template + pos + 2;      // skip "%["
char *label_end = strchr(template + pos + 1, ']');
if (!label_end)
    assertion_handler("error.c", 4714, "write_message_to_buffer", NULL, NULL);

size_t label_len = label_end - label_start;

// Walk off_D481E0 table
struct label_entry *entry = off_D481E0;
while (entry->name) {
    if (strncmp(entry->name, label_start, label_len) == 0) {
        // Found matching label
        int string_index;
        if (*entry->condition_ptr)
            string_index = entry->true_index;
        else
            string_index = entry->false_index;

        if (string_index > 3794)
            error_text_invalid_code();     // sub_4F2D30

        // Expand the referenced string directly into the buffer
        const char *text = off_88FAA0[string_index];
        write_to_buffer(buffer, text, strlen(text));
        pos = label_end + 1;
        break;
    }
    entry++;   // advance by 24 bytes
}

if (!entry->name) {
    // Label not found -- fatal
    fprintf(stderr, "missing fill-in label: %.*s\n", label_len, label_start);
    assertion_handler("error.c", 430,
        "get_label_fill_in_entry",
        "get_label_fill_in_entry: no label fill-in found", NULL);
}

The label table entries reference string indices in the same off_88FAA0 table used for error messages. This allows a single error template to produce different text depending on compilation mode -- for example, using "class" vs "struct" based on a language-mode flag, or "virtual" vs "" based on a feature flag.

The label text is written directly to the output buffer without further format specifier processing -- labels cannot contain nested % specifiers.

Output Buffer

All rendering targets the global message text buffer at qword_106B488:

Initial allocation: 0x400 bytes (1 KB) via sub_6B98A0
Dynamic growth: sub_6B9B20 doubles the buffer when capacity is exceeded
String append: sub_6B9CD0(buffer, data, length) -- the workhorse write function
String write: sub_6B9EA0(buffer, string) -- convenience wrapper (calls strlen + sub_6B9CD0)

The entity display callback infrastructure at qword_1067860 allows the type/name formatting subsystem to write to the same buffer through an indirect call:

Variable	Address	Purpose
`qword_1067860`	`0x1067860`	Entity formatter callback (set to `sub_5B29C0`)
`qword_1067870`	`0x1067870`	Entity formatter output buffer (set to `qword_106B488`)
`byte_10678F1`	`0x10678F1`	C mode flag (`dword_126EFB4 == 1`)
`byte_10678F4`	`0x10678F4`	Pre-C++11 flag
`byte_10678FA`	`0x10678FA`	Name lookup kind (saved/restored around type rendering)
`byte_10678FE`	`0x10678FE`	Entity display flags (saved/restored around `%n` processing)
`byte_1067902`	`0x1067902`	Type desugaring mode flag (saved/restored around `%t` aka rendering)

Colorization Interaction

When dword_126ECA4 (colorization active) is non-zero, the format engine inserts ANSI escape sequences around quoted names and type references:

Context	Color Code	ANSI Sequence	Visual
Opening quote (`"`)	6 (quote)	`\033[01m`	Bold
Closing quote (`"`)	1 (reset)	`\033[0m`	Normal
Type rendering context	(inherited)	--	Inherits from diagnostic severity color

The escape sequences are emitted by sub_4ECDD0(buffer, color_code). The color codes correspond to the categories parsed from EDG_COLORS / GCC_COLORS environment variables during initialization.

Function Map

Address	Name (Recovered)	Size	Role
`0x4EDCD0`	`process_fill_in`	1,202 lines	Core format specifier expansion
`0x4EF620`	`write_message_to_buffer`	159 lines	Template string walker, `%` parser
`0x4F2DE0`	`alloc_fill_in_entry`	41 lines	Pool allocator for 40-byte fill-in nodes
`0x4F2D30`	`error_text_invalid_code`	12 lines	Assert on invalid error code (> 3794)
`0x4F2930`	`assertion_handler`	101 lines	`__noreturn`, 5,185 callers
`0x4F3480`	`format_assertion_message`	~100 lines	Multi-arg string builder for assertion text
`0x4F6820`	`form_source_position`	~130 lines	Render `%p` source position with file + line
`0x4F3970`	`format_entity_unqualified`	--	Render unqualified entity name
`0x4F39E0`	`format_entity_with_template`	--	Render entity with template args + accessibility
`0x737A00`	`format_qualified_name`	--	Render fully-qualified name through scope chain
`0x5FE8B0`	`format_type_with_qualifiers`	--	Render type with cv-qualifiers for `%n` prefix
`0x5FB270`	`format_function_parameters`	--	Render function parameter type list
`0x6016F0`	`format_simple_type`	--	Render simple type suffix
`0x600740`	`format_type_for_display`	--	Render type for `%t` specifier
`0x7BE9C0`	`has_desugared_type`	--	Check if type has an "aka" form
`0x5FA660`	`format_template_argument_list`	--	Render template argument list for `%n` `a` option
`0x5FA0D0`	`format_template_argument`	--	Render single template argument for `%T`
`0x5B9EE0`	`lookup_entity_by_scope`	--	Entity lookup for `%r` template parameter
`0x4F63D0`	`format_unsigned_decimal`	--	Render unsigned integer for `%u`
`0x6B9CD0`	`buffer_append`	--	Write bytes to dynamic buffer
`0x6B9EA0`	`buffer_write_string`	--	Write null-terminated string to buffer
`0x4ECDD0`	`emit_colorization_escape`	--	Emit ANSI escape sequence

Cross-References

Diagnostic Overview -- 7-stage pipeline, severity levels, diagnostic record layout
CUDA Error Catalog -- all 338 CUDA-specific error templates with specifier usage
SARIF & Pragma Control -- SARIF JSON output and #pragma nv_diagnostic system

SARIF Output & Pragma Diagnostic Control

cudafe++ supports two diagnostic output formats -- traditional text (default) and SARIF v2.1.0 JSON -- controlled by the --output_mode flag (flag index 274, stored in dword_106BBB8). Alongside the output format, the pragma diagnostic system allows per-error severity overrides at arbitrary source positions through #pragma nv_diag_* directives, which record a stack of severity modifications binary-searched at emission time. A companion colorization subsystem adds ANSI escape sequences to text-mode output, governed by environment variables and terminal detection. This page covers the internals of all three subsystems.

For the diagnostic pipeline architecture, severity levels, and error message formatting, see Diagnostic Overview. For the CUDA error catalog and tag-name suppression, see CUDA Errors.

SARIF Output Mode

Activation

SARIF mode is activated by passing --output_mode sarif on the command line. The flag handler (case 274 in the CLI parser at sub_454160) performs a simple string comparison:

// sub_454160, case 274
if (strcmp(arg, "text") == 0)
    dword_106BBB8 = 0;        // text mode (default)
else if (strcmp(arg, "sarif") == 0)
    dword_106BBB8 = 1;        // SARIF JSON mode
else
    error("unrecognized output mode (must be one of text, sarif): %s", arg);

When dword_106BBB8 == 1, three changes take effect globally:

write_init (sub_5AEDB0) emits the SARIF JSON header instead of nothing
check_severity (sub_4F1330) routes each diagnostic through the SARIF JSON builder instead of construct_text_message
write_signoff (sub_5AEE00) emits ]}]}\n instead of the error/warning summary line

All other pipeline behavior -- severity computation, pragma overrides, error counting, exit codes -- is identical in both modes. Exit codes in SARIF mode skip the text messages ("Compilation terminated.", "Compilation aborted.") but use the same numeric values (0, 2, 4, 11).

SARIF Header (`sub_5AEDB0`)

write_init is called once at the start of compilation. In SARIF mode, it writes the JSON envelope to qword_126EDF0 (the diagnostic output stream, typically stderr):

{
  "version": "2.1.0",
  "$schema": "https://raw.githubusercontent.com/oasis-tcs/sarif-spec/master/Schemata/sarif-schema-2.1.0.json",
  "runs": [{
    "tool": {
      "driver": {
        "name": "EDG CPFE",
        "version": "6.6",
        "organization": "Edison Design Group",
        "fullName": "Edison Design Group C/C++ Front End - 6.6",
        "informationUri": "https://edg.com/c"
      }
    },
    "columnKind": "unicodeCodePoints",
    "results": [

The version strings ("6.6") are hardcoded in the binary via two %s format arguments that both resolve to the static string "6.6". The runs array is opened but not closed -- each diagnostic result is appended as the compilation proceeds, and the array is closed by write_signoff.

An assertion guards the mode value: if dword_106BBB8 is neither 0 nor 1, the function fires sub_4F2930 with "write_init" at host_envir.c:2017.

SARIF Result Object

Each diagnostic emitted through check_severity (sub_4F1330) produces one JSON result object. The construction happens inline within check_severity at LABEL_91, building the JSON into the SARIF buffer qword_106B478:

{
  "ruleId": "EC<error_code>",
  "level": "<severity_string>",
  "message": {"text": "<expanded_message>"},
  "locations": [{"physicalLocation": <location_object>}],
  "relatedLocations": [<related_location_objects>]
}

Comma handling: When qword_126ED90 + qword_126ED98 > 1 (more than one diagnostic has been emitted), a comma is prepended before the opening { to maintain valid JSON array syntax.

Rule ID Format

The rule ID is always "EC" followed by the internal error code (0--3794), not the display code:

sub_6B9CD0(sarif_buf, "\"ruleId\":", 9);
sub_6B9CD0(sarif_buf, "\"EC", 3);
sprintf(s, "%lu", *(uint32_t*)(record + 176));  // internal error code
sub_6B9CD0(sarif_buf, s, strlen(s));
sub_6B9CD0(sarif_buf, "\"", 1);

For a CUDA error with internal code 3499 (display code 20042), the rule ID is "EC3499", not "EC20042". This differs from the text-mode format which uses "EC%lu" with the same internal code in construct_text_message.

Level Mapping

The level field is derived from the diagnostic severity byte at record offset 180. When severity <= byte_126ED68 (the error-promotion threshold) and severity <= 7, it is promoted to "error" before level selection. The mapping:

Severity	`level` String	SARIF Standard?
4 (remark)	`"remark"`	Non-standard extension
5 (warning)	`"warning"`	Standard
7 (error, soft)	`"error"`	Standard
8 (error, hard)	`"error"`	Standard
9 (catastrophic)	`"catastrophe"`	Non-standard extension
11 (internal)	`"internal_error"`	Non-standard extension

Any other severity value triggers the assertion at error.c:4886:

sub_4F2930(..., "write_sarif_level",
    "determine_severity_code: bad severity", 0);

Notes (severity 2) and command-line diagnostics (severity 6, 10) never reach the SARIF level mapper -- notes are suppressed below the minimum severity gate, and command-line diagnostics bypass the SARIF path entirely.

Message Object (`sub_4EF8A0`)

The message text is produced by write_sarif_message_json (sub_4EF8A0), which wraps the expanded error template in a JSON {"text":"..."} object:

Appends {"text":" to the SARIF buffer
Calls write_message_to_buffer (sub_4EF620) to expand the error template with fill-in values into qword_106B488
Null-terminates the message buffer
JSON-escapes the message: iterates each character, prepending \ before any " (0x22) or \ (0x5C) character
Appends "} to close the message object

The escaping is minimal -- only double-quote and backslash are escaped. Control characters (newlines, tabs) are not escaped, relying on the fact that EDG error messages do not contain embedded newlines.

Physical Location (`sub_4ECB10`)

When the diagnostic record has a valid file index (offset 136 != 0), a locations array is emitted containing one physical location object:

{
  "physicalLocation": {
    "artifactLocation": {"uri": "file://<canonical_path>"},
    "region": {"startLine": <line>, "startColumn": <column>}
  }
}

The function sub_4ECB10 (write_sarif_physical_location):

Calls sub_5B97A0 to resolve the source-position cookie at record offset 136 into file path, line number, and column number
Calls sub_5B1060 to canonicalize the file path
Emits the artifactLocation with a file:// URI prefix
Emits startLine unconditionally
Emits startColumn only when the column value is non-zero (the v4 check: if (v4))

The startColumn conditional emission means that diagnostics without column information (e.g., command-line errors) produce location objects with only startLine.

Sub-diagnostics (linked at record offset 72, the sub_diagnostic_head pointer) are serialized into the relatedLocations array:

if (record->sub_diagnostic_head) {
    append(",\"relatedLocations\":[");
    int first = 1;
    for (sub = record->sub_diagnostic_head; sub; sub = sub->next) {
        sub->parent = record;          // back-link at offset 16
        append("{\"message\":");
        write_sarif_message_json(sub);  // expand sub-diagnostic message
        if (sub->file_index)
            write_sarif_physical_location(sub);
        append("}");
        if (!first)
            append(",");               // note: comma AFTER closing }
        first = 0;
    }
    append("]");
}

Each related location has its own message object and an optional physicalLocation. The comma is placed after the closing brace of each entry except the first, yielding [{...}{...},{...},...] -- this is a bug in the JSON generation that produces malformed output when there are three or more related locations, since the first separator comma is missing.

write_signoff closes the JSON structure:

if (dword_106BBB8 == 1) {
    fwrite("]}]}\n", 1, 5, qword_126EDF0);
    return;
}

This closes: results array (]), the run object (}), the runs array (]), and the top-level object (}), followed by a newline.

In text mode, write_signoff instead prints the error/warning summary (e.g., "3 errors, 2 warnings detected in file.cu"), using message-table lookups via sub_4F2D60 with IDs 1742--1748 and 3234--3235 for pluralization.

Complete SARIF Output Example

{"version":"2.1.0","$schema":"https://raw.githubusercontent.com/oasis-tcs/sarif-spec/master/Schemata/sarif-schema-2.1.0.json","runs":[{"tool":{"driver":{"name":"EDG CPFE","version":"6.6","organization":"Edison Design Group","fullName":"Edison Design Group C/C++ Front End - 6.6","informationUri":"https://edg.com/c"}},"columnKind":"unicodeCodePoints","results":[{"ruleId":"EC3499","level":"error","message":{"text":"calling a __device__ function(\"foo\") from a __host__ function(\"main\") is not allowed"},"locations":[{"physicalLocation":{"artifactLocation":{"uri":"file:///path/to/test.cu"},"region":{"startLine":10,"startColumn":5}}}]}]}]}

Pragma Diagnostic Control

Pragma Actions

cudafe++ processes #pragma nv_diag_* directives through the preprocessor, which records them as pragma action entries on a global stack. Six action codes are defined:

Code	Pragma Directive	Severity Effect	Internal Name
30	`#pragma nv_diag_suppress`	Set severity to 3 (suppressed)	`ignored`
31	`#pragma nv_diag_remark`	Set severity to 4 (remark)	`remark`
32	`#pragma nv_diag_warning`	Set severity to 5 (warning)	`warning`
33	`#pragma nv_diag_error`	Set severity to 7 (error)	`error`
35	`#pragma nv_diag_default`	Restore from `byte_1067920[4 * error_code]`	`default`
36	`#pragma nv_diag_push` / `pop`	Scope boundary marker	push/pop

Note the gap: action code 34 is not used. Actions 30--33 modify severity, 35 restores the compile-time default, and 36 provides push/pop scoping to allow localized overrides.

The pragmas accept either a numeric error code or a diagnostic tag name:

#pragma nv_diag_suppress 20042              // by display code
#pragma nv_diag_suppress calling_a_constexpr__host__function  // by tag name

Display codes >= 20000 are converted to internal codes by sub_4ED170:

int internal_code = (display_code > 19999) ? display_code - 16543 : display_code;

Pragma Stack (`qword_1067820`)

The pragma stack is a dynamically-growing array of 24-byte records stored at qword_1067820. The array is managed as a sorted-by-position sequence to enable binary search.

Each 24-byte stack entry has the following layout:

Offset	Size	Field	Description
0	4	`position_cookie`	Source position (sequence number)
4	2	`column`	Column number within the line
8	1	`action_code`	Pragma action (30--36)
9	1	`flags`	Bit 0: is push/pop with saved index
16	8	`error_code` or `saved_index`	Target error code, or -1/saved push index for scope markers

The array header (pointed to by qword_1067820) contains:

Offset	Size	Field
0	8	Pointer to entry array base
8	8	Array capacity
16	8	Entry count

Recording Pragma Entries (`sub_4ED190`)

When the preprocessor encounters a #pragma nv_diag_* directive, record_pragma_diagnostic (sub_4ED190) creates a new stack entry:

void record_pragma_diagnostic(uint error_code, uint8_t severity, uint *position) {
    // Hash: (column+1) * (position+1) * error_code * (severity+1)
    uint64_t hash = (*(uint16_t*)(position+2) + 1) * (*position + 1)
                    * error_code * (severity + 1);
    uint64_t bucket = hash % 983;     // 0x3D7

    entry = allocate(32);
    entry->error_code_field = error_code;    // offset 8
    entry->severity = severity;              // offset 12
    entry->position = *position;             // offset 16
    entry->saved_index = 0xFFFFFFFF;         // offset 24 = -1

    // Insert at head of hash chain
    entry->next = hash_table[bucket];        // qword_1065960
    hash_table[bucket] = entry;
}

This function serves double duty: it records the pragma entry for the per-diagnostic suppression hash table (qword_1065960, 983 buckets) used by check_pragma_diagnostic (sub_4ED240), and it simultaneously records the entry on the position-sorted pragma stack.

The bit byte_1067922[4 * error_code] |= 4 is set to mark that this error code has at least one pragma override, enabling the fast-path check in check_for_overridden_severity.

Per-Diagnostic Suppression Check (`sub_4ED240`)

check_pragma_diagnostic (sub_4ED240) is the fast-path check called from check_severity to determine whether a specific diagnostic at a specific source position should be suppressed. It operates on the hash table rather than the sorted stack:

bool check_pragma_diagnostic(uint error_code, uint8_t severity, uint *position) {
    uint64_t hash = (position->column + 1) * (position->cookie + 1)
                    * error_code * (severity + 1);
    entry = hash_table[hash % 983];

    // Walk hash chain matching all four fields
    while (entry) {
        if (entry->error_code == error_code &&
            entry->severity == severity &&
            entry->position == position->cookie &&
            entry->column == position->column)
            break;
        entry = entry->next;
    }
    if (!entry) return false;

    // Scope check: compare current scope ID
    scope = scope_table[current_scope_index];
    if (entry->saved_scope_id != scope->id || scope->kind == 9) {
        entry->saved_scope_id = scope->id;
        entry->emit_count = 0;
        return true;    // first time in this scope → suppress
    }

    // Already seen in this scope → check error limit
    entry->emit_count++;
    return entry->emit_count <= error_limit;
}

Severity Override Resolution (`sub_4F30A0`)

check_for_overridden_severity (sub_4F30A0) is the position-based pragma stack walker. It is called from create_diagnostic_entry (sub_4F40C0) for any diagnostic with severity <= 7, and determines the effective severity by walking the pragma stack backward from the diagnostic's source position.

Entry conditions:

void check_for_overridden_severity(int error_code, char *severity_out,
                                    int64_t position, ...) {
    char current_severity = byte_1067921[4 * error_code];

    // Fast path: if no pragma override exists for this error code, skip
    if ((byte_1067922[4 * error_code] & 4) == 0)
        goto done;

    // Ensure pragma stack exists and has entries
    if (!qword_1067820 || !qword_1067820->count)
        goto done;

Binary search phase:

When the diagnostic position is before the last pragma stack entry (i.e., the position comparison at offset 0/4 shows the diagnostic comes before the final entry), the function uses bsearch with comparator sub_4ECD20 to find the nearest pragma entry at or before the diagnostic position:

// Construct search key from diagnostic position
search_key.position = position->cookie;
search_key.column = position->column;

qword_10658F8 = 0;  // scratch: will hold the best-match pointer
result = bsearch(&search_key, stack_base, entry_count, 24, comparator);

The comparator sub_4ECD20 compares position cookies first, then columns. It has a side effect: whenever the comparison result is >= 0 (the search key is at or after the candidate), it stores the candidate pointer in qword_10658F8. This means after bsearch completes, qword_10658F8 holds the rightmost entry that is at or before the search key -- the "floor" entry.

Backward walk phase:

After finding the starting position (either via binary search or by starting from the last entry), the function walks backward through the stack:

while (1) {
    uint8_t action = *(uint8_t*)(entry + 8);

    if (action == 36) {             // push/pop marker
        if ((*(uint8_t*)(entry+9) & 1) == 0)
            goto skip;              // plain pop: no saved index
        int64_t saved_idx = *(int64_t*)(entry + 16);
        if (saved_idx == -1)
            goto skip;              // push without matching pop
        // Jump to the push point
        entry = &stack_base[24 * saved_idx];
        continue;
    }

    if (*(uint32_t*)(entry + 16) == error_code) {
        switch (action) {
            case 30: current_severity = 3; goto apply;     // suppress
            case 31: current_severity = 4; goto apply;     // remark
            case 32: current_severity = 5; goto apply;     // warning
            case 33: current_severity = 7; goto apply;     // error
            case 35:                                        // default
                current_severity = byte_1067920[4 * error_code];
                goto done;
            default:
                assertion("get_severity_from_pragma", error.c:3741);
        }
    }

skip:
    if (entry == stack_base)
        goto done;                  // reached bottom of stack
    entry -= 24;                    // previous entry
}

done:
    if (current_severity)
        *severity_out = current_severity;

apply:
    *severity_out = current_severity;

The key insight is the push/pop handling: action code 36 entries with flags & 1 set contain a saved index at offset 16 that points to the corresponding push entry. The walker jumps to the push entry, effectively skipping all pragma entries within the pushed scope, restoring the severity state from before the push.

An out-of-bounds entry pointer triggers the assertion at error.c:3803:

if (entry < stack_base || entry >= &stack_base[24 * count])
    assertion("check_for_overridden_severity", error.c:3803);

GCC Diagnostic Pragma Output

cudafe++ generates #pragma GCC diagnostic directives in its output (the transformed C++ sent to the host compiler) to suppress host-compiler warnings on code that cudafe++ knowingly generates or transforms. These are not the same as the nv_diag_* pragmas that control cudafe++'s own diagnostics.

The output pragmas are emitted via sub_467E50 (the line-output function) with hardcoded strings:

// Emitted around certain code regions
sub_467E50("#pragma GCC diagnostic push");
sub_467E50("#pragma GCC diagnostic ignored \"-Wunused-local-typedefs\"");
sub_467E50("#pragma GCC diagnostic ignored \"-Wattributes\"");
// ... generated code ...
sub_467E50("#pragma GCC diagnostic pop");

The full set of GCC warnings suppressed in output:

Warning Flag	Context
`-Wunevaluated-expression`	`decltype` expressions in init-captures (when `dword_126E1E8` = GCC host)
`-Wattributes`	CUDA attribute annotations on transformed code
`-Wunused-parameter`	Device function stubs with unused parameters
`-Wunused-function`	Forward-declared device functions not called in host path
`-Wunused-local-typedefs`	Type aliases generated for CUDA type handling
`-Wunused-variable`	Variables in constexpr-if discarded branches
`-Wunused-private-field`	Private members of device-only classes

On MSVC host compilers, the equivalent mechanism uses __pragma(warning(push)) / __pragma(warning(pop)) instead.

Colorization

Initialization (`sub_4F2C10`)

Colorization is initialized by init_colorization (sub_4F2C10), called from the diagnostic pipeline setup. The function determines whether color output should be enabled and parses the color specification.

Decision sequence:

1. Assert dword_126ECA0 != 0       (colorization was requested via --colors)
2. Check getenv("NOCOLOR")         → if set, disable
3. Check sub_5AF770()              → if stderr is not a TTY, disable
4. If still enabled, parse color spec
5. Set dword_126ECA4 = dword_126ECA0  (activate colorization)

Step 3 calls sub_5AF770 (check_terminal_capabilities), which:

Verifies qword_126EDF0 (diagnostic output FILE*) exists
Calls fileno() + isatty() on it
Calls getenv("TERM") and rejects "dumb" terminals
Returns 1 if interactive, 0 otherwise

The --colors / --no_colors CLI flag pair controls dword_126ECA0 (colorization requested). When --no_colors is set or NOCOLOR is in the environment, colorization is unconditionally disabled regardless of terminal capabilities.

Color Specification Parsing (`sub_4EC850`)

The color specification string is sourced from environment variables with a fallback chain:

char *spec = getenv("EDG_COLORS");
if (!spec) {
    spec = getenv("GCC_COLORS");
    if (!spec)
        spec = "error=01;31:warning=01;35:note=01;36:locus=01:quote=01:range1=32";
}

Note: although the string "DEFAULT_EDG_COLORS" appears in the binary (as a compile-time macro name), the actual default is hardcoded. The EDG_COLORS variable takes priority over GCC_COLORS, allowing EDG-specific customization while maintaining GCC compatibility.

The specification format is category=codes:category=codes:... where:

category is one of: error, warning, note, locus, quote, range1
codes is a semicolon-separated sequence of ANSI SGR parameters (digits and ; only)
: separates category assignments

sub_4EC850 (parse_color_category) is called once for each of the 6 configurable categories:

sub_4EC850(2, "error");      // category code 2
sub_4EC850(3, "warning");    // category code 3
sub_4EC850(4, "note");       // category code 4
sub_4EC850(5, "locus");      // category code 5
sub_4EC850(6, "quote");      // category code 6
sub_4EC850(7, "range1");     // category code 7

For each category, the parser:

Uses strstr() to find the category name in the spec string
Checks that the character after the name is =
Extracts the value up to the next : (or end of string)
Validates that the value contains only digits (0x30--0x39) and semicolons (0x3B)
Stores the pointer and length in qword_126ECC0[2*code] and qword_126ECC8[2*code]
If validation fails (non-digit, non-semicolon character), nullifies the entry

Color Category Codes

Seven category codes are used internally, with code 1 reserved for reset:

Code	Category	Default ANSI	Escape	Applied To
1	reset	`\033[0m`	`ESC [ 0 m`	End of any colored region
2	error	`\033[01;31m`	`ESC [ 01;31 m`	Error/catastrophic/internal severity labels
3	warning	`\033[01;35m`	`ESC [ 01;35 m`	Warning/command-line-warning labels
4	note/remark	`\033[01;36m`	`ESC [ 01;36 m`	Note and remark severity labels
5	locus	`\033[01m`	`ESC [ 01 m`	Source file:line location prefix
6	quote	`\033[01m`	`ESC [ 01 m`	Quoted identifiers in messages
7	range1	`\033[32m`	`ESC [ 32 m`	Source-range underline markers

Escape Sequence Emission

Two functions handle color escape output, depending on context:

sub_4ECDD0 (emit_colorization_escape): Used within construct_text_message for inline color markers. Writes a 2-byte internal marker (ESC byte 0x1B followed by the category code) into the output buffer. These markers are later expanded into full ANSI sequences during the final output pass.

void emit_colorization_escape(buffer *buf, uint8_t category_code) {
    buf_append_byte(buf, 0x1B);     // ESC
    buf_append_byte(buf, category_code);
}

sub_4F3E50 (add_colorization_characters): Used during word-wrapped output to emit full ANSI escape sequences. For category 1 (reset), it writes ESC [ 0 m. For categories 2--7, it writes ESC [ followed by the parsed ANSI codes from qword_126ECC0, followed by m.

void add_colorization_characters(uint8_t category) {
    if (category > 7)
        assertion("add_colorization_characters", error.c:862);

    if (category == 1) {
        // Reset: ESC [ 0 m
        buf_append(sarif_buf, ESC);
        buf_append(sarif_buf, '[');
        buf_append(sarif_buf, '0');
        buf_append(sarif_buf, 'm');
    } else if (color_pointer[category]) {
        // ESC [ <codes> m
        buf_append(sarif_buf, ESC);
        buf_append(sarif_buf, '[');
        buf_append_n(sarif_buf, color_pointer[category], color_length[category]);
        buf_append(sarif_buf, 'm');
    }
}

The assertion at error.c:862 fires if a category code > 7 is passed, which would indicate a programming error in the diagnostic formatter.

Word Wrapping with Colors

construct_text_message (sub_4EF9D0) has two code paths for word wrapping:

Non-colorized: Simple space-scanning algorithm that breaks at the terminal width (dword_106B470)
Colorized: Tracks visible character width separately from escape sequence bytes. When the formatted string contains byte 0x1B (ESC), the wrapping logic counts only non-escape characters toward the column width, ensuring that ANSI codes do not prematurely trigger line breaks.

The terminal width dword_106B470 defaults to a reasonable value (typically 80 or derived from the terminal) and controls the column at which output lines are wrapped.

Colorization State Variables

Variable	Address	Purpose
`dword_126ECA0`	`0x126ECA0`	Colorization requested (`--colors` flag)
`dword_126ECA4`	`0x126ECA4`	Colorization active (after init_colorization)
`qword_126ECC0`	`0x126ECC0`	Color spec pointer array (2 qwords per category)
`qword_126ECC8`	`0x126ECC8`	Color spec length array (paired with pointers)
`dword_106B470`	`0x106B470`	Terminal width for word wrapping

Diagnostic Counter System (`sub_4F3020`)

The function update_diagnostic_counter (sub_4F3020) is called from check_severity to increment per-severity counters. These counters drive the summary output in write_signoff and the error-limit check:

void update_diagnostic_counter(uint8_t severity, uint64_t *counter_block) {
    switch (severity) {
        case 2:  break;              // notes: not counted
        case 4:  counter_block[0]++; break;  // remarks
        case 5:
        case 6:  counter_block[1]++; break;  // warnings
        case 7:
        case 8:  counter_block[2]++; break;  // errors
        case 9:
        case 10:
        case 11: counter_block[3]++; break;  // fatal
        default:
            assertion("update_diagnostic_counter: bad severity", error.c:3223);
    }
}

The primary counter block is at qword_126ED80 (4 qwords: remark_count, warning_count, error_count, fatal_count). The global totals qword_126ED90 (total errors) and qword_126ED98 (total warnings) are updated from a different counter block qword_126EDC8 after pragma-suppressed diagnostics are processed.

Global Variables

Variable	Address	Type	Purpose
`dword_106BBB8`	`0x106BBB8`	int	Output format: 0=text, 1=SARIF
`qword_106B478`	`0x106B478`	buffer*	SARIF JSON output buffer (0x400 initial)
`qword_106B488`	`0x106B488`	buffer*	Message text buffer (0x400 initial)
`qword_106B480`	`0x106B480`	buffer*	Location prefix buffer (0x80 initial)
`qword_1067820`	`0x1067820`	array*	Pragma diagnostic stack (24-byte entries)
`qword_1065960`	`0x1065960`	ptr[983]	Per-diagnostic suppression hash table
`qword_10658F8`	`0x10658F8`	ptr	bsearch scratch: best-match pragma entry
`byte_1067920`	`0x1067920`	byte[4*3795]	Default severity per error code
`byte_1067921`	`0x1067921`	byte[4*3795]	Current severity per error code
`byte_1067922`	`0x1067922`	byte[4*3795]	Per-error tracking flags (bit 2 = has pragma)
`dword_126ECA0`	`0x126ECA0`	int	Colorization requested
`dword_126ECA4`	`0x126ECA4`	int	Colorization active
`qword_126ECC0`	`0x126ECC0`	ptr[]	Color spec pointers (per category)
`qword_126ECC8`	`0x126ECC8`	size_t[]	Color spec lengths (per category)
`qword_126EDF0`	`0x126EDF0`	FILE*	Diagnostic output stream

Function Map

Address	Name	EDG Source	Size	Role
`0x4EC850`	`parse_color_category`	error.c	47 lines	Parse one `category=codes` from color spec
`0x4ECB10`	`write_sarif_physical_location`	error.c	64 lines	Emit SARIF `physicalLocation` JSON
`0x4ECD20`	`bsearch_comparator`	error.c	15 lines	Position comparator for pragma stack search
`0x4ECD50`	`check_suppression_flags`	error.c	30 lines	Bit-flag suppression test
`0x4ECDD0`	`emit_colorization_escape`	error.c	30 lines	Write ESC+category to buffer
`0x4ED100`	`create_file_index_entry`	error.c	22 lines	Allocate 160-byte file-index node
`0x4ED170`	`display_to_internal_code`	error.c	12 lines	Convert display code >= 20000 to internal
`0x4ED190`	`record_pragma_diagnostic`	error.c	24 lines	Record pragma entry in hash table
`0x4ED240`	`check_pragma_diagnostic`	error.c	39 lines	Hash-based per-diagnostic suppression check
`0x4EF8A0`	`write_sarif_message_json`	error.c	79 lines	JSON-escape and wrap message text
`0x4F1330`	`check_severity`	error.c:3859	601 lines	Central dispatch, SARIF/text routing
`0x4F2C10`	`init_colorization`	error.c:825	43 lines	Parse color env vars, set up categories
`0x4F3020`	`update_diagnostic_counter`	error.c:3223	38 lines	Increment per-severity counters
`0x4F30A0`	`check_for_overridden_severity`	error.c:3803	~130 lines	Pragma stack walk with bsearch
`0x4F3E50`	`add_colorization_characters`	error.c:862	~80 lines	Emit full ANSI escape sequence
`0x5AEDB0`	`write_init`	host_envir.c:2017	28 lines	SARIF header / text-mode no-op
`0x5AEE00`	`write_signoff`	host_envir.c:2203	131 lines	SARIF footer / text-mode summary
`0x5AF770`	`check_terminal_capabilities`	host_envir.c	~30 lines	TTY + TERM detection

Entity Node Layout

The entity node is the central data structure in cudafe++ (EDG 6.6) for representing every named declaration: functions, variables, fields, parameters, namespaces, and types. Each node is a variable-size record -- routines occupy 288 bytes, variables 232 bytes, fields 176 bytes -- linked into scope chains and cross-referenced by type nodes, expression nodes, and template instantiation records.

This page focuses on the CUDA-specific fields that NVIDIA grafted onto the EDG entity node. These fields encode execution space (__host__/__device__/__global__), variable memory space (__shared__/__constant__/__managed__), launch configuration (__launch_bounds__/__cluster_dims__/__block_size__/__maxnreg__), and assorted kernel metadata. The attribute application functions in attribute.c write these fields; the backend code generator, cross-space validator, IL walker, and stub emitter read them.

Key Facts

Property	Value
Routine entity size	288 bytes (IL entry kind 11)
Variable entity size	232 bytes (IL entry kind 7)
Field entity size	176 bytes (IL entry kind 8)
Execution space offset	`+182` (1 byte, bitfield)
Memory space offset	`+148` (1 byte, bitfield)
Launch config pointer	`+256` (8-byte pointer to 56-byte struct)
Source file	`attribute.c` (writers), `nv_transforms.c` / `cp_gen_be.c` (readers)
Attribute dispatch	`sub_413240` (`apply_one_attribute`, 585 lines)
Post-validation	`sub_6BC890` (`nv_validate_cuda_attributes`)

Visual Layout (Routine Entity, 288 Bytes)

Offset   0         8        16        24        32        40        48        56
       +=========+=========+=========+=========+=========+=========+=========+=========+
  0x00 | next_entity_ptr   | name_string_ptr   |            (EDG internal)             |
       +---------+---------+---------+---------+---------+---------+---------+---------+
  0x20 |                              (EDG internal continued)                          |
       +---------+---------+---------+---------+---------+---------+---------+---------+
  0x40 |                              (EDG internal continued)                          |
       +====+====+=========+=========+=========+=========+=========+=========+=========+
  0x50 |kind|stor|         | assoc_entity_ptr  |                                       |
       |+80 |+81 |         |                   |                                       |
       +----+----+---------+---------+---------+---------+---------+---------+---------+
  0x60 |                   | variable_type_ptr |                                       |
       +=========+=========+=========+=========+====+=========+=========+==========+===+
  0x80 | storage_class/align|         |type_kind|    | return_type_ptr   |MEM |EXT |    |
       |         |         |         |+132     |    | +144              |+148|+149|    |
       +---------+---------+---------+----+----+----+---------+---------+----+----+----+
  0x98 | proto_ptr / param_list +152  |link|stor|    |grid|    |op  |         |         |
       |                              |+160|+161|    |+164|    |+166|         |         |
       +---------+---------+---------+----+----+----+----+----+----+---------+---------+
  0xB0 |mbr |dev |    |kern|func|    |EXEC|CEXT| template_linkage_flags +184            |
       |+176|+177|    |+179|+180|    |+182|+183|                                       |
       +----+----+----+----+----+----+----+----+=========+=========+=========+=========+
  0xC0 | alias_chain/linkage+186      |         |ctor/dtor|lambda  |                    |
       |                              |         |  +190   | +191   |                    |
       +---------+---------+---------+---------+---------+---------+---------+---------+
  0xD0 | variable_alias_chain_next +208         |                                       |
       +---------+---------+---------+---------+---------+---------+---------+---------+
  0xF0 | func_extra / alias_entry +240          |                                       |
       +---------+---------+---------+---------+---------+---------+---------+---------+
 0x100 | LAUNCH_CONFIG_PTR +256      |         (padding to 288)                         |
       +=========+=========+=========+=========+=========+=========+=========+=========+

CUDA-specific fields (UPPERCASE):
  MEM   = +148  variable memory space bitfield (__device__/__shared__/__constant__)
  EXT   = +149  extended memory space (__managed__)
  EXEC  = +182  execution space bitfield (__host__/__device__/__global__)
  CEXT  = +183  CUDA extended flags (__nv_register_params__, __cluster_dims__ intent)
  LAUNCH_CONFIG_PTR = +256  pointer to 56-byte launch_config_t struct

Full Offset Map (CUDA-Relevant Fields)

The table below documents every entity node offset touched by CUDA attribute handlers and validation functions. Offsets are byte positions from the start of the entity node. Fields marked "EDG base" are standard EDG fields that CUDA code tests but does not define.

Offset	Size	Field	Set By	Read By
`+0`	8	Next entity pointer (linked list)	EDG	Scope iteration
`+8`	8	Name string pointer	EDG	Error messages, stub emission
`+80`	1	Entity kind byte (7=variable, 8=field, 11=routine)	EDG	All attribute handlers
`+81`	1	Storage flags (bit 2=local, bit 3=has_name, bit 6=anonymous)	EDG	`__global__` / `__device__` validation
`+88`	8	Associated entity pointer	EDG	`nv_is_device_only_routine`
`+112`	8	Variable type pointer	EDG	`get_func_type_for_attr`
`+128`	1	Storage class code / alignment	EDG	`apply_internal_linkage_attr`
`+132`	1	Type kind byte (12=qualifier)	EDG	Return type traversal
`+144`	8	Return type / next-in-chain pointer	EDG	`__global__` void-return check
`+148`	1	Variable memory space bitfield	CUDA attr handlers	Backend, IL walker
`+149`	1	Extended memory space	`apply_nv_managed_attr`	Backend, runtime init
`+152`	8	Function prototype / parameter list head	EDG	`__global__` param checks
`+160`	1	Linkage/visibility bits (variable: low 3 = visibility)	Various	Visibility propagation
`+161`	1	Storage/linkage flags (bit 7=thread_local)	EDG	`__managed__` / `__device__` validation
`+164`	1	Storage class / grid_constant flags (bit 2=grid_constant)	`__grid_constant__` handler	`__managed__`/`__device__` conflict check
`+166`	1	Operator function kind (5=operator function)	EDG	`__global__` validation
`+176`	1	Member function flags (bit 7=static member)	EDG	`__global__` static-member check
`+177`	1	Device propagation flag (bit 4=0x10)	Virtual override propagation	Override space checking
`+179`	1	Constexpr/kernel flags	Declaration processing	Stub generation, attribute interaction
`+180`	1	Function attributes (bit 6=nodiscard, bit 7=noinline)	Various attribute handlers	Backend
`+182`	1	Execution space bitfield	CUDA execution space handlers	Everywhere
`+183`	1	CUDA extended flags	`__cluster_dims__` / `__nv_register_params__`	Post-validation, stub emission
`+184`	8	Template/linkage flags (48-bit field)	EDG + CUDA handlers	Lambda check, visibility
`+186`	1	Alias chain flag (bit 3=internal linkage)	`apply_internal_linkage_attr`	Linker
`+190`	1	Constructor/destructor priority flags	`apply_constructor_attr` / `apply_destructor_attr`	Backend
`+191`	1	Lambda flags (bit 0=is_lambda)	EDG lambda processing	`__global__` validation
`+208`	8	Variable alias chain next pointer	`apply_alias_attr`	Alias loop detection
`+240`	8	Function extra info / alias entry	`apply_alias_attr`	Alias chain traversal
`+256`	8	Launch configuration pointer	CUDA launch config handlers	Post-validation, backend

Execution Space Bitfield (Byte +182)

This is the most frequently read field in CUDA-specific code paths. Every function entity carries a single byte that encodes which execution spaces the function belongs to.

Byte at entity+182:

  bit 0  (0x01)   device_capable     Function can execute on device
  bit 1  (0x02)   device_explicit    __device__ was explicitly written
  bit 2  (0x04)   host_capable       Function can execute on host
  bit 3  (0x08)   (reserved)
  bit 4  (0x10)   host_explicit      __host__ was explicitly written
  bit 5  (0x20)   device_annotation  Secondary device flag (HD detection)
  bit 6  (0x40)   global_kernel      Function is a __global__ kernel
  bit 7  (0x80)   global_confirmed   Always set by __global__ handler tail guard

Combined Patterns

The attribute handlers do not set individual bits. They OR entire patterns into the byte. Each CUDA keyword produces a fixed bitmask:

Keyword	OR mask(s)	Result byte	Handler	Evidence
`__global__`	`0x61` then `0x80`	`0xE1`	`sub_40E1F0` (`apply_nv_global_attr`)	`entity+182
`__device__`	`0x23`	`0x23`	`sub_40EB80` (`apply_nv_device_attr`)	`entity+182
`__host__`	`0x15`	`0x15`	`sub_4108E0` (`apply_nv_host_attr`)	`entity+182
`__host__ __device__`	`0x23` then `0x15`	`0x37`	Both handlers in sequence	OR of device + host masks
(no annotation)	none	`0x00`	--	Implicit `__host__`

The 0x80 bit is set unconditionally at the end of apply_nv_global_attr. After the main body ORs 0x61 into byte+182 (setting bit 6 = global_kernel), a tail guard checks bit 6 and always ORs 0x80:

// sub_40E1F0, lines 84-88
v10 = *(_BYTE *)(a2 + 182);
if ( (v10 & 0x40) == 0 )       // if bit 6 (global_kernel) not set, bail
    return a2;                  // (only reachable via early error paths)
*(_BYTE *)(a2 + 182) = v10 | 0x80;   // always set for __global__

Since 0x61 was already OR'd in, bit 6 is always set on the normal path, so 0x80 is always applied. The actual result byte for any successful __global__ application is 0x61 | 0x80 = 0xE1. The guard condition only triggers on error paths where 0x61 was never applied (e.g., the template-lambda error at line 21 which returns before reaching line 56).

Extraction Patterns

Code throughout cudafe++ extracts execution space category using bitmask tests:

Mask	Test	Meaning	Used in
`& 0x30`	`== 0x00`	No explicit annotation (implicit host)	Space classification
`& 0x30`	`== 0x10`	`__host__` only	Space classification
`& 0x30`	`== 0x20`	`__device__` only	`nv_is_device_only_routine`
`& 0x30`	`== 0x30`	`__host__ __device__`	Space classification
`& 0x60`	`== 0x20`	Device, not kernel	Device-only predicate
`& 0x60`	`== 0x60`	`__global__` kernel (implies device)	Kernel identification
`& 0x40`	`!= 0`	Is a `__global__` kernel	Stub generation gate

Variable Memory Space Bitfield (Byte +148)

For variable entities (kind 7), byte +148 encodes the CUDA memory space:

Byte at entity+148:

  bit 0  (0x01)   __device__     Variable resides in device global memory
  bit 1  (0x02)   __shared__     Variable resides in shared memory
  bit 2  (0x04)   __constant__   Variable resides in constant memory

These bits are mutually exclusive in valid programs. The attribute handlers enforce this by checking for conflicting combinations:

// From apply_nv_device_attr (sub_40EB80), variable path:
a2->byte_148 |= 0x01;                      // set __device__
int shared_or_constant = a2->byte_148 & 0x06;   // check __shared__ | __constant__
if (popcount(shared_or_constant) + (a2->byte_148 & 0x01) == 2)
    error(3481, ...);                       // conflicting memory spaces

The __device__ attribute on a function (kind 11) does NOT touch byte +148. It writes to byte +182 (execution space) instead. The memory space byte is strictly for variables.

Extended Memory Space (Byte +149)

Byte at entity+149:

  bit 0  (0x01)   __managed__    Unified memory, accessible from both host and device

Set by apply_nv_managed_attr (sub_40E0D0). The handler also sets bit 0 of +148 (__device__) because managed memory resides in device global memory. Additional validation:

Error 3481: conflicting if __shared__ or __constant__ is already set
Error 3482: cannot be thread-local (byte +161 bit 7)
Error 3485: cannot be a local variable (byte +81 bit 2)
Error 3577: incompatible with __grid_constant__ parameter (byte +164 bit 2)

Constexpr and Kernel Flags (Byte +176, +179)

Byte +176: Member Function Flags

Byte at entity+176:

  bit 7  (0x80)   static_member   Function is a static class member

Tested by apply_nv_global_attr to detect static __global__ functions. The check is (signed char)(a2->byte_176) < 0, which is true when bit 7 is set. Combined with the local-function test (byte +81 bit 2 clear), this triggers warning 3507.

Byte +179: Constexpr / Kernel Property Flags

Byte at entity+179:

  bit 1  (0x02)   kernel_body        Function has a kernel body (used for stub generation)
  bit 2  (0x04)   (instantiation)    Instantiation-required status
  bit 4  (0x10)   constexpr          Function is constexpr
  bit 5  (0x20)   noinline           Function is noinline

The kernel_body flag at bit 1 (0x02) is the primary gate for device stub generation. The backend code generator (gen_routine_decl in cp_gen_be.c) checks:

// From gen_routine_decl (sweep p1.04, line ~1430)
if ((*(_BYTE *)(v3 + 182) & 0x40) != 0       // is __global__ kernel
    && (*(_BYTE *)(v3 + 179) & 2) != 0)       // has kernel body
{
    // Emit __wrapper__device_stub_<name>(<params>) forwarding body
}

The constexpr flag at bit 4 (0x10) is tested during __global__ attribute validation. When set, the void-return-type check AND the lambda check are both skipped:

// From apply_nv_global_attr (sub_40E1F0), lines 39-50
if ( (*(_BYTE *)(a2 + 179) & 0x10) == 0 )   // NOT constexpr
{
    // Non-constexpr __global__: check return type and lambda
    if ( (*(_BYTE *)(a2 + 191) & 1) != 0 )
        error(3506, ...);                    // lambda __global__ not allowed
    else if ( !is_void_return_type(a2) )
        error(3505, ...);                    // must return void
}
// If constexpr (bit 4 set): skip both checks entirely

This is a separate check from the static-member test (byte +176 bit 7 with byte +81 bit 2), which appears earlier at line 28:

if ( *(char *)(a2 + 176) < 0         // static member (bit 7 set)
    && (*(_BYTE *)(a2 + 81) & 4) == 0 )  // not local
    warning(3507, "__global__");          // static __global__ warning

Operator Function Kind (Byte +166)

Byte at entity+166:

  Value 5:  operator function (operator(), operator+, etc.)

Tested during __global__ attribute application. If the entity is an operator function (value == 5), error 3644 is emitted: operator() cannot be declared __global__.

// From apply_nv_global_attr (sub_40E1F0), line 30-31
if ( *(_BYTE *)(a2 + 166) == 5 )
    sub_4F8200(7, 3644, a1 + 56);     // error: __global__ on operator function

This prevents declaring lambda call operators as kernels via the __global__ attribute directly (extended lambdas use a different mechanism with wrapper types).

Parameter List (Pointer +152)

For routine entities, offset +152 holds a pointer to the function prototype structure. The prototype's first field (+0) points to the parameter list head -- a linked list of parameter entities.

The __global__ attribute handler iterates this list to check two constraints:

Variadic check: prototype +16 bit 0 indicates variadic parameters. If set, error 3503 is emitted (variadic __global__ functions are not allowed).
__grid_constant__ check: the post-validation function nv_validate_cuda_attributes (sub_6BC890) walks the parameter list looking for parameters with byte +32 bit 1 set (the __grid_constant__ flag on a parameter entity). If found on a non-__global__ function, error 3702 is emitted.

// From nv_validate_cuda_attributes (sub_6BC890), lines 26-39
// Walk parameter list from prototype
v10 = **(__int64 ****)(v2 + 152);    // parameter list head
while (v10) {
    if (((_BYTE)v10[4] & 2) != 0)    // parameter byte+32 bit 1 = __grid_constant__
        error(3702, ...);            // grid_constant on non-kernel parameter
    v10 = (__int64 **)*v10;          // next parameter
}

CUDA Extended Flags (Byte +183)

Byte at entity+183:

  bit 3  (0x08)   __nv_register_params__   Function uses register parameter passing
  bit 6  (0x40)   __cluster_dims__ intent  cluster_dims attribute with no arguments

nv_register_params (Bit 0x08)

Set by apply_nv_register_params_attr (sub_40B0A0). When present, the post-validation function nv_validate_cuda_attributes checks whether the function is __global__ or __host__, and emits error 3661 if so. Device-only functions (__device__ without __host__) are exempt:

// From nv_validate_cuda_attributes (sub_6BC890), lines 42-69
if ( (*(_BYTE *)(a1 + 183) & 8) == 0 )     // no __nv_register_params__
    goto check_launch_config;

if ( (v3 & 0x40) != 0 ) {                  // __global__ kernel
    v4 = "__global__";
    error(3661, &qword_126EDE8, v4);       // incompatible
} else if ( (v3 & 0x30) != 0x20 ) {        // NOT device-only (has host component)
    v4 = "__host__";
    error(3661, &qword_126EDE8, v4);       // incompatible
}
// else: device-only function -- __nv_register_params__ is allowed

The key check is (v3 & 0x30) != 0x20: when the execution space annotation bits indicate device-only (bits 4,5 = 0x20), the error is skipped. This means __nv_register_params__ is valid only on __device__ functions -- it is rejected on __global__, __host__, and __host__ __device__ functions.

cluster_dims Intent (Bit 0x40)

Set by apply_nv_cluster_dims_attr (sub_4115F0) when the attribute is applied with zero arguments. This marks the function as "wants cluster dimensions" without specifying concrete values -- the values may come from a separate __block_size__ attribute or from a template parameter.

Template / Linkage Flags (Pointer +184)

Offset +184 is a 48-bit (6-byte) field encoding template instantiation and linkage information. The __global__ attribute handler tests a specific bit pattern to detect constexpr lambdas with template linkage:

// From apply_nv_global_attr (sub_40E1F0), line 21
if ( (*(_QWORD *)(a2 + 184) & 0x800001000000LL) == 0x800000000000LL )
{
    // This is a template lambda with external linkage but no definition yet.
    // Applying __global__ to it is an error.
    v14 = sub_6BC6B0(a2, 0);    // get entity name
    sub_4F7510(3469, a1 + 56, "__global__", v14);
    return;
}

The mask 0x800001000000 tests two bits:

Bit 47 (0x800000000000): template instantiation pending
Bit 24 (0x000001000000): has definition body

When bit 47 is set but bit 24 is clear, the entity is a template lambda awaiting instantiation that has no body yet -- applying __global__ (or __device__) to such an entity produces error 3469.

Launch Configuration Struct (Pointer +256)

Offset +256 holds a pointer to a lazily-allocated 56-byte launch configuration structure. This pointer is NULL for functions without any launch configuration attributes. The allocation function sub_5E52F0 creates and zero-initializes the struct on first use.

Launch Config Layout

struct launch_config_t {           // 56 bytes, allocated by sub_5E52F0
    int64_t  maxThreadsPerBlock;   // +0   from __launch_bounds__(arg1)
    int64_t  minBlocksPerMP;       // +8   from __launch_bounds__(arg2)
    int32_t  maxBlocksPerCluster;  // +16  from __launch_bounds__(arg3)
    int32_t  cluster_dim_x;        // +20  from __cluster_dims__(x) or __block_size__(x,y,z,cx)
    int32_t  cluster_dim_y;        // +24  from __cluster_dims__(y) or __block_size__(x,y,z,cx,cy)
    int32_t  cluster_dim_z;        // +28  from __cluster_dims__(z) or __block_size__(x,y,z,cx,cy,cz)
    int32_t  maxnreg;              // +32  from __maxnreg__(N)
    int32_t  local_maxnreg;        // +36  from __local_maxnreg__(N)
    int32_t  block_size_x;         // +40  from __block_size__(x)
    int32_t  block_size_y;         // +44  from __block_size__(y)
    int32_t  block_size_z;         // +48  from __block_size__(z)
    uint8_t  flags;                // +52  bit 0=cluster_dims_set, bit 1=block_size_set
};                                 //      3 bytes padding to 56

Attribute-to-Field Mapping

Attribute	Arguments	Fields Written	Handler
`__launch_bounds__(M)`	1 int	`+0` = M	`sub_411C80`
`__launch_bounds__(M,N)`	2 ints	`+0` = M, `+8` = N	`sub_411C80`
`__launch_bounds__(M,N,C)`	3 ints	`+0` = M, `+8` = N, `+16` = C	`sub_411C80`
`__cluster_dims__(x)`	1 int	`+20` = x, `+24` = 1, `+28` = 1, `+52` bit 0	`sub_4115F0`
`__cluster_dims__(x,y)`	2 ints	`+20` = x, `+24` = y, `+28` = 1, `+52` bit 0	`sub_4115F0`
`__cluster_dims__(x,y,z)`	3 ints	`+20` = x, `+24` = y, `+28` = z, `+52` bit 0	`sub_4115F0`
`__cluster_dims__()`	0 args	`entity+183` bit 6 (intent flag only)	`sub_4115F0`
`__maxnreg__(N)`	1 int	`+32` = N	`sub_410F70`
`__local_maxnreg__(N)`	1 int	`+36` = N	`sub_411090`
`__block_size__(x,y,z)`	3 ints	`+40` = x, `+44` = y, `+48` = z, `+52` bit 1	`sub_4109E0`
`__block_size__(x,y,z,cx,cy,cz)`	6 ints	block + cluster dims, `+52` bits 0+1	`sub_4109E0`

Post-Validation Constraints

The function nv_validate_cuda_attributes (sub_6BC890) performs cross-attribute validation after all attributes have been applied. The key checks on the launch config struct:

1. __launch_bounds__ only on __global__:

// sub_6BC890, lines 45-51
v5 = *(_QWORD *)(a1 + 256);         // launch config pointer
if ( !v5 )  goto done;
if ( (v3 & 0x40) != 0 )             // if __global__, skip to next check
    goto check_cluster;
// Not __global__ but has launch_bounds values
if ( *(_QWORD *)v5 || *(_QWORD *)(v5 + 8) )
    error(3534, "__launch_bounds__");   // launch_bounds on non-kernel

2. __cluster_dims__/__block_size__ only on __global__:

// sub_6BC890, lines 81-87
if ( (*(_BYTE *)(a1 + 183) & 0x40) != 0    // cluster_dims intent
    || *(int *)(v5 + 20) >= 0 )             // cluster_dim_x set
{
    v11 = "__cluster_dims__";
    if ( *(int *)(v5 + 40) > 0 )
        v11 = "__block_size__";
    error(3534, v11);                       // not allowed on non-kernel
}

3. maxBlocksPerCluster vs cluster product:

// sub_6BC890, lines 101-114
v6 = *(int *)(v5 + 20);                    // cluster_dim_x
if ( (int)v6 > 0 ) {
    v7 = *(int *)(v5 + 16);                // maxBlocksPerCluster
    if ( (int)v7 > 0
        && v7 < *(int*)(v5 + 28) * *(int*)(v5 + 24) * v6 )
    {
        // maxBlocksPerCluster < cluster_dim_x * cluster_dim_y * cluster_dim_z
        error(3707, "__cluster_dims__");    // inconsistent values
    }
}

4. __maxnreg__ only on __global__:

// sub_6BC890, lines 116-121
if ( *(int *)(v5 + 32) < 0 )               // maxnreg not set (sentinel -1)
    goto check_launch_maxnreg_conflict;
if ( (v9 & 0x40) == 0 )                    // not __global__
    error(3715, "__maxnreg__");             // maxnreg on non-kernel

5. __launch_bounds__ + __maxnreg__ conflict:

// sub_6BC890, lines 144-145
if ( *(_QWORD *)v5 )                       // maxThreadsPerBlock set
    error(3719, "__launch_bounds__ and __maxnreg__");

Entity Kind Reference

The entity kind byte at +80 determines which offsets are valid. CUDA attribute handlers gate on this value:

Kind	Value	CUDA offsets used	Handler examples
Variable	7	`+148`, `+149`, `+161`, `+164`	`__device__`, `__shared__`, `__constant__`, `__managed__`
Field	8	`+136`	`packed`, `aligned` (non-CUDA)
Routine	11	`+144`, `+152`, `+166`, `+176`, `+179`, `+182`, `+183`, `+184`, `+191`, `+256`	All execution space attrs, launch config

Cross-References

Execution Spaces -- deep dive on byte +182 semantics and the six virtual override mismatch errors
Attributes Overview -- attribute kind enum (86-108) and apply_one_attribute dispatch
IL Overview -- IL entry kinds 7 (variable), 8 (field), 11 (routine) node sizes
Scope Entry -- 784-byte scope structure that contains entity chains

Scope Entry

The scope entry is the 784-byte record that forms the elements of the scope stack, the central data structure in cudafe++ for tracking nested lexical scopes during C++ parsing and semantic analysis. The scope stack is a flat array at qword_126C5E8, indexed by dword_126C5E4 (current depth). Every time the parser enters a new scope -- file, block, function body, class definition, template declaration, namespace -- a new 784-byte entry is pushed onto this stack. When the scope closes, the entry is popped and all associated cleanup runs: symbol table housekeeping, using-directive deactivation, name collision discriminator assignment, template parameter restoration, and memory region disposal.

This page documents the scope stack entry layout, the scope kind enum, the key flag bytes, the CUDA-specific additions (device/host scope context), the template instantiation depth counters, and the major push/pop functions.

Key Facts

Property	Value
Entry size	784 bytes (constant, verified by "Stack entry size: %d\n" in debug statistics)
Stack base pointer	`qword_126C5E8` (global, array of 784-byte entries)
Current depth index	`dword_126C5E4` (global, 0-based index of topmost entry)
Function scope index	`dword_126C5D8` (-1 if not inside a function scope)
Class scope index	`dword_126C5C8` (-1 if not inside a class scope)
File scope index	`dword_126C5DC`
EDG source file	`scope_stk.c` (address range `0x6FE160`-`0x7106B0`, ~160 functions)
Push function	`sub_700560` (`push_scope_full`, 1476 lines, 13 parameters)
Pop function	`sub_7076A0` (`pop_scope`, 1142 lines)
Index arithmetic	`784 * index` for byte offset; reverse via multiply by `0x7D6343EB1A1F58D1` and shift right (division by 784 = 16 * 49)

Scope Stack Global Variables

Global	Type	Meaning
`qword_126C5E8`	`void*`	Base pointer to the scope stack array
`dword_126C5E4`	`int32`	Current scope stack top index (0-based)
`dword_126C5D8`	`int32`	Current function scope index (-1 if none)
`dword_126C5DC`	`int32`	File scope index / secondary depth marker
`dword_126C5AC`	`int32`	Saved depth for template instantiation
`qword_126C5D0`	`void*`	Current routine descriptor pointer
`dword_126C5B8`	`int32`	`is_member_of_template` flag
`dword_126C5C8`	`int32`	Class scope index (-1 if none)
`dword_126C5C4`	`int32`	Nested class / lambda scope index (-1 if none)
`dword_126C5E0`	`int32`	Scope hash / identifier
`dword_126C5B4`	`int32`	Namespace scope index
`dword_126C5BC`	`int32`	Class scope depth counter
`qword_126C598`	`void*`	Pack expansion context pointer

Full Offset Map

The table documents every field observed in the 784-byte scope stack entry. These are scope stack entry fields, not IL scope node fields (the IL scope is a separate 288-byte structure pointed to from offset +192).

Offset	Size	Field	Evidence
`+0`	4	`scope_number`	Unique identifier for this scope instance; checked in pop_scope assertions
`+4`	1	`scope_kind`	Scope kind enum byte (see table below)
`+5`	1	`scope_flags_1`	General flags
`+6`	2	`scope_flags_2`	Bit 0 = void return flag; bit 1 = device scope context (NVIDIA addition); bit 2 = inline namespace; in some contexts bit 1 = `is_extern`, bit 5 = `inline_namespace`
`+7`	1	`access_flags`	Bit 0 = in class context; bit 1 = has using-directives; bit 4 = lambda body
`+8`	1	`scope_flags_4`	Template/class/reactivation bits; bit 5 (`0x20`) = `is_template_scope`
`+9`	1	`scope_flags_5`	Bit 0 = needs cleanup / scope pop control -- when set, triggers `sub_67B4E0()` cleanup of template instantiation artifacts before popping
`+10`	1	`scope_flags_6`	Bit 0 = `in_template_context`
`+11`	1	sign bit	`in_template_dependent_context`
`+12`	1	`scope_flags_7`	Bit 0 = `in_template_arg_scan`; bit 2 = `suppress_diagnostics`; bit 4 = `has_concepts` / `void_return_warned`
`+13`	1	`scope_flags_8`	Bit 4 = `warned_no_return`
`+14`	1	`flags3`	Bit 2 = `in_device_code` (NVIDIA-specific, marks whether code in this scope is device code)
`+24`	8	`symbol_chain_or_hash_ptr`	Pointer to the name symbol chain or hash table for name lookup
`+32`	8	`hash_table_ptr`	Hash table pointer (when scope uses hashing for lookup)
`+32`-`+144`	112	Inline tail info	When `+24` is 0, this region contains inline tail pointers for entity lists: `+40` = variables tail, `+48` = types tail, `+56` = routines next, `+88` = asm tail, `+112` = namespaces tail, `+144` = templates tail
`+192`	8	`il_scope_ptr`	Pointer to the associated 288-byte IL scope node (the persistent representation that survives scope pop)
`+200`	8	`local_static_init_list`	List of local static variable initializers
`+208`	8	`vla_dimensions_list` / `scope_depth`	VLA dimension tracking (C mode); scope depth integer
`+216`	8	`class_type_ptr` / `tu_ptr`	For class scopes: pointer to the class type symbol. For template instantiation scopes: pointer to the translation unit descriptor
`+224`	8	`routine_descriptor`	Pointer to the current routine descriptor (set for function scopes)
`+232`	8	`namespace_entity`	For namespace scopes: pointer to the namespace entity
`+240`	4	`region_number`	Memory region number (-1 = invalid sentinel, set by `alloc_scope`)
`+256`	4	`parent_scope_index`	Index of the enclosing scope in the stack (reported at both +240 and +256 in different sweeps -- likely `+240` = region, `+256` = parent)
`+272`	8	`name_hiding_list`	Linked list of names hidden by declarations in this scope
`+296`	8	`local_vars_tail`	Tail pointer for the local variables list
`+368`	8	`source_begin`	Source position at scope entry
`+376`	8	`associated_entity` / `parent_template_info`	Associated entity pointer / template information pointer
`+384`	8	`template_argument_list`	Template argument list for instantiation scopes
`+408`	4	`try_block_index` / `enclosing_class_scope_index`	Try block index (-1 = none); in class contexts, index to enclosing class scope
`+416`	8	`module_info`	Module information pointer (C++20 modules support)
`+424`	4	`line_number`	Line number at scope open (for diagnostics)
`+496`	8	`root_object_lifetime`	Root of the object lifetime tree for this scope
`+512`	8	`freed_lifetime_list`	List of freed object lifetimes awaiting reuse
`+560`	4	`enclosing_scope_index`	Parent scope index for pop validation
`+576`	4	`template_instantiation_depth_counter`	Nested instantiation depth counter -- incremented on recursive template instantiation push, decremented on pop; when > 0, pop just decrements without actually popping the scope stack
`+580`	4	`orig_depth`	Original scope stack depth at time of template instantiation push; validated during pop
`+584`	4	`saved_scope_depth`	Saved scope depth; restored via `dword_126C5AC` on template instantiation pop
`+608`	8	`class_def_info_ptr`	Class definition information pointer
`+624`	8	`template_info_ptr`	Template information record pointer
`+632`	8	`template_parameter_list` / `class_info_ptr`	Template parameter list pointer
`+704`	8	`lambda_counter`	Lambda expression counter within this scope (int64)
`+720`	4	`fixup_counter`	Deferred fixup counter
`+728`	8	`has_been_completed`	Completion flag (int64 used as bool)
`+736`	8	`deferred_fixup_list_head`	Head of deferred fixup linked list
`+744`	8	`deferred_fixup_list_tail`	Tail of deferred fixup linked list

Scope Stack Kind Enum

The scope stack kind byte at +4 uses a different, larger enum than the IL scope kind (sck_*) at IL scope node +28. The scope stack enum includes additional entries for reactivation states, template instantiation context, and module scopes. The mapping is derived from scope_kind_to_string (sub_7000E0, 77 lines) which contains display string literals for each enum value, and from display_scope (sub_5F2140) in il_to_str.c.

Scope Stack Kind Values

Value	Name	Display String	Notes
0	`ssk_source_file`	`"source file"`	Top-level file scope. Maps to IL `sck_file` (0).
1	`ssk_func_prototype`	`"function prototype"`	Function prototype scope (parameter names). Maps to IL `sck_func_prototype` (1).
2	`ssk_block`	`"block"`	Block scope (compound statement). Maps to IL `sck_block` (2).
3	`ssk_alloc_namespace`	`"alloc_namespace"`	Namespace scope (first opening). Maps to IL `sck_namespace` (3).
4	`ssk_namespace_extension`	`"namespace extension"`	Namespace extension (reopened `namespace N { ... }`).
5	`ssk_namespace_reactivation`	`"namespace reactivation"`	Namespace scope reactivated for out-of-line definition.
6	`ssk_class_struct_union`	`"class/struct/union"`	Class/struct/union scope. Maps to IL `sck_class_struct_union` (6).
7	`ssk_class_reactivation`	`"class reactivation"`	Class scope reactivated for out-of-line member definition (e.g., `void MyClass::foo() { ... }`).
8	`ssk_template_declaration`	`"template declaration"`	Template declaration scope (`template<...>`). Maps to IL `sck_template_declaration` (8).
9	`ssk_template_instantiation`	`"template instantiation"`	Template instantiation scope (pushed by `push_template_instantiation_scope`).
10	`ssk_instantiation_context`	`"instantiation context"`	Instantiation context scope (tracks the chain of instantiation sites for diagnostics).
11	`ssk_module_decl_import`	`"module decl import"`	C++20 module declaration/import scope.
12	`ssk_module_isolation`	`"module isolation"`	C++20 module isolation scope (module purview boundary).
13	`ssk_pragma`	`"pragma"`	Pragma scope (for pragma-delimited regions).
14	`ssk_function_access`	`"function access"`	Function access scope.
15	`ssk_condition`	`"condition"`	Condition scope (if/while/for condition variable). Maps to IL `sck_condition` (15).
16	`ssk_enum`	`"enum"`	Scoped enum scope (C++11 `enum class`). Maps to IL `sck_enum` (16).
17	`ssk_function`	`"function"`	Function body scope (has routine pointer, parameters, ctor init list). Maps to IL `sck_function` (17).

Relationship to IL Scope Kinds

The IL scope node (288 bytes, allocated by alloc_scope at sub_5E7D80) uses a smaller sck_* enum at its +28 field. The scope stack entry at +192 points to the IL scope that persists after the stack entry is popped. Not all scope stack kinds produce an IL scope -- reactivation kinds (5, 7) and context kinds (9, 10) reuse existing IL scopes.

IL `sck_*`	Value	Corresponding stack kind(s)
`sck_file`	0	0
`sck_func_prototype`	1	1
`sck_block`	2	2
`sck_namespace`	3	3, 4, 5
`sck_class_struct_union`	6	6, 7
`sck_template_declaration`	8	8
`sck_condition`	15	15
`sck_enum`	16	16
`sck_function`	17	17

CUDA-Specific Fields

NVIDIA added two device/host scope tracking bits to the scope entry, grafted onto the EDG base structure.

Byte +6, Bit 1: Device Scope Context

scope_entry+6, bit 1 (0x02):

  When set: code in this scope is compiled for the device execution space.
  When clear: code in this scope is compiled for the host.

This bit is tested by CUDA-specific code paths to determine whether the current compilation context targets device or host. It affects:

Whether __device__-only functions suppress certain diagnostics (e.g., missing return value warning at check_void_return_okay, sub_719D20)
Whether device-specific type validation applies
Severity overrides via byte_126ED55 (default diagnostic severity for device mode)

The bit is set when entering __device__ or __global__ function scopes and cleared when entering __host__ scopes. This allows mixed host/device compilation to track which context is active at any nesting depth.

Byte +14, Bit 2: In Device Code

scope_entry+14, bit 2 (0x04):

  Secondary device-code marker. Set when the parser is inside a device
  function body. Used in conjunction with dword_106C2C0 (CUDA device
  compilation mode flag).

Template Instantiation Depth Counters

Three fields at offsets +576, +580, and +584 form the template instantiation depth tracking system. These fields enable the scope stack to handle nested template instantiations without fully pushing/popping scope entries at every nesting level.

Mechanism

When push_template_instantiation_scope (sub_709DE0) sets up a template instantiation, it writes the current scope stack depth into +580 (orig_depth) and the saved global depth into +584 (saved_scope_depth). The +576 counter starts at 0.

If the same template scope is re-entered for a nested instantiation (e.g., recursive template), +576 is incremented rather than pushing a full new scope entry. On pop, pop_template_instantiation_scope (sub_708EE0) checks +576:

if (scope_entry[576] > 0) {
    scope_entry[576]--;     // just decrement, don't pop
    return;
}
// Full pop: restore scope stack to orig_depth
pop_scopes_to(scope_entry[580]);
restore(dword_126C5AC, scope_entry[584]);

This optimization avoids deep scope stack growth during deeply recursive template instantiations (e.g., std::tuple<T1, T2, ..., TN> with large N).

Validation

pop_template_instantiation_scope_with_check (sub_708E90) validates that +576 matches the expected depth before calling the actual pop. The assertion is at scope_stk.c line 5593. A mismatch triggers an internal error.

Push Scope: `push_scope_full` (sub_700560)

The core scope push function (1476 lines, 13 parameters, located at 0x700560). Called directly or via thin wrappers for each scope kind.

Parameters

The 13-parameter signature handles all scope kinds through a single entry point:

Scope kind
Associated entity pointer (class type, namespace entity, routine descriptor, etc.)
Region number
Additional flags 5-13. Kind-specific parameters (template info, reactivation data, etc.)

Key Operations

Stack growth: Increments dword_126C5E4. If the stack exceeds its allocation, it is reallocated (the base pointer qword_126C5E8 may change).
Entry initialization: Zeros the 784-byte entry, then sets:
- +0 = scope number (from a global counter)
- +4 = scope kind
- +240 = region number
- +192 = IL scope pointer (newly allocated via alloc_scope or reused from a reactivated entity)
- +560 = parent scope index
Kind-specific setup:
- File (0): Sets dword_126C5DC, initializes file-level state.
- Block (2): Links to enclosing function scope.
- Namespace (3, 4, 5): Sets +232 to namespace entity. For extensions (4), reuses existing IL scope. For reactivation (5), calls add_active_using_directives_for_scope.
- Class (6, 7): Sets +216 to class type pointer, dword_126C5C8 to current index. For reactivation (7), walks the class hierarchy to restore template context.
- Template declaration (8): Sets template-related bits in +8.
- Function (17): Sets dword_126C5D8, qword_126C5D0, +224.
Parent scope linkage: Calls set_parent_scope_on_push to establish the scope tree.
Memory region: Calls get_enclosing_memory_region to determine the memory arena for allocations within this scope.

Push Wrappers

Wrapper	Address	Parameters	Target Kind
`push_scope` (thin)	`sub_704790`	7	Various
`push_scope_with_using_dirs`	`sub_7047C0`	29	Namespace + using
`push_template_scope`	`sub_704870`	7	Template declaration (8)
`push_block_reactivation_scope`	`sub_7048A0`	32	Block reactivation
`push_namespace_scope_full`	`sub_7024D0`	40	Namespace (3)
`push_function_scope`	`sub_704BB0`	13	Function (17)
`push_class_scope`	`sub_704C10`	17	Class (6)
`push_scope_for_compound_statement`	`sub_70C8A0`	64	Block (2)
`push_scope_for_condition`	`sub_70C950`	86	Condition (15)
`push_scope_for_init_statement`	`sub_70CAE0`	49	Block (2), C++17 init

Pop Scope: `pop_scope` (sub_7076A0)

The core scope pop function (1142 lines, at 0x7076A0). Complement to push_scope_full. Performs all scope cleanup in a specific order.

Cleanup Sequence

Debug trace: If dword_126EFC8 is set, prints "pop_scope: number = %ld, depth = %d".
Scope wrapup: Calls wrapup_scope (sub_706710, 381 lines) which:
- Iterates all symbols in the scope
- Runs end_of_scope_symbol_check (sub_705440, 781 lines) for consistency validation
- Emits needed definitions
- Reports unreferenced entities
Using-directive deactivation: Clears active using-directives for this scope via sub_6FEC10 (debug: clear using-directive).
Template parameter restoration: If leaving a template scope, calls restore_default_template_params (sub_6FEEE0) to undo template parameter symbol bindings.
Name collision discriminators: Assigns ABI discriminator values to entities with the same name in this scope via assign_discriminators_to_entities_list (sub_7036E0).
C99 inline definitions: Checks check_c99_inline_definition (sub_703AD0) for C99-mode inline function rules.
Module/pragma state: Adjusts STDC pragma state (byte_126E558/559/55A) and module context if applicable.
Stack decrement: Decrements dword_126C5E4. Restores previous scope's global state (function scope index, class scope index, etc.).
Memory region disposal: Frees the memory arena associated with this scope if the scope kind has one.

Pop Variants

Function	Address	Lines	Purpose
`pop_scope` (core)	`sub_7076A0`	1142	Full scope pop with all cleanup
`pop_scope_full`	`sub_70C440`	100	Wrapper calling core + name hiding cleanup
`pop_scope` (validation)	`sub_70C620`	62	Pop with object lifetime validation: asserts `"pop_scope: curr_object_lifetime is not that of"`, `"pop_scope: unexpected curr_object_lifetime"`

Template Instantiation Scope

The template instantiation scope push and pop are separate from the generic scope push/pop. They handle the complex process of binding template parameters to arguments, setting up the correct translation unit context, and managing pack expansions.

Push: `push_template_instantiation_scope` (sub_709DE0)

The largest function in the scope_stk.c range at 1281 lines. Takes 8 parameters: template pointer, association info, and various flags.

Key operations:

Translation unit check: Calls sub_7418D0 to verify that the template being instantiated belongs to the same translation unit, or that cross-TU instantiation is explicitly allowed (flag & 0x1000). Failure triggers the assert "push_template_instantiation_scope: wrong translation unit".
Template parameter binding: Iterates the template parameter list and the instantiation argument list in parallel, creating bindings. For each template parameter:
- Type parameters: binds to the supplied type argument
- Non-type parameters: binds to the supplied expression/value
- Template template parameters: binds to the supplied template
Pack expansion: For variadic templates, handles parameter pack expansion. Creates pack instantiation descriptors via create_pack_instantiation_descr (sub_70CF50, 772 lines).
Scope entry setup: Writes +576 = 0, +580 = current depth, +584 = dword_126C5AC. Sets +216 to the translation unit pointer.
State save/restore: Saves dword_126C5B8 (is_member_of_template), dword_126C5D8 (function scope), qword_126C5D0 (routine descriptor).
Reactivation flags: Flag bits & 0x84000 control class template reactivation behavior. When set, the function enters class reactivation mode via sub_70BB60.

Pop: `pop_template_instantiation_scope` (sub_708EE0)

66 lines. Reverse of the push.

Reads +576 (depth counter). If > 0, decrements and returns early (nested instantiation shortcut).
If bit 0 of +9 is set (needs_cleanup), calls sub_67B4E0() to clean up template instantiation artifacts.
Pops scope entries back to orig_depth (+580) via sub_7076A0.
Restores dword_126C5AC from +584.
Calls sub_6FED20 (debug trace: set using-directive).

Function	Address	Lines	Role
`pop_template_instantiation_scope_wrapper`	`sub_708E70`	7	Thin wrapper passing through to `sub_708EE0`
`pop_template_instantiation_scope_with_check`	`sub_708E90`	14	Validates `+576` depth counter before calling `sub_708EE0`
`pop_template_instantiation_scope_variant`	`sub_709110`	71	Alternative pop with extra `+8` flag processing, returns `int64`
`pop_instantiation_scope_for_rescan`	`sub_709000`	54	Pop for template argument rescan case
`push_instantiation_scope_for_rescan`	`sub_70B900`	123	Push for template parameter rescanning
`push_instantiation_scope_for_templ_param_rescan`	`sub_70B7C0`	52	Push for template parameter rescan
`push_instantiation_scope_for_class`	`sub_70BB60`	131	Push for class template instantiation
`push_class_and_template_reactivation_scope_full`	`sub_7098B0`	261	Combined class + template reactivation

Using-Directive Activation

When entering a namespace scope that has active using namespace directives, those directives must be reactivated so that names from the nominated namespace are visible. The scope stack manages this through two functions:

add_active_using_directives_for_scope (sub_6FFCC0)

246 lines. Called during scope push when entering a namespace or block that may have inherited using-directives. Walks the using-directive list for the scope and calls add_active_using_directive_to_scope for each one.

Debug trace format: "adding using-dir at depth %d for namespace %s applies at %d".

Using-Directive Debug Traces

Function	Address	Lines	Trace
Debug: set using-directive	`sub_6FED20`	74	`"setting using-dir at depth %d for namespace %s applies at %d"`
Debug: clear using-directive	`sub_6FEC10`	34	`"clearing using-dir at depth %d for namespace %s applies at %d"`
Debug: using-dir set/clear	`sub_704490`	106	`"using_dir"`, `"setting"`, `"clearing"`

Name Collision Discriminators

When multiple local entities share the same name (e.g., two struct S in different blocks within the same function), the Itanium ABI requires a discriminator suffix in the mangled name. The scope stack manages this through:

Function	Address	Lines	Role
`get_name_collision_list` + `initialize_local_name_collision_table`	`sub_6FE760`	64	Manages the name collision table at `qword_12C6FE0`
`compute_local_name_collision_discriminator` + `distinct_lambda_signatures`	`sub_702FB0`	293	Computes ABI discriminator values for local entities; includes lambda signature discrimination logic
`cancel_name_collision_discriminator`	`sub_7034C0`	118	Cancels a previously assigned discriminator (7 assertion sites)
`assign_discriminators_to_entities_list`	`sub_7036E0`	46	Assigns ABI discriminators to a list of entities at scope exit
`set_parent_entity_for_closure_types`	`sub_703790`	91	Sets parent entity for lambda closure types (needed for correct mangling, 5 assertion sites)
`set_parent_routine_for_closure_types_in_default_args`	`sub_703920`	43	Sets parent routine for lambdas in default argument contexts

Class and Template Reactivation

When defining an out-of-line member function (void MyClass::foo() { ... }), the parser must reactivate the class scope so that class member names are visible. For class templates, this also requires reactivating the template instantiation scope.

reactivate_class_context (sub_7029D0 / sub_70BE50)

Two implementations exist:

sub_7029D0 (196 lines, in p1.16): Reactivates a class scope for out-of-line definition. Asserts "reactivate_class_context: class type has NULL assoc_info".
sub_70BE50 (130 lines, in p1.17): Additional variant that handles nesting, template flags, and scope_entry +8 bit manipulation.

push_class_and_template_reactivation_scope_full (sub_7098B0)

261 lines. Handles the combined case of class template reactivation. Reads symbol flags at offsets +80, +81, +161, +162. Processes "specified template decl info" at +64 of assoc_info. Detects member templates via bit 0x10 at +81. When dword_106BC58 is set, enters class reactivation mode with sub_70BB60.

reactivate_local_context (sub_702670 / sub_70C0F0)

sub_702670 (120 lines): Reactivates a previously saved local scope context. Calls push_scope_full.
sub_70C0F0 (50 lines): Asserts "reactivate_local_context".

Pack Expansion Support

The scope stack provides infrastructure for variadic template parameter pack expansion during instantiation.

Function	Address	Lines	Role
`create_pack_instantiation_descr`	`sub_70CF50`	772	Creates pack instantiation descriptors; handles sizeof..., fold expressions
`create_pack_instantiation_descr_helper`	`sub_70DD60`	212	Helper for pack descriptor creation
`cleanup_pack_instantiation_state`	`sub_70E130`	37	Cleans up pack expansion state
`end_potential_pack_expansion_context`	`sub_70E1D0`	392	Processes end of pack expansion; checks C++17 via `dword_126EF68 > 199900`; uses `qword_126C598` (pack expansion context)
`find_template_arg_for_pack` + `get_enclosing_template_params_and_args`	`sub_6FE9B0`	140	Traverses scope stack to find template arguments for parameter packs

Scope Stack Query Functions

Function	Address	Lines	Role
`get_innermost_template_dependent_context`	`sub_6FE160`	72	Traverses scope stack to find innermost template-dependent scope
`get_outermost_template_dependent_context`	`sub_6FFA60`	54	Complement to innermost variant
`get_curr_template_params_and_args` (part 1)	`sub_70E7F0`	321	Retrieves current template parameters and arguments from scope stack
`get_curr_template_params_and_args` (full)	`sub_70F540`	1002	Full implementation with default argument handling and pack expansion
`is_in_template_context`	`sub_70EE10`	16	No-arg predicate, returns bool
`is_in_template_instantiation_scope`	`sub_70EDA0`	27	6-arg predicate, returns bool
`current_class_symbol_if_class_template`	`sub_704130`	84	Returns class symbol if inside a class template definition
`is_in_deprecated_context`	`sub_70F440`	43	Checks `scope_entry[83]` bit 4 and walks scope stack
`get_scope_depth`	`sub_70C600`	17	Returns current scope stack depth value
`get_template_scope_info_for_entity`	`sub_7106B0`	74	Last function in scope_stk.c range

Debug and Statistics

Scope Statistics Dump (sub_702DC0)

95 lines. Prints scope stack statistics when debug tracing is enabled. Output format:

Scope stack statistics
Stack entry size: 784
Max. stack depth: <N>

Followed by per-scope-kind counts using all scope kind display names.

Scope Entry Dump (sub_700260 / sub_7002D0)

sub_700260 (17 lines): Prints " scope %d" with scope kind name via scope_kind_to_string. Detects bad depth with "***BAD SCOPE DEPTH***".
sub_7002D0 (111 lines): Detailed dump using format "%s%3ld %3d " with associated type/symbol information.

End-of-Scope Processing

wrapup_scope (sub_706710)

381 lines. Major scope cleanup function called from pop_scope. Processes all symbols in the scope, emits needed definitions, runs end_of_scope_symbol_check. Debug traces: "wrapup_scope", "Wrapping up ", " scope".

end_of_scope_symbol_check (sub_705440)

781 lines, 6 assertion sites. The largest validation function. Checks:

Symbol-to-IL-entry parent-class consistency ("end_of_scope_symbol_check: sym/il-entry parent-class mismatch")
Parameter-to-routine association ("end_of_scope_symbol_check: parameter with no assoc routine")
Hash table statistics ("hash_stats", "Hash statistics for: ")

set_needed_flags_at_end_of_file_scope (sub_707040)

188 lines. Determines which entities need to be emitted at the end of the translation unit. Validates scope kind ("set_needed_flags_at_end_of_file_scope: bad scope kind"). Debug brackets: "Start of set_needed_flags_at_end_of_file_scope\n" / "End of set_needed_flags_at_end_of_file_scope\n".

finish_function_body_processing (sub_6FE2A0)

142 lines. Post-processes function bodies after the scope closes. Determines whether the function needs to be emitted ("routine_needed_even_if_unreferenced", "Not calling mark_as_needed for", "storage class is %s\n").

Cross-References

Entity Node Layout -- entity kind enum, execution space byte at +182
IL Overview -- IL scope kinds (sck_*), IL entry kind 23 (scope, 288 bytes)
IL Allocation -- alloc_scope allocator for 288-byte IL scope nodes
Template Instance Record -- template instantiation data structures
Translation Unit Descriptor -- TU pointer stored at scope entry +216

Translation Unit Descriptor

The translation unit descriptor is the 424-byte structure at the heart of cudafe++'s multi-TU compilation support. Every source file processed by the frontend -- whether via RDC separate compilation or C++20 module import -- gets its own TU descriptor. The descriptor holds pointers to the parser state, scope stack snapshot, error context, and IL tree root for that translation unit. When the frontend switches from one TU to another, it saves the entire set of per-TU global variables into the outgoing descriptor's storage buffer and restores the incoming descriptor's saved values, making TU switching look like a cooperative context switch for compiler state.

The descriptor is allocated from the region-based arena (sub_6BA0D0), linked into a global TU chain, and managed through a TU stack that tracks the active-TU history for nested TU switches (e.g., when processing an entity requires switching to its owning TU temporarily).

Key Facts

Property	Value
Size	424 bytes (confirmed by `print_trans_unit_statistics`: "translation units ... 424 bytes each")
Allocation	`sub_6BA0D0(424)` -- region-based arena, never individually freed
Source file	`trans_unit.c` (EDG 6.6, address range `0x7A3A50`-`0x7A48B0`, ~12 functions)
Allocator	`sub_7A40A0` (`process_translation_unit`)
Save function	`sub_7A3A50` (`save_translation_unit_state`)
Restore function	`sub_7A3D60` (`switch_translation_unit`)
Fix-up function	`sub_7A3CF0` (`fix_up_translation_unit`)
Statistics	`sub_7A45A0` (`print_trans_unit_statistics`)
TU count global	`qword_12C7A78` (incremented on each allocation)
Active TU global	`qword_106BA10` (`current_translation_unit`)
Primary TU global	`qword_106B9F0` (`primary_translation_unit`)

Full Offset Map

The table below documents every field in the 424-byte TU descriptor. Offsets are byte positions from the start of the descriptor. Fields are identified from the initialization code in process_translation_unit (sub_7A40A0), the save/restore pair (sub_7A3A50/sub_7A3D60), and the fix-up function (sub_7A3CF0).

Offset	Size	Field	Set By	Read By
`+0`	8	`next_tu` -- linked list pointer to next TU in chain	`process_translation_unit` (via `qword_12C7A90`)	`fe_wrapup` TU iteration loop
`+8`	8	`prev_scope_state` -- saved scope pointer (xmmword_126EB60+8)	`save_translation_unit_state`	`switch_translation_unit`
`+16`	8	`storage_buffer` -- pointer to bulk registered-variable storage	`process_translation_unit` (allocates `sub_6BA0D0(per_tu_storage_size)`)	`save/switch_translation_unit`
`+24`	160	`file_scope_info` -- file scope state block (20 qwords, initialized by `sub_7046E0`)	`sub_7046E0` (zeroes 20 fields at offsets 0-152 within this block)	Scope stack operations, `sub_704490`
`+184`	8	(cleared to 0) -- within file scope info tail	`process_translation_unit`	--
`+192`	8	(cleared to 0) -- gap between scope info and registered-variable zone	`process_translation_unit`	--
`+200`	160	registered-variable direct fields -- zeroed bulk region (offsets +200 through +359)	`memset` in `process_translation_unit`; individual fields written by registered-variable initialization loop	`save/switch_translation_unit` via storage buffer
`+208`	8	`scope_stack_saved_1` -- saved `qword_126EB70` (scope stack depth marker)	`save_translation_unit_state` (a1[26])	`switch_translation_unit`
`+256`	8	`scope_stack_saved_2` -- saved `qword_126EBA0`	`save_translation_unit_state` (a1[32])	`switch_translation_unit`
`+320`	8	`scope_stack_saved_3` -- saved `qword_126EBE0`	`save_translation_unit_state` (a1[40])	`switch_translation_unit`
`+352`	8	(cleared to 0) -- end of registered-variable zone	`process_translation_unit`	--
`+360`	8	(cleared to 0) -- additional state word 1	`process_translation_unit`	--
`+368`	8	(cleared to 0) -- additional state word 2	`process_translation_unit`	--
`+376`	8	`module_info_ptr` -- pointer to module info structure (parameter `a3` of `process_translation_unit`)	`process_translation_unit`	Module import path, `a3[2]` back-link
`+384`	8	`il_state_ptr` -- shortcut pointer for IL state (1344-byte aggregate at `unk_126E600`), set via registered-variable mechanism with `offset_in_tu = 384`	Registered-variable init loop	IL operations
`+392`	2	`flags` -- bit field: byte 0 = `is_primary_tu` (1 if `a3 == NULL`), byte 1 = 0x01 (initialization sentinel, combined initial value = `0x0100`)	`process_translation_unit`	TU classification
`+394`	14	(padding / reserved)	--	--
`+408`	4	`error_severity_level` -- copied from `dword_126EC90` (current maximum error severity)	`process_translation_unit`	Error reporting, recovery decisions
`+416`	8	(cleared to 0) -- additional state	`process_translation_unit`	--

Layout Diagram

Translation Unit Descriptor (424 bytes)
===========================================

 +0    [next_tu          ] -----> next TU in chain (NULL for last)
 +8    [prev_scope_state ] -----> saved scope ptr (from xmmword_126EB60+8)
+16    [storage_buffer   ] -----> heap block for registered variable values
+24    [                                                              ]
       [  file_scope_info (160 bytes, 20 qwords)                      ]
       [  initialized by sub_7046E0: all fields zeroed                ]
       [  scope state snapshot for this TU's file scope               ]
+184   [  (tail of scope info, cleared)                               ]
+192   [  (gap, cleared to 0)                                         ]
+200   [                                                              ]
       [  registered-variable direct fields (160 bytes, bulk zeroed)  ]
       [  includes scope stack snapshots at +208, +256, +320          ]
       [  individual fields set by registered-variable init loop      ]
+352   [  (cleared to 0)                                              ]
+360   [  (additional state, cleared)                                 ]
+368   [  (additional state, cleared)                                 ]
+376   [module_info_ptr   ] -----> module info (NULL for primary TU)
+384   [il_state_ptr      ] -----> shortcut to IL state in storage buffer
+392   [flags             ] 0x0100 initial; byte 0 = is_primary
+394   [  (reserved)      ]
+408   [error_severity    ] from dword_126EC90
+412   [  (pad)           ]
+416   [  (additional, 0) ]
+424   === end ===

Initialization Sequence

The initialization in process_translation_unit proceeds in this order:

[+0] = 0 (next_tu pointer, not yet linked)
[+16] = sub_6BA0D0(qword_12C7A98) (allocate storage buffer, size = accumulated registered-variable total)
[+8] = 0 (prev_scope_state)
sub_7046E0(tu + 24) -- zero-initialize the 160-byte file scope info block
[+192] = 0, [+352] = 0, [+184] = 0 -- explicit clears around the bulk region
memset(aligned(tu + 200), 0, ...) -- bulk-zero the registered-variable direct fields from +200 to +360 (aligned to 8-byte boundary)
[+360] = 0, [+368] = 0, [+376] = 0 -- clear additional state
[+392] = 0x0100 (flags: initialized sentinel in high byte)
[+408] = 0, [+416] = 0
Registered-variable default-value loop: iterate qword_12C7AA8 (registered variable list) and for each entry with offset_in_tu != 0, write variable_address into tu_desc[offset_in_tu]
[+376] = a3 (module_info_ptr)
[+392] byte 0 = (a3 == NULL) (is_primary flag)

Lifecycle

Phase 1: Registration (Before Any TU Processing)

Before the first translation unit is processed, every EDG subsystem registers its per-TU global variables by calling f_register_trans_unit_variable (sub_7A3C00). This happens during frontend initialization, before dword_12C7A8C (registration_complete) is set to 1.

The three core variables are registered by register_builtin_trans_unit_variables (sub_7A4690):

// sub_7A4690 -- register_builtin_trans_unit_variables
f_register_trans_unit_variable(&dword_106BA08, 4, 0);   // is_recompilation
f_register_trans_unit_variable(&qword_106BA00, 8, 0);   // current_filename
f_register_trans_unit_variable(&dword_106B9F8, 4, 0);   // has_module_info

In total, approximately 217 calls to f_register_trans_unit_variable are made across all subsystems. Each call adds a 40-byte registration record to the linked list headed by qword_12C7AA8 and accumulates the variable size into qword_12C7A98 (the per-TU storage buffer size). The accumulated size determines how large the storage buffer allocation will be for each TU descriptor.

Phase 2: Allocation and Initialization

When process_translation_unit (sub_7A40A0) is called for each source file:

process_translation_unit(filename, is_recompilation, module_info_ptr)

If a current TU exists (qword_106BA10 != 0), save its state via save_translation_unit_state
Reset compilation state (sub_5EAEC0 -- error state reset)
If recompilation mode: reset parse state (sub_585EE0)
Set dword_12C7A8C = 1 (registration complete -- no more variable registrations allowed)
Allocate the 424-byte descriptor and its storage buffer
Initialize all fields (see sequence above)
Copy registered-variable defaults into the descriptor
Link into the TU chain

Phase 3: Linking

The descriptor is linked into two structures simultaneously:

TU Chain (singly-linked list via [+0]):

Head: qword_106B9F0 (primary_translation_unit) -- the first TU processed
Tail: qword_12C7A90 (tu_chain_tail) -- the most recently allocated TU
Linking: *tu_chain_tail = new_tu; tu_chain_tail = new_tu
Used by: fe_wrapup to iterate all TUs during the 5-pass post-processing

TU Stack (singly-linked list of 16-byte stack entries):

Top: qword_106BA18 (translation_unit_stack_top)
Each entry: [+0] = next, [+8] = tu_descriptor_ptr
Free list: qword_12C7AB8 (stack entries are recycled, not freed)
Depth counter: dword_106B9E8 (counts non-primary TUs on the stack)

TU Chain:                    TU Stack:
                             qword_106BA18
primary_tu --> tu_2 --> tu_3    |
    ^                           v
    |                      [next|tu_3] --> [next|tu_2] --> [next|primary] --> NULL
qword_106B9F0                        each entry: 16 bytes

Phase 4: Active TU Tracking

The global qword_106BA10 always points to the currently active TU descriptor. All compiler state -- parser globals, scope stack, symbol tables, error context -- corresponds to this TU. Switching the active TU requires a full context switch through switch_translation_unit.

Phase 5: Processing (5 Passes in fe_wrapup)

After parsing completes, fe_wrapup (sub_588F90) iterates the TU chain and performs 5 passes over all TUs:

Pass 1 (file_scope_il_wrapup): per-TU scope cleanup, cross-TU entity marking
Pass 2 (set_needed_flags_at_end_of_file_scope): compute needed-flags for entities
Pass 3 (mark_to_keep_in_il): mark entities to keep in the IL tree
Pass 4 (three sub-stages): clear unneeded instantiation flags, eliminate unneeded function bodies, eliminate unneeded IL entries
Pass 5 (file_scope_il_wrapup_part_3): final cleanup, scope assertion, re-run of passes 2-4 for the primary TU

Each pass switches to the target TU via switch_translation_unit before processing.

Phase 6: Pop and Cleanup

After sub_588E90 (translation_unit_wrapup) and the main compilation passes complete, the TU is popped from the stack. The inline pop code in process_translation_unit (mirroring pop_translation_unit_stack at sub_7A3F70):

Assert: stack_top->tu_ptr == current_tu (stack integrity check)
Decrement dword_106B9E8 if popped TU is not the primary TU
Move stack entry to free list (qword_12C7AB8)
If a previous TU remains on the stack, switch to it via switch_translation_unit

Registered Variable Mechanism

The registered variable mechanism is the save/restore system that makes TU switching possible. It works in three phases: registration, save, and restore.

Registration Phase

During frontend initialization, each subsystem calls f_register_trans_unit_variable to declare global variables that contain per-TU state. Each call creates a 40-byte registration record:

Registered Variable Entry (40 bytes)
  [0]   8   next               linked list pointer
  [8]   8   variable_address   pointer to the global variable
  [16]  8   variable_size      number of bytes to save/restore
  [24]  8   offset_in_storage  byte offset within the TU storage buffer
  [32]  8   offset_in_tu       byte offset within the TU descriptor (0 = none)

Registration accumulates the total storage buffer size in qword_12C7A98. Each variable gets assigned a sequential offset within the buffer:

// f_register_trans_unit_variable (sub_7A3C00), simplified
void f_register_trans_unit_variable(void *var_ptr, size_t size, size_t offset_in_tu) {
    assert(!registration_complete);   // dword_12C7A8C must be 0
    assert(var_ptr != NULL);

    record = alloc(40);
    record->next = NULL;
    record->variable_address = var_ptr;
    record->variable_size = size;
    record->offset_in_storage = per_tu_storage_size;  // qword_12C7A98
    record->offset_in_tu = offset_in_tu;

    // append to linked list
    if (!list_head) list_head = record;     // qword_12C7AA8
    if (list_tail) list_tail->next = record;
    list_tail = record;                      // qword_12C7AA0

    // align size to 8 bytes, accumulate
    size_t aligned = (size + 7) & ~7;
    per_tu_storage_size += aligned;          // qword_12C7A98
}

The third parameter offset_in_tu is non-zero only for variables that need a direct shortcut pointer within the TU descriptor itself. For example, the 1344-byte IL state aggregate at unk_126E600 registers with offset_in_tu = 384, so tu_desc[384] receives a pointer to the stored copy of that aggregate within the storage buffer. Most variables pass 0 (no shortcut needed).

Save Phase (save_translation_unit_state)

When switching away from a TU, sub_7A3A50 saves the current state:

// save_translation_unit_state (sub_7A3A50), simplified
void save_translation_unit_state(tu_desc *tu) {
    char *storage = tu->storage_buffer;     // tu[2]

    // Iterate all registered variables
    for (reg = registered_variable_list_head; reg; reg = reg->next) {
        // Copy current global value into storage buffer
        void *dst = storage + reg->offset_in_storage;
        memcpy(dst, reg->variable_address, reg->variable_size);

        // If this variable has a direct field in the TU descriptor,
        // store a pointer to the saved copy there
        if (reg->offset_in_tu != 0) {
            *(void **)((char *)tu + reg->offset_in_tu) = dst;
        }
    }

    // Save scope stack state (3 explicit fields)
    tu->scope_saved_1 = qword_126EB70;   // tu[26]
    tu->scope_saved_2 = qword_126EBA0;   // tu[32]
    tu->scope_saved_3 = qword_126EBE0;   // tu[40]

    // Save file scope indices via sub_704490
    if (dword_126C5E4 != -1) {
        sub_704490(dword_126C5E4, 0, 0);
        // Walk scope stack entries, clear scope-to-TU back-pointers
        for (entry = scope_top; entry; entry -= 784) {
            if (entry[+192])
                *(int *)(entry[+192] + 240) = -1;
            if (!entry[+4]) break;
        }
    }
}

Restore Phase (switch_translation_unit)

When switching to a different TU, sub_7A3D60 restores its state:

// switch_translation_unit (sub_7A3D60), simplified
void switch_translation_unit(tu_desc *target) {
    assert(current_tu != NULL);       // qword_106BA10

    if (current_tu == target) return; // no-op if already active

    save_translation_unit_state(current_tu);  // save outgoing
    current_tu = target;              // qword_106BA10 = target

    char *storage = target->storage_buffer;

    // Iterate all registered variables -- REVERSE of save
    for (reg = registered_variable_list_head; reg; reg = reg->next) {
        // Copy saved value from storage buffer back to global
        memcpy(reg->variable_address, storage + reg->offset_in_storage,
               reg->variable_size);

        // Update shortcut pointer if present
        if (reg->offset_in_tu != 0) {
            *(void **)((char *)target + reg->offset_in_tu) =
                memcpy result;  // points into the global
        }
    }

    // Restore scope stack state
    xmmword_126EB60_high = target[1];  // prev_scope_state
    qword_126EB70 = target[26];
    qword_126EBA0 = target[32];
    qword_126EBE0 = target[40];

    // Rebuild file scope indices via sub_704490
    if (dword_126C5E4 != -1) {
        // Recompute scope-to-TU back-pointers
        for (entry = scope_top; entry; entry -= 784) {
            if (entry[+192])
                *(int *)(entry[+192] + 240) = index_formula;
            if (!entry[+4]) break;
        }
        sub_704490(dword_126C5E4, 1, computed_flag);
    }
}

The key asymmetry between save and restore: memcpy direction is reversed. Save copies global -> storage_buffer. Restore copies storage_buffer -> global. The shortcut pointer (offset_in_tu) semantics also differ: during save it points into the storage buffer; during restore it points back to the global variable.

Fix-Up Phase

After the primary TU's registered-variable defaults are first copied into its descriptor, fix_up_translation_unit (sub_7A3CF0) performs a one-time pass that writes variable-address pointers into the TU descriptor's shortcut fields:

// fix_up_translation_unit (sub_7A3CF0)
void fix_up_translation_unit(tu_desc *primary) {
    assert(primary->next_tu == NULL);  // must be the first TU

    for (reg = registered_variable_list_head; reg; reg = reg->next) {
        if (reg->offset_in_tu != 0) {
            *(void **)((char *)primary + reg->offset_in_tu) =
                reg->variable_address;
        }
    }
}

This ensures the primary TU's shortcut pointers point directly to the live global variables rather than the storage buffer, since the primary TU's globals are the "real" values (not copies).

TU Stack Operations

The TU stack supports nested TU switches. This is needed when processing an entity declared in a different TU requires temporarily switching to that TU's context.

Push (push_translation_unit_stack)

// sub_7A3EF0 -- push_translation_unit_stack
void push_translation_unit_stack(tu_desc *tu) {
    // Allocate stack entry from free list or fresh allocation
    stack_entry *entry;
    if (stack_entry_free_list) {          // qword_12C7AB8
        entry = stack_entry_free_list;
        stack_entry_free_list = entry->next;
    } else {
        entry = alloc(16);               // sub_6B7340(16)
        ++stack_entry_count;              // qword_12C7A80
    }

    entry->tu_ptr = tu;                  // [+8]
    entry->next = tu_stack_top;          // [+0] = qword_106BA18

    // If pushing a different TU than current, switch to it
    if (current_tu != tu)
        switch_translation_unit(tu);

    // Track depth for non-primary TUs
    if (tu != primary_tu)
        ++tu_stack_depth;                // dword_106B9E8

    tu_stack_top = entry;                // qword_106BA18
}

Pop (pop_translation_unit_stack)

// sub_7A3F70 -- pop_translation_unit_stack
void pop_translation_unit_stack() {
    stack_entry *top = tu_stack_top;      // qword_106BA18

    // Integrity assertion: top-of-stack TU must match current TU
    assert(top->tu_ptr == current_tu);    // top[+8] == qword_106BA10

    if (top->tu_ptr != primary_tu)
        --tu_stack_depth;                 // dword_106B9E8

    // Pop: move top to free list, advance stack
    stack_entry *prev = top->next;
    top->next = stack_entry_free_list;   // return to free list
    stack_entry_free_list = top;
    tu_stack_top = prev;                 // qword_106BA18

    // If another TU remains, switch to it
    if (prev)
        switch_translation_unit(prev->tu_ptr);
}

Push Entity's TU (push_entity_translation_unit)

A convenience function sub_7A3FE0 pushes the TU that owns a given entity:

// sub_7A3FE0 -- push_entity_translation_unit
int push_entity_translation_unit(entity *ent) {
    if (ent->flags_81 & 0x20)  return 0;  // anonymous entity, no TU
    tu_desc *owning_tu = get_entity_tu(ent);  // sub_741960
    if (owning_tu == current_tu)  return 0;   // already in correct TU

    push_translation_unit_stack(owning_tu);
    return 1;  // caller must pop when done
}

TU Stack Entry Layout

TU Stack Entry (16 bytes)
  [0]   8   next            next entry in stack (toward bottom) or free list
  [8]   8   tu_desc_ptr     pointer to the TU descriptor

Stack entries are recycled through a free list (qword_12C7AB8). They are allocated by sub_6B7340 (general storage, not arena) on first use and never deallocated -- only returned to the free list on pop.

TU Correspondence (24 bytes)

When processing multiple TUs in RDC mode, the frontend must track structural equivalence between types and declarations across TUs. Each correspondence is a 24-byte node:

Trans Unit Correspondence (24 bytes)
  [0]   8   next            linked list pointer
  [8]   8   ptr             pointer to the corresponding entity/type
  [16]  4   refcount        reference count (freed when decremented to 1)
  [20]  1   flag            correspondence type flag

Allocation uses a free list (qword_12C7AB0) with fallback to arena allocation (sub_6BA0D0(24)). The reference-counted deallocation in free_trans_unit_corresp (sub_7A3BB0) asserts that refcount > 0 before decrementing, and only pushes the node onto the free list when the count reaches 1 (not 0 -- the last reference is the free-list entry itself).

Global Variables

TU State

Global	Type	Identity
`qword_106BA10`	`tu_desc*`	`current_translation_unit` -- always points to the active TU
`qword_106B9F0`	`tu_desc*`	`primary_translation_unit` -- the first TU processed (head of chain)
`qword_12C7A90`	`tu_desc*`	`tu_chain_tail` -- last TU in the linked list
`qword_106BA18`	`stack_entry*`	`translation_unit_stack_top` -- top of the TU stack
`dword_106B9E8`	`int32`	`tu_stack_depth` -- number of non-primary TUs on the stack
`qword_106BA00`	`char*`	`current_filename` -- source file name for the active TU
`dword_106BA08`	`int32`	`is_recompilation` -- 1 if this TU is being recompiled
`dword_106B9F8`	`int32`	`has_module_info` -- 1 if the active TU has module info

Registration Infrastructure

Global	Type	Identity
`qword_12C7AA8`	`reg_entry*`	`registered_variable_list_head`
`qword_12C7AA0`	`reg_entry*`	`registered_variable_list_tail`
`qword_12C7A98`	`size_t`	`per_tu_storage_size` -- accumulated total size of all registered variables (determines storage buffer allocation)
`dword_12C7A8C`	`int32`	`registration_complete` -- set to 1 when first TU is allocated; guards against late registration
`dword_12C7A88`	`int32`	`has_seen_module_tu` -- set to 1 when a TU with module info is processed

Free Lists and Counters

Global	Type	Identity
`qword_12C7AB8`	`stack_entry*`	`stack_entry_free_list`
`qword_12C7AB0`	`corresp*`	`corresp_free_list`
`qword_12C7A78`	`int64`	`tu_count` -- total TU descriptors allocated
`qword_12C7A80`	`int64`	`stack_entry_count` -- total stack entries allocated (not freed)
`qword_12C7A68`	`int64`	`registration_count` -- total registered variable entries
`qword_12C7A70`	`int64`	`corresp_count` -- total correspondence nodes allocated

Correspondence State

Global	Type	Identity
`dword_106B9E4`	`int32`	Correspondence state variable 1 (per-TU, registered)
`dword_106B9E0`	`int32`	Correspondence state variable 2 (per-TU, registered)
`qword_12C7798`	`int64`	Correspondence state variable 3 (per-TU, registered)
`qword_12C7800`	`[14]`	Correspondence hash table 1 (0x70 bytes)
`qword_12C7880`	`[14]`	Correspondence hash table 2 (0x70 bytes)
`qword_12C7900`	`[14]`	Correspondence hash table 3 (0x70 bytes)

Reset Functions

Two reset functions exist for different scopes:

reset_translation_unit_state (sub_7A4860) -- zeroes the 6 core runtime globals. Called during error recovery or frontend teardown:

qword_106BA10 = 0;  // current_tu
qword_106B9F0 = 0;  // primary_tu
qword_12C7A90 = 0;  // tu_chain_tail
dword_106B9F8 = 0;  // has_module_info
qword_106BA18 = 0;  // tu_stack_top
dword_106B9E8 = 0;  // tu_stack_depth

init_translation_unit_tracking (sub_7A48B0) -- zeroes all 13 tracking globals. Called during frontend initialization before any registrations:

qword_12C7AA8 = 0;  // registered_variable_list_head
qword_12C7AA0 = 0;  // registered_variable_list_tail
qword_12C7A98 = 0;  // per_tu_storage_size
dword_106BA08 = 0;  // is_recompilation
qword_106BA00 = 0;  // current_filename
qword_12C7AB8 = 0;  // stack_entry_free_list
qword_12C7AB0 = 0;  // corresp_free_list
qword_12C7A80 = 0;  // stack_entry_count
qword_12C7A78 = 0;  // tu_count
qword_12C7A68 = 0;  // registration_count
qword_12C7A70 = 0;  // corresp_count
dword_12C7A8C = 0;  // registration_complete
dword_12C7A88 = 0;  // has_seen_module_tu

Memory Statistics

The print_trans_unit_statistics function (sub_7A45A0) reports the allocation counts and total memory for the four structure types managed by the TU system:

Structure	Size	Counter	Storage
Trans unit correspondence	24 bytes	`qword_12C7A70`	Arena
Translation unit descriptor	424 bytes	`qword_12C7A78`	Arena (gen. storage)
TU stack entry	16 bytes	`qword_12C7A80`	General storage
Variable registration	40 bytes	`qword_12C7A68`	General storage

Function Map

Address	Identity	Confidence	Role
`sub_7A3A50`	`save_translation_unit_state`	HIGH	Save all per-TU globals to storage buffer
`sub_7A3B50`	`alloc_trans_unit_corresp`	HIGH	Allocate 24-byte correspondence node
`sub_7A3BB0`	`free_trans_unit_corresp`	HIGH	Reference-counted deallocation
`sub_7A3C00`	`f_register_trans_unit_variable`	DEFINITE	Register a global for per-TU save/restore
`sub_7A3CF0`	`fix_up_translation_unit`	DEFINITE	One-time shortcut pointer fix-up for primary TU
`sub_7A3D60`	`switch_translation_unit`	DEFINITE	Context-switch to a different TU
`sub_7A3EF0`	`push_translation_unit_stack`	HIGH	Push TU onto the stack
`sub_7A3F70`	`pop_translation_unit_stack`	DEFINITE	Pop TU from the stack
`sub_7A3FE0`	`push_entity_translation_unit`	MEDIUM-HIGH	Push the TU that owns a given entity
`sub_7A40A0`	`process_translation_unit`	DEFINITE	Main entry: allocate, init, parse, cleanup
`sub_7A45A0`	`print_trans_unit_statistics`	HIGH	Memory usage report for TU subsystem
`sub_7A4690`	`register_builtin_trans_unit_variables`	HIGH	Register the 3 core per-TU globals
`sub_7A4860`	`reset_translation_unit_state`	DEFINITE	Zero 6 runtime globals
`sub_7A48B0`	`init_translation_unit_tracking`	DEFINITE	Zero all 13 tracking globals

Assertions

The TU system contains 8 assertion sites (calls to sub_4F2930 with source path trans_unit.c):

Line	Function	Condition	Meaning
163	`free_trans_unit_corresp`	`refcount > 0`	Correspondence double-free
227	`f_register_trans_unit_variable`	`!registration_complete`	Variable registered after first TU allocated
230	`f_register_trans_unit_variable`	`var_ptr != NULL`	NULL pointer passed for registration
469	`fix_up_translation_unit`	`primary_tu->next == NULL`	Fix-up called on non-primary TU
514	`switch_translation_unit`	`current_tu != NULL`	Switch attempted with no active TU
556	`pop_translation_unit_stack`	`stack_top->tu == current_tu`	Stack/active TU mismatch
696	`process_translation_unit`	`!(a3==NULL && has_seen_module)`	Non-module TU after module TU
725	`process_translation_unit`	`is_recompilation` (when primary and first TU)	Primary TU must be a recompilation

Cross-References

RDC Mode -- multi-TU compilation: correspondence system, cross-TU IL copying, module ID generation
Frontend Wrapup -- the 5-pass post-processing architecture that iterates the TU chain
Scope Entry -- 784-byte scope stack entries saved/restored during TU switches
Entity Node -- entities carry a back-pointer to their owning TU (extracted via sub_741960)
IL Overview -- the IL tree rooted in each TU's file scope
Pipeline Overview -- where process_translation_unit sits in the full pipeline

Type Node

The type node is the fundamental type representation in EDG's intermediate language (IL). Every C++ type -- from int to const volatile std::vector<std::pair<int, float>>*& -- is represented as a 176-byte type node allocated by alloc_type (sub_5E3D40 in il_alloc.c). Type nodes form the backbone of the type system: every variable, field, routine, expression, and parameter carries a pointer to its type node. There are approximately 4,448 call sites across 128 type-query leaf functions in types.c alone.

The type node is a discriminated union. The type_kind byte at offset +132 selects one of 22 type kinds (0-21), and certain type kinds trigger allocation of a separate supplementary structure (class_type_supplement, routine_type_supplement, etc.) that hangs off the type node at offset +152. The base 176 bytes contain the common header shared with all IL entries (96 bytes), the type discriminator, qualifier flags, size/alignment, and type-kind-specific inline payload fields.

Key Facts

Property	Value
Allocation size	176 bytes (IL entry size)
Allocator	`sub_5E3D40` (`alloc_type`), `il_alloc.c`
Kind setter	`sub_5E2E80` (`set_type_kind`), 22 cases
In-place reinit	`sub_5E3590` (`init_type_fields`), no allocation
Counter global	`qword_126F8E0`
Stats label	`"type"` (176 bytes each)
Region	file-scope only (always `dword_126EC90`)
Type query functions	128 leaf functions in `types.c` (`0x7A4940`-`0x7C02A0`)
Most-called query	`is_class_or_struct_or_union_type` (`sub_7A8A30`, 407 call sites)
Source files	`il_alloc.c` (allocation), `types.c` (queries/construction)

Memory Layout

Raw Allocation vs Returned Pointer

Like all IL entries, the raw allocation includes a 16-byte prefix that is hidden from the returned pointer. The allocator returns raw + 16, so all field offsets documented below are relative to this returned pointer.

Raw allocation (192 bytes total):
  raw+0   [8 bytes]   TU copy address (zeroed, ptr[-2])
  raw+8   [8 bytes]   next-in-list link (zeroed, ptr[-1])
  raw+16  [176 bytes] type node body (ptr+0 onward)

Prefix flags byte at ptr[-8]:
  bit 0 (0x01)  allocated
  bit 1 (0x02)  file_scope
  bit 3 (0x08)  language_flag (C++ mode)
  bit 7 (0x80)  keep_in_il (CUDA device marking)

Complete Field Map

The 176 bytes of the type node body divide into three regions: the common IL header (bytes 0-95), the type discriminator and qualifier zone (bytes 96-135), and the type-kind-specific payload (bytes 136-175).

Offset  Size  Field                  Description
------  ----  -----                  -----------
+0      96    common_il_header       Shared with all IL entry types (see below)
+96     24    (continuation of       Source position, declaration metadata,
              common header area)    scope/name linkage -- varies by IL kind
+120    8     type_size              Computed size of this type in bytes
+128    4     alignment              Required alignment (bytes)
+132    1     type_kind              Type discriminator (0-21, see table below)
+133    1     type_flags_1           bit 5 = is_dependent
+134    1     type_qual_flags        bit 0 = const, bit 1 = volatile
                                    (cleared by sub_5E3580: *(a1+134) &= 0xFC)
+135    1     (padding/reserved)
+136    8     (reserved/varies)      Kind-dependent inline storage
+144    8     referenced_type        For tk_pointer/tk_reference/tk_typedef:
                                      -> pointed-to/referenced/aliased type
                                    For tk_pointer_to_member: -> class type
                                    For tk_function: return type enum (2=void)
                                    For tk_integer (enum): underlying type ptr
+145    1     enum_flags             For tk_integer:
                                      bit 3 = scoped_enum
                                      bit 4 = is_bit_int_capable
+146    1     extended_int_flags     For tk_integer:
                                      bit 2 = is_BitInt
+147    1     (padding)
+148    1     (varies by kind)       For tk_class/struct/union: access default
                                    set_type_kind initializes to 1
+149    1     kind_init_byte         Flags initialized during set_type_kind
+150    2     (cleared by init)      Zeroed by init_type_fields_and_set_kind
+152    8     supplement_ptr         For tk_class/struct/union: -> class_type_supplement
                                    For tk_routine: -> routine_type_supplement
                                    For tk_integer: -> integer_type_supplement
                                    For tk_typeref: -> typeref_type_supplement
                                    For tk_template_param: -> templ_param_supplement
                                    For tk_pointer: member_pointer_flag
                                      bit 0 = is_member_pointer
                                      bit 1 = extended_member_ptr
                                    For tk_array: bound expression pointer
+153    1     array_flags            For tk_array:
                                      bit 0 = dependent_bound
                                      bit 1 = is_VLA
                                      bit 5 = star_modifier
+154    6     (varies)
+160    8     typedef_attr_kind      For tk_typeref: attribute kind value
                                    For tk_array: numeric bound value
+161    1     class_flags_1          For tk_class/struct/union:
                                      bit 0 = is_local_class
                                      bit 2 = no_name_linkage
                                      bit 4 = is_template_class
                                      bit 5 = is_anonymous
                                      bit 7 = has_nested_types
+162    1     typedef_flags          For tk_typeref:
                                      bit 0 = is_elaborated
                                      bit 6 = is_attributed
                                      bit 7 = has_addr_space
+163    1     class_flags_2          For tk_class/struct/union:
                                      bit 0 = extern_template_inst
                                      bit 3 = alignment_set
                                      bit 4 = is_scoped (for union)
+164    2     feature_flags          Target feature requirements
                                    (copied to byte_12C7AFC by
                                     record_type_features_used)
+166    2     (reserved)
+168    4     alignment_attr         Explicit alignment / packed attribute
+172    4     (tail padding)

Common IL Header (Bytes 0-95)

The first 96 bytes are copied verbatim from the template globals (xmmword_126F6A0 through xmmword_126F6F0) during allocation. This template captures the current source file, line, and column position, and is refreshed as the parser advances. The header contains:

xmmword_126F6A0  [+0..+15]    scope/class pointer, name pointer (zeroed)
xmmword_126F6B0  [+16..+31]   declaration metadata (high qword zeroed)
xmmword_126F6C0  [+32..+47]   reserved (zeroed)
xmmword_126F6D0  [+48..+63]   reserved (zeroed)
xmmword_126F6E0  [+64..+79]   source position (from qword_126EFB8)
xmmword_126F6F0  [+80..+95]   low word = 4 (access default), high zeroed
qword_126F700    [+96..+103]  current source file reference

The source position at bytes +64..+79 allows error messages and diagnostics to reference the exact declaration point for each type.

Type Kind Enumeration

The type kind byte at offset +132 holds one of 22 values. The set_type_kind function (sub_5E2E80, 279 lines, il_alloc.c:2334) dispatches on this value to initialize type-kind-specific fields and allocate supplement structures where needed.

Value	Name	C++ Constructs	Supplement	Payload
0	`tk_none`	Placeholder / uninitialized	None	no-op
1	`tk_void`	`void`	None	no-op
2	`tk_integer`	`bool`, `char`, `short`, `int`, `long`, `long long`, `__int128`, `_BitInt(N)`, all `unsigned` variants, `wchar_t`, `char8_t`, `char16_t`, `char32_t`, enumerations	32-byte `integer_type_supplement`	`+144`=5 (default)
3	`tk_float`	`float`, `_Float16`, `__bf16`	None	format byte = 2
4	`tk_double`	`double`	None	format byte = 2
5	`tk_long_double`	`long double`, `__float128`	None	format byte = 2
6	`tk_pointer`	`T*`, member pointers (bit 0 of `+152`)	None	2 payload fields zeroed
7	`tk_routine`	Function type `int(int, float)`	64-byte `routine_type_supplement`	calling convention, params init
8	`tk_array`	`T[N]`, `T[]`, VLAs	None	size+flags zeroed
9	`tk_struct`	`struct S`	208-byte `class_type_supplement`	kind stored at supplement+100
10	`tk_class`	`class C`	208-byte `class_type_supplement`	kind stored at supplement+100
11	`tk_union`	`union U`	208-byte `class_type_supplement`	kind stored at supplement+100
12	`tk_typeref`	`typedef`, `using` alias, elaborated type specifiers	56-byte `typeref_type_supplement`	--
13	`tk_typeof`	`typeof(expr)`, `__typeof__`	None	zeroed
14	`tk_template_param`	`typename T`, template type/non-type/template parameters	40-byte `templ_param_supplement`	--
15	`tk_decltype`	`decltype(expr)`	None	zeroed
16	`tk_pack_expansion`	`T...` (parameter pack expansion)	None	zeroed
17	`tk_pack_expansion_alt`	Alternate pack expansion form	None	no-op
18	`tk_auto`	`auto`, `decltype(auto)`	None	no-op
19	`tk_rvalue_reference`	`T&&` (rvalue reference)	None	no-op
20	`tk_nullptr_t`	`std::nullptr_t`	None	no-op
21	`tk_reserved_21`	Reserved / unused	None	no-op

Reconciling set_type_kind with types.c query functions: There is an apparent conflict between the set_type_kind dispatch (where case 7 allocates a routine supplement, case 0xD/13 is typeof, case 0xE/14 is template_param) and the types.c query function catalog (where is_reference_type tests kind==7, is_pointer_to_member_type tests kind==13, is_function_type tests kind==14). The set_type_kind switch is the authoritative source for allocation behavior -- it is a 279-line DEFINITE-confidence function with the embedded error string "set_type_kind: bad type kind". The types.c catalog was reconstructed from runtime query patterns and may reflect a different numbering or the fact that type kind values are reassigned after initial allocation. The table above follows the set_type_kind dispatch numbering. The types.c query mappings are documented in the query function catalog below for cross-reference.

set_type_kind Dispatch Summary

switch (type_kind) {
  case 0, 1, 17..21:   // tk_none, tk_void, alt-pack, auto, rvalue_ref, nullptr, reserved
    break;             // no-op: simple types with no extra state

  case 2:              // tk_integer
    type->+144 = 5;    // default integer subkind
    supplement = alloc_in_file_scope_region(32);   // integer_type_supplement
    ++qword_126F8E8;
    supplement->+16 = source_position;
    type->+152 = supplement;
    break;

  case 3, 4, 5:        // tk_float, tk_double, tk_long_double
    type->format_byte = 2;   // IEEE format indicator
    break;

  case 6:              // tk_pointer
    type->+144 = 0;    // pointed-to type (to be set later)
    type->+152 = 0;    // member-pointer flags (cleared)
    break;

  case 7:              // tk_routine
    supplement = alloc_in_file_scope_region(64);   // routine_type_supplement
    ++qword_126F958;
    init_bitfield_struct(supplement+32);            // calling convention defaults
    type->+152 = supplement;
    break;

  case 8:              // tk_array
    type->+120 = 0;    // array total size (unknown)
    type->+152 = 0;    // bound expression (none)
    type->+153 &= mask; // clear array flags
    type->+160 = 0;    // numeric bound (none)
    break;

  case 9, 10, 11:      // tk_struct, tk_class, tk_union
    supplement = alloc_in_file_scope_region(208);  // class_type_supplement
    ++qword_126F948;
    init_class_type_supplement_fields(supplement);
    supplement->+100 = type_kind;                  // remember which class flavor
    type->+152 = supplement;
    break;

  case 12 (0xC):       // tk_typeref (typedef / using alias)
    supplement = alloc_in_file_scope_region(56);   // typeref_type_supplement
    ++qword_126F8F0;
    type->+152 = supplement;
    break;

  case 13 (0xD):       // tk_typeof
    type->+144 = 0;
    type->+152 = 0;
    break;

  case 14 (0xE):       // tk_template_param
    supplement = alloc_in_file_scope_region(40);   // templ_param_supplement
    ++qword_126F8F8;
    type->+152 = supplement;
    break;

  case 15 (0xF):       // tk_decltype
    type->+144 = 0;
    type->+152 = 0;
    break;

  case 16 (0x10):      // tk_pack_expansion
    type->+144 = 0;
    type->+152 = 0;
    break;

  default:
    internal_error("set_type_kind: bad type kind");
}
type->+132 = type_kind;

Supplement Structures

Five type kinds trigger allocation of a supplementary structure. The supplement pointer lives at type node offset +152 and points to a separately-allocated block in the file-scope region.

class_type_supplement (208 bytes)

Allocated for tk_struct (kind 9), tk_class (kind 10), and tk_union (kind 11). This is the richest supplement, carrying the full class definition metadata. Initialized by init_class_type_supplement_fields (sub_5E2D70, 40 lines) and init_class_type_supplement (sub_5E2C70, 42 lines).

Offset	Size	Field	Description
`+0`	8	scope_ptr	Pointer to the class scope (288-byte scope node)
`+8`	8	base_class_list	Head of linked list of base class entries (112 bytes each)
`+16`	8	friend_decl_list	Head of friend declaration list
`+24`	8	member_list_head	Member entity list (routines, variables, nested types)
`+32`	8	nested_type_list	Nested type definitions
`+40`	4	default_access	1 = public (struct/union), 2 = private (class)
`+44`	4	(reserved)
`+48`	8	using_decl_list	Using declarations in class scope
`+56`	8	static_data_members	Static data member list
`+64`	8	template_info	Template instantiation info (if template class)
`+72`	8	virtual_base_list	Virtual base class chain
`+80`	4	(flags)
`+84`	2	(reserved)
`+86`	1	class_property_flags	bit 0 = has_virtual_bases, bit 3 = has_user_conversion
`+88`	1	extended_flags	bit 5 = has_flexible_array
`+96`	8	vtable_ptr	Virtual function table pointer
`+100`	4	class_kind	Copy of type_kind (9, 10, or 11)
`+104`	8	destructor_ptr	Pointer to destructor entity
`+112`	8	copy_ctor_ptr	Copy constructor entity
`+120`	8	move_ctor_ptr	Move constructor entity
`+128`	8	(scope chain)
`+136`	8	conversion_functions	User-defined conversion operator list
`+144`	8	befriending_classes	List of classes that befriend this class
`+152`	8	deduction_guides	Deduction guide list (C++17)
`+160`	8	(reserved)
`+168`	8	(reserved)
`+176`	4	vtable_index	Virtual function table index, initialized to -1 (0xFFFFFFFF)
`+180`	4	(padding)
`+184`	8	(reserved)
`+192`	8	(reserved)
`+200`	8	(reserved)

Counter: qword_126F948, stats label "class type supplement".

routine_type_supplement (64 bytes)

Allocated for tk_routine (kind 7) by set_type_kind. Encodes the function signature metadata.

Offset	Size	Field	Description
`+0`	8	param_type_list	Head of parameter type linked list (80 bytes each)
`+8`	8	return_type	Return type pointer
`+16`	8	exception_spec	Exception specification pointer (16 bytes)
`+24`	8	(reserved)
`+32`	4	calling_convention	Calling convention bitfield (initialized by set_type_kind)
`+36`	4	param_count	Number of parameters
`+40`	4	flags	Variadic, noexcept, trailing-return, etc.
`+44`	4	(reserved)
`+48`	8	(reserved)
`+56`	8	(reserved)

Counter: qword_126F958, stats label "routine type supplement".

Each parameter in the param_type_list is an 80-byte param_type node (allocated by alloc_param_type, sub_5E1D40, free-list recycled from qword_126F678). Parameter types form a singly-linked list through their +0 field.

integer_type_supplement (32 bytes)

Allocated for tk_integer (kind 2). Represents the properties of integral and enumeration types.

Offset	Size	Field	Description
`+0`	4	integer_subkind	Subkind identifier (values 1-12, default 5)
`+4`	4	bit_width	Width in bits (for `_BitInt(N)`)
`+8`	4	signedness	0=unsigned, 1=signed (lookup via `byte_E6D1B0`)
`+12`	4	(reserved)
`+16`	8	source_position	Source position at allocation time
`+24`	8	underlying_type	For enums: pointer to the underlying integer type

Counter: qword_126F8E8, stats label "integer type supplement".

The integer_subkind field distinguishes between the various integer types. Known subkind values from type query analysis:

Subkind	Type
1-10	Standard integer types (`bool` through `long long`)
11	`_Float16` / extended
12	`__int128` / extended

typeref_type_supplement (56 bytes)

Allocated for tk_typeref (kind 12 = 0xC in set_type_kind). Links the typedef/using-alias to its referenced declaration and tracks elaborated type specifier properties.

Offset	Size	Field	Description
`+0`	8	referenced_decl	The declaration this typedef names
`+8`	8	original_type	The original type before typedef expansion
`+16`	8	scope_ptr	Scope in which the typedef was declared
`+24`	8	(reserved)
`+32`	8	attribute_info	Attribute specifier chain
`+40`	8	template_info	Template argument list (for alias templates)
`+48`	8	(reserved)

Counter: qword_126F8F0, stats label "typeref type supplement".

The elaborated type specifier kind is encoded in type_node+162:

bit 0: is_elaborated (uses struct/class/union/enum keyword)
bit 6: is_attributed (carries [[...]] attributes)
bit 7: has_addr_space (CUDA address space attribute)

The constant 0x18C2 (= bits {1,6,7,11,12}) is used as a bitmask in is_incomplete_type_deep (sub_7A6580) to identify the set of elaborated type specifier kinds.

templ_param_supplement (40 bytes)

Allocated for tk_template_param (kind 14 = 0xE in set_type_kind). Represents a template type parameter (typename T), non-type template parameter, or template template parameter.

Offset	Size	Field	Description
`+0`	4	param_index	Zero-based index in the template parameter list
`+4`	4	param_depth	Nesting depth (0 for outermost template)
`+8`	4	param_kind	0=type, 1=non-type, 2=template-template
`+12`	4	(reserved)
`+16`	8	constraint	Associated constraint expression (C++20 concepts)
`+24`	8	default_arg	Default template argument (type or expression)
`+32`	8	(reserved)

Counter: qword_126F8F8, stats label "templ param supplement".

Type Qualifier Encoding

CV-qualifiers are not stored as separate type nodes (unlike some compiler designs). Instead, they are encoded as bit flags within the type node itself. The primary qualifier storage is at offset +134:

Byte at type+134 (type_qual_flags):
  bit 0 (0x01)   const
  bit 1 (0x02)   volatile

The function clear_type_qualifier_bits (sub_5E3580) performs *(a1+134) &= 0xFC to strip both const and volatile.

Additional qualifier information is accessed through the prefix flags byte at ptr[-8]:

bit 5 (0x20): __restrict qualifier (has_restrict_qualifier, sub_7A7850)
bit 6 (0x40): volatile qualifier duplicate (has_volatile_qualifier, sub_7A7890)

The function get_cv_qualifiers (sub_7A9E70, 319 call sites) accumulates cv-qualifier bits by walking through typedef chains, applying a & 0x7F mask to collect all qualifier bits from each layer.

Type Query Function Catalog

The types.c file (address range 0x7A4940-0x7C02A0) contains approximately 250 functions. Of these, 128 are tiny leaf functions that query type node properties. They follow a canonical pattern:

// Canonical type query pattern
bool is_<property>_type(type_node *type) {
    while (type->type_kind == 12)        // skip typedefs
        type = type->referenced_type;
    return type->type_kind == <expected>;  // or flag check
}

Most-Referenced Query Functions

Sorted by call site count across the entire binary:

Callers	Function	Address	Test
407	`is_class_or_struct_or_union_type`	`0x7A8A30`	kind in {9, 10, 11}
389	`type_pointed_to`	`0x7A9910`	kind==6, return `+144`
319	`get_cv_qualifiers`	`0x7A9E70`	accumulate qualifier bits (& 0x7F)
299	`is_dependent_type`	`0x7A6B60`	bit 5 of byte `+133`
243	`is_object_pointer_type`	`0x7A7630`	kind==6 && !(bit 0 of `+152`)
221	`is_array_type`	`0x7A8370`	kind==8
199	`is_member_pointer_or_ref`	`0x7A7B30`	kind==6 && (bit 0 of `+152`)
185	`is_reference_type`	`0x7A6AC0`	kind==7
169	`is_function_type`	`0x7A8DC0`	kind==14
140	`is_void_type`	`0x7A6E90`	kind==1
126	`array_element_type` (deep)	`0x7A9350`	strips arrays+typedefs recursively
85	`is_enum_type`	`0x7A7010`	kind==2 (with scoped check)
82	`is_integer_type`	`0x7A71B0`	kind==2
77	`is_member_pointer_flag`	`0x7A7810`	kind==6, bit 0 of `+152`
76	`is_pointer_to_member_type`	`0x7A8D90`	kind==13
70	`is_long_double_type`	`0x7A73F0`	kind==5
62	`is_scoped_enum_type`	`0x7A70F0`	kind==2, bit 3 of `+145`
56	`is_rvalue_reference_type`	`0x7A6EF0`	kind==19

Typedef Stripping Functions

Six functions strip typedef layers with different stopping conditions:

Function	Address	Behavior
`skip_typedefs`	`0x7A68F0`	Strips all typedef layers, preserves cv-qualifiers
`skip_named_typedefs`	`0x7A6930`	Stops at unnamed typedefs
`skip_to_attributed_typedef`	`0x7A6970`	Stops at typedef with attribute flag
`skip_typedefs_and_attributes`	`0x7A69C0`	Strips both typedefs and attributed-typedefs
`skip_to_elaborated_typedef`	`0x7A6A10`	Stops at typedef with elaborated flag
`skip_non_attributed_typedefs`	`0x7A6A70`	Stops at typedef with any attribute bits

Compound Type Predicates

Function	Address	Type Kinds
`is_arithmetic_type`	`0x7A7560`	{2, 3, 4, 5}
`is_scalar_type`	`0x7A7BA0`	{2, 3, 4, 5, 6(non-member), 13, 19, 20}
`is_aggregate_type`	`0x7A8B40`	{8, 9, 10, 11}
`is_floating_point_type`	`0x7A7300`	{3, 4, 5}
`is_pack_or_auto_type`	`0x7A7420`	{16, 17, 18}
`is_pack_expansion_type`	`0x7A6BE0`	{16, 17}
`is_complete_type`	`0x7A6DA0`	Not void, not reference, not incomplete class

Duplicate Functions

EDG uses distinct function names for semantic clarity even when the implementation is identical. The compiler does not merge them:

0x7A7630 == 0x7A7670 == 0x7A7750 (all: is_non_member_pointer / is_object_pointer_type)
0x7A7B00 == 0x7A7B70 (both: is_pointer_type)
0x7A78D0 == 0x7A7910 (both: is_non_const_ref)

Type Construction

alloc_type (`sub_5E3D40`)

The primary type allocation function. Takes a single argument: the type kind. Returns a pointer to a fully-initialized 176-byte type node with the appropriate supplement structure allocated and linked.

Protocol:

Trace enter (if dword_126EFC8 set)
Allocate 176 bytes via region_alloc in file-scope region
Write 16-byte prefix (TU copy addr, next link, flags byte)
Increment qword_126F8E0 (type counter)
Copy 96-byte common IL header from template globals
Set default access to 1 at +148
Dispatch set_type_kind switch for the requested kind
Trace leave (if tracing)
Return raw + 16

init_type_fields (`sub_5E3590`)

Re-initializes an existing type node in-place without allocating new memory. Used when a type node needs to change kind after initial allocation (rare but occurs during template instantiation). Copies the template header and dispatches the same set_type_kind switch.

make_cv_combined_type (`sub_7A6320`)

Constructs a new type that combines cv-qualifiers from two source types. Recursively handles arrays (recurses on element type) and pointer-to-member (recurses on member type). Allocates a fresh type node via alloc_type, copies the base type via sub_5DA0A0, then applies the combined qualifiers via sub_5D64F0.

Type Comparison

types_are_identical (`sub_7AA150`)

The main type comparison function (636 lines). Handles all 22 type kinds with deep structural comparison. For class types, delegates to the class scope comparison infrastructure. For function types, compares parameter lists, return types, and calling conventions.

types_are_equivalent_for_correspondence (`sub_7B2260`)

A 688-line function used during multi-TU compilation (CUDA RDC mode). Compares types across translation units for structural equivalence, called from verify_class_type_correspondence (sub_7A00D0).

compatible_ms_bit_field_container_types (`sub_7C02A0`)

The last function in types.c. Checks if two integer types are compatible for MSVC bit-field container layout rules: both must be kind==2 (integer) with matching size at offset +120.

Pointer and Reference Encoding

Pointers use type kind 6 (tk_pointer), with member-pointer status distinguished by flag bits at offset +152:

tk_pointer (kind 6):
  +144  referenced_type   The pointed-to / referenced type
  +152  bit 0 = 0         Object pointer (T*)
        bit 0 = 1         Member pointer (T C::*)
        bit 1             Extended member pointer flag

The types.c query functions use the following kind tests for pointer/reference classification. Note that the kind values tested here correspond to the types.c query numbering (see reconciliation note in the type kind table):

Query	Kind Test	+152 Test	Matches
`is_pointer_type`	kind==6	--	`T`, `T C::`
`is_object_pointer_type`	kind==6	!(bit 0)	`T*` only
`is_member_pointer_flag`	kind==6	bit 0	`T C::*` only
`is_reference_type`	kind==7	--	`T&` (lvalue reference)
`is_rvalue_reference_type`	kind==19	--	`T&&`
`is_pointer_to_member_type`	kind==13	--	`T C::*` (alternate encoding)

The pm_class_type (sub_7A9A10) and pm_member_type (sub_7A99D0) access +144 and +152 respectively for kind-13 nodes.

Array Type Encoding

Array types (kind 8) store bounds inline in the type node:

tk_array (kind 8):
  +120  type_size          Total array size in bytes (0 if unknown)
  +128  alignment          Element alignment
  +144  element_type       Pointer to the element type node
  +152  bound_expr         Bound expression pointer (for VLAs and dependent)
  +153  array_flags:
          bit 0 = dependent_bound    (template-dependent array size)
          bit 1 = is_VLA             (C99 variable-length array)
          bit 5 = star_modifier      (C99 [*] syntax)
  +160  numeric_bound      Compile-time bound value (when not VLA/dependent)

The function identical_array_type_level (sub_7A4E10, types.c:6779) compares two array types by checking the VLA flag, dependent flag, and then either bound expressions (via sub_5D2160) or numeric bounds at +160.

Class Type Flags

Class types (kinds 9, 10, 11) carry two flag bytes at offsets +161 and +163 in the type node, plus property flags in the class_type_supplement at supplement offset +86:

type_node+161 (class_flags_1)

Bit	Mask	Field	Query Function
0	`0x01`	is_local_class	`is_local_class_type` (`0x7A8EE0`)
2	`0x04`	no_name_linkage	`ttt_is_type_with_no_name_linkage` (`0x7A4B40`)
4	`0x10`	is_template_class	`is_template_class_type` (`0x7A8EA0`)
5	`0x20`	is_anonymous	`is_non_anonymous_class_type` tests !(bit 5) (`0x7A8A90`)
7	`0x80`	has_nested_types	--

type_node+163 (class_flags_2)

Bit	Mask	Field	Query Function
0	`0x01`	extern_template_inst	`is_empty_class` checks this (`0x7A` range)
3	`0x08`	alignment_set	--
4	`0x10`	is_scoped	`is_scoped_union_type` (`0x7A8B00`)

class_type_supplement+86

Bit	Mask	Field	Query Function
0	`0x01`	has_virtual_bases	`class_has_virtual_bases` (`0x7A8BC0`)
3	`0x08`	has_user_conversion	`class_has_user_conversion` (`0x7A8C00`)

Type Size and Layout

type_size_and_alignment (sub_7A8020, 132 lines) computes the size and alignment of a type for ABI purposes. The computed size is stored at type_node offset +120 and alignment at +128.

For class types, the major layout computation is performed by compute_type_layout (sub_7B6350, 1107 lines), which handles:

Base class sub-object placement
Virtual base class offsets
Member field alignment and padding
Bit-field packing (with MSVC compatibility via compatible_ms_bit_field_container_types)
Empty base optimization

Integration with Other IL Nodes

Type nodes are referenced from virtually every other IL entity:

IL Node	Offset	Description
Variable entity (232B)	`+112`	Variable's declared type
Field entity (176B)	`+112`	Field's declared type
Routine entity (288B)	`+112`	Function's type (kind 7 with routine_type_supplement)
Expression node (72B)	`+16`	Expression result type
Parameter type (80B)	`+8`	Parameter's declared type
Constant (184B)	`+112`	Constant's type
Template argument (64B)	`+32`	Type argument value (when kind=0)

Allocation Statistics

In a typical CUDA compilation, the stats dump (sub_5E99D0) reports type node counts in the thousands. The supplement allocation counts track closely:

type                    176 bytes each   (qword_126F8E0)
integer type supplement  32 bytes each   (qword_126F8E8)
routine type supplement  64 bytes each   (qword_126F958)
class type supplement   208 bytes each   (qword_126F948)
typeref type supplement  56 bytes each   (qword_126F8F0)
templ param supplement   40 bytes each   (qword_126F8F8)
param type               80 bytes each   (qword_126F960, free-list recycled)

Type nodes are always allocated in the file-scope region (persistent for the entire translation unit) because types must outlive any individual function body. This contrasts with expression nodes and statements which can be allocated in per-function regions and freed after each function is processed.

Template Instance Record

The template instance record is the 128-byte structure that represents a pending or completed template instantiation in cudafe++ (EDG 6.6). Every template entity that may require instantiation -- function templates, class member function templates, variable templates -- gets one of these records allocated by alloc_template_instance (sub_7416E0). The records are chained into a singly-linked worklist for function/variable templates (qword_12C7740). A separate worklist of type entries (not instance records) at qword_12C7758 tracks pending class template instantiations. A fixpoint loop at translation-unit end drains both lists, instantiating entities until no new work remains.

This page documents the instance record layout, the master instance info record, the two worklists, the depth-tracking mechanisms, the parser state save/restore during instantiation, and the fixpoint algorithm that ties everything together.

Key Facts

Property	Value
Instance record size	128 bytes (allocated by `sub_7416E0`, `alloc_template_instance`)
Master instance info size	32 bytes (allocated by `sub_7416A0`, `alloc_master_instance_info`)
Allocation counter (instances)	`qword_12C74F0` (incremented on each 128-byte allocation)
Allocation counter (master info)	`qword_12C74E8` (incremented on each 32-byte allocation)
Memory allocator	`sub_6BA0D0` (EDG arena allocator)
Function/variable worklist head	`qword_12C7740`
Function/variable worklist tail	`qword_12C7738`
Class worklist head	`qword_12C7758`
Fixpoint entry point	`sub_78A9D0` (`template_and_inline_entity_wrapup`)
Worklist walker	`sub_78A7F0` (`do_any_needed_instantiations`)
Decision gate	`sub_774620` (`should_be_instantiated`)
Source file	`templates.c` (EDG 6.6, path `edg/EDG_6.6/src/templates.c`)

Instance Record Layout (128 bytes)

Each record is allocated by alloc_template_instance (sub_7416E0) and zero-initialized. The allocator clears all 128 bytes, then initializes offsets +84 and +92 from qword_126EFB8 (the current source position context). The low nibble of byte +81 is explicitly masked to zero (*(_BYTE *)(result + 81) &= 0xF0).

Offset	Size	Field	Description
`+0`	8	`entity_primary`	Primary entity pointer (the instantiation's own symbol)
`+8`	8	`next`	Next entry in the pending worklist (singly-linked)
`+16`	8	`inst_info`	Pointer to 32-byte master instance info record (see below)
`+24`	8	`master_symbol`	Canonical template symbol -- the entity being instantiated from
`+32`	8	`actual_decl`	Declaration entity in the instantiation context
`+40`	8	`cached_decl`	Cached declaration for function-local templates (partial specialization lookup result)
`+48`	8	`referencing_namespace`	Namespace that triggered the instantiation (set by `determine_referencing_namespace`, `sub_75D5B0`)
`+56`	8	(reserved)	Zero-initialized, usage not observed
`+64`	8	`body_flags`	Deferred/deleted function body flags
`+72`	8	`pre_computed_result`	Result from a prior instantiation attempt (non-null skips re-instantiation)
`+80`	1	`flags`	Status bitfield (see flags table below)
`+81`	1	`flags2`	Secondary flags (bit 0 = on_worklist, bit 1 = warning_emitted)
`+84`	8	`source_position_1`	Source location context at entry creation (from `qword_126EFB8`)
`+92`	8	`source_position_2`	Second source location context at entry creation (from `qword_126EFB8`)
`+104`	8	(reserved)	Zero-initialized
`+112`	8	(reserved)	Zero-initialized
`+120`	8	(reserved)	Zero-initialized

Allocator Pseudocode

// sub_7416E0 — alloc_template_instance
template_instance_t *alloc_template_instance(void) {
    if (debug_tracing_enabled)
        trace_enter(5, "alloc_template_instance");

    template_instance_t *rec = arena_alloc(128);   // sub_6BA0D0
    alloc_count_instances++;                        // qword_12C74F0

    // Zero all fields
    rec->entity_primary    = NULL;    // +0
    rec->next              = NULL;    // +8
    rec->inst_info         = NULL;    // +16
    rec->master_symbol     = NULL;    // +24
    rec->actual_decl       = NULL;    // +32
    rec->cached_decl       = NULL;    // +40
    rec->ref_namespace     = NULL;    // +48
    rec->reserved_56       = NULL;    // +56
    rec->body_flags        = NULL;    // +64
    rec->precomputed       = NULL;    // +72
    rec->flags             = 0;       // +80
    rec->flags2           &= 0xF0;   // +81: clear low nibble
    rec->source_pos_1      = current_source_context;  // +84 from qword_126EFB8
    rec->source_pos_2      = current_source_context;  // +92 from qword_126EFB8
    rec->reserved_104      = NULL;    // +104
    rec->reserved_112      = NULL;    // +112
    rec->reserved_120      = NULL;    // +120

    if (debug_tracing_enabled)
        trace_leave();
    return rec;
}

Flags Byte at `+80`

Six bits are used. The byte is written by update_instantiation_required_flag (sub_7770E0) and read by do_any_needed_instantiations (sub_78A7F0) and should_be_instantiated (sub_774620).

Bit	Mask	Name	Meaning
0	`0x01`	`instantiation_required`	Entity needs instantiation (set by `update_instantiation_required_flag`)
1	`0x02`	`not_needed`	Entity was determined not to need instantiation (skip on worklist walk)
3	`0x08`	`explicit_instantiation`	Explicit `template` declaration triggered this entry
4	`0x10`	`suppress_auto`	Auto-instantiation suppressed (from `extern template` declaration)
5	`0x20`	`excluded`	Entity excluded from instantiation set
7	`0x80`	`can_be_instantiated_checked`	Pre-check (`f_entity_can_be_instantiated`) already performed; skip redundant check

Flags Byte at `+81`

Bit	Mask	Name	Meaning
0	`0x01`	`on_worklist`	Entry has been linked into the pending worklist
1	`0x02`	`warning_emitted`	Depth-limit warning already emitted for this entry

The on_worklist bit at +81 bit 0 is the guard that prevents double-insertion into the linked list. When add_to_instantiations_required_list sets up the linked list linkage (at qword_12C7740/qword_12C7738), it checks this bit first and sets it afterward. If the bit is already set, the function takes the "already on worklist" path which may set the new_instantiations_needed fixpoint flag (dword_12C771C = 1) instead.

Master Instance Info Record (32 bytes)

Each template entity (class, function, or variable template) has exactly one master instance info record, allocated by alloc_master_instance_info (sub_7416A0). This record is shared across all instantiations of the same template and is stored at the template's associated scope info (scope_assoc + 16). The link between a 128-byte instance record and its master info is at instance +16.

Offset	Size	Field	Description
`+0`	8	`next`	Next master info in chain (linked list)
`+8`	8	`back_pointer`	Pointer back to the template instance record that owns this info
`+16`	8	`associated_scope`	Pointer to the associated scope/translation-unit data
`+24`	4	`pending_count`	Number of pending instantiations of this template (incremented/decremented by `update_instantiation_required_flag`)
`+28`	1	`flags`	Status bits (low nibble cleared on allocation)

Master Info Flags Byte at `+28`

Bit	Mask	Name	Meaning
0	`0x01`	`blocked`	Instantiation blocked (dependency cycle or `extern template`)
2	`0x04`	`has_instances`	At least one instantiation has been completed
3	`0x08`	`debug_checked`	Already checked by debug tracing path

Allocator Pseudocode

// sub_7416A0 — alloc_master_instance_info
master_instance_info_t *alloc_master_instance_info(void) {
    master_instance_info_t *info = arena_alloc(32);   // sub_6BA0D0
    alloc_count_master_info++;                         // qword_12C74E8

    info->next             = NULL;    // +0
    info->back_pointer     = NULL;    // +8
    info->associated_scope = NULL;    // +16
    info->pending_count    = 0;       // +24
    info->flags           &= 0xF0;   // +28: clear low nibble

    return info;
}

find_or_create_master_instance (sub_753550)

This function connects a 128-byte instance record to its shared master info. It looks up the template's scope association, checks whether a master info record already exists at scope_assoc + 16, and creates one if absent.

// sub_753550 — find_or_create_master_instance
void find_or_create_master_instance(template_instance_t *inst) {
    entity_t *sym = inst->master_symbol;            // inst[3], offset +24
    scope_t  *scope = resolve_template_scope(sym);  // sub_73DE50

    // Find the template's canonical entity
    entity_t *canonical;
    if (is_variable(sym))                           // (kind - 7) & 0xFD == 0
        canonical = *find_variable_correspondence(scope);   // sub_79AAA0
    else
        canonical = *find_function_correspondence(scope);   // sub_79FD80
    assert(canonical != NULL, "find_or_create_master_instance");

    scope_assoc_t *assoc = canonical->scope_assoc;  // offset +96
    master_instance_info_t *info = assoc->master_info;  // assoc + 16

    if (info == NULL) {
        // First instantiation of this template — allocate master info
        info = alloc_master_instance_info();         // sub_7416A0
        info->back_pointer = inst;                   // info[1] = inst

        if (sym != inst->actual_decl) {
            // Class members: add to secondary deferred list
            // qword_12C7750 / qword_12C7748
            append_to_deferred_list(info);
        }
        assoc->master_info = info;                   // assoc + 16

        if (debug_tracing_enabled) {
            trace("find_or_create_master_instance: symbol:");
            print_symbol(inst->master_symbol);
        }
    }

    inst->inst_info = info;                          // inst[2], offset +16
}

The Two Worklists

Template instantiation uses two separate worklists -- one for class templates, one for function/variable templates. This separation is fundamental to correctness: class templates must be instantiated before function templates within each fixpoint iteration, because function template bodies may reference members of class template instantiations.

Function/Variable Worklist (`qword_12C7740`)

Global	Purpose
`qword_12C7740`	Head of the singly-linked list
`qword_12C7738`	Tail pointer (for O(1) append)

Entries are 128-byte instance records linked through +8 (next pointer). New entries are appended at the tail by add_to_instantiations_required_list (the tail section of sub_7770E0).

Class Worklist (`qword_12C7758`)

This list holds type entries (not 128-byte instance records) that need class template instantiation. Entries are linked through offset +0 of the type entry. The list is populated by update_instantiation_flags (sub_789EF0) and drained by template_and_inline_entity_wrapup (sub_78A9D0).

Worklist Insertion: add_to_instantiations_required_list

The tail portion of update_instantiation_required_flag (sub_7770E0, starting at the label checking inst->flags2 & 0x01) implements worklist insertion:

// Tail of sub_7770E0 — add_to_instantiations_required_list
void add_to_instantiations_required_list(template_instance_t *inst) {
    if (inst->flags2 & 0x01) {
        // Already on the worklist — do not re-add.
        // But if instantiation mode is active and instantiation_required is set,
        // signal that a new fixpoint pass is needed.
        if (instantiation_mode_active
            && (inst->flags & 0x01)
            && inst->inst_info != NULL
            && !(inst->inst_info->flags & 0x01))     // not blocked
        {
            new_instantiations_needed = 1;            // dword_12C771C
            tu_ptr->needs_recheck = 1;                // TU + 393
        }
        return;
    }

    // Link into the function/variable worklist
    if (pending_list_head)                            // qword_12C7740
        pending_list_tail->next = inst;               // qword_12C7738->next
    else
        pending_list_head = inst;

    pending_list_tail = inst;
    inst->flags2 |= 0x01;                            // mark as on worklist

    // Verify correct translation unit
    tu_t *tu = trans_unit_for_symbol(inst->master_symbol);  // sub_741960
    assert(tu == current_tu,
           "add_to_instantiations_required_list: symbol for wrong translation unit");
}

The Fixpoint Loop

The fixpoint loop is the algorithm that drives all template instantiation in cudafe++. It runs at the end of each translation unit, after parsing is complete, and iterates until no new instantiation work remains. The entry point is template_and_inline_entity_wrapup (sub_78A9D0).

Algorithm

template_and_inline_entity_wrapup (sub_78A9D0):

    assert(tu_stack_top == 0)          // qword_106BA18: not nested in another TU
    assert(compilation_phase == 2)     // dword_126EFB4: full compilation mode

    LOOP:
        FOR EACH translation_unit IN tu_list (qword_106B9F0):

            set_up_tu_context(tu)      // sub_7A3EF0

            // PHASE 1 — Class templates (from qword_12C7758)
            for entry in class_worklist:
                if is_dependent_type(entry)       continue
                if !is_class_or_struct(entry)     continue
                f_instantiate_template_class(entry)

            // PHASE 2 — Enable instantiation mode
            instantiation_mode_active = 1         // dword_12C7730

            // PHASE 3 — Function/variable templates
            do_any_needed_instantiations()        // sub_78A7F0

            tear_down_tu_context()                // sub_7A3F70

        // PHASE 4 — Check fixpoint condition
        new_instantiations_needed = 0             // dword_12C771C
        FOR EACH translation_unit IN tu_list:
            if tu->needs_recheck:                 // TU + 393
                tu->needs_recheck = 0
                set_up_tu_context(tu)
                do_any_needed_instantiations()
                // process inline entities
                tear_down_tu_context()
                additional_pass_needed = 1        // dword_12C7718

            if new_instantiations_needed:
                GOTO LOOP                         // restart fixpoint

The fixpoint is necessary because instantiating one template can trigger references to other not-yet-instantiated templates. For example, instantiating std::vector<Foo> may require instantiating std::allocator<Foo>, Foo's copy constructor, comparison operators, and any other templates used in std::vector's implementation. Each such reference may add a new entry to the worklist, which the next pass will discover and process. The loop terminates when a complete pass produces no new entries.

Worklist Walker: do_any_needed_instantiations

sub_78A7F0 performs a linear walk over the function/variable worklist. For each entry, it applies a series of rejection filters, and if the entry passes all of them, dispatches to instantiate_template_function_full (sub_775E00).

// sub_78A7F0 — do_any_needed_instantiations
void do_any_needed_instantiations(void) {
    template_instance_t *entry = pending_list_head;   // qword_12C7740

    while (entry) {
        // 1. Already processed?
        if (entry->flags & 0x02) {                    // not_needed
            entry = entry->next;
            continue;
        }

        // 2. Get master instance info
        master_instance_info_t *info = entry->inst_info;  // offset +16
        assert(info != NULL, "do_any_needed_instantiations");

        // 3. Blocked by dependency?
        if (info->flags & 0x01) {                     // blocked
            entry = entry->next;
            continue;
        }

        // 4. Check if debug-verified
        if (!(info->flags & 0x08))                    // not debug_checked
            f_is_static_or_inline(entry);             // sub_756B40

        // 5. Pre-check if not already done
        if (!(entry->flags & 0x80))                   // can_be_instantiated not checked
            f_entity_can_be_instantiated(entry);      // sub_7574B0

        // 6. Mode filter
        if (compilation_mode != 1                     // dword_106C094
            && !(entry->flags & 0x01))                // not instantiation_required
        {
            entry = entry->next;
            continue;
        }

        // 7. Blocked after pre-check?
        if (info->flags & 0x01) {
            entry = entry->next;
            continue;
        }

        // 8. Decision gate
        if (!should_be_instantiated(entry, 1)) {      // sub_774620
            entry = entry->next;
            continue;
        }

        // 9. Instantiate
        instantiate_template_function_full(entry, 1); // sub_775E00
        entry = entry->next;
    }
}

The walk is a simple forward traversal. Entries appended during instantiation (by new add_to_instantiations_required_list calls from within instantiated function bodies) that land after the current position will be visited on this same pass. Entries that land before the current position, or entries whose status changes after being skipped, are caught by the next fixpoint iteration.

Decision Gate: should_be_instantiated

sub_774620 (326 lines) is the final filter before instantiation. It implements an eight-step rejection chain. An entity must pass every step to be instantiated.

// sub_774620 — should_be_instantiated
bool should_be_instantiated(template_instance_t *inst, int check_implicit) {
    master_instance_info_t *info = inst->inst_info;     // +16
    assert(info != NULL, "should_be_instantiated");

    // 1. Blocked?
    if (info->flags & 0x01)          return false;

    // 2. Excluded?
    if (inst->flags & 0x20)          return false;      // excluded

    // 3. Explicit but not required?
    if (inst->flags & 0x08) {                           // explicit_instantiation
        if (!(inst->flags & 0x01))   return false;      // not marked required
    }

    // 4. Not required and not in normal mode?
    if (!(inst->flags & 0x01) && compilation_mode != 1)
        return false;

    // 5. Has valid master_symbol?
    entity_t *master = inst->master_symbol;             // +24
    if (!master)                     return false;

    // 6. Entity kind check
    //    Function: kind 10/11 (member), kind 9 (namespace-scope)
    //    Variable: kind 7
    //    class-local function: kind 17 (lambda)
    int kind = master->kind;                            // master + 80
    // ... kind-specific filtering ...

    // 7. Template body available?
    //    Check that the template has a cached body to replay
    if (!has_template_body(inst))     return false;

    // 8. Implicit include?
    if (check_implicit && implicit_include_enabled) {
        do_implicit_include_if_needed(inst);            // sub_754A70
        // re-check body availability after include
    }

    // 9. Depth limit warning (diagnostics 489/490)
    if (approaching_depth_limit(inst)) {
        if (!(inst->flags2 & 0x02)) {                   // warning not yet emitted
            emit_warning(489 or 490, inst->master_symbol);
            inst->flags2 |= 0x02;                       // mark warning emitted
        }
        return false;
    }

    return true;
}

Depth Tracking

Template instantiation depth is tracked at two levels -- a global counter for function templates and a per-type counter for class templates -- plus a pending-instantiation counter that detects runaway expansion.

Function Template Depth: `qword_12C76E0`

A single global counter incremented on entry to instantiate_template_function_full and decremented on exit. Hard limit: 255 (0xFF).

// Inside sub_775E00 (instantiate_template_function_full):
if (instantiation_depth >= 0xFF) {                // qword_12C76E0
    emit_fatal_error(/* depth exceeded */);
    goto restore_state;
}
instantiation_depth++;
// ... perform instantiation ...
instantiation_depth--;

The 255 limit is a safety valve against infinite recursive template metaprogramming. Consider:

template<int N>
struct factorial {
    static constexpr int value = N * factorial<N-1>::value;
};

Without a depth limit, factorial<256> would recurse 256 levels deep, each level re-entering the parser to process the template body. At 255, EDG aborts with a fatal error rather than risk a stack overflow. The C++ standard (Annex B) recommends implementations support at least 1,024 recursively nested template instantiations, but EDG defaults to 255 as a practical limit -- configurable via qword_106BD10.

Class Template Depth: Per-Type Counter at `type_entry + 56`

Each class type entry has its own depth counter at offset +56. The limit is read from qword_106BD10 (the same configurable limit, typically 500). This per-type design is critical: it prevents one deeply-nested class hierarchy from blocking all other class instantiations.

// Inside sub_777CE0 (f_instantiate_template_class):
uint32_t depth = type_entry->depth_counter;       // type_entry + 56
if (depth >= max_depth_limit) {                   // qword_106BD10
    emit_error(456, decl);
    type_entry->flags |= 0x01;                    // mark completed
    goto restore_state;
}
type_entry->depth_counter++;
// ... perform class instantiation ...
type_entry->depth_counter--;

Pending Instantiation Counter: increment/decrement/too_many

Three functions manage a per-type pending-instantiation counter that detects exponential expansion of template instantiation work.

increment_pending_instantiations (sub_75D740): dispatches on the entity kind byte at entity + 80 to locate the owning type entry, then increments the counter at type_entry + 56.

decrement_pending_instantiations (sub_75D7C0): mirror of the above, decrements.

too_many_pending_instantiations (sub_75D6A0): compares the counter against qword_106BD10. If the threshold is met, emits diagnostic 456 and returns true to abort the instantiation.

// sub_75D6A0 — too_many_pending_instantiations
bool too_many_pending_instantiations(entity_t *entity, entity_t *context,
                                     source_pos_t *pos) {
    type_entry_t *type = resolve_owning_type(entity);  // dispatch on entity->kind
    assert(type != NULL, "too_many_pending_instantiations");

    uint32_t count = type->pending_counter;            // type + 56
    if (count >= max_depth_limit) {                    // qword_106BD10
        emit_error(456, pos, context);
        return true;
    }
    return false;
}

The entity-kind dispatch is identical across all three functions:

Entity Kind	Byte `+80`	Type Entry Resolution
4, 5 (template member function)	`entity->scope_assoc->field_80`	`*(entity->assoc + 96)->offset_80`
6 (type alias template)	`entity->scope_assoc->field_32`	`*(entity->assoc + 96)->offset_32`
9, 10 (namespace function, class)	`entity->scope_assoc->field_56`	`*(entity->assoc + 96)->offset_56`
19-22 (class member types)	`entity->type_info`	`entity->offset_88`

Depth Limit Counter at `type_entry + 432`

Inside update_instantiation_required_flag (sub_7770E0), a secondary counter at type_entry + 432 (a 16-bit word) tracks how many times an entity's instantiation-required flag has been toggled. When this counter reaches 200, diagnostic 599 is emitted as a warning. If it exceeds 199, the instantiation is skipped entirely. This catches oscillating patterns where two mutually-dependent templates keep adding and removing each other from the worklist.

// Inside sub_7770E0, compilation_mode == 1 path:
if (!setting_required && is_function_or_variable(master_symbol)) {
    int16_t toggle_count = *(int16_t *)(type_entry + 432);
    toggle_count++;
    *(int16_t *)(type_entry + 432) = toggle_count;
    if (toggle_count == 200)
        emit_warning(599, actual_decl);
    if (toggle_count > 199)
        return;  // stop oscillating
}

Parser State Save/Restore During Instantiation

Template instantiation re-enters the parser: the compiler replays the cached template body tokens with substituted types. This means the parser's global state -- scope indices, current token, source position, declaration context -- must be saved before instantiation and restored afterward. EDG uses movups/movaps SSE instructions to bulk-save/restore this state in 128-bit chunks.

Why SSE?

The global parser state variables are ordinary integers, pointers, and flags laid out at consecutive addresses. The compiler's register allocator (or manual optimization) packs adjacent globals into 128-bit SSE loads/stores, saving 4 or more individual mov instructions per save/restore. This is not a quirk of the architecture -- it is a deliberate performance optimization for a hot path. Template-heavy C++ codebases (Boost, STL, Eigen) can trigger thousands of instantiations, each requiring a state save/restore pair.

Function Instantiation: 4 SSE Registers

instantiate_template_function_full (sub_775E00) saves and restores 4 SSE registers covering 64 bytes of parser state at addresses 0x106C380--0x106C3B0.

Save on entry (before any parser re-entry):
    local[0] = xmmword_106C380      // 16 bytes: parser scope context
    local[1] = xmmword_106C390      // 16 bytes: token stream state
    local[2] = xmmword_106C3A0      // 16 bytes: scope nesting info
    local[3] = xmmword_106C3B0      // 16 bytes: auxiliary flags

Also saved as individual scalars:
    saved_source_pos     = qword_126DD38
    saved_source_col     = WORD2(qword_126DD38)
    saved_diag_pos       = qword_126EDE8
    saved_diag_col       = WORD2(qword_126EDE8)

Restore on exit (always, even on error path):
    xmmword_106C380 = local[0]
    xmmword_106C390 = local[1]
    xmmword_106C3A0 = local[2]
    xmmword_106C3B0 = local[3]
    qword_126DD38   = saved_source_pos
    qword_126EDE8   = saved_diag_pos

Class Instantiation: 11 + 12 SSE Registers (Conditional)

f_instantiate_template_class (sub_777CE0) saves substantially more state because class body parsing involves deeper parser perturbation -- member declarations, nested types, access specifiers, base class processing, and member template definitions all modify global parser state.

The save is conditional on the current token kind (word_126DD58). If the token kind is between 2 and 8 inclusive (meaning the parser is mid-expression or mid-declaration when the class instantiation is triggered), the full save executes:

Primary state block (always saved when token is 2--8): 11 SSE registers from xmmword_126DC60--xmmword_126DD00, covering 176 bytes of declaration parser state, plus qword_126DD10 (8 bytes).

Save:
    local[0]  = xmmword_126DC60     // declaration context
    local[1]  = xmmword_126DC70     // access specifier state
    local[2]  = xmmword_126DC80     // base class list context
    local[3]  = xmmword_126DC90     // member template tracking
    local[4]  = xmmword_126DCA0     // nested type state
    local[5]  = xmmword_126DCB0     // friend declaration context
    local[6]  = xmmword_126DCC0     // using declaration state
    local[7]  = xmmword_126DCD0     // default argument context
    local[8]  = xmmword_126DCE0     // static assertion state
    local[9]  = xmmword_126DCF0     // concept/requires state
    local[10] = xmmword_126DD00     // template parameter context
    saved_dd10 = qword_126DD10      // additional scalar

Extended state block (saved only when token kind == 8, class definition in progress): 12 more SSE registers from xmmword_126DBA0--xmmword_126DC40, covering 192 bytes, plus qword_126DC50.

    local[11] = xmmword_126DBA0     // class body parse state
    local[12] = xmmword_126DBB0     // virtual function table context
    local[13] = xmmword_126DBC0     // constructor/destructor tracking
    local[14] = xmmword_126DBD0     // initializer list state
    local[15] = xmmword_126DBE0     // exception specification
    local[16] = xmmword_126DBF0     // noexcept evaluation context
    local[17] = xmmword_126DC00     // member initializer state
    local[18] = xmmword_126DC10     // default member init
    local[19] = xmmword_126DC20     // alignment tracking
    local[20] = unk_126DC30         // padding/layout state
    local[21] = xmmword_126DC40     // class completion state
    saved_dc50 = qword_126DC50      // additional scalar

The conditional save is a performance optimization: when the parser is in a simple context (token kind outside 2--8), the class instantiation only needs to save the 4 SSE registers from xmmword_106C380--xmmword_106C3B0 (same as function instantiation). The full 23-register save is only needed when a class instantiation is triggered mid-parse (e.g., during elaborated type specifier resolution or SFINAE evaluation).

Summary of State Save Areas

Instantiation Kind	Condition	SSE Registers	Bytes Saved	Address Range
Function	Always	4	64	`0x106C380`--`0x106C3B0`
Class (minimal)	token not 2--8	4	64	`0x106C380`--`0x106C3B0`
Class (mid-declaration)	token 2--8	4 + 11	64 + 184	`0x106C380`--`0x106C3B0` + `0x126DC60`--`0x126DD10`
Class (mid-class-body)	token == 8	4 + 11 + 12	64 + 184 + 200	All three ranges

The update_instantiation_required_flag Function

sub_7770E0 (434 lines) is the central function that decides whether to add a template instance to the worklist. Its name in EDG source is update_instantiation_required_flag, confirmed by the assert string at templates.c:38863 and the debug trace "Setting instantiation_required flag to %s for (options=%d)".

This function is called whenever a template entity's instantiation status changes -- when a template is first referenced, when it is explicitly instantiated, when its definition becomes available, or when an extern template declaration is encountered.

Parameters

void update_instantiation_required_flag(
    template_instance_t *inst,     // a1: the instance record
    bool                 setting,  // a2 (int cast): true = mark required, false = unmark
    unsigned int         options   // a3: bitmask controlling behavior
);

Options Bitmask

Bit	Mask	Meaning
0	`0x01`	Force worklist addition even without inline body
1	`0x02`	Unmarking: decrement pending count and clear flags
2	`0x04`	Suppress `should_be_instantiated` check
3	`0x08`	Check for inline member of class template

High-Level Flow

update_instantiation_required_flag(inst, setting, options):

    // 1. Resolve owning type entry (for toggle counter)
    type_entry = resolve_type_from_actual_decl(inst->actual_decl)

    // 2. Check inline member status
    if is_function_or_variable(master_symbol):
        if is_inline_member:
            adjust options based on module/inline status

    // 3. Debug trace
    if debug_tracing_enabled:
        "Setting instantiation_required flag to TRUE/FALSE for (options=N)"
        print_symbol(inst->master_symbol)

    // 4. Ensure inst_info exists
    if inst_info is NULL:
        find_or_create_master_instance(inst)    // sub_753550

    // 5. If setting == true and entity != actual_decl:
    //    — Set inst->flags |= 0x01 (instantiation_required)
    //    — If inst_info exists, increment pending_count
    //    — Set referencing_namespace
    //    — Possibly call should_be_instantiated + instantiate immediately
    //    — Add to worklist via add_to_instantiations_required_list

    // 6. If setting == false and options & 0x02:
    //    — Decrement pending_count
    //    — Clear inst->flags bit 0
    //    — If count < 0: internal error (templates.c:38908)

    // 7. Worklist linkage (add_to_instantiations_required_list tail)
    //    — Check on_worklist bit
    //    — Append to qword_12C7740/qword_12C7738
    //    — Or set fixpoint flag if already on list

Global State Reference

Address	Name	Type	Description
`qword_12C7740`	`pending_instantiation_list`	`void*`	Head of function/variable worklist
`qword_12C7738`	`pending_instantiation_list_tail`	`void*`	Tail of function/variable worklist
`qword_12C7758`	`pending_class_list`	`void*`	Head of class template worklist
`qword_12C7750`	`deferred_master_info_list`	`void*`	Head of deferred master-info list
`qword_12C7748`	`deferred_master_info_list_tail`	`void*`	Tail of deferred master-info list
`dword_12C7730`	`instantiation_mode`	`int32`	0=none, 1=used, 2=all, 3=local
`dword_12C771C`	`new_instantiations_needed`	`int32`	Fixpoint flag (1 = restart loop)
`dword_12C7718`	`additional_pass_needed`	`int32`	Secondary fixpoint flag
`qword_12C76E0`	`function_depth_counter`	`int64`	Current function instantiation depth (max 255)
`qword_106BD10`	`max_depth_limit`	`int64`	Configurable depth limit (read by both function and class paths)
`qword_12C74F0`	`instance_alloc_count`	`int64`	Total 128-byte records allocated
`qword_12C74E8`	`master_info_alloc_count`	`int64`	Total 32-byte master-info records allocated
`qword_12C7708`	`inline_entity_list_head`	`void*`	Head of inline entity fixup list
`qword_12C7700`	`inline_entity_list_tail`	`void*`	Tail of inline entity fixup list

Diagnostic Messages

Number	Severity	Condition	Message Summary
456	Error	Depth counter >= max limit	Excessive template instantiation depth
489	Warning	Approaching depth limit (explicit instantiation)	Template instantiation depth nearing limit
490	Warning	Approaching depth limit (auto instantiation)	Template instantiation depth nearing limit
599	Warning	Toggle counter reaches 200	Instantiation flag oscillation detected
759	Error	Entity not visible at file scope	Template entity not accessible for instantiation

Function Map

Address	Identity	Confidence	Lines	Role
`sub_7416E0`	`alloc_template_instance`	95%	40	Allocate 128-byte instance record
`sub_7416A0`	`alloc_master_instance_info`	95%	16	Allocate 32-byte master info record
`sub_753550`	`find_or_create_master_instance`	95%	75	Link instance to shared master info
`sub_7770E0`	`update_instantiation_required_flag`	95%	434	Update flags, add to worklist
`sub_78A7F0`	`do_any_needed_instantiations`	100%	72	Walk function/variable worklist
`sub_78A9D0`	`template_and_inline_entity_wrapup`	100%	136	Fixpoint loop entry point
`sub_774620`	`should_be_instantiated`	95%	326	Decision gate
`sub_775E00`	`instantiate_template_function_full`	95%	839	Function template instantiation
`sub_777CE0`	`f_instantiate_template_class`	95%	516	Class template instantiation
`sub_774C30`	`instantiate_template_variable`	95%	751	Variable template instantiation
`sub_75D740`	`increment_pending_instantiations`	95%	--	Increment per-type depth counter
`sub_75D7C0`	`decrement_pending_instantiations`	95%	--	Decrement per-type depth counter
`sub_75D6A0`	`too_many_pending_instantiations`	95%	--	Check depth limit, emit diagnostic 456
`sub_75D5B0`	`determine_referencing_namespace`	95%	47	Find namespace that triggered instantiation
`sub_7574B0`	`f_entity_can_be_instantiated`	95%	--	Pre-check: body available, constraints satisfied
`sub_756B40`	`f_is_static_or_inline_template_entity`	95%	--	Check linkage for instantiation eligibility
`sub_789EF0`	`update_instantiation_flags`	95%	--	Update class instantiation flags, add to class worklist
`sub_72ED70`	`alloc_symbol_list_entry`	95%	39	Allocate 16-byte symbol list node (for inline entity list)

Cross-References

Template Engine -- the full instantiation pipeline, substitution engine, argument deduction, partial ordering
CUDA Template Restrictions -- CUDA-specific template argument accessibility checks
Scope Entry -- 784-byte scope stack entry, template instantiation depth counters at +576/+580/+584
Entity Node Layout -- entity kind byte at +80, execution space at +182
Translation Unit Descriptor -- TU linked list at qword_106B9F0, per-TU needs_recheck flag at +393

CLI Flag Inventory

Quick Reference: 20 Most Important CUDA-Specific Flags

Flag (via `-Xcudafe`)	nvcc Equivalent	ID	Effect
`--diag_suppress=N`	`--diag-suppress=N`	39	Suppress diagnostic number N (comma-separated)
`--diag_error=N`	`--diag-error=N`	42	Promote diagnostic N to error
`--diag_warning=N`	`--diag-warning=N`	41	Demote diagnostic N to warning
`--display_error_number`	--	44	Show `#NNNNN-D` error codes in output
`--target=smXX`	`--gpu-architecture=smXX`	245	Set SM architecture target (parsed via `sub_7525E0`)
`--relaxed_constexpr`	`--expt-relaxed-constexpr`	104	Allow constexpr cross-space calls
`--extended-lambda`	`--expt-extended-lambda`	106	Enable `__device__`/`__host__ __device__` lambdas in host code (`dword_106BF38`)
`--device-c`	`-rdc=true`	77	Relocatable device code (separate compilation)
`--keep-device-functions`	`--keep-device-functions`	71	Do not strip unused device functions
`--no_warnings`	`-w`	22	Suppress all warnings
`--promote_warnings`	`-W`	23	Promote all warnings to errors
`--error_limit=N`	--	32	Maximum errors before abort (default: unbounded)
`--force-lp64`	`-m64`	65	LP64 data model (pointer=8, long=8)
`--output_mode=sarif`	--	274	SARIF JSON diagnostic output
`--debug_mode`	`-G`	82	Full debug mode (sets 3 debug globals)
`--device-syntax-only`	--	72	Device-side syntax check without codegen
`--no-device-int128`	--	52	Disable `__int128` on device
`--zero_init_auto_vars`	--	81	Zero-initialize automatic variables
`--fe-inlining`	--	54	Enable frontend inlining
`--gen_c_file_name=path`	--	45	Set output `.int.c` file path

These are the flags most commonly passed through -Xcudafe for CUDA development. The full inventory of 276 flags follows below.

cudafe++ accepts 276 command-line flags registered in a flat table at dword_E80060. The flags are not parsed directly from the binary's argv -- NVIDIA's driver compiler nvcc decomposes its own options and invokes cudafe++ with the appropriate low-level flags. Users never run cudafe++ directly; instead, they pass options through nvcc -Xcudafe <flag>, which strips the -Xcudafe prefix and forwards the remainder as a bare argument to the cudafe++ process.

The flag system is implemented in three functions within cmd_line.c:

Function	Address	Lines	Role
`register_command_flag`	`sub_451F80`	25	Insert one entry into the flag table
`init_command_line_flags`	`sub_452010`	3,849	Register all 276 flags (called once)
`proc_command_line`	`sub_459630`	4,105	Main parser: match argv against table, dispatch to 275-case switch
`default_init`	`sub_45EB40`	470	Zero 350 global config variables + flag-was-set bitmap

Flag Table Structure

Each flag occupies a 40-byte entry in a contiguous array beginning at dword_E80060, with a maximum capacity of 552 entries (overflow triggers a panic via sub_40351D). The current count is tracked in dword_E80058.

struct flag_entry {                    // 40 bytes per entry
    int32_t   case_id;                 // dword_E80060[idx*10]    -- switch dispatch ID
    char*     name;                    // qword_E80068[idx*5]     -- long flag name string
    int16_t   short_char;              // word_E80070[idx*20]     -- single-char alias (0 if none)
    int8_t    is_valid;                // word_E80070[idx*20]+1   -- always 1
    int8_t    takes_value;             // byte_E80072[idx*40]     -- flag requires =<value> argument
    int32_t   visible;                 // dword_E80080[idx*10]    -- mode/action classification
    int8_t    is_boolean;              // byte_E80073[idx*40]     -- flag is on/off toggle
    int64_t   name_length;             // qword_E80078[idx*5]     -- strlen(name), precomputed
};

The flag-was-set bitmap at byte_E7FF40 spans 0x110 bytes (272 flag slots). When a flag is matched during parsing, the corresponding bit is set to record that the user explicitly provided it. This bitmap is zeroed by default_init before every compilation.

Registration Protocol

register_command_flag (sub_451F80) is called approximately 275 times from init_command_line_flags. Its prototype:

void register_command_flag(
    int    case_id,        // dispatch ID for the switch statement
    char*  name,           // "--name" (without the dashes)
    char   short_opt,      // single-letter alias, 0 for none
    char   takes_value,    // 1 if the flag requires =<value>
    int    mode_flag,      // visibility / classification
    char   enabled         // whether the flag is active
);

Some flags are registered as paired toggles -- --flag and --no_flag share the same case_id but set the target global to 1 or 0 respectively. These pairs are registered either by two calls to register_command_flag or by inline table population within init_command_line_flags.

Parsing Flow

proc_command_line (sub_459630) is the master CLI parser. It:

Calls init_command_line_flags to populate the flag table (once)
Allocates four hash tables for accumulating -D, -I, system include, and macro alias arguments
Adjusts nine diagnostic severities by default via sub_4ED400: four are suppressed (severity 3: errors 1373, 1374, 1375, 2330) and five are demoted to remark (severity 4: errors 1257, 1633, 111, 185, 175)
Enters the main loop over argv:
- Scans for - prefix to identify flags
- Handles -X short flags and --flag-name long flags
- Handles --flag=value syntax via parse_flag_name_value (sub_451EC0)
- Matches flag names against the registered table using strncmp against each entry's precomputed name_length
- Dispatches to a giant switch(case_id) with 275 cases
Executes post-parsing dialect resolution (described below)
Opens output, error, and list files
Treats the remaining non-flag argv entry as the input filename

The -Xcudafe Pass-Through

Users never invoke cudafe++ directly. The intended usage path is:

nvcc --some-option -Xcudafe --diag_suppress=1234 source.cu

nvcc strips -Xcudafe and passes --diag_suppress=1234 directly to the cudafe++ process as an argv element. Multiple -Xcudafe arguments accumulate. Because cudafe++ flags use -- long-form prefixes, there is no ambiguity with nvcc's own flag namespace.

Certain nvcc flags like --expt-extended-lambda and --expt-relaxed-constexpr are translated by nvcc into the corresponding cudafe++ internal flags (--extended-lambda, --relaxed_constexpr) before invocation. Users do not need to know the internal names.

Flag Catalog by Category

The 276 flags are grouped below by functional category. Each table lists:

ID -- the case_id used in the dispatch switch
Flag -- the --name as registered (paired flags shown as name / no_name)
Short -- single-character alias (dash required: -E, -C, etc.)
Arg -- whether the flag takes a =<value> argument
Effect -- what the flag does internally

Core EDG Flags (1--44)

These are standard Edison Design Group frontend options that predate NVIDIA's CUDA modifications.

ID	Flag	Short	Arg	Effect
1	`strict`	`-A`	no	Enable strict standards conformance mode
2	`strict_warnings`	`-a`	no	Strict mode with extra warnings
3	`no_line_commands`	`-P`	no	Suppress `#line` directives in preprocessor output
4	`preprocess`	`-E`	no	Preprocessor-only mode (output to stdout)
5	`comments`	`-C`	no	Preserve comments in preprocessor output
6	`old_line_commands`	--	no	Use old-style `# N "file"` line directives
7	`old_c`	`-K`	no	K&R C mode (calls `set_c_mode(1)`)
8	`dependencies`	`-M`	no	Output `#include` dependency list (preprocessor-only)
9	`trace_includes`	`-H`	no	Print each `#include` file as it is opened
10	`il_display`	--	no	Dump intermediate language after parsing
11	`anachronisms / no_anachronisms`	--	no	Allow/disallow anachronistic C++ constructs
12	`cfront_2.1`	`-b`	no	Cfront 2.1 compatibility mode
13	`cfront_3.0`	--	no	Cfront 3.0 compatibility mode
14	`no_code_gen`	`-n`	no	Parse only, skip code generation
15	`signed_chars / unsigned_chars`	`-s`	no	Default `char` signedness
16	`instantiate`	`-t`	yes	Template instantiation mode: `none`, `all`, `used`, `local`
17	`implicit_include / no_implicit_include`	`-B`	no	Enable/disable implicit inclusion of template definitions
18	`suppress_vtbl / force_vtbl`	--	no	Control virtual table emission
19	`dollar`	`-$`	no	Allow `$` in identifiers
20	`timing`	`-#`	no	Print compilation phase timing
21	`version`	`-v`	no	Print version banner and continue
22	`no_warnings`	`-w`	no	Suppress all warnings (sets severity threshold to error-only)
23	`promote_warnings`	`-W`	no	Promote warnings to errors
24	`remarks`	`-r`	no	Enable remark-level diagnostics
25	`c`	`-m`	no	Force C language mode
26	`c++`	`-p`	no	Force C++ language mode
27	`exceptions / no_exceptions`	`-x`	no	Enable/disable C++ exception handling
28	`no_use_before_set_warnings`	`-j`	no	Suppress "used before set" variable warnings
29	`include_directory`	`-I`	yes	Add include search path (handles `-` for stdin)
30	`define_macro`	`-D`	yes	Define preprocessor macro (builds linked list)
31	`undefine_macro`	`-U`	yes	Undefine preprocessor macro
32	`error_limit`	`-e`	yes	Maximum number of errors before abort
33	`list`	`-L`	yes	Generate listing file
34	`xref`	`-X`	yes	Generate cross-reference file
35	`error_output`	--	yes	Redirect error output to file
36	`output`	`-o`	yes	Set output file path
37	`db`	`-d`	yes	Load debug database
38	`time_limit`	--	yes	Set compilation time limit
39	`diag_suppress`	--	yes	Suppress diagnostic numbers (comma-separated list)
40	`diag_remark`	--	yes	Demote diagnostics to remark severity
41	`diag_warning`	--	yes	Set diagnostics to warning severity
42	`diag_error`	--	yes	Promote diagnostics to error severity
43	`diag_once`	--	yes	Emit diagnostic only on first occurrence
44	`display_error_number / no_display_error_number`	--	no	Show/hide error code numbers in output

NVIDIA CUDA-Specific Flags (45--89)

These flags are NVIDIA additions absent from stock EDG. They control CUDA compilation modes, device code generation, and host/device interaction.

ID	Flag	Arg	Effect
45	`gen_c_file_name`	yes	Set output `.int.c` file path (`qword_106BF20`)
46	`msvc_target_version`	yes	MSVC version for compatibility (`dword_126E1D4`)
47	`host-stub-linkage-explicit`	no	Use explicit linkage on host stubs
48	`static-host-stub`	no	Generate static host stubs
49	`device-hidden-visibility`	no	Apply hidden visibility to device symbols
50	`no-hidden-visibility-on-unnamed-ns`	no	Exempt unnamed namespaces from hidden visibility
51	`no-multiline-debug`	no	Disable multiline debug info
52	`no-device-int128`	no	Disable `__int128` on device
53	`no-device-float128`	no	Disable `__float128` on device
54	`fe-inlining`	no	Enable frontend inlining (`dword_106C068 = 1`)
55	`modify-stack-limit`	yes	Control stack limit modification (`dword_106C064`)
56	`fassociative-math`	no	Enable associative floating-point math
57	`orig_src_file_name`	yes	Original source file name (before preprocessing)
58	`orig_src_path_name`	yes	Original source path name (full path)
59	`frandom-seed`	yes	Random seed for reproducible output
60	`check-template-param-qual`	no	Check template parameter qualifications
61	`check-clock-call`	no	Validate `clock()` calls in device code
62	`check-ffs-call`	no	Validate `ffs()` calls in device code
63	`check-routine-address-taken`	no	Check when device routine address is taken
64	`check-memory-clobber`	no	Validate memory clobber in inline asm
65	`force-lp64`	no	LP64 data model: pointer=8, long=8
66	`force-llp64`	no	LLP64 data model: pointer=4, long=4
67	`pgi_llvm`	no	PGI/LLVM backend mode
68	`pgi_arch_ppc`	no	PGI PowerPC architecture
69	`pgi_arch_aarch64`	no	PGI AArch64 architecture
70	`pgi_version`	yes	PGI compiler version number
71	`keep-device-functions`	no	Do not strip unused device functions
72	`device-syntax-only`	no	Device-side syntax check without codegen
73	`device-time-trace`	no	Enable device compilation time tracing
74	`force_linkonce_to_weak`	no	Convert linkonce to weak linkage
75	`disable_host_implicit_call_check`	no	Skip implicit call validation on host
76	`no_strict_cuda_error`	no	Relax strict CUDA error checking
77	`device-c`	no	Relocatable device code (RDC) mode
78	`no-shadow-functions`	no	Disable function shadowing in device code
79	`disable_ext_lambda_cache`	no	Disable extended lambda capture cache
80	`no-constant-variable-inferencing`	no	Disable constexpr variable inference on device
81	`zero_init_auto_vars`	no	Zero-initialize automatic variables
82	`debug_mode`	no	Full debug mode (sets 3 debug globals to 1)
83	`gen_module_id_file`	no	Generate module ID file
84	`include_file_name`	yes	Forced include file name
85	`gen_device_file_name`	yes	Device-side output file name
86	`stub_file_name`	yes	Stub file output path
87	`module_id_file_name`	yes	Module ID file path
88	`tile_bc_file_name`	yes	Tile bitcode file path
89	`tile-only`	no	Tile-only compilation mode

Architecture and Host Compiler Flags (90--114)

These flags identify the target architecture and host compiler for compatibility emulation.

ID	Flag	Short	Arg	Effect
90	`m32`	--	no	32-bit mode: pointer=4, long=4, all types sized for ILP32
91	`m64`	--	no	64-bit mode (default on Linux x86-64)
92	`Version`	`-V`	no	Print version with different copyright format, then `exit(1)`
93	`compiler_bindir`	--	yes	Host compiler binary directory
94	`sdk_dir`	--	yes	SDK directory path
95	`pgc++`	--	no	PGI C++ compiler mode
96	`icc`	--	no	Intel ICC compiler mode
97	`icc_version`	--	yes	Intel ICC version number
98	`icx`	--	no	Intel ICX (oneAPI) compiler mode
99	`grco`	--	no	GRCO compiler mode
100	`allow_managed`	--	no	Allow `__managed__` variable declarations
101	`gen_system_templates_from_text`	--	no	Generate system templates from text
102	`no_host_device_initializer_list`	--	no	Disable HD `initializer_list` support
103	`no_host_device_move_forward`	--	no	Disable HD `std::move`/`std::forward`
104	`relaxed_constexpr`	--	no	Relaxed constexpr rules for device code (`--expt-relaxed-constexpr`)
105	`dont_suppress_host_wrappers`	--	no	Emit host wrapper functions unconditionally
106	`arm_cross_compiler`	--	no	ARM cross-compilation mode
107	`target_woa`	--	no	Windows on ARM target
108	`gen_div_approx_no_ftz`	--	no	Generate approximate division without flush-to-zero
109	`gen_div_approx_ftz`	--	no	Generate approximate division with flush-to-zero
110	`shared_address_immutable`	--	no	Shared memory addresses are immutable
111	`uumn`	--	no	Unnamed union member naming

C++ Language Feature Toggle Flags (115--275)

The largest group -- approximately 120 paired boolean toggles that control individual C++ language features. Most are inherited from EDG's configuration surface. Each pair shares a case_id and sets a global variable to 1 (--flag) or 0 (--no_flag).

Precompiled Headers (115--121)

ID	Flag	Arg	Effect
115	`unsigned_wchar_t`	no	`wchar_t` is unsigned
116	`create_pch`	yes	Create precompiled header file
117	`use_pch`	yes	Use existing precompiled header
118	`pch`	no	Enable PCH mode
119	`pch_messages / no_pch_messages`	no	Show/hide PCH status messages
120	`pch_verbose / no_pch_verbose`	no	Verbose PCH output
121	`pch_dir`	yes	PCH file directory

Core C++ Feature Toggles (122--170)

ID	Flag	Arg	Default
122	`restrict / no_restrict`	no	on
123	`long_lifetime_temps / short_lifetime_temps`	no	--
124	`wchar_t_keyword / no_wchar_t_keyword`	no	on
125	`pack_alignment`	yes	--
126	`alternative_tokens / no_alternative_tokens`	no	on
127	`svr4 / no_svr4`	no	--
128	`brief_diagnostics / no_brief_diagnostics`	no	--
129	`nonconst_ref_anachronism / no_nonconst_ref_anachronism`	no	--
130	`no_preproc_only`	no	--
131	`rtti / no_rtti`	no	on
132	`building_runtime`	no	--
133	`bool / no_bool`	no	on
134	`array_new_and_delete / no_array_new_and_delete`	no	--
135	`explicit / no_explicit`	no	--
136	`namespaces / no_namespaces`	no	on
137	`using_std / no_using_std`	no	--
138	`remove_unneeded_entities / no_remove_unneeded_entities`	no	on
139	`typename / no_typename`	no	--
140	`implicit_typename / no_implicit_typename`	no	on
141	`special_subscript_cost / no_special_subscript_cost`	no	--
143	`old_style_preprocessing`	no	--
144	`old_for_init / new_for_init`	no	--
145	`for_init_diff_warning / no_for_init_diff_warning`	no	--
146	`distinct_template_signatures / no_distinct_template_signatures`	no	--
147	`guiding_decls / no_guiding_decls`	no	on
148	`old_specializations / no_old_specializations`	no	on
149	`wrap_diagnostics / no_wrap_diagnostics`	no	--
150	`implicit_extern_c_type_conversion / no_implicit_extern_c_type_conversion`	no	--
151	`long_preserving_rules / no_long_preserving_rules`	no	--
152	`extern_inline / no_extern_inline`	no	--
153	`multibyte_chars / no_multibyte_chars`	no	--
154	`embedded_c++`	no	Embedded C++ mode
155	`vla / no_vla`	no	--
156	`enum_overloading / no_enum_overloading`	no	--
157	`nonstd_qualifier_deduction / no_nonstd_qualifier_deduction`	no	--
158	`late_tiebreaker / early_tiebreaker`	no	--
159	`preinclude`	yes	--
160	`preinclude_macros`	yes	--
161	`pending_instantiations`	yes	--
162	`const_string_literals / no_const_string_literals`	no	on
163	`class_name_injection / no_class_name_injection`	no	on
164	`arg_dep_lookup / no_arg_dep_lookup`	no	on
165	`friend_injection / no_friend_injection`	no	on
166	`nonstd_using_decl / no_nonstd_using_decl`	no	--
168	`designators / no_designators`	no	--
169	`extended_designators / no_extended_designators`	no	--
170	`variadic_macros / no_variadic_macros`	no	--
171	`extended_variadic_macros / no_extended_variadic_macros`	no	--

Include Paths and Module Support (167, 172, 256--265)

Note: These flags use non-contiguous IDs because sys_include and incl_suffixes are registered early, while the C++20 module flags use a separate ID range (256+).

ID	Flag	Arg	Effect
167	`sys_include`	yes	System include directory
172	`incl_suffixes`	yes	Include file suffix list (default `"::stdh:"`)
256	`modules_directory`	yes	C++20 modules directory
257	`ms_mod_file_map`	yes	MSVC module file mapping
258	`ms_header_unit`	yes	MSVC header unit
259	`ms_header_unit_quote`	yes	MSVC quoted header unit
260	`ms_header_unit_angle`	yes	MSVC angle-bracket header unit
261	`ms_mod_interface / no_ms_mod_interface`	no	MSVC module interface mode
262	`ms_internal_partition / no_ms_internal_partition`	no	MSVC internal partition mode
263	`ms_translate_include / no_ms_translate_include`	no	MSVC translate `#include` to `import`
264	`modules / no_modules`	no	Enable/disable C++20 modules
265	`module_import_diagnostics / no_module_import_diagnostics`	no	Module import diagnostic messages

Host Compiler and Language Feature Toggles (182--239)

Note: All IDs below are verified against the decompiled init_command_line_flags (sub_452010). Flags are registered by sub_451F80 (explicit call) or by inline array population. IDs are not sequential -- gaps exist where flags were removed or repurposed.

ID	Flag	Arg	Default
182	`gcc / no_gcc`	no	GCC compatibility mode
183	`g++ / no_g++`	no	G++ mode (alias for GCC C++ mode)
184	`gnu_version`	yes	GCC version number (default 80100 = GCC 8.1.0)
185	`report_gnu_extensions`	no	Report use of GNU extensions
186	`short_enums / no_short_enums`	no	Use minimal-size enum representation
187	`clang / no_clang`	no	Clang compatibility mode
188	`clang_version`	yes	Clang version number (default 90100 = Clang 9.1.0)
189	`strict_gnu / no_strict_gnu`	no	Strict GNU mode
190	`db_name`	yes	Debug database name
191	`long_long`	no	Allow `long long` type
192	`context_limit`	yes	Maximum template instantiation context depth
193	`set_flag / clear_flag`	yes	Raw flag manipulation via `off_D47CE0` lookup table
194	`edg_base_dir`	yes	EDG base directory (error on invalid path)
195	`embedded_c / no_embedded_c`	no	Embedded C mode (not relevant to CUDA)
196	`thread_local_storage / no_thread_local_storage`	no	`thread_local` support
197	`trigraphs / no_trigraphs`	no	Trigraph processing (default on)
198	`nonstd_default_arg_deduction / no_nonstd_default_arg_deduction`	no	--
199	`stdc_zero_in_system_headers / no_stdc_zero_in_system_headers`	no	--
200	`template_typedefs_in_diagnostics / no_template_typedefs_in_diagnostics`	no	--
202	`uliterals / no_uliterals`	no	Unicode literals (`u""`, `U""`, `u8""`)
203	`type_traits_helpers / no_type_traits_helpers`	no	Intrinsic type traits
204	`c++11 / c++0x`	no	C++11 mode (sets `dword_126EF68` to 201103 or 199711)
205	`list_macros`	no	List all defined macros after preprocessing
206	`dump_configuration`	no	Dump full compiler configuration
207	`dump_legacy_as_target`	yes	Dump legacy configuration in target format
208	`signed_bit_fields / unsigned_bit_fields`	no	Default bit-field signedness
210	`check_concatenations / no_check_concatenations`	no	String literal concatenation checks
211	`unicode_source_kind`	yes	Source encoding: `UTF-8`=1, `UTF-16LE`=2, `UTF-16BE`=3, `none`=0
212	`lambdas / no_lambdas`	no	C++ lambda expressions
213	`rvalue_refs / no_rvalue_refs`	no	Rvalue references
214	`rvalue_ctor_is_copy_ctor / rvalue_ctor_is_not_copy_ctor`	no	Rvalue constructor treatment
215	`gen_move_operations / no_gen_move_operations`	no	Implicit move constructor/assignment (default on)
216	`auto_type / no_auto_type`	no	C++11 `auto` type deduction
217	`auto_storage / no_auto_storage`	no	`auto` as storage class (C++03 meaning)
218	`nonstd_instantiation_lookup / no_nonstd_instantiation_lookup`	no	--
219	`nullptr / no_nullptr`	no	`nullptr` keyword
220	`gcc89_inlining`	no	GCC 8.9-era inlining behavior
221	`nonstd_gnu_keywords / no_nonstd_gnu_keywords`	no	GNU extension keywords
222	`default_nocommon_tentative_definitions / default_common_tentative_definitions`	no	Tentative definition linkage
223	`no_token_separators_in_pp_output`	no	--
224	`c23_typeof / no_c23_typeof`	no	C23 `typeof` operator
225	`c++11_sfinae / no_c++11_sfinae`	no	C++11 SFINAE rules
226	`c++11_sfinae_ignore_access / no_c++11_sfinae_ignore_access`	no	Ignore access checks in SFINAE
227	`variadic_templates / no_variadic_templates`	no	Parameter packs and pack expansion
228	`c++03`	no	C++03 mode (sets `dword_126EF68` to 199711)
229	`func_prototype_tags / no_func_prototype_tags`	no	--
230	`implicit_noexcept / no_implicit_noexcept`	no	Implicit `noexcept` on destructors
231	`unrestricted_unions / no_unrestricted_unions`	no	Unrestricted unions (C++11)
232	`max_depth_constexpr_call`	yes	Maximum `constexpr` recursion depth (default 200)
233	`max_cost_constexpr_call`	yes	Maximum `constexpr` evaluation cost (default 256)
234	`delegating_constructors / no_delegating_constructors`	no	--
235	`lossy_conversion_warning / no_lossy_conversion_warning`	no	--
236	`deprecated_string_conv / no_deprecated_string_conv`	no	Deprecated string literal to `char*` conversion
237	`user_defined_literals / no_user_defined_literals`	no	UDL support
238	`preserve_lvalues_with_same_type_casts / no_...`	no	--
239	`nonstd_anonymous_unions / no_nonstd_anonymous_unions`	no	--

Late C++/Architecture/Output Flags (240--258)

ID	Flag	Arg	Effect
240	`c++14`	no	C++14 mode (sets `dword_126EF68` to 201402)
241	`c11`	no	C11 mode (sets `dword_126EF68` to 201112)
242	`c17`	no	C17 mode (sets `dword_126EF68` to 201710)
243	`c23`	no	C23 mode (sets `dword_126EF68` to 202311)
244	`digit_separators / no_digit_separators`	no	C++14 digit separators (`1'000'000`)
245	`target`	yes	SM architecture string, parsed via `sub_7525E0` into `dword_126E4A8`
246	`c++17`	no	C++17 mode (sets `dword_126EF68` to 201703)
247	`utf8_char_literals / no_utf8_char_literals`	no	UTF-8 character literal support
248	`stricter_template_checking`	no	Additional template constraint checks
249	`exc_spec_in_func_type / no_exc_spec_in_func_type`	no	Exception spec as part of function type (C++17)
250	`aligned_new / no_aligned_new`	no	Aligned `operator new` (C++17)
251	`c++20`	no	C++20 mode (sets `dword_126EF68` to 202002)
252	`c++23`	no	C++23 mode (sets `dword_126EF68` to 202302)
253	`ms_std_preprocessor / no_ms_std_preprocessor`	no	MSVC standard preprocessor mode
268	`partial-link`	no	Partial linking mode
273	`dump_command_options`	no	Print all registered flag names
274	`output_mode`	yes	Output format: `text` (0) or `sarif` (1)
275	`incognito / no_incognito`	no	Incognito mode

Note: Many IDs in the 240-252 range serve double duty as both C/C++ standard selectors and feature toggles. The standard selection IDs are also cross-referenced in the Language Standard Selection section above.

Inline-Registered Paired Flags

Seven additional paired flags are registered through inline table population rather than calls to register_command_flag. They share the same entry structure but are populated directly into the array:

Flag	Effect
`relaxed_abstract_checking / no_relaxed_abstract_checking`	Relax abstract class checks
`concepts / no_concepts`	C++20 concepts support
`colors / no_colors`	Colorized diagnostic output
`keep_restrict_in_signatures / no_keep_restrict_in_signatures`	Preserve `restrict` in mangled names
`check_unicode_security / no_check_unicode_security`	Unicode security checks (homoglyph detection)
`old_id_chars / no_old_id_chars`	Legacy identifier character rules
`add_match_notes / no_add_match_notes`	Add notes about matching overloads

Language Standard Selection

Six language standard flags set dword_126EF68 (the internal __cplusplus / __STDC_VERSION__ value) and trigger corresponding mode changes:

C Standards

ID	Flag	`dword_126EF68` value	C standard
7	`old_c`	(K&R)	Pre-ANSI C via `set_c_mode(1)`
179	`c89`	198912	ANSI C / C89
178	`c99`	199901	C99
241	`c11`	201112	C11
242	`c17`	201710	C17
243	`c23`	202311	C23

C++ Standards

ID	Flag	`dword_126EF68` value	C++ standard
228	`c++03`	199711	C++98/03 (also aliased as `c++98` via `--c++11` flag ID 204 with conditional)
204	`c++11`	201103	C++11 (sets 199711 if `dword_E7FF14` is unset or C mode)
240	`c++14`	201402	C++14
246	`c++17`	201703	C++17
251	`c++20`	202002	C++20
252	`c++23`	202302	C++23

When a C++ standard is selected, the post-parsing dialect resolution logic automatically enables the corresponding feature flags. For example, selecting --c++11 (value 201103) enables lambdas, rvalue references, auto type deduction, nullptr, variadic templates, and other C++11 features. The resolution logic also interacts with GCC/Clang version thresholds to determine which extensions are available.

Diagnostic Control Flags

The five diag_* flags (IDs 39--43) accept comma-separated lists of diagnostic numbers. The parser strips whitespace, splits on commas, and calls sub_4ED400(number, severity, 1) for each number:

--diag_suppress=1234,5678       # suppress errors 1234 and 5678
--diag_warning=20001            # demote CUDA error 20001 to warning
--diag_error=111                # promote diagnostic 111 to error
--diag_remark=185               # demote diagnostic 185 to remark
--diag_once=175                 # emit diagnostic 175 only once

The error number system is documented in Diagnostic System Overview. Numbers above 3456 in the internal range correspond to the 20000-series CUDA errors via the offset formula display_code = internal_code + 16543.

Post-Parsing Dialect Resolution

After the main parsing loop completes, proc_command_line executes a large block of dialect resolution logic that:

Resolves host compiler mode conflicts -- If both --gcc and --clang are set, or --cfront_2.1 is combined with modern modes, the resolution picks one and adjusts feature flags accordingly
Sets C++ feature flags from __cplusplus version -- Based on the value in dword_126EF68:
- 199711 (C++98/03): baseline features only
- 201103 (C++11): enables lambdas, rvalue refs, auto, nullptr, variadic templates, range-based for, delegating constructors, unrestricted unions, user-defined literals
- 201402 (C++14): adds digit separators, generic lambdas, relaxed constexpr
- 201703 (C++17): adds aligned new, exception spec in function type, structured bindings
- 202002 (C++20): adds concepts, modules, coroutines
- 202302 (C++23): adds latest features
Applies GCC version thresholds -- When in GCC compatibility mode, certain features are gated on the GCC version number stored in qword_126EF98 (default 80100 = GCC 8.1.0). Known thresholds:
- 40299 (0x9D6B): GCC 4.2
- 40599 (0x9E97): GCC 4.5
- 40699 (0x9EFB): GCC 4.6
- Higher versions enable progressively more features
Opens output files -- Error output, listing file, output file
Processes the input filename -- The remaining non-flag argv entry

Key Globals After Resolution

Global	Type	Content
`dword_126EF68`	`int32`	`__cplusplus` / `__STDC_VERSION__` value
`dword_126EFB4`	`int32`	Language mode: 0=unset, 1=C, 2=C++
`dword_126EFA8`	`int32`	GCC compatibility enabled
`dword_126EFA4`	`int32`	Clang compatibility enabled
`qword_126EF98`	`int64`	GCC version (default 80100)
`qword_126EF90`	`int64`	Clang version (default 90100)
`dword_126EFB0`	`int32`	GNU extensions enabled
`dword_126EFAC`	`int32`	Clang extensions enabled
`dword_126E4A8`	`int32`	SM architecture code (from `--target`)
`dword_126E1D4`	`int32`	MSVC target version

The set_flag / clear_flag Mechanism

Flag ID 199 (--set_flag / --clear_flag) provides a raw escape hatch. The argument is a flag name looked up in the off_D47CE0 table -- an array of {name, global_address} pairs. If the name is found, the corresponding global variable is set to the provided integer value (--set_flag=name=value) or cleared to 0 (--clear_flag=name). This mechanism allows nvcc to toggle internal EDG configuration flags that do not have dedicated CLI flag registrations.

Default Values

default_init (sub_45EB40) runs before proc_command_line and initializes approximately 350 global configuration variables. Notable non-zero defaults:

Global	Default	Meaning
`dword_106C210`	1	Exceptions enabled
`dword_106C180`	1	RTTI enabled
`dword_106C178`	1	`bool` is keyword
`dword_106C194`	1	Namespaces enabled
`dword_106C19C`	1	Argument-dependent lookup enabled
`dword_106C1A0`	1	Class name injection enabled
`dword_106C1A4`	1	String literals are const
`dword_106C188`	1	`wchar_t` is keyword
`dword_106C18C`	1	Alternative tokens enabled
`dword_106C140`	1	Compound literals allowed
`dword_106C138`	1	Dependent name processing enabled
`dword_106C134`	1	Template parsing enabled
`dword_106C12C`	1	Friend injection enabled
`dword_106BDB8`	1	`restrict` enabled
`dword_106BDB0`	1	Remove unneeded entities enabled
`dword_106BD98`	1	Trigraphs enabled
`dword_106BD68`	1	Guiding declarations allowed
`dword_106BD58`	1	Old specializations allowed
`dword_106BD54`	1	Implicit typename enabled
`dword_106BE84`	1	Generate move operations enabled
`dword_106C064`	1	Stack limit modification enabled
`qword_106BD10`	200	Max constexpr recursion depth
`qword_106BD08`	256	Max constexpr evaluation cost
`qword_126EF98`	80100	Default GCC version (8.1.0)
`qword_126EF90`	90100	Default Clang version (9.1.0)
`qword_126EF78`	1926	MSVC version threshold
`qword_126EF70`	99999	Some upper bound sentinel

Conflict Detection

Before the main parsing loop, check_conflicting_flags (sub_451E80) verifies that flags 3, 193, 194, and 195 (no_line_commands, set_flag, clear_flag, and related flags) are not used in conflicting combinations. If any conflict is detected, error 1027 is emitted.

Version Banners

Two flags print version information:

--version (ID 21, -v):

cudafe: NVIDIA (R) Cuda Language Front End
Portions Copyright (c) 2005, 2024 NVIDIA Corporation
Portions Copyright (c) 1988-2018, 2024 Edison Design Group Inc.
Based on Edison Design Group C/C++ Front End, version 6.6
Cuda compilation tools, release 13.0, V13.0.88

--Version (ID 92, -V): Prints a different copyright format with full date/time stamp, then calls exit(1).

Cross-References

Pipeline Overview -- Stage 2 is proc_command_line
Diagnostic System Overview -- diag_suppress/diag_error flag handling
Architecture Detection -- --target flag and SM version parsing
Experimental Flags -- --set_flag/--clear_flag for internal feature gates
EDG 6.6 Overview -- cmd_line.c source file context

EDG Build Configuration

cudafe++ is built from Edison Design Group (EDG) C/C++ front end source code, version 6.6. At build time, NVIDIA sets approximately 750 compile-time constants that control every aspect of the front end's behavior -- from which backend generates output, to how the IL system operates, to what ABI conventions are followed. These constants are baked into the binary and cannot be changed at runtime. They represent the specific EDG configuration NVIDIA chose for CUDA compilation.

The function dump_configuration (sub_44CF30, 785 lines) prints all 747 constants as C preprocessor #define statements when invoked with --dump_configuration. Of these, 613 are defined and 134 are explicitly listed as "not defined." The output is written to qword_126EDF0 (the configuration output stream, typically stderr) in alphabetical order.

$ cudafe++ --dump_configuration
/* Configuration data for Edison Design Group C/C++ Front End */
/* version 6.6, built on Aug 20 2025 at 13:59:03. */

#define ABI_CHANGES_FOR_ARRAY_NEW_AND_DELETE 1
#define ABI_CHANGES_FOR_CONSTRUCTION_VTBLS 1
...
#define WRITE_SIGNOFF_MESSAGE 1

/* Legacy configuration: <unnamed> */
#define LEGACY_TARGET_CONFIGURATION_NAME NULL

The constants fall into seven categories: backend selection, IL system, internal checking, diagnostics, target platform model, compiler compatibility, and feature defaults.

Backend Selection

The EDG front end supports multiple backend code generators. NVIDIA configured cudafe++ for the C++ code generation backend (cp_gen_be), which means the front end's output is C++ source code -- not object code, not C, and not a serialized IL file.

Constant	Value	Meaning
`BACK_END_IS_CP_GEN_BE`	`1`	Backend generates C++ source (the `.ii` / `.int.c` output)
`BACK_END_IS_C_GEN_BE`	`0`	Not the C code generation backend
`BACK_END_SHOULD_BE_CALLED`	`1`	Backend phase is active (front end does not stop after parsing)
`CP_GEN_BE_TARGET_MATCHES_SOURCE_DIALECT`	`1`	Generated C++ targets the same dialect as the input
`GEN_CPP_FILE_SUFFIX`	`".int.c"`	Output file suffix for generated C++
`GEN_C_FILE_SUFFIX`	`".int.c"`	Output file suffix for generated C (same as C++, unused)

This is the central architectural fact about cudafe++. It is a source-to-source translator: CUDA C++ goes in, host-side C++ with device stubs comes out. The cp_gen_be backend walks the IL tree and emits syntactically valid C++ that the host compiler (gcc/clang/MSVC) can consume. The generated code preserves the original types, templates, and namespaces rather than lowering to a simpler representation.

The CP_GEN_BE_TARGET_MATCHES_SOURCE_DIALECT=1 setting means the backend does not down-level the output. If the input is C++17, the generated code uses C++17 constructs. This avoids the complexity of translating modern C++ features into older dialects.

Disabled Backend Features

Several backend capabilities are compiled out:

Constant	Value	Meaning
`GCC_IS_GENERATED_CODE_TARGET`	`0`	Output is not GCC-specific C
`CLANG_IS_GENERATED_CODE_TARGET`	`0`	Output is not Clang-specific C
`MSVC_IS_GENERATED_CODE_TARGET`	`0`	Output is not MSVC-specific C
`SUN_IS_GENERATED_CODE_TARGET`	`0`	Output is not Sun/Oracle compiler C
`MICROSOFT_DIALECT_IS_GENERATED_CODE_TARGET`	`0`	Output does not use Microsoft C++ extensions

None of the compiler-specific code generation targets are enabled. The cp_gen_be emits portable C++ that is syntactically valid across all major compilers. This is possible because CUDA's host compilation already controls dialect selection through its own flag forwarding to the host compiler.

IL System

The Intermediate Language (IL) system is the core data structure connecting the parser to the backend. NVIDIA's configuration makes a critical choice: the IL is never serialized to disk.

Constant	Value	Meaning
`IL_SHOULD_BE_WRITTEN_TO_FILE`	`0`	IL stays in memory -- never written to an IL file
`DO_IL_LOWERING`	`0`	No IL transformation passes before backend
`IL_WALK_NEEDED`	`1`	IL walker infrastructure is compiled in
`IL_VERSION_NUMBER`	`"6.6"`	IL format version, matches EDG version
`ALL_TEMPLATE_INFO_IN_IL`	`1`	Complete template metadata in the IL graph
`PROTOTYPE_INSTANTIATIONS_IN_IL`	`1`	Uninstantiated function prototypes preserved
`NEED_IL_DISPLAY`	`1`	IL display/dump routines compiled in
`NEED_NAME_MANGLING`	`1`	Name mangling infrastructure compiled in
`NEED_DECLARATIVE_WALK`	`0`	Declarative IL walker not needed

Why IL_SHOULD_BE_WRITTEN_TO_FILE=0 Matters

In a standard EDG deployment (like the Comeau C++ compiler or Intel ICC's older front end), the IL can be serialized to a binary file for separate backend processing. With IL_SHOULD_BE_WRITTEN_TO_FILE=0, NVIDIA eliminates the entire IL serialization path. The IL exists only as an in-memory graph during compilation:

The parser builds IL nodes in region-based arenas (file-scope region 1, per-function region N)
The IL walker traverses the graph to select device vs. host code
The cp_gen_be backend reads the IL graph directly and emits C++ source
The arenas are freed

This design means the IL_FILE_SUFFIX constant is left undefined -- there is no suffix because there is no file. The constants LARGE_IL_FILE_SUPPORT, USE_TEMPLATE_INFO_FILE, TEMPLATE_INFO_FILE_SUFFIX, INSTANTIATION_FILE_SUFFIX, and EXPORTED_TEMPLATE_FILE_SUFFIX are all similarly undefined.

Why DO_IL_LOWERING=0 Matters

IL lowering is an optional transformation pass that simplifies the IL before the backend processes it. In a lowering-enabled build, complex C++ constructs (VLAs, complex numbers, rvalue adjustments) are reduced to simpler forms. With DO_IL_LOWERING=0, NVIDIA bypasses all of this:

Constant	Value	Meaning
`DO_IL_LOWERING`	`0`	Master lowering switch is off
`LOWER_COMPLEX`	`0`	No lowering of `_Complex` types
`LOWER_VARIABLE_LENGTH_ARRAYS`	`0`	VLAs passed through as-is
`LOWER_CLASS_RVALUE_ADJUST`	`0`	No rvalue conversion lowering
`LOWER_FIXED_POINT`	`0`	No fixed-point lowering
`LOWER_IFUNC`	`0`	No indirect function lowering
`LOWER_STRING_LITERALS_TO_NON_CONST`	`0`	String literals keep const qualification
`LOWER_EXTERN_INLINE`	`1`	Exception: extern inline functions are lowered
`LOWERING_NORMALIZES_BOOLEAN_CONTROLLING_EXPRESSIONS`	`0`	No boolean normalization
`LOWERING_REMOVES_UNNEEDED_CONSTRUCTIONS_AND_DESTRUCTIONS`	`0`	No dead construction removal

The only lowering that remains active is LOWER_EXTERN_INLINE=1, which handles extern inline functions that need special treatment in the generated output. Everything else passes through the IL untransformed.

This makes sense for cudafe++'s role. As a source-to-source translator, it benefits from preserving the original code structure. The host compiler handles all the actual lowering when it compiles the generated .ii file.

Why IL_WALK_NEEDED=1 Matters

Despite no serialization and no lowering, the IL walk infrastructure is compiled in. This is because cudafe++ uses the IL walker for its primary CUDA-specific task: device/host code separation. The walker traverses the IL graph and marks each entity with execution space flags (__host__, __device__, __global__), then the backend selectively emits code based on which space is being generated.

Template Information Preservation

Constant	Value	Meaning
`ALL_TEMPLATE_INFO_IN_IL`	`1`	Full template definitions in the IL, not a separate database
`PROTOTYPE_INSTANTIATIONS_IN_IL`	`1`	Even uninstantiated prototypes kept
`RECORD_TEMPLATE_STRINGS`	`1`	Template argument strings preserved
`RECORD_HIDDEN_NAMES_IN_IL`	`1`	Names hidden by using declarations still recorded
`RECORD_UNRECOGNIZED_ATTRIBUTES`	`1`	Unknown `[[attributes]]` preserved in IL
`RECORD_RAW_ASM_OPERAND_DESCRIPTIONS`	`1`	Raw asm operand text kept
`KEEP_TEMPLATE_ARG_EXPR_THAT_CAUSES_INSTANTIATION`	`1`	Template argument expressions that trigger instantiation are retained

With ALL_TEMPLATE_INFO_IN_IL=1, template definitions, partial specializations, and instantiation directives live directly in the IL graph. This eliminates the need for a separate template information file (USE_TEMPLATE_INFO_FILE is undefined). Combined with PROTOTYPE_INSTANTIATIONS_IN_IL=1, the IL retains complete template metadata -- even for function templates that have been declared but not yet instantiated. This is essential for CUDA's device/host separation, where a template might be instantiated in different execution spaces.

Internal Checking

NVIDIA builds cudafe++ with assertions enabled. This produces a binary with extensive runtime self-checking.

Constant	Value	Meaning
`CHECKING`	`1`	Internal assertion macros are active
`DEBUG`	`1`	Debug-mode code paths are compiled in
`CHECK_SWITCH_DEFAULT_UNEXPECTED`	`1`	Default cases in switch statements trigger assertions
`EXPENSIVE_CHECKING`	`0`	Costly O(n) verification checks are disabled
`OVERWRITE_FREED_MEM_BLOCKS`	`0`	No memory poisoning on free
`EXIT_ON_INTERNAL_ERROR`	`0`	Internal errors do not call exit() directly
`ABORT_ON_INIT_COMPONENT_LEAKAGE`	`0`	No abort on init-time leaks
`TRACK_INTERPRETER_ALLOCATIONS`	`0`	constexpr interpreter does not track allocations

Assertion Infrastructure

With CHECKING=1, the internal assertion macro internal_error (sub_4F2930) is live. The binary contains 5,178 call sites across 2,139 functions that invoke this handler. Each call site passes the source file name, line number, function name, and a diagnostic message pair. When an assertion fires, the handler constructs error 2656 with severity level 11 (catastrophic) and reports it through the standard diagnostic infrastructure.

The DEBUG=1 setting enables additional code paths that perform intermediate consistency checks during parsing and IL construction. These checks are less expensive than EXPENSIVE_CHECKING (which is off) but still add measurable overhead to compilation time. NVIDIA presumably leaves both CHECKING and DEBUG on because cudafe++ is a critical toolchain component where silent corruption is far worse than a slightly slower compilation.

The CHECK_SWITCH_DEFAULT_UNEXPECTED=1 setting means that every switch statement in the EDG source that handles enumerated values will trigger an assertion if control reaches the default case. This catches missing case handling when new enum values are added.

Diagnostics Configuration

These constants control the default formatting and behavior of compiler error messages.

Constant	Value	Meaning
`DEFAULT_BRIEF_DIAGNOSTICS`	`0`	Full diagnostics by default (not one-line)
`DEFAULT_DISPLAY_ERROR_NUMBER`	`0`	Error numbers hidden by default
`COLUMN_NUMBER_IN_BRIEF_DIAGNOSTICS`	`1`	Column numbers included in brief-mode output
`DEFAULT_ENABLE_COLORIZED_DIAGNOSTICS`	`1`	ANSI color codes enabled by default
`MAX_ERROR_OUTPUT_LINE_LENGTH`	`79`	Diagnostic lines wrap at 79 characters
`DEFAULT_CONTEXT_LIMIT`	`10`	Maximum 10 lines of instantiation context shown
`DEFAULT_DISPLAY_ERROR_CONTEXT_ON_CATASTROPHE`	`1`	Show context even on fatal errors
`DEFAULT_ADD_MATCH_NOTES`	`1`	Add notes explaining overload/template resolution
`DEFAULT_DISPLAY_TEMPLATE_TYPEDEFS_IN_DIAGNOSTICS`	`0`	Use raw types, not typedef aliases, in messages
`DEFAULT_OUTPUT_MODE`	`om_text`	Default output is text, not SARIF JSON
`DEFAULT_MACRO_POSITIONS_IN_DIAGNOSTICS`	(undefined)	Macro expansion position tracking is off
`ERROR_SEVERITY_EXPLICIT_IN_ERROR_MESSAGES`	`1`	Severity word ("error"/"warning") always printed
`DIRECT_ERROR_OUTPUT_TO_STDOUT`	`0`	Errors go to stderr
`WRITE_SIGNOFF_MESSAGE`	`1`	Print summary line at compilation end

Color Configuration

The DEFAULT_EDG_COLORS constant encodes ANSI SGR (Select Graphic Rendition) color codes for diagnostic categories:

"error=01;31:warning=01;35:note=01;36:locus=01:quote=01:range1=32"

Category	SGR Code	Appearance
`error`	`01;31`	Bold red
`warning`	`01;35`	Bold magenta
`note`	`01;36`	Bold cyan
`locus`	`01`	Bold (default color)
`quote`	`01`	Bold (default color)
`range1`	`32`	Green (non-bold)

This matches GCC's diagnostic color scheme, which is intentional -- cudafe++ is designed to produce diagnostics that look visually consistent with the host GCC compiler's output.

ABI Configuration

Constant	Value	Meaning
`ABI_COMPATIBILITY_VERSION`	`9999`	Maximum ABI compatibility level
`IA64_ABI`	`1`	Uses Itanium C++ ABI (standard on Linux)
`ABI_CHANGES_FOR_ARRAY_NEW_AND_DELETE`	`1`	Array new/delete ABI changes active
`ABI_CHANGES_FOR_CONSTRUCTION_VTBLS`	`1`	Construction vtable ABI changes active
`ABI_CHANGES_FOR_COVARIANT_VIRTUAL_FUNC_RETURN`	`1`	Covariant return ABI changes active
`ABI_CHANGES_FOR_PLACEMENT_DELETE`	`1`	Placement delete ABI changes active
`ABI_CHANGES_FOR_RTTI`	`1`	RTTI ABI changes active
`DRIVER_COMPATIBILITY_VERSION`	`9999`	Maximum driver-level compatibility

The ABI_COMPATIBILITY_VERSION=9999 is a sentinel meaning "accept all ABI changes." In EDG's versioning scheme, specific ABI compatibility versions can be set to match a particular compiler release (e.g., GCC 3.2's ABI). Setting it to 9999 means cudafe++ uses the latest ABI rules for every construct, which is appropriate because it generates source code that the host compiler will re-ABI anyway.

All five ABI_CHANGES_FOR_* constants are set to 1, meaning every ABI improvement EDG has made is active. These affect name mangling, vtable layout, and RTTI representation. Since cudafe++ emits C++ source rather than object code, these primarily affect name mangling output and the structure of compiler-generated entities.

Compiler Compatibility Layer

cudafe++ emulates GCC by default. These constants configure the compatibility surface.

Constant	Value	Meaning
`DEFAULT_GNU_COMPATIBILITY`	`1`	GCC compatibility mode is on by default
`DEFAULT_GNU_VERSION`	`80100`	Default GCC version = 8.1.0
`GNU_TARGET_VERSION_NUMBER`	`70300`	Target GCC version = 7.3.0
`DEFAULT_GNU_ABI_VERSION`	`30200`	Default GNU ABI version = 3.2.0
`DEFAULT_CLANG_COMPATIBILITY`	`0`	Clang compat off by default
`DEFAULT_CLANG_VERSION`	`90100`	Clang version if enabled = 9.1.0
`DEFAULT_MICROSOFT_COMPATIBILITY`	`0`	MSVC compat off by default
`DEFAULT_MICROSOFT_VERSION`	`1926`	MSVC version if enabled = 19.26 (VS 2019)
`MSVC_TARGET_VERSION_NUMBER`	`1926`	Same: MSVC 19.26 target
`GNU_EXTENSIONS_ALLOWED`	`1`	GNU extensions compiled into the parser
`GNU_X86_ASM_EXTENSIONS_ALLOWED`	`1`	GNU inline asm syntax supported
`GNU_X86_ATTRIBUTES_ALLOWED`	`1`	GNU `__attribute__` on x86 targets
`GNU_VECTOR_TYPES_ALLOWED`	`1`	GNU vector types (`__attribute__((vector_size(...)))`)
`GNU_VISIBILITY_ATTRIBUTE_ALLOWED`	`1`	`__attribute__((visibility(...)))` support
`GNU_INIT_PRIORITY_ATTRIBUTE_ALLOWED`	`1`	`__attribute__((init_priority(...)))` support
`MICROSOFT_EXTENSIONS_ALLOWED`	`0`	MSVC extensions not available
`SUN_EXTENSIONS_ALLOWED`	`0`	Sun/Oracle extensions not available

The DEFAULT_GNU_VERSION=80100 encodes GCC 8.1.0 as major*10000 + minor*100 + patch. This is the baseline GCC version cudafe++ emulates when nvcc does not specify an explicit --compiler-bindir host compiler. At runtime, nvcc overrides this with the actual detected host GCC version via --gnu_version=NNNNN.

The version numbers stored here serve as fallback defaults. They affect which GNU extensions and builtins are available, which warning behaviors are emulated, and how __GNUC__ / __GNUC_MINOR__ / __GNUC_PATCHLEVEL__ are defined for the preprocessor.

Disabled Compatibility Modes

Constant	Value	Meaning
`CFRONT_2_1_OBJECT_CODE_COMPATIBILITY`	`0`	No AT&T cfront 2.1 compat
`CFRONT_3_0_OBJECT_CODE_COMPATIBILITY`	`0`	No AT&T cfront 3.0 compat
`CFRONT_GLOBAL_VS_MEMBER_NAME_LOOKUP_BUG`	`0`	No cfront name lookup bug emulation
`DEFAULT_SUN_COMPATIBILITY`	(undefined)	No Sun/Oracle compat
`CPPCLI_ENABLING_POSSIBLE`	`0`	C++/CLI (managed C++) disabled
`CPPCX_ENABLING_POSSIBLE`	`0`	C++/CX (WinRT extensions) disabled
`DEFAULT_UPC_MODE`	`0`	Unified Parallel C disabled
`DEFAULT_EMBEDDED_C_ENABLED`	`0`	Embedded C extensions disabled

NVIDIA disables every compatibility mode except GCC. This is consistent with CUDA's host compiler support matrix: GCC and Clang on Linux, MSVC on Windows. The cfront, Sun, UPC, and embedded C modes are EDG capabilities that NVIDIA does not need.

Target Platform Model

The TARG_* constants describe the target architecture's data model. Since cudafe++ is a source-to-source translator for the host side, these model x86-64 Linux.

Data Type Sizes (bytes)

Type	Size	Alignment
`char`	1	1
`short`	2	2
`int`	4	4
`long`	8	8
`long long`	8	8
`__int128`	16	16
`pointer`	8	8
`float`	4	4
`double`	8	8
`long double`	16	16
`__float80`	16	16
`__float128`	16	16
`ptr-to-data-member`	8	8
`ptr-to-member-function`	16	8
`ptr-to-virtual-base`	8	8

This is the standard LP64 data model (long and pointer are 64-bit). TARG_ALL_POINTERS_SAME_SIZE=1 confirms there are no near/far pointer distinctions.

Key Target Properties

Constant	Value	Meaning
`TARG_CHAR_BIT`	`8`	8 bits per byte
`TARG_HAS_SIGNED_CHARS`	`1`	`char` is signed by default
`TARG_HAS_IEEE_FLOATING_POINT`	`1`	IEEE 754 floating point
`TARG_SUPPORTS_X86_64`	`1`	x86-64 target support
`TARG_SUPPORTS_ARM64`	`0`	No ARM64 target support
`TARG_SUPPORTS_ARM32`	`0`	No ARM32 target support
`TARG_DEFAULT_NEW_ALIGNMENT`	`16`	`operator new` returns 16-byte aligned
`TARG_IA64_ABI_USE_GUARD_ACQUIRE_RELEASE`	`1`	Thread-safe static local init guards
`TARG_CASE_SENSITIVE_EXTERNAL_NAMES`	`1`	Symbol names are case-sensitive
`TARG_EXTERNAL_NAMES_GET_UNDERSCORE_ADDED`	`0`	No leading underscore on symbols

The TARG_SUPPORTS_ARM64=0 and TARG_SUPPORTS_ARM32=0 confirm that this build of cudafe++ targets x86-64 Linux only. NVIDIA produces separate cudafe++ builds for other host platforms (ARM64 Linux, Windows).

Floating Point Model

Constant	Value	Meaning
`FP_USE_EMULATION`	`1`	Floating-point constant folding uses software emulation
`USE_SOFTFLOAT`	`1`	Software floating-point library linked
`APPROXIMATE_QUADMATH`	`1`	`__float128` operations use approximate arithmetic
`USE_QUADMATH_LIBRARY`	`0`	Not linked against libquadmath
`HOST_FP_VALUE_IS_128BIT`	`1`	Host FP value representation uses 128 bits
`FP_LONG_DOUBLE_IS_80BIT_EXTENDED`	`1`	`long double` is x87 80-bit extended precision
`FP_LONG_DOUBLE_IS_BINARY128`	`0`	`long double` is not IEEE binary128
`FLOAT80_ENABLING_POSSIBLE`	`1`	`__float80` type can be enabled
`FLOAT128_ENABLING_POSSIBLE`	`1`	`__float128` type can be enabled

The FP_USE_EMULATION=1 and USE_SOFTFLOAT=1 settings mean cudafe++ does not use the host CPU's floating-point unit for constant folding during compilation. Instead, it uses a software emulation library. This guarantees deterministic results regardless of the build machine's FPU behavior, rounding mode, or x87 precision settings. The APPROXIMATE_QUADMATH=1 indicates that __float128 constant folding uses an approximate (but portable) implementation rather than requiring libquadmath.

Memory and Host Configuration

Constant	Value	Meaning
`USE_MMAP_FOR_MEMORY_REGIONS`	`1`	IL memory regions use `mmap`
`USE_MMAP_FOR_MODULES`	`1`	C++ module storage uses `mmap`
`HOST_ALLOCATION_INCREMENT`	`65536`	Arena grows in 64 KB increments
`HOST_ALIGNMENT_REQUIRED`	`8`	Host requires 8-byte alignment
`HOST_IL_ENTRY_PREFIX_ALIGNMENT`	`8`	IL node prefix aligned to 8 bytes
`HOST_POINTER_ALIGNMENT`	`8`	Pointer alignment on host platform
`USE_FIXED_ADDRESS_FOR_MMAP`	`0`	No fixed mmap addresses
`NULL_POINTER_IS_ZERO`	`1`	Null pointer has all-zero bit pattern

The USE_MMAP_FOR_MEMORY_REGIONS=1 setting means the IL's region-based arena allocator uses mmap system calls (likely MAP_ANONYMOUS) rather than malloc. This gives EDG more control over memory layout and allows whole-region deallocation via munmap without fragmentation concerns. The 64 KB allocation increment (HOST_ALLOCATION_INCREMENT=65536) means each arena expansion maps a new 64 KB page-aligned chunk.

Code Generation Controls

These constants affect what the cp_gen_be backend emits.

Constant	Value	Meaning
`GENERATE_SOURCE_SEQUENCE_LISTS`	`1`	Source sequence lists (instantiation ordering) generated
`GENERATE_LINKAGE_SPEC_BLOCKS`	`1`	`extern "C"` blocks preserved in output
`USING_DECLARATIONS_IN_GENERATED_CODE`	`1`	`using` declarations appear in output
`GENERATE_EH_TABLES`	`0`	No EH tables -- host compiler handles exceptions
`GENERATE_MICROSOFT_IF_EXISTS_ENTRIES`	`0`	No `__if_exists` / `__if_not_exists` output
`SUPPRESS_ARRAY_STATIC_IN_GENERATED_CODE`	`1`	`static` in array parameter declarations suppressed
`GCC_BUILTIN_VARARGS_IN_GENERATED_CODE`	`0`	No GCC `__builtin_va_*` in output
`USE_HEX_FP_CONSTANTS_IN_GENERATED_CODE`	`0`	No hex float literals in output
`ADD_BRACES_TO_AVOID_DANGLING_ELSE_IN_GENERATED_C`	`0`	No extra braces for dangling else
`DOING_SOURCE_ANALYSIS`	`1`	Source analysis mode (affects what is preserved)

The GENERATE_EH_TABLES=0 is significant. Exception handling tables are not generated because cudafe++ emits source code -- the host compiler is responsible for generating the actual EH tables when it compiles the .ii output. Similarly, GCC_BUILTIN_VARARGS_IN_GENERATED_CODE=0 means the output uses standard <stdarg.h> varargs rather than GCC builtins, keeping the output compiler-portable.

Template and Instantiation Model

Constant	Value	Meaning
`AUTOMATIC_TEMPLATE_INSTANTIATION`	`0`	No automatic instantiation to separate files
`INSTANTIATION_BY_IMPLICIT_INCLUSION`	`1`	Template definitions found via implicit include
`INSTANTIATE_TEMPLATES_EVERYWHERE_USED`	`0`	Not every use triggers instantiation
`INSTANTIATE_EXTERN_INLINE`	`0`	Extern inline templates not instantiated eagerly
`INSTANTIATE_INLINE_VARIABLES`	`0`	Inline variables not instantiated eagerly
`INSTANTIATE_BEFORE_PCH_CREATION`	`0`	No instantiation before PCH
`DEFAULT_INSTANTIATION_MODE`	`tim_none`	No separate instantiation mode
`DEFAULT_MAX_PENDING_INSTANTIATIONS`	`200`	Maximum pending instantiations per TU
`MAX_TOTAL_PENDING_INSTANTIATIONS`	`256`	Hard cap on total pending
`MAX_UNUSED_ALL_MODE_INSTANTIATIONS`	`200`	Limit on unused instantiation entries
`DEFAULT_MAX_DEPTH_CONSTEXPR_CALL`	`256`	Maximum constexpr recursion depth
`DEFAULT_MAX_COST_CONSTEXPR_CALL`	`2000000`	Maximum constexpr evaluation cost

The AUTOMATIC_TEMPLATE_INSTANTIATION=0 and DEFAULT_INSTANTIATION_MODE=tim_none disable EDG's automatic template instantiation mechanism. This mechanism (where EDG writes instantiation requests to a file for later processing) is unnecessary because cudafe++ processes each translation unit in a single pass -- templates are instantiated inline as the parser encounters them, and the backend emits the instantiated code directly.

Feature Enablement Constants

The DEFAULT_* constants set the initial values of runtime-configurable features. These can be overridden by command-line flags, but they establish the baseline behavior when no flags are specified.

Enabled by Default

Constant	Value	Feature
`DEFAULT_GNU_COMPATIBILITY`	`1`	GCC compatibility mode
`DEFAULT_EXCEPTIONS_ENABLED`	`1`	C++ exception handling
`DEFAULT_RTTI_ENABLED`	`1`	Runtime type identification
`DEFAULT_BOOL_IS_KEYWORD`	`1`	`bool` is a keyword (not a typedef)
`DEFAULT_WCHAR_T_IS_KEYWORD`	`1`	`wchar_t` is a keyword
`DEFAULT_NAMESPACES_ENABLED`	`1`	Namespaces are supported
`DEFAULT_ARG_DEPENDENT_LOOKUP`	`1`	ADL (Koenig lookup) active
`DEFAULT_CLASS_NAME_INJECTION`	`1`	Class name injected into its own scope
`DEFAULT_EXPLICIT_KEYWORD_ENABLED`	`1`	`explicit` keyword recognized
`DEFAULT_EXTERN_INLINE_ALLOWED`	`1`	`extern inline` permitted
`DEFAULT_IMPLICIT_NOEXCEPT_ENABLED`	`1`	Implicit noexcept on dtors/deallocs
`DEFAULT_IMPLICIT_TYPENAME_ENABLED`	`1`	`typename` implicit in dependent contexts
`DEFAULT_TYPE_TRAITS_HELPERS_ENABLED`	`1`	Compiler intrinsic type traits
`DEFAULT_STRING_LITERALS_ARE_CONST`	`1`	String literals have const type
`DEFAULT_TYPE_INFO_IN_NAMESPACE_STD`	`1`	`type_info` in `std::`
`DEFAULT_C_AND_CPP_FUNCTION_TYPES_ARE_DISTINCT`	`1`	C and C++ function types differ
`DEFAULT_FRIEND_INJECTION`	`1`	Friend declarations inject names
`DEFAULT_DISTINCT_TEMPLATE_SIGNATURES`	`1`	Template signatures are distinct
`DEFAULT_ARRAY_NEW_AND_DELETE_ENABLED`	`1`	`operator new[]` / `operator delete[]`
`DEFAULT_CPP11_DEPENDENT_NAME_PROCESSING`	`1`	C++11-style dependent name processing
`DEFAULT_ENABLE_COLORIZED_DIAGNOSTICS`	`1`	ANSI color in diagnostics
`DEFAULT_CHECK_FOR_BYTE_ORDER_MARK`	`1`	UTF-8 BOM detection on
`DEFAULT_CHECK_PRINTF_SCANF_POSITIONAL_ARGS`	`1`	printf/scanf format checking
`DEFAULT_ALWAYS_FOLD_CALLS_TO_BUILTIN_CONSTANT_P`	`1`	`__builtin_constant_p` folded

Disabled by Default (Require Explicit Enabling)

Constant	Value	Feature
`DEFAULT_CPP_MODE`	`199711`	Default language standard is C++98
`DEFAULT_LAMBDAS_ENABLED`	`0`	Lambdas off (enabled by C++ version selection)
`DEFAULT_RVALUE_REFERENCES_ENABLED`	`0`	Rvalue refs off (enabled by C++ version)
`DEFAULT_VARIADIC_TEMPLATES_ENABLED`	`0`	Variadic templates off (enabled by C++ version)
`DEFAULT_NULLPTR_ENABLED`	`0`	`nullptr` off (enabled by C++ version)
`DEFAULT_RANGE_BASED_FOR_ENABLED`	`0`	Range-for off (enabled by C++ version)
`DEFAULT_AUTO_TYPE_SPECIFIER_ENABLED`	`0`	`auto` type deduction off (enabled by C++ version)
`DEFAULT_COMPOUND_LITERALS_ALLOWED`	`0`	C99 compound literals off
`DEFAULT_DESIGNATORS_ALLOWED`	`0`	C99/C++20 designated initializers off
`DEFAULT_C99_MODE`	`0`	Not in C99 mode
`DEFAULT_VLA_ENABLED`	`0`	Variable-length arrays off
`DEFAULT_CPP11_SFINAE_ENABLED`	`0`	C++11 SFINAE rules off (enabled by C++ version)
`DEFAULT_MODULES_ENABLED`	`0`	C++20 modules off
`DEFAULT_REFLECTION_ENABLED`	`0`	C++ reflection off
`DEFAULT_MICROSOFT_COMPATIBILITY`	`0`	MSVC compat off
`DEFAULT_CLANG_COMPATIBILITY`	`0`	Clang compat off
`DEFAULT_BRIEF_DIAGNOSTICS`	`0`	Full diagnostic output
`DEFAULT_DISPLAY_ERROR_NUMBER`	`0`	Error numbers hidden
`DEFAULT_INCOGNITO`	`0`	Not in incognito mode
`DEFAULT_REMOVE_UNNEEDED_ENTITIES`	`0`	Dead code not removed

The DEFAULT_CPP_MODE=199711 (C++98) looks surprising, but this is simply the EDG default. In practice, nvcc always passes an explicit --std=c++NN flag to cudafe++ that overrides this default, typically --std=c++17 in modern CUDA. The C++11/14/17/20 features listed as "disabled by default" are all enabled by the standard version selection code in proc_command_line.

Predefined Macro Constants

These constants control which macros cudafe++ automatically defines for the preprocessor.

Constant	Value	Effect
`DEFINE_MACRO_WHEN_EXCEPTIONS_ENABLED`	`1`	`--exceptions` causes `#define __EXCEPTIONS`
`DEFINE_MACRO_WHEN_RTTI_ENABLED`	`1`	`--rtti` causes `#define __RTTI`
`DEFINE_MACRO_WHEN_BOOL_IS_KEYWORD`	`1`	`bool` keyword causes `#define _BOOL`
`DEFINE_MACRO_WHEN_WCHAR_T_IS_KEYWORD`	`1`	`wchar_t` keyword causes `#define _WCHAR_T`
`DEFINE_MACRO_WHEN_ARRAY_NEW_AND_DELETE_ENABLED`	`1`	Causes `#define __ARRAY_OPERATORS`
`DEFINE_MACRO_WHEN_PLACEMENT_DELETE_ENABLED`	`1`	Causes `#define __PLACEMENT_DELETE`
`DEFINE_MACRO_WHEN_VARIADIC_TEMPLATES_ENABLED`	`1`	Causes `#define __VARIADIC_TEMPLATES`
`DEFINE_MACRO_WHEN_CHAR16_T_AND_CHAR32_T_ARE_KEYWORDS`	`1`	Causes `#define __CHAR16_T_AND_CHAR32_T`
`DEFINE_MACRO_WHEN_LONG_LONG_IS_DISABLED`	`1`	Causes `#define __NO_LONG_LONG` when long long is off
`DEFINE_FEATURE_TEST_MACRO_OPERATORS_IN_ALL_MODES`	`1`	Feature test macros available in all modes
`MACRO_DEFINED_WHEN_IA64_ABI`	`"__EDG_IA64_ABI"`	Always defined (since `IA64_ABI=1`)
`MACRO_DEFINED_WHEN_TYPE_TRAITS_HELPERS_ENABLED`	`"__EDG_TYPE_TRAITS_ENABLED"`	Always defined (since type traits are on)

These macros allow header files to conditionally compile based on which compiler features are active. They are part of EDG's mechanism for compatibility with GCC's predefined macro surface -- GCC defines __EXCEPTIONS when exceptions are on, so cudafe++ does the same.

Miscellaneous Constants

Constant	Value	Meaning
`VERSION_NUMBER`	`"6.6"`	EDG front end version
`VERSION_NUMBER_FOR_MACRO`	`606`	Numeric form for `__EDG_VERSION__` macro
`DIRECTORY_SEPARATOR`	`'/'`	Unix path separator
`FILE_NAME_FOR_STDIN`	`"-"`	Standard Unix convention for stdin
`OBJECT_FILE_SUFFIX`	`".o"`	Unix object file suffix
`PCH_FILE_SUFFIX`	`".pch"`	Precompiled header suffix
`PREDEFINED_MACRO_FILE_NAME`	`"predefined_macros.txt"`	File with platform-defined macros
`DEFAULT_TMPDIR`	`"/tmp"`	Default temp directory
`DEFAULT_USR_INCLUDE`	`"/usr/include"`	Default system include path
`DEFAULT_EDG_BASE`	`""`	EDG base directory (empty = use argv[0] path)
`MAX_INCLUDE_FILES_OPEN_AT_ONCE`	`8`	Limit on simultaneously open include files
`MODULE_MAX_LINE_NUMBER`	`250000`	Maximum source lines per module
`COMPILE_MULTIPLE_SOURCE_FILES`	`0`	One source file per invocation
`COMPILE_MULTIPLE_TRANSLATION_UNITS`	`0`	One TU per invocation
`USING_DRIVER`	`0`	Not integrated into a driver binary
`EDG_WIN32`	`0`	Not a Windows build
`WINDOWS_PATHS_ALLOWED`	`0`	No backslash path separators

The VERSION_NUMBER="6.6" identifies this as EDG C/C++ front end version 6.6, which is the latest major release. VERSION_NUMBER_FOR_MACRO=606 becomes the __EDG_VERSION__ predefined macro, allowing header files to detect the exact EDG version (e.g., #if __EDG_VERSION__ >= 606).

The legacy configuration section at the bottom of the dump output reports LEGACY_TARGET_CONFIGURATION_NAME as NULL, meaning this build does not use a named legacy target configuration. In EDG's framework, named target configurations are used to preset constants for specific compilers (e.g., "gnu" or "microsoft"). NVIDIA's configuration is fully custom and does not map to any of EDG's predefined configurations.

Relationship Between Build Configuration and Runtime Flags

The build configuration constants and the runtime CLI flags form a two-layer system:

Build-time constants (CHECKING=1, BACK_END_IS_CP_GEN_BE=1, IL_SHOULD_BE_WRITTEN_TO_FILE=0) determine what code paths exist in the binary. If IL_SHOULD_BE_WRITTEN_TO_FILE=0, the IL serialization code is not compiled in -- no runtime flag can enable it.
DEFAULT_* constants set initial values for features that can be toggled at runtime. DEFAULT_EXCEPTIONS_ENABLED=1 means exceptions are on unless --no_exceptions is passed. These defaults are loaded by default_init (sub_45EB40) before command-line parsing.
*_ENABLING_POSSIBLE constants gate whether a feature can be toggled at all. COROUTINE_ENABLING_POSSIBLE=1 means the --coroutines / --no_coroutines flag pair is registered. REFLECTION_ENABLING_POSSIBLE=0 means the reflection flag pair is not even registered -- the feature cannot be turned on.

This layering means the build configuration determines the binary's permanent capabilities, while the CLI flags select among the enabled possibilities.

Function Reference

Function	Address	Lines	Role
`dump_configuration`	`sub_44CF30`	785	Print all 747 constants as `#define` statements
`default_init`	`sub_45EB40`	470	Initialize 350 config globals from `DEFAULT_*` values
`init_command_line_flags`	`sub_452010`	3,849	Register all CLI flags (gated by `*_ENABLING_POSSIBLE`)
`proc_command_line`	`sub_459630`	4,105	Parse flags and override `DEFAULT_*` settings

Architecture Detection

cudafe++ determines the target GPU architecture through a five-stage pipeline: nvcc translates the user-facing --gpu-architecture=sm_XX flag into an internal numeric index, passes it to cudafe++ via --target, the CLI parser stores the index in a global, set_target_configuration configures over 100 type-system globals for that target, and the TU initializer copies the index into per-translation-unit state where it is read by feature gates throughout compilation. A parallel path, select_cp_gen_be_target_dialect, routes the backend to emit either device-side or host-side C++ based on a separate flag. This page documents the complete chain from nvcc invocation to the point where individual feature checks read the stored architecture value.

Key Facts

Property	Value
Target index global	`dword_126E4A8` (set by `--target`, CLI case 245)
Invalid sentinel	`-1` (`0xFFFFFFFF`)
Error on invalid target	Error 2664: `"invalid or no value specified with --nv_arch flag"`
Target parser stub	`sub_7525E0` (6 bytes, returns `-1` unconditionally)
Configuration function	`sub_7525F0` (`set_target_configuration`, `target.c:299`)
Type table initializer	`sub_7515D0` (100+ globals, called from `sub_7525F0`)
Configuration validator	`sub_7527B0` (`check_target_configuration`, `target.c:512-659`)
Field alignment initializer	`sub_752DF0` (`init_field_alignment_tables`, `target.c:825`)
Dialect selector	`sub_752A80` (`select_cp_gen_be_target_dialect`, `target.c:736`)
TU-level copy	`dword_126EBF8` (`target_configuration_index`, set in `sub_586240`)
GPU mode flag	`dword_126EFA8` (set by `--gcc`, case 182; gates dialect selection)
Device-side flag	`dword_126EFA4` (set by `--clang`, case 187; selects device vs host output)

The Full Propagation Chain

The architecture value flows through five distinct stages before it is available for feature gate checks. Each stage adds a layer of processing: parsing, validation, type model configuration, dialect routing, and per-TU state replication.

Stage 1: nvcc                         Stage 2: CLI parsing
  --gpu-architecture=sm_90    --->      case 245 (--target)
  translates to --target=<idx>          sub_7525E0(<arg>) -> dword_126E4A8
                                        if -1: error 2664, abort
                                            |
                                            v
Stage 3: Target init                   Stage 4: Dialect selection
  sub_7525F0(idx)                        sub_752A80()
    assert idx != -1                       if dword_126EFA8 (GPU mode):
    sub_7515D0()   -> 100+ type globals      if dword_126EFA4: device path
    qword_126E1B0 = "lib"                    else: host path
    sub_752DF0()   -> alignment tables
    sub_7527B0()   -> validation
                                            |
                                            v
Stage 5: TU initialization
  sub_586240()
    dword_126EBF8 = dword_126E4A8  (per-TU copy)
    version marker: "6.6\0"
    timestamp copy
                                            |
                                            v
Feature checks throughout compilation
  if (dword_126E4A8 < 70) { error("__grid_constant__ requires compute_70"); }
  if (dword_126E4A8 < 80) { error("__nv_register_params__ requires compute_80"); }
  ...

Stage 1: nvcc Translates the Architecture

Users specify the GPU architecture through nvcc:

nvcc --gpu-architecture=sm_90 source.cu

nvcc translates this into an internal numeric index and passes it to cudafe++ as --target=<index>. The value stored in dword_126E4A8 is NOT a raw SM number like 90 -- it is an index into EDG's target configuration table. nvcc performs the mapping from user-facing strings (sm_90, compute_80, etc.) to this index. cudafe++ never sees the sm_XX string directly.

The --target flag is registered as CLI flag 253 with the internal case_id 245 in the flag table:

// From sub_452010 (init_command_line_flags)
sub_451F80(245, "target", 0, 1, 1, 1);
//         ^id   ^name   ^no_short ^has_arg ^mode ^enabled

Stage 2: CLI Parsing (proc_command_line, case 245)

When proc_command_line (sub_459630) encounters --target, it dispatches to case 245:

// sub_459630, case 245 (decompiled)
case 245:
    v80 = sub_7525E0(qword_E7FF28, v23, v20, v30);
    dword_126E4A8 = v80;                   // store target index
    if (v80 == -1) {
        sub_4F8420(2664);                   // emit error 2664
        // "invalid or no value specified with --nv_arch flag"
        sub_4F2930("cmd_line.c", 12219,
                   "proc_command_line", 0, 0);  // assert-fail
    }
    sub_7525F0(v80);                        // set_target_configuration
    goto LABEL_136;                         // continue parsing

The error string references --nv_arch, which is the nvcc-facing name for this flag. Internally cudafe++ processes it as --target (case 245). The discrepancy exists because the error message is shared with nvcc's error reporting path.

The sub_7525E0 Stub

sub_7525E0 is the architecture parser function. In the CUDA Toolkit 13.0 binary, it is a 6-byte stub:

// sub_7525E0 -- 0x7525E0, 6 bytes
__int64 sub_7525E0()
{
    return 0xFFFFFFFFLL;  // always returns -1
}

; IDA disassembly
sub_7525E0:
    mov     eax, 0FFFFFFFFh
    retn

This stub always returns the invalid sentinel -1. The actual architecture code reaches dword_126E4A8 through the argument value passed by nvcc, not through parsing logic within this function. The function signature in the call site (sub_7525E0(qword_E7FF28, v23, v20, v30)) shows that four arguments are passed, but the stub ignores all of them. This means either:

The actual parsing is performed by nvcc, which passes the pre-resolved numeric index as the argument string, and sub_7525E0 simply converts it with strtol -- but the link-time optimization eliminated the body because the result was equivalent to the argument itself.
The function is a placeholder that was replaced at link time by a different object file that nvcc provides when building the toolchain.

In either case, the return value -1 is only reached when no valid --target argument is provided, which triggers error 2664.

Stage 3: set_target_configuration (sub_7525F0)

After the target index is stored, sub_7525F0 performs the post-parse initialization. This function lives in target.c:299:

// sub_7525F0 -- set_target_configuration
__int64 __fastcall sub_7525F0(int a1)
{
    // Guard: accepts any value >= 0, rejects only -1
    // (a1 + 1) wraps -1 to 0, and (0u > 1) is false
    // Any non-negative value + 1 > 1 would be true... BUT this is unsigned:
    // -1 + 1 = 0, 0 > 1u = false (passes)
    // 0 + 1 = 1, 1 > 1u = false (passes)
    // The guard actually fires when a1 <= -2 (e.g., -2 + 1 = -1, cast unsigned = huge)
    if ((unsigned int)(a1 + 1) > 1)
        assert_fail("target.c", 299, "set_target_configuration", 0, 0);

    sub_7515D0();              // initialize type tables
    qword_126E1B0 = "lib";    // library search path prefix
    return -1;                 // return value unused
}

The unsigned comparison (a1 + 1) > 1u accepts values 0 and -1, rejecting everything else. In practice, only 0 or a valid non-negative target index reaches this function (the -1 case is caught earlier by the error 2664 check). The guard is a sanity assertion rather than a functional check.

Type Table Initialization (sub_7515D0)

sub_7515D0 is the core of Stage 3. It sets over 100 global variables that define the target platform's data model. These globals control how the EDG front end sizes types, computes alignments, and evaluates constant expressions. The function hardcodes an LP64 data model with CUDA-specific properties:

// sub_7515D0 -- target type initialization (complete decompilation)
__int64 sub_7515D0()
{
    // === Integer type sizes (in bytes) ===
    dword_126E338 = 4;     // sizeof(int)
    dword_126E328 = 8;     // sizeof(long)
    dword_126E410 = 4;     // sizeof(short) [confirmed by cross-ref]
    dword_126E420 = 2;     // sizeof(wchar_t)

    // === Pointer properties ===
    dword_126E2B8 = 8;     // sizeof(pointer)
    dword_126E2AC = 8;     // alignof(pointer)
    dword_126E4A0 = 8;     // target bits-per-byte (CHAR_BIT)
    dword_126E29C = 8;     // sizeof(ptrdiff_t)

    // === Floating-point properties ===
    // float: 24-bit mantissa, exponent range [-125, 128]
    dword_126E264 = 24;    // float mantissa bits
    dword_126E25C = 128;   // float max exponent
    dword_126E260 = -125;  // float min exponent

    // double: 53-bit mantissa, exponent range [-1021, 1024]
    dword_126E258 = 53;    // double mantissa bits
    dword_126E250 = 1024;  // double max exponent
    dword_126E254 = -1021; // double min exponent

    // long double: 16 bytes, same as __float128
    dword_126E2FC = 16;    // sizeof(long double)
    dword_126E308 = 16;    // alignof(long double)

    // __float128: 113-bit mantissa, exponent range [-16381, 16384]
    dword_126E234 = 113;   // __float128 mantissa bits
    dword_126E22C = 0x4000; // __float128 max exponent (16384)
    dword_126E230 = -16381; // __float128 min exponent

    // 80-bit extended (x87): same parameters as __float128
    dword_126E240 = 64;    // x87 extended mantissa bits
    dword_126E238 = 0x4000; // x87 extended max exponent
    dword_126E23C = -16381; // x87 extended min exponent
    dword_126E24C = 64;    // another extended format (IBM double-double?)
    dword_126E244 = 0x4000;
    dword_126E248 = -16381;

    // === Alignment properties ===
    dword_126E400 = 8;     // alignof(long long)
    dword_126E3F0 = 8;     // alignof(double)
    dword_126E35C = 8;     // alignof(long)
    dword_126E3E0 = 16;    // alignof(__int128) or max alignment
    dword_126E318 = 16;    // alignof(long double, repeated)
    dword_126E278 = 16;    // maximum natural alignment

    // === Endianness and signedness ===
    dword_126E4A4 = 1;     // little-endian
    dword_126E498 = 1;     // char is signed
    dword_126E368 = 1;     // int is 2's complement
    dword_126E384 = 1;     // enum underlying type signed

    // === Bit-field and struct layout ===
    dword_126E3A8 = -1;    // MSVC bit-field allocation mode (-1 = disabled)
    dword_126E2A8 = 0;     // no extra struct padding
    dword_126E2F0 = 0;     // field alignment override disabled
    dword_126E398 = 0;     // no special alignment for unnamed fields
    dword_126E298 = 0;     // no zero-length array as last field padding
    dword_126E288 = 1;     // field alloc order = declaration order
    dword_126E294 = 1;     // allow zero-sized objects
    dword_126E28C = 1;     // allow empty base optimization

    // === ABI flags ===
    dword_126E394 = 1;     // ELF-style name mangling
    dword_126E3AC = 1;     // Itanium ABI compliance
    dword_126E37C = 1;     // EH table generation enabled
    dword_126E3A0 = 0;     // no Windows SEH
    dword_126E36C = 1;     // thunks for virtual calls
    dword_126E380 = 1;     // covariant return types
    dword_126E39C = 0;     // no RTTI incompatibility workaround

    // === Integral type encoding (byte_126E4xx) ===
    byte_126E431 = 0;      // bool encoding index
    byte_126E430 = 2;      // char encoding index
    byte_126E480 = 4;      // char16_t encoding
    byte_126E470 = 6;      // char32_t encoding
    byte_126E490 = 5;      // wchar_t encoding
    byte_126E481 = 6;      // char8_t encoding

    // === Size_t properties ===
    byte_126E349 = 8;      // size_t byte width indicator
    qword_126E350 = -1;    // SIZE_MAX (0xFFFFFFFFFFFFFFFF for 64-bit)
    byte_126E348 = 7;      // size_t type encoding index

    // === String properties ===
    dword_126E49C = 8;     // host string char bit width
    dword_126E1BC = 1;     // feature flag (enabled)
    dword_126E494 = 1;     // null-terminated string assumption

    // === Replicated size values (qword versions) ===
    // These are 64-bit copies of the 32-bit size values above,
    // used for 64-bit arithmetic in constant evaluation
    qword_126E330 = 8;     // sizeof(long) as int64
    qword_126E340 = 4;     // sizeof(int) as int64
    qword_126E300 = 16;    // sizeof(long double) as int64
    qword_126E310 = 16;    // alignof(long double) as int64
    qword_126E418 = 4;     // sizeof(short) as int64
    qword_126E3E8 = 16;    // sizeof(__int128) as int64
    qword_126E408 = 8;     // sizeof(long long) as int64
    qword_126E320 = 16;    // alignof(something 16B) as int64
    qword_126E3F8 = 8;     // alignof(double) as int64
    qword_126E3D0 = 16;    // sizeof(max int) as int64
    qword_126E360 = 8;     // sizeof(long) alignment as int64
    qword_126E2C0 = 8;     // sizeof(pointer) as int64
    qword_126E2B0 = 16;    // alignof(pointer, packed) as int64
    qword_126E428 = 2;     // sizeof(wchar_t) as int64
    qword_126E2A0 = 8;     // sizeof(ptrdiff_t) as int64

    // === Miscellaneous ===
    qword_126E3B0 = 0;     // no custom va_list
    qword_126E3B8 = 0;     // no custom va_list secondary
    dword_126E3A4 = 0;     // bit-field container sizing disabled
    byte_126E2F6 = 4;      // unnamed struct alignment
    byte_126E2F5 = 4;      // unnamed union alignment
    byte_126E2F4 = 4;      // default minimum alignment
    byte_126E2F7 = 4;      // stack alignment
    byte_126E2F8 = 4;      // thread-local alignment
    byte_126E358 = 7;      // size_t type kind encoding
    dword_126E370 = 0;     // padding/zero
    dword_126E374 = 0;
    dword_126E378 = 1;     // 64-bit mode flag (LP64)
    dword_126E290 = 0;
    dword_126E388 = 0;
    dword_126E38C = 0;
    dword_126E390 = -1;    // special marker

    return -1;  // return value unused by caller
}

The function establishes the LP64 data model: sizeof(int)=4, sizeof(long)=8, sizeof(pointer)=8. This matches the CUDA device code ABI where device pointers are 64-bit. The dword_126E378 = 1 flag explicitly marks this as 64-bit mode.

CLI Overrides for the Data Model

Two CLI flags can override specific type properties set by sub_7515D0, because they are processed before case 245 in the switch:

Case 65 (--force-lp64): Enforces 64-bit pointer and long sizes:

case 65:
    dword_106C01C = 1;          // force-lp64 flag recorded
    qword_126E408 = 8;          // sizeof(long long) = 8
    dword_126E400 = 8;          // alignof(long long) = 8
    byte_126E349 = 8;           // size_t = 8 bytes
    byte_126E358 = 7;           // size_t type encoding

Case 66 (--force-llp64): Sets 32-bit pointer and long sizes (Windows-like):

case 66:
    dword_106C018 = 1;          // force-llp64 flag recorded
    qword_126E408 = 4;          // sizeof(long long) = 4
    dword_126E400 = 4;          // alignof(long long) = 4
    byte_126E349 = 10;          // size_t = different encoding
    byte_126E358 = 9;           // size_t type encoding

Case 90 (--m32): Sets the complete 32-bit (ILP32) data model:

case 90:
    dword_126E378 = 0;          // 32-bit mode (not LP64)
    qword_126E360 = 4;          // sizeof(long) = 4
    dword_126E35C = 4;          // alignof(long) = 4
    qword_126E350 = 0xFFFFFFFF; // SIZE_MAX = 32-bit
    byte_126E349 = 6;           // size_t = 4 bytes
    byte_126E358 = 5;           // size_t type encoding
    qword_126E2C0 = 4;          // sizeof(pointer) = 4
    dword_126E2B8 = 4;          // sizeof(pointer, dword) = 4
    qword_126E2B0 = 8;          // alignof(pointer, packed) = 8
    dword_126E2AC = 4;          // alignof(pointer) = 4
    qword_126E2A0 = 4;          // sizeof(ptrdiff_t) = 4
    dword_126E29C = 4;          // sizeof(ptrdiff_t, dword) = 4
    qword_126E408 = 4;          // sizeof(long long) = 4
    dword_126E400 = 4;          // alignof(long long) = 4
    byte_126E2F4 = 4;           // default minimum alignment = 4

Because sub_7515D0 is called from sub_7525F0 (which runs during case 245), and case 90 executes before case 245, the --m32 overrides are applied first but then overwritten by sub_7515D0's LP64 defaults. This means the 32-bit overrides from --m32 are effective ONLY for the globals that sub_7515D0 does NOT touch. For the globals that both code paths write (like qword_126E408, dword_126E400, byte_126E349, byte_126E358), the sub_7515D0 LP64 values take precedence. However, --force-lp64 and --force-llp64 are no-ops when --target is also specified, because sub_7515D0 overwrites their values too.

In practice, nvcc controls all of these flags coherently -- it does not pass conflicting combinations.

Configuration Validation (sub_7527B0)

After sub_7515D0 sets the type tables, sub_752DF0 (init_field_alignment_tables) populates alignment lookup tables and then calls sub_7527B0 (check_target_configuration). This function validates the consistency of the configured type model:

// sub_7527B0 -- check_target_configuration (pseudocode summary)
void check_target_configuration()
{
    // Validate char fits in 8 bytes
    compute_type_size(0, &size, &precision);
    if (size > 8) fatal("target char is too large");

    // Validate wchar_t size
    if (qword_126E488 > 8) fatal("target wchar_t is too large");

    // Validate char16_t: must be unsigned, at least 16 bits
    if (qword_126E478 > 8) fatal("target char16_t is too large");
    if (dword_126E4A0 * qword_126E478 <= 15)
        fatal("target char16_t is too small");
    if (is_unsigned[byte_126E480] == 0)
        assert_fail("target char16_t must be unsigned");

    // Validate char32_t: must be unsigned, at least 32 bits
    if (qword_126E468 > 8) fatal("target char32_t is too large");
    if (dword_126E4A0 * qword_126E468 <= 31)
        fatal("target char32_t is too small");
    if (is_unsigned[byte_126E470] == 0)
        assert_fail("target char32_t must be unsigned");

    // Validate size_t range
    compute_type_size(byte_126E349, &size, &precision);
    if (size * dword_126E4A0 > 64) size_bits = 64;
    if (qword_126E350 > max_for_bits(size_bits))
        fatal("targ_size_t_max is too large");

    // Validate largest integer type
    if (qword_126E3D8 > 16) fatal("targ_sizeof_largest_integer too large");
    if (qword_126E3D8 < qword_126E3F8)
        fatal("invalid targ_sizeof_largest_integer");

    // Validate INT_VALUE_PARTS
    if (16 * dword_126E4A0 != 128)
        fatal("invalid INT_VALUE_PARTS_PER_INTEGER_VALUE");

    // Validate host string char
    if (dword_126E49C > 8) fatal("targ_host_string_char_bit too large");

    // Validate pack alignment
    if (dword_126E284 < 1 || dword_126E284 > 255)
        fatal("invalid targ_minimum_pack_alignment");
    if (dword_126E284 > dword_126E280)
        fatal("invalid targ_maximum_pack_alignment");

    // Validate GNU IA-32 vector function integer sizes
    if (qword_126E428 != 2 || qword_126E418 != 4 || qword_126E3F8 != 8)
        assert_fail("invalid integer sizes for GNU IA-32 vector functions");

    // Validate MSVC bit-field allocation
    if (dword_126E3A4 && dword_126E3A8 != -1)
        fatal("targ_microsoft_bit_field_allocation must be -1 "
              "when targ_bit_field_container_size is TRUE");

    // Validate field allocation order
    if (!dword_126E3AC) assert_fail(...);
    if (!dword_126E288)
        fatal("targ_field_alloc_sequence_equals_decl_sequence must be TRUE");

    // Validate host/target endianness match
    if (dword_126E4A4 != dword_126EE40)
        fatal("unexpected host/target endian mismatch");

    // After validation, call dialect selector
    select_cp_gen_be_target_dialect();  // sub_752A80
}

The validator confirms that the type model is internally consistent. Most of these checks are compile-time assertions that should never fire with the hardcoded LP64 values from sub_7515D0, but they guard against corruption or misconfiguration if the type globals are modified by other code paths (such as --m32 or --force-llp64).

Notably, check_target_configuration calls select_cp_gen_be_target_dialect (sub_752A80) as its last action. This means dialect selection happens after all type model validation is complete.

Field Alignment Tables (sub_752DF0)

init_field_alignment_tables populates two alignment lookup tables at qword_12C7640 and qword_12C7680. These tables map integer type kinds to their struct field alignment requirements. The function only fills the tables when dword_126E2F0 (field alignment override) is nonzero; in the default CUDA configuration, this field is set to 0 by sub_7515D0, so the alignment tables remain at their initialized-to-zero state.

When the tables are populated, they read alignment values from the dword_126E2CC-dword_126E2F0 range (which sub_7515D0 leaves at zero), meaning the alignment tables are effectively disabled for CUDA targets. The function also copies qword_126E3E8 (sizeof largest integer type) into qword_126E3D8 before calling the configuration validator.

Stage 4: Dialect Selection (sub_752A80)

select_cp_gen_be_target_dialect determines whether the backend generates device-side or host-side C++ output. It is called from check_target_configuration (sub_7527B0) after all type model validation passes:

// sub_752A80 -- select_cp_gen_be_target_dialect (complete decompilation)
__int64 sub_752A80()
{
    // Guard: no dialect should be set yet
    if (dword_126E1F8 || dword_126E1D0 || dword_126E1FC || dword_126E1E8)
        assert_fail("target.c", 736,
                    "select_cp_gen_be_target_dialect",
                    "Target dialect already set.", 0);

    if (dword_126EFA8) {           // GPU compilation mode enabled
        dword_126E1DC = 1;         // enable cp_gen backend
        dword_126E1EC = 1;         // enable backend output

        if (dword_126EFA4) {       // device-side compilation
            dword_126E1E8 = 1;     // set device target dialect
            qword_126E1E0 = qword_126EF90;  // copy Clang version
            return qword_126EF90;
        } else {                   // host-side compilation (stub generation)
            dword_126E1F8 = 1;     // set host target dialect
            qword_126E1F0 = qword_126EF98;  // copy GCC version
            return qword_126EF98;
        }
    }
    return result;  // non-GPU mode: no dialect set
}

The guard at entry checks that no dialect has been previously set. This fires only if select_cp_gen_be_target_dialect is called twice, which is a programming error.

Dialect Global Roles

Global	Role	Set When
`dword_126EFA8`	GPU compilation mode active	`--gcc` flag (case 182) sets this to 1
`dword_126EFA4`	Device-side (vs host-side) compilation	`--clang` flag (case 187) sets this to 1
`dword_126E1DC`	cp_gen backend enabled	GPU mode active
`dword_126E1EC`	Backend output enabled	GPU mode active
`dword_126E1E8`	Device target dialect selected	Device-side compilation
`dword_126E1F8`	Host target dialect selected	Host-side compilation
`qword_126E1E0`	Device compiler version	Copied from `qword_126EF90` (Clang version)
`qword_126E1F0`	Host compiler version	Copied from `qword_126EF98` (GCC version)

The naming of dword_126EFA8 as "gcc mode" and dword_126EFA4 as "clang mode" is misleading. In CUDA compilation, dword_126EFA8 means "GPU compilation is active" (nvcc always passes --gcc) and dword_126EFA4 means "this is the device-side pass" (nvcc passes --clang for the device compilation pass, not for the host pass). The version numbers copied into qword_126E1E0 and qword_126E1F0 represent the host compiler's version for pragma compatibility, not the "Clang" or "GCC" version in any semantic sense.

Device vs Host Output Paths

cudafe++ is invoked twice per .cu file by nvcc:

Device pass (dword_126EFA4 = 1): cudafe++ processes the CUDA source and emits the device-side IL/PTX code. The dialect is set to "device" (dword_126E1E8 = 1) and the version number comes from qword_126EF90.
Host pass (dword_126EFA4 = 0): cudafe++ processes the same source and emits the host-side .int.c file with device stubs. The dialect is set to "host" (dword_126E1F8 = 1) and the version number comes from qword_126EF98.

The dialect selection determines which backend code paths execute during .int.c generation. Device-dialect mode generates PTX-compatible output; host-dialect mode generates host C++ with stub functions.

Stage 5: TU Initialization (sub_586240)

During translation unit initialization, sub_586240 copies the target index from the CLI-level global into per-TU state:

// sub_586240 -- fe_translation_unit_init_secondary (relevant excerpt)
if (dword_106BA08) {                       // is recompilation / secondary TU
    // ... version marker and timestamp setup ...
    v6 = allocate(4);
    *(int32_t *)v6 = 3550774;              // "6.6\0" -- EDG version marker
    qword_126EB78 = v6;                    // store version string pointer
    qword_126EB80 = strcpy(allocate(len), byte_106B5C0);  // timestamp
    dword_126EBF8 = dword_126E4A8;         // CRITICAL: copy target index
}

The copy dword_126EBF8 = dword_126E4A8 replicates the architecture index into the translation unit's state block. Both globals contain the same value in single-TU compilation (which is the only mode CUDA uses). The dual-variable pattern exists because EDG's multi-TU architecture theoretically supports per-TU target configurations, but CUDA compilation always uses a single target per cudafe++ invocation.

After this point, feature checks throughout the compiler read either dword_126E4A8 (the CLI-level global) or dword_126EBF8 (the TU-level copy). Both contain the same integer target index.

Feature Gate Mechanism

Individual features are gated by comparing dword_126E4A8 against threshold constants during semantic analysis. The pattern is consistent across all architecture-gated features:

// Pattern: hard error on unsupported architecture
if (dword_126E4A8 < THRESHOLD) {
    emit_error(DIAGNOSTIC_ID, location);
    // compilation continues or aborts depending on severity
}

Some features use a global flag that is set during target initialization rather than reading dword_126E4A8 directly. For example, __nv_register_params__ checks dword_106C028 (the "uumn" flag, set by CLI case 112) rather than comparing the architecture directly:

// sub_40B0A0 -- apply_nv_register_params_attr
if (!dword_106C028) {                    // feature not enabled
    emit_error(7, 3659, location);       // "not enabled" error
    v3 = 0;                              // mark as invalid
}

The architecture check for __nv_register_params__ is separate -- it uses diagnostic tag register_params_unsupported_arch (requiring compute_80+), which is evaluated in a different code path from the enable flag check.

Feature Flag vs Direct Comparison

The distinction between feature-flag gating and direct SM comparison is:

Direct comparison (dword_126E4A8 < N): Used for features where the threshold is baked into the comparison instruction. The threshold cannot be changed without recompiling cudafe++. Examples: __grid_constant__ (< 70), __managed__ (< 30), alloca() (< 52).
Feature flag (dword_XXXXXX == 0): Used for features that can be enabled/disabled independently of the architecture. The flag is set by a CLI option, and the architecture is checked separately. Example: __nv_register_params__ uses dword_106C028 for the enable check and a separate comparison for the architecture check.

Both patterns ultimately depend on the target index value, but the feature-flag pattern adds an extra level of indirection that allows nvcc to control feature availability through CLI flags rather than relying solely on the architecture number.

The --db Debug Mechanism

The --db flag (CLI case 37) activates EDG's internal debug tracing. While not directly part of the architecture detection chain, it shares adjacent globals (dword_126EFC8, dword_126EFCC) and can expose architecture checks as they execute.

The --db flag calls sub_48A390 (proc_debug_option, 238 lines, debug.c). On entry, it unconditionally enables tracing:

dword_126EFC8 = 1;  // debug tracing enabled

If the argument is a bare integer, it sets the verbosity level:

if (first_char is digit) {
    dword_126EFCC = strtol(arg, NULL, 10);  // verbosity level
    return 0;
}

Otherwise, it parses debug trace control entries (see Architecture Feature Gating for the full --db parsing grammar). After proc_debug_option returns, the CLI parser saves the verbosity level:

// proc_command_line, case 37
dword_106C2A0 = dword_126EFCC;  // save error count baseline

At higher verbosity levels (5+), the compiler logs IL tree walking with messages like "Walking IL tree, entry kind = ...", which provides visibility into when architecture gate checks fire during semantic analysis.

Complete Call Graph

main (sub_585EE0)
  |
  +-> proc_command_line (sub_459630)
  |     |
  |     +-> case 90 (--m32):     set ILP32 type properties
  |     +-> case 65 (--force-lp64): set LP64 overrides
  |     +-> case 66 (--force-llp64): set LLP64 overrides
  |     +-> case 245 (--target):
  |           |
  |           +-> sub_7525E0(<arg>)            // parse target index (stub: returns -1)
  |           +-> dword_126E4A8 = result       // store target index
  |           +-> if -1: emit error 2664       // invalid target
  |           +-> sub_7525F0(result)           // set_target_configuration
  |                 |
  |                 +-> sub_7515D0()           // initialize 100+ type globals (LP64)
  |                 +-> qword_126E1B0 = "lib"  // library prefix
  |                 |
  |                 +-> [implicit via sub_752DF0]:
  |                       +-> sub_752DF0()     // init_field_alignment_tables
  |                       +-> sub_7527B0()     // check_target_configuration
  |                             |
  |                             +-> [20+ consistency checks]
  |                             +-> sub_752A80()   // select_cp_gen_be_target_dialect
  |                                   |
  |                                   +-> if GPU mode && device:
  |                                   |     dword_126E1E8 = 1 (device dialect)
  |                                   |     qword_126E1E0 = qword_126EF90
  |                                   +-> if GPU mode && host:
  |                                         dword_126E1F8 = 1 (host dialect)
  |                                         qword_126E1F0 = qword_126EF98
  |
  +-> fe_translation_unit_init (sub_586240)
        |
        +-> dword_126EBF8 = dword_126E4A8      // copy target index to TU state
        +-> qword_126EB78 = "6.6\0"            // EDG version marker
        +-> qword_126EB80 = timestamp           // compilation timestamp

[After TU init, feature checks read dword_126E4A8 or dword_126EBF8]

Global Variable Summary

Target Architecture State

Address	Size	Name	Role
`dword_126E4A8`	4	`sm_architecture`	Target index from `--target`. Sentinel: `-1`.
`dword_126EBF8`	4	`target_configuration_index`	TU-level copy of `dword_126E4A8`.
`dword_126E378`	4	`is_64bit_mode`	1 = LP64 (64-bit), 0 = ILP32 (32-bit).
`dword_126E4A4`	4	`target_little_endian`	1 = little-endian.

Type Model (Sizes, set by sub_7515D0)

Address	Size	Name	LP64 Value
`dword_126E338` / `qword_126E340`	4/8	`sizeof_int`	4
`dword_126E328` / `qword_126E330`	4/8	`sizeof_long`	8
`dword_126E2B8` / `qword_126E2C0`	4/8	`sizeof_pointer`	8
`dword_126E29C` / `qword_126E2A0`	4/8	`sizeof_ptrdiff`	8
`dword_126E410` / `qword_126E418`	4/8	`sizeof_short`	4
`dword_126E400` / `qword_126E408`	4/8	`sizeof_long_long`	8
`dword_126E420` / `qword_126E428`	4/8	`sizeof_wchar`	2
`dword_126E2FC` / `qword_126E300`	4/8	`sizeof_long_double`	16
`dword_126E258`	4	`double_mantissa_bits`	53
`dword_126E264`	4	`float_mantissa_bits`	24
`dword_126E234`	4	`float128_mantissa_bits`	113

Type Model (Alignment, set by sub_7515D0)

Address	Size	Name	LP64 Value
`dword_126E2AC`	4	`alignof_pointer`	8
`dword_126E35C` / `qword_126E360`	4/8	`alignof_long`	8
`dword_126E308` / `qword_126E310`	4/8	`alignof_long_double`	16
`dword_126E3F0` / `qword_126E3F8`	4/8	`alignof_double`	8
`dword_126E278`	4	`max_natural_alignment`	16
`byte_126E2F4`	1	`default_min_alignment`	4

Dialect Selection State

Address	Size	Name	Role
`dword_126EFA8`	4	`gpu_mode_enabled`	GPU compilation active (set by `--gcc`)
`dword_126EFA4`	4	`is_device_compilation`	Device-side pass (set by `--clang`)
`dword_126E1DC`	4	`cp_gen_enabled`	cp_gen backend active
`dword_126E1EC`	4	`backend_output_enabled`	Backend output generation active
`dword_126E1E8`	4	`device_dialect_set`	Device target dialect selected
`dword_126E1F8`	4	`host_dialect_set`	Host target dialect selected
`qword_126E1E0`	8	`device_version`	Clang version copied for device dialect
`qword_126E1F0`	8	`host_version`	GCC version copied for host dialect
`qword_126E1B0`	8	`lib_prefix`	Library search prefix, set to `"lib"`

Feature Gate Globals

Address	Size	Name	Role
`dword_106C028`	4	`nv_register_params_enabled`	Enable flag for `__nv_register_params__` (set by `--uumn`, case 112)

Cross-References

CLI Flag Inventory -- --target (case 245), --m32 (case 90), --force-lp64 (case 65), --force-llp64 (case 66) flag details
Architecture Feature Gating -- SM version thresholds for CUDA features, host compiler version gating, --db debug mechanism
EDG Build Configuration -- Compile-time constants controlling backend selection and IL configuration
Pipeline Overview -- Where architecture detection fits in the compilation pipeline
CLI Processing -- proc_command_line dispatcher and flag table mechanics
Translation Unit Descriptor -- TU state block containing dword_126EBF8
Global Variable Index -- Full address-level documentation of all globals
Minor Attributes -- __nv_register_params__ attribute handler and dword_106C028 usage

Experimental and Version-Gated Flags

cudafe++ gates several categories of CUDA language features behind flags that nvcc manages automatically. Users interact with these through nvcc options like --expt-extended-lambda and --expt-relaxed-constexpr; nvcc translates these into the internal cudafe++ flags --extended-lambda and --relaxed_constexpr before invocation. A third category, C++ standard version gating, controls which language-level features affect the CUDA compilation pipeline. Two additional flags (--default-device, --no-device-int128/--no-device-float128) tune device code semantics without the "experimental" label.

This page documents the internal mechanism of each flag: the global variable it sets, every code path it unlocks, the diagnostics it suppresses or enables, and the compile-time cost of enabling it.

Flag Summary

nvcc Flag	cudafe++ Internal Flag	Flag ID	Global Variable	Default	Effect
`--expt-extended-lambda`	`--extended-lambda`	79*	`dword_106BF38`	0	Enable entire extended lambda wrapper infrastructure
`--expt-relaxed-constexpr`	`--relaxed_constexpr`	104	`dword_106BF40`	0	Allow constexpr cross-space calls
`-std=c++NN`	`--c++NN` / `set_flag`	--	`dword_126EF68`	199711	Gate C++ standard features
(JIT mode)	`--default-device`	**	--	0	Change unannotated default to `__device__`
`--no-device-int128`	`--no-device-int128`	52	--	0	Disable `__int128` in device code
`--no-device-float128`	`--no-device-float128`	53	--	0	Disable `__float128`/`_Float128` in device code

* The extended-lambda flag is registered as flag 79 (disable_ext_lambda_cache is a separate flag at that slot in some reports; the exact case_id for the flag parsed as extended-lambda is in the CUDA-specific range 47--89 but the individual case within the grouped 47--53 block is not fully disambiguated). The flag string "extended-lambda" is at binary address 0x836410, referenced from sub_452010 (init_command_line_flags).

** The --default-device flag is not in the standard numbered flag catalog (1--275). It is registered through one of the 7 inline-registered paired flags or the set_flag/clear_flag table (off_D47CE0). Its string literal appears in four JIT error messages in the binary.

`--extended-lambda` (dword_106BF38)

This is the single most impactful experimental flag in cudafe++. It enables the entire extended lambda subsystem -- approximately 40 functions in nv_transforms.c, 2,100 lines of lambda scanning in cmd_line.c, 17 steps of preamble text emission, and per-lambda wrapper generation in the backend. Without it, CUDA lambdas annotated with __device__ or __host__ __device__ are rejected outright.

What It Enables

When dword_106BF38 != 0, the following subsystems activate:

1. Lambda scanning (sub_447930, 2,113 lines)

The 7-phase scan_lambda function performs full CUDA validation on every lambda expression. Phase 4 checks all 35+ restriction categories documented in the restrictions page. Without the flag, phase 4 early-exits and emits error 3612 instead.

2. Preamble injection (sub_4864F0 + sub_6BCC20)

When the backend encounters a type declaration for the sentinel __nv_lambda_preheader_injection, three conditions must all be true for the preamble to fire:

// sub_4864F0 trigger conditions:
if ((entity_bits[-8] & 0x10) != 0      // marker bit set
    && dword_106BF38 != 0               // --extended-lambda enabled
    && name_matches_sentinel)           // 30-byte name comparison
{
    sub_6BCC20(emit_func);              // emit ~10-50 KB of template text
}

The master emitter (sub_6BCC20) produces the complete lambda wrapper infrastructure as inline C++ text injected into the .int.c output. The 17-step emission sequence generates:

Step	Output	Purpose
1	`__NV_LAMBDA_WRAPPER_HELPER`, `__nvdl_remove_ref`, `__nvdl_remove_const`	Utility macros and type traits
2	`__nv_dl_tag`	Device lambda tag type
3	Array capture helpers (dim 2--8)	N-dimensional array forwarding via `sub_6BC290`
4	Primary `__nv_dl_wrapper_t` + zero-capture specialization	Device lambda wrapper template
5	`__nv_dl_trailing_return_tag` + zero-capture specialization	Trailing return type support
6	Device bitmap scan	One `sub_6BB790` call per set bit in `unk_1286980`
7	`__nv_hdl_helper` (anonymous namespace, 4 static function pointers)	Host-device lambda dispatch helper
8	Primary `__nv_hdl_wrapper_t` with `static_assert`	Host-device wrapper template
9	HD bitmap scan	Four calls per set bit in `unk_1286900` (const x mutable x 2 helpers)
10	`__nv_hdl_helper_trait_outer`	Deduction helper traits
11	C++17 `noexcept` variants	Conditional on `dword_126E270` (see C++ version gating)
12	`__nv_hdl_create_wrapper_t`	Factory for HD wrappers
13	`__nv_lambda_trait_remove_const/volatile/cv`	CV-qualifier removal traits
14	`__nv_extended_device_lambda_trait_helper` + detection macro	Device lambda type detection
15	`__nv_lambda_trait_remove_dl_wrapper`	Unwrapper trait
16	Trailing-return detection trait + macro	Type introspection
17	HD detection trait + macro	Host-device lambda type detection

3. 1024-bit capture bitmaps

Two bitmaps track which capture counts have been observed during parsing:

Bitmap	Address	Scope	Bits Used
Device	`unk_1286980`	128 bytes (1024 bits)	Bit N = capture count N seen in a `__device__` lambda
Host-device	`unk_1286900`	128 bytes (1024 bits)	Bit N = capture count N seen in an HD lambda

sub_6BCBF0 registers a capture count by setting the corresponding bit. sub_6BCBC0 resets both bitmaps to zero between translation units. The maximum representable capture count is 1023 (bit 0 is reserved for the primary template in the device path; the HD path uses bit 0). Error 3595 fires when capture count exceeds 1022 (v33 > 0x3FE).

4. Per-lambda wrapper generation (sub_47B890, 336 lines)

During backend code generation, gen_lambda produces the per-lambda wrapper specialization for each extended lambda encountered. This runs in the gen_template dispatcher (sub_47ECC0).

5. Extended lambda capture type generation (sub_46E640, ~400 lines)

nv_gen_extended_lambda_capture_types generates explicit type declarations for captured variables, enabling the closure type to be serialized across host/device boundaries.

What Happens Without It

When dword_106BF38 == 0, any lambda with __host__ or __device__ annotations triggers error 3612:

error #20155-D: __host__ or __device__ annotation on lambda requires --extended-lambda nvcc flag

Additionally, the .int.c header emits hardcoded false macros (from sub_489000):

#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false

These definitions ensure that code using the detection macros compiles without error but reports that no extended lambdas exist.

Compile-Time Cost

Enabling --extended-lambda has measurable compile-time impact:

Fixed overhead: ~10 KB of injected template text (steps 1--5, 7--8, 10--17) emitted for every translation unit, regardless of how many lambdas appear
Variable per capture count: ~0.8 KB per distinct device lambda capture count, ~6 KB per distinct HD capture count (the HD path emits 4 specializations per bit: const non-mutable, const mutable, non-const non-mutable, non-const mutable)
Typical TU with 3--5 distinct capture counts: 30--50 KB of additional .int.c text
Template instantiation load: The wrapper templates use deep SFINAE patterns; the host compiler (gcc/clang/MSVC) must instantiate these for every extended lambda in the TU
Lambda scanning: The 2,113-line scan_lambda function performs full restriction validation on every lambda expression, adding O(N) per-lambda overhead

The cost is proportional to the number of distinct capture counts, not the total number of lambdas. Two __device__ lambdas each capturing 3 variables share a single wrapper specialization.

All 35+ extended lambda error codes (3590--3691) are documented in lambda/restrictions.md. Key errors specific to the flag gate:

Error	Display	Tag	Condition
3612	20155-D	`extended_lambda_disallowed`	Lambda has `__host__`/`__device__` annotation but `dword_106BF38 == 0`
3595	20138-D	`extended_lambda_too_many_captures`	Capture count > 1023
3590	20133-D	`extended_lambda_multiple_parent`	Multiple `__nv_parent` pragmas

`--expt-relaxed-constexpr` (dword_106BF40)

This flag relaxes cross-execution-space calling rules for constexpr functions. Without it, a constexpr __device__ function cannot be called from a __host__ function and vice versa, even though constexpr functions are evaluated at compile time on the host regardless of their execution space annotation.

Flag Registration

Registered as flag ID 104 (relaxed_constexpr) in the CUDA-specific flag range. The --expt-relaxed-constexpr nvcc flag is translated to --relaxed_constexpr before passing to cudafe++. The flag sets dword_106BF40 to 1.

Note: Despite the W066 report labeling this global lambda_host_device_mode, the decompiled code shows it is checked in two distinct contexts: cross-space call validation (sub_505720) and extended lambda device qualification (sub_6BC680). The variable name reflects its role in relaxing constexpr constraints, not lambda-specific behavior. It affects lambda behavior only in the specific case of is_device_or_extended_device_lambda (see below).

What It Relaxes

The flag modifies behavior in two code paths:

1. Cross-space call checking (sub_505720)

In check_cross_execution_space_call, when the caller is a __device__-only function and the callee has bit 2 set at offset +177 (explicit __device__ annotation), the checker tests dword_106BF40:

// sub_505720, caller is __device__ or __global__, callee is constexpr __host__:
if ((callee[177] & 0x02) != 0) {     // callee has explicit execution space
    if (dword_106BF40) {               // --expt-relaxed-constexpr
        // skip error, allow the call
        return;
    }
}

Without the flag, this path falls through to emit one of the 6 constexpr-specific cross-space errors.

2. Device lambda qualification (sub_6BC680)

In is_device_or_extended_device_lambda, when an entity has __device__ annotation (bit 177|2) but NOT the extended lambda bit (bit 177|4), the function returns dword_106BF40 != 0:

// sub_6BC680 (decompiled):
bool is_device_or_extended_device_lambda(entity* a1) {
    if ((a1->byte_177 & 0x02) != 0) {       // has __device__
        if ((a1->byte_177 & 0x04) == 0) {    // NOT extended lambda
            return dword_106BF40 != 0;        // relaxed constexpr allows it
        }
        return true;
    }
    return false;
}

This means --expt-relaxed-constexpr allows certain __device__ lambdas to be treated as extended device lambdas even without the --extended-lambda flag, but only in the specific context of device lambda type checking.

The 6 Error Messages It Suppresses

When dword_106BF40 == 0 and a constexpr function call crosses execution spaces, one of these 6 error messages is emitted. Each message explicitly suggests the flag as a workaround:

#	Caller Space	Callee Space	Error Message
1	`__host__ __device__`	constexpr `__device__`	"calling a constexpr __device__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this."
2	`__host__`	constexpr `__device__`	"calling a constexpr __device__ function(%sq1) from a __host__ function(%sq2) is not allowed. ..."
3	`__host__ __device__`	constexpr `__host__`	"calling a constexpr __host__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. ..."
4	`__host__ __device__`	constexpr `__host__`	"calling a constexpr __host__ function from a __host__ __device__ function is not allowed. ..." (no entity names -- edge case for unresolved functions)
5	`__device__`	constexpr `__host__`	"calling a constexpr __host__ function(%sq1) from a __device__ function(%sq2) is not allowed. ..."
6	`__global__`	constexpr `__host__`	"calling a constexpr __host__ function(%sq1) from a __global__ function(%sq2) is not allowed. ..."

The %sq1 and %sq2 format specifiers are cudafe++'s diagnostic format for qualified entity names (see diagnostics/format-specifiers.md).

Why It Is Experimental

The flag is labeled "experimental" because enabling it can produce silent runtime errors when:

A constexpr function has different behavior on host vs device due to #ifdef __CUDA_ARCH__ guards or host/device-specific intrinsics. The compiler evaluates constexpr functions on the host during compilation, but with the flag enabled, a constexpr __device__ function might be evaluated on the host where __CUDA_ARCH__ is not defined, producing a different constant value than the programmer expects for device code.
A constexpr __host__ function references host-only APIs (file I/O, system calls, host-specific math libraries). With relaxed constexpr, this function can be called from a __device__ context. If the call is not resolved at compile time (not actually evaluated as a constant expression), the linker or runtime will fail with an obscure error rather than the clear cudafe++ diagnostic.
The relaxation applies globally -- there is no per-function opt-in. Once enabled, all constexpr cross-space calls are permitted, making it impossible to catch genuinely incorrect calls alongside intentionally relaxed ones.

The related diagnostic tag is cl_relaxed_constexpr_requires_bool (at binary address 0x853640), which indicates there was at some point a stricter validation that the flag's value must be boolean.

Interaction with Other Globals

The dword_106BF40 flag interacts with the cross-space checking infrastructure controlled by dword_106BFD0 (device_registration) and dword_106BFCC (constant_registration). When dword_106BF40 is set AND the current routine is in device scope (+182 & 0x30 == 0x20) AND the routine has __device__ annotation (+177 bit 1), the cross-space reference check in record_symbol_reference_full (sub_72A650/sub_72B510) skips the error entirely.

C++ Standard Version Gating (dword_126EF68)

The global variable dword_126EF68 holds the C++ (or C) standard version as an integer matching the __cplusplus or __STDC_VERSION__ predefined macro value. This is set during CLI parsing and controls feature gating throughout the frontend.

Version Values

Standard	`dword_126EF68` Value	nvcc Flag
C++98/03	199711	`-std=c++03`
C++11	201103	`-std=c++11`
C++14	201402	`-std=c++14`
C++17	201703	`-std=c++17`
C++20	202002	`-std=c++20`
C++23	202302	`-std=c++23`

C standard values are also stored here when compiling C code:

Standard	`dword_126EF68` Value
K&R	(triggers `set_c_mode(1)` instead)
C89	198912
C99	199901
C11	201112
C17	201710
C23	202311

How Version Gating Works

Throughout the frontend, dword_126EF68 is compared against threshold values to enable or disable features. The comparison is always >= or > against the version number. Examples from the binary:

List initialization (sub_6D7DE0, overload.c): The 2,119-line list initialization function checks dword_126EF68 >= 201103 before enabling C++11 brace-enclosed initializer semantics.

Operator overloading (sub_6E7310, overload.c): Checks dword_126EF68 >= 201703 for C++17 features like class template argument deduction in operator resolution.

Preprocessor directives (sub_6FEDD0, preproc.c): Checks dword_126EF68 >= 202301 for #elifdef/#elifndef support (C++23 feature).

Byte ordering in .int.c output (sub_489000): Sets byte_10657F4 based on:

if (dword_126EFB4 == 2)              // CUDA mode
    byte_10657F4 = (dword_126EFB0 != 0);
else if (dword_126EF68 <= 199900)    // pre-C99
    byte_10657F4 = (dword_126EFB0 != 0);
else
    byte_10657F4 = 1;

C++17 noexcept-in-Type-System (dword_126E270)

A key version-gated feature for CUDA is dword_126E270, the C++17 "noexcept is part of the type system" flag. This global is set when dword_126EF68 >= 201703 and controls whether the lambda preamble injection (step 11 in sub_6BCC20) emits noexcept specializations of __nv_hdl_helper_trait_outer:

// sub_6BCC20, step 11:
if (dword_126E270) {                   // C++17 noexcept in type system
    // Emit 2 additional trait specializations with NeverThrows=true
    // for noexcept-qualified function types
    emit_noexcept_trait_specialization(emit, /* const */ 0);
    emit_noexcept_trait_specialization(emit, /* non-const */ 1);
}
// Closing }; of __nv_hdl_helper_trait_outer emitted unconditionally after

Without these specializations, C++17 code using noexcept lambdas in host-device contexts would fail to match the wrapper traits, producing template deduction failures.

Version Interactions with CUDA

The C++ standard version interacts with CUDA semantics in several ways:

C++11 minimum: Most CUDA lambda features require >= 201103. Extended lambdas are only meaningful with C++11 lambda syntax.
C++14 generic lambdas: Generic __device__ lambdas (with auto parameters) are gated on >= 201402.
C++17 structured bindings and if constexpr: The extended lambda system interacts with if constexpr through restriction errors 3620/3621 (constexpr/consteval conflict in lambda operator()).
C++20 concepts: The template variant of cross-space checking (sub_505B40) has a concept-context guard that checks dword_126C5C4 (nested class scope), which is only meaningful with C++20 concepts.

`--default-device`

This flag is specific to JIT (device-only) compilation mode and changes the default execution space for unannotated entities from __host__ to __device__.

Mechanism

When enabled, the execution-space assignment logic modifies entity+182 to receive the __device__ OR mask (0x23) instead of the implicit host default (0x00). For variables, entity+148 bit 0 (__device__ memory space) is set.

JIT Mode Context

JIT mode activates when --gen_c_file_name (flag 45) is NOT provided -- there is no host output path, so the host backend never runs. This is the compilation mode used by NVRTC (the CUDA runtime compilation library) and the CUDA Driver API's runtime compilation facilities (cuModuleLoadData, cuLinkAddData).

Without --default-device, five JIT-specific diagnostics warn about unannotated entities:

Diagnostic Tag	Message Summary
`no_host_in_jit`	Explicit `__host__` not allowed in JIT mode (no `--default-device` suggestion)
`unannotated_function_in_jit`	Unannotated function considered host, not allowed in JIT
`unannotated_variable_in_jit`	Namespace-scope variable without memory space annotation
`unannotated_static_data_member_in_jit`	Non-const static data member considered host
`host_closure_class_in_jit`	Lambda closure class inferred as `__host__`

Four of the five messages explicitly suggest --default-device as a workaround. The exception is no_host_in_jit -- an explicit __host__ annotation cannot be overridden by a flag and requires a source code change.

The --default-device flag interacts with the extended lambda system (dword_106BF38): when both are active, namespace-scope lambda closure classes infer __device__ execution space instead of __host__, avoiding the host_closure_class_in_jit diagnostic.

See cuda/jit-mode.md for full JIT mode documentation.

`--no-device-int128` / `--no-device-float128`

These two flags (IDs 52 and 53) disable 128-bit integer and floating-point types in device code respectively.

Registration

Both are registered in sub_452010 as no-argument mode flags in the CUDA-specific range:

Flag	ID	Binary Address	Global Effect
`no-device-int128`	52	`0x836133`	Disables `__int128` type in device compilation
`no-device-float128`	53	`0x836144`	Disables `__float128`/`_Float128` in device compilation

Purpose

The EDG frontend supports __int128 (keyword ID 239 in the builtin keyword table) and _Float128 (keyword ID 335) as extended types. In device code, these types may not be supported by all GPU architectures or may have different semantics than on the host.

The flags belong to the grouped CUDA boolean flags (cases 47--53 in proc_command_line), alongside host-stub-linkage-explicit, static-host-stub, device-hidden-visibility, no-hidden-visibility-on-unnamed-ns, and no-multiline-debug.

Type feature tracking uses byte_12C7AFC as a usage flags byte: bit 0 tracks specific integer subtypes (kinds 11, 12), bit 2 tracks float128/bfloat16 usage. The dword_106C070 global serves as the float128 feature flag, and dword_106C06C controls bfloat16.

NVRTC has specific support strings for both types in the binary (int128 NVRTC, float128 NVRTC), confirming that the JIT compilation path handles the presence or absence of these types explicitly.

Interaction Matrix

The experimental flags interact with each other and with version gating:

Interaction	Behavior
`--extended-lambda` + C++17	Enables `noexcept` wrapper trait specializations (step 11 in preamble) via `dword_126E270`
`--extended-lambda` + `--expt-relaxed-constexpr`	A `__device__` lambda without the extended-lambda bit is treated as extended if `dword_106BF40` is set (via `sub_6BC680`)
`--extended-lambda` + JIT mode	Lambda closure class execution space inference changes; `--default-device` affects namespace-scope lambda inference
`--expt-relaxed-constexpr` + cross-space checking	Suppresses 6 specific constexpr cross-space errors; does NOT suppress the 6 non-constexpr variants
`--no-device-int128` + NVRTC	NVRTC-specific handling confirms both flags are respected in JIT compilation
C++20 + cross-space checking	Concept context guard in `sub_505B40` adds an additional bypass condition for template cross-space calls

Global Variable Reference

Address	Size	Semantic Name	Set By	Checked By
`dword_106BF38`	4	`extended_lambda_mode`	Flag 79* (`--extended-lambda`)	`sub_4864F0` (trigger), `sub_489000` (macros), `sub_447930` (scan_lambda)
`dword_106BF40`	4	`relaxed_constexpr_mode`	Flag 104 (`--relaxed_constexpr`)	`sub_505720` (cross-space call), `sub_6BC680` (device lambda test), `sub_72A650`/`sub_72B510` (symbol ref)
`dword_126EF68`	4	`cpp_standard_version`	CLI std selection	28+ functions across all subsystems
`dword_126E270`	4	`cpp17_noexcept_type`	Post-parsing dialect resolution	`sub_6BCC20` (preamble step 11)

Function Reference

Address	Lines	Identity	Source	Role
`sub_452010`	3,849	`init_command_line_flags`	`cmd_line.c`	Registers all 276 flags including experimental
`sub_459630`	4,105	`proc_command_line`	`cmd_line.c`	Parses flags, sets globals
`sub_447930`	2,113	`scan_lambda`	`cmd_line.c`	Full lambda validation (uses `dword_106BF38`)
`sub_4864F0`	751	`gen_type_decl`	`cp_gen_be.c`	Preamble injection trigger (checks `dword_106BF38`)
`sub_6BCC20`	244	`nv_emit_lambda_preamble`	`nv_transforms.c`	Master preamble emitter (17 steps)
`sub_505720`	147	`check_cross_execution_space_call`	`expr.c`	Cross-space call checker (uses `dword_106BF40`)
`sub_505B40`	92	`check_cross_space_call_in_template`	`expr.c`	Template variant of cross-space checker
`sub_6BC680`	16	`is_device_or_extended_device_lambda`	`nv_transforms.c`	Device lambda test (uses `dword_106BF40`)
`sub_489000`	723	`process_file_scope_entities`	`cp_gen_be.c`	Backend entry; emits false macros when flag off
`sub_46E640`	~400	`nv_gen_extended_lambda_capture_types`	`cp_gen_be.c`	Capture type declarations for extended lambdas
`sub_6BCBF0`	13	`nv_record_capture_count`	`nv_transforms.c`	Bitmap bit-set for capture counts
`sub_6BCBC0`	~10	`nv_reset_capture_bitmaps`	`nv_transforms.c`	Reset both 1024-bit bitmaps

Cross-References

config/cli-flags.md -- complete flag catalog and registration protocol
lambda/overview.md -- extended lambda pipeline architecture
lambda/preamble-injection.md -- 17-step preamble emission detail
lambda/restrictions.md -- all 35+ lambda restriction error codes
cuda/cross-space-validation.md -- cross-space call checking and dword_106BF40 relaxation
cuda/jit-mode.md -- JIT mode, --default-device flag, and NVRTC
diagnostics/cuda-errors.md -- complete CUDA error catalog

EDG Source File Map

This page is the definitive reference table mapping all 52 .c source files and 13 .h header files from EDG 6.6 to their binary addresses in the cudafe++ CUDA 13.0 build. Every column is derived from the .rodata string cross-reference database and verified against the 20 sweep reports (P1.01 through P1.20).

For narrative discussion of these files and their roles in the compilation pipeline, see the Function Map and EDG Overview pages.

Build Path

All source files share the build prefix:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/

Coverage Summary

Metric	Count
`.c` files with mapped functions	52
`.h` files with mapped functions	13
Total source files	65
Functions mapped via `.c` paths	2,129
Functions mapped via `.h` paths only	80
Total mapped functions	2,209
Unmapped functions in EDG region (`0x403300`--`0x7E0000`)	~2,896
C++ runtime / demangler (`0x7E0000`--`0x829722`)	~1,085
PLT stubs + init (`0x402A18`--`0x403300`)	~283
Total functions in binary	~6,483
Mapping coverage	34.1%

The 34% mapping rate reflects the fact that only functions containing EDG internal_error assertions reference __FILE__ strings. Functions below the assertion threshold, display-only code compiled without assertions, inlined leaf functions, and the statically-linked C++ runtime are all invisible to this technique.

Column Definitions

Column	Meaning
#	Row number, ordered by main body start address
Source File	Filename from the EDG source tree
Origin	`EDG` = standard Edison Design Group code; `NVIDIA` = NVIDIA-authored
Total Funcs	Unique functions referencing this file's `__FILE__` string (stubs + main)
Stubs	Assert wrapper functions in `0x403300`--`0x408B40`
Main Funcs	Functions in the main body region (after `0x409350`)
Main Body Start	Lowest xref address outside the stub region
Main Body End	Highest xref address outside the stub region
Code Size	`Main Body End - Main Body Start` in bytes; approximate (includes interleaved `.h` inlines and alignment padding)

Source File Table -- 52 `.c` Files

Sorted by main body start address. This ordering reflects the binary layout, which is near-alphabetical with two exceptions noted below.

#	Source File	Origin	Total Funcs	Stubs	Main Funcs	Main Body Start	Main Body End	Code Size
1	`attribute.c`	EDG	177	7	170	`0x409350`	`0x418F80`	64,560
2	`class_decl.c`	EDG	273	9	264	`0x419280`	`0x447930`	190,160
3	`cmd_line.c`	EDG	44	1	43	`0x44B250`	`0x459630`	58,336
4	`const_ints.c`	EDG	4	1	3	`0x461C20`	`0x4659A0`	15,744
5	`cp_gen_be.c`	EDG	226	25	201	`0x466F90`	`0x489000`	139,376
6	`debug.c`	EDG	2	0	2	`0x48A1B0`	`0x48A1B0`	<1 KB
7	`decl_inits.c`	EDG	196	4	192	`0x48B3F0`	`0x4A1540`	90,448
8	`decl_spec.c`	EDG	88	3	85	`0x4A1BF0`	`0x4B37F0`	72,704
9	`declarator.c`	EDG	64	0	64	`0x4B3970`	`0x4C00A0`	50,480
10	`decls.c`	EDG	207	5	202	`0x4C0910`	`0x4E8C40`	164,656
11	`disambig.c`	EDG	5	1	4	`0x4E9E70`	`0x4EC690`	10,272
12	`error.c`	EDG	51	1	50	`0x4EDCD0`	`0x4F8F80`	45,744
13	`expr.c`	EDG	538	10	528	`0x4F9870`	`0x5565E0`	380,528
14	`exprutil.c`	EDG	299	13	286	`0x558720`	`0x583540`	175,648
15	`extasm.c`	EDG	7	0	7	`0x584CA0`	`0x585850`	2,992
16	`fe_init.c`	EDG	6	1	5	`0x585B10`	`0x5863A0`	2,192
17	`fe_wrapup.c`	EDG	2	0	2	`0x588D40`	`0x588F90`	592
18	`float_pt.c`	EDG	79	0	79	`0x589550`	`0x594150`	44,032
19	`folding.c`	EDG	139	9	130	`0x594B30`	`0x5A4FD0`	66,464
20	`func_def.c`	EDG	56	1	55	`0x5A51B0`	`0x5AAB80`	22,992
21	`host_envir.c`	EDG	19	2	17	`0x5AD540`	`0x5B1E70`	18,736
22	`il.c`	EDG	358	16	342	`0x5B28F0`	`0x5DFAD0`	184,800
23	`il_alloc.c`	EDG	38	1	37	`0x5E0600`	`0x5E8300`	31,488
24	`il_to_str.c`	EDG	83	1	82	`0x5F7FD0`	`0x6039E0`	47,632
25	`il_walk.c`	EDG	27	1	26	`0x603FE0`	`0x620190`	115,120
26	`interpret.c`	EDG	216	5	211	`0x620CE0`	`0x65DE10`	250,160
27	`layout.c`	EDG	21	2	19	`0x65EA50`	`0x665A60`	28,688
28	`lexical.c`	EDG	140	5	135	`0x666720`	`0x689130`	141,328
29	`literals.c`	EDG	21	0	21	`0x68ACC0`	`0x68F2B0`	17,904
30	`lookup.c`	EDG	71	2	69	`0x68FAB0`	`0x69BE80`	50,128
31	`lower_name.c`	EDG	179	11	168	`0x69C980`	`0x6AB280`	59,648
32	`macro.c`	EDG	43	1	42	`0x6AB6E0`	`0x6B5C10`	42,288
33	`mem_manage.c`	EDG	9	2	7	`0x6B6DD0`	`0x6BA230`	13,408
34	`nv_transforms.c`	NVIDIA	1	0	1	`0x6BE300`	`0x6BE300`	~22 KB¹
35	`overload.c`	EDG	284	3	281	`0x6BE4A0`	`0x6EF7A0`	201,472
36	`pch.c`	EDG	23	3	20	`0x6F2790`	`0x6F5DA0`	13,840
37	`pragma.c`	EDG	28	0	28	`0x6F61B0`	`0x6F8320`	8,560
38	`preproc.c`	EDG	10	0	10	`0x6F9B00`	`0x6FC940`	11,840
39	`scope_stk.c`	EDG	186	6	180	`0x6FE160`	`0x7106B0`	75,600
40	`src_seq.c`	EDG	57	1	56	`0x710F10`	`0x718720`	30,736
41	`statements.c`	EDG	83	1	82	`0x719300`	`0x726A50`	55,120
42	`symbol_ref.c`	EDG	42	2	40	`0x726F20`	`0x72CEA0`	24,448
43	`symbol_tbl.c`	EDG	175	8	167	`0x72D950`	`0x74B8D0`	122,688
44	`sys_predef.c`	EDG	35	1	34	`0x74C690`	`0x751470`	19,936
45	`target.c`	EDG	11	0	11	`0x7525F0`	`0x752DF0`	2,048
46	`templates.c`	EDG	455	12	443	`0x7530C0`	`0x794D30`	285,808
47	`trans_copy.c`	EDG	2	0	2	`0x796BA0`	`0x796BA0`	<1 KB
48	`trans_corresp.c`	EDG	88	6	82	`0x796E60`	`0x7A3420`	50,112
49	`trans_unit.c`	EDG	10	0	10	`0x7A3BB0`	`0x7A4690`	2,784
50	`types.c`	EDG	88	5	83	`0x7A4940`	`0x7C02A0`	112,480
51	`modules.c`	EDG	22	3	19	`0x7C0C60`	`0x7C2560`	6,400
52	`floating.c`	EDG	50	9	41	`0x7D0EB0`	`0x7D59B0`	19,200
	TOTALS		5,338	198	5,140	`0x409350`	`0x7D59B0`	~3.57 MB

Source File Table -- 13 `.h` Header Files

Header files appear in assertion strings when an inline function or macro defined in the header triggers an internal_error call. The function itself is compiled within the .c file's translation unit, but __FILE__ resolves to the header path. These functions are scattered across the binary, interleaved with the .c file that #include-d them.

#	Header File	Total Funcs	Stubs	Main Funcs	Min Address	Max Address	Primary Host
1	`decls.h`	1	0	1	`0x4E08F0`	`0x4E08F0`	`decls.c`
2	`float_type.h`	63	0	63	`0x7D1C90`	`0x7DEB90`	`floating.c`
3	`il.h`	5	2	3	`0x52ABC0`	`0x6011F0`	`expr.c`, `il.c`, `il_to_str.c`
4	`lexical.h`	1	0	1	`0x68F2B0`	`0x68F2B0`	`lexical.c` / `literals.c` boundary
5	`mem_manage.h`	4	0	4	`0x4EDCD0`	`0x4EDCD0`	`error.c`
6	`modules.h`	5	0	5	`0x7C1100`	`0x7C2560`	`modules.c`
7	`nv_transforms.h`	3	0	3	`0x432280`	`0x719D20`	`class_decl.c`, `cp_gen_be.c`, `src_seq.c`
8	`overload.h`	1	0	1	`0x6C9E40`	`0x6C9E40`	`overload.c`
9	`scope_stk.h`	4	0	4	`0x503D90`	`0x574DD0`	`expr.c`, `exprutil.c`
10	`symbol_tbl.h`	2	1	1	`0x7377D0`	`0x7377D0`	`symbol_tbl.c`
11	`types.h`	17	4	13	`0x469260`	`0x7B05E0`	Many `.c` files (scattered type queries)
12	`util.h`	124	10	114	`0x430E10`	`0x7C2B10`	All major `.c` files
13	`walk_entry.h`	51	0	51	`0x604170`	`0x618660`	`il_walk.c`
	TOTALS	281	17	264

Header Distribution Patterns

The 13 headers fall into three distinct patterns:

Localized headers -- functions cluster in a single .c file's address range:

float_type.h (63 funcs in 52 KB at 0x7D1C90--0x7DEB90, all within floating.c)
walk_entry.h (51 funcs in 90 KB at 0x604170--0x618660, all within il_walk.c)
modules.h (5 funcs in 5 KB at 0x7C1100--0x7C2560, all within modules.c)
decls.h, lexical.h, overload.h, symbol_tbl.h (1--2 funcs each, single site)
mem_manage.h (4 funcs, single site in error.c)

Moderately scattered headers -- functions appear in 2--3 .c files:

il.h (5 funcs across expr.c, il.c, il_to_str.c)
scope_stk.h (4 funcs across expr.c, exprutil.c)
nv_transforms.h (3 funcs across class_decl.c, cp_gen_be.c, src_seq.c)

Pervasive headers -- functions inlined into most .c files:

util.h (124 xrefs spanning 0x430E10--0x7C2B10, nearly the entire EDG region)
types.h (17 funcs spanning 0x469260--0x7B05E0, scattered type queries)

Assert Stub Region

The region 0x403300--0x408B40 contains 198 small __noreturn functions. Each encodes a single assertion site: the source file path, line number, and enclosing function name. When the assertion condition fails, the stub calls sub_4F2930 (EDG's internal_error handler) and does not return. Every stub is 29 bytes.

Stub Distribution by Source File

Source File	Stub Count	Source File	Stub Count
`cp_gen_be.c`	25	`macro.c`	1
`il.c`	16	`mem_manage.c`	2
`exprutil.c`	13	`modules.c`	3
`templates.c`	12	`overload.c`	3
`lower_name.c`	11	`pch.c`	3
`expr.c`	10	`preproc.c`	0
`class_decl.c`	9	`scope_stk.c`	6
`folding.c`	9	`src_seq.c`	1
`floating.c`	9	`statements.c`	1
`attribute.c`	7	`symbol_ref.c`	2
`symbol_tbl.c`	8	`symbol_tbl.h`	1
`trans_corresp.c`	6	`sys_predef.c`	1
`lexical.c`	5	`target.c`	0
`decls.c`	5	`trans_copy.c`	0
`types.c`	5	`trans_unit.c`	0
`decl_inits.c`	4	`types.h`	4
`interpret.c`	3	`util.h`	10
`decl_spec.c`	3	`il.h`	2
`host_envir.c`	2	`debug.c`	0
`layout.c`	2	`extasm.c`	0
`lookup.c`	2	`fe_wrapup.c`	0
`cmd_line.c`	1	`float_pt.c`	0
`const_ints.c`	1	`declarator.c`	0
`disambig.c`	1	`pragma.c`	0
`error.c`	1	`nv_transforms.c`	0
`fe_init.c`	1	`literals.c`	0
`func_def.c`	1
`il_alloc.c`	1
`il_to_str.c`	1
`il_walk.c`	1

After the stubs, addresses 0x408B40--0x409350 contain 15 C++ static constructor functions (ctor_001 through ctor_015) that initialize global tables at program startup. These have no source file attribution.

Gap Analysis -- Unmapped Regions

The following address ranges within the EDG .text region contain functions that could not be mapped to any source file via __FILE__ strings. Each gap represents functions that either lack assertions entirely, use non-EDG assertion macros, or are compiler-generated (vtable thunks, exception handlers, template instantiation artifacts).

#	Gap Range	Size	Between	Probable Content
1	`0x408B40`--`0x409350`	2 KB	stubs / `attribute.c`	Static constructors (`ctor_001`--`ctor_015`)
2	`0x447930`--`0x44B250`	13 KB	`class_decl.c` / `cmd_line.c`	Boundary helpers, small inlines
3	`0x459630`--`0x461C20`	34 KB	`cmd_line.c` / `const_ints.c`	Unmapped option handlers, flag tables
4	`0x4659A0`--`0x466F90`	6 KB	`const_ints.c` / `cp_gen_be.c`	Constant integer helpers
5	`0x489000`--`0x48A1B0`	5 KB	`cp_gen_be.c` / `debug.c`	Backend emission tail
6	`0x48A1B0`--`0x48B3F0`	5 KB	`debug.c` / `decl_inits.c`	Debug infrastructure
7	`0x5E8300`--`0x5F7FD0`	87 KB	`il_alloc.c` / `il_to_str.c`	IL display routines (no assertions)
8	`0x620190`--`0x620CE0`	3 KB	`il_walk.c` / `interpret.c`	Walk epilogue
9	`0x65DE10`--`0x65EA50`	3 KB	`interpret.c` / `layout.c`	Interpreter tail
10	`0x665A60`--`0x666720`	3 KB	`layout.c` / `lexical.c`	Layout/lexer boundary
11	`0x689130`--`0x68ACC0`	7 KB	`lexical.c` / `literals.c`	Token conversion helpers
12	`0x6AB280`--`0x6AB6E0`	1 KB	`lower_name.c` / `macro.c`	Mangling helpers
13	`0x6BA230`--`0x6BAE70`	3 KB	`mem_manage.c` / `nv_transforms.c`	Memory infrastructure
14	`0x6EF7A0`--`0x6F2790`	12 KB	`overload.c` / `pch.c`	Overload resolution tail
15	`0x6FC940`--`0x6FE160`	6 KB	`preproc.c` / `scope_stk.c`	Preprocessor tail functions
16	`0x751470`--`0x7525F0`	7 KB	`sys_predef.c` / `target.c`	Predefined macro infrastructure
17	`0x7A4690`--`0x7A4940`	1 KB	`trans_unit.c` / `types.c`	TU helpers
18	`0x7C2560`--`0x7D0EB0`	59 KB	`modules.c` / `floating.c`	Type-name encoding, module helpers
19	`0x7D59B0`--`0x7DEB90`	37 KB	`floating.c` tail	`float_type.h` template instantiations
20	`0x7DFFF0`--`0x82A000`	304 KB	post-EDG	C++ runtime, demangler, soft-float, EH
	Total unmapped	~582 KB

The largest unmapped gap within EDG code proper is the IL display region at 0x5E8300--0x5F7FD0 (87 KB). These functions were compiled from il_to_str.c but contain no assertions because the display/dump subsystem was built without assertion macros -- it is purely diagnostic code that formats IL trees to stdout.

Alphabetical Layout Observation

Source files are laid out in the binary in near-alphabetical order by filename, a consequence of the build system compiling .c files in directory-listing order and the linker processing them sequentially. The sequence is strictly alphabetical from attribute.c through types.c (rows 1--50).

Two files break this pattern:

File	Expected Position	Actual Position	Offset
`modules.c`	Between `mem_manage.c` and `nv_transforms.c` (#33--#34)	After `types.c` (#51, at `0x7C0C60`)	+47 rows late
`floating.c`	Between `float_pt.c` and `folding.c` (#18--#19)	After `modules.c` (#52, at `0x7D0EB0`)	+34 rows late

Both files appear after the main alphabetical sequence, placed at the very end of the EDG region. The most likely explanation is that modules.c and floating.c are compiled as separate translation units outside the main EDG build directory -- perhaps in a subdirectory or a secondary build target -- and are appended to the link line after the alphabetically-sorted main objects. The modules.c file implements C++20 module support (mostly stubs in the CUDA build), and floating.c implements arbitrary-precision IEEE 754 arithmetic -- both are semi-independent subsystems that could plausibly be compiled separately.

Note that floating.c is followed immediately by its private header float_type.h (63 template instantiations at 0x7D1C90--0x7DEB90), confirming they share a compilation unit.

Binary Region Map

0x402A18 +--------------------------+
         | PLT stubs / init (283)   |  3 KB
0x403300 +--------------------------+
         | Assert stubs (198)       |  22 KB
0x408B40 +--------------------------+
         | Constructors (15)        |  2 KB
0x409350 +--------------------------+
         | attribute.c              |  65 KB
0x419280 | class_decl.c             | 190 KB
         | cmd_line.c               |  58 KB
         | const_ints.c             |  16 KB
         | cp_gen_be.c              | 139 KB
         | debug.c                  |  <1 KB
         | decl_inits.c             |  90 KB
         | decl_spec.c              |  73 KB
         | declarator.c             |  50 KB
         | decls.c                  | 165 KB
         | disambig.c               |  10 KB
         | error.c                  |  46 KB
         | expr.c                   | 381 KB
         | exprutil.c               | 176 KB
         | extasm.c                 |   3 KB
         | fe_init.c                |   2 KB
         | fe_wrapup.c              |  <1 KB
         | float_pt.c               |  44 KB
         | folding.c                |  66 KB
         | func_def.c               |  23 KB
         | host_envir.c             |  19 KB
         | il.c                     | 185 KB
         | il_alloc.c               |  31 KB
         | [il display gap]         |  87 KB  (unmapped)
         | il_to_str.c              |  48 KB
         | il_walk.c                | 115 KB
         | interpret.c              | 250 KB
         | layout.c                 |  29 KB
         | lexical.c                | 141 KB
         | literals.c               |  18 KB
         | lookup.c                 |  50 KB
         | lower_name.c             |  60 KB
         | macro.c                  |  42 KB
         | mem_manage.c             |  13 KB
         | nv_transforms.c          |  22 KB
         | overload.c               | 201 KB
         | pch.c                    |  14 KB
         | pragma.c                 |   9 KB
         | preproc.c                |  12 KB
         | scope_stk.c              |  76 KB
         | src_seq.c                |  31 KB
         | statements.c             |  55 KB
         | symbol_ref.c             |  24 KB
         | symbol_tbl.c             | 123 KB
         | sys_predef.c             |  20 KB
         | target.c                 |   2 KB
         | templates.c              | 286 KB
         | trans_copy.c             |  <1 KB
         | trans_corresp.c          |  50 KB
         | trans_unit.c             |   3 KB
         | types.c                  | 112 KB
         |  --- alphabetical break ---
         | modules.c                |   6 KB
         |  --- gap (59 KB) ---
0x7D0EB0 | floating.c               |  19 KB
         | float_type.h inlines     |  52 KB
0x7DFFF0 +--------------------------+
         | C++ runtime / demangler  | 304 KB
0x82A000 +--------------------------+

Reproduction

To regenerate the source file list from the strings database:

jq '[.[] | select(.value | test("/dvs/p4/.*\\.[ch]$")) |
  {file: (.value | split("/") | last),
   xrefs: (.xrefs | length)}
] | group_by(.file) |
  map({file: .[0].file,
       total_xrefs: (map(.xrefs) | add)}) |
  sort_by(.file)' cudafe++_strings.json

To extract address ranges per file:

import json
from collections import defaultdict

with open('cudafe++_strings.json') as f:
    data = json.load(f)

files = defaultdict(list)
for entry in data:
    val = entry.get('value', '')
    if '/dvs/p4/' not in val:
        continue
    if not (val.endswith('.c') or val.endswith('.h')):
        continue
    fname = val.split('/')[-1]
    for xref in entry.get('xrefs', []):
        files[fname].append(int(xref['from'], 16))

for fname in sorted(files):
    addrs = sorted(files[fname])
    print(f"{fname:25s}  {hex(addrs[0]):>12s} - {hex(addrs[-1]):>12s}"
          f"  ({len(addrs)} xrefs)")

nv_transforms.c has only 1 function with an EDG-style __FILE__ reference, but sweep analysis confirms ~40 functions in the 0x6BAE70--0x6BE4A0 region (~22 KB). Most use NVIDIA's own assertion macros instead of EDG's internal_error path. ↩

Global Variable Index

cudafe++ v13.0 uses approximately 400+ global variables scattered across the .bss and .data segments. These variables fall into clear functional categories: compilation mode selectors, error/diagnostic state, I/O handles, CUDA-specific flags, translation unit management, scope tracking, IL allocation, lexer state, template instantiation, lambda transforms, and memory management. Every address listed below was confirmed through binary analysis of the x86-64 Linux ELF (sha256 6a69...). This page serves as the canonical cross-reference for all other wiki articles.

The variables cluster into three address regions: 0x106xxxx (NVIDIA-added configuration flags, typically set during CLI processing), 0x126xxxx (EDG core compiler state, used throughout parsing, IL generation, and code emission), and 0x12Cxxxx / 0x128xxxx (template instantiation, lambda transform, and arena allocator state). A few tables live in the read-only .rodata segment at 0xE6xxxx--0xE8xxxx.

Compilation Mode and Language Standard

These globals control the fundamental compilation dialect -- C vs C++, which standard version, which vendor extensions are active, and whether the compiler is in CUDA mode.

Address	Size	Name	Description
`dword_126EFB4`	4	`language_mode`	Master dialect selector. `1` = C, `2` = C++. Checked in virtually every subsystem. In some contexts (p1.12) interpreted as `device_il_mode` when value is 2.
`dword_126EF68`	4	`cpp_standard_version`	`__cplusplus` value. `199711` = C++98, `201103` = C++11, `201402` = C++14, `201703` = C++17, `202002` = C++20, `202302` = C++23. For C mode: `199000` (pre-C99), `199901` (C99), `201112` (C11), `201710` (C17), `202311` (C23).
`dword_126EFAC`	4	`extended_features`	EDG extended features / GNU compatibility mode flag. Also used as CUDA mode indicator in several paths.
`dword_126EFA8`	4	`gcc_extensions`	GCC extensions mode (`1` = enabled). Also used as GPU compilation mode flag in device/host separation.
`dword_126EFA4`	4	`clang_extensions`	Clang extensions mode. Dual-use: also serves as device-code-mode flag during device/host separation (`1` = compiling device side).
`dword_126EFB0`	4	`gnu_extensions_enabled`	GNU extensions active (set alongside `dword_126EFA8`). Also used as `strict_c_mode` and `relaxed_constexpr` in some paths.
`qword_126EF98`	8	`gcc_version`	GCC compatibility version, encoded as `major10000+minor100+patch`. Default `80100` (GCC 8.1.0). Compared as hex thresholds (e.g., `0x9E97` = 40599).
`qword_126EF90`	8	`clang_version`	Clang compatibility version. Default `90100`. Used for feature gating (compared against `0x78B3`, `0x15F8F`, `0x1D4BF`).
`qword_126EF78`	8	`msvc_version`	MSVC compatibility version. Default `1926`.
`qword_126EF70`	8	`version_threshold_max`	Upper version bound. Default `99999`.
`dword_126EF64`	4	`cpp_extensions_enabled`	C extension level (nonstandard extensions).
`dword_126EF80`	4	`feature_flag_80`	Miscellaneous feature flag, default `1`.
`dword_126EF48`	4	`auto_parameter_mode`	Auto parameter support flag (inverse of input).
`dword_126EF4C`	4	`auto_parameter_support`	Auto-parameter enabled (C++20 auto function params).
`dword_126EEFC`	4	`digit_separators_enabled`	C++14 digit separator (`'`) support.
`dword_126EF0C`	4	`feature_flag_0C`	Miscellaneous feature flag, default `1`.
`dword_126E4A8`	4	`sm_architecture`	Target SM architecture version (set by `--nv_arch` / case 245).
`dword_126E498`	4	`signed_chars`	Whether plain `char` is signed.

CUDA-Specific Flags

Flags controlling CUDA-specific behavior: device code generation, extended lambdas, relaxed constexpr, OptiX mode.

Address	Size	Name	Description
`dword_1065850`	4	`device_stub_mode`	Device stub mode toggle. Toggled by expression `dword_1065850 = (dword_1065850 == 0)` in `gen_routine_decl`. `0` = forwarding body pass, `1` = static stub pass.
`dword_106BF38`	4	`extended_lambda_mode`	NVIDIA extended lambdas enabled (`--expt-extended-lambda`). Gates the lambda wrapper generation pipeline.
`dword_106BF40`	4	`lambda_host_device_mode`	Lambda host-device mode flag. Controls whether `__device__` function references are allowed in host code.
`dword_106BF34`	4	`lambda_validation_skip`	Skip lambda validation checks.
`dword_106BFDC`	4	`skip_device_only`	Skip device-only code generation. When clear, deferred function list accumulates at `qword_1065840`.
`dword_106BFF0`	4	`relaxed_attribute_mode`	NVIDIA relaxed override mode. Controls permissive `__host__`/`__device__` attribute mismatch handling. Default `1` in CLI defaults.
`dword_106BFBC`	4	`whole_program_mode`	Whole-program mode (affects deferred function list behavior).
`dword_106BFD0`	4	`device_registration`	Enable CUDA device registration / cross-space reference checking.
`dword_106BFCC`	4	`constant_registration`	Enable CUDA constant registration / another cross-space check flag.
`dword_106BFB8`	4	`emit_symbol_table`	Emit symbol table in output.
`dword_106BF6C`	4	`alt_host_compiler_mode`	Alternative host compiler mode.
`dword_106BF68`	4	`host_compiler_flag`	Host compiler attribute support flag. Also `dword_106BF58`.
`dword_106BDD8`	4	`optix_mode`	OptiX compilation mode flag.
`dword_106B670`	4	`optix_kernel_index`	OptiX kernel index (combined with `dword_106BDD8` for error 3689).
`qword_106B678`	8	`optix_kernel_table`	OptiX kernel info table pointer.
`dword_106C2C0`	4	`gpu_mode`	GPU/device compilation mode. Controls `reinterpret_cast` semantics, pointer dereference, and keyword detection in device context.
`dword_106C1D8`	4	`relaxed_constexpr_ptr`	Controls pointer dereference in device constexpr (`--expt-relaxed-constexpr` related).
`dword_106C1E0`	4	`device_typeid`	Controls `typeid` availability in device constexpr context.
`dword_106C1F4`	4	`device_class_lookup`	CUDA device class member lookup flag.
`dword_E7C760`	4[6]	`exec_space_table`	Execution space bitmask table (6 entries). `a1 & dword_E7C760[a2]` tests space compatibility.
`dword_106B640`	4	`keep_in_il_active`	Assertion guard: set to `1` before `keep_in_il` walk, cleared to `0` after.
`dword_E85700`	4	`host_runtime_included`	Flag: `host_runtime.h` already included in `.int.c` output.
`dword_126E270`	4	`cpp17_noexcept_type`	C++17 noexcept-in-type-system flag. Gates noexcept variant emission for lambda wrappers.
`dword_106BF80`	4-ptr	`module_id_file`	Module-ID file path (for CRC32 calculation).
`qword_1065840`	8	`deferred_function_list`	Linked list of deferred functions (used when `dword_106BFDC` is clear).

Error and Diagnostic State

The diagnostic subsystem uses a set of globals to track error/warning counts, severity thresholds, output format, and per-error suppression state.

Address	Size	Name	Description
`qword_126ED90`	8	`error_count`	Total errors emitted. Also used as error-recovery-mode flag (nonzero = in recovery).
`qword_126ED98`	8	`warning_count`	Total warnings emitted.
`qword_126EDF0`	8	`error_output_stream`	`FILE*` for diagnostic output. Default `stderr`. Initialized during `ctor_002`.
`qword_126EDE8`	8	`current_source_position`	Current source position for error reporting. Mirrored from `qword_1065810`.
`qword_126ED60`	8	`error_limit`	Maximum error count before abort.
`byte_126ED69`	1	`min_severity_threshold`	Minimum severity for diagnostic output (default threshold).
`byte_126ED68`	1	`error_promotion_threshold`	Severity at or above which warnings become errors.
`dword_126ED40`	4	`suppress_assertion_output`	Suppress assertion output flag.
`dword_126ED48`	4	`no_catastrophic_on_error`	Disable catastrophic error on internal assertion.
`dword_126ED50`	4	`no_caret_diagnostics`	Disable caret (^) diagnostics.
`dword_126ED58`	4	`max_context_lines`	Maximum source context lines in diagnostics.
`dword_126ED78`	4	`has_error_in_scope`	Error occurred in current scope.
`dword_126ED44`	4	`name_lookup_kind`	Name lookup kind for diagnostic formatting.
`byte_126ED55`	1	`device_severity_override`	Default severity for device-mode diagnostics.
`byte_126ED56`	1	`warning_level_control`	Warning level control byte.
`dword_106BBB8`	4	`output_format`	Output format selector. `0` = plaintext, `1` = SARIF JSON.
`dword_106C088`	4	`warnings_are_errors`	Treat warnings as errors (`-Werror` equivalent).
`dword_126ECA0`	4	`colorization_requested`	Color output requested.
`dword_126ECA4`	4	`colorization_active`	Color output currently active (after TTY detection).
`off_88FAA0`	8[3795]	`error_message_table`	Array of 3,795 `const char*` pointers indexed by error code.
`byte_1067920`	1[3795]	`default_severity_table`	Default severity for each error code.
`byte_1067921`	1[3795]	`current_severity_table`	Current (possibly pragma-modified) severity.
`byte_1067922`	4[3795]	`per_error_flags`	Per-error tracking: bit 0 = first occurrence, other bits = suppression state.
`off_D481E0`	--	`label_fill_in_table`	Diagnostic label fill-in table (`{name, cond_index, default_index}` entries).
`qword_106B488`	8	`message_text_buffer`	Growable message text buffer (initial 0x400 bytes via `sub_6B98A0`).
`qword_106B480`	8	`location_prefix_buffer`	Location prefix buffer (initial 0x80 bytes).
`qword_106B478`	8	`sarif_json_buffer`	SARIF JSON output buffer (initial 0x400 bytes).
`dword_106B470`	4	`terminal_width`	Terminal width for word wrapping.
`dword_106B4A0`	4	`fill_in_alloc_count`	Fill-in entry allocation counter.
`qword_106B490`	8	`fill_in_free_list`	Free list for 40-byte fill-in entries.
`dword_106B4B0`	4	`catastrophic_error_guard`	Re-entry guard for catastrophic error processing.
`dword_1065928`	4	`assertion_reentry_guard`	Re-entry guard for assertion handler.
`qword_1067860`	8	`entity_formatter_callback`	Entity name formatting callback (`sub_5B29C0`).
`qword_1067870`	8	`entity_formatter_buffer`	Entity formatter output buffer.
`byte_10678F1`	1	`diag_is_c_mode`	Diagnostic C mode flag (`dword_126EFB4 == 1`).
`byte_10678F4`	1	`diag_is_pre_cpp11`	Diagnostic pre-C++11 flag.
`byte_10678FA`	1	`diag_name_lookup_kind`	Name lookup kind for entity display.
`qword_106BCD8`	8	`suppress_all_but_fatal`	When set, suppress all errors except 992 (fatal).
`dword_106BCD4`	4	`predefined_macro_file_mode`	Predefined macro file mode (affects error case).
`qword_10658F8`	8	`pragma_scratch_buffer`	Scratch buffer for pragma bsearch operations.
`dword_106B4BC`	4	`werror_emitted_guard`	Prevents recursion in warnings-as-errors emission.

I/O and File Management

Globals controlling input/output filenames, streams, include paths, and preprocessor output.

Address	Size	Name	Description
`qword_126EEE0`	8	`input_filename`	Current output/source filename (write-protected name). Compared against `"-"` for stdout mode.
`qword_106BF20`	8	`output_filename_override`	Output C file path (set by `--gen_c_file_name` / case 45).
`qword_106C040`	8	`output_filename_alt`	Alternative output filename (used in signoff).
`qword_106C280`	8	`output_file`	`FILE*` for `.int.c` output (stdout or file).
`qword_126EE98`	8	`include_path_list`	Include search path linked list head.
`qword_126F100`	8	`include_path_free_list`	Free list for recycled search path nodes.
`qword_126F0E8`	8	`path_normalize_buffer`	Growable buffer for path normalization (0x100 initial).
`dword_126EE58`	4	`backslash_as_separator`	Backslash as path separator (Windows mode).
`dword_126EE54`	4	`windows_drive_letter`	Recognize Windows drive-letter paths.
`dword_126EEE8`	4	`bom_detection_enabled`	Byte-order mark detection enabled.
`dword_126F110`	4	`once_guard`	One-time initialization guard for source file processing.
`qword_126F0C0`	8	`cached_module_id`	Cached module ID string (CRC32-based).
`qword_106BF80`	8	`module_id_file_path`	Module-ID file path for external ID override.
`qword_106C038`	8	`options_hash_input`	Command-line options hash input for module ID.
`qword_106C248`	8	`macro_alias_map`	Hash table: macro define/alias mappings.
`qword_106C240`	8	`include_path_map`	Include path list for CLI processing.
`qword_106C238`	8	`sys_include_map`	System include path map.
`qword_106C228`	8	`sys_include_map_2`	Additional system include map.
`dword_106C29C`	4	`preprocess_mode`	Preprocessing-only mode (`1` = active). Set by CLI cases 3,4.
`dword_106C294`	4	`no_line_commands`	Suppress `#line` directives in output.
`dword_106C288`	4	`preprocess_output_mode`	Preprocess output: `0` = suppress, `1` = emit preprocessed text.
`dword_106C254`	4	`skip_backend`	Skip backend code generation entirely.

Scope Stack

The scope stack is an array of 784-byte entries at qword_126C5E8, indexed by dword_126C5E4. It tracks the nested scope hierarchy (file, namespace, class, function, block, template).

Address	Size	Name	Description
`qword_126C5E8`	8	`scope_table_base`	Base pointer to scope stack array. Each entry is 784 bytes.
`dword_126C5E4`	4	`current_scope_index`	Current top-of-stack index.
`dword_126C5DC`	4	`saved_scope_index`	Saved scope index (for enum processing, lambda nesting).
`dword_126C5D8`	4	`function_scope_index`	Enclosing function scope index (`-1` if none).
`dword_126C5C8`	4	`template_scope_index`	Template scope index (`-1` if not in template).
`dword_126C5C4`	4	`class_scope_index`	Class/nested-class scope index (`-1` if none). Also used as `friend_scope_index` in some paths.
`dword_126C5BC`	4	`lambda_body_flag`	Lambda body processing flag / template declaration flag.
`dword_126C5B8`	4	`class_nesting_depth`	Class nesting depth / `is_member_of_template` flag.
`dword_126C5B4`	4	`block_scope_counter`	Block scope counter / namespace scope parameter.
`dword_126C5AC`	4	`saved_depth_template`	Saved scope depth for template instantiation restore.
`dword_126C5E0`	4	`scope_hash`	Scope hash/identifier.
`dword_126C5A4`	4	`nesting_scope_index`	Nesting scope index.
`dword_126C5A0`	4	`scope_misc_flag`	Miscellaneous scope flag.
`dword_126C5C0`	4	`instantiation_scope_index`	Instantiation scope index.
`qword_126C5D0`	8	`current_routine_ptr`	Current enclosing function/routine descriptor pointer. Used for execution space checks (offset `+32` -> byte `+177` bit 2 for device, byte `+182 & 0x30` for space mask).
`qword_126C598`	8	`pack_expansion_context`	Pack expansion context pointer (C++17).
`qword_126C590`	8	`symbol_hash_table`	Robin Hood hash table for symbol lookup within scope.

Lexer and Token State

The lexer maintains its current token, source position, and preprocessor state in these globals.

Address	Size	Name	Description
`word_126DD58`	2	`current_token`	Current token kind (357 possible values). Key values: `7` = identifier, `33` = comma, `55` = semicolon, `56` = `=`, `67` = equals, `73` = CUDA token, `76` = `*`, `142` = `__attribute__`, `161` = `this`, `187` = requires clause.
`qword_126DD38`	8	`token_source_position`	Source position of current token.
`qword_126DD48`	8	`token_text_ptr`	Pointer to current identifier/literal text.
`dword_126DF90`	4	`token_flags_1`	Token flags / current declaration counter.
`dword_126DF8C`	4	`token_flags_2`	Secondary token flags.
`qword_126DF80`	8	`token_extra_data`	Token extra data pointer.
`dword_126DB74`	4	`has_cached_tokens`	Cached token state flag.
`dword_126DB58`	4	`digit_separator_seen`	C++14 digit separator seen during number scanning.
`qword_126DDA0`	8	`input_position`	Current position in input buffer.
`qword_126DDD8`	8	`input_buffer_base`	Input buffer base address.
`qword_126DDD0`	8	`input_buffer_end`	Input buffer end address.
`dword_126DDA8`	4	`line_counter`	Current line number in input.
`dword_126DDBC`	4	`source_line_number`	Source line number (for `#line` directive tracking).
`qword_126DD80`	8	`active_macro_chain`	Active macro expansion chain head.
`qword_126DD60`	8	`macro_expansion_marker`	Macro expansion position marker.
`dword_126DD30`	4	`in_directive_flag`	Currently processing preprocessor directive.
`qword_126DD18`	8	`current_macro_node`	Current macro being expanded.
`qword_126DD70`	8	`macro_tracking_1`	Macro position tracking state.
`qword_126DDE0`	8	`macro_tracking_2`	Secondary macro tracking state.
`qword_126DDF0`	8	`file_stack`	Include file stack (for `#include` nesting).
`dword_126DDE8`	4	`preproc_state_1`	Preprocessor state variable.
`dword_126E49C`	4	`preproc_state_2`	Preprocessor state variable.
`qword_126DB40`	8	`lexical_state_stack`	Lexical state save/restore stack (linked list of 80-byte nodes).
`qword_126DB48`	8	`stop_token_table`	Stop token table: 357 entries at offset `+8`, indexed by token kind.
`qword_126DD98`	8	`raw_string_state`	Raw string literal tracking state.
`dword_126EF00`	4	`raw_string_flag`	Raw string literal processing flag.
`qword_126DDD8`	8	`raw_string_base`	Raw string buffer base.
`qword_126DDD0`	8	`raw_string_end`	Raw string buffer end.

Preprocessor and Macro System

Address	Size	Name	Description
`qword_1270140`	8	`macro_definition_chain`	Macro definition chain head.
`qword_1270148`	8	`free_token_list`	Free list for recycled token nodes.
`qword_1270150`	8	`cached_token_list`	Cached token list head (for rescan).
`qword_1270128`	8	`reusable_cache_stack`	Reusable macro cache stack.
`qword_106B8A0`	8	`pending_macro_arg`	Pending macro argument pointer.
`dword_106B718`	4	`suppress_pragma_mode`	Suppress pragma processing mode.
`dword_106B720`	4	`preprocessing_mode`	Preprocessor-only mode active.
`dword_106B6EC`	4	`line_numbering_state`	Line numbering state for `#line` output.
`qword_106B740`	8	`pragma_binding_table`	Pragma binding table (0x158 bytes initial).
`qword_106B730`	8	`pragma_alloc_pool_1`	Pragma allocation pool.
`qword_106B738`	8	`pragma_alloc_pool_2`	Pragma allocation pool (secondary).
`qword_106B890`	8	`pragma_name_hash_1`	Pragma name hash table.
`qword_106B8A8`	8	`pragma_name_hash_2`	Pragma name hash table (secondary).
`off_E6CDE0`	--	`pragma_id_table`	Pragma ID-to-name mapping table.
`byte_126E558`	1	`stdc_cx_limited_range`	`#pragma STDC CX_LIMITED_RANGE` state. Default `3`.
`byte_126E559`	1	`stdc_fenv_access`	`#pragma STDC FENV_ACCESS` state. Default `3`.
`byte_126E55A`	1	`stdc_fp_contract`	`#pragma STDC FP_CONTRACT` state. Default `3`.
`dword_126EE48`	4	`macro_expansion_tracking`	Macro expansion tracking / secondary IL enabled flag. Set to `1` during init-complete. Also controls shareable-constants feature.

Translation Unit State

These globals track the current translation unit, TU list, and per-TU save/restore mechanism.

Address	Size	Name	Description
`qword_106BA10`	8	`current_tu`	Pointer to current translation unit descriptor (424 bytes).
`qword_106B9F0`	8	`primary_tu`	Pointer to first (primary) translation unit.
`qword_12C7A90`	8	`tu_chain_tail`	Tail of translation unit linked list.
`qword_106BA18`	8	`tu_stack`	Translation unit stack (for nested TU processing).
`dword_106B9E8`	4	`tu_stack_depth`	TU stack depth (excluding primary).
`dword_106BA08`	4	`is_recompilation`	Recompilation / secondary-TU flag. When `0` = primary TU, when `1` = secondary. Affects IL entity flag bits.
`qword_106BA00`	8	`current_filename`	Current filename string pointer.
`dword_106B9F8`	4	`has_module_info`	TU has module information.
`qword_12C7A98`	8	`per_tu_storage_size`	Total per-TU variable buffer size.
`qword_12C7AA8`	8	`registered_var_list_head`	Registered per-TU variable list head.
`qword_12C7AA0`	8	`registered_var_list_tail`	Registered per-TU variable list tail.
`qword_12C7AB8`	8	`stack_entry_free_list`	TU stack entry free list.
`qword_12C7AB0`	8	`corresp_free_list`	TU correspondence structure free list.
`dword_12C7A8C`	4	`registration_complete`	Variable registration complete flag.
`dword_12C7A88`	4	`has_seen_module_tu`	Has seen a module TU.
`qword_12C7A70`	8	`corresp_count`	TU correspondence allocation counter.
`qword_12C7A78`	8	`tu_count`	Translation unit allocation counter.
`qword_12C7A80`	8	`stack_entry_count`	Stack entry allocation counter.
`qword_12C7A68`	8	`registration_count`	Variable registration allocation counter.

IL (Intermediate Language) State

The IL subsystem uses arena-allocated regions for entities. Two primary regions exist: file-scope and function-scope.

Address	Size	Name	Description
`dword_126EC90`	4	`file_scope_region_id`	File-scope IL region ID. Persistent for the entire TU.
`dword_126EB40`	4	`current_region_id`	Current allocation region ID (file-scope or function-scope).
`dword_126EC80`	4	`max_region_id`	Maximum allocated region ID.
`qword_126EB60`	16	`il_header`	IL header (SSE-width, used for expression copy).
`qword_126EB70`	8	`main_routine`	Main routine entity (`main()` function). Sign-bit used as elimination marker.
`qword_126EB78`	8	`compiler_version_string`	Compiler version string pointer.
`qword_126EB80`	8	`compilation_timestamp`	Compilation timestamp string.
`byte_126EB88`	1	`plain_chars_signed`	Plain chars are signed flag (IL header field).
`qword_126EB90`	8	`routine_scope_array`	Array indexed by routine number. Also per-region metadata.
`qword_126EB98`	8	`function_def_table`	Function definition table (16 bytes per entry, indexed 1..`dword_126EC78`).
`qword_126EBA0`	8	`orphaned_scope_list`	Orphaned scope list head (for dead code elimination).
`dword_126EBA8`	4	`source_language`	Source language (`0` = C++, `1` = C).
`dword_126EBAC`	4	`std_version_il`	Standard version for IL header.
`byte_126EBB0`	1	`pcc_compatibility_mode`	PCC compatibility mode.
`byte_126EBB1`	1	`enum_type_is_integral`	Enum underlying type is integral.
`dword_126EBB4`	4	`max_member_alignment`	Default maximum member alignment.
`byte_126EBB8`	1	`il_gcc_mode`	IL GCC mode.
`byte_126EBB9`	1	`il_gpp_mode`	IL G++ mode.
`byte_126EBD5`	1	`any_templates_seen`	Any templates encountered.
`byte_126EBD6`	1	`proto_instantiations_in_il`	Prototype instantiations present in IL.
`byte_126EBD7`	1	`il_all_proto_instantiations`	IL has all prototype instantiations.
`byte_126EBD8`	1	`il_c_semantics`	IL has C semantics.
`qword_126EBE0`	8	`deferred_instantiation_list`	Deferred/external declaration list head.
`qword_126EBE8`	8	`seq_number_entries`	Sequence number lookup entries (for IL index build).
`dword_126EBF8`	4	`target_config_index`	Target configuration index.
`dword_126EC78`	4	`routine_counter`	Current routine / entity counter.
`dword_126EC7C`	4	`entity_buffer_capacity`	Entity buffer capacity (grows by 2048).
`qword_126EC88`	8	`region_block_chains`	Array of block chains indexed by region ID.
`qword_126EC50`	8	`region_size_tracking`	Array of region size tracking.
`qword_126EC58`	8	`large_alloc_array`	Large-allocation (mmap) array.
`dword_126E5FC`	4	`file_scope_constant_flag`	Source-file-info flags (bit 0 = constant region flag).
`byte_126E5F8`	1	`il_language_byte`	Language standard byte for routine-type init.
`qword_126EFB8`	8	`null_source_position`	Default/null source position struct.
`qword_126F700`	8	`current_source_file_ref`	Current source file reference for IL entities.

IL Entity Kind Lists

The IL maintains per-kind linked lists for file-scope entities (kinds 1 through 72+).

Address	Size	Name	Description
`qword_126E610`	8	`kind_1_list`	Source file entries (kind 1).
`qword_126E620`	8	`kind_2_list`	Constant entries (kind 2).
`qword_126E630`	8	`kind_3_list`	Parameter entries (kind 3).
...		...	Continues through all 72+ entry kinds.
`qword_126EA80`	8	`kind_72_list`	Last numbered kind list (kind 72).

IL Allocation Counters

Each IL entity type has a dedicated allocation counter used for memory statistics reporting.

Address	Size	Name	Description
`qword_126F680`	8	`local_constant_count`	Local constant allocation count. Asserted zero at region boundaries.
`qword_126F748`	8	`orphan_ptr_count`	Orphan pointer allocation count.
`qword_126F750`	8	`entity_prefix_count`	Entity prefix allocation count.
`qword_126F790`	8	`source_corresp_count`	Source correspondence allocation count.
`qword_126F7C0`	8	`gen_alloc_header_count`	Gen-alloc header count (TU copy addresses).
`qword_126F7D0`	8	`string_bytes_count`	String literal bytes counter.
`qword_126F7D8`	8	`il_entry_prefix_count`	IL entry prefix allocation count.
`qword_126F8A0`	8	`exception_spec_count`	Exception specification entry count (16 bytes).
`qword_126F898`	8	`exception_spec_type_count`	Exception spec type count (24 bytes).
`qword_126F890`	8	`asm_entry_count`	ASM entry count (152 bytes).
`qword_126F8A8`	8	`routine_count`	Routine entry count (288 bytes).
`qword_126F8B0`	8	`field_count`	Field entry count (176 bytes).
`qword_126F8B8`	8	`var_template_count`	Variable template entry count (24 bytes).
`qword_126F8C0`	8	`variable_count`	Variable entry count (232 bytes).
`qword_126F8C8`	8	`vla_dim_count`	VLA dimension entry count (48 bytes).
`qword_126F8D0`	8	`local_static_init_count`	Local static init count (40 bytes).
`qword_126F8D8`	8	`dynamic_init_count`	Dynamic init entry count (104 bytes).
`qword_126F8E0`	8	`type_count`	Type entry count (176 bytes).
`qword_126F8E8`	8	`enum_supplement_count`	Enum type supplement count.
`qword_126F8F0`	8	`typeref_supplement_count`	Typeref type supplement count (56 bytes).
`qword_126F8F8`	8	`misc_supplement_count`	Misc type supplement count.
`qword_126F900`	8	`template_arg_count`	Template argument count (64 bytes).
`qword_126F908`	8	`base_class_count`	Base class count (112 bytes).
`qword_126F910`	8	`base_class_deriv_count`	Base class derivation count (32 bytes).
`qword_126F918`	8	`derivation_step_count`	Derivation step count (24 bytes).
`qword_126F920`	8	`overriding_count`	Overriding entry count (40 bytes).
`qword_126F928`	8	`constant_list_count`	Constant list entry count (16 bytes).
`qword_126F930`	8	`variable_list_count`	Variable list entry count (16 bytes).
`qword_126F938`	8	`routine_list_count`	Routine list entry count (16 bytes).
`qword_126F940`	8	`class_list_count`	Class list entry count (16 bytes).
`qword_126F948`	8	`class_supplement_count`	Class type supplement count.
`qword_126F950`	8	`based_type_member_count`	Based type list member count (24 bytes).
`qword_126F958`	8	`routine_supplement_count`	Routine type supplement count (64 bytes).
`qword_126F960`	8	`param_type_count`	Parameter type entry count (80 bytes).
`qword_126F968`	8	`constant_alloc_count`	Constant allocation count (184 bytes).
`qword_126F970`	8	`source_file_count`	Source file entry count.

IL Free Lists

Arena allocators recycle nodes through per-type free lists.

Address	Size	Name	Description
`qword_126E4B8`	8	`constant_free_list`	Constants (linked via offset `+104`).
`qword_126E4B0`	8	`expr_node_free_list`	Expression nodes (linked via offset `+64`).
`qword_126F678`	8	`param_type_free_list`	Parameter type entries (linked via offset `+0`).
`qword_126F670`	8	`template_arg_free_list`	Template argument entries (linked via offset `+0`).
`qword_126F668`	8	`constant_list_free_list`	Constant list entries (linked via offset `+0`).

IL Pools and Region Allocator

Address	Size	Name	Description
`qword_126F600`	104	`type_node_pool_1`	Type node pool (104-byte entries).
`qword_126F580`	104	`type_node_pool_2`	Secondary type node pool.
`qword_126F500`	104	`conditional_pool_1`	Conditional pool (guarded by `dword_106BF68 \|\| dword_106BF58`).
`qword_126F480`	104	`conditional_pool_2`	Conditional pool (secondary).
`qword_126F400`	112	`expr_pool_1`	Expression/statement node pool (112 bytes).
`qword_126F380`	112	`expr_pool_2`	Expression pool (secondary).
`qword_126F300`	112	`expr_pool_3`	Expression pool (tertiary).
`unk_126E600`	1344	`scope_pool`	Scope table pool (1344 bytes, 384 initial count).
`qword_126E580`	96	`common_header_pool`	Common IL header pool (96 bytes).
`dword_126F690`	4	`region_prefix_offset`	Region allocation prefix offset (0 or 8).
`dword_126F694`	4	`region_prefix_size`	Region allocation prefix size (16 or 24).
`dword_126F688`	4	`alt_prefix_offset`	Alternate region prefix offset.
`dword_126F68C`	4	`alt_prefix_size`	Alternate region prefix size (8).

Address	Size	Name	Description
`qword_126F128`	8	`constant_hash_table`	Hash table for constant sharing/dedup.
`qword_126F130`	8	`next_constant_index`	Next constant index (monotonically increasing).
`qword_126F228`	8	`shareable_constant_hash`	Shareable constant hash table (2039 buckets).
`qword_126F200`	8	`hash_comparisons`	Hash comparison count (statistics).
`qword_126F208`	8	`hash_searches`	Hash search count.
`qword_126F210`	8	`hash_new_buckets`	New hash bucket count.
`qword_126F218`	8	`hash_region_hits`	Region hit count.
`qword_126F220`	8	`hash_global_hits`	Global hit count.
`qword_126F280`	8	`member_ptr_type_count`	Member-pointer / qualified type allocation counter.
`qword_126F2F8`	3240	`char_string_type_cache`	Character string type cache (405 entries = 3240/8). Indexed by `648char_kind + 8length`.

Cached Type Nodes

Address	Size	Name	Description
`qword_126F2F0`	8	`cached_void_type`	Lazy-init cached void type node.
`qword_126F2E0`	8	`cached_size_t_type`	Lazy-init cached size_t type (for array memcpy).
`qword_126F2D0`	8	`cached_wchar_type`	Cached `wchar_t` type.
`qword_126F2C8`	8	`cached_char16_type`	Cached `char16_t` type.
`qword_126F2C0`	8	`cached_char32_type`	Cached `char32_t` type.
`qword_126F2B8`	8	`cached_char8_type`	Cached `char8_t` type (C++20).
`qword_126F610`	8	`cached_char16_variant`	Cached `char16_t` variant type.
`qword_106B660`	8	`cached_void_fn_type`	Cached void function type (C++ mode).
`qword_126E5E0`	8	`global_char_type`	Global `char` type. Used with qualifier `1` = `const` for `const char*`.

Template Instantiation

Address	Size	Name	Description
`qword_12C7740`	8	`pending_instantiation_list`	Pending function/variable instantiation worklist head.
`qword_12C7758`	8	`pending_class_list`	Pending class instantiation list.
`qword_12C76E0`	8	`instantiation_depth`	Current instantiation depth counter (max `0xFF` = 255).
`qword_106BD10`	8	`max_instantiation_depth`	Maximum template instantiation depth limit. Default `200`.
`qword_106BD08`	8	`max_constexpr_cost`	Maximum constexpr evaluation cost. Default `256`.
`dword_12C7730`	4	`instantiation_mode_active`	Instantiation mode active flag.
`dword_12C771C`	4	`new_instantiations_needed`	Fixpoint flag: new instantiations generated in current pass.
`dword_12C7718`	4	`additional_pass_needed`	Additional instantiation pass needed flag.
`dword_106C094`	4	`compilation_mode`	Compilation mode: `0` = none, `1` = normal, `2` = used-only, `3` = precompile.
`dword_106C09C`	4	`extended_language_mode`	Extended language mode.
`qword_12C7B48`	8	`template_arg_cache`	Template argument cache.
`qword_12C7B40`	8	`template_arg_cache_2`	Template argument cache (secondary).
`qword_12C7B50`	8	`template_arg_cache_3`	Template argument cache (tertiary).
`qword_12C7800`	112[3]	`template_hash_tables`	Three template hash tables (0x70 bytes each = 14 slots).

Lambda Transform State

NVIDIA's extended lambda system uses bitmaps and linked lists to track device and host-device lambda closures.

Address	Size	Name	Description
`unk_1286980`	128	`device_lambda_bitmap`	Device lambda capture count bitmap (1024 bits). One bit per closure class index.
`unk_1286900`	128	`host_device_lambda_bitmap`	Host-device lambda capture count bitmap (1024 bits).
`qword_12868F0`	8	`entity_closure_map`	Entity-to-closure mapping hash table (via `sub_742670`).
`qword_1286A00`	8	`cached_anon_namespace_name`	Cached anonymous namespace name (`_GLOBAL__N_<filename>`).
`qword_1286760`	8	`cached_static_prefix`	Cached static prefix string for mangled names.
`byte_1286A20`	256K	`name_format_buffer`	256KB buffer for name formatting.

Lambda Registration Lists

Six linked lists track device/constant/kernel entities with internal/external linkage for .int.c registration emission.

Address	Size	Name	Description
`unk_1286780`	--	`device_external_list`	Device entities with external linkage.
`unk_12867C0`	--	`device_internal_list`	Device entities with internal linkage.
`unk_1286800`	--	`constant_external_list`	Constant entities with external linkage.
`unk_1286840`	--	`constant_internal_list`	Constant entities with internal linkage.
`unk_1286880`	--	`kernel_external_list`	Kernel entities with external linkage.
`unk_12868C0`	--	`kernel_internal_list`	Kernel entities with internal linkage.

IL Tree Walking

The walk_tree subsystem uses global callback pointers for its 5-callback traversal model.

Address	Size	Name	Description
`qword_126FB88`	8	`entry_callback`	Called for each IL entry during walk.
`qword_126FB80`	8	`string_callback`	Called for each string encountered.
`qword_126FB78`	8	`pre_walk_check`	Pre-walk filter: if returns nonzero, skip subtree.
`qword_126FB70`	8	`entry_replace`	Entry replacement callback.
`qword_126FB68`	8	`entry_filter`	Linked-list entry filter callback.
`dword_126FB5C`	4	`is_file_scope_walk`	`1` = walking file-scope IL.
`dword_126FB58`	4	`is_secondary_il`	`1` = current scope is in secondary IL region.
`dword_126FB60`	4	`walk_mode_flags`	Walk mode flags (template stripping, etc.).
`dword_106B644`	4	`current_il_region`	Current IL region (0 or 1; toggles bit 2 of entry flags).

IL Walk Visited-Set

Address	Size	Name	Description
`dword_126FB30`	4	`visited_count`	Count of visited entries in current walk.
`qword_126FB40`	8	`visited_set`	Visited-entry set pointer.
`dword_126FB48`	4	`hash_table_count`	Hash table entry count for visited set.
`qword_126FB50`	8	`hash_table_array`	Hash table array for visited set.

IL Display

Address	Size	Name	Description
`qword_126F980`	8	`display_output_context`	IL-to-string output callback/context.
`dword_126FA30`	4	`is_file_scope_display`	`1` = displaying file-scope region.
`byte_126FA16`	1	`display_active`	IL display currently active flag.
`byte_126FA11`	1	`pcc_mode_shadow`	PCC compatibility mode shadow for display.
`qword_126FA40`	--	`display_string_buffer`	Display string buffer (raw literal prefix, etc.).

Constexpr Evaluator

Address	Size	Name	Description
`qword_126FDE0`	8	`eval_node_free_list`	Evaluation node free list (0x10000-byte arena blocks).
`qword_126FDE8`	8	`eval_nesting_depth`	Evaluation nesting depth counter.
`qword_126FE00`	8[11]	`hash_bucket_free_lists`	Hash bucket free lists by popcount size class (11 buckets).
`qword_126FE60`	8[11]	`value_node_free_lists`	Value node free lists by popcount size class (11 buckets).
`qword_126FBC0`	8	`variant_path_free_list`	Variant path node free list.
`qword_126FBB8`	8	`variant_path_count`	Variant path allocation count.
`qword_126FBC8`	8	`variant_path_limit`	Variant path limit.
`qword_126FBD0`	8	`variant_path_table`	Variant path table pointer.
`qword_126FEC0`	8	`constexpr_class_hash_table`	Class type hash table base for constexpr.
`qword_126FEC8`	8	`constexpr_class_hash_info`	Low 32 = capacity mask, high 32 = entry count.

Backend Code Generation (cp_gen_be.c)

Address	Size	Name	Description
`dword_1065834`	4	`indent_level`	Current indentation depth in output.
`dword_1065820`	4	`output_line_number`	Output line counter.
`dword_106581C`	4	`output_column`	Output column counter (chars since last newline).
`dword_1065830`	4	`output_column_alt`	Alternate column counter.
`dword_1065818`	4	`needs_line_directive`	Needs `#line` directive flag.
`qword_1065810`	8	`output_source_position`	Current source position for `#line` directives.
`qword_1065748`	8	`source_sequence_ptr`	Current source sequence entry pointer.
`qword_1065740`	8	`source_sequence_alt`	Secondary source sequence pointer (nested scope iteration).
`byte_10656F0`	1	`current_linkage_spec`	Current linkage spec: `2` = `extern "C"`, `3` = `extern "C++"`.
`qword_1065708`	8	`output_scope_stack`	Output scope stack pointer (linked list).
`qword_1065870`	8	`debug_trace_list`	Debug trace request linked list.

Expression Parsing State

Address	Size	Name	Description
`qword_106B970`	8	`expr_stack_top`	Current expression stack top pointer. Primary context object for expression parsing. Checked at offset `+17` (flags), `+18`, `+19` (bit flags), `+48`, `+120`.
`qword_106B968`	8	`expr_stack_prev`	Previous expression stack entry (push/pop).
`qword_106B580`	8	`saved_expr_context`	Saved expression context (for nested evaluation).
`qword_106B510`	8	`rewrite_loop_counter`	Rewrite loop counter (limited to 100 to prevent infinite loops).
`dword_126EF08`	4	`requires_expr_enabled`	Requires-expression enabled (C++20).

Overload Resolution

Address	Size	Name	Description
`qword_E7FE98`	8	`override_pending_list`	Virtual function override pending list head (40-byte entries).
`qword_E7FEA0`	8	`override_free_list`	Override entry free list.
`qword_E7FE88`	8	`covariant_free_list`	Covariant override free list.
`qword_E7FEC8`	8	`lambda_hash_table`	Lambda closure class hash table pointer.
`qword_E7FED0`	8	`template_member_hash`	Template member hash table pointer.
`dword_E7FE48`	4	`rbtree_sentinel`	Red-black tree sentinel node (for lambda numbering).
`qword_E7FE58`	8	`rbtree_left_sentinel`	Red-black tree left sentinel (= `&dword_E7FE48`).
`qword_E7FE60`	8	`rbtree_right_sentinel`	Red-black tree right sentinel (= `&dword_E7FE48`).
`qword_E7FE68`	8	`rbtree_size`	Red-black tree entry count.

Attribute System

Address	Size	Name	Description
`off_D46820`	32/entry	`attribute_descriptor_table`	Attribute descriptor table. ~160 entries, stride 32 bytes. Runs to `unk_D47A60`.
`qword_E7FB60`	8	`attribute_hash_table`	Attribute name hash table (Robin Hood lookup via `sub_742670`).
`qword_E7F038`	8	`attribute_hash_table_2`	Secondary attribute hash table.
`byte_E7FB80`	204	`scoped_attr_buffer`	Buffer for scoped attribute name formatting (`"namespace::name"`).
`byte_82C0E0`	--	`attribute_kind_table`	Attribute kind descriptor table (indexed by attribute kind).
`dword_E7F078`	4	`attr_init_flag`	Attribute subsystem initialization flag.
`dword_E7F080`	4	`attr_flags`	Attribute system flags.
`qword_E7F070`	8	`visibility_stack`	Visibility stack linked list.
`qword_E7F068`	8	`visibility_state`	Current visibility state.
`qword_E7F048`	8	`alias_ifunc_free_list`	Free list for alias/ifunc entries.
`qword_E7F058`	8	`alias_list_head`	Alias entry linked list head.
`qword_E7F050`	8	`alias_list_next`	Alias entry linked list next.
`dword_106BF18`	4	`extended_attr_config`	Extended attribute configuration flag. Gates additional initialization.

Control Flow Tracking

Address	Size	Name	Description
`qword_12C7110`	8	`cf_descriptor_free_list`	Control flow descriptor free list.
`qword_12C7118`	8	`cf_active_list_tail`	Active control flow list tail.
`qword_12C7120`	8	`cf_active_list_head`	Active control flow list head.

Cross-Reference System

Address	Size	Name	Description
`qword_106C258`	8	`xref_output_file`	Cross-reference output file handle. When nonzero, enables xref emission.
`qword_12C7160`	8	`xref_callback`	Cross-reference callback (`sub_726F10`).
`dword_12C7148`	4	`xref_enabled`	Cross-reference generation enabled.
`byte_12C71FA`	1	`xref_flag_a`	Cross-reference flag A.
`byte_12C71FE`	1	`xref_flag_b`	Cross-reference flag B. Default `1`.

Object Lifetime Stack

Address	Size	Name	Description
`qword_126E4C0`	8	`curr_object_lifetime`	Top of object lifetime stack. Used for destructor ordering and scope cleanup.

Timing and Debug

Address	Size	Name	Description
`dword_106C0A4`	4	`timing_enabled`	Timing/profiling enabled flag.
`dword_126EFC8`	4	`debug_trace`	Debug tracing active. When set, calls `sub_48AE00`/`sub_48AFD0` trace hooks.
`dword_126EFCC`	4	`debug_verbosity`	Debug verbosity level. `>2` = detailed, `>3` = very detailed, `>4` = IL walk trace.
`byte_106B5C0`	128	`compilation_timestamp`	Compilation timestamp string (from `ctime()`).

Memory Allocator (Arena/Pool System)

Address	Size	Name	Description
`qword_1280730`	8	`block_free_list`	Recycled 0x10000-byte block free list.
`qword_1280718`	8	`total_memory_allocated`	Total memory allocated (watermark).
`qword_1280710`	8	`peak_memory_allocated`	Peak memory allocated.
`qword_1280708`	8	`tracked_alloc_total`	Tracked allocation total.
`qword_1280720`	8	`free_fe_hash_table`	Hash table for `free_fe` tracked allocations.
`qword_1280748`	8	`alloc_tracking_list`	Linked list of allocation tracking records.
`dword_1280728`	4	`mmap_mode`	Allocation mode flag. `0` = malloc-based, `1` = mmap-based. Set from `dword_106BF18`.
`dword_1280750`	4	`tracking_record_count`	Tracking record count (inline up to 1023, then heap).
`unk_1280760`	--	`tracking_record_array`	Inline tracking record array.

IL Copy Remap

Address	Size	Name	Description
`qword_126F1E0`	8	`copy_remap_free_list`	Copy remap entry free list (24 bytes each).
`qword_126F1D8`	8	`copy_remap_count`	Copy remap entry count.
`qword_126F1D0`	4	`copy_recursion_depth`	Copy recursion depth counter.
`qword_126F1F8`	8	`copy_remap_stat_count`	Copy remap statistics count.
`qword_126F140`	8	`selected_entity`	Selected entity for copy/comparison.
`byte_126F138`	1	`selected_entity_kind`	Kind of selected entity (7 or 11).

IL Deferred Reordering Batch

Address	Size	Name	Description
`qword_126F170`	8	`reorder_batch`	Batch reordering array (24-byte records: entity, placeholder, source_sequence).
`qword_126F158`	8	`reorder_ptr_array`	Pointer array for batch reordering.
`qword_126F150`	8	`reorder_batch_limit`	Batch size limit (100 entries).

CLI Processing State

Address	Size	Name	Description
`dword_E80058`	4	`flag_count`	Current registered CLI flag count (panics at 552 via `sub_40351D`).
`dword_E7FF20`	4	`argv_index`	Current argv parsing index (starts at 1).
`byte_E7FF40`	272	`flag_was_set_bitmap`	272-byte bitmap: which CLI flags were explicitly set.
`dword_E7FF14`	4	`language_already_set`	Guard against switching language mode after initial set.
`dword_E7FF10`	4	`cuda_compat_flag`	CUDA compatibility flag (set based on `dword_126EFAC && qword_126EF98 <= 0x76BF`).
`off_D47CE0`	--	`set_flag_lookup_table`	Lookup table for `--set_flag` CLI option (name-to-address mapping).

EDG Feature Flags (0x106Bxxx-0x106Cxxx Region)

These flags control individual C/C++ language features. Set during CLI processing and standard-version initialization.

Address	Size	Name	Description
`dword_106C210`	4	`exceptions_enabled`	Exception handling enabled. Default `1`.
`dword_106C180`	4	`rtti_enabled`	RTTI enabled. Default `1`.
`dword_106C164`	4	`templates_enabled`	Templates enabled.
`dword_106C1B8`	4	`template_arg_context`	Template argument context flag.
`dword_106C194`	4	`namespaces_enabled`	Namespaces enabled. Default `1`.
`dword_106C19C`	4	`arg_dep_lookup`	Argument-dependent lookup. Default `1`.
`dword_106C178`	4	`bool_keyword`	`bool` keyword enabled. Default `1`.
`dword_106C188`	4	`wchar_t_keyword`	`wchar_t` keyword enabled. Default `1`.
`dword_106C18C`	4	`alternative_tokens`	Alternative tokens enabled. Default `1`.
`dword_106C1A0`	4	`class_name_injection`	Class name injection. Default `1`.
`dword_106C1A4`	4	`const_string_literals`	Const string literals. Default `1`.
`dword_106C134`	4	`parse_templates`	Parse templates. Default `1`.
`dword_106C138`	4	`dep_name`	Dependent name processing. Default `1`.
`dword_106C12C`	4	`friend_injection`	Friend injection. Default `1`.
`dword_106C128`	4	`adl_related`	ADL related feature. Default `1`.
`dword_106C124`	4	`module_visibility`	Module-level visibility. Default `1`.
`dword_106C140`	4	`compound_literals`	Compound literals. Default `1`.
`dword_106C13C`	4	`base_assign_default`	Base assign op is default. Default `1`.
`dword_106C10C`	4	`deferred_instantiation`	Deferred instantiation flag.
`dword_106C0E4`	4	`exceptions_feature`	Exceptions feature flag (version-dependent).
`dword_106C064`	4	`modify_stack_limit`	Modify stack limit. Default `1`.
`dword_106C068`	4	`fe_inlining`	Frontend inlining enabled.
`dword_106C0A0`	4	`feature_A0`	Miscellaneous feature flag. Default `1`.
`dword_106C098`	4	`feature_98`	Miscellaneous feature flag. Default `1`.
`dword_106C0FC`	4	`feature_FC`	Miscellaneous feature flag. Default `1`.
`dword_106C154`	4	`feature_154`	Miscellaneous feature flag. Default `1`.
`dword_106C208`	4	`constexpr_if_discard`	Constexpr-if discarded-statement handling.
`dword_106C1F0`	4	`cpp_mode_feature`	C++ mode feature flag.
`dword_106C2A4`	4	`feature_2A4`	Default `1`.
`dword_106C214`	4	`feature_214`	Default `1`.
`dword_106C2BC`	4	`modules_enabled`	C++20 modules enabled.
`dword_106C2B8`	4	`module_partitions`	Module partitions enabled.
`dword_106BDB8`	4	`restrict_enabled`	`restrict` keyword enabled. Default `1`.
`dword_106BDB0`	4	`remove_unneeded_entities`	Remove unneeded entities. Default `1`.
`dword_106BD98`	4	`trigraphs_enabled`	Trigraph support. Default `1`.
`dword_106BD68`	4	`guiding_decls`	Guiding declarations. Default `1`.
`dword_106BD58`	4	`old_specializations`	Old-style specializations. Default `1`.
`dword_106BD54`	4	`implicit_typename`	Implicit typename. Default `1`.
`dword_106BEA0`	4	`rtti_config`	RTTI configuration flag.
`dword_106BE84`	4	`gen_move_operations`	Generate move operations. Default `1`.
`dword_106BC08`	4	`nodiscard_enabled`	`[[nodiscard]]` enabled.
`dword_106BC64`	4	`visibility_support`	Visibility support enabled.
`dword_106BDF0`	4	`gnu_attr_groups`	GNU attribute groups enabled.
`dword_106BDF4`	4	`msvc_declspec`	MSVC `__declspec` enabled.
`dword_106BCBC`	4	`template_features`	Template features flag.
`dword_106BFC4`	4	`debug_mode_1`	Debug mode flag 1 (set by `--debug_mode`).
`dword_106BFC0`	4	`debug_mode_2`	Debug mode flag 2.
`dword_106BFBC`	4	`debug_mode_3`	Debug mode flag 3.
`qword_106BCE0`	8	`include_suffix_default`	Include suffix default string (`"::stdh:"`).
`qword_106BC70`	8	`version_threshold`	Feature version threshold. Default `30200`.

Host Compiler Target Configuration

Address	Size	Name	Description
`dword_126E1D4`	4	`msvc_target_version`	MSVC target version (`1200` = VC6, `1400` = VS2005, etc.).
`dword_126E1D8`	4	`is_msvc_host`	Is MSVC host compiler.
`dword_126E1DC`	4	`is_edg_native`	EDG native mode.
`dword_126E1E8`	4	`is_clang_host`	Is Clang host compiler.
`dword_126E1F8`	4	`is_gnu_host`	Is GNU/GCC host compiler.
`qword_126E1F0`	8	`gnu_host_version`	GCC/Clang host version number.
`qword_126E1E0`	8	`clang_host_version`	Clang host version number.
`dword_126E1EC`	4	`backend_enabled`	Backend generation enabled.
`dword_126E1BC`	4	`host_feature_flag`	Host feature flag. Default `1`.
`dword_126DFF0`	4	`msvc_declspec_mode`	MSVC `__declspec` mode enabled.
`qword_126E1B0`	8	`library_prefix`	Library search path prefix (`"lib"`).
`dword_126E200`	4	`constexpr_init_flag`	Constexpr initialization flag.
`dword_126E204`	4	`instantiation_flag`	Instantiation control flag.
`dword_126E224`	4	`parameter_flag`	Parameter handling flag.

Type System Lookup Tables (Read-Only)

Address	Size	Name	Description
`byte_E6D1B0`	256	`signedness_table`	Type-code-to-signedness lookup table.
`byte_E6D1AD`	1	`unsigned_int_kind_sentinel`	Must equal `111` (`'o'`) -- sentinel validation.
`byte_A668A0`	256	`type_kind_properties`	Type kind property table. Bit 1 = callable, bit 4 = aggregate.
`off_E6E020`	--	`il_entry_kind_names`	IL entry kind name table (last = `"last"`, sentinel = 9999).
`off_E6CD78`	--	`db_storage_class_names`	Storage class name table (last = `"last"`).
`off_E6D228`	--	`db_special_function_kinds`	Special function kind name table.
`off_E6CD20`	--	`db_operator_names`	Operator name table.
`off_E6E060`	--	`name_linkage_kind_names`	Name linkage kind names.
`off_E6CD88`	--	`decl_modifier_names`	Declaration modifier names.
`off_E6CF38`	--	`pragma_ids`	Pragma ID table.
`qword_E6C580`	8	`sizeof_il_entry_sentinel`	Must equal `9999` -- sizeof IL entry validation.
`off_E6DD80`	--	`il_entry_kind_display_names`	IL entry kind display names (indexed by kind byte).
`off_E6E040`	--	`linkage_kind_display_names`	Linkage kind display names (none/internal/external/C/C++).
`off_E6E140`	--	`feature_init_table`	Feature initialization table (used with `dword_106BF18`).

IL Display Tables (Read-Only)

Address	Size	Name	Description
`off_A6F840`	8[120]	`builtin_op_names`	Builtin operation kind names (120 entries).
`off_A6FE40`	8[22]	`type_kind_names`	Type kind names (22 entries: void, bool, int, float, ...).
`off_A6F760`	8[4]	`access_specifier_names`	Access specifier names (public/protected/private/none).
`off_A6FE00`	8[7]	`storage_class_display_names`	Storage class display names (7: none/auto/register/static/extern/mutable/thread_local).
`off_A6F480`	--	`register_kind_names`	Register kind names.
`off_A6FC00`	--	`special_kind_names`	Special function kind names (lambda call operator, etc.).
`off_A6FC80`	--	`opname_kind_names`	Operator name kind names.
`off_A6F640`	--	`typeref_kind_names`	Typeref kind names.
`off_A6F420`	--	`based_type_kind_names`	Based type kind names.
`off_A6F3F0`	--	`class_kind_names`	Class/struct/union kind names.
`off_E6C5A0`	--	`builtin_op_table`	Builtin operation reference table.

PCH and Serialization

Address	Size	Name	Description
`dword_106B690`	4	`pch_mode`	Precompiled header mode.
`dword_106B6B0`	4	`pch_loaded`	PCH loaded flag.
`qword_12C6BA0`	8	`pch_string_buffer_1`	PCH string buffer.
`qword_12C6BA8`	8	`pch_string_buffer_2`	PCH string buffer (secondary).
`qword_12C6EA0`	8	`pch_write_state`	PCH binary write state.
`qword_12C6EA8`	8	`pch_misc_state`	PCH miscellaneous state.
`dword_12C6C88`	4	`pch_config_flag`	PCH configuration flag.
`byte_12C6EE0`	1	`pch_byte_flag`	PCH byte flag.
`dword_12C6C8C`	4	`saved_var_list_count`	Saved variable list count (PCH).
`qword_12C6CA0`	8	`saved_var_lists`	Saved variable list array (PCH).

Inline and Linkage Tracking

Address	Size	Name	Description
`qword_12C6FC8`	8	`inline_def_tracking_1`	Inline definition tracking.
`qword_12C6FD0`	8	`inline_def_tracking_2`	Inline definition tracking (secondary).
`qword_12C6FD8`	8	`inline_def_tracking_3`	Inline definition tracking (tertiary).
`qword_12C6FB8`	8	`linkage_stack_1`	Linkage stack.
`qword_12C6FC0`	8	`linkage_stack_2`	Linkage stack (secondary).
`qword_12C6FE0`	8	`mangling_discriminator`	ABI mangling discriminator tracking.
`qword_12C70E8`	8	`misc_tracking`	Miscellaneous definition tracking.

Miscellaneous

Address	Size	Name	Description
`qword_126E4C0`	8	`curr_object_lifetime`	Top of object lifetime stack.
`qword_106B9B0`	8	`active_compilation_ctx`	Active compilation context pointer.
`dword_126E280`	4	`max_object_size`	Maximum object size (for vector/array validation).
`dword_106B4B8`	4	`omp_declare_variant`	OpenMP `declare variant` active flag.
`dword_106BC7C`	4	`compressed_mangling`	Compressed name mangling mode.
`dword_106BD4C`	4	`profiling_flag`	Profiling / performance measurement flag.
`dword_106BCFC`	4	`traditional_enum`	Traditional (unscoped) enum mode.
`dword_106BBD4`	4	`char16_variant_flag`	`char16_t` variant selection flag.
`dword_106BD74`	4	`sharing_mode_config`	IL sharing mode configuration.
`dword_126E1C0`	4	`string_sharing_enabled`	String sharing enabled in IL.
`byte_126E1C4`	1	`basic_char_type`	Basic char type code (for `sub_5BBDF0`).
`dword_106BD8C`	4	`svr4_mode`	SVR4 ABI mode.
`byte_126E349`	1	`cuda_extensions_byte`	CUDA extensions flag (byte-sized).
`byte_126E358`	1	`arch_extension_byte`	Extension flag (possibly `__CUDA_ARCH__`).
`byte_126E3C0`	1	`extension_byte_C0`	Extension flag byte.
`byte_126E3C1`	1	`extension_byte_C1`	Extension flag byte.
`byte_126E481`	1	`extension_byte_481`	Extension flag byte.
`dword_126F248`	4	`il_index_valid`	IL index valid flag (`1` = index built).
`qword_126F240`	8	`il_index_capacity`	IL index array capacity.
`qword_126EBF0`	8	`il_index_count`	IL index entry count.
`qword_126F230`	8	`il_index_aux`	IL index auxiliary pointer.
`dword_12C6A24`	4	`block_scope_suppress`	Block-scope suppress level.
`dword_127FC70`	4	`mark_direction`	Mark/unmark direction for entity traversal.
`dword_127FBA0`	4	`eof_flag`	Input EOF flag.
`qword_127FBA8`	8	`file_handle`	Current input file handle.
`dword_127FB9C`	4	`multibyte_mode`	Multibyte character mode (`>1` = active).
`qword_126E440`	8[6]	`char_type_widths`	Character type width table (indexed by char kind: 1,2,4 bytes).
`qword_126E580`	8[11]	`special_type_entries`	Special type entries (11 entries).
`qword_126DE00`	--	`operator_name_table`	Operator name string table.
`off_E6E0E0`	--	`predef_macro_mode_names`	Predefined macro mode name table (sentinel = `"last"`).
`qword_126EEA0`	8	`predef_macro_state`	Predefined macro initialization state.
`dword_106BBA8`	4	`c23_features`	C23 features flag (`#elifdef`/`#elifndef`).
`dword_106C2B0`	4	`preproc_feature_flag`	Preprocessor feature flag.
`dword_106BEF8`	4	`pch_config_2`	PCH configuration flag (secondary).

GCC Pragma State

Address	Size	Name	Description
`qword_12C6F60`	8	`gcc_pragma_stack_1`	GCC pragma push/pop stack.
`qword_12C6F68`	8	`gcc_pragma_stack_2`	GCC pragma stack (secondary).
`qword_12C6F78`	8	`gcc_pragma_state`	GCC pragma state.
`qword_12C6F98`	8	`gcc_pragma_misc`	GCC pragma miscellaneous state.

Integer Range Tables (SSE-width)

Address	Size	Name	Description
`xmmword_126E0E0`	16	`integer_upper_bounds`	Upper bounds for integer kinds (populated during init).
`xmmword_126E000`	16	`integer_lower_bounds`	Lower bounds for integer kinds.

IL Common Header Template

The 96-byte (6 x 16 bytes) template copied into every new IL entity:

Address	Size	Name
`xmmword_126F6A0`	16	IL header template word 0
`xmmword_126F6B0`	16	IL header template word 1
`xmmword_126F6C0`	16	IL header template word 2
`xmmword_126F6D0`	16	IL header template word 3
`xmmword_126F6E0`	16	IL header template word 4
`xmmword_126F6F0`	16	IL header template word 5

Address Region Summary

Region	Range	Count	Purpose
`.rodata`	`0x82xxxx`--`0xA7xxxx`	~30	Constant tables (attribute descriptors, operation names, type kind names)
`.rodata`	`0xD46xxx`--`0xD48xxx`	~10	Attribute descriptor table, CLI flag lookup
`.rodata`	`0xE6xxxx`--`0xE8xxxx`	~40	IL metadata tables (entry kind names, type properties, signedness, pragma IDs)
`.data`	`0x88xxxx`	1	Error message template table (3795 entries)
`.bss`	`0x106Bxxx`--`0x106Cxxx`	~120	NVIDIA-added CLI flags, feature toggles, CUDA configuration
`.bss`	`0x1065xxx`	~20	Backend code generator state (output position, stub mode)
`.bss`	`0x1067xxx`	~10	Diagnostic per-error tracking, entity formatter
`.bss`	`0x126xxxx`	~200	EDG core state (scope stack, lexer, IL, error counters, source position)
`.bss`	`0x1270xxx`	~10	Preprocessor macro chains
`.bss`	`0x1280xxx`	~15	Arena allocator tracking, lambda bitmaps
`.bss`	`0x1286xxx`	~10	Lambda transform state, registration lists
`.bss`	`0x12C6xxx`--`0x12C7xxx`	~40	PCH, template instantiation, TU management
`.bss`	`0xE7xxxx`	~30	Attribute system, override tracking, red-black tree

Token Kind Table

Every token produced by cudafe++'s lexer carries a 16-bit token kind stored in the global word_126DD58. There are exactly 357 token kinds, numbered 0 through 356, with names indexed from a read-only string pointer table at off_E6D240 in the .rodata segment. A parallel 357-entry byte array at byte_E6C0E0 maps each token kind to an operator-name index, used by the initialize_opname_kinds routine (sub_588BB0) to populate the operator name display table at qword_126DE00. A boolean stop-token table at qword_126DB48 + 8 (357 entries) marks which token kinds are valid synchronization points for error recovery in skip_to_token (sub_6887C0).

Token kind assignment follows a block scheme established by the EDG 6.6 frontend: operators and punctuation occupy the lowest range, followed by alternative tokens (C++ digraphs and named operators), C89 keywords, C99/C11 extensions, MSVC keywords, core C++ keywords, compiler internals, type-trait intrinsics, and finally the newest C++23/26 and extended-type additions at the top. CUDA-specific additions from NVIDIA occupy three dedicated slots (328--330) within the type-trait block, plus additional entries in the extended range. This ordering reflects the historical accretion of the C and C++ standards: each new standard appended its keywords at the end rather than filling gaps.

Key Facts

Property	Value
Total token kinds	357 (indices 0--356)
Name table	`off_E6D240` (357 string pointers in `.rodata`)
Operator-to-name map	`byte_E6C0E0` (357-byte index array)
Operator name display table	`qword_126DE00` (48 string pointers, populated by `sub_588BB0`)
Stop-token table	`qword_126DB48 + 8` (357 boolean entries)
Current token global	`word_126DD58` (WORD)
Keyword registration function	`sub_5863A0` (`keyword_init`, 1,113 lines, `fe_init.c`)
Keyword entry function	`sub_7463B0` (`enter_keyword`)
GNU variant registration	`sub_585B10` (`enter_gnu_keyword`)
Alternative token entry	`sub_749600` (registers named operator alternative)

Token Kind Ranges

Range	Count	Category	Description
0	1	Special	End-of-file / no-token sentinel
1--31	31	Operators and punctuation	Core operators (`+`, `-`, `*`, etc.) and delimiters (`(`, `)`, `{`, `}`, `;`)
32--51	20	Operators (continued)	Compound and remaining operators (`<<`, `>>`, `->`, `::`, `...`, `<=>`)
52--76	25	Alternative tokens / digraphs	C++ named operators (`and`, `or`, `not`) and digraphs (`<%`, `%>`, `<:`, `:>`)
77--108	32	C89 keywords	All keywords from ANSI C89/ISO C90
109--131	23	C99/C11 keywords	`restrict`, `_Bool`, `_Complex`, `_Imaginary`, character types
132--136	5	MSVC keywords	`__declspec`, `__int8`--`__int64`
137--199	63	C++ keywords	Core C++ keywords plus C++11/14/17/20/23 additions
200--206	7	Compiler internal	Preprocessor and internal token kinds
207--327	121	Type trait intrinsics	`__is_xxx` / `__has_xxx` compiler intrinsic keywords
328--330	3	NVIDIA CUDA type traits	NVIDIA-specific lambda type-trait intrinsics
331--356	26	Extended types / recent additions	`_Float32`--`_Float128`, C++23/26 features, scalable vector types

Complete Token Table

Operators and Punctuation (0--51)

These tokens are produced directly by the character-level scanner sub_679800 (scan_token). Multi-character operators are resolved by dedicated scanning functions in the 0x67ABB0--0x67BAB0 range.

Kind	Name	C/C++ Construct	Notes
0	`<eof>`	End of file	Sentinel / no-token marker
1	`<identifier>`	Identifier	Any non-keyword identifier
2	`<integer literal>`	Integer constant	Decimal, hex, octal, or binary
3	`<floating literal>`	Floating-point constant	Float, double, or long double
4	`<character literal>`	Character constant	`'x'`, includes wide/u8/u16/u32
5	`<string literal>`	String literal	`"..."`, includes wide/u8/u16/u32/raw
6	`;`	Semicolon	Statement terminator
7	`(`	Left parenthesis	Grouping, function call
8	`)`	Right parenthesis
9	`,`	Comma	Separator, comma operator
10	`=`	Assignment	`a = b`
11	`{`	Left brace	Block/initializer open
12	`}`	Right brace	Block/initializer close
13	`+`	Plus	Addition, unary plus
14	`-`	Minus	Subtraction, unary minus
15	`*`	Star	Multiplication, pointer dereference, pointer declarator
16	`/`	Slash	Division
17	`<`	Less-than	Comparison, template open bracket
18	`>`	Greater-than	Comparison, template close bracket
19	`&`	Ampersand	Bitwise AND, address-of, reference declarator
20	`?`	Question mark	Ternary conditional
21	`:`	Colon	Label, ternary, bit-field width
22	`~`	Tilde	Bitwise complement, destructor
23	`%`	Percent	Modulo
24	`^`	Caret	Bitwise XOR
25	`[`	Left bracket	Array subscript, attributes `[[`
26	`.`	Dot	Member access
27	`]`	Right bracket
28	`!`	Exclamation	Logical NOT
29	`\|`	Pipe	Bitwise OR
30	`->`	Arrow	Pointer member access
31	`++`	Increment	Pre/post increment
32	`--`	Decrement	Pre/post decrement
33	`==`	Equal	Equality comparison; also `bitand` alt-token for `&`
34	`!=`	Not-equal	Inequality comparison
35	`<=`	Less-or-equal	Comparison
36	`>=`	Greater-or-equal	Comparison
37	`<<`	Left shift	Also `compl` alt-token for `~`
38	`>>`	Right shift	Also `not` alt-token for `!`
39	`+=`	Add-assign	Compound assignment
40	`-=`	Subtract-assign
41	`*=`	Multiply-assign
42	`/=`	Divide-assign
43	`%=`	Modulo-assign
44	`<<=`	Left-shift-assign
45	`>>=`	Right-shift-assign
46	`&&`	Logical AND	Also address of rvalue reference
47	`\|\|`	Logical OR
48	`^=`	XOR-assign	Also `not_eq` alt-token for `!=`
49	`&=`	AND-assign
50	`\|=`	OR-assign	Also `xor` alt-token for `^`
51	`::`	Scope resolution	Also `bitor` alt-token for `\|`

Alternative Tokens and Digraphs (52--76)

C++ alternative tokens (ISO 14882 clause 5.5) and C/C++ digraphs. These are registered during keyword_init (sub_5863A0) via sub_749600 when in C++ mode (dword_126EFB4 == 2).

Kind	Name	Equivalent	Notes
52	`and`	`&&`	Logical AND
53	`or`	`\|\|`	Logical OR
54	`->*`	`->*`	Pointer-to-member via pointer
55	`.*`	`.*`	Pointer-to-member via object
56	`...`	`...`	Ellipsis (variadic)
57	`<=>`	`<=>`	Three-way comparison (C++20)
58	`#`	`#`	Preprocessor stringification
59	`##`	`##`	Preprocessor token paste
60	`<%`	`{`	Digraph for left brace
61	`%>`	`}`	Digraph for right brace
62	`<:`	`[`	Digraph for left bracket
63	`:>`	`]`	Digraph for right bracket
64	`and_eq`	`&=`	Bitwise AND-assign
65	`xor_eq`	`^=`	Bitwise XOR-assign
66	`or_eq`	`\|=`	Bitwise OR-assign
67	`%:`	`#`	Digraph for hash
68	`%:%:`	`##`	Digraph for token paste
69--76	(reserved)	--	Reserved for future alternative tokens

C89 Keywords (77--108)

Always registered unconditionally. These form the base keyword set present in every compilation mode.

Kind	Name	C/C++ Construct
77	`auto`	Storage class (C89); type deduction (C++11)
78	`break`	Loop/switch exit
79	`case`	Switch case label
80	`char`	Character type
81	`const`	Const qualifier
82	`continue`	Loop continuation
83	`default`	Switch default label; defaulted function (C++11)
84	`do`	Do-while loop
85	`double`	Double-precision float
86	`else`	If-else branch
87	`enum`	Enumeration
88	`extern`	External linkage
89	`float`	Single-precision float
90	`for`	For loop
91	`goto`	Unconditional jump
92	`if`	Conditional
93	`int`	Integer type
94	`long`	Long integer modifier
95	`register`	Register storage hint (deprecated in C++17)
96	`return`	Function return
97	`short`	Short integer modifier
98	`signed`	Signed integer modifier
99	`sizeof`	Size query operator
100	`static`	Static storage / internal linkage
101	`struct`	Structure
102	`switch`	Multi-way branch
103	`typedef`	Type alias (C-style)
104	`union`	Union type
105	`unsigned`	Unsigned integer modifier
106	`void`	Void type
107	`volatile`	Volatile qualifier
108	`while`	While loop

C99/C11/C23 Keywords (109--131)

Gated on the C standard version at dword_126EF68 (values: 199901 = C99, 201112 = C11, 202311 = C23).

Kind	Name	Standard	C/C++ Construct
109	`inline`	C99	Inline function hint (already C++ keyword at 154)
110--118	(reserved)	--	--
119	`restrict`	C99	Pointer restrict qualifier
120	`_Bool`	C99	Boolean type (C-style)
121	`_Complex`	C99	Complex number type
122	`_Imaginary`	C99	Imaginary number type
123--125	(reserved)	--	--
126	`char16_t`	C++11/C23	16-bit character type
127	`char32_t`	C++11/C23	32-bit character type
128	`char8_t`	C++17/C23	UTF-8 character type
129--131	(reserved)	--	--

MSVC Keywords (132--136)

Gated on dword_126EFB0 (Microsoft extensions enabled, language mode 2/MSVC).

Kind	Name	MSVC Construct
132	`__declspec`	MSVC declaration specifier
133	`__int8`	8-bit integer type
134	`__int16`	16-bit integer type
135	`__int32`	32-bit integer type
136	`__int64`	64-bit integer type

C++ Core Keywords (137--199)

Gated on C++ mode (dword_126EFB4 == 2). Some keywords within this range were added in C++11 through C++23 and have additional standard-version gates.

Kind	Name	Standard	C/C++ Construct
137	`bool`	C++98	Boolean type
138	`true`	C++98	Boolean literal
139	`false`	C++98	Boolean literal
140	`wchar_t`	C++98	Wide character type
141--149	(reserved)	--	--
142	`__attribute`	GNU	GCC attribute syntax
143	`__builtin_types_compatible_p`	GNU	GCC type compatibility test
144--149	(reserved)	--	--
150	`catch`	C++98	Exception handler
151	`class`	C++98	Class definition
152	`delete`	C++98	Deallocation; deleted function (C++11)
153	`friend`	C++98	Friend declaration
154	`inline`	C++98	Inline function/variable
155	`new`	C++98	Allocation expression
156	`operator`	C++98	Operator overload
157	`private`	C++98	Access specifier
158	`protected`	C++98	Access specifier
159	`public`	C++98	Access specifier
160	`template`	C++98	Template declaration
161	`this`	C++98	Current object pointer
162	`throw`	C++98	Throw expression
163	`try`	C++98	Try block
164	`virtual`	C++98	Virtual function/base
165	(reserved)	--	--
166	`const_cast`	C++98	Const cast expression
167	`dynamic_cast`	C++98	Dynamic cast expression
168	(reserved)	--	--
169	`export`	C++98/20	Export declaration (original C++98, revived for modules in C++20)
170	`export`	C++20	Module export (alternate registration slot)
171--173	(reserved)	--	--
174	`mutable`	C++98	Mutable data member
175	`namespace`	C++98	Namespace declaration
176	`reinterpret_cast`	C++98	Reinterpret cast expression
177	`static_cast`	C++98	Static cast expression
178	`typeid`	C++98	Runtime type identification
179	`using`	C++98	Using declaration/directive
180--182	(reserved)	--	--
183	`typename`	C++98	Dependent type name
184	`static_assert`	C++11	Static assertion; also `_Static_assert` in C11
185	`decltype`	C++11	Decltype specifier
186	`__auto_type`	GNU	GCC auto type extension
187	`__extension__`	GNU	GCC extension marker (suppress warnings)
188	(reserved)	--	--
189	`typeof`	C++23/GNU	Type-of expression
190	`typeof_unqual`	C++23	Unqualified type-of expression
191--193	(reserved)	--	--
194	`thread_local`	C++11	Thread-local storage; also `_Thread_local` in C11
195--199	(reserved)	--	--

Compiler Internal Tokens (200--206)

These tokens are used internally by the preprocessor and the token cache. They never appear in user-visible diagnostics.

Kind	Name	Purpose
200	`<pp-number>`	Preprocessing number (not yet classified as integer or float)
201	`<header-name>`	Include file name (`<file>` or `"file"`)
202	`<newline>`	Logical newline token (preprocessor directive boundary)
203	`<whitespace>`	Whitespace token (preprocessing mode only)
204	`<placemarker>`	Token-paste placeholder (empty argument in `##`)
205	`<pragma>`	Pragma token (deferred for later processing)
206	`<end-of-directive>`	End of preprocessor directive

Type Trait Intrinsics (207--327)

These are compiler intrinsic keywords that implement the C++ type traits (from <type_traits>) without requiring template instantiation. They are registered during keyword_init with C++ standard version gating -- earlier traits (C++11) are always available in C++ mode, while newer traits (C++20, C++23, C++26) require the corresponding standard version at dword_126EF68. Some traits are MSVC-specific (gated on dword_126EFB0) or Clang-specific (gated on qword_126EF90).

The complete list of type-trait intrinsics, organized alphabetically within each sub-category:

Unary Type Predicates

Kind	Name	Standard	Tests Whether...
207	`__is_class`	C++11	Type is a class (not union)
208	`__is_enum`	C++11	Type is an enumeration
209	`__is_union`	C++11	Type is a union
210	`__is_pod`	C++11	Type is POD (plain old data)
211	`__is_empty`	C++11	Type has no non-static data members
212	`__is_polymorphic`	C++11	Type has at least one virtual function
213	`__is_abstract`	C++11	Type has at least one pure virtual function
214	`__is_literal_type`	C++11	Type is a literal type (deprecated C++17)
215	`__is_standard_layout`	C++11	Type is standard-layout
216	`__is_trivial`	C++11	Type is trivially copyable and has trivial default constructor
217	`__is_trivially_copyable`	C++11	Type is trivially copyable
218	`__is_final`	C++14	Class is marked `final`
219	`__is_aggregate`	C++17	Type is an aggregate
220	`__has_virtual_destructor`	C++11	Type has a virtual destructor
221	`__has_trivial_constructor`	C++11	Type has a trivial default constructor
222	`__has_trivial_copy`	C++11	Type has a trivial copy constructor
223	`__has_trivial_assign`	C++11	Type has a trivial copy assignment
224	`__has_trivial_destructor`	C++11	Type has a trivial destructor
225	`__has_nothrow_constructor`	C++11	Default constructor is noexcept
226	`__has_nothrow_copy`	C++11	Copy constructor is noexcept
227	`__has_nothrow_assign`	C++11	Copy assignment is noexcept
228	`__has_trivial_move_constructor`	C++11	Type has a trivial move constructor
229	`__has_trivial_move_assign`	C++11	Type has a trivial move assignment
230	`__has_nothrow_move_assign`	C++11	Move assignment is noexcept
231	`__has_unique_object_representations`	C++17	Type has unique object representations
232	`__is_signed`	C++11	Type is a signed arithmetic type
233	`__is_unsigned`	C++11	Type is an unsigned arithmetic type
234	`__is_integral`	C++11	Type is an integral type
235	`__is_floating_point`	C++11	Type is a floating-point type
236	`__is_arithmetic`	C++11	Type is an arithmetic type
237	`nullptr`	C++11	Null pointer literal (not a trait; shares range)
238	`__is_fundamental`	C++11	Type is a fundamental type
239	`__int128`	GNU	128-bit integer type (not a trait; shares range)
240	`__is_scalar`	C++11	Type is a scalar type
241	`__is_object`	C++11	Type is an object type
242	`__is_compound`	C++11	Type is a compound type
243	`__is_reference`	C++11	Type is an lvalue or rvalue reference
244	`constexpr`	C++11	Constexpr specifier (not a trait; shares range)
245	`consteval`	C++20	Consteval specifier (not a trait; shares range)
246	`constinit`	C++20	Constinit specifier (not a trait; shares range)
247	`_Alignof`	C11	Alignment query (C11 spelling)
248	`_Alignas`	C11	Alignment specifier (C11 spelling)
249	`__bases`	GCC	Direct base classes (GCC extension)
250	`__direct_bases`	GCC	Non-virtual direct base classes (GCC extension)
251	`__builtin_arm_ldrex`	Clang	ARM load-exclusive intrinsic
252	`__builtin_arm_ldaex`	Clang	ARM load-acquire-exclusive intrinsic
253	`__builtin_arm_addg`	Clang	ARM MTE add-tag intrinsic
254	`__builtin_arm_irg`	Clang	ARM MTE insert-random-tag intrinsic
255	`__builtin_arm_ldg`	Clang	ARM MTE load-tag intrinsic
256	`__is_member_pointer`	C++11	Type is a pointer to member
257	`__is_member_function_pointer`	C++11	Type is a pointer to member function
258	`__builtin_shufflevector`	Clang	Clang vector shuffle intrinsic
259	`__builtin_convertvector`	Clang	Clang vector conversion intrinsic
260	`_Noreturn`	C11	No-return function specifier
261	`__builtin_complex`	GNU	GCC complex number construction
262	`_Generic`	C11	Generic selection expression
263	`_Atomic`	C11	Atomic type qualifier/specifier
264	`_Nullable`	Clang	Nullable pointer qualifier
265	`_Nonnull`	Clang	Non-null pointer qualifier
266	`_Null_unspecified`	Clang	Null-unspecified pointer qualifier
267	`co_yield`	C++20	Coroutine yield expression
268	`co_return`	C++20	Coroutine return statement
269	`co_await`	C++20	Coroutine await expression
270	`__is_member_object_pointer`	C++11	Type is a pointer to data member
271	`__builtin_addressof`	GNU	Address-of without operator overload

EDG Internal Keywords (272--283)

These are not user-facing keywords. They are injected by the EDG frontend into synthesized declarations for built-in types, throw specifications, and vector types.

Kind	Name	Purpose
272	`__edg_type__`	EDG internal type placeholder
273	`__edg_vector_type__`	SIMD vector type (GCC `__attribute__((vector_size))` lowering)
274	`__edg_neon_vector_type__`	ARM NEON vector type
275	`__edg_scalable_vector_type__`	ARM SVE scalable vector type
276	`__edg_neon_polyvector_type__`	ARM NEON polynomial vector type
277	`__edg_size_type__`	Placeholder for `size_t` before it is typedef'd
278	`__edg_ptrdiff_type__`	Placeholder for `ptrdiff_t` before it is typedef'd
279	`__edg_bool_type__`	Placeholder for `bool` / `_Bool`
280	`__edg_wchar_type__`	Placeholder for `wchar_t`
281	`__edg_throw__`	Throw specification in synthesized declarations
282	`__edg_opnd__`	Operand reference in synthesized expressions
283	(reserved)	--

More Type Predicates and Binary Traits (284--327)

Kind	Name	Standard	Tests Whether...
284	`__is_const`	C++11	Type is const-qualified
285	`__is_volatile`	C++11	Type is volatile-qualified
286	`__is_void`	C++11	Type is `void`
287	`__is_array`	C++11	Type is an array
288	`__is_pointer`	C++11	Type is a pointer
289	`__is_lvalue_reference`	C++11	Type is an lvalue reference
290	`__is_rvalue_reference`	C++11	Type is an rvalue reference
291	`__is_function`	C++11	Type is a function type
292	`__is_constructible`	C++11	Type is constructible from given args
293	`__is_nothrow_constructible`	C++11	Construction is noexcept
294	`requires`	C++20	Requires expression/clause
295	`concept`	C++20	Concept definition
296	`__builtin_has_attribute`	GNU	Tests if declaration has given attribute
297	`__builtin_bit_cast`	C++20	Bit cast intrinsic (`std::bit_cast` implementation)
298	`__is_assignable`	C++11	Type is assignable from given type
299	`__is_nothrow_assignable`	C++11	Assignment is noexcept
300	`__is_trivially_constructible`	C++11	Construction is trivial
301	`__is_trivially_assignable`	C++11	Assignment is trivial
302	`__is_destructible`	C++11	Type is destructible
303	`__is_nothrow_destructible`	C++11	Destruction is noexcept
304	`__edg_is_deducible`	EDG	EDG internal: template argument is deducible
305	`__is_trivially_destructible`	C++11	Destruction is trivial
306	`__is_base_of`	C++11	First type is base of second (binary trait)
307	`__is_convertible`	C++11	First type is convertible to second (binary trait)
308	`__is_same`	C++11	Two types are the same (binary trait)
309	`__is_trivially_copy_assignable`	C++11	Copy assignment is trivial
310	`__is_assignable_no_precondition_check`	EDG	Assignable without precondition validation
311	`__is_same_as`	Clang	Alias for `__is_same` (Clang compatibility)
312	`__is_referenceable`	C++11	Type can be referenced
313	`__is_bounded_array`	C++20	Type is a bounded array
314	`__is_unbounded_array`	C++20	Type is an unbounded array
315	`__is_scoped_enum`	C++23	Type is a scoped enumeration
316	`__is_literal`	C++11	Alias for `__is_literal_type`
317	`__is_complete_type`	EDG	Type is complete (not forward-declared)
318	`__is_nothrow_convertible`	C++20	Conversion is noexcept (binary trait)
319	`__is_convertible_to`	MSVC	MSVC alias for `__is_convertible`
320	`__is_invocable`	C++17	Callable with given arguments
321	`__is_nothrow_invocable`	C++17	Call is noexcept
322	`__is_trivially_equality_comparable`	Clang	Bitwise equality is equivalent
323	`__is_layout_compatible`	C++20	Types have compatible layouts
324	`__is_pointer_interconvertible_base_of`	C++20	Pointer-interconvertible base (binary trait)
325	`__is_corresponding_member`	C++20	Corresponding members in layout-compatible types
326	`__is_pointer_interconvertible_with_class`	C++20	Member pointer is interconvertible with class pointer
327	`__is_trivially_relocatable`	C++26	Type can be trivially relocated

NVIDIA CUDA Type Traits (328--330)

Three NVIDIA-specific type-trait intrinsics occupy dedicated token kinds. These are registered during keyword_init when GPU mode is active (dword_106C2C0 != 0) and participate in the same token classification pipeline as all other type traits. They are used internally by the CUDA frontend to detect extended lambda closure types during device/host separation.

Kind	Name	Purpose
328	`__nv_is_extended_device_lambda_closure_type`	Tests whether a type is the closure type of an extended device lambda. Used during device code generation to identify lambda closures that require special treatment (wrapper function generation, address-space conversion).
329	`__nv_is_extended_host_device_lambda_closure_type`	Tests whether a type is the closure type of an extended host-device lambda (`__host__ __device__`). These lambdas require dual code generation paths and wrapper functions for both host and device.
330	`__nv_is_extended_device_lambda_with_preserved_return_type`	Tests whether a device lambda has an explicitly specified (preserved) return type rather than a deduced one. Affects how the compiler generates the wrapper function return type.

When extended lambdas are disabled, these traits are predefined as macros expanding to false:

// Fallback definitions in preprocessor preamble:
#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false

Extended Types and Recent Additions (331--356)

These are the newest token kinds, added for extended floating-point types (ISO/IEC TS 18661-3) and recent C++23/26 features.

Kind	Name	Standard	C/C++ Construct
331	`_Float32`	TS 18661-3	32-bit IEEE 754 float
332	`_Float32x`	TS 18661-3	Extended 32-bit float
333	`_Float64`	TS 18661-3	64-bit IEEE 754 float
334	`_Float64x`	TS 18661-3	Extended 64-bit float
335	`_Float128`	TS 18661-3	128-bit IEEE 754 float
336--340	(reserved)	--	--
341--356	(recent additions)	C++23/26	Reserved for MSVC C++/CLI traits (`__is_ref_class`, `__is_value_class`, `__is_interface_class`, `__is_delegate`, `__is_sealed`, `__has_finalizer`, `__has_copy`, `__has_assign`, `__is_simple_value_class`, `__is_ref_array`, `__is_valid_winrt_type`, `__is_win_class`, `__is_win_interface`) and additional future extensions

Token Cache

The token cache provides lookahead, backtracking, and macro-expansion replay for C++ parsing. Tokens are stored in a linked list of cache entries, each 80--112 bytes depending on payload.

Cache Entry Layout

Offset	Size	Field	Description
`+0`	8	`next`	Next entry in linked list
`+8`	8	`source_position`	Encoded file/line/column
`+16`	2	`token_code`	Token kind (0--356)
`+18`	1	`cache_entry_kind`	Payload discriminator (see table below)
`+20`	4	`flags`	Token classification flags
`+24`	4	`extra_flags`	Additional flags
`+32`	8	`extra_data`	Context-dependent data
`+40`..	varies	`payload`	Kind-specific data (40--72 bytes)

Cache Entry Kinds

Eight discriminator values select the payload interpretation at offset +40:

Kind	Value	Payload Content	Size	Description
identifier	1	Name pointer + 64-byte lookup result	72	Identifier with pre-resolved scope/symbol lookup. The 64-byte lookup result mirrors `xmmword_106C380`--`106C3B0`.
macro_def	2	Macro definition pointer	8	Reference to a macro definition for re-expansion. Dispatched to `sub_5BA500`.
pragma	3	Pragma data	varies	Preprocessor pragma deferred for later processing
pp_number	4	Number text pointer	8	Preprocessing number not yet classified as integer or float
(reserved)	5	--	--	Not observed in use
string	6	String data + encoding byte	varies	String literal with encoding prefix information
(reserved)	7	--	--	Not observed in use
concatenated_string	8	Concatenated string data	varies	Wide or multi-piece concatenated string literal

Cache Management Globals

Address	Name	Description
`qword_1270150`	`cached_token_rescan_list`	Head of list of tokens to re-scan (pushed back for lookahead)
`qword_1270128`	`reusable_cache_stack`	Stack of reusable cache entry blocks
`qword_1270148`	`free_token_list`	Free list for recycling cache entries
`qword_1270140`	`macro_definition_chain`	Active macro definition chain
`qword_1270118`	`cache_entry_free_list`	Free list for `allocate_token_cache_entry`
`dword_126DB74`	`has_cached_tokens`	Boolean: nonzero when cache is non-empty

Cache Operations

Address	Identity	Lines	Description
`sub_669650`	`copy_tokens_from_cache`	385	Copies cached preprocessor tokens for macro re-expansion (assert at `lexical.c:3417`)
`sub_669D00`	`allocate_token_cache_entry`	119	Allocates from free list at `qword_1270118`, initializes fields
`sub_669EB0`	`create_cached_token_node`	83	Creates and initializes cache node with source position
`sub_66A000`	`append_to_token_cache`	88	Appends token to cache list, maintains tail pointer
`sub_66A140`	`push_token_to_rescan_list`	46	Pushes token onto rescan stack at `qword_1270150`
`sub_66A2C0`	`free_single_cache_entry`	18	Returns cache entry to free list

Keyword Registration

All keywords are registered during frontend initialization by sub_5863A0 (keyword_init / fe_translation_unit_init, 1,113 lines, in fe_init.c). The function calls sub_7463B0 (enter_keyword) for each keyword, passing the numeric token kind and the keyword string. GNU double-underscore variants (e.g., __asm and __asm__ for asm) are registered via sub_585B10 (enter_gnu_keyword), which automatically generates both __name and __name__ forms from a single root. Alternative tokens are registered via sub_749600.

Version Gating Architecture

Registration is conditional on a set of global configuration flags established during CLI processing:

Address	Name	Controls	Values
`dword_126EFB4`	`language_mode`	C vs C++ dialect	`1` = C (GNU default), `2` = C++
`dword_126EF68`	`cpp_standard_version`	Standard version level	`199711` (C++98), `201103` (C++11), `201402` (C++14), `201703` (C++17), `202002` (C++20), `202302` (C++23)
`dword_126EFAC`	`c_language_mode`	C mode flag	Boolean
`dword_126EFB0`	`microsoft_extensions`	MSVC keywords	Boolean
`dword_126EFA8`	`gnu_extensions`	GCC keywords	Boolean
`dword_126EFA4`	`clang_extensions`	Clang keywords	Boolean
`qword_126EF98`	`gnu_version`	GCC version threshold	Encoded: e.g., `0x9FC3` = GCC 4.0.3
`qword_126EF90`	`clang_version`	Clang version threshold	Encoded: e.g., `0x15F8F`, `0x1D4BF`

Registration Pattern

The pseudocode below shows the version-gated registration pattern reconstructed from sub_5863A0:

void keyword_init(void) {
    // C89 keywords -- always registered
    enter_keyword(77, "auto");
    enter_keyword(78, "break");
    enter_keyword(79, "case");
    // ... all C89 keywords ...
    enter_keyword(108, "while");

    // C99 keywords -- gated on C99+ standard
    if (c_standard_version >= 199901) {
        enter_keyword(119, "restrict");
        enter_keyword(120, "_Bool");
        enter_keyword(121, "_Complex");
        enter_keyword(122, "_Imaginary");
    }

    // C11 keywords
    if (c_standard_version >= 201112) {
        enter_keyword(184, "_Static_assert");
        enter_keyword(247, "_Alignof");
        enter_keyword(248, "_Alignas");
        enter_keyword(260, "_Noreturn");
        enter_keyword(262, "_Generic");
        enter_keyword(263, "_Atomic");
        enter_keyword(194, "_Thread_local");
    }

    // C++ mode keywords
    if (language_mode == 2) {  // C++ mode
        enter_keyword(137, "bool");
        enter_keyword(138, "true");
        enter_keyword(139, "false");
        enter_keyword(140, "wchar_t");
        enter_keyword(150, "catch");
        enter_keyword(151, "class");
        // ... all C++ core keywords ...
        enter_keyword(183, "typename");

        // Alternative tokens (C++ only)
        enter_alt_token(52, "and", /*len*/3);
        enter_alt_token(53, "or", 2);
        enter_alt_token(64, "and_eq", 6);
        // ... all alternative tokens ...

        // C++11 keywords
        if (cpp_standard_version >= 201103) {
            enter_keyword(244, "constexpr");
            enter_keyword(185, "decltype");
            enter_keyword(237, "nullptr");
            enter_keyword(126, "char16_t");
            enter_keyword(127, "char32_t");
            enter_keyword(184, "static_assert");
            enter_keyword(194, "thread_local");
        }

        // C++20 keywords
        if (cpp_standard_version >= 202002) {
            enter_keyword(245, "consteval");
            enter_keyword(246, "constinit");
            enter_keyword(267, "co_yield");
            enter_keyword(268, "co_return");
            enter_keyword(269, "co_await");
            enter_keyword(294, "requires");
            enter_keyword(295, "concept");
        }
    }

    // GNU extensions -- gated on gnu_extensions flag
    if (gnu_extensions) {
        enter_gnu_keyword(187, "__extension__");
        enter_gnu_keyword(186, "__auto_type");
        enter_gnu_keyword(142, "__attribute");
        enter_keyword(117, "__builtin_offsetof");
        enter_keyword(143, "__builtin_types_compatible_p");
        enter_keyword(239, "__int128");
        // ... all GNU extensions ...
    }

    // MSVC extensions
    if (microsoft_extensions) {
        enter_keyword(132, "__declspec");
        enter_keyword(133, "__int8");
        enter_keyword(134, "__int16");
        enter_keyword(135, "__int32");
        enter_keyword(136, "__int64");
    }

    // Type traits (C++11+, ~60 traits)
    if (language_mode == 2) {
        enter_keyword(207, "__is_class");
        enter_keyword(208, "__is_enum");
        // ... all type traits through 327 ...
    }

    // CUDA type traits (GPU mode)
    if (gpu_mode) {
        enter_keyword(328, "__nv_is_extended_device_lambda_closure_type");
        enter_keyword(329, "__nv_is_extended_host_device_lambda_closure_type");
        enter_keyword(330, "__nv_is_extended_device_lambda_with_preserved_return_type");
    }

    // Extended float types (GNU)
    if (gnu_extensions) {
        enter_keyword(331, "_Float32");
        enter_keyword(332, "_Float32x");
        enter_keyword(333, "_Float64");
        enter_keyword(334, "_Float64x");
        enter_keyword(335, "_Float128");
    }

    // Post-keyword init: scope setup, builtin registration
    // ...
}

GNU Double-Underscore Registration

sub_585B10 (enter_gnu_keyword, assert at fe_init.c:698) implements the pattern where a single keyword name is registered in two or three forms:

If name starts with _: registers name as-is and __name__ (e.g., _Bool stays, plus ___Bool__ if applicable)
Otherwise: registers __name and __name__ (e.g., asm produces __asm and __asm__)

The function uses a stack buffer of 49 characters maximum (name + 5 <= 0x31), prepends __ (encoded as 0x5F5F in little-endian), copies the name, and appends __ with a null terminator. Both variants call sub_7463B0 (enter_keyword) with the same token kind.

Operator Name Table

The operator name display table at qword_126DE00 maps operator kinds to printable names for diagnostics and error messages. It is populated by sub_588BB0 (initialize_opname_kinds) during fe_wrapup.c initialization.

The initialization loop iterates all 357 entries of byte_E6C0E0 (operator-to-name index), mapping each non-zero entry to the corresponding string from off_E6D240 (the token name table). Two special cases are hardcoded:

Operator Kind	Display Name	Special Case
42	`()`	Function call operator (overridden from default)
43	`[]`	Array subscript operator (overridden from default)

Additionally, the array positions for new[] and delete[] are hardcoded separately, since these operator names do not correspond to single tokens.

The routine validates that all entries in the range qword_126DE08 through qword_126DF80 (the 48 operator name slots) are non-null, and panics with "initialize_opname_kinds: bad init of opname_names" if any gap is found.

Token State Globals

When a token is produced by the lexer, the following globals are populated:

Address	Name	Type	Description
`word_126DD58`	`current_token_code`	WORD	16-bit token kind (0--356)
`qword_126DD38`	`current_source_position`	QWORD	Encoded file/line/column
`qword_126DD48`	`token_text_ptr`	QWORD	Pointer to identifier/literal text
`src`	`token_start_position`	char*	Start of token in input buffer
`n`	`token_text_length`	size_t	Length of token text
`dword_126DF90`	`token_flags_1`	DWORD	Classification flags
`dword_126DF8C`	`token_flags_2`	DWORD	Additional flags
`qword_126DF80`	`token_extra_data`	QWORD	Context-dependent payload
`xmmword_106C380`--`106C3B0`	`identifier_lookup_result`	4 x 128-bit	SSE-packed lookup result (64 bytes, 4 XMM registers)

Cross-References

Lexer & Tokenizer -- full lexer subsystem documentation, architecture, and function map
Pipeline Overview -- keyword registration during initialization
Entry Point & Initialization -- keyword_init in the startup sequence
Global Variable Index -- all global addresses referenced here
Template Engine -- template argument scanning and >> disambiguation
CUDA Lambda Overview -- NVIDIA type-trait token usage in lambda transforms
Attribute System Overview -- CUDA attribute handling at token level
EDG Source File Map -- lexical.c and fe_init.c binary layout

CUDA Error Catalog

cudafe++ reserves internal error indices 3457--3794 (338 slots) for CUDA-specific diagnostics. These are displayed to users as numbers 20000--20337 using the formula display = internal + 16543. Of the 338 slots, approximately 210 carry unique message templates; the remainder are reserved or share templates with parametric fill-ins. Every CUDA error can be controlled by its numeric code or diagnostic tag name via --diag_suppress, --diag_warning, --diag_error, or the #pragma nv_diagnostic system.

This page is a flat lookup table. For the diagnostic pipeline architecture (severity stack, pragma scoping, SARIF output), see Diagnostic Overview. For narrative discussion of each category with implementation details, see CUDA Errors.

Numbering and Display Format

User-visible:  file.cu(42): error #20042-D: calling a __device__ function from ...
                                    ^^^^^
                                    display code = internal + 16543

Direction	Formula	Example
Display to internal	`internal = display - 16543`	20042 maps to internal 3499
Internal to display	`display = internal + 16543`	3457 maps to display 20000

The -D suffix appears when severity is 7 or below (note, remark, warning, soft error). Hard errors (severity 8+) omit the suffix.

Severity Codes

Code	Level	Suppressible
2	note	yes
4	remark	yes
5	warning	yes
6	command-line warning	no
7	error (soft)	yes
8	error (hard, from pragma)	no
9	catastrophic error	no
10	command-line error	no
11	internal error	no

How to Use This Catalog

Suppress by numeric code:

nvcc --diag_suppress=20042

Suppress by tag name:

nvcc --diag_suppress=unsafe_device_call

In source code:

#pragma nv_diag_suppress unsafe_device_call
#pragma nv_diag_suppress 20042

Category 1: Cross-Space Calling

Checks performed by the call-graph walker comparing the execution-space byte at entity offset +182 of caller vs. callee.

Standard Cross-Space Calls (6 messages)

Tag	Sev	Message
`unsafe_device_call`	W	calling a __device__ function(%sq1) from a __host__ function(%sq2) is not allowed
`unsafe_device_call`	W	calling a __device__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed
`unsafe_device_call`	W	calling a __host__ function(%sq1) from a __device__ function(%sq2) is not allowed
`unsafe_device_call`	W	calling a __host__ function(%sq1) from a __global__ function(%sq2) is not allowed
`unsafe_device_call`	W	calling a __host__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed
`unsafe_device_call`	W	calling a __host__ function from a __host__ __device__ function is not allowed

Constexpr Cross-Space Calls (6 messages)

These fire when --expt-relaxed-constexpr is not enabled.

Tag	Sev	Message
`unsafe_device_call`	W	calling a constexpr __device__ function(%sq1) from a __host__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
`unsafe_device_call`	W	calling a constexpr __device__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
`unsafe_device_call`	W	calling a constexpr __host__ function(%sq1) from a __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
`unsafe_device_call`	W	calling a constexpr __host__ function(%sq1) from a __global__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
`unsafe_device_call`	W	calling a constexpr __host__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
`unsafe_device_call`	W	calling a constexpr __host__ function from a __host__ __device__ function is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.

Category 2: Virtual Override Mismatch

Override checker (sub_432280) extracts the 0x30 mask from the execution-space byte. __global__ is excluded because kernels cannot be virtual.

Tag	Sev	Message
--	E	execution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ function
--	E	execution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ __device__ function
--	E	execution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __device__ function
--	E	execution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __host__ __device__ function
--	E	execution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __device__ function
--	E	execution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __host__ function

Category 3: Redeclaration Mismatch

Checked in decl_routine (sub_4CE420) and check_cuda_attribute_consistency (sub_4C6D50).

Incompatible Redeclarations (error-level)

Tag	Sev	Message
`device_function_redeclared_with_global`	E	a __device__ function(%no1) redeclared with __global__
`global_function_redeclared_with_device`	E	a __global__ function(%no1) redeclared with __device__
`global_function_redeclared_with_host`	E	a __global__ function(%no1) redeclared with __host__
`global_function_redeclared_with_host_device`	E	a __global__ function(%no1) redeclared with __host__ __device__
`global_function_redeclared_without_global`	E	a __global__ function(%no1) redeclared without __global__
`host_function_redeclared_with_global`	E	a __host__ function(%no1) redeclared with __global__
`host_device_function_redeclared_with_global`	E	a __host__ __device__ function(%no1) redeclared with __global__

Compatible Promotions (warning-level, promoted to HD)

Tag	Sev	Message
`device_function_redeclared_with_host`	W	a __device__ function(%no1) redeclared with __host__, hence treated as a __host__ __device__ function
`device_function_redeclared_with_host_device`	W	a __device__ function(%no1) redeclared with __host__ __device__, hence treated as a __host__ __device__ function
`device_function_redeclared_without_device`	W	a __device__ function(%no1) redeclared without __device__, hence treated as a __host__ __device__ function
`host_function_redeclared_with_device`	W	a __host__ function(%no1) redeclared with __device__, hence treated as a __host__ __device__ function
`host_function_redeclared_with_host_device`	W	a __host__ function(%no1) redeclared with __host__ __device__, hence treated as a __host__ __device__ function

Category 4: global Function Constraints

Return Type and Signature

Tag	Sev	Message
`global_function_return_type`	E	a __global__ function must have a void return type
`global_function_deduced_return_type`	E	a __global__ function must not have a deduced return type
`global_function_has_ellipsis`	E	a __global__ function cannot have ellipsis
`global_rvalue_ref_type`	E	a __global__ function cannot have a parameter with rvalue reference type
`global_ref_param_restrict`	E	a __global__ function cannot have a parameter with __restrict__ qualified reference type
`global_va_list_type`	E	A __global__ function or function template cannot have a parameter with va_list type
`global_function_with_initializer_list`	E	a __global__ function or function template cannot have a parameter with type std::initializer_list
`global_param_align_too_big`	E	cannot pass a parameter with a too large explicit alignment to a __global__ function on win32 platforms

Declaration Context

Tag	Sev	Message
`global_class_decl`	E	A __global__ function or function template cannot be a member function
`global_friend_definition`	E	A __global__ function or function template cannot be defined in a friend declaration
`global_function_in_unnamed_inline_ns`	E	A __global__ function or function template cannot be declared within an inline unnamed namespace
`global_operator_function`	E	An operator function cannot be a __global__ function
`global_new_or_delete`	E	(__global__ on operator new/delete)
--	E	function main cannot be marked __device__ or __global__

C++ Feature Restrictions

Tag	Sev	Message
`global_function_constexpr`	E	A __global__ function or function template cannot be marked constexpr
`global_function_consteval`	E	A __global__ function or function template cannot be marked consteval
`global_function_inline`	E	(__global__ with inline)
`global_exception_spec`	E	An exception specification is not allowed for a __global__ function or function template

Template Argument Restrictions

Tag	Sev	Message
`global_private_type_arg`	E	A type that is defined inside a class and has private or protected access (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the class is local to a __device__ or __global__ function
`global_private_template_arg`	E	A template that is defined inside a class and has private or protected access cannot be used in the template template argument of a __global__ function template instantiation
`global_unnamed_type_arg`	E	An unnamed type (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the type is local to a __device__ or __global__ function
`global_func_local_template_arg`	E	A type defined inside a __host__ function (%t) cannot be used in the template argument type of a __global__ function template instantiation
`global_lambda_template_arg`	E	The closure type for a lambda (%t%s) cannot be used in the template argument type of a __global__ function template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the flag '-extended-lambda' is specified and the lambda is an extended lambda
`local_type_used_in_global_function`	W	a local type %t (defined in %sq1) used in global function %sq2 template argument, the global function cannot be launched from host code.

Variable Template Restrictions (parallel set)

Tag	Sev	Message
`variable_template_private_type_arg`	E	(private/protected type in variable template instantiation)
`variable_template_private_template_arg`	E	(private template template arg in variable template)
`variable_template_unnamed_type_template_arg`	E	An unnamed type (%t) cannot be used in the template argument type of a variable template instantiation, unless the type is local to a __device__ or __global__ function
`variable_template_func_local_template_arg`	E	A type defined inside a __host__ function (%t) cannot be used in the template argument type of a variable template instantiation
`variable_template_lambda_template_arg`	E	The closure type for a lambda (%t%s) cannot be used in the template argument type of a variable template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the lambda is an 'extended lambda' and the flag --extended-lambda is specified

Variadic Template Constraints

Tag	Sev	Message
`global_function_multiple_packs`	E	Multiple pack parameters are not allowed for a variadic __global__ function template
`global_function_pack_not_last`	E	Pack template parameter must be the last template parameter for a variadic __global__ function template

Launch Configuration Attributes

Tag	Sev	Message
`bounds_attr_only_on_global_func`	E	%s is only allowed on a __global__ function
`maxnreg_attr_only_on_global_func`	E	(__maxnreg__ only on __global__)
`missing_launch_bounds`	W	no __launch_bounds__ specified for __global__ function
`cuda_specifier_twice_in_group`	E	(duplicate CUDA specifier on same declaration)
`bounds_maxnreg_incompatible_qualifiers`	E	(__launch_bounds__ and __maxnreg__ conflict)
--	E	The %s qualifiers cannot be applied to the same kernel
--	E	Multiple %s specifiers are not allowed
--	E	incorrect value for launch bounds

Category 5: Extended Lambda Restrictions

Extended lambdas (__device__ or __host__ __device__ lambdas in host code, enabled by --extended-lambda) must have closure types serializable for device transfer.

Capture Restrictions

Tag	Sev	Message
`extended_lambda_reference_capture`	E	An extended %s lambda cannot capture variables by reference
`extended_lambda_pack_capture`	E	An extended %s lambda cannot capture an element of a parameter pack
`extended_lambda_too_many_captures`	E	An extended %s lambda can only capture up to 1023 variables
`extended_lambda_array_capture_rank`	E	An extended %s lambda cannot capture an array variable (type: %t) with more than 7 dimensions
`extended_lambda_array_capture_assignable`	E	An extended %s lambda cannot capture an array variable whose element type (%t) is not assignable on the host
`extended_lambda_array_capture_default_constructible`	E	An extended %s lambda cannot capture an array variable whose element type (%t) is not default constructible on the host
`extended_lambda_init_capture_array`	E	An extended %s lambda cannot init-capture variables with array type
`extended_lambda_init_capture_initlist`	E	An extended %s lambda cannot have init-captures with type std::initializer_list
`extended_lambda_capture_in_constexpr_if`	E	An extended %s lambda cannot first-capture variable in constexpr-if context
`this_addr_capture_ext_lambda`	W	Implicit capture of 'this' in extended lambda expression
`extended_lambda_hd_init_capture`	E	init-captures are not allowed for extended __host__ __device__ lambdas
--	E	Unless enabled by language dialect, *this capture is only supported when the lambda is either __device__ only, or is defined within a __device__ or __global__ function

Type Restrictions on Captures and Parameters

Tag	Sev	Message
`extended_lambda_capture_local_type`	E	A type local to a function (%t) cannot be used in the type of a variable captured by an extended __device__ or __host__ __device__ lambda
`extended_lambda_capture_private_type`	E	A type that is a private or protected class member (%t) cannot be used in the type of a variable captured by an extended __device__ or __host__ __device__ lambda
`extended_lambda_call_operator_local_type`	E	A type local to a function (%t) cannot be used in the return or parameter types of the operator() of an extended __device__ or __host__ __device__ lambda
`extended_lambda_call_operator_private_type`	E	A type that is a private or protected class member (%t) cannot be used in the return or parameter types of the operator() of an extended __device__ or __host__ __device__ lambda
`extended_lambda_parent_local_type`	E	A type local to a function (%t) cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended __device__ or __host__ __device__ lambda
`extended_lambda_parent_private_type`	E	A type that is a private or protected class member (%t) cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended __device__ or __host__ __device__ lambda
`extended_lambda_parent_private_template_arg`	E	A template that is a private or protected class member cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended %s lambda

Enclosing Parent Function Restrictions

Tag	Sev	Message
`extended_lambda_enclosing_function_local`	E	The enclosing parent function (%sq2) for an extended %s1 lambda must not be defined inside another function
`extended_lambda_enclosing_function_not_found`	E	(no enclosing function found for extended lambda)
`extended_lambda_inaccessible_parent`	E	The enclosing parent function (%sq2) for an extended %s1 lambda cannot have private or protected access within its class
`extended_lambda_enclosing_function_deducible`	E	The enclosing parent function (%sq2) for an extended %s1 lambda must not have deduced return type
`extended_lambda_cant_take_function_address`	E	The enclosing parent function (%sq2) for an extended %s1 lambda must allow its address to be taken
`extended_lambda_parent_non_extern`	E	On Windows, the enclosing parent function (%sq2) for an extended %s1 lambda cannot have internal or no linkage
`extended_lambda_parent_class_unnamed`	E	The enclosing parent function (%sq2) for an extended %s1 lambda cannot be a member function of a class that is unnamed
`extended_lambda_parent_template_param_unnamed`	E	The enclosing parent function (%sq2) for an extended %s1 lambda cannot be in a template which has a unnamed parameter: %nd
`extended_lambda_nest_parent_template_param_unnamed`	E	The enclosing parent %n for an extended %s lambda cannot be a template which has a unnamed parameter
`extended_lambda_multiple_parameter_packs`	E	The enclosing parent template function (%sq2) for an extended %s1 lambda cannot have more than one variadic parameter, or it is not listed last in the template parameter list.
`extended_lambda_no_parent_func`	E	(extended lambda has no parent function)
`extended_lambda_illegal_parent`	E	(extended lambda in illegal parent context)

Nesting and Context Restrictions

Tag	Sev	Message
`extended_lambda_enclosing_function_generic_lambda`	E	An extended %s1 lambda cannot be defined inside a generic lambda expression(%sq2).
`extended_lambda_enclosing_function_hd_lambda`	E	An extended %s1 lambda cannot be defined inside an extended __host__ __device__ lambda expression(%sq2).
`extended_lambda_inaccessible_ancestor`	E	An extended %s1 lambda cannot be defined inside a class (%sq2) with private or protected access within another class
`extended_lambda_inside_constexpr_if`	E	For this host platform/dialect, an extended lambda cannot be defined inside the 'if' or 'else' block of a constexpr if statement
`extended_lambda_multiple_parent`	E	Cannot specify multiple __nv_parent directives in a lambda declaration
`extended_host_device_generic_lambda`	E	__host__ __device__ extended lambdas cannot be generic lambdas
--	E	If an extended %s lambda is defined within the body of one or more nested lambda expressions, each of these enclosing lambda expressions must be defined within the immediate or nested block scope of a function.

Specifier and Annotation

Tag	Sev	Message
`extended_lambda_disallowed`	E	__host__ or __device__ annotation on lambda requires --extended-lambda nvcc flag
`extended_lambda_constexpr`	E	The %s1 specifier is not allowed for an extended %s2 lambda
`lambda_operator_annotated`	E	The operator() function for a lambda cannot be explicitly annotated with execution space annotations (__host__/__device__/__global__), the annotations are derived from its closure class
`extended_lambda_discriminator`	E	(extended lambda discriminator collision)

Category 6: Device Code Restrictions

General restrictions that apply to all GPU-side code (__device__ and __global__ function bodies).

Tag	Sev	Message
`cuda_device_code_unsupported_operator`	E	The operator '%s' is not allowed in device code
`unsupported_type_in_device_code`	E	%t %s1 a %s2, which is not supported in device code
--	E	device code does not support exception handling
`no_coroutine_on_device`	E	device code does not support coroutines
--	E	operations on vector types are not supported in device code
`undefined_device_entity`	E	cannot use an entity undefined in device code
`undefined_device_identifier`	E	identifier %sq is undefined in device code
`thread_local_in_device_code`	E	cannot use thread_local specifier for variable declarations in device code
`unrecognized_pragma_device_code`	W	unrecognized #pragma in device code
--	E	zero-sized parameter type %t is not allowed in device code
--	E	zero-sized variable %sq is not allowed in device code
--	E	dynamic initialization is not supported for a function-scope static %s variable within a __device__/__global__ function
--	E	function-scope static variable within a __device__/__global__ function requires a memory space specifier
`use_of_virtual_base_on_compute_1x`	E	Use of a virtual base (%t) requires the compute_20 or higher architecture
--	E	alloca() is not supported for architectures lower than compute_52

Category 7: Kernel Launch

Tag	Sev	Message
`device_launch_no_sepcomp`	E	kernel launch from __device__ or __global__ functions requires separate compilation mode
`missing_api_for_device_side_launch`	E	device-side kernel launch could not be processed as the required runtime APIs are not declared
--	W	explicit stream argument not provided in kernel launch
--	E	kernel launches from templates are not allowed in system files
`device_side_launch_arg_with_user_provided_cctor`	E	cannot pass an argument with a user-provided copy-constructor to a device-side kernel launch
`device_side_launch_arg_with_user_provided_dtor`	E	cannot pass an argument with a user-provided destructor to a device-side kernel launch

Category 8: Memory Space and Variable Restrictions

Variable Access Across Spaces

Tag	Sev	Message
`device_var_read_in_host`	E	a %s1 %n1 cannot be directly read in a host function
`device_var_written_in_host`	E	a %s1 %n1 cannot be directly written in a host function
`device_var_address_taken_in_host`	E	address of a %s1 %n1 cannot be directly taken in a host function
`host_var_read_in_device`	E	a host %n1 cannot be directly read in a device function
`host_var_written_in_device`	E	a host %n1 cannot be directly written in a device function
`host_var_address_taken_in_device`	E	address of a host %n1 cannot be directly taken in a device function

Variable Declaration Restrictions

Tag	Sev	Message
`illegal_local_to_device_function`	E	%s1 %sq2 variable declaration is not allowed inside a device function body
`illegal_local_to_host_function`	E	%s1 %sq2 variable declaration is not allowed inside a host function body
`shared_specifier_in_range_for`	E	the __shared__ memory space specifier is not allowed for a variable declared by the for-range-declaration
`bad_shared_storage_class`	E	__shared__ variables cannot have external linkage
`device_variable_in_unnamed_inline_ns`	E	A %s variable cannot be declared within an inline unnamed namespace
--	E	member variables of an anonymous union at global or namespace scope cannot be directly accessed in __device__ and __global__ functions
`shared_inside_struct`	E	shared type inside a struct or union is not allowed
`shared_parameter`	E	(__shared__ as function parameter)

Auto-Deduced Device References

Tag	Sev	Message
`auto_device_fn_ref`	E	A non-constexpr __device__ function (%sq1) with "auto" deduced return type cannot be directly referenced %s2, except if the reference is absent when __CUDA_ARCH__ is undefined
`device_var_constexpr`	E	(constexpr rules for __device__ variables)
`device_var_structured_binding`	E	(structured bindings on __device__ variables)

Category 9: __grid_constant__

The __grid_constant__ annotation (compute_70+) marks a kernel parameter as read-only grid-wide.

Tag	Sev	Message
`grid_constant_non_kernel`	E	__grid_constant__ annotation is only allowed on a parameter of a __global__ function
`grid_constant_not_const`	E	a parameter annotated with __grid_constant__ must have const-qualified type
`grid_constant_reference_type`	E	a parameter annotated with __grid_constant__ must not have reference type
`grid_constant_unsupported_arch`	E	__grid_constant__ annotation is only allowed for architecture compute_70 or later
`grid_constant_incompat_redecl`	E	incompatible __grid_constant__ annotation for parameter %s in function redeclaration (see previous declaration %p)
`grid_constant_incompat_templ_redecl`	E	incompatible __grid_constant__ annotation for parameter %s in function template redeclaration (see previous declaration %p)
`grid_constant_incompat_specialization`	E	incompatible __grid_constant__ annotation for parameter %s in function specialization (see previous declaration %p)
`grid_constant_incompat_instantiation_directive`	E	incompatible __grid_constant__ annotation for parameter %s in instantiation directive (see previous declaration %p)

Category 10: JIT Mode

JIT mode (-dc for device-only compilation) restricts host constructs.

Tag	Sev	Message
`no_host_in_jit`	E	A function explicitly marked as a __host__ function is not allowed in JIT mode
`unannotated_function_in_jit`	E	A function without execution space annotations (__host__/__device__/__global__) is considered a host function, and host functions are not allowed in JIT mode. Consider using -default-device flag to process unannotated functions as __device__ functions in JIT mode
`unannotated_variable_in_jit`	E	A namespace scope variable without memory space annotations (__device__/__constant__/__shared__/__managed__) is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process unannotated namespace scope variables as __device__ variables in JIT mode
`unannotated_static_data_member_in_jit`	E	A class static data member with non-const type is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process such data members as __device__ variables in JIT mode
`host_closure_class_in_jit`	E	The execution space for the lambda closure class members was inferred to be __host__ (based on context). This is not allowed in JIT mode. Consider using -default-device to infer __device__ execution space for namespace scope lambda closure classes.

Category 11: RDC / Whole-Program Mode

Tag	Sev	Message
--	E	An inline __device__/__constant__/__managed__ variable must have internal linkage when the program is compiled in whole program mode (-rdc=false)
`template_global_no_def`	E	when "-static-global-template-stub=true" in whole program compilation mode ("-rdc=false"), a __global__ function template instantiation or specialization (%sq) must have a definition in the current translation unit
`extern_kernel_template`	E	when "-static-global-template-stub=true", extern __global__ function template is not supported in whole program compilation mode ("-rdc=false")
--	W	address of internal linkage device function (%sq) was taken (nv bug 2001144). mitigation: no mitigation required if the address is not used for comparison, or if the target function is not a CUDA C++ builtin

Category 12: Atomics

CUDA atomics lowered to PTX instructions with size, type, scope, and memory-order constraints.

Architecture and Type Constraints

Tag	Sev	Message
`nv_atomic_functions_not_supported_below_sm60`	E	__nv_atomic_* functions are not supported on arch < sm_60.
`nv_atomic_operation_not_in_device_function`	E	atomic operations are not in a device function.
`nv_atomic_function_no_args`	E	atomic function requires at least one argument.
`nv_atomic_function_address_taken`	E	nv atomic function must be called directly.
`invalid_nv_atomic_operation_size`	E	atomic operations and, or, xor, add, sub, min and max are valid only on objects of size 4, or 8.
`invalid_nv_atomic_cas_size`	E	atomic CAS is valid only on objects of size 2, 4, 8 or 16 bytes.
`invalid_nv_atomic_exch_size`	E	atomic exchange is valid only on objects of size 4, 8 or 16 bytes.
`invalid_data_size_for_nv_atomic_generic_function`	E	generic nv atomic functions are valid only on objects of size 1, 2, 4, 8 and 16 bytes.
`non_integral_type_for_non_generic_nv_atomic_function`	E	non-generic nv atomic load, store, cas and exchange are valid only on integral types.
`invalid_nv_atomic_operation_add_sub_size`	E	atomic operations add and sub are not valid on signed integer of size 8.
`nv_atomic_add_sub_f64_not_supported`	W	atomic add and sub for 64-bit float is supported on architecture sm_60 or above.
`invalid_nv_atomic_operation_max_min_float`	E	atomic operations min and max are not supported on any floating-point types.
`floating_type_for_logical_atomic_operation`	E	For a logical atomic operation, the first argument cannot be any floating-point types.
`nv_atomic_cas_b16_not_supported`	E	16-bit atomic compare-and-exchange is supported on architecture sm_70 or above.
`nv_atomic_exch_cas_b128_not_supported`	E	128-bit atomic exchange or compare-and-exchange is supported on architecture sm_90 or above.
`nv_atomic_load_store_b128_version_too_low`	E	128-bit atomic load and store are supported on architecture sm_70 or above.

Memory Order and Scope

Tag	Sev	Message
`nv_atomic_load_order_error`	E	atomic load's memory order cannot be release or acq_rel.
`nv_atomic_store_order_error`	E	atomic store's memory order cannot be consume, acquire or acq_rel.
`nv_atomic_operation_order_not_constant_int`	E	atomic operation's memory order argument is not an integer literal.
`nv_atomic_operation_scope_not_constant_int`	E	atomic operation's scope argument is not an integer literal.
`invalid_nv_atomic_memory_order_value`	E	(invalid memory order enum value)
`invalid_nv_atomic_thread_scope_value`	E	(invalid thread scope enum value)

Scope Fallback Warnings

Tag	Sev	Message
`nv_atomic_operations_scope_fallback_to_membar`	W	atomic operations' scope argument is supported on architecture sm_60 or above. Fall back to use membar.
`nv_atomic_operations_memory_order_fallback_to_membar`	W	atomic operations' argument of memory order is supported on architecture sm_70 or above. Fall back to use membar.
`nv_atomic_operations_scope_cluster_change_to_device`	W	atomic operations' scope of cluster is supported on architecture sm_90 or above. Using device scope instead.
`nv_atomic_load_store_scope_cluster_change_to_device`	W	atomic load and store's scope of cluster is supported on architecture sm_90 or above. Using device scope instead.

Category 13: ASM in Device Code

NVPTX backend supports fewer inline-assembly constraint letters than x86.

Tag	Sev	Message
`asm_constraint_letter_not_allowed_in_device`	E	asm constraint letter '%s' is not allowed inside a __device__/__global__ function
`asm_constraint_must_have_single_letter`	E	an asm operand may specify only one constraint letter in a __device__/__global__ function
--	E	The 'C' constraint can only be used for asm statements in device code
`cc_clobber_in_device`	E	The cc clobber constraint is not supported in device code
`cuda_xasm_strict_placeholder_format`	E	(strict placeholder format in CUDA asm)
`addr_of_label_in_device_func`	E	address of label extension is not supported in __device__/__global__ functions

Category 14: #pragma nv_abi

Controls calling convention for device functions, adjusting parameter passing to match PTX ABI.

Tag	Sev	Message
`nv_abi_pragma_bad_format`	E	(malformed #pragma nv_abi)
`nv_abi_pragma_invalid_option`	E	#pragma nv_abi contains an invalid option
`nv_abi_pragma_missing_arg`	E	#pragma nv_abi requires an argument
`nv_abi_pragma_duplicate_arg`	E	#pragma nv_abi contains a duplicate argument
`nv_abi_pragma_not_constant`	E	#pragma nv_abi argument must evaluate to an integral constant expression
`nv_abi_pragma_not_positive_value`	E	#pragma nv_abi argument value must be a positive value
`nv_abi_pragma_overflow_value`	E	#pragma nv_abi argument value exceeds the range of an integer
`nv_abi_pragma_device_function`	E	#pragma nv_abi must be applied to device functions
`nv_abi_pragma_device_function_context`	E	#pragma nv_abi is not supported inside a host function
`nv_abi_pragma_next_construct`	E	#pragma nv_abi must appear immediately before a function declaration, function definition, or an expression statement

Category 15: __nv_register_params__

Forces all parameters to be passed in registers (compute_80+).

Tag	Sev	Message
`register_params_not_enabled`	E	__nv_register_params__ support is not enabled
`register_params_unsupported_arch`	E	__nv_register_params__ is only supported for compute_80 or later architecture
`register_params_unsupported_function`	E	__nv_register_params__ is not allowed on a %s function
`register_params_ellipsis_function`	E	__nv_register_params__ is not allowed on a function with ellipsis

Category 16: Name Expression (NVRTC)

__CUDACC_RTC__name_expr forms the mangled name of a __global__ function or __device__/__constant__ variable at compile time.

Tag	Sev	Message
`name_expr_parsing`	E	Error in parsing name expression for lowered name lookup. Input name expression was: %sq
`name_expr_non_global_routine`	E	Name expression cannot form address of a non-__global__ function. Input name expression was: %sq
`name_expr_non_device_variable`	E	Name expression cannot form address of a variable that is not a __device__/__constant__ variable. Input name expression was: %sq
`name_expr_not_routine_or_variable`	E	Name expression must form address of a __global__ function or the address of a __device__/__constant__ variable. Input name expression was: %sq
`name_expr_extra_tokens`	E	Extra tokens found after parsing name expression for lowered name lookup. Input name expression was: %sq
`name_expr_internal_error`	E	Internal error in parsing name expression for lowered name lookup. Input name expression was: %sq

Category 17: Texture and Surface Variables

Tag	Sev	Message
`texture_surface_variable_in_unnamed_inline_ns`	E	A texture or surface variable cannot be declared within an inline unnamed namespace
--	E	A texture or surface variable cannot be used in the non-type template argument of a __device__, __host__ __device__ or __global__ function template instantiation
`reference_to_text_surf_type_in_device_func`	E	a reference to texture/surface type cannot be used in __device__/__global__ functions
`reference_to_text_surf_var_in_device_func`	E	taking reference of texture/surface variable not allowed in __device__/__global__ functions
`addr_of_text_surf_var_in_device_func`	E	cannot take address of texture/surface variable %sq in __device__/__global__ functions
`addr_of_text_surf_expr_in_device_func`	E	cannot take address of texture/surface expression in __device__/__global__ functions
`indir_into_text_surf_var_in_device_func`	E	indirection not allowed for accessing texture/surface through variable %sq in __device__/__global__ functions
`indir_into_text_surf_expr_in_device_func`	E	indirection not allowed for accessing texture/surface through expression in __device__/__global__ functions

Category 18: managed Variables

Tag	Sev	Message
`managed_const_type_not_allowed`	E	a __managed__ variable cannot have a const qualified type
`managed_reference_type_not_allowed`	E	a __managed__ variable cannot have a reference type
`managed_cant_be_shared_constant`	E	__managed__ variables cannot be marked __shared__ or __constant__
`unsupported_arch_for_managed_capability`	E	__managed__ variables require architecture compute_30 or higher
`unsupported_configuration_for_managed_capability`	E	__managed__ variables are not yet supported for this configuration (compilation mode (32/64 bit) and/or target operating system)
`decltype_of_managed_variable`	E	A __managed__ variable cannot be used as an unparenthesized id-expression argument for decltype()

Category 19: Device Function Signature Constraints

Tag	Sev	Message
`device_function_has_ellipsis`	E	__device__ or __host__ __device__ function with ellipsis requires compute_30 or higher architecture
`device_func_tex_arg`	E	(device function with texture argument restriction)
`no_host_device_initializer_list`	E	(std::initializer_list in __host__ __device__ context)
`no_host_device_move_forward`	E	(std::move/forward in __host__ __device__ context)
`no_strict_cuda_error`	W	(relaxed error checking mode)

Category 20: __wgmma_mma_async Builtins

Warp Group Matrix Multiply-Accumulate builtins (sm_90a+).

Tag	Sev	Message
`wgmma_mma_async_not_enabled`	E	__wgmma_mma_async builtins are only available for sm_90a
`wgmma_mma_async_nonconstant_arg`	E	Non-constant argument to __wgmma_mma_async call
`wgmma_mma_async_missing_args`	E	The 'A' or 'B' argument to __wgmma_mma_async call is missing
`wgmma_mma_async_bad_shape`	E	The shape %s is not supported for __wgmma_mma_async builtin
`wgmma_mma_async_bad_A_type`	E	(invalid type for operand A)
`wgmma_mma_async_bad_B_type`	E	(invalid type for operand B)

Category 21: __block_size / cluster_dims__

Architecture-dependent launch configuration attributes.

Tag	Sev	Message
`block_size_unsupported`	E	__block_size__ is not supported for this GPU architecture
`block_size_must_be_positive`	E	(block size values must be positive)
`cluster_dims_unsupported`	E	__cluster_dims__ is not supported for this GPU architecture
`cluster_dims_must_be_positive`	E	(__cluster_dims__ values must be positive)
`cluster_dims_too_large`	E	cluster dimension value is too large
`conflict_between_cluster_dim_and_block_size`	E	cannot specify the second tuple in __block_size__ while __cluster_dims__ is present
`max_blocks_per_cluster_unsupported`	E	cannot specify max blocks per cluster for this GPU architecture
`max_blocks_per_cluster_negative`	E	max blocks per cluster must not be negative
`max_blocks_per_cluster_too_large`	E	max blocks per cluster is too large
`too_many_blocks_in_cluster`	E	total number of blocks in cluster computed from %s exceeds __launch_bounds__ specified limit for max blocks in cluster
`shared_block_size_must_be_positive`	E	the block size of a shared array must be greater than zero
`shared_block_size_too_large`	E	(shared block size exceeds maximum)
`mismatched_shared_block_size`	E	shared block size does not match one previously specified
`ambiguous_block_size_spec`	E	(ambiguous block size specification)
`multiple_block_sizes`	E	multiple block sizes not allowed
`threads_dimension_requires_definite_block_size`	E	a dynamic THREADS dimension requires a definite block size
`shared_nonthreads_dim`	E	(shared array dimension is not THREADS-based)
`shared_affinity_type`	E	(shared affinity type mismatch)

Category 22: Inline Hint Conflicts

Tag	Sev	Message
`inline_hint_forceinline_conflict`	E	"__inline_hint__" and "__forceinline__" may not be used on the same declaration
`inline_hint_noinline_conflict`	E	"__inline_hint__" and "__noinline__" may not be used on the same declaration

Category 23: __local_maxnreg__

Tag	Sev	Message
`local_maxnreg`	E	(__local_maxnreg__ attribute applied)
`local_maxnreg_attr_only_nonmember_func`	E	(__local_maxnreg__ only on non-member functions)
`local_maxnreg_attribute_conflict`	E	(__local_maxnreg__ conflicts with existing attribute)
`local_maxnreg_negative`	E	(__local_maxnreg__ value is negative)
`local_maxnreg_too_large`	E	(__local_maxnreg__ value exceeds maximum)
`maxnreg_attr_only_nonmember_func`	E	(__maxnreg__ only on non-member functions)
`bounds_attr_only_nonmember_func`	E	(launch bounds only on non-member functions)

Category 24: Miscellaneous CUDA Errors

Tag	Sev	Message
`cuda_displaced_new_or_delete_operator`	E	(displaced new/delete in CUDA context)
`cuda_demote_unsupported_floating_point`	W	(unsupported floating-point type demoted)
`illegal_ucn_in_device_identifer`	E	Universal character is not allowed in device entity name (%sq)
`thread_local_for_device_vars`	E	cannot use thread_local specifier for a %s variable
`global_qualifier_not_allowed`	E	(execution space qualifier not allowed here)
`unsupported_nv_attribute`	W	(unrecognized NVIDIA attribute)
`addr_of_nv_builtin_var`	E	(address-of applied to NVIDIA builtin variable)
`shared_address_immutable`	E	(__shared__ variable address is immutable)
`nonshared_blocksizeof`	E	(BLOCKSIZEOF applied to non-__shared__ variable)
`nonshared_strict_relaxed`	E	(strict/relaxed qualifier on non-__shared__ variable)
`extern_shared`	W	(extern __shared__ variable)
`invalid_nvvm_builtin_intrinsic`	E	(invalid NVVM builtin intrinsic)
`unannotated_static_not_allowed_in_device`	E	(unannotated static not allowed in device code)
`missing_pushcallconfig`	E	(cudaConfigureCall not found for kernel launch lowering)

Complete Diagnostic Tag Index

All 286 CUDA-specific diagnostic tag names extracted from the cudafe++ binary, organized alphabetically within functional groups. Every tag can be used with --diag_suppress, --diag_warning, --diag_error, or #pragma nv_diag_suppress / nv_diag_warning / nv_diag_error.

Cross-Space / Execution Space (1 tag)

#	Tag Name
1	`unsafe_device_call`

Redeclaration (12 tags)

#	Tag Name
2	`device_function_redeclared_with_global`
3	`device_function_redeclared_with_host`
4	`device_function_redeclared_with_host_device`
5	`device_function_redeclared_without_device`
6	`global_function_redeclared_with_device`
7	`global_function_redeclared_with_host`
8	`global_function_redeclared_with_host_device`
9	`global_function_redeclared_without_global`
10	`host_device_function_redeclared_with_global`
11	`host_function_redeclared_with_device`
12	`host_function_redeclared_with_global`
13	`host_function_redeclared_with_host_device`

global Constraints (30 tags)

#	Tag Name
14	`bounds_attr_only_on_global_func`
15	`bounds_maxnreg_incompatible_qualifiers`
16	`cuda_specifier_twice_in_group`
17	`global_class_decl`
18	`global_exception_spec`
19	`global_friend_definition`
20	`global_func_local_template_arg`
21	`global_function_consteval`
22	`global_function_constexpr`
23	`global_function_deduced_return_type`
24	`global_function_has_ellipsis`
25	`global_function_in_unnamed_inline_ns`
26	`global_function_inline`
27	`global_function_multiple_packs`
28	`global_function_pack_not_last`
29	`global_function_return_type`
30	`global_function_with_initializer_list`
31	`global_lambda_template_arg`
32	`global_new_or_delete`
33	`global_operator_function`
34	`global_param_align_too_big`
35	`global_private_template_arg`
36	`global_private_type_arg`
37	`global_qualifier_not_allowed`
38	`global_ref_param_restrict`
39	`global_rvalue_ref_type`
40	`global_unnamed_type_arg`
41	`global_va_list_type`
42	`local_type_used_in_global_function`
43	`maxnreg_attr_only_on_global_func`
44	`missing_launch_bounds`
45	`template_global_no_def`

Extended Lambda (38 tags)

#	Tag Name
46	`extended_host_device_generic_lambda`
47	`extended_lambda_array_capture_assignable`
48	`extended_lambda_array_capture_default_constructible`
49	`extended_lambda_array_capture_rank`
50	`extended_lambda_call_operator_local_type`
51	`extended_lambda_call_operator_private_type`
52	`extended_lambda_cant_take_function_address`
53	`extended_lambda_capture_in_constexpr_if`
54	`extended_lambda_capture_local_type`
55	`extended_lambda_capture_private_type`
56	`extended_lambda_constexpr`
57	`extended_lambda_disallowed`
58	`extended_lambda_discriminator`
59	`extended_lambda_enclosing_function_deducible`
60	`extended_lambda_enclosing_function_generic_lambda`
61	`extended_lambda_enclosing_function_hd_lambda`
62	`extended_lambda_enclosing_function_local`
63	`extended_lambda_enclosing_function_not_found`
64	`extended_lambda_hd_init_capture`
65	`extended_lambda_illegal_parent`
66	`extended_lambda_inaccessible_ancestor`
67	`extended_lambda_inaccessible_parent`
68	`extended_lambda_init_capture_array`
69	`extended_lambda_init_capture_initlist`
70	`extended_lambda_inside_constexpr_if`
71	`extended_lambda_multiple_parameter_packs`
72	`extended_lambda_multiple_parent`
73	`extended_lambda_nest_parent_template_param_unnamed`
74	`extended_lambda_no_parent_func`
75	`extended_lambda_pack_capture`
76	`extended_lambda_parent_class_unnamed`
77	`extended_lambda_parent_local_type`
78	`extended_lambda_parent_non_extern`
79	`extended_lambda_parent_private_template_arg`
80	`extended_lambda_parent_private_type`
81	`extended_lambda_parent_template_param_unnamed`
82	`extended_lambda_reference_capture`
83	`extended_lambda_too_many_captures`
84	`this_addr_capture_ext_lambda`

Device Code (13 tags)

#	Tag Name
85	`addr_of_label_in_device_func`
86	`asm_constraint_letter_not_allowed_in_device`
87	`asm_constraint_must_have_single_letter`
88	`auto_device_fn_ref`
89	`cc_clobber_in_device`
90	`cuda_device_code_unsupported_operator`
91	`cuda_xasm_strict_placeholder_format`
92	`illegal_ucn_in_device_identifer`
93	`no_coroutine_on_device`
94	`no_strict_cuda_error`
95	`thread_local_in_device_code`
96	`undefined_device_entity`
97	`undefined_device_identifier`
98	`unrecognized_pragma_device_code`
99	`unsupported_type_in_device_code`
100	`use_of_virtual_base_on_compute_1x`

Device Function (4 tags)

#	Tag Name
101	`device_func_tex_arg`
102	`device_function_has_ellipsis`
103	`no_host_device_initializer_list`
104	`no_host_device_move_forward`

Kernel Launch (4 tags)

#	Tag Name
105	`device_launch_no_sepcomp`
106	`device_side_launch_arg_with_user_provided_cctor`
107	`device_side_launch_arg_with_user_provided_dtor`
108	`missing_api_for_device_side_launch`

Variable Access (11 tags)

#	Tag Name
109	`device_var_address_taken_in_host`
110	`device_var_constexpr`
111	`device_var_read_in_host`
112	`device_var_structured_binding`
113	`device_var_written_in_host`
114	`device_variable_in_unnamed_inline_ns`
115	`host_var_address_taken_in_device`
116	`host_var_read_in_device`
117	`host_var_written_in_device`
118	`illegal_local_to_device_function`
119	`illegal_local_to_host_function`

Variable Template (5 tags)

#	Tag Name
120	`variable_template_func_local_template_arg`
121	`variable_template_lambda_template_arg`
122	`variable_template_private_template_arg`
123	`variable_template_private_type_arg`
124	`variable_template_unnamed_type_template_arg`

managed (6 tags)

#	Tag Name
125	`decltype_of_managed_variable`
126	`managed_cant_be_shared_constant`
127	`managed_const_type_not_allowed`
128	`managed_reference_type_not_allowed`
129	`unsupported_arch_for_managed_capability`
130	`unsupported_configuration_for_managed_capability`

__grid_constant__ (8 tags)

#	Tag Name
131	`grid_constant_incompat_instantiation_directive`
132	`grid_constant_incompat_redecl`
133	`grid_constant_incompat_specialization`
134	`grid_constant_incompat_templ_redecl`
135	`grid_constant_non_kernel`
136	`grid_constant_not_const`
137	`grid_constant_reference_type`
138	`grid_constant_unsupported_arch`

Atomics (26 tags)

#	Tag Name
139	`floating_type_for_logical_atomic_operation`
140	`invalid_data_size_for_nv_atomic_generic_function`
141	`invalid_nv_atomic_cas_size`
142	`invalid_nv_atomic_exch_size`
143	`invalid_nv_atomic_memory_order_value`
144	`invalid_nv_atomic_operation_add_sub_size`
145	`invalid_nv_atomic_operation_max_min_float`
146	`invalid_nv_atomic_operation_size`
147	`invalid_nv_atomic_thread_scope_value`
148	`non_integral_type_for_non_generic_nv_atomic_function`
149	`nv_atomic_add_sub_f64_not_supported`
150	`nv_atomic_cas_b16_not_supported`
151	`nv_atomic_exch_cas_b128_not_supported`
152	`nv_atomic_function_address_taken`
153	`nv_atomic_function_no_args`
154	`nv_atomic_functions_not_supported_below_sm60`
155	`nv_atomic_load_order_error`
156	`nv_atomic_load_store_b128_version_too_low`
157	`nv_atomic_load_store_scope_cluster_change_to_device`
158	`nv_atomic_operation_not_in_device_function`
159	`nv_atomic_operation_order_not_constant_int`
160	`nv_atomic_operation_scope_not_constant_int`
161	`nv_atomic_operations_memory_order_fallback_to_membar`
162	`nv_atomic_operations_scope_cluster_change_to_device`
163	`nv_atomic_operations_scope_fallback_to_membar`
164	`nv_atomic_store_order_error`

JIT Mode (5 tags)

#	Tag Name
165	`host_closure_class_in_jit`
166	`no_host_in_jit`
167	`unannotated_function_in_jit`
168	`unannotated_static_data_member_in_jit`
169	`unannotated_variable_in_jit`

RDC / Whole-Program (2 tags)

#	Tag Name
170	`extern_kernel_template`
171	`template_global_no_def`

#pragma nv_abi (10 tags)

#	Tag Name
172	`nv_abi_pragma_bad_format`
173	`nv_abi_pragma_device_function`
174	`nv_abi_pragma_device_function_context`
175	`nv_abi_pragma_duplicate_arg`
176	`nv_abi_pragma_invalid_option`
177	`nv_abi_pragma_missing_arg`
178	`nv_abi_pragma_next_construct`
179	`nv_abi_pragma_not_constant`
180	`nv_abi_pragma_not_positive_value`
181	`nv_abi_pragma_overflow_value`

__nv_register_params__ (4 tags)

#	Tag Name
182	`register_params_ellipsis_function`
183	`register_params_not_enabled`
184	`register_params_unsupported_arch`
185	`register_params_unsupported_function`

Name Expression (6 tags)

#	Tag Name
186	`name_expr_extra_tokens`
187	`name_expr_internal_error`
188	`name_expr_non_device_variable`
189	`name_expr_non_global_routine`
190	`name_expr_not_routine_or_variable`
191	`name_expr_parsing`

Texture / Surface (7 tags)

#	Tag Name
192	`addr_of_text_surf_expr_in_device_func`
193	`addr_of_text_surf_var_in_device_func`
194	`indir_into_text_surf_expr_in_device_func`
195	`indir_into_text_surf_var_in_device_func`
196	`reference_to_text_surf_type_in_device_func`
197	`reference_to_text_surf_var_in_device_func`
198	`texture_surface_variable_in_unnamed_inline_ns`

__wgmma_mma_async (6 tags)

#	Tag Name
199	`wgmma_mma_async_bad_A_type`
200	`wgmma_mma_async_bad_B_type`
201	`wgmma_mma_async_bad_shape`
202	`wgmma_mma_async_missing_args`
203	`wgmma_mma_async_nonconstant_arg`
204	`wgmma_mma_async_not_enabled`

__block_size / cluster_dims__ (18 tags)

#	Tag Name
205	`ambiguous_block_size_spec`
206	`block_size_must_be_positive`
207	`block_size_unsupported`
208	`cluster_dims_must_be_positive`
209	`cluster_dims_too_large`
210	`cluster_dims_unsupported`
211	`conflict_between_cluster_dim_and_block_size`
212	`max_blocks_per_cluster_negative`
213	`max_blocks_per_cluster_too_large`
214	`max_blocks_per_cluster_unsupported`
215	`mismatched_shared_block_size`
216	`multiple_block_sizes`
217	`shared_affinity_type`
218	`shared_block_size_must_be_positive`
219	`shared_block_size_too_large`
220	`shared_nonthreads_dim`
221	`threads_dimension_requires_definite_block_size`
222	`too_many_blocks_in_cluster`

Inline Hint (2 tags)

#	Tag Name
223	`inline_hint_forceinline_conflict`
224	`inline_hint_noinline_conflict`

__local_maxnreg__ (7 tags)

#	Tag Name
225	`bounds_attr_only_nonmember_func`
226	`local_maxnreg`
227	`local_maxnreg_attr_only_nonmember_func`
228	`local_maxnreg_attribute_conflict`
229	`local_maxnreg_negative`
230	`local_maxnreg_too_large`
231	`maxnreg_attr_only_nonmember_func`

Lambda Annotation (1 tag)

#	Tag Name
232	`lambda_operator_annotated`

Miscellaneous (16 tags)

#	Tag Name
233	`addr_of_nv_builtin_var`
234	`bad_shared_storage_class`
235	`cuda_demote_unsupported_floating_point`
236	`cuda_displaced_new_or_delete_operator`
237	`extern_shared`
238	`invalid_nvvm_builtin_intrinsic`
239	`missing_pushcallconfig`
240	`nonshared_blocksizeof`
241	`nonshared_strict_relaxed`
242	`shared_address_immutable`
243	`shared_inside_struct`
244	`shared_parameter`
245	`shared_specifier_in_range_for`
246	`thread_local_for_device_vars`
247	`unannotated_static_not_allowed_in_device`
248	`unsupported_nv_attribute`

Diagnostic Pragma Actions (6 tags -- not suppressible, but listed for completeness)

#	Tag Name
249	`nv_diag_default`
250	`nv_diag_error`
251	`nv_diag_once`
252	`nv_diag_remark`
253	`nv_diag_suppress`
254	`nv_diag_warning`

Cross-Reference: EDG Error Codes Used for CUDA

The following standard EDG error codes (0--3456) are repurposed or frequently triggered by CUDA-specific validation. These display with their original number (not the 20000-D series).

Internal #	Display #	Context
21	21	CUDA auto type with template deduction
147	147	redeclaration mismatch
149	149	illegal CUDA storage class at namespace scope
246	246	static member of non-class type
298	298	typedef/using with template name
325	325	thread_local in CUDA
337	337	calling convention mismatch
453	453	in template instantiation context
551	551	not a member function
795	795	definition in class scope with external linkage (CUDA)
799	799	definition in class scope (C++20 CUDA)
891	891	anonymous type in variable declaration
892	892	auto with __constant__ variable
893	893	auto with CUDA variable
948	948	calling convention mismatch on redeclaration
992	992	fatal error (suppress-all sentinel)
1034	1034	explicit instantiation with conflicting attributes
1063	1063	in include file context
1118	1118	CUDA attribute on namespace-scope variable
1150	1150	context lines truncated
1158	1158	auto return type with __global__
1306	1306	CUDA memory space mismatch on redeclaration
1418	1418	incomplete type in definition
1430	1430	function attribute mismatch in template
1560	1560	CUDA constexpr class with non-trivial destructor
1580	1580	redeclaration with different template parameters
1655	1655	tentative definition of constexpr
2384	2384	constexpr mismatch on redeclaration (CUDA)
2442	2442	extern variable at block scope with CUDA attribute
2443	2443	extern variable at block scope with CUDA attribute (variant)
2502	2502	no_unique_address mismatch
2503	2503	no_unique_address mismatch (variant)
2656	2656	internal error (assertion failure)
2885	2885	CUDA attribute on deduction guide
2937	2937	structured binding with CUDA attribute
3033	3033	incompatible constexpr CUDA target
3116	3116	restrict qualifier on definition
3414	3414	auto with volatile/atomic qualifier
3510	3510	__shared__ variable with VLA
3566	3566	__constant__ with constexpr with auto
3567	3567	CUDA variable with VLA type
3568	3568	__constant__ with constexpr
3578	3578	CUDA attribute in discarded constexpr-if branch
3579	3579	CUDA attribute at namespace scope with structured binding
3580	3580	CUDA attribute on variable-length array
3648	3648	CUDA __constant__ with external linkage
3698	3698	parameter type mismatch
3709	3709	warnings treated as errors

Format Specifiers in CUDA Messages

CUDA error messages use the same fill-in system as EDG base errors, expanded by process_fill_in (sub_4EDCD0).

Specifier	Kind	Meaning	Example
`%sq`	3	Quoted entity name	Function name in cross-space call
`%sq1`, `%sq2`	3	Indexed quoted names	Caller and callee
`%no1`	4	Entity name (omit kind prefix)	Function in redeclaration
`%n1`, `%n2`	4	Entity names	Override base/derived pair
`%nd`	4	Entity with declaration location	Template parameter
`%s`, `%s1`, `%s2`	3	String fill-in	Execution space keyword
`%t`	6	Type fill-in	Type in template arg errors
`%p`	2	Source position	Previous declaration location

Architecture Requirements Summary

Quick reference for minimum architecture required by various CUDA features.

Feature	Minimum Architecture
Virtual bases	compute_20
`__device__`/`__host__ __device__` with ellipsis	compute_30
`__managed__` variables	compute_30
alloca()	compute_52
`__nv_atomic_*` functions	sm_60
Atomic scope argument	sm_60
Atomic add/sub for f64	sm_60
`__grid_constant__`	compute_70
Atomic memory order argument	sm_70
16-bit atomic CAS	sm_70
128-bit atomic load/store	sm_70
`__nv_register_params__`	compute_80
Cluster scope for atomics	sm_90
128-bit atomic exchange/CAS	sm_90
`__wgmma_mma_async` builtins	sm_90a

Virtual Override Execution Space Matrix

When a derived class overrides a base class virtual function in CUDA, the execution spaces of both functions must be compatible. A __device__ virtual cannot be overridden by a __host__ function, a __host__ virtual cannot be overridden by a __device__ function, and so on. cudafe++ enforces these rules inside record_virtual_function_override (sub_432280, 437 lines, class_decl.c), which runs each time the EDG front-end registers a virtual override during class body scanning. The function performs three tasks: (1) propagate the base class's execution space obligations onto the derived function, (2) detect illegal mismatches and emit one of six dedicated error messages (3542--3547), and (3) fall through to standard EDG override recording (covariant returns, [[nodiscard]], override/final, requires-clause checks).

This page documents the override checking logic at reimplementation-grade depth: reconstructed pseudocode from the decompiled binary, a complete compatibility matrix, the six error messages with their diagnostic tags, and the relaxed-mode flag that softens certain checks.

Key Facts

Property	Value
Binary function	`sub_432280` (`record_virtual_function_override`, 437 lines)
Source file	`class_decl.c`
Parameters	`a1`=derivation_info, `a2`=overriding_sym, `a3`=overridden_sym, `a4`=base_class_info, `a5`=covariant_return_adjustment
Entity field read	`byte +182` (execution space bitfield) on both overridden and overriding entities
Classification mask	`byte & 0x30` -- two-bit extraction: `0x00`=implicit host, `0x10`=explicit host, `0x20`=device, `0x30`=HD
Propagation bits	`0x10` (host_explicit), `0x20` (device_annotation)
Attribute lookup	`sub_5CEE70` with kind 87 (`__device__`) and 86 (`__host__`)
Error emission	`sub_4F4F10` with severity 8 (hard error)
Relaxed mode flag	`dword_106BFF0` (`relaxed_attribute_mode`)
Implicitly-HD test	`byte +177 & 0x10` on entity -- constexpr / `__forceinline__` bypass
Override-involved mark	`byte +176 \|= 0x02` on overriding entity
Assertion guard	`nv_is_device_only_routine` from `nv_transforms.h:367`

Why Virtual Functions Need Execution Space Checks

Standard C++ imposes no concept of execution space on virtual functions. CUDA introduces three execution spaces (__host__, __device__, __host__ __device__) and one launch-only space (__global__). When a virtual function in a base class is declared with one execution space, every override in every derived class must be callable in the same space. If the base declares a __device__ virtual, calling it through a base pointer on the GPU must dispatch to the derived override -- which is only possible if the override is also __device__ (or __host__ __device__).

__global__ functions cannot be virtual at all (error 3505/3506 prevents this at the attribute application stage), so the override matrix only covers three spaces: __host__, __device__, and __host__ __device__. An unannotated function counts as implicit __host__.

Function Entry: Mark and Resolve Entities

The function begins by resolving the actual entity nodes from the symbol table entries:

// sub_432280 entry (lines 60-69 of decompiled output)
//
// a2 = overriding_sym (symbol table entry for the derived-class function)
// a3 = overridden_sym (symbol table entry for the base-class function)
//
// v10 = entity of overridden function:  *(overridden_sym + 88)
// v11 = entity of overriding function:  *(*(overriding_sym) + 88)
//
// The entity node at offset +88 is the "associated routine entity" --
// the actual function representation containing execution space bits.

int64_t overridden_entity = *(int64_t*)(overridden_sym + 88);   // v10
int64_t overriding_entity = *(int64_t*)(*(int64_t*)overriding_sym + 88);  // v11

// Mark the overriding entity as "involved in an override"
*(uint8_t*)(overriding_entity + 176) |= 0x02;

The +176 |= 0x02 flag marks the derived function as "override-involved." This flag is consumed downstream by the exception specification resolver and other class completion logic.

Phase 1: Implicitly-HD Fast Path and Execution Space Propagation

The first branch tests byte +177 & 0x10 on the overriding entity. This bit indicates the function is implicitly __host__ __device__ -- set for constexpr functions (implicitly HD since CUDA 7.5) and __forceinline__ functions. When this bit is set, the override is exempt from mismatch checking, but execution space propagation still occurs.

// Phase 1: implicitly-HD check and propagation (lines 70-94)
void check_and_propagate(int64_t overriding_entity, int64_t overridden_entity) {

    if (overriding_entity->byte_177 & 0x10) {
        // Overriding function is implicitly HD (constexpr / __forceinline__)
        //
        // Skip mismatch errors entirely -- an implicitly-HD function is
        // compatible with any base execution space.  But we must still
        // propagate the base's space obligations onto the derived entity
        // so that downstream passes (IL marking, code generation) know
        // what to emit.

        if (!(overridden_entity->byte_177 & 0x10)) {
            // Overridden function is NOT implicitly HD -- it has an explicit
            // execution space.  We need to propagate that space.
            //
            // Guard: skip propagation for constexpr lambdas with internal
            // linkage but no override flag (a degenerate case).
            if ((overridden_entity->qword_184 & 0x800001000000) == 0x800000000000
                && !(overridden_entity->byte_176 & 0x02)) {
                // Degenerate case -- skip propagation
                goto done_nvidia_checks;
            }

            uint8_t base_es = overridden_entity->byte_182;

            // Propagate __host__ obligation:
            // If the base is NOT device-only (i.e., base is host, HD, or
            // unannotated), the derived function inherits the host obligation.
            if ((base_es & 0x30) != 0x20) {
                overriding_entity->byte_182 |= 0x10;   // set host_explicit
            }

            // Propagate __device__ obligation:
            // If the base has the device_annotation bit set, the derived
            // function inherits the device obligation.
            if (base_es & 0x20) {
                overriding_entity->byte_182 |= 0x20;   // set device_annotation
            }
        }

        goto done_nvidia_checks;
    }

    // ... Phase 2 continues below
}

Why Propagation Matters

Propagation ensures that a derived class inherits its base class's execution space obligations even when the derived function is implicitly HD. Consider:

struct Base {
    __device__ virtual void f();        // byte_182 & 0x30 == 0x20
};

struct Derived : Base {
    constexpr void f() override;        // byte_177 & 0x10 set (implicitly HD)
};

Without propagation, Derived::f would have byte_182 == 0x00 (no explicit annotation). The device-side IL pass would skip it, and a virtual call base_ptr->f() on the GPU would dispatch to a function never compiled for the device. Propagation sets byte_182 |= 0x20 (device_annotation), ensuring the function is included in device IL.

The propagation follows strict rules:

Base `byte_182 & 0x30`	Propagated to overriding entity
`0x00` (implicit host)	`\|= 0x10` (host_explicit)
`0x10` (explicit host)	`\|= 0x10` (host_explicit)
`0x20` (device)	`\|= 0x20` (device_annotation)
`0x30` (HD)	`\|= 0x10` then `\|= 0x20` (both)

Phase 2: Explicit Annotation Mismatch Detection

When the overriding function is NOT implicitly HD (byte_177 & 0x10 == 0), the checker must verify that the derived function's explicit execution space matches the base. It does this by querying the attribute lists on the overriding symbol for __device__ (kind 87) and __host__ (kind 86) attributes using sub_5CEE70.

The overriding symbol has two attribute list pointers: offset +184 (primary attributes) and offset +200 (secondary/redeclaration attributes). Both are checked for each attribute kind.

Reconstructed Pseudocode

// Phase 2: explicit annotation mismatch detection (lines 96-188)
//
// At this point, overriding_entity->byte_177 & 0x10 == 0 (not implicitly HD).
// We must determine what execution space annotations the overriding function
// has, and compare against the overridden function's execution space.

void check_override_mismatch(
    int64_t overriding_sym,       // a2
    int64_t overriding_entity,    // v11
    int64_t overridden_entity,    // v10
    int64_t overridden_sym_list,  // v6 = a2+48 (location info for diagnostics)
    int64_t overridden_sym_arg,   // v8 = a3 (for diagnostics)
    int64_t base_sym              // v9 = *a2 (for diagnostics)
) {
    // -- Assertion: overridden entity must exist --
    if (!overridden_entity) {
        internal_error("nv_transforms.h", 367, "nv_is_device_only_routine");
    }

    // -- Extract overridden execution space --
    uint8_t base_es    = overridden_entity->byte_182;
    uint8_t mask_30    = base_es & 0x30;     // 0x00/0x10/0x20/0x30
    bool    base_no_device_annotation = (base_es & 0x20) == 0;  // v56
    bool    base_is_hd = (mask_30 == 0x30);  // v58
    uint8_t base_device_bit = base_es & 0x20;  // v55

    // -- Check overriding function for __device__ attribute (kind 87) --
    bool has_device_attr = find_attribute(87, overriding_sym->attr_list_184)
                        || find_attribute(87, overriding_sym->attr_list_200);

    if (has_device_attr) {
        // Overriding function has __device__.
        // Now check if it also has __host__ (kind 86) -- making it HD.

        bool has_host_attr = find_attribute(86, overriding_sym->attr_list_184)
                          || find_attribute(86, overriding_sym->attr_list_200);

        if (has_host_attr) {
            // --- Overriding is __host__ __device__ ---
            if (base_device_bit) {
                // Base has device_annotation (bit 5 set).
                // If base is device-only (mask_30 == 0x20), error 3544.
                if (mask_30 == 0x20) {
                    emit_error(8, 3544, location, overridden, base);
                }
                // If base is HD (mask_30 == 0x30), it's legal -- no error.
                // If base has device_bit but mask_30 != 0x20 and != 0x30,
                // that can't happen (bit 5 set implies mask_30 is 0x20 or 0x30).
            } else {
                // Base has no device_annotation -- base is host or implicit host.
                emit_error(8, 3543, location, overridden, base);
            }
        } else {
            // --- Overriding is __device__ only ---
            // Fall through to LABEL_83 logic.
            goto device_only_check;
        }
    } else {
        // Overriding function has NO __device__ attribute.
        // It's either explicit __host__ or implicit host (no annotation).

        if (dword_106BFF0) {
            // Relaxed mode: check if overriding has explicit __host__.
            bool has_host_attr = find_attribute(86, overriding_sym->attr_list_184)
                              || find_attribute(86, overriding_sym->attr_list_200);

            if (!has_host_attr) {
                // No explicit __host__ either -- implicit host.
                // In relaxed mode, an implicit-host override is treated like
                // a device-only override for certain base configurations.
                // Jump into the device-only path with modified conditions.
                goto device_only_check_relaxed;
            }
            // Explicit __host__ in relaxed mode: fall through to normal checks.
        }

        // --- Overriding is __host__ (explicit or implicit) ---
        if (mask_30 == 0x20) {
            // Base is __device__ only
            emit_error(8, 3545, location, overridden, base);
        } else if (mask_30 == 0x30) {
            // Base is __host__ __device__
            emit_error(8, 3546, location, overridden, base);
        }
        // else: base is host/implicit-host, same space -- no error.
        goto done_nvidia_checks;
    }

device_only_check:
    // Overriding is __device__ only (has __device__ but no __host__).
    // v39 = base_no_device_annotation (v56), v40 = 1 (always set entering here).
    {
        bool should_error = base_no_device_annotation;  // v39
        bool relaxed_extra = true;                      // v40

device_only_check_relaxed:
        // (relaxed mode entry: v39 = 0, a1 = v56 = base_no_device_annotation)

        if (dword_106BFF0) {
            // Relaxed mode: the error fires unconditionally when
            // base has no device annotation (base is host/implicit-host).
            // In strict mode, same condition applies.
            should_error = base_no_device_annotation;
            relaxed_extra = true;   // always true in relaxed
        }

        if (should_error) {
            // Base is host-only (no device_annotation) and override is device-only.
            emit_error(8, 3542, location, overridden, base);
        } else if (base_is_hd && relaxed_extra) {
            // Base is HD, override is device-only.
            // v40 (relaxed_extra) is always 1 from Entry A, so this
            // fires in both strict and relaxed modes for D-overrides-HD.
            emit_error(8, 3547, location, overridden, base);
        }
        // else: base is device-only too -- compatible, no error.
    }

done_nvidia_checks:
    // Continue to standard EDG override recording...
}

Decision Tree (Simplified)

overriding byte_177 & 0x10?
  YES (implicitly HD) --> propagate, skip mismatch check
  NO  --> extract base_es = overridden byte_182
          has __device__ attr on overriding?
            YES --> also has __host__ attr?
              YES (override=HD):
                base has device_annotation?
                  YES and mask_30==0x20 --> ERROR 3544
                  NO                    --> ERROR 3543
              NO (override=D-only):
                base has NO device_annotation? --> ERROR 3542
                base is HD?                    --> ERROR 3547
            NO (override=H or implicit-H):
              base mask_30==0x20 --> ERROR 3545
              base mask_30==0x30 --> ERROR 3546
              otherwise         --> legal (same space)

The Six Error Messages

Each mismatch produces one of six errors. All are emitted at severity 8 (hard error) and are individually suppressible by their diagnostic tag via --diag_suppress or #pragma nv_diag_suppress.

Internal	Display	Diagnostic Tag	Message Template
3542	20085	`vfunc_incompat_exec_h_d`	`execution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __device__ function`
3543	20086	`vfunc_incompat_exec_h_hd`	`execution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __host__ __device__ function`
3544	20087	`vfunc_incompat_exec_d_hd`	`execution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ __device__ function`
3545	20088	`vfunc_incompat_exec_d_h`	`execution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ function`
3546	20089	`vfunc_incompat_exec_hd_h`	`execution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __host__ function`
3547	20090	`vfunc_incompat_exec_hd_d`	`execution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __device__ function`

The display number is computed as internal + 16543 (the standard CUDA error renumbering from construct_text_message). The tag naming convention is vfunc_incompat_exec_{overridden}_{overriding}.

The %n1 and %n2 fill-ins resolve to the entity display names of the base and derived functions respectively, including their full qualified names and parameter types.

Suppression Example

# Suppress by tag (preferred)
nvcc --diag_suppress=vfunc_incompat_exec_h_d file.cu

# Suppress by display number
nvcc --diag_suppress=20085 file.cu

# Suppress in source
#pragma nv_diag_suppress vfunc_incompat_exec_h_d

Complete Compatibility Matrix

This table shows every combination of base (overridden) and derived (overriding) execution space. "Implicit H" means the function has no execution space annotation (byte_182 & 0x30 == 0x00). Since implicit host and explicit __host__ are treated identically for override purposes (both lack the device_annotation bit and have mask_30 != 0x20), they share the same row/column behavior.

__global__ is excluded because __global__ functions cannot be virtual -- the attribute handler rejects __global__ on virtual functions before override checking ever runs.

The matrix is the same in both strict mode (dword_106BFF0 == 0) and relaxed mode (dword_106BFF0 == 1). The relaxed flag changes the code path used to reach the error decision but produces the same result for all input combinations.

	Derived: H / implicit H	Derived: D	Derived: HD	Derived: implicitly HD
Base: H / implicit H	legal	error 3542	error 3543	legal + propagate `\|= 0x10`
Base: D	error 3545	legal	error 3544	legal + propagate `\|= 0x20`
Base: HD	error 3546	error 3547	legal	legal + propagate `\|= 0x10`, `\|= 0x20`

Reading the matrix: each row is the base class virtual function's space; each column is the derived class override's space. "Legal" means no error is emitted and the override is recorded normally. "Legal + propagate" means the override is accepted AND the base's execution space bits are OR'd into the derived entity's byte_182.

The diagonal (same space in base and derived) is always legal. The last column (implicitly HD) is always legal because an implicitly HD function is compatible with every execution space -- the mismatch check is skipped entirely and only propagation runs.

Why Both Modes Produce the Same Matrix

Tracing the LABEL_83 code path with the two entry points reveals that dword_106BFF0 does NOT gate error 3547. In the critical device-only-override path (Entry A), v40 is set to 1 before reaching LABEL_83 regardless of the relaxed flag. The flag only changes the assignment to a1 and v40 via conditional moves (cmovz/cmovnz in the disassembly), but the net effect is identical for all input combinations:

LABEL_83 internals (decompiled, annotated):
  a2 = 3542;                          // tentative error
  if (!dword_106BFF0) a1 = v39;       // strict: a1 = v39
  if (dword_106BFF0) v40 = 1;         // relaxed: force v40 = 1
  // BUT v40 was already 1 from Entry A (line 134)
  if (a1) emit_error(3542);           // base has no device_annotation
  else if (v58 && v40) emit_error(3547);  // base is HD
  else skip;                          // base is D-only (compatible)

Entry A sets v39 = v56, v40 = 1, a1 = v56. In strict mode, a1 is overwritten to v39 (same value). In relaxed mode, a1 stays v56 (same value). Either way, a1 = v56 = (base has no device annotation). The v40 = 1 from Entry A is preserved. The result is identical.

The relaxed flag introduces a second entry point (Entry B) for overriding functions with no explicit annotation. In relaxed mode, such functions are routed through LABEL_83 with v39 = 0 and a1 = v56, producing the same device-only check logic. In strict mode, the same functions take the direct H/implicit-H path and produce errors 3545/3546 for device/HD bases. Both paths reach the same conclusions.

Relaxed Mode: The Unannotated Override Path

When dword_106BFF0 == 1 and the overriding function has no __device__ attribute, the checker takes an additional step before falling through to the H/implicit-H path. It queries the overriding symbol for explicit __host__ (kind 86). If __host__ IS found, the function is confirmed as explicit host and errors 3545/3546 apply normally. If __host__ is NOT found (truly unannotated), the function is reclassified through the device-only check path (LABEL_83). This reclassification does not change the error outcome -- an unannotated function overriding a host base still sees no error (both are host-space), and an unannotated function overriding a device or HD base still produces the appropriate error.

Propagation Details

When the overriding function is implicitly HD (byte_177 & 0x10), execution space is propagated from the base to the derived entity by OR-ing bits into byte_182:

// Propagation (direct from decompiled sub_432280, lines 77-91)
uint8_t base_es = overridden_entity->byte_182;

// If base is NOT device-only, derived inherits host obligation
if ((base_es & 0x30) != 0x20) {
    overriding_entity->byte_182 |= 0x10;   // host_explicit bit
    base_es = overridden_entity->byte_182;  // re-read (compiler artifact)
}

// If base has device_annotation, derived inherits device obligation
if (base_es & 0x20) {
    overriding_entity->byte_182 |= 0x20;   // device_annotation bit
}

The re-read of overridden_entity->byte_182 after setting 0x10 on the overriding entity is a compiler artifact (the decompiler shows it reading back from v10+182 into v22, but v10 is the overridden entity, so the value hasn't changed). The OR operations are on the overriding entity only.

Propagation Matrix

Base space (`byte_182 & 0x30`)	Bits OR'd into overriding `byte_182`	Net effect on overriding entity
`0x00` (implicit H)	`\|= 0x10`	Becomes explicit host (`0x10`)
`0x10` (explicit H)	`\|= 0x10`	Becomes explicit host (`0x10`)
`0x20` (D only)	`\|= 0x20`	Becomes device-annotated (`0x20`)
`0x30` (HD)	`\|= 0x10`, then `\|= 0x20`	Becomes HD (`0x30`)

After propagation, the overriding entity's byte_182 accurately reflects the execution space obligations inherited from its base class. Downstream passes (device/host separation, IL marking, code generation) use this byte to determine whether the function needs device-side compilation, host-side compilation, or both.

Relaxed Mode (dword_106BFF0)

The global flag dword_106BFF0 (relaxed_attribute_mode, default 1 per CLI defaults) controls permissive handling of execution space annotations across the compiler. Its primary effects are on attribute application (allowing __device__ + __global__ coexistence) and cross-space call validation. For virtual override checking, its effect is narrower:

Unannotated override reclassification. In relaxed mode, when the overriding function has neither __device__ nor __host__ attributes explicitly, the checker additionally queries the overriding symbol for __host__ (kind 86). If __host__ is NOT found, the checker treats the unannotated function as potentially device-compatible and routes through the device-only check path (LABEL_83). This can produce error 3542 (D overrides H) for an implicit-host function, which would otherwise only see errors 3545/3546.
No error suppression for overrides. Unlike attribute application where relaxed mode suppresses error 3481, relaxed mode does NOT suppress any of the six override errors. All six fire at severity 8 in both modes. The flag dword_106BFF0 modulates the code path taken to reach the error decision, not the severity or suppression of the error itself.

Additional Override Checks (Non-CUDA)

After the CUDA execution space checks, sub_432280 continues with standard EDG override validation:

Error	Condition	Meaning
1788	Base has `[[nodiscard]]`, derived does not	Missing `[[nodiscard]]` on override
1789	Derived has `[[nodiscard]]`, base does not	Extraneous `[[nodiscard]]` on override
1850	Overriding a `final` virtual function	Override of `final` function
2935	Derived has requires-clause, base does not	Requires-clause mismatch
2936	Base has requires-clause, derived does not	Requires-clause mismatch

These are standard C++ checks unrelated to CUDA execution spaces.

Example: Override Interactions

// Example 1: Legal same-space override
struct Base {
    __device__ virtual void f();
};
struct Derived : Base {
    __device__ void f() override;     // Legal: D overrides D
};

// Example 2: Error 3542 -- D overrides H
struct Base2 {
    virtual void f();                 // Implicit __host__
};
struct Derived2 : Base2 {
    __device__ void f() override;     // ERROR 3542 (20085)
};
// error #20085-D: execution space mismatch: overridden entity (Base2::f)
//   is a __host__ function, but overriding entity (Derived2::f)
//   is a __device__ function

// Example 3: Error 3546 -- H overrides HD
struct Base3 {
    __host__ __device__ virtual void f();
};
struct Derived3 : Base3 {
    void f() override;                // ERROR 3546 (20089)
};
// error #20089-D: execution space mismatch: overridden entity (Base3::f)
//   is a __host__ __device__ function, but overriding entity (Derived3::f)
//   is a __host__ function

// Example 4: Legal constexpr override with propagation
struct Base4 {
    __device__ virtual int g();
};
struct Derived4 : Base4 {
    constexpr int g() override;       // Legal: implicitly HD, propagates |= 0x20
};
// Derived4::g now has byte_182 |= 0x20 (device_annotation)
// and is included in device IL compilation.

// Example 5: Error 3547 -- D overrides HD
struct Base5 {
    __host__ __device__ virtual void h();
};
struct Derived5 : Base5 {
    __device__ void h() override;     // ERROR 3547 (20090)
};

Function Map

Address	Identity	Lines	Source
`sub_432280`	`record_virtual_function_override`	437	`class_decl.c`
`sub_5CEE70`	`find_attribute` (attribute list lookup by kind)	~30	`attribute.c`
`sub_4F4F10`	`emit_diag_with_entity_pair` (severity, error, loc, base, derived)	~100	`error.c`
`sub_4F2930`	`internal_error` (assertion failure)	~20	`error.c`
`sub_41A6E0`	`dump_override_entry` (debug trace helper)	~40	`class_decl.c`
`sub_41D010`	`add_to_override_list`	~20	`class_decl.c`
`sub_5E20D0`	`allocate_override_entry` (40-byte node)	~15	`mem.c`
`sub_432130`	`resolve_indeterminate_exception_specification`	~60	`class_decl.c`

Override Entry Structure

Each recorded override is stored as a 40-byte linked list node:

Override entry (40 bytes):
  +0x00 (0):   next pointer
  +0x08 (8):   base_class_symbol (entity in base class vtable)
  +0x10 (16):  derived_class_entity (overriding function entity)
  +0x18 (24):  flags (0 initially, set during processing)
  +0x20 (32):  covariant_return_adjustment (pointer or NULL)

The override list is managed via:

qword_E7FE98: list head (most recent entry)
qword_E7FEA0: free list head (recycled 40-byte entries)
qword_E7FE90: allocation counter

When debug tracing is enabled (dword_126EFCC > 3), the function prints "newly created: ", "existing entry: ", "after modification: ", and "removing: " to stderr via fwrite, followed by calls to sub_41A6E0 to dump the entry contents.

Cross-References

Execution Spaces -- bitfield layout at entity +182, attribute application handlers, conflict matrix
Cross-Space Call Validation -- call-graph enforcement, the implicitly-HD bypass
CUDA Error Catalog -- error numbering scheme, diagnostic tag suppression system
Global Variables -- dword_106BFF0 and other flags
Entity Node Layout -- full byte map of the entity structure including +176, +177, +182
__global__ Function Constraints -- why __global__ functions cannot be virtual

Keyboard shortcuts

cudafe++ Reverse Engineering Reference