Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

cudafe++ v13.0 -- Reverse Engineering Reference

cudafe++ is NVIDIA's CUDA frontend compiler -- the first stage of the CUDA compilation pipeline. It is built on the Edison Design Group (EDG) C++ Front End v6.6, a commercial compiler frontend licensed by compiler vendors worldwide. NVIDIA ships cudafe++ as a statically-linked, stripped ELF binary inside every CUDA Toolkit installation. This binary accepts .cu source files, parses them as C++ with CUDA extensions, separates device code from host code, and produces two outputs: an EDG Intermediate Language (IL) stream consumed by cicc (the NVIDIA PTX code generator), and a transformed .int.c host file consumed by the system C++ compiler (gcc, clang, or cl.exe).

This wiki documents the complete internals of the cudafe++ binary from CUDA Toolkit 13.0, reverse-engineered through static analysis (IDA Pro + Hex-Rays decompilation) of all 6,483 functions. The goal is reimplementation-grade documentation: every page should give a senior compiler engineer enough information to build equivalent functionality from scratch.

Binary Identity

PropertyValue
Binarycudafe++ from CUDA Toolkit 13.0
FormatELF 64-bit LSB executable, x86-64, statically linked, stripped
File size8,910,936 bytes (8.5 MB)
EDG baseEdison Design Group C++ Front End v6.6
Build path/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/
Total functions6,483
Functions mapped to source2,208 (34%)

Segment Layout

SectionStartEndSizeDescription
.text0x4033000x8297224,351,010 bytes (4.15 MB)Executable code
.rodata0x8297400xAA3FA32,599,011 bytes (2.48 MB)Read-only data (string tables, jump tables, constants)
.data0xD464800xE7EFF01,280,880 bytes (1.22 MB)Initialized global variables
.bss0xE7F0000x12D6F204,554,528 bytes (4.34 MB)Zero-initialized globals
.eh_frame0xCB12100xD3F398582,024 bytesException handling unwind tables
.data.rel.ro0xD428C00xD45E0013,632 bytesRelocation-read-only (vtables, GOT-relative)

Role in the CUDA Toolchain

  input.cu
     |
     v
 cudafe++  ──────── THIS BINARY ────────
     |                                   |
     v                                   v
  device.gpu  (EDG IL)             input.int.c  (transformed host C++)
     |                                   |
     v                                   v
   cicc                            gcc / clang / cl.exe
     |                                   |
     v                                   v
  device.ptx                       host.o
     |                                   |
     v                                   v
   ptxas                              ld
     |                                   |
     v                                   v
  device.cubin ──────────────────> final executable

cudafe++ is a source-to-source compiler. It never generates machine code directly. Its job is to take a single .cu translation unit, understand which code is device (__device__, __global__) and which is host, then:

  1. For the device track: Emit EDG IL -- a typed, scope-linked intermediate representation containing every declaration, type, expression, and statement. This IL is consumed by cicc, which lowers it through LLVM to PTX assembly.

  2. For the host track: Emit a .int.c file -- valid C++ source where device function bodies are suppressed inside #if 0/#endif, __global__ kernels are replaced by __wrapper__device_stub_<name>() forwarding functions, and CUDA runtime registration boilerplate is appended.

The binary runs as a single-threaded, single-pass-per-stage pipeline with 8 stages: pre-init, CLI parsing (276 flags), one-time init (38 subsystem initializers), TU state reset, frontend parse (EDG parser + CUDA extensions), 5-pass IL finalization, backend .int.c emission, and exit. See Pipeline Overview for the full stage diagram.

Source Attribution

The binary embeds __FILE__ strings from the EDG build system, revealing the original source file structure. From these strings plus address-range analysis of decompiled code, 52 .c source files and 13 .h header files have been identified:

CategoryFilesFunctions MappedDescription
EDG core parser15 .c~800Lexer, expression/declaration parser, statement handling
EDG type system6 .c~350Type representation, checking, conversion
EDG templates5 .c~300Template parsing, instantiation, deduction
EDG IL subsystem8 .c~250IL node types, allocation, walking, display, comparison
EDG infrastructure12 .c~400Memory management, error handling, name mangling, scope management
EDG code generation3 .c~150Backend .int.c emission, ASM handling
NVIDIA additions3 .c~110CUDA transforms, attribute validation, lambda wrappers
Headers13 .h(inline)Shared constants, struct layouts, macro definitions

The NVIDIA-specific source files are:

  • nv_transforms.c (~34 functions, ~14 KB of .text): The heart of CUDA support. Implements device/host-device lambda wrapper template generation (__nv_dl_wrapper_t, __nv_hdl_wrapper_t, __nv_hdl_create_wrapper_t), CUDA attribute validation (__launch_bounds__, __cluster_dims__, __block_size__, __maxnreg__), host reference array emission (.nvHRKI/.nvHRDE/.nvHRCE ELF sections), lambda preamble injection (sub_6BCC20), and array capture helper generation.

  • nv_transforms.h: Header with NVIDIA-specific declarations, type trait template names, and bitmask table definitions.

  • 3 modified EDG files: cmd_line.c (CUDA CLI flags spliced into EDG's flag table), fe_init.c (CUDA-specific initialization at stage 3), and cp_gen_be.c (device stub generation, lambda wrapper emission, registration table output in the backend).

Key Discoveries

Execution Space Bitfield

Every entity node in the EDG IL carries CUDA execution-space information at byte offset +182 (relative to the entity node base). The bitfield encoding:

BitMaskMeaning
4-50x30Execution space: 0=none, 1=__host__, 2=__device__, 3=__host__ __device__
60x40Device/global flag (set for __device__ and __global__ functions)
70x80__global__ kernel flag

This bitfield is checked throughout the pipeline -- in cross-space call validation, device/host code separation, the keep-in-IL predicate, and backend stub generation.

Lambda Wrapper Template Injection

CUDA extended lambdas (__device__ and __host__ __device__ lambdas) cannot be passed directly across the host/device boundary. cudafe++ solves this by injecting a library of template wrapper structs into the compilation at backend time. The master emitter sub_6BCC20 (nv_emit_lambda_preamble) generates all __nv_* templates in a single function call, driven by two 1024-bit bitmasks that record which capture counts were actually needed during parsing:

  • unk_1286980: Device lambda capture counts (bit N = need __nv_dl_wrapper_t for N captures)
  • unk_1286900: Host-device lambda capture counts (need __nv_hdl_wrapper_t for N captures)

Only the required specializations are emitted, keeping the generated code minimal.

CUDA Error Catalog

The binary contains 3,795 diagnostic messages in the EDG error table. Of these, 338 are CUDA-specific (error numbers in the 20000+ range and the 3500-3800 range). These cover:

  • Execution space violations (calling __device__ from __host__ and vice versa)
  • __global__ function constraints (no return value, no variadic args, no virtual)
  • Lambda restrictions (35+ distinct error categories for extended lambda misuse)
  • Attribute conflicts (__launch_bounds__ + __maxnreg__ mutual exclusion)
  • RDC mode restrictions (user-defined copy constructors in kernel arguments)
  • Architecture feature gates (feature X requires SM_YY or higher)

IL Entry Kind System

The EDG IL uses 85 defined entry kinds (0-84), each representing a distinct node type in the typed, scope-linked IL graph. Key node types include: routine (288 bytes, functions/methods), variable (232 bytes), type (176 bytes, 22 sub-kinds), expr_node (72 bytes, 36 sub-kinds), statement (80 bytes, 26 sub-kinds), and scope (288 bytes, 9 sub-kinds). All nodes live in a region-based arena allocator with 64 KB blocks. See IL Overview for the complete entry kind table.

CLI Flag Inventory

cudafe++ accepts 276 command-line flags parsed in sub_459630 (cmd_line.c). These control:

  • Language mode and C++ standard version (__cplusplus value)
  • Host compiler identity (MSVC, GCC, Clang) and version
  • CUDA-specific modes: extended lambdas, RDC, JIT, architecture target
  • Diagnostic suppression and promotion
  • Include paths and macro definitions
  • Output format and timing

Flags are passed from nvcc via the -Xcudafe forwarding mechanism. Many flags are undocumented EDG internals.

Wiki Structure

This wiki is organized into 10 sections covering the binary from top-level pipeline down to individual data structures.

Overview

  • Function Map -- address-to-identity table for all 2,208 mapped functions
  • Binary Layout -- segment map, memory regions, address space organization
  • Methodology -- RE tools, approach, confidence scoring

Compilation Pipeline

The 8-stage pipeline from main() at 0x408950 through exit. Covers initialization, CLI parsing, EDG frontend invocation, 5-pass IL finalization, backend .int.c emission, and exit code mapping.

CUDA Execution Model

How cudafe++ handles __device__, __host__, and __global__ execution spaces. Device/host code separation, cross-space call validation, kernel stub generation, RDC (relocatable device code) mode, JIT mode, and SM architecture feature gating.

CUDA Attributes

The internal attribute system: __global__ function constraints, __launch_bounds__ / __cluster_dims__ / __block_size__ / __maxnreg__ validation, __grid_constant__ parameter handling, __managed__ variable support, and minor attributes (__nv_pure__, __nv_register_params__).

Lambda Transformations

Extended lambda support architecture: device lambda wrapper (__nv_dl_wrapper_t), host-device lambda wrapper (__nv_hdl_wrapper_t / __nv_hdl_create_wrapper_t), capture handling (field types, array wrappers for up to 8D), preamble injection (sub_6BCC20), and the 35+ lambda restriction error categories.

EDG Intermediate Language

The 85-entry-kind IL format: node allocation (region-based arena), tree walking (5 callback traversal), device code selection (keep-in-IL predicate), display (debug dump), and comparison/copy operations.

Host Output Generation

The .int.c file format, CUDA runtime boilerplate (__nv_managed_rt initialization, crt/host_runtime.h inclusion), host reference arrays (.nvHRKI/.nvHRDE/.nvHRCE ELF sections for device symbol registration), and CRC32-based module ID generation.

EDG Frontend Internals

The stock EDG 6.6 subsystems: lexer/tokenizer (357 token kinds), expression parser, declaration parser, overload resolution, template engine (instantiation worklist), CUDA-specific template restrictions, constexpr interpreter, Itanium ABI name mangling with CUDA extensions, and the type system (176-byte type node, 22 type kinds).

Error & Diagnostic System

The 3,795-entry diagnostic table, CUDA-specific error catalog (338 entries), format specifier system (%t/%s/%n/%sq/%p/%d), and SARIF output / pragma control.

Data Structures

Byte-level layouts for the core IL node types: entity node (execution/memory space at +182), scope entry (784 bytes), translation unit descriptor (424 bytes), type node (176 bytes, 22 kinds), and template instance record (128 bytes).

Configuration

CLI flag inventory (276 flags by category), EDG build configuration (compile-time constants baked into the binary), architecture detection (--nv_arch and SM version mapping), and experimental feature flags.

Reference

EDG source file map (52 .c + 13 .h), global variable index, token kind table (357 types), full error message catalog, and virtual override mismatch matrix.

If you want to understand the compilation pipeline: Start with Pipeline Overview, then follow the stage-by-stage links.

If you want to understand CUDA-specific behavior: Start with the CUDA Execution Model section. The execution spaces page explains the fundamental bitfield encoding that everything else depends on.

If you want to understand lambda transformations: Start with the Lambda Transformations overview. Lambda support is the most complex NVIDIA addition and involves template injection, capture-count bitmasks, and 5 distinct wrapper template families.

If you want to understand the IL format: Start with IL Overview for the 85 entry kinds, then Keep-in-IL for how device code is selected.

If you want to look up a specific function: The Function Map provides address-to-identity mappings for all 2,208 identified functions. The EDG Source File Map shows which source file each address range belongs to.

Data Sources

This wiki is derived from:

  • 6,202 Hex-Rays decompiled C pseudocode files -- one per function with recognizable control flow
  • 6,342 x86-64 disassembly files -- full instruction-level coverage
  • 9.5 MB strings database with cross-references to every function that uses each string
  • 161 MB cross-reference database -- complete caller/callee and data-reference mappings
  • 7.7 MB call graph in JSON and DOT format
  • 6,483 control flow graphs with basic block boundaries
  • 247 MB IDA Pro database (.i64)

All analysis was performed on the binary shipped with CUDA Toolkit 13.0, obtained from NVIDIA's public distribution channels.

Function Map

Every function in the cudafe++ binary that triggers an EDG assertion encodes three pieces of data in the assertion string: the source file path, the line number, and the enclosing function name. These strings survive in .rodata and cross-reference back to the compiled functions, providing a ground-truth mapping from binary address to EDG source file. This page catalogs that mapping for all 52 .c source files and 13 .h header files identified in the CUDA 13.0 build of cudafe++ (EDG 6.6).

The mapping was produced by extracting all string literals matching /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/*.c and *.h from the binary's .rodata section, then tracing their cross-references to determine which functions load each path. A function that references attribute.c in an assertion string was compiled from attribute.c. Functions that reference no source path at all (the "unmapped" pool) are either too small to contain assertions, are inlined from headers, or belong to the statically-linked C++ runtime.

Coverage Summary

CategoryFunctionsPercentage
Mapped via .c file paths2,12932.8%
Mapped via .h file paths only801.2%
Total mapped2,20934.1%
Unmapped in EDG region (0x403300--0x7E0000)2,90644.8%
C++ runtime / demangler (0x7E0000--0x829722)1,08516.7%
PLT stubs + init (0x402A18--0x403300)2834.4%
Total functions in binary6,483100%

The 2,906 unmapped functions in the EDG region include inlined header expansions (e.g., util.h vector/hash helpers, types.h type queries), small leaf functions below the assertion threshold, switch-table dispatch fragments, and functions from translation units compiled without assertions enabled (notably il_to_str.c display routines and parts of floating.c).

Binary Layout

The EDG .text region (0x403300--0x7E0000) has a three-part structure:

  1. Assert stub region (0x403300--0x408B40): 235 small __noreturn functions, one per assertion site. Each encodes a source file path, line number, and function name, then calls sub_4F2930 (the internal error handler). These stubs are sorted by source file name -- the linker grouped them from all 52 .c files into one contiguous block. 200 stubs map to .c files; the remaining 35 are from .h files inlined into .c compilation units.

  2. Constructor region (0x408B40--0x409350): 15 C++ static constructor functions (ctor_001 through ctor_015) that initialize global tables at program startup.

  3. Main body region (0x409350--0x7DFFF0): The bulk of the compiler. Source files are laid out roughly in alphabetical order by filename, a consequence of the linker processing object files in directory-listing order. The alphabetical ordering holds across the entire range: attribute.c starts at 0x409350, class_decl.c at 0x419280, progressing through to types.c at 0x7A4940, modules.c at 0x7C0C60, and floating.c at 0x7D0EB0.

Source File Address Table

The table below lists all 52 .c source files sorted by their main body start address. "Total Funcs" counts all functions referencing the file (stubs + main body). "Stubs" counts assert stubs in 0x403300--0x408B40. "Main Funcs" counts functions in the main body region.

#Source FileOriginTotal FuncsStubsMain FuncsMain Body StartMain Body EndSweep
1attribute.cEDG17771700x4093500x418F80P1.01
2class_decl.cEDG27392640x4192800x447930P1.01--02
3cmd_line.cEDG441430x44B2500x459630P1.02--03
4const_ints.cEDG4130x461C200x4659A0P1.03
5cp_gen_be.cEDG226252010x466F900x489000P1.03--04
6debug.cEDG2020x48A1B00x48A1B0P1.04
7decl_inits.cEDG19641920x48B3F00x4A1540P1.04--05
8decl_spec.cEDG883850x4A1BF00x4B37F0P1.05
9declarator.cEDG640640x4B39700x4C00A0P1.05
10decls.cEDG20752020x4C09100x4E8C40P1.05--06
11disambig.cEDG5140x4E9E700x4EC690P1.06
12error.cEDG511500x4EDCD00x4F8F80P1.06
13expr.cEDG538105280x4F98700x5565E0P1.07--08
14exprutil.cEDG299132860x5587200x583540P1.08--09
15extasm.cEDG7070x584CA00x585850P1.09
16fe_init.cEDG6150x585B100x5863A0P1.09
17fe_wrapup.cEDG2020x588D400x588F90P1.09
18float_pt.cEDG790790x5895500x594150P1.09--10
19folding.cEDG13991300x594B300x5A4FD0P1.10
20func_def.cEDG561550x5A51B00x5AAB80P1.10
21host_envir.cEDG192170x5AD5400x5B1E70P1.10
22il.cEDG358163420x5B28F00x5DFAD0P1.10--11d
23il_alloc.cEDG381370x5E06000x5E8300P1.11a--11e
24il_to_str.cEDG831820x5F7FD00x6039E0P1.11f--12
25il_walk.cEDG271260x603FE00x620190P1.12
26interpret.cEDG21652110x620CE00x65DE10P1.12--13
27layout.cEDG212190x65EA500x665A60P1.13
28lexical.cEDG14051350x6667200x689130P1.13--14
29literals.cEDG210210x68ACC00x68F2B0P1.14
30lookup.cEDG712690x68FAB00x69BE80P1.14
31lower_name.cEDG179111680x69C9800x6AB280P1.14--15
32macro.cEDG431420x6AB6E00x6B5C10P1.15
33mem_manage.cEDG9270x6B6DD00x6BA230P1.15
34nv_transforms.cNVIDIA1010x6BE3000x6BE300P1.15
35overload.cEDG28432810x6BE4A00x6EF7A0P1.15--16
36pch.cEDG233200x6F27900x6F5DA0P1.16
37pragma.cEDG280280x6F61B00x6F8320P1.16
38preproc.cEDG100100x6F9B000x6FC940P1.16
39scope_stk.cEDG18661800x6FE1600x7106B0P1.16--17
40src_seq.cEDG571560x710F100x718720P1.17
41statements.cEDG831820x7193000x726A50P1.17
42symbol_ref.cEDG422400x726F200x72CEA0P1.17
43symbol_tbl.cEDG17581670x72D9500x74B8D0P1.17--18
44sys_predef.cEDG351340x74C6900x751470P1.18
45target.cEDG110110x7525F00x752DF0P1.18
46templates.cEDG455124430x7530C00x794D30P1.18
47trans_copy.cEDG2020x796BA00x796BA0P1.18
48trans_corresp.cEDG886820x796E600x7A3420P1.18--19
49trans_unit.cEDG100100x7A3BB00x7A4690P1.19
50types.cEDG885830x7A49400x7C02A0P1.19
51modules.cEDG223190x7C0C600x7C2560P1.19
52floating.cEDG509410x7D0EB00x7D59B0P1.19

Totals: 5,338 cross-references across 52 .c files, resolving to 2,129 unique functions. With .h file references added, 2,209 unique functions are mapped.

Largest Source Files by Function Count

Source FileMain Body FuncsApproximate Code Size
expr.c528~373 KB (0x4F9870--0x5565E0)
templates.c443~282 KB (0x7530C0--0x794D30)
il.c342~185 KB (0x5B28F0--0x5DFAD0)
exprutil.c286~175 KB (0x558720--0x583540)
overload.c281~200 KB (0x6BE4A0--0x6EF7A0)
class_decl.c264~187 KB (0x419280--0x447930)
interpret.c211~241 KB (0x620CE0--0x65DE10)
decls.c202~165 KB (0x4C0910--0x4E8C40)
cp_gen_be.c201~141 KB (0x466F90--0x489000)
decl_inits.c192~91 KB (0x48B3F0--0x4A1540)

Header File Cross-References

Thirteen .h header files appear in assertion strings. These are headers that contain non-trivial inline functions or macros that expand to assertion-bearing code. When a function compiled from decls.c triggers an assertion whose __FILE__ is types.h, that assertion was inlined from types.h into the decls.c compilation unit.

#Header FileXrefsStubsMain FuncsAddress RangeInlined Into
1decls.h1010x4E08F0decls.c
2float_type.h630630x7D1C90--0x7DEB90floating.c
3il.h5230x52ABC0--0x6011F0expr.c, il.c, il_to_str.c
4lexical.h1010x68F2B0lexical.c / literals.c boundary
5mem_manage.h4040x4EDCD0error.c
6modules.h5050x7C1100--0x7C2560modules.c
7nv_transforms.h3030x432280--0x719D20class_decl.c, cp_gen_be.c, src_seq.c
8overload.h1010x6C9E40overload.c
9scope_stk.h4040x503D90--0x574DD0expr.c, exprutil.c
10symbol_tbl.h2110x7377D0symbol_tbl.c
11types.h174130x469260--0x7B05E0Many files (scattered type queries)
12util.h124101140x430E10--0x7C2B10All major .c files
13walk_entry.h510510x604170--0x618660il_walk.c

Notable Header Patterns

util.h is the most widely-included header, with 124 cross-references (114 in main body) spanning nearly the entire EDG .text region from 0x430E10 to 0x7C2B10. It provides generic container templates (dynamic arrays, hash tables, sorted sets) used by every major subsystem. The EDG linker inlined these templates into each compilation unit, creating many small util.h-attributed functions scattered across the binary.

float_type.h is concentrated in a single 52 KB block at 0x7D1C90--0x7DEB90, immediately after floating.c. It contains 63 template instantiations for IEEE 754 floating-point type operations (comparison, conversion, arithmetic) for each target floating-point width. These templates were instantiated in the floating.c compilation unit.

walk_entry.h contributes 51 functions in the tight range 0x604170--0x618660, all within the il_walk.c region. These are the per-entry-kind callback dispatch functions generated by preprocessor macros in the IL walker header.

nv_transforms.h is NVIDIA-specific. Its 3 cross-references appear in class_decl.c (sub_432280 at 0x432280), cp_gen_be.c (sub_47ECC0 at 0x47ECC0), and src_seq.c (sub_719D20 at 0x719D20). These are the integration points where NVIDIA's CUDA transform hooks are called from standard EDG code paths -- class definition processing, backend code generation, and source sequence ordering.

NVIDIA-Specific Files

nv_transforms.c

The only NVIDIA-authored .c file in the EDG source tree. Despite having only 1 mapped function via __FILE__ (sub_6BE300 at 0x6BE300), the sweep analysis of the 0x6BAE70--0x6BE4A0 range identified approximately 40 functions compiled from this file. The discrepancy exists because nv_transforms.c uses NVIDIA's own assertion macros (not EDG's standard internal_error path), so most functions do not reference the EDG-style __FILE__ string.

Functions confirmed in the nv_transforms.c region:

AddressIdentityPurpose
0x6BAE70nv_init_transformsZero all NVIDIA transform state at startup
0x6BAF70alloc_mem_block64 KB memory block allocator for NV region pools
0x6BB290reset_mem_stateEmergency OOM recovery -- clear memory tracking
0x6BB350init_memory_regionsBootstrap region 0 and region 1 with initial blocks
0x6BB790emit_device_lambda_wrapperGenerate __nv_dl_wrapper_t<> specialization
0x6BCC20emit_lambda_preambleInject lambda wrapper preamble declarations
0x6BD490emit_host_device_lambda_wrapperGenerate __nv_hdl_wrapper_t<> specialization
0x6BE300(mapped function)Single function with EDG-style __FILE__ reference

Key infrastructure in this file:

  • __nv_dl_wrapper_t<> / __nv_hdl_wrapper_t<> struct template generation
  • Host reference array emission (.nvHRKE, .nvHRKI, .nvHRDE, .nvHRDI, .nvHRCE, .nvHRCI)
  • Capture count bitmask tables: unk_1286980 (device) and unk_1286900 (host-device), 128 bytes each
  • Lambda-to-closure entity mapping via hash table at qword_12868F0

nv_transforms.h

NVIDIA's hook header, #include-d from three EDG source files. It declares the functions that bridge standard EDG processing to NVIDIA's CUDA transform layer. The three inclusion sites represent the three points where EDG's standard C++ frontend cedes control to NVIDIA-specific logic:

  1. class_decl.c (sub_432280 at 0x432280): Called during class definition processing to apply CUDA execution-space attributes to closure types and validate lambda capture constraints.

  2. cp_gen_be.c (sub_47ECC0 at 0x47ECC0): Called during backend code generation to emit CUDA-specific output constructs (device stubs, host reference arrays, registration calls).

  3. src_seq.c (sub_719D20 at 0x719D20): Called during source sequence processing to inject NVIDIA preamble declarations and wrapper type definitions into the correct position in the declaration order.

Unmapped Regions (Gap Analysis)

Several address ranges within the EDG .text region contain functions that could not be mapped to any source file via __FILE__ strings. The major gaps and their probable contents:

Gap RangeSizeProbable ContentEvidence
0x408B40--0x409350~2 KBStatic constructors (ctor_001--ctor_015)No source path; global table initializers
0x447930--0x44B250~13 KBclass_decl.c / cmd_line.c boundary helpersBetween confirmed ranges
0x459630--0x461C20~34 KBcmd_line.c tail + const_ints.c preambleUnmapped option handlers
0x5E8300--0x5F7FD0~87 KBIL display routines (il_to_str.c early body)No assertions (display-only code)
0x665A60--0x666720~3 KBlayout.c / lexical.c boundarySmall gap between confirmed ranges
0x689130--0x68ACC0~7 KBlexical.c tail + literals.c preambleToken/literal conversion helpers
0x6AB280--0x6AB6E0~1 KBlower_name.c / macro.c boundaryMangling helpers
0x6BA230--0x6BAE70~3 KBmem_manage.c / nv_transforms.c boundaryMemory infrastructure
0x6EF7A0--0x6F2790~12 KBoverload.c / pch.c boundaryOverload resolution helpers
0x6FC940--0x6FE160~6 KBpreproc.c / scope_stk.c boundaryPreprocessor tail
0x751470--0x7525F0~7 KBsys_predef.c / target.c boundaryPredefined macro infrastructure
0x7A4690--0x7A4940~1 KBtrans_unit.c / types.c boundaryTranslation unit helpers
0x7C2560--0x7D0EB0~59 KBType-name mangling / encoding for outputBetween modules.c and floating.c
0x7D1C90--0x7DEB90~52 KBfloat_type.h template instantiationsConfirmed via .h path strings
0x7DFFF0--0x82A000~304 KBC++ runtime, demangler, soft-float, EHStatically-linked libstdc++/libgcc

The largest unmapped gap within EDG code is the IL display region at 0x5E8300--0x5F7FD0 (87 KB). These functions were compiled from il_to_str.c but contain no assertions because the display/dump subsystem was built without assertion macros -- it is purely diagnostic code that formats IL trees to stdout.

The float_type.h block at 0x7D1C90--0x7DEB90 (52 KB) is technically mapped via .h cross-references but has no .c file attribution because the template instantiations carry only the header's __FILE__ path.

Alphabetical Ordering Observation

The files are laid out in the binary in rough alphabetical order, consistent with a build system that compiles object files in directory-listing order and a linker that processes them sequentially:

0x409350  attribute.c      (a)
0x419280  class_decl.c     (c)
0x44B250  cmd_line.c       (c)
0x461C20  const_ints.c     (c)
0x466F90  cp_gen_be.c      (c)
0x48A1B0  debug.c          (d)
0x48B3F0  decl_inits.c     (d)
0x4A1BF0  decl_spec.c      (d)
0x4B3970  declarator.c     (d)
0x4C0910  decls.c          (d)
0x4E9E70  disambig.c       (d)
0x4EDCD0  error.c          (e)
0x4F9870  expr.c           (e)
0x558720  exprutil.c       (e)
0x584CA0  extasm.c         (e)
0x585B10  fe_init.c        (f)
0x588D40  fe_wrapup.c      (f)
0x589550  float_pt.c       (f)
0x594B30  folding.c        (f)
0x5A51B0  func_def.c       (f)
0x5AD540  host_envir.c     (h)
0x5B28F0  il.c             (i)
0x5E0600  il_alloc.c       (i)
0x5F7FD0  il_to_str.c      (i)
0x603FE0  il_walk.c        (i)
0x620CE0  interpret.c      (i)
0x65EA50  layout.c         (l)
0x666720  lexical.c        (l)
0x68ACC0  literals.c       (l)
0x68FAB0  lookup.c         (l)
0x69C980  lower_name.c     (l)
0x6AB6E0  macro.c          (m)
0x6B6DD0  mem_manage.c     (m)
0x6BAE70  nv_transforms.c  (n)  [region start; mapped func at 0x6BE300]
0x6BE4A0  overload.c       (o)
0x6F2790  pch.c            (p)
0x6F61B0  pragma.c         (p)
0x6F9B00  preproc.c        (p)
0x6FE160  scope_stk.c      (s)
0x710F10  src_seq.c        (s)
0x719300  statements.c     (s)
0x726F20  symbol_ref.c     (s)
0x72D950  symbol_tbl.c     (s)
0x74C690  sys_predef.c     (s)
0x7525F0  target.c         (t)
0x7530C0  templates.c      (t)
0x796BA0  trans_copy.c     (t)
0x796E60  trans_corresp.c  (t)
0x7A3BB0  trans_unit.c     (t)
0x7A4940  types.c          (t)
0x7C0C60  modules.c        (m)  [breaks alphabetical order]
0x7D0EB0  floating.c       (f)  [breaks alphabetical order]

Two files break the alphabetical pattern: modules.c at 0x7C0C60 and floating.c at 0x7D0EB0. Both appear after types.c instead of in their expected positions (between mem_manage.c and nv_transforms.c for modules.c, between float_pt.c and folding.c for floating.c). This suggests these two files are compiled as separate translation units outside the main EDG source directory, or are added to the link line after the alphabetically-sorted EDG objects.

Data Source

All mappings were extracted from the binary's .rodata string table. The extraction command:

jq '[.[] | select(.value | test("/dvs/p4/.*\\.c$")) |
  {file: (.value | split("/") | last),
   xrefs: [.xrefs[].func] | length}
] | sort_by(.file)' cudafe++_strings.json

The full build path for every source file is:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/<filename>

Address ranges were verified against the 20 sweep reports (P1.01 through P1.20) produced during the binary analysis phase.

Binary Layout

cudafe++ ships as a single statically-linked, stripped ELF 64-bit x86-64 executable. Static linking pulls in the entirety of libstdc++ (locale facets, iostream, exception handling), Berkeley SoftFloat 3e (half/quad-precision arithmetic), and glibc CRT startup code. The resulting 8.5 MB binary has no external shared library dependencies -- it runs identically on any Linux x86-64 host regardless of installed C++ runtime version.

This page documents the complete segment and section layout, the internal organization of each major section, and the key data structures located within each region. All addresses are virtual addresses from the ELF load image.

ELF Header

PropertyValue
FormatELF 64-bit LSB executable
Architecturex86-64 (AMD64)
LinkingStatically linked
StrippedYes (no debug symbols, no .symtab)
File size8,910,936 bytes (8.5 MB)
Entry point0x40918C (_start, glibc CRT)
Main0x408950

Complete Section Table

SectionStartEndSize (bytes)Size (human)PermissionsPurpose
LOAD (ELF hdr)0x4000000x402A1810,77610.5 KBr-xELF headers and program header table
.init0x402A180x402A302424 Br-xInitialization stub (calls init_proc)
.plt0x402A300x4033002,2562.2 KBr-xProcedure Linkage Table (141 entries)
.text0x4033000x8297224,351,0104.15 MBr-xAll executable code
.fini0x8297240x8297321414 Br-xFinalization stub (empty body)
.rodata0x8297400xAA3FA32,599,0112.48 MBr--Read-only data
.eh_frame_hdr0xAA3FA40xAB035050,09248.9 KBr--Exception frame header index
.eh_frame0xCB12100xD3F398582,024568.4 KBrw-Exception unwind tables (CFI)
.gcc_except_table0xD3F3980xD4285413,50013.2 KBrw-GCC LSDA exception handler tables
.ctors0xD428580xD428B08888 Brw-Constructor table (9 function pointers + 2 sentinels)
.dtors0xD428B00xD428C01616 Brw-Destructor table (2 sentinels, empty)
.data.rel.ro0xD428C00xD45E0013,63213.3 KBrw-Vtables and relocation-read-only data
.got0xD45FC00xD45FF85656 Brw-Global Offset Table
.got.plt0xD460000xD464781,1441.1 KBrw-GOT for PLT entries
.data0xD464800xE7EFF01,280,8801.22 MBrw-Initialized globals
.bss0xE7F0000x12D6F204,554,5284.34 MBrw-Zero-initialized globals
.tls0x12D6F200x12D6F382424 B---Thread-local storage (exception state)
extern0x12D6F380x12D73A81,1361.1 KB---External symbol stubs

Total virtual address space consumed: 0x12D73A8 - 0x400000 = 18.9 MB.

.text -- Executable Code (4.15 MB)

The .text section contains all 6,483 functions in the binary. It divides into four distinct regions, laid out contiguously by the linker:

0x403300                                                          0x829722
|-- assert stubs --|-- ctors --|---- EDG main body ----|-- C++ runtime ----|
0x403300    0x408B40  0x409350                  0x7DF400          0x829722
   34 KB      8 KB              3.61 MB                  304 KB

Assert Stub Region (0x403300 -- 0x408B40, 34 KB)

Contains 235 small __noreturn functions, each encoding a single assertion site. Every stub loads three string constants -- source file path, line number, and function name -- then calls sub_4F2930 (the internal_error handler in error.c). These stubs are called from the bodies of larger functions when an impossible condition is detected.

The linker groups all stubs from all 52 .c source files into this contiguous block, sorted approximately by source file name. Of the 235 stubs:

  • 200 map to .c source files (e.g., attribute.c:10897 at 0x403300, cp_gen_be.c:22342 at 0x4036F6)
  • 35 map to .h header files inlined into .c compilation units (e.g., types.h at 0x40345C)

Each stub is exactly 29 bytes: a lea for the file path, a mov for the line number, a lea for the function name, then a call to sub_4F2930.

Constructor Region (0x408B40 -- 0x409350, 8 KB)

Contains 9 C++ global constructor functions (ctor_001 through ctor_009) registered in the .ctors table. These run before main() via __libc_start_main's init callback at 0x829640. The constructors, in execution order:

ConstructorAddressIdentityWhat It Initializes
ctor_0010x408B40EDG diagnostic listDoubly-linked list at E7FE40..E7FE68 (self-referencing empty sentinel)
ctor_0020x408B90Stream state table13 qwords at 126ED80..126EDE0 (output channel array including 126EDF0 = stderr FILE*)
ctor_0030x408C20EDG internal cachesios_base::Init + 7 doubly-linked lists at 12C6A40, 12868C0..1286780 (symbol/type caches)
ctor_0040x408E50Emergency exception pool72,704-byte malloc pool at 12D4870, free-list at 12D4868, with pthread mutex
ctor_0050x408ED0Locale once-flags (set 1)8 flags at 12D6A68..12D6AA0
ctor_0060x408F50Locale once-flags (set 2)8 flags at 12D6AF0..12D6B28
ctor_0070x408FD0Locale once-flags (set 3)12 flags at 12D6D28..12D6D80
ctor_0080x409090Locale once-flags (set 4)12 flags at 12D6DE8..12D6E40
ctor_0090x409150Stream buffer destructors__cxa_atexit for basic_streambuf<char> and basic_streambuf<wchar_t>

Constructors 4--9 belong to statically-linked libstdc++. Only constructors 1--3 initialize EDG/NVIDIA state.

EDG Main Body (0x409350 -- 0x7DF400, 3.61 MB)

The core of the compiler. Contains 5,115 functions compiled from 52 EDG .c source files plus 3 NVIDIA-specific source files. Functions are laid out in approximate alphabetical order by source file name -- the linker processed object files in directory-listing order:

0x409350   attribute.c     (170 functions)
0x419280   class_decl.c    (264 functions)
0x44B250   cmd_line.c      (43 functions)
0x461C20   const_ints.c    (3 functions)
0x466F90   cp_gen_be.c     (201 functions)
  ...
0x6BE300   nv_transforms.c (1 mapped function, NVIDIA)
0x6BE4A0   overload.c      (281 functions)
  ...
0x7A4940   types.c
0x7C0C60   modules.c
0x7D0EB0   floating.c
  ~0x7DF400  end of EDG code

The 52 source files break down by subsystem:

SubsystemFilesFunctionsDescription
Parser15 .c~800Lexer, expression/declaration parser, statements
Type system6 .c~350Type representation, checking, conversion
Templates5 .c~300Parsing, instantiation, deduction
IL subsystem8 .c~250Node types, allocation, walking, display, comparison
Infrastructure12 .c~400Memory, errors, name mangling, scope management
Code generation3 .c~150Backend .int.c emission
NVIDIA additions3 .c~110CUDA transforms, attribute validation, lambda wrappers

See Function Map for the complete address-to-source-file table.

C++ Runtime Region (0x7DF400 -- 0x829722, 304 KB)

Statically-linked library code with no EDG source attribution. Contains approximately 900 functions from three libraries:

Berkeley SoftFloat 3e (0x7E0D30 -- 0x7E4150, ~80 functions). IEEE 754 arithmetic for half-precision (float16), extended precision (float80), and quad-precision (float128). Operations: add, sub, mul, div, sqrt, comparisons, int/float conversions. Global state at 12D4820 (exception flags) and 12D4821 (rounding mode). Used by the EDG floating.c subsystem for constant folding of non-native float types.

libstdc++ / libsupc++ (0x7E42E0 -- 0x829600, ~800 functions). The C++ runtime:

  • operator new/operator delete with new-handler retry loop (0x7E42E0)
  • Exception handling: __cxa_throw (0x823050), __cxa_begin_catch (0x822EB0), __cxa_allocate_exception (0x7E4750), std::terminate (0x8231A0)
  • Emergency exception pool: 72,704-byte fallback allocator for OOM during exception handling (0x7E45C0)
  • iostream initialization: ios_base::Init constructor/destructor (0x7E5650/0x7E5F20) setting up cout/cin/cerr + wide variants
  • Full locale system: 600+ functions implementing ctype, num_get, num_put, numpunct, collate, time_get/put, money_get/put, moneypunct, messages, and codecvt facets for both char and wchar_t

CUDA-aware name demangler (at 0x7CABB0, technically in the EDG tail region). NVIDIA's custom Itanium ABI demangler with extensions for CUDA lambda wrapper templates. Recognizes mangled prefixes: "Unvdl" for __nv_dl_wrapper_t<>, "Unvdtl" for __nv_dl_wrapper_t<> with trailing return, and "Unvhdl" for __nv_hdl_wrapper_t<>.

CRT startup (0x40918C and 0x829640 -- 0x829722). _start at 0x40918C calls __libc_start_main(main@0x408950, init@0x829640, fini@0x8296D0). The .fini_array processor at 0x8296E0 iterates backwards through function pointers at off_D428A0.

.rodata -- Read-Only Data (2.48 MB)

The .rodata section at 0x829740 -- 0xAA3FA3 holds all constant data: string literals, jump tables, error message templates, IL metadata tables, and format strings. Major structures:

Error Message Table (off_88FAA0)

The EDG diagnostic system's message template table. An array of 3,795 const char* pointers, indexed by error code 0--3794:

off_88FAA0[0]    = ""                           // error 0: unused
off_88FAA0[1]    = "last line of file ends ..."  // error 1
  ...
off_88FAA0[3794] = "..."                         // error 3794

Each pointer references a NUL-terminated format string elsewhere in .rodata containing % fill-in specifiers (%t = type, %s = string, %n = name, %sq = quoted string, %p = position, %d = decimal). Error codes above 3456 are CUDA-specific (338 entries covering execution space violations, lambda restrictions, architecture feature gates). See Diagnostic Overview.

IL Entry Kind Name Table (off_E6DD80)

Maps the 85 entry_kind enum values (0--84) to human-readable strings. Used by the IL display subsystem (il_to_str.c) for debug output:

off_E6DD80[0]  = "scope"
off_E6DD80[6]  = "type"
off_E6DD80[11] = "routine"
off_E6DD80[23] = "variable"
  ...
off_E6DD80[84] = "last"       // sentinel

The il_one_time_init function (sub_5CF7F0) validates at startup that this table ends with the "last" sentinel, catching version mismatches between the table and the enum.

EDG Source File Path Strings

Approximately 65 string literals of the form /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/<file>.<ext>. These are __FILE__ expansions embedded in assertion macros. Each is referenced by the corresponding assert stub in the 0x403300 region.

Jump Tables

Switch-statement jump tables for the major dispatch functions. The largest are:

  • Expression parser dispatch (~120 case targets)
  • Declaration specifier dispatch (~80 case targets)
  • IL walker entry-kind dispatch (~85 case targets)
  • Backend code generation dispatch (~90 case targets)

Format Strings

Printf-style format strings for the .int.c backend emitter. These include CUDA runtime boilerplate templates ("#include \"crt/host_runtime.h\"", "static __nv_managed_rt ...", "void __device_stub__...") and IL display format strings.

.data -- Initialized Globals (1.22 MB)

The .data section at 0xD46480 -- 0xE7EFF0 holds all initialized global variables. Major structures, ordered by address:

Attribute Descriptor Table (off_D46820)

The master attribute dispatch table, starting at 0xD46820 and extending to approximately 0xD47A60. Each entry is 32 bytes and describes one EDG/CUDA attribute kind: kind code (1 byte), flags (1 byte), name string pointer, validation function pointer, and application function pointer. See Attribute System Overview.

Diagnostic Fill-in Tables (off_D481E0)

Named-label fill-in descriptors for the diagnostic system. Maps fill-in label strings to format specifier dispatch codes. Located at 0xD481E0.

Keyword Tables

The EDG keyword registration system stores keyword-to-token-ID mappings. Initialized during fe_translation_unit_init (sub_5863A0) with 200+ C/C++ keywords (from auto through co_yield), 60+ type trait intrinsics (__is_class, __has_trivial_copy, etc.), and CUDA extension keywords (__device__, __global__, __shared__, __constant__, __managed__, __launch_bounds__, __grid_constant__).

Error Severity Override Table

Maps error codes to their overridden severity levels. Populated by --diag_suppress, --diag_warning, --diag_error CLI flags.

libstdc++ Vtables (0xD428C0 -- 0xD45E00, in .data.rel.ro)

The .data.rel.ro section holds vtables for all statically-linked C++ classes. Key vtables:

AddressClass
off_D42C00__gnu_cxx::__concurrence_lock_error
off_D42C28__gnu_cxx::__concurrence_unlock_error
off_D42CD8std::bad_alloc
off_D45740std::basic_istream<char>
off_D457C0std::basic_istream<wchar_t>
off_D45860std::basic_ostream<char>
off_D458E0std::basic_ostream<wchar_t>
off_D45A28std::basic_streambuf<char>
off_D45A78std::basic_streambuf<wchar_t>

Exception Handler Pointers (0xE7EExx)

Located at the tail of .data:

AddressTypeIdentity
off_E7EEB0qwordatexit target: basic_streambuf<wchar_t> object
off_E7EEB8qwordatexit target: basic_streambuf<char> object
off_E7EEC0qwordstd::unexpected_handler pointer
off_E7EEC8qwordstd::terminate_handler pointer

EDG Diagnostic List Head (0xE7FE40)

A 40-byte doubly-linked list structure at 0xE7FE40..0xE7FE68. Initialized by ctor_001 as an empty self-referencing sentinel (both forward and backward pointers point to the list head). Used to chain diagnostic records during compilation.

.bss -- Zero-Initialized Globals (4.34 MB)

The .bss section at 0xE7F000 -- 0x12D6F20 is the largest section by virtual size. It contains all zero-initialized global state for both the EDG compiler and the statically-linked runtime. The .bss occupies no space in the ELF file on disk -- it is allocated and zeroed by the OS loader.

The 4.34 MB .bss divides into three logical regions:

EDG Compiler State (0xE7F000 -- 0x1290000, ~4.1 MB)

The bulk of .bss holds the EDG frontend's global state. Major structures:

Scope stack and symbol tables (~1.5 MB). The EDG scope stack (scope_stk.c) maintains nested scope contexts during parsing. Each scope entry is 784 bytes. The scope stack globals, various hash tables for name lookup, and the associated symbol table arrays consume the largest contiguous blocks.

IL region tracking (~800 KB). Region indices, region-to-scope mappings (qword_126EB90), region memory tables (qword_126EC88), and IL entry list heads. The region counter at dword_126EC80 tracks active regions. Each function definition creates a new region.

Translation unit state (~400 KB). The TU descriptor itself is dynamically allocated (424 bytes), but the per-TU global variables -- source file table, include stack, macro state, conditional compilation depth -- live in .bss. sub_7A4860 (reset_tu_state) zeroes these between compilations.

Parser state (~600 KB). Token lookahead buffers, declaration nesting depth, template argument stacks, expression evaluation context. The lexer maintains character classification tables and identifier hash buckets.

Error and diagnostic state (~200 KB). Error count (qword_126ED90), warning count (qword_126ED98), error limit (qword_126ED60), diagnostic suppression bitmaps, and the stream state table at 126ED80..126EDE0 (13 qwords including the stderr FILE* at qword_126EDF0).

Configuration flags (~100 KB). The 0x106xxxx region contains hundreds of dword flags set by CLI parsing and used throughout compilation. Examples:

AddressTypeIdentity
dword_106B640intKeep-in-IL guard flag
dword_106B4B0intCatastrophic error re-entry guard
dword_106B4BCintWarnings-as-errors recursion guard
dword_106B9E8intTU stack depth
dword_106BA08intTU-copy mode flag
dword_106BBB8intOutput format (0=text, 1=SARIF)
dword_106BCD4intPredefined macro file mode
dword_106C088intWarnings-are-errors mode
dword_106C188intwchar_t keyword enabled
dword_106C254intSkip backend (errors present)
dword_106C2C0intGPU mode flag
dword_1065928intInternal error re-entry guard

Lambda capture bitmasks (~256 bytes). Two 1024-bit bitmasks recording which lambda capture counts were used during parsing:

AddressSizeIdentity
unk_1286900128 bytesHost-device lambda capture counts
unk_1286980128 bytesDevice lambda capture counts

Bit N set means a lambda with N captures was encountered, triggering emission of the corresponding __nv_dl_wrapper_t or __nv_hdl_wrapper_t specialization in the backend.

IL walker callbacks (5 function pointers at qword_126FB68..126FB88). The five IL tree-walk callback slots: entry filter, entry replace, pre-walk check, string callback, and entry callback. Swapped in and out by different IL traversal passes.

libstdc++ Runtime State (0x1290000 -- 0x12D6F20, ~280 KB)

SoftFloat globals (16 bytes). Exception flags at byte_12D4820, rounding mode at byte_12D4821.

Emergency exception pool (24 bytes of metadata). Free-list head (qword_12D4868), base address (qword_12D4870), capacity (qword_12D4878 = 72,704 bytes). The pool itself is heap-allocated at startup by ctor_004.

Locale system (~2 KB). The "C" locale singleton (unk_12D5E60), global locale impl pointer (qword_12D5E70), classic locale impl pointer (qword_12D5E78), character classification tables (12D5BE0..12D5D50), locale ID counter (dword_12D5E58), and pthread_once control variables.

iostream objects (~2 KB). The six standard stream objects and their backing file buffers:

AddressIdentity
0x12D6000std::cerr
0x12D6060std::cin
0x12D60C0std::cout
0x12D5EE0std::wcerr
0x12D5F40std::wcin
0x12D5FA0std::wcout

Each stream object is backed by a basic_filebuf at a known offset (e.g., cout's filebuf at 0x12D67E0).

Demangler caches (40 bytes). Template argument cache at qword_12C7B40/12C7B48/12C7B50 (capacity/count/buffer pointer, grows by 500 entries via realloc). Block-scope suppress flag at dword_12C6A24.

EDG internal lists (7 x 48 bytes). Seven doubly-linked list structures at 12868C0..1286780 initialized by ctor_003. Serve as symbol/scope/type caches with destructor sub_6BD820.

Thread-Local Storage (0x12D6F20 -- 0x12D6F38, 24 bytes)

The .tls section holds exactly 24 bytes of thread-local data. This is the __cxa_eh_globals structure (accessed via __readfsqword(0) - 16):

struct __cxa_eh_globals {
    void     *caught_exception_stack;   // +0x00: linked list of caught exceptions
    uint32_t  uncaughtExceptions;       // +0x08: count of in-flight exceptions
};

Despite cudafe++ being single-threaded, the TLS infrastructure exists because libstdc++ exception handling unconditionally uses TLS offsets compiled into the static library.

.ctors / .dtors -- Constructor/Destructor Tables

The .ctors section at 0xD42858 is 88 bytes: a -1 sentinel (8 bytes), 9 constructor function pointers (72 bytes), and a 0 terminator (8 bytes). The 9 constructors are ctor_001 through ctor_009 documented above.

The .dtors section at 0xD428B0 is 16 bytes: a -1 sentinel and a 0 terminator. No destructors are registered -- all cleanup is done via __cxa_atexit handlers registered during construction.

.eh_frame / .gcc_except_table -- Exception Handling

The .eh_frame section (582 KB) contains DWARF Call Frame Information (CFI) records for stack unwinding during C++ exception propagation. The .gcc_except_table section (13.2 KB) contains GCC Language-Specific Data Area (LSDA) records that map program counters to catch handlers and cleanup functions.

The .eh_frame_hdr section (48.9 KB) is a binary search index into .eh_frame, enabling O(log n) lookup of unwind information by instruction pointer during exception throw.

These sections exist because libstdc++ exception handling requires them. cudafe++ itself rarely throws exceptions -- the EDG frontend uses longjmp-based error recovery. However, the statically-linked libstdc++ code (particularly operator new and locale initialization) uses C++ exceptions internally.

.plt / .got.plt -- PLT Stubs

The .plt section (2.2 KB, 141 entries) and .got.plt (1.1 KB) implement lazy binding for the 141 libc functions that cudafe++ imports despite static linking. These are glibc internal symbols resolved at load time. The PLT stubs are the standard x86-64 two-instruction pattern: indirect jump through GOT, then fallback to the dynamic linker (which never executes since the binary is statically linked -- the GOT is pre-resolved by the static linker).

Static Libraries Linked

The binary statically links four library components:

LibraryFunctions.text RangePurpose
libstdc++ (locale)~6000x7EA800 -- 0x829600Full locale facet implementations
libstdc++ (iostream/exception)~600x7E42E0 -- 0x7EA800Streams, exceptions, operator new
Berkeley SoftFloat 3e~800x7E0D30 -- 0x7E4150float16/float80/float128 arithmetic
glibc CRT~100x40918C, 0x829640 -- 0x829722_start, init, fini

No shared libraries are loaded at runtime. The binary is fully self-contained.

Virtual Address Space Map

0x400000 +-----------------------+
         | ELF headers           |  10.5 KB
0x402A18 | .init                 |  24 B
0x402A30 | .plt                  |  2.2 KB
0x403300 | .text                 |  4.15 MB
         |   assert stubs        |    34 KB    (0x403300 - 0x408B40)
         |   constructors        |    8 KB     (0x408B40 - 0x409350)
         |   EDG main body       |    3.61 MB  (0x409350 - 0x7DF400)
         |   C++ runtime         |    304 KB   (0x7DF400 - 0x829722)
0x829722 | padding               |
0x829724 | .fini                 |  14 B
0x829740 | .rodata               |  2.48 MB
         |   error table         |    30 KB    (off_88FAA0)
         |   string literals     |    ~2 MB
         |   IL kind names       |    <1 KB    (off_E6DD80)
         |   jump tables         |    ~400 KB
0xAA3FA3 | .eh_frame_hdr         |  48.9 KB
         |       [gap]           |
0xCB1210 | .eh_frame             |  568 KB
0xD3F398 | .gcc_except_table     |  13.2 KB
0xD42858 | .ctors                |  88 B
0xD428B0 | .dtors                |  16 B
0xD428C0 | .data.rel.ro          |  13.3 KB   (vtables)
0xD45E00 |       [padding/GOT]   |
0xD46480 | .data                 |  1.22 MB
         |   attribute table     |    ~5 KB    (off_D46820)
         |   keyword tables      |    variable
         |   handler pointers    |    (at 0xE7EExx)
         |   diagnostic list     |    (at 0xE7FE40)
0xE7EFF0 |       [padding]       |
0xE7F000 | .bss                  |  4.34 MB
         |   EDG compiler state  |    ~4.1 MB  (0xE7F000 - 0x1290000)
         |   libstdc++ state     |    ~280 KB  (0x1290000 - 0x12D6F20)
0x12D6F20| .tls                  |  24 B
0x12D6F38| extern                |  1.1 KB
0x12D73A8+-----------------------+

Key Observations

The .bss dominates. At 4.34 MB, the .bss is the largest section -- larger than .text. This reflects the EDG frontend's design: hundreds of global variables hold parser state, scope stacks, symbol tables, and IL region metadata. A reimplementation should strongly consider replacing these globals with a context struct passed through the call chain.

Static linking adds 304 KB of dead-weight code. The C++ runtime region (0x7DF400 -- 0x829722) contains 900 functions, the majority of which (600+ locale facet methods) are never called by cudafe++. The locale system is pulled in transitively through iostream initialization. A reimplementation that avoids std::cout/std::cerr could eliminate this entirely.

The EDG code is tightly packed. The 3.61 MB EDG main body has almost no inter-function padding. Functions from the same source file are contiguous, and the alphabetical ordering by filename is consistent across the entire range. This makes address-to-source-file attribution reliable.

The binary is position-dependent. No PIE (Position-Independent Executable) flag is set. All code references use absolute addressing. The .got is minimal (56 bytes / 7 entries) -- almost all data references are direct.

Methodology

This page documents the reverse engineering methodology used to produce every page in this wiki. The goal is full transparency: a reader should be able to reproduce any finding by following the same techniques against the same binary. Every claim in the wiki traces back to one of four evidence categories (CONFIRMED, HIGH, MEDIUM, LOW), and this page defines exactly what each level means, what tools produced the raw data, and how that data was refined into the structured documentation that follows.

Toolchain

ComponentVersionRole
IDA Pro9.0 (64-bit)Interactive disassembler and database host
Hex-Raysx86-64 decompiler (IDA 9.0 bundled)Pseudocode generation for all 6,483 functions
IDAPython3.x (IDA-embedded)Scripted extraction via analyze_cudafe++.py (531 lines)
Target binarycudafe++ from CUDA Toolkit 13.0ELF 64-bit, statically linked, stripped, 8,910,936 bytes
IDA databasecudafe++.i64247 MB analysis state (all function boundaries, xrefs, type info, decompilation caches)

The binary was loaded into IDA Pro 9.0 with default x86-64 analysis settings. IDA's auto-analysis resolved all code/data boundaries, generated function boundaries for 6,483 functions, and identified 52,489 string literals. The Hex-Rays decompiler was invoked on all 6,483 functions; the IDAPython extraction log reports 6,343 successful decompilations (the remaining 140 failures are exception personality routines, SoftFloat leaf functions, and tiny thunks where Hex-Rays cannot reconstruct a valid C AST). However, due to function-name collisions in the output filenames (multiple sub_XXXXXX entries mapping to the same sanitized name after / replacement), the actual decompiled output directory contains 6,202 unique .c files -- the number used throughout this wiki.

Extraction Script

All raw data was exported from the IDA database in a single automated pass using analyze_cudafe++.py, an IDAPython script that runs inside IDA's scripting environment. The script produces 12 output artifacts:

ArtifactFileRecordsSizeDescription
String tablecudafe++_strings.json52,489 strings9.2 MBEvery string literal with address, type, and all cross-references
Function tablecudafe++_functions.json6,483 functions12 MBAddress, size, instruction count, callers, callees per function
Import tablecudafe++_imports.json142 imports16 KBImported PLT symbols (glibc wrappers in static binary)
Segment tablecudafe++_segments.json26 segments3.3 KBELF section addresses, sizes, types, permissions
Cross-reference tablecudafe++_xrefs.json1,243,258 xrefs154 MBEvery code and data xref with source function attribution
Comment tablecudafe++_comments.json22,911 comments2.0 MBAll IDA comments (regular + repeatable)
Name tablecudafe++_names.json54,771 names3.5 MBAll named locations (IDA auto-names + user-defined)
Call graphcudafe++_callgraph.json + .dot67,756 edges7.4 MBComplete inter-procedural call graph (5,057 unique callers, 5,382 unique callees)
.rodata dumpcudafe++_rodata.bin2,599,011 bytes2.5 MBRaw bytes of the read-only data section
Disassemblydisasm/<func>_<addr>.asm6,342 files86 MBPer-function annotated disassembly with hex bytes
CFG graphsgraphs/<func>_<addr>.json + .dot12,684 files184 MBPer-function basic-block graph with instructions and edges (JSON + DOT)
Decompiled codedecompiled/<func>_<addr>.c6,202 files38 MBHex-Rays pseudocode per function

Script Architecture

The script is structured as a main() function that calls idaapi.auto_wait() to block until IDA's auto-analysis completes, then executes 12 extraction passes in a fixed order. Output is written to four directories: the root output directory (JSON databases), graphs/ (per-function CFGs), disasm/ (per-function disassembly), and decompiled/ (per-function pseudocode). Directories are created if they do not exist.

The 12 passes, in execution order:

  1. export_all_strings() -- Enumerates idautils.Strings(), then for each string walks XrefsTo(string_ea) to record every function that references it. Each string entry captures the address, string value, string type code, and a list of xref records ({from_addr, func_name, xref_type}). This is the foundation for source attribution (see below).

  2. export_all_functions() -- For each function in idautils.Functions(), records start/end address, size, instruction count (via idc.is_code() on each head), library flag (FUNC_LIB), thunk flag (FUNC_THUNK), and builds caller/callee lists. Callers are found via XrefsTo(func_start); callees via XrefsFrom(head) filtered to call-type xrefs (fl_CN = type 17, fl_CF = type 19).

  3. export_imports() -- Enumerates all imported modules via idaapi.get_import_module_qty() and idaapi.enum_import_names(). Records module name, symbol name, address, and ordinal for each of the 142 glibc imports.

  4. export_segments() -- Iterates idautils.Segments() to record each ELF section's name, start/end address, size, type code, and permission bits.

  5. export_xrefs() -- Full enumeration of all cross-references from every instruction head in every function. For each xref, records source address, source function, target address, target function (if any), and xref type code. Produces the 1,243,258-record xref table. The six xref type codes in the output:

    TypeCodeCountMeaning
    dr_O129,631Data offset reference
    dr_W211,488Data write reference
    dr_R342,364Data read reference
    fl_CN1767,756Code near call
    fl_CF19189,364Code far/ordinary flow
    fl_JN21902,655Code near jump (including fall-through)
  6. export_comments() -- Walks every instruction head in the database via idautils.Heads(), extracting both regular comments (idc.get_cmt(ea, 0)) and repeatable comments (idc.get_cmt(ea, 1)).

  7. export_names() -- Iterates idautils.Names() to export all named locations (function names, data labels, IDA auto-generated names).

  8. extract_rodata() -- Reads the raw bytes of the .rodata segment via ida_bytes.get_bytes() and writes them to a binary file. Used for offline string scanning and jump table analysis.

  9. export_callgraph() -- Builds the 67,756-edge call graph by iterating every function and scanning its instruction heads for outgoing call xrefs (fl_CN, fl_CF). Output in both JSON (array of {from, from_addr, to, to_addr} edge records) and Graphviz DOT format (67,759 lines).

  10. export_complete_disassembly() -- Per-function disassembly files. For each function, iterates all instruction heads within the function's address range, generating hex byte dumps alongside disassembly text via idc.generate_disasm_line(). Each file includes a header with function name, address range, and byte size.

  11. export_function_graphs() -- Per-function control flow graphs via idaapi.FlowChart(). For each basic block: block ID, start/end address, size, and full instruction listing. Block-to-block edges (fall-through and branch targets) are extracted via block.succs(). Output as both JSON (structured blocks + edges) and DOT (for Graphviz visualization).

  12. export_decompilation() -- Calls idaapi.init_hexrays_plugin() to initialize the Hex-Rays decompiler, then iterates all functions and calls idaapi.decompile(func_ea). On success, the pseudocode string (str(cfunc)) is written to a .c file with a header comment containing the function name and address. Failures are silently caught via a bare except Exception and skipped.

The script is invoked via IDA's headless batch mode or interactive scripting console. It does not call qexit() at the end, allowing the IDA database to remain open for further interactive analysis after extraction. Total extraction time is approximately 30-45 minutes on a workstation-class machine, dominated by the 6,483 decompilation calls in pass 12.

Source Attribution Technique

The single most powerful technique in this analysis is source attribution via __FILE__ strings. The EDG C++ frontend uses C-style assertions throughout its codebase. When an assertion fires, the handler receives the source file path, line number, and function name as compile-time string constants embedded by the __FILE__, __LINE__, and __func__ macros. Because the binary is stripped (no .symtab), these assertion strings are the only surviving link to the original source tree.

The Assert Handler

The central assert handler is sub_4F2930, located in error.c. It is a __noreturn function that formats and emits an internal compiler error message, then terminates the process. A total of 2,139 functions in the binary call sub_4F2930, with 5,178 total call sites (many functions have multiple assertion points throughout their bodies).

The highest-density callers are the 235 assert stubs in the region 0x403300--0x408B40. Each stub is exactly 29 bytes: three register loads (source file path via lea rdi, line number via mov esi, function name via lea rdx) followed by a call to sub_4F2930:

sub_403300:         ; assert stub for is_aliasable (attribute.c:10897)
  lea  rdi, aAttributeC    ; "/dvs/p4/.../EDG_6.6/src/attribute.c"
  mov  esi, 10897           ; line number (integer, not string)
  lea  rdx, aIsAliasable   ; "is_aliasable"
  call sub_4F2930           ; internal_error(__FILE__, __LINE__, __func__)

Of the 235 stubs, 200 reference .c file paths and 35 reference .h file paths (inlined assertions from header files). The stubs are sorted approximately by source file name within the stub region -- the linker grouped them from all 52 .c compilation units into one contiguous block.

Beyond the dedicated stubs, 1,904 additional functions contain inline assertion checks: the lea rdi, <file_path> instruction appears within the function body at the assertion site, not in a separate stub. These inline assertions provide the same source-file attribution as the stubs.

The Attribution Chain

The attribution chain works in three steps:

  1. String discovery. Extract all strings matching the EDG build path prefix /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/. This yields one string per source file, each cross-referenced by the assert stubs that load it.

  2. Xref tracing. For each assert stub, follow XrefsTo() to find which main-body functions call it. A function at 0x40DFD0 that calls the attribute.c:5108 stub was compiled from attribute.c. This attributes the caller to the source file.

  3. Range extension. Assert stubs are sparse -- not every function contains an assertion. Once a set of functions in a contiguous address range are attributed to the same source file, the entire range is assigned to that file. This works because the linker places all object code from a single .c file contiguously, and the files are arranged roughly alphabetically by filename.

This technique attributed 2,209 functions (34.1% of the binary) to specific source files. The remaining 4,274 functions fall into three categories: C++ runtime code (1,085 functions from libstdc++/glibc, identifiable by address range), PLT/init stubs (283 functions), and unmapped EDG functions (2,906 functions that contain no assertions and cannot be confidently attributed).

Build Path

The full build path embedded in the binary is:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/

This reveals the NVIDIA internal Perforce depot structure (/dvs/p4/), the release branch (r13.0), and the EDG version (EDG_6.6). It confirms the binary was built from EDG C++ Front End version 6.6, licensed from Edison Design Group.

Confidence Levels

Every identification in the raw sweep reports and wiki pages carries one of four confidence levels:

LevelTagCriteriaExample
CONFIRMEDDirect matchThe function's identity is proven by an assertion string that encodes the exact function name, source file, and line number. No ambiguity.sub_403300 loads "is_aliasable" + "attribute.c" + "10897" -- it is the assertion stub for is_aliasable() in attribute.c at line 10897.
HIGHString + callgraphThe function references a distinctive string (error message, format string, keyword literal) AND its position in the call graph is consistent with a single plausible identity.sub_459630 references 276 CLI flag strings and is called from main() at the position where command-line processing occurs -- identified as proc_command_line().
MEDIUMPattern + contextThe function matches a known EDG pattern (struct layout access, IL node walking, type query) and its address falls within the expected source file range, but no string or assertion directly confirms the identity.A function at 0x5B3000 accesses the IL node kind field at the expected struct offset and falls within the il.c address range -- likely an IL accessor, but the specific function name is inferred.
LOWAddress proximityThe function's address falls within a source file's range, but no internal evidence (strings, struct accesses, callees) distinguishes it from neighboring functions. The attribution is based solely on the linker's contiguous placement of object code.A small leaf function at 0x5B2F80 sits between two il.c-attributed functions -- probably from il.c, but it could be an inlined header function.

In practice, approximately 34% of functions are CONFIRMED (via assert strings), ~20% are HIGH (via distinctive strings or unique callgraph positions), ~25% are MEDIUM, and ~21% are LOW or unattributed.

Call Graph Analysis

The complete call graph contains 67,756 edges connecting the 6,483 functions. This graph is the primary tool for understanding system architecture -- which subsystems call which, where the hot paths are, and how NVIDIA's additions integrate with the EDG base.

Hub Identification

Hub functions -- those with exceptionally high in-degree (many callers) or out-degree (many callees) -- reveal the architectural spine of the compiler:

Hub TypeFunctionDescriptionDegree
Top calleesub_4F2930internal_error handler235+ callers (every assert stub)
Top calleeType query functions (104 total)is_class_or_struct_or_union_type, etc.407 call sites for top query
Top callersub_7A40A0process_translation_unitCalls into parser, IL, type system
Top callersub_459630proc_command_line (4,105-line monster)Touches 276 flag variables
Top callersub_585DB0fe_one_time_init36 subsystem initializer calls
Cross-module bridgesub_6BCC20Lambda preamble injection (NVIDIA)Called from EDG statement handlers

Graph Structure

The call graph exhibits a layered structure typical of compiler frontends:

  1. Entry layer. main() at 0x408950 calls exactly 8 stage functions in sequence.
  2. Stage layer. Each stage function (init, CLI, parse, wrapup, backend) fans out to dozens of subsystem entry points.
  3. Core layer. The parser (expr.c, decls.c, statements.c) calls into the type system (types.c, exprutil.c), IL builder (il.c, il_alloc.c), and name lookup (lookup.c, scope_stk.c).
  4. Leaf layer. Memory management (mem_manage.c), error reporting (error.c), and type queries form the bottom of the call hierarchy, referenced from almost every subsystem.

NVIDIA's nv_transforms.c sits as a lateral extension at the core layer: it is called from class_decl.c, cp_gen_be.c, and statements.c (via nv_transforms.h inlines), but does not itself call back into the EDG parser. This clean separation suggests NVIDIA modifies the EDG source minimally, preferring to hook into existing EDG extension points rather than fork the core.

String-Based Discovery

The binary contains 52,489 strings in .rodata. These strings are the second most important evidence source after the assertion paths. Major categories:

CategoryApproximate CountUsage
EDG assertion paths (/dvs/p4/...)65 (52 .c + 13 .h)Source attribution
CUDA keyword strings~300Keyword table initialization, CLI flag names
Error message templates~3,800Diagnostic emission (off_88FAA0 error table, 3,795 entries)
C/C++ keyword strings~200Lexer token recognition
Format strings (%s, %d, etc.)~500Output formatting in .int.c emission and diagnostics
IL kind names~200IL node type display (off_E6DD80 table)
Type name fragments~400Mangling output, type display
CUDA architecture names (sm_XX)~50Architecture feature gating
Internal EDG config strings~200Build configuration, feature flags

String Mining Techniques

Three string mining techniques are used throughout the analysis:

  1. Error message tracing. CUDA-specific error messages (e.g., "calling a __host__ function from a __device__ function is not allowed") are grepped from the string table, their xrefs traced to the emitting function, and the emitting function's callers analyzed to understand the validation logic that triggers the error.

  2. Keyword enumeration. The keyword initialization function (sub_5863A0) loads 200+ string constants in sequence. By reading the strings in load order, the complete CUDA keyword vocabulary is recovered -- including internal-only keywords not documented in the CUDA C++ Programming Guide.

  3. Format string analysis. Format strings in the backend (cp_gen_be.c) reveal the exact syntax of .int.c output. A string like "static void __device_stub__%s(" tells us the precise naming convention for device stub wrapper functions.

Decompilation Quality

Hex-Rays produces readable pseudocode for the vast majority of functions, but several systematic limitations affect the analysis:

Control Flow Artifacts

Hex-Rays occasionally introduces control flow constructs that do not exist in the original source. The most prominent example is the while(1) loop in main() (sub_408950): the decompiler wraps the entire function body in an infinite loop because a setjmp-based error recovery mechanism creates a backward edge in the CFG. In reality, main() executes linearly and returns -- the while(1) is a decompiler artifact, not a real loop.

Similar artifacts appear in functions with complex switch statements (EDG uses computed gotos for performance), where Hex-Rays may produce nested if-else chains instead of the flat dispatch table the original code uses.

Lost Preprocessor Logic

The original EDG source makes heavy use of preprocessor conditionals (#if CUDA_SUPPORT, #ifdef FRONT_END_CPFE, etc.). The compiled binary contains only the taken branch -- the preprocessor evaluated all conditions at build time. This means the decompiled code shows the CUDA-enabled configuration only; any host-only or non-CUDA EDG behavior is invisible.

Similarly, C macros that wrap common patterns (assertion macros, IL access macros, type query macros) are fully expanded in the binary. The decompiled output shows the expanded form -- a sequence of struct field accesses and conditional jumps -- rather than the concise macro invocation the original source used.

Unnamed Variables

The binary is stripped. All local variable names are lost. Hex-Rays assigns synthetic names (v1, v2, a1, a2) based on register allocation and stack slot positions. Function parameters are named a1 through aN in declaration order. During analysis, meaningful names are sometimes manually applied in the IDA database, but most decompiled output uses the synthetic names.

Structure field accesses appear as byte-offset expressions (*((_BYTE *)a1 + 182)) rather than named fields (entity->execution_space). Reconstructing the structure layouts from these offset patterns is a core part of the analysis -- see the Entity Node Layout page for the most extensively reconstructed structure.

Decompilation Failures

The IDAPython extraction log reports 6,343 successful decompilations out of 6,487 attempts (140 failures). Due to filename collisions in the output directory (functions with identical sanitized names at different addresses overwrite each other), the actual output directory contains 6,202 unique .c files. The 281 "missing" files break down as:

CategoryCountReason
Hex-Rays decompilation failure~140Exception personality routines, SoftFloat leaf functions, tiny thunks, irreducible CFG
Filename collisions (overwritten)~141Multiple functions with the same IDA name (after / to _ sanitization) write to the same output path

The 140 true decompilation failures are concentrated in the C++ runtime region (0x7DF400--0x829722), particularly in the libstdc++ locale facet implementations (complex template instantiations with deeply nested virtual dispatch) and Berkeley SoftFloat 3e functions (pure arithmetic with non-standard calling conventions). For these functions, analysis relies on the raw disassembly output in disasm/ instead.

Phase 1: Address-Range Sweeps

The first phase of analysis consists of 20 address-range sweeps that collectively cover the entire .text section from 0x403000 to 0x82A000. Each sweep examines a contiguous address range of 128--256 KB, documenting every function within that range.

Sweep Index

SweepAddress RangeSizePrimary Source FilesKey Findings
P1.010x403000--0x425000136 KBattribute.c, class_decl.cAssert stub region, CUDA attribute handlers
P1.020x425000--0x450000172 KBclass_decl.c, cmd_line.cVirtual override checking, execution space propagation
P1.030x450000--0x478000160 KBcmd_line.c, const_ints.c, cp_gen_be.c4,105-line CLI parser, 276 flags
P1.040x478000--0x4A0000160 KBcp_gen_be.c, decl_inits.cBackend .int.c emission, device stub generation
P1.050x4A0000--0x4C8000160 KBdecl_inits.c, decl_spec.c, declarator.c, decls.cDeclaration parsing pipeline
P1.060x4C8000--0x4F8000192 KBdecls.c, disambig.c, error.cError table (off_88FAA0, 3,795 entries)
P1.070x4F8000--0x530000224 KBexpr.cExpression parser (528 functions)
P1.080x530000--0x560000192 KBexpr.c, exprutil.cExpression utilities, operator overloads
P1.090x560000--0x598000224 KBexprutil.c, extasm.c, fe_init.c, fe_wrapup.cInitialization chain, 5-pass wrapup
P1.100x598000--0x5C8000192 KBfloat_pt.c, folding.c, func_def.c, host_envir.cConstant folding, timing infrastructure
P1.11a--f0x5C8000--0x5F8000192 KBil.c, il_alloc.cIL node creation, arena allocator
P1.120x5F8000--0x628000192 KBil_to_str.c, il_walk.c, interpret.cIL display, tree walking, constexpr
P1.130x628000--0x668000256 KBinterpret.c, layout.c, lexical.cConstexpr interpreter, struct layout, lexer
P1.140x668000--0x6A8000256 KBlexical.c, literals.c, lookup.c, lower_name.cName lookup, name mangling
P1.150x6A8000--0x6D0000160 KBlower_name.c, macro.c, mem_manage.c, nv_transforms.c, overload.cNVIDIA transforms, memory management
P1.160x6D0000--0x708000224 KBoverload.c, pch.c, pragma.c, preproc.c, scope_stk.cOverload resolution, scope stack
P1.170x708000--0x740000224 KBscope_stk.c, src_seq.c, statements.c, symbol_ref.c, symbol_tbl.cStatement parsing, symbol table
P1.180x740000--0x7A0000384 KBsymbol_tbl.c, sys_predef.c, templates.cTemplate engine (443 functions)
P1.190x7A0000--0x7E0000256 KBtrans_unit.c, types.c, modules.c, trans_corresp.cType system, TU processing
P1.200x7E0000--0x82A000304 KB(C++ runtime)libstdc++, SoftFloat, CRT, demangler

The P1.11 sweep was subdivided into six sub-sweeps (11a through 11f) because the il.c region is dense and complex, containing the core IL node creation and manipulation functions that are referenced from nearly every other source file.

Sweep Report Format

Each sweep report follows a consistent format:

================================================================================
P1.XX SWEEP: Address range 0xNNNNNN - 0xMMMMMM
================================================================================
Range: 0xNNNNNN - 0xMMMMMM
Functions found: N
EDG source files:
  - file.c (assert stub range, main body range)
  ...

### 0xAAAAAA -- sub_AAAAAA (NN bytes / NN lines)
**Identity**: function_name (source_file.c:NNNN)
**Confidence**: CONFIRMED / HIGH / MEDIUM / LOW
**EDG Source**: source_file.c
**Notes**: Additional observations about behavior, callers, callees

Every function in the sweep range gets an entry. Functions are documented in address order. The identity field records the inferred function name and source location. The confidence field uses the four-level system defined above. Notes capture anything unusual -- unexpected callers, CUDA-specific behavior, undocumented error codes, or connections to other subsystems.

Phase 2: Targeted Deep Dives

After the Phase 1 sweep establishes the complete function map and identifies all source files, Phase 2 produces the detailed wiki pages. Each wiki page corresponds to one W-series work report that focuses on a specific subsystem or topic.

Deep Dive Methodology

Each W-series report follows a consistent process:

  1. Scope definition. Identify the set of functions relevant to the topic. For example, W012 (Execution Spaces) requires the CUDA attribute application handlers in attribute.c, the execution space checking functions in nv_transforms.c, and the virtual override validator in class_decl.c.

  2. Decompilation review. Read the full Hex-Rays pseudocode for every function in scope. For complex functions, also review the raw disassembly to catch decompiler artifacts.

  3. String evidence collection. Grep the string table for all strings referenced by the in-scope functions. Error messages reveal validation rules; format strings reveal output patterns; keyword strings reveal accepted syntax.

  4. Call graph traversal. Starting from the in-scope functions, walk callers and callees to understand the full data flow. Who calls apply_nv_global_attr? What does it call? How does data arrive and where does it go?

  5. Struct layout reconstruction. When decompiled code accesses struct fields via byte offsets, reconstruct the field layout by collecting all access patterns across all functions that touch the same struct. Cross-validate offsets across multiple functions.

  6. Pseudocode reconstruction. Translate the Hex-Rays output into readable C-like pseudocode with meaningful variable names, proper control flow, and comments explaining the logic. This reconstructed pseudocode appears in the wiki pages.

  7. Cross-reference synthesis. Link findings to other wiki pages and W-series reports. Every page should situate itself within the overall architecture.

W-Series Report Index

As of this writing, 28 W-series reports have been produced, each backing one or more wiki pages:

ReportTopicWiki Page(s)
W001Index pageindex.md
W002Function mapfunction-map.md
W003Binary layoutbinary-layout.md
W004Methodologymethodology.md (this page)
W005Pipeline overviewpipeline/overview.md
W006Entry pointpipeline/entry.md
W010Backend code genpipeline/backend.md
W012Execution spacescuda/execution-spaces.md
W014Cross-space validationcuda/cross-space-validation.md
W015Device/host separationcuda/device-host-separation.md
W016Kernel stubscuda/kernel-stubs.md
W020Attribute systemattributes/overview.md
W021__global__ constraintsattributes/global-function.md
W026Lambda overviewlambda/overview.md
W027Device wrapperlambda/device-wrapper.md
W028Host-device wrapperlambda/host-device-wrapper.md
W029Capture handlinglambda/capture-handling.md
W032IL overviewil/overview.md
W033IL allocationil/allocation.md
W035Keep-in-ILil/keep-in-il.md
W038.int.c formatoutput/int-c-format.md
W042EDG overviewedg/overview.md
W047Template engineedg/template-engine.md
W052Diagnostics overviewdiagnostics/overview.md
W053CUDA errorsdiagnostics/cuda-errors.md
W056Entity node layoutstructs/entity-node.md
W061CLI flagsconfig/cli-flags.md
W065EDG source mapreference/edg-source-map.md
W066Global variablesreference/global-variables.md

Numerical Summary

MetricValue
Binary file size8,910,936 bytes (8.5 MB)
Total functions in binary6,483
Decompiled functions (log-reported)6,343
Decompiled files (actual on disk)6,202
Disassembly files6,342
CFG files (JSON + DOT)12,684
Functions attributed to source files2,209 (34.1%)
Functions calling sub_4F2930 (assert handler)2,139
Total call sites to sub_4F29305,178
Assert stubs (0x403300--0x408B40)235
Source files identified (.c)52
Header files identified (.h)13
EDG build-path strings in .rodata65
String literals extracted52,489
Cross-references extracted1,243,258
Call graph edges67,756 (5,057 callers, 5,382 callees)
Named locations54,771
IDA comments22,911
Imported glibc symbols142
ELF segments26
.rodata raw dump2,599,011 bytes
IDA database (.i64)247 MB
Phase 1 sweep reports28 files (20 ranges + 8 sub-sweeps), 38,221 lines
Phase 2 deep-dive reports (W-series)28
Wiki pages55
Error table entries (off_88FAA0)3,795
CLI flags documented276
Total exported data~500 MB

Limitations and Caveats

What This Analysis Cannot Determine

  1. Preprocessor-disabled code. Any EDG code behind #if 0, #ifndef CUDA_SUPPORT, or similar guards was compiled out. The binary reflects only the CUDA-enabled, Linux x86-64, EDG 6.6 configuration. Other EDG frontend features (e.g., Fortran support, Windows target, older C++ standards) are not present.

  2. Inlined function boundaries. When the compiler inlines a function, its code merges with the caller. The binary may contain hundreds of inlined instances of small EDG utility functions (type queries, IL accessors) that are invisible as separate entities. The 6,483 function count represents only the non-inlined functions.

  3. Original variable names. All local and most global variable names are lost. The wiki uses reconstructed names based on semantics (e.g., execution_space_byte for *((_BYTE *)entity + 182)), but these are analyst-assigned, not original.

  4. Exact source line mapping. While assertion strings encode line numbers, these are the assertion site's line number, not the calling function's line number. The analyst can determine that is_aliasable in attribute.c has an assertion at line 10897, but cannot determine the start line of is_aliasable itself.

  5. NVIDIA-internal documentation. Any design documents, code comments, commit messages, or internal wikis that informed the original development are unavailable. All behavioral descriptions in this wiki are inferred from the binary alone.

Reproducibility

Every finding in this wiki can be reproduced by:

  1. Obtaining cudafe++ from CUDA Toolkit 13.0 (version string embedded in binary as the build path prefix r13.0).
  2. Loading it into IDA Pro 9.0 (64-bit) with default x86-64 analysis settings. Wait for auto-analysis to complete (5-10 minutes).
  3. Running analyze_cudafe++.py via File > Script File to extract all raw data (30-45 minutes).
  4. Querying the exported JSON files with jq to trace cross-reference chains, string lookups, and callgraph paths.
  5. Reading the decompiled .c files and raw .asm files for behavioral analysis.

No proprietary tools beyond IDA Pro + Hex-Rays are required. The analysis does not depend on NVIDIA source code access, NDA-protected documentation, or insider knowledge. Every claim is derived from the publicly distributed binary.

Pipeline Overview

cudafe++ is a source-to-source compiler. It reads a .cu file, parses it as C++ with CUDA extensions using a modified EDG 6.6 frontend, then emits a transformed .int.c file where device code is suppressed and host-side stubs replace kernel launch sites. The entire binary is a single-threaded, single-pass-per-stage pipeline controlled from main() at 0x408950.

Pipeline Diagram

  input.cu
     |
     v
 [1] fe_pre_init          sub_585D60   fe_init.c
     9 subsystem pre-initializers
     |
     v
     * sub_5AF350(v7) ---- capture "Total compilation time" start
     |
     v
 [2] proc_command_line     sub_459630   cmd_line.c
     276 CLI flags parsed, mode selection
     |
     v
 [3] fe_one_time_init      sub_585DB0   fe_init.c
     38 subsystem initializers + keyword registration
     |--- fe_init_part_1 (sub_585EE0): per-unit inits, output file open
     |--- keyword_init + fe_translation_unit_init (sub_5863A0)
     |
     v
     * sub_5AF350(v8) ---- capture "Front end time" start
     |
     v
 [4] reset_tu_state        sub_7A4860   trans_unit.c
     Zero all TU globals
     |
     v
 [5] process_trans_unit    sub_7A40A0   trans_unit.c
     Allocate 424-byte TU descriptor, parse source,
     build EDG IL tree, CUDA attribute propagation
     |
     v
 [6] fe_wrapup             sub_588F90   fe_wrapup.c
     5-pass IL finalization: needed-flags, keep-in-IL marking,
     dead entity elimination, scope cleanup
     |
     v
     * sub_5AF350(v9) ---- capture "Front end time" end
     * sub_5AF390("Front end time", v8, v9)
     |
     v
     * sub_5AF350(v10) --- capture "Back end time" start
     |
     v
 [7] Backend entry         sub_489000   cp_gen_be.c
     Walk source sequence, emit .int.c, device stubs,
     lambda wrappers, registration tables
     |
     v
     * sub_5AF350(v11) --- capture "Back end time" end
     * sub_5AF390("Back end time", v10, v11)
     |
     v
     * sub_5AF350(v12) --- capture "Total compilation time" end
     * sub_5AF390("Total compilation time", v7, v12)
     |
     v
 [8] exit_with_status      sub_5AF1D0   host_envir.c
     Map internal status to exit code, terminate

     |----- "Front end time" covers stages 4-6 ----------|
     |----- "Back end time" covers stage 7 ---------------|
     |----- "Total compilation time" covers stages 2-8 ---|

Call Hierarchy from main()

The decompiled main() at 0x408950 calls the pipeline stages in this exact order:

void main(int argc, char **argv, char **envp)
{
    sub_585D60(argc, argv, envp);      // [1] fe_pre_init
    sub_5AF350(v7);                     //     capture_time (total start)
    sub_459630(argc, argv);             // [2] proc_command_line
    // [stack limit adjustment via setrlimit]
    sub_585DB0();                       // [3] fe_one_time_init
    if (dword_106C0A4)
        sub_5AF350(v8);                 //     capture_time (frontend start)
    sub_7A4860();                       // [4] reset_tu_state
    sub_7A40A0(qword_126EEE0);         // [5] process_translation_unit
    sub_588F90(v5, 1);                  // [6] fe_wrapup
    if (dword_106C0A4) {
        sub_5AF350(v9);
        sub_5AF390("Front end time", v8, v9);
    }
    // --- error-recovery re-compilation loop ---
    if (qword_126ED90) {               //     errors present?
        dword_106C254 = 1;             //     skip backend
    }
    while (1) {
        sub_6B8B20(0);                  //     reset file state
        sub_589530();                   //     write signoff + cleanup
        // exit code computation
        if (dword_106C0A4)
            sub_5AF390("Total compilation time", ...);
        sub_5AF1D0(exit_code);          // [8] exit
        // --- if dword_106C254 == 0, backend runs ---
        if (!dword_106C254) {
            if (dword_106C0A4)
                sub_5AF350(v10);        //     capture_time (backend start)
            sub_489000();               // [7] process_file_scope_entities
            if (dword_106C0A4) {
                sub_5AF350(v11);
                sub_5AF390("Back end time", v10, v11);
            }
        }
    }
}

The while(1) loop with sub_5AF1D0 (which calls exit() / abort()) never actually iterates -- the call to sub_5AF1D0 is __noreturn. The compiler just arranged the basic blocks this way: the backend stage at label LABEL_16 falls through from a goto at the top of the loop when dword_106C254 == 0 (no errors).

Stage Details

Stage 1: fe_pre_init -- sub_585D60 (0x585D60)

Source: fe_init.c

Performs absolute minimum initialization before anything else can run. Called with the raw argc, argv, envp from the OS.

CallAddressIdentityPurpose
1sub_48B3C0error_handling_initZero error counters
2sub_6BB290source_file_mgr_initFile descriptor table setup
3sub_5B1E70scope_symbol_pre_initScope stack index = -1
4sub_752C90type_system_pre_initType table allocation
5sub_45EB40cmd_line_pre_initRegister CLI flag table
6sub_4ED530declaration_pre_initDeclaration state zeroing
7sub_6F6020il_pre_initIL node allocator setup
8sub_7A48B0tu_tracking_pre_initZero all TU globals
9sub_7C00F0template_pre_initTemplate engine state

Sets dword_126C5E4 = -1 (current scope index = "none") and dword_126C5C8 = -1 (secondary scope index = "none").

Data flow: No input beyond process args. Output: global state zeroed and ready for CLI parsing.

Stage 2: proc_command_line -- sub_459630 (0x459630)

Source: cmd_line.c (4105 decompiled lines)

Parses all 276 CLI flags. Populates global configuration variables that control every subsequent stage. Key outputs:

GlobalAddressMeaning
dword_126EFB40x126EFB4Language mode: 1=K&R C, 2=C++
dword_126EF680x126EF68C++ standard version (__cplusplus value)
dword_106C0A40x106C0A4Timing enabled (print stage durations)
dword_126E1D80x126E1D8MSVC host compiler
dword_126E1F80x126E1F8GNU/GCC host compiler
dword_126E1E80x126E1E8Clang host compiler
dword_106BF380x106BF38Extended lambda mode
qword_126EEE00x126EEE0Output filename (or "-" for stdout)
qword_106BA000x106BA00Primary source filename
dword_106C29C0x106C29CPreprocessing-only mode
dword_106C0640x106C064Stack limit adjustment flag

The parser builds four hash tables for macro defines (qword_106C248), include paths (qword_106C240), and system includes (qword_106C238, qword_106C228). It also suppresses a default set of diagnostic numbers (1257, 1373, 1374, 1375, 1633, 2330, 111, 185, 175).

Data flow: Input: argv. Output: ~150+ global configuration variables populated.

Stage 3: fe_one_time_init -- sub_585DB0 (0x585DB0)

Source: fe_init.c

The heaviest initialization stage. Calls 38 subsystem initializers in dependency order, then validates the function pointer dispatch table (a sentinel check: off_D560C0 must equal the address of nullsub_6). After validation, calls sub_585EE0 (fe_init_part_1) which:

  1. Records compilation timestamp via time()/ctime() into byte_106B5C0
  2. Runs 26 per-compilation-unit initializers
  3. Opens the output file (qword_106C280 = stdout or file)
  4. Writes the output file header via sub_5AEDB0
  5. Calls the keyword registration function sub_5863A0 which registers 200+ C/C++ keywords plus NVIDIA CUDA-specific type traits (__nv_is_extended_device_lambda_closure_type, etc.)

38 subsystem initializers (in call order):

#AddressSubsystem
1sub_752DF0types
2sub_5B1D40scopes
3sub_447430errors
4sub_4B37F0preprocessor
5sub_4E8ED0declarations
6sub_4C0840attributes
7sub_4A1B60names
8sub_4E9CF0declarations (part 2)
9sub_4ED710declarations (part 3)
10sub_510C30statements
11sub_56DC90expression utilities
12sub_5A5160expressions
13sub_603B00parser
14sub_5CF7F0classes
15sub_65DC50overload resolution
16sub_69C8B0templates
17sub_665A00template instantiation
18sub_689550exception handling
19sub_68F640implicit conversions
20sub_6B6510IL
21sub_6BAE70source file manager
22sub_6F5FC0IL walking
23sub_6F8300IL (part 2)
24sub_6FDFF0lowering
25sub_726DC0name mangling
26sub_72D410name mangling (part 2)
27sub_74B9A0type checking
28sub_710B70IL (part 3)
29sub_76D630code generation
30nullsub_11debug (no-op)
31sub_7A4690allocation
32sub_7A3920memory pools
33sub_6A0E90templates (part 2)
34sub_418F80diagnostics
35sub_5859C0extended asm
36sub_751540types (part 2)
37sub_7C25F0templates (part 3)
38sub_7DF400CUDA-specific init

Data flow: Input: populated config globals. Output: all subsystems initialized, keyword table built, output file open.

Stage 4: reset_tu_state -- sub_7A4860 (0x7A4860)

Source: trans_unit.c

Zeroes all translation unit tracking globals to prepare for processing:

qword_106BA10 = 0;   // current_translation_unit
qword_106B9F0 = 0;   // primary_translation_unit
qword_12C7A90 = 0;   // tu_chain_tail
dword_106B9F8 = 0;   // has_module_info
qword_106BA18 = 0;   // tu_stack_top
dword_106B9E8 = 0;   // tu_stack_depth

Data flow: No input. Output: TU state clean-slated.

Stage 5: process_translation_unit -- sub_7A40A0 (0x7A40A0)

Source: trans_unit.c

The main frontend workhorse. This single call parses the entire .cu source file into the EDG intermediate language. Workflow:

  1. Debug trace: "Processing translation unit %s"
  2. Clean up any previous TU state (sub_7A3A50)
  3. Reset error state (sub_5EAEC0)
  4. Allocate 424-byte TU descriptor via sub_6BA0D0
  5. Initialize TU scope state (offsets 24..192 via sub_7046E0)
  6. Set as primary TU (qword_106B9F0) if first
  7. Link into TU chain
  8. Call sub_586240 -- parse the source file (this enters the EDG parser, which handles all of C++ plus CUDA extensions: __device__, __host__, __global__, __shared__, __managed__, etc.)
  9. Depending on mode:
    • Module compilation: sub_6FDDF0
    • Standard compilation: sub_6F4AD0 (header-unit) + sub_4E8A60 (standard)
  10. Post-processing: sub_588E90 (translation_unit_wrapup -- scope closure, template wrapup, IL output)
  11. Debug trace: "Done processing translation unit %s"

At the end of this stage, the EDG IL tree is fully built. Every declaration, type, expression, and statement from the source has been parsed into IL nodes. CUDA execution-space attributes (__device__, __host__, __global__) have been recorded on entity nodes at byte offset +182 (bit 6 = device/global, bits 4-5 = execution space).

Data flow: Input: source filename from qword_126EEE0. Output: complete EDG IL tree anchored at qword_106BA10 (TU descriptor), source sequence list at *(qword_106BA10 + 8).

Stage 6: fe_wrapup -- sub_588F90 (0x588F90)

Source: fe_wrapup.c

Five-pass finalization over all translation units. Each pass iterates the TU chain (qword_106B9F0). Passes 2-4 are per-TU error-gated (skip TUs with qword_126ED90 != 0); passes 1 and 5 run unconditionally.

PassFunctionPurposeError-gated?
1sub_588C60Per-file IL wrapup: template/exception cleanup, IL tree walk (sub_706710), IL finalize (sub_706F40), destroy temporariesNo
2sub_707040Needed-flags computation: determine which entities must be preserved for backend consumptionPer-TU skip
3sub_610420(23)Keep-in-IL marking: mark entities for device code preservation with guard flag dword_106B640Per-TU skip
4sub_5CCA40 + sub_5CC410 + sub_5CCBF0Dead entity elimination (C++ gate on sub_5CCA40): clear unneeded instantiation flags, remove dead function bodies, remove unneeded IL entriesPer-TU skip
5sub_588D40Statement finalization, scope assertions, IL output + template outputNo

Between Pass 1 and Pass 2, if no errors have occurred, sub_796C00 runs cross-TU entity marking.

Post-pass operations:

  • Cross-TU consistency (sub_796BA0, error-gated)
  • Scope renumbering (sub_707480 double-loop)
  • Template validation (sub_765480)
  • File index cleanup (sub_6B8B20 for indices 2..dword_126EC80)
  • Output flush + close three output files (IDs 1513, 1514, 1515)
  • Memory statistics: sums 10 space_used() callbacks
  • State teardown

Data flow: Input: fully built IL tree. Output: finalized IL with dead entities eliminated and device-needed entities marked. The source sequence list (qword_1065748) is the ordered list of top-level declarations the backend will walk.

Stage 7: Backend Code Generation -- sub_489000 (0x489000)

Source: cp_gen_be.c (723 decompiled lines, the largest single function in the backend)

This is the host-side C++ code generator. It walks the EDG source sequence and emits the .int.c file that the host compiler (gcc/cl.exe/clang) will compile. The backend is gated by dword_106C254: if set to 1 (errors occurred), stage 7 is skipped entirely.

Initialization:

  1. Zeros output state: dword_1065834 (indent level), stream handle, counters
  2. Clears four 512KB hash tables (memset 0x7FFE0 bytes each)
  3. Sets up gen_be_info callback table (xmmword_1065760..10657B0)
  4. Creates output file: <input>.int.c (or stdout for "-")

Boilerplate emission:

  • #pragma GCC diagnostic push/pop blocks for suppressing host compiler warnings
  • __nv_managed_rt initialization boilerplate (for __managed__ variables)
  • Lambda type-trait macro definitions

Main processing loop:

  • Walks qword_1065748 (global source sequence list)
  • For each entry: dispatches to sub_47ECC0 (gen_template/process_source_sequence)
  • Kind 57 entries are pragma interleavings (handled inline)

CUDA-specific transformations performed:

  1. Device stub generation: For __global__ kernels, emit __wrapper__device_stub_<name>() forwarding, wrap original body in #if 0/#endif
  2. Device-only suppression: Device-only declarations wrapped in #if 0/#endif
  3. Lambda wrappers: __nv_dl_wrapper_t<> for device lambdas, __nv_hdl_create_wrapper_t<> for host-device lambdas
  4. Runtime header injection: #include "crt/host_runtime.h" at first CUDA entity
  5. Registration tables: sub_6BCF80 called 6 times for device/host/managed/constant combinations
  6. Anonymous namespace: _NV_ANON_NAMESPACE macro for unique global symbols

Trailer:

  • Empty-file guard: int __dummy_to_avoid_empty_file;
  • Re-inclusion of original source via #include "<original_file>"
  • #undef _NV_ANON_NAMESPACE

Data flow: Input: finalized source sequence from stage 6. Output: .int.c file on disk.

Stage 8: exit_with_status -- sub_5AF1D0 (0x5AF1D0)

Source: host_envir.c

Maps internal compilation status to process exit codes:

Internal StatusMeaningExit CodeAction
3, 4, 5Success0exit(0)
8Warnings only2exit(2)
9, 10Errors4exit(4) + "Compilation terminated."
11Internal error--abort() + "Compilation aborted."

In SARIF mode (dword_106BBB8), text messages are suppressed but exit codes remain the same.

Key Global Variables Controlling Flow

VariableAddressTypeRole
dword_106C2540x106C254intSkip-backend flag. Set to 1 when qword_126ED90 (error count) is nonzero after frontend. Prevents stage 7 from running.
dword_106C0A40x106C0A4intTiming flag. When set, sub_5AF350/sub_5AF390 bracket each phase with CPU + wall-clock timestamps.
dword_126EFB40x126EFB4intLanguage mode. 1=K&R C, 2=C++. Controls C++ class finalization in pass 4 of fe_wrapup, keyword set selection, and backend behavior. In CUDA mode, always 2.
qword_126ED900x126ED90qwordError count. Checked after stages 5-6 to decide whether to run backend. Nonzero skips needed-flags, keep-in-IL marking, and dead entity elimination passes in fe_wrapup.
qword_126EEE00x126EEE0char*Output filename. Passed to sub_7A40A0 for TU naming. Used by backend to construct .int.c path.
dword_10658500x1065850intDevice stub mode. Toggled during backend generation: 1 = currently emitting device stub code (changes parameter types, suppresses bodies).
dword_106C0640x106C064intStack limit flag. When set, main adjusts RLIMIT_STACK to max before entering frontend (deep recursion in parser/template engine).

Timing Regions

When dword_106C0A4 is set (via --timing or equivalent flag), three timing regions are printed:

Front end time                     12.34 (CPU)     15.67 (elapsed)
Back end time                       3.45 (CPU)      4.56 (elapsed)
Total compilation time             15.79 (CPU)     20.23 (elapsed)

Format string: "%-30s %10.2f (CPU) %10.2f (elapsed)\n"

The timing is implemented via sub_5AF350 (capture_time: records clock() as CPU milliseconds and time() as wall seconds) and sub_5AF390 (report_timing: computes deltas and prints).

RegionStartEndCovers
Front endAfter sub_585DB0 (fe_one_time_init)After sub_588F90 (fe_wrapup)Stages 4-6: TU reset, parse, IL build, wrapup
Back endBefore sub_489000After sub_489000Stage 7: .int.c generation
TotalAfter sub_585D60 (fe_pre_init), before sub_459630 (CLI)Before sub_5AF1D0 (exit)Stages 2-8: CLI parsing through exit

Error Recovery Loop

The main() function contains a while(1) loop that appears to support re-compilation (the TU processing infrastructure has a dword_106BA08 "is_recompilation" flag and sub_7A40A0 checks an a2 recompilation parameter). In practice, for the standard CUDA compilation flow, this loop executes exactly once: sub_5AF1D0 is __noreturn and terminates the process.

The loop body:

  1. sub_6B8B20(0) -- reset file state for the source file manager
  2. sub_589530() -- write output signoff (sub_5AEE00) + close source manager (sub_6B8DE0)
  3. Compute exit code from qword_126ED90 (errors) and qword_126ED88 (additional status)
  4. Print total timing if enabled
  5. Restore stack limit if it was raised
  6. sub_5AF1D0(exit_code) -- terminate

Cross-References

Entry Point & Initialization

main() at 0x408950 is a 488-byte __noreturn function that orchestrates the entire cudafe++ compilation pipeline. It takes the standard POSIX signature (int argc, char **argv, char **envp), performs two phases of subsystem initialization, optionally raises the process stack limit, then runs the frontend, backend, and exit sequence in a linearized loop that executes exactly once. The function has 22 direct callees (including getrlimit, setrlimit, and library calls) and never returns -- sub_5AF1D0 at the bottom of the loop calls exit() or abort().

Key Facts

PropertyValue
Address0x408950
Size488 bytes
Source filefe_init.c / host_envir.c (initialization); fe_wrapup.c (finalization)
Signaturevoid __noreturn main(int argc, char **argv, char **envp)
Direct callees22 (9 pre-init + CLI + heavy-init + 5 pipeline stages + timing/exit helpers)
Stack frame0x88 bytes (136 bytes: 6 timing stamps + rlimit struct + alignment)
Attribute__noreturn -- the while(1) loop terminates via sub_5AF1D0 which calls exit()/abort()

Annotated Decompilation

void __noreturn main(int argc, char **argv, char **envp)
{
    rlim_t original_stack;
    bool stack_was_raised;
    uint8_t exit_code;
    struct rlimit rlimits;
    timestamp_t t_total_start, t_fe_start, t_fe_end, t_be_start, t_be_end, t_total_end;

    // --- Redirect diagnostic output to stderr ---
    s = stderr;                            // 0x126EDF0 alias
    qword_126EDF0 = stderr;                // diagnostic stream

    // === PHASE 1: Pre-initialization (9 subsystem calls) ===
    sub_585D60(argc, argv, envp);          // fe_pre_init

    // --- Capture total compilation start time ---
    sub_5AF350(&t_total_start);            // capture_time

    // === PHASE 2: Command-line parsing ===
    sub_459630(argc, argv);                // proc_command_line (276 flags)

    // === Stack limit adjustment ===
    if (dword_106C064                      // --modify-stack-limit (default: ON)
        && !getrlimit(RLIMIT_STACK, &rlimits))
    {
        original_stack = rlimits.rlim_cur;
        rlimits.rlim_cur = rlimits.rlim_max;  // raise to hard limit
        stack_was_raised = (setrlimit(RLIMIT_STACK, &rlimits) == 0);
    }

    // === PHASE 3: Heavy initialization (38 subsystem calls + validation) ===
    sub_585DB0();                          // fe_one_time_init
    //   └─ sub_585EE0()  fe_init_part_1  (33 per-unit inits, output file, keywords)

    if (dword_106C0A4)                     // --timing enabled?
        sub_5AF350(&t_fe_start);           // capture frontend start

    // === PHASE 4: Translation unit setup ===
    sub_7A4860();                          // reset_tu_state (zero 6 TU globals)

    // === PHASE 5: Frontend parse + IL build ===
    sub_7A40A0(qword_126EEE0);            // process_translation_unit

    // === PHASE 6: Frontend wrapup (5-pass IL finalization) ===
    sub_588F90(qword_126EEE0, 1);         // fe_wrapup

    if (dword_106C0A4) {
        sub_5AF350(&t_fe_end);
        sub_5AF390("Front end time", &t_fe_start, &t_fe_end);
    }

    // --- Error gate: skip backend if frontend had errors ---
    if (!qword_126ED90) goto backend;     // no errors → run backend
    dword_106C254 = 1;                    // skip-backend flag

    // === Linearized exit loop (executes once) ===
    while (1) {
        exit_code = 8;                    // default: warnings
        sub_6B8B20(0);                    // reset file state
        sub_589530();                     // write signoff + close source mgr

        if (!qword_126ED90)               // re-check after wrapup
            exit_code = qword_126ED88 ? 5 : 3;  // success codes

        if (dword_106C0A4) {
            sub_5AF350(&t_total_end);
            sub_5AF390("Total compilation time", &t_total_start, &t_total_end);
        }

        if (stack_was_raised) {           // restore original stack limit
            rlimits.rlim_cur = original_stack;
            setrlimit(RLIMIT_STACK, &rlimits);
        }

        sub_5AF1D0(exit_code);            // __noreturn: exit() or abort()

    backend:
        if (!dword_106C254) {             // backend not skipped
            if (dword_106C0A4)
                sub_5AF350(&t_be_start);
            sub_489000();                 // process_file_scope_entities (backend)
            if (dword_106C0A4) {
                sub_5AF350(&t_be_end);
                sub_5AF390("Back end time", &t_be_start, &t_be_end);
            }
        }
    }
}

The while(1) never actually loops. The call to sub_5AF1D0 is __noreturn (it calls exit() or abort() internally), so control never reaches the second iteration. The compiler arranged the basic blocks this way because the backend code at backend: is reached via a goto from the error-gate check, placing it logically "after" the exit call in the CFG.

Phase 1: fe_pre_init -- sub_585D60 (0x585D60)

The first thing main() does after redirecting stderr is call sub_585D60, which performs the absolute minimum initialization needed before command-line parsing can proceed. This function lives in fe_init.c and makes 9 sequential calls to subsystem pre-initializers, plus two inline global assignments.

Pre-Init Call Table

#AddressIdentitySourcePurpose
1sub_48B3C0error_pre_initerror.cZero 4 error-tracking globals: qword_1065870=0, qword_1065868=0, dword_1065860=-1, qword_1065858=0
2sub_6BB290source_file_mgr_pre_initsrcfile.cZero 10 file descriptor table globals: file chain head, file count, file hash, include stack
3sub_5B1E70host_envir_early_inithost_envir.cHeaviest pre-init call. Signal handlers, locale, CWD capture, env vars. See below.
4sub_752C90type_system_pre_inittype.cSet dword_126E4A8=-1 (dialect version unset), call sub_7515D0 (type table alloc), set host compiler defaults (qword_126E1F0=70300 = GCC 7.3.0 default), init 3 type comparison descriptor pools via sub_465510
5sub_45EB40cmd_line_pre_initcmd_line.cZero the 272-flag was-set bitmap (byte_E7FF40, 0x110 bytes), set dword_E7FF20=1 (skip argv[0]), initialize ~350 global config variables to defaults. Notable: dword_106C064=1 (stack limit adjustment ON by default)
6sub_4ED530declaration_pre_initdecls.cSet stderr into two global stream pointers, zero error/warning counters (qword_126ED80..qword_126EDE0), set diagnostic defaults (byte_126ED69=5, byte_126ED68=8, qword_126ED60=100 max errors), clear 15.2KB diagnostic severity table (byte_1067920, 0x3B50 bytes)
7sub_6F6020il_pre_initil.cZero 3 globals: dword_12C6C8C=0 (PCH event counter), qword_12C6EC0=0, qword_12C6EB8=0
--(inline)scope_index_initfe_init.cdword_126C5E4 = -1 (current scope stack index = "none"), dword_126C5C8 = -1 (secondary scope index = "none")
8sub_7A48B0tu_tracking_pre_inittrans_unit.cZero 13 TU tracking globals: source filename, compilation mode flags, TU stack pointers, PCH state
9sub_7C00F0template_pre_inittemplate.cSingle assignment: dword_106BA20 = 0 (template nesting depth = 0)

host_envir_early_init (sub_5B1E70) Detail

This is the most substantial pre-init call. It initializes the host environment interface layer from host_envir.c:

Signal handlers (one-time, guarded by dword_E6E120):

SignalHandlerBehavior
SIGINT (2)handler at 0x5AF2C0Write newline to stderr, call sub_5AF2B0(9) which writes signoff then exit(4)
SIGTERM (15)handler at 0x5AF2C0Same as SIGINT
SIGXCPU (24)sub_5AF270Print "Internal error: CPU time limit exceeded.\n", call sub_5AF1D0(11) which calls abort()
SIGXFSZ (25)SIG_IGNIgnored (prevents crash on large output files)

After signal setup, dword_E6E120 is set to 0 so handlers are registered only once.

Locale: Calls newlocale(LC_NUMERIC, "C", 0) then uselocale() to force the C locale for numeric output. If either call fails, asserts with "could not set LC_NUMERIC locale" at host_envir.c:264.

Working directory: Iteratively calls getcwd() with a growing buffer (starting at 256 bytes, expanding by 256 on ERANGE) until it fits, then copies the result into qword_126EEA0 via permanent allocation.

Environment variables:

  • EDG_BASE -- read into qword_126EE38 (base path for EDG data files; empty string if unset)
  • EDG_SUPPRESS_ASSERTION_LINE_NUMBER -- if set and not "0", sets dword_126ED40 = 1 (suppress line numbers in internal assertion messages)

CPU time limit: Calls getrlimit(RLIMIT_CPU) then setrlimit() with rlim_cur = RLIM_INFINITY to disable the CPU time limit.

Global zeroing: Zeros ~50 host-environment globals including file descriptors, path buffers, platform flags, output filename pointers.

Language mode: Sets dword_126EFB4 = 2 (default to C++ mode -- this is later overridden by CLI parsing if -x c is specified).

Sentinel validation: Checks off_E6E0E0 against the string "last" to verify that the predef_macro_mode_names table was properly initialized at link time. On mismatch, asserts with "predef_macro_mode_names not initialized properly" at host_envir.c:6927.

Stack Limit Adjustment

Between CLI parsing and heavy initialization, main() conditionally raises the process stack limit:

if (dword_106C064 && !getrlimit(RLIMIT_STACK, &rlimits)) {
    original_stack = rlimits.rlim_cur;
    rlimits.rlim_cur = rlimits.rlim_max;   // raise soft to hard limit
    stack_was_raised = (setrlimit(RLIMIT_STACK, &rlimits) == 0);
}

The flag dword_106C064 is set to 1 by default in sub_45EB40 (cmd_line_pre_init) and can be disabled via the --modify_stack_limit=false CLI flag. The purpose is to prevent stack overflow during deep recursion in the C++ parser, template instantiation engine, and constexpr interpreter. After compilation completes (just before exit), main() restores the original rlim_cur value.

Phase 3: fe_one_time_init -- sub_585DB0 (0x585DB0)

This is the heaviest initialization stage. It zeroes the token state (qword_126DD38 -- 6 bytes packed as a dword + word), optionally calls sub_5AF330 for profiling init if dword_106BD4C is set, then makes 38 sequential calls to subsystem one-time initializers.

One-Time Init Call Table

#AddressIdentitySource file
1sub_752DF0type_one_time_inittype.c
2sub_5B1D40scope_one_time_initscope.c
3sub_447430error_one_time_initerror.c
4sub_4B37F0preprocessor_one_time_initpreproc.c
5sub_4E8ED0declaration_one_time_initdecls.c
6sub_4C0840attribute_one_time_initattribute.c
7sub_4A1B60name_one_time_initlookup.c
8sub_4E9CF0declaration_one_time_init_2decl_spec.c
9sub_4ED710declaration_one_time_init_3declarator.c
10sub_510C30statement_one_time_initstmt.c
11sub_56DC90exprutil_one_time_initexprutil.c
12sub_5A5160expression_one_time_initexpr.c
13sub_603B00parser_one_time_initparse.c
14sub_5CF7F0class_one_time_initclass_decl.c
15sub_65DC50overload_one_time_initoverload.c
16sub_69C8B0template_one_time_inittemplate.c
17sub_665A00instantiation_one_time_initinstantiate.c
18sub_689550exception_one_time_initexcept.c
19sub_68F640conversion_one_time_initconvert.c
20sub_6B6510il_one_time_initil.c
21sub_6BAE70srcfile_one_time_initsrcfile.c
22sub_6F5FC0il_walk_one_time_initil_walk.c
23sub_6F8300il_one_time_init_2il.c
24sub_6FDFF0lower_one_time_initlower_il.c
25sub_726DC0mangling_one_time_initlower_name.c
26sub_72D410mangling_one_time_init_2lower_name.c
27sub_74B9A0typecheck_one_time_inittypecheck.c
28sub_710B70il_one_time_init_3il.c
29sub_76D630codegen_one_time_initcp_gen_be.c
30nullsub_11debug_one_time_initdebug.c (no-op)
31sub_7A4690allocation_one_time_initil_alloc.c
32sub_7A3920pool_one_time_initil_alloc.c
33sub_6A0E90template_one_time_init_2template.c
34sub_418F80diagnostics_one_time_initdiag.c
35sub_5859C0extasm_one_time_initextasm.c
36sub_751540type_one_time_init_2type.c
37sub_7C25F0template_one_time_init_3template.c
38sub_7DF400cuda_one_time_initnv_transforms.c

The call order reflects dependency constraints: types before scopes, scopes before declarations, declarations before expressions, expressions before the parser, etc. Template initialization is split across three calls (#16, #33, #37) because different phases of template support depend on different subsystems being initialized first.

Function Pointer Table Validation

After all 38 initializers complete, sub_585DB0 performs a critical integrity check:

if (funcs_6F71AE || off_D560C0 != nullsub_6)
    sub_4F21C0("function_pointers is incorrectly initialized");

This validates two conditions:

  1. funcs_6F71AE must be zero. This global acts as a "dirty flag" -- if any initializer wrote a nonzero value here, the table was not properly zeroed during static initialization.

  2. off_D560C0 must point to nullsub_6 (0x585B00). The address off_D560C0 is the last entry in a function pointer dispatch table in .rodata. The empty function nullsub_6 acts as a sentinel -- its known address is compared against the table's last slot to verify that the table was correctly populated at link time. If the linker reordered or dropped entries, the sentinel would not match.

If either check fails, sub_4F21C0 emits a fatal diagnostic ("function_pointers is incorrectly initialized") and then falls through to sub_585EE0 (fe_init_part_1) regardless -- this is a non-recoverable error that will likely cause crashes later, but the code attempts to continue.

On successful validation, sub_585DB0 returns without calling sub_585EE0. However, sub_585EE0 is actually called from a different path: the normal flow is that sub_585DB0 returns, and main() proceeds. The sub_585EE0 call on the error path in sub_585DB0 appears to be a fallthrough from the panic handler.

Correction from the sweep report: Examination of the actual decompiled code shows that sub_585EE0 (fe_init_part_1) is called only on the error path of the sentinel check within sub_585DB0. On the normal (no-error) path, sub_585DB0 returns sub_7DF400()'s return value directly. This means fe_init_part_1 is called from the sentinel-check error handler, not from the main success path of sub_585DB0. The actual invocation of fe_init_part_1 in the normal flow must occur elsewhere in the pipeline (likely called from within one of the subsystem initializers or from sub_7A40A0).

fe_init_part_1 -- sub_585EE0 (0x585EE0)

This function performs per-compilation-unit initialization. It is identified by the debug trace string "fe_init_part_1" at level 5 and an assertion path fe_init.c:2007. Its responsibilities:

Compilation Timestamp

time(&timer);
char *t = ctime(&timer);
if (!t) t = "Sun Jan 01 00:00:00 1900\n";
if (strlen(t) > 127)
    assert("fe_init.c", 2007, "fe_init_part_1");  // buffer overflow guard
strcpy(byte_106B5C0, t);   // 128-byte timestamp buffer
dword_126EE48 = 1;         // init-complete flag

Per-Unit Initializer Call Table

After the timestamp, sub_585EE0 calls 33 per-compilation-unit initializers:

#AddressIdentity
1sub_4ED7C0declaration_unit_init
2nullsub_7(no-op placeholder)
3sub_65DC20overload_unit_init
4sub_6BB350srcfile_unit_init
5sub_5B22E0scope_unit_init
6sub_603B30parser_unit_init
7sub_5D0170class_unit_init
8sub_61EBD0expression_unit_init
9sub_68A0D0exception_unit_init
10sub_74BFF0typecheck_unit_init
11sub_710DE0il_unit_init
12sub_4E8F10declaration_unit_init_2
13sub_4C0860attribute_unit_init
14nullsub_2(no-op placeholder)
15sub_4474D0error_unit_init
16sub_665A60instantiation_unit_init
17sub_4E9D10decl_spec_unit_init
18sub_76D780codegen_unit_init
19sub_7C0300template_unit_init
20sub_7A3980pool_unit_init
21sub_56DEE0exprutil_unit_init
22nullsub_10(no-op placeholder)
23sub_6B6890il_unit_init_2
24sub_726EE0mangling_unit_init
25sub_6F5DA0il_walk_unit_init
26sub_6F8320il_unit_init_3
27sub_6FE130lower_unit_init
28sub_752FC0type_unit_init
29sub_4660B0folding_unit_init
30sub_5943E0float_unit_init
31sub_6A0F40template_unit_init_2
32sub_4190B0diagnostics_unit_init
33sub_7C2640template_unit_init_3

Compilation Mode Flags

After the per-unit initializers, sub_585EE0 copies global configuration values (set during CLI parsing) into the compilation-mode descriptor at 0x126EB88:

FieldAddressSourceMeaning
byte_126EB880x126EB88dword_126E498Dialect flags
byte_126EBB00x126EBB0dword_126EFB4 == 1K&R C mode
dword_126EBA80x126EBA8dword_126EFB4 != 2Not-C++ flag
dword_126EBAC0x126EBACdword_126EF68C standard version
byte_126EBB80x126EBB8dword_126EFB0Strict C mode
byte_126EBB90x126EBB9dword_126EFACEDG GNU-compat extensions
byte_126EBBA0x126EBBAdword_126EFA4Clang extensions enabled
xmmword_126EBC00x126EBC0qword_126EF90Clang + GNU version thresholds (16 bytes packed)

Output File Setup

if (dword_106C298) {                          // output enabled
    if (qword_106C278)                        // output path specified
        qword_106C280 = sub_4F48F0(path, 0, 0, 16, 1513);  // open file (ID 1513)
    else
        qword_106C280 = stdout;               // default to stdout
}
sub_5AEDB0();                                 // write output header

The output file ID 1513 is one of three output file slots used during compilation (1513, 1514, 1515).

Initialization Summary

The total initialization sequence before parsing begins involves 80+ subsystem init calls across three layers:

main()
 ├─ sub_585D60()  fe_pre_init           9 subsystem pre-inits
 │   ├─ sub_48B3C0   error              4 globals zeroed
 │   ├─ sub_6BB290   srcfile            10 globals zeroed
 │   ├─ sub_5B1E70   host_envir         signals, locale, CWD, env vars, ~50 globals
 │   ├─ sub_752C90   types              type table alloc, compiler defaults
 │   ├─ sub_45EB40   cmd_line           272-flag bitmap, ~350 config defaults
 │   ├─ sub_4ED530   declarations       error counters, diagnostic severity table (15KB)
 │   ├─ sub_6F6020   il                 3 globals zeroed
 │   ├─ [inline]     scope indices      dword_126C5E4 = dword_126C5C8 = -1
 │   ├─ sub_7A48B0   tu_tracking        13 globals zeroed
 │   └─ sub_7C00F0   templates          1 global zeroed
 │
 ├─ sub_459630()  proc_command_line     276 flags → ~150 config globals
 │
 ├─ [RLIMIT_STACK adjustment]           raise soft limit to hard limit
 │
 └─ sub_585DB0()  fe_one_time_init      38 subsystem one-time inits
     ├─ token state zeroing             qword_126DD38 = 0 (6 bytes)
     ├─ 38 subsystem calls              types → scopes → errors → ... → CUDA
     ├─ sentinel check                  funcs_6F71AE == 0 && off_D560C0 == nullsub_6
     └─ sub_585EE0()  fe_init_part_1    (on error path, or called from subsystem)
         ├─ compilation timestamp       byte_106B5C0 via ctime()
         ├─ 33 per-unit inits           declarations → overload → ... → templates
         ├─ compilation mode flags      copy CLI config into descriptor struct
         ├─ output file open            stdout or file (ID 1513)
         └─ sub_5AEDB0()               write output header

Global State Set Before Parsing

By the time sub_7A40A0 (process_translation_unit) is called, the following critical globals have been established:

GlobalAddressValueSet by
dword_126EFB40x126EFB42 (C++)sub_5B1E70 default, may be overridden by CLI
dword_126EF680x126EF68C/C++ standard versionCLI parsing
dword_106C0640x106C0641 (stack limit ON)sub_45EB40 default
dword_106C0A40x106C0A40 or 1CLI --timing flag
qword_126EEE00x126EEE0source filenameCLI parsing
qword_106C2800x106C280output FILE*sub_585EE0
qword_126EDF00x126EDF0stderrmain() + sub_4ED530
dword_126EE480x126EE481sub_585EE0 (init-complete flag)
byte_106B5C00x106B5C0ctime stringsub_585EE0 (compilation timestamp)
dword_126C5E40x126C5E4-1 then updatedsub_585D60 then scope init
qword_126F1200x126F120C locale handlesub_5B1E70
qword_126EEA00x126EEA0CWD string copysub_5B1E70

The Error Gate

The transition from frontend to backend is controlled by a simple error check:

if (!qword_126ED90)          // qword_126ED90 = error count from frontend
    goto backend_label;      // no errors → run backend
dword_106C254 = 1;           // errors → set skip-backend flag

When dword_106C254 == 1, the backend stage (sub_489000) is skipped entirely. The process still writes a signoff trailer and exits with a nonzero status code. This means a cudafe++ compilation with frontend errors produces no .int.c output file -- the backend never runs.

Exit Code Mapping

The exit function sub_5AF1D0 at 0x5AF1D0 maps internal status codes to process exit codes:

Internal CodeMeaningProcess ExitMessage
3, 4, 5Success (various)exit(0)(none)
8Warnings onlyexit(2)(none)
9, 10Compilation errorsexit(4)"Compilation terminated.\n"
11Internal errorabort()"Compilation aborted.\n"
(other)Unknown/fatalabort()(none)

In SARIF mode (dword_106BBB8 set), the text messages ("Compilation terminated.", "Compilation aborted.") are suppressed, but exit codes remain identical.

Cross-References

CLI Processing

proc_command_line (sub_459630) at 0x459630 is a 21,773-byte function (4,105 decompiled lines, 296 callees) in cmd_line.c that parses the entire cudafe++ command line. It registers 276 flags into a flat lookup table, iterates argv with prefix-matching against that table, dispatches each matched flag through a 275-case switch statement, then resolves language dialect settings and opens output files. This function is the second stage of the pipeline, called directly from main() at 0x408950 before any heavy initialization.

Nobody invokes cudafe++ directly. NVIDIA's driver compiler nvcc decomposes its own options and passes the appropriate low-level flags via -Xcudafe <flag>. The full flag inventory is in CLI Flag Inventory; this page documents the implementation mechanics of the parsing system itself.

Key Facts

PropertyValue
Address0x459630
Binary size21,773 bytes
Decompiled lines4,105
Source filecmd_line.c
Signatureint64_t proc_command_line(int argc, char** argv)
Direct callees296
Flag table basedword_E80060
Flag table entry size40 bytes
Flag table capacity552 entries (overflow panics via sub_40351D)
Registered flags276
Switch cases275 (case IDs 1--275)
Default-suppressed diagnostics9 (1257, 1373, 1374, 1375, 1633, 2330, 111, 185, 175)

Flag Table Layout

The flag table is a contiguous array starting at dword_E80060. Each of the 552 slots occupies 40 bytes. The current entry count is tracked in dword_E80058.

Offset   Field            Type       Access pattern
------   -----            ----       --------------
+0       case_id          int32      dword_E80060[idx * 10]
+8       name             char*      qword_E80068[idx * 5]
+16      short_char       int16      word_E80070[idx * 20]    (low byte = char, high byte = 1)
+17      is_valid         int8       (high byte of short_char word, always 1)
+18      takes_value      int8       byte_E80072[idx * 40]
+19      visible          int8       (part of dword_E80080[idx * 10] at +32)
+20      is_boolean       int8       byte_E80073[idx * 40]
+24      name_length      int64      qword_E80078[idx * 5]    (precomputed strlen)
+32      mode_flag        int32      dword_E80080[idx * 10]

The flag-was-set bitmap at byte_E7FF40 spans 0x110 bytes (272 slots). When a flag is matched during parsing, the corresponding byte is set to 1 to record that the user explicitly provided it. The bitmap is zeroed by default_init (sub_45EB40) before every compilation.

Registration: sub_452010 (init_command_line_flags)

sub_452010 at 0x452010 is a 30,133-byte function (3,849 decompiled lines) that populates the entire flag table. It is called once, at line 280 of proc_command_line, before the parsing loop begins.

register_command_flag (sub_451F80)

Each flag is registered through sub_451F80 (25 lines), called approximately 275 times from sub_452010:

void register_command_flag(
    int    case_id,       // dispatch ID for the switch (1-275)
    char*  name,          // flag name without dashes ("preprocess", "timing", etc.)
    char   short_opt,     // single-char alias ('E', '#', etc.), 0 for none
    char   takes_value,   // 1 if the flag requires =<value>
    int    mode_flag,     // visibility/classification (mode vs. action)
    char   enabled        // whether the flag is active (1 = registered, 0 = disabled)
);

The function writes into the next free slot at index dword_E80058, precomputes strlen(name) into name_length, always sets the is_valid byte to 1, then increments the counter. If the counter reaches 552, it panics via sub_40351D -- the table is statically sized.

Paired Toggle Registration

Approximately half of all flags are boolean toggles registered as pairs: --flag and --no_flag share the same case_id but differ in which value they write. Pairs are registered in two ways:

  1. Two sequential register_command_flag calls -- both point to the same case_id; the parsing loop determines whether the matched name starts with no_ and sets the target global to 0 or 1 accordingly.

  2. Inline table population -- seven additional paired flags (relaxed_abstract_checking, concepts, colors, keep_restrict_in_signatures, check_unicode_security, old_id_chars, add_match_notes) are written directly into the array without going through register_command_flag.

Parsing Loop

After flag registration, proc_command_line performs five sequential setup steps, then enters the main argv iteration.

Pre-Loop Setup

Step 1:  Initialize qword_126DD38, qword_126EDE8 (token state / source position)
Step 2:  Call sub_452010() -- register all 276 flags
Step 3:  Allocate 4 hash tables (16-byte header + 256-byte data each):
           qword_106C248  macro define/alias map
           qword_106C240  include path list
           qword_106C238  system include map
           qword_106C228  additional system include map
Step 4:  Suppress 9 diagnostic numbers by default:
           1257, 1373, 1374, 1375, 1633, 2330, 111, 185, 175
         Each via sub_4ED400(number, suppress_severity, 1)
Step 5:  Set dword_E7FF20 = 1 (argv index, skipping argv[0])

The default-suppressed diagnostics are EDG warnings that NVIDIA considers noise for CUDA compilation. Diagnostic 111 ("statement is unreachable"), 185 ("pointless comparison of unsigned integer with zero"), and 175 ("subscript out of range") are common false positives in CUDA template-heavy code.

argv Iteration

The loop processes argv[dword_E7FF20] through argv[argc-1]. For each argument:

  1. Dash detection -- if the argument does not start with -, it is treated as the input filename (stored in qword_126EEE0). Only one non-flag argument is expected.

  2. Short flag matching -- for single-dash arguments (-X), the parser scans the flag table for an entry whose short_char matches. If the flag takes_value, the next argv element is consumed as the value.

  3. Long flag matching -- for double-dash arguments (--flag-name), the parser calls parse_flag_name_value (sub_451EC0) to split on =:

// sub_451EC0: split "--name=value" into name and value
// Respects backslash escapes and quoted strings
// If no '=' found: *name_out = src, *value_out = NULL
void parse_flag_name_value(char* src, char** name_out, char** value_out);

The name portion is then matched against the flag table using strncmp with each entry's precomputed name_length. The parser iterates all entries and counts exact and prefix matches:

  • Exact match (length equals name_length and strncmp returns 0) -- dispatches immediately.
  • Unique prefix match (only one entry's name starts with the given prefix) -- dispatches to that entry.
  • Ambiguous prefix (multiple entries match the prefix) -- emits error 923 ("ambiguous command-line option").
  • No match -- the argument is silently ignored or treated as input.

Conflict Detection

Before the main loop, check_conflicting_flags (sub_451E80, 15 lines) validates that mutually exclusive flags were not specified together. It checks byte_E7FFF2 || byte_E80031 || byte_E80032 || byte_E80033, corresponding to flags 3, 193, 194, and 195. If any conflict is detected, it emits error 1027 via sub_4F8480.

The Dispatch Switch (275 Cases)

After a flag is matched, its case_id indexes into a giant switch statement occupying the bulk of proc_command_line. The following sections document the most important cases grouped by function.

Preprocessor Control (Cases 3--9)

CaseFlagGlobal(s)Behavior
3no_line_commandsdword_106C29C=1, dword_106C294=1, dword_106C288=0Suppress #line in preprocessor output
4preprocessdword_106C29C=1, dword_106C294=1, dword_106C288=1Preprocessor-only mode (output to stdout)
5comments(flag bitmap)Preserve comments in preprocessor output
6old_line_commands(flag bitmap)Use old-style # N "file" line directives
8dependencies(multiple)Dependencies output mode (preprocessor-only + dependency emission)
9trace_includes(flag bitmap)Print each #include as it is opened

Compilation Mode (Cases 14, 20--26)

CaseFlagGlobalBehavior
14no_code_gendword_106C254 = 1Parse-only mode -- sets the skip-backend flag, preventing process_file_scope_entities from running
20timingdword_106C0A4 = 1Enable compilation phase timing. main() checks this flag to decide whether to call sub_5AF350/sub_5AF390 for "Front end time", "Back end time", "Total compilation time"
21version(stdout)Print the version banner and continue (does not exit). Banner includes: "cudafe: NVIDIA (R) Cuda Language Front End", "Portions Copyright (c) 2005, 2024-YYYY NVIDIA Corporation", "Portions Copyright (c) 1988-2018, 2024 Edison Design Group Inc.", "Based on Edison Design Group C/C++ Front End, version 6.6", "Cuda compilation tools, release 13.0, V13.0.88"
22no_warningsbyte_126ED69 = 7Set diagnostic severity threshold to error-only (suppress all warnings and remarks)
23promote_warningsbyte_126ED68 = 5Promote all warnings to errors
24remarksbyte_126ED69 = 4Lower threshold to include remark-level diagnostics
25ccalls sub_44C4F0(0)Force C language mode (overrides default C++ if currently in C++ mode)
26c++calls sub_44C4F0(2)Force C++ language mode

Diagnostic Control (Cases 39--44)

Cases 39--43 (diag_suppress, diag_remark, diag_warning, diag_error, diag_once) share the same value-parsing logic:

1. Read the value string (after '=')
2. Strip leading/trailing whitespace
3. Split on commas
4. For each token:
   a. Parse as integer (diagnostic number)
   b. Call sub_4ED400(number, severity, 1)

The severity values map to:

  • Suppress = skip entirely
  • Remark = informational (level 4)
  • Warning = default warning (level 5)
  • Error = hard error (level 7)
  • Once = emit on first occurrence only

Case 44 (display_error_number / no_display_error_number) toggles whether error codes appear in diagnostic messages.

CUDA-Specific Flags (Cases 45--89)

Output File Paths

CaseFlagGlobalDescription
45gen_c_file_nameqword_106BF20Path for the generated .int.c file
85gen_device_file_name(has_arg global)Device-side output file name
86stub_file_name(has_arg global)Stub file output path
87module_id_file_name(has_arg global)Module ID file path
88tile_bc_file_name(has_arg global)Tile bitcode file path

Data Model (Cases 65--66, 90--91)

CaseFlagBehavior
65force-lp64LP64 model: pointer size=8, long size=8, specific type encodings for 64-bit
66force-llp64LLP64 model (Windows): pointer size=4, long size=4
90m32ILP32 model: all type sizes set for 32-bit (pointer=4, long=4, etc.)
91m6464-bit mode (default on Linux x86-64)

Device Compilation Control

CaseFlagGlobalDescription
46msvc_target_versiondword_126E1D4MSVC version for compatibility emulation
47host-stub-linkage-explicitbooleanUse explicit linkage on generated host stubs
48static-host-stubbooleanGenerate static (internal linkage) host stubs
49device-hidden-visibilitybooleanApply hidden visibility to device symbols
52no-device-int128booleanDisable __int128 type support on device
53no-device-float128booleanDisable __float128 type support on device
54fe-inliningdword_106C068 = 1Enable frontend inlining pass
55modify-stack-limitdword_106C064Whether main() raises the process stack limit via setrlimit. Default is ON. Value parsed as integer: nonzero enables, zero disables.
71keep-device-functionsbooleanDo not strip unused device functions
72device-syntax-onlybooleanDevice-side syntax check without code generation
77device-cbooleanRelocatable device code (RDC) mode
82debug_modedword_106BFC4=1, dword_106BFC0=1, dword_106BFBC=1Full debug mode (sets three debug globals simultaneously)
89tile-onlybooleanTile-only compilation mode

Template Instantiation (Case 16)

The instantiate flag takes a string value and sets dword_106C094:

Valuedword_106C094Meaning
"none"0No implicit instantiation
"all"1Instantiate all referenced templates
"used"2Instantiate only used templates
"local"3Local instantiation only

Include and Macro Arguments (Cases 29--31)

Cases 29 (include_directory / -I) and 167 (sys_include) append entries to linked lists via sub_4595D0:

// sub_4595D0: append_to_linked_list
// Allocates a 24-byte node: {next_ptr, string_ptr, int_field}
// Appends to singly-linked list with head/tail pointers
void append_to_linked_list(list_head*, char* string, int type);

A special case: -I- (the literal string "-") sets a flag for stdin include mode rather than appending to the path list. It calls sub_5AD0A0 for the actual path registration.

Case 30 (define_macro / -D) builds a linked list of macro definitions via sub_4595D0. Case 31 (undefine_macro / -U) allocates the same 24-byte node but marks the int_field as 1 to indicate undefine.

Language Standard Selection (Cases 228, 240--252)

These cases set dword_126EF68 -- the internal value of __cplusplus or __STDC_VERSION__:

Case(s)Flagdword_126EF68Standard
228c++98199711C++98/03
204c++11201103C++11
240c++14201402C++14
246c++17201703C++17
251c++20202002C++20
252c++23202302C++23
178c99199901C99 (calls set_c_mode)
179pre-c99199000Pre-C99
241c11201112C11
242c17201710C17
243c23202311C23
7old_c(K&R)K&R C via sub_44C4F0(1)

SM Architecture Target (Case 245)

case 245:  // --target=<sm_arch>
    dword_126E4A8 = sub_7525E0(value_string);

sub_7525E0 parses the SM architecture string (e.g., "sm_90", "sm_100") and returns the internal architecture code stored in dword_126E4A8. This value gates which CUDA features are available during compilation (see Architecture Feature Gating).

Host Compiler Compatibility (Cases 182--188)

CaseFlagGlobalsBehavior
182gcc / no_gccdword_126EFA8, dword_126EFB0Enable/disable GCC compatibility mode + GNU extensions
184gnu_versionqword_126EF98GCC version number (default: 80100 = GCC 8.1.0). Parsed as integer.
187clang / no_clangdword_126EFA4Enable/disable Clang compatibility mode
188clang_versionqword_126EF90Clang version number (default: 90100 = Clang 9.1.0)
95pgc++booleanPGI C++ compiler mode
96iccbooleanIntel ICC mode
97icc_version(has_arg)Intel ICC version number
98icxbooleanIntel ICX (oneAPI DPC++) mode

Raw Flag Manipulation (Case 193)

case 193:  // --set_flag=<name>=<value>  or  --clear_flag=<name>
    // Looks up <name> in off_D47CE0 (a name-to-address lookup table)
    // Sets the corresponding global to <value> (integer)

This is a backdoor for nvcc to set arbitrary internal globals by name, used for flags that do not have dedicated case_id entries.

Output Mode (Case 274)

case 274:  // --output_mode=text  or  --output_mode=sarif
    if (strcmp(value, "text") == 0)
        output_mode = 0;     // plain text diagnostics (default)
    else if (strcmp(value, "sarif") == 0)
        output_mode = 1;     // SARIF JSON diagnostics

SARIF (Static Analysis Results Interchange Format) output is used by IDE integrations and CI pipelines. When enabled, diagnostic messages are emitted as structured JSON instead of traditional file:line: error: format.

Dump Options (Case 273)

case 273:  // --dump_command_options
    // Iterates the entire flag table
    // For each entry where is_valid == 1:
    //   printf("--%s ", name);
    // Then exits

This is a diagnostic/debug mode that prints every registered flag name and exits. Used by nvcc to discover the cudafe++ flag namespace.

Post-Parsing: Dialect Resolution

After the argv loop exits, proc_command_line enters a massive dialect resolution block (approximately 800 lines). This phase reconciles the various mode flags into a consistent configuration.

Input Filename Extraction

The last non-flag argv element is the input filename, stored in qword_126EEE0. This pointer is later passed to process_translation_unit (sub_7A40A0) in stage 5 of the pipeline.

Memory Region Initialization

Eleven memory regions (numbered 1--11) are initialized with default configurations. These correspond to CUDA memory spaces (global, shared, constant, local, texture, etc.) and are used by the frontend to track address space qualifiers.

GCC/Clang Feature Resolution

The resolver checks GCC version thresholds to decide which extensions to enable:

GCC version thresholds (stored as integer * 100):
  40299 (0x9D6B)  -- GCC 4.2.99 boundary
  40599 (0x9E97)  -- GCC 4.5.99 boundary
  40699 (0x9EFB)  -- GCC 4.6.99 boundary
  etc.

For each threshold, specific feature flags are conditionally enabled. For example, if GCC version >= 40599, rvalue references and variadic templates are enabled even if the language standard is technically C++03. This emulates how GCC provides extensions ahead of standards.

C++ Standard Feature Cascade

Based on the value of dword_126EF68 (__cplusplus), the resolver enables feature flags in a cascade:

199711 (C++98):  base features only
201103 (C++11):  + lambdas, rvalue_refs, auto_type, nullptr,
                   variadic_templates, unrestricted_unions,
                   delegating_constructors, user_defined_literals, ...
201402 (C++14):  + digit_separators, generic lambdas, relaxed_constexpr
201703 (C++17):  + exc_spec_in_func_type, aligned_new, if_constexpr,
                   structured_bindings, fold_expressions, ...
202002 (C++20):  + concepts, modules, coroutines, consteval, ...
202302 (C++23):  + deducing_this, multidimensional_subscript, ...

Conflict Validation

Post-dialect resolution performs consistency checks:

  • If both gcc and clang modes are enabled, GCC takes precedence
  • If cfront_2.1 or cfront_3.0 is set alongside modern C++ features, features are silently disabled
  • If no_exceptions is set but coroutines is requested, coroutines are disabled (they require exceptions)

Output File Opening

After all flags are resolved:

  • The output .int.c file is opened (path from case 45/gen_c_file_name, or stdout if path is "-")
  • The error output file is opened if --error_output was specified (case 35)
  • The listing file is opened if --list was specified (case 33)

Default Diagnostic Severity Overrides

Nine diagnostic numbers are suppressed by default before any user --diag_suppress flags are processed:

DiagnosticEDG meaningWhy suppressed
1257(C++11 narrowing conversion in aggregate init)Common in CUDA kernel argument forwarding
1373(nonstandard extension used: zero-sized array in struct)Used in CUDA runtime headers
1374(nonstandard extension used: struct with no members)Empty base optimization patterns
1375(nonstandard extension used: unnamed struct/union)Windows SDK compatibility
1633(inline function linkage conflict)Host/device function linkage edge cases
2330(implicit narrowing conversion)Template-heavy CUDA code triggers false positives
111statement is unreachable__builtin_unreachable() and device code control flow
185pointless comparison of unsigned integer with zeroGeneric template code comparing unsigned with zero
175subscript out of rangeStatic analysis false positives in device intrinsics

Users can override these defaults with explicit --diag_error=111 (or similar) on the command line, since user-specified severity always wins.

Key Helper Functions

FunctionAddressLinesIdentityRole
sub_451E800x451E8015check_conflicting_flagsValidates mutually exclusive flags (3/193/194/195)
sub_451EC00x451EC057parse_flag_name_valueSplits --name=value on =, respecting quotes and backslash escapes
sub_451F800x451F8025register_command_flagInserts one entry into the flag table
sub_4520100x4520103,849init_command_line_flagsRegisters all 276 flags (called once from proc_command_line)
sub_4595D00x4595D021append_to_linked_listAllocates 24-byte node, appends to -D/-I argument lists
sub_45EB400x45EB40470default_initZeros 350 global config variables + flag-was-set bitmap
sub_44C4F00x44C4F0--set_c_modeSets language mode: 0=C, 1=K&R, 2=C++
sub_44C4600x44C460--parse_integer_argParses string argument as integer (used by error_limit, etc.)
sub_4ED4000x4ED400--set_diagnostic_severitySets severity for a single diagnostic number

Key Global Variables

VariableAddressTypeSet byDescription
dword_E800580xE80058int32register_command_flagCurrent flag table entry count (max 552)
dword_E800600xE80060arrayregister_command_flagFlag table base (40 bytes/entry)
byte_E7FF400xE7FF40byte[272]Parsing loopFlag-was-set bitmap
dword_E7FF200xE7FF20int32default_initCurrent argv index (initialized to 1)
qword_126EEE00x126EEE0char*Post-parseInput source filename
dword_106C2540x106C254int32Case 14Skip-backend flag (--no_code_gen)
dword_106C0A40x106C0A4int32Case 20Timing enabled (--timing)
dword_126EF680x126EF68int32Standard flags__cplusplus / __STDC_VERSION__ value
dword_126EFB40x126EFB4int32Mode flagsLanguage mode (0=unset, 1=C, 2=C++)
dword_126EFA80x126EFA8int32Case 182GCC compatibility mode enabled
dword_126EFA40x126EFA4int32Case 187Clang compatibility mode enabled
qword_126EF980x126EF98int64Case 184GCC version (default 80100 = 8.1.0)
qword_126EF900x126EF90int64Case 188Clang version (default 90100 = 9.1.0)
dword_126EFB00x126EFB0int32Case 182GNU extensions enabled
dword_106C0640x106C064int32Case 55Modify stack limit (default 1)
dword_126E4A80x126E4A8int32Case 245Target SM architecture code
dword_106C0940x106C094int32Case 16Template instantiation mode (0--3)
byte_126ED690x126ED69int8Cases 22/24Diagnostic severity threshold
byte_126ED680x126ED68int8Case 23Warning promotion threshold
qword_106BF200x106BF20char*Case 45Output .int.c file path
qword_106C2480x106C248void*Pre-loopMacro define/alias hash table
qword_106C2400x106C240void*Pre-loopInclude path hash table
qword_106C2380x106C238void*Pre-loopSystem include map hash table

Annotated Parsing Flow

int64_t proc_command_line(int argc, char** argv)
{
    // --- Phase 1: Global state init ---
    qword_126DD38 = 0;                         // zero token state
    qword_126EDE8 = 0;                         // zero source position

    // --- Phase 2: Register all flags ---
    init_command_line_flags();                  // sub_452010: 3849 lines, 276 flags

    // --- Phase 3: Allocate hash tables ---
    qword_106C248 = alloc_hash_table();        // macro defines/aliases
    qword_106C240 = alloc_hash_table();        // include paths
    qword_106C238 = alloc_hash_table();        // system includes
    qword_106C228 = alloc_hash_table();        // additional system includes

    // --- Phase 4: Default diagnostic suppressions ---
    set_diagnostic_severity(1257, SUPPRESS, 1);
    set_diagnostic_severity(1373, SUPPRESS, 1);
    set_diagnostic_severity(1374, SUPPRESS, 1);
    set_diagnostic_severity(1375, SUPPRESS, 1);
    set_diagnostic_severity(1633, SUPPRESS, 1);
    set_diagnostic_severity(2330, SUPPRESS, 1);
    set_diagnostic_severity(111,  SUPPRESS, 1);
    set_diagnostic_severity(185,  SUPPRESS, 1);
    set_diagnostic_severity(175,  SUPPRESS, 1);

    // --- Phase 5: Main parsing loop ---
    for (int i = 1; i < argc; i++) {
        char* arg = argv[i];
        if (arg[0] != '-') {
            qword_126EEE0 = arg;               // input filename
            continue;
        }

        // Split --name=value
        char *name, *value;
        parse_flag_name_value(arg + 2, &name, &value);  // sub_451EC0

        // Match against flag table
        int match_count = 0;
        int matched_id = -1;
        for (int f = 0; f < dword_E80058; f++) {
            if (strncmp(name, flag_table[f].name, strlen(name)) == 0) {
                if (strlen(name) == flag_table[f].name_length) {
                    matched_id = flag_table[f].case_id;  // exact match
                    break;
                }
                match_count++;
                matched_id = flag_table[f].case_id;
            }
        }

        if (match_count > 1) {
            error(923);  // "ambiguous command-line option"
            continue;
        }

        byte_E7FF40[matched_id] = 1;           // mark flag as set

        switch (matched_id) {
            case 3:   /* no_line_commands */ ...
            case 4:   /* preprocess */      ...
            ...
            case 274: /* output_mode */     ...
            case 275: /* incognito */       ...
        }
    }

    // --- Phase 6: Post-parsing dialect resolution ---
    // ~800 lines: resolve gcc/clang versions, cascade C++ features,
    // validate consistency, open output files

    // --- Phase 7: Memory region init (1-11) ---
    // Initialize CUDA memory space descriptors

    return 0;
}

Version Banner

Case 21 (--version / -v) prints the following banner to stdout (does not exit):

cudafe: NVIDIA (R) Cuda Language Front End
Portions Copyright (c) 2005, 2024-YYYY NVIDIA Corporation
Portions Copyright (c) 1988-2018, 2024 Edison Design Group Inc.
Based on Edison Design Group C/C++ Front End, version 6.6 (BUILD_DATE BUILD_TIME)
Cuda compilation tools, release 13.0, V13.0.88

Case 92 (--Version / -V) prints a different copyright format and then calls exit(1). This variant is used for machine-parseable version queries.

Relationship to Pipeline

proc_command_line is called as stage 2 of the pipeline, after fe_pre_init (sub_585D60) has initialized signal handlers, locale, working directory, and default config:

main()
  |-- sub_585D60()           [1] fe_pre_init (10 subsystem pre-initializers)
  |-- sub_5AF350()               capture_time (total start)
  |-- sub_459630(argc, argv) [2] proc_command_line  <-- THIS FUNCTION
  |-- setrlimit()                conditional stack raise (gated by dword_106C064)
  |-- sub_585DB0()           [3] fe_one_time_init (38 subsystem initializers)
  ...

By the time proc_command_line returns, every global configuration variable is set to its final value. The subsequent fe_one_time_init phase reads these globals to configure keyword tables, type system parameters, and per-translation-unit state.

Frontend Invocation

process_translation_unit (sub_7A40A0, 1267 bytes at 0x7A40A0, from EDG source file trans_unit.c) is the main frontend workhorse -- stage 5 of the pipeline. Called once from main(), it orchestrates the entire transformation from .cu source text to a fully-built EDG IL tree. The function allocates a 424-byte translation unit descriptor, opens the source file via the lexer, drives the C++ parser to completion, runs semantic analysis on the parsed declarations, and finally performs per-TU wrapup (stop-token verification, class linkage checking, module finalization). By the time it returns, every declaration, type, expression, and statement from the source has been parsed into IL nodes, CUDA execution-space attributes have been resolved, and the TU is linked into the global TU chain ready for the 5-pass fe_wrapup stage.

Key Facts

PropertyValue
Functionsub_7A40A0 (process_translation_unit)
Binary address0x7A40A0
Binary size1267 bytes
EDG sourcetrans_unit.c
ConfidenceDEFINITE (source path and function name embedded at lines 696, 725, 556)
Signatureint process_translation_unit(char *filename, int is_recompilation, void *module_info)
Direct callees27
Debug trace entry"Processing translation unit %s\n"
Debug trace exit"Done processing translation unit %s\n"
TU descriptor size424 bytes (allocated via sub_6BA0D0)
TU stack entry size16 bytes ([0]=next, [8]=tu_ptr)

Annotated Decompilation

int process_translation_unit(char *filename,       // source file path
                              int is_recompilation, // nonzero on error-retry pass
                              void *module_info)    // non-NULL for C++20 module TUs
{
    bool is_primary = (module_info == NULL);

    // --- Debug trace on entry ---
    if (debug_verbosity > 0 || (debug_enabled && trace_category("trans_unit")))
        fprintf(stderr, "Processing translation unit %s\n", filename);

    // --- Module-mode state validation ---
    // If this is a primary TU (no module_info) but we've already seen a module TU,
    // that's an internal consistency error.
    if (is_recompilation)
        goto skip_validation;
    if (!is_primary) {
skip_validation:
        if (module_info)
            has_seen_module_tu = 1;                          // dword_12C7A88
        goto proceed;
    }
    if (has_seen_module_tu)
        assertion_failure("trans_unit.c", 696, "process_translation_unit", 0, 0);

proceed:
    // --- Save previous TU state if any ---
    if (current_translation_unit)                            // qword_106BA10
        save_translation_unit_state(current_translation_unit);  // sub_7A3A50

    // --- Reset per-TU compilation state ---
    current_source_position = 0;                             // qword_126DD38
    is_recompilation_flag = is_recompilation;                 // dword_106BA08
    current_filename = filename;                             // qword_106BA00
    has_module_info = (module_info != NULL);                  // dword_106B9F8

    // --- Initialize error/parser state ---
    reset_error_state();                                     // sub_5EAEC0
    if (is_recompilation)
        fe_init_part_1();                                    // sub_585EE0

    // ==========================================================
    //  PHASE 1: Allocate and initialize TU descriptor (424 bytes)
    // ==========================================================
    registration_complete = 1;                               // dword_12C7A8C
    tu_descriptor *tu = allocate_storage(424);               // sub_6BA0D0
    tu->next_tu = NULL;                                      // [0]
    ++tu_count;                                              // qword_12C7A78
    tu->storage_buffer = allocate_storage(per_tu_storage_size); // [16], sub_6BA0D0
    tu->tu_name = NULL;                                      // [8]
    init_scope_state(tu + 24);                               // sub_7046E0, offsets [24..192]
    tu->field_192 = 0;
    tu->field_352 = 0;
    tu->field_184 = 0;
    memset(&tu->scope_decl_area, 0, ...);                    // [200..360] zeroed
    tu->field_360 = 0;
    tu->field_368 = 0;
    tu->field_376 = 0;
    tu->flags = 0x0100;                                      // [392] = "initialized"
    tu->error_severity_count = 0;                            // [408]
    tu->field_416 = 0;

    // --- Copy registered variable defaults into per-TU storage ---
    for (reg = registered_variable_list; reg; reg = reg->next) {
        if (reg->offset_in_tu)
            *(tu + reg->offset_in_tu) = reg->variable_value;
    }

    // --- Set module info pointer and primary flag ---
    tu->module_info_ptr = module_info;                       // [376]
    tu->is_primary = is_primary;                             // [392] byte 0

    // ==========================================================
    //  PHASE 2: Link TU into global chains
    // ==========================================================

    // --- Set as primary TU if this is the first ---
    if (primary_translation_unit == NULL) {                   // qword_106B9F0
        primary_translation_unit = tu;
        if (!is_recompilation)
            assertion_failure("trans_unit.c", 725, "process_translation_unit", 0, 0);
    }

    // --- Push onto TU stack ---
    current_translation_unit = tu;                           // qword_106BA10
    // (stack entry allocated from free list or via permanent alloc)
    stack_entry = alloc_stack_entry();                        // 16 bytes
    stack_entry->tu_ptr = tu;
    stack_entry->next = tu_stack_top;
    if (tu != primary_translation_unit)
        ++tu_stack_depth;                                    // dword_106B9E8
    tu_stack_top = stack_entry;                               // qword_106BA18

    // --- Append to TU linked list ---
    if (tu_chain_tail)                                       // qword_12C7A90
        tu_chain_tail->next_tu = tu;
    tu_chain_tail = tu;

    // ==========================================================
    //  PHASE 3: Source file setup + parse
    // ==========================================================

    if (module_info) {
        // --- Module compilation path ---
        // Extract header info from module descriptor
        module_id = module_info[7];
        module_info[2] = tu;                                 // back-link TU into module
        current_module_id = module_id;                       // qword_106C0B0
        // ... copy include paths, source paths from module descriptor ...
        source_dir = intern_directory_path(filename, 1);     // sub_5ADC60
        set_include_paths(source_dir, &include_list, &sys_list); // sub_5AD120
        fe_translation_unit_init(source_dir, &include_list); // sub_5863A0
        import_module = module_info[3];
        tu->error_severity_count = current_error_severity;   // [408]
        set_module_id(import_module);                        // sub_5AF7F0
        if (preprocessing_only)                              // dword_106C29C
            goto compile;
        goto compile_module;
    }

    // --- Standard (non-module) path ---
    fe_translation_unit_init(0, 0);                          // sub_5863A0
    tu->error_severity_count = current_error_severity;
    if (preprocessing_only)
        goto compile;

    // --- PCH header processing (optional) ---
    if (pch_enabled && !pch_skip_flag) {                     // dword_106BF18, dword_106B6AC
        setup_pch_source();                                  // sub_5861C0
        precompiled_header_processing();                     // sub_6F4AD0
    }

compile:
    // --- Main compilation: parse + build IL ---
    compile_primary_source();                                // sub_586240
    semantic_analysis();                                     // sub_4E8A60 (standard path)
    goto wrapup;

compile_module:
    compile_primary_source();                                // sub_586240
    module_compilation();                                    // sub_6FDDF0 (module path)

wrapup:
    // ==========================================================
    //  PHASE 4: Per-TU wrapup + stack pop
    // ==========================================================
    translation_unit_wrapup();                               // sub_588E90

    // --- Pop TU stack (inlined pop_translation_unit_stack) ---
    top = tu_stack_top;
    popped_tu = top->tu_ptr;
    if (popped_tu != current_translation_unit)
        assertion_failure("trans_unit.c", 556,
                          "pop_translation_unit_stack", 0, 0);
    if (popped_tu != primary_translation_unit)
        --tu_stack_depth;
    tu_stack_top = top->next;
    // (return stack entry to free list)
    if (tu_stack_top)
        switch_translation_unit(tu_stack_top->tu_ptr);       // sub_7A3D60

    // --- Debug trace on exit ---
    if (debug_verbosity > 0 || (debug_enabled && trace_category("trans_unit")))
        fprintf(stderr, "Done processing translation unit %s\n", filename);
}

Execution Flow

process_translation_unit (sub_7A40A0)
  |
  |-- [1] Debug trace: "Processing translation unit %s"
  |-- [2] Module-state validation (assert at trans_unit.c:696)
  |-- [3] Save previous TU state (sub_7A3A50)
  |-- [4] Reset error state (sub_5EAEC0)
  |-- [5] If recompilation: re-run fe_init_part_1 (sub_585EE0)
  |
  |-- [6] Allocate 424-byte TU descriptor (sub_6BA0D0)
  |       |-- Allocate per-TU storage buffer (sub_6BA0D0(per_tu_storage_size))
  |       |-- Initialize scope state at [24..192] (sub_7046E0)
  |       |-- Zero remaining fields [192..416]
  |       |-- Copy registered variable defaults
  |       |-- Set module_info_ptr [376] and flags [392]
  |
  |-- [7] Set as primary TU if first (assert at trans_unit.c:725)
  |-- [8] Push onto TU stack, link into TU chain
  |
  |-- [9] Module path? (module_info != NULL)
  |       |-- YES: Extract module header info
  |       |        sub_5ADC60 (intern_directory_path)
  |       |        sub_5AD120 (set_include_paths)
  |       |        sub_5863A0 (fe_translation_unit_init)
  |       |        sub_5AF7F0 (set_module_id)
  |       |
  |       |-- NO:  sub_5863A0 (fe_translation_unit_init) with NULL args
  |                sub_5861C0 + sub_6F4AD0 (PCH processing, if enabled)
  |
  |-- [10] sub_586240  -- compile_primary_source (parser entry)
  |
  |-- [11] Post-parse semantic analysis:
  |        |-- Module path: sub_6FDDF0 (module_compilation)
  |        |-- Standard path: sub_4E8A60 (translation_unit / semantic analysis)
  |
  |-- [12] sub_588E90 -- translation_unit_wrapup
  |
  |-- [13] Pop TU stack (assert at trans_unit.c:556)
  |-- [14] Debug trace: "Done processing translation unit %s"

Phase 1: Error State Reset -- sub_5EAEC0

Before any parsing begins, sub_5EAEC0 resets the parser's error recovery state. This is a tiny function (22 bytes) that configures the error-recovery token scan depth based on whether this is a recompilation pass:

void reset_error_state(void) {
    if (is_recompilation) {            // dword_106BA08
        error_scan_depth = 8;          // dword_126F68C -- shallower scan on retry
        error_scan_mode = 0;           // dword_126F688
        error_recovery_kind = 16;
    } else {
        error_recovery_kind = 24;      // full recovery on first pass
    }
    error_token_limit = error_recovery_kind;  // dword_126F694
    error_count_local = 0;                     // dword_126F690
}

The different error_recovery_kind values (16 vs 24) control how aggressively the parser attempts to resynchronize after encountering a syntax error. On recompilation (error-retry), the compiler uses a smaller recovery window to avoid cascading errors.

Phase 2: TU Descriptor Allocation

The 424-byte TU descriptor is the central data structure tracking a single translation unit's state during compilation. It is allocated from EDG's permanent storage pool via sub_6BA0D0 and linked into two separate data structures: the TU linked list and the TU stack.

Translation Unit Descriptor Layout (424 bytes)

OffsetSizeFieldDescription
08next_tuSingly-linked list pointer: chains all TUs in processing order. qword_106B9F0 (primary TU) is the head; qword_12C7A90 is the tail.
88tu_nameInitially NULL. Set later by the parser to the TU's internal identifier.
168storage_bufferPointer to a dynamically-sized buffer holding per-TU copies of all registered global variables. Size = qword_12C7A98 (accumulated during f_register_trans_unit_variable calls).
24-192168scope_stateInitialized by sub_7046E0. Contains the TU's scope stack snapshot: file scope descriptor, scope nesting state, using-directive lists. Saved/restored during TU switching by sub_7A3A50/sub_7A3D60.
1848source_file_entrySet to *(qword_126DDF0 + 64) after the source file is opened -- the file descriptor from the source file manager.
1928(cleared)Zero-initialized.
200-352~160scope_decl_areaBulk-zeroed via memset. Holds scope-level declaration state that accumulates during parsing. The zero-init ensures clean state for a new TU.
3528(cleared)Zero-initialized.
360-37624additional_stateThree qwords, all zeroed. Purpose unclear; possibly reserved for future EDG versions.
3768module_info_ptrPointer to the C++20 module descriptor (a3 parameter). NULL for standard compilation. When set, the TU participates in modular compilation.
3922flagsByte 0: is_primary (1 if this is the first TU, 0 otherwise). Byte 1: initialized marker (always 1 = 0x100 in the word).
4084error_severity_countSnapshot of dword_126EC90 at TU creation time. Compared during wrapup to detect new errors introduced during this TU's compilation.
4168(cleared)Zero-initialized.

Registered Variable Mechanism

EDG's multi-TU infrastructure requires certain global variables to be saved and restored when switching between translation units (e.g., during relocatable device code compilation). The mechanism works as follows:

  1. Registration phase (during initialization, before any TU processing): Subsystem initializers call f_register_trans_unit_variable (sub_7A3C00) to register global variables that need per-TU state. Each registration creates a 40-byte entry:
OffsetSizeField
08next -- linked list pointer
88variable_address -- pointer to the global variable
168variable_name -- debug name string (e.g., "is_recompilation")
248prior_accumulated_size -- offset into per-TU storage buffer
328field_offset_in_tu -- if nonzero, the offset within the TU descriptor where the default value lives
  1. Accumulated size tracking: Each registration pads the variable's size to 8-byte alignment and adds it to qword_12C7A98 (per-TU storage size). The linked list head is qword_12C7AA8, tail is qword_12C7AA0.

  2. TU creation: When a TU descriptor is allocated, a storage buffer of per_tu_storage_size bytes is allocated alongside it at offset [16]. Default values from the field_offset_in_tu entries are copied into the TU descriptor's own fields.

  3. TU switching: save_translation_unit_state (sub_7A3A50) iterates the registered variable list, copying each variable's current value from its global address into the outgoing TU's storage buffer. switch_translation_unit (sub_7A3D60) does the reverse: copies from the incoming TU's storage buffer back to the global addresses.

Three core variables are always registered (by sub_7A4690):

VariableAddressSizeName
is_recompilationdword_106BA084"is_recompilation"
current_filenameqword_106BA008"current_filename"
has_module_infodword_106B9F84"has_module_info"

Additional variables are registered by other subsystem initializers (trans_corresp registers 3 more via sub_7A3920).

Phase 3: TU Linking and Stack Management

TU Linked List

Translation units are linked in processing order through the next_tu field at offset [0]:

qword_106B9F0 (primary_translation_unit)
  |
  v
  TU_0 --[next_tu]--> TU_1 --[next_tu]--> TU_2 --[next_tu]--> NULL
                                                       ^
                                                       |
                                          qword_12C7A90 (tu_chain_tail)

qword_106B9F0 always points to the first (primary) TU. qword_12C7A90 always points to the last. The chain is walked by fe_wrapup (sub_588F90) during its 5-pass finalization.

TU Stack

The TU stack tracks the active compilation context. Each stack entry is a 16-byte structure:

OffsetSizeField
08next -- points to the entry below on the stack
88tu_ptr -- pointer to the TU descriptor

Stack entries are allocated from a free list (qword_12C7AB8); when the free list is empty, a new 16-byte block is allocated via sub_6B7340 (permanent allocator).

qword_106BA18 (tu_stack_top)
  |
  v
  entry_N: [next=entry_N-1, tu_ptr=current_tu]
  entry_N-1: [next=entry_N-2, tu_ptr=prev_tu]
  ...
  entry_0: [next=NULL, tu_ptr=primary_tu]

The stack depth counter dword_106B9E8 tracks how many non-primary TUs are stacked. It is incremented on push (if tu != primary_tu) and decremented on pop.

The pop operation at the end of process_translation_unit includes an assertion (at trans_unit.c:556) verifying that the top-of-stack TU matches current_translation_unit. This guards against mismatched push/pop sequences, which would corrupt the multi-TU state:

if (stack_top->tu_ptr != current_translation_unit)
    assertion_failure("trans_unit.c", 556, "pop_translation_unit_stack", 0, 0);

Phase 4: Source File Setup

The source file setup differs between standard compilation and C++20 module compilation.

Standard Path (module_info == NULL)

  1. sub_5863A0 (fe_translation_unit_init / keyword_init, 1113 lines, fe_init.c): The largest initialization function in the binary. Performs two tasks in sequence:

    • Token state reset: Zeros qword_126DD38 (6-byte source position) and qword_126EDE8 (mirror).
    • Per-TU subsystem reinit: Calls 15+ subsystem re-initializers to prepare for a new compilation unit (source file manager, scope system, preprocessor, diagnostics, etc.).
    • Keyword registration: Registers 200+ C/C++ keywords via sub_7463B0 (enter_keyword), including all C89/C99/C11/C23 keywords, C++ keywords through C++26, GNU extensions, MSVC extensions, Clang extensions, 60+ type traits, and three NVIDIA CUDA-specific type trait keywords (__nv_is_extended_device_lambda_closure_type, __nv_is_extended_host_device_lambda_closure_type, __nv_is_extended_device_lambda_with_preserved_return_type). Keyword registration is version-gated by the language mode (dword_126EFB4) and C++ standard version (dword_126EF68).
    • File scope creation: Calls sub_7047C0(0) to push the initial file scope onto the scope stack.
    • C++ builtins: For C++ mode, registers namespace std, operator new/operator delete allocation functions, std::align_val_t.
  2. PCH processing (optional, if dword_106BF18 is set): Calls sub_5861C0 to open the source file with minimal setup (same as sub_586240 but without the recompilation logic), followed by sub_6F4AD0 (precompiled_header_processing, 721 lines, pch.c) which searches for an applicable .pch file, validates memory allocation history, and restores saved variable state from the precompiled header.

Module Path (module_info != NULL)

When compiling a C++20 module unit, the module descriptor (passed as a3) provides pre-computed configuration:

module_info[2] = tu;              // back-link TU into module descriptor
qword_106C0B0 = module_info[7];  // module identifier
qword_126EE98 = module_info[4];  // include path list
qword_126EE78 = module_info[6];  // system include path list
qword_126EE90 = module_info[5];  // additional path list

The module path then calls:

  • sub_5ADC60(filename, 1) -- intern the source directory path (cached allocation)
  • sub_5AD120(source_dir, &include_list, &sys_list) -- configure include search paths from the module descriptor
  • sub_5863A0(source_dir, &include_list) -- fe_translation_unit_init with module-specific paths
  • sub_5AF7F0(module_info[3]) -- set the module identifier for this TU (asserts not already set)

Phase 5: Compilation Driver -- sub_586240

sub_586240 (fe_init.c, 63 lines) is the compilation driver that opens the source file and launches the parser. It is called for both standard and module compilation paths.

void compile_primary_source(void) {
    // If recompilation: reset file-scope scope pointer
    if (is_recompilation)
        *(uint64_t *)&xmmword_126EB60 = 0;

    // Allocate mutable copy of filename for the lexer
    char *fn_copy = temp_allocate(strlen(current_filename) + 1);  // sub_5E0460
    strcpy(fn_copy, current_filename);

    // --- Open source file and push onto input stack ---
    open_file_and_push_input_stack(fn_copy, 0, 0, 0, 0, 0, 0, 0, 0, 0);  // sub_66E6E0

    // Record source file descriptor in TU
    current_tu->source_file_entry = *(source_file_descriptor + 64);  // [184]

    // --- Scope handling ---
    if (!pch_mode) {                                         // dword_106B690
        init_global_scope_flag = 1;                          // dword_126C708
        global_scope_decl_list = global_decl_chain;          // qword_126C710
        finalize_scope();                                    // sub_66E920
    }
    open_scope(1, 0);                                        // sub_6702F0

    // --- PCH recompilation metadata ---
    if (is_recompilation) {
        // Allocate 4-byte version marker (3550774 = "6.6\0")
        char *ver = temp_allocate(4);
        *(uint32_t *)ver = 3550774;                          // EDG 6.6 version tag
        edg_version_ptr = ver;                               // qword_126EB78
        // Copy compilation timestamp
        char *ts = temp_allocate(strlen(byte_106B5C0));
        compilation_timestamp_copy = strcpy(ts, byte_106B5C0); // qword_126EB80
        dialect_version_snapshot = dialect_version;           // dword_126EBF8
    }

    // --- PCH header loading ---
    if (pch_mode) {
        load_precompiled_header(byte_106B5C0);               // sub_6B5C10
        pch_header_loaded = 1;                               // dword_106B6B0
    }
}

Parser Entry: sub_66E6E0 (open_file_and_push_input_stack)

sub_66E6E0 (lexical.c, 95 lines) is the gateway from file-level compilation into the EDG lexer/parser. It takes 10 parameters controlling how the source file is opened:

ParameterPositionTypical ValueMeaning
filenamea1source pathPath to the .cu file
include_modea200 = primary source, nonzero = #include
search_typea300 = absolute path, nonzero = search include dirs
is_systema40System header flag
guard_flaga50Include guard checking mode
is_pragmaa60Pragma-include flag
embed_modea70#embed processing flag
line_adjusta80Line number adjustment
recoverya90Error recovery mode
result_outa100Output: set to 1 if file was skipped (guard)

The function delegates to sub_66CBD0 which resolves the file path, opens the file handle, and creates the file descriptor. Then sub_66DFF0 pushes the opened file onto the lexer's input stack, making it the active source for tokenization. The lexer reads from this stack via get_next_token (sub_676860, 1995 lines).

At debug verbosity > 3, it prints: "open_file_and_push_input_stack: skipping guarded include file %s\n" when an include guard causes the file to be skipped.

Phase 6: Semantic Analysis -- sub_4E8A60

After parsing completes, sub_4E8A60 (translation_unit, decls.c, 77 lines) performs semantic analysis on the parsed declarations. This function is called only on the standard (non-module) compilation path.

void translation_unit(void) {
    // PCH mode: additional scope finalization
    if (pch_mode)
        finalize_pch_scope();                                // sub_6FC900
    if (global_decl_chain)
        process_pending_declarations();                      // sub_6FDD60

    // --- Main declaration processing loop ---
    declaration_processing_active = 1;                       // dword_126C704
    parse_declaration_seq();                                  // sub_676860 (get_next_token)
    declaration_processing_active = 0;

    // Header-unit stop detection
    if (header_unit_mode)
        finalize_header_unit();                              // sub_6F4A10

    // --- Top-level declaration loop ---
    // Repeatedly processes declarations until token 9 (EOF) is reached.
    // For C++ (dword_126EFB4 == 2) with C++14+ (dword_126EF68 > 201102):
    //   calls sub_6FBCD0 (deferred template processing)
    //   then sub_4E6F80(1, 0) (process next declaration)
    while (current_token != 9) {  // 9 = EOF token
        if (is_cpp && (cpp_version > 201102 || has_cpp20_features))
            process_deferred_templates();                    // sub_6FBCD0
        if (declaration_enabled)
            process_declaration(1, 0);                       // sub_4E6F80
    }

    // --- Post-parse validation ---
    if (!header_unit_mode) {
        if (is_cpp && (cpp_version > 201102 || has_cpp20_features))
            process_deferred_templates();                    // sub_6FBCD0 final pass
        finalize_module_interface();                         // sub_6F81D0
    } else {
        // Header-unit mode assertion: stop position must be found
        assertion_failure("decls.c", 23975, "translation_unit",
                          "translation_unit:", "header stop position not found");
    }
}

The C++ standard version checks (dword_126EF68 > 201102) gate C++14+ features like deferred template instantiation. The value 201102 corresponds to C++11 (__cplusplus value). For C++14 and later, sub_6FBCD0 handles deferred template processing between declaration groups.

Phase 7: Translation Unit Wrapup -- sub_588E90

sub_588E90 (translation_unit_wrapup, fe_wrapup.c, 36 lines) performs per-TU finalization after parsing and semantic analysis are complete. It is the last step before the TU stack is popped.

void translation_unit_wrapup(void) {
    if (debug_enabled)
        trace_enter(1, "translation_unit_wrapup");

    // [1] Stop-token verification
    check_all_stop_token_entries_are_reset(                  // sub_675DA0
        file_scope_stop_tokens + 8);                         // qword_126DB48 + 8

    // [2] Class linkage checking (conditional)
    if (!preprocessing_only) {
        if (rdc_enabled || rdc_alt_enabled)                  // dword_106C2BC, dword_106C2B8
            check_class_linkage();                           // sub_446F80
    }

    // [3] Module import finalization
    finalize_module_imports();                                // sub_7C24D0

    // [4] IL output
    complete_scope();                                        // sub_709250

    // [5] Close file scope
    close_file_scope(1);                                     // sub_7047C0

    // [6] Module correspondence finalization (non-preprocessing)
    if (!preprocessing_only)
        process_verification_list();                         // sub_7A2FE0

    // [7] Write compilation unit boundary
    make_module_id(0);                                       // sub_5AF830

    // [8] Namespace cleanup (C++ only, non-PCH, non-preprocessing)
    if (is_cpp && !is_recompilation && !preprocessing_only)
        namespace_cleanup();                                 // sub_76C910

    if (debug_enabled)
        trace_leave();                                       // sub_48AFD0
}

Sub_675DA0: check_all_stop_token_entries_are_reset

Iterates all 357 entries in the stop-token array. If any nonzero entry is found, logs "stop_tokens[\"%s\"] != 0\n" (using off_E6D240 as the token name table) and asserts with "stop token array not all zero" at lexical.c:17680. This catches lexer state corruption where a stop-token (used during error recovery and tentative parsing) was not properly cleared.

Sub_446F80: check_class_linkage

Called only when relocatable device code (RDC) compilation is enabled (dword_106C2BC or dword_106C2B8). Iterates file-scope type entities looking for class/struct/union types (kind 9-11) and scoped enums (kind 2, bit 3 of +145) that need external linkage for cross-TU visibility. For qualifying types, calls sub_41F800 (make_class_externally_linked) to set the linkage bits at offset +80 to 0x20 (external linkage flag). The function performs a two-pass scan:

  1. Pass 1: Identify types needing external linkage. Checks whether the type is used by externally-visible definitions, has nested types with external linkage requirements, or has member functions with non-inline definitions.

  2. Pass 2: If any types were promoted, propagates linkage to member functions and nested class template instantiations via sub_41FD90.

Sub_7A2FE0: process_verification_list (Module Finalization)

sub_7A2FE0 (trans_corresp.c, 69 lines) processes the deferred correspondence verification list for multi-TU compilation. This is the mechanism EDG uses to verify that declarations shared across translation units are structurally compatible (One Definition Rule checking for RDC).

void process_verification_list(void) {
    if (is_recompilation || error_count != saved_error_count)
        goto skip;  // skip if new errors appeared

    correspondence_active = 1;                               // dword_106B9E4
    source_seq = *(current_tu + 8);                          // TU source sequence

    prepare_correspondence(source_seq);                      // sub_79FE00
    verify_correspondence(source_seq);                       // sub_7A2CC0

    // Process pending verification items
    while (pending_list) {                                   // qword_12C7790
        pending_list_snapshot = pending_list;
        pending_list = NULL;
        for (item = pending_list_snapshot; item; item = next) {
            next = item->next;
            switch (item->kind) {                            // byte at [8]
                case 0:  break;                              // no-op
                case 2:  verify_typedef_correspondence(item->data);          // sub_7986A0
                case 6:  verify_friend_correspondence(item->data);           // sub_7A1830
                case 7:  verify_nested_class_correspondence(item->data);     // sub_798960
                case 8:  verify_enum_member_correspondence(item->data);      // sub_798770
                case 11: verify_member_function_correspondence(item->data);  // sub_7A1DB0
                case 28: verify_using_declaration_correspondence(item->data);// sub_7982C0
                case 58: verify_base_class_correspondence(item->data);       // sub_7A27B0
                default: assertion_failure("trans_corresp.c", 7709, ...);
            }
            // Return item to free list
            item->next = corresp_free_list;
        }
    }

    correspondence_active = 0;
    correspondence_complete = 1;                             // dword_106B9E0

skip:
    correspondence_complete = 1;
}

The kind codes (0, 2, 6, 7, 8, 11, 28, 58) correspond to EDG declaration kinds: typedef (2), friend (6), nested class (7), enum member (8), member function (11), using declaration (28), base class (58).

Module vs Standard Compilation Path

The control flow diverges based on dword_106C29C (preprocessing-only mode) and the presence of module_info:

                          module_info?
                         /            \
                       YES             NO
                        |               |
                  sub_5ADC60         sub_5863A0(0,0)
                  sub_5AD120              |
                  sub_5863A0       PCH enabled?
                  sub_5AF7F0        /        \
                        |         YES         NO
                        |          |           |
                        |     sub_5861C0       |
                        |     sub_6F4AD0       |
                        |          |           |
                        +-----+----+-----+-----+
                              |          |
                         sub_586240  sub_586240
                              |          |
                      preprocessing_only?
                        /            \
                      YES             NO
                       |               |
                  sub_6FDDF0      sub_4E8A60
                  (module comp)   (standard comp)
                       |               |
                       +-------+-------+
                               |
                         sub_588E90
                    (translation_unit_wrapup)

Note: sub_6FDDF0 is the module compilation driver (59 lines, lower_il.c). It enters a loop calling sub_676860 (get_next_token) until EOF (token 9), processing module import/export declarations. Between module units, it calls sub_66EA70 to close the current input source and advance to the next module partition.

Global State Variables

Translation Unit Tracking

VariableAddressTypeDescription
current_translation_unitqword_106BA10tu_descriptor*Points to the TU currently being compiled. Set during TU creation and switching.
primary_translation_unitqword_106B9F0tu_descriptor*Points to the first TU. Set exactly once. Never changes after that.
tu_chain_tailqword_12C7A90tu_descriptor*Tail of the TU linked list. Used for O(1) append of new TUs.
tu_stack_topqword_106BA18stack_entry*Top of the TU stack. Each entry is a 16-byte {next, tu_ptr} node.
tu_stack_depthdword_106B9E8intNumber of non-primary TUs on the stack. Incremented on push, decremented on pop.
current_filenameqword_106BA00char*Path of the .cu file being compiled. Per-TU variable (saved/restored on switch).
is_recompilationdword_106BA08intNonzero during error-retry recompilation pass. Per-TU variable.
has_module_infodword_106B9F8int1 if the current TU is a C++20 module unit. Per-TU variable.

Registration Infrastructure

VariableAddressTypeDescription
registered_variable_list_headqword_12C7AA8reg_entry*Head of the registered variable linked list. Built during initialization.
registered_variable_list_tailqword_12C7AA0reg_entry*Tail of the registered variable list. Used for O(1) append.
per_tu_storage_sizeqword_12C7A98size_tAccumulated size of all registered variables (8-byte aligned). Determines the storage buffer size at TU descriptor offset [16].
registration_completedword_12C7A8CintSet to 1 at the start of process_translation_unit. After this, no more variables can be registered.
has_seen_module_tudword_12C7A88intSet to 1 when a module-info TU is processed. Guards against mixing module and non-module TUs.
stack_entry_free_listqword_12C7AB8stack_entry*Free list for recycling 16-byte TU stack entries.

Statistics Counters

VariableAddressDescription
qword_12C7A78tu_countTotal TU descriptors allocated (424 bytes each)
qword_12C7A80stack_entry_countTotal stack entries allocated (16 bytes each)
qword_12C7A68registration_countTotal variable registration entries (40 bytes each)
qword_12C7A70corresp_countTotal correspondence entries (24 bytes each)

These counters are reported by sub_7A45A0 (print_trans_unit_statistics), which prints formatted memory usage:

trans. unit corresps          N x 24 bytes
translation units             N x 424 bytes
trans. unit stack entry       N x 16 bytes
variable registration         N x 40 bytes

Assertions

The function contains three assertion checks, each producing a fatal diagnostic via sub_4F2930:

LineConditionMessageMeaning
trans_unit.c:696Primary TU (no module_info) but has_seen_module_tu is set(none)Cannot process a non-module TU after a module TU has been seen
trans_unit.c:725primary_translation_unit is set but is_recompilation is false(none)First TU must be on the initial compilation pass, not a retry
trans_unit.c:556Stack top's TU pointer does not match current_translation_unit(none)TU stack push/pop mismatch -- corrupted compilation state

Callee Reference Table

AddressIdentitySourceRole in Pipeline
sub_48A7E0trace_categoryerror.cCheck if debug category "trans_unit" is enabled
sub_5EAEC0reset_error_stateparse.cReset parser error recovery state
sub_585EE0fe_init_part_1fe_init.cRe-run per-unit init on recompilation
sub_6BA0D0allocate_storageil_alloc.cPermanent storage allocator (424-byte TU, per-TU buffer)
sub_7046E0init_scope_statescope_stk.cInitialize scope fields at TU descriptor [24..192]
sub_6B7340permanent_allocil_alloc.cAllocate 16-byte TU stack entry
sub_7A3A50save_translation_unit_statetrans_unit.cSave current TU's registered variables and scope state
sub_7A3D60switch_translation_unittrans_unit.cRestore a TU's state (inverse of save)
sub_5ADC60intern_directory_pathhost_envir.cCache directory path string (module path)
sub_5AD120set_include_pathshost_envir.cConfigure include search paths from module descriptor
sub_5863A0fe_translation_unit_initfe_init.cPer-TU init + keyword registration (1113 lines)
sub_5AF7F0set_module_idhost_envir.cSet module identifier for current TU
sub_5861C0setup_pch_sourcefe_init.cOpen source file for PCH mode
sub_6F4AD0precompiled_header_processingpch.cFind/load applicable PCH file (721 lines)
sub_586240compile_primary_sourcefe_init.cOpen source, launch parser, build IL
sub_66E6E0open_file_and_push_input_stacklexical.cOpen source file, push onto lexer input stack (10 params)
sub_676860get_next_tokenlexical.cMain tokenizer (1995 lines)
sub_6702F0open_scopescope_stk.cPush a new scope onto the scope stack
sub_6FDDF0module_compilationlower_il.cModule compilation driver (EOF-driven loop)
sub_4E8A60translation_unitdecls.cStandard compilation: semantic analysis + declaration loop
sub_588E90translation_unit_wrapupfe_wrapup.cPer-TU finalization (8 sub-steps)
sub_675DA0check_all_stop_token_entries_are_resetlexical.cVerify all 357 stop-tokens are cleared
sub_446F80check_class_linkageclass_decl.cRDC: promote class types to external linkage
sub_7C24D0finalize_module_importsmodules.cC++20 module import finalization
sub_709250complete_scopeil.cIL scope completion
sub_7047C0close_file_scopescope_stk.cPop file scope, activate using-directives
sub_7A2FE0process_verification_listtrans_corresp.cODR verification for multi-TU (RDC)
sub_76C910namespace_cleanupcp_gen_be.cC++ namespace state cleanup
sub_4F2930assertion_failureerror.cFatal assertion handler (prints source path + line)

Cross-References

Frontend Wrapup

fe_wrapup (sub_588F90, 1433 bytes at 0x588F90, from fe_wrapup.c:776) is the sixth stage of the cudafe++ pipeline. It runs after the parser has built the complete EDG IL tree and before the backend emits the .int.c file. The function performs five sequential passes over the translation unit chain, each pass iterating the linked list rooted at qword_106B9F0. After the five passes, it runs a series of post-pass operations: cross-TU consistency checks, graph optimization, template validation, memory statistics reporting, and global state teardown. The function has 51 direct callees.

The five passes transform the raw IL tree into a finalized, pruned representation: Pass 1 cleans up parsing artifacts, Pass 2 computes which entities are needed, Pass 3 marks entities that must be preserved in the IL for device compilation, Pass 4 eliminates everything not marked, and Pass 5 serializes the result and validates scope consistency. The entire sequence is the bridge between the parser's "everything parsed" state and the backend's "only what matters" input.

Key Facts

PropertyValue
Functionsub_588F90 (fe_wrapup)
Binary address0x588F90
Binary size1433 bytes
EDG sourcefe_wrapup.c, line 776
Direct callees51
Debug trace name"fe_wrapup" (level 1 via sub_48AE00)
Assertion"bad translation unit in fe_wrapup" if dword_106BA08 == 0
Error checkqword_126ED90 -- passes 2-4 skip TUs with errors
Language gatedword_126EFB4 == 2 gates C++-only operations in pass 4

Architecture Overview

sub_588F90 (fe_wrapup)
  |
  |-- Preamble: debug trace, assertion, C++ wrapup, diagnostic hooks
  |
  |-- Pass 1: per-TU basic declaration processing (sub_588C60)
  |-- Pass 2: template/inline instantiation + needed-flags (sub_707040)
  |       |-- gated by !qword_126ED90 (skip error TUs)
  |       |-- preceded by cross-TU marking (sub_796C00) on first run
  |-- Pass 3: keep-in-IL marking for device code (sub_610420 with arg 23)
  |       |-- sets dword_106B640=1 guard, clears after
  |-- Pass 4: constant folding + CUDA transforms + dead entity elimination
  |       |-- sub_5CCA40 (C++ only), sub_5CC410, sub_5CCBF0
  |-- Pass 5: per-TU final cleanup (sub_588D40)
  |
  |-- Post-pass: cross-TU consistency (sub_796BA0)
  |-- Post-pass: graph optimization (sub_707480 double-loop)
  |-- Post-pass: template validation (sub_765480)
  |-- Post-pass: final main-TU cleanup (sub_588D40)
  |-- Post-pass: file index processing (sub_6B8B20 loop)
  |-- Post-pass: output flush (sub_5F7DF0)
  |-- Post-pass: close output files (sub_4F7B10 x3)
  |-- Post-pass: memory statistics (10 subsystem counters -> sub_6B95C0)
  |-- Post-pass: debug dumps (sub_702DC0, sub_6C6570)
  |-- Post-pass: final teardown (sub_5E1D00, sub_4ED0E0, zero 6 globals)

Translation Unit Chain

All five passes iterate the same linked list structure. Each translation unit descriptor is a 424-byte allocation. The first qword of each descriptor is the next pointer, forming a singly-linked list. The head is qword_106B9F0 (the primary TU). For standard single-file CUDA compilation, there is typically one primary TU and zero secondary TUs, but the multi-TU infrastructure exists for module compilation and precompiled headers.

Before processing each TU, sub_7A3D60 (set_current_translation_unit) is called to switch global state to point at that TU. This updates qword_106BA10 (current TU descriptor), which is then used by all subsystems to find the current scope, IL root, file info, and error state.

The file scope IL node -- the root of the IL tree for a TU -- is at *(qword_106BA10 + 8).

The iteration pattern shared by all passes:

// Walk secondary TUs (linked from primary)
node = *(qword **)qword_106B9F0;     // first secondary TU
while (node) {
    sub_7A3D60(node);                 // set node as current TU
    // ... pass-specific work on *(qword_106BA10 + 8) ...
    node = *(qword **)node;           // follow next pointer at +0
}
// Then process primary TU
sub_7A3D60(qword_106B9F0);
// ... pass-specific work on main TU ...

Preamble

Before the five passes begin, fe_wrapup performs:

  1. Debug trace: If dword_126EFC8 (debug mode), logs "fe_wrapup" at level 1 via sub_48AE00.
  2. Set current TU: Calls sub_7A3D60(qword_106B9F0) to select the primary TU.
  3. Assertion: Checks dword_106BA08 != 0 -- the "full compilation mode" flag. If false, triggers a fatal assertion: "bad translation unit in fe_wrapup". This flag is set during TU initialization; its absence here indicates a corrupted pipeline state.
  4. C++ template wrapup: If dword_126EFB4 == 2 (C++ mode), calls sub_78A9D0 (template_and_inline_entity_wrapup). This performs cross-TU template instantiation setup, walking all TUs and their pending instantiation lists.
  5. No-op hook: Calls nullsub_5 -- a disabled debug hook in the exprutil address range (0x56DC80). Likely a compile-time-disabled expression validation point.
  6. CUDA diagnostics: If dword_106C268 is set, calls sub_6B3260 (CUDA-specific diagnostic processing).
  7. Source sequence debug: If debug mode and the "source_file_for_seq_info" flag is active, calls sub_5B9580 to dump source file sequence information.

Pass 1: Basic Declaration Processing

Function: sub_588C60 (file_scope_il_wrapup) Address: 0x588C60 Per-TU: Yes (iterates all secondary TUs, then processes the primary TU) Error-gated: No -- runs unconditionally

This pass performs initial cleanup on each translation unit's IL tree. It runs unconditionally on every TU, regardless of error status, because the cleanup operations (template state release, exception spec finalization) are safe and necessary even after errors.

Operations per TU:

StepFunctionPurpose
1sub_7C2690Template cleanup -- release deferred template instantiation state
2sub_68A0C0Exception handling cleanup -- finalize exception specifications, resolve pending catch-block types
3sub_446F80Diagnostic finalization (conditional: only if dword_106C2BC or dword_106C2B8 is set, and dword_106C29C is clear -- i.e., not preprocessing-only mode)
4sub_706710IL tree walk with parameters (root, 0, scope_list, 1, 0, 0) -- traverses the full IL tree performing bookkeeping: arg 2=0 means initial walk, arg 4=1 enables scope processing, arg 3 passes the TU scope list at qword_106BA10 + 24
5sub_706F40IL finalize -- post-walk finalization of the IL root node, marks it as ready for lowering
6sub_5BD350Destroy temporaries (C++ only, dword_126EFB4 == 2) -- cleans up temporary objects from expression evaluation
7(inline loop)Clear deferred declaration flags (C++ only, dword_126EE50 == 0): iterates the declaration chain at *(root + 280), and for each declaration where bit 2 of byte +81 is set and sub_5CA6F0 returns true, clears the pointer at +40 and clears bit 2 of byte +81. This removes deferred-initialization markers from declarations whose initialization has completed.
8sub_65D9A0Overload resolution cleanup -- releases candidate sets and viability data

After all secondary TUs are processed, the primary TU itself gets the same treatment:

for (tu = *primary_tu; tu != NULL; tu = *tu)
    set_current_tu(tu);
    file_scope_il_wrapup();           // sub_588C60
set_current_tu(primary_tu);
file_scope_il_wrapup();               // for the primary TU itself

Cross-TU Marking (Between Pass 1 and Pass 2)

Before Pass 2 begins, if no errors have occurred (!qword_126ED90), sub_796C00 (mark_secondary_trans_unit_IL_entities_used_from_primary_as_needed) is called. This function:

  1. Calls sub_60E4F0 with callbacks sub_796A60 (IL walk visitor) and sub_796A20 (scope visitor) to walk the primary TU's IL and mark entities referenced from secondary TUs.
  2. Iterates the file table (dword_126EC80 entries starting at index 2), and for each valid file scope that is not bit-2 flagged in byte -8 and has a non-zero scope kind byte +28, calls sub_610200 with the same visitor callbacks.
  3. Runs the walk twice (controlled by a counter: first pass with callback sub_796A60, second with NULL). The two-pass design ensures transitive closure: the first pass discovers direct references, the second propagates through chains of indirect references.

Pass 2: Template/Inline Instantiation and Needed-Flags

Function: sub_707040 (set_needed_flags_at_end_of_file_scope) Address: 0x707040 Per-TU: Yes, but skips TUs with errors (qword_126ED90 check) Source: scope_stk.c:8090

This pass determines which entities are "needed" -- must be preserved in the IL for backend consumption. It is the EDG "needed flags" computation, which decides based on linkage, usage, and language rules whether each declaration must survive to the output.

The function operates on a file scope IL node and walks four declaration lists at different offsets:

OffsetListEntity KindProcessing
+168Nested scopesNamespace/class scopesRecursively calls sub_707040 on each scope's IL root at *(entry + 120), skipping entries with bit 0 of byte +116 set (extern linkage marker)
+104Type declarationsClasses (kind 9-11 at byte +132)Calls sub_61D7F0(entry, 6) to set needed flag; recursively processes the class scope at *(*(entry+152) + 128) if non-null and bit 5 of byte +29 is clear
+112Variable declarationsVariables/objectsComplex multi-condition evaluation (see below)
+144Routine declarationsFunctions/methodsChecks template body availability at *(entry+240) and *(*(entry+240)+8), bit 2 of byte +186 (not-needed marker), and entity class at byte +164; marks via sub_61CE20(entry, 0xB); preserves and restores bit 5 of byte +177 across the call

Variable needed-flag logic

For each variable in the +112 list, the algorithm checks (in order of precedence):

  1. If bit 3 of byte +80 is set (external/imported), skip -- always mark as needed via sub_61CE20(entry, 7).
  2. Check sub_7A7850(*(entry+112)) -- if referenced, mark as needed.
  3. Check sub_7A7890(*(entry+112)) -- if used, mark as needed.
  4. Otherwise evaluate:
    • Byte +162 bit 4 set and full compilation mode: check linkage class at byte +128 (1=external) and base type completeness via sub_75C1F0.
    • Byte +128 == 0 (no linkage) or byte +169 == 2: check initializer pointer at +224 and constexpr flags at byte +164.
    • Internal/external linkage with specific storage class: check definition pointer at +200, storage class byte +128, and flag patterns in bytes +160, +161.

At the start of file scope processing, dword_106B640 is set to 1. At the end, after optionally calling sub_6FE8C0 (C++ scope merging), it is cleared to 0.

Debug trace: prints "Start/End of set_needed_flags_at_end_of_file_scope" when the "needed_flags" debug flag is active.

Pass 3: Keep-in-IL Marking (Device Code Selection)

Function: sub_610420 (mark_to_keep_in_il) Address: 0x610420 Per-TU: Yes, skips error TUs Source: il_walk.c:1959 Argument: 23 (the file-scope walk mode)

This is the critical CUDA-specific pass. It determines which entities must be preserved in the intermediate language for device code compilation by cicc. The guard flag dword_106B640 is set to 1 before the call and cleared to 0 after, preventing accidental re-invocation.

The keep-in-IL bit is bit 7 (0x80) of the byte at (entity_pointer - 8). Testing uses signed comparison: *(entry - 8) < 0 means "marked for keeping."

Operation

  1. Save/restore state: Saves and restores 9 global callback/state variables (qword_126FB88 through dword_126FB60), installing sub_617310 (prune_keep_in_il_walk) as the walk prune callback at qword_126FB78. All other callback slots are zeroed. The callback set at dword_126FB58 is set to (byte_at_a1_minus_8 & 2) != 0 -- derived from a flag in the scope node header.

  2. File scope walk: When a2 == 23 and scope kind byte *(a1+28) is 0 (file scope), clears bit 7 of byte *(a1-8) via AND 0x7F. Then calls sub_6115E0(a1, 23) -- the recursive walk_tree_and_set_keep_in_il traversal on the file scope root.

  3. C++ companion walk: For C++ mode (dword_126EFB4 == 2), calls sub_6175F0(a1) to walk scopes and mark out-of-line definitions and friend declarations.

  4. Guard assertion: Asserts dword_106B640 != 0. If the guard was cleared during the walk, fires a fatal assertion at il_walk.c:1959 with function name mark_to_keep_in_il.

  5. Pending entity lists: Iterates the deferred entity list at qword_126EBA0, calling sub_6115E0(entity, 55) for each entry with bit 2 set in byte *(entity[1] + 187) (the "deferred instantiation needed" flag).

  6. 43 category-specific walks: Iterates 43 global lists, each containing entities of a specific IL category. Each list is walked with a category-specific tag argument:

    Global rangeTagsCount
    qword_126E610 -- qword_126E7701--2323 lists
    qword_126E7B0 -- qword_126E7E027--304 lists
    qword_126E810 -- qword_126E8A033--4210 lists
    qword_126E8E0 -- qword_126E90046--483 lists
    qword_126E9B0, qword_126E9D0, qword_126E9E0, qword_126E9F059, 61, 62, 634 lists
    qword_126EA80721 list

    These lists follow a reverse-linked structure where the back-pointer is at *(list_entry - 16), not at offset +0. Each entity's tag tells sub_6115E0 what kind of entity it is processing, which affects how the keep_in_il mark propagates to dependents.

  7. Using-declaration fixed-point: Processes namespace member entries at *(root + 256) via sub_6170C0(member, is_file_scope, &changed) in a loop that repeats until changed == 0. The is_file_scope flag is derived from *(a1+28) being 2 or 17.

  8. Hidden name resolution: If *(a1+264) is non-NULL, walks hidden name entries. Each entry has a linked list at entry[1] with per-entry kind at *(entry + 16) (byte). Five kinds are handled:

    KindNameAction
    0x35InstantiationWalk via sub_6170C0 on *(entry[3] + 8)
    0x33Function templateConditional marking based on scope type and entity mark
    0x34Variable templateSame as 0x33 with v111 = entry[3]
    0x36Alias templateSame as 0x33 with v110 = entry[3]
    6Class/structSpecial handling: checks typedef chain at byte +132 == 12 with non-null source at +8; marks via sub_6115E0(entity, 6) for file-scope entries

    For each marked hidden name entry, the keep_in_il bit at *(entry - 8) is set via OR with 0x80.

  9. Context restore: Restores all saved function pointers and state variables.

Debug trace: "Beginning/Ending file scope keep_in_il walk" when the "needed_flags" flag is active.

For full details on the keep-in-IL mechanism, see Keep-in-IL.

Pass 4: Constant Folding, CUDA Transforms, and Dead Entity Elimination

Per-TU: Yes, skips error TUs C++ only: The sub_5CCA40 call is gated by dword_126EFB4 == 2

This pass has three sub-stages per TU. The first (sub_5CCA40) clears flags to prevent unnecessary work. The second (sub_5CC410) removes function bodies. The third (sub_5CCBF0) removes entire IL entries.

Stage 4a: Clear Unneeded Instantiation Flags -- sub_5CCA40

Address: 0x5CCA40 Source: il.c:29450 (clear_instantiation_required_on_unneeded_entities) C++ only: Asserts dword_126EFB4 == 2

Walks the same four declaration lists as Pass 2 (nested scopes at +168, types at +104, routines at +144, and for non-file scopes variables at +112). For routines that are not marked for keeping but have instantiation-required flags set, calls sub_78A380(entity, 0, 2) to clear the instantiation-required bit. This prevents the template engine from instantiating definitions that will be eliminated in the next sub-stage.

The conditions for clearing a routine's instantiation-required flag are:

  • Byte +80 bit 3 clear (not an external/imported entity)
  • Byte +179 bit 4 clear (not a special instantiation)
  • Byte +179 bits 1-2 == 0b10 (has "instantiation required" set) OR (dword_126E204 is set AND byte +176 bit 7 is set)
  • Non-null template pointer at *(entity + 0) (has a source template)
  • Byte +176 bit 1 clear (not already processed)

For non-file scopes (byte +28 of scope is nonzero), additionally processes variables in the +112 list with an analogous pattern: byte +162 bit 6 clear, bits 4-5 in the pattern (v8 & 0xB0) == 0x10, with a non-null pointer at *(entry + 0).

Stage 4b: Eliminate Unneeded Function Bodies -- sub_5CC410

Address: 0x5CC410 Source: il.c:29231 (eliminate_bodies_of_unneeded_functions) Gate: dword_126E55C != 0 (deferred class members exist)

Iterates the scope table (qword_126EB98, 16-byte entries: {qword scope_ptr, int file_index, pad}). The iteration runs from index 1 through dword_126EC78. For each entry:

  1. Checks that the file reference at qword_126EC88[file_index] is non-null.
  2. Checks TU ownership:
    • Primary TU (qword_106BA10 == qword_106B9F0): checks (*(scope_ptr - 8) >> 1) ^ 1) & 1 -- bit 1 of the pre-header flags byte must be clear.
    • Secondary TU: checks qword_126DFE0[*(scope_ptr + 24)] == qword_106BA10 -- the scope's file index maps to the current TU.
  3. Verifies scope kind byte +28 == 17 (class/namespace scope).
  4. Checks the keep-in-il mark: bit 2 of byte *(scope_ptr + 187) must be clear (not needed) AND the scope file entry has bit 0 of byte +29 set (eligible for elimination).
  5. If all checks pass, calls sub_5CAB40 to remove the function body from the scope.

In C++ mode with dword_126EFB4 == 2, also calls sub_6FFBA0 to reorganize namespace-level declarations after body removal.

Debug trace: "eliminate_bodies_of_unneeded_functions" at level 3.

Stage 4c: Eliminate Unneeded IL Entries -- sub_5CCBF0

Address: 0x5CCBF0 Source: il.c:29598 (eliminate_unneeded_il_entries) Gate: dword_126E55C != 0

The heaviest sub-stage. First calls sub_703C30(a1) to get a scope summary structure (7-element qword array stored at v2), asserting the result is non-null. Then walks four entity lists, removing entries whose keep-in-IL mark (bit 7 of byte at entity - 8) is clear:

ListOffsetEntity TypeRemoval actions
Variables+112Variable declarationsUnlink from list; for C++, call sub_7B0B60 on type pointers at +112 and +216 with callback sub_5C71B0 (id 147) to clean up associated type metadata
Routines+144Function/method declarationsUnlink from list; same sub_7B0B60 type cleanup on type pointers at +144 and +248; set bit 5 of byte +87 in the routine supplement at *(entity+152)
Types+104Type declarationsUnlink from list; for class entities (kind 9-11 at byte +132), call sub_5CB920 (C++ member cleanup) then sub_5E2D70 (scope deallocation); set bit 5 of byte +87 in the entity supplement
Hidden names+272Hidden name entriesUnlink unmarked entries from list

After variable/routine/type processing, the tail pointers are stored into v2[5], v2[6], and v2[4] respectively (the scope summary structure).

For file-scope nodes (byte +28 == 0), additionally calls sub_5CC570 (eliminate unneeded scope orphaned list entries) after variable processing, and sub_718720 (scope-level cleanup) after type/hidden-name processing.

After list processing, walks qword_126EBE0 (a global deferred entity chain) and removes entries where *(entry - 8) >= 0 (bit 7 clear = not marked).

String arithmetic in debug output

The diagnostic output uses a pointer arithmetic trick: "TARG_VERT_TAB_CHAR" + 17 evaluates to "R", so the format string "%semoving variable " produces either "Removing variable ..." (when the entity is being removed) or "Not removing variable ..." (when kept).

Deferred-Class-Members Flag

Pass 4 checks dword_126E55C after each TU's stage 4a. This flag indicates whether there are deferred class member definitions that need processing. If no errors occurred and the flag is set, stages 4b and 4c run. If errors are present during the per-TU loop, the flag is simply cleared to 0 and stages 4b/4c are skipped for that TU.

Pass 5: Per-TU Final Cleanup

Function: sub_588D40 (file_scope_il_wrapup_part_3) Address: 0x588D40 Source: fe_wrapup.c:559 Per-TU: Yes (all TUs, no error skip)

This pass performs final statement-level processing and scope validation, then optionally re-runs the Pass 2-4 sequence for the main compilation unit.

Operations

  1. Statement finalization: sub_5BAD30 -- finalizes statement-level IL nodes (label resolution, goto target binding, fall-through analysis).

  2. Scope stack assertion (C++ with dword_106BA08): Verifies that *(qword_126C5E8 + 784 * dword_126C5E4 + 496) == qword_126E4C0. The scope stack is an array of 784-byte entries at qword_126C5E8, indexed by dword_126C5E4 (current depth). The assertion checks that the scope pointer at offset +496 of the current entry matches the expected file scope entity (qword_126E4C0). On mismatch, triggers a fatal assertion at fe_wrapup.c:559 with function name file_scope_il_wrapup_part_3.

  3. Scope cleanup: For C++ mode, calls sub_5C9E10(0) -- finalizes class scope processing, resolves deferred member access checks.

  4. IL output: sub_709250 -- serializes the IL tree to the IL output stream. This produces the internal representation that the backend reads, not the final .int.c file.

  5. Template output: sub_7C2560 -- serializes template instantiation information to the output.

  6. Mirrored 3-pass sequence (only when dword_106BA08 -- full compilation mode): Re-runs passes 2-4 on the main TU's file scope node. This handles entities that were discovered or modified during the per-TU passes. The re-run is necessary because secondary TU processing may have added new cross-references to the primary TU's entities:

    • sub_707040(file_scope) (needed flags) -- if errors appear (qword_126ED90), clears dword_126E55C and skips remaining
    • sub_610420(file_scope, 23) with dword_106B640 = 1/0 guard -- again abort if errors
    • sub_5CCA40(file_scope) (clear instantiation flags, C++ only)
    • sub_5CC410() + sub_5CCBF0(file_scope) (eliminate, if dword_126E55C)
  7. Source file state: sub_6B9580 -- updates source file tracking counters.

  8. Diagnostic flush: sub_4F4030 -- flushes pending diagnostic messages for this TU.

  9. File scope cleanup: sub_6B9340(dword_126EC90) -- closes file scope state, passing the current error count for this file.

Post-Pass Operations

After all five passes complete, fe_wrapup performs a series of global operations that are not per-TU.

Cross-TU IL Consistency -- sub_796BA0

Address: 0x796BA0 Source: trans_copy.c:3003 (copy_secondary_trans_unit_IL_to_primary)

Called only when there are no errors (!qword_126ED90), the multi-TU flag is clear (!dword_106C2B4), and there are secondary TUs (*(qword_106B9F0) != 0). In the current binary, this function always triggers a fatal assertion at trans_copy.c:3003 -- the multi-TU IL copy infrastructure is compiled but disabled, likely reserved for future C++ module compilation support. The function traces "copy_secondary_trans_unit_IL_to_primary" before aborting.

Scope Renumbering -- sub_707480

Address: 0x707480 Source: scope_stk.c

Called when dword_126C5A0 (scope renumbering flag) is set and dword_126EC78 > 0 (scope count is positive). Executes a double-loop:

unsigned pass = 1;
do {
    for (int idx = 1; idx < dword_126EC78; idx++)
        sub_707480(idx, pass);
    if (!pass) break;
    pass = 0;
} while (dword_126EC78 > 0);
dword_126C5A0 = 0;

For each scope entry at qword_126EB98 + 16 * idx:

  • Extracts the scope pointer at +0 and file index at +8
  • Checks non-null scope pointer, valid file reference in qword_126EC88[file_index]
  • Verifies scope kind byte +28 == 17 (class/namespace scope)
  • In pass=1: skips entries where byte +176 of the entity at *(scope+32) is non-negative
  • Checks bit 1 of byte *(scope-8) is clear and bit 0 of byte +29 is clear
  • In C++ mode with bit 5 of byte *(*(scope+32) + 186) set: calls sub_6FFBA0 to reorganize scope members
  • Calls sub_6FE2A0(scope, 0, 1) to renumber the scope's declaration entries

After the double-loop, clears dword_126C5A0 = 0.

Template Validation -- sub_765480

Address: 0x765480 Source: templates.c:19822 (remove_unneeded_instantiations)

Called unless dword_106C094 == 1 (minimal compilation mode). Walks the instantiation pending list at qword_12C7740 (linked via offset +8) and removes template instantiations that are no longer needed:

Referent kind (byte +80)Entity kindAction
9Class template instantiationIf function body exists and is unreferenced (or dword_106C094 == 2), call sub_5CE710 to eliminate class definition
7Function template instantiationSame check with dword_106C094 != 2 guard
10-11Variable/alias templateCall sub_5BBC70 to find underlying function, then sub_5CAB40 to remove body

Each entry has: offset +16 = template entity pointer, offset +24 = referent entity, offset +80 = flags byte.

Final Main-TU Cleanup

Calls sub_588D40 one more time on the main translation unit (not iterating the chain). This ensures the primary TU gets the same final cleanup treatment as secondary TUs.

File Index Processing

If the primary TU has secondary TUs (*(qword_106B9F0) != 0), iterates the file table starting at index 2 through dword_126EC80:

for (int idx = 2; idx <= dword_126EC80; idx++) {
    if (!qword_126EC88[idx] || *(byte *)(qword_126EB90[idx] + 28))
        continue;
    sub_6B8B20(idx);
}

sub_6B8B20 resets the file state for each valid, non-header file index, updating the source file manager's tracking structures.

Output Flush and File Close

  1. Conditional flush: If dword_106C250 is set and no errors, calls sub_5F7DF0(0) -- flushes the IL output stream.

  2. Close three output files via sub_4F7B10:

    CallFile pointerIDIdentity
    sub_4F7B10(&qword_106C280, 1513)Primary output1513Main .int.c output (or stdout)
    sub_4F7B10(&qword_106C260, 1514)Secondary output1514Module interface or IL dump
    sub_4F7B10(&qword_106C258, 1515)Tertiary output1515Template instantiation log

    sub_4F7B10 checks if the file pointer is non-null, zeroes it, calls sub_5AEAD0 (fclose wrapper), and on error triggers diagnostic sub_4F7AA0 with the given ID.

Memory Statistics Reporting

Triggered when any of these conditions hold:

  • dword_106BC80 is set (always-report-stats flag)
  • dword_126EFCC > 0 (verbosity level > 0)
  • Debug mode (dword_126EFC8) with "space_used" flag active

Sums the return values of 10 subsystem space_used functions:

#FunctionAddressSubsystemReport Header
1sub_74A9800x74A980Symbol table"Symbol table use:"
2sub_6B62800x6B6280Macro table"Macro table use:"
3sub_4ED9700x4ED970Error/diagnostic table"Error table use:"
4sub_6887C00x6887C0Conversion table(conversion/cast subsystem)
5sub_4E8F600x4E8F60Declaration table(declarations subsystem)
6sub_56D8C00x56D8C0Expression table"Expression table use:"
7sub_5CEA800x5CEA80IL table(IL node/class subsystem)
8sub_726C800x726C80Mangling table(name mangling subsystem)
9sub_6FDF000x6FDF00Lowering table(IL lowering subsystem)
10sub_4191500x419150Diagnostic table(diagnostic output subsystem)

Each function prints its own detailed allocation table to stderr in a standardized format with columns Table / Number / Each / Total, tracks "lost" entries (allocated count minus free-list traversal count), and returns its total byte count.

The cumulative sum is passed to sub_6B95C0 (print_memory_management_statistics at 0x6B95C0), which prints the grand total accounting report:

Memory management table use:
                    Table   Number     Each    Total
             text buffers      NNN       40     NNNN
                    Total                       NNNN

Allocated space in all categories:
           Total of above                    NNNNNNN
    Skipped for alignment                       NNNN
       File mapped memory                          0
          Mapped from PCH                          0  (included in previous line)
      Mapped IL file size                          0
               Not listed                      NNNNN
               Total used                    NNNNNNN
  Avail in used mem blocks                      NNNN
 Avail in freed mem blocks                         0
             Max mem alloc                   NNNNNNN

The "Not listed" entry is computed as qword_1280700 + qword_1280708 - qword_12806F8 - total_above -- it captures memory allocated by subsystems that do not have their own space_used reporter.

Debug Dumps

If debug mode (dword_126EFC8) is active:

  • "scope_stack" flag: calls sub_702DC0 -- dumps the entire scope stack to stderr, showing all active scopes with their indices, kinds, and entity counts.
  • "viability" flag: calls sub_6C6570 -- dumps overload viability information, showing candidate sets and resolution decisions.

Final Teardown

  1. IL allocator check -- sub_5E1D00 (check_local_constant_use at il_alloc.c:1177): Copies qword_126EFB8 to qword_126EDE8 (restores the IL source position to a baseline). Asserts qword_126F680 == 0 -- no pending local constants should remain after wrapup. If nonzero, fires a fatal assertion.

  2. Zero 6 global state variables:

    • qword_126DB48 = 0 -- pending entity pointer (scope tracking)
    • Call sub_4ED0E0() -- declaration subsystem cleanup (releases declaration pools)
    • dword_126EE48 = 0 -- init-complete flag (cleared, marking end of frontend processing)
    • qword_106BA10 = 0 -- current TU descriptor (no active TU)
    • qword_12C7768 = 0 -- template state pointer 1
    • qword_12C7770 = 0 -- template state pointer 2
  3. Timing: If debug mode, calls sub_48AFD0 (print trace timing footer for the fe_wrapup section).

Error Gating Summary

Each pass has a distinct error-gating pattern. The conditions below are verified against the decompiled sub_588F90:

PassError behaviorDecompiled condition
Pass 1 (sub_588C60)No gate -- always runs. Cleanup operations (template release, exception spec finalization) are safe and necessary even after errors.None. Unconditional iteration of all secondary TUs followed by primary.
Cross-TU (sub_796C00)Skipped entirely if any errors occurred. This prevents cross-TU marking from propagating errors between units.if (!qword_126ED90) sub_796C00(); (line 67-68 of decompiled)
Pass 2 (sub_707040)Per-TU skip. Inside the TU iteration loop, each TU is independently gated: if errors exist when that TU is selected, it is skipped but subsequent TUs may still run.sub_7A3D60(tu); if (!qword_126ED90) sub_707040(*(qword_106BA10 + 8)); (lines 77-84)
Pass 3 (sub_610420)Per-TU skip. Same per-TU gating as Pass 2. When a TU is skipped, dword_106B640 is never set to 1, so the guard flag remains 0.sub_7A3D60(tu); if (!qword_126ED90) { dword_106B640 = 1; sub_610420(..., 23); dword_106B640 = 0; } (lines 97-108)
Pass 4 (sub_5CCA40 etc.)Per-TU skip. On error for a TU: dword_126E55C is cleared to 0, which prevents stages 4b (sub_5CC410) and 4c (sub_5CCBF0) from running for that TU. Stage 4a (sub_5CCA40) is additionally gated by dword_126EFB4 == 2 (C++ only).sub_7A3D60(tu); if (!qword_126ED90) { ... if (dword_126E55C) { sub_5CC410(); sub_5CCBF0(v8); } } else { dword_126E55C = 0; } (lines 120-137)
Pass 5 (sub_588D40)No gate on the per-TU iteration -- always runs. However, the internal mirrored 2-3-4 re-run within sub_588D40 is individually error-gated at each stage.Unconditional iteration. Internal re-run checks qword_126ED90 before each of sub_707040, sub_610420, sub_5CCA40.
Post-passessub_796BA0 requires !qword_126ED90 && !dword_106C2B4 && *(qword_106B9F0) != 0. sub_5F7DF0 requires dword_106C250 && !qword_126ED90. All others run unconditionally.Line 158: if (!qword_126ED90 && !dword_106C2B4 && *v4) sub_796BA0(); Line 213: if (dword_106C250 && !qword_126ED90) sub_5F7DF0(0);

Data Flow Summary

InputDescription
qword_106B9F0TU chain head -- linked list of all translation units
*(qword_106BA10 + 8)File scope IL root node -- the IL tree for each TU
qword_126ED90Error flag -- nonzero means compilation errors occurred
dword_126EFB4Language mode -- 2 for C++, gates pass 4 and template operations
dword_106BA08Full compilation mode flag -- gates Pass 5's mirrored sequence
OutputDescription
Finalized IL treeEntities marked for keeping preserved; all others eliminated
dword_106B640IL emission guard flag -- 0 at completion
dword_126E55CDeferred class members flag -- 0 after processing
Closed output filesThree output streams (IDs 1513-1515) flushed and closed
Zeroed globalsqword_106BA10, dword_126EE48, qword_126DB48, template state -- all cleared

Function Map

AddressIdentitySource fileRole in fe_wrapup
sub_588F90fe_wrapupfe_wrapup.c:776Top-level entry, called from main()
sub_588C60file_scope_il_wrapupfe_wrapup.cPass 1: template/exception cleanup, IL walk, IL finalize
sub_588D40file_scope_il_wrapup_part_3fe_wrapup.c:559Pass 5: statement finalization, scope assertion, IL/template output
sub_588E90translation_unit_wrapupfe_wrapup.cCalled from process_translation_unit, not directly from fe_wrapup
sub_707040set_needed_flags_at_end_of_file_scopescope_stk.c:8090Pass 2: compute needed-flags on all entity lists
sub_610420mark_to_keep_in_ilil_walk.c:1959Pass 3: mark entities for device code preservation
sub_5CCA40clear_instantiation_required_on_unneeded_entitiesil.c:29450Pass 4a: prevent unnecessary template instantiation
sub_5CC410eliminate_bodies_of_unneeded_functionsil.c:29231Pass 4b: remove dead function bodies
sub_5CCBF0eliminate_unneeded_il_entriesil.c:29598Pass 4c: remove dead entities from IL lists
sub_796C00mark_secondary_trans_unit_IL_entities_used_from_primary_as_neededscope_stk.cBetween Pass 1 and 2: cross-TU reference marking
sub_796BA0copy_secondary_trans_unit_IL_to_primarytrans_copy.c:3003Post-pass: dead in CUDA build (always asserts)
sub_707480scope renumber (inferred)scope_stk.cPost-pass: renumber scope declarations
sub_765480remove_unneeded_instantiationstemplates.c:19822Post-pass: prune template instantiation list
sub_6B95C0print_memory_management_statisticsmemory mgmtPost-pass: grand total memory report
sub_5E1D00check_local_constant_useil_alloc.c:1177Post-pass: assert no pending local constants
sub_7A3D60set_current_translation_unittrans_unit.cCalled before every per-TU operation
sub_706710IL tree walkIL subsystemPass 1 via sub_588C60
sub_706F40IL finalizeIL subsystemPass 1 via sub_588C60
sub_6115E0walk_tree_and_set_keep_in_ilil_walk.cPass 3: recursive keep_in_il walker
sub_6170C0namespace member walkil_walk.cPass 3: using-declaration fixed-point
sub_6175F0C++ companion walkil_walk.cPass 3: out-of-line definitions
sub_617310prune_keep_in_il_walkil_walk.cPass 3: installed as walk prune callback
sub_5BD350destroy temporariesIL subsystemPass 1: C++ temporary cleanup
sub_7C2690template cleanuptemplate enginePass 1: release deferred template state
sub_68A0C0exception cleanupexception handlingPass 1: finalize exception specs
sub_78A9D0template_and_inline_entity_wrapupC++ supportPreamble: C++ pre-wrapup
sub_78A380clear instantiation-required flagtemplate enginePass 4a via sub_5CCA40
sub_5CAB40eliminate function bodyIL subsystemPass 4b via sub_5CC410, post-pass via sub_765480
sub_5CE710eliminate class definitionIL subsystemPost-pass via sub_765480
sub_5CB920C++ member cleanupclass subsystemPass 4c via sub_5CCBF0
sub_5E2D70scope deallocationscope subsystemPass 4c via sub_5CCBF0
sub_5CC570eliminate scope orphaned entriesIL subsystemPass 4c via sub_5CCBF0
sub_718720scope-level cleanupscope subsystemPass 4c via sub_5CCBF0
sub_703C30get scope summaryscope subsystemPass 4c via sub_5CCBF0
sub_7B0B60walk type treetype subsystemPass 4c: type metadata cleanup
sub_5C71B0type cleanup callbacktype subsystemPass 4c: invoked via sub_7B0B60 with id 147
sub_6FE2A0renumber scope entriesscope subsystemPost-pass via sub_707480
sub_6FFBA0reorganize scope membersscope subsystemPass 4b, scope renumbering
sub_6FE8C0C++ scope mergescope subsystemPass 2: merge declaration/scope lists
sub_4F7B10close output filefile I/OPost-pass: close 3 files
sub_5F7DF0flush IL outputIL outputPost-pass: conditional flush
sub_6B8B20process file entrysource file mgrPost-pass: file index loop
sub_4ED0E0declaration cleanupdeclarationsState teardown
sub_709250IL outputIL outputPass 5: serialize IL tree
sub_7C2560template outputtemplate enginePass 5: serialize template info
sub_5BAD30statement finalizationstatement subsystemPass 5: finalize statement-level nodes
sub_5C9E10class scope finalizationclass subsystemPass 5: C++ scope cleanup
sub_6B9580source file state updatesource file mgrPass 5: update file tracking
sub_4F4030diagnostic flushdiagnosticsPass 5: flush pending messages
sub_6B9340file scope closesource file mgrPass 5: close file scope with error count
sub_702DC0scope stack dumpscope subsystemPost-pass: debug dump
sub_6C6570viability dumpoverload resolutionPost-pass: debug dump
sub_48AE00debug trace enterdebug subsystemPreamble, Pass 4b/4c
sub_48AFD0debug trace exit/timingdebug subsystemFinal: print timing
sub_48A7E0debug flag checkdebug subsystemMultiple: check named trace flags

Diagnostic Strings

StringLocationWhen emitted
"fe_wrapup"sub_588F90 preambleDebug trace at function entry
"bad translation unit in fe_wrapup"sub_588F90 preambleFatal assertion when dword_106BA08 == 0
"source_file_for_seq_info"sub_588F90 preambleDebug flag name for source sequence dump
"Start of set_needed_flags_at_end_of_file_scope"sub_707040 entryPass 2 debug trace
"End of set_needed_flags_at_end_of_file_scope"sub_707040 exitPass 2 debug trace
"needed_flags"sub_707040, sub_610420Debug flag name for needed-flags diagnostics
"bad scope kind"sub_707040Fatal assertion when scope kind is not 0, 3, or 6
"variable_needed_even_if_unreferenced"sub_707040Assertion function name at scope_stk.c:7999/8001
"Beginning file scope keep_in_il walk"sub_610420 entryPass 3 debug trace
"Ending file scope keep_in_il walk"sub_610420 exitPass 3 debug trace
"mark_to_keep_in_il"sub_610420Fatal assertion function name at il_walk.c:1959
"file_scope_il_wrapup_part_3"sub_588D40Assertion function name at fe_wrapup.c:559
"clear_instantiation_required_on_unneeded_entities"sub_5CCA40Assertion function name at il.c:29450
"eliminate_bodies_of_unneeded_functions"sub_5CC410Debug trace at level 3
"eliminate_unneeded_il_entries"sub_5CCBF0Debug trace at level 3
"Removing variable ..."sub_5CCBF0Verbose output when removing a variable entity
"Not removing variable ..."sub_5CCBF0Verbose output when keeping a variable entity
"Removing routine ..."sub_5CCBF0Verbose output when removing a function entity
"Not removing routine ..."sub_5CCBF0Verbose output when keeping a function entity
"Removing hidden name entry for ..."sub_5CCBF0Verbose output during hidden name cleanup
"check_local_constant_use"sub_5E1D00Assertion function name at il_alloc.c:1177
"copy_secondary_trans_unit_IL_to_primary"sub_796BA0Debug trace + fatal assertion at trans_copy.c:3003/3008
"remove_unneeded_instantiations"sub_765480Assertion function name at templates.c:19822/19848
"scope_stack"sub_588F90 post-passDebug flag name for scope stack dump
"viability"sub_588F90 post-passDebug flag name for viability analysis dump
"space_used"sub_588F90 post-passDebug flag name for memory statistics
"dump_elim"sub_5CCBF0, sub_5CC410Debug flag name for entity removal details
"Memory management table use:"sub_6B95C0Memory statistics report header
"Symbol table use:"sub_74A980Symbol table statistics header
"Macro table use:"sub_6B6280Macro table statistics header
"Error table use:"sub_4ED970Error table statistics header
"Expression table use:"sub_56D8C0Expression table statistics header

Key Global Variables

VariableAddressRole in fe_wrapup
qword_106B9F00x106B9F0TU chain head. Iterated by all 5 passes.
qword_106BA100x106BA10Current TU descriptor. Switched by sub_7A3D60 before each TU.
qword_126ED900x126ED90Error flag. Passes 2-4 skip TUs when nonzero.
dword_126EFB40x126EFB4Language mode. 2 = C++. Gates sub_5CCA40, sub_78A9D0, template operations.
dword_106BA080x106BA08Full compilation flag. Gates preamble assertion and Pass 5's mirrored sequence.
dword_106B6400x106B640IL emission guard. Set=1 during Pass 2 (file scope entry) and Pass 3 (caller). Asserted by sub_610420. Cleared=0 at end.
dword_126E55C0x126E55CDeferred class members flag. When set, enables stages 4b and 4c. Cleared on error exit.
dword_126C5A00x126C5A0Scope renumbering flag. When set, enables post-pass sub_707480 double-loop. Cleared after.
dword_126EC780x126EC78Scope count. Controls iteration bounds for sub_707480 and sub_5CC410.
qword_126EB980x126EB98Scope table base. 16-byte entries: {qword scope_ptr, int file_index, pad}.
dword_126EC800x126EC80File table entry count. Controls file index processing loop.
qword_126EC880x126EC88File table (name/scope pointers). Indexed by file ID.
qword_126EB900x126EB90File table (info entries). Indexed by file ID.
dword_106C0940x106C094Compilation mode. Value 1 skips sub_765480 (template validation).
dword_106C2500x106C250Output flush flag. When set with no errors, calls sub_5F7DF0(0).
dword_106C2680x106C268CUDA diagnostics flag. Gates sub_6B3260 in preamble.
dword_106C2B40x106C2B4Cross-TU copy disabled. When set, skips sub_796BA0.
dword_126EFC80x126EFC8Debug/trace mode. Enables trace output and debug dumps throughout.
dword_126EFCC0x126EFCCDiagnostic verbosity level. Level > 0 enables memory stats, > 2 enables dump_elim.
dword_106BC800x106BC80Always-report-stats flag. Forces memory statistics regardless of verbosity.
dword_126EE480x126EE48Init-complete flag. Set to 1 during fe_init_part_1, cleared to 0 during teardown.
qword_126DB480x126DB48Scope tracking pointer. Cleared during teardown.
qword_12C77680x12C7768Template state pointer 1. Cleared during teardown.
qword_12C77700x12C7770Template state pointer 2. Cleared during teardown.
qword_126E4C00x126E4C0Expected file scope entity. Compared in Pass 5 scope assertion.
qword_126C5E80x126C5E8Scope stack base pointer. Array of 784-byte entries.
dword_126C5E40x126C5E4Current scope stack depth index.
dword_126E2040x126E204Template mode flag. Affects instantiation-required clearing in Pass 4a.
qword_126EBA00x126EBA0Deferred entity list head. Walked in Pass 3.
qword_126EBE00x126EBE0Global deferred entity chain. Cleaned in Pass 4c.
qword_12C77400x12C7740Template instantiation pending list. Walked by sub_765480.
qword_126DFE00x126DFE0File-index-to-TU mapping table. Used for TU ownership checks.

Cross-References

Backend Code Generation

The backend is the final stage of the cudafe++ pipeline (stage 7 in the overview). It lives in a single function, process_file_scope_entities (sub_489000, 723 decompiled lines, 4520 bytes), whose job is to walk the EDG source sequence produced by the frontend and emit a .int.c file that the host C++ compiler (gcc, clang, or cl.exe) can compile. The function resides in cp_gen_be.c at EDG source lines around 19916-26628, and it delegates per-entity code generation to gen_template (sub_47ECC0, 1917 decompiled lines), which dispatches on entity kind to specialized generators for variables, types, routines, namespaces, and templates.

The backend is gated by the skip-backend flag (dword_106C254): if set to 1 (errors occurred during the frontend), main() never calls sub_489000 and proceeds directly to exit.

Key Facts

PropertyValue
Functionsub_489000 (process_file_scope_entities)
Binary address0x489000
Binary size4520 bytes (723 decompiled lines)
EDG sourcecp_gen_be.c
Callees~140 distinct call targets
Output.int.c file (or stdout when filename is "-")
Main dispatchersub_47ECC0 (gen_template, 1917 lines)
Host reference emittersub_6BCF80 (nv_emit_host_reference_array)
Module ID writersub_5B0180 (write_module_id_to_file)
Skip-backend flagdword_106C254
Backend timing label"Back end time"

Output Primitives

All output to the .int.c file passes through a small set of character-level emitters. Understanding these is essential for reading the decompiled backend code, since every line of generated C/C++ is assembled from these calls:

FunctionAddressIdentityBehavior
sub_467D600x467D60emit_newlineWrites \n via putc(10, stream). Increments dword_1065820 (line counter). Resets dword_106581C (column counter) and dword_1065830 to 0. Calls sub_403730 (write error abort) on failure.
sub_467DA00x467DA0emit_line_directiveChecks dword_1065818 (needs-line-directive flag). If the current source position (qword_1065810) differs from the output line counter, calls sub_467EB0 to emit a #line N "file" directive. Resets dword_1065818 to 0. Handles close-range line gaps (within 5 lines) by emitting blank lines instead of a #line directive.
sub_467E500x467E50emit_stringIf dword_1065818 is set, calls emit_line_directive first. Writes each character of the string via putc. Increments dword_106581C by the string length.
sub_467EB00x467EB0emit_line_numberEmits #line N "file" or # N "file" (short form when dword_106C28C or MSVC EDG-native mode is set). Constructs the directive in a stack buffer starting with #line , appends the decimal line number, then the quoted filename via sub_5B1940. Sets dword_1065820 to the target line number. Resets column counters.
sub_4681500x468150emit_charIf dword_1065818 is set, calls emit_line_directive first. Writes a single character via putc. Increments dword_106581C by 1.
sub_4681900x468190emit_raw_stringLike emit_string but without strlen -- walks the string character by character, incrementing dword_106581C per character. Calls emit_line_directive first if dword_1065818 is set.
sub_4682700x468270emit_decimalWrites an unsigned integer as decimal digits. Has fast paths for 1-5 digit numbers (manual digit extraction via division by powers of 10). Falls back to sub_465480 (sprintf-style) for larger numbers. Calls emit_line_directive first if needed.
sub_46BC800x46BC80emit_line_startIf the column counter is nonzero, first emits a newline. Increments dword_1065834 (indent level). Calls emit_line_directive if needed. Then writes the string character by character. Used for the first token on a new line (e.g., #define, #ifdef).

Output State Variables

VariableAddressTypeRole
stream0x106583xFILE*Output file handle for .int.c
dword_10658340x1065834intIndent level counter. Incremented by emit_line_start, decremented after each directive block. Not used for actual indentation emission -- tracks logical nesting depth for #line management.
dword_10658200x1065820intOutput line counter. Tracks the current line number in the generated .int.c file. Incremented by every \n written.
dword_106581C0x106581CintOutput column counter. Tracks the current column position. Reset to 0 after each newline.
dword_10658300x1065830intColumn counter after last newline (secondary tracking). Reset to 0 with dword_106581C.
dword_10658180x1065818intNeeds-line-directive flag. Set to 1 when the source position changes. Checked by every output primitive; when set, a #line directive is emitted before the next output.
qword_10658100x1065810qwordCurrent source position (line number from the original .cu file). Updated when processing each entity.
qword_10658280x1065828qwordCurrent source file index. Compared against new file references to decide whether to emit a #line with filename.
qword_126EDE80x126EDE8qwordMirror of qword_1065810. Updated in parallel; used by other subsystems to query current position.

Execution Flow

The backend proceeds through seven sequential phases within sub_489000:

sub_489000 (process_file_scope_entities)
  |
  |-- Phase 1: State initialization (40+ globals zeroed, 4 buffers cleared)
  |-- Phase 2: Output file opening (.int.c or stdout)
  |-- Phase 3: Boilerplate emission (GCC diagnostics, managed runtime, lambda macros)
  |-- Phase 4: Main entity loop (walk source sequence, dispatch to gen_template)
  |-- Phase 5: Empty file guard + scope unwind (sub_466C10)
  |-- [optional] Breakpoint placeholders (qword_1065840 list)
  |-- Phase 6: File trailer (#line, _NV_ANON_NAMESPACE, #include, #undef)
  |-- Phase 7: Host reference arrays (sub_6BCF80 x 6, conditional on dword_106BFD0/BFCC)
  |
  +-- sub_4F7B10: close output file (ID 1701)

Phase 1: State Initialization

The function begins by zeroing approximately 40 global variables and clearing four large buffers. This ensures no state leaks between compilation units (relevant in the recompilation loop, though in practice sub_489000 runs exactly once).

Scalar Zeroing

The first 20 lines of the decompiled function zero individual globals:

dword_1065834 = 0;   // indent level
dword_1065830 = 0;   // column after newline
stream        = 0;   // FILE* handle
qword_126EDE8 = 0;   // current source position (low 6 bytes)
qword_1065828 = 0;   // current file index
dword_1065820 = 0;   // output line counter
dword_106581C = 0;   // output column counter
dword_1065818 = 0;   // needs-line-directive flag
qword_1065748 = 0;   // source sequence cursor
qword_1065740 = 0;   // alternate source sequence cursor
qword_126C5D0 = 0;   // (template instantiation tracking)
dword_106573C = 0;
dword_1065734 = 0;
dword_1065730 = 0;
dword_106572C = 0;
qword_1065708 = 0;   // scope stack head
qword_1065720 = 0;   // scope free list
qword_1065700 = 0;   // scope pool head
dword_10656FC = 0;   // current access specifier
// ... additional counters, flags, sequence pointers

Additional globals zeroed later (after the callback setup):

dword_1065758 = 0;   dword_1065754 = 0;   dword_1065750 = 0;
dword_10656F8 = 0;   dword_10656F4 = 0;
qword_1065718 = 0;   qword_1065710 = 0;
dword_1065728 = 0;   qword_F05708  = 0;

Buffer Clearing

Four memset calls clear hash tables / lookup buffers:

Buffer BaseSize (hex)Size (decimal)Description
unk_FE57000x7FFE0524,256 bytes (~512 KB)Entity lookup hash table
unk_F657200x7FFE0524,256 bytes (~512 KB)Type lookup hash table
qword_E857200x7FFE0524,256 bytes (~512 KB)Declaration tracking table
xmmword_F057200x5FFE8393,192 bytes (~384 KB)Scope/name resolution table

Total: approximately 1.93 MB of memory zeroed at backend entry.

Callback Table Setup

After zeroing, the function initializes two tables of function pointers:

gen_be_info callbacks (6 entries at xmmword_1065760..10657B0):

sub_5F9040(&xmmword_1065760);    // clear the table first
xmmword_1065760 = off_83BD60;    // callback 0: expression gen
xmmword_1065778 = off_83BD68;    // callback 1: type gen
xmmword_1065788 = off_83BD70;    // callback 2: declaration gen
xmmword_10657A0 = off_83BD78;    // callback 3: statement gen
xmmword_10657B0 = qword_83BD80;  // callback 4: scope gen

These pointers are loaded from read-only data via SSE (_mm_loadh_ps), packing two 8-byte function pointers per 16-byte XMM value.

Direct callback assignments (4 entries):

VariableAddressValueIdentity
qword_10657C00x10657C0sub_46BEE0gen_statement_expression (only set when not in MSVC __declspec mode)
qword_10657C80x10657C8loc_469200gen_type_operator_expression
qword_10657D00x10657D0sub_466F40gen_be_helper_1
qword_10657D80x10657D8sub_4686C0gen_be_helper_2

Host Compiler Version Detection

A block of conditionals determines warning suppression behavior based on the host compiler version:

byte_10657F0 = 1;                        // always set
byte_10657F1 = byte_126EBB0;             // copy verbose-line-dir flag
if (dword_126EFB4 == 2                   // CUDA mode
    || dword_126EF68 <= 199900)          // C++ standard <= C++98
{
    byte_10657F4 = (dword_126EFB0 != 0); // copy flag
} else {
    byte_10657F4 = 1;                    // force on for newer standards
}

The byte_1065803 flag is set to 1 when MSVC mode (dword_126E1D8) is active or when the GNU/Clang version falls in a specific range (version check qword_126E1F0 - 40500 with tolerance of 2, i.e., Clang versions 40500-40502).

Scope Stack Allocation

A dynamic scope tracking structure is allocated (or resized if it exists from a prior run):

if (qword_10656E8) {
    // resize existing: realloc to 16 * (count + 1) bytes
    sub_6B74D0(*(qword_10656E8), 16 * (*(qword_10656E8 + 8) + 1));
} else {
    // allocate fresh: 16-byte header
    v0 = sub_6B7340(16);
    qword_10656E8 = v0;
}
// allocate 1024-byte data block, zero it, attach to header
v2 = sub_6B7340(1024);
// zero 1024 bytes in 16-byte steps (zeroing 64 pointer-sized slots)
*v0 = v2;
v0[1] = 63;   // capacity = 63 entries

This creates a 64-slot lookup table (63 usable entries plus sentinel) for tracking entity references during code generation.

Phase 2: Output File Opening

The function opens the output .int.c file. Two paths are possible:

Stdout mode: If the output filename (qword_126EEE0) equals "-", the function sets stream = stdout.

// strcmp(qword_126EEE0, "-")
if (filename_is_dash) {
    stream = stdout;
}

File mode: Otherwise, the function constructs the output path by appending .int.c to the base filename (stripping the original extension):

v55 = qword_106BF20;                       // pre-set output path (CLI override)
if (!v55)
    v55 = sub_5ADD90(qword_126EEE0, ".int.c");  // derive_name: strip ext, add ".int.c"
stream = sub_4F48F0(v55, 0, 0, 0, 1701);   // open_output_file (mode 1701)

The sub_5ADD90 function (derive_name) finds the last . in the filename, strips the extension, and appends .int.c. It handles multi-byte UTF-8 characters correctly when scanning for the dot position. The constant 1701 is the file descriptor identifier used by the file management subsystem.

After opening the file, sub_5B9A20 is called to initialize the output stream state, and sub_467EB0 emits the initial #line 1 directive.

Phase 3: Boilerplate Emission

Before processing any user declarations, the backend emits several blocks of boilerplate that the host compiler needs. The exact output depends on the host compiler identity (Clang, GCC, MSVC) and the CUDA mode.

GCC Diagnostic Suppressions

Multiple #pragma GCC diagnostic directives suppress host compiler warnings that would be spurious for generated code:

// Conditional on Clang version > 30599 (0x7787) or GNU version > 40799 (0x9F5F)
#pragma GCC diagnostic ignored "-Wunused-local-typedefs"

// Conditional on dword_126EFA8 (attribute mode) && dword_106C07C
#pragma GCC diagnostic ignored "-Wattributes"

// Clang or recent GNU/Clang:
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-variable"
#pragma GCC diagnostic ignored "-Wunused-function"

// Clang-specific additional suppressions:
#pragma GCC diagnostic ignored "-Wunused-private-field"
#pragma GCC diagnostic ignored "-Wunused-parameter"

The version thresholds use the encoded host compiler version from qword_126EF90 (Clang version) and qword_126E1F0 (GCC/Clang combined version):

Hex constantDecimalApproximate version
0x778730,599Clang ~3.x
0x9D0740,199GCC/Clang ~4.0
0x9E9740,599GCC/Clang ~4.1
0x9F5F40,799GCC/Clang ~4.1+

Managed Runtime Boilerplate

A block of C code is emitted unconditionally for __managed__ variable support:

static char __nv_inited_managed_rt = 0;
static void **__nv_fatbinhandle_for_managed_rt;
static void __nv_save_fatbinhandle_for_managed_rt(void **in) {
    __nv_fatbinhandle_for_managed_rt = in;
}
static char __nv_init_managed_rt_with_module(void **);

Followed by the inline initialization helper:

__attribute__((unused))                    // added when dword_106BF6C (alt host mode) is set
static inline void __nv_init_managed_rt(void) {
    __nv_inited_managed_rt = (__nv_inited_managed_rt
        ? __nv_inited_managed_rt
        : __nv_init_managed_rt_with_module(__nv_fatbinhandle_for_managed_rt));
}

This boilerplate is surrounded by a #pragma GCC diagnostic push / pop pair to suppress warnings about unused variables/functions in the boilerplate itself.

After the pop, additional #pragma GCC diagnostic ignored directives may be emitted for the remainder of the file (outside the push/pop scope), depending on compiler version.

Lambda Detection Macros

When extended lambda mode (dword_106BF38) is NOT active, three stub macro definitions are emitted:

#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false

Followed by a self-checking #if defined block:

#if defined(__nv_is_extended_device_lambda_closure_type) \
 && defined(__nv_is_extended_host_device_lambda_closure_type) \
 && defined(__nv_is_extended_device_lambda_with_preserved_return_type)
#endif

When extended lambda mode IS active, these macros are not emitted -- the frontend's keyword registration has already defined them as built-in type traits recognized by the parser. The empty #if defined / #endif block serves as a guard that downstream tools can detect.

Phase 4: Main Entity Loop

This is the core of the backend. The source sequence cursor qword_1065748 is initialized from the file scope IL node's declaration list at offset +256: qword_1065748 = *(*(xmmword_126EB60 + 8) + 256), where the high qword of xmmword_126EB60 points to the file scope root (set during fe_wrapup). The cursor walks this linked list of top-level declarations in the order they appeared in the source file. For each entry, it dispatches based on the entry's kind field at offset +16.

Source Sequence Entry Structure

Each source sequence entry has this layout:

OffsetSizeFieldDescription
+08nextPointer to next entry in the linked list
+81sub_kindSub-classification within the kind
+91skip_flagIf nonzero, entry has already been processed
+161kindEntry kind (see dispatch table below)
+248entityPointer to the EDG entity node for this declaration
+328source_positionSource file/line encoding
+488pragma_textFor pragma entries: pointer to raw pragma string
+568stdc_kind / pragma_dataSTDC pragma kind or additional pragma metadata
+571stdc_valueSTDC pragma value (ON/OFF/DEFAULT)

Dual-Cursor Iteration

The loop uses two cursors -- qword_1065748 (primary) and qword_1065740 (alternate) -- to handle pragma interleavings. When the primary cursor encounters a kind-53 entry (a continuation marker), it switches to the alternate cursor. This mechanism handles the case where pragmas are interleaved between parts of a single declaration:

for (i = qword_1065748; i != NULL; ) {
    if (entry_kind(i) == 53) {          // continuation marker
        // save as alternate, follow continuation chain
        alt_cursor = i;
        i = *(i->entity + 8);          // follow entity's next pointer
        continue;
    }
    if (entry_kind(i) == 57) {          // pragma interleave
        entity = i->entity;
        // advance past pragma entries to find next real entity
        for (i = i->next; i && entry_kind(i) == 53; ) {
            alt_cursor = i;
            i = *(i->entity + 8);
        }
        // handle the pragma inline (see below)
        ...
    } else {
        // non-pragma entity: dispatch to gen_template
        sub_47ECC0(0);
    }
}

When the primary cursor is exhausted and an alternate cursor exists, the primary takes the alternate's next pointer and continues. This ensures correct ordering when pragmas split a declaration sequence.

Full Main Loop Pseudocode

The following pseudocode is derived from the decompiled sub_489000 (lines 288-558) and shows the complete dispatch logic. The variable v12 tracks whether any non-pragma entity was emitted (used by the empty file guard in Phase 5). The variable v14 saves/restores byte_10657FB across pragma handling.

// Initialize source sequence cursor from file scope node
qword_1065748 = *(xmmword_126EB60_high + 256);  // source sequence list head
byte_10656F0 = (dword_126EFB4 != 2) + 2;        // linkage: 3=C++, 2=C
sub_466E60(...);                                  // init output state
v12 = 0;                                          // no entities emitted yet

while (1) {
    v14 = byte_10657FB;                           // save pragma-in-progress flag

    i = qword_1065748;                            // primary cursor
    alt = qword_1065740;                          // alternate cursor
    modified_primary = false;
    modified_alt = false;

    while (i != NULL) {
        kind = *(byte*)(i + 16);

        if (kind == 57) {
            // --- Pragma interleave ---
            entity = *(qword*)(i + 24);
            // Walk past continuation markers (kind 53)
            for (i = *(qword*)i; i != NULL; ) {
                if (*(byte*)(i + 16) != 53) break;
                alt = i;
                modified_alt = true;
                i = *(qword*)(*(qword*)(i + 24) + 8);  // follow entity next
            }
            if (i == NULL && alt != NULL) {
                i = *(qword*)alt;
                alt = NULL;
                modified_alt = true;
            }
            modified_primary = true;

            if (*(byte*)(entity + 9))              // skip_flag set?
                continue;                          // already processed

            // Commit cursor state
            qword_1065748 = i;
            if (modified_alt) qword_1065740 = alt;
            byte_10657FB = 1;                      // mark pragma context

            // Set source position from pragma entity
            dword_1065818 = 1;                     // needs line directive
            qword_1065810 = *(qword*)(entity + 32);
            qword_126EDE8 = *(qword*)(entity + 32);

            sub_kind = *(byte*)(entity + 8);
            switch (sub_kind) {
                case 26:  // STDC pragma
                    emit_line_start("#pragma ");
                    emit_raw("STDC ");
                    switch (*(byte*)(entity + 56)) {
                        case 1: emit_raw("FP_CONTRACT ");    break;
                        case 2: emit_raw("FENV_ACCESS ");    break;
                        case 3: emit_raw("CX_LIMITED_RANGE "); break;
                        default: assertion("gen_stdc_pragma: bad kind");
                    }
                    switch (*(byte*)(entity + 57)) {
                        case 1: emit_raw("OFF");     break;
                        case 2: emit_raw("ON");      break;
                        case 3: emit_raw("DEFAULT"); break;
                        default: assertion("gen_stdc_pragma: bad value");
                    }
                    emit_newline();
                    break;

                case 21:  // Line directive pragma
                    emit_line_start("#line ");
                    byte_10657F9 = 1;
                    sub_5FCAF0(*(qword*)(entity + 56), 0, &xmmword_1065760);
                    byte_10657F9 = 0;
                    emit_newline();
                    break;

                default:  // Generic pragma (including sub_kind 19)
                    if (!*(qword*)(entity + 48))
                        assertion("gen_pragma: NULL pragma_text");
                    emit_line_start("#pragma ");
                    emit_raw(*(char**)(entity + 48));
                    emit_newline();
                    if (sub_kind == 19)
                        dword_10656F8 = *(int*)(entity + 56);  // track #pragma pack
                    break;
            }
            byte_10657FB = v14;                    // restore saved flag
            continue;                              // next iteration
        }

        // --- Non-pragma entity ---
        if (modified_primary) qword_1065748 = i;
        if (modified_alt)     qword_1065740 = alt;

        if (kind == 53) {
            // Continuation marker: switch to alternate cursor
            alt = i;
            modified_alt = true;
            i = *(qword*)(*(qword*)(i + 24) + 8);
            continue;
        }

        if (kind == 52)  // end_of_construct: should never appear at top level
            sub_4F2930("cp_gen_be.c", 26628,
                       "process_file_scope_entities",
                       "Top-level end-of-construct entry", 0);

        v12 = 1;                                   // mark: entity emitted
        sub_47ECC0(0);                             // gen_template(recursion_level=0)
        // Loop continues from updated qword_1065748
    }

    // Exhausted primary cursor; check for pending alternate
    if (i == NULL && alt != NULL) {
        i = *(qword*)alt;
        alt = NULL;
        // ... continue outer loop
    } else {
        break;  // done
    }
}

// Final cursor cleanup
if (modified_primary) qword_1065748 = 0;
if (modified_alt)     qword_1065740 = alt;

Entity Kind Dispatch

For non-pragma entries (kind != 57), the loop calls sub_47ECC0(0) (gen_template with recursion level 0), which reads the current entity from qword_1065748 and dispatches based on the entity's kind:

KindNameHandler
2variable_declsub_484A40 (gen_variable_decl) or inline
6type_declsub_4864F0 (gen_type_decl)
7parameter_declsub_484A40
8field_declInline field handler
11routine_declsub_47BFD0 (gen_routine_decl, 1831 lines)
28namespaceInline namespace handler (recursive sub_47ECC0(0))
29using_declInline using-declaration handler
42asm_decl__asm(...) generation
51indirectUnwrap and re-dispatch
52end_of_constructAssertion (kind 52 triggers sub_4F2930 diagnostic)
54instantiationTemplate instantiation directive
58templateTemplate definition
66alias_declAlias declaration (using X = Y)
67concept_declConcept handling
83deduction_guideDeduction guide

Inline Pragma Handling

Kind 57 entries are pragma interleavings that appear between declarations. The backend handles three sub-kinds inline within sub_489000:

Sub-kind 26: STDC Pragma

Emits #pragma STDC <kind> <value>:

// Read pragma kind from offset +56
switch (stdc_kind) {
    case 1:  emit("FP_CONTRACT ");    break;
    case 2:  emit("FENV_ACCESS ");    break;
    case 3:  emit("CX_LIMITED_RANGE "); break;
    default: assertion_failure("gen_stdc_pragma: bad kind");
}
// Read pragma value from offset +57
switch (stdc_value) {
    case 1:  emit("OFF");     break;
    case 2:  emit("ON");      break;
    case 3:  emit("DEFAULT"); break;
    default: assertion_failure("gen_stdc_pragma: bad value");
}

The #pragma keyword is emitted character-by-character from a hardcoded string at address 0x838441 ("#pragma "), followed by "STDC " from address 0x83847B.

Sub-kind 21: Raw Pragma (Line Directive)

Calls sub_5FCAF0 to emit a preprocessor line directive using the pragma's data. The byte_10657F9 flag is set to 1 during emission and reset to 0 afterward, temporarily changing the line-directive emission format.

Sub-kind 19 (or other): Generic Pragma

For all other pragma sub-kinds, the backend reads the raw pragma text from offset +48 and emits it character by character after a #pragma prefix:

if (!entity->pragma_text)
    assertion_failure("gen_pragma: NULL pragma_text");
emit("#pragma ");
emit_raw_string(entity->pragma_text);
emit_newline();

For sub-kind 19 specifically, the function also records the pragma data in dword_10656F8, tracking #pragma pack state.

Linkage Specification

The variable byte_10656F0 tracks the current linkage specification:

ValueMeaning
2extern "C" linkage
3extern "C++" linkage

Set at initialization: byte_10656F0 = (dword_126EFB4 != 2) + 2 -- this evaluates to 3 (C++) when in CUDA mode (dword_126EFB4 == 2), and 2 (C) otherwise. This controls how the backend wraps declarations that need explicit linkage changes.

Phase 5: Empty File Guard

After the main loop completes, the function checks whether any entities were actually emitted:

if (!v12 && dword_126EFB4 != 2) {
    sub_467E50("int __dummy_to_avoid_empty_file;");
    sub_467D60();  // newline
}

The variable v12 tracks whether sub_47ECC0 was called at least once (set to 1 when any non-pragma entity is processed). If no entities were processed AND the mode is not CUDA (dword_126EFB4 != 2), a dummy variable declaration is emitted to prevent the host compiler from rejecting an empty translation unit. In CUDA mode, the file always has content due to the managed runtime boilerplate.

Phase 6: File Trailer

After all entities and the empty-file guard, the function emits a structured trailer. The call to sub_466C10 performs scope stack unwinding -- it pops any remaining scope entries, restoring entity attributes that were temporarily modified during code generation (specifically, bits in byte +82 and +134 of entity nodes).

#line Reset

Two #line 1 "<original_file>" directives bracket the trailer, resetting the host compiler's notion of the current source location back to the original .cu file:

sub_46BC80("#");
if (!dword_126E1F8)      // not GNU mode: use long form
    sub_467E50("line");
sub_467E50(" 1 \"");
filename = sub_5AF450(qword_106BF88);   // get original filename
sub_467E50(filename);
sub_468150(34);           // closing quote '"'

_NV_ANON_NAMESPACE Macro

The anonymous namespace support macro is emitted:

#define _NV_ANON_NAMESPACE <unique_id>

The unique identifier is generated by sub_6BC7E0 (get_anonymous_namespace_name), which returns "_GLOBAL__N_<filename>" -- a mangled name that ensures anonymous namespace entities from different translation units do not collide in the final linked binary.

This is followed by a guard block:

#ifdef _NV_ANON_NAMESPACE
#endif

The #ifdef/#endif block appears to be a deliberate no-op that downstream tools (nvcc's driver) can detect to confirm the file was processed by cudafe++.

MSVC Pack Reset

In MSVC host compiler mode (dword_126E1D8), a #pragma pack() is emitted to reset the packing alignment to the compiler default:

if (dword_126E1D8) {
    sub_46BC80("#pragma pack()");
    sub_467D60();
}

Source Re-inclusion

The original source file is re-included via #include:

#include "<original_file>"

This is the mechanism by which the host compiler sees the original source code: the .int.c file first declares all the generated stubs and boilerplate, then #includes the original file. The EDG frontend has already parsed the original file and knows which declarations are host-visible; the re-inclusion lets the host compiler process them with the stubs already in scope.

A final #line 1 directive follows, and then:

#undef _NV_ANON_NAMESPACE

This cleans up the macro so it does not leak into subsequent compilation units.

Phase 7: Host Reference Arrays

The final emission step generates CUDA host reference arrays via sub_6BCF80 (nv_emit_host_reference_array). These arrays are placed in special ELF sections that the CUDA runtime linker uses to discover device symbols at launch time.

The function is called 6 times with different flag combinations:

// Signature: nv_emit_host_reference_array(emit_fn, is_kernel, is_device, is_internal)

sub_6BCF80(sub_467E50, 1, 0, 1);  // kernel,   internal  -> .nvHRKI
sub_6BCF80(sub_467E50, 1, 0, 0);  // kernel,   external  -> .nvHRKE
sub_6BCF80(sub_467E50, 0, 1, 1);  // device,   internal  -> .nvHRDI
sub_6BCF80(sub_467E50, 0, 1, 0);  // device,   external  -> .nvHRDE
sub_6BCF80(sub_467E50, 0, 0, 1);  // constant, internal  -> .nvHRCI
sub_6BCF80(sub_467E50, 0, 0, 0);  // constant, external  -> .nvHRCE
SectionArray NameSymbol TypeLinkage
.nvHRKIhostRefKernelArrayInternalLinkage__global__ kernelInternal (anonymous namespace)
.nvHRKEhostRefKernelArrayExternalLinkage__global__ kernelExternal
.nvHRDIhostRefDeviceArrayInternalLinkage__device__ variableInternal
.nvHRDEhostRefDeviceArrayExternalLinkage__device__ variableExternal
.nvHRCIhostRefConstantArrayInternalLinkage__constant__ variableInternal
.nvHRCEhostRefConstantArrayExternalLinkage__constant__ variableExternal

Each array entry encodes a device symbol's mangled name as a byte array:

extern "C" {
    extern __attribute__((section(".nvHRKE")))
           __attribute__((weak))
    const unsigned char hostRefKernelArrayExternalLinkage[] = {
        0x5f, 0x5a, ... /* mangled name bytes */ 0x00
    };
}

The 6 global lists from which these symbols are collected reside at:

AddressContents
unk_1286780Device-external symbols
unk_12867C0Device-internal symbols
unk_1286800Constant-external symbols
unk_1286840Constant-internal symbols
unk_1286880Kernel-external symbols
unk_12868C0Kernel-internal symbols

This phase is conditional: it only executes when dword_106BFD0 (CUDA device registration) or dword_106BFCC (CUDA constant registration) is nonzero.

Module ID Output

Before the host reference arrays, if dword_106BFB8 is set, sub_5B0180 (write_module_id_to_file) writes the CRC32-based module identifier to a separate file. This ID is used by the CUDA runtime to match device code fatbinaries with their host-side registration code.

Breakpoint Placeholders (Between Phase 5 and Phase 6)

After the empty file guard and scope unwinding (sub_466C10) but before the file trailer, if the breakpoint placeholder list (qword_1065840) is non-empty, the backend emits debug breakpoint functions:

static __attribute__((used)) void __nv_breakpoint_placeholder<N>_<name>(void) {
    exit(0);
}

The placeholder list is a linked list where each node contains:

OffsetField
+0next pointer
+8Source position (start)
+16Source position (end)
+24Name string (or NULL)

Each placeholder is numbered sequentially (starting from 0). The __attribute__((used)) prevents the linker from stripping these symbols, and the exit(0) body ensures the function has a concrete implementation that a debugger can set a breakpoint on. The underscore separator before the name distinguishes the placeholder from the numbered prefix.

Complete .int.c File Structure

Putting all phases together, the output .int.c file has this structure:

#line 1 "<input>.cu"                          // initial line directive
#pragma GCC diagnostic ignored "-Wunused-local-typedefs"
#pragma GCC diagnostic ignored "-Wattributes"
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-variable"
#pragma GCC diagnostic ignored "-Wunused-function"
// ... additional suppressions for Clang

// --- managed runtime boilerplate ---
static char __nv_inited_managed_rt = 0;
static void **__nv_fatbinhandle_for_managed_rt;
static void __nv_save_fatbinhandle_for_managed_rt(void **in) { ... }
static char __nv_init_managed_rt_with_module(void **);
static inline void __nv_init_managed_rt(void) { ... }

#pragma GCC diagnostic pop
#pragma GCC diagnostic ignored "-Wunused-variable"

// --- lambda detection macros (when not in extended lambda mode) ---
#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false
#if defined(...) && defined(...) && defined(...)
#endif

// --- main entity output ---
// [user declarations, type definitions, function stubs, etc.]
// [device-only code wrapped in #if 0 / #endif]
// [__global__ kernels -> __wrapper__device_stub_ forwarding]
// [pragmas interleaved at original positions]

// --- empty file guard (non-CUDA mode only) ---
int __dummy_to_avoid_empty_file;

// --- breakpoint placeholders (if any) ---
static __attribute__((used)) void __nv_breakpoint_placeholder0_name(void) { exit(0); }

// --- file trailer ---
#line 1 "<input>.cu"
#define _NV_ANON_NAMESPACE _GLOBAL__N_<input>
#ifdef _NV_ANON_NAMESPACE
#endif
#pragma pack()                                // MSVC only
#line 1 "<input>.cu"
#include "<input>.cu"                         // re-include original source
#line 1 "<input>.cu"
#undef _NV_ANON_NAMESPACE

// --- host reference arrays (if CUDA registration active) ---
extern "C" { extern __attribute__((section(".nvHRKI"))) ... }
extern "C" { extern __attribute__((section(".nvHRKE"))) ... }
extern "C" { extern __attribute__((section(".nvHRDI"))) ... }
extern "C" { extern __attribute__((section(".nvHRDE"))) ... }
extern "C" { extern __attribute__((section(".nvHRCI"))) ... }
extern "C" { extern __attribute__((section(".nvHRCE"))) ... }

Key Global Variables

VariableAddressTypeRole
streamoutput stateFILE*Output file handle
dword_10658340x1065834intIndent/nesting level
dword_10658200x1065820intOutput line counter
dword_106581C0x106581CintOutput column counter
dword_10658180x1065818intNeeds-line-directive flag
qword_10658100x1065810qwordCurrent source position
qword_10658280x1065828qwordCurrent source file index
qword_10657480x1065748qwordSource sequence cursor (primary)
qword_10657400x1065740qwordSource sequence cursor (alternate)
dword_10658500x1065850intDevice stub mode toggle
byte_10656F00x10656F0byteCurrent linkage spec (2=C, 3=C++)
dword_10656F80x10656F8intCurrent #pragma pack state
qword_10657080x1065708qwordScope stack head
qword_10657000x1065700qwordScope pool head
qword_10657200x1065720qwordScope free list
dword_106BF380x106BF38intExtended lambda mode
dword_106BFB80x106BFB8intEmit module ID flag
dword_106BFD00x106BFD0intCUDA device registration flag
dword_106BFCC0x106BFCCintCUDA constant registration flag
dword_106BF6C0x106BF6CintAlternative host compiler mode
dword_126EFB40x126EFB4intCompiler mode (2 = CUDA)
dword_126E1D80x126E1D8intMSVC host compiler flag
dword_126E1F80x126E1F8intGNU/GCC host compiler flag
dword_126E1E80x126E1E8intClang host compiler flag
qword_126E1F00x126E1F0qwordGCC/Clang version number
dword_126EF680x126EF68intC++ standard version (__cplusplus)

Cross-References

Timing & Exit

The timing and exit subsystem lives in host_envir.c and handles three responsibilities: measuring CPU and wall-clock time for compilation phases, formatting the compilation summary (error/warning counts), and mapping internal status codes to process exit codes. All functions write to qword_126EDF0 (the diagnostic output stream, initialized to stderr in main()).

Key Facts

PropertyValue
Source filehost_envir.c (EDG 6.6)
Timing functionssub_5AF350 (capture_time), sub_5AF390 (report_timing)
Exit functionsub_5AF1D0 (exit_with_status), 145 bytes, __noreturn
Signoff functionsub_5AEE00 (write_signoff), sub_589530 (write_signoff + free_mem_blocks)
Timing enable flagdword_106C0A4 at 0x106C0A4, set by CLI flag --timing (case 20)
Diagnostic streamqword_126EDF0 at 0x126EDF0 (stderr)
SARIF mode flagdword_106BBB8 at 0x106BBB8

Timing Infrastructure

capture_time -- sub_5AF350 (0x5AF350)

A 48-byte function that samples both CPU time and wall-clock time into a 16-byte timestamp structure.

// Annotated decompilation
void capture_time(timestamp_t *out)    // sub_5AF350
{
    out->cpu_ms  = (int)((double)(int)clock() * 1000.0 / 1e6);  // [0]: CPU milliseconds
    out->wall_s  = time(NULL);                                    // [1]: wall-clock seconds
}

Timestamp structure layout (16 bytes, two 64-bit fields):

OffsetSizeTypeContent
+08int64_tCPU time in milliseconds: clock() * 1000 / CLOCKS_PER_SEC
+88time_tWall-clock time via time(0) (epoch seconds)

The CPU time computation clock() * 1000.0 / 1000000.0 normalizes the clock() return value (microseconds on Linux where CLOCKS_PER_SEC = 1000000) to milliseconds, then truncates to integer. This means CPU time resolution is 1 ms.

report_timing -- sub_5AF390 (0x5AF390)

Computes deltas between two timestamps and prints a formatted timing line.

// Annotated decompilation
void report_timing(const char *label,       // sub_5AF390
                   timestamp_t *start,
                   timestamp_t *end)
{
    double elapsed = difftime(end->wall_s, start->wall_s);    // wall seconds
    double cpu_sec = (double)(end->cpu_ms - start->cpu_ms) / 1000.0;  // CPU seconds

    fprintf(qword_126EDF0,
            "%-30s %10.2f (CPU) %10.2f (elapsed)\n",
            label, cpu_sec, elapsed);
}

The decompiled code contains explicit unsigned-to-double conversion handling for 64-bit values (the v6 & 1 | (v6 >> 1) pattern followed by doubling). This is the compiler's standard idiom for converting unsigned 64-bit integers to double on x86-64 when the value might exceed INT64_MAX. In practice, clock() millisecond values fit comfortably in signed 64-bit range, so this path is never taken.

Output format: "%-30s %10.2f (CPU) %10.2f (elapsed)\n"

Front end time                     12.34 (CPU)     15.67 (elapsed)
Back end time                       3.45 (CPU)      4.56 (elapsed)
Total compilation time             15.79 (CPU)     20.23 (elapsed)

The label is left-justified in a 30-character field. CPU and elapsed times are right-justified in 10-character fields with 2 decimal places.

Timing Flag Activation

The timing flag dword_106C0A4 is registered in the CLI flag table as flag ID 20:

// In sub_452010 (register_internal_flags)
sub_451F80(20, "timing", 35, 0, 0, 1);
//         ^id  ^name    ^case ^  ^ ^undocumented

When --timing is passed on the command line, the CLI parser (sub_459630) hits case 20 in its switch statement, which sets dword_106C0A4 = 1. The flag defaults to 0 (disabled), set explicitly in sub_45EB40 (cmd_line_pre_init).

Timing Brackets in main()

main() at 0x408950 allocates six 16-byte timestamp slots on its stack frame:

VariableStack offsetPurpose
v7[rsp+0x00]Total compilation start
v8[rsp+0x10]Frontend start
v9[rsp+0x20]Frontend end
v10[rsp+0x30]Backend start
v11[rsp+0x40]Backend end
v12[rsp+0x50]Total compilation end

Three timing regions are measured:

Region 1: Frontend

Captured after sub_585DB0 (fe_one_time_init) and reported after sub_588F90 (fe_wrapup). Covers stages 3-6 of the pipeline: heavy initialization, TU state reset, source parsing + IL build, and the 5-pass wrapup.

if (dword_106C0A4)
    capture_time(&t_fe_start);      // v8

reset_tu_state();                   // sub_7A4860
process_translation_unit(filename); // sub_7A40A0
fe_wrapup(filename, 1);            // sub_588F90

if (dword_106C0A4) {
    capture_time(&t_fe_end);        // v9
    report_timing("Front end time", &t_fe_start, &t_fe_end);
}

Region 2: Backend

Captured around sub_489000 (process_file_scope_entities). Only executed when dword_106C254 == 0 (no frontend errors).

if (!dword_106C254) {
    if (dword_106C0A4)
        capture_time(&t_be_start);  // v10

    process_file_scope_entities();  // sub_489000

    if (dword_106C0A4) {
        capture_time(&t_be_end);    // v11
        report_timing("Back end time", &t_be_start, &t_be_end);
    }
}

Region 3: Total

Starts before CLI parsing (sub_459630) and ends just before exit. Always uses v7 (captured once at the very beginning) as the start timestamp.

capture_time(&t_total_start);       // v7 — captured before CLI parsing

// ... entire compilation ...

if (dword_106C0A4) {
    capture_time(&t_total_end);     // v12
    report_timing("Total compilation time", &t_total_start, &t_total_end);
}

Note that the "Total compilation time" region begins before command-line parsing and includes the CLI parsing overhead, all initialization, frontend, backend, and signoff. The "Front end time" region does NOT include CLI parsing or pre-init -- it starts after fe_one_time_init.

Compilation Summary -- write_signoff

sub_5AEE00 (0x5AEE00) -- write_signoff

This 490-byte function writes the compilation summary trailer to the diagnostic stream. It has two completely separate code paths: SARIF mode and text mode.

SARIF Mode (dword_106BBB8 == 1)

Closes the SARIF JSON document started by sub_5AEDB0 (write_init):

fwrite("]}]}\n", 1, 5, qword_126EDF0);

This closes the results array, the run object, the runs array, and the top-level SARIF document. If dword_106BBB8 is set but not equal to 1, the function hits an assertion: write_signoff at host_envir.c:2203.

Text Mode (dword_106BBB8 == 0)

The text-mode path assembles a human-readable summary from four counters:

GlobalAddressMeaning
qword_126ED900x126ED90Error count
qword_126ED980x126ED98Warning count
qword_126EDB00x126EDB0Suppressed error count
qword_126EDB80x126EDB8Suppressed warning count

The function uses EDG's message catalog (sub_4F2D60) for all translatable strings:

Message IDPurposeLikely content
1742Error (singular)"error"
1743Errors (plural)"errors"
1744Warning (singular)"warning"
1745Warning (plural)"warnings"
1746Conjunction"and"
1747Source file indicator (format)"in compilation of \"%s\""
1748Generated indicator"generated"
3234Suppressed intro"of which"
3235Suppressed verb"were suppressed" / "was suppressed"

Output assembly logic (simplified pseudocode):

void write_text_signoff(void)     // text-mode path of sub_5AEE00
{
    int64_t errors   = qword_126ED90;
    int64_t warnings = qword_126ED98;
    int64_t total    = errors + warnings;

    // Debug: module declaration count (only if dword_126EFC8 + "module_report")
    if (dword_126EFC8 && is_debug_enabled("module_report") && qword_106B9C8)
        fprintf(s, "%lu modules declarations processed (%lu failed).\n",
                qword_106B9C8, qword_106B9C0);

    if (total == 0)
        return;                   // nothing to report

    int64_t suppressed_warn = qword_126EDB8;
    int64_t suppressed_total = suppressed_warn + qword_126EDB0;
    int     displayed = total - suppressed_total;

    // --- Print displayed counts ---
    if (displayed != suppressed_total) {  // there ARE unsuppressed diagnostics
        if (errors)
            fprintf(stream, "%lu %s", errors, msg(errors != 1 ? 1743 : 1742));
        if (errors && warnings)
            fprintf(stream, " %s ", msg(1746));   // " and "
        if (warnings)
            fprintf(stream, "%lu %s", warnings, msg(warnings != 1 ? 1745 : 1744));
    }

    // --- Print suppressed counts ---
    if (suppressed_total > 0) {
        // Assertion: suppressed_warn must be 0 if we reach here
        // (i.e., only suppressed errors, not suppressed warnings, trigger assert)
        if (suppressed_warn)
            assert(0);  // host_envir.c:2141, "write_text_signoff"

        if (displayed) {
            fprintf(stream, " (%s ", msg(3234));        // " (of which "
            fprintf(stream, "%lu %s %s",
                    suppressed_total,
                    msg(3235),                          // "was/were suppressed"
                    msg(suppressed_total == 1 ? 1742 : 1743));
            fputc(')', stream);                         // close paren
        } else {
            // All diagnostics were suppressed -- just print suppressed count
            fprintf(stream, "%lu %s %s",
                    suppressed_total,
                    msg(3235),
                    msg(suppressed_total == 1 ? 1742 : 1743));
        }
    }

    // --- Print source filename ---
    fputc(' ', stream);
    if (qword_126EEE0 && *qword_126EEE0 && strcmp(qword_126EEE0, "-") != 0) {
        char *display_name = qword_106C040 ? qword_106C040 : qword_126EEE0;
        char *basename = normalize_path(display_name) + 32;  // sub_5AC020 returns
                                                              // buffer, basename at +32
        fprintf(stream, msg(1747), basename);  // "in compilation of \"%s\""
    } else {
        fputs(msg(1748), stream);              // "generated" (stdout mode)
    }
    fputc('\n', stream);
}

Example output:

2 errors and 1 warning in compilation of "kernel.cu"
3 errors (of which 1 was suppressed error) in compilation of "main.cu"

sub_589530 (0x589530) -- write_signoff + free_mem_blocks

A thin wrapper (13 bytes) called from main()'s exit path. Performs two operations:

void fe_finish(void)          // sub_589530
{
    write_signoff();          // sub_5AEE00 — print summary
    free_mem_blocks();        // sub_6B8DE0 — release all frontend memory pools
}

sub_6B8DE0 (free_mem_blocks) is the master memory deallocation function from mem_manage.c (assertion at line 1438, function name "free_mem_blocks"). It operates in two modes depending on the global dword_1280728:

Pool allocator mode (dword_1280728 set): Walks three linked lists of allocated memory blocks:

  1. Current block at qword_1280720: freed first, looked up in the free-block hash chain at qword_1280748, then the block descriptor itself is freed.
  2. Hash table at qword_126EC88 with dword_126EC80 buckets: each bucket is a singly-linked list of block descriptors. Blocks with nonzero size (field [4]) are freed; blocks with zero size trigger the mem_manage.c:1438 assertion (invariant: a complete block must have a recorded size).
  3. Overflow list at qword_1280730: same walk-and-free logic.

Each block deallocation decrements qword_1280718 (total allocated bytes) and optionally updates qword_1280710 (low-water mark). At debug level > 4, each free prints: "free_complete_block: freeing block of size %lu\n".

Non-pool mode (dword_1280728 == 0): Iterates source file entries via sub_6B8B20(N) for each entry N from dword_126EC80 down to 0, then walks the permanent allocation array at qword_126EC58, calling sub_5B0500 for each (which wraps munmap or free).

Exit Handling

exit_with_status -- sub_5AF1D0 (0x5AF1D0)

A 145-byte __noreturn function that maps internal compilation status codes to POSIX exit codes. This is the only exit point for normal compilation flow -- every path through main() ends here.

// Full annotated decompilation
__noreturn void exit_with_status(uint8_t status)  // sub_5AF1D0
{
    // --- Text-mode messages (suppressed in SARIF mode) ---
    if (!dword_106BBB8) {           // not SARIF mode
        if (status == 9 || status == 10) {
            fwrite("Compilation terminated.\n", 1, 0x18, qword_126EDF0);
            exit(4);                // goto LABEL_8
        }
        if (status == 11) {
            fwrite("Compilation aborted.\n", 1, 0x15, qword_126EDF0);
            fflush(qword_126EDF0);
            abort();                // goto LABEL_10
        }
    }

    // --- Exit code mapping (both text and SARIF modes) ---
    switch (status) {
        case 3:
        case 4:
        case 5:  exit(0);          // success
        case 8:  exit(2);          // warnings only
        case 9:
        case 10: exit(4);          // errors (SARIF mode reaches here)
        default: fflush(qword_126EDF0);
                 abort();          // internal error (11, or any unknown)
    }
}

Status-to-exit-code mapping:

Internal StatusMeaningText OutputExit CodeTermination
3Clean success (no warnings, no additional status)(none)0exit(0)
4Success variant(none)0exit(0)
5Success with additional status (qword_126ED88 != 0)(none)0exit(0)
8Warnings present (qword_126ED90 != 0)(none)2exit(2)
9Errors"Compilation terminated.\n"4exit(4)
10Errors (variant)"Compilation terminated.\n"4exit(4)
11Internal error / fatal"Compilation aborted.\n"(n/a)abort()

In SARIF mode (dword_106BBB8 != 0), the text messages "Compilation terminated." and "Compilation aborted." are suppressed. The exit codes remain the same -- the function falls through to the switch which dispatches identically.

The default case handles status 11 and any unexpected status value by calling abort() after flushing the diagnostic stream. This generates a core dump for debugging.

Control flow note

The code structure looks unusual because the decompiler linearizes a two-phase dispatch. First, text-mode messages are emitted for statuses 9/10 and 11 (with early exit(4) or abort() respectively). If SARIF mode is active OR status is not 9/10/11, execution falls through to the switch statement. This means statuses 9/10 reach exit(4) via two different paths depending on SARIF mode, but the exit code is always 4.

Exit Code Determination in main()

The exit code passed to sub_5AF1D0 is computed in main() based on two global counters:

// From main() at 0x408950
uint8_t exit_code = 8;              // default: warnings (errors present → v6=8)

sub_6B8B20(0);                      // reset file state
sub_589530();                       // write_signoff + free_mem_blocks

if (!qword_126ED90)                 // no errors?
    exit_code = qword_126ED88 ? 5 : 3;   // success codes

// ... timing, stack restore ...
exit_with_status(exit_code);

Decision tree:

qword_126ED90 != 0  (errors present)
  └── exit_code = 8  →  exit(2)   "warnings only" path
      NOTE: This is counterintuitive. When errors exist, the exit
      code defaults to 8 (which maps to exit(2), not exit(4)).
      However, this path is only reachable when qword_126ED90 was
      nonzero at the error gate (dword_106C254 = 1, skip backend),
      but became zero by the time we reach the exit code check.
      In practice, errors set qword_126ED90 and it stays nonzero.

qword_126ED90 == 0  (no errors)
  ├── qword_126ED88 != 0  →  exit_code = 5  →  exit(0)  (success w/ status)
  └── qword_126ED88 == 0  →  exit_code = 3  →  exit(0)  (clean success)

The variable qword_126ED88 at 0x126ED88 is initialized to 0 in sub_4ED530 (declaration_pre_init) and sub_4ED7C0. It appears to track whether any notable conditions occurred during compilation that are not errors or warnings -- possibly informational remarks or specific compiler actions taken. When nonzero, the exit code changes from 3 to 5, but both map to exit(0).

Stack Limit Restoration

Before calling exit_with_status, main() restores the process stack limit if it was raised during initialization:

if (stack_was_raised) {
    rlimits.rlim_cur = original_stack;   // restore saved soft limit
    setrlimit(RLIMIT_STACK, &rlimits);
}

The boolean stack_was_raised (stored in rbp, variable v4) is set during startup when dword_106C064 (the --modify_stack_limit flag, default ON) causes main() to raise RLIMIT_STACK from its soft limit to the hard limit. This restoration is a defensive measure -- it ensures any child processes spawned during cleanup (or signal handlers) inherit a normal stack size.

Signal-Driven Exit Paths

Three additional paths reach exit_with_status:

SIGINT / SIGTERM Handler -- handler (0x5AF2C0)

Registered in sub_5B1E70 (host_envir_early_init) for signals 2 (SIGINT) and 15 (SIGTERM). The registration is one-shot, guarded by dword_E6E120 (set to 0 after first call). SIGINT registration is conditional: the code first calls signal(SIGINT, SIG_IGN) and checks the return value. If the previous handler was already SIG_IGN (meaning the parent process -- typically nvcc -- has set the child to ignore interrupts), it stays ignored. Otherwise, the custom handler is installed. SIGTERM always gets the handler unconditionally.

__noreturn void handler(void)           // 0x5AF2C0
{
    fputc('\n', qword_126EDF0);         // newline to stderr
    terminate_compilation(9);           // sub_5AF2B0
}

terminate_compilation -- sub_5AF2B0 (0x5AF2B0)

Bridge function: writes signoff then exits.

__noreturn void terminate_compilation(uint8_t status)  // sub_5AF2B0
{
    write_signoff();                    // sub_5AEE00
    exit_with_status(status);           // sub_5AF1D0
}

When called from handler, status is 9 (errors), which produces "Compilation terminated.\n" followed by exit(4).

SIGXCPU Handler -- sub_5AF270 (0x5AF270)

Registered for signal 24 (SIGXCPU):

__noreturn void cpu_time_limit_handler(void)  // sub_5AF270
{
    fputc('\n', qword_126EDF0);
    fwrite("Internal error: CPU time limit exceeded.\n", 1, 0x29, qword_126EDF0);
    exit_with_status(11);               // sub_5AF1D0 → abort()
}

This handler fires if the process receives SIGXCPU despite sub_5B1E70 having set RLIMIT_CPU to RLIM_INFINITY at startup. A SIGXCPU could still arrive if an external resource manager (e.g., batch scheduler) overrides the limit after initialization. Status 11 causes abort() with a core dump.

SIGXFSZ

Set to SIG_IGN in sub_5B1E70 (signal(25, SIG_IGN)). This prevents the process from being killed when writing a .int.c file that exceeds the filesystem's file-size limit. Without this, large compilation outputs could trigger an unhandled SIGXFSZ (25) and terminate with a core dump.

SARIF Output Bookends

The SARIF JSON output is bracketed by two functions:

FunctionAddressWhen CalledOutput
sub_5AEDB0 (write_init)0x5AEDB0During fe_init_part_1 (stage 3){"version":"2.1.0","$schema":"...","runs":[{"tool":{"driver":{"name":"EDG CPFE","version":"6.6",...}},"columnKind":"unicodeCodePoints","results":[
sub_5AEE00 (write_signoff)0x5AEE00During sub_589530 (pre-exit)]}]} + newline

The tool metadata identifies the frontend as "EDG CPFE" version "6.6" from "Edison Design Group", with fullName "Edison Design Group C/C++ Front End - 6.6" and informationUri "https://edg.com/c". The column kind is "unicodeCodePoints" (not byte offsets). Individual diagnostics are appended to the results array by the error subsystem between these two calls.

The write_init function (sub_5AEDB0) has the same assertion guard as write_signoff: if dword_106BBB8 is set but not equal to 1, it triggers an assertion at host_envir.c:2017 ("write_init"). Both assertions enforce the invariant that SARIF mode is exactly 0 or 1, never any other value.

Profiling Init -- sub_5AF330 (0x5AF330)

A separate but related mechanism. During sub_585DB0 (fe_one_time_init), if dword_106BD4C is set, sub_5AF330 is called:

int profiling_init(void)             // sub_5AF330
{
    int was_initialized = dword_126F110;
    if (!dword_126F110)
        dword_126F110 = 1;           // mark as initialized
    return was_initialized;          // 0 on first call, 1 on subsequent
}

This is a one-shot initializer for a profiling subsystem distinct from the --timing flag. The dword_106BD4C gate is set by a different CLI flag and controls a more granular, per-function profiling infrastructure (used by the EDG debug trace system, not the phase-level timing brackets). The dword_126F110 flag prevents double-initialization if fe_one_time_init is called more than once.

Signal Handler Registration Detail

The full signal setup in sub_5B1E70 (host_envir_early_init):

if (dword_E6E120) {                              // one-shot guard (starts nonzero)
    if (signal(SIGINT, SIG_IGN) != SIG_IGN)       // was SIGINT not already ignored?
        signal(SIGINT, handler);                   //   install interrupt handler
    signal(SIGTERM, handler);                      // always install
    signal(SIGXFSZ, SIG_IGN);                     // ignore file-size limit signals
    signal(SIGXCPU, sub_5AF270);                  // CPU time limit → abort
    dword_E6E120 = 0;                             // prevent re-registration
}
SignalNumberHandlerBehavior
SIGINT2handler (0x5AF2C0)Conditional: only if not inherited as SIG_IGN. Writes newline, calls terminate_compilation(9).
SIGTERM15handler (0x5AF2C0)Always installed. Same handler as SIGINT.
SIGXFSZ25SIG_IGNIgnored. Prevents crash on large .int.c output.
SIGXCPU24sub_5AF270 (0x5AF270)Prints "Internal error: CPU time limit exceeded.\n", then exit_with_status(11) (abort).

After signal setup, sub_5B1E70 also disables the CPU time limit by setting RLIMIT_CPU soft limit to RLIM_INFINITY:

getrlimit(RLIMIT_CPU, &rlimits);
rlimits.rlim_cur = RLIM_INFINITY;    // -1 = unlimited
setrlimit(RLIMIT_CPU, &rlimits);

This prevents normal compilations from hitting SIGXCPU. The handler at sub_5AF270 is a safety net for cases where an external resource manager re-imposes the limit after initialization.

Complete Exit Sequence

The full sequence from compilation completion to process termination:

1.  sub_6B8B20(0)           Reset source file manager state
2.  sub_589530()            Write signoff + free memory
    ├── sub_5AEE00()        Print error/warning summary (or close SARIF JSON)
    └── sub_6B8DE0()        Free all frontend memory pools
3.  Compute exit_code       Based on qword_126ED90, qword_126ED88
4.  [If timing enabled]
    ├── sub_5AF350(v12)     Capture total end timestamp
    └── sub_5AF390(...)     Print "Total compilation time"
5.  [If stack was raised]
    └── setrlimit(...)      Restore original stack soft limit
6.  sub_5AF1D0(exit_code)   Map status → exit code, terminate
    ├── 3,4,5 → exit(0)
    ├── 8     → exit(2)
    ├── 9,10  → exit(4) + "Compilation terminated."
    └── 11    → abort()  + "Compilation aborted."

Global Variable Reference

VariableAddressSizeRole
dword_106C0A40x106C0A44Timing enable flag. CLI flag 20 (--timing).
dword_106BBB80x106BBB84SARIF output mode. 0=text, 1=SARIF JSON.
qword_126EDF00x126EDF08Diagnostic output FILE* (stderr).
qword_126ED900x126ED908Total error count.
qword_126ED980x126ED988Total warning count.
qword_126ED880x126ED888Additional status (nonzero changes exit code from 3 to 5).
qword_126EDB00x126EDB08Suppressed error count.
qword_126EDB80x126EDB88Suppressed warning count.
qword_126EEE00x126EEE08Output filename (for source display in signoff).
qword_106C0400x106C0408Display filename override (used if set, else falls back to qword_126EEE0).
dword_106C2540x106C2544Skip-backend flag. Set to 1 when errors detected after frontend.
dword_106C0640x106C0644Stack limit adjustment flag (--modify_stack_limit, default ON).
dword_E6E1200xE6E1204One-shot guard for signal handler registration in sub_5B1E70.
dword_126F1100x126F1104Profiling initialized flag. Set to 1 by sub_5AF330.
dword_106BD4C0x106BD4C4Profiling gate flag. When set, fe_one_time_init calls sub_5AF330.
qword_106B9C80x106B9C88Module declarations processed count (for debug module_report).
qword_106B9C00x106B9C08Module declarations failed count.
dword_12807280x12807284Memory manager mode flag. Controls pool vs non-pool deallocation in sub_6B8DE0.

Cross-References

Execution Spaces

Every CUDA function lives in one or more execution spaces that govern where the function can run (host CPU, device GPU, or both) and what it can call. cudafe++ encodes execution space as a single-byte bitfield at offset +182 of the entity (routine) node. This byte is the most frequently tested field in CUDA-specific code paths -- it drives attribute application, redeclaration compatibility, virtual override checking, call-graph validation, IL marking, and code generation selection. Understanding this byte is prerequisite to understanding nearly every CUDA-specific subsystem in cudafe++.

The three CUDA execution-space keywords (__host__, __device__, __global__) are parsed as EDG attributes with internal kind codes 'V' (86), 'W' (87), and 'X' (88) respectively. The attribute dispatch table in apply_one_attribute (sub_413240) routes each kind to a dedicated handler that validates constraints and sets the bitfield. Functions without any explicit annotation default to __host__.

Key Facts

PropertyValue
Source fileattribute.c (handlers), class_decl.c (redecl/override), nv_transforms.h (inline predicates)
Bitfield locationEntity node byte at offset +182
__global__ handlersub_40E1F0 / sub_40E7F0 (apply_nv_global_attr, two variants)
__device__ handlersub_40EB80 (apply_nv_device_attr)
__host__ handlersub_4108E0 (apply_nv_host_attr)
Virtual override checkersub_432280 (record_virtual_function_override)
Execution space mask tabledword_E7C760[] (indexed by space enum)
Mask lookupsub_6BCF60 (nv_check_execution_space_mask)
Annotation helpersub_41A1F0 (validates HD annotations on types)
Relaxed mode flagdword_106BFF0 (permits otherwise-illegal space combinations)
main() entity pointerqword_126EB70 (compared during attribute application)

The Execution Space Bitfield (Entity + 182)

Byte offset +182 within a routine entity node encodes the execution space as a bitfield. Individual bits carry distinct meanings:

Byte at entity+182:

  bit 0  (0x01)   device_capable     Function can execute on device
  bit 1  (0x02)   device_explicit    __device__ was explicitly written
  bit 2  (0x04)   host_capable       Function can execute on host
  bit 3  (0x08)   (reserved)
  bit 4  (0x10)   host_explicit      __host__ was explicitly written
  bit 5  (0x20)   device_annotation  Secondary device flag (used in HD detection)
  bit 6  (0x40)   global_kernel      Function is a __global__ kernel
  bit 7  (0x80)   hd_combined        Combined __host__ __device__ flag

Combined Patterns

The attribute handlers do not set individual bits -- they OR entire patterns into the byte. Each CUDA keyword produces a characteristic bitmask:

KeywordOR maskResulting byteBit breakdown
__global__0x610xE1device_capable + device_annotation + global_kernel + bit 7 (always set)
__device__0x230x23device_capable + device_explicit + device_annotation
__host__0x150x15device_capable + host_capable + host_explicit
__host__ __device__0x23 | 0x150x37device_capable + device_explicit + host_capable + host_explicit + device_annotation
(no annotation)none0x00Implicit __host__ -- bits remain zero

The 0x80 bit is set unconditionally by the __global__ handler. After the |= 0x61 operation (which sets bit 6), the handler reads the byte back and checks (byte & 0x40) != 0. Since bit 6 was just set, this is always true, so |= 0x80 always executes. Despite the field name hd_combined in some tooling, the bit functions as a "has global annotation" marker in practice.

Why device_capable (bit 0) Appears in host

The __host__ mask 0x15 includes bit 0 (device_capable). This is not an error. Bit 0 acts as a "has execution space annotation" marker rather than a strict "runs on device" flag. The actual device-only vs host-only distinction is determined by the two-bit extraction at bits 4-5 (the 0x30 mask), described below.

Execution Space Classification (0x30 Mask)

The critical two-bit extraction byte & 0x30 classifies a routine into one of four categories:

(byte & 0x30):
  0x00  ->  no explicit annotation (implicit __host__)
  0x10  ->  __host__ only
  0x20  ->  __device__ only
  0x30  ->  __host__ __device__

This extraction is the basis of nv_is_device_only_routine, an inline predicate defined in nv_transforms.h (line 367). The full check from the decompiled binary is:

// nv_is_device_only_routine (inlined from nv_transforms.h:367)
// entity_sym: the symbol table entry for the routine
// entity_sym+88 -> associated routine entity

__int64 entity = *(entity_sym + 88);
if (!entity)
    internal_error("nv_transforms.h", 367, "nv_is_device_only_routine");

char byte = *(char*)(entity + 182);
bool is_device_only = ((byte & 0x30) == 0x20) && ((byte & 0x60) == 0x20);

The double-check (byte & 0x60) == 0x20 ensures the function is device-only and NOT a __global__ kernel (which would have bit 6 set, making byte & 0x60 == 0x60). This predicate is used in:

  • check_void_return_okay (sub_719D20): suppress missing-return warnings for device-only functions
  • record_virtual_function_override (sub_432280): drive virtual override execution space propagation
  • Cross-space call validation: determine whether a call crosses execution space boundaries
  • IL keep-in-il marking: identify device-reachable code

The 0x60 Mask (Kernel vs Device)

A secondary extraction byte & 0x60 distinguishes kernels from plain device functions:

(byte & 0x60):
  0x00  ->  no device annotation
  0x20  ->  __device__ only (not a kernel)
  0x40  ->  __global__ only (should not occur in isolation)
  0x60  ->  __global__ (which implies __device__)

nv_is_device_only_routine Truth Table

The predicate is inlined from nv_transforms.h:367 and appears in multiple call sites. Its internal_error guard string "nv_is_device_only_routine" appears in sub_432280 at the source path EDG_6.6/src/nv_transforms.h. The complete truth table for all execution space combinations:

Execution spacebyte+182byte & 0x30byte & 0x60Result
(none, implicit __host__)0x000x000x00false
__host__0x150x100x00false
__device__0x230x200x20true
__host__ __device__0x370x300x20false
__global__0xE10x200x60false

The __global__ case is the key distinction: byte & 0x30 yields 0x20 (same as __device__), but byte & 0x60 yields 0x60 (not 0x20), so the predicate correctly rejects kernels.

// Full pseudocode for nv_is_device_only_routine
// Inlined at every call site; not a standalone function in the binary.
//
// Input: sym -- a symbol table entry (not the entity itself)
// Output: true if the routine is __device__ only (not __host__, not __global__)
bool nv_is_device_only_routine(symbol *sym) {
    entity *e = sym->entity;            // sym + 88
    if (!e)
        internal_error("nv_transforms.h", 367, "nv_is_device_only_routine");

    char byte = e->byte_182;
    // First check:  bits 4-5 == 0x20 -> has __device__, no __host__
    // Second check: bits 5-6 == 0x20 -> has __device__, no __global__
    return ((byte & 0x30) == 0x20) && ((byte & 0x60) == 0x20);
}

Complete Redeclaration Matrix

The matrix below documents every possible pair of (existing annotation, newly-applied annotation) and the result. Each cell is derived from the three attribute handler functions. "Relaxed" means the outcome changes when dword_106BFF0 is set.

Existing \ Applying__host____device____global__
(none) 0x000x15 -- OK0x23 -- OK0xE1 -- OK
__host__ 0x150x15 -- idempotent0x37 -- OK (HD)error 3481 (always: handler checks byte & 0x10 unconditionally)
__device__ 0x230x37 -- OK (HD)0x23 -- idempotenterror 3481 (relaxed: OK)
__global__ 0xE1error 3481 (always)error 3481 (relaxed: OK)0xE1 -- idempotent
__host__ __device__ 0x370x37 -- idempotent0x37 -- idempotenterror 3481 (always: byte & 0x10 fires)

The __global__ column always errors when the existing annotation includes __host__ (bit 4 = 0x10), because the __global__ handler's condition (v5 & 0x10) != 0 is not guarded by the relaxed-mode flag. The __device__ column errors on existing __global__ only when relaxed mode is off, because the __device__ handler guards its check with !dword_106BFF0.

Note that __global__'s byte value is 0xE1 (not 0x61) because the 0x80 bit is always set after __global__ is applied, as documented above.

Attribute Application Functions

apply_nv_global_attr (sub_40E1F0 / sub_40E7F0)

Two nearly identical entry points exist. Both apply __global__ to a function entity. The variant at sub_40E7F0 uses a do-while loop for parameter iteration instead of a for loop, but the validation logic is identical. Both variants may exist because EDG generates different code paths for attribute-on-declaration vs attribute-on-definition.

The function performs extensive validation before setting the bitmask:

// Pseudocode for apply_nv_global_attr (sub_40E1F0)
int64_t apply_nv_global_attr(attr_node *a1, entity *a2, char target_kind) {
    if (target_kind != 11)      // only applies to functions
        return a2;

    // Check constexpr lambda with wrong linkage
    if ((a2->qword_184 & 0x800001000000) == 0x800000000000) {
        char *name = get_entity_name(a2, 0);
        error(3469, a1->source_loc, "__global__", name);
        return a2;
    }

    // Static member check
    if ((signed char)a2->byte_176 < 0 && !(a2->byte_81 & 0x04))
        warning(3507, a1->source_loc, "__global__");

    // operator() check
    if (a2->byte_166 == 5)
        error(3644, a1->source_loc);

    // Return type must be void (skip cv-qualifiers)
    type *ret = a2->return_type;    // +144
    while (ret->kind == 12)         // 12 = cv-qualifier wrapper
        ret = ret->next;            // +144
    if (ret->prototype->exception_spec)     // +152 -> +56
        error(3647, a1->source_loc);        // auto/decltype(auto) return

    // Execution space conflict check (single condition with ||)
    char es = a2->byte_182;
    if ((!dword_106BFF0 && (es & 0x60) == 0x20) || (es & 0x10) != 0)
        error(3481, a1->source_loc);
        // Left branch: already __device__ (not relaxed mode) -> conflict
        // Right branch: already __host__ explicit (unconditional) -> conflict

    // Return type must be void (non-constexpr path)
    if (!(a2->byte_179 & 0x10)) {      // not constexpr
        if (a2->byte_191 & 0x01)       // lambda
            error(3506, a1->source_loc);
        else if (!is_void_return(a2))
            error(3505, a1->source_loc);
    }

    // Variadic check
    // ... skip to prototype, check bit 0 of proto+16
    if (proto_flags & 0x01)
        error(3503, a1->source_loc);

    // >>> SET THE BITMASK <<<
    a2->byte_182 |= 0x61;    // bits 0,5,6: device_capable + device_annotation + global_kernel

    // Local function check
    if (a2->byte_81 & 0x04)
        error(3688, a1->source_loc);

    // main() check
    if (a2 == qword_126EB70 && (a2->byte_182 & 0x20))
        error(3538, a1->source_loc);

    // Always set bit 7 after __global__: the check reads the byte AFTER |= 0x61,
    // so bit 6 is always set, making this unconditional.
    if (a2->byte_182 & 0x40)
        a2->byte_182 |= 0x80;

    // Parameter default-init check (device-side warning)
    // ... iterate parameters, warn 3669 if missing defaults
    return a2;
}

apply_nv_device_attr (sub_40EB80)

Handles both variables (target_kind == 7) and functions (target_kind == 11). For variables, it sets the memory space bitfield at +148 (bit 0 = __device__). For functions, it sets the execution space.

// Variable path (target_kind == 7):
a2->byte_148 |= 0x01;              // __device__ memory space
if (((a2->byte_148 & 0x02) != 0) + ((a2->byte_148 & 0x04) != 0) == 2)
    error(3481, ...);               // both __shared__ (bit 1) AND __constant__ (bit 2) set
if ((signed char)a2->byte_161 < 0)
    error(3482, ...);               // thread_local
if (a2->byte_81 & 0x04)
    error(3485, ...);               // local variable

// Function path (target_kind == 11):
// Same constexpr-lambda check as __global__
if (!dword_106BFF0 && (a2->byte_182 & 0x40))
    error(3481, ...);               // already __global__, now __device__
a2->byte_182 |= 0x23;              // device_capable + device_explicit + device_annotation
if ((a2->byte_81 & 0x04) && (a2->byte_182 & 0x40))
    error(3688, ...);               // local function with __global__
if (a2 == qword_126EB70 && (a2->byte_182 & 0x20))
    error(3538, ...);               // __device__ on main()

apply_nv_host_attr (sub_4108E0)

The simplest of the three. Only applies to functions (target_kind 11). Fewer validation checks than __global__ or __device__.

// Function path (target_kind == 11):
// Same constexpr-lambda check
if (a2->byte_182 & 0x40)
    error(3481, ...);           // already __global__, now __host__
a2->byte_182 |= 0x15;          // device_capable + host_capable + host_explicit
if ((a2->byte_81 & 0x04) && (a2->byte_182 & 0x40))
    error(3688, ...);           // local function
if (a2 == qword_126EB70 && (a2->byte_182 & 0x20))
    error(3538, ...);           // __host__ on main()

Default Execution Space

Functions without any explicit annotation have byte +182 == 0x00. This is treated as implicit __host__:

  • The 0x30 mask yields 0x00, which the cross-space validator treats identically to 0x10 (explicit __host__)
  • The function is compiled for the host side only
  • It is excluded from device IL during the keep-in-il pass

In JIT compilation mode (--default-device), the default flips to __device__. This changes which functions are kept in device IL without requiring explicit annotations.

Execution Space Conflict Detection

The attribute handlers enforce a mutual-exclusion matrix. When a second execution space attribute is applied to a function that already has one, the handler checks for conflicts using error 3481:

Already setApplyingResult
(none)__host__0x15 -- accepted
(none)__device__0x23 -- accepted
(none)__global__0xE1 -- accepted
__host__ (0x15)__device__0x37 -- accepted (HD)
__device__ (0x23)__host__0x37 -- accepted (HD)
__host__ (0x15)__global__error 3481 (always -- byte & 0x10 is unconditional)
__device__ (0x23)__global__error 3481 (unless dword_106BFF0)
__global__ (0xE1)__host__error 3481 (always)
__global__ (0xE1)__device__error 3481 (unless dword_106BFF0)
__host__ (0x15)__host__idempotent OR, no error
__device__ (0x23)__device__idempotent OR, no error
__global__ (0xE1)__global__idempotent OR, no error

The relaxed mode flag dword_106BFF0 suppresses certain conflicts. When set, combinations that would normally produce error 3481 are silently accepted. This flag corresponds to --expt-relaxed-constexpr or similar permissive compilation modes. Note that the relaxed flag does NOT affect the __host__ -> __global__ or __global__ -> __host__ paths -- these always error because the __global__ handler checks byte & 0x10 unconditionally, and the __host__ handler checks byte & 0x40 unconditionally.

Virtual Function Override Checking (sub_432280)

When a derived class overrides a virtual function, cudafe++ must verify execution space compatibility. This check is embedded in record_virtual_function_override (sub_432280, 437 lines, from class_decl.c).

nv_is_device_only_routine Inline Check

The function first tests whether the overriding function has the __device__ flag at +177 bit 4 (0x10). If so, and the overridden function does NOT have this flag, execution space propagation occurs:

// Propagation logic (simplified from sub_432280, lines 70-94)
if (overriding->byte_177 & 0x10) {     // overriding is __device__
    if (!(overridden->byte_177 & 0x10)) {   // overridden is NOT __device__
        char es = overridden->byte_182;
        if ((es & 0x30) != 0x20) {          // overridden is not device-only
            overriding->byte_182 |= 0x10;   // propagate __host__ flag
        }
        if (es & 0x20) {                    // overridden has device_annotation
            overriding->byte_182 |= 0x20;   // propagate device_annotation
        }
    }
}

Six Virtual Override Mismatch Errors (3542-3547)

When the overriding function is NOT __device__, the checker looks up execution space attributes using sub_5CEE70 (attribute kind 87 = __device__, kind 86 = __host__). Based on which attributes are found on the overriding function and the execution space of the overridden function, one of six errors is emitted:

ErrorOverriding hasOverridden space (byte & 0x30)Meaning
3542__device__ only0x00 or 0x10 (host/implicit)Device override of host virtual
3543__device__ + __host__0x00 (no annotation)HD override of implicit-host virtual
3544__device__ + __host__0x20 (device-only)HD override of device-only virtual
3545no __device__0x20 (device-only)Host override of device-only virtual
3546no __device__0x30 (HD)Host override of HD virtual
3547__device__ only0x30 (HD), relaxed modeDevice override of HD virtual (relaxed)

The errors are emitted via sub_4F4F10 with severity 8 (hard error). The dword_106BFF0 relaxed mode flag modulates certain paths: in relaxed mode, some combinations that would otherwise error are accepted or downgraded.

Decision Logic

// Pseudocode for override mismatch detection (sub_432280, lines 95-188)
char es = overridden->byte_182;
char mask_30 = es & 0x30;
bool has_host_bit = (es & 0x20) != 0;    // device_annotation
bool is_hd = (mask_30 == 0x30);

bool has_device_attr = has_attribute(overriding, 87 /*__device__*/);
bool has_host_attr   = has_attribute(overriding, 86 /*__host__*/);

if (has_device_attr) {
    if (has_host_attr) {
        // Overriding is __host__ __device__
        if (has_host_bit)
            error = 3544;   // HD overrides device-only
        else if (mask_30 != 0x20)
            error = 3543;   // HD overrides implicit-host
    } else {
        // Overriding is __device__ only
        if (!has_host_bit)
            error = 3542;   // device overrides host
        if (is_hd && relaxed_mode)
            error = 3547;   // device overrides HD (relaxed)
    }
} else {
    // Overriding has no __device__
    if (mask_30 == 0x20)
        error = 3545;       // host overrides device-only
    else if (mask_30 == 0x30)
        error = 3546;       // host overrides HD
}

global Function Constraints

The __global__ handler enforces the strictest constraints of any execution space. A kernel function must satisfy all of the following:

ConstraintCheckError
Must be a function (not variable/type)target_kind == 11silently ignored if not
Not a constexpr lambda with wrong linkage(qword_184 & 0x800001000000) != 0x8000000000003469
Not a static member function(signed char)byte_176 >= 0 || (byte_81 & 0x04)3507
Not operator()byte_166 != 53644
Return type not auto/decltype(auto)no exception spec at proto+563647
No conflicting execution spacesee conflict matrix above3481
Return type is void (non-constexpr)is_void_return(a2)3505 / 3506
Not variadic!(proto_flags & 0x01)3503
Not a local function!(byte_81 & 0x04)3688
Not main()a2 != qword_126EB703538
Parameters have default init (device-side)walk parameter list3669 (warning)

Execution Space Annotation Helper (sub_41A1F0)

This function validates that type arguments used in __host__ __device__ or __device__ template contexts are well-formed. It traverses the type chain (following cv-qualifier wrappers where kind == 12), emitting diagnostics:

  • Error 3597: Type nesting depth exceeds 7 levels
  • Error 3598: Type is not device-callable (fails sub_550E50 check)
  • Error 3599: Type lacks appropriate constructor/destructor for device context

The first argument selects the annotation string: when a3 == 0, the string is "__host__ __device__"; when a3 != 0, it is "__device__".

Attribute Dispatch (apply_one_attribute)

The central dispatcher sub_413240 (apply_one_attribute, 585 lines) routes attribute kinds to their handlers via a switch statement:

Kind byteDecimalAttributeHandler
'V'86__host__sub_4108E0
'W'87__device__sub_40EB80
'X'88__global__sub_40E1F0 or sub_40E7F0

Attribute display names are resolved by sub_40A310 (attribute_display_name), which maps the kind byte back to the human-readable CUDA keyword string for use in diagnostic messages.

Execution Space Mask Table (dword_E7C760)

A lookup table at dword_E7C760 stores precomputed bitmasks indexed by execution space enum value. The function sub_6BCF60 (nv_check_execution_space_mask) performs return a1 & dword_E7C760[a2], allowing fast bitwise checks of whether a given entity's execution space matches a target space category. This table is used throughout cross-space validation and IL marking.

Diagnostics Reference

ErrorSeverityMeaning
3469errorExecution space attribute on constexpr lambda with wrong linkage
3481errorConflicting execution spaces
3482error__device__ variable with thread_local storage
3485error__device__ attribute on local variable
3503error__global__ function cannot be variadic
3505error__global__ return type must be void (non-constexpr path)
3506error__global__ return type must be void (constexpr/lambda path)
3507warning__global__ on static member function
3538errorExecution space attribute on main()
3577error__device__ variable with constexpr and conflicting memory space
3542errorVirtual override: __device__ overrides host
3543errorVirtual override: __host__ __device__ overrides implicit-host
3544errorVirtual override: __host__ __device__ overrides device-only
3545errorVirtual override: host overrides device-only
3546errorVirtual override: host overrides __host__ __device__
3547errorVirtual override: __device__ overrides HD (relaxed mode)
3597errorType nesting too deep for execution space annotation
3598errorType not callable in target execution space
3599errorType lacks device-compatible constructor/destructor
3644error__global__ on operator()
3647error__global__ return type cannot be auto/decltype(auto)
3669warning__global__ parameter without default initializer (device-side)
3688errorExecution space attribute on local function

Function Map

AddressIdentityLinesSource
sub_40A310attribute_display_name83attribute.c
sub_40E1F0apply_nv_global_attr (variant 1)89attribute.c
sub_40E7F0apply_nv_global_attr (variant 2)86attribute.c
sub_40EB80apply_nv_device_attr100attribute.c
sub_4108E0apply_nv_host_attr31attribute.c
sub_413240apply_one_attribute (dispatch)585attribute.c
sub_41A1F0execution space annotation helper82class_decl.c
sub_432280record_virtual_function_override437class_decl.c
sub_6BCF60nv_check_execution_space_mask7nv_transforms.c
sub_719D20check_void_return_okay271statements.c

Cross-References

Memory Spaces

Every CUDA variable that resides in GPU memory belongs to one of four memory spaces: __device__ (global memory), __shared__ (per-block scratchpad), __constant__ (read-only broadcast memory), or __managed__ (unified memory). cudafe++ encodes memory space as a two-byte bitfield at offsets +148 and +149 of the variable entity node. These two bytes are the variable-side analog of the execution space byte at +182 used for functions -- the two systems are complementary but independent.

The memory space bitfield passes through three processing stages. First, attribute handlers in attribute.c set the appropriate bits and enforce mutual exclusion constraints (no __shared__ + __constant__, no thread_local, no grid_constant conflict). Second, declaration processing in decls.c applies additional validation: VLA restrictions for __shared__, constexpr and external-linkage restrictions for __constant__/__device__, and structured binding constraints for all spaces. Third, symbol reference recording in symbol_ref.c checks whether host code illegally accesses device-side variables at reference time.

Memory spaces apply exclusively to variables (entity kind 7). __shared__ and __constant__ have no function-side meaning -- only __device__ (kind 'W', 87) doubles as a function execution space attribute.

Key Facts

PropertyValue
Memory space offsetEntity node byte +148 (3-bit bitfield)
Extended space offsetEntity node byte +149 (1 bit for __managed__)
__device__ handlersub_40EB80 (apply_nv_device_attr, 100 lines, attribute.c)
__managed__ handlersub_40E0D0 (apply_nv_managed_attr, 47 lines, attribute.c:10523)
__shared__ handlerKind 'Z' (90), not individually decompiled; sets +148 |= 0x02
__constant__ handlerKind '[' (91), not individually decompiled; sets +148 |= 0x04
Declaration processorsub_4DEC90 (variable_declaration, 1098 lines, decls.c)
Variable declarationsub_4CA6C0 (decl_variable, 1090 lines, decls.c:7730)
Variable fixupsub_4CC150 (cuda_variable_fixup, 120 lines, decls.c)
Defined-variable checksub_4DC200 (mark_defined_variable, 26 lines, decls.c)
Cross-space reference checkersub_72A650 / sub_72B510 (record_symbol_reference_full, symbol_ref.c)
Device-var-in-host checkersub_6BCF10 (nv_check_device_variable_in_host, nv_transforms.c)
Post-validationsub_6BC890 (nv_validate_cuda_attributes, 161 lines, nv_transforms.c)
Attribute kind codes'W'=87 (__device__), 'Z'=90 (__shared__), '['=91 (__constant__), 'f'=102 (__managed__)

The Memory Space Bitfield (Entity +148 / +149)

Byte +148: Primary Memory Space

Byte at entity+148:

  bit 0  (0x01)   __device__       Variable in device global memory
  bit 1  (0x02)   __shared__       Variable in per-block shared memory
  bit 2  (0x04)   __constant__     Variable in constant memory
  bit 3  (0x08)   type_member      Set when variable inherits space from type context
  bit 4  (0x10)   device_at_file   __device__ at file scope (no enclosing function)
  bit 7  (0x80)   weak_odr         Set by apply_nv_weak_odr_attr (sub_40AD80)

Bits 3, 4, and 7 are set by decl_variable (sub_4CA6C0) during declaration processing, not by the attribute handlers. Bit 3 is set via *(_BYTE *)(v33 + 148) |= 8u when the variable inherits its memory space from a type context (such as a static member of a class with a device annotation). Bit 4 is set via *(_BYTE *)(v43 + 148) = v73 | 0x10 when a __device__ variable is declared at file scope (dword_126C5D8 == -1, meaning no enclosing function).

Byte +149: Extended Memory Space

Byte at entity+149:

  bit 0  (0x01)   __managed__    Unified memory (host + device accessible)
  bits 1-7        (reserved)

Word-Level Access

Some validation code reads bytes +148 and +149 together as a 16-bit word. The __grid_constant__ conflict check in apply_nv_managed_attr tests:

// sub_40E0D0, line 26 (apply_nv_managed_attr)
if ( (a2[164] & 4) != 0 && (*((_WORD *)a2 + 74) & 0x102) != 0 )

Here (_WORD *)(a2 + 148) (offset 74 in 16-bit units) is tested against 0x0102. In little-endian layout, 0x0102 means byte +148 bit 1 (__shared__) OR byte +149 bit 0 (__managed__). This catches the case where a __grid_constant__ parameter also carries __shared__ or __managed__.

Mutual Exclusion

In valid CUDA programs, at most one of __device__, __shared__, and __constant__ should be set. However, __managed__ always implies __device__ -- the handler sets both +149 bit 0 and +148 bit 0. The validation logic permits __device__ + __managed__ but rejects combinations like __shared__ + __constant__.

The mutual exclusion check appears identically in both apply_nv_managed_attr and apply_nv_device_attr:

// From sub_40EB80 (apply_nv_device_attr), variable path:
v9 = *(_BYTE *)(a2 + 148) | 1;     // set __device__ bit
*(_BYTE *)(a2 + 148) = v9;
if ( ((v9 & 2) != 0) + ((v9 & 4) != 0) == 2 )
    sub_4F81B0(3481, a1 + 56);      // error: conflicting spaces

The expression ((v9 & 2) != 0) + ((v9 & 4) != 0) == 2 is true only when both __shared__ (bit 1) and __constant__ (bit 2) are set simultaneously. This means:

  • __device__ + __shared__ is allowed (the bits coexist)
  • __device__ + __constant__ is allowed
  • __shared__ + __constant__ triggers error 3481

Attribute Handlers

apply_nv_managed_attr -- sub_40E0D0

The __managed__ handler is the simplest and most thoroughly documented. It demonstrates the full validation pattern that all memory space handlers share.

Entry point: Called from apply_one_attribute (sub_413240) when attribute kind is 'f' (102).

Decompiled logic (47 lines, attribute.c:10523):

// sub_40E0D0 -- apply_nv_managed_attr
// a1: attribute node, a2: entity node, a3: entity kind

// Gate: only applies to variables
if ( a3 != 7 )
    internal_error("attribute.c", 10523, "apply_nv_managed_attr");

// Step 1: Set managed flag AND device flag
v3 = a2[148];           // save old memory space byte
a2[149] |= 1;           // set __managed__ bit
a2[148] = v3 | 1;       // set __device__ bit (managed implies device)

// Step 2: Mutual exclusion check
if ( ((v3 & 2) != 0) + ((v3 & 4) != 0) == 2 )
    error(3481, ...);    // __shared__ + __constant__ conflict

// Step 3: Thread-local check
if ( (char)a2[161] < 0 )
    error(3482, ...);    // __managed__ on thread_local

// Step 4: Local variable check
if ( (a2[81] & 4) != 0 )
    error(3485, ...);    // __managed__ on local variable

// Step 5: __grid_constant__ conflict
if ( (a2[164] & 4) != 0 && (*(WORD*)(a2 + 148) & 0x102) != 0 )
{
    // Determine which space string to display
    v4 = a2[148];
    v5 = "__constant__";
    if ( (v4 & 4) == 0 ) {
        v5 = "__managed__";
        if ( (a2[149] & 1) == 0 ) {
            v5 = "__shared__";
            if ( (v4 & 2) == 0 ) {
                v5 = "__device__";
                if ( (v4 & 1) == 0 )
                    v5 = "";
            }
        }
    }
    error(3577, ..., v5);   // incompatible with __grid_constant__
}

The space-name selection cascade (__constant__ > __managed__ > __shared__ > __device__ > empty) is used in error messages to show which memory space conflicts with __grid_constant__. The cascade tests bits in priority order, matching the most "restrictive" space first.

apply_nv_device_attr -- sub_40EB80

The __device__ handler is dual-purpose: it handles both variables (a3 == 7) and functions (a3 == 11).

Entry point: Called from apply_one_attribute when attribute kind is 'W' (87).

Variable path (entity kind 7):

// sub_40EB80, variable branch
*(_BYTE *)(a2 + 148) |= 1;          // set __device__ bit

// Validation (identical to __managed__):
// 1. Error 3481 if __shared__ + __constant__ both set
// 2. Error 3482 if thread_local (byte +161 bit 7)
// 3. Error 3485 if local variable (byte +81 bit 2)
// 4. Error 3577 if __grid_constant__ conflict

Function path (entity kind 11):

// sub_40EB80, function branch
// Check: not an implicitly-deleted function
if ( (*(_QWORD *)(a2 + 184) & 0x800001000000LL) != 0x800000000000LL
     || (*(_BYTE *)(a2 + 176) & 2) != 0 )
{
    // Conflict with __global__
    if ( !dword_106BFF0 && (*(_BYTE *)(a2 + 182) & 0x40) != 0 )
        error(3481, ...);

    *(_BYTE *)(a2 + 182) |= 0x23;    // set device execution space

    // Local function with __global__ conflict
    if ( (*(_BYTE *)(a2 + 81) & 4) != 0 && (*(_BYTE *)(a2 + 182) & 0x40) != 0 )
        error(3688, ...);

    // __device__ on main()
    if ( a2 == qword_126EB70 && (*(_BYTE *)(a2 + 182) & 0x20) != 0 )
        warning(3538, ...);
}
else
{
    // Implicitly-deleted function: just warn
    v14 = get_entity_display_name(a2);
    error(3469, ..., "__device__", v14);
}

// Check function parameters for missing default initializers
// (error 3669 for parameters without defaults in device context)

The function path is documented in Execution Spaces -- here we focus on the variable path.

shared and constant Handlers

The __shared__ and __constant__ attribute handlers are dispatched through apply_one_attribute (sub_413240) when attribute kind codes 'Z' (90) and '[' (91) are encountered. Their variable-path logic mirrors __device__ and __managed__:

Step__shared__ ('Z')__constant__ ('[')
Set memory space bitbyte +148 |= 0x02byte +148 |= 0x04
Mutual exclusion (3481)Check __constant__ bit (bit 2)Check __shared__ bit (bit 1)
Thread-local check (3482)YesYes
Local variable check (3485)YesYes
__grid_constant__ conflict (3577)YesYes

The __shared__ and __constant__ keywords apply only to variables (kind 7). Unlike __device__, they do not have a function-path branch -- there is no __shared__ or __constant__ function execution space.

Variable Declaration Processing

sub_4DEC90 -- variable_declaration

The top-level declaration processor (decls.c) performs additional CUDA-specific validation after attribute handlers have set the memory space bits. This function is 1098 lines and handles both normal variable declarations and static data member definitions.

CUDA-specific checks in variable_declaration:

ErrorConditionDescription
149Memory space attribute at illegal scopeCUDA storage class at namespace scope (specific scenarios)
892auto with __constant__auto-typed __constant__ variable
893auto with CUDA attributeauto-typed variable with other CUDA memory space
3510__shared__ with VLA__shared__ variable with variable-length array type
3566__constant__ + constexpr + auto__constant__ constexpr with auto deduction
3567CUDA variable with VLACUDA memory-space variable with VLA type
3568__constant__ + constexpr__constant__ combined with constexpr
3578CUDA attribute in discarded branchCUDA attribute on variable in constexpr-if discarded branch
3579CUDA attribute + structured bindingCUDA attribute at namespace scope with structured binding
3580CUDA attribute on VLACUDA attribute on variable-length array

Memory space string selection (used in error messages):

// sub_4DEC90, line ~357: selecting display name for the memory space
v50 = "__constant__";
if ( (v49 & 4) == 0 ) {
    v50 = "__managed__";
    if ( (*(_BYTE *)(v15 + 149) & 1) == 0 ) {
        v50 = "__host__ __device__" + 9;   // pointer arithmetic: = "__device__"
        if ( (v49 & 2) != 0 )
            v50 = "__shared__";
    }
}

The string "__device__" is produced by taking the string "__host__ __device__" and advancing by 9 bytes, skipping past "__host__ ". This is a binary-level optimization -- the compiler shares string storage between the combined "__host__ __device__" literal and the standalone "__device__" reference.

sub_4CA6C0 -- decl_variable

The core variable declaration function (1090 lines, decls.c:7730) handles CUDA memory space propagation during symbol table entry creation. Key behaviors:

Storage class mapping: When declaration state byte at offset +269 equals 5, it indicates a CUDA memory space storage class. The function performs a scope walk to determine the correct namespace scope for the variable. If a prior declaration exists at the same scope (dword_126C5DC == dword_126C5B4), the CUDA storage class is reset to allow redeclaration.

Scope walk: Traverses the scope chain (784-byte scope entries at qword_126C5E8, indexed by dword_126C5E4) upward through class scopes (scope_kind 4) and template scopes (bit 0x20 at scope entry +9), until reaching a non-class, non-template scope. This determines whether the variable is at namespace scope, class scope, or block scope.

Error 3483 -- memory space in non-device function: When a variable with a device memory space bit (+148 bit 0 set) is declared inside a function body, and the enclosing routine is NOT device-only (+182 & 0x30 != 0x20), the function emits error 3483 with the storage kind and space name:

// From sub_4CA6C0, ~line 886-910
if (!at_namespace_scope) {
    char space = entity->byte_148;
    if (storage_class != 1 && (space & 0x01)) {
        routine_descriptor = qword_126C5D0;
        if (routine_descriptor) {
            entity_ptr = *(routine_descriptor + 32);
            if (entity_ptr && (entity_ptr[182] & 0x30) != 0x20) {
                const char *name = get_space_name(entity);  // priority cascade
                const char *kind = (storage_class == 2) ? "a static" : "an automatic";
                error(3483, source_loc, kind, name);
            }
        }
    }
}

File-scope device flag: When a __device__ variable is at file scope (dword_126C5D8 == -1), the function sets bit 4 of +148:

if ((entity->byte_148 & 0x01) && dword_126C5D8 == -1)
    entity->byte_148 |= 0x10;   // bit 4: device_at_file_scope

Redeclaration checking: When a variable is redeclared, the function compares memory space encoding at offset +136 (the attribute byte) between the existing and new entity. Error 1306 is emitted for mismatched CUDA memory spaces.

Memory space propagation: Calls sub_4C4750 (set_variable_attributes) for final attribute propagation, and sub_4CA480 (check_variable_redeclaration) for prior-declaration compatibility.

sub_4DC200 -- mark_defined_variable

Post-declaration validation for device-memory variables with external linkage (26 lines):

// sub_4DC200 -- mark_defined_variable (decompiled)
void mark_defined_variable(entity_t *a1, int a2) {
    if (a1[164] & 0x10) {   // already marked as defined
        if (!dword_106BFD0                    // cross-space checking not overridden
            && (a1[148] & 3) == 1             // __device__ set, __shared__ NOT set
            && !is_compiler_generated(a1)     // not compiler-generated
            && (a1[80] & 0x70) != 0x10)       // not anonymous
        {
            warning(3648, a1 + 64);           // external linkage warning
        }
    } else if (!a2 && (*(byte*)(*(qword*)a1 + 81) & 2)) {
        error(1655, ...);   // tentative definition of constexpr
    } else {
        // Same 3648 check on first definition
        if (!dword_106BFD0 && (a1[148] & 3) == 1 && ...)
            warning(3648, a1 + 64);
        a1[164] |= 0x10;   // mark as defined
    }
}

The condition (a1[148] & 3) == 1 tests that bit 0 (__device__) is set AND bit 1 (__shared__) is NOT set. This catches __device__ variables (including __device__ __constant__ and __device__ __managed__, since those have bit 0 set) but excludes __shared__ variables (which have bit 1 set). The check is NOT about __constant__ alone -- a pure __constant__ variable (only bit 2 set, value 0x04) would yield (0x04 & 3) == 0, failing the test. The p1.06 report's characterization of error 3648 as "constant with external linkage" is misleading; the actual condition is "device-accessible (non-shared) variable with external linkage."

sub_4CC150 -- cuda_variable_fixup

Called from variable_declaration after CUDA constexpr-if detection. This function:

  • Manipulates variable entity fields at offset +148 (memory space) and +162 (visibility flags)
  • Adjusts scope chains using the 784-byte scope entry array
  • Creates new type entries for CUDA-specific variable rewriting

Bit Assignment Resolution

Two sweep reports provided conflicting bit assignments for byte +148:

Sourcebit 0bit 1bit 2
p1.01 (attribute.c handlers)__device____shared____constant__
p1.06 (decls.c)__constant____shared____managed__

The decompiled code resolves this definitively in favor of the p1.01 assignment. Two independent functions confirm it:

  1. sub_40E0D0 (apply_nv_managed_attr) sets a2[149] |= 1 (managed at +149) and a2[148] = v3 | 1 (device at +148 bit 0). The subsequent conflict check tests (v3 & 2) for __shared__ and (v3 & 4) for __constant__.

  2. sub_40EB80 (apply_nv_device_attr) sets *(_BYTE *)(a2 + 148) | 1 (device at +148 bit 0), then uses the identical conflict test ((v9 & 2) != 0) + ((v9 & 4) != 0) == 2.

The canonical encoding is:

Byte +148:  bit 0 = __device__,  bit 1 = __shared__,  bit 2 = __constant__
Byte +149:  bit 0 = __managed__

The p1.06 report's alternative encoding is an analysis error, caused by mark_defined_variable (sub_4DC200) testing +148 & 3 == 1 in the context of error 3648. That test checks for __device__ set (bit 0) without __shared__ (bit 1) -- not for __constant__ at bit 0. The error was then characterized as "constant with external linkage" based on the error message text rather than the actual bit test.

Validation Constraints

managed Constraints

__managed__ has the strictest requirements among memory space annotations. All five checks occur in apply_nv_managed_attr (sub_40E0D0):

ConstraintBinary testErrorDescription
Variables onlya3 != 7internal_error__managed__ can only apply to variables, not functions or types
No shared+constant((old & 2) != 0) + ((old & 4) != 0) == 23481Both __shared__ and __constant__ already set
Not thread-local(signed char)byte+161 < 03482Bit 7 of +161 = thread_local storage
Not reference/localbyte+81 & 43485Bit 2 of +81 = reference type or local variable
Not grid_constantbyte+164 & 4 and word +148 & 0x01023577__grid_constant__ parameter with managed or shared space

The __managed__ keyword requires compute capability >= 3.0. This is verified at compilation time via version threshold comparisons (qword_126EF90 > 0x78B3, where 0x78B3 = 30899 in the CUDA version encoding scheme). The specific error code for architecture-too-low is not captured in the decompiled attribute handler.

shared Constraints

__shared__ variables have restrictions enforced across multiple functions:

ConstraintWhereErrorDescription
No VLA typesub_4DEC903510__shared__ variable cannot have variable-length array type
No VLA (general)sub_4DEC903580CUDA memory-space attribute on variable-length array
Not thread-localAttribute handler3482__shared__ on thread_local variable
Not local (non-block)Attribute handler3485Cannot appear on local variables outside device function scope
No grid_constantAttribute handler3577Incompatible with __grid_constant__ parameter

constant Constraints

__constant__ carries additional restrictions related to constexpr and type:

ConstraintWhereErrorDescription
No constexprsub_4DEC903568__constant__ combined with constexpr (when managed+device bits also set)
No constexpr+autosub_4DEC903566Constexpr with const-qualified type
No VLA typesub_4DEC903567CUDA memory-space variable with VLA type
Not thread-localAttribute handler3482__constant__ on thread_local variable
Not localAttribute handler3485Cannot appear on local variables
No grid_constantAttribute handler3577Incompatible with __grid_constant__ parameter

Note: Error 3648 (external linkage warning) is emitted by sub_4DC200 but the condition tests (byte+148 & 3) == 1, which checks for __device__ set without __shared__ -- not specifically __constant__. The check applies to any device-accessible non-shared variable, including __device__, __device__ __constant__, and __device__ __managed__.

Cross-Space Variable Access Checking

When host code references a device-side variable, the symbol reference recorder emits diagnostics. This checking occurs in record_symbol_reference_full (sub_72A650 / sub_72B510, symbol_ref.c) and is gated by global flags dword_106BFD0 and dword_106BFCC.

Gate Logic

1. Is cross-space checking enabled?
   → dword_106BFD0 != 0 OR dword_106BFCC != 0

2. Is the referenced entity a variable (kind == 7)?
   → Yes: proceed to nv_check_device_var_ref_in_host
   → No (kind 10/11/20 -- function): check nv_check_host_var_ref_in_device

3. Get current routine from scope stack (dword_126C5D8)
4. Check routine execution space at +182 (0x30 mask):
   → 0x00 or 0x10 (host): emit device-var-in-host errors
   → 0x20 (device): emit host-var-in-device errors

Device Variable Referenced from Host Code

The nv_check_device_var_ref_in_host path (assert string at symbol_ref.c:2347) checks memory space bits and produces specific errors based on which space the variable occupies:

ErrorConditionDescription
3548Variable has __shared__ or __constant__ (byte+148 bits 1-2)Reference to __shared__ / __constant__ variable from host code
3549Variable has __constant__ and reference is in initializer context (ref_kind bit 4)Initializer referencing device memory variable from host
3550Variable has __shared__ and reference is a write (ref_kind bit 1)Write to __shared__ variable from host code
3486Via sub_6BCF10 -- complex linkage check (+176 & 0x200000000002000, +166 == 5, +168 in [1,4])Illegal device variable reference from host (operator function context)

Host Variable Referenced from Device Code

The nv_check_host_var_ref_in_device path (assert string at symbol_ref.c:2390) handles the reverse direction:

ErrorConditionDescription
3623Device-only function referenced outside device contextUse of __device__-only function outside the bodies of device functions

The error 3623 has two context strings:

  • "outside the bodies of device functions" -- general case
  • "from a constexpr or consteval __device__ function" -- constexpr context

Relaxation: dword_106BF40

When dword_106BF40 is set (corresponding to --expt-relaxed-constexpr), and the current routine at +182 has the device annotation pattern (& 0x30 == 0x20) with +177 bit 1 set (explicit __device__), cross-space variable access checks are suppressed. This allows constexpr device functions to reference host variables during constant evaluation.

Host Reference Arrays

When the backend emits host-side code, variables marked with __device__, __shared__, or __constant__ are registered in ELF section arrays so the CUDA runtime can discover them at load time. The emission function sub_6BCF80 (nv_emit_host_reference_array) writes entries into six separate sections:

SectionArray NameMemory SpaceLinkage
.nvHRDEhostRefDeviceArrayExternalLinkage__device__External
.nvHRDIhostRefDeviceArrayInternalLinkage__device__Internal
.nvHRCEhostRefConstantArrayExternalLinkage__constant__External
.nvHRCIhostRefConstantArrayInternalLinkage__constant__Internal
.nvHRKEhostRefKernelArrayExternalLinkage__global__ (kernel)External
.nvHRKIhostRefKernelArrayInternalLinkage__global__ (kernel)Internal

Each array entry contains the mangled name of the device symbol as a byte array:

extern "C" {
    extern __attribute__((section(".nvHRDE")))
    __attribute__((weak))
    const unsigned char hostRefDeviceArrayExternalLinkage[] = {
        /* mangled name bytes */ 0x0
    };
}

Six global lists (at addresses unk_1286780 through unk_12868C0) accumulate symbols during compilation, one per section type. Note that __shared__ variables do NOT get host reference arrays -- they have no host-visible address.

Redeclaration Compatibility

When a variable is redeclared, decl_variable (sub_4CA6C0) compares the memory space bits between the prior declaration and the new one. Error 1306 is emitted for mismatched CUDA memory spaces:

Error 1306: CUDA memory space mismatch on redeclaration

The comparison tests byte +148 of both the existing entity and the new declaration's computed attributes. The CUDA memory space acts as an implicit storage class -- storage class value 5 in the declaration state (offset 269) indicates a CUDA-specific storage class that requires special scope-walking behavior.

String Table Usage

The memory space keywords appear in the binary's string table and are referenced by error message formatting code:

StringUsage
"__constant__"Error messages for __constant__ constraints, space name display
"__managed__"Error messages for __managed__ constraints
"__device__"Obtained via "__host__ __device__" + 9 (pointer arithmetic), or direct literal
"__shared__"Error messages for __shared__ constraints
"__host__ __device__"Combined string; +9 yields "__device__"

The pointer-arithmetic trick for "__device__" appears in both sub_4DEC90 (variable_declaration) and error message formatting throughout the attribute handlers. It saves binary space by reusing the combined "__host__ __device__" string constant.

Error Code Summary

Attribute Application Errors

ErrorSeverityDescription
3481ErrorConflicting CUDA memory spaces (__shared__ + __constant__ simultaneously)
3482ErrorCUDA memory space attribute on thread_local variable
3485ErrorCUDA memory space attribute on local variable
3577ErrorMemory space incompatible with __grid_constant__ parameter

Declaration Processing Errors

ErrorSeverityDescription
149ErrorIllegal CUDA storage class at namespace scope
892Errorauto type with __constant__ variable
893Errorauto type with CUDA memory space variable
1306ErrorCUDA memory space mismatch on redeclaration
3483ErrorMemory space qualifier on automatic/static variable in non-device function
3510Error__shared__ variable with variable-length array
3566Error__constant__ with constexpr and auto deduction
3567ErrorCUDA variable with VLA type
3568Error__constant__ combined with constexpr
3578ErrorCUDA attribute in constexpr-if discarded branch
3579ErrorCUDA attribute at namespace scope with structured binding
3580ErrorCUDA attribute on variable-length array
3648WarningDevice-accessible (non-shared) variable with external linkage

Cross-Space Reference Errors

ErrorSeverityDescription
3486ErrorIllegal device variable reference from host (operator function context)
3548ErrorReference to __shared__ / __constant__ variable from host code
3549ErrorInitializer referencing device memory variable from host
3550ErrorWrite to __shared__ variable from host code
3623ErrorUse of __device__-only function outside device context

Global State Variables

VariableTypeDescription
dword_126EFA8intCUDA mode flag (nonzero when compiling CUDA)
dword_126EFB4intCUDA dialect (2 = CUDA C++)
dword_126EFACintExtended CUDA features flag
dword_126EFA4intCUDA version-check control
qword_126EF98int64CUDA version threshold (hex: 0x9E97 = 40599, 0x9D6C, etc.)
qword_126EF90int64CUDA version threshold (hex: 0x78B3 = 30899 for compute_30)
dword_106BFD0intEnable cross-space reference checking (primary)
dword_106BFCCintEnable cross-space reference checking (secondary)
dword_106BF40intAllow __device__ function refs in host (--expt-relaxed-constexpr)
dword_106BFF0intRelaxed execution space mode (permits otherwise-illegal combos)
qword_126EB70ptrEntity pointer for main() (prevents __device__ on main)
qword_126C5E8ptrScope stack base pointer (784-byte entries)
dword_126C5E4intCurrent scope stack top index
dword_126C5D8intCurrent function scope index (-1 if none)

Function Map

AddressIdentitySizeSource
sub_40AD80apply_nv_weak_odr_attr0.2 KBattribute.c:10497
sub_40E0D0apply_nv_managed_attr0.4 KBattribute.c:10523
sub_40E1F0apply_nv_global_attr (variant 1)0.9 KBattribute.c
sub_40E7F0apply_nv_global_attr (variant 2)0.9 KBattribute.c
sub_40EB80apply_nv_device_attr1.0 KBattribute.c
sub_4108E0apply_nv_host_attr0.3 KBattribute.c
sub_413240apply_one_attribute (dispatch)5.9 KBattribute.c
sub_413ED0apply_attributes_to_entity4.9 KBattribute.c
sub_40A310attribute_display_name0.6 KBattribute.c:1307
sub_4CA6C0decl_variable11 KBdecls.c:7730
sub_4CC150cuda_variable_fixup1.2 KBdecls.c:20654
sub_4DC200mark_defined_variable0.3 KBdecls.c
sub_4DEC90variable_declaration11 KBdecls.c:12956
sub_6BC890nv_validate_cuda_attributes1.6 KBnv_transforms.c
sub_6BCF10nv_check_device_variable_in_host0.2 KBnv_transforms.c
sub_6BCF80nv_emit_host_reference_array0.8 KBnv_transforms.c
sub_72A650record_symbol_reference_full (6-arg)6.6 KBsymbol_ref.c
sub_72B510record_symbol_reference_full (4-arg)7.3 KBsymbol_ref.c

See Also

Cross-Space Call Validation

CUDA's execution model partitions code into host (CPU) and device (GPU) worlds. A function in one execution space cannot directly call a function in the other -- a __host__ function cannot call a __device__ function, and vice versa. cudafe++ enforces these rules at two points during compilation: at explicit call sites in expressions (expr.c) and at symbol reference recording time (symbol_ref.c). Together these checks cover both direct function calls and indirect references -- variable accesses, implicit constructor/destructor invocations, and template-instantiated calls. The validation produces 12 distinct calling error messages (6 normal + 6 constexpr-with-suggestion variants), plus 4 variable access errors and 1 device-only function reference error.

Key Facts

PropertyValue
Source filesexpr.c (call site checks), symbol_ref.c (reference-time checks), class_decl.c (type hierarchy walk), nv_transforms.c (helpers)
Call-site checkersub_505720 (check_cross_execution_space_call, 4.0 KB)
Template variantsub_505B40 (check_cross_space_call_in_template, 2.7 KB)
Reference checkersub_72A650 (record_symbol_reference_full, 6-arg, 659 lines)
Reference checker (short)sub_72B510 (record_symbol_reference_full, 4-arg, 732 lines)
Type hierarchy walkersub_41A1F0 (annotation helper, walks nested types for HD violations)
Type hierarchy entrysub_41A3E0 (validates lambda/class HD annotation, calls sub_41A1F0)
Space name helpersub_6BC6B0 (get_entity_display_name, 49 lines)
Trivial-device-copyablesub_6BC680 (is_device_or_extended_device_lambda, 16 lines)
Device ref expression walkersub_6BE330 (nv_scan_expression_for_device_refs, 89 lines)
Diagnostic emissionsub_4F7450 (multi-arg diagnostic), sub_4F8090 (type+entity diagnostic)
Calling errors3462, 3463, 3464, 3465, 3508
Variable access errors3548, 3549, 3550, 3486
Device-only function ref3623
Type annotation errors3593, 3594, 3597, 3598, 3599, 3615, 3635, 3691
Cross-space enable flagdword_106BFD0 (primary), dword_106BFCC (secondary)
Device ref relaxationdword_106BF40 (allow __device__ function refs in host)
Relaxed constexpr flagdword_126EFB0 (also referenced as CLI flag 104)

Execution Space Recall

The execution space is encoded at byte offset +182 of the entity (routine) node. The two-bit extraction byte & 0x30 classifies the routine:

byte & 0x30SpaceMeaning
0x00(none)Implicit __host__
0x10__host__Explicit host-only
0x20__device__Device-only
0x30__host__ __device__Both spaces

The 0x60 mask distinguishes __global__ kernels: (byte & 0x60) == 0x20 means plain __device__, while byte & 0x40 set means __global__.

Additional flags at byte +177 encode secondary space information:

BitMaskMeaning
00x01__host__ annotation present
10x02__device__ annotation present
20x04constexpr device
40x10implicitly HD / __forceinline__ relaxation

The +177 & 0x10 bit is the critical bypass: when set, the function is treated as implicitly __host__ __device__ and exempt from cross-space checks. This covers constexpr functions (which are implicitly HD since CUDA 7.5) and __forceinline__ functions (which the compiler may allow to be instantiated in either space).

The Implicitly-HD Bypass

Before any cross-space error is emitted, both the caller and callee are tested for the implicitly-HD condition. The exact binary test is:

// Implicitly-HD check (appears in both sub_505720 and sub_505B40)
// entity: pointer to routine entity node

bool is_implicitly_hd(int64_t entity) {
    // Check 1: bit 0x10 at +177 (constexpr/forceinline HD)
    if ((*(uint8_t*)(entity + 177) & 0x10) != 0)
        return true;

    // Check 2: deleted function with specific annotation combo
    // +184 is an 8-byte extended flags field
    // 0x800000000000 = deleted bit, 0x1000000 = explicit annotation
    // If deleted but NOT explicitly annotated, AND byte+176 bit 1 is clear:
    if ((*(uint64_t*)(entity + 184) & 0x800001000000LL) == 0x800000000000LL
        && (*(uint8_t*)(entity + 176) & 2) == 0)
        return true;

    return false;
}

This means:

  1. constexpr functions -- the +177 & 0x10 bit is set during attribute processing, making them callable from both host and device code without explicit annotation.
  2. __forceinline__ functions -- same bit, allowing cross-space inlining.
  3. Implicitly-deleted functions -- defaulted special members (constructors, destructors, assignment operators) that are deleted due to non-copyable members. These get a pass because they will never actually be called.

If either the caller or the callee is implicitly HD, the cross-space check returns immediately without error.

Call-Site Validation: sub_505720

check_cross_execution_space_call is called during expression scanning in scan_expr_full whenever a function call expression is processed. It takes three parameters:

// sub_505720 -- check_cross_execution_space_call
// a1: entity pointer of the callee function (may be NULL)
// a2: bool -- if true, this is a "must be callable" context (__global__ launch)
// a3: source location pointer for diagnostics
// returns: char (nonzero if diagnostic was emitted)
char check_cross_execution_space_call(int64_t callee, bool must_callable, uint64_t *src_loc);

Algorithm

The function follows a multi-stage gate structure. At each gate, an early return can skip the check entirely:

Gate 1 -- Class scope suppression. If we are inside a class definition scope (dword_126C5C8 != -1) and the current scope has device-scope flags set (scope_entry[6] & 0x06), AND we are inside a type node context (dword_106B670 != -1, type_entry[5] & 0x08), the check is suppressed. This allows member function declarations inside device classes to reference host functions without error -- the actual check happens when the member is instantiated/defined.

Gate 2 -- Diagnostic suppression scope. If the current scope entry has diagnostic-suppression bit 1 of byte +14 set (scope_entry[14] & 0x02), checks are suppressed. This covers SFINAE contexts and decltype evaluation.

Gate 3 -- Concept/requires context. If the current context pointer (qword_106B970) is non-null and byte +17 has bit 1 set (strict-mode or concept context), checks are suppressed.

Gate 4 -- No enclosing function. If dword_126C5D8 == -1 (no enclosing function scope), the caller space defaults to host-only (v7=0, v8=1) -- meaning we are at file scope, which is implicitly host.

Gate 5 -- Extract caller space. The enclosing function entity is retrieved from the scope stack at qword_126C5E8 + 784 * dword_126C5D8 + 224. Its execution space is extracted:

  • v7 = (caller[182] & 0x60) == 0x20 -- caller is host-only
  • v8 = (caller[182] & 0x30) != 0x20 -- caller is NOT device-only
  • v5 = (caller[-8] & 0x10) != 0 -- caller has secondary device mark (the -8 offset reads a flags byte 8 bytes before the entity, in the preceding allocation header)

Gate 6 -- Caller implicitly HD. The caller is tested for implicitly-HD status. If true, return immediately.

Gate 7 -- Callee implicitly HD. The callee (parameter a1) is tested for implicitly-HD status. If true, return immediately.

Gate 8 -- No caller entity or secondary device. If no caller entity exists or the secondary device flag is set, skip to the __global__ check.

Error Decision Logic

After passing all gates, the function computes which error to emit based on caller/callee space combination:

// Pseudocode for the error decision tree

bool callee_is_not_device = (callee[182] & 0x30) != 0x20;   // v3
bool callee_is_host_only  = (callee[182] & 0x60) == 0x20;   // v4
bool callee_is_global     = (callee[182] & 0x40) != 0;       // v11 in some paths
bool caller_is_host_only  = (caller[182] & 0x60) == 0x20;    // v7
bool caller_not_device    = (caller[182] & 0x30) != 0x20;    // v8
bool has_forceinline      = (caller[181] & 0x20) != 0;

if (caller_is_host_only && caller_not_device) {
    // Caller is __host__ __device__ (both flags set)
    if (has_forceinline || callee_is_not_device || !callee_is_host_only)
        goto global_check;

    // HD caller calling host-only callee
    if (!is_device_or_extended_lambda(callee)) {
        char *caller_name = get_entity_display_name(caller, 0);
        char *callee_name = get_entity_display_name(callee, 1);
        int errcode = 3462 + ((callee[177] & 0x02) != 0);  // 3462 or 3463
        emit_diagnostic(errcode, src_loc, callee_name, caller_name);
    }
} else if (caller_not_device) {
    // Caller is host-only, callee is device-only
    if (has_forceinline || callee_is_not_device || !callee_is_host_only)
        goto global_check;

    // Check relaxed-constexpr bypass
    if ((callee[177] & 0x02) != 0 && dword_106BF40) {
        // Callee has __device__ annotation AND relaxation flag is set
        if (must_callable && !callee_is_global)
            goto global_check;  // suppress for __global__ must-call context
        // else suppress entirely
    }

    // Check constexpr-device bypass
    if ((callee[177] & 0x04) != 0)
        goto global_check;  // constexpr device functions get a pass

    // Host caller calling device-only callee
    char *caller_name = get_entity_display_name(caller, 0);
    char *callee_name = get_entity_display_name(callee, 1);
    int errcode = 3465 - ((callee[177] & 0x02) == 0);  // 3464 or 3465
    emit_diagnostic(errcode, src_loc, callee_name, caller_name);
}

global_check:
if (must_callable && !callee_is_global) {
    // must_callable is true but callee is not __global__
    // (this path is for __global__ launch checks)
    // no error here -- fall through
} else if (!must_callable && callee_is_global) {
    // __global__ function called from wrong context
    if (callee_is_host_only) {
        // __global__ called from host-only -- "cannot be called from host"
        emit_diagnostic(3508, src_loc, "host", "cannot");
    } else if (!callee_is_host_only) {
        // __global__ called from __device__ context
        emit_diagnostic(3508, src_loc, "__device__", "cannot");
    }
} else if (must_callable || !callee_is_global) {
    return;  // no __global__ issue
} else {
    emit_diagnostic(3508, src_loc, "__global__", "must");
}

Error 3462 vs 3463 (Device-from-Host Direction)

The distinction between errors 3462 and 3463 is the +177 & 0x02 bit on the callee -- whether it has an explicit __device__ annotation:

  • 3462: __device__ function called from __host__ context. The callee has no explicit __device__ annotation (it was implicitly device-only).
  • 3463: Same violation, but the callee has explicit __device__ annotation. The error message includes an additional note about the __host__ __device__ context.

The computation: 3462 + ((callee[177] & 0x02) != 0) yields 3462 when the bit is clear, 3463 when set.

Error 3464 vs 3465 (Host-from-Device Direction)

Similarly for the reverse direction:

  • 3464: __host__ function called from __device__ context, callee has explicit __device__ annotation (bit clear in the subtraction).
  • 3465: Same violation, callee does NOT have explicit __device__ annotation.

The computation: 3465 - ((callee[177] & 0x02) == 0) yields 3464 when the bit is clear, 3465 when set.

Error 3508 (global Misuse)

Error 3508 is a parameterized error with two string arguments: the context string and the verb. The combinations are:

ContextVerbMeaning
"host""cannot"__global__ function cannot be called from __host__ code directly (must use <<<>>>)
"__device__""cannot"__global__ function cannot be called from __device__ code
"__host__ __device__" + 9 = "__device__""cannot"Same, from HD context with device focus
"__global__""must"A __global__ function must be called with <<<>>> syntax

Template Variant: sub_505B40

check_cross_space_call_in_template performs the same validation but is called during template instantiation rather than initial expression scanning. It has two key differences:

  1. Guard on dword_126C5C4 == -1: only runs when no nested class scope is active. If dword_126C5C4 != -1, the entire function is skipped -- template instantiation inside nested class definitions defers cross-space checks.

  2. Additional scope guards: checks scope_entry[4] != 12 (not a namespace scope) and qword_106B970 + 17 & 0x40 == 0 (not in a concept context). These prevent false positives during dependent name resolution.

  3. No return value: returns void instead of char. It only emits diagnostics; it does not report whether a diagnostic was emitted.

  4. Error code selection: uses 3463 - ((callee[177] & 0x02) == 0) for the HD-caller case (yielding 3462 or 3463), and 3465 - ((callee[177] & 0x02) == 0) for the host-caller case (yielding 3464 or 3465). The __global__ error always uses "must" verb.

  5. No must_callable parameter: the template variant does not handle the must/cannot distinction for __global__. It always emits 3508 with "__global__" and "must" if the callee is __global__.

Complete Calling Error Matrix

The following matrix shows which errors fire for each caller/callee space combination:

Caller \ Callee__host____device____host__ __device____global__
__host__ (explicit)OK3464 or 3465OK3508 ("must")
__device__3462 or 3463OKOK3508 ("cannot")
__host__ __device__OK3462 or 3463OK3508
(no annotation) = hostOK3464 or 3465OK3508 ("must")
__global__OKOKOK3508 ("cannot")

Entries marked "OK" pass the cross-space check without error. The specific error (3462 vs 3463, 3464 vs 3465) depends on whether the callee has the +177 & 0x02 bit (explicit __device__ annotation).

Bypass Conditions (No Error Despite Mismatch)

Even when the matrix says an error should fire, the following conditions suppress it:

  1. Caller or callee is implicitly HD (+177 & 0x10): constexpr functions, __forceinline__ functions, implicitly-deleted special members.
  2. Caller has __forceinline__ relaxation (+181 & 0x20): the caller has a __forceinline__ attribute that relaxes cross-space restrictions.
  3. Callee is a device lambda that passes trivial-device-copyable check (sub_6BC680 returns true): extended lambda optimization.
  4. Callee has constexpr-device flag (+177 & 0x04): constexpr functions marked for device use.
  5. dword_106BF40 is set and callee has explicit __device__ (+177 & 0x02): the --expt-relaxed-constexpr or similar flag allows device function references from host code.
  6. Current scope has diagnostic suppression (scope_entry[14] & 0x02): SFINAE context.
  7. Concept/requires context (qword_106B970 + 17 & 0x40).

The 12 Calling Error Messages

cudafe++ emits 6 base error messages for cross-space call violations. Each has a variant that adds a --expt-relaxed-constexpr suggestion when the callee is a constexpr function, yielding 12 total messages:

ErrorDirectionContextSuggestion?
3462device called from hostCallee lacks explicit __device__No
3463device called from HDCallee has explicit __device__ (HD context note)No
3464host called from deviceCallee has explicit __device__ (bit clear in subtraction)No
3465host called from deviceCallee lacks explicit __device__No
3508__global__ context errorParameterized: "must" / "cannot" + space stringNo
3462+constexprdevice called from hostconstexpr calleeYes: --expt-relaxed-constexpr
3463+constexprdevice called from HDconstexpr calleeYes
3464+constexprhost called from deviceconstexpr calleeYes
3465+constexprhost called from deviceconstexpr calleeYes
3508+constexpr__global__ contextconstexpr calleeYes

The constexpr suggestion variants are selected by the relaxed-constexpr flag state. When dword_106BF40 (the --expt-relaxed-constexpr relaxation flag) is NOT set and the callee has constexpr annotations, the error message includes a note suggesting the flag to resolve the issue.

Variable Access Validation: symbol_ref.c

The record_symbol_reference_full functions (sub_72A650 / sub_72B510) enforce cross-space rules at the symbol reference level. This is a different check point than the call-site checker -- it catches variable accesses and implicit function references that are not explicit function calls.

Reference Kind Bitmask (Parameter a1)

The first parameter encodes the kind of reference being made:

BitMaskMeaning
00x01Address reference (&var)
10x02Write reference (assignment target)
20x04Non-modifying reference (read)
30x08Direct use
40x10Initializer
50x20Potential modification
60x40Move reference
100x400Template argument
130x2000ODR-use
150x8000Negative offset

Device Variable in Host Code (Errors 3548-3550)

When the referenced symbol is a variable (kind == 7), record_symbol_reference_full checks whether a __shared__, __constant__, or __managed__ variable is accessed from host code:

// nv_check_device_var_ref_in_host (inlined in sub_72A650, lines ~330-374)
// var_info: *(entity + 88) -- pointer to variable descriptor

uint64_t caller = 0;
if (dword_126C5D8 != -1)  // enclosing function scope exists
    caller = *(qword_126C5E8 + 784 * dword_126C5D8 + 224);

uint8_t mem_flags = *(var_info + 148);     // memory space flags
const char *space_name;
if (mem_flags & 0x02)
    space_name = "__shared__";
else if (mem_flags & 0x04)
    space_name = "__constant__";
else
    space_name = "";  // no specific space (managed or other)

// Exemption: managed variables with bit 0x100 set are OK
if ((*(uint16_t*)(var_info + 148) & 0x0101) == 0x0101)
    return;  // managed + exemption flag

// Only check if: has device memory annotation, there is a caller,
// caller is NOT device-only, caller is not implicitly-HD
if ((ref_kind & 0x12040) == 0       // not a transparent reference
    && (mem_flags & 0x07) != 0       // has device memory annotation
    && caller != 0
    && (*(caller + 182) & 0x30) != 0x20   // caller NOT device-only
    && (*(caller + 177) & 0x10) == 0      // caller NOT implicitly HD
    && !is_implicitly_hd(caller))          // extended implicit-HD check
{
    if (ref_kind & 0x08)  // direct use
        emit_diag(3548, src_loc, space_name, entity);  // "reference to __shared__"

    if (ref_kind & 0x10)  // initializer
        emit_diag(3549, src_loc, space_name, entity);  // "initializer for __constant__"

    if ((mem_flags & 0x02) && (ref_kind & 0x20))  // __shared__ + write
        emit_diag(3550, src_loc, space_name, entity);  // "write to __shared__"
}
ErrorConditionMessage
3548Direct use of __shared__/__constant__ variable from hostReference to device memory variable from host code
3549Initializer referencing __shared__/__constant__ from hostCannot initialize from host
3550Write to __shared__ variable from hostCannot write to shared memory from host

Device-Only Function Reference (Error 3623)

For function-type symbols (kind 10 or 11, or concept kind 20), the check validates that __device__-only functions are not referenced from host code:

// nv_check_device_function_ref_in_host (inlined in sub_72A650, lines ~382-454)
// entity: the function being referenced
// entity + 88 -> routine info (for kind 10/11)
// entity + 88 -> +192 for concepts (kind 20)

int64_t routine_info = ...;  // resolve through type chain
if (routine_info == 0)
    return;

// Only check if: has device annotation, is device-only,
// has no implicit-HD flags
if ((*(routine_info + 191) & 0x01) == 0     // not a coroutine exemption
    || (*(routine_info + 182) & 0x30) != 0x20  // not device-only
    || (*(routine_info + 177) & 0x15) != 0)    // has HD/host/constexpr flags
    return;

// Check if already exempted by extended flags
if (is_implicitly_hd(routine_info))
    return;

// Determine caller context
int64_t caller_routine = 0;
if (dword_126C5D8 != -1) {
    caller_routine = *(qword_126C5E8 + 784 * dword_126C5D8 + 224);
} else if (dword_126C5B8) {
    // Walk scope stack to find enclosing try block
    int scope_idx = dword_126C5E4;
    while (scope_idx != -1) {
        int64_t entry = qword_126C5E8 + 784 * scope_idx;
        if (*(int32_t*)(entry + 408) != -1)  // has try block
            break;
        scope_idx = *(int32_t*)(entry + 560);  // parent scope
    }
    if (scope_idx == -1) return;
    caller_routine = *(entry + 224);
}

if (caller_routine == 0) goto emit_outside;
if (is_implicitly_hd(caller_routine)) return;

if ((*(caller_routine + 182) & 0x30) == 0x20) {
    // Caller is __device__-only
    if ((*(caller_routine + 177) & 0x05) == 0)
        return;  // no constexpr/consteval markers
    context = "from a constexpr or consteval __device__ function";
} else {
    context = "outside the bodies of device functions";
}

emit_outside:
const char *name = *(routine_info + 8);  // function name
if (!name) name = "";
emit_diagnostic(3623, src_loc, name, context);

Error 3623 has two context strings:

  • "outside the bodies of device functions" -- the reference is from file scope or host code
  • "from a constexpr or consteval __device__ function" -- the reference is from a constexpr/consteval device function that cannot actually call the target

The dword_106BFD0 / dword_106BFCC Gate

Both record_symbol_reference_full variants gate the cross-space device-reference scan (sub_6BE330) with:

if (dword_106BFD0 || dword_106BFCC) {
    // Cross-space reference checking is enabled
    if (!qword_126C5D0                                    // no current routine descriptor
        || *(qword_126C5D0 + 32) == 0                    // no routine entity
        || (*(*(qword_126C5D0 + 32) + 182) & 0x30) != 0x20  // not device-only
        || (dword_106BF40 && (*(*(qword_126C5D0 + 32) + 177) & 0x02) != 0))
    {
        // Call sub_6BE330 to walk expression tree for device references
        nv_scan_expression_for_device_refs(entity);
    }
}

The scan is skipped when the current routine IS __device__-only -- device code referencing other device symbols is always valid. The dword_106BF40 check further relaxes: if the flag is set AND the routine has explicit __device__ annotation (+177 & 0x02), the scan is also skipped.

Type Hierarchy Walk: sub_41A1F0 / sub_41A3E0

The type hierarchy walkers handle a different class of violation: when a __host__ __device__ or __device__ annotation is applied to a class or lambda whose member types contain HD-incompatible nested types. These functions live in class_decl.c and are called during class completion.

sub_41A3E0 (Entry Point)

This function validates a complete type annotation context. It receives a lambda/class info structure and checks multiple conditions:

// sub_41A3E0 -- validate_type_hd_annotation
// a1: type annotation context structure
//   +8:  entity pointer
//   +32: flags byte (bit 0 = has_host, bit 3 = has_conflict, bit 4 = has_device,
//                     bit 5 = has_virtual)
//   +36: source location
// a2: 0 = __host__ __device__, nonzero = __device__ only
// a3: enable additional nested check (for OptiX path)

char *space_name = (a2 == 0) ? "__host__ __device__" : "__device__";

// Error 3615: duplicate HD annotation conflict
if (a2 == 0 && (flags & 0x01))
    emit_diag(3615, src_loc);

// Error 3593: conflict between __host__ and __device__ on type
if (flags & 0x08) {
    if (entity && entity[163] < 0) {  // entity has device-negative flag
        if ((flags & 0x18) != 0x18)
            goto check_members;
        emit_diag(3635, src_loc);  // both __host__ and __device__ + conflict
    } else {
        emit_diag(3593, src_loc, space_name);
    }
}

// Error 3594: virtual function in __device__ context
if (flags & 0x20 || ...)
    emit_diag(3594, src_loc, space_name);

// Recurse into member types
walk_type_for_hd_violations(type_entry, src_loc, a2);  // sub_41A1F0

// Error 3691: nested OptiX check
if (a3 && (flags & 0x10))
    emit_diag(3691, src_loc, space_name);

sub_41A1F0 (Recursive Type Walker)

This function walks the type hierarchy to find nested violations. It uses sub_7A8370 (is-array-type check) and sub_7A9310 (get-array-element-type) to traverse through arrays, and walks through cv-qualified type wrappers (kind == 12) by following the +144 pointer chain.

// sub_41A1F0 -- walk_type_for_hd_violations (recursive)
// a1: type node pointer
// a2: source location pointer
// a3: 0 = HD mode, nonzero = device-only mode

char *space_name = (a3) ? "__device__" : "__host__ __device__";

if (!is_valid_type(a1) || a1 == 0) {
    // Base case: no type to check, or check passed at top level
    goto label_20;
}

int depth = 0;
int64_t current = a1;
do {
    if (!is_array_type(current)) {  // sub_7A8370
        // Not an array -- check this type for violations
        if (depth > 7)
            emit_diag(3597, src_loc, space_name, a1);  // nesting depth exceeded

        // Walk through cv-qualified wrappers
        while (*(current + 132) == 12)  // cv-qual kind
            current = *(current + 144);  // underlying type

        // Guard: skip if in nested class scope
        if (dword_126C5C4 != -1)
            return;
        if ((scope_entry[6] & 0x06) != 0)
            return;
        if (scope_entry[4] == 12)  // namespace scope
            goto walk_callback;

        // Error 3598: type not valid in device context
        if (!check_type_valid_for_space(30, current, 0))  // sub_550E50
            emit_diag(3598, src_loc, space_name, current);

        // Error 3599: type has problematic member
        int64_t display = get_type_display_name(current);  // sub_5BD540
        if (!check_member_compat(60, display, current))  // sub_510860
            emit_diag(3599, src_loc, space_name, current);

        goto label_20;
    }
    ++depth;
    current = get_array_element_type(current);  // sub_7A9310
} while (current != 0);

label_20:
// Final phase: walk_tree with callback sub_41B420
if (dword_126C5C4 != -1) return;
if ((scope_entry[6] & 0x06) != 0) return;
if (scope_entry[4] == 12) return;

// Save/restore diagnostic state
saved_state = qword_126EDE8;
qword_126EDE8 = *src_loc;
dword_E7FE78 = 0;
walk_tree(a1, sub_41B420, 792);  // sub_7B0B60 with callback
qword_126EDE8 = saved_state;

The callback sub_41B420 is used in the tree walk to check each nested type member. This is the same callback used for OptiX extended lambda body validation, applied to validate that all types referenced within the annotated scope are compatible with the target execution space.

Type Annotation Errors

ErrorConditionMessage
3593Conflict between __host__ and __device__ on extended lambda/typeCannot apply both annotations
3594Virtual function in __device__ or HD contextVirtual dispatch not supported on device
3597Type nesting depth exceeds 7 levels in HD validationType hierarchy too deep for device
3598Nested type not valid in device contextType X cannot be used in __device__ code
3599Nested type member incompatible with device executionMember of type X is not device-compatible
3615Duplicate __host__ __device__ annotationAlready annotated as HD
3635Both __host__ and __device__ annotations with negative device flagConflicting explicit annotations
3691Nested OptiX annotation conflictOptiX extended lambda nested check failure

Global State Variables

GlobalTypePurpose
qword_126C5E8int64_tScope stack base pointer (array of 784-byte entries)
dword_126C5E4int32_tCurrent scope stack top index
dword_126C5D8int32_tCurrent function scope index (-1 if none)
dword_126C5C8int32_tClass scope index (-1 if none)
dword_126C5C4int32_tNested class scope (-1 if none)
dword_126C5B8int32_tIs-member-of-template flag
qword_126C5D0int64_tCurrent routine descriptor pointer
qword_106B970int64_tCurrent compilation context
dword_106BFD0int32_tEnable cross-space reference checking (primary)
dword_106BFCCint32_tEnable cross-space reference checking (secondary)
dword_106BF40int32_tAllow __device__ function references in host
dword_106B670int32_tCurrent type node context index (-1 if none)
qword_106B678int64_tType node table base pointer
dword_E7FE78int32_tDiagnostic state flag (cleared during type walks)
qword_126EDE8int64_tSaved diagnostic source position

Function Map

AddressSizeIdentitySource
sub_41A1F0~0.5 KBwalk_type_for_hd_violationsclass_decl.c
sub_41A3E0~0.5 KBvalidate_type_hd_annotationclass_decl.c
sub_41B420(callback)Type walk callback for device compatclass_decl.c
sub_4F7450~0.3 KBemit_diag_multi_arg (cross-space diagnostics)expr.c
sub_5057204.0 KBcheck_cross_execution_space_callexpr.c
sub_505AA00.8 KBget_execution_space_stringexpr.c
sub_505B402.7 KBcheck_cross_space_call_in_templateexpr.c
sub_6BC6800.1 KBis_device_or_extended_device_lambdanv_transforms.c
sub_6BC6B00.5 KBget_entity_display_namenv_transforms.c
sub_6BE3300.9 KBnv_scan_expression_for_device_refsnv_transforms.c
sub_72A6506.6 KBrecord_symbol_reference_full (6-arg)symbol_ref.c
sub_72B5107.3 KBrecord_symbol_reference_full (4-arg)symbol_ref.c

Cross-References

Device/Host Separation

A single .cu file contains both host and device code intermixed. Conventional wisdom assumes cudafe++ splits them with two compilation passes -- one for host, one for device. That assumption is wrong. cudafe++ uses a single-pass, tag-and-filter architecture: the EDG frontend builds one unified IL tree from the entire translation unit, every entity gets execution-space bits written into its node, and then two separate output paths filter the tagged IL -- one path emits the .int.c host file, the other emits the device IL for cicc. There is no re-parse, no second invocation of the frontend.

This page documents the global variables that control the split, the IL-marking walk that selects device-reachable entries, the host-output filtering logic that suppresses device-only entities, and the output files produced.

Key Facts

PropertyValue
ArchitectureSingle-pass: parse once, tag with execution-space bits, filter at output time
Language mode flagdword_126EFB4 -- language mode (1 = C, 2 = C++)
Host compiler identitydword_126EFA4 -- clang mode; dword_126EFA8 -- gcc mode
Device stub modedword_1065850 -- toggled per-entity in sub_47BFD0 (gen_routine_decl)
Device-only filtersub_46B3F0 -- returns 0 for device-only entities when generating host output
Keep-in-IL entry pointsub_610420 (mark_to_keep_in_il), 892 lines
Keep-in-IL workersub_6115E0 (walk_tree_and_set_keep_in_il), 4649 lines
Prune callbacksub_617310 (prune_keep_in_il_walk), 127 lines
Host output entry pointsub_489000 (process_file_scope_entities)
Host sequence dispatchersub_47ECC0 (gen_template / top-level source sequence processor), 1917 lines
Routine declarationsub_47BFD0 (gen_routine_decl), 1831 lines
Host output file<input>.int.c (transformed C++ for host compiler)
Device output fileNamed via --gen_device_file_name CLI flag (binary IL for cicc)
Module ID fileNamed via --module_id_file_name CLI flag
Stub fileNamed via --stub_file_name CLI flag

Why Single-Pass Matters

Old NVIDIA documentation and third-party descriptions sometimes describe a "two-pass" compilation model where cudafe++ runs once to extract device code and once to extract host code. This is not what the binary does. The evidence:

  1. One frontend invocation. sub_489000 (process_file_scope_entities) is called once. It walks the source sequence list (qword_1065748) a single time, dispatching each entity through sub_47ECC0.

  2. No re-parse. The EDG frontend builds the IL tree in memory once. The keep-in-IL walk (sub_610420) runs during fe_wrapup pass 3, marking device-reachable entries with bit 7 of the prefix byte. The host backend then emits .int.c from the same IL tree, filtering based on execution-space bits.

  3. dword_126EFB4 is a language mode, not a pass counter. Its value 2 means "C++ mode," not "second pass." It never changes between device and host output phases.

  4. The device IL is a byte-level binary dump of marked entries, not the output of a separate code-generation pass. The host output is a text-mode C++ file produced by the gen_* family of functions.

The practical implication: every CUDA entity exists once in memory with its execution-space tag at entity+182. The tag drives all downstream decisions -- what goes into device IL, what appears in host .int.c, what gets wrapped in #if 0, and what gets a kernel stub.

Control Globals

dword_126EFB4 -- Language Mode

ValueMeaning
0Unset / not initialized
1C mode
2C++ mode

Set during CLI processing (sub_45C200, case 228/240/246/251/252 for C++ standard versions). In CUDA compilation this is always 2 because .cu files are compiled as C++. The keep-in-IL logic at sub_610420 checks dword_126EFB4 == 2 to decide whether to run the secondary routine-definition marking pass (sub_6175F0).

dword_126EFA4 -- Clang Mode / Device Code Mode

This global has different semantics depending on context. In CLI processing (case 187), it records whether clang host compiler mode is active. In the template instantiation system (p1.18 sweep), it acts as a device-code mode flag (1 = device code path, 0 = host stubs). The dual use reflects the fact that cudafe++ reuses the same global for different phases.

dword_126EFA8 -- GCC Mode / GPU Compilation Mode

Set when gcc host compiler mode is active. In template-related code paths, a nonzero value indicates GPU compilation mode is enabled.

dword_1065850 -- Device Stub Mode Toggle

This global flag controls how __global__ kernel bodies are emitted. It is toggled inside gen_routine_decl (sub_47BFD0). The toggle mechanism is a self-inverting flip that causes gen_routine_decl to process each __global__ kernel TWICE. Because the toggle fires at the TOP of the function (before body emission), the first call (0->1) emits the static stub definition, and the recursive call (1->0) emits the forwarding body.

Toggle Pseudocode (from sub_47BFD0, decompiled line 551-553)

// v3 = entity pointer, v8 = is_friend flag
uint64_t flags = *(uint64_t*)(entity + 176);    // 8-byte flags field

// Bitmask 0x40000002000000 combines the __global__ attribute bit (0x40000000000000)
// and a definition/linkage flag (0x2000000) from the entity's flags field at +176.
if ((flags & 0x40000002000000) == 0x40000002000000 && !is_friend)
    dword_1065850 = (dword_1065850 == 0);   // flip: 0->1 or 1->0

This toggle fires at the TOP of gen_routine_decl, before either stub variant is emitted. Because the function calls itself recursively at the end (decompiled line 1821: return sub_47BFD0(v152, a2)), the toggle fires again on re-entry, resetting the flag.

Body Emission Decision (decompiled line 1421-1432)

The actual stub body selection happens later in the function, based on the CURRENT value of dword_1065850 (which has already been toggled):

if ((entity->byte_182 & 0x40) != 0) {       // has __global__ annotation
    char has_body = entity->byte_179 & 0x02;  // has a definition

    if (dword_1065850) {
        // First call (toggle 0->1): emit static stub with cudaLaunchKernel placeholder
        if (!is_specialization && has_body) {
            emit("{ ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);}");
        }
    } else if (has_body) {
        // Recursive call (toggle 1->0): emit forwarding stub
        emit("{");
        emit_scope_qualifier(entity);
        emit("__wrapper__device_stub_");
        emit(entity->name);
        emit_template_args_if_needed(entity);
        emit_parameter_forwarding(entity);
        emit(");return;}");
    }
    // Both invocations: wrap original body in #if 0 / #endif
}

Self-Recursion (decompiled line 1817-1821)

After the first call emits the static stub, the function checks whether dword_1065850 is nonzero (the toggle set it to 1). If so, it restores the source sequence pointer and calls itself:

if (dword_1065850) {
    qword_1065748 = saved_source_sequence;
    return sub_47BFD0(context, a2);   // recursive self-call
}

The recursive invocation toggles dword_1065850 back to 0, emits the forwarding body, and returns without further recursion (since dword_1065850 == 0 at the self-recursion check).

The flag is also set in sub_47ECC0 when processing template instantiation directives (source sequence kind 54): if the entity has byte_182 & 0x40 (device/global annotation) and CUDA language mode is active, dword_1065850 is set to 1 before emitting the instantiation directive.

dword_126EBA8 -- Language Standard Mode

Value 1 indicates C language standard mode. The device-only filtering function sub_46B3F0 references this to determine whether EBA (EDG binary archive) mode applies.

Host-Output Filtering: sub_46B3F0

This compact function (39 lines decompiled) is the gatekeeper that determines whether an entity should be emitted in the host .int.c output. It is called from sub_47ECC0 at the point where the host backend decides whether to emit a type/variable declaration or wrap it in #if 0.

Decompiled Logic

// sub_46B3F0 -- returns 0 to suppress (device-only), nonzero to emit
uint64_t sub_46B3F0(entry *a1, entry *a2) {
    char kind = a1->byte_132;

    // Classes, structs, unions (kind 9-11): always check device-only
    if ((unsigned char)(kind - 9) <= 2)
        goto check_device_flag;

    // Enums (kind 2): check if scoped enum is device-only
    if (kind == 2) {
        if ((a1->byte_145 & 0x08) == 0)  // not an enum definition
            return 1;                      // emit it
        goto check_device_flag;
    }

    // Typedefs (kind 12): check underlying type kind
    if (kind == 12) {
        char underlying = a1->byte_160;
        if (underlying > 10)
            return 0;
        // Magic bitmask: 0x71D = 0b11100011101
        // Bits set for kinds 0,2,3,4,8,9,10 -> emit
        return (0x71DULL >> underlying) & 1;
    }

    return 1;  // everything else: emit

check_device_flag:
    int is_device;
    if (a2)
        is_device = a2->byte_49 & 1;
    else
        is_device = a1->byte_135 >> 7;

    if (!is_device)
        return 0;   // not device-related, suppress? (inverted logic)

    // Device entity: check if it should still be emitted
    return dword_126EBA8           // C mode -> emit anyway
        || (kind - 9) > 2         // not a class/struct/union -> emit
        || *(a1->ptr_152 + 89) != 1;  // scope check
}

The function uses a bitmask trick (0x71D >> underlying_kind) to quickly determine which typedef underlying types pass the filter. The bit pattern 0b11100011101 selects kinds 0 (void/basic), 2 (enum), 3 (parameter), 4 (pointer), 8 (field), 9 (class), and 10 (struct).

Where It Is Called

In sub_47ECC0 (the master source-sequence dispatcher), when processing type declarations (kind 6):

case 6:  // type_decl
    sub_4864F0(recursion_level, &continuation, kind_byte);
    if (!recursion_level && !sub_46B3F0(type_entry, scope_entry)) {
        // Entity is device-only in host context
        // Wrap in #if 0 / #endif
    }

This is the mechanism that makes device-only classes, structs, and enums invisible to the host compiler. They still exist in the IL tree (and participate in the keep-in-IL walk for device output), but their text representation is suppressed in .int.c.

Device-Only Suppression in Host Output

When sub_46B3F0 returns 0 for an entity, or when the execution-space check in gen_routine_decl identifies a device-only function, the host backend wraps the declaration in preprocessor guards:

#if 0
__device__ void device_only_function() {
    // ... original body ...
}
#endif

This pattern appears in three locations:

  1. Type declarations -- sub_47ECC0 wraps device-only types via sub_46B3F0 check.

  2. Routine declarations -- sub_47BFD0 checks entity->byte_81 & 0x04 (has device scope) combined with execution-space bits at entity+182. When a function is device-only and the current output track is host, the function body is suppressed.

  3. Lambda bodies -- sub_47B890 (gen_lambda) wraps device lambda bodies in #if 0 / #endif and emits __nv_dl_wrapper_t wrapper types instead.

The nv_is_device_only_routine Check

The inline predicate from nv_transforms.h:367 is the canonical way to test if a routine lives exclusively in device space:

bool nv_is_device_only_routine(entity *e) {
    char byte = e->byte_182;
    return ((byte & 0x30) == 0x20)    // device annotation, no host
        && ((byte & 0x60) == 0x20);   // device, not __global__
}

The double-mask check distinguishes three cases:

  • (byte & 0x30) == 0x20: has __device__ but not __host__ (bits 4-5)
  • (byte & 0x60) == 0x20: has __device__ but not __global__ (bits 5-6)

A __global__ function fails the second test because bit 6 is set (byte & 0x60 == 0x60). This matters because __global__ functions ARE emitted in host output -- as stubs that call __wrapper__device_stub_<name>.

The Keep-in-IL Walk (Device Code Selection)

The keep-in-IL mechanism runs during fe_wrapup pass 3 and selects which IL entries belong to the device output. The full details are documented in the Keep-in-IL page; this section covers the aspects relevant to device/host separation.

Call Chain

sub_610420 (mark_to_keep_in_il)
  |
  +-- installs pre_walk_check = sub_617310 (prune_keep_in_il_walk)
  +-- walks file-scope IL via sub_6115E0 (walk_tree_and_set_keep_in_il)
  |     |
  |     +-- for each child entry:
  |           *(child - 8) |= 0x80    // set bit 7 = keep_in_il
  |           recurse into child
  |
  +-- if dword_126EFB4 == 2 (C++ mode):
  |     sub_6175F0 (walk_scope_and_mark_routine_definitions)
  |
  +-- iterates 45+ global entry-kind linked lists
  +-- processes using-declarations (fixed-point loop)

The Keep Bit

Every IL entry has an 8-byte prefix. Bit 7 (0x80) of the byte at entry_ptr - 8 is the keep-in-IL flag:

Byte at (entry_ptr - 8):
  bit 0  (0x01)  is_file_scope
  bit 1  (0x02)  is_in_secondary_il
  bit 2  (0x04)  current_il_region
  bits 3-6       reserved
  bit 7  (0x80)  keep_in_il          <<<< THE DEVICE CODE MARKER

The sign bit doubles as the flag, enabling a fast test: *(signed char*)(entry - 8) < 0 means "keep." The recursive worker sub_6115E0 sets this bit on every reachable sub-entry by ORing 0x80 into the prefix byte and recursing.

Transitive Closure

The walk implements a transitive closure: if a __device__ function references a type, that type gets marked, which transitively marks its member types, base classes, template parameters, and any routines they reference. The prune callback (sub_617310) prevents infinite loops by returning 1 (skip) when an entry already has bit 7 set.

Additional "keep definition" flags exist for deeper marking:

EntityFieldBitEffect
Type (class/struct)entry + 162bit 7 (0x80)Retain full class body, not just forward decl
Routineentry + 187bit 2 (0x04)Retain function body

Seed Entries

The walk starts from entities already tagged with execution-space bits. These seeds include:

  • Functions with __device__ or __global__ at entity+182
  • Variables with __shared__, __constant__, or __managed__ memory space attributes
  • Extended device/host-device lambdas

Everything reachable from a seed gets the keep bit. Everything without the keep bit is eliminated from the device IL by the elimination pass (sub_5CCBF0).

host device Functions

Functions annotated with both __host__ and __device__ have bits 4 and 5 set in entity+182, producing (byte & 0x30) == 0x30. These functions participate in BOTH output paths:

  1. Host output (.int.c): The function passes the nv_is_device_only_routine check (it returns false because bit 4 is set alongside bit 5). The function body is emitted normally -- no #if 0 wrapping, no stub substitution.

  2. Device IL: The keep-in-IL walk marks the function and all its dependencies because it has device-capable bits set. The full function body is retained in the device IL.

This dual inclusion is why __host__ __device__ functions must be valid C++ in both execution contexts. They are compiled once by EDG, then the same IL is consumed by both the host compiler (via .int.c text) and cicc (via binary IL).

Template Instantiation Interaction

When sub_47ECC0 processes a template instantiation directive (source sequence kind 54) for a __host__ __device__ template, it does NOT set dword_1065850. The stub mode toggle only activates for entities with byte_182 & 0x40 (the __global__ kernel bit). Host-device functions get their bodies emitted directly in both tracks.

Output Files

cudafe++ produces up to four output files from a single compilation:

1. Host C++ File (.int.c)

Generated by sub_489000 (process_file_scope_entities). The filename is derived from the input: <input>.int.c, or stdout if the output name is "-".

Contents:

  • Pragma boilerplate (#pragma GCC diagnostic ignored ...)
  • Managed runtime initialization (__nv_init_managed_rt, __nv_fatbinhandle_for_managed_rt)
  • Lambda macro definitions (__nv_is_extended_device_lambda_closure_type, etc.)
  • #include "crt/host_runtime.h" (injected when first CUDA-tagged type is encountered)
  • All host-visible declarations with device-only entities wrapped in #if 0
  • Kernel functions replaced with forwarding stubs to __wrapper__device_stub_<name>
  • Registration tables (sub_6BCF80 called 6 times for device/host x managed/constant combinations)
  • Anonymous namespace macro (_NV_ANON_NAMESPACE)
  • Original source re-inclusion (#include "<original_file>")

2. Device IL File

Named via --gen_device_file_name CLI flag (flag index 85). Contains the binary IL for all entries that passed the keep-in-IL walk. This file is consumed by cicc (the CUDA IL compiler).

3. Module ID File

Named via --module_id_file_name CLI flag (flag index 87). Contains the CRC32-based unique identifier for this compilation unit, computed by make_module_id (sub_5B5500). Used to prevent ODR violations across separate compilation units in RDC mode.

4. Stub File

Named via --stub_file_name CLI flag (flag index 86). Contains the __wrapper__device_stub_<name> function definitions that bridge host-side kernel launch calls to the CUDA runtime.

Kernel Stub Generation

For __global__ kernel functions, the host output replaces the original body with two stub forms. The toggle dword_1065850 flips 0->1 at the top of gen_routine_decl, so the static definition is emitted first, followed by the forwarding body from the recursive call:

// Output 1 (dword_1065850 == 1 after toggle, emitted first):
static void __wrapper__device_stub_kernel_name(params) {
    ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}
#if 0
<original body>
#endif

// Output 2 (dword_1065850 == 0 after toggle, emitted by recursive call):
void kernel_name(params) {
    <scope>::__wrapper__device_stub_kernel_name(params);
    return;
}
#if 0
<original body>
#endif

The static stub provides the definition of __wrapper__device_stub_ that the forwarding body calls. The cudaLaunchKernel(0, 0, 0, 0, 0, 0) placeholder creates a linker dependency on the CUDA runtime without performing an actual kernel launch.

For template kernels, the forwarding stub includes explicit template arguments: __wrapper__device_stub_kernel_name<T1, T2, ...>(params). For full details see Kernel Stubs.

Architectural Diagram

                        .cu source
                            |
                     EDG Frontend (parse once)
                            |
                     Unified IL Tree
                    (all entities tagged
                     at entity+182)
                            |
              +-------------+-------------+
              |                           |
        fe_wrapup pass 3           Backend (sub_489000)
     mark_to_keep_in_il            walks source sequence
      (sub_610420)                       |
              |                    sub_47ECC0 per entity
        set bit 7 on                     |
        device-reachable          +------+------+
        entries                   |             |
              |              sub_46B3F0    sub_47BFD0
        Device IL output    returns 0?    __global__?
        (binary, for cicc)       |             |
                            #if 0/endif   stub body
                            wrap it       replacement
                                  |             |
                                  +------+------+
                                         |
                                   .int.c output
                                 (text C++ for host
                                  compiler)

Function Map

AddressNameLinesRole
sub_489000process_file_scope_entities723Backend entry point, .int.c emission
sub_47ECC0gen_template (source sequence dispatcher)1917Dispatches each entity; calls sub_46B3F0 for type filtering
sub_47BFD0gen_routine_decl1831Routine declaration/definition; toggles dword_1065850
sub_46B3F0device-only type filter39Returns 0 for device-only entities in host output
sub_610420mark_to_keep_in_il892Top-level device IL marking entry point
sub_6115E0walk_tree_and_set_keep_in_il4649Recursive worker that sets bit 7 on reachable entries
sub_617310prune_keep_in_il_walk127Pre-walk callback; skips already-marked entries
sub_6175F0walk_scope_and_mark_routine_definitions634Additional pass for C++ routine definitions
sub_47B890gen_lambda336Lambda wrapper generation; #if 0 for device lambda bodies
sub_4864F0gen_type_decl751Type declaration emission; host runtime injection
sub_5CCBF0eliminate_unneeded_il_entries345Elimination pass (removes entries without keep bit)

Cross-References

  • Execution Spaces -- byte +182 bitfield encoding for __host__/__device__/__global__; the nv_is_device_only_routine predicate that drives host-output filtering
  • Kernel Stubs -- detailed stub generation logic: forwarding body (pass 1) and static cudaLaunchKernel body (pass 2)
  • Keep-in-IL -- full documentation of the device code marking walk, the keep bit at entry_ptr - 8, and the transitive closure algorithm
  • Memory Spaces -- variable-side __device__/__shared__/__constant__ at entity+148; these are the seed entries for the keep-in-IL walk
  • .int.c File Format -- structure of the generated host translation file
  • Entity Node Layout -- full byte map of the entity structure including offset +176 (flags field) and +182 (execution space byte)

Kernel Stub Generation

When cudafe++ generates the .int.c host translation of a CUDA source file, every __global__ kernel function undergoes a critical transformation: the original kernel body is suppressed and replaced with a device stub -- a lightweight host-callable wrapper that delegates to cudaLaunchKernel. This mechanism is how CUDA kernel launch syntax (kernel<<<grid, block>>>(args)) ultimately becomes a regular C++ function call that the host compiler can process. The stub generation logic lives entirely within gen_routine_decl (sub_47BFD0), a 1,831-line function in cp_gen_be.c that is the central code generator for all C++ function declarations and definitions. A secondary function, gen_bare_name (sub_473F10), handles the character-by-character emission of the __wrapper__device_stub_ prefix into function names.

The stub mechanism operates in two passes controlled by a global toggle, dword_1065850 (the device_stub_mode flag). The toggle fires at the top of gen_routine_decl, BEFORE the body-selection logic runs. Because the toggle is dword_1065850 = (dword_1065850 == 0), it flips 0->1 on the first invocation. This means:

  • First invocation (toggle 0->1): dword_1065850 == 1 at decision points -> emits the static declaration with cudaLaunchKernel placeholder body, then recurses.
  • Recursive invocation (toggle 1->0): dword_1065850 == 0 at decision points -> emits the forwarding body that calls __wrapper__device_stub_<name>.

Both invocations wrap the original kernel body in #if 0 / #endif so the host compiler never sees device code.

Key Facts

PropertyValue
Source filecp_gen_be.c (EDG 6.6 backend code generator)
Main generatorsub_47BFD0 (gen_routine_decl, 1831 lines)
Bare name emittersub_473F10 (gen_bare_name, 671 lines)
Stub prefix string"__wrapper__device_stub_" at 0x839420
Specialization prefix"__specialization_" at 0x839960
cudaLaunchKernel body"{ ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);}" at 0x839CB8
Device-only dummy (ctor/dtor)"{int *volatile ___ = 0;" at 0x839A3E + "::free(___);" at 0x839A72
Device-only dummy (global)"{int volatile ___ = 1;" at 0x839A56 + "::exit(___);" at 0x839A80
Stub mode flagdword_1065850 (global toggle)
Static template stub CLI flag-static-global-template-stub=true
Parameter list generatorsub_478900 (gen_parameter_list)
Scope qualifier emittersub_474D60 (recursive namespace path)
Parameter name emittersub_474BB0 (emit entity name for forwarding)

The Device Stub Mode Toggle

The entire stub generation mechanism hinges on a single global variable, dword_1065850. This flag acts as a modal switch: when set, all subsequent code generation for __global__ functions produces the static stub variant rather than the forwarding body.

Toggle Logic

The toggle occurs in gen_routine_decl at the point where the function's CUDA flags are inspected. The critical line from the decompiled binary:

// sub_47BFD0, around decompiled line 553
// v3 = routine entity pointer, v8 = is_friend flag

__int64 flags = *(_QWORD *)(v3 + 176);

if ((flags & 0x40000002000000) == 0x40000002000000 && v8 != 1)
    dword_1065850 = dword_1065850 == 0;   // toggle: 0->1 or 1->0

The bitmask 0x40000002000000 encodes a combination of the __global__ attribute and a linkage/definition flag in the entity's 8-byte flags field at offset +176. The condition requires BOTH bits set and the declaration must NOT be a friend declaration (v8 != 1). The toggle expression dword_1065850 == 0 flips the flag: if it was 0, it becomes 1; if it was 1, it becomes 0.

This means gen_routine_decl is called twice for every __global__ kernel. Crucially, the toggle fires at the TOP of the function, BEFORE the body emission logic:

  1. First call (dword_1065850 == 0 at entry -> toggled to 1): All subsequent decision points see dword_1065850 == 1. Emits the static stub with cudaLaunchKernel placeholder body. Then recurses.
  2. Recursive call (dword_1065850 == 1 at entry -> toggled to 0): All subsequent decision points see dword_1065850 == 0. Emits the forwarding stub body. Does NOT recurse (the flag is 0 at the end).

The self-recursion that drives the second call is explicit at the end of gen_routine_decl:

// sub_47BFD0, decompiled line 1817-1821
if (dword_1065850) {
    qword_1065748 = (int64_t)v163;  // restore source sequence pointer
    return sub_47BFD0(v152, a2);     // recursive self-call
}

After emitting the static stub (first call), the self-recursion check at line 1817 fires because dword_1065850 == 1. The function restores the source sequence state and calls itself. In the recursive call, the toggle fires again (1->0), and the forwarding body is emitted with dword_1065850 == 0. At the end of the recursive call, dword_1065850 == 0, so no further recursion occurs.

Stub Generation: The Forwarding Body

When dword_1065850 == 0 and the entity has __global__ annotation (byte +182 & 0x40) with a body (byte +179 & 0x02), gen_routine_decl emits a forwarding body instead of the original kernel implementation. This is the output produced by the recursive (second) invocation.

Step-by-Step Emission

The forwarding body is assembled from multiple sub_468190 (emit raw string) calls:

// Condition: (byte[182] & 0x40) != 0 && (byte[179] & 2) != 0 && dword_1065850 == 0

// 1. Open brace
sub_468190("{");

// 2. Scope qualification (if kernel is in a namespace)
scope = *(v3 + 40);  // entity's enclosing scope
if (scope && byte_at(scope + 28) == 3) {       // scope kind 3 = namespace
    sub_474D60(*(scope + 32));   // recursively emit namespace::namespace::...
    sub_468190("::");
}

// 3. Emit "__wrapper__device_stub_" prefix
sub_468190("__wrapper__device_stub_");

// 4. Emit the original function name
sub_468190(*(char **)(v3 + 8));  // entity name string at offset +8

Template Argument Emission

After the function name, template arguments must be forwarded. The logic branches on whether the function is an explicit template specialization (v153) or a non-template member of a template class:

Case A: Explicit specialization (v153 != 0) -- uses the template argument list at entity offset +224:

v135 = *(v3 + 224);  // template_args linked list
if (v135) {
    putc('<', stream);  // emit '<'
    do {
        arg_kind = byte_at(v135 + 8);
        if (arg_kind == 0) {
            // Type argument: emit type specifier + declarator
            sub_5FE8B0(v135[4], ...);   // gen_type_specifier
            sub_5FB270(v135[4], ...);   // gen_declarator
        } else if (arg_kind == 1) {
            // Value argument (non-type template param)
            sub_5FCAF0(v135[4], 1, ...); // gen_constant
        } else {
            // Template-template argument
            sub_472730(v135[4], ...);    // gen_template_arg
        }
        v135 = *v135;          // next in linked list
        separator = v135 ? ',' : '>';
        putc(separator, stream);
    } while (v135);
}

Case B: Non-specialization -- template parameters from the enclosing class template are forwarded:

// v162 = template parameter info from enclosing scope
v92 = v162[1];  // template parameter list
if (v92 && (byte_at(v92 + 113) & 2) == 0) {
    sub_467E50("<");
    do {
        param_kind = byte_at(v92 + 112);
        if (param_kind == 1) {
            // type parameter -- emit the type
            sub_5FE8B0(*(v92 + 120), ...);
            sub_5FB270(*(v92 + 120), ...);
        } else if (param_kind == 2) {
            // non-type parameter -- emit constant
            sub_5FCAF0(*(v92 + 120), 1, ...);
        } else {
            // template-template parameter
            sub_472730(*(v92 + 120), ...);
        }
        if (byte_at(v92 + 113) & 1)
            sub_467E50("...");   // parameter pack expansion
        v92 = *(v92 + 104);     // next parameter
        emit(v92 ? "," : ">");
    } while (v92);
}

Parameter Forwarding

After the name and template arguments, the forwarding call's actual arguments are emitted:

// 5. Emit parameter forwarding: "(param1, param2, ...)"
sub_468150(40);  // '('
param = *(v167 + 40);  // first parameter entity from definition scope
if (param) {
    for (separator = ""; ; separator = ",") {
        sub_468190(separator);
        sub_474BB0(param, 7);  // emit parameter name
        if (byte_at(param + 166) & 0x40) {
            sub_468190("...");  // variadic parameter pack expansion
        }
        param = *(param + 104);  // next parameter in list
        if (!param) break;
    }
}
sub_468190(");");

// 6. Emit return statement and closing brace
sub_468190("return;}");

Complete Output Example

For a kernel:

namespace my_ns {
template<typename T>
__global__ void my_kernel(T* data, int n) { /* device code */ }
}

The forwarding body (emitted during the recursive call with dword_1065850 == 0) produces:

template<typename T>
void my_ns::my_kernel(T* data, int n) {
    my_ns::__wrapper__device_stub_my_kernel<T>(data, n);
    return;
}
#if 0
/* original kernel body here */
#endif

Note: __host__ is NOT emitted in the forwarding body. The __global__ attribute is stripped and no explicit execution space appears. The function appears as a plain C++ function in .int.c.

Stub Generation: The Static cudaLaunchKernel Placeholder

When dword_1065850 == 1 (the first invocation, after the toggle), the function declaration is rewritten with a different storage class and body. Despite being called "pass 2" conceptually (it produces the definition that the forwarding body calls), it is emitted FIRST in the output because the toggle sets the flag before any body emission logic runs.

Declaration Modifiers

When dword_1065850 is set, gen_routine_decl forces the storage class to static and optionally prepends the __specialization_ prefix:

// sub_47BFD0, decompiled lines 897-903
if (dword_1065850) {
    v164 = 2;                    // force storage class = static
    v23 = "static";
    if (v153)                    // if template specialization
        sub_467E50("__specialization_");
    goto emit_storage_class;     // -> sub_467E50("static"); sub_468150(' ');
}

The __specialization_ prefix is emitted BEFORE static for template specializations. This creates names like __specialization_static void __wrapper__device_stub_kernel(...) which the CUDA runtime uses to distinguish specialization stubs from primary template stubs.

Name Emission via gen_bare_name

In stub mode, gen_bare_name (sub_473F10) prepends the wrapper prefix character-by-character. The relevant code path:

// sub_473F10, decompiled lines 130-144
if (byte_at(v2 + 182) & 0x40 && dword_1065850) {
    // Emit line directive if pending
    if (dword_1065818)
        sub_467DA0();

    // Character-by-character emission of "__wrapper__device_stub_"
    v25 = "_wrapper__device_stub_";   // note: starts at second char
    v26 = 95;                          // first char: '_' (0x5F = 95)
    do {
        ++v25;
        putc(v26, stream);
        v26 = *(v25 - 1);
        ++dword_106581C;
    } while ((char)v26);
}

The technique is notable: the string "_wrapper__device_stub_" is stored starting at the second character, and the first underscore (_, ASCII 95) is loaded as the initial character separately. The do/while loop then walks the string pointer forward, emitting each character via putc and incrementing the column counter (dword_106581C). This assembles the full __wrapper__device_stub_ prefix before the actual function name is emitted.

cudaLaunchKernel Placeholder Body

For non-specialization __global__ kernels in stub mode, the body is a single-line placeholder:

// sub_47BFD0, decompiled lines 1424-1429
if (dword_1065850) {
    if (!v153 && v90) {    // not a specialization AND has __global__ body
        sub_468190("{ ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);}");
        goto suppress_original;
    }
}

The call ::cudaLaunchKernel(0, 0, 0, 0, 0, 0) is never actually executed at runtime. It exists solely to create a linker dependency on the CUDA runtime library, ensuring that cudaLaunchKernel is linked even though the real launch is performed through the CUDA driver API. The six zero arguments match the signature cudaError_t cudaLaunchKernel(const void*, dim3, dim3, void**, size_t, cudaStream_t).

Complete Output Example (Static Stub)

For the same kernel above, the static stub (emitted first, with dword_1065850 == 1) produces:

static void __wrapper__device_stub_my_kernel(float* data, int n) {
    ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}

Dummy Bodies for Non-Kernel Device Functions

Not all CUDA-annotated functions are __global__ kernels. Device-only functions (constructors, destructors, and plain __device__ functions) that have definitions also need host-side bodies to prevent host compiler errors. These receive dummy bodies designed to suppress optimizer warnings while remaining syntactically valid.

Condition for Dummy Body Emission

The dummy body path activates in the ELSE branch of the __global__ check -- that is, for non-kernel device functions. The condition from the decompiled code (lines 1603-1606):

// This path is reached when (byte[182] & 0x40) == 0 -- entity is NOT __global__
// The flags field at offset +176 is an 8-byte bitfield encoding linkage/definition state.

uint64_t flags = *(uint64_t*)(entity + 176);
if ((flags & 0x30000000000500) != 0x20000000000000)  // NOT a device-only entity with definition
    goto emit_original_body;                          // skip dummy, emit normally

if (!dword_106BFDC || (entity->byte_81 & 4) != 0)   // whole-program flag check
{
    // Emit dummy body for device-only function visible in host output
}

The bitmask 0x30000000000500 extracts the device-annotation and definition bits from the 8-byte flags field. The target value 0x20000000000000 selects entities that have device annotation set but no host-side definition -- exactly the functions that need a dummy body to satisfy the host compiler.

Constructor/Destructor Dummy (definition_kind 1 or 2)

For constructors (definition_kind == 1) and destructors (definition_kind == 2), the dummy body allocates a volatile null pointer and frees it:

// sub_47BFD0, decompiled lines 1611-1651
if ((unsigned char)(byte[166] - 1) <= 1) {
    sub_468190("{int *volatile ___ = 0;");
    // ... emit (void)param; for each parameter ...
    sub_468190("::free(___);}");
}

Output:

{int *volatile ___ = 0;(void)param1;(void)param2;::free(___);}

The volatile qualifier prevents the optimizer from removing the allocation. The ::free(0) call is a no-op at runtime but establishes a dependency on the C library and prevents dead code elimination of the entire body.

global / Regular Device Function Dummy (definition_kind >= 3)

For non-constructor/destructor device functions, a different pattern is used:

else {
    sub_468190("{int volatile ___ = 1;");
    // ... emit (void)param; for each parameter ...
    sub_468190("::exit(___);}");
}

Output:

{int volatile ___ = 1;(void)param1;(void)param2;::exit(___);}

The ::exit(1) call guarantees the function is never considered to "return normally" by the host compiler's control-flow analysis, suppressing missing-return-value warnings for non-void functions.

Parameter Usage Emission

Between the opening and closing statements, each named parameter is referenced with (void)param; to suppress unused-parameter warnings. The loop walks the parameter list:

for (kk = *(v167 + 40); kk; kk = *(kk + 104)) {
    if (*(kk + 8) && !(byte_at(kk + 166) & 0x40)) {  // has name, not a pack
        // For aggregate types with GNU host compiler: complex cast chain
        if (!dword_1065750 && dword_126E1F8
            && is_aggregate_type(*(kk + 112))
            && has_nontrivial_dtor(*(kk + 112))) {
            sub_468190("(void)");
            sub_468190("reinterpret_cast<void *>(&(const_cast<char &>");
            sub_468190("(reinterpret_cast<const volatile char &>(");
            sub_474BB0(kk, 7);  // parameter name
            sub_468190("))))");
        } else {
            sub_468190("(void)");
            sub_474BB0(kk, 7);  // parameter name
        }
        sub_468150(';');
    }
}

The complex reinterpret_cast chain for aggregate types with non-trivial destructors avoids triggering GCC/Clang warnings about taking the address of a parameter that might be passed in registers.

The #if 0 / #endif Suppression

After the stub body is emitted, the original kernel body is wrapped in preprocessor guards to hide it from the host compiler:

// sub_47BFD0, decompiled lines 1598-1601
sub_46BC80("#if 0");       // emit "#if 0\n"
--dword_1065834;           // decrease indent level
sub_467D60();              // emit newline

// ... then emit the original body via:
dword_1065850_saved = dword_1065850;
dword_1065850 = 0;                    // temporarily disable stub mode
sub_47AEF0(*(v167 + 80), 0);         // gen_statement_full: emit original body
dword_1065850 = dword_1065850_saved;  // restore stub mode
sub_466C10();                          // finalize

// ... then emit #endif
putc('#', stream);
// character-by-character emission of "#endif\n"

The function temporarily disables stub mode (dword_1065850 = 0) while emitting the original body so that any nested constructs are generated normally. After the body, #endif is emitted and stub mode is restored.

For definitions (when v112 == 0), a trailing ; is appended after #endif to satisfy host compilers that may expect a statement terminator.

The -static-global-template-stub Flag

The CLI flag -static-global-template-stub=true controls how template __global__ functions are stubbed. When enabled, template kernel stubs receive static linkage, which avoids ODR violations when the same template kernel is instantiated in multiple translation units during whole-program compilation (-rdc=false).

The flag produces two diagnostic messages when it encounters problematic patterns:

  1. Extern template kernel: "when "-static-global-template-stub=true", extern __global__ function template is not supported in whole program compilation mode ("-rdc=false")" -- An extern template kernel cannot receive a static stub because the definitions would conflict across TUs.

  2. Missing definition: "when "-static-global-template-stub=true" in whole program compilation mode ("-rdc=false"), a __global__ function template instantiation or specialization (%sq) must have a definition in the current translation unit" -- The static stub requires a local definition to replace.

Both diagnostics recommend either switching to -rdc=true (separate compilation) or explicitly setting -static-global-template-stub=false.

Diagnostic Push/Pop Around Stubs

Before emitting device stub declarations, gen_routine_decl wraps the output in compiler-specific diagnostic suppression to prevent spurious warnings:

For GCC/Clang hosts (dword_126E1F8 set, version > 0x9E97 = 40599):

sub_467E50("\n#pragma GCC diagnostic push\n");
sub_467E50("#pragma GCC diagnostic ignored \"-Wunused-parameter\"\n");
// ... stub emission ...
sub_467E50("\n#pragma GCC diagnostic pop\n");

For MSVC hosts (dword_126E1D8 set):

sub_467E50("\n__pragma(warning(push))\n");
sub_467E50("__pragma(warning(disable : 4100))\n");  // unreferenced formal parameter
// ... stub emission ...
sub_467E50("\n__pragma(warning(pop))\n");

For static template specialization stubs, an additional warning is suppressed:

  • GCC/Clang: #pragma GCC diagnostic ignored "-Wunused-function" (warning 4505 on MSVC: "unreferenced local function has been removed")

Deferred Function List for Whole-Program Mode

When dword_106BFBC (a whole-program compilation flag) is set and dword_106BFDC is clear, instead of emitting a dummy body immediately, gen_routine_decl adds the function to a deferred list:

// sub_47BFD0, decompiled lines 1713-1745
v117 = sub_6B7340(32);          // allocate 32-byte node
v117[0] = qword_1065840;        // link to previous head
v117[1] = source_start;         // source position start
v117[2] = source_end;           // source position end
if (has_name)
    v117[3] = strdup(name);     // copy of function name
else
    v117[3] = NULL;
qword_1065840 = v117;           // push onto list head

This deferred list (qword_1065840) is later consumed during the breakpoint placeholder generation phase in process_file_scope_entities (sub_489000), where each deferred entry produces a static __attribute__((used)) void __nv_breakpoint_placeholder<N>_<name>(void) { exit(0); } function.

Function Map

AddressNameRole
sub_47BFD0gen_routine_declMain stub generator; 1831 lines; handles all function declarations
sub_473F10gen_bare_nameCharacter-by-character name emission with __wrapper__device_stub_ prefix
sub_474BB0gen_entity_nameParameter name emission for forwarding calls
sub_474D60gen_scope_qualifierRecursive namespace path emission (ns1::ns2::)
sub_478900gen_parameter_listParameter list with type transformation in stub mode
sub_478D70gen_function_declarator_with_scopeFull function declarator with cv-qualifiers and ref-qualifiers
sub_47AEF0gen_statement_fullStatement generator used for emitting original body inside #if 0
sub_47ECC0gen_template / process_source_sequenceTop-level dispatch; also sets dword_1065850 for instantiation directives
sub_46BC80(emit #if directive)Emits #if 0 / #if 1 preprocessor lines
sub_467E50(emit string)Primary string emission to output stream
sub_468190(emit raw string)Raw string emission (no line directive)
sub_489000process_file_scope_entitiesBackend entry point; consumes deferred function list

Concrete Example: Simple Kernel Stub Output

Given this input CUDA source:

__global__ void add_one(int *data, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n)
        data[idx] += 1;
}

cudafe++ generates the following in the .int.c host translation file. The toggle fires at the top of gen_routine_decl (0->1), so the static stub definition is emitted FIRST, followed by the forwarding body from the recursive call.

Output 1: Static Stub Definition (first call, dword_1065850 == 1 after toggle)

The static stub provides the linker symbol that the forwarding body calls. Diagnostic pragmas wrap the declaration to suppress unused-parameter warnings:

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-parameter"
static void __wrapper__device_stub_add_one(int *data, int n) {
    ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}
#if 0
/* Original kernel body -- hidden from host compiler */
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n)
        data[idx] += 1;
}
#endif
#pragma GCC diagnostic pop

The static storage class is forced by the check at decompiled line 897-903. The __wrapper__device_stub_ prefix is emitted by gen_bare_name (sub_473F10). The cudaLaunchKernel placeholder body comes from the string literal at 0x839CB8.

Output 2: Forwarding Body (recursive call, dword_1065850 == 0 after toggle)

After the static stub is emitted and gen_routine_decl recurses, the forwarding body replaces the original kernel body. The __global__ attribute is stripped (kernels become regular host functions in .int.c):

void add_one(int *data, int n) {__wrapper__device_stub_add_one(data, n);return;}
#if 0
/* Original kernel body -- hidden from host compiler (emitted again) */
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n)
        data[idx] += 1;
}
#endif

The forwarding body is assembled character-by-character:

  1. { -- open brace
  2. Scope qualifier (none for file-scope kernels; ns:: for namespaced ones)
  3. __wrapper__device_stub_ -- the stub prefix from string at 0x839420
  4. add_one -- the original function name from entity + 8
  5. (data, n) -- parameter names forwarded (no types, just names via sub_474BB0)
  6. );return;} -- close the forwarding call and return

The original body appears in #if 0 in both outputs because both code paths reach the same LABEL_457 -> sub_46BC80("#if 0") emission point.

Template Kernel Example

For a template kernel:

template<typename T>
__global__ void scale(T *data, T factor, int n) { /* ... */ }

// explicit instantiation
template __global__ void scale<float>(float *, float, int);

Output 1 (first call, dword_1065850 == 1) produces a specialization stub:

__specialization_static void __wrapper__device_stub_scale(float *data, float factor, int n) {
    ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}

Output 2 (recursive call, dword_1065850 == 0) produces a forwarding stub with template arguments:

template<typename T>
void scale(T *data, T factor, int n) {__wrapper__device_stub_scale<T>(data, factor, n);return;}

The __specialization_ prefix is emitted only when the entity is a template specialization (v153 != 0) and dword_1065850 is set (decompiled line 901-902).

Device-Only Function Example

For a non-kernel __device__ function with a body:

__device__ int device_helper(int x, int y) {
    return x + y;
}

The host output uses a dummy body instead of a forwarding stub (since there is no __wrapper__device_stub_ target for non-kernel functions):

__attribute__((unused)) int device_helper(int x, int y) {int volatile ___ = 1;(void)x;(void)y;::exit(___);}
#if 0
{
    return x + y;
}
#endif

The __attribute__((unused)) prefix is emitted when the function's execution space is device-only ((byte_182 & 0x70) == 0x20) and dword_126E1F8 (GCC host compiler mode) is set (decompiled line 905-906).

Cross-References

RDC Mode

CUDA supports two compilation models that fundamentally change how cudafe++ processes device code: whole-program mode (-rdc=false, the default) and separate compilation mode (-rdc=true, also called Relocatable Device Code). The mode switch affects error checking, stub linkage, module ID generation, anonymous namespace mangling, and -- when multiple translation units are involved -- triggers EDG's cross-TU correspondence machinery for structural type verification.

From cudafe++'s perspective, the distinction maps to a single CLI flag (--device-c, flag index 77) and a handful of global booleans that gate code paths throughout the binary. This page documents what changes between the two modes, how module IDs are generated, how cross-TU IL correspondence works, and how host stub linkage is controlled.

Key Facts

PropertyValue
RDC CLI flag--device-c (flag index 77, no argument)
Whole-program mode flagdword_106BFBC (also set by --debug_mode)
Module ID cacheqword_126F0C0 (cached string, computed once)
Module ID generatorsub_5AF830 (make_module_id, ~450 lines)
Module ID settersub_5AF7F0 (set_module_id)
Module ID gettersub_5AF820 (get_module_id)
Module ID file writersub_5B0180 (write_module_id_to_file)
Module ID file flag--gen_module_id_file (flag 83)
Module ID file path--module_id_file_name (flag 87)
Cross-TU IL copiersub_796BA0 (copy_secondary_trans_unit_IL_to_primary, trans_copy.c)
Cross-TU usage markersub_796C00 (mark_secondary_IL_entities_used_from_primary)
Class correspondencesub_7A00D0 (verify_class_type_correspondence, 703 lines)
TU processing entrysub_7A40A0 (process_translation_unit)
TU switchsub_7A3D60 (switch_translation_unit)
Host stub linkage flag--host-stub-linkage-explicit (flag 47)
Static host stub flag--static-host-stub (flag 48)
Static template stub flag--static-global-template-stub (set_flag mechanism)
EDG source fileshost_envir.c (module ID), trans_copy.c, trans_corresp.c, trans_unit.c

Whole-Program Mode (-rdc=false)

Whole-program mode is the default. All device code for a given translation unit must be defined within that single .cu file. No external device symbols are allowed. The host compiler sees the entire program at once, and nvlink is not required for device code linking.

Constraints Enforced

Five diagnostics are specific to whole-program mode or are closely tied to the internal-linkage consequences of non-RDC compilation:

1. Inline device/constant/managed variables must have internal linkage.

An inline __device__/__constant__/__managed__ variable must have
internal linkage when the program is compiled in whole program
mode (-rdc=false)

In whole-program mode, the device runtime has no linker step to resolve external inline variables across TUs. An inline __device__ variable with external linkage would need cross-TU deduplication that only nvlink can provide. The frontend forces static (or anonymous-namespace) linkage, emitting an error if the variable has external linkage.

2. Extern __global__ function templates are forbidden (with -static-global-template-stub=true).

when "-static-global-template-stub=true", extern __global__ function
template is not supported in whole program compilation mode ("-rdc=false").
To resolve the issue, either use separate compilation mode ("-rdc=true"),
or explicitly set "-static-global-template-stub=false" (but see nvcc
documentation about downsides of turning it off)

The -static-global-template-stub flag causes template kernel stubs to receive static linkage to avoid ODR violations when the same template is instantiated in multiple host-side compilation units. An extern template declaration conflicts with this because the extern stub expects an external definition while the static stub forces a local one. The diagnostic tag for this is extern_kernel_template.

3. __global__ template instantiations must have local definitions (with -static-global-template-stub=true).

when "-static-global-template-stub=true" in whole program compilation
mode ("-rdc=false"), a __global__ function template instantiation or
specialization (%sq) must have a definition in the current translation
unit.

A static stub requires a definition in the same TU. If the instantiation point references a template defined in another header without an explicit instantiation, the stub has no body to emit. The diagnostic tag is template_global_no_def.

Both template-related diagnostics recommend either switching to -rdc=true or setting -static-global-template-stub=false. The 4 usage contexts in the binary for -static-global-template-stub all appear in error message strings (at addresses 0x88E588 and 0x88E6E0).

4. Kernel launch from __device__ or __global__ functions requires separate compilation.

kernel launch from __device__ or __global__ functions requires
separate compilation mode

Dynamic parallelism -- launching a kernel from device code (a __device__ or __global__ function calling <<<...>>>) -- requires the device linker (nvlink) to resolve cross-module kernel references. In whole-program mode, no device linking occurs, so the construct is illegal. The diagnostic tag is device_launch_no_sepcomp.

5. Address of internal linkage device function (bug mitigation).

address of internal linkage device function (%sq) was taken
(nv bug 2001144). mitigation: no mitigation required if the
address is not used for comparison, or if the target function
is not a CUDA C++ builtin. Otherwise, write a wrapper function
to call the builtin, and take the address of the wrapper
function instead

This diagnostic fires in whole-program mode when code takes the address of a static __device__ function. Because device functions with internal linkage get module-ID-based name mangling, their addresses may differ across compilations or across TUs even when they refer to the "same" function. The warning documents a known NVIDIA bug (2001144) and provides a workaround: wrap the builtin in a non-internal function and take the wrapper's address instead. This diagnostic has no associated tag name -- it is emitted unconditionally when the condition is detected.

Deferred Function List

When dword_106BFBC (whole-program mode) is set and dword_106BFDC (skip-device-only) is clear, gen_routine_decl (sub_47BFD0) adds device-only functions to a deferred linked list (qword_1065840) rather than emitting dummy bodies inline. Each list node is 32 bytes:

OffsetField
+0next pointer
+8Source position (start)
+16Source position (end)
+24Name string (strdup'd, or NULL)

This list is consumed during the breakpoint placeholder phase in process_file_scope_entities (sub_489000), where each entry produces a static __attribute__((used)) void __nv_breakpoint_placeholder<N>_<name>(void) { exit(0); } function for debugger support.

Separate Compilation Mode (-rdc=true)

When nvcc passes --device-c (flag index 77) to cudafe++, separate compilation mode is activated. This:

  • Allows __device__, __constant__, and __managed__ variables to have external linkage
  • Permits extern __global__ template functions
  • Enables dynamic parallelism (kernel launches from device code)
  • Requires nvlink to resolve device-side cross-TU references
  • Generates a module ID that uniquely identifies each compilation unit for runtime registration

In this mode, the host stubs are generated with external linkage (by default) so the host linker can resolve cross-TU kernel calls. The module ID is embedded in the registration code to match host stubs with their corresponding device fatbinary segments.

Multi-TU Processing in EDG

When multiple translation units are compiled in a single cudafe++ invocation (as happens during RDC compilation with nvcc), the EDG frontend processes them sequentially using a stack-based TU management system:

GlobalPurpose
qword_106BA10Current translation unit pointer
qword_106B9F0Primary (first) translation unit
qword_106BA18TU stack top
dword_106B9E8TU stack depth (excluding primary)

process_translation_unit (sub_7A40A0, trans_unit.c) is the main entry point called from main() for each source file:

  1. Allocates a 424-byte TU descriptor via sub_6BA0D0
  2. Initializes scope state and copies registered variable defaults
  3. Sets the primary TU pointer (qword_106B9F0) for the first file
  4. Links the TU into the processing chain
  5. Opens the source file and sets up include paths
  6. Runs the parser (sub_586240)
  7. Dispatches to standard compilation (sub_4E8A60) or module compilation (sub_6FDDF0)
  8. Calls finalization (sub_588E90)
  9. Pops the TU from the stack

switch_translation_unit (sub_7A3D60, trans_unit.c, line 514) saves/restores per-TU state when the frontend needs to reference entities from a different TU:

  1. Asserts qword_106BA10 != 0 (current TU exists)
  2. If target differs from current: saves current TU via sub_7A3A50
  3. Restores target TU state via memcpy from per-TU buffer
  4. Sets qword_106BA10 = target
  5. Restores scope chain: xmmword_126EB60, qword_126EB70, etc.
  6. Recomputes file scope indices via sub_704490

Per-TU state is registered through f_register_trans_unit_variable (sub_7A3C00, trans_unit.c, line 227), which accumulates variables into a linked list (qword_12C7AA8). Each registration record is 40 bytes with fields for the variable pointer, name, prior size, and buffer offset. The total per-TU buffer size is tracked in qword_12C7A98.

Three core variables are always registered (sub_7A4690):

  • dword_106BA08 (is_recompilation), 4 bytes
  • qword_106BA00 (current_filename), 8 bytes
  • dword_106B9F8 (has_module_info), 4 bytes

Module ID Generation

Every compilation unit in CUDA needs a unique identifier to associate host-side registration code with the correct device fatbinary. This identifier -- the module ID -- is generated by make_module_id (sub_5AF830, host_envir.c, ~450 lines) and cached in qword_126F0C0.

Algorithm

The module ID generator has three source modes, tried in order:

Mode 1: Module ID file. If qword_106BF80 (set by --module_id_file_name) is non-NULL, the entire contents of the specified file are read and used as the module ID. This allows build systems to inject deterministic identifiers.

Mode 2: Explicit numeric token. If the caller provides a non-NULL string argument (nptr), it is parsed via strtoul. If the parse succeeds, the numeric value is used directly. If the parse fails (the string is not a pure integer), the string itself is CRC32-hashed and the hash is used.

Mode 3: Default computation. The default path builds the ID from several components:

  1. Calls stat() on the source file to obtain mtime
  2. Formats ctime() of the modification time
  3. Reads getpid() for the current process ID
  4. Collects qword_106C038 (command-line options hash input)
  5. Computes the CRC32 hash of the options string
  6. Takes the output filename, strips it to basename
  7. If the source filename exceeds 8 characters, replaces it with its CRC32 hex representation

The final string is assembled in the format:

{options_crc}_{output_name_len}_{output_name}_{source_or_crc}[_{extra}][_{pid}]

All non-alphanumeric characters in the result are replaced with underscores. The string is allocated permanently and cached in qword_126F0C0.

Debug tracing (gated by dword_126EFC8) emits:

make_module_id: str1 = %s, str2 = %s, pid = %ld
make_module_id: final string = %s

CRC32 Implementation

The function contains an inline CRC32 implementation that appears three times (for the options hash, the source filename, and the extra string). All three copies use the same algorithm:

  • Polynomial: 0xEDB88320 (standard reflected CRC-32)
  • Initial value: 0xFFFFFFFF
  • Processing: bit-by-bit, 8 iterations per byte
  • Final XOR: implicit via the reflected algorithm

The triple inlining suggests the CRC32 was originally a macro or small inline function that the compiler expanded at each call site. The polynomial 0xEDB88320 is the bitwise reversal of the standard CRC-32 polynomial 0x04C11DB7, confirming this is the ubiquitous CRC-32/ISO-HDLC algorithm.

PID Incorporation

The getpid() call ensures that concurrent compilations of the same source file produce different module IDs. Without the PID, two parallel nvcc invocations compiling the same .cu file with the same flags would generate identical module IDs, potentially causing runtime registration collisions. The PID is appended as the final underscore-separated component.

Module ID File Output

When --gen_module_id_file (flag 83) is set, write_module_id_to_file (sub_5B0180) generates the module ID via sub_5AF830(0) and writes it to the file specified by qword_106BF80 (--module_id_file_name, flag 87). If the filename is not set, it emits "module id filename not specified". If the write fails, it emits "error writing module id to file".

In the backend output phase, if dword_106BFB8 (emit-symbol-table flag) is set, sub_5B0180 is also called to write the module ID before the host reference arrays are emitted.

Entity-Based Module ID Selection

An alternative module ID source is available through use_variable_or_routine_for_module_id_if_needed (sub_5CF030, il.c, line 31969, ~65 lines). Rather than computing a hash from file metadata, this function selects a representative entity (variable or function) from the current TU whose mangled name can serve as a stable identifier. The selection criteria are strict:

  • Entity kind must be 7 (variable) or 11 (routine), tested via (kind - 7) & 0xFB == 0
  • Must have a definition (for variables: offset +169 != 0; for routines: has a body)
  • Must not be a class member
  • Must not be in an unnamed namespace
  • Must have storage class == 0 (no explicit static, extern, or register)
  • Must not be template-related or marked with special compilation flags
  • For routines: must not have explicit specialization, return type must not be a builtin

The selected entity is stored in qword_126F140 with its kind byte in byte_126F138 (7 for variable, 11 for routine). This entity's name is then fed into sub_5AF830 to produce the final module ID string. The entity-based approach provides a more deterministic ID than the PID-based default, since it is derived from source content rather than runtime state.

Anonymous Namespace Mangling

The module ID directly controls how anonymous namespaces are mangled in the .int.c output. The function sub_6BC7E0 (in nv_transforms.c) constructs the anonymous namespace identifier:

// sub_6BC7E0 implementation:
if (qword_1286A00)                      // cached?
    return qword_1286A00;
module_id = sub_5AF830(0);              // get or compute module ID
buf = malloc(strlen(module_id) + 12);   // "_GLOBAL__N_" = 11 chars + NUL
strcpy(buf, "_GLOBAL__N_");
strcat(buf, module_id);
qword_1286A00 = buf;                    // cache for reuse
return buf;

This _GLOBAL__N_<module_id> string is emitted in the .int.c trailer as:

#define _NV_ANON_NAMESPACE _GLOBAL__N_<module_id>
#ifdef _NV_ANON_NAMESPACE
#endif
#include "<source_file>"
#undef _NV_ANON_NAMESPACE

The #define gives anonymous namespace entities a stable, unique mangled name that is consistent between the device and host compilation paths. The #ifdef/#endif guard is defensive -- it tests that the macro was defined (it always is at this point). The #include re-includes the original source file with the macro defined, allowing the host compiler to see the anonymous namespace entities with their module-ID-qualified names. The #undef cleans up to avoid polluting later inclusions.

The anonymous namespace hash also appears during host reference array name construction. For static or anonymous-namespace device entities, the scoped name prefix builder (sub_6BD2F0) inserts _GLOBAL__N_<module_id> as the namespace component, ensuring the mangled name in the .nvHR* section uniquely identifies the entity even across TUs with the same anonymous namespace structure.

Usage in Output

The module ID appears in three places in the generated .int.c output:

  1. Anonymous namespace mangling: sub_6BC7E0 constructs _GLOBAL__N_<module_id> for anonymous-namespace symbols in device code, producing unique mangled names per TU.

  2. Registration boilerplate: The __cudaRegisterFatBinary call passes the module ID to the CUDA runtime, which uses it to match host registration with the correct device fatbinary.

  3. Module ID file: When requested, the ID is written to a separate file for consumption by the build system or nvlink.

Cross-TU IL Correspondence

When multiple TUs are processed in a single cudafe++ invocation, the same C++ types, templates, and declarations may appear in multiple TUs. EDG's correspondence system verifies structural equivalence and establishes canonical entries to avoid duplicate definitions in the merged output.

trans_copy.c: IL Copying Between TUs

The trans_copy.c file contains a single function at address 0x796BA0:

copy_secondary_trans_unit_IL_to_primary -- Copies IL entries from secondary translation units into the primary TU's IL tree. Called after all TUs have been parsed, during the fe_wrapup finalization phase (specifically, after the 5-pass multi-TU iteration). This function ensures that device-reachable IL entries from secondary TUs are available in the primary TU's output scope.

A closely related function exists at 0x796C00:

mark_secondary_IL_entities_used_from_primary (sub_796C00) -- Called during fe_wrapup pass 2 (IL lowering), before the TU iteration loop that applies sub_707040 to each TU's file-scope IL. This function marks IL entities in secondary TUs that are referenced from the primary TU, ensuring they survive any dead-code elimination in later passes.

trans_corresp.c: Structural Equivalence Checking

The trans_corresp.c file (address range 0x796E60--0x7A3420, 88 functions) implements the full cross-TU correspondence verification system. The core functions:

verify_class_type_correspondence (sub_7A00D0, 703 lines) is the centerpiece. It performs a deep structural comparison of two class types from different TUs:

  1. Base class comparison via sub_7A27B0 (verify_base_class_correspondence) -- iterates base class lists, comparing virtual/non-virtual status, accessibility, and type identity
  2. Friend declaration comparison via sub_7A1830 (verify_friend_declaration_correspondence) -- walks friend lists checking structural equivalence
  3. Member function comparison via sub_7A1DB0 (verify_member_function_correspondence, 411 lines) -- compares function signatures, attributes, constexpr status, and virtual overrides
  4. Nested type comparison via sub_798960 (equiv_member_constants) -- verifies nested class/enum/typedef correspondence
  5. Template parameter comparison via sub_7B2260 -- validates template parameter lists match structurally
  6. Using declaration comparison -- dispatches by kind: 36 = alias, 6/11 = using declaration, 7/58 = namespace using declaration

If any comparison fails, the function delegates to sub_797180 to emit a diagnostic (error codes 1795/1796), then falls through to f_set_no_trans_unit_corresp (5 variants at sub_797B50-sub_7981A0 for different entity kinds).

The type node layout used by the correspondence system:

  • Offset +132: type kind (9=struct, 10=class, 11=union)
  • Offset +144: referenced type / next pointer
  • Offset +152: class info pointer
  • Offset +161: flags byte (bits for anonymous, elaborated, template, local)
  • Class info at +128: scope block with members at indexed offsets [12], [13], [14], [18], [22]

Supporting verification functions:

AddressNameScope
sub_7A0E10verify_enum_type_correspondenceEnum underlying type and enumerator list
sub_7A1230verify_function_type_correspondenceParameter and return type
sub_7A1390verify_type_correspondenceDispatcher to class/enum/function variants
sub_7A1460set_type_correspondenceLinks two types as corresponding
sub_7A1CC0verify_nested_class_body_correspondenceNested class scope comparison
sub_7A2C10verify_template_parameter_correspondenceTemplate parameter list
sub_7A3140check_decl_correspondence_with_bodyDeclaration with definition
sub_7A3420check_decl_correspondence_without_bodyDeclaration-only case
sub_7A38A0check_decl_correspondenceDispatcher (with/without body)
sub_7A38D0same_source_positionSource position comparison
sub_7999C0find_template_correspondenceCross-TU template entity matching (601 lines)
sub_79A5A0determine_correspondenceGeneral correspondence determination
sub_79B8D0mark_canonical_instantiationUpdates instantiation canonical status
sub_79C1A0get_canonical_entry_ofReturns canonical entity for a TU entry
sub_79D080establish_instantiation_correspondencesLinks instantiations across TUs
sub_79DFC0set_type_correspSets type correspondence
sub_79E760find_routine_correspondenceCross-TU function matching
sub_79F320find_namespace_correspondenceCross-TU namespace matching

Correspondence Lifecycle

The correspondence system uses three hash tables (qword_12C7800, qword_12C7880, qword_12C7900, each 0x70 bytes / 14 slots) plus linked lists to track established correspondences. The lifecycle:

  1. Registration (sub_7A3920): Registers three global variables (dword_106B9E4, dword_106B9E0, qword_12C7798) for per-TU save/restore
  2. Initialization (sub_7A3980): Zeroes all correspondence hash tables and list pointers
  3. Discovery during parsing: As the secondary TU is parsed, types/functions that match primary-TU entities are identified through name and scope comparison
  4. Verification: verify_class_type_correspondence and its siblings perform deep structural comparison
  5. Linkage: set_type_correspondence (sub_7A1460) and f_set_trans_unit_corresp (sub_79C400, 511 lines) connect matching entities
  6. Canonicalization: canonical_ranking (sub_796E60) determines which TU's entity is the canonical representative; mark_canonical_instantiation (sub_79B8D0) updates instantiation records

The correspondence allocation uses 24-byte nodes from a free list (qword_12C7AB0) managed by alloc_trans_unit_corresp (sub_7A3B50) and free_trans_unit_corresp (sub_7A3BB0). The free function decrements a refcount at offset +16; when it reaches 1, the node returns to the free list.

Integration with fe_wrapup

The cross-TU correspondence system hooks into the 5-pass multi-TU architecture in fe_wrapup (sub_588E90):

PassActionCross-TU Role
1Per-file IL wrapup (sub_588C60)Iterates TU chain, prepares file scope IL
2IL lowering (sub_707040)Calls sub_796C00 (mark secondary IL) before loop
3IL emission (sub_610420, arg 23)Marks device-reachable entries per TU
4C++ class finalizationDeferred member processing
5Per-file part 3 (sub_588D40)Final per-TU cleanup
PostCleanupCalls sub_796BA0 (copy secondary IL to primary)

After all five passes complete, sub_796BA0 copies remaining secondary-TU IL into the primary TU's tree, and scope renumbering fixes up any index conflicts.

Host Reference Arrays and Linkage Splitting

The six .nvHR* ELF sections emitted in the .int.c output trailer encode device symbol names for CUDA runtime discovery. These arrays are split along two axes: symbol type (kernel, device variable, constant variable) and linkage (external, internal). The split is critical for RDC: external-linkage symbols are globally resolvable by nvlink across all TUs, while internal-linkage symbols are TU-local and require module-ID-based prefixing to avoid collisions.

SectionArray NameSymbol TypeLinkage
.nvHRKEhostRefKernelArrayExternalLinkage__global__ kernelexternal
.nvHRKIhostRefKernelArrayInternalLinkage__global__ kernelinternal
.nvHRDEhostRefDeviceArrayExternalLinkage__device__ variableexternal
.nvHRDIhostRefDeviceArrayInternalLinkage__device__ variableinternal
.nvHRCEhostRefConstantArrayExternalLinkage__constant__ variableexternal
.nvHRCIhostRefConstantArrayInternalLinkage__constant__ variableinternal

The emission is driven by 6 calls to nv_emit_host_reference_array (sub_6BCF80, 79 lines, nv_transforms.c) with parameters (emit_callback, is_kernel, is_device, is_internal_linkage):

// From sub_489000 (process_file_scope_entities), backend output phase:
if (dword_106BFD0 || dword_106BFCC) {
    sub_6BCF80(sub_467E50, 1, 0, 1);  // kernel, internal
    sub_6BCF80(sub_467E50, 1, 0, 0);  // kernel, external
    sub_6BCF80(sub_467E50, 0, 1, 1);  // device, internal
    sub_6BCF80(sub_467E50, 0, 1, 0);  // device, external
    sub_6BCF80(sub_467E50, 0, 0, 1);  // constant, internal
    sub_6BCF80(sub_467E50, 0, 0, 0);  // constant, external
}

Each call iterates a separate global list that was populated during the entity walk:

List AddressContent
unk_1286880kernel external
unk_12868C0kernel internal
unk_1286780device external
unk_12867C0device internal
unk_1286800constant external
unk_1286840constant internal

Entity registration into these lists is performed by nv_get_full_nv_static_prefix (sub_6BE300, 370 lines, nv_transforms.c:2164). This function examines each device-annotated entity and routes it to the appropriate list based on its execution space bits (at entity offset +182) and linkage (internal linkage = static or anonymous namespace, determined by flags at entity offset +80).

For internal linkage entities, the function builds a scoped name prefix:

  1. Recursively constructs the scope path via sub_6BD2F0 (nv_build_scoped_name_prefix)
  2. For anonymous namespaces, inserts the _GLOBAL__N_<module_id> prefix (via qword_1286A00)
  3. Hashes the full path with format_string_to_sso (sub_6BD1C0)
  4. Constructs the prefix: off_E7C768 + len + "_" + filename + "_"
  5. Caches the prefix in qword_1286760 for reuse
  6. Appends "_" and the entity's mangled name

For external linkage entities, the path is simpler: the :: scope-qualified name is used directly without module-ID-based prefixing.

The generated output for each symbol:

extern "C" {
    extern __attribute__((section(".nvHRKE")))
           __attribute__((weak))
    const unsigned char hostRefKernelArrayExternalLinkage[] = {
        0x5f, 0x5a, /* ... mangled name bytes ... */ 0x00
    };
}

The __attribute__((weak)) allows multiple TUs to define the same array without linker errors -- the CUDA runtime reads whichever copy survives.

Host Stub Linkage Flags

Three CLI flags control the linkage of generated host stubs:

--host-stub-linkage-explicit (Flag 47)

When set, host stubs are emitted with explicit linkage specifiers rather than relying on the default linkage of the surrounding context. This ensures that the stub's linkage matches what nvcc/nvlink expects regardless of the source file's linkage context (e.g., inside an anonymous namespace or extern "C" block).

--static-host-stub (Flag 48)

Forces all generated host stubs (__wrapper__device_stub_*) to have static linkage. This is used in single-TU compilation where the stubs do not need to be visible to other object files. It prevents symbol conflicts when the same kernel name appears in multiple compilation units that are linked together.

--static-global-template-stub (set_flag Mechanism)

Unlike the direct CLI flags above, -static-global-template-stub is set through the generic --set_flag mechanism (flag 193), which looks up the name in the off_D47CE0 table and stores the value. It has 4 usage contexts in the binary, all in error message strings.

When enabled (=true), template __global__ function stubs receive static linkage. This prevents ODR violations in whole-program mode when the same template kernel is instantiated in multiple host-side TUs. The tradeoff is that extern template kernels and out-of-TU instantiations become illegal (see the constraints in the whole-program section above).

Output Differences Between Modes

Output AspectWhole-Program (-rdc=false)Separate Compilation (-rdc=true)
Host stub linkageCan be static (with flags 47/48)External (default)
Template stub linkagestatic (with -static-global-template-stub)External
Module ID generationGenerated but less criticalRequired for registration matching
Module ID fileOptionalTypically generated
Device code embeddingInline fatbinary in host objectRelocatable device object (.rdc)
nvlink requirementNoYes (resolves device symbols)
Dynamic parallelismForbiddenAllowed
Extern device variablesForbiddenAllowed
Anonymous namespace hashUsed for device symbol uniquenessUsed for device symbol uniqueness
Deferred function listActive (breakpoint placeholders)Behavior depends on dword_106BFDC
Cross-TU correspondenceN/A (single TU)Active when multi-TU invocation

Global Variables

AddressSizeNamePurpose
dword_106BFBC4whole_program_modeWhole-program mode; also set by --debug_mode (flag 82, which sets dword_106BFC4=1, dword_106BFC0=1, dword_106BFBC=1)
dword_106BFDC4skip_device_onlyDisables deferred function list accumulation
dword_106BFB84emit_symbol_tableEmit symbol table + module ID to file
dword_106BFD04device_registrationDevice registration / cross-space reference checking
dword_106BFCC4constant_registrationConstant registration flag
qword_126F0C08cached_module_idCached module ID string
qword_106BF808module_id_file_pathModule ID file path (from --module_id_file_name)
qword_106BA108current_translation_unitPointer to current TU descriptor
qword_106B9F08primary_translation_unitPointer to first TU (primary)
qword_106BA188translation_unit_stackTop of TU stack
dword_106B9E84tu_stack_depthTU stack depth (excluding primary)
qword_12C7AA88registered_variable_list_headPer-TU variable registration list
qword_12C7A988per_tu_storage_sizeTotal per-TU buffer size
qword_12C7AB08corresp_free_listCorrespondence node free list
qword_12C7AB88stack_entry_free_listTU stack entry free list
qword_10658408deferred_function_listBreakpoint placeholder linked list head

Function Map

AddressNameSource FileLinesRole
sub_5AF830make_module_idhost_envir.c~450CRC32-based unique TU identifier
sub_5AF7F0set_module_idhost_envir.c~10Setter for cached module ID
sub_5AF820get_module_idhost_envir.c~3Getter for cached module ID
sub_5B0180write_module_id_to_filehost_envir.c~30Writes module ID to file
sub_5CF030use_variable_or_routine_for_module_id_if_neededil.c:31969~65Selects representative entity for ID
sub_6BC7E0(anon namespace hash)nv_transforms.c~20Generates _GLOBAL__N_<module_id>
sub_6BCF80nv_emit_host_reference_arraynv_transforms.c79Emits .nvHR* ELF section with symbol names
sub_6BD2F0nv_build_scoped_name_prefixnv_transforms.c~95Recursive scope-qualified name builder
sub_6BE300nv_get_full_nv_static_prefixnv_transforms.c:2164~370Scoped name + host ref array registration
sub_796BA0copy_secondary_trans_unit_IL_to_primarytrans_copy.c~50Copies secondary TU IL to primary
sub_796C00mark_secondary_IL_entities_used_from_primary----Marks secondary IL referenced from primary
sub_796E60canonical_rankingtrans_corresp.c--Determines canonical TU entry
sub_7975D0may_have_correspondencetrans_corresp.c--Quick correspondence eligibility check
sub_797990f_change_canonical_entrytrans_corresp.c--Updates canonical representative
sub_7983A0f_same_nametrans_corresp.c--Cross-TU symbol name comparison
sub_79C400f_set_trans_unit_corresptrans_corresp.c511Establishes entity correspondence
sub_7A00D0verify_class_type_correspondencetrans_corresp.c703Deep class structural comparison
sub_7A0E10verify_enum_type_correspondencetrans_corresp.c--Enum comparison
sub_7A1230verify_function_type_correspondencetrans_corresp.c--Function type comparison
sub_7A1460set_type_correspondencetrans_corresp.c--Links corresponding types
sub_7A1DB0verify_member_function_correspondencetrans_corresp.c411Member function comparison
sub_7A27B0verify_base_class_correspondencetrans_corresp.c--Base class list comparison
sub_7A3920register_trans_corresp_variablestrans_corresp.c--Registers per-TU state variables
sub_7A3980init_trans_corresp_statetrans_corresp.c--Zeroes all correspondence state
sub_7A3A50save_translation_unit_statetrans_unit.c--Saves current TU state to buffer
sub_7A3C00f_register_trans_unit_variabletrans_unit.c--Registers a per-TU variable
sub_7A3CF0fix_up_translation_unittrans_unit.c--Finalizes TU state
sub_7A3D60switch_translation_unittrans_unit.c--Saves/restores TU context
sub_7A3EF0push_translation_unit_stacktrans_unit.c--Pushes TU onto stack
sub_7A3F70pop_translation_unit_stacktrans_unit.c--Pops TU from stack
sub_7A40A0process_translation_unittrans_unit.c--Main TU processing entry point
sub_7A4690register_builtin_trans_unit_variablestrans_unit.c--Registers 3 core per-TU vars

Cross-References

JIT Mode

JIT mode is a compilation mode where cudafe++ produces device code only -- no host .int.c file, no kernel stubs, no CUDA runtime registration tables. The output is a standalone device IL payload suitable for runtime compilation via NVRTC (nvrtcCompileProgram) or direct loading through the CUDA Driver API (cuModuleLoadData, cuModuleLoadDataEx). Because there is no host compiler invocation downstream, anything that belongs exclusively to the host side is illegal: explicit __host__ functions, unannotated functions (which default to __host__), namespace-scope variables without memory-space qualifiers, non-const class static data members, and lambda closures inferred to have __host__ execution space.

The --default-device flag inverts the annotation default -- unannotated entities become __device__ instead of __host__, allowing C++ code written without CUDA annotations to compile directly for the GPU. This is the recommended workaround for all four unannotated-entity diagnostics.

Key Facts

PropertyValue
Compilation outputDevice IL only (no .int.c, no stubs, no registration)
Host output suppression--gen_c_file_name (flag 45) not supplied by driver
Device output path--gen_device_file_name (flag 85)
Default execution space (normal)__host__ (entity+182 byte == 0x00)
Default execution space (JIT + --default-device)__device__ (entity+182 byte 0x23)
Annotation override flag--default-device (passed to cudafe++ by NVRTC or nvcc)
RDC mode flag--device-c (flag 77) -- relocatable device code; orthogonal to JIT
JIT diagnostic count5 error messages (1 explicit-host + 4 unannotated-entity)
Diagnostic tag suffixAll five tags end with _in_jit
NVRTC integrationNVRTC calls cudafe++ with JIT-appropriate flags internally
Driver API consumerscuModuleLoadData, cuModuleLoadDataEx, cuLinkAddData

How JIT Mode Is Activated

cudafe++ is never invoked directly by application code. In the standard offline compilation pipeline, nvcc invokes cudafe++ with both --gen_c_file_name (flag 45, the host .int.c path) and --gen_device_file_name (flag 85, the device IL path). Both outputs are generated from a single frontend invocation -- cudafe++ uses a single-pass architecture internally (see Device/Host Separation).

In JIT mode, the driving tool -- typically NVRTC -- invokes cudafe++ with only the device-side output path. The host-output file name (--gen_c_file_name) is not provided, so no .int.c file is generated. The absence of a host output target is what structurally makes this "JIT mode": without a host file, there is no host compiler to feed, and therefore no host-side constructs can be tolerated.

Activation Conditions

JIT mode is not a single user-facing CLI flag. It is an internal compilation state activated by the combination of flags that the driving tool (nvcc or NVRTC) sets when invoking cudafe++:

  1. NVRTC invocation. NVRTC always invokes cudafe++ in JIT mode. NVRTC compiles CUDA C++ source to PTX at application runtime. There is no host compiler, no host object file, and no linking -- the output is pure device code.

  2. nvcc --ptx or --cubin without host compilation. When nvcc is asked to produce only PTX or cubin output (no host object), it may invoke cudafe++ with the JIT mode configuration to skip host-side generation entirely.

  3. Architecture target combined with device-only flags. The internal JIT state is set when the target configuration (--target, flag 245 -> dword_126E4A8) is combined with device-only compilation flags (e.g., --device-syntax-only, flag 72).

The practical effect: when JIT mode is active, the entire implicit-host-annotation system becomes a source of errors rather than a convenience. Every function without __device__ or __global__ defaults to __host__, and host entities are illegal.

NVRTC Runtime Compilation Path

NVRTC (libnvrtc.so / nvrtc64_*.dll) is NVIDIA's runtime compilation library. Application code calls nvrtcCreateProgram with CUDA C++ source text, then nvrtcCompileProgram to compile it. Internally, NVRTC embeds a complete CUDA compilation pipeline including cudafe++ and cicc, invoking them with JIT-appropriate flags:

Application
    |
    v
nvrtcCompileProgram(prog, numOptions, options)
    |
    v
cudafe++ --target <sm_code> --gen_device_file_name <tmpfile> [--default-device] ...
    |                    (no --gen_c_file_name => JIT mode)
    v
cicc <tmpfile> --> PTX
    |
    v
ptxas / cuModuleLoadData --> device binary (cubin)

The user-facing NVRTC options (--gpu-architecture=compute_90, --device-debug, etc.) are translated by the NVRTC library into internal cudafe++ and cicc flags. The --default-device flag is passed through when the user includes it in the NVRTC options array.

CUDA Driver API Consumption

The PTX or cubin produced by the JIT pipeline is consumed by the CUDA Driver API:

  • cuModuleLoadData / cuModuleLoadDataEx: Load a compiled module (PTX or cubin) into the current context. The driver JIT-compiles PTX to native binary at load time.
  • cuLinkAddData / cuLinkComplete: Link multiple compiled objects into a single module (JIT linking for RDC workflows).
  • cuModuleGetFunction: Retrieve a __global__ kernel handle from the loaded module for launch via cuLaunchKernel.

Because JIT-compiled code has no host-side registration (no __cudaRegisterFunction calls, no fatbin embedding), the Driver API is the only path to launch kernels from JIT-compiled modules. The CUDA Runtime API launch syntax (<<<>>>) is not available for JIT-compiled kernels -- the application must use cuLaunchKernel explicitly.

The --default-device Flag

In normal (offline) compilation, functions and namespace-scope variables without explicit CUDA annotations default to __host__. This default makes sense when both host and device outputs are generated: the unannotated entities go into the host .int.c file and are compiled by the host compiler.

In JIT mode, this default is counterproductive. Most code intended for JIT compilation targets the GPU, and requiring explicit __device__ on every function and variable is verbose and incompatible with header-only libraries written for standard C++.

The --default-device flag changes the default:

Entity typeDefault without --default-deviceDefault with --default-device
Unannotated function__host__ (entity+182 == 0x00)__device__ (entity+182 == 0x23)
Namespace-scope variable (no memory space)Host variable__device__ variable (entity+148 bit 0 set)
Non-const class static data memberHost variable__device__ variable
Lambda closure class (namespace scope)__host__ inferred space__device__ inferred space
Explicitly __host__ function__host__ (unchanged)__host__ (unchanged -- always error in JIT)
Explicitly __device__ function__device__ (unchanged)__device__ (unchanged)
__global__ kernel__global__ (unchanged)__global__ (unchanged)

Entities with explicit annotations are unaffected. Only entities that would otherwise receive the implicit __host__ default are redirected to __device__.

Interaction with Entity+182

The execution-space bitfield at entity+182 (documented in Execution Spaces) is set during attribute application. Without --default-device, an unannotated function has byte 0x00 at entity+182 -- the 0x30 mask extracts 0x00, which is treated as implicit __host__. With --default-device active, the frontend treats unannotated functions as if __device__ had been applied, setting byte+182 to 0x23 (the standard __device__ OR mask: device_capable | device_explicit | device_annotation).

This means the downstream subsystems -- keep-in-IL marking, cross-space validation, device-only filtering -- all see a properly-annotated __device__ entity and process it identically to an explicitly annotated one. The flag does not add a "JIT mode" code path through every subsystem; it simply changes the default annotation, and the existing execution-space machinery handles the rest.

How to Pass the Flag

In normal nvcc workflows, --default-device is passed through -Xcudafe:

nvcc -Xcudafe --default-device source.cu

In NVRTC workflows, the flag is passed via the nvrtcCompileProgram options array:

const char *opts[] = {"--default-device"};
nvrtcCompileProgram(prog, 1, opts);

JIT Mode Diagnostics

Five error messages enforce JIT mode restrictions. All five are emitted during semantic analysis when the frontend encounters an entity that cannot exist in a device-only compilation. The messages are self-documenting: four of the five include an explicit suggestion to use --default-device.

Diagnostic 1: Explicit host Function

Tag: no_host_in_jit

Message:

A function explicitly marked as a __host__ function is not allowed in JIT mode

Trigger: The function declaration carries an explicit __host__ annotation (entity+182 has bit 4 set via the 0x15 OR mask from apply_nv_host_attr at sub_4108E0). This is unconditionally illegal in JIT mode -- there is no device-side representation of a host-only function, and JIT mode produces no host output.

No --default-device suggestion: This is the only JIT diagnostic that does not suggest --default-device. The flag only affects unannotated entities. An explicit __host__ annotation overrides the default. The fix must be a source code change: remove __host__, change it to __device__, or change it to __host__ __device__.

Example:

// JIT mode: error no_host_in_jit
__host__ void setup() { /* ... */ }

// Fix options:
__device__ void setup() { /* ... */ }
__host__ __device__ void setup() { /* ... */ }  // if needed in both contexts

Diagnostic 2: Unannotated Function

Tag: unannotated_function_in_jit

Message:

A function without execution space annotations (__host__/__device__/__global__)
is considered a host function, and host functions are not allowed in JIT mode.
Consider using -default-device flag to process unannotated functions as __device__
functions in JIT mode

Trigger: A function entity has (entity+182 & 0x30) == 0x00 -- no explicit execution-space annotation. By default this means implicit __host__, which is illegal in JIT mode.

Fix: Either add __device__ to the function declaration, or compile with --default-device.

Example:

// JIT mode without --default-device: error unannotated_function_in_jit
int compute(int x) { return x * x; }

// Fix 1: explicit annotation
__device__ int compute(int x) { return x * x; }

// Fix 2: compile with --default-device (function becomes implicitly __device__)

Diagnostic 3: Unannotated Namespace-Scope Variable

Tag: unannotated_variable_in_jit

Message:

A namespace scope variable without memory space annotations
(__device__/__constant__/__shared__/__managed__) is considered a host variable,
and host variables are not allowed in JIT mode. Consider using -default-device flag
to process unannotated namespace scope variables as __device__ variables in JIT mode

Trigger: A variable declared at namespace scope (including global scope and anonymous namespaces) lacks a CUDA memory-space annotation. In normal compilation, such variables live in host memory. In JIT mode, host memory is inaccessible.

The check applies to the memory-space bitfield at entity+148, not the execution-space bitfield at entity+182. Without any annotation, none of the memory-space bits (__device__ bit 0, __shared__ bit 1, __constant__ bit 2, __managed__ bit 3) are set.

Scope note: This check targets namespace-scope variables only. Local variables inside __device__ or __global__ functions are not subject to this check -- they live on the device stack or in registers.

Fix: Add a memory-space annotation, or compile with --default-device.

Example:

// JIT mode without --default-device: error unannotated_variable_in_jit
int table[256] = { /* ... */ };

// Fix 1: mutable device memory
__device__ int table[256] = { /* ... */ };

// Fix 2: read-only data
__constant__ int table[256] = { /* ... */ };

Diagnostic 4: Non-Const Class Static Data Member

Tag: unannotated_static_data_member_in_jit

Message:

A class static data member with non-const type is considered a host variable,
and host variables are not allowed in JIT mode. Consider using -default-device flag
to process such data members as __device__ variables in JIT mode

Trigger: A class or struct has a static data member whose type is not const-qualified. Static data members are allocated at namespace scope (not per-instance), so they are subject to the same host-variable prohibition as namespace-scope variables.

Why non-const only: const and constexpr static members with compile-time-constant initializers can be folded into device code by cicc without requiring an actual global variable in host memory. Non-const static members require mutable storage that must be explicitly placed in device memory.

Example:

struct Config {
    // JIT mode without --default-device: error unannotated_static_data_member_in_jit
    static int max_iterations;

    // OK: const with constant initializer (compile-time folding)
    static const int default_value = 42;

    // OK: constexpr (compile-time constant)
    static constexpr float pi = 3.14159f;
};

// Fix: explicit annotation
struct Config {
    __device__ static int max_iterations;
};

Diagnostic 5: Lambda Closure Class with Inferred host Space

Tag: host_closure_class_in_jit

Message:

The execution space for the lambda closure class members was inferred to be __host__
(based on context). This is not allowed in JIT mode. Consider using -default-device
to infer __device__ execution space for namespace scope lambda closure classes.

Trigger: A lambda expression at namespace scope (or in a context where the enclosing function has implicit __host__ space) produces a closure class whose execution space is inferred to be __host__. The lambda was not explicitly annotated with __device__, and the enclosing context is host-only, so cudafe++'s execution-space inference assigns __host__ to the closure class members.

This diagnostic interacts with the extended lambda system (documented in Extended Lambda Overview). In normal compilation, a namespace-scope lambda without annotations is host-only and gets a closure type compiled for the CPU. In JIT mode, that closure type has no valid compilation target.

Fix: Either annotate the lambda with __device__ (requires extended lambdas: --expt-extended-lambda), or pass --default-device to change the inference to __device__.

Example:

// JIT mode without --default-device: error host_closure_class_in_jit
auto fn = [](int x) { return x * 2; };

// Fix 1: explicit annotation (requires --expt-extended-lambda)
auto fn = [] __device__ (int x) { return x * 2; };

// Fix 2: compile with --default-device

Diagnostic Summary

TagEntity type--default-device suggestedSuppressible
no_host_in_jitExplicit __host__ functionNoYes (via --diag_suppress)
unannotated_function_in_jitFunction with no annotationYesYes
unannotated_variable_in_jitNamespace-scope variable, no annotationYesYes
unannotated_static_data_member_in_jitNon-const static data memberYesYes
host_closure_class_in_jitLambda closure inferred __host__YesYes

All five diagnostics use the standard cudafe++ diagnostic system. They can be controlled via CLI flags or source pragmas:

--diag_suppress=unannotated_function_in_jit
--diag_warning=no_host_in_jit
#pragma nv_diag_suppress unannotated_variable_in_jit

Warning: Suppressing these diagnostics silences the messages but does not change the underlying problem. The entities still have host execution space and will be absent from the device IL output, leading to link errors or runtime failures when the module is loaded.

Architecture: JIT Mode vs Normal Mode

AspectNormal (offline) modeJIT mode
Driver toolnvccNVRTC (or nvcc with --ptx / --cubin)
Host output (.int.c)Generated via sub_489000Not generated
Device IL outputGenerated via keep-in-IL walkGenerated via keep-in-IL walk (identical)
Kernel stubs__wrapper__device_stub_ in .int.cNot needed
Registration code__cudaRegisterFunction / __cudaRegisterVarNot emitted
Fatbin embeddingEmbedded in host objectNot applicable
Default unannotated space__host____host__ (error) or __device__ (with --default-device)
Kernel launch mechanism<<<>>> -> cudaLaunchKernel (Runtime API)cuLaunchKernel (Driver API)
Module loadingAutomatic (CUDA runtime startup)Manual (cuModuleLoadData)
Link modelStatic linking with host objectJIT linking (cuLinkAddData) or direct load

Single-Pass Architecture Impact

cudafe++ uses a single-pass architecture: the EDG frontend parses the source once, builds a unified IL tree, and tags every entity with execution-space bits at entity+182. In normal mode, two output filters run on this tree -- one for the host .int.c file (driven by sub_489000 -> sub_47ECC0), one for the device IL (driven by the keep-in-IL walk at sub_610420). In JIT mode, only the device IL output path runs. The host output path is simply never invoked because no host output was requested.

This means JIT mode does not require a fundamentally different code path through the frontend. Parsing, semantic analysis, template instantiation, and IL construction all proceed identically. The difference manifests at two points:

  1. Diagnostic emission during semantic analysis. The five JIT diagnostics fire when the frontend detects entities that would be host-only. In normal mode, these entities are silently accepted because they will appear in the host output.

  2. Output generation. The backend skips host-file emission entirely. The keep-in-IL walk runs as usual, marking device-reachable entries with bit 7 of the prefix byte (entry_ptr - 8). The device IL writer produces the binary output. No stub generation (gen_routine_decl stub path), no registration table emission, no .int.c formatting.

Interaction with Other Modes

RDC (Relocatable Device Code)

JIT mode is orthogonal to RDC (--device-c, flag 77). RDC controls whether device code is compiled for separate linking (enabling cross-TU __device__ function calls and extern __device__ variables), while JIT mode controls whether host output is produced. Both can be active simultaneously -- for example, NVRTC with --relocatable-device-code=true compiles device code for separate device linking without any host output.

When RDC is combined with JIT mode, NVRTC compiles each source file to relocatable device code, and the driver-API linker (cuLinkAddData, cuLinkComplete) resolves cross-references at load time. Without RDC, all device code must be self-contained within a single translation unit.

Extended Lambdas

Extended lambdas (--expt-extended-lambda, controlled by dword_106BF38) interact with JIT mode through the lambda closure class inference. The host_closure_class_in_jit diagnostic targets the case where a lambda's closure is inferred as host-side. With --default-device, the inference changes to device-side, resolving the conflict. Extended lambda capture rules still apply in JIT mode -- captures must be trivially device-copyable, subject to the 1023-capture limit, and array captures are limited to 7 dimensions.

Relaxed Constexpr

Relaxed constexpr mode (--expt-relaxed-constexpr, flag 104, sets dword_106BFF0) makes constexpr functions implicitly __host__ __device__. In JIT mode, this resolves many unannotated-function errors because constexpr functions gain the __device__ annotation implicitly via the HD bypass (entity+177 bit 4). However, non-constexpr unannotated functions still trigger unannotated_function_in_jit unless --default-device is also active.

Practical Patterns

Pattern 1: Minimal JIT Kernel

// Source passed to nvrtcCreateProgram -- no --default-device needed
extern "C" __global__ void add(float* a, float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

No annotations needed beyond __global__ on the kernel. All code within the kernel body is implicitly device code. The extern "C" prevents name mangling so the kernel can be found by cuModuleGetFunction.

Pattern 2: JIT-Compiling Library Code with --default-device

// Header-only math library, no CUDA annotations
template <typename T>
T clamp(T val, T lo, T hi) {
    return val < lo ? lo : (val > hi ? hi : val);
}

__global__ void kernel(float* data, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) data[i] = clamp(data[i], 0.0f, 1.0f);
}

Without --default-device, clamp triggers unannotated_function_in_jit. With --default-device, clamp is implicitly __device__ and compiles cleanly.

Pattern 3: Guarding Host Code with Preprocessor

// Use __CUDACC_RTC__ to guard host-only code
#ifndef __CUDACC_RTC__
__host__ void cpu_fallback(float* data, int n) {
    for (int i = 0; i < n; i++) data[i] *= 2.0f;
}
#endif

__global__ void gpu_process(float* data, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) data[i] *= 2.0f;
}

__CUDACC_RTC__ is predefined by NVRTC. Code guarded by #ifndef __CUDACC_RTC__ is invisible to the JIT compiler, avoiding no_host_in_jit errors.

Pattern 4: Static Data Members in JIT

struct Constants {
    static constexpr int BLOCK_SIZE = 256;        // OK: constexpr, folded at compile time
    static const float EPSILON;                    // Error without --default-device (non-constexpr const)
};

#ifdef __CUDACC_RTC__
__device__
#endif
const float Constants::EPSILON = 1e-6f;            // Annotated for JIT mode

Function Map

AddressNameLinesRole
sub_459630proc_command_line4105CLI parser; processes --default-device and --device-c flags
sub_452010init_command_line_flags3849Registers all flags including default-device
sub_610420mark_to_keep_in_il892Device IL marking (runs identically in JIT and normal mode)
sub_489000process_file_scope_entities723Host .int.c backend (skipped entirely in JIT mode)
sub_47ECC0gen_template1917Source-sequence dispatcher; host output path (skipped in JIT)
sub_40EB80apply_nv_device_attr100Sets __device__ bits; entity+182 OR 0x23 (function), entity+148 OR 0x01 (variable)
sub_4108E0apply_nv_host_attr31Sets __host__ bits; entity+182 OR 0x15

Cross-References

  • Execution Spaces -- entity+182 bitfield, __host__/__device__/__global__ OR masks, 0x30 mask classification
  • Device/Host Separation -- single-pass architecture, keep-in-IL walk, host/device output file generation
  • Cross-Space Validation -- execution-space call checking (still applies in JIT mode for HD entities)
  • CUDA Error Catalog -- Category 10 (JIT Mode), all five diagnostic messages with tag names
  • CLI Flag Inventory -- flag table, --gen_device_file_name (85), --gen_c_file_name (45), --device-c (77)
  • Architecture Feature Gating -- --target SM code (dword_126E4A8) and feature thresholds
  • Extended Lambda Overview -- lambda closure class execution-space inference, wrapper types
  • Kernel Stubs -- __wrapper__device_stub_ mechanism (absent in JIT mode)
  • RDC Mode -- relocatable device code, separate compilation for device-side linking

Architecture Feature Gating

cudafe++ enforces architecture-dependent feature gates that prevent use of CUDA constructs on hardware that cannot support them. These gates operate at three distinct layers: compile-time SM version checks against dword_126E4A8 during semantic analysis, string-embedded diagnostic messages with architecture names baked into .rodata, and host-compiler version gating controlling which GCC/Clang-specific #pragma directives and language constructs appear in the generated .int.c output. A separate mechanism, the --db debug system, provides runtime tracing that can expose architecture checks as they execute. This page documents all three layers, the global variables involved, every discovered threshold constant, and the complete data flow from nvcc invocation to feature gate evaluation.

Key Facts

PropertyValue
SM version storagedword_126E4A8 (sm_architecture, set by --target / case 245)
SM version TU-level copydword_126EBF8 (target_config_index, copied during TU init in sub_586240)
Architecture parser stubsub_7525E0 (6-byte stub returning -1; actual parsing done by nvcc)
Post-parse initializersub_7525F0 (set_target_configuration, target.c:299)
Type table initializersub_7515D0 (sets 100+ type-size/alignment globals, called from sub_7525F0)
GCC version globalqword_126EF98 (default 80100 = GCC 8.1.0, set by --gnu_version case 184)
Clang version globalqword_126EF90 (default 90100 = Clang 9.1.0, set by --clang_version case 188)
GCC host dialect flagdword_126E1F8 (host compiler identified as GCC)
Clang host dialect flagdword_126E1E8 (host compiler identified as Clang)
Host GCC version copyqword_126E1F0 (copied from qword_126EF98 during dialect init)
Host Clang version copyqword_126E1E0 (copied from qword_126EF90 during dialect init)
--nv_arch error string"invalid or no value specified with --nv_arch flag" at 0x8884F0
Debug option parsersub_48A390 (proc_debug_option, 238 lines, debug.c)
Debug trace linked listqword_1065870 (head pointer)
Invalid arch sentinel-1 (0xFFFFFFFF)
Feature threshold count17 CUDA features across 7 SM versions (20, 30, 52, 60, 70, 80, 90/90a)
Host compiler threshold count19 version constants across GCC 3.0 through GCC 14.0

Layer 1: SM Architecture Input

How the Architecture Reaches cudafe++

cudafe++ never parses architecture strings directly from the user. The driver (nvcc) translates user-facing flags like --gpu-architecture=sm_90 into an internal numeric code and passes it via the --target flag when spawning the cudafe++ process. Inside cudafe++, the --target flag is registered as CLI flag 245 and handled in proc_command_line (sub_459630).

The handler calls sub_7525E0, which in the CUDA Toolkit 13.0 binary is a 6-byte stub:

; sub_7525E0 -- architecture parser stub
; Address: 0x7525E0, Size: 6 bytes
mov     eax, 0FFFFFFFFh    ; return -1 unconditionally
retn

This stub always returns -1 (the invalid-architecture sentinel). The actual architecture code is injected by nvcc into the argument string that sub_7525E0 receives. Because IDA decompiled this as a stub, the parsing logic is either inlined by the compiler or resolved through a different mechanism at link time. The result is stored in dword_126E4A8:

// proc_command_line (sub_459630), case 245
v80 = sub_7525E0(qword_E7FF28, v23, v20, v30);  // parse SM code from arg string
dword_126E4A8 = v80;                              // store in sm_architecture
if (v80 == -1) {
    sub_4F8420(2664);  // emit error 2664
    // error string: "invalid or no value specified with --nv_arch flag"
    sub_4F2930("cmd_line.c", 12219, "proc_command_line", 0, 0);
    // assert_fail -- unreachable if error handler returns
}
sub_7525F0(v80);  // set_target_configuration

Error 2664 fires when the architecture value is -1. The error string at 0x8884F0 references --nv_arch (the nvcc-facing name for this flag). This string has no direct xrefs in the IDA analysis, meaning it is loaded indirectly through the error message table (off_88FAA0). The --nv_arch name in the error message is a user-facing alias; internally cudafe++ processes it as --target (flag 245).

set_target_configuration (sub_7525F0)

After storing the SM version, sub_7525F0 performs post-parse initialization. This function lives in target.c:299:

// sub_7525F0 -- set_target_configuration
__int64 __fastcall sub_7525F0(int a1)
{
    if ((unsigned int)(a1 + 1) > 1)  // rejects only -1
        assert_fail("set_target_configuration", 299);
    sub_7515D0();           // initialize type table for target platform
    qword_126E1B0 = "lib";  // library search path prefix
}

The guard (a1 + 1) > 1u is an unsigned comparison that accepts any value >= 0 and rejects only -1 (which wraps to 0 when incremented). This is a sanity check -- in production, nvcc always provides a valid SM code.

Type Table Initialization (sub_7515D0)

The sub_7515D0 function, called from set_target_configuration, initializes over 100 global variables describing the target platform's type sizes, alignments, and numeric limits. This establishes the data model for CUDA device code:

// sub_7515D0 -- target type initialization (excerpt)
// Sets LP64 data model with CUDA-specific type properties
dword_126E328 = 8;     // sizeof(long)
dword_126E338 = 4;     // sizeof(int)
dword_126E2FC = 16;    // sizeof(long double)
dword_126E308 = 16;    // alignof(long double)
dword_126E2B8 = 8;     // sizeof(pointer)
dword_126E2AC = 8;     // alignof(pointer)
dword_126E420 = 2;     // sizeof(wchar_t)
dword_126E4A0 = 8;     // target vector width
dword_126E258 = 53;    // double mantissa bits
dword_126E250 = 1024;  // double max exponent
dword_126E254 = -1021; // double min exponent
dword_126E234 = 113;   // __float128 mantissa bits
dword_126E22C = 0x4000; // __float128 max exponent
dword_126E230 = -16381; // __float128 min exponent
// ... ~80 more assignments ...

The function unconditionally returns -1, which is not used by the caller.

SM Version Propagation

During translation unit initialization (sub_586240, called from fe_translation_unit_init), the SM version is copied into a TU-level global:

// sub_586240, line 54 in decompiled output
dword_126EBF8 = dword_126E4A8;  // target_config_index = sm_architecture

After this point, architecture checks throughout the compiler read either dword_126E4A8 (the CLI-level global) or dword_126EBF8 (the TU-level copy). Both contain the same integer SM version code. The dual-variable pattern exists because EDG's architecture supports multi-TU compilation where each TU could theoretically target a different architecture (though CUDA compilation always uses a single target per cudafe++ invocation).

Layer 2: CUDA Feature Thresholds

cudafe++ checks the SM architecture version at semantic analysis time to gate CUDA-specific features. When a feature is used on an architecture below its minimum requirement, the compiler emits a diagnostic error or warning. All thresholds below were extracted from error strings embedded in the binary's .rodata section and confirmed through cross-reference with diagnostic tag names.

Complete Feature Threshold Table

FeatureMin ArchitectureDiagnostic TagError String
Virtual base classescompute_20use_of_virtual_base_on_compute_1xUse of a virtual base (%t) requires the compute_20 or higher architecture
Device variadic functionscompute_30device_function_has_ellipsis__device__ or __host__ __device__ function with ellipsis requires compute_30 or higher architecture
__managed__ variablescompute_30unsupported_arch_for_managed_capability__managed__ variables require architecture compute_30 or higher
alloca() in device codecompute_52alloca_unsupported_for_lower_than_arch52alloca() is not supported for architectures lower than compute_52
Atomic scope argumentsm_60(inline)atomic operations' scope argument is supported on architecture sm_60 or above. Fall back to use membar.
Atomic f64 add/subsm_60(inline)atomic add and sub for 64-bit float is supported on architecture sm_60 or above.
__nv_atomic_* functionssm_60(inline)__nv_atomic_* functions are not supported on arch < sm_60.
__grid_constant__compute_70grid_constant_unsupported_arch__grid_constant__ annotation is only allowed for architecture compute_70 or later
Atomic memory ordersm_70(inline)atomic operations' argument of memory order is supported on architecture sm_70 or above. Fall back to use membar.
128-bit atomic load/storesm_70(inline)128-bit atomic load and store are supported on architecture sm_70 or above.
16-bit atomic CASsm_70(inline)16-bit atomic compare-and-exchange is supported on architecture sm_70 or above.
__nv_register_params__compute_80register_params_unsupported_arch__nv_register_params__ is only supported for compute_80 or later architecture
__wgmma_mma_asyncsm_90awgmma_mma_async_not_enabled__wgmma_mma_async builtins are only available for sm_90a
Atomic cluster scopesm_90(inline)atomic operations' scope of cluster is supported on architecture sm_90 or above. Using device scope instead.
Atomic cluster scope (load/store)sm_90(inline)atomic load and store's scope of cluster is supported on architecture sm_90 or above. Using device scope instead.
128-bit atomic exch/CASsm_90nv_atomic_exch_cas_b128_not_supported128-bit atomic exchange or compare-and-exchange is supported on architecture sm_90 or above.

GPU-Architecture-Gated Attributes (No Specific SM in String)

Several features check the architecture but their error strings do not embed a specific SM version number. Instead, they use the generic phrase "this GPU architecture", meaning the threshold is encoded in the comparison logic rather than the diagnostic text:

FeatureDiagnostic TagError String
__cluster_dims__cluster_dims_unsupported__cluster_dims__ is not supported for this GPU architecture
max_blocks_per_clustermax_blocks_per_cluster_unsupportedcannot specify max blocks per cluster for this GPU architecture
__block_size__block_size_unsupported__block_size__ is not supported for this GPU architecture
__managed__ (config)unsupported_configuration_for_managed_capability__managed__ variables are not yet supported for this configuration (compilation mode (32/64 bit) and/or target operating system)

These features are gated by the same dword_126E4A8 comparison mechanism as the features in the main table, but their exact SM threshold values would require tracing the specific comparison instructions in the semantic analysis functions.

Diagnostic Behavior: Errors vs Warnings vs Demotions

Architecture gate violations produce three distinct behaviors depending on the feature class:

Hard errors -- Compilation halts. Features that fundamentally cannot work on the target architecture:

  • __managed__ below compute_30 -- No unified memory hardware support
  • __grid_constant__ below compute_70 -- No hardware constant propagation mechanism
  • __nv_register_params__ below compute_80 -- Register parameter ABI not available
  • __wgmma_mma_async below sm_90a -- No warp-group MMA hardware
  • alloca() below compute_52 -- No dynamic stack allocation support on device
  • Virtual base classes below compute_20 -- No vtable support on earliest GPU architectures

Fallback warnings -- Compilation continues with degraded behavior. The compiler generates functionally correct but potentially less performant code:

  • Atomic scope arguments on pre-sm_60 -- Falls back to membar-based synchronization
  • Atomic memory order on pre-sm_70 -- Falls back to membar-based ordering
  • 64-bit float atomics on pre-sm_60 -- Falls back to CAS loop emulation

Scope demotion warnings -- Informational diagnostics about automatic scope narrowing:

  • Cluster scope atomics on pre-sm_90 -- Silently demotes to device scope ("Using device scope instead")

compute_XX vs sm_XX Naming

Error strings use two naming conventions that reflect CUDA's split between virtual and physical architectures:

  • compute_XX -- Virtual architecture. Checked at PTX generation time. Features gated by compute_XX are relevant to the intermediate PTX representation and are independent of the specific GPU die. Examples: __managed__ (requires unified memory ISA support), alloca() (requires dynamic stack frame instructions).

  • sm_XX -- Physical architecture. Checked at SASS generation time. Features gated by sm_XX are tied to specific hardware capabilities of a GPU die. Examples: 128-bit atomics (require specific load/store unit widths), cluster scope (requires the SM 9.0 thread block cluster hardware).

In practice, cudafe++ stores a single integer in dword_126E4A8 and the distinction is purely semantic -- both forms gate against the same numeric value. The value is a compute capability number (e.g., 70 for Volta, 90 for Hopper).

The sm_90a suffix (with the a accelerator flag) is a special case used exclusively for __wgmma_mma_async builtins. This variant requires the Hopper accelerated architecture, which is distinct from the base sm_90. The a suffix is encoded in the SM integer value passed to cudafe++ by nvcc.

__wgmma_mma_async Detail

The warp-group matrix multiply-accumulate builtin has the most granular validation of any architecture-gated feature. Beyond the sm_90a architecture check, cudafe++ also validates:

CheckDiagnostic TagError String
Architecture gatewgmma_mma_async_not_enabled__wgmma_mma_async builtins are only available for sm_90a
Shape validationwgmma_mma_async_bad_shapeThe shape %s is not supported for __wgmma_mma_async builtin
A operand typewgmma_mma_async_bad_A_type(type mismatch diagnostic)
B operand typewgmma_mma_async_bad_B_type(type mismatch diagnostic)
Missing argumentswgmma_mma_async_missing_argsThe 'A' or 'B' argument to __wgmma_mma_async call is missing
Non-constant argswgmma_mma_async_nonconstant_argNon-constant argument to __wgmma_mma_async call

The validation function is identified as check_wgmma_mma_async (string at 0x888CAC). Four type-specific builtin variants are registered: __wgmma_mma_async_f16, __wgmma_mma_async_bf16, __wgmma_mma_async_tf32, and __wgmma_mma_async_f8.

nv_register_params Detail

The register parameter attribute has three distinct checks, only one of which is an architecture gate:

CheckDiagnostic TagError String
Feature enable flagregister_params_not_enabled__nv_register_params__ support is not enabled
Architecture gateregister_params_unsupported_arch__nv_register_params__ is only supported for compute_80 or later architecture
Function type checkregister_params_unsupported_function__nv_register_params__ is not allowed on a %s function
Ellipsis checkregister_params_ellipsis_function(variadic function diagnostic)

The attribute handler is apply_nv_register_params_attr (string at 0x830C78).

SM Version to Feature Summary

SM VersionFeatures IntroducedFeature Count
compute_20Virtual base classes in device code1
compute_30__managed__ variables, device variadic functions2
compute_52alloca() in device code1
sm_60Atomic scope argument, 64-bit float atomics, __nv_atomic_* API3
sm_70__grid_constant__, 128-bit atomic load/store, atomic memory order, 16-bit CAS4
compute_80__nv_register_params__1
sm_90 / sm_90a__wgmma_mma_async, thread block clusters, 128-bit atomic exchange/CAS, cluster scope atomics5

Notably absent from cudafe++ error strings are features like cooperative groups (sm_60+), tensor cores (sm_70+), and dynamic parallelism (sm_35+). These are checked at runtime or by the PTX assembler (ptxas) rather than the language frontend.

Layer 3: Host Compiler Version Gating

cudafe++ generates .int.c output that must compile cleanly under the host C++ compiler (GCC, Clang, or MSVC). Because different host compiler versions support different warning pragmas, attributes, and language features, cudafe++ gates its output based on the host compiler version stored in qword_126EF98 (GCC) and qword_126EF90 (Clang). Additionally, several C++ language feature flags in the EDG frontend are conditionally enabled based on host compiler version to match the behavior the user expects from their host compiler.

Version Encoding

Both GCC and Clang versions are encoded as a single integer: major * 10000 + minor * 100 + patch. For example, GCC 8.1.0 is encoded as 80100. The compiler tests these values against hexadecimal threshold constants using > (strictly-greater-than) comparisons, which effectively means "version at or above threshold + 1." Since all threshold values use a 99 patch level (e.g., 40299 for GCC 4.2.99), the gate > 40299 is equivalent to >= 40300, which effectively means "GCC 4.3 or later."

Complete Threshold Table

Hex ConstantDecimalEncoded VersionEffective GateOccurrence Count
0x752F29,9992.99.99GCC/Clang >= 3.01 (dialect resolution)
0x75F730,1993.01.99GCC/Clang >= 3.2low
0x76BF30,3993.03.99GCC/Clang >= 3.4low (cuda_compat_flag gate)
0x778730,5993.05.99Clang >= 3.6medium (-Wunused-local-typedefs)
0x78B330,8993.08.99Clang >= 3.9low
0x9C3F39,9993.99.99GCC >= 4.0medium (dword_106BDD8 + Clang gate)
0x9D0740,1994.01.99GCC >= 4.2medium (-Wunused-variable file-level)
0x9D6B40,2994.02.99GCC >= 4.3medium (variadic templates)
0x9DCF40,3994.03.99GCC >= 4.4low (dialect resolution)
0x9E3340,4994.04.99GCC >= 4.5low (dialect resolution)
0x9E9740,5994.05.99GCC >= 4.6medium (diagnostic push/pop)
0x9EFB40,6994.06.99GCC >= 4.7low (feature flag gating)
0x9F5F40,7994.07.99GCC >= 4.8medium (-Wunused-local-typedefs)
0xEA5F59,9995.99.99GCC >= 6.022 files (C++14/17 features)
0xEB2760,1996.01.99GCC >= 6.2low (HasFuncPtrConv gate)
0x1116F69,9996.99.99GCC >= 7.0medium (dword_106BDD8 + feature flags)
0x15F8F89,9998.99.99GCC/Clang >= 9.0medium (C++17/20 features)
0x1D4BF119,99911.99.99GCC/Clang >= 12.08 files
0x1FBCF129,99912.99.99GCC >= 13.013 files
0x222DF139,99913.99.99GCC >= 14.05 files

How Thresholds Are Used

The thresholds serve three purposes:

1. Diagnostic pragma emission. The .int.c output includes #pragma GCC diagnostic directives to suppress host compiler warnings about CUDA-generated code. Different GCC/Clang versions introduced different warning flags, so the pragmas are conditionally emitted:

// From sub_489000 (backend boilerplate emission)
// -Wunused-local-typedefs: GCC 4.8+ (0x9F5F) or Clang 3.6+ (0x7787)
if ((dword_126E1E8 && qword_126EF90 > 0x7787)
    || (!dword_106BF6C && !dword_106BF68
        && dword_126E1F8 && qword_126E1F0 > 0x9F5F))
{
    emit("#pragma GCC diagnostic ignored \"-Wunused-local-typedefs\"");
}

// Push/pop block for managed RT: GCC 4.6+ (0x9E97) or Clang
if (dword_126E1E8 || (!dword_106BF6C && dword_126E1F8 && qword_126E1F0 > 0x9E97))
{
    emit("#pragma GCC diagnostic push");
    emit("#pragma GCC diagnostic ignored \"-Wunused-variable\"");
    emit("#pragma GCC diagnostic ignored \"-Wunused-function\"");
    // ... managed runtime boilerplate ...
    emit("#pragma GCC diagnostic pop");
}

// File-level -Wunused-variable: GCC 4.2+ (0x9D07) or Clang
if (dword_126E1E8 || (dword_126E1F8 && qword_126E1F0 > 0x9D07))
    emit("#pragma GCC diagnostic ignored \"-Wunused-variable\"");

2. C++ feature gating during dialect resolution. The post-parsing dialect resolution in proc_command_line and the sub_44B6B0 dialect setup function use qword_126EF98 thresholds to decide which C++ language features to enable. Examples from the decompiled code:

// sub_44B6B0 -- dialect resolution, ~400 lines
// GCC 4.3+ (0x9D6B): enable variadic templates
if (qword_126EF98 > 0x9D6B)
    dword_106BE1C = 1;  // variadic_templates

// GCC 4.7+ (0x9EFB): enable list initialization under certain conditions
if (qword_126EF98 > 0x9EFB && dword_106BE1C && (!byte_E7FFF1 || dword_106C10C))
    dword_106BE10 = 1;

// GCC 6.0+ (0xEA5F) or Clang: enable C++14/17 features
if (dword_126EFA4 || (dword_126EFA8 && qword_126EF98 > 0xEA5F))
    // Enable feature (Clang always, GCC only 6.0+)

3. CUDA compatibility mode. A special flag dword_E7FF10 (cuda_compat_flag) is set when dword_126EFAC && qword_126EF98 <= 0x76BF -- that is, when extended features are enabled but the GCC version is 3.3.99 or below. This activates a legacy compatibility path for very old host compilers that lack modern C++ support.

The 0xEA5F (59999) Threshold -- The Most Pervasive Gate

The threshold 0xEA5F (GCC 6.0) is the most widely used version constant in the binary, appearing in 22 decompiled functions. It gates the C++14/17 feature set boundary. GCC 6.0 was the first GCC release with full C++14 support and substantial C++17 support.

The typical usage pattern is:

// Pattern: "Clang (any version) OR GCC 6.0+"
if (dword_126EFA4 || (dword_126EFA8 && qword_126EF98 > 0xEA5F))
    // Enable C++14/17 feature

// Pattern: "GNU extensions but not Clang, GCC 6.0+"
if (dword_126EFAC && !dword_126EFA4 && qword_126EF98 > 0xEA5F)
    // Enable GNU-specific extended feature

Functions using this threshold include: declaration processing (sub_40D900), attribute application (sub_413ED0), class declaration (sub_431590), dialect resolution (sub_44B6B0), initializer processing (sub_48C710, sub_4B6760), backend code generation (sub_4688C0), expression canonicalization (sub_4CA6C0, sub_4D2B70), IL walking (sub_54AED0), scope management (sub_59C9B0, sub_59AF40), type processing (sub_5D1350), overload resolution (sub_662670, sub_666720), and template specialization (sub_6A3B00).

Version-Gated Feature Flag: dword_106BDD8

One particular feature flag (dword_106BDD8) is set during dialect resolution based on a compound version check:

// sub_44B6B0, decompiled line ~228-231
// v4 = (dword_126EFA4 != 0), i.e., is_clang_mode
if ((dword_126EFAC && !v4 && qword_126EF98 > 0x1116F)  // GNU ext, not Clang, GCC >= 7.0
    || (v4 && qword_126EF90 > 0x9C3F))                   // or Clang >= 4.0
{
    dword_106BDD8 = 1;
}

This flag is referenced in 7 decompiled functions (sub_430920, sub_42FE50, sub_447930, sub_44AAC0, sub_44B6B0, sub_45EB40, sub_724630). The W066 global variables report identifies it as optix_mode, but the decompiled code shows it is set purely based on compiler version thresholds during dialect resolution, not from any --emit-optix-ir CLI flag. It likely controls a C++ language feature (possibly structured bindings or another C++17 feature) that requires GCC 7.0+ or Clang 4.0+ support, and the "optix_mode" name in the report may be a misidentification based on context where it was encountered. The flag gates behavior in attribute validation (sub_42FE50), where it interacts with dword_106B670 to control feature availability.

Dialect Initialization Flow

The host compiler version globals are initialized in proc_command_line and propagated to the dialect system during TU initialization:

proc_command_line (CLI parsing, sub_459630):
  case 184 (--gnu_version=X):   qword_126EF98 = X   // GCC version
  case 188 (--clang_version=X): qword_126EF90 = X   // Clang version
  case 182 (--gcc):             dword_126EFA8 = 1    // GCC mode flag
  case 187 (--clang):           dword_126EFA4 = 1    // Clang mode flag

dialect_init (sub_44B6B0, called during setup):
  // ~400 lines of version-threshold-based feature flag resolution
  // Sets 30+ EDG feature flags based on gcc_version, clang_version,
  // cpp_standard_version, and extension mode flags

target dialect (sub_752A80, select_cp_gen_be_target_dialect):
  if (dword_126EFA8):                           // GCC mode
    dword_126E1F8 = 1                           // host_dialect_gnu
    qword_126E1F0 = qword_126EF98              // host_gcc_version
  if (dword_126EFA4):                           // Clang mode
    dword_126E1E8 = 1                           // host_dialect_clang
    qword_126E1E0 = qword_126EF90              // host_clang_version

The defaults for unspecified versions are qword_126EF98 = 80100 (GCC 8.1.0) and qword_126EF90 = 90100 (Clang 9.1.0), set during default_init (sub_45EB40).

The --db Debug Mechanism

The --db flag (CLI case 37) activates EDG's internal debug tracing system by calling sub_48A390 (proc_debug_option). While not directly related to architecture gating, the --db mechanism shares the adjacent global namespace (dword_126EFC8, dword_126EFCC) and is relevant because debug tracing can expose architecture checks as they execute in real time.

Connection Between --db and Architecture

The --db flag does not set or modify any architecture-related globals. Its connection to the architecture system is observational: when debug tracing is enabled, the compiler emits trace output at key decision points throughout compilation, including the semantic analysis functions that evaluate architecture thresholds. Enabling --db=5 (verbosity level 5) causes the compiler to log IL entry kinds, template instantiation steps, and scope transitions, which provides visibility into when and why architecture gates fire.

The CLI dispatch for --db:

// proc_command_line (sub_459630), case 37
case 37:  // --db=<string>
    if (sub_48A390(qword_E7FF28))  // proc_debug_option
        goto error;                // returns nonzero on parse failure
    dword_106C2A0 = dword_126EFCC; // save initial error count baseline

After proc_debug_option returns, dword_106C2A0 captures the current value of dword_126EFCC (debug verbosity level). This is used as a baseline error count for subsequent error tracking.

proc_debug_option (sub_48A390)

This 238-line function (debug.c) parses debug control strings. On entry, it unconditionally sets dword_126EFC8 = 1 (debug tracing enabled), then dispatches based on the first character of the input:

// sub_48A390 entry
dword_126EFC8 = 1;  // enable debug tracing
v3 = (unsigned __int8)*nptr;
if ((v3 - 48) <= 9) {               // first char is digit
    dword_126EFCC = strtol(v1, 0, 10); // set verbosity level
    return 0;
}

The full parsing grammar:

Input FormatParsed AsAction
"5" (numeric only)Verbosity levelSets dword_126EFCC = 5
"name=3"Name with levelAdds trace node: action=1, level=3
"name+=3"Additive traceAdds trace node: action=2, level=3
"name-=3"Subtractive traceAdds trace node: action=3, level=3
"name=3!"Permanent traceAdds trace node: action=1, level=3, permanent=1
"#name"Hash removalRemoves matching node from trace list
"-name"Dash removalRemoves matching node from trace list
"a,b=2,c=3"Comma-separatedProcesses each entry independently

Debug Trace Node Structure

Debug trace requests are stored as a singly-linked list rooted at qword_1065870. Each node is 28 bytes, allocated via sub_6B7340 (the IL allocator):

struct debug_trace_node {           // 28 bytes (32 allocated)
    struct debug_trace_node* next;  // +0:  linked list link
    char*  name_string;             // +8:  entity name to trace (heap copy)
    int32  action_type;             // +16: 1=set, 2=add, 3=subtract, 4=remove
    int32  level;                   // +20: trace level (integer)
    int32  permanent;               // +24: 1=survives reset, 0=cleared on reset
};

When proc_debug_option encounters its own name in the trace list (the self-referential check !strcmp(src, "proc_debug_option")), it prints the entire trace state to stderr:

if (qword_1065870 && (v2 & 1) != 0) {
    do {
        fprintf(s, "debug request for: %s\n", node->name_string);
        fprintf(s, "action=%d,  level=%d\n", node->action_type, node->level);
        node = node->next;
    } while (node);
}

Debug Verbosity Levels

The dword_126EFCC verbosity level controls trace output granularity across the entire compiler:

LevelEffect
0No debug output (default)
1-2Basic trace: function entry/exit markers
3Detailed trace: includes entity names, scope indices
4Very detailed: IL entry kinds, overload candidate lists
5+Full trace: IL tree walking with "Walking IL tree, entry kind = ..."

db_name (CLI case 190)

The --db_name flag (case 190) calls a separate function sub_48AD80 to register a debug name filter. Unlike --db which enables global tracing, --db_name restricts trace output to entities matching the specified name pattern. If sub_48AD80 fails (returns nonzero), error 570 is emitted.

Three-Layer Checking Model

Layer 1: Compile-Time Semantic Checks (cudafe++ Frontend)

These are the primary gates. During semantic analysis, cudafe++ reads dword_126E4A8 and compares it against threshold constants. Violations emit diagnostic errors through the standard error system (diagnostic IDs in the 3000+ range, displayed as 20000-series via the +16543 offset formula). These checks are unconditional -- they fire regardless of whether the code would actually execute at runtime.

Enforcement point: Declaration processing, type checking, attribute application, and CUDA-specific semantic validation passes.

Examples:

  • __managed__ variable declaration with dword_126E4A8 < 30 triggers unsupported_arch_for_managed_capability
  • __grid_constant__ parameter with dword_126E4A8 < 70 triggers grid_constant_unsupported_arch
  • __wgmma_mma_async call on non-sm_90a triggers wgmma_mma_async_not_enabled
  • Virtual base class with dword_126E4A8 < 20 triggers use_of_virtual_base_on_compute_1x

Layer 2: String-Embedded Diagnostic Formatting

Error strings with architecture names baked into .rodata represent the complete set of architecture-dependent diagnostics. These strings are loaded by the diagnostic system and formatted with the current architecture value. The strings serve as the user-visible feedback for Layer 1 checks.

The architecture name in the string (e.g., "compute_70", "sm_90a") is a literal constant, not a formatted parameter -- the compiler does not interpolate the actual target architecture into these messages. This means the error messages always state the minimum required architecture, not what the user actually specified. The only exception is the virtual base error which uses %t (a type formatter) to include the base class name, not the architecture.

Layer 3: Host Compiler Version Gating

This layer does not check GPU architecture at all -- instead, it gates the output format of the generated .int.c file based on the host C++ compiler's version. The thresholds ensure that GCC/Clang-specific pragmas, attributes, and language constructs in the generated code are compatible with the actual host compiler that will consume the output.

Enforcement point: Backend code generation (sub_489000 and related functions in cp_gen_be.c).

Impact: Incorrect host compiler version gating does not cause compilation failure -- it may produce warnings from the host compiler due to unrecognized pragmas, or miss warning suppression directives that would silence spurious diagnostics.

Interaction Between Layers

nvcc (driver)
  |
  | --target=<sm_code>  --gnu_version=<ver>  --clang_version=<ver>
  v
cudafe++ process
  |
  +-- CLI parsing (proc_command_line)
  |     dword_126E4A8 = sm_code         (SM architecture)
  |     qword_126EF98 = gcc_version     (host GCC version)
  |     qword_126EF90 = clang_version   (host Clang version)
  |
  +-- set_target_configuration (sub_7525F0)
  |     sub_7515D0()  -- type table init (100+ globals)
  |
  +-- dialect_resolution (sub_44B6B0)
  |     30+ feature flags set based on version thresholds
  |     dword_126E1F8 / dword_126E1E8  -- host dialect set
  |     qword_126E1F0 / qword_126E1E0  -- host version copies
  |
  +-- TU init (sub_586240)
  |     dword_126EBF8 = dword_126E4A8   (SM version copy)
  |
  +-- [Layer 1] Semantic analysis
  |     Compare dword_126E4A8 against SM thresholds
  |     Emit CUDA-specific errors for unsupported features
  |
  +-- [Layer 2] Diagnostic formatting
  |     Load error string with baked-in architecture name
  |     Format and display error to user
  |
  +-- [Layer 3] .int.c code generation
  |     Compare qword_126E1F0 / qword_126E1E0 against host thresholds
  |     Emit appropriate #pragma directives
  |     Generate host-compiler-compatible boilerplate
  |
  v
Host Compiler (gcc / clang / cl.exe)

Layers 1 and 2 operate during the frontend phase and can halt compilation. Layer 3 operates during the backend phase and only affects the format of the generated output file.

Global Variable Summary

AddressSizeNameRole
dword_126E4A84sm_architectureTarget SM version from --target (case 245). Sentinel: -1.
dword_126EBF84target_config_indexTU-level copy of dword_126E4A8, set in sub_586240.
qword_126EF988gcc_versionGCC compatibility version. Default 80100. Set by --gnu_version (case 184).
qword_126EF908clang_versionClang compatibility version. Default 90100. Set by --clang_version (case 188).
dword_126EFA84gcc_extensionsGCC mode enabled. Set by --gcc (case 182).
dword_126EFA44clang_extensionsClang mode enabled. Set by --clang (case 187).
dword_126EFAC4extended_featuresExtended features / GNU compat mode.
dword_126EFB04gnu_extensions_enabledGNU extensions active.
dword_126E1F84host_dialect_gnuHost compiler is GCC/GNU. Set during dialect init.
dword_126E1E84host_dialect_clangHost compiler is Clang. Set during dialect init.
qword_126E1F08host_gcc_versionHost GCC version, copied from qword_126EF98.
qword_126E1E08host_clang_versionHost Clang version, copied from qword_126EF90.
dword_126EFC84debug_trace_enabledDebug tracing active. Set unconditionally by --db.
dword_126EFCC4debug_verbosityDebug output level. >2=detailed, >4=IL walk trace.
dword_E7FF104cuda_compat_flagLegacy compat: dword_126EFAC && qword_126EF98 <= 0x76BF.
dword_106BDD84version_gated_featureSet when GCC >= 7.0 or Clang >= 4.0. Referenced in 7 functions.
dword_106C2A04error_count_baselineSaved from dword_126EFCC after --db processing.
qword_10658708debug_trace_listHead of debug trace request linked list.
dword_126E4A04target_vector_widthSet to 8 by sub_7515D0.

Cross-References

Attribute System Overview

cudafe++ processes CUDA attributes through NVIDIA's customization of the EDG 6.6 attribute subsystem. EDG provides a general-purpose attribute infrastructure in attribute.c (approximately 11,500 lines of source, spanning addresses 0x409350--0x418F80 in the binary) that handles C++11 [[...]] attributes, GNU __attribute__((...)), MSVC __declspec, and alignas. NVIDIA extends this infrastructure by injecting 14 CUDA-specific attribute kinds into EDG's attribute kind enumeration, registering CUDA-specific handler callbacks, and adding a post-declaration validation pass that enforces cross-attribute consistency rules (e.g., __launch_bounds__ requires __global__).

The attribute system operates in four phases: scanning (lexer recognizes attribute syntax and builds attribute node lists), lookup (maps attribute names to descriptors via a hash table), application (dispatches to per-attribute handler functions that modify entity nodes), and validation (post-declaration consistency checks). CUDA attributes participate in all four phases, using the same node structures and dispatch mechanisms as standard C++/GNU attributes.

CUDA Attribute Kind Enum

Every attribute node carries a kind byte at offset +8. For standard C++/GNU attributes, EDG assigns kinds from its built-in descriptor table (byte_82C0E0 in the .rodata segment). For CUDA attributes, NVIDIA reserves a block of kind values in the ASCII printable range. The function attribute_display_name (sub_40A310, from attribute.c:1307) contains the authoritative switch table that maps kind values to human-readable names:

KindHexASCIIDisplay NameCategoryHandler
860x56'V'__host__Execution spacesub_4108E0
870x57'W'__device__Execution spacesub_40EB80
880x58'X'__global__Execution spacesub_40E1F0 / sub_40E7F0
890x59'Y'__tile_global__Execution space(internal)
900x5A'Z'__shared__Memory spacesub_40E0D0 (shared path)
910x5B'['__constant__Memory spacesub_40E0D0 (constant path)
920x5C'\'__launch_bounds__Launch configsub_411C80
930x5D']'__maxnreg__Launch configsub_410F70
940x5E'^'__local_maxnreg__Launch configsub_411090
950x5F'_'__tile_builtin__Internal(internal)
1020x66'f'__managed__Memory spacesub_40E0D0 (managed path)
1070x6B'k'__cluster_dims__Launch configsub_4115F0
1080x6C'l'__block_size__Launch configsub_4109E0
1100x6E'n'__nv_pure__Optimization(internal)

The kind values are not contiguous. Kinds 86--95 form a dense block for the original CUDA attributes. Kinds 102, 107, 108, and 110 were added later (managed memory in CUDA 6.0, cluster dimensions in CUDA 11.8, block size and nv_pure more recently), occupying gaps in the ASCII range.

attribute_display_name (sub_40A310)

This function serves dual duty: it formats the display name for diagnostic messages, and its switch table is the canonical enumeration of all CUDA attribute kinds. The logic:

// sub_40A310 -- attribute_display_name (attribute.c:1307)
// a1: pointer to attribute node
const char* attribute_display_name(attr_node_t* a1) {
    const char* name = a1->name;           // +16
    const char* ns   = a1->namespace_str;  // +24

    // If scoped (namespace::name), format "namespace::name"
    if (ns) {
        size_t ns_len = strlen(ns);
        assert(ns_len + strlen(name) + 3 <= 204);  // buffer byte_E7FB80
        sprintf(byte_E7FB80, "%s::%s", ns, name);
        name = intern_string(byte_E7FB80);  // sub_5E0700
    }

    // Override with CUDA display name based on kind byte
    switch (a1->kind) {  // byte at +8
        case 'V': return "__host__";
        case 'W': return "__device__";
        case 'X': return "__global__";
        case 'Y': return "__tile_global__";
        case 'Z': return "__shared__";
        case '[': return "__constant__";
        case '\\': return "__launch_bounds__";
        case ']': return "__maxnreg__";
        case '^': return "__local_maxnreg__";
        case '_': return "__tile_builtin__";
        case 'f': return "__managed__";
        case 'k': return "__cluster_dims__";
        case 'l': return "__block_size__";
        case 'n': return "__nv_pure__";
        default:  return name ? name : "";
    }
}

The 204-byte static buffer byte_E7FB80 is shared across calls (not thread-safe, but cudafe++ is single-threaded per translation unit). The intern_string call (sub_5E0700) ensures the formatted "namespace::name" string is deduplicated into EDG's permanent string pool.

Attribute Node Structure

Every attribute is represented by a 72-byte IL node (entry kind 0x48 = attribute). The node layout:

struct attr_node_t {               // 72 bytes, IL entry kind 0x48
    attr_node_t*  next;            // +0   next attribute in list
    uint8_t       kind;            // +8   attribute kind byte (CUDA: 'V'..'n')
    uint8_t       source_mode;     // +9   1=C++11, 2=GNU, 3=MSVC, 4=alignas, 5=clang
    uint8_t       target_kind;     // +10  what entity type this targets
    uint8_t       flags;           // +11  bit 0=applies_to_params
                                   //      bit 1=skip_arg_check
                                   //      bit 4=scoped attribute
                                   //      bit 7=unknown/unrecognized
    uint32_t      _pad;            // +12  (alignment)
    const char*   name;            // +16  attribute name string
    const char*   namespace_str;   // +24  namespace (NULL for unscoped)
    arg_node_t*   arguments;       // +32  argument list head
    void*         source_pos;      // +40  source position info
    void*         decl_context;    // +48  declaration context / scope
    void*         src_loc_1;       // +56  source location
    void*         src_loc_2;       // +64  secondary source location
};

For CUDA attributes, the kind byte at offset +8 is the discriminator. When get_attr_descr_for_attribute (sub_40FDB0) resolves an attribute name, it writes the corresponding kind value from the descriptor table (byte_82C0E0) into this field. All subsequent dispatch operates on this byte alone.

The source_mode byte at +9 indicates the syntactic form the user wrote. CUDA attributes like __host__ are parsed as GNU-style attributes (source_mode = 2), because cudafe++ defines them via __attribute__((...)) internally.

Attribute Descriptor Table and Name Lookup

Master Descriptor Table (off_D46820)

The attribute descriptor table is a static array in .rodata at off_D46820, extending to unk_D47A60. Each entry is 32 bytes and encodes:

  • Attribute name string
  • Kind byte (written to attr_node_t.kind on match)
  • Handler function pointer (the apply_* callback)
  • Mode/version condition string (e.g., 'g' for GCC-only, 'l' for Clang-only)
  • Target applicability mask

Initialization: init_attr_name_map (sub_418F80)

At startup, init_attr_name_map iterates the descriptor table, validates each name is at most 100 characters, and inserts it into the hash table qword_E7FB60 (created via sub_7425C0). This hash table enables O(1) lookup of attribute names during parsing.

// sub_418F80 -- init_attr_name_map (attribute.c:1524)
void init_attr_name_map(void) {
    attr_name_map = create_hash_table();  // qword_E7FB60
    for (attr_descr* d = off_D46820; d < unk_D47A60; d++) {
        assert(strlen(d->name) <= 100);
        insert_into_hash_table(attr_name_map, d->name, d);
    }
    // Also initializes dword_E7F078 and processes config if dword_106BF18 set
}

A companion function init_attr_token_map (sub_419070) creates a second hash table qword_E7F038 that maps attribute tokens to their descriptors, used during lexer-level attribute recognition.

Name Normalization: sub_40A250

Before looking up an attribute name, EDG strips __ prefixes and suffixes. The function at sub_40A250 checks whether the name starts with "__" and ends with "__", strips them, and looks up the bare name in qword_E7FB60. This means __host__, __attribute__((host)), and host all resolve to the same descriptor. The stripping respects the current language standard (dword_126EFB4) and C++ version (dword_126EF68).

Central Dispatch: get_attr_descr_for_attribute (sub_40FDB0)

This 227-line function is the central attribute resolution path. Given an attribute node with a name, it:

  1. Looks up the name in the hash table
  2. Checks mode compatibility (GCC mode via dword_126EFA8, Clang mode via dword_126EFA4, MSVC mode via dword_106BF68/dword_106BF58)
  3. Checks namespace match ("gnu", "__gnu__", "clang") via cond_matches_attr_mode (sub_40C4C0)
  4. Evaluates version-conditional availability via in_attr_cond_range (sub_40D620)
  5. Writes the kind byte from the matched descriptor into attr_node_t.kind
  6. Returns the descriptor entry (which carries the handler function pointer)

The mode condition strings use a compact encoding: 'g'=GCC, 'l'=Clang, 's'=Sun, 'c'=C++, 'm'=MSVC; 'x'=extension, '+'=positive match, '!'=boundary marker.

Attribute Application Pipeline

Phase 1: Scanning

The lexer recognizes attribute syntax and calls into the scanning functions:

FunctionAddressRole
scan_std_attribute_groupsub_412650Parses [[...]] C++11 and __attribute__((...)) GNU attributes
scan_gnu_attribute_groupssub_412F20Handles __attribute__((...)) specifically
scan_attributes_listsub_4124A0Iterates token stream building attribute node lists
parse_attribute_argument_clausesub_40C8B0Parses attribute argument expressions
get_balanced_tokensub_40C6C0Handles balanced parentheses/brackets in arguments

Scanning produces a linked list of attr_node_t nodes. At this stage, the kind byte is unset; only the name and namespace_str fields are populated.

Phase 2: Lookup and Kind Assignment

When the parser reaches a declaration, get_attr_descr_for_attribute resolves each attribute name to a descriptor and writes the kind byte. For CUDA attributes, this assigns values in the 'V'--'n' range.

Phase 3: Application -- apply_one_attribute (sub_413240)

The central dispatcher is a 585-line function containing a switch on the kind byte. For each CUDA kind, it calls the corresponding handler:

// sub_413240 -- apply_one_attribute (attribute.c, main dispatch)
// 585 lines, giant switch on attribute kind
void apply_one_attribute(attr_node_t* attr, entity_t* entity, int target_kind) {
    switch (attr->kind) {
        case 'V':  apply_nv_host_attr(attr, entity, target_kind);     break;
        case 'W':  apply_nv_device_attr(attr, entity, target_kind);   break;
        case 'X':  apply_nv_global_attr(attr, entity, target_kind);   break;
        case 'Z':  apply_nv_shared_attr(attr, entity, target_kind);   break;
        case '[':  apply_nv_constant_attr(attr, entity, target_kind); break;
        case '\\': apply_nv_launch_bounds(attr, entity, target_kind); break;
        case ']':  apply_nv_maxnreg_attr(attr, entity, target_kind);  break;
        case '^':  apply_nv_local_maxnreg(attr, entity, target_kind); break;
        case 'f':  apply_nv_managed_attr(attr, entity, target_kind);  break;
        case 'k':  apply_nv_cluster_dims(attr, entity, target_kind);  break;
        case 'l':  apply_nv_block_size(attr, entity, target_kind);    break;
        // ... standard attributes handled similarly ...
    }
}

The outer iteration is apply_attributes_to_entity (sub_413ED0, 492 lines), which walks the attribute list, calls apply_one_attribute for each, and handles deferred attributes, attribute merging, and ordering constraints.

Phase 4: Post-Declaration Validation -- sub_6BC890

After all attributes on a declaration are applied, sub_6BC890 (nv_validate_cuda_attributes, from nv_transforms.c) performs cross-attribute consistency checking. This function validates that combinations of CUDA attributes are legal:

// sub_6BC890 -- nv_validate_cuda_attributes (nv_transforms.c)
// a1: entity (function), a2: diagnostic location
void nv_validate_cuda_attributes(entity_t* fn, source_loc_t* loc) {
    if (!fn || (fn->byte_177 & 0x10))  // skip if null or already validated
        return;

    uint8_t exec_space = fn->byte_182;  // CUDA execution space bits
    launch_config_t* lc = fn->launch_config;  // entity+256

    // Check 1: parameters with rvalue-reference in __global__ functions
    // Walks parameter list, emits error 3702 for ref-qualified params

    // Check 2: __nv_register_params__ on __host__-only or __global__
    if (fn->byte_183 & 0x08) {
        if (exec_space & 0x40)       // __global__
            emit_error(3661, "__global__");
        else if ((exec_space & 0x30) == 0x20)  // __host__ only (no __device__)
            emit_error(3661, "__host__");
    }

    // Check 3: __launch_bounds__ without __global__
    if (lc && !(exec_space & 0x40)) {
        if (lc->maxThreadsPerBlock || lc->minBlocksPerMultiprocessor)
            emit_error(3534, "__launch_bounds__");
    }

    // Check 4: __cluster_dims__ / __block_size__ without __global__
    if (lc && (fn->byte_183 & 0x40 || lc->cluster_dim_x > 0)) {
        const char* name = (lc->block_size_x > 0) ? "__block_size__" : "__cluster_dims__";
        emit_error(3534, name);
    }

    // Check 5: maxBlocksPerClusterSize exceeds cluster product
    if (lc && lc->cluster_dim_x > 0 && lc->maxBlocksPerClusterSize > 0) {
        if (lc->maxBlocksPerClusterSize <
            lc->cluster_dim_x * lc->cluster_dim_y * lc->cluster_dim_z) {
            emit_error(3707, ...);
        }
    }

    // Check 6: __maxnreg__ without __global__
    if (lc && lc->maxnreg >= 0 && !(exec_space & 0x40))
        emit_error(3715, "__maxnreg__");

    // Check 7: __launch_bounds__ + __maxnreg__ conflict
    if (lc && lc->maxThreadsPerBlock && lc->maxnreg >= 0)
        emit_error(3719, "__launch_bounds__ and __maxnreg__");

    // Check 8: __global__ without __launch_bounds__
    if ((exec_space & 0x40) && (!lc || (!lc->maxThreadsPerBlock && !lc->minBlocksPerMultiprocessor)))
        emit_warning(3695);  // "no __launch_bounds__ specified for __global__ function"
}

Error Codes in Validation

ErrorSeverityMessage
35347 (error)"%s" attribute is not allowed on a non-__global__ function
36617 (error)__nv_register_params__ is not allowed on a %s function
36954 (warning)no __launch_bounds__ specified for __global__ function
37027 (error)Parameter with rvalue reference in __global__ function
37077 (error)total number of blocks in cluster computed from %s exceeds __launch_bounds__ specified limit
37157 (error)__maxnreg__ is not allowed on a non-__global__ function
37197 (error)__launch_bounds__ and __maxnreg__ may not be used on the same declaration

Per-Attribute Handler Function Table

Each CUDA attribute has a dedicated apply_* function registered in the descriptor table. These functions modify entity node fields (execution space bits, memory space bits, launch configuration) and emit diagnostics for invalid usage.

AttributeHandlerAddressLinesEntity Fields Modified
__host__apply_nv_host_attrsub_4108E031entity+182 |= 0x15
__device__apply_nv_device_attrsub_40EB80100Functions: entity+182 |= 0x23; Variables: entity+148 |= 0x01
__global__apply_nv_global_attrsub_40E1F089entity+182 |= 0x61
__global__ (variant 2)apply_nv_global_attrsub_40E7F086Same as above (alternate entry point)
__shared__(via device attr path)----entity+148 |= 0x02
__constant__(via device attr path)----entity+148 |= 0x04
__managed__apply_nv_managed_attrsub_40E0D047entity+148 |= 0x01, entity+149 |= 0x01
__launch_bounds__apply_nv_launch_bounds_attrsub_411C8098entity+256 -> launch config +0, +8, +16
__maxnreg__apply_nv_maxnreg_attrsub_410F7067entity+256 -> launch config +32
__local_maxnreg__apply_nv_local_maxnreg_attrsub_41109067entity+256 -> launch config +36
__cluster_dims__apply_nv_cluster_dims_attrsub_4115F0145entity+256 -> launch config +20, +24, +28
__block_size__apply_nv_block_size_attrsub_4109E0265entity+256 -> launch config +40..+52
__nv_register_params__apply_nv_register_params_attrsub_40B0A038entity+183 |= 0x08

Attribute Registration (sub_6B5E50)

The function sub_6B5E50 (160 lines, in the nv_transforms.c / mem_manage.c area) registers NVIDIA-specific pseudo-attributes into EDG's keyword and macro systems at startup. It operates after EDG's standard keyword initialization but before parsing begins.

The registration creates macro-like definitions that the lexer expands before attribute processing. The function:

  1. Allocates attribute definition nodes via sub_6BA0D0 (EDG's node allocator)
  2. Looks up existing definitions via sub_734430 (hash table search) -- if a definition already exists, it chains the new handler onto it via sub_6AC190
  3. Creates new keyword entries via sub_749600 if no prior definition exists
  4. Registers __nv_register_params__ as a 40-byte attribute definition node (kind marker 8961) with chain linkage
  5. Registers __noinline__ as a 30-byte attribute definition node (kind marker 6401), including the "oinline))" suffix for __attribute__((__noinline__)) expansion
  6. Conditionally registers ARM SME attributes (__arm_in, __arm_inout, __arm_out, __arm_preserves, __arm_streaming, __arm_streaming_compatible) via sub_6ACCB0 when Clang version >= 180000 and ARM target flags are set
  7. Registers _Pragma as an operator-like keyword for _Pragma("...") processing

If any registration fails (the existing entry cannot be extended), it emits internal error 1338 with the attribute name and calls sub_6B6280 (fatal error handler).

Entity Node: CUDA Attribute Fields

CUDA attributes modify specific byte fields in entity nodes. The key fields for a reimplementation:

Execution Space (entity+182)

Bit 0 (0x01): __device__           set by apply_nv_device_attr
Bit 2 (0x04): __host__             set by apply_nv_host_attr
Bit 4 (0x10): (reserved)
Bit 5 (0x20): __host__ explicit    set by apply_nv_host_attr
Bit 6 (0x40): __global__           set by apply_nv_global_attr
Bit 7 (0x80): __host__ __device__  set when both specified

Handlers use OR-masks: __host__ sets 0x15 (bits 0+2+4), __device__ sets 0x23 (bits 0+1+5), __global__ sets 0x61 (bits 0+5+6). The overlap at bit 0 means all execution-space-annotated functions have bit 0 set, which serves as a quick "has CUDA annotation" predicate.

Memory Space (entity+148)

Bit 0 (0x01): __device__           device memory
Bit 1 (0x02): __shared__           shared memory
Bit 2 (0x04): __constant__         constant memory

Extended Memory Space (entity+149)

Bit 0 (0x01): __managed__          managed (unified) memory

Launch Configuration (entity+256)

A pointer to a separately allocated launch_config_t structure (created by sub_5E52F0):

struct launch_config_t {
    uint64_t  maxThreadsPerBlock;          // +0   from __launch_bounds__(N, ...)
    uint64_t  minBlocksPerMultiprocessor;  // +8   from __launch_bounds__(N, M, ...)
    int32_t   maxBlocksPerClusterSize;     // +16  from __launch_bounds__(N, M, K)
    int32_t   cluster_dim_x;              // +20  from __cluster_dims__(X, ...)
    int32_t   cluster_dim_y;              // +24  from __cluster_dims__(X, Y, ...)
    int32_t   cluster_dim_z;              // +28  from __cluster_dims__(X, Y, Z)
    int32_t   maxnreg;                    // +32  from __maxnreg__(N)
    int32_t   local_maxnreg;              // +36  from __local_maxnreg__(N)
    int32_t   block_size_x;              // +40  from __block_size__(X, ...)
    int32_t   block_size_y;              // +44  from __block_size__(X, Y, ...)
    int32_t   block_size_z;              // +48  from __block_size__(X, Y, Z, ...)
    uint8_t   flags;                      // +52  bit 0=cluster_dims_set
                                          //      bit 1=block_size_set
};

This structure is allocated lazily -- only created when a launch configuration attribute is first applied to a function. The allocation function sub_5E52F0 returns a zero-initialized structure with maxnreg = -1 and local_maxnreg = -1 (sentinel for "unset").

Attribute Processing Global State

GlobalAddressPurpose
qword_E7FB600xE7FB60Attribute name hash table (created by init_attr_name_map)
qword_E7F0380xE7F038Attribute token hash table (created by init_attr_token_map)
byte_E7FB800xE7FB80204-byte static buffer for formatted attribute display names
off_D468200xD46820Master attribute descriptor table (32 bytes per entry, extends to 0xD47A60)
qword_E7F0700xE7F070Visibility stack (for __attribute__((visibility(...))) nesting)
qword_E7F0480xE7F048Alias/ifunc free list head
qword_E7F058/E7F0500xE7F058/0xE7F050Alias chain list head/tail
dword_E7F0800xE7F080Attribute processing flags
dword_E7F0780xE7F078Extended attribute config flag

The function reset_attribute_processing_state (sub_4190B0) zeroes all of these at the start of each translation unit.

Function Map

AddressIdentitySourceConfidence
sub_40A250strip_double_underscores_and_lookupattribute.cHIGH
sub_40A310attribute_display_nameattribute.c:1307HIGH
sub_40C4C0cond_matches_attr_modeattribute.cHIGH
sub_40C6C0get_balanced_tokenattribute.cHIGH
sub_40C8B0parse_attribute_argument_clauseattribute.cHIGH
sub_40D620in_attr_cond_rangeattribute.cHIGH
sub_40E0D0apply_nv_managed_attrattribute.c:10523HIGH
sub_40E1F0apply_nv_global_attr (variant 1)attribute.cHIGH
sub_40E7F0apply_nv_global_attr (variant 2)attribute.cHIGH
sub_40EB80apply_nv_device_attrattribute.cHIGH
sub_40FDB0get_attr_descr_for_attributeattribute.c:1902HIGH
sub_4108E0apply_nv_host_attrattribute.cHIGH
sub_4109E0apply_nv_block_size_attrattribute.cHIGH
sub_410F70apply_nv_maxnreg_attrattribute.cHIGH
sub_411090apply_nv_local_maxnreg_attrattribute.cHIGH
sub_4115F0apply_nv_cluster_dims_attrattribute.cHIGH
sub_411C80apply_nv_launch_bounds_attrattribute.cHIGH
sub_412650scan_std_attribute_groupattribute.c:2914HIGH
sub_413240apply_one_attributeattribute.cHIGH
sub_413ED0apply_attributes_to_entityattribute.cHIGH
sub_418F80init_attr_name_mapattribute.c:1524HIGH
sub_419070init_attr_token_mapattribute.cHIGH
sub_4190B0reset_attribute_processing_stateattribute.cHIGH
sub_6B5E50process_nv_register_params / attribute registrationnv_transforms.cHIGH
sub_6BC890nv_validate_cuda_attributesnv_transforms.cVERY HIGH

Cross-References

global Function Constraints

The __global__ attribute designates a CUDA kernel -- a function that executes on the GPU and is callable from host code via the <<<...>>> launch syntax. Of all CUDA execution space attributes, __global__ imposes the most constraints. cudafe++ enforces these constraints across three separate validation passes: attribute application (when __global__ is first applied to an entity), post-declaration validation (after all attributes on a declaration are resolved), and semantic analysis (during template instantiation, redeclaration merging, and lambda processing). This page documents all constraint checks, their implementation in the binary, the entity node fields they inspect, and the diagnostics they emit.

Key Facts

PropertyValue
Source filesattribute.c (apply handler), nv_transforms.c (post-validation), class_decl.c (redeclaration, lambda), decls.c (template packs)
Apply handler (variant 1)sub_40E1F0 (89 lines)
Apply handler (variant 2)sub_40E7F0 (86 lines)
Post-validationsub_6BC890 (nv_validate_cuda_attributes, 161 lines)
Attribute kind byte0x58 = 'X'
OR mask appliedentity+182 |= 0x61 (bits 0 + 5 + 6)
HD combined flagentity+182 |= 0x80 (set when __global__ applied to function already marked __host__)
Total constraint checks37 distinct error conditions
Entity fields read+81, +144, +148, +152, +166, +176, +179, +182, +183, +184, +191
Relaxed mode flagdword_106BFF0 (suppresses certain conflict checks)
main() entity pointerqword_126EB70 (compared to detect __global__ main)

Two Variants of apply_nv_global_attr

Two nearly identical functions implement the __global__ application logic. Both perform the same 11 validation checks and apply the same 0x61 bitmask. The difference is purely structural: sub_40E1F0 uses a for loop with a null-terminated break for the parameter default-init iteration, while sub_40E7F0 uses a do-while loop with an explicit null check and early return. Both exist because EDG's attribute subsystem may route through different call paths depending on whether the attribute appears on a declaration or a definition.

// Pseudocode for apply_nv_global_attr (sub_40E1F0 / sub_40E7F0)
// a1: attribute node, a2: entity node, a3: target kind
entity_t* apply_nv_global_attr(attr_node_t* a1, entity_t* a2, uint8_t a3) {

    // Gate: only applies to functions (kind 11)
    if (a3 != 11)
        return a2;

    // ---- Phase 1: Linkage / constexpr lambda check ----
    // Bits 47 and 24 of the 48-bit field at +184
    if ((a2->qword_184 & 0x800001000000) == 0x800000000000) {
        // Constexpr lambda with internal linkage but no local flag
        char* name = get_entity_display_name(a2, 0);  // sub_6BC6B0
        emit_error(3469, a1->src_loc, "__global__", name);
        return a2;   // bail out, do not apply __global__
    }

    // ---- Phase 2: Structural constraints ----

    // 2a. Static member function check
    if ((signed char)a2->byte_176 < 0 && !(a2->byte_81 & 0x04))
        emit_warning(3507, a1->src_loc, "__global__");  // severity 5

    // 2b. operator() check
    if (a2->byte_166 == 5)
        emit_error(3644, a1->src_loc);  // severity 7

    // 2c. Exception specification check (uses return type chain)
    type_t* ret = a2->type_chain;  // entity+144
    while (ret->kind == 12)        // skip cv-qualifier wrappers
        ret = ret->referenced;     // type+144
    if (ret->prototype->exception_spec)  // proto+152 -> +56
        emit_error(3647, a1->src_loc);   // auto/decltype(auto) return

    // 2d. Execution space conflict
    uint8_t es = a2->byte_182;
    if (!relaxed_mode && (es & 0x60) == 0x20)  // already __device__ only
        emit_error(3481, a1->src_loc);
    if (es & 0x10)                              // already __host__ explicit
        emit_error(3481, a1->src_loc);

    // 2e. Return type must be void
    if (!(a2->byte_179 & 0x10)) {  // not constexpr
        if (a2->byte_191 & 0x01)   // is lambda
            emit_error(3506, a1->src_loc);
        else {
            type_t* base = skip_typedefs(a2->type_chain);  // sub_7A68F0
            if (!is_void_type(base->referenced))            // sub_7A6E90
                emit_error(3505, a1->src_loc);
        }
    }

    // 2f. Variadic (ellipsis) check
    type_t* proto_type = a2->type_chain;  // +144
    while (proto_type->kind == 12)
        proto_type = proto_type->referenced;
    if (proto_type->prototype->flags_16 & 0x01)  // bit 0 of proto+16
        emit_error(3503, a1->src_loc);

    // ---- Phase 3: Apply the bitmask ----
    a2->byte_182 |= 0x61;   // device_capable + device_annotation + global_kernel

    // ---- Phase 4: Additional checks (after bitmask set) ----

    // 4a. Local function (constexpr local)
    if (a2->byte_81 & 0x04)
        emit_error(3688, a1->src_loc);

    // 4b. main() function check
    if (a2 == main_entity && (a2->byte_182 & 0x20))
        emit_error(3538, a1->src_loc);

    // ---- Phase 5: Parameter iteration (__grid_constant__ warning) ----
    if (a1->flags & 0x01) {  // attr_node+11 bit 0: applies to parameters
        // Walk parameter list from prototype
        proto_type = a2->type_chain;
        while (proto_type->kind == 12)
            proto_type = proto_type->referenced;
        param_t* param = *proto_type->prototype->param_list;  // deref +152

        source_loc_t loc = a1->src_loc;  // +56
        for (; param != NULL; param = param->next) {
            // Peel cv-qualifier wrappers
            type_t* ptype = param->type;  // param[1]
            while (ptype->kind == 12)
                ptype = ptype->referenced;

            // Check: is type a __grid_constant__ candidate?
            if (!has_grid_constant_flag(ptype) && scope_index == -1) {
                // sub_7A6B60: checks byte+133 bit 5 (0x20)
                int64_t scope = scope_table_base + 784 * scope_table_index;
                if ((scope->flags_6 & 0x06) == 0 && scope->kind_4 != 12) {
                    type_t* ptype2 = param->type;
                    while (ptype2->kind == 12)
                        ptype2 = ptype2->referenced;
                    if (!ptype2->default_init)  // type+120 == NULL
                        emit_error(3669, &loc);
                }
            }
        }
    }

    // ---- Phase 6: HD combined flag ----
    if (a2->byte_182 & 0x40)       // __global__ now set
        a2->byte_182 |= 0x80;      // mark as combined HD

    return a2;
}

Execution Order Detail

The 0x61 bitmask is applied before the local-function (3688) and main() (3538) checks but after all structural checks (3507, 3644, 3647, 3481, 3505/3506, 3503). This means the bitmask is set even when errors are emitted -- cudafe++ continues processing after errors to collect as many diagnostics as possible in a single compilation pass.

The constexpr-lambda check at the top (error 3469) is the only check that causes an early return. If the function is a constexpr lambda with wrong linkage, the bitmask is NOT set and no further validation is performed.

Validation Error Catalog

The 37 validation errors are organized by the phase in which they are checked and by semantic category. Error codes below are cudafe++ internal diagnostic numbers; severity values match the sub_4F41C0 severity parameter (5 = warning, 7 = error, 8 = hard error).

Category 1: Return Type

ErrorSeverityCheckMessage
35057!is_void_type(skip_typedefs(entity+144)->referenced)a __global__ function must have a void return type
35067entity+191 & 0x01 (lambda) and non-voida __global__ function must not have a deduced return type
36477entity+152 -> +56 != NULL (exception spec present on return proto)auto/decltype(auto) deduced return type

Error 3505 and 3506 are mutually exclusive paths guarded by the byte+179 & 0x10 constexpr flag. When the function is not constexpr, the handler checks whether it is a lambda (3506 path, which checks byte+191 bit 0) or a regular function (3505 path, which resolves through skip_typedefs via sub_7A68F0 and tests is_void_type via sub_7A6E90). The skip_typedefs function follows the type chain while type->kind == 12 (cv-qualifier wrapper) and type->byte_161 & 0x7F == 0 (no qualifier flags). The is_void_type function follows the same chain and returns kind == 1 (void).

Error 3647 is checked independently of 3505/3506. The check examines the exception specification pointer at prototype offset +56. In EDG's type system, auto and decltype(auto) return types are represented with a non-null exception specification node on the return type's prototype -- this is a repurposed field that indicates the return type is deduced.

Category 2: Parameters

ErrorSeverityCheckMessage
35038proto+16 & 0x01 (has ellipsis)a __global__ function cannot have ellipsis
37027param_flags & 0x02 (rvalue ref)a __global__ function cannot have a parameter with rvalue reference type
--7Parameter with __restrict__ on reference typea __global__ function cannot have a parameter with __restrict__ qualified reference type
--7Parameter of type va_listA __global__ function or function template cannot have a parameter with va_list type
--7Parameter of type std::initializer_lista __global__ function or function template cannot have a parameter with type std::initializer_list
--7Oversized alignment on win32cannot pass a parameter with a too large explicit alignment to a __global__ function on win32 platforms
36698Device-scope parameter without default init__grid_constant__ parameter warning (device-side check)

Error 3503 (ellipsis) is checked in the apply handler by testing bit 0 of the function prototype's flags word at offset +16. This bit indicates the parameter list ends with ....

Error 3702 (rvalue reference) is checked in the post-validation pass (sub_6BC890), not in the apply handler. The post-validator walks the parameter list and checks byte offset +32 (bit 1) of each parameter node.

The __restrict__ reference, va_list, initializer_list, and win32 alignment checks are scattered across separate validation functions in nv_transforms.c and are triggered during declaration processing rather than during attribute application.

Error 3669 is checked in the apply handler's parameter iteration loop. It walks each parameter, resolves through cv-qualifier wrappers, and tests whether sub_7A6B60 returns false (meaning the parameter type has bit 5 of byte+133 clear -- not a __grid_constant__ type) AND the scope lookup produces a non-array, non-qualifier type without a default initializer at type+120.

Category 3: Modifiers

ErrorSeverityCheckMessage
35075(signed char)byte_176 < 0 && !(byte_81 & 0x04)A __global__ function or function template cannot be marked constexpr (warning for static member)
36888byte_81 & 0x04 (local function)A __global__ function or function template cannot be marked constexpr (constexpr local)
34818Execution space conflict (see matrix)Conflicting CUDA execution spaces
--7Function is constevalA __global__ function or function template cannot be marked consteval
36447byte_166 == 5 (operator function kind)An operator function cannot be a __global__ function
--7Defined in friend declarationA __global__ function or function template cannot be defined in a friend declaration
--7Exception specification presentAn exception specification is not allowed for a __global__ function or function template
--7Declared in inline unnamed namespaceA __global__ function or function template cannot be declared within an inline unnamed namespace
35387a2 == qword_126EB70 (is main())function main cannot be marked __device__ or __global__

Error 3507 deserves special attention. The decompiled code shows:

if ((signed char)a2->byte_176 < 0 && !(a2->byte_81 & 0x04))
    emit_warning(3507, ...);

The signed char cast means byte_176 >= 0x80 (bit 7 set = static member function). The !(byte_81 & 0x04) condition ensures it is NOT a local function. The emitter uses severity 5 (warning via sub_4F8DB0), meaning this is a warning, not an error -- NVIDIA chose to warn rather than reject __global__ on static members, though the official documentation says it is not allowed. The displayed string is "A __global__ function or function template cannot be marked constexpr" with "__global__" as the attribute name parameter, though the actual semantic is "static member function" per the field being checked.

Error 3644 checks entity+166 == 5. This field stores the "operator function kind" enum value, where 5 corresponds to operator(). This prevents lambda call operators or functors from being directly marked __global__.

Error 3688 is checked after the bitmask is set (byte_182 |= 0x61). It tests byte_81 & 0x04, which indicates a local (block-scope) function. The handler emits with severity 8 (via sub_4F81B0, hard error).

Error 3538 compares the entity pointer against qword_126EB70, which holds the entity pointer for main() (set during initial declaration processing). The condition also requires byte_182 & 0x20 (device annotation bit set), which is always true after |= 0x61.

Category 4: Template Constraints

ErrorSeverityCheckMessage
--7Pack parameter is not last template parameterPack template parameter must be the last template parameter for a variadic __global__ function template
--7Multiple pack parametersMultiple pack parameters are not allowed for a variadic __global__ function template

These checks are performed during template declaration processing in decls.c, not in the apply handler. They constrain variadic __global__ function templates: CUDA requires that pack parameters appear last (so the runtime can enumerate kernel arguments), and only a single pack is permitted (the CUDA launch infrastructure cannot handle multiple parameter packs).

Category 5: Redeclaration

ErrorSeverityCheckMessage
--7Previously __global__, now no execution spacea __global__ function(%no1) redeclared without __global__
--7Previously __global__, now __host__a __global__ function(%no1) redeclared with __host__
--7Previously __global__, now __device__a __global__ function(%no1) redeclared with __device__
--7Previously __global__, now __host__ __device__a __global__ function(%no1) redeclared with __host__ __device__

These four error variants are symmetrical with the reverse direction:

  • a __device__ function(%no1) redeclared with __global__
  • a __host__ function(%no1) redeclared with __global__
  • a __host__ __device__ function(%no1) redeclared with __global__

Redeclaration checks occur during declaration merging in class_decl.c. When a function is redeclared and the execution space of the new declaration does not match the original, cudafe++ emits one of these errors. The %no1 format specifier inserts the function name. These checks run independently of the apply_nv_global_attr handler -- they operate on the merged entity after both attribute sets have been processed.

Category 6: Constexpr Lambda Linkage

ErrorSeverityCheckMessage
34695(qword_184 & 0x800001000000) == 0x800000000000__global__ on constexpr lambda with wrong linkage

This is the first check in the apply handler and the only one that causes early return. The 48-bit field at entity+184 encodes template and linkage properties. Bit 47 (0x800000000000) indicates internal linkage or a similar constraint, while bit 24 (0x000001000000) indicates a local entity. When bit 47 is set but bit 24 is clear, the entity is a constexpr lambda that cannot legally receive __global__. The handler calls sub_6BC6B0 (get_entity_display_name) to format the entity name for the diagnostic message, then returns without setting the bitmask.

Category 7: Post-Validation (sub_6BC890)

These checks run after all attributes on a declaration have been applied, in the nv_validate_cuda_attributes function:

ErrorSeverityCheckMessage
37027Parameter with rvalue reference flag (bit 1 at param+32)a __global__ function cannot have a parameter with rvalue reference type
36617__nv_register_params__ on __global____nv_register_params__ is not allowed on a __global__ function
35347__launch_bounds__ on non-__global__%s attribute is not allowed on a non-__global__ function
37077maxBlocksPerCluster < cluster producttotal number of blocks in cluster computed from %s exceeds __launch_bounds__ specified limit
37157__maxnreg__ on non-__global____maxnreg__ is not allowed on a non-__global__ function
37197Both __launch_bounds__ and __maxnreg____launch_bounds__ and __maxnreg__ may not be used on the same declaration
36954__global__ without __launch_bounds__no __launch_bounds__ specified for __global__ function (warning)

Error 3695 is a severity-4 diagnostic (informational warning). It fires when a __global__ function has no associated launch configuration, encouraging developers to specify __launch_bounds__ for optimal register allocation. This is the only constraint that is a soft advisory rather than a hard or standard error.

Entity Node Field Reference

The apply handler reads and writes specific fields within the entity node. Complete field semantics:

OffsetSizeField NameRole in __global__ Validation
+811 bytelocal_flagsBit 2 (0x04): function is local (block-scope). Checked for 3688 and as exemption for 3507.
+1448 bytestype_chainPointer to return type. Followed through kind==12 cv-qualifier wrappers.
+1528 bytesprototypeFunction prototype pointer. At prototype+16: flags (bit 0 = ellipsis). At prototype+56: exception spec pointer. At prototype+0: parameter list head (double deref for first param).
+1661 byteoperator_kindValue 5 = operator(). Checked for 3644.
+1761 bytemember_flagsBit 7 (0x80, checked as signed char < 0): static member function. Checked for 3507.
+1791 byteconstexpr_flagsBit 4 (0x10): function is constexpr. Guards 3505/3506 check (skipped if constexpr).
+1821 byteexecution_spaceThe primary execution space bitfield. |= 0x61 sets global kernel. Read for conflict checks (0x60, 0x10 masks).
+1831 byteextended_cudaBit 3 (0x08): __nv_register_params__. Checked in post-validation. Bit 6 (0x40): __cluster_dims__ set.
+1848 byteslinkage_template48-bit field encoding template/linkage flags. Only lower 48 bits used; mask 0x800001000000 checks constexpr lambda linkage.
+1911 bytelambda_flagsBit 0 (0x01): entity is a lambda. Routes to 3506 instead of 3505 for void-return check.
+2568 byteslaunch_configPointer to launch configuration struct (56 bytes). NULL if no launch attributes applied. Read in post-validation.

The 0x61 Bitmask

The OR mask 0x61 sets three bits in the execution space byte:

0x61 = 0b01100001

  bit 0 (0x01):  device_capable     -- function can run on device
  bit 5 (0x20):  device_annotation  -- has explicit device-side annotation
  bit 6 (0x40):  global_kernel      -- function is a __global__ kernel

Bit 0 is shared with __device__ (0x23) and __host__ (0x15). It serves as a "has CUDA annotation" predicate -- any entity with bit 0 set has been explicitly annotated with at least one execution space keyword. This enables fast if (byte_182 & 0x01) checks throughout the codebase.

Bit 5 is shared with __device__. A __global__ function is considered device-annotated because kernel code executes on the GPU.

Bit 6 is unique to __global__. The mask byte_182 & 0x40 is the canonical predicate for "is this a kernel function?" used in dozens of locations throughout the binary.

HD Combined Flag (0x80)

After setting 0x61, the handler checks whether bit 6 (0x40, global kernel) is now set. If so, it ORs 0x80 into the byte. This bit means "combined host+device" and is set as a secondary effect. The logic at the end of the function:

if (a2->byte_182 & 0x40)       // just set via |= 0x61
    a2->byte_182 |= 0x80;      // always true after apply

This means every __global__ function ends up with byte_182 & 0x80 set, which marks it as "combined" in the execution space classification. This is semantically correct: a kernel has both a host-side stub (for launching) and device-side code (for execution).

Parameter Iteration for grid_constant

The final section of the apply handler iterates the function's parameter list to check for parameters that should be annotated __grid_constant__. This check only runs when attr_node->flags bit 0 (a1+11 & 0x01) is set, indicating the attribute application context includes parameter-level processing.

The iteration follows this structure:

// Navigate to function prototype
type_t* proto_type = entity->type_chain;     // +144
while (proto_type->kind == 12)               // skip cv-qualifiers
    proto_type = proto_type->referenced;      // +144

// Get parameter list head (double dereference)
param_t** param_list = proto_type->prototype->param_head;  // proto+152 -> deref
param_t* param = *param_list;                               // deref again

for (; param != NULL; param = param->next) {
    // Navigate to unqualified parameter type
    type_t* ptype = param[1];    // param->type (offset 8)
    while (ptype->kind == 12)
        ptype = ptype->referenced;

    // sub_7A6B60: checks byte+133 bit 5 (0x20) -- "has __grid_constant__"
    bool has_gc = (ptype->byte_133 & 0x20) != 0;

    if (!has_gc && dword_126C5C4 == -1) {
        // Scope table lookup
        int64_t scope = qword_126C5E8 + 784 * dword_126C5E4;
        uint8_t scope_flags = scope->byte_6;
        uint8_t scope_kind = scope->byte_4;

        // Skip if scope has qualifier flags or is a cv-qualified scope
        if ((scope_flags & 0x06) == 0 && scope_kind != 12) {
            // Re-navigate to unqualified type
            type_t* ptype2 = param[1];
            while (ptype2->kind == 12)
                ptype2 = ptype2->referenced;

            // Check for default initializer
            if (ptype2->qword_120 == 0)
                emit_error(3669, &saved_source_loc);
        }
    }
}

The scope table lookup uses a 784-byte scope structure (at qword_126C5E8 indexed by dword_126C5E4) to determine whether the current context is device-side. The dword_126C5C4 == -1 check verifies we are in device compilation mode. This entire parameter iteration is a device-side warning mechanism: it alerts developers when a kernel parameter lacks a default initializer in a context where __grid_constant__ would be appropriate.

Post-Declaration Validation (sub_6BC890)

After all attributes on a declaration are applied, nv_validate_cuda_attributes (sub_6BC890, 161 lines) performs cross-attribute consistency checks. For __global__ functions, this function enforces:

Rvalue Reference Parameters (3702)

// Walk parameter list
type_t* ret = entity->type_chain;
while (ret->kind == 12)
    ret = ret->referenced;
param_t* param = **((param_t***)ret + 19);  // proto -> param list

while (param) {
    if (param->byte_32 & 0x02)  // rvalue reference flag
        emit_error(3702, source_loc);
    param = param->next;
}

This check scans all parameters for the rvalue reference flag (bit 1 at parameter node offset +32). Kernel functions cannot accept rvalue references because kernel launch involves copying arguments through the CUDA runtime, which does not support move semantics across the host-device boundary.

nv_register_params Conflict (3661)

if (entity->byte_183 & 0x08) {  // __nv_register_params__ set
    if (entity->byte_182 & 0x40)
        emit_error(3661, ..., "__global__");
    else if ((entity->byte_182 & 0x30) == 0x20)
        emit_error(3661, ..., "__host__");
}

The __nv_register_params__ attribute (bit 3 of byte+183) is incompatible with __global__ because kernel parameter passing uses a fixed ABI that cannot be overridden.

Launch Configuration Without global (3534)

launch_config_t* lc = entity->launch_config;  // +256
if (lc && !(entity->byte_182 & 0x40)) {
    if (lc->maxThreadsPerBlock || lc->minBlocksPerMultiprocessor)
        emit_error(3534, ..., "__launch_bounds__");
}

The __launch_bounds__, __cluster_dims__, and __block_size__ attributes require __global__. If a non-kernel function has any of these, error 3534 fires.

Cluster Dimension Product Check (3707)

if (lc->cluster_dim_x > 0 && lc->maxBlocksPerCluster > 0) {
    uint64_t product = lc->cluster_dim_x * lc->cluster_dim_y * lc->cluster_dim_z;
    if (lc->maxBlocksPerCluster < product)
        emit_error(3707, ...);
}

launch_bounds and maxnreg Conflict (3719)

if (lc->maxThreadsPerBlock && lc->maxnreg >= 0)
    emit_error(3719, ..., "__launch_bounds__ and __maxnreg__");

These two attributes provide contradictory register pressure hints and cannot coexist.

Missing launch_bounds Warning (3695)

if ((entity->byte_182 & 0x40) &&
    (!lc || (!lc->maxThreadsPerBlock && !lc->minBlocksPerMultiprocessor)))
    emit_warning(3695);

Severity 4 (advisory). Encourages developers to annotate kernels with __launch_bounds__ for optimal register allocation.

Execution Space Conflict Matrix

When __global__ is applied to a function that already has an execution space annotation, the handler checks for conflicts using two conditions:

// Condition 1: already __device__ only (without relaxed mode)
if (!dword_106BFF0 && (byte_182 & 0x60) == 0x20)
    error(3481);

// Condition 2: already __host__ explicit
if (byte_182 & 0x10)
    error(3481);
Current byte_182Applying __global__(byte & 0x60) == 0x20byte & 0x10Result
0x00 (none)|= 0x61 -> 0x61falsefalseaccepted
0x23 (__device__)truefalseerror 3481 (unless relaxed)
0x15 (__host__)falsetrueerror 3481
0x37 (__host__ __device__)falsetrueerror 3481
0x61 (__global__)truefalseerror 3481 (unless relaxed) -- idempotent bitmask

In relaxed mode (dword_106BFF0 != 0), the first condition is suppressed, allowing __device__ + __global__ combinations. The second condition (explicit __host__) is never relaxed.

Helper Functions

AddressIdentityLinesPurpose
sub_6BC6B0get_entity_display_name49Formats entity name for diagnostic messages. Handles demangling, strips leading ::.
sub_7A68F0skip_typedefs19Follows type chain through kind==12 wrappers while byte_161 & 0x7F == 0.
sub_7A6E90is_void_type16Follows type chain through kind==12, returns kind == 1.
sub_7A6B60has_grid_constant_flag9Follows type chain through kind==12, returns byte_133 & 0x20.
sub_4F7510emit_error_with_names66Emits error with two string arguments (attribute name + entity name).
sub_4F8DB0emit_warning_with_name38Emits warning (severity 5) with one string argument.
sub_4F8200emit_error_basic10Emits error with severity + code + source location.
sub_4F81B0emit_error_minimal10Emits error (severity 8) with code + source location.
sub_4F8490emit_error_with_extra38Emits error with one supplementary argument.

Additional global Constraints (Outside Apply Handler)

Beyond the apply handler and post-validation, several other subsystems enforce __global__-specific rules. These checks occur during template instantiation, lambda processing, and declaration merging:

Template Argument Type Restrictions

CUDA restricts which types can appear as template arguments in __global__ function template instantiations:

  • Host-local types (defined inside a __host__ function) cannot be used
  • Private/protected class members cannot be used (unless the class is local to a __device__/__global__ function)
  • Unnamed types cannot be used (unless local to a __device__/__global__ function)
  • Lambda closure types cannot be used (unless the lambda is defined in a __device__/__global__ function, or is an extended lambda with --extended-lambda)
  • Texture/surface variables cannot be used as non-type template arguments
  • Private/protected template template arguments from class scope cannot be used

Static Global Template Stub

In whole-program compilation mode (-rdc=false) with -static-global-template-stub=true:

  • Extern __global__ function templates are not supported
  • __global__ function template instantiations must have definitions in the current TU

Device-Side Restrictions

Functions marked __global__ (or __device__) are subject to additional restrictions during semantic analysis:

  • address of label extension is not supported
  • ASM operands may specify only one constraint letter
  • Certain ASM constraint letters are forbidden
  • Texture/surface variables cannot have their address taken or be indirected
  • Anonymous union member variables at global/namespace scope cannot be directly accessed
  • Function-scope static variables require a memory space specifier
  • Dynamic initialization of function-scope static variables is not supported

Function Map

AddressIdentityLinesSource File
sub_40E1F0apply_nv_global_attr (variant 1)89attribute.c
sub_40E7F0apply_nv_global_attr (variant 2)86attribute.c
sub_6BC890nv_validate_cuda_attributes161nv_transforms.c
sub_6BC6B0get_entity_display_name49nv_transforms.c
sub_7A68F0skip_typedefs19types.c
sub_7A6E90is_void_type16types.c
sub_7A6B60has_grid_constant_flag9types.c
sub_4F7510emit_error_with_names66error.c
sub_4F8DB0emit_warning_with_name38error.c
sub_4F8200emit_error_basic10error.c
sub_4F81B0emit_error_minimal10error.c
sub_4F8490emit_error_with_extra38error.c
sub_413240apply_one_attribute (dispatch)585attribute.c

Global Variables

GlobalAddressPurpose
dword_106BFF00x106BFF0Relaxed mode flag. When set, suppresses __device__ + __global__ conflict (3481).
qword_126EB700x126EB70Pointer to the entity node for main(). Compared during 3538 check.
dword_126C5C40x126C5C4Scope index sentinel (-1 = device compilation mode). Guards 3669 parameter check.
dword_126C5E40x126C5E4Current scope table index.
qword_126C5E80x126C5E8Scope table base pointer. Each entry is 784 bytes.

Cross-References

Launch Configuration Attributes

cudafe++ supports five attributes that control CUDA kernel launch parameters: __launch_bounds__, __cluster_dims__, __block_size__, __maxnreg__, and __local_maxnreg__. All five store their values into a shared 56-byte launch configuration struct pointed to by entity+256. The struct is lazily allocated on first use by sub_5E52F0 and initialized with sentinel values (-1 for all int32 fields, 0 for the two leading int64 fields, flags cleared). Each attribute handler parses its arguments through a shared constant-expression evaluation pipeline (sub_461640 for value extraction, sub_461980 for sign checking), validates positivity and 32-bit range, then writes results into specific offsets of the struct. A post-declaration validation pass (sub_6BC890 in nv_transforms.c) enforces cross-attribute constraints: launch config attributes require __global__, cluster dimensions must not exceed __launch_bounds__, and __maxnreg__ is mutually exclusive with __launch_bounds__.

Key Facts

PropertyValue
Source filesattribute.c (apply handlers), nv_transforms.c (post-validation)
__launch_bounds__ handlersub_411C80 (98 lines)
__cluster_dims__ handlersub_4115F0 (145 lines)
__block_size__ handlersub_4109E0 (265 lines)
__maxnreg__ handlersub_410F70 (67 lines)
__local_maxnreg__ handlersub_411090 (67 lines)
Post-validationsub_6BC890 (nv_validate_cuda_attributes, 160 lines)
Struct allocatorsub_5E52F0 (42 lines)
Constant value extractorsub_461640 (const_expr_get_value, 53 lines)
Constant sign checkersub_461980 (const_expr_sign_compare, 97 lines)
Dependent-type checksub_7BE9E0 (is_dependent_type)
Entity fieldentity+256 -- pointer to launch_config_t (56 bytes, NULL if no launch attrs)
Entity extended flagsentity+183 bit 6 (0x40): cluster_dims intent (set by zero-argument __cluster_dims__)
Total error codes17 distinct diagnostics across all five attributes and post-validation

Attribute Kind Codes

Each CUDA attribute carries a kind byte at attr_node+8. The five launch config attributes use these values from the attribute_display_name (sub_40A310) switch table:

KindHexASCIIAttributeHandler
920x5C'\'__launch_bounds__sub_411C80
930x5D']'__maxnreg__sub_410F70
940x5E'^'__local_maxnreg__sub_411090
1070x6B'k'__cluster_dims__sub_4115F0
1080x6C'l'__block_size__sub_4109E0

Kinds 92--94 are part of the original dense block (86--95). Kinds 107 and 108 were added later for cluster/Hopper-era features, occupying gaps in the ASCII range.

Launch Configuration Struct Layout

The struct is allocated by sub_5E52F0 and returned with a 16-byte offset from the raw allocation base. All handlers access the struct through the pointer stored at entity+256. The allocator initializes all int32 fields to -1 (sentinel for "not set") and zeroes the two leading int64 fields and the flags byte.

struct launch_config_t {                  // 56 bytes (offsets from entity+256 pointer)
    int64_t  maxThreadsPerBlock;          // +0   from __launch_bounds__ arg 1 (init: 0)
    int64_t  minBlocksPerMultiprocessor;  // +8   from __launch_bounds__ arg 2 (init: 0)
    int32_t  maxBlocksPerCluster;         // +16  from __launch_bounds__ arg 3 (init: -1)
    int32_t  cluster_dim_x;              // +20  from __cluster_dims__ / __block_size__ (init: -1)
    int32_t  cluster_dim_y;              // +24  from __cluster_dims__ / __block_size__ (init: -1)
    int32_t  cluster_dim_z;              // +28  from __cluster_dims__ / __block_size__ (init: -1)
    int32_t  maxnreg;                    // +32  from __maxnreg__ (init: -1)
    int32_t  local_maxnreg;              // +36  from __local_maxnreg__ (init: -1)
    int32_t  block_size_x;              // +40  from __block_size__ (init: -1)
    int32_t  block_size_y;              // +44  from __block_size__ (init: -1)
    int32_t  block_size_z;              // +48  from __block_size__ (init: -1)
    uint8_t  flags;                      // +52  bit 0: cluster_dims_set
                                         //       bit 1: block_size_set
    // +53..+55: padding
};

The struct packs integer fields of mixed widths. The first two fields (maxThreadsPerBlock and minBlocksPerMultiprocessor) are 64-bit to accommodate the full range of CUDA launch bounds values. The cluster dimensions, block sizes, and register counts are 32-bit because individual values cannot exceed hardware limits. The flags byte at offset +52 records which dimension-setting attributes have been applied, enabling mutual exclusion enforcement between __cluster_dims__ and __block_size__.

Allocator: sub_5E52F0

The allocator performs arena allocation via sub_6B7D60, then initializes every field:

// sub_5E52F0 -- allocate_launch_config
launch_config_t* allocate_launch_config() {
    void* raw = arena_alloc(pool_id, launch_config_pool_size + 56);
    char* base = pool_base + raw;

    if (!abi_mode) {              // dword_106BA08 == 0
        ++alloc_counter_prefix;
        base += 8;
        *(int64_t*)(base - 8) = 0;   // 8-byte ABI prefix
    }

    ++alloc_counter_main;

    // Zero the int64 fields
    *(int64_t*)(base + 0)  = 0;       // becomes returned+0:  maxThreadsPerBlock = 0
    *(int64_t*)(base + 8)  = 0;       // padding (base+8..15)
    *(int64_t*)(base + 16) = 0;       // becomes returned+0..7 after offset

    // Initialize all int32 fields to -1 (sentinel = "not set")
    *(int32_t*)(base + 32) = -1;      // returned+16: maxBlocksPerCluster
    *(int32_t*)(base + 36) = -1;      // returned+20: cluster_dim_x
    *(int32_t*)(base + 40) = -1;      // returned+24: cluster_dim_y
    *(int32_t*)(base + 44) = -1;      // returned+28: cluster_dim_z
    *(int32_t*)(base + 48) = -1;      // returned+32: maxnreg
    *(int32_t*)(base + 52) = -1;      // returned+36: local_maxnreg
    *(int32_t*)(base + 56) = -1;      // returned+40: block_size_x
    *(int32_t*)(base + 60) = -1;      // returned+44: block_size_y
    *(int32_t*)(base + 64) = -1;      // returned+48: block_size_z
    base[68] &= 0xFC;                // returned+52: clear flags bits 0 and 1

    // Set internal flags byte combining ABI mode, device mode, marker
    base[8] = (8 * (device_flag & 1)) & 0x7F
            | (2 * (!abi_mode))       & 0x7E
            | 1;

    return (launch_config_t*)(base + 16);   // return with 16-byte offset
}

The sentinel value -1 (0xFFFFFFFF as unsigned, -1 as signed) is semantically meaningful throughout: handlers and the post-validator test field >= 0 or field > 0 to determine whether a field has been set. A value of -1 always fails both tests, so unset fields are correctly treated as absent. The two leading int64 fields use 0 as their sentinel since they store __launch_bounds__ arguments where zero means "not specified."

Constant-Expression Evaluation Pipeline

All five attribute handlers share the same two-function pipeline for parsing attribute argument values from EDG's internal 128-bit constant representation.

sub_461980 -- const_expr_sign_compare

Compares a constant expression's value against a 64-bit threshold. Returns +1 if the expression value is greater, -1 if less, 0 if equal. The comparison operates on the 128-bit extended-precision value stored at offsets +152 through +166 (eight 16-bit words) of the expression node.

// sub_461980 -- const_expr_sign_compare(expr_node, threshold)
// Returns: +1 if expr > threshold, -1 if expr < threshold, 0 if equal
int32_t const_expr_sign_compare(expr_node_t* expr, int64_t threshold) {
    // Decompose threshold into eight 16-bit words with sign extension
    uint16_t thresh_words[8];
    // ... sign-extension propagation through all 8 words ...

    // Navigate to base type, skipping cv-qualifier wrappers (kind == 12)
    type_t* type = expr->type_chain;    // expr+112
    while (type->kind_132 == 12)
        type = type->referenced;        // type+144

    // Determine signedness from base type
    bool is_signed = (type->kind_132 == 2
                      && is_signed_type_table[type->subkind_144]);

    if (is_signed && (expr->word_152 & 0x8000)) {
        // Negative expression value
        if (!(threshold_high & 0x8000))
            return -1;    // negative < non-negative
    } else if (!is_signed) {
        if (threshold_high & 0x8000)
            return 1;     // non-negative > negative threshold
    }

    // Word-by-word comparison from most-significant to least
    // expr+152 (MSW) through expr+166 (LSW) vs threshold words
    for (int i = 0; i < 8; i++) {
        if (expr->words[152 + 2*i] > thresh_words[i]) return 1;
        if (expr->words[152 + 2*i] < thresh_words[i]) return -1;
    }
    return 0;  // equal
}

The handlers call const_expr_sign_compare(expr, 0) to check positivity:

  • <= 0 means non-positive (used by __cluster_dims__, __block_size__, __maxnreg__, __local_maxnreg__)
  • < 0 means strictly negative (used by __launch_bounds__ arg 3, where zero is allowed)

sub_461640 -- const_expr_get_value

Extracts a uint64_t value from a constant expression node's 128-bit representation. Sets an overflow flag if the value does not fit in 64 bits (accounting for sign).

// sub_461640 -- const_expr_get_value(expr_node, *overflow_flag)
// Returns: uint64_t value; *overflow_flag = 1 if truncation occurred
uint64_t const_expr_get_value(expr_node_t* expr, int32_t* overflow) {
    // Navigate to base type
    type_t* type = expr->type_chain;    // expr+112
    while (type->kind_132 == 12)
        type = type->referenced;

    uint16_t sign_word = expr->word_152;    // most-significant of 128-bit value
    bool is_signed = (type->kind_132 == 2
                      && is_signed_type_table[type->subkind_144]);

    int16_t expected_high;
    if (is_signed) {
        *overflow = 0;
        expected_high = -(sign_word >> 15);     // -1 if negative, 0 if positive
    } else {
        *overflow = 0;
        expected_high = 0;
    }

    // Verify that the upper 64 bits match the expected sign-extension pattern
    bool has_overflow = (sign_word != (uint16_t)expected_high);
    if (expr->word_154 != (uint16_t)expected_high) has_overflow = true;
    if (expr->word_156 != (uint16_t)expected_high) has_overflow = true;
    if (expr->word_158 != (uint16_t)expected_high) has_overflow = true;

    // Reconstruct 64-bit value from the lower four 16-bit words
    uint64_t result = ((uint64_t)expr->word_160 << 48)
                    | ((uint64_t)expr->word_162 << 32)
                    | ((uint64_t)expr->word_164 << 16)
                    | ((uint64_t)expr->word_166);

    if (!is_signed) {
        if (has_overflow) { *overflow = 1; }
        return result;
    }
    // Signed: verify sign bit consistency
    if (((uint16_t)expected_high) != (uint16_t)(result >> 63)
        || has_overflow
        || (int16_t)sign_word < 0) {
        *overflow = 1;
    }
    return result;
}

The overflow flag is used by all handlers with a consistent check pattern:

int32_t overflow;
uint64_t val = const_expr_get_value(expr, &overflow);
if (overflow || val > 0x7FFFFFFF)
    emit_error(OVERFLOW_ERROR_CODE, src_loc);
else
    launch_config->field = (int32_t)val;

Template-Dependent Argument Bailout

Before evaluating constant expressions, all five handlers walk the attribute argument list checking for template-dependent types via sub_7BE9E0 (is_dependent_type). The walk follows a linked list of argument nodes (head at attr_node+32), where each node has:

OffsetFieldDescription
+0nextNext argument node in list
+10kindArgument kind: 3 = type-qualified, 4 = expression, 5 = indirect expression
+32exprExpression/type pointer (accessed as node[4] in decompiled code)

If any argument has a dependent type, the handler returns immediately without modifying the entity. This defers attribute processing to template instantiation time, when concrete values are available:

// Common bailout pattern (appears in all 5 handlers)
arg_node_t* walk = *(arg_node_t**)(attr_node + 32);
while (walk) {
    switch (walk->kind_10) {
        case 3:   // type-qualified argument
            if (walk->expr[4]->kind_148 == 12)    // cv-qualifier wrapper
                return entity;                     // dependent -- bail
            break;
        case 4:   // expression argument
            if (is_dependent_type(walk->expr[4]))  // sub_7BE9E0
                return entity;
            if (walk->kind_10 != 5)
                break;
            // fallthrough to case 5
        case 5:   // indirect expression
            if (is_dependent_type(*(walk->expr[4])))
                return entity;
            break;
        default:
            break;
    }
    walk = walk->next;
}
// All args are concrete -- proceed with evaluation

launch_bounds (sub_411C80)

Syntax: __launch_bounds__(maxThreadsPerBlock [, minBlocksPerMultiprocessor [, maxBlocksPerCluster]])

Accepts 1 to 3 arguments. Registered at kind byte 0x5C ('\\').

// sub_411C80 -- apply_nv_launch_bounds_attr (attribute.c, 98 lines)
// a1: attribute node, a2: entity node
entity_t* apply_nv_launch_bounds(attr_node_t* attr, entity_t* entity) {

    // ---- Error 3535: launch_bounds on local function ----
    // Note: does NOT return early -- continues to store values
    if (entity->byte_81 & 0x04)
        emit_error_with_name(7, 3535, attr->src_loc, "__launch_bounds__");

    // ---- Parse argument list ----
    arg_list_t* args = attr->arg_list;    // attr+32
    if (!args)
        return entity;

    // ---- Allocate launch config if needed ----
    launch_config_t* lc = entity->launch_config;   // entity+256
    if (!lc) {
        lc = allocate_launch_config();              // sub_5E52F0
        entity->launch_config = lc;
    }

    // ---- Arg 1: maxThreadsPerBlock (required, stored as int64) ----
    // Copied directly from constant expression value -- no sign/overflow check
    lc->maxThreadsPerBlock = args->const_value;     // +0, int64

    // ---- Arg 2: minBlocksPerMultiprocessor (optional, stored as int64) ----
    arg_node_t* arg2_list = *args;                  // first child
    if (!arg2_list)
        return entity;

    expr_node_t* arg2_expr = *arg2_list;            // expression node
    lc->minBlocksPerMultiprocessor = arg2_list[4];  // +8, int64, raw copy

    // ---- Check for arg 3 existence ----
    if (!arg2_expr)
        goto process_arg3;

    // ---- Template-dependent bailout for remaining args ----
    arg_node_t* walk = *(arg_node_t**)(attr + 32);
    if (!walk)
        goto process_arg3;
    // ... dependent type walk (same pattern as documented above) ...
    // If any arg is dependent, return entity unchanged

process_arg3:
    // ---- Arg 3: maxBlocksPerCluster (optional, int32, uses full pipeline) ----
    expr_node_t* expr3 = arg2_expr->const_value;   // 3rd arg expression
    if (!expr3)
        return entity;

    if (const_expr_sign_compare(expr3, 0) < 0) {
        // Error 3705: negative maxBlocksPerCluster
        emit_error(7, 3705, attr->src_loc);
    } else {
        int32_t overflow;
        uint64_t val = const_expr_get_value(expr3, &overflow);
        if (overflow || val > 0x7FFFFFFF) {
            // Error 3706: overflow
            emit_error(7, 3706, attr->src_loc);
        } else if (val != 0) {
            lc->maxBlocksPerCluster = (int32_t)val;   // +16
        }
        // val == 0: not stored, sentinel -1 remains (means "use default")
    }

    return entity;
}

Argument Semantics

ArgFieldOffsetTypeValidationDescription
1 (required)maxThreadsPerBlock+0int64None -- raw copyMaximum threads per block. Guides register allocation in ptxas.
2 (optional)minBlocksPerMultiprocessor+8int64None -- raw copyMinimum resident blocks per SM. Guides occupancy optimization.
3 (optional)maxBlocksPerCluster+16int32sign_compare < 0 (3705), overflow (3706)Maximum blocks per cluster (CUDA 11.8+).

Critical Implementation Details

First two args bypass the sign/overflow pipeline. Arguments 1 and 2 are copied directly from the constant expression node's value field as 64-bit quantities. They do not pass through const_expr_sign_compare or const_expr_get_value. This means negative or excessively large values for maxThreadsPerBlock and minBlocksPerMultiprocessor are accepted at parse time -- downstream consumers (ptxas) are responsible for rejecting them.

Third argument uses the strict pipeline. Only argument 3 (maxBlocksPerCluster) passes through both const_expr_sign_compare and const_expr_get_value with the overflow check. This argument was added later (CUDA 11.8 cluster launch) and uses the newer, stricter validation pattern.

Zero is acceptable for arg 3. The sign check uses const_expr_sign_compare(expr, 0) < 0 (strictly negative), not <= 0. A zero value passes the sign check but is not written (else if (val != 0) guard), leaving the sentinel -1 in place. This means zero effectively means "use default."

Error 3535 does not abort. The local-function check fires but does NOT return early. Processing continues, arguments are stored, and the launch config struct is populated even after emitting the error. This is consistent with cudafe++'s design of collecting as many diagnostics as possible in a single compilation pass.

cluster_dims (sub_4115F0)

Syntax: __cluster_dims__(x [, y [, z]]) or __cluster_dims__()

Accepts 0 to 3 arguments. Missing dimensions default to 1. Sets flag bit 0 at +52. Registered at kind byte 0x6B ('k').

// sub_4115F0 -- apply_nv_cluster_dims_attr (attribute.c, 145 lines)
entity_t* apply_nv_cluster_dims(attr_node_t* attr, entity_t* entity) {

    arg_list_t* args = attr->arg_list;    // attr+32

    // ---- No-argument form: set intent flag only ----
    if (args->kind_10 == 0) {             // no arguments present
        entity->byte_183 |= 0x40;        // cluster_dims intent flag
        return entity;
    }

    // ---- Extract argument expressions (up to 3) ----
    expr_node_t* expr_x = args->value;
    arg_node_t* child1 = args->first_child;
    expr_node_t* expr_y = child1 ? child1->value : NULL;
    expr_node_t* expr_z = NULL;
    if (child1 && child1->first_child)
        expr_z = child1->first_child->value;

    // ---- Template-dependent bailout ----
    // ... same walk pattern as __launch_bounds__ ...

    // ---- Allocate launch config if needed ----
    launch_config_t* lc = entity->launch_config;
    if (!lc) {
        lc = allocate_launch_config();
        entity->launch_config = lc;
    }

    // ---- Conflict check: __block_size__ already set cluster dims ----
    if (lc->flags & 0x02) {               // bit 1 = block_size_set
        emit_error(7, 3791, attr->src_loc);
        lc = entity->launch_config;       // reload after error emit
    }

    // ---- Set cluster_dims flag ----
    lc->flags |= 0x01;                    // bit 0 = cluster_dims_set

    // ---- Arg 1: cluster_dim_x ----
    if (!expr_x) {
        lc->cluster_dim_x = 1;            // +20, default
    } else if (const_expr_sign_compare(expr_x, 0) <= 0) {
        emit_error_with_name(7, 3685, attr->src_loc, "__cluster_dims__");
        lc = entity->launch_config;       // reload
    } else {
        int32_t overflow;
        uint64_t val = const_expr_get_value(expr_x, &overflow);
        if (overflow || val > 0x7FFFFFFF)
            emit_error(7, 3686, attr->src_loc);
        else
            lc->cluster_dim_x = (int32_t)val;
    }

    // ---- Arg 2: cluster_dim_y (defaults to 1) ----
    if (!expr_y) {
        lc->cluster_dim_y = 1;            // +24
    } else {
        // Same sign_compare/get_value/3685/3686 pattern
        // Stores at lc->cluster_dim_y (+24)
    }

    // ---- Arg 3: cluster_dim_z (defaults to 1) ----
    if (!expr_z) {
        lc->cluster_dim_z = 1;            // +28
    } else {
        // Same pattern, stores at lc->cluster_dim_z (+28)
    }

    return entity;
}

Key Observations

Zero-argument form. When __cluster_dims__() is called with no arguments, the handler does not allocate the launch config struct. It sets entity+183 |= 0x40 (the "cluster_dims intent" flag) and returns. This intent flag is checked during post-validation to detect __cluster_dims__ on non-__global__ functions (error 3534) even when no dimensions were specified.

Conflict check with block_size. Before storing dimensions, the handler checks lc->flags & 0x02 (bit 1 = block_size_set). If __block_size__ was already applied, error 3791 fires. Crucially, the handler does NOT return early after this error -- it continues to set the flag and attempt to store values. The reverse conflict (applying __block_size__ after __cluster_dims__) is checked in sub_4109E0 with the same error code, testing lc->flags & 0x01.

Strict positivity (zero rejected). All three dimensions use const_expr_sign_compare(expr, 0) <= 0, rejecting zero. Error 3685 fires with the attribute name "__cluster_dims__" as a format argument. Error 3686 fires for values exceeding 0x7FFFFFFF.

Defaults to 1. Unspecified dimensions default to 1, not 0. A cluster dimension of 1 means "no clustering in that dimension" -- the neutral value. The default is written explicitly (lc->cluster_dim_x = 1), overwriting the -1 sentinel from allocation.

block_size (sub_4109E0)

Syntax: __block_size__(bx [, by [, bz [, cx [, cy [, cz]]]]])

Accepts up to 6 arguments: three block dimensions followed by three optional cluster dimensions. Registered at kind byte 0x6C ('l'). At 265 lines, this is the largest launch config handler.

// sub_4109E0 -- apply_nv_block_size_attr (attribute.c, 265 lines)
entity_t* apply_nv_block_size(attr_node_t* attr, entity_t* entity) {

    // ---- Parse up to 6 argument expressions ----
    arg_list_t* args = attr->arg_list;
    expr_node_t* block_x  = args->value;          // arg 1
    expr_node_t* block_y  = NULL;                  // arg 2
    expr_node_t* block_z  = NULL;                  // arg 3
    expr_node_t* cluster_x = NULL;                 // arg 4
    expr_node_t* cluster_y = NULL;                 // arg 5
    expr_node_t* cluster_z = NULL;                 // arg 6
    // ... linked-list traversal to extract args 2-6 ...

    // ---- Template-dependent bailout ----
    // ... same walk pattern ...

    // ---- Allocate launch config ----
    launch_config_t* lc = entity->launch_config;
    if (!lc) {
        lc = allocate_launch_config();
        entity->launch_config = lc;
    }

    // ---- Block dimensions: args 1-3 ----
    // Each uses: sign_compare <= 0 -> error 3788
    //            get_value overflow or > 0x7FFFFFFF -> error 3789
    //            else store at +40/+44/+48
    //            missing args default to 1

    // block_size_x (+40):
    if (!block_x)
        lc->block_size_x = 1;
    else
        validate_positive_int32(block_x, &lc->block_size_x, 3788, 3789, attr);

    // block_size_y (+44): same pattern, default 1
    // block_size_z (+48): same pattern, default 1

    // ---- Cluster dimensions: args 4-6 (only if arg 4 present) ----
    if (!cluster_x) {
        // No cluster dims from __block_size__
        lc->flags &= ~0x02;           // clear bit 1 temporarily

        if (!(lc->flags & 0x01)) {    // cluster_dims NOT already set
            // Write default cluster dims
            lc->cluster_dim_x = 1;     // +20
            lc->cluster_dim_y = 1;     // +24
            lc->cluster_dim_z = 1;     // +28
        }
        return entity;
    }

    // ---- Conflict check: cluster_dims already set ----
    if (lc->flags & 0x01) {           // bit 0 = cluster_dims_set
        emit_error(7, 3791, attr->src_loc);
        lc = entity->launch_config;
    }

    // ---- Set block_size flag ----
    lc->flags |= 0x02;                // bit 1 = block_size_set

    if (lc->flags & 0x01)             // cluster_dims_set -> conflict, bail
        return entity;

    // ---- Parse cluster dims from args 4-6 ----
    // Uses error 3788 for non-positive, 3789 for overflow
    // (same codes as block dims, with "__block_size__" as attr name)
    // Stores at +20/+24/+28, defaults to 1 if absent

    return entity;
}

Key Observations

Dual-purpose attribute. __block_size__ combines block dimensions and cluster dimensions in a single attribute. Arguments 1-3 specify the thread block shape (stored at +40/+44/+48); arguments 4-6 specify the cluster shape (stored at +20/+24/+28). This is NVIDIA's older, combined syntax, compared to the newer separate __cluster_dims__ attribute.

Shared cluster fields. Both __block_size__ and __cluster_dims__ write to the same offsets (+20/+24/+28). The flags byte (bit 0 for cluster_dims, bit 1 for block_size) provides mutual exclusion via error 3791.

Block size fields are separate from launch_bounds. The block dimensions from __block_size__ go to +40/+44/+48, distinct from __launch_bounds__'s maxThreadsPerBlock at +0. The __block_size__ attribute specifies exact dimensions; __launch_bounds__ specifies an upper bound. Both can coexist on the same function.

Defaulting behavior when no cluster args. When only 3 arguments are provided (block dims only), the handler checks whether __cluster_dims__ was already applied (flags & 0x01). If not, it writes default cluster dims of (1, 1, 1) to +20/+24/+28. If __cluster_dims__ was already applied, it leaves the existing cluster dim values untouched.

Error 3788/3789. These are the __block_size__-specific equivalents of __cluster_dims__'s 3685/3686. Both use strict positivity (<= 0), rejecting zero.

maxnreg (sub_410F70)

Syntax: __maxnreg__(N)

Accepts exactly 1 argument. Stores at launch_config+32. Registered at kind byte 0x5D (']').

// sub_410F70 -- apply_nv_maxnreg_attr (attribute.c, 67 lines)
entity_t* apply_nv_maxnreg(attr_node_t* attr, entity_t* entity) {
    arg_list_t* args = attr->arg_list;       // attr+32
    if (!args)
        return entity;

    // ---- Template-dependent bailout ----
    // ... same walk pattern ...

    // ---- Allocate launch config ----
    if (!entity->launch_config)
        entity->launch_config = allocate_launch_config();

    // ---- Parse the single argument ----
    expr_node_t* expr = args->const_value;   // argument expression
    if (!expr)
        return entity;

    if (const_expr_sign_compare(expr, 0) <= 0) {
        emit_error(7, 3717, attr->src_loc);       // non-positive register count
    } else {
        int32_t overflow;
        uint64_t val = const_expr_get_value(expr, &overflow);
        if (overflow || val > 0x7FFFFFFF)
            emit_error(7, 3718, attr->src_loc);    // register count too large
        else
            entity->launch_config->maxnreg = (int32_t)val;   // +32
    }

    return entity;
}

The maxnreg field defaults to -1 from the allocator. A value >= 0 in post-validation unambiguously means the attribute was applied with a valid value (since zero would be caught by the <= 0 check here, the minimum valid value is 1).

Post-Validation Conflict

The __maxnreg__ handler does not check for conflicts with __launch_bounds__ at application time. The mutual exclusion is enforced in post-validation (sub_6BC890), which emits error 3719 when both maxThreadsPerBlock != 0 and maxnreg >= 0. This design allows the apply handlers to be called in any order.

local_maxnreg (sub_411090)

Syntax: __local_maxnreg__(N)

Structurally identical to __maxnreg__. Stores at launch_config+36. Registered at kind byte 0x5E ('^').

// sub_411090 -- apply_nv_local_maxnreg_attr (attribute.c, 67 lines)
entity_t* apply_nv_local_maxnreg(attr_node_t* attr, entity_t* entity) {
    // ... identical structure to __maxnreg__ ...

    if (const_expr_sign_compare(expr, 0) <= 0) {
        emit_error(7, 3786, attr->src_loc);        // error 3786: non-positive
    } else {
        int32_t overflow;
        uint64_t val = const_expr_get_value(expr, &overflow);
        if (overflow || val > 0x7FFFFFFF)
            emit_error(7, 3787, attr->src_loc);     // error 3787: too large
        else
            entity->launch_config->local_maxnreg = (int32_t)val;   // +36
    }

    return entity;
}

The __local_maxnreg__ attribute limits register usage within a specific device function scope rather than at the kernel level. It uses a separate struct field (+36 vs +32) so both can coexist. The post-validator does NOT check local_maxnreg for __global__-only enforcement -- __local_maxnreg__ is more permissive than __maxnreg__ and may appear on __device__ functions.

Post-Declaration Validation (sub_6BC890)

After all attributes on a declaration have been applied, nv_validate_cuda_attributes (sub_6BC890, 160 lines, in nv_transforms.c) performs cross-attribute consistency checks. This function is called from the declaration processing pipeline and operates on the completed entity node. Multiple errors can be emitted from a single validation pass -- cudafe++ does not short-circuit after the first error.

// sub_6BC890 -- nv_validate_cuda_attributes (nv_transforms.c, 160 lines)
// a1: entity pointer, a2: source location for diagnostics
void nv_validate_cuda_attributes(entity_t* entity, source_loc_t* loc) {

    if (!entity || (entity->byte_177 & 0x10))
        return;      // null or suppressed entity

    // ---- Phase 1: Parameter validation (rvalue refs, error 3702) ----
    // Walks parameter list checking for rvalue reference flag
    // [documented on __global__ page]

    // ---- Phase 2: __nv_register_params__ check (error 3661) ----
    // [documented on __global__ page]

    // ---- Phase 3: Launch config attribute checks ----
    launch_config_t* lc = entity->launch_config;   // entity+256
    uint8_t es = entity->byte_182;                  // execution space

    if (!lc)
        goto check_global_advisory;

    if (es & 0x40)                                  // is __global__
        goto cross_attribute_checks;

    // ==== Error 3534: launch config on non-__global__ ====

    // 3534 for __launch_bounds__
    if (lc->maxThreadsPerBlock || lc->minBlocksPerMultiprocessor) {
        emit_error_with_name(7, 3534, &global_loc, "__launch_bounds__");
        lc = entity->launch_config;                 // reload after emit
    }

    // 3534 for __cluster_dims__ or __block_size__
    if ((entity->byte_183 & 0x40) || lc->cluster_dim_x >= 0) {
        const char* name = (lc->block_size_x > 0) ? "__block_size__"
                                                    : "__cluster_dims__";
        emit_error_with_name(7, 3534, &global_loc, name);
        lc = entity->launch_config;
        if (!lc)
            goto check_global_advisory;
    }

cross_attribute_checks:
    // ==== Error 3707: cluster size exceeds maxBlocksPerCluster ====
    if (lc->cluster_dim_x > 0) {
        if (lc->maxBlocksPerCluster > 0) {
            uint64_t cluster_product = (int64_t)lc->cluster_dim_x
                                     * (int64_t)lc->cluster_dim_y
                                     * (int64_t)lc->cluster_dim_z;
            if ((uint64_t)lc->maxBlocksPerCluster < cluster_product) {
                const char* name = (lc->block_size_x > 0) ? "__block_size__"
                                                           : "__cluster_dims__";
                emit_error_with_name(7, 3707, &global_loc, name);
                lc = entity->launch_config;
                if (!lc)
                    goto check_maxnreg;
            }
        }
    }

    // ==== Error 3719: __launch_bounds__ + __maxnreg__ conflict ====
    if (lc->maxnreg >= 0) {
        if (!(es & 0x40)) {
            // ==== Error 3715: __maxnreg__ on non-__global__ ====
            emit_error_with_name(7, 3715, &global_loc, "__maxnreg__");
            lc = entity->launch_config;
            if (lc)
                goto check_maxnreg_conflict;
            goto check_global_advisory;
        }

check_maxnreg_conflict:
        if (!lc->maxThreadsPerBlock) {
            // No __launch_bounds__ -- maxnreg is fine on its own
            // (but this path is for non-__global__, so it already errored)
            goto check_global_advisory;
        }
        // Both __launch_bounds__ and __maxnreg__ present
        emit_error_with_name(7, 3719, &global_loc,
                             "__launch_bounds__ and __maxnreg__");
    }

check_maxnreg:

check_global_advisory:
    // ==== Warning 3695: __global__ without __launch_bounds__ ====
    if (!(es & 0x40))
        return;                  // not __global__, no advisory needed

    lc = entity->launch_config;
    if (!lc) {
        emit_warning(4, 3695, &kernel_decl_loc);
        return;
    }

    if (!lc->maxThreadsPerBlock && !lc->minBlocksPerMultiprocessor) {
        // Launch config exists but no __launch_bounds__ values set
        // (struct was allocated by __cluster_dims__ or __block_size__)
        emit_warning(4, 3695, &kernel_decl_loc);
    }
}

Validation Logic Detail

Error 3534 -- Launch config on non-global. Tests entity->byte_182 & 0x40 (the __global__ bit). If clear, any non-default values in the launch config struct trigger error 3534. The error message uses %s with the specific attribute name. Notably, the check for __cluster_dims__ or __block_size__ tests lc->cluster_dim_x >= 0 (which is true when any cluster dim handler has run, since they write non-negative values). It also checks the intent flag (entity->byte_183 & 0x40) for the zero-argument __cluster_dims__() form.

Error 3707 -- Cluster product exceeds maxBlocksPerCluster. Computes cluster_dim_x * cluster_dim_y * cluster_dim_z using signed 64-bit arithmetic and compares against maxBlocksPerCluster. The multiplication uses the actual stored dimension values. The error message names whichever attribute set the cluster dims ("__block_size__" if block_size_x > 0, otherwise "__cluster_dims__"). This is a compile-time consistency check: if the programmer specifies both a cluster shape and a maximum cluster block count, the shape must fit.

Error 3715 -- maxnreg on non-global. Separate from the general 3534 check. While 3534 covers __launch_bounds__/__cluster_dims__/__block_size__, __maxnreg__ uses its own code because it appears in a different branch of the validation logic.

Error 3719 -- launch_bounds + maxnreg conflict. These two attributes provide contradictory register allocation hints: __launch_bounds__ asks the compiler to choose registers based on occupancy targets; __maxnreg__ overrides with a hard limit. Detected by lc->maxThreadsPerBlock != 0 && lc->maxnreg >= 0.

Warning 3695 -- Missing launch_bounds advisory. Severity 4 (informational). Fires when a __global__ function has no __launch_bounds__ annotation. Tests both lc == NULL (no launch config at all) and maxThreadsPerBlock == 0 && minBlocksPerMultiprocessor == 0 (struct exists but was allocated by other attrs). Not an error; can be suppressed.

Error Catalog

Apply-Time Errors

ErrorSevAttributeConditionSign testEmit function
35357__launch_bounds__entity+81 & 0x04 (local function)--sub_4F79D0
36857__cluster_dims__sign_compare(expr, 0) <= 0<= 0 (zero rejected)sub_4F79D0
36867__cluster_dims__overflow || val > 0x7FFFFFFF--sub_4F8200
37057__launch_bounds__ (arg 3)sign_compare(expr, 0) < 0< 0 (zero allowed)sub_4F8200
37067__launch_bounds__ (arg 3)overflow || val > 0x7FFFFFFF--sub_4F8200
37177__maxnreg__sign_compare(expr, 0) <= 0<= 0sub_4F8200
37187__maxnreg__overflow || val > 0x7FFFFFFF--sub_4F8200
37867__local_maxnreg__sign_compare(expr, 0) <= 0<= 0sub_4F8200
37877__local_maxnreg__overflow || val > 0x7FFFFFFF--sub_4F8200
37887__block_size__sign_compare(expr, 0) <= 0<= 0sub_4F79D0
37897__block_size__overflow || val > 0x7FFFFFFF--sub_4F8200
37917__cluster_dims__ / __block_size__flags & opposite_bit--sub_4F8200

Post-Validation Errors

ErrorSevConditionEmit function
35347Launch config attrs on non-__global__sub_4F79D0
36954__global__ without __launch_bounds__sub_4F8200
37077maxBlocksPerCluster < cluster_x * cluster_y * cluster_zsub_4F79D0
37157maxnreg >= 0 on non-__global__sub_4F79D0
37197maxThreadsPerBlock != 0 && maxnreg >= 0sub_4F79D0

Sign-Test Summary

AttributeNon-positive errorOverflow errorSign testZero allowed?
__launch_bounds__ arg 1-2(none)(none)No checkYes
__launch_bounds__ arg 337053706< 0Yes (not stored)
__cluster_dims__36853686<= 0No
__block_size__37883789<= 0No
__maxnreg__37173718<= 0No
__local_maxnreg__37863787<= 0No

Attribute Interaction Matrix

__launch_bounds____cluster_dims____block_size____maxnreg____local_maxnreg__
__launch_bounds__--OKOK3719OK
__cluster_dims__OK--3791OKOK
__block_size__OK3791--OKOK
__maxnreg__3719OKOK--OK
__local_maxnreg__OKOKOKOK--

Additional constraints:

  • All attributes except __local_maxnreg__ require __global__ execution space (error 3534 / 3715)
  • __launch_bounds__ arg 3 must be >= cluster product when cluster dims are set (error 3707)
  • __launch_bounds__ is also rejected on local functions at application time (error 3535)

Entity Node Field Reference

OffsetSizeFieldRole in Launch Config
+811 bytelocal_flagsBit 2 (0x04): local function. Checked by sub_411C80 for error 3535.
+1771 bytesuppress_flagsBit 4 (0x10): entity suppressed. Post-validation skips if set.
+1821 byteexecution_spaceBit 6 (0x40): __global__. Checked by sub_6BC890 for 3534, 3695, 3715.
+1831 byteextended_cudaBit 6 (0x40): cluster_dims intent (set by zero-arg __cluster_dims__).
+2568 byteslaunch_configPointer to launch_config_t (56 bytes). NULL if no launch config attrs.

Error Emission Functions

AddressIdentitySignatureUsed for
sub_4F79D0emit_error_with_name(severity, code, loc, name_str)3535, 3685, 3534, 3707, 3715, 3719, 3788
sub_4F8200emit_error_basic(severity, code, loc)3686, 3705, 3706, 3717, 3718, 3786, 3787, 3789, 3791, 3695

sub_4F79D0 passes a format string argument (the attribute name) into the diagnostic message via %s. sub_4F8200 emits a fixed-format message with no string interpolation. Warning 3695 uses severity 4 through sub_4F8200; all other diagnostics use severity 7.

Function Map

AddressIdentityLinesSource File
sub_411C80apply_nv_launch_bounds_attr98attribute.c
sub_4115F0apply_nv_cluster_dims_attr145attribute.c
sub_4109E0apply_nv_block_size_attr265attribute.c
sub_410F70apply_nv_maxnreg_attr67attribute.c
sub_411090apply_nv_local_maxnreg_attr67attribute.c
sub_6BC890nv_validate_cuda_attributes160nv_transforms.c
sub_5E52F0allocate_launch_config42il.c (IL allocation)
sub_461640const_expr_get_value53const_expr.c
sub_461980const_expr_sign_compare97const_expr.c
sub_7BE9E0is_dependent_type15template.c
sub_4F79D0emit_error_with_name--error.c
sub_4F8200emit_error_basic--error.c

Global Variables

AddressNamePurpose
qword_126EDE8global_source_locDefault source location used in post-validation error emission
qword_126DD38kernel_decl_locSource location for kernel declaration (used in 3695 advisory)
dword_126EC90il_pool_idArena allocator pool ID for launch config allocation
dword_126F694launch_config_sizeSize parameter for arena allocator
dword_126F690pool_baseBase pointer of the IL arena pool
dword_106BA08abi_modeABI compatibility flag; when 0, allocator adds 8-byte prefix
dword_126E5FCdevice_flagDevice compilation mode; bit 0 affects launch config flags byte
byte_E6D1B0is_signed_type_tableLookup table indexed by type subkind; true if type is signed integer

Cross-References

grid_constant

The __grid_constant__ attribute marks a __global__ function parameter as read-only across the entire kernel grid. When applied, the parameter is loaded once from host memory into GPU constant memory at grid launch, and all threads in the grid read from this cached copy instead of loading from the parameter buffer in global memory. The attribute was introduced in CUDA 11.7 and requires compute capability 7.0 or later (Volta+).

cudafe++ enforces 8 validation checks on __grid_constant__ parameters, distributed across three phases: attribute application (checking type constraints -- const qualification, no reference types, SM version), post-declaration validation (checking that the annotation appears only on __global__ function parameters), and redeclaration/template merging (checking consistency of annotations between declarations). A ninth related check (error 3669) in the __global__ apply handler issues an advisory when a kernel parameter lacks a default initializer in device compilation mode, suggesting that __grid_constant__ would be appropriate.

Key Facts

PropertyValue
Internal keywordgrid_constant (stored at 0x82bf0f), displayed as __grid_constant__ (at 0x82bf1d)
Attribute categoryOptimization (parameter-level)
Minimum architecturecompute_70 (Volta), gated by dword_126E4A8 >= 70
Entity node flagentity+164 bit 2 (0x04) -- set on the parameter entity during attribute application
Type node flagtype+133 bit 5 (0x20) -- checked by sub_7A6B60 (type chain query)
Parameter node flagparam+32 bit 1 (0x02) -- checked during post-declaration validation in sub_6BC890
Total diagnostics8 unique error strings + 1 related advisory (3669) + 1 memory space conflict (3577)
Diagnostic tag prefixgrid_constant_* (8 tags in .rodata at 0x84810f--0x857770)
Message string block0x88d8b0--0x88dbe8 (contiguous block in .rodata)

Why grid_constant Exists

A parameter annotated __grid_constant__ tells the CUDA runtime and compiler three things:

1. The parameter value is identical for every thread in the grid. This is inherently true for all kernel parameters -- they are passed by value through the kernel launch API -- but the annotation makes this guarantee explicit and mechanically exploitable.

2. The parameter lives in constant memory, not the parameter buffer. Without the annotation, kernel parameters are placed in a parameter buffer that threads read from global memory (or a dedicated parameter memory space with limited caching). With __grid_constant__, the runtime loads the parameter into the GPU's constant memory cache at launch time. This provides:

  • Broadcast reads: all 32 threads in a warp reading the same constant-memory address execute in a single memory transaction. The uniform cache serves a broadcast at full throughput.
  • Separate cache hierarchy: constant memory has a dedicated L1 cache (the "uniform cache") separate from the general L1/L2 data caches. Using it for grid-wide parameters reduces pressure on the main cache hierarchy.
  • Reduced register pressure: the compiler can re-read the parameter from constant memory at any point instead of keeping it pinned in a register. This frees registers for other values, improving occupancy.

3. The parameter must be const-qualified. Since the value is shared across the grid and cached in constant memory, writes would be nonsensical. The hardware constant memory is read-only from the kernel's perspective. cudafe++ enforces this at the type level.

4. The parameter must not be a reference type. References to host memory are meaningless on the device. Kernel parameters are already copied to the device by the CUDA runtime. A reference would dangle because it would point into host address space. Even a reference to device memory is not valid here -- __grid_constant__ parameters must be values, not indirections.

SM_70+ Requirement Rationale

The compute_70 (Volta) minimum exists because Volta significantly rearchitected the constant memory subsystem. Pre-Volta GPUs (Maxwell, Pascal) have a more restricted constant memory subsystem with a fixed 64 KB window per kernel. Volta introduced:

  • Larger effective constant memory through improved caching
  • Per-thread-block constant buffer indexing
  • Hardware support for grid-wide parameter broadcasting with the new parameter cache architecture

The compiler lowers __grid_constant__ parameters to ld.const (constant-space load) PTX instructions, which rely on the Volta constant memory architecture to function correctly. On pre-Volta hardware, the constant memory hardware cannot serve this use case.

Where Validation Happens

The __grid_constant__ validation logic is spread across multiple compilation phases because the checks require different kinds of information. The type-level checks (const, reference) can be performed as soon as the attribute is applied. The context check (must be on a __global__ parameter) requires the function's execution space to be resolved. The redeclaration checks require both the old and new declarations to be available.

Phase 1: Attribute Application

Checks 1 (const), 2 (reference), and 4 (architecture) execute during attribute application, when the __grid_constant__ attribute handler runs. This handler is registered in EDG's attribute descriptor table under the kind byte for __grid_constant__. It receives the attribute node, the entity node, and the target kind. The handler inspects the parameter's type node to verify const-qualification and absence of reference semantics, and checks dword_126E4A8 against the threshold value 70.

Phase 2: Post-Declaration Validation

Check 3 (must be on __global__ parameter) executes in nv_validate_cuda_attributes (sub_6BC890). This function runs after all attributes on a declaration have been applied and resolved. It walks the function's parameter list and checks whether any parameter carries the __grid_constant__ flag (param+32 bit 1) on a non-__global__ function.

Phase 3: Redeclaration/Template Merging

Checks 5--8 (consistency across redeclarations, template redeclarations, specializations, and explicit instantiations) execute during the declaration merging passes in class_decl.c, decls.c, and template.c. These passes compare the entity+164 bit 2 flag on corresponding parameters of the old and new declarations.

Validation Check 1: const-Qualified Type

PropertyValue
Taggrid_constant_not_const (at 0x848146)
Messagea parameter annotated with __grid_constant__ must have const-qualified type (at 0x88d8b0)
Severityerror
PhaseAttribute application

The parameter's type must carry the const qualifier. The check peels through the type chain, following cv-qualifier wrapper nodes (kind == 12) to reach the underlying type, then verifies the const flag is present.

The type-level check works on the same type chain navigation pattern used throughout EDG's type system:

// Conceptual logic (from the __grid_constant__ attribute handler)
type_t* ptype = param->type;
while (ptype->kind == 12)        // skip cv-qualifier wrapper nodes
    ptype = ptype->referenced;   // follow chain at type+144

if (!(ptype->cv_quals & CONST_FLAG))
    emit_error("grid_constant_not_const", param->src_loc);

If the user writes:

__global__ void kernel(__grid_constant__ int x) { ... }

cudafe++ emits grid_constant_not_const because int is not const-qualified. The correct form is:

__global__ void kernel(__grid_constant__ const int x) { ... }

Validation Check 2: No Reference Type

PropertyValue
Taggrid_constant_reference_type (at 0x84815e)
Messagea parameter annotated with __grid_constant__ must not have reference type (at 0x88d900)
Severityerror
PhaseAttribute application

The parameter must not be a reference (& or &&). This check fires independently of the const check -- both can fire on the same parameter.

In EDG's type system, reference types have kind == 7 (lvalue reference) or kind == 19 (rvalue reference). The check walks the type chain through cv-qualifier wrappers and tests the final type kind:

type_t* ptype = param->type;
while (ptype->kind == 12)
    ptype = ptype->referenced;

if (ptype->kind == 7 || ptype->kind == 19)   // lvalue ref or rvalue ref
    emit_error("grid_constant_reference_type", param->src_loc);

Example that triggers this error:

__global__ void kernel(__grid_constant__ const int& x) { ... }

The rationale is that kernel parameters are copied across the host-device boundary by the CUDA runtime. A reference to host memory would be invalid on the device, and a reference to device memory does not participate in the kernel launch parameter copying mechanism. The __grid_constant__ attribute specifically requests constant-memory placement of the parameter value -- a reference has no value to place.

Validation Check 3: Only on global Parameters

PropertyValue
Taggrid_constant_non_kernel (at 0x84812d)
Message__grid_constant__ annotation is only allowed on a parameter of a __global__ function (at 0x88db38)
Error code3702
Severity7 (standard error)
PhasePost-declaration validation (sub_6BC890)

This check enforces that __grid_constant__ only appears on parameters of __global__ (kernel) functions. Parameters of __device__ or __host__ __device__ functions do not participate in the kernel launch mechanism and have no grid-wide constant memory optimization path.

The check executes in nv_validate_cuda_attributes (sub_6BC890, 161 lines, nv_transforms.c). The validator navigates from the function entity to its parameter list, then walks each parameter testing for the __grid_constant__ flag. The reconstructed pseudocode:

// From nv_validate_cuda_attributes (sub_6BC890)
// a1: function entity node
// a2: pointer to source location for diagnostics

void nv_validate_cuda_attributes(entity_t* a1, source_loc_t* a2) {

    if (!a1 || (a1->byte_177 & 0x10))
        return;   // null entity or suppressed

    type_t* type_chain = a1->type_chain;   // entity+144
    uint8_t exec_space = a1->byte_182;     // execution space bitfield

    // Skip parameter walk under certain execution space conditions
    if (!type_chain || ((exec_space & 0x30) == 0x20 &&
                        (exec_space & 0x60) != 0x20))
        goto skip_param_walk;

    // Navigate through cv-qualifier wrappers to reach the function type
    while (type_chain->kind == 12)
        type_chain = type_chain->referenced;   // type+144

    // Get parameter list from prototype (double dereference)
    param_t* param = **(param_t***)(type_chain + 152);

    // Walk each parameter
    while (param) {
        if (param->byte_32 & 0x02) {
            // __grid_constant__ flag is set on a non-__global__ parameter
            emit_error(7, 3702, a2);   // grid_constant_non_kernel
        }
        param = param->next;
    }

    // ... (continues with __launch_bounds__ validation below)
}

The param->byte_32 & 0x02 test checks bit 1 of the parameter node's byte at offset +32. This bit is the __grid_constant__ flag on the parameter entity node -- it is set by the __grid_constant__ attribute application handler when the attribute is first applied, and checked here to verify the containing function is actually a kernel.

The error fires for any execution space that is NOT __global__. The condition skip at the top of the function ((exec_space & 0x30) == 0x20 && (exec_space & 0x60) != 0x20) is a pre-filter that handles certain host-side function configurations -- it does NOT suppress the parameter walk for __global__ functions (which have bit 6 = 0x40 set).

Validation Check 4: compute_70+ Architecture

PropertyValue
Taggrid_constant_unsupported_arch (at 0x857770)
Message__grid_constant__ annotation is only allowed for architecture compute_70 or later (at 0x88db90)
Severityerror
PhaseAttribute application

The target architecture, stored in dword_126E4A8 (set by the --target CLI flag via case 245 in proc_command_line), must be >= 70. The architecture code is an integer representation: sm_70 maps to 70, sm_80 to 80, sm_90 to 90, etc.

// Architecture gate in the __grid_constant__ attribute handler
if (dword_126E4A8 < 70)
    emit_error("grid_constant_unsupported_arch", param->src_loc);

If the user compiles with -arch=compute_60 or lower and uses __grid_constant__, this error fires. The check is a straightforward integer comparison -- no bitmask, no table lookup.

The architecture value reaches cudafe++ through nvcc, which translates user-facing flags like --gpu-architecture=sm_70 into the internal numeric code and passes it via the --target flag. Inside cudafe++, sub_7525E0 (a 6-byte stub returning -1) nominally parses this value, but the actual number is injected by nvcc into the argument string. See Architecture Feature Gating for the full data flow.

Validation Checks 5--8: Redeclaration Consistency

The four redeclaration consistency checks share the same algorithmic structure but apply to different declaration contexts. They all enforce the invariant that __grid_constant__ annotations must match between declarations: if the first declaration annotates a parameter with __grid_constant__, every subsequent declaration (redeclaration, template redeclaration, specialization, explicit instantiation) must also annotate the corresponding parameter, and vice versa.

Why These Checks Exist

The __grid_constant__ attribute affects the kernel's ABI -- specifically, how the CUDA runtime passes the parameter at launch time. If one translation unit sees a declaration with __grid_constant__ and another sees a declaration without it, they would generate incompatible kernel launch code. In RDC (relocatable device code) mode, where kernels can be declared in one TU and defined in another, this mismatch would cause silent data corruption at runtime. The compiler catches it at declaration merging time to prevent this.

Check 5: Function Redeclaration

PropertyValue
Taggrid_constant_incompat_redecl (at 0x84810f)
Messageincompatible __grid_constant__ annotation for parameter %s in function redeclaration (see previous declaration %p) (at 0x88d950)
PhaseRedeclaration merging (class_decl.c)

When a __global__ function is redeclared, cudafe++ compares the entity+164 bit 2 (0x04) flag on each parameter between the existing and new declarations. If the flags differ for any parameter at the same position, the error fires.

// Redeclaration consistency check (conceptual, in class_decl.c)
param_t* old_param = get_params(old_decl);
param_t* new_param = get_params(new_decl);

while (old_param && new_param) {
    bool old_gc = (old_param->entity->byte_164 & 0x04) != 0;
    bool new_gc = (new_param->entity->byte_164 & 0x04) != 0;

    if (old_gc != new_gc)
        emit_error("grid_constant_incompat_redecl",
                   new_param->name, old_decl->src_loc);

    old_param = old_param->next;
    new_param = new_param->next;
}

Example:

__global__ void kernel(__grid_constant__ const int x);
__global__ void kernel(const int x);  // ERROR: grid_constant_incompat_redecl

The %s in the message is expanded to the parameter name, and %p is expanded to a source location reference pointing at the previous declaration.

Check 6: Function Template Redeclaration

PropertyValue
Taggrid_constant_incompat_templ_redecl (at 0x857748)
Messageincompatible __grid_constant__ annotation for parameter %s in function template redeclaration (see previous declaration %p) (at 0x88d9c8)
PhaseTemplate redeclaration merging (class_decl.c)

Same logic as check 5, but for function template redeclarations. Template redeclaration merging occurs in a separate code path from regular function redeclaration because template entities have additional metadata (template parameter lists, partial specialization chains) that must be reconciled.

template<typename T>
__global__ void kernel(__grid_constant__ const T x);

template<typename T>
__global__ void kernel(const T x);  // ERROR: grid_constant_incompat_templ_redecl

Check 7: Template Specialization

PropertyValue
Taggrid_constant_incompat_specialization (at 0x857720)
Messageincompatible __grid_constant__ annotation for parameter %s in function specialization (see previous declaration %p) (at 0x88da48)
PhaseTemplate specialization processing

When a function template specialization's __grid_constant__ annotations disagree with the primary template, this error fires. The specialization must preserve the __grid_constant__ annotation from the primary template because the compiler may have already committed to constant-memory parameter placement based on the primary template's declaration.

template<typename T>
__global__ void kernel(__grid_constant__ const T x);

template<>
__global__ void kernel<int>(const int x);  // ERROR: grid_constant_incompat_specialization

A specialization that omits the annotation would require a different ABI for that particular instantiation, which the kernel launch infrastructure cannot accommodate on a per-specialization basis.

Check 8: Explicit Instantiation Directive

PropertyValue
Taggrid_constant_incompat_instantiation_directive (at 0x8576f0)
Messageincompatible __grid_constant__ annotation for parameter %s in instantiation directive (see previous declaration %p) (at 0x88dac0)
PhaseExplicit instantiation processing

This mirrors the specialization check but applies to explicit instantiation declarations and definitions (template void ... and extern template void ...).

template<typename T>
__global__ void kernel(__grid_constant__ const T x) { ... }

template __global__ void kernel<int>(const int x);
// ERROR: grid_constant_incompat_instantiation_directive

The instantiation directive must match the primary template's __grid_constant__ annotation for each parameter.

Memory Space Conflict Check (Error 3577)

While not one of the 8 __grid_constant__ validation checks, error 3577 provides a guard in the reverse direction. When apply_nv_managed_attr (sub_40E0D0) or apply_nv_device_attr (sub_40EB80) applies a memory space attribute to a variable, they check whether the entity has the __grid_constant__ flag set at entity+164 bit 2. If so, and the variable also has a memory space qualifier, error 3577 is emitted with the name of the conflicting memory space.

The check is identical in both handlers. Here is the reconstructed pseudocode from apply_nv_managed_attr (sub_40E0D0):

// From apply_nv_managed_attr (sub_40E0D0, attribute.c:10523)
// a1: attribute node, a2: entity node, a3: target kind (must be 7 = variable)

entity_t* apply_nv_managed_attr(attr_node_t* a1, entity_t* a2, uint8_t a3) {

    // Gate: variables only
    if (a3 != 7)
        internal_error("apply_nv_managed_attr", "attribute.c", 10523);

    // Apply memory space flags
    uint8_t old_memspace = a2->byte_148;
    a2->byte_149 |= 0x01;        // set __managed__ flag
    a2->byte_148 = old_memspace | 0x01;  // set __device__ flag (managed implies device)

    // Check for conflicting memory space combinations
    if (((old_memspace & 0x02) != 0) + ((old_memspace & 0x04) != 0) == 2)
        emit_error(3481, a1->src_loc);   // both __shared__ and __constant__ set

    if ((signed char)a2->byte_161 < 0)
        emit_error(3482, a1->src_loc);   // thread_local conflict

    if (a2->byte_81 & 0x04)
        emit_error(3485, a1->src_loc);   // local variable conflict

    // Grid constant conflict check
    if ((a2->byte_164 & 0x04) != 0      // has __grid_constant__ flag
        && (*(uint16_t*)(a2 + 148) & 0x0102) != 0)  // __shared__ OR __managed__
    {
        // Determine which memory space to report in the diagnostic
        uint8_t mem = a2->byte_148;
        const char* space;
        if      (mem & 0x04)         space = "__constant__";
        else if (a2->byte_149 & 0x01) space = "__managed__";
        else if (mem & 0x02)         space = "__shared__";
        else if (mem & 0x01)         space = "__device__";
        else                         space = "";

        emit_error_with_string(3577, a1->src_loc, space);
    }

    return a2;
}

The 0x0102 mask on the 16-bit word at a2 + 148 checks two bits: bit 1 of byte +148 (__shared__, value 0x02) and bit 0 of byte +149 (__managed__, value 0x01 shifted left by 8 bits = 0x0100). This means the conflict check fires specifically when a __grid_constant__ parameter also has __shared__ or __managed__ -- these memory spaces are incompatible with constant memory placement.

The priority order for the diagnostic message (__constant__ > __managed__ > __shared__ > __device__) determines which memory space name appears in the error output when multiple conflicting spaces are present simultaneously.

The apply_nv_device_attr handler (sub_40EB80) performs the identical check in its variable-handling branch (when a3 == 7):

// From apply_nv_device_attr (sub_40EB80), variable branch
if (a3 == 7) {
    a2->byte_148 |= 0x01;         // set __device__ flag

    // ... shared/constant conflict, thread_local, local variable checks ...

    // Identical grid_constant conflict check
    if ((a2->byte_164 & 0x04) != 0 && (*(uint16_t*)(a2 + 148) & 0x0102) != 0) {
        // Same priority cascade for space name
        // ...
        emit_error_with_string(3577, a1->src_loc, space);
    }
    return a2;
}

Entity Node Fields

Three distinct locations in entity/type/parameter nodes carry __grid_constant__ state:

entity+164 bit 2 (0x04): Grid Constant Declaration Flag

Set during attribute application when a parameter is declared __grid_constant__. This is the "declaration-side" flag that records the programmer's intent. Used by:

  • Memory space conflict check (error 3577) in apply_nv_managed_attr and apply_nv_device_attr
  • Redeclaration consistency checks (checks 5--8)

type+133 bit 5 (0x20): Type-Level Flag

A flag on the type node (not the entity node) checked by sub_7A6B60. This function follows the type chain through cv-qualifier wrappers (kind == 12) and tests byte+133 & 0x20:

// sub_7A6B60 (types.c)
// In the broader EDG type system, this function checks bit 5 of the
// type's flag byte. For CUDA parameter types, this bit indicates
// __grid_constant__ annotation. The same bit is also used as the
// dependent-type flag in template contexts (hence 299 callers in the binary).
bool type_has_flag_0x20(type_t* type) {
    while (type->kind == 12)       // skip cv-qualifier wrappers
        type = type->referenced;   // follow type chain at +144
    return (type->byte_133 & 0x20) != 0;
}

Used by the __global__ apply handler's parameter iteration to detect parameters that are already annotated with __grid_constant__, suppressing the error 3669 advisory for those parameters.

param+32 bit 1 (0x02): Parameter Node Flag

A flag on the parameter node itself, checked during post-declaration validation (sub_6BC890). The validator walks the parameter list and tests each parameter's byte at offset +32 for bit 1. If set on a parameter of a non-__global__ function, error 3702 (grid_constant_non_kernel) is emitted.

The three flags serve different purposes: the entity flag records the declaration intent and is used for cross-declaration consistency checks, the type flag enables efficient type-level queries during attribute application, and the parameter flag enables the post-validation pass to scan parameter lists without resolving entity or type chains.

Parameter Iteration in the global Apply Handler

The apply_nv_global_attr handlers (sub_40E1F0 and sub_40E7F0) contain a parameter iteration loop that interacts with __grid_constant__. This loop checks each kernel parameter for types that should be __grid_constant__ but are not annotated as such. When found in device compilation mode (dword_126C5C4 == -1), error 3669 is emitted as an advisory.

// From apply_nv_global_attr (sub_40E1F0), Phase 5: parameter iteration
// This section runs only when attr_node+11 bit 0 is set (applies to parameters)

if (a1->byte_11 & 0x01) {

    // Navigate to function prototype through cv-qualifier chain
    type_t* proto_type = entity->type_chain;     // entity+144
    while (proto_type->kind == 12)
        proto_type = proto_type->referenced;

    // Get parameter list head (double dereference from prototype+152)
    param_t* param = **(param_t***)(proto_type + 152);
    source_loc_t saved_loc = a1->src_loc;        // attr_node+56

    for (; param != NULL; param = param->next) {
        // Peel cv-qualifier wrappers from parameter type
        type_t* ptype = param->type;             // param[1] (offset 8)
        while (ptype->kind == 12)
            ptype = ptype->referenced;

        // sub_7A6B60: returns true if type+133 bit 5 is set
        // (parameter is already __grid_constant__)
        if (!sub_7A6B60(ptype) && dword_126C5C4 == -1) {

            // Scope table lookup (784-byte entries)
            int64_t scope = qword_126C5E8 + 784 * dword_126C5E4;

            // Skip if scope has qualifier flags or is a cv-qualified scope
            if ((scope->byte_6 & 0x06) == 0 && scope->byte_4 != 12) {

                // Re-navigate to unqualified type
                type_t* ptype2 = param->type;
                while (ptype2->kind == 12)
                    ptype2 = ptype2->referenced;

                // If no default initializer, suggest __grid_constant__
                if (ptype2->qword_120 == 0)
                    emit_error(3669, &saved_loc);
            }
        }
    }
}

The logic: for each parameter in a __global__ function, if the parameter type does NOT already have the __grid_constant__ flag AND we are in device compilation mode AND the current scope is not a cv-qualified context AND the parameter type lacks a default initializer (the type+120 pointer is null), then emit error 3669 as an advisory. The advisory nudges kernel authors to add __grid_constant__ annotations for better performance.

The scope table lookup (qword_126C5E8 indexed by dword_126C5E4, 784-byte entries) determines whether the current compilation context is device-side. The dword_126C5C4 == -1 sentinel explicitly indicates device compilation mode. Together these two conditions ensure the advisory only fires when processing the device-side compilation of a kernel, not during host-side stub generation.

Keyword Registration

The __grid_constant__ keyword is registered during fe_translation_unit_init (sub_5863A0), alongside other CUDA extension keywords (__device__, __global__, __shared__, __constant__, __managed__, __launch_bounds__). The registration inserts both grid_constant (bare form, for attribute name lookup) and __grid_constant__ (double-underscore form, for lexer recognition) into EDG's keyword-to-token-ID mapping.

The attribute name lookup function (sub_40A250) strips leading and trailing double underscores before searching the attribute name hash table (qword_E7FB60), so __grid_constant__ resolves to the same descriptor entry as the bare grid_constant form.

Diagnostic Tag Summary

TagError CodeMessagePhase
grid_constant_not_const--a parameter annotated with __grid_constant__ must have const-qualified typeApplication
grid_constant_reference_type--a parameter annotated with __grid_constant__ must not have reference typeApplication
grid_constant_non_kernel3702__grid_constant__ annotation is only allowed on a parameter of a __global__ functionPost-validation
grid_constant_unsupported_arch--__grid_constant__ annotation is only allowed for architecture compute_70 or laterApplication
grid_constant_incompat_redecl--incompatible __grid_constant__ annotation for parameter %s in function redeclaration (see previous declaration %p)Redeclaration
grid_constant_incompat_templ_redecl--incompatible __grid_constant__ annotation for parameter %s in function template redeclaration (see previous declaration %p)Template redecl
grid_constant_incompat_specialization--incompatible __grid_constant__ annotation for parameter %s in function specialization (see previous declaration %p)Specialization
grid_constant_incompat_instantiation_directive--incompatible __grid_constant__ annotation for parameter %s in instantiation directive (see previous declaration %p)Instantiation

Error codes for checks 1, 2, 4--8 are not individually mapped in the decompiled code available for this analysis. Error 3702 (check 3) is confirmed from the post-validation function sub_6BC890. Error 3577 (memory space conflict) is confirmed from sub_40E0D0 and sub_40EB80.

Function Map

AddressIdentityLinesSource FileRole
sub_7A6B60type flag query (byte_133 & 0x20)9types.cFollows type chain, returns grid_constant / dependent flag
sub_40E0D0apply_nv_managed_attr47attribute.c:10523Memory space conflict check (3577) for __managed__
sub_40EB80apply_nv_device_attr100attribute.cMemory space conflict check (3577) for __device__
sub_6BC890nv_validate_cuda_attributes161nv_transforms.cPost-validation: param walk for 3702 (grid_constant_non_kernel)
sub_40E1F0apply_nv_global_attr (variant 1)89attribute.cParameter iteration with grid_constant flag check (3669 advisory)
sub_40E7F0apply_nv_global_attr (variant 2)86attribute.cSame parameter iteration (alternate call path, do-while loop)
sub_5863A0fe_translation_unit_init--fe_init.cRegisters __grid_constant__ keyword
sub_40A250attribute name lookup--attribute.cStrips __ prefix/suffix, searches hash table

Global Variables

GlobalAddressPurpose
dword_126E4A80x126E4A8Target SM architecture code (from --target). Must be >= 70 for __grid_constant__.
dword_126C5C40x126C5C4Scope index sentinel. -1 = device compilation mode. Guards 3669 advisory check.
dword_126C5E40x126C5E4Current scope table index. Used in 3669 scope lookup.
qword_126C5E80x126C5E8Scope table base pointer (784-byte entries). Used in 3669 scope lookup.

Cross-References

managed Variables

The __managed__ attribute declares a variable in CUDA Unified Memory -- a memory region accessible from both host (CPU) and device (GPU) code, with the CUDA runtime handling page migration transparently. Unlike __device__ variables (accessible only from device code without explicit cudaMemcpy), managed variables can be read and written by both the host and device using the same pointer. The hardware and driver cooperate to migrate pages on demand between CPU and GPU memory, so neither the programmer nor the compiler needs to issue explicit copies.

The constraint set on __managed__ reflects two fundamental realities. First, unified memory is a runtime feature: the compiler cannot resolve managed addresses at compile time, so every host-side access must be gated behind a lazy initialization call that registers the variable with the CUDA runtime's unified memory subsystem. Second, unified memory requires hardware support: the Kepler architecture (compute capability 3.0) introduced the UVA (Unified Virtual Addressing) infrastructure that managed memory depends on. These two realities drive the entire implementation -- the attribute handler sets both a managed flag and a device flag (because managed memory is device-global memory with extra runtime semantics), the validation chain rejects memory spaces and qualifiers that conflict with runtime writability, and the code generator wraps every host-side access in a comma-operator expression that forces lazy initialization.

Key Facts

PropertyValue
Attribute kind byte0x66 = 'f' (102)
Handler functionsub_40E0D0 (apply_nv_managed_attr, 47 lines, attribute.c:10523)
Entity node flags setentity+149 bit 0 (__managed__) AND entity+148 bit 0 (__device__)
Detection bitmask(*(_WORD*)(entity + 148) & 0x101) == 0x101
Minimum architecturecompute_30 (Kepler) -- dword_126E4A8 >= 30
Applies toVariables only (entity kind 7)
Diagnostic codes3481, 3482, 3485, 3577 (attribute application); arch/config errors (declaration processing)
Managed RT boilerplate emittersub_489000 (process_file_scope_entities, line 218)
Access wrapper emitterssub_4768F0 (gen_name_ref), sub_484940 (gen_variable_name)
Managed access prefix string0x839570 (65 bytes)
Managed RT static block string0x83AAC8 (243 bytes)
Managed RT init function string0x83ABC0 (210 bytes)

Semantic Meaning

A __managed__ variable occupies a single virtual address that is valid on both host and device. The CUDA runtime allocates the variable through cudaMallocManaged during module initialization and registers it so the driver can track page ownership. When a kernel accesses the variable, the GPU's page fault handler migrates the page from CPU memory (if needed). When host code accesses it after a kernel launch, the runtime ensures the GPU has finished writing and the page is migrated back to CPU-accessible memory.

This is fundamentally different from the other three memory spaces:

SpaceAccessibilityMigrationLifetime
__device__Device only (host needs cudaMemcpy)ManualProgram lifetime
__shared__Device only, per-thread-blockNone (on-chip SRAM)Block lifetime
__constant__Device read-only (host writes via cudaMemcpyToSymbol)ManualProgram lifetime
__managed__Host and device, same pointerAutomatic (page faults)Program lifetime

Because managed memory is fundamentally device global memory with runtime-managed migration, the __managed__ handler always sets the __device__ bit alongside the __managed__ bit. This is not redundant -- it ensures that all code paths that check for "device-accessible variable" (error 3483 scope checks, external linkage warning 3648, cross-space reference validation) treat managed variables correctly. A managed variable IS a device variable; it just happens to also be host-accessible through the runtime's page migration.

Why the Constraints Exist

Each validation check enforced by the handler exists for a specific hardware or semantic reason:

  • Variables only (kind 7): Unified memory is a storage concept. Functions do not reside in managed memory -- they have execution spaces, not memory spaces.

  • Cannot be __shared__ or __constant__: These are mutually exclusive memory spaces that occupy different physical hardware. __shared__ is per-block on-chip SRAM with no concept of host accessibility. __constant__ is a read-only cached region with no write path from device code. Managed memory is global DRAM with page migration. They cannot coexist.

  • Cannot be thread_local: Thread-local storage uses thread-specific addressing (TLS segments) which is a host-side concept incompatible with CUDA's execution model. A managed variable must have a single global address visible to all threads on both host and device.

  • Cannot be a local variable or reference type: Managed variables require runtime registration with the CUDA driver during module loading. Local variables are stack-allocated with lifetimes that cannot be tracked by the runtime. References cannot cross address spaces -- a reference to a managed variable on the host would hold a CPU virtual address that is meaningless on the device.

  • Requires compute_30+: Unified Virtual Addressing (UVA), the hardware foundation for managed memory, was introduced with the Kepler architecture (compute capability 3.0). On earlier architectures, host and device have separate, non-overlapping virtual address spaces, making transparent page migration impossible.

  • Incompatible with __grid_constant__: Grid-constant parameters are loaded into constant memory at kernel launch. A managed variable's value is determined by its current page state, which can change between kernel launches. The two semantics are contradictory.

Attribute Application: apply_nv_managed_attr

sub_40E0D0 -- Full Pseudocode

The __managed__ attribute handler is the simplest of the four memory space handlers and demonstrates the complete validation template. Called from apply_one_attribute (sub_413240) when the attribute kind byte is 'f' (102).

// sub_40E0D0 -- apply_nv_managed_attr (attribute.c:10523)
// a1: attribute node pointer (attribute_node_t*)
// a2: entity node pointer (entity_t*)
// a3: entity kind (uint8_t)
// returns: entity node pointer (passthrough)

entity_t* apply_nv_managed_attr(attr_node_t* a1, entity_t* a2, uint8_t a3) {

    // ===== Gate: variables only =====
    // Entity kind 7 = variable. Any other kind (function=11, type=6, etc.)
    // is an internal error -- the dispatch table should never route
    // __managed__ to a non-variable entity.
    if (a3 != 7)
        internal_error("attribute.c", 10523, "apply_nv_managed_attr", 0, 0);

    // ===== Step 1: Set managed + device flags =====
    // Save current memory space byte for later checks.
    // Managed memory IS device global memory, so both flags must be set.
    uint8_t old_space = a2->byte_148;
    a2->byte_149 |= 0x01;         // set __managed__ flag
    a2->byte_148 = old_space | 1;  // set __device__ flag

    // ===== Step 2: Mutual exclusion -- shared + constant =====
    // The expression ((x & 2) != 0) + ((x & 4) != 0) == 2 is true
    // only when BOTH __shared__ (bit 1) and __constant__ (bit 2) are set.
    // This catches an impossible three-way conflict, NOT managed+shared
    // or managed+constant individually. The individual conflicts
    // (__managed__ + __shared__, __managed__ + __constant__) are caught
    // by the __grid_constant__ check or by subsequent declaration processing.
    if (((old_space & 2) != 0) + ((old_space & 4) != 0) == 2)
        emit_error(3481, a1->source_loc);  // "conflicting CUDA memory spaces"

    // ===== Step 3: Thread-local check =====
    // Byte +161 bit 7 (sign bit when read as signed char) indicates
    // thread_local storage duration. Managed variables must have
    // static storage duration with a single global address.
    if ((signed char)a2->byte_161 < 0)
        emit_error(3482, a1->source_loc);  // "CUDA memory space on thread_local"

    // ===== Step 4: Local variable / reference type check =====
    // Byte +81 bit 2 indicates the entity is declared in a local scope
    // (block scope, function parameter, or reference type).
    // Managed variables require file-scope lifetime for runtime registration.
    if (a2->byte_81 & 0x04)
        emit_error(3485, a1->source_loc);  // "CUDA memory space on local/ref"

    // ===== Step 5: __grid_constant__ conflict =====
    // Byte +164 bit 2 is the __grid_constant__ flag on the parameter entity.
    // If set, check whether this entity also has a conflicting memory space.
    // The 16-bit word read at +148 with mask 0x0102 catches:
    //   byte +148 bit 1 (0x02) = __shared__
    //   byte +149 bit 0 (0x01, as 0x100 in word) = __managed__
    // (Little-endian: word = byte_149 << 8 | byte_148)
    if ((a2->byte_164 & 0x04) && (*(uint16_t*)(a2 + 148) & 0x0102)) {

        // Build error message: select most restrictive space name
        uint8_t space = a2->byte_148;
        const char* name = "__constant__";
        if (!(space & 0x04)) {
            name = "__managed__";
            if (!(a2->byte_149 & 0x01)) {
                name = "__shared__";
                if (!(space & 0x02)) {
                    name = "__device__";
                    if (!(space & 0x01))
                        name = "";
                }
            }
        }
        emit_error_with_name(3577, a1->source_loc, name);
        // "memory space %s incompatible with __grid_constant__"
    }

    return a2;
}

Entity Node Fields Modified

OffsetFieldBits SetMeaning
+148memory_spacebit 0 (0x01)__device__ -- variable lives in device global memory
+149extended_spacebit 0 (0x01)__managed__ -- variable is in unified memory

Entity Node Fields Read (Validation)

OffsetFieldMaskMeaning
+148memory_space0x02__shared__ flag (mutual exclusion check)
+148memory_space0x04__constant__ flag (mutual exclusion check)
+161storage_flagsbit 7 (sign)thread_local storage duration
+81scope_flags0x04Local scope / reference type indicator
+164cuda_flags0x04__grid_constant__ parameter flag
+148:149space_word0x0102Combined __shared__ OR __managed__ (grid_constant conflict)

Comparison with apply_nv_device_attr (sub_40EB80)

The __device__ handler's variable path (entity kind 7) is structurally identical to apply_nv_managed_attr, minus the byte_149 |= 1 step. Both handlers:

  1. Set byte_148 |= 0x01 (device memory space)
  2. Check error 3481 (shared + constant mutual exclusion)
  3. Check error 3482 (thread_local)
  4. Check error 3485 (local variable)
  5. Check error 3577 (grid_constant conflict)

The only difference: __managed__ additionally sets byte_149 |= 0x01. The __device__ handler also has a function path (kind 11) for setting execution space bits -- __managed__ has no function path because managed memory is a storage concept, not an execution concept.

Architecture Gating

The compute_30 requirement for __managed__ is enforced during declaration processing, not in the attribute handler itself. The attribute handler (sub_40E0D0) sets the bitfield flags unconditionally; the architecture check happens later when the declaration is fully processed.

Two diagnostic tags cover managed architecture gating:

TagMessageCondition
unsupported_arch_for_managed_capability__managed__ variables require architecture compute_30 or higherdword_126E4A8 < 30
unsupported_configuration_for_managed_capability__managed__ variables are not yet supported for this configuration (compilation mode (32/64 bit) and/or target operating system)Configuration-specific flag check

The architecture check uses the global dword_126E4A8 which stores the SM version number from the --gpu-architecture flag. The value 30 corresponds to sm_30 (Kepler), the first architecture with Unified Virtual Addressing (UVA) support. The configuration check covers edge cases like 32-bit compilation mode or unsupported operating systems where the CUDA runtime's managed memory subsystem is unavailable.

Managed Runtime Boilerplate

Every .int.c file emitted by cudafe++ contains a block of managed runtime initialization code, emitted unconditionally by sub_489000 (process_file_scope_entities) at line 218. This block is emitted regardless of whether the translation unit contains any __managed__ variables -- the static guard flag ensures zero overhead when no managed variables exist.

Static Declarations

Four declarations are emitted as a single string literal from 0x83AAC8 (243 bytes):

// Emitted verbatim by sub_489000, line 218
static char __nv_inited_managed_rt = 0;
static void **__nv_fatbinhandle_for_managed_rt;
static void __nv_save_fatbinhandle_for_managed_rt(void **in) {
    __nv_fatbinhandle_for_managed_rt = in;
}
static char __nv_init_managed_rt_with_module(void **);

Each symbol serves a specific role in the initialization chain:

SymbolTypeRole
__nv_inited_managed_rtstatic charGuard flag: 0 = uninitialized, nonzero = initialized
__nv_fatbinhandle_for_managed_rtstatic void**Cached fatbinary handle, populated during __cudaRegisterFatBinary
__nv_save_fatbinhandle_for_managed_rtstatic void(void**)Callback that stores the fatbin handle -- called at program startup
__nv_init_managed_rt_with_modulestatic char(void**)Forward declaration -- defined later by crt/host_runtime.h

The forward declaration of __nv_init_managed_rt_with_module is critical: this function is provided by the CUDA runtime headers and performs the actual cudaRegisterManagedVariable calls. By forward-declaring it here, the managed runtime boilerplate can reference it before the runtime header is #included later in the .int.c file.

Lazy Initialization Function

Emitted immediately after the static block (string at 0x83ABC0, 210 bytes):

// sub_489000, lines 221-224
// Conditional prefix:
if (dword_106BF6C)  // alternative host compiler mode
    emit("__attribute__((unused)) ");

// Function body:
static inline void __nv_init_managed_rt(void) {
    __nv_inited_managed_rt = (
        __nv_inited_managed_rt
            ? __nv_inited_managed_rt
            : __nv_init_managed_rt_with_module(
                  __nv_fatbinhandle_for_managed_rt)
    );
}

The ternary is a lazy-init idiom. On first call, __nv_inited_managed_rt is 0 (falsy), so the false branch executes __nv_init_managed_rt_with_module, which registers all managed variables in the translation unit and returns nonzero. The result is stored back into __nv_inited_managed_rt, so subsequent calls short-circuit through the true branch and return the existing nonzero value without re-initializing.

The __attribute__((unused)) prefix is conditionally added when dword_106BF6C (alternative host compiler mode) is set. This suppresses -Wunused-function warnings on host compilers that may not see any call sites for this function if no managed variables exist in the translation unit.

Runtime Registration Sequence

The full initialization flow spans the compilation and runtime startup pipeline:

Compile time (cudafe++ emits into .int.c):
  1. __nv_save_fatbinhandle_for_managed_rt() -- defined, stores fatbin handle
  2. __nv_init_managed_rt_with_module()      -- forward-declared only
  3. __nv_init_managed_rt()                  -- defined, lazy init wrapper
  4. #include "crt/host_runtime.h"           -- provides _with_module() definition

Program startup:
  5. __cudaRegisterFatBinary() calls __nv_save_fatbinhandle_for_managed_rt()
     to cache the fatbin handle for this translation unit

First managed variable access:
  6. Comma-operator wrapper calls __nv_init_managed_rt()
  7. Guard flag is 0, so __nv_init_managed_rt_with_module() executes
  8. __nv_init_managed_rt_with_module() calls cudaRegisterManagedVariable()
     for every __managed__ variable in the translation unit
  9. Guard flag set to nonzero, preventing re-initialization

Subsequent accesses:
  10. Comma-operator wrapper calls __nv_init_managed_rt()
  11. Guard flag is nonzero, ternary short-circuits, no runtime call

Host Access Transformation: The Comma-Operator Pattern

When cudafe++ generates the .int.c host-side code and encounters a reference to a __managed__ variable, it wraps the access in a comma-operator expression. This is the core mechanism that ensures the CUDA managed memory runtime is initialized before any managed variable is touched on the host.

Detection

Two backend emitter functions detect managed variables using the same 16-bit bitmask test:

// Used by both sub_4768F0 (gen_name_ref) and sub_484940 (gen_variable_name)
if ((*(_WORD*)(entity + 148) & 0x101) == 0x101)

In little-endian layout, the 16-bit word at offset 148 spans bytes +148 (low) and +149 (high). The mask 0x101 tests:

  • Bit 0 of byte +148 (0x01): __device__ flag
  • Bit 0 of byte +149 (0x100 in the word): __managed__ flag

Both bits are always set together by apply_nv_managed_attr, so this test is equivalent to "is this a managed variable?"

Transformed Output

For a managed variable named managed_var, the emitter produces:

(*( (__nv_inited_managed_rt ? (void)0 : __nv_init_managed_rt()), (managed_var)))

The prefix string lives at 0x839570 (65 bytes):

"(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), ("

After emitting the variable name, the suffix ))) closes the expression.

Why This Works: Anatomy of the Expression

Reading from inside out:

(*( (__nv_inited_managed_rt ? (void)0 : __nv_init_managed_rt()), (managed_var)))
     ^--- ternary: lazy init guard ----^                          ^--- value ---^
     ^--- comma operator: init side-effect, then yield value --------------------------^
^--- dereference: access the managed variable's storage ---------------------------------^
  1. Ternary __nv_inited_managed_rt ? (void)0 : __nv_init_managed_rt() -- The guard flag is checked. If nonzero (already initialized), the expression evaluates to (void)0, which generates no code. If zero (first access), __nv_init_managed_rt() is called, which performs CUDA runtime registration and sets the guard flag to nonzero.

  2. Comma operator (init_expr, (managed_var)) -- The C comma operator evaluates its left operand for side effects only, discards the result, then evaluates and returns its right operand. This guarantees the initialization side-effect is sequenced before the variable access, per C/C++ sequencing rules (C11 6.5.17, C++17 [expr.comma]).

  3. Outer dereference *(...) -- The outer * dereferences the result. After runtime registration, the managed variable's symbol resolves to the unified memory pointer that the CUDA runtime allocated via cudaMallocManaged. The dereference yields the actual variable value.

The entire expression is parenthesized to be safely usable in any expression context -- assignments, function arguments, member access, etc.

Two Emitter Paths

The access transformation is applied by two separate functions, covering different name resolution contexts:

sub_484940 (gen_variable_name, 52 lines) -- handles direct variable name emission. Simpler structure: check the 0x101 bitmask, emit prefix, emit the name (handling three sub-cases: thread-local via this, anonymous via sub_483A80, or regular via sub_472730), emit suffix.

// sub_484940 -- gen_variable_name (pseudocode)
void gen_variable_name(entity_t* a1) {
    bool needs_suffix = false;

    // Check: is this a __managed__ variable?
    if ((*(uint16_t*)(a1 + 148) & 0x101) == 0x101) {
        needs_suffix = true;
        emit("(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), (");
    }

    // Emit variable name (three cases)
    if (a1->byte_163 & 0x80)
        emit("this");                       // thread-local proxy
    else if (a1->byte_165 & 0x04)
        emit_anonymous_name(a1);            // compiler-generated name
    else
        gen_expression_or_name(a1, 7);      // regular name emission

    if (needs_suffix)
        emit(")))");
}

sub_4768F0 (gen_name_ref, 237 lines) -- handles qualified name references with :: scope resolution, template arguments, __super:: qualifier, and member access. The managed wrapping applies an additional gate: a3 == 7 (entity is a variable) AND !v7 (the fourth parameter is zero, meaning no nested context that already handles initialization).

// sub_4768F0 -- gen_name_ref, managed wrapping (lines 160-163, 231-236)
int gen_name_ref(context_t* ctx, entity_t* entity, uint8_t kind, int nested) {

    bool needs_suffix = false;

    if (!nested && kind == 7
        && (*(uint16_t*)(entity + 148) & 0x101) == 0x101) {
        needs_suffix = true;
        emit("(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), (");
    }

    // ... 200+ lines of qualified name emission ...
    // handles ::, template<>, __super::, member access paths

    if (needs_suffix) {
        emit(")))");
        return 1;
    }
    // ...
}

Host-Side Exemption in Cross-Space Checking

Managed variables receive a special exemption in the cross-space reference validation performed by record_symbol_reference_full (sub_72A650). When host code references a __device__ variable, the checker would normally emit error 3548. But managed variables are specifically exempted:

// Inlined in sub_72A650, cross-space variable reference check
if ((*(uint16_t*)(var_info + 148) & 0x0101) == 0x0101)
    return;  // managed variable -- host access is legal

This uses the same 0x0101 bitmask to detect managed variables. The exemption exists because managed variables are explicitly designed for host access -- that is their entire purpose. Without this exemption, every host-side __managed__ variable access would trigger a spurious "reference to device variable from host code" error.

Managed Variables and constexpr

The declaration processor sub_4DEC90 (variable_declaration) imposes additional constraints when __managed__ is combined with constexpr:

ErrorConditionDescription
3568__constant__ + constexpr__constant__ combined with constexpr (prevents runtime initialization)
3566__constant__ + constexpr + auto__constant__ constexpr with auto deduction

These errors target __constant__ specifically, but the validation cascade also generates the space name for managed variables when constructing error messages. The space name selection uses the same priority cascade as the attribute handler:

// sub_4DEC90, line ~357 -- selecting display name for error messages
const char* space_name = "__constant__";
if (!(space & 0x04)) {
    space_name = "__managed__";
    if (!(*(uint8_t*)(entity + 149) & 0x01)) {
        space_name = "__host__ __device__" + 9;  // pointer trick: "__device__"
        if (space & 0x02)
            space_name = "__shared__";
    }
}

The string "__device__" is obtained by taking "__host__ __device__" and advancing by 9 bytes, skipping the "__host__ " prefix. This is a binary-level optimization where the compiler shares string storage between the combined form and the standalone "__device__" substring.

Error 3648: External Linkage Warning

The post-definition check in sub_4DC200 (mark_defined_variable) warns when a device-accessible variable has external linkage. This affects managed variables because they always have the __device__ bit set:

// sub_4DC200 -- mark_defined_variable
// Condition for warning 3648:
if ((entity->byte_148 & 3) == 1    // __device__ set AND __shared__ NOT set
    && !is_compiler_generated(entity)
    && (entity->byte_80 & 0x70) != 0x10)  // not anonymous
{
    warning(3648, entity->source_loc);
}

The bit test (byte_148 & 3) == 1 checks that bit 0 (__device__) is set and bit 1 (__shared__) is NOT set. This catches:

  • __device__ variables (0x01): yes, (0x01 & 3) == 1
  • __managed__ variables (0x01 at +148, 0x01 at +149): yes, (0x01 & 3) == 1
  • __device__ __constant__ (0x05): yes, (0x05 & 3) == 1
  • __shared__ (0x02): no, (0x02 & 3) == 2
  • __constant__ alone (0x04): no, (0x04 & 3) == 0

Managed variables therefore trigger this warning if they have external linkage and are not compiler-generated.

Diagnostic Summary

ErrorPhaseConditionMessage
3481Attribute application__shared__ AND __constant__ both setConflicting CUDA memory spaces
3482Attribute applicationthread_local storage durationCUDA memory space on thread_local variable
3485Attribute applicationLocal scope or reference typeCUDA memory space on local variable
3577Attribute application__grid_constant__ + managed/sharedMemory space incompatible with __grid_constant__
3648Post-definitionExternal linkage on device-accessible (non-shared) varExternal linkage warning
(arch)Declaration processingdword_126E4A8 < 30__managed__ requires compute_30 or higher
(config)Declaration processingUnsupported OS/bitness__managed__ not supported for this configuration

Function Map

AddressNameLinesRole
sub_40E0D0apply_nv_managed_attr47Attribute handler -- sets flags, validates
sub_40EB80apply_nv_device_attr100Device handler (variable path is structurally identical)
sub_413240apply_one_attribute585Dispatch -- routes kind 'f' to sub_40E0D0
sub_489000process_file_scope_entities723Emits managed RT boilerplate into .int.c
sub_4768F0gen_name_ref237Access wrapper -- qualified name path
sub_484940gen_variable_name52Access wrapper -- direct name path
sub_4DEC90variable_declaration1098Declaration processing, constexpr/VLA checks
sub_4DC200mark_defined_variable26External linkage warning (error 3648)
sub_72A650record_symbol_reference_full~400Cross-space check with managed exemption
sub_6BC890nv_validate_cuda_attributes161Post-declaration cross-attribute validation

Cross-References

Minor CUDA Attributes

cudafe++ defines several CUDA-specific attributes beyond the core execution-space, memory-space, and launch-configuration families. These attributes serve diverse purposes: optimization hints for the downstream compiler, parameter passing strategy selection, inline control that bridges the EDG front-end with cicc's code generation, and internal annotations for tile/cooperative infrastructure. Most are undocumented by NVIDIA. This page covers each in detail: what the attribute does, why it exists, how cudafe++ validates and stores it, and where the flags end up in the entity node.

Attribute Summary

KindHexASCIIDisplay NameCategoryHandler / Flag
1100x6E'n'__nv_pure__Optimizationentity+183 (via IL propagation)
------__nv_register_params__ABIsub_40B0A0 (38 lines), entity+183 bit 3
------__forceinline__Inline controlentity+177 bit 4
------__noinline__Inline controlsub_40F5F0 / sub_40F6F0, entity+179 bit 5, entity+180 bit 7
------__inline_hint__Inline controlentity+179 bit 4
890x59'Y'__tile_global__Internal(no handler observed)
950x5F'_'__tile_builtin__Internal(no handler observed)
940x5E'^'__local_maxnreg__Launch configsub_411090 (67 lines)
1080x6C'l'__block_size__Launch configsub_4109E0 (265 lines)

Note: __nv_register_params__, __forceinline__, __noinline__, and __inline_hint__ do not have CUDA attribute kind codes. They are processed through different paths (EDG's standard attribute system, pragma-like registration at startup, or direct flag manipulation). Only __nv_pure__, __tile_global__, __tile_builtin__, __local_maxnreg__, and __block_size__ have dedicated CUDA kind bytes in the attribute_display_name switch table.


__nv_pure__ (Kind 0x6E = 'n')

Purpose

__nv_pure__ marks a function as having no observable side effects: given the same inputs, it always returns the same result and does not modify any state visible to the caller. This is an optimization hint for cicc (the CUDA compiler backend). A pure function can be:

  • Common-subexpression eliminated (CSE): if f(x) appears twice in the same basic block, the second call can be replaced by the first call's result.
  • Hoisted out of loops: if f(x) is invariant across loop iterations, it can be computed once before the loop (LICM -- loop-invariant code motion).
  • Dead-code eliminated: if the result of f(x) is never used and the function has no side effects, the call can be removed entirely.

This is semantically equivalent to GCC's __attribute__((pure)) and LLVM's readonly function attribute, but expressed through NVIDIA's internal attribute system rather than the standard GNU attribute path. The choice of a separate internal attribute rather than reusing the GNU pure attribute reflects cudafe++'s design of routing all CUDA-specific semantics through its own kind-byte dispatch, keeping the NVIDIA optimization pipeline cleanly separated from EDG's standard attribute handling.

Binary Encoding

In the attribute kind enum, __nv_pure__ has kind value 110 (0x6E, ASCII 'n'). This is the highest kind value in the CUDA attribute range, added later than the original dense block (86--95).

The attribute_display_name switch (sub_40A310) maps it:

case 'n': return "__nv_pure__";

Application Behavior

In the apply_one_attribute constraint checker (sub_413240), kind 'n' has the following entry:

case 'n':
    if (target_kind == 28)   // target is a namespace-level entity
        goto LABEL_21;       // -> pass through (no per-entity modification)
    goto LABEL_8;            // -> attribute doesn't apply to this target

The handler does not modify any entity node fields directly. Unlike __host__ or __device__ which set bitmask flags at entity+182, __nv_pure__ propagates through the attribute node list itself. The attribute node with kind 0x6E remains attached to the entity's attribute chain and is consumed later by:

  1. The .int.c output generator (sub_5565E0 and related functions), which emits the __nv_pure__ attribute into the intermediate C output. In the IL code generator, kind 0x6E shares handling with __launch_bounds__ (0x5C):
case 0x5C:
case 0x6E:
    a2->kind_field = 25;    // IL node type for "function attribute"
    sub_540560(0, 0, a2, a4, ...);  // emit attribute to .int.c
    break;
  1. cicc then reads the __nv_pure__ annotation from the .int.c output and applies the corresponding LLVM-level optimization attributes (readonly, willreturn, etc.) to the function in the NVVM IR.

Why It Exists

CUDA device code has optimization opportunities that GCC's pure does not capture. Device functions execute in a constrained environment (no system calls, no I/O, deterministic memory model), which makes purity easier to verify and more valuable to exploit. By providing __nv_pure__ as a separate internal attribute, NVIDIA can:

  • Gate it behind CUDA mode (it only appears in device compilation flows).
  • Attach it to internal runtime functions (__shfl_sync, math intrinsics, etc.) that NVIDIA knows are pure but that cannot carry GCC pure through the host compilation path.
  • Avoid interactions with EDG's GNU attribute conflict checking, which has its own rules for pure vs const vs noreturn.

String Evidence

The string table contains exactly one reference to __nv_pure__ at address 0x829848, and a diagnostic tag nv_pure at 0x88cc08. The low reference count confirms this is an internal optimization attribute not exposed to user code through documented CUDA APIs.


__nv_register_params__ (Entity+183 bit 3)

Purpose

__nv_register_params__ tells cicc to pass kernel parameters in registers instead of through constant memory. By default, CUDA kernel parameters are loaded via ld.param instructions, which access a dedicated constant memory bank visible to the kernel launch mechanism. This works well when parameter counts are large (the constant memory bank is 4 KB per kernel), but for small parameter counts, passing values directly in registers avoids the latency of the constant memory load path.

Register parameter passing eliminates the constant-bank load latency (typically 4--8 cycles on modern architectures) and removes potential bank conflicts when multiple warps read the same parameters. The trade-off is that it consumes registers from the limited register file, which can reduce occupancy if the kernel already uses many registers.

Requirements

The attribute has four validation checks, enforced across two separate locations:

  1. Enablement flag (dword_106C028): a compiler internal flag that must be set. If not set, the handler emits error 3659 with the message "__nv_register_params__ support is not enabled". This flag is controlled by an internal nvcc option, not exposed to users.

  2. Architecture check (implied by error string): the string "__nv_register_params__ is only supported for compute_80 or later architecture" exists in the binary at 0x88cb80. This check is performed outside the apply handler, in the post-validation or downstream pipeline.

  3. Function type restriction (implied by error string): the string "__nv_register_params__ is not allowed on a %s function" at 0x88cbd0 shows that certain function types (likely __host__ or non-kernel functions) are rejected. The post-validation in sub_6BC890 checks: if entity+183 & 0x08 is set (register_params flag) but the execution space at entity+182 is __global__ (bit 6) or the function is not a pure __device__ function, it emits error 3661 with the relevant space name.

  4. Ellipsis (variadic) check: the apply handler (sub_40B0A0) traverses the function's return type chain to reach the prototype, then checks prototype+16 & 0x01 (the variadic flag). If set, it emits error 3662 with the message "__nv_register_params__ is not allowed on a function with ellipsis". Variadic functions cannot use register parameter passing because the parameter count is not known at compile time.

Apply Handler: sub_40B0A0 (38 lines)

// sub_40B0A0 -- apply_nv_register_params_attr (attribute.c:10537)
entity_t* apply_nv_register_params_attr(attr_node_t* a1, entity_t* a2, uint8_t a3) {
    assert(a3 == 11);  // functions only

    bool enabled = true;
    if (!dword_106C028) {       // enablement flag not set
        emit_error(7, 3659, a1->src_loc);  // "support is not enabled"
        enabled = false;
    }

    if (!a2) return a2;

    // Walk return type chain to get function prototype
    type_t* ret_type = a2->type_at_144;
    if (!ret_type) goto set_flag;

    while (ret_type->kind == 12)     // skip cv-qualifier wrappers
        ret_type = ret_type->next;   // +144

    // Check variadic flag
    if (ret_type->prototype->flags_16 & 0x01) {
        emit_error(7, 3662, a1->src_loc);  // "not allowed on variadic"
        return a2;
    }

set_flag:
    if (enabled)
        a2->byte_183 |= 0x08;  // set register_params bit
    return a2;
}

The flag is stored at entity+183 bit 3 (0x08), the same byte that holds the cluster_dims intent flag (bit 6, 0x40). These two flags coexist without conflict because they serve orthogonal purposes.

Post-Declaration Validation

In sub_6BC890 (nv_validate_cuda_attributes), if entity+183 & 0x08 is set:

if (entity->byte_183 & 0x08) {
    uint8_t es = entity->byte_182;
    if (es & 0x40) {                    // __global__ function
        emit_error(7, 3661, src, "__global__");
    } else if ((es & 0x30) != 0x20) {   // not pure __device__
        emit_error(7, 3661, src, "__host__");
    }
    // else: pure __device__ function -- register_params is valid
}

This means __nv_register_params__ is only valid on __device__ functions (not __global__, not __host__, not __host__ __device__). Kernel functions (__global__) have their own parameter passing ABI dictated by the CUDA runtime, and host functions use the host ABI.

Registration at Startup

The function sub_6B5E50 (called during compiler initialization) registers __nv_register_params__ as a preprocessor macro expansion. It looks up the name via sub_734430, and if not found, creates a new macro definition node and registers it in the symbol table via sub_749600. The macro body is a 40-byte token sequence that, when expanded, produces the __attribute__((__nv_register_params__)) syntax that EDG's attribute parser can consume. This macro-based registration is why __nv_register_params__ does not have a CUDA kind byte -- it enters the attribute system through the standard GNU __attribute__ path, not through the CUDA attribute descriptor table.

The same startup function also registers __noinline__ with a similar mechanism, and _Pragma (if Clang compatibility mode requires it).


Inline Control Attributes

cudafe++ provides three inline control attributes that interact with EDG's inline heuristic system. These attributes do not have CUDA kind bytes; they are processed through EDG's standard attribute infrastructure and NVIDIA's own flag-setting paths.

Entity Node Fields

entity+177  cuda_flags (byte):
    bit 4 (0x10) = __forceinline__

entity+179  more_cuda_flags (byte):
    bit 4 (0x10) = __inline_hint__
    bit 5 (0x20) = __noinline__ (EDG internal noinline)

entity+180  function_attrs (byte):
    bit 7 (0x80) = __noinline__ (GNU attribute form)

__forceinline__

__forceinline__ requests that the compiler always inline the function, overriding cost-based heuristics. It is stored at entity+177 bit 4 (0x10). This bit is checked during cross-execution-space call validation (sub_505720): a __forceinline__ function is treated as implicitly host-device, meaning it suppresses cross-space call errors. The logic in the cross-space checker:

if (entity->byte_177 & 0x10)    // __forceinline__
    // treat as implicitly __host__ __device__

This relaxation exists because __forceinline__ functions are expected to be inlined at the call site, so their execution space becomes the caller's execution space. There is no separate call to resolve, hence no cross-space violation.

In the .int.c output, __forceinline__ is emitted so that cicc can apply it during NVVM IR generation. cicc translates it to LLVM's alwaysinline attribute.

__noinline__

__noinline__ prevents the compiler from inlining a function, regardless of heuristics. It has two separate handlers because it can arrive through two syntactic paths:

Path 1: EDG internal form (sub_40F5F0, 51 lines)

This handler is invoked when __noinline__ is recognized as a CUDA-specific attribute (source_mode 3 or with the scoped-attribute bit set). It sets entity+179 |= 0x20. In C mode (dword_126EFB4 == 2), it additionally creates an ABI annotation node by calling sub_5E5130 and linking it to the function's prototype exception-spec chain at prototype+56. This ABI node carries flags 0x19 and signals to the code generator that the noinline directive should be preserved across compilation boundaries.

// sub_40F5F0 -- apply_noinline_attr (EDG internal path)
if (target_kind == 11) {  // function
    if (attr->kind) {
        entity->byte_179 |= 0x20;         // noinline flag
        if (attr->source_mode == 3 && dword_126EFB4 == 2) {
            // Create ABI annotation for C mode
            extract_func_type(entity+144, &ft_out);
            if (!ft_out->prototype->abi_info) {
                abi_node_t* n = alloc_abi_node();
                *n |= 0x19;
                ft_out->prototype->abi_info = n;
            }
        }
    }
    return entity;
}
// else: emit error 1835 (wrong target) or 2470 (alignas context)

Path 2: GNU attribute form (sub_40F6F0, 37 lines)

This handler is invoked when __noinline__ arrives through the __attribute__((__noinline__)) GNU attribute path. It sets a different bit: entity+180 |= 0x80. This separation allows the compiler to distinguish between the CUDA-specific noinline directive and the GNU portable one, although in practice both prevent inlining.

Additionally, when the function is a device function (byte+176 bit 7 set = static member, source_mode indicates GNU/Clang, byte+81 bit 2 set = local, byte+187 bit 0 clear), it calls sub_5CEE70(28, entity->attr_chain) to record the noinline directive for device-side compilation.

// sub_40F6F0 -- apply_noinline_attr (GNU form)
if (target_kind == 11) {
    entity->byte_180 |= 0x80;
    if ((signed char)entity->byte_176 < 0
        && (attr->source_mode == 2 || (attr->flags & 0x10))
        && (entity->byte_81 & 0x04)
        && !(entity->byte_187 & 0x01)) {
        sub_5CEE70(28, entity->attr_chain);
    }
} else {
    // emit error 1835/2470 with appropriate severity
}

__inline_hint__

__inline_hint__ is an internal NVIDIA attribute that provides a non-binding suggestion to the compiler's inlining heuristics. Unlike __forceinline__, which mandates inlining, __inline_hint__ merely biases the cost model in favor of inlining. It is stored at entity+179 bit 4 (0x10).

The attribute is registered through the same startup mechanism as __nv_register_params__ in sub_6B5E50, and its handler apply_nv_inline_hint_attr (referenced at address 0x40A999 within sub_40A8A0) sets the flag. The diagnostic tag nv_inline_hint exists at 0x82bf2f in the string table, suggesting diagnostic messages exist for conflicts.

Mutual Exclusion

__forceinline__ and __noinline__ are mutually exclusive. The diagnostic system includes 2 messages for inline hint conflicts (identified in the W053 error report). When both are applied to the same function, the compiler emits a diagnostic. However, __inline_hint__ can coexist with either, as it is merely a suggestion that the other directives override.

The mutual exclusion is enforced through the constraint checker in apply_one_attribute (sub_413240) and through post-validation checks. The constraint string for the 'r' (routine/function) constraint class includes property codes m (for member/constexpr) and v (for virtual), with + and - qualifiers controlling whether the attribute is allowed or forbidden. Error codes 1835--1843 and 1858--1871 cover the various conflict scenarios.

IL Output

In the .int.c output, the inline control attributes are emitted as standard GNU __attribute__ annotations:

// emitted for __noinline__:
__attribute__((noinline))

// emitted for __forceinline__:
__attribute__((always_inline))

cicc reads these and translates them to LLVM's noinline and alwaysinline function attributes respectively.


__tile_global__ (Kind 0x59 = 'Y')

Purpose

__tile_global__ is an internal execution-space attribute that appears in the attribute_display_name switch table but has no user-facing documentation. Its kind value (89, 'Y') places it in the original dense block of CUDA attributes between __global__ (88, 'X') and __shared__ (90, 'Z').

The name strongly suggests this attribute is related to NVIDIA's tile-based cooperative group infrastructure or the Tensor Memory Accelerator (TMA) programming model, where "tile global" would denote a function that operates on a tile of global memory. In the cooperative groups model, tiled partitions allow threads to cooperatively access contiguous memory regions, and a __tile_global__ function might be the kernel entry point for such a tiled execution pattern.

Binary Evidence

The attribute is defined in the kind enum (the attribute_display_name switch case), but no handler function has been identified in the binary. In the apply_one_attribute dispatcher (sub_413240), there is no case for kind 'Y'. This means:

  • The attribute can be parsed and stored in an attribute node.
  • It has a display name for diagnostics.
  • It does not modify entity node fields through the standard apply pipeline.

This is consistent with the attribute being consumed downstream by cicc or another tool in the compilation pipeline, rather than requiring cudafe++ to perform validation beyond basic parsing. Alternatively, it may be a reserved placeholder for future functionality.


__tile_builtin__ (Kind 0x5F = '_')

Purpose

__tile_builtin__ is another internal attribute in the CUDA kind enum, with kind value 95 (0x5F, ASCII '_'). Its kind value is the last in the original dense block (86--95).

The name suggests this attribute marks functions that are tile-level builtins -- compiler intrinsics that implement tile-based operations. These would be functions like cooperative_groups::tiled_partition::shfl(), cooperative_groups::tiled_partition::ballot(), or TMA copy intrinsics, which are compiled by cudafe++ as ordinary function calls but need special handling by cicc for efficient code generation.

Binary Evidence

Like __tile_global__, __tile_builtin__ has no handler in the apply_one_attribute dispatcher. It appears only in the attribute_display_name switch table. The attribute node with kind 0x5F passes through cudafe++ without entity node modification and is consumed by the downstream compiler.

The pairing of __tile_global__ (Y) and __tile_builtin__ (_) suggests a two-part infrastructure:

  • __tile_global__ marks kernel-level entry points for tiled execution.
  • __tile_builtin__ marks the intrinsic operations available within that tiled execution context.

__local_maxnreg__ (Kind 0x5E = '^')

Purpose

__local_maxnreg__ sets a per-function register limit, as opposed to __maxnreg__ which is per-kernel. The distinction matters for __device__ helper functions called from kernels: __maxnreg__ can only be applied to __global__ functions, but __local_maxnreg__ can be applied to any device function. This allows fine-grained register pressure tuning at the function level without requiring the entire kernel to be constrained.

When cicc compiles a __device__ function with __local_maxnreg__, it sets the target register limit for that specific function during register allocation, potentially spilling more aggressively to local memory. The surrounding kernel can use a different register budget.

Apply Handler: sub_411090 (67 lines)

The handler is structurally identical to sub_410F70 (__maxnreg__), differing only in the offset within the launch config struct where it stores the value:

// sub_411090 -- apply_nv_local_maxnreg_attr
entity_t* apply_nv_local_maxnreg_attr(attr_node_t* a1, entity_t* a2, ...) {
    // Allocate launch config struct if needed
    if (!entity->launch_config)
        entity->launch_config = allocate_launch_config();  // sub_5E52F0

    // Skip if template-dependent argument
    if (is_dependent_type(arg))
        return entity;

    // Validate: must be positive
    if (const_expr_sign_compare(arg, 0) <= 0) {   // sub_461980
        emit_error(7, 3786, a1->src_loc);          // non-positive value
        return entity;
    }

    // Validate: must fit in int32
    int64_t val = const_expr_get_value(arg);        // sub_461640
    if (val > INT32_MAX) {
        emit_error(7, 3787, a1->src_loc);           // value too large
        return entity;
    }

    entity->launch_config->local_maxnreg = (int32_t)val;  // offset +36
    return entity;
}

Post-Validation Difference from __maxnreg__

In sub_6BC890, __maxnreg__ (stored at launch_config+32) is validated to require __global__ (error 3715: "__maxnreg__ is only valid on __global__ functions"). __local_maxnreg__ has no such check in post-validation. This is intentional: it is designed to work on __device__ functions as well. The post-validation function only checks the maxnreg field (offset +32) for the __global__ requirement; the local_maxnreg field (offset +36) is left unchecked.

Diagnostics

ErrorMessageCondition
3786Non-positive __local_maxnreg__ valueconst_expr_sign_compare(arg, 0) <= 0
3787__local_maxnreg__ value too largeValue exceeds int32 range

__block_size__ (Kind 0x6C = 'l')

Purpose

__block_size__ specifies the thread block dimensions (and optionally cluster dimensions) for a kernel at compile time. Unlike __launch_bounds__, which provides hints for the compiler's register allocator, __block_size__ declares the actual block geometry. This enables the compiler to optimize based on known block dimensions: unrolling loops by the block dimension, computing shared memory bank conflict patterns at compile time, and statically determining the number of warps.

Apply Handler: sub_4109E0 (265 lines)

This is the largest of the launch config attribute handlers. It accepts up to 6 arguments: three block dimensions (x, y, z) and three cluster dimensions (x, y, z).

// sub_4109E0 -- apply_nv_block_size_attr (simplified)
entity_t* apply_nv_block_size_attr(attr_node_t* a1, entity_t* a2, ...) {
    // Allocate launch config struct if needed
    if (!entity->launch_config)
        entity->launch_config = allocate_launch_config();

    launch_config_t* lc = entity->launch_config;

    // Parse block dimensions (arguments 1-3)
    // Each: validate positive, validate fits in int32
    for (int i = 0; i < 3 && arg_exists; i++) {
        if (const_expr_sign_compare(arg, 0) <= 0)
            emit_error(7, 3788, src);   // non-positive
        else {
            int64_t val = const_expr_get_value(arg);
            if (val > INT32_MAX)
                emit_error(7, 3789, src);  // too large
            else
                lc->block_size[i] = (int32_t)val;  // +40, +44, +48
        }
    }

    // Parse optional cluster dimensions (arguments 4-6)
    if (cluster_args_present) {
        // Check for conflict with prior __cluster_dims__
        if (lc->flags & 0x01)
            emit_error(7, 3791, src);  // conflict

        for (int i = 0; i < 3 && arg_exists; i++) {
            // same positive/range validation
            lc->cluster_dim[i] = (int32_t)val;  // +20, +24, +28
        }
    } else if (!(lc->flags & 0x01)) {
        // Default cluster dims to (1,1,1) when no cluster args
        // and no prior __cluster_dims__
        lc->cluster_dim_x = 1;
        lc->cluster_dim_y = 1;
        lc->cluster_dim_z = 1;
    }

    lc->flags |= 0x02;   // mark block_size_set
    return entity;
}

Conflict with __cluster_dims__

__block_size__ and __cluster_dims__ have a bidirectional conflict. Each handler checks the other's flag:

  • __block_size__ checks flags & 0x01 (cluster_dims_set) before writing cluster dims: error 3791.
  • __cluster_dims__ checks flags & 0x02 (block_size_set) before writing cluster dims: error 3791.

However, neither handler returns early on this conflict. Both continue to set their respective flag bits, so after conflict the flags byte can be 0x03 (both bits set). The error diagnostic is emitted but the compilation continues.

Diagnostics

ErrorMessageCondition
3788Non-positive __block_size__ dimensionconst_expr_sign_compare(arg, 0) <= 0
3789__block_size__ dimension too largeValue exceeds int32 range
3791Conflicting __cluster_dims__ and __block_size__Both attributes applied to same entity

Global State and Registration

Startup Registration (sub_6B5E50)

The function sub_6B5E50 runs during compiler initialization and registers three names as preprocessor macro definitions:

  1. __nv_register_params__: looked up via sub_734430; if not found, creates a new macro via sub_749600 and associates it with a 40-byte token sequence. The token body encodes the magic values 8961 (0x2301) as a prefix, followed by attribute argument tokens. If the symbol already exists (the macro was predefined), it appends the token body to the existing definition's expansion via sub_6AC190.

  2. __noinline__: registered with the same mechanism. The token body contains the string "oinline))" as a suffix (the decompiled code shows strcpy((char*)(v11+20), "oinline))");), which reconstructs the full __attribute__((__noinline__)) expansion.

  3. _Pragma: conditionally registered if dword_106C0E0 is set. The _Pragma macro registration enables MSVC-compatible pragma handling in certain compilation modes.

Additionally, if Clang compatibility mode is active (dword_126EFA4 set, qword_126EF90 > 0x2BF1F = Clang >= 3.0, and specific extension flags are enabled), the function registers ARM SVE attribute macros (__arm_in, __arm_inout, __arm_out, __arm_preserves, __arm_streaming, __arm_streaming_compatible).

Entity Node Field Summary

entity+177  bit 4 (0x10): __forceinline__
entity+179  bit 4 (0x10): __inline_hint__
entity+179  bit 5 (0x20): __noinline__ (EDG path)
entity+180  bit 7 (0x80): __noinline__ (GNU path)
entity+181  bit 5 (0x20): __forceinline__ relaxation flag
entity+182  [byte]:       execution space (see overview)
entity+183  bit 3 (0x08): __nv_register_params__
entity+183  bit 6 (0x40): __cluster_dims__ intent
entity+256  [pointer]:    launch_config_t* (for __local_maxnreg__, __block_size__)

Function Map

AddressSizeIdentitySource
sub_40A31083 linesattribute_display_nameattribute.c:1307
sub_40A8A023 linesapply_nv_inline_hint_attr (contains)attribute.c
sub_40B0A038 linesapply_nv_register_params_attrattribute.c:10537
sub_40F5F051 linesapply_noinline_attr (EDG path)attribute.c
sub_40F6F037 linesapply_noinline_attr (GNU path)attribute.c
sub_40F7B061 linesapply_noinline_scoped_attrattribute.c
sub_4109E0265 linesapply_nv_block_size_attrattribute.c
sub_41109067 linesapply_nv_local_maxnreg_attrattribute.c
sub_413240585 linesapply_one_attribute (dispatch)attribute.c
sub_6B5E50160 linesStartup registrationnv_transforms.c adjacent
sub_6BC890160 linesnv_validate_cuda_attributesnv_transforms.c

Diagnostic Tag Index

ErrorDiagnostic TagAttribute
3659register_params_not_enabled__nv_register_params__
3661register_params_unsupported_function__nv_register_params__
3662register_params_ellipsis_function__nv_register_params__
--register_params_unsupported_arch__nv_register_params__
3786local_maxnreg_negative__local_maxnreg__
3787local_maxnreg_too_large__local_maxnreg__
3788block_size_must_be_positive__block_size__
3789(block_size dimension overflow)__block_size__
3791conflict_between_cluster_dim_and_block_size__block_size__ / __cluster_dims__
1835(attribute on wrong target)__noinline__
2470(attribute in alignas context)__noinline__

Cross-References

Extended Lambda Overview

Extended lambdas are the most complex NVIDIA addition to the EDG frontend. Standard C++ lambdas produce closure classes with host linkage only -- they cannot appear in __global__ kernel launches or __device__ function calls because the closure type has no device-side instantiation. The --extended-lambda flag (dword_106BF38) enables a transformation pipeline that wraps each annotated lambda in a device-visible template struct, making the closure class callable across the host/device boundary.

Two wrapper types exist. __nv_dl_wrapper_t handles device-only lambdas (annotated __device__). __nv_hdl_wrapper_t handles host-device lambdas (annotated __host__ __device__). The wrappers are parameterized template structs that store captured variables as typed fields, providing the device compiler with a concrete, instantiatable type for each lambda's captures. The wrapper templates do not exist in any header file -- they are synthesized as raw C++ text and injected into the compilation stream by the backend code generator.

Key Facts

PropertyValue
Enable flagdword_106BF38 (--extended-lambda / --expt-extended-lambda)
Source filesclass_decl.c (scan), nv_transforms.c (emit), cp_gen_be.c (gen)
Device wrapper type__nv_dl_wrapper_t<Tag, CapturedVarTypePack...>
Host-device wrapper type__nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, NeverThrows, Tag, OpFunc, CapturedVarTypePack...>
Device bitmapunk_1286980 (128 bytes, 1024 bits)
Host-device bitmapunk_1286900 (128 bytes, 1024 bits)
Max captures supported1024 per wrapper type
lambda_info allocatorsub_5E92A0
Preamble injection markerType named __nv_lambda_preheader_injection

End-to-End Flow

The extended lambda system spans the entire cudafe++ pipeline -- from parsing through backend emission. Five major functions form the chain:

  FRONTEND (class_decl.c)              BACKEND (cp_gen_be.c + nv_transforms.c)
  ========================             ========================================

  sub_447930 scan_lambda               sub_47ECC0 gen_template (dispatcher)
       |                                    |
       +-- detect annotations               +-- sees __nv_lambda_preheader_injection
       |   (bits at lambda+25)              |
       +-- validate constraints             +-- sub_4864F0 gen_type_decl
       |   (35+ error codes)                |       triggers preamble emission
       |                                    |
       +-- record capture count             +-- sub_6BCC20 nv_emit_lambda_preamble
           in bitmap                        |       emits ALL __nv_* templates
                                            |
                                            +-- sub_47B890 gen_lambda
                                                    emits per-lambda wrapper call

Stage 1: scan_lambda (sub_447930, 2113 lines)

The frontend entry point for all lambda expressions. Called from the expression parser when it encounters [. For extended lambdas, this function performs three critical operations:

  1. Execution space detection -- Walks up the scope stack looking for scope_kind == 17 (function body). Reads execution space byte at offset +182: bit 4 = __device__, bit 5 = __host__. Sets can_be_host and can_be_device flags.

  2. Annotation processing -- Parses the __nv_parent specifier (NVIDIA extension for closure-to-parent linkage) and __host__/__device__ attribute annotations on the lambda expression itself. Sets decision bits at lambda_info + 25.

  3. Validation -- When dword_106BF38 is set, validates that the lambda's execution space is compatible with its enclosing context. Emits errors 3592-3634 and 3689-3690 for violations. Records the capture count in the appropriate bitmap via sub_6BCBF0.

Stage 2: Annotation Detection (Decision Bits)

The scan_lambda function sets bits at lambda_info + 25 that control all downstream behavior:

BitMaskMeaningSet when
bit 30x08Device lambda wrapper neededLambda has __device__ annotation
bit 40x10Host-device lambda wrapper neededLambda has __host__ __device__
bit 50x20Has __nv_parent__nv_parent pragma parsed in capture list

Additional flags at lambda_info + 24:

BitMaskMeaning
bit 40x10Capture-default is =
bit 50x20Capture-default is &

And at lambda_info + 25 lower bits:

BitMaskMeaning
bit 00x01Is generic lambda
bit 10x02Has __host__ execution space
bit 20x04Has __device__ execution space

Stage 3: Preamble Trigger (sub_4864F0, gen_type_decl)

During backend code generation, sub_47ECC0 (the master source sequence dispatcher) encounters a type declaration whose name matches __nv_lambda_preheader_injection. This sentinel type is never used by user code -- it exists solely as a trigger. When matched:

  1. The backend emits #line 1 "nvcc_internal_extended_lambda_implementation".
  2. It calls sub_6BCC20 (nv_emit_lambda_preamble) to inject the entire _nv* template library.
  3. It wraps the trigger type in #if 0 / #endif so it never reaches the host compiler.

Stage 4: Preamble Emission (sub_6BCC20, 244 lines)

This is the single point where all CUDA lambda support templates enter the compilation. It takes a void(*emit)(const char*) callback and emits raw C++ source text. The exact emission order, verified against the decompiled binary, is:

  1. __NV_LAMBDA_WRAPPER_HELPER macro, __nvdl_remove_ref (with T&, T&&, T(&)(Args...) specializations), and __nvdl_remove_const trait helpers
  2. __nv_dl_tag template (device lambda tag type)
  3. Array capture helpers via sub_6BC290 (__nv_lambda_array_wrapper primary + dimension 2-8 specializations, __nv_lambda_field_type primary + array/const-array specializations)
  4. Primary __nv_dl_wrapper_t with static_assert + zero-capture __nv_dl_wrapper_t<Tag> specialization (emitted as a single string literal)
  5. __nv_dl_trailing_return_tag definition + its zero-capture wrapper specialization with __builtin_unreachable() body (emitted as two consecutive string literals)
  6. Device bitmap scan -- iterates unk_1286980 (1024 bits). For each set bit N > 0, calls sub_6BB790(N, emit) to generate two __nv_dl_wrapper_t specializations (standard tag + trailing-return tag) for N captures
  7. __nv_hdl_helper class (anonymous namespace, with fp_copier, fp_deleter, fp_caller, fp_noobject_caller static members + out-of-line definitions)
  8. Primary __nv_hdl_wrapper_t with static_assert
  9. Host-device bitmap scan -- iterates unk_1286900 (1024 bits). For each set bit N (including 0), emits four wrapper specializations per N: sub_6BBB10(0, N) (non-mutable, HasFuncPtrConv=false), sub_6BBEE0(0, N) (mutable, HasFuncPtrConv=false), sub_6BBB10(1, N) (non-mutable, HasFuncPtrConv=true), sub_6BBEE0(1, N) (mutable, HasFuncPtrConv=true)
  10. __nv_hdl_helper_trait_outer with const and non-const operator() specializations, plus conditionally (when dword_126E270 is set for C++17 noexcept-in-type-system) const noexcept and non-const noexcept specializations -- all inside the same struct, closed by \n};
  11. __nv_hdl_create_wrapper_t factory
  12. Type trait helpers: __nv_lambda_trait_remove_const, __nv_lambda_trait_remove_volatile, __nv_lambda_trait_remove_cv (composed from the first two)
  13. __nv_extended_device_lambda_trait_helper + #define __nv_is_extended_device_lambda_closure_type(X) (emitted together in one string)
  14. __nv_lambda_trait_remove_dl_wrapper (unwraps device lambda wrapper to get inner tag)
  15. __nv_extended_device_lambda_with_trailing_return_trait_helper + #define __nv_is_extended_device_lambda_with_preserved_return_type(X) (emitted together)
  16. __nv_extended_host_device_lambda_trait_helper + #define __nv_is_extended_host_device_lambda_closure_type(X) (emitted together)

Note: each SFINAE trait and its corresponding detection macro are emitted as a single a1() call in the decompiled code, not as separate steps. The device bitmap scan skips bit 0 (zero-capture handled by step 4's specialization), but the host-device bitmap scan processes bit 0 (zero-capture host-device wrappers require distinct HasFuncPtrConv specializations).

Stage 5: Per-Lambda Wrapper Emission (sub_47B890, gen_lambda, 336 lines)

For each lambda expression in the translation unit, the backend emits the wrapper call. The decision depends on the bits at lambda_info + 25:

Device lambda (bit 3 set, byte[25] & 0x08):

__nv_dl_wrapper_t< /* closure type tag */ >(/* captured values */)

The original lambda body is wrapped in #if 0 / #endif so it is invisible to the host compiler. The device compiler sees the wrapper struct which provides the captured values as typed fields.

Host-device lambda (bit 4 set, byte[25] & 0x10):

__nv_hdl_create_wrapper_t<IsMutable, HasFuncPtrConv, Tag, CaptureTypes...>
    ::__nv_hdl_create_wrapper( /* lambda expression */, capture_args... )

The lambda expression is emitted inline as the first argument (binds to Lambda &&lam in the factory). The factory internally calls std::move(lam) when heap-allocating. Unlike the device lambda path, the original lambda body is NOT wrapped in #if 0 -- it must be visible to both host and device compilers.

Neither bit set (plain lambda or byte[25] & 0x06 == 0x02):

Standard lambda emission with no wrapping. If byte[25] & 0x06 == 0x02, emits an empty body placeholder { } with the real body in #if 0 / #endif.

Bitmap System

Rather than generating all 1024 possible capture-count specializations for each wrapper type, cudafe++ tracks which capture counts were actually used during frontend parsing. This is a critical compile-time optimization.

Bitmap Layout

unk_1286980 (device lambda bitmap):
  128 bytes = 16 x uint64 = 1024 bits
  Bit N set  =>  __nv_dl_wrapper_t specialization for N captures is needed

unk_1286900 (host-device lambda bitmap):
  128 bytes = 16 x uint64 = 1024 bits
  Bit N set  =>  __nv_hdl_wrapper_t specializations for N captures are needed

Bitmap Operations

FunctionAddressOperation
nv_reset_capture_bitmaskssub_6BCBC0Zeroes both 128-byte bitmaps. Called before each translation unit.
nv_record_capture_countsub_6BCBF0Sets bit capture_count in the appropriate bitmap. a1 == 0 targets device, a1 != 0 targets host-device. Implementation: result[a2 >> 6] |= 1LL << a2.
Scan in sub_6BCC20inlineIterates each uint64 word, shifts right to test each bit, calls the wrapper emitter for each set bit.

The scan loop in sub_6BCC20 processes 64 bits at a time:

uint64_t *ptr = (uint64_t *)&unk_1286980;
unsigned int idx = 0;
do {
    uint64_t word = *ptr;
    unsigned int limit = idx + 64;
    do {
        if (idx != 0 && (word & 1))
            emit_device_lambda_wrapper(idx, callback);  // sub_6BB790
        ++idx;
        word >>= 1;
    } while (limit != idx);
    ++ptr;
} while (limit != 1024);

Note that bit 0 is never emitted as a specialization -- the zero-capture case is handled by the primary template itself.

The __nv_parent Pragma

__nv_parent is a NVIDIA-specific capture-list extension that provides closure-to-parent class linkage. It appears in the lambda capture list as a special identifier:

auto lam = [__nv_parent = ParentClass, x, y]() __device__ { /* ... */ };

Processing in scan_lambda

During capture list parsing (Phase 3 of sub_447930, around line 584):

  1. The parser checks for a token matching the string "__nv_parent" at address 0x82e284.
  2. If found, calls sub_52FB70 to resolve the parent class by name lookup.
  3. Sets lambda_info + 25 |= 0x20 (bit 5 = has __nv_parent).
  4. Stores the resolved parent class pointer at lambda_info + 32.
  5. If __nv_parent is specified more than once, emits error 3590.
  6. If __nv_parent is specified without __device__, emits error 3634.

The __nv_parent class reference is used during device code generation to establish the relationship between the lambda's closure type and its enclosing class, which is necessary for the device compiler to properly resolve member accesses through the closure.

lambda_info Structure

Allocated by sub_5E92A0. This is the per-lambda metadata node created during scan_lambda and consumed during backend generation.

OffsetSizeFieldDescription
+08captured_variable_listHead of linked list of capture entries
+88closure_class_type_nodePointer to the closure class type in the IL
+168call_operator_symbolPointer to the operator() routine entity
+241flags_byte_1bit 0 = has captures, bit 3 = __host__, bit 4 = __device__, bit 5 = has __nv_parent, bit 6 = is opaque, bit 7 = constexpr const
+251flags_byte_2bit 0 = is generic, bit 1 = __host__ exec space, bit 2 = __device__ exec space, bit 3 = device wrapper needed, bit 4 = host-device wrapper needed, bit 5 = has __nv_parent
+328__nv_parent_classParent class pointer (NVIDIA extension)
+404lambda_numberUnique lambda index within scope
+444source_locationSource position of lambda expression

Key Functions

AddressName (recovered)SourceLinesRole
sub_447930scan_lambdaclass_decl.c2113Frontend: parse lambda, validate constraints, record capture count
sub_42FE50scan_lambda_capture_listclass_decl.c524Frontend: parse [...] capture list, handle __nv_parent
sub_42EE00make_field_for_lambda_captureclass_decl.c551Frontend: create closure class fields for captures
sub_42D710scan_lambda_capture_list (inner)class_decl.c1025Frontend: process individual capture entries
sub_42F910field_for_lambda_captureclass_decl.c~200Frontend: resolve capture field via hash lookup
sub_436DF0Lambda template decl helperclass_decl.c65Frontend: propagate execution space to call operator template
sub_6BCC20nv_emit_lambda_preamblenv_transforms.c244Backend: emit ALL __nv_* template infrastructure
sub_6BB790emit_device_lambda_wrapper_specializationnv_transforms.c191Backend: emit __nv_dl_wrapper_t<Tag, F1..FN> for N captures
sub_6BBB10emit_host_device_lambda_wrapper (const)nv_transforms.c238Backend: emit __nv_hdl_wrapper_t non-mutable variant
sub_6BBEE0emit_host_device_lambda_wrapper (mutable)nv_transforms.c236Backend: emit __nv_hdl_wrapper_t mutable variant
sub_6BC290emit_array_capture_helpersnv_transforms.c183Backend: emit __nv_lambda_array_wrapper for dim 2-8
sub_6BCBC0nv_reset_capture_bitmasksnv_transforms.c9Init: zero both 128-byte bitmaps
sub_6BCBF0nv_record_capture_countnv_transforms.c13Record: set bit in device or host-device bitmap
sub_6BCDD0nv_find_parent_lambda_functionnv_transforms.c33Query: find enclosing host/device function for nested lambda
sub_6BC680is_device_or_extended_device_lambdanv_transforms.c16Query: test if entity qualifies as device lambda
sub_47B890gen_lambdacp_gen_be.c336Backend: emit per-lambda wrapper construction call
sub_4864F0gen_type_declcp_gen_be.c751Backend: detect preamble trigger, invoke emission
sub_47ECC0gen_template (dispatcher)cp_gen_be.c1917Backend: master source sequence dispatcher
sub_489000process_file_scope_entitiescp_gen_be.c723Backend: entry point, emits lambda macro defines in boilerplate

Global State

VariableAddressPurpose
dword_106BF380x106BF38Extended lambda mode flag (--extended-lambda)
dword_106BF400x106BF40Lambda host-device mode flag
unk_12869800x1286980Device lambda capture-count bitmap (128 bytes)
unk_12869000x1286900Host-device lambda capture-count bitmap (128 bytes)
qword_12868F00x12868F0Entity-to-closure mapping hash table
dword_126E2700x126E270C++17 noexcept-in-type-system flag (controls noexcept wrapper variants)
qword_E7FEC80xE7FEC8Lambda hash table (Robin Hood, 16 bytes/slot, 1024 entries)
ptr (E7FE40 area)0xE7FE40Red-black tree root for lambda numbering per source position
dword_E7FE480xE7FE48Red-black tree sentinel node
dword_E857000xE85700host_runtime.h already included flag
dword_106BDD80x106BDD8OptiX mode flag (triggers error 3689 on incompatible lambdas)

Concrete End-to-End Example

Consider a user writing this CUDA code with --extended-lambda:

// user.cu
#include <cstdio>
__global__ void kernel(int *out) {
    int scale = 2;
    auto f = [=] __device__ (int x) { return x * scale; };
    out[threadIdx.x] = f(threadIdx.x);
}

Here is the transformation at each stage.

Stage 1: scan_lambda detects the lambda

The frontend parser encounters [=] __device__ (int x) { ... }. sub_447930 runs:

  1. Finds __device__ annotation on the lambda expression.
  2. Sets lambda_info + 25 |= 0x08 (bit 3: device wrapper needed) and lambda_info + 25 |= 0x04 (bit 2: has __device__ exec space).
  3. Sets lambda_info + 24 |= 0x10 (bit 4: capture-default is =).
  4. Counts one capture (scale). Calls sub_6BCBF0(0, 1) to set bit 1 in the device bitmap unk_1286980.
  5. Creates a closure class (compiler-generated name like __lambda_17_16) with one field of type int for the captured scale.

Stage 2: Preamble injection

When the backend encounters the sentinel type __nv_lambda_preheader_injection, sub_6BCC20 emits the template library. Because bit 1 is set in the device bitmap, it calls sub_6BB790(1, emit) which generates a one-capture specialization:

template <typename Tag, typename F1>
struct __nv_dl_wrapper_t<Tag, F1> {
    typename __nv_lambda_field_type<F1>::type f1;
    __nv_dl_wrapper_t(Tag, F1 in1) : f1(in1) { }
    template <typename...U1>
    int operator()(U1...) { return 0; }
};

template <typename U, U func, typename Return, unsigned Id, typename F1>
struct __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>, F1> {
    typename __nv_lambda_field_type<F1>::type f1;
    __nv_dl_wrapper_t(__nv_dl_trailing_return_tag<U, func, Return, Id>, F1 in1)
        : f1(in1) { }
    template <typename...U1>
    Return operator()(U1...) { __builtin_unreachable(); }
};

Stage 3: Per-lambda wrapper emission

sub_47B890 (gen_lambda) reads byte[25] & 0x08 (device lambda flag is set) and emits the wrapper construction call. The lambda body is hidden from the host compiler:

// Output in .int.c (what the host compiler sees):
__nv_dl_wrapper_t< __nv_dl_tag<
    __NV_LAMBDA_WRAPPER_HELPER(&__lambda_17_16::operator(), 0u)>,
    int>(
    __nv_dl_tag<
        __NV_LAMBDA_WRAPPER_HELPER(&__lambda_17_16::operator(), 0u)>{},
    scale)
#if 0
[=] __device__ (int x) { return x * scale; }
#endif

The __NV_LAMBDA_WRAPPER_HELPER(X, Y) macro expands to decltype(X), Y, giving the tag its two non-type parameters: the function pointer type and the pointer itself.

What each compiler sees

Host compiler sees a __nv_dl_wrapper_t<Tag, int> struct with field f1 holding the captured scale. The operator() returns int(0) (never actually called on host). The original lambda body is inside #if 0.

Device compiler sees the same wrapper struct but resolves the tag's encoded function pointer &__lambda_17_16::operator() to call the actual lambda body. The wrapper's f1 field provides the captured scale value.

Architecture: Text Template Approach

NVIDIA's lambda support uses a raw text emission pattern rather than constructing AST nodes. The template infrastructure is generated as C++ source text strings, passed through a callback function:

emit("template <typename Tag, typename...CapturedVarTypePack>\n"
     "struct __nv_dl_wrapper_t {\n"
     "static_assert(sizeof...(CapturedVarTypePack) == 0,"
     "\"nvcc internal error: unexpected number of captures!\");\n"
     "};\n");

This text is emitted to the .int.c output file and subsequently parsed by the host compiler. The device compiler receives the same text through a parallel path. This design is architecturally simpler than building proper AST nodes for the wrapper templates, at the cost of the templates existing only as generated text rather than first-class IL entities.

The preamble injection point is controlled by a sentinel type declaration: when the backend encounters a type named __nv_lambda_preheader_injection, it emits the entire template library and wraps the sentinel in #if 0. This guarantees the templates appear exactly once, before any lambda expression that references them, regardless of declaration ordering in the user's source.

Device Lambda Wrapper (__nv_dl_wrapper_t)

When a C++ lambda is annotated __device__ inside CUDA code compiled with --extended-lambda, the closure class that the frontend creates has host linkage only -- it cannot be instantiated on the device. The device lambda wrapper system solves this by replacing the lambda expression at the call site with a construction of __nv_dl_wrapper_t<Tag, F1, ..., FN>, a template struct whose type parameters encode the lambda's identity (via Tag) and whose fields store the captured variables in device-accessible storage. The wrapper struct has a dummy operator() that never executes real code on the device side -- its purpose is purely to carry captured state across the host/device boundary. The actual device-side call is dispatched through the tag type, which encodes a function pointer to the lambda's operator() as a non-type template parameter.

Two tag types exist. __nv_dl_tag is the standard tag for lambdas with auto-deduced return types. __nv_dl_trailing_return_tag handles lambdas with explicit trailing return types, preserving the user-specified return type through the wrapper. Both tag types carry the lambda's operator() function pointer and a unique ID as template parameters.

The wrapper template does not exist in any header file. It is synthesized as raw C++ text by sub_6BB790 (emit_device_lambda_wrapper_specialization) in nv_transforms.c and injected into the compilation stream during preamble emission. Only the capture counts actually used in the translation unit are emitted, controlled by a 1024-bit bitmap at unk_1286980.

Key Facts

PropertyValue
Wrapper type__nv_dl_wrapper_t<Tag, CapturedVarTypePack...>
Standard tag__nv_dl_tag<U, func, unsigned>
Trailing-return tag__nv_dl_trailing_return_tag<U, func, Return, unsigned>
Specialization emittersub_6BB790 (emit_device_lambda_wrapper_specialization, 191 lines)
Per-lambda emissionsub_47B890 (gen_lambda, 336 lines, cp_gen_be.c)
Preamble master emittersub_6BCC20 (nv_emit_lambda_preamble, 244 lines)
Capture bitmapunk_1286980 (128 bytes = 1024 bits, device lambda)
Bitmap settersub_6BCBF0 (nv_record_capture_count, 13 lines)
Max supported captures1024
Source filenv_transforms.c (specialization emitter), cp_gen_be.c (per-lambda call)
Field type trait__nv_lambda_field_type<T>

Primary Template and Zero-Capture Specialization

The primary template is a static_assert trap -- any instantiation with a non-zero variadic pack that was not explicitly specialized triggers a compilation error. The zero-capture specialization (Tag only, no F parameters) provides a trivial constructor and a dummy operator() returning 0.

This code is emitted verbatim as a single string literal from sub_6BCC20:

// Exact binary string (emitted as a single a1() call in sub_6BCC20):
template <typename Tag,typename...CapturedVarTypePack>
struct __nv_dl_wrapper_t {
static_assert(sizeof...(CapturedVarTypePack) == 0,"nvcc internal error: unexpected number of captures!");
};
template <typename Tag>
struct __nv_dl_wrapper_t<Tag> {
__nv_dl_wrapper_t(Tag) { }
template <typename...U1>
int operator()(U1...) { return 0; }
};

Note: no space after the comma in Tag,typename... and no indentation -- this is the literal text injected into the .int.c output. The primary template and the zero-capture specialization are emitted as a single string literal.

The primary template's static_assert acts as a safety net: if the frontend records a capture count of N but fails to emit the corresponding N-capture specialization, the host compiler will produce a diagnostic rather than silently generating broken code. The zero-capture specialization's operator() returns int(0) -- this value is never used at runtime because the device compiler dispatches through the tag's encoded function pointer, not through the wrapper's operator().

Tag Types

__nv_dl_tag

The standard device lambda tag. Three template parameters encode the lambda identity. Exact binary string:

template <typename U, U func, unsigned>
struct __nv_dl_tag { };

The string is "\ntemplate <typename U, U func, unsigned>\nstruct __nv_dl_tag { };\n" -- note the leading newline.

ParameterRole
UType of the lambda's operator() (deduced via decltype)
funcNon-type template parameter: pointer to the lambda's operator()
unsignedUnnamed parameter: unique ID disambiguating lambdas with identical operator types

The __NV_LAMBDA_WRAPPER_HELPER(X, Y) macro (emitted at preamble start) expands to decltype(X), Y, providing the U, func pair from a single expression. The full macro and helper text emitted as the first a1() call:

#define __NV_LAMBDA_WRAPPER_HELPER(X, Y) decltype(X), Y
template <typename T>
struct __nvdl_remove_ref { typedef T type; };

template<typename T>
struct __nvdl_remove_ref<T&> { typedef T type; };

template<typename T>
struct __nvdl_remove_ref<T&&> { typedef T type; };

template <typename T, typename... Args>
struct __nvdl_remove_ref<T(&)(Args...)> {
  typedef T(*type)(Args...);
};

template <typename T>
struct __nvdl_remove_const { typedef T type; };

template <typename T>
struct __nvdl_remove_const<T const> { typedef T type; };

The __nvdl_remove_ref specialization for function references (T(&)(Args...)) is notable: it converts a function reference type to a function pointer type (T(*)(Args...)). This handles the case where a lambda captures a function by reference -- the wrapper field needs a copyable function pointer, not a reference.

__nv_dl_trailing_return_tag

For lambdas with explicit trailing return types (-> ReturnType), a separate tag preserves the return type:

template <typename U, U func, typename Return, unsigned>
struct __nv_dl_trailing_return_tag { };

The additional Return parameter carries the user-specified return type. This is necessary because the wrapper's operator() must return this type rather than int, and the body uses __builtin_unreachable() to satisfy the compiler without generating actual return-value code.

Trailing-Return Zero-Capture Specialization

The zero-capture variant for trailing-return lambdas uses __builtin_unreachable() instead of return 0. The exact binary text (emitted as two consecutive a1() calls):

template <typename U, U func, typename Return, unsigned>
struct __nv_dl_trailing_return_tag { };

template <typename U, U func, typename Return, unsigned Id>
struct __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id> > {
  __nv_dl_wrapper_t(__nv_dl_trailing_return_tag<U, func, Return, Id>) { }

  template <typename...U1> Return operator()(U1...) { __builtin_unreachable(); }
};

Note: the __nv_dl_trailing_return_tag definition and its zero-capture wrapper specialization are emitted together (two strings in immediate succession: the first ends at { before __builtin_unreachable, the second contains __builtin_unreachable(); }\n}; \n\n -- note the trailing space before the newlines).

The __builtin_unreachable() tells the compiler this code path is never taken, so no return value needs to be materialized. This is safe because the wrapper's operator() is never called on the device side -- the device compiler resolves the call through the tag's encoded function pointer directly.

Per-Capture-Count Specialization Generator (sub_6BB790)

The function sub_6BB790 generates partial specializations of __nv_dl_wrapper_t for a specific capture count N. It takes two arguments: the capture count (unsigned int a1) and an emit callback (void(*a2)(const char*)). For each N, it emits two struct specializations: one for __nv_dl_tag and one for __nv_dl_trailing_return_tag.

Generated Template Structure (N captures)

For a lambda capturing N variables, sub_6BB790(N, emit) produces:

// Standard tag specialization
template <typename Tag, typename F1, typename F2, ..., typename FN>
struct __nv_dl_wrapper_t<Tag, F1, F2, ..., FN> {
    typename __nv_lambda_field_type<F1>::type f1;
    typename __nv_lambda_field_type<F2>::type f2;
    ...
    typename __nv_lambda_field_type<FN>::type fN;

    __nv_dl_wrapper_t(Tag, F1 in1, F2 in2, ..., FN inN)
        : f1(in1), f2(in2), ..., fN(inN) { }

    template <typename...U1>
    int operator()(U1...) { return 0; }
};

// Trailing-return tag specialization
template <typename U, U func, typename Return, unsigned Id,
          typename F1, typename F2, ..., typename FN>
struct __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>,
                          F1, F2, ..., FN> {
    typename __nv_lambda_field_type<F1>::type f1;
    typename __nv_lambda_field_type<F2>::type f2;
    ...
    typename __nv_lambda_field_type<FN>::type fN;

    __nv_dl_wrapper_t(__nv_dl_trailing_return_tag<U, func, Return, Id>,
                      F1 in1, F2 in2, ..., FN inN)
        : f1(in1), f2(in2), ..., fN(inN) { }

    template <typename...U1>
    Return operator()(U1...) { __builtin_unreachable(); }
};

__nv_lambda_field_type Indirection

Each field is declared as typename __nv_lambda_field_type<Fi>::type fi rather than Fi fi. This indirection allows the lambda infrastructure to intercept array types (which cannot be captured by value in C++) and replace them with __nv_lambda_array_wrapper instances that perform element-by-element copying. The primary template is an identity transform:

template <typename T>
struct __nv_lambda_field_type {
    typedef T type;
};

Specializations for array types (emitted by sub_6BC290) map T[D1]...[DN] to __nv_lambda_array_wrapper<T[D1]...[DN]>, and const T[D1]...[DN] to const __nv_lambda_array_wrapper<T[D1]...[DN]>.

Emission Mechanics

The decompiled sub_6BB790 reveals the emission is entirely printf-based, building C++ source text in a 1064-byte stack buffer (v29[1064]) and passing each fragment through the emit callback. The function has two major branches:

Branch 1: a1 == 0 (zero captures) -- Dead code. Falls through to emit __nv_dl_wrapper_t(Tag,) : with a trailing comma and empty initializer list, which would produce syntactically invalid C++. This path is never reached because the bitmap scan loop in sub_6BCC20 skips bit 0 (if (v2 && (v3 & 1) != 0)). The zero-capture case is handled by the primary template's __nv_dl_wrapper_t<Tag> specialization emitted unconditionally as a string literal in sub_6BCC20.

Branch 2: a1 > 0 (N captures) -- Generates the N-ary specializations through seven sequential loops:

Loop 1:  Emit template parameter list    ", typename F1, ..., typename FN"
Loop 2:  Emit partial specialization      ", F1, ..., FN"
Loop 3:  Emit field declarations          "typename __nv_lambda_field_type<Fi>::type fi;\n"
Loop 4:  Emit constructor parameters      "F1 in1, F2 in2, ..., FN inN"
Loop 5:  Emit initializer list            "f1(in1), f2(in2), ..., fN(inN)"
         Emit operator() with "return 0"
         Then repeat Loops 1-5 for __nv_dl_trailing_return_tag variant
Loop 6:  Same parameter/field emission for trailing-return variant
Loop 7:  Same initializer list for trailing-return variant
         Emit operator() with __builtin_unreachable()

Each loop uses sprintf(v29, "...", index) for numbered parameters and a2(v29) to emit the fragment. The first element in each comma-separated list is handled specially (no leading comma), with subsequent elements prefixed by ", ".

Key string literals used by sub_6BB790 (extracted from binary):

StringPurpose
"\ntemplate <typename Tag"Opens template parameter list
", typename F%u"Each additional type parameter
">\nstruct __nv_dl_wrapper_t<Tag"Opens partial specialization
", F%u"Each type argument in specialization
"typename __nv_lambda_field_type<F%u>::type f%u;\n"Field declaration
"__nv_dl_wrapper_t(Tag,"Constructor declaration (standard tag)
"F%u in%u"Constructor parameter
"f%u(in%u)"Initializer list entry
" { }\ntemplate <typename...U1>\nint operator()(U1...) { return 0; }\n};\n"Standard operator()
"__nv_dl_trailing_return_tag<U, func, Return, Id>"Trailing-return tag name
" { }\ntemplate <typename...U1>\nReturn operator()(U1...) "Trailing-return operator()
"{ __builtin_unreachable(); }\n};\n\n"Unreachable body

Concrete Example: 2 Captures

For a lambda capturing two variables, sub_6BB790(2, emit) produces:

template <typename Tag, typename F1, typename F2>
struct __nv_dl_wrapper_t<Tag, F1, F2> {
    typename __nv_lambda_field_type<F1>::type f1;
    typename __nv_lambda_field_type<F2>::type f2;
    __nv_dl_wrapper_t(Tag, F1 in1, F2 in2) : f1(in1), f2(in2) { }
    template <typename...U1>
    int operator()(U1...) { return 0; }
};

template <typename U, U func, typename Return, unsigned Id,
          typename F1, typename F2>
struct __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>,
                          F1, F2> {
    typename __nv_lambda_field_type<F1>::type f1;
    typename __nv_lambda_field_type<F2>::type f2;
    __nv_dl_wrapper_t(__nv_dl_trailing_return_tag<U, func, Return, Id>,
                      F1 in1, F2 in2) : f1(in1), f2(in2) { }
    template <typename...U1>
    Return operator()(U1...) { __builtin_unreachable(); }
};

Per-Lambda Wrapper Emission (sub_47B890)

The backend code generator sub_47B890 (gen_lambda in cp_gen_be.c) handles the per-lambda transformation at each lambda expression's usage site. It reads the decision bits at lambda_info + 25 and emits a wrapper construction call that replaces the lambda expression in the output .int.c file.

Device Lambda Path (bit 3 set: byte[25] & 0x08)

When the device lambda flag is set, the emitter produces a wrapper construction expression followed by a #if 0 block that hides the original lambda body from the host compiler:

// sub_47B890, decompiled lines 46-58
if ((v2 & 8) != 0) {
    sub_467E50("__nv_dl_wrapper_t< ");   // open wrapper type
    sub_475820(a1);                       // emit tag type (closure class)
    sub_46E640(a1);                       // emit capture type list
    sub_467E50(">( ");                    // close template args, open ctor
    sub_475820(a1);                       // emit tag constructor arg
    sub_467E50("{} ");                    // empty-brace tag construction
    sub_46E550(*a1);                      // emit captured value expressions
    sub_467E50(") ");                     // close ctor call
    sub_46BC80("#if 0");                  // suppress original lambda
    --dword_1065834;                      // adjust nesting depth
    sub_467D60();                         // newline
}

The generated output for a device lambda with two captures looks like:

__nv_dl_wrapper_t< __nv_dl_tag<decltype(&ClosureType::operator()),
    &ClosureType::operator(), 0u>, int, float>(
    __nv_dl_tag<decltype(&ClosureType::operator()),
    &ClosureType::operator(), 0u>{}, x, y)
#if 0
// original lambda body hidden from host compiler
[x, y]() __device__ { /* ... */ }
#endif

The #if 0 suppression ensures the host compiler never attempts to parse the device lambda body, which may contain device-only intrinsics and constructs. The device compiler sees the wrapper struct and resolves the call through the tag type's encoded function pointer.

Body Suppression for Host-Only Pass (bit pattern byte[25] & 0x06 == 0x02)

A separate suppression path handles lambdas where the body should not be compiled on the current pass. In this case, the emitter outputs an empty body { } and wraps the real body in #if 0 / #endif:

// sub_47B890, decompiled lines 290-306
if ((*(_BYTE *)(a1 + 25) & 6) == 2) {
    sub_467D60();             // newline
    sub_468190("{ }");        // empty body placeholder
    sub_46BC80("#if 0");      // start suppression
    --dword_1065834;
    sub_467D60();
}
// ... emit original body under #if 0 ...
sub_47AEF0(body, 0);         // emit body (invisible due to #if 0)
if ((*(_BYTE *)(a1 + 25) & 6) == 2) {
    sub_46BC80("#endif");     // end suppression
    --dword_1065834;
    sub_467D60();
    dword_1065820 = 0;
    qword_1065828 = 0;
}

After the body emission completes, the device lambda path also emits a matching #endif to close the #if 0 block opened at the wrapper call:

// sub_47B890, decompiled lines 312-320
if ((v29 & 8) != 0) {          // device lambda
    sub_46BC80("#endif");       // close #if 0 from wrapper call
    --dword_1065834;
    sub_467D60();
    dword_1065820 = 0;
    qword_1065828 = 0;
}

Host-Device Lambda Path (bit 4 set: byte[25] & 0x10)

Host-device lambdas take a different path through __nv_hdl_create_wrapper_t rather than __nv_dl_wrapper_t. This is covered in the Host-Device Lambda Wrapper page.

Bitmap-Driven Emission

Only capture counts that were actually used during frontend parsing get specializations emitted. The scan loop in sub_6BCC20 processes the 128-byte bitmap at unk_1286980 as an array of 16 uint64_t values:

uint64_t *ptr = (uint64_t *)&unk_1286980;
unsigned int idx = 0;
do {
    uint64_t word = *ptr;
    unsigned int limit = idx + 64;
    do {
        if (idx != 0 && (word & 1))       // skip bit 0 (handled by primary)
            sub_6BB790(idx, callback);     // emit N-capture specialization
        ++idx;
        word >>= 1;
    } while (limit != idx);
    ++ptr;
} while (limit != 1024);

Bit 0 is skipped because the zero-capture case is already handled by the primary template's __nv_dl_wrapper_t<Tag> specialization (emitted unconditionally as a string literal). For each remaining set bit N, sub_6BB790(N, emit) produces two structs (standard tag and trailing-return tag), meaning a translation unit using lambdas with 1, 3, and 5 captures emits exactly 6 wrapper struct specializations rather than the full 2048 that exhaustive generation would produce.

Detection Traits

After all wrapper specializations are emitted, sub_6BCC20 emits SFINAE trait templates that allow compile-time detection of device-lambda wrapper types. These are emitted AFTER the host-device wrapper infrastructure (steps 7-12 in the emission sequence), not immediately after the device bitmap scan. Each trait + its #define macro is emitted as a single a1() call:

// Emitted as one string (step 13 in sub_6BCC20):
template <typename T>
struct __nv_extended_device_lambda_trait_helper {
  static const bool value = false;
};
template <typename T1, typename...Pack>
struct __nv_extended_device_lambda_trait_helper<__nv_dl_wrapper_t<T1, Pack...> > {
  static const bool value = true;
};
#define __nv_is_extended_device_lambda_closure_type(X) __nv_extended_device_lambda_trait_helper< typename __nv_lambda_trait_remove_cv<X>::type>::value

Note: in the binary, the #define is a single line (no backslash continuation). The 2-space indentation on static const bool matches the binary exactly.

An unwrapper trait strips the wrapper to recover the inner tag type (step 14 in emission):

template<typename T> struct __nv_lambda_trait_remove_dl_wrapper { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_dl_wrapper< __nv_dl_wrapper_t<T> > { typedef T type; };

A separate trait detects whether a wrapper uses a trailing-return tag (step 15 in emission):

template <typename T>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper {
  static const bool value = false;
};
template <typename U, U func, typename Return, unsigned Id, typename...Pack>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper<__nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>, Pack...> > {
  static const bool value = true;
};
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) __nv_extended_device_lambda_with_trailing_return_trait_helper< typename __nv_lambda_trait_remove_cv<X>::type >::value

Note: the emission order in sub_6BCC20 is: device trait (step 13), then __nv_lambda_trait_remove_dl_wrapper (step 14), then trailing-return trait (step 15), then host-device trait (step 16). The unwrapper appears between the two detection traits, not after both of them.

These traits and macros enable the CUDA runtime headers and device compiler to distinguish wrapped device lambdas from ordinary closure types at compile time, which is necessary for proper template argument deduction in kernel launch expressions.

Function Map

AddressName (recovered)SourceLinesRole
sub_6BB790emit_device_lambda_wrapper_specializationnv_transforms.c191Emit __nv_dl_wrapper_t<Tag, F1..FN> for N captures (both tag variants)
sub_6BCC20nv_emit_lambda_preamblenv_transforms.c244Master emitter: primary template, zero-capture, bitmap scan, traits
sub_6BCBF0nv_record_capture_countnv_transforms.c13Set bit N in device or host-device bitmap
sub_6BCBC0nv_reset_capture_bitmasksnv_transforms.c9Zero both 128-byte bitmaps before each TU
sub_47B890gen_lambdacp_gen_be.c336Per-lambda wrapper call emission in .int.c output
sub_467E50emit_stringcp_gen_be.c--Low-level string emitter to output buffer
sub_46BC80emit_preprocessor_directivecp_gen_be.c--Emit #if 0 / #endif suppression blocks
sub_475820emit_closure_tag_typecp_gen_be.c--Emit tag type for wrapper construction
sub_46E640emit_capture_type_listcp_gen_be.c--Emit template argument list of capture types
sub_46E550emit_capture_value_listcp_gen_be.c--Emit constructor arguments (captured values)
sub_6BC290emit_array_capture_helpersnv_transforms.c183Emit __nv_lambda_array_wrapper for dim 2-8

Global State

VariableAddressPurpose
unk_12869800x1286980Device lambda capture-count bitmap (128 bytes, 1024 bits)
dword_106BF380x106BF38--extended-lambda mode flag (enables entire system)
dword_10658340x1065834Preprocessor nesting depth (decremented on #if 0 emission)
dword_10658200x1065820Output state flag (reset after #endif emission)
qword_10658280x1065828Output state pointer (reset after #endif emission)

Host-Device Lambda Wrapper

The __nv_hdl_wrapper_t template is cudafe++'s type-erased wrapper for __host__ __device__ extended lambdas. Unlike the device-only __nv_dl_wrapper_t which is a simple aggregate of captured fields, the host-device wrapper must operate on both the host (through the host compiler) and the device (through ptxas). This dual requirement forces a fundamentally different design: the wrapper uses void*-based type erasure with a manager<Lambda> inner struct that provides do_copy, do_call, and do_delete operations as static function pointers. The Lambda type is known only inside the constructor -- after construction, all operations go through the type-erased function pointer table stored in __nv_hdl_helper.

A second, lightweight path exists for lambdas that have no captures and can convert to a raw function pointer. When HasFuncPtrConv=true, the wrapper skips heap allocation entirely and stores the lambda directly as a function pointer via fp_noobject_caller, providing a operator __opfunc_t*() conversion operator.

Both paths are generated as raw C++ source text by two nearly-identical emitter functions in nv_transforms.c: sub_6BBB10 (non-mutable, IsMutable=false, const operator()) and sub_6BBEE0 (mutable, IsMutable=true, non-const operator()). For each capture count N observed during frontend parsing, the preamble emitter (sub_6BCC20) calls each function twice -- once with HasFuncPtrConv=0 and once with HasFuncPtrConv=1 -- producing four partial specializations per capture count: (non-mutable, no-fptr), (mutable, no-fptr), (non-mutable, fptr), (mutable, fptr).

Key Facts

PropertyValue
Full template signature__nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, NeverThrows, Tag, OpFunc, Captures...>
Source filenv_transforms.c (EDG 6.6)
Non-mutable emittersub_6BBB10 (238 lines, IsMutable=false)
Mutable emittersub_6BBEE0 (236 lines, IsMutable=true)
Helper class__nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...> (anonymous namespace)
Factory__nv_hdl_create_wrapper_t<IsMutable, HasFuncPtrConv, Tag, CaptureArgs...>
Trait deduction__nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>
Bitmapunk_1286900 (128 bytes, 1024 bits)
Primary template static_assert"nvcc internal error: unexpected number of captures in __host__ __device__ lambda!"
Specializations per capture count4 (2 mutability x 2 HasFuncPtrConv); each of the 4 sub_6BCC20 calls emits one specialization
Noexcept variantsAdditional 2 trait specializations when dword_126E270 is set (C++17)

Template Parameters

template <bool IsMutable,       // false = const operator(), true = non-const
          bool HasFuncPtrConv,  // true = captureless, function pointer path
          bool NeverThrows,     // maps to noexcept(NeverThrows)
          typename Tag,         // unique tag type per lambda site
          typename OpFunc,      // operator() signature as R(Args...)
          typename... CapturedVarTypePack>  // captured variable types F1..FN
struct __nv_hdl_wrapper_t;
ParameterRole
IsMutableControls whether operator() is const. false for lambdas without mutable keyword (the common case), true for mutable lambdas. Emitted as "false," by sub_6BBB10 and "true," by sub_6BBEE0.
HasFuncPtrConvtrue when the lambda has no captures and can be implicitly converted to a function pointer. Enables the lightweight fp_noobject_caller path instead of heap allocation. Passed as a1 to the emitter functions.
NeverThrowsPropagated to noexcept(NeverThrows) on operator(). Set to true only when dword_126E270 is active (C++17 noexcept-in-type-system) and the lambda's operator() is declared noexcept.
TagA unique type tag generated per lambda call site, used to give each __nv_hdl_helper instantiation its own static function pointer storage. Same tag system as device lambdas.
OpFuncThe lambda's call signature decomposed as OpFuncR(OpFuncArgs...). Used to type the function pointers in __nv_hdl_helper and the wrapper's operator().
CapturedVarTypePackF1, F2, ..., FN -- one type per captured variable. Each becomes a field typename __nv_lambda_field_type<Fi>::type fi in the wrapper struct.

The __nv_hdl_helper Class

Before any __nv_hdl_wrapper_t specialization is emitted, sub_6BCC20 emits the __nv_hdl_helper class inside an anonymous namespace. This class holds the static function pointers that enable type erasure -- the Lambda type is known when the constructor assigns the pointers, but the operator(), copy constructor, and destructor access them without knowing the concrete Lambda type.

// Exact binary string (emitted as a single a1() call):
namespace {template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
struct __nv_hdl_helper {
  typedef void * (*fp_copier_t)(void *);
  typedef OpFuncR (*fp_caller_t)(void *, OpFuncArgs...);
  typedef void (*fp_deleter_t) (void *);
  typedef OpFuncR (*fp_noobject_caller_t)(OpFuncArgs...);
  static fp_copier_t fp_copier;
  static fp_deleter_t fp_deleter;
  static fp_caller_t fp_caller;
  static fp_noobject_caller_t fp_noobject_caller;
};

template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
typename __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier_t __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier;

template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
typename __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_deleter_t __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_deleter;

template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
typename __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_caller_t __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_caller;
template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
typename __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_noobject_caller_t __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_noobject_caller;
}

Note three details in the binary that differ from a hand-written version: (1) namespace {template has no newline between the opening brace and template, (2) fp_deleter_t has a space before (void *) that the other typedefs lack: typedef void (*fp_deleter_t) (void *), (3) the blank line between fp_caller and fp_noobject_caller out-of-line definitions is missing -- they are separated by only one newline.

The anonymous namespace is critical: it gives each translation unit its own copy of the static function pointers, preventing ODR violations when multiple TUs use the same lambda tag type. The Tag parameter ensures that different lambda call sites within the same TU get independent function pointer storage even if they share the same OpFuncR(OpFuncArgs...) signature.

Function Pointer Roles

PointerTypeSet byUsed byPurpose
fp_copiervoid*(*)(void*)Constructor (capturing path)Copy constructorHeap-allocates a new Lambda copy from void* buffer
fp_callerOpFuncR(*)(void*, OpFuncArgs...)Constructor (capturing path)operator()Casts void* back to Lambda* and invokes it
fp_deletervoid(*)(void*)Constructor (capturing path)DestructorCasts void* to Lambda* and deletes it
fp_noobject_callerOpFuncR(*)(OpFuncArgs...)Constructor (non-capturing path)operator() + conversion operatorStores the lambda directly as a function pointer

Type-Erasure Mechanism

The following diagram shows how a void* data pointer and the manager<Lambda> static functions work together to erase the concrete lambda type:

Construction (concrete Lambda type known):
============================================

  __nv_hdl_wrapper_t ctor(Tag{}, Lambda &&lam, F1 in1, ...)
       |
       |-- data = new Lambda(std::move(lam))          // heap-allocate
       |
       |-- __nv_hdl_helper<Tag,...>::fp_copier         // ASSIGN function pointers
       |       = &manager<Lambda>::do_copy             //   (Lambda type captured here)
       |-- __nv_hdl_helper<Tag,...>::fp_deleter
       |       = &manager<Lambda>::do_delete
       |-- __nv_hdl_helper<Tag,...>::fp_caller
       |       = &manager<Lambda>::do_call

After construction (Lambda type erased):
============================================

  __nv_hdl_wrapper_t
  +----------------------------+
  | f1, f2, ..., fN            |   captured variable fields (typed)
  | void *data ----------------+---> heap: Lambda object
  +----------------------------+
                                     (concrete type unknown here)
  operator()(args...):
       fp_caller(data, args...)
           |
           v
       manager<Lambda>::do_call(void *buf, args...)
           auto ptr = static_cast<Lambda*>(buf);
           return (*ptr)(args...);

  Copy ctor:
       data = fp_copier(in.data)
           |
           v
       manager<Lambda>::do_copy(void *buf)
           return new Lambda(*static_cast<Lambda*>(buf));

  Move ctor:
       data = in.data;  in.data = 0;     // pointer steal

  Destructor:
       fp_deleter(data)
           |
           v
       manager<Lambda>::do_delete(void *buf)
           delete static_cast<Lambda*>(buf);

The Tag template parameter is critical: it ensures each lambda call site gets its own set of __nv_hdl_helper static function pointers. Without Tag, two different lambdas with the same OpFuncR(OpFuncArgs...) signature would share the same function pointers, and the second constructor call would overwrite the first's fp_caller/fp_copier/fp_deleter.

The Capturing Path (HasFuncPtrConv=false)

When HasFuncPtrConv=false (the a1=0 path in the emitter), the wrapper uses heap allocation for type erasure. This is the full-weight path for lambdas that capture state.

Reconstructed Template (N captures, non-mutable)

The following is the complete C++ output reconstructed from sub_6BBB10 with a1=0 (HasFuncPtrConv=false) and a2=N captures:

template <bool NeverThrows, typename Tag, typename OpFuncR,
          typename... OpFuncArgs, typename F1, typename F2, /* ...FN */>
struct __nv_hdl_wrapper_t<false, false, NeverThrows, Tag,
                           OpFuncR(OpFuncArgs...), F1, F2, /* ...FN */> {
    // --- Captured fields ---
    typename __nv_lambda_field_type<F1>::type f1;
    typename __nv_lambda_field_type<F2>::type f2;
    // ...
    typename __nv_lambda_field_type<FN>::type fN;

    typedef OpFuncR(__opfunc_t)(OpFuncArgs...);

    // --- Data member for type-erased lambda ---
    void *data;

    // --- Type erasure manager ---
    template <typename Lambda>
    struct manager {
        static void *do_copy(void *buf) {
            auto ptr = static_cast<Lambda *>(buf);
            return static_cast<void *>(new Lambda(*ptr));
        };
        static OpFuncR do_call(void *buf, OpFuncArgs... args) {
            auto ptr = static_cast<Lambda *>(buf);
            return (*ptr)(std::forward<OpFuncArgs>(args)...);
        };
        static void do_delete(void *buf) {
            auto ptr = static_cast<Lambda *>(buf);
            delete ptr;
        }
    };

    // --- Constructor: heap-allocate Lambda, register function pointers ---
    template <typename Lambda>
    __nv_hdl_wrapper_t(Tag, Lambda &&lam, F1 in1, F2 in2, /* ...FN inN */)
        : f1(in1), f2(in2), /* ...fN(inN), */
          data(static_cast<void *>(new Lambda(std::move(lam)))) {
        __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier
            = &manager<Lambda>::do_copy;
        __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_deleter
            = &manager<Lambda>::do_delete;
        __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_caller
            = &manager<Lambda>::do_call;
    }

    // --- Call operator: delegate through type-erased fp_caller ---
    // Binary emits: "OpFuncR operator() (OpFuncArgs... args) " + "const " + "noexcept(NeverThrows) "
    OpFuncR operator() (OpFuncArgs... args) const noexcept(NeverThrows) {
        return __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>
            ::fp_caller(data, std::forward<OpFuncArgs>(args)...);
    }

    // --- Copy constructor: delegate through fp_copier ---
    __nv_hdl_wrapper_t(const __nv_hdl_wrapper_t &in)
        : f1(in.f1), f2(in.f2), /* ...fN(in.fN), */
          data(__nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>
               ::fp_copier(in.data)) { }

    // --- Move constructor: steal void* pointer ---
    __nv_hdl_wrapper_t(__nv_hdl_wrapper_t &&in)
        : f1(std::move(in.f1)), f2(std::move(in.f2)), /* ...fN(std::move(in.fN)), */
          data(in.data) { in.data = 0; }

    // --- Destructor: delegate through fp_deleter ---
    ~__nv_hdl_wrapper_t(void) {
        __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_deleter(data);
    }

    // --- Copy assignment: deleted ---
    __nv_hdl_wrapper_t & operator=(const __nv_hdl_wrapper_t &in) = delete;
};

Key Design Decisions

Heap allocation in constructor. The lambda is std::moved into a heap-allocated copy via new Lambda(std::move(lam)). This erases the concrete type -- the wrapper only holds a void* afterward. The manager<Lambda> static methods are assigned to the __nv_hdl_helper static function pointers during construction, preserving the type information as function pointer values rather than as template parameters.

Static function pointers instead of vtable. Rather than using virtual functions, the wrapper stores the type-erasure operations in static function pointers on __nv_hdl_helper. This is an unconventional choice -- it means all wrappers with the same Tag share the same function pointer storage. This works because within a single translation unit, each tag corresponds to exactly one lambda closure type. The approach avoids vtable overhead (no virtual destructor, no vptr in the wrapper) at the cost of not being safe across multiple lambda types sharing a tag.

Move constructor steals pointer. The move constructor copies the void* data pointer and sets the source to 0 (null). The destructor unconditionally calls fp_deleter(data), so a null data pointer after move must be handled by the deleter. Since delete on a null pointer is a no-op in C++, the moved-from wrapper's destructor call is safe.

Copy assignment is deleted. Only copy construction and move construction are supported. This avoids the complexity of managing the void* lifetime during assignment (which would require deleting the old data and copying the new).

Zero-Capture Specialization

When a2=0 (no captures), the emitter skips the field declarations and the field portions of the member initializer lists. The wrapper degenerates to holding only void* data with no fN fields. The constructor takes only (Tag, Lambda&&) with no capture arguments. The copy and move constructors handle only the data member.

The Lightweight Path (HasFuncPtrConv=true)

When HasFuncPtrConv=true (the a1=1 path), the lambda has no captures and can be implicitly converted to a raw function pointer. The emitter produces a drastically simpler wrapper:

template <bool NeverThrows, typename Tag, typename OpFuncR,
          typename... OpFuncArgs>
struct __nv_hdl_wrapper_t<false, true, NeverThrows, Tag,
                           OpFuncR(OpFuncArgs...)> {
    typedef OpFuncR(__opfunc_t)(OpFuncArgs...);

    // --- Constructor: store lambda as function pointer ---
    template <typename Lambda>
    __nv_hdl_wrapper_t(Tag, Lambda &&lam)
     { __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_noobject_caller = lam; }

    // --- Call operator: invoke through stored function pointer ---
    // Binary: "OpFuncR operator() (OpFuncArgs... args) " + "const " + "noexcept(NeverThrows) "
    OpFuncR operator() (OpFuncArgs... args) const noexcept(NeverThrows) {
        return __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>
            ::fp_noobject_caller(std::forward<OpFuncArgs>(args)...);
    }

    // --- Function pointer conversion operator ---
    // Binary: "operator __opfunc_t * () const { ... }"
    operator __opfunc_t * () const {
        return __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_noobject_caller;
    }

    // --- Copy assignment: deleted ---
    __nv_hdl_wrapper_t & operator=(const __nv_hdl_wrapper_t &in) = delete;
};

No void* data member. No manager struct. No heap allocation. No copy constructor, move constructor, or destructor (the compiler-generated defaults suffice). The lambda is stored directly as a function pointer in fp_noobject_caller, and the wrapper provides an implicit conversion to __opfunc_t* -- the raw function pointer type matching the lambda's signature.

This path is selected when gen_lambda (sub_47B890) detects that the lambda has no capture list (*(_QWORD *)a1 == 0, the capture head pointer is null) and the lambda does not use capture-default = (bit 4 at byte[24] is clear). Additional conditions involving dword_126EFAC, dword_126EFA4, and qword_126EF98 (a version threshold at 0xEB27 = 60199, likely a CUDA toolkit version) gate this detection, suggesting the function-pointer conversion path was added in a specific toolkit release.

Mutable vs Non-Mutable (sub_6BBB10 vs sub_6BBEE0)

The two emitter functions are structurally identical. The sole differences:

Aspectsub_6BBB10 (non-mutable)sub_6BBEE0 (mutable)
First template bool emitted"false,""true,"
operator() qualifiera3("const ") before noexceptNo "const " emission
Binary differenceLine 190: emits "const "Line 188: skips to noexcept

In the decompiled binary, the two functions are 238 and 236 lines respectively. The 2-line difference is exactly the a3("const ") call present in sub_6BBB10 but absent from sub_6BBEE0.

For a mutable lambda, the C++ standard says operator() is non-const, allowing the lambda body to modify captured-by-value variables. The wrapper faithfully propagates this: sub_6BBEE0 generates operator() without the const qualifier. In the capturing path, this means the do_call function pointer invokes a non-const Lambda, which is sound because the lambda is heap-allocated and accessed through a mutable void*.

Emitter Call Matrix

sub_6BCC20 emits all four combinations for each set bit N in the host-device bitmap:

sub_6BBB10(0, N, emit);  // IsMutable=false, HasFuncPtrConv=false
sub_6BBEE0(0, N, emit);  // IsMutable=true,  HasFuncPtrConv=false
sub_6BBB10(1, N, emit);  // IsMutable=false, HasFuncPtrConv=true
sub_6BBEE0(1, N, emit);  // IsMutable=true,  HasFuncPtrConv=true

This produces four partial specializations per set bitmap bit N. The NeverThrows parameter remains a template parameter (not a partial-specialization value), handled at instantiation time. Note in the decompiled binary that the fourth call uses v9 (which holds v6 before the post-increment): v9 = v6++; ... sub_6BBEE0(1, v9, a1); -- all four calls use the same capture count N.

The __nv_hdl_helper_trait_outer Deduction Helper

After the per-capture-count specializations, sub_6BCC20 emits a trait class that deduces the wrapper return type from the lambda's operator() signature:

template <bool IsMutable, bool HasFuncPtrConv, typename ...CaptureArgs>
struct __nv_hdl_helper_trait_outer {
    // Primary: extract operator() signature via decltype(&Lambda::operator())
    template <typename Tag, typename Lambda>
    struct __nv_hdl_helper_trait
        : public __nv_hdl_helper_trait<Tag, decltype(&Lambda::operator())> { };

    // Specialization for const operator() (non-mutable lambda):
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) const> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, false,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

    // Specialization for non-const operator() (mutable lambda):
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...)> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, false,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

    // C++17 noexcept variants (only when dword_126E270 is set):
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) const noexcept> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, true,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) noexcept> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, true,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };
};

The trick here is the primary __nv_hdl_helper_trait inheriting from a specialization on decltype(&Lambda::operator()). The compiler deduces the member function pointer type of operator(), which pattern-matches against one of the four specializations. The non-noexcept specializations pass NeverThrows=false; the noexcept specializations pass NeverThrows=true. This is how the NeverThrows template parameter gets its value -- through trait deduction, not through an explicit argument.

The C++17 noexcept variants are gated on dword_126E270. In C++17, noexcept became part of the type system, so R(C::*)(Args...) noexcept is a distinct type from R(C::*)(Args...). Without the additional specializations, the compiler would fail to match noexcept member function pointers.

In the decompiled sub_6BCC20, the emission is split into three a1() calls: (1) the base struct with const and non-const specializations (ending with }; for the non-const spec), (2) conditionally (if (dword_126E270)) the const noexcept and noexcept specializations, and (3) a1("\n};") to close the outer struct. This means the closing brace of __nv_hdl_helper_trait_outer is always emitted, but the noexcept specializations inside it are conditional. A subtle consequence: in non-C++17 mode, the binary between the non-const }; and the outer }; contains only \n}; -- the inner struct specializations end before the outer struct closes.

The __nv_hdl_create_wrapper_t Factory

The factory struct ties everything together. It provides a single static method that the backend emits at each host-device lambda usage site:

template <bool IsMutable, bool HasFuncPtrConv,
          typename Tag, typename... CaptureArgs>
struct __nv_hdl_create_wrapper_t {
    template <typename Lambda>
    static auto __nv_hdl_create_wrapper(Lambda &&lam, CaptureArgs... args)
        -> decltype(
            __nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>
                ::template __nv_hdl_helper_trait<Tag, Lambda>
                ::get(lam, args...))
    {
        typedef decltype(
            __nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>
                ::template __nv_hdl_helper_trait<Tag, Lambda>
                ::get(lam, args...)) container_type;
        return container_type(Tag{}, std::move(lam), args...);
    }
};

The trailing return type uses decltype to invoke the trait chain and deduce the exact __nv_hdl_wrapper_t specialization. The body constructs that deduced type with Tag{} (a value-initialized tag), the moved lambda, and the capture arguments.

Backend Emission at Lambda Call Site

When gen_lambda (sub_47B890) encounters a host-device lambda (bit 4 set at byte[25]), it emits the factory call in two phases:

Phase 1 (before lambda body): Opens the factory call with template arguments and the method name:

__nv_hdl_create_wrapper_t< IsMutable, HasFuncPtrConv, Tag, CaptureTypes... >
    ::__nv_hdl_create_wrapper(

Phase 2 (after lambda body): The lambda expression is emitted as the first argument to __nv_hdl_create_wrapper, then the captured value expressions are appended as trailing arguments, followed by the closing ):

    /* lambda expression emitted inline */,
    capture_arg1, capture_arg2, ... )

This differs from the device lambda path where the original lambda body is wrapped in #if 0 / #endif. In the host-device path, the lambda is passed by rvalue reference to the factory method, which moves it into a heap-allocated copy for type erasure. The captured values are passed separately (via sub_46E550 at line 323 of the decompiled binary) so the wrapper can store them as typed fields alongside the void* data.

The IsMutable decision comes from byte[24] & 0x02 (mutable keyword present). The HasFuncPtrConv decision involves nested conditions, all gated on the capture list head being null (*(_QWORD *)a1 == 0):

HasFuncPtrConv = false;  // default
if (capture_list_head == NULL) {
    if (dword_126EFAC && !dword_126EFA4 && qword_126EF98 <= 0xEB27) {
        HasFuncPtrConv = true;   // forced true for old toolkit versions
    } else {
        // General path: true iff no capture-default '='
        HasFuncPtrConv = !(byte[24] & 0x10);
    }
}

When dword_126EFAC is set and dword_126EFA4 is clear, the toolkit version qword_126EF98 is compared against 0xEB27 (60199). At or below this threshold, HasFuncPtrConv is unconditionally true. Above the threshold, it falls through to the general path which checks whether the lambda has a capture-default = (bit 4 at byte[24]): if no = default, then the lambda is captureless and can convert to a function pointer.

This logic is at sub_47B890 lines 62-77 of the decompiled binary.

SFINAE Detection Traits

At the end of the preamble, sub_6BCC20 emits a detection trait and macro for identifying host-device lambda wrappers:

// Exact binary string (step 16 in sub_6BCC20, emitted as a single a1() call):
template <typename>
struct __nv_extended_host_device_lambda_trait_helper {
  static const bool value = false;
};
template <bool B1, bool B2, bool B3, typename T1, typename T2, typename...Pack>
struct __nv_extended_host_device_lambda_trait_helper<__nv_hdl_wrapper_t<B1, B2, B3, T1, T2, Pack...> > {
  static const bool value = true;
};
#define __nv_is_extended_host_device_lambda_closure_type(X)  __nv_extended_host_device_lambda_trait_helper< typename __nv_lambda_trait_remove_cv<X>::type>::value

Note: binary has typename...Pack (no space), Pack...> > (space between angle brackets -- pre-C++11 syntax), two spaces before __nv_extended_host_device_lambda_trait_helper in the macro, and 2-space indentation on static const bool.

This allows compile-time detection of whether a type is a host-device lambda wrapper, used internally by the CUDA runtime headers and by nvcc to apply special handling to extended host-device lambda closure types.

Emission Sequence in sub_6BCC20

The host-device wrapper infrastructure is emitted in steps 7-12 of the 20-step preamble emission sequence:

StepContentFunction
7__nv_hdl_helper class (anonymous namespace, 4 static function pointer members + out-of-line definitions)sub_6BCC20 inline
8Primary __nv_hdl_wrapper_t with static_assert (catches unexpected capture counts)sub_6BCC20 inline
9Per-capture-count specializations: for each bit N set in unk_1286900, emit 4 calls: sub_6BBB10(0,N), sub_6BBEE0(0,N), sub_6BBB10(1,N), sub_6BBEE0(1,N)sub_6BBB10, sub_6BBEE0
10__nv_hdl_helper_trait_outer deduction helper (2 or 4 trait specializations depending on C++17)sub_6BCC20 inline
11C++17 noexcept trait variants (conditional on dword_126E270)sub_6BCC20 inline
12__nv_hdl_create_wrapper_t factorysub_6BCC20 inline

The bitmap scan loop for host-device wrappers differs from the device-lambda loop in one important way: bit 0 IS emitted. The device-lambda loop skips bit 0 (the zero-capture case is handled by the primary template), but the host-device loop processes every set bit including 0. This is because the zero-capture host-device wrapper still requires distinct specializations for the HasFuncPtrConv=true and HasFuncPtrConv=false paths.

// sub_6BCC20, host-device bitmap scan (decompiled)
v5 = (unsigned __int64 *)&unk_1286900;
v6 = 0;
do {
    v7 = *v5;
    v8 = v6 + 64;
    do {
        while ((v7 & 1) == 0) {    // skip unset bits
            ++v6;
            v7 >>= 1;
            if (v6 == v8) goto LABEL_13;
        }
        sub_6BBB10(0, v6, a1);     // non-mutable, HasFuncPtrConv=false
        sub_6BBEE0(0, v6, a1);     // mutable,     HasFuncPtrConv=false
        sub_6BBB10(1, v6, a1);     // non-mutable, HasFuncPtrConv=true
        v9 = v6++;
        v7 >>= 1;
        sub_6BBEE0(1, v9, a1);     // mutable,     HasFuncPtrConv=true
    } while (v6 != v8);
LABEL_13:
    ++v5;
} while (v6 != 1024);

Comparison with Device Lambda Wrapper

Aspect__nv_dl_wrapper_t__nv_hdl_wrapper_t
Type erasureNone -- concrete fields onlyvoid* + manager<Lambda> function pointers
Heap allocationNeverYes (capturing path) or never (HasFuncPtrConv path)
Copy semanticsTrivially copyable aggregateCustom copy ctor via fp_copier; copy assign deleted
Move semanticsDefaultCustom move ctor stealing void*; moved-from nulled
DestructorTrivialCalls fp_deleter(data)
operator() bodyreturn 0; / __builtin_unreachable() (placeholder)Delegates through fp_caller or fp_noobject_caller
Function pointer conversionNot supportedoperator __opfunc_t * () when HasFuncPtrConv=true
Specializations per N2 (standard tag + trailing-return tag)4 (2 mutability x 2 HasFuncPtrConv)
Template params (partial spec)Tag, F1..FNIsMutable, HasFuncPtrConv, NeverThrows, Tag, OpFuncR(OpFuncArgs...), F1..FN

The host-device wrapper is fundamentally more complex because it must produce a callable object that works on both host and device. The device-only wrapper can use placeholder operator bodies (return 0) because the device compiler sees the original lambda body through a different mechanism. The host-device wrapper must actually call the lambda through the type-erased function pointer table.

Concrete Example: Host-Device Lambda with One Capture

User code:

auto add_n = [n] __host__ __device__ (int x) { return x + n; };
int result = add_n(42);

This lambda has one capture (n, by value), is not mutable (default), and cannot convert to a function pointer (it captures). The frontend sets bit 4 at byte[25] (host-device wrapper needed) and calls sub_6BCBF0(1, 1) to set bit 1 in the host-device bitmap unk_1286900.

During preamble emission, sub_6BCC20 sees bit 1 set and emits four specializations via sub_6BBB10(0,1), sub_6BBEE0(0,1), sub_6BBB10(1,1), sub_6BBEE0(1,1). The relevant one for this lambda (non-mutable, capturing) is from sub_6BBB10(0,1).

At the lambda call site, gen_lambda emits:

__nv_hdl_create_wrapper_t< false, false, __nv_dl_tag<...>, int >
    ::__nv_hdl_create_wrapper(
        [n] __host__ __device__ (int x) { return x + n; },
        n )

The factory method deduces the wrapper type via __nv_hdl_helper_trait_outer and constructs:

__nv_hdl_wrapper_t<false, false, false, Tag, int(int), int>

At runtime on the host: the constructor heap-allocates the lambda, stores n as field f1, and sets the fp_caller/fp_copier/fp_deleter static function pointers. Calling add_n(42) invokes fp_caller(data, 42) which casts void* back to the lambda type and calls operator()(42).

At runtime on the device: the same wrapper struct is memcpy'd to device memory. The device compiler sees the wrapper's fields and operator() which delegates through the function pointer table, resolving to the lambda body.

Emitter Function Signature

Both sub_6BBB10 and sub_6BBEE0 share the same prototype:

__int64 __fastcall sub_6BBB10(int a1, unsigned int a2,
                               void (__fastcall *a3)(const char *));
ParameterRole
a1HasFuncPtrConv flag. 0 = full type-erased path. 1 = lightweight function pointer path.
a2Number of captured variables (0 to 1023).
a3Emit callback. Called with C++ source text fragments that are concatenated to form the output.

The functions use a 1080-byte stack buffer (v28[1080]) for sprintf formatting of per-capture template parameters and field declarations. The buffer is large enough for field names up to F1023 / f1023 / in1023 with surrounding template syntax.

Key Functions

AddressNameLinesRole
sub_6BBB10emit_hdl_wrapper_nonmutable238Emit __nv_hdl_wrapper_t<false, ...> specialization
sub_6BBEE0emit_hdl_wrapper_mutable236Emit __nv_hdl_wrapper_t<true, ...> specialization
sub_6BCC20nv_emit_lambda_preamble244Master emitter; calls both for each bitmap bit
sub_47B890gen_lambda336Per-lambda site emission of __nv_hdl_create_wrapper_t::__nv_hdl_create_wrapper(...) call
sub_6BCBF0nv_record_capture_count13Sets bit in unk_1286900 bitmap during frontend scan
sub_6BCBC0nv_reset_capture_bitmasks9Zeroes both bitmaps before each TU

Global State

VariableAddressPurpose
unk_12869000x1286900Host-device lambda capture-count bitmap (128 bytes, 1024 bits)
dword_126E2700x126E270C++17 noexcept-in-type-system flag; gates noexcept trait variants
dword_126EFAC0x126EFACInfluences HasFuncPtrConv deduction in gen_lambda
dword_126EFA40x126EFA4Secondary gate for HasFuncPtrConv path
qword_126EF980x126EF98Toolkit version threshold for HasFuncPtrConv (compared against 0xEB27)
dword_106BF380x106BF38Extended lambda mode flag (--extended-lambda)

Capture Handling

C++ lambdas capture variables by creating closure-class fields -- one field per captured entity. For scalars this is straightforward: the closure stores a copy (or reference) of the variable. Arrays present a problem because C++ forbids direct value-capture of C-style arrays. CUDA extended lambdas compound the problem: the wrapper template that carries captures across the host/device boundary needs a uniform way to express every field's type, including multi-dimensional arrays and const-qualified variants. cudafe++ solves this with two injected template families: __nv_lambda_field_type<T> (a type trait that maps each captured variable's declared type to a storable type) and __nv_lambda_array_wrapper<T[D1]...[DN]> (a wrapper struct that holds a deep copy of an N-dimensional array with element-by-element copy in its constructor).

A separate subsystem handles the backend code generator's emission of capture type declarations and capture value expressions for each lambda. nv_gen_extended_lambda_capture_types (sub_46E640) walks the capture list and emits decltype-based template arguments wrapped in __nvdl_remove_ref / __nvdl_remove_const / __nv_lambda_trait_remove_cv. sub_46E550 emits the corresponding capture values (variable names, this, *this, or init-capture expressions).

All of this is driven by a bitmap system that tracks which capture counts were actually used, so cudafe++ only emits the wrapper specializations that a given translation unit requires.

Key Facts

PropertyValue
Field type trait__nv_lambda_field_type<T>
Array wrapper__nv_lambda_array_wrapper<T[D1]...[DN]>
Supported array dims1D (identity) through 7D (generated for ranks 2-8)
Array helper emittersub_6BC290 (emit_array_capture_helpers) in nv_transforms.c
Capture type emittersub_46E640 (nv_gen_extended_lambda_capture_types) in cp_gen_be.c
Capture value emittersub_46E550 in cp_gen_be.c
Device bitmapunk_1286980 (128 bytes = 1024 bits)
Host-device bitmapunk_1286900 (128 bytes = 1024 bits)
Bitmap initializersub_6BCBC0 (nv_reset_capture_bitmasks)
Bitmap settersub_6BCBF0 (nv_record_capture_count)

__nv_lambda_field_type

This is the type trait that maps every captured variable's declared type to a type suitable for storage in a wrapper struct field. For scalar types (and anything that is not an array), it is the identity:

template <typename T>
struct __nv_lambda_field_type {
    typedef T type;
};

For array types, it maps to the corresponding __nv_lambda_array_wrapper specialization. cudafe++ generates partial specializations for dimensions 2 through 8, each in both non-const and const variants.

Generated Specializations (Example: 3D)

// Non-const array
template<typename T, size_t D1, size_t D2, size_t D3>
struct __nv_lambda_field_type<T [D1][D2][D3]> {
    typedef __nv_lambda_array_wrapper<T [D1][D2][D3]> type;
};

// Const array
template<typename T, size_t D1, size_t D2, size_t D3>
struct __nv_lambda_field_type<const T [D1][D2][D3]> {
    typedef const __nv_lambda_array_wrapper<T [D1][D2][D3]> type;
};

For 1D arrays (T[D1]), no specialization is generated. The primary template handles them -- 1D arrays decay to pointers in standard capture, so this is the identity case. The explicit specializations cover dimensions 2 through 8 (template parameter lists with D1 through D2...D7 respectively).

Why Ranks 2-8

The loop in sub_6BC290 runs with counter v1 from 2 to 8 inclusive (while (v1 != 9)). Rank 1 is handled by the primary template. Rank 9+ triggers the static_assert in the unspecialized __nv_lambda_array_wrapper primary template. This bounds the maximum supported array dimensionality for lambda capture at 7D -- an extremely generous limit (standard CUDA kernels rarely exceed 3D arrays).

__nv_lambda_array_wrapper<T[D1]...[DN]>

The array wrapper is a struct that owns a copy of an N-dimensional C-style array. Since arrays cannot be value-captured in C++ (they decay to pointers), this wrapper provides the deep-copy semantics that CUDA extended lambdas need.

Primary Template (Trap)

The unspecialized primary template contains only a static_assert that always fires:

template <typename T>
struct __nv_lambda_array_wrapper {
    static_assert(sizeof(T) == 0,
        "nvcc internal error: unexpected failure in capturing array variable");
};

This catches any array dimensionality that falls outside the range [2, 8]. Since sizeof(T) is never zero for a real type, the assertion always fails if the primary template is instantiated.

Generated Specializations

For each rank N from 2 through 8, sub_6BC290 generates a partial specialization:

// Example: rank 3
template<typename T, size_t D1, size_t D2, size_t D3>
struct __nv_lambda_array_wrapper<T [D1][D2][D3]> {
    T arr[D1][D2][D3];
    __nv_lambda_array_wrapper(const T in[D1][D2][D3]) {
        for(size_t i1 = 0; i1 < D1; ++i1)
        for(size_t i2 = 0; i2 < D2; ++i2)
        for(size_t i3 = 0; i3 < D3; ++i3)
            arr[i1][i2][i3] = in[i1][i2][i3];
    }
};

The constructor takes a const T in[D1]...[DN] parameter and performs element-by-element copy via nested for-loops. Each loop variable is named i1 through iN and iterates from 0 to D1 through DN respectively. The assignment arr[i1]...[iN] = in[i1]...[iN] copies each element.

Reconstructed Output for Rank 4

What sub_6BC290 actually emits for a 4-dimensional array (directly from the decompiled string fragments):

template<typename T, size_t D1, size_t D2, size_t D3, size_t D4>
struct __nv_lambda_array_wrapper<T [D1][D2][D3][D4]> {
    T arr[D1][D2][D3][D4];
    __nv_lambda_array_wrapper(const T in[D1][D2][D3][D4]) {
        for(size_t i1 = 0; i1  < D1; ++i1)
        for(size_t i2 = 0; i2  < D2; ++i2)
        for(size_t i3 = 0; i3  < D3; ++i3)
        for(size_t i4 = 0; i4  < D4; ++i4)
            arr[i1][i2][i3][i4] = in[i1][i2][i3][i4];
    }
};

Note the double-space before < in the for condition -- this is present in the actual emitted code (visible in the decompiled sprintf format string "for(size_t i%u = 0; i%u < D%u; ++i%u)").

sub_6BC290: emit_array_capture_helpers

Address 0x6BC290, 183 decompiled lines, in nv_transforms.c. Takes a single argument: void (*a1)(const char *), the text emission callback.

Algorithm

The function has two major loops, each iterating rank from 2 to 8.

Loop 1 -- Array wrapper specializations:

for rank = 2 to 8:
    emit "template<typename T"
    for d = 1 to rank-1:
        emit ", size_t D{d}"
    emit ">\nstruct __nv_lambda_array_wrapper<T "
    for d = 1 to rank-1:
        emit "[D{d}]"
    emit "> {T arr"
    for d = 1 to rank-1:
        emit "[D{d}]"
    emit ";\n__nv_lambda_array_wrapper(const T in"
    for d = 1 to rank-1:
        emit "[D{d}]"
    emit ") {"
    for d = 1 to rank-1:
        emit "\nfor(size_t i{d} = 0; i{d}  < D{d}; ++i{d})"
    emit " arr"
    for d = 1 to rank-1:
        emit "[i{d}]"
    emit " = in"
    for d = 1 to rank-1:
        emit "[i{d}]"
    emit ";\n}\n};\n"

Loop 2 -- Field type specializations:

First emits the primary __nv_lambda_field_type:

emit "template <typename T>\nstruct __nv_lambda_field_type {\ntypedef T type;};"

Then for each rank from 2 to 8, emits two specializations (non-const and const):

for rank = 2 to 8:
    // Non-const specialization
    emit "template<typename T"
    for d = 1 to rank-1:
        emit ", size_t D{d}"
    emit ">\nstruct __nv_lambda_field_type<T "
    for d = 1 to rank-1:
        emit "[D{d}]"
    emit "> {\ntypedef __nv_lambda_array_wrapper<T "
    for d = 1 to rank-1:
        emit "[D{d}]"
    emit "> type;\n};\n"

    // Const specialization
    emit "template<typename T"
    for d = 1 to rank-1:
        emit ", size_t D{d}"
    emit ">\nstruct __nv_lambda_field_type<const T "
    for d = 1 to rank-1:
        emit "[D{d}]"
    emit "> {\ntypedef const __nv_lambda_array_wrapper<T "
    for d = 1 to rank-1:
        emit "[D{d}]"
    emit "> type;\n};\n"

Stack Usage

Two stack buffers: v33[1024] for the for-loop lines (the sprintf format includes four %u substitutions) and s[1064] for the dimension fragments (smaller format: "%s%u%s" with prefix/suffix).

Emission Order in Preamble

sub_6BC290 is called from sub_6BCC20 (nv_emit_lambda_preamble) at step 3, after __nvdl_remove_ref/__nvdl_remove_const trait helpers and __nv_dl_tag, but before the primary __nv_dl_wrapper_t definition. This ordering is critical: __nv_dl_wrapper_t field declarations reference __nv_lambda_field_type, which in turn references __nv_lambda_array_wrapper, so both must be defined first.

Capture Type Emission (sub_46E640)

Address 0x46E640, approximately 400 decompiled lines, in cp_gen_be.c. Confirmed identity: nv_gen_extended_lambda_capture_types (assert string at line 17368 of cp_gen_be.c).

This function emits the template type arguments that appear in a wrapper struct instantiation. For a device lambda wrapper __nv_dl_wrapper_t<Tag, F1, F2, ..., FN>, this function generates the F1 through FN types. Each type must precisely match the declared type of the captured variable, with references and top-level const stripped.

Input

Takes __int64 **a1 -- a pointer to the lambda info structure. The capture list is a linked list starting at *a1 (offset +0 of the lambda info). Each capture entry is a node with:

OffsetSizeField
+08next pointer (linked list)
+88variable_entity -- pointer to the captured variable's entity node
+248init_capture_scope -- scope for init-capture expressions
+321flags_byte_1 -- bit 0 = init-capture, bit 7 = has braces/parens
+331flags_byte_2 -- bit 0 = paren-init (vs brace-init)

The variable entity at offset +8 has:

  • Offset +8: name string (null if *this capture)
  • Offset +163: sign bit (bit 7) -- if set, this is a *this or this capture

Algorithm: Three Capture Kinds

The function walks the capture list and for each entry, dispatches on two conditions: the init-capture flag (i[4] & 1) and the *this flag (byte at entity+163 sign bit).

Case 1: Regular variable capture (i[4] & 1 == 0 and entity+163 >= 0)

Emits:

, typename __nvdl_remove_ref<decltype(varname)>::type

Where varname is the string at entity+8. This strips reference qualification from the variable's type. The decltype(varname) ensures the type is deduced from the actual declaration, not from any decay.

Case 2: *this capture (i[4] & 1 == 0 and entity+163 < 0)

Two sub-cases depending on whether this is an explicit this capture (C++23 deducing this) versus traditional *this:

If i[4] & 8 (explicit this):

, decltype(this) const

Otherwise (traditional *this):

, typename __nvdl_remove_const<typename __nvdl_remove_ref<decltype(*this) > ::type> :: type

If the lambda is non-const (mutable), const is not appended. The mutable check reads (byte)a1[3] & 2 -- if clear, appends const.

Case 3: Init-capture (i[4] & 1 != 0)

Emits:

, typename __nv_lambda_trait_remove_cv<typename __nvdl_remove_ref<decltype({expr})>::type>::type

Where {expr} is the init-capture expression, emitted by calling sub_46D910 (the expression code generator). The expression is wrapped in {...} (brace-init) or (...) (paren-init) depending on byte+33 bit 0. The additional __nv_lambda_trait_remove_cv wrapper strips top-level const and volatile from the deduced type.

GCC Diagnostic Guards

When dword_126E1E8 is set (indicating the host compiler is GCC-based), the init-capture path wraps the decltype expression in pragma guards:

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunevaluated-expression"
decltype({expr})
#pragma GCC diagnostic pop

This suppresses GCC warnings about using decltype on expressions that are not evaluated. The flag dword_126E1E8 is likely set when the target host compiler is GCC rather than MSVC or Clang.

Character-by-Character Emission

The decompiled code reveals that sub_46E640 does not use sub_467E50 (emit string) for all output. For short constant strings like ", ", "typename __nvdl_remove_ref<decltype(", etc., it emits character-by-character via putc(ch, stream) with a manual loop. This is a common pattern in EDG's code generator where inline string emission avoids function-call overhead for fixed text.

The character counter dword_106581C tracks the column position for line-wrapping decisions. Each emission path increments it by the string length.

Capture Value Emission (sub_46E550)

Address 0x46E550, 60 decompiled lines, in cp_gen_be.c. This function emits the actual values passed to the wrapper constructor -- the runtime expressions that initialize each captured field.

Algorithm

Walks the same capture linked list. For each entry, emits , followed by:

ConditionOutput
Regular variable (byte+32 & 1 == 0, entity+163 >= 0)Variable name string from entity+8
Explicit this (byte+32 & 8, entity+163 < 0)this
Traditional *this (byte+32 & 8 == 0, entity+163 < 0)*this
Init-capture (byte+32 & 1)The init-capture expression via sub_46D910

For init-captures, the expression is wrapped in (...) or {...} based on bit 0 of byte+33:

  • Bit 0 set: paren-init (expr)
  • Bit 0 clear: brace-init {expr}

Relationship to Type Emission

sub_46E550 and sub_46E640 are called in sequence by the per-lambda wrapper emitter (sub_47B890, gen_lambda). The type emission produces the template type parameters; the value emission produces the constructor arguments. Together they construct an expression like:

__nv_dl_wrapper_t<
    __nv_dl_tag<decltype(&Closure::operator()), &Closure::operator(), 42>,
    typename __nvdl_remove_ref<decltype(x)>::type,
    typename __nvdl_remove_ref<decltype(y)>::type
>(tag, x, y)

Bitmap System

Rather than generating wrapper specializations for all possible capture counts (0 through 1023), cudafe++ maintains two 1024-bit bitmaps that record which counts were actually observed during frontend parsing. During preamble emission, only the specializations for set bits are generated.

Memory Layout

unk_1286980 (device lambda bitmap):
    Address: 0x1286980
    Size:    128 bytes = 16 x uint64_t = 1024 bits
    Bit N:   __nv_dl_wrapper_t specialization for N captures needed

unk_1286900 (host-device lambda bitmap):
    Address: 0x1286900
    Size:    128 bytes = 16 x uint64_t = 1024 bits
    Bit N:   __nv_hdl_wrapper_t specializations for N captures needed

sub_6BCBC0: nv_reset_capture_bitmasks

Address 0x6BCBC0, 9 decompiled lines. Called before each translation unit.

memset(&unk_1286980, 0, 0x80);   // Clear device bitmap (128 bytes)
memset(&unk_1286900, 0, 0x80);   // Clear host-device bitmap (128 bytes)

sub_6BCBF0: nv_record_capture_count

Address 0x6BCBF0, 13 decompiled lines. Called from scan_lambda (sub_447930) after counting captures.

_QWORD *result = &unk_1286900;          // Default: host-device bitmap
if (!a1)
    result = &unk_1286980;              // a1 == 0: device bitmap
result[a2 >> 6] |= 1LL << a2;          // Set bit a2

Parameters:

  • a1 (int): Bitmap selector. 0 = device, non-zero = host-device.
  • a2 (unsigned): Capture count (0-1023).

The bit-set logic: a2 >> 6 selects the uint64_t word (divides by 64), and 1LL << a2 sets the appropriate bit within that word. Since a2 is an unsigned int, the shift 1LL << a2 uses only the low 6 bits of a2 on x86-64, so the word index and bit index are consistent.

Note the mapping inversion: a1 == 0 maps to unk_1286980 (device), while a1 != 0 maps to unk_1286900 (host-device). This is counterintuitive but confirmed by the decompiled code.

Bitmap Scan in nv_emit_lambda_preamble

The scan loop in sub_6BCC20 processes each bitmap as 16 uint64_t words:

// Device lambda bitmap scan
uint64_t *ptr = (uint64_t *)&unk_1286980;
unsigned int idx = 0;
do {
    uint64_t word = *ptr;
    unsigned int limit = idx + 64;
    do {
        if (idx != 0 && (word & 1))
            sub_6BB790(idx, callback);   // emit_device_lambda_wrapper_specialization
        ++idx;
        word >>= 1;
    } while (limit != idx);
    ++ptr;
} while (limit != 1024);

// Host-device lambda bitmap scan
ptr = (uint64_t *)&unk_1286900;
idx = 0;
do {
    uint64_t word = *ptr;
    unsigned int limit = idx + 64;
    do {
        while ((word & 1) == 0) {    // Skip unset bits
            ++idx;
            word >>= 1;
            if (idx == limit) goto next_word;
        }
        sub_6BBB10(0, idx, callback);    // Non-mutable, HasFuncPtrConv=false
        sub_6BBEE0(0, idx, callback);    // Non-mutable, HasFuncPtrConv=true
        sub_6BBB10(1, idx, callback);    // Mutable, HasFuncPtrConv=false
        sub_6BBEE0(1, idx++, callback);  // Mutable, HasFuncPtrConv=true
        word >>= 1;
    } while (idx != limit);
next_word:
    ++ptr;
} while (idx != 1024);

Key differences between the two scans:

  • The device scan skips bit 0 (if (idx != 0 && ...)). The zero-capture case is handled by the primary template and its explicit <Tag> specialization already emitted as static text.
  • The host-device scan does not skip bit 0 -- zero-capture host-device lambdas (stateless lambdas with __host__ __device__) still need wrapper specializations because the host-device wrapper has function-pointer-conversion variants.
  • Each set bit in the host-device bitmap triggers four emitter calls (non-mutable/mutable x HasFuncPtrConv false/true), compared to one call per bit for device lambdas.

How Fields Use __nv_lambda_field_type

When sub_6BB790 (emit_device_lambda_wrapper_specialization) generates a wrapper struct for N captures, each field is declared as:

typename __nv_lambda_field_type<F1>::type f1;
typename __nv_lambda_field_type<F2>::type f2;
// ... through fN

This indirection through __nv_lambda_field_type means:

  • If F1 is int, the field type is int (identity via primary template).
  • If F1 is float[3][4], the field type is __nv_lambda_array_wrapper<float[3][4]>, which stores a deep copy.
  • If F1 is const double[2][2], the field type is const __nv_lambda_array_wrapper<double[2][2]>.

The constructor mirrors this pattern:

__nv_dl_wrapper_t(Tag, F1 in1, F2 in2, ..., FN inN)
    : f1(in1), f2(in2), ..., fN(inN) { }

For array captures, the f1(in1) initialization invokes __nv_lambda_array_wrapper's constructor, which performs the element-by-element copy. For scalar captures, it is a trivial copy/move.

End-to-End Example

Given user code:

int x = 42;
float matrix[3][4];
auto lam = [x, matrix]() __device__ { /* use x and matrix */ };

cudafe++ produces:

  1. Frontend (scan_lambda): Counts 2 captures. Calls sub_6BCBF0(0, 2) to set bit 2 in the device bitmap.

  2. Preamble emission (sub_6BCC20): Scans the device bitmap, finds bit 2 set. Calls sub_6BB790(2, emit) which generates:

template <typename Tag, typename F1, typename F2>
struct __nv_dl_wrapper_t<Tag, F1, F2> {
    typename __nv_lambda_field_type<F1>::type f1;
    typename __nv_lambda_field_type<F2>::type f2;
    __nv_dl_wrapper_t(Tag, F1 in1, F2 in2) : f1(in1), f2(in2) { }
    template <typename...U1>
    int operator()(U1...) { return 0; }
};
  1. Per-lambda emission (sub_47B890 calling sub_46E640 and sub_46E550):
__nv_dl_wrapper_t<
    __nv_dl_tag<decltype(&ClosureType::operator()), &ClosureType::operator(), 0>,
    typename __nvdl_remove_ref<decltype(x)>::type,        // int
    typename __nvdl_remove_ref<decltype(matrix)>::type     // float[3][4]
>(tag, x, matrix)
  1. Template instantiation: The host compiler instantiates the wrapper. F1 = int so __nv_lambda_field_type<int>::type = int (identity). F2 = float[3][4] so __nv_lambda_field_type<float[3][4]>::type = __nv_lambda_array_wrapper<float[3][4]>, which triggers the rank-2 specialization with its nested double for-loop constructor.

Function Map

AddressName (recovered)SourceLinesRole
sub_6BC290emit_array_capture_helpersnv_transforms.c183Emit __nv_lambda_array_wrapper (ranks 2-8) and __nv_lambda_field_type specializations
sub_6BCBC0nv_reset_capture_bitmasksnv_transforms.c9Zero both 128-byte bitmaps at translation unit start
sub_6BCBF0nv_record_capture_countnv_transforms.c13Set bit N in device or host-device bitmap
sub_6BCC20nv_emit_lambda_preamblenv_transforms.c244Master emitter -- scans bitmaps, calls all sub-emitters
sub_6BB790emit_device_lambda_wrapper_specializationnv_transforms.c191Emit __nv_dl_wrapper_t<Tag, F1..FN> for N captures
sub_46E640nv_gen_extended_lambda_capture_typescp_gen_be.c~400Emit decltype-based template type args for each capture
sub_46E550(capture value emitter)cp_gen_be.c~60Emit variable names / this / *this / init-capture exprs
sub_46D910(expression code generator)cp_gen_be.c--Called by both sub_46E640 and sub_46E550 for init-captures
sub_467E50(emit string to output)cp_gen_be.c--String emission helper used by code generator
sub_467DA0(column tracking helper)cp_gen_be.c--Called when dword_1065818 is set for line-length management

Global State

VariableAddressSizePurpose
unk_12869800x1286980128 bytesDevice lambda capture-count bitmap
unk_12869000x1286900128 bytesHost-device lambda capture-count bitmap
dword_106581C0x106581C4 bytesColumn counter for output line tracking
dword_10658180x10658184 bytesLine-length management enabled flag
dword_126E1E80x126E1E84 bytesGCC-compatible host compiler flag (enables diagnostic pragmas)
stream(global)8 bytesOutput FILE* for code generation

Preamble Injection

The entire CUDA extended lambda template library -- every __nv_dl_wrapper_t, every __nv_hdl_wrapper_t, every trait helper and detection macro -- enters the compilation through a single function: sub_6BCC20 (nv_emit_lambda_preamble). This 244-line function in nv_transforms.c accepts a void(*emit)(const char*) callback and produces raw C++ source text that is injected into the .int.c output stream. The preamble is emitted exactly once per translation unit, triggered by a sentinel type declaration named __nv_lambda_preheader_injection. The trigger mechanism lives in sub_4864F0 (gen_type_decl in cp_gen_be.c), which string-compares each type declaration's name against the sentinel marker, emits a synthetic #line directive, and then calls the master emitter.

The preamble contains 20 logical emission steps, ranging from simple type traits (4 lines each) to bitmap-driven loops that generate hundreds of template specializations. The design is driven by a critical optimization: rather than emitting all 1024 possible capture-count specializations for each wrapper type, cudafe++ maintains two 1024-bit bitmaps (unk_1286980 for device lambdas, unk_1286900 for host-device lambdas) that track which capture counts were actually used during frontend parsing. The preamble emitter scans these bitmaps and generates only the specializations that the translation unit requires.

Key Facts

PropertyValue
Master emittersub_6BCC20 (nv_emit_lambda_preamble, 244 lines, nv_transforms.c)
Trigger functionsub_4864F0 (gen_type_decl, 751 lines, cp_gen_be.c)
Emit callback (typical)sub_467E50 (raw text output to .int.c stream)
Sentinel type name__nv_lambda_preheader_injection
Synthetic source file"nvcc_internal_extended_lambda_implementation"
Enable flagdword_106BF38 (--extended-lambda / --expt-extended-lambda)
Device bitmapunk_1286980 (128 bytes = 16 x uint64 = 1024 bits)
Host-device bitmapunk_1286900 (128 bytes = 16 x uint64 = 1024 bits)
C++17 noexcept gatedword_126E270 (controls noexcept trait variants)
One-shot guaranteeOnce emitted, the sentinel type is wrapped in #if 0 / #endif
Max capture count1024 (bit index range 0..1023)
Array dimension range2D through 8D (7 specializations per wrapper)

Trigger Mechanism: sub_4864F0 (gen_type_decl)

The preamble is not emitted eagerly at the start of compilation. Instead, the EDG frontend inserts a synthetic type declaration named __nv_lambda_preheader_injection into the IL at the point where the lambda template library is needed. During backend code generation, sub_4864F0 (the type declaration emitter in cp_gen_be.c) encounters this declaration and performs the following sequence:

// sub_4864F0, decompiled lines 200-242
// Check: is this a type tagged with the preheader marker? (bit at v4-8 & 0x10)
if ((*(_BYTE *)(v4 - 8) & 0x10) != 0)
{
    if (dword_106BF38)                           // --extended-lambda enabled?
    {
        v18 = *(_QWORD *)(v4 + 8);              // get type name pointer
        if (v18)
        {
            // Compare name against "__nv_lambda_preheader_injection" (30 chars + NUL)
            v30 = "__nv_lambda_preheader_injection";
            v31 = 32;                             // comparison length
            do {
                if (!v31) break;
                v29 = *(_BYTE *)v18++ == *v30++;
                --v31;
            } while (v29);

            if (v29)                              // name matched
            {
                if (dword_106581C)                // pending newline needed
                    sub_467D60();                 // emit newline

                // Emit #line directive pointing to synthetic source file
                v32 = "#line";
                if (dword_126E1DC)                // shorthand mode
                    v32 = "#";
                sub_467E50(v32);
                sub_467E50(" 1 \"nvcc_internal_extended_lambda_implementation\"");

                if (dword_106581C)
                    sub_467D60();

                // THE CRITICAL CALL: emit entire lambda template library
                sub_6BCC20(sub_467E50);

                dword_1065820 = 0;                // reset line tracking state
                qword_1065828 = 0;
            }
        }
    }
    // Suppress the sentinel type from host compiler output
    sub_46BC80("#if 0");
    --dword_1065834;
    sub_467D60();
}

Trigger Conditions

Three conditions must all be true for preamble emission:

  1. Marker bit set -- The type declaration node has bit 0x10 set at offset -8 (the IL node header flags). This bit marks NVIDIA-injected synthetic declarations.

  2. Extended lambda mode active -- dword_106BF38 is nonzero, meaning --extended-lambda (or --expt-extended-lambda) was passed to nvcc.

  3. Name matches sentinel -- The type's name at offset +8 is byte-equal to "__nv_lambda_preheader_injection" (a 31-character string including NUL; the comparison loop runs up to 32 iterations).

Synthetic Source File Context

Before calling sub_6BCC20, the trigger emits:

#line 1 "nvcc_internal_extended_lambda_implementation"

This #line directive serves two purposes: it changes the apparent source file for any diagnostics emitted during template parsing, and it provides a recognizable marker in the generated .int.c file for debugging. All lambda template infrastructure appears to originate from "nvcc_internal_extended_lambda_implementation" rather than from the user's source file. The dword_126E1DC flag selects between #line and the shorthand # form for the line directive.

One-Shot Guarantee and Sentinel Suppression

After the preamble is emitted, the sentinel type declaration is wrapped in #if 0 / #endif. The #if 0 is emitted immediately after the preamble call (line 239: sub_46BC80("#if 0")). The matching #endif is emitted later when sub_4864F0 reaches the closing path for this declaration type (lines 736-745):

else if ((*(_BYTE *)(v4 - 8) & 0x10) != 0)
{
    if (dword_106581C)
        sub_467D60();
    ++dword_1065834;
    sub_468190("#endif");
    --dword_1065834;
    sub_467D60();
    dword_1065820 = 0;
    qword_1065828 = 0;
}

The sentinel type __nv_lambda_preheader_injection never reaches the host compiler's type system -- it exists solely as a positional marker in the IL. Because the EDG frontend inserts exactly one such declaration per translation unit, and the backend processes declarations sequentially, the preamble is guaranteed to be emitted exactly once.

After emission, dword_1065820 (output line counter) and qword_1065828 (output state pointer) are reset to zero, ensuring subsequent #line directives correctly track the user's source file.

Master Emitter: sub_6BCC20

The function signature:

__int64 __fastcall sub_6BCC20(void (__fastcall *a1)(const char *));

The single parameter a1 is an output callback. In production, this is always sub_467E50 -- the function that writes raw text to the .int.c output stream. Every a1("...") call appends the given string literal to the output. The function has no other state parameters; all needed state (bitmaps, C++17 flag) is read from globals.

The 20 emission steps are executed unconditionally in a fixed order. Steps 6 and 9 contain bitmap-scanning loops that conditionally call sub-emitters based on which capture counts were registered during frontend parsing. Step 11 is gated on the C++17 noexcept flag.

Step 1: Type Removal Traits and Wrapper Helper Macro

The first a1(...) call emits the largest single string literal in the function -- three foundational metaprogramming utilities:

#define __NV_LAMBDA_WRAPPER_HELPER(X, Y) decltype(X), Y

template <typename T>
struct __nvdl_remove_ref { typedef T type; };

template<typename T>
struct __nvdl_remove_ref<T&> { typedef T type; };

template<typename T>
struct __nvdl_remove_ref<T&&> { typedef T type; };

template <typename T, typename... Args>
struct __nvdl_remove_ref<T(&)(Args...)> {
  typedef T(*type)(Args...);
};

template <typename T>
struct __nvdl_remove_const { typedef T type; };

template <typename T>
struct __nvdl_remove_const<T const> { typedef T type; };

__NV_LAMBDA_WRAPPER_HELPER(X, Y) expands to decltype(X), Y. It provides the <U, func> pair for tag type construction from a single expression. At each lambda wrapper call site, the per-lambda emitter (sub_47B890) generates __NV_LAMBDA_WRAPPER_HELPER(&Closure::operator(), &Closure::operator()), which expands to decltype(&Closure::operator()), &Closure::operator().

__nvdl_remove_ref strips lvalue and rvalue references, with a special case for function references (T(&)(Args...) -> T(*)(Args...)). __nvdl_remove_const strips top-level const. Both are used during capture type emission to normalize captured variable types before passing them as template arguments to wrapper structs.

Step 2: Device Lambda Tag

template <typename U, U func, unsigned>
struct __nv_dl_tag { };

The device lambda tag type. U is the type of the lambda's operator(), func is a non-type template parameter holding the pointer to that operator, and the unsigned disambiguates lambdas with identical operator types at different call sites within the same TU.

Step 3: Array Capture Helpers (sub_6BC290)

sub_6BCC20 calls sub_6BC290(a1), which emits the __nv_lambda_array_wrapper and __nv_lambda_field_type infrastructure for C-style array captures. This is a separate 183-line function that generates templates for array dimensions 2 through 8.

Three template families are emitted:

Primary template (static_assert trap):

template <typename T>
struct __nv_lambda_array_wrapper {
    static_assert(sizeof(T) == 0,
        "nvcc internal error: unexpected failure in capturing array variable");
};

Per-dimension partial specializations (dimensions 2-8). For each dimension D from 2 to 8, sub_6BC290 generates a partial specialization with D size_t template parameters and a nested-for-loop constructor:

// Example: 3D (v1 = 3)
template<typename T, size_t D1, size_t D2, size_t D3>
struct __nv_lambda_array_wrapper<T [D1][D2][D3]> {
    T arr[D1][D2][D3];
    __nv_lambda_array_wrapper(const T in[D1][D2][D3]) {
        for(size_t i1 = 0; i1 < D1; ++i1)
        for(size_t i2 = 0; i2 < D2; ++i2)
        for(size_t i3 = 0; i3 < D3; ++i3)
            arr[i1][i2][i3] = in[i1][i2][i3];
    }
};

Field type trait specializations:

template <typename T>
struct __nv_lambda_field_type { typedef T type; };

// For each dimension D from 2 to 8:
template<typename T, size_t D1, ..., size_t DN>
struct __nv_lambda_field_type<T [D1]...[DN]> {
    typedef __nv_lambda_array_wrapper<T [D1]...[DN]> type;
};

template<typename T, size_t D1, ..., size_t DN>
struct __nv_lambda_field_type<const T [D1]...[DN]> {
    typedef const __nv_lambda_array_wrapper<T [D1]...[DN]> type;
};

The loop structure in sub_6BC290 uses two stack buffers: v33[1024] for the nested-for-loop lines (each sprintf call formats four copies of the loop index variable) and s[1064] for dimension parameters and array subscript expressions. The outer loop runs from v1 = 2 to v1 = 8 (inclusive, 7 iterations). 1D arrays do not need a wrapper -- they can be captured directly. Arrays of 9+ dimensions are unsupported (the primary template's static_assert fires).

See Capture Handling for detailed documentation.

Step 4: Primary __nv_dl_wrapper_t and Zero-Capture Specialization

template <typename Tag, typename...CapturedVarTypePack>
struct __nv_dl_wrapper_t {
    static_assert(sizeof...(CapturedVarTypePack) == 0,
                  "nvcc internal error: unexpected number of captures!");
};

template <typename Tag>
struct __nv_dl_wrapper_t<Tag> {
    __nv_dl_wrapper_t(Tag) { }
    template <typename...U1>
    int operator()(U1...) { return 0; }
};

The primary template traps any instantiation with a non-zero capture count that lacks a matching specialization. The zero-capture specialization provides a trivial constructor and a dummy operator() returning int(0). This return value is never used at runtime -- the device compiler dispatches through the tag's encoded function pointer.

Step 5: Trailing-Return Tag and Base Specialization

template <typename U, U func, typename Return, unsigned>
struct __nv_dl_trailing_return_tag { };

template <typename U, U func, typename Return, unsigned Id>
struct __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id> > {
    __nv_dl_wrapper_t(__nv_dl_trailing_return_tag<U, func, Return, Id>) { }

    template <typename...U1> Return operator()(U1...) {
        __builtin_unreachable();
    }
};

For lambdas with explicit trailing return types (-> ReturnType), the tag carries the Return type as a template parameter. The operator() returns Return instead of int, with __builtin_unreachable() satisfying the compiler without generating actual return-value code.

The trailing-return tag and its zero-capture specialization are emitted as two separate a1(...) calls. The __builtin_unreachable() body is split: a1("__builtin_unreachable(); }\n}; \n\n").

Step 6: Device Lambda Bitmap Scan

Scans unk_1286980 (the device lambda bitmap, 1024 bits) and calls sub_6BB790 for each set bit with index greater than zero:

// Decompiled from sub_6BCC20
v1 = (unsigned __int64 *)&unk_1286980;
v2 = 0;
do {
    v3 = *v1;                          // load 64-bit word
    v4 = v2 + 64;                      // word boundary
    do {
        if (v2 && (v3 & 1) != 0)       // skip bit 0, emit for set bits
            sub_6BB790(v2, a1);         // emit_device_lambda_wrapper_specialization
        ++v2;
        v3 >>= 1;
    } while (v4 != v2);
    ++v1;
} while (v4 != 1024);

Bit 0 is explicitly skipped (if (v2 && ...)). The zero-capture case is handled by the specializations in steps 4 and 5.

For each set bit N > 0, sub_6BB790(N, a1) emits two __nv_dl_wrapper_t partial specializations: one for __nv_dl_tag and one for __nv_dl_trailing_return_tag, each with N typed fields, a constructor taking N parameters, and an initializer list binding inK to fK. See Device Lambda Wrapper for full emitter logic.

This bitmap-driven approach is the critical compile-time optimization. A translation unit using lambdas with capture counts 1, 3, and 5 emits exactly 6 struct specializations rather than 2046 (1023 counts x 2 tag variants).

Step 7: Host-Device Helper Class (__nv_hdl_helper)

Emitted inside an anonymous namespace:

namespace {
template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
struct __nv_hdl_helper {
    typedef void * (*fp_copier_t)(void *);
    typedef OpFuncR (*fp_caller_t)(void *, OpFuncArgs...);
    typedef void (*fp_deleter_t)(void *);
    typedef OpFuncR (*fp_noobject_caller_t)(OpFuncArgs...);

    static fp_copier_t fp_copier;
    static fp_deleter_t fp_deleter;
    static fp_caller_t fp_caller;
    static fp_noobject_caller_t fp_noobject_caller;
};

// Out-of-line static member definitions (4 members):
template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
typename __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier_t
    __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier;

// ... (fp_deleter, fp_caller, fp_noobject_caller follow the same pattern)
}

The anonymous namespace prevents ODR violations across TUs. The Tag parameter isolates function pointer storage per lambda site even when call signatures are identical. The entire struct definition plus all four out-of-line member definitions are emitted as a single a1(...) call.

PointerPurpose
fp_copierHeap-copies a Lambda from void* (used by copy constructor)
fp_callerCasts void* to Lambda* and invokes operator()
fp_deleterCasts void* to Lambda* and deletes it
fp_noobject_callerStores captureless lambda as raw function pointer

Step 8: Primary __nv_hdl_wrapper_t

template <bool IsMutable, bool HasFuncPtrConv, bool NeverThrows,
          typename Tag, typename OpFunc, typename...CapturedVarTypePack>
struct __nv_hdl_wrapper_t {
    static_assert(sizeof...(CapturedVarTypePack) == 0,
        "nvcc internal error: unexpected number of captures "
        "in __host__ __device__ lambda!");
};

Same safety-net pattern as the device wrapper.

Step 9: Host-Device Lambda Bitmap Scan

Scans unk_1286900 (the host-device bitmap, 1024 bits). Unlike the device scan, this loop does not skip bit 0 -- the zero-capture host-device case still requires distinct specializations for HasFuncPtrConv=true vs HasFuncPtrConv=false.

For each set bit N, four specialization calls are made:

v5 = (unsigned __int64 *)&unk_1286900;
v6 = 0;
do {
    v7 = *v5;
    v8 = v6 + 64;
    do {
        while ((v7 & 1) == 0) {        // fast-skip unset bits
            ++v6;
            v7 >>= 1;
            if (v6 == v8) goto LABEL_13;
        }
        sub_6BBB10(0, v6, a1);         // IsMutable=false, HasFuncPtrConv=false
        sub_6BBEE0(0, v6, a1);         // IsMutable=true,  HasFuncPtrConv=false
        sub_6BBB10(1, v6, a1);         // IsMutable=false, HasFuncPtrConv=true
        v9 = v6++;
        v7 >>= 1;
        sub_6BBEE0(1, v9, a1);        // IsMutable=true,  HasFuncPtrConv=true
    } while (v6 != v8);
LABEL_13:
    ++v5;
} while (v6 != 1024);

Note the ordering asymmetry in the fourth call: sub_6BBEE0(1, v9, a1) uses the pre-increment value v9 because v6 has already been incremented by the v9 = v6++ expression.

The inner while ((v7 & 1) == 0) loop provides fast skipping over consecutive unset bits without executing four function calls per zero bit. This is an optimization compared to the device scan loop.

Calla1a2IsMutableHasFuncPtrConvoperator() qualifier
sub_6BBB10(0, N, emit)0Nfalsefalseconst noexcept(NeverThrows)
sub_6BBEE0(0, N, emit)0Ntruefalsenoexcept(NeverThrows) (no const)
sub_6BBB10(1, N, emit)1Nfalsetrueconst noexcept(NeverThrows)
sub_6BBEE0(1, N, emit)1Ntruetruenoexcept(NeverThrows) (no const)

The sole difference between sub_6BBB10 and sub_6BBEE0 is that sub_6BBB10 emits "false," for IsMutable and adds a3("const ") before the noexcept qualifier on operator(), while sub_6BBEE0 emits "true," and omits the const. They are otherwise structurally identical -- 238 vs 236 lines, the 2-line difference being exactly the a3("const ") call.

See Host-Device Lambda Wrapper for the complete internal structure of each specialization.

Step 10: __nv_hdl_helper_trait_outer (Base Specializations)

The deduction helper trait that extracts the wrapper type from a lambda's operator() signature:

template <bool IsMutable, bool HasFuncPtrConv, typename ...CaptureArgs>
struct __nv_hdl_helper_trait_outer {
    template <typename Tag, typename Lambda>
    struct __nv_hdl_helper_trait
        : public __nv_hdl_helper_trait<Tag, decltype(&Lambda::operator())> { };

    // Match const operator() (non-mutable lambda):
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) const> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, false,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

    // Match non-const operator() (mutable lambda):
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...)> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, false,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

The primary __nv_hdl_helper_trait inherits from a specialization on decltype(&Lambda::operator()). The compiler deduces the member function pointer type and pattern-matches against the const or non-const specialization. Both produce NeverThrows=false.

This block is emitted without a closing }; -- the noexcept variants (step 11) are conditionally appended before the closing brace.

Step 11: C++17 Noexcept Trait Variants (Conditional)

Gated on dword_126E270:

if (dword_126E270)
    a1(/* noexcept trait specializations */);
a1("\n};");  // close __nv_hdl_helper_trait_outer

When C++17 noexcept-in-type-system is active, two additional __nv_hdl_helper_trait specializations are emitted:

    // Match const noexcept operator():
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) const noexcept> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, true,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

    // Match non-const noexcept operator():
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) noexcept> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, true,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

The noexcept specializations produce NeverThrows=true. In C++17, R(C::*)(Args...) const noexcept is a distinct type from R(C::*)(Args...) const, so without these specializations, noexcept lambdas would fail to match and the trait chain would break.

Step 12: __nv_hdl_create_wrapper_t Factory

template<bool IsMutable, bool HasFuncPtrConv, typename Tag,
         typename...CaptureArgs>
struct __nv_hdl_create_wrapper_t {
    template <typename Lambda>
    static auto __nv_hdl_create_wrapper(Lambda &&lam, CaptureArgs... args)
        -> decltype(
            __nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>
                ::template __nv_hdl_helper_trait<Tag, Lambda>
                ::get(lam, args...))
    {
        typedef decltype(
            __nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>
                ::template __nv_hdl_helper_trait<Tag, Lambda>
                ::get(lam, args...)) container_type;
        return container_type(Tag{}, std::move(lam), args...);
    }
};

This factory is the entry point called at each host-device lambda usage site. The trailing return type chains through the trait hierarchy to deduce the exact __nv_hdl_wrapper_t specialization. The body constructs the deduced wrapper with Tag{}, the moved lambda, and the capture arguments.

Step 13: CV-Removal Traits

template<typename T> struct __nv_lambda_trait_remove_const { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_const<T const> { typedef T type; };

template<typename T> struct __nv_lambda_trait_remove_volatile { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_volatile<T volatile> { typedef T type; };

template<typename T> struct __nv_lambda_trait_remove_cv {
    typedef typename __nv_lambda_trait_remove_const<
        typename __nv_lambda_trait_remove_volatile<T>::type>::type type;
};

These are distinct from the __nvdl_remove_ref/__nvdl_remove_const emitted in step 1. The step-1 traits are used during capture type normalization at wrapper call sites. The step-13 traits are used by the detection macros (steps 14-17) to strip CV qualifiers before testing whether a type is an extended lambda wrapper.

Step 14: Device Lambda Detection Trait

template <typename T>
struct __nv_extended_device_lambda_trait_helper {
    static const bool value = false;
};

template <typename T1, typename...Pack>
struct __nv_extended_device_lambda_trait_helper<__nv_dl_wrapper_t<T1, Pack...> > {
    static const bool value = true;
};

#define __nv_is_extended_device_lambda_closure_type(X) \
    __nv_extended_device_lambda_trait_helper< \
        typename __nv_lambda_trait_remove_cv<X>::type>::value

SFINAE detection for device lambda wrappers. The macro strips CV qualifiers first, ensuring const __nv_dl_wrapper_t<...> is also detected. Used by CUDA runtime headers for conditional behavior on extended lambda types.

Step 15: Device Lambda Wrapper Unwrapper

template<typename T> struct __nv_lambda_trait_remove_dl_wrapper { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_dl_wrapper<__nv_dl_wrapper_t<T> > {
    typedef T type;
};

Extracts the inner tag type from a zero-capture device lambda wrapper. Only matches __nv_dl_wrapper_t<T> with a single template parameter (the tag). Used to access __nv_dl_tag or __nv_dl_trailing_return_tag for device function dispatch resolution.

Step 16: Trailing-Return Device Lambda Detection

template <typename T>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper {
    static const bool value = false;
};

template <typename U, U func, typename Return, unsigned Id, typename...Pack>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper<
    __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>, Pack...> > {
    static const bool value = true;
};

#define __nv_is_extended_device_lambda_with_preserved_return_type(X) \
    __nv_extended_device_lambda_with_trailing_return_trait_helper< \
        typename __nv_lambda_trait_remove_cv<X>::type>::value

Detects whether a device lambda wrapper uses the trailing-return tag variant. Needed because trailing-return lambdas require different handling during device compilation -- the return type is explicit and must be preserved, rather than deduced.

Step 17: Host-Device Lambda Detection Trait

The final emission:

template <typename>
struct __nv_extended_host_device_lambda_trait_helper {
    static const bool value = false;
};

template <bool B1, bool B2, bool B3, typename T1, typename T2, typename...Pack>
struct __nv_extended_host_device_lambda_trait_helper<
    __nv_hdl_wrapper_t<B1, B2, B3, T1, T2, Pack...> > {
    static const bool value = true;
};

#define __nv_is_extended_host_device_lambda_closure_type(X) \
    __nv_extended_host_device_lambda_trait_helper< \
        typename __nv_lambda_trait_remove_cv<X>::type>::value

Detects any __nv_hdl_wrapper_t instantiation. The partial specialization matches all six template parameters (B1=IsMutable, B2=HasFuncPtrConv, B3=NeverThrows, T1=Tag, T2=OpFunc, Pack=captures).

sub_6BCC20 returns the result of this final a1(...) call.

Bitmap Infrastructure

Registration: sub_6BCBF0 (nv_record_capture_count)

During frontend parsing, scan_lambda (sub_447930) records each lambda's capture count:

__int64 __fastcall sub_6BCBF0(int a1, unsigned int a2)
{
    unsigned __int64 *result;
    if (a1)
        result = (unsigned __int64 *)&unk_1286900;  // host-device bitmap
    else
        result = (unsigned __int64 *)&unk_1286980;  // device bitmap
    result[a2 >> 6] |= 1ULL << a2;
    return (__int64)result;
}

The function selects the bitmap based on a1 (0 = device, nonzero = host-device), computes the word index as a2 >> 6 (divide by 64), and sets the bit via bitwise OR. No synchronization is needed because the frontend is single-threaded.

Reset: sub_6BCBC0 (nv_reset_capture_bitmasks)

Before each translation unit, both bitmaps are zeroed:

void sub_6BCBC0(void)
{
    memset(&unk_1286980, 0, 128);  // device bitmap
    memset(&unk_1286900, 0, 128);  // host-device bitmap
}

Scan Algorithm Differences

AspectDevice scan (step 6)Host-device scan (step 9)
Bitmapunk_1286980unk_1286900
Bit 0Skipped (if (v2 && ...))Processed
Skip strategyTests every bit individuallyInner while fast-skips consecutive zeros
Calls per set bit1 (sub_6BB790)4 (sub_6BBB10 x2 + sub_6BBEE0 x2)
Specializations per set bit2 (standard + trailing-return)4 (IsMutable x HasFuncPtrConv)

The device scan skips bit 0 because the zero-capture case is handled by the always-emitted primary template. The host-device scan processes bit 0 because the zero-capture case requires explicit specializations for the HasFuncPtrConv and IsMutable dimensions -- the always-emitted primary template contains only a static_assert trap.

Complete Emission Order Summary

StepContentEmitterTemplates Produced
1Ref/const removal traitsinline string__NV_LAMBDA_WRAPPER_HELPER, __nvdl_remove_ref, __nvdl_remove_const
2Device taginline string__nv_dl_tag
3Array helperssub_6BC290__nv_lambda_array_wrapper (dim 2-8), __nv_lambda_field_type specializations
4Device wrapper primaryinline string__nv_dl_wrapper_t primary + zero-capture
5Trailing-return taginline string__nv_dl_trailing_return_tag + zero-capture specialization
6Device bitmap scanloop + sub_6BB790N-capture __nv_dl_wrapper_t (2 per set bit N > 0)
7HD helperinline string__nv_hdl_helper (anonymous namespace, 4 static FPs)
8HD wrapper primaryinline string__nv_hdl_wrapper_t primary with static_assert
9HD bitmap scanloop + sub_6BBB10 x2 + sub_6BBEE0 x2N-capture __nv_hdl_wrapper_t (4 per set bit)
10Trait outerinline string__nv_hdl_helper_trait_outer (const + non-const specializations)
11C++17 noexceptconditional inlineNoexcept __nv_hdl_helper_trait specializations
12Factoryinline string__nv_hdl_create_wrapper_t
13CV traitsinline string__nv_lambda_trait_remove_const/volatile/cv
14Device detectioninline string__nv_extended_device_lambda_trait_helper + macro
15Wrapper unwrapinline string__nv_lambda_trait_remove_dl_wrapper
16Trailing-return detectioninline string__nv_extended_device_lambda_with_trailing_return_trait_helper + macro
17HD detectioninline string__nv_extended_host_device_lambda_trait_helper + macro

Output Size Characteristics

The preamble size depends on the number of distinct capture counts used:

ComponentFixed/VariableApproximate Size
Steps 1-5 (fixed templates)Fixed~1.5 KB
Step 3 (array helpers, dim 2-8)Fixed~4 KB
Step 6 (device, per capture count)Variable~0.8 KB per count
Steps 7-8 (HD helper + primary)Fixed~1.5 KB
Step 9 (HD, per capture count)Variable~6 KB per count (4 specializations)
Steps 10-17 (traits, macros)Fixed~3 KB

A typical translation unit with 3-5 distinct capture counts produces approximately 30-50 KB of injected C++ text.

Design Rationale

Text Emission vs AST Construction

The preamble is emitted as raw C++ source text rather than constructed as AST nodes in the EDG IL. This trades correctness-by-construction for implementation simplicity:

  • Avoids IL complexity. Constructing proper AST nodes for template partial specializations, static member definitions, anonymous namespaces, and macros would require deep integration with the EDG IL construction API.
  • Matches output format. The .int.c file is plain C++ text consumed by the host compiler. Since the templates must eventually become text, generating them as text from the start eliminates a serialize-deserialize round trip.
  • Self-documenting. The emitted text is directly readable in the .int.c file. grep for __nv_dl_wrapper_t to see exactly what was produced.

The cost is that the templates exist only as generated text, not as first-class IL entities. They cannot be analyzed or transformed by other EDG passes. This is acceptable because the preamble templates are infrastructure -- they are never the target of user-facing diagnostics or transformations.

Why Bitmaps Instead of Lists

The 1024-bit bitmap offers constant-time set (O(1) via shift-and-OR) and linear-time scan (O(1024) = effectively constant for a fixed-size structure). The bitmap has zero dynamic allocation, fits in two cache lines (128 bytes), and the scan loop compiles to simple shift-and-test instructions. Alternative representations (sorted lists, hash sets) would add allocation overhead and complexity for negligible benefit given the fixed 128-byte size.

Why Bit 0 Is Skipped for Device but Not Host-Device

The device lambda zero-capture case is fully handled by the primary template's zero-capture specialization (step 4), which is always emitted. No per-capture-count specialization is needed because the zero-capture wrapper has no fields, no constructor parameters, and no specialization-specific behavior.

The host-device zero-capture case requires distinct specializations for HasFuncPtrConv=true (lightweight function pointer path) and HasFuncPtrConv=false (heap-allocated type erasure path). These paths have fundamentally different internal structure. The always-emitted primary template contains only a static_assert trap, not a working implementation, so bit 0 must be processed to generate the actual zero-capture specializations.

Function Map

AddressName (recovered)SourceLinesRole
sub_6BCC20nv_emit_lambda_preamblenv_transforms.c244Master emitter: 17-step template injection pipeline
sub_4864F0gen_type_declcp_gen_be.c751Trigger: detects sentinel, emits #line, calls master emitter
sub_467E50emit_stringcp_gen_be.c~29Output callback: writes string char-by-char via putc()
sub_467D60emit_newlinecp_gen_be.c~15Emits \n, increments line counter
sub_6BC290emit_array_capture_helpersnv_transforms.c183Step 3: __nv_lambda_array_wrapper for dim 2-8
sub_6BB790emit_device_lambda_wrapper_specializationnv_transforms.c191Step 6: N-capture __nv_dl_wrapper_t (both tag variants)
sub_6BBB10emit_hdl_wrapper_nonmutablenv_transforms.c238Step 9: __nv_hdl_wrapper_t<false,...> specialization
sub_6BBEE0emit_hdl_wrapper_mutablenv_transforms.c236Step 9: __nv_hdl_wrapper_t<true,...> specialization
sub_6BCBF0nv_record_capture_countnv_transforms.c13Sets bit N in device or HD bitmap
sub_6BCBC0nv_reset_capture_bitmasksnv_transforms.c9Zeroes both 128-byte bitmaps before each TU
sub_46BC80emit_preprocessor_directivecp_gen_be.c--Emits #if 0 / #endif suppression blocks

Global State

VariableAddressTypePurpose
unk_12869800x1286980uint64_t[16]Device lambda capture-count bitmap (1024 bits)
unk_12869000x1286900uint64_t[16]Host-device lambda capture-count bitmap (1024 bits)
dword_106BF380x106BF38int32--extended-lambda mode flag
dword_126E2700x126E270int32C++17 noexcept-in-type-system flag
dword_126E1DC0x126E1DCint32EDG native mode flag (# vs #line format)
dword_106581C0x106581Cint32Output column counter
dword_10658200x1065820int32Output line counter (reset after preamble)
qword_10658280x1065828int64Output state pointer (reset after preamble)
dword_10658180x1065818int32Pending indentation flag
dword_10658340x1065834int32Preprocessor nesting depth counter

Lambda Restrictions

Extended lambdas are CUDA's most constraint-heavy feature. Before a lambda can be wrapped in __nv_dl_wrapper_t or __nv_hdl_wrapper_t for device transfer, cudafe++ must verify that the closure type is serializable: no reference captures (device memory cannot hold host-side pointers), no function-local types in the public interface (device compiler has no access to them), no unnamed parent classes (the wrapper tag requires a mangleable name), and dozens of other structural invariants. The restriction checker runs as Phase 4 of scan_lambda (sub_447930, lines 626--866 of the 2113-line function) and continues through per-capture validation in make_field_for_lambda_capture (sub_42EE00) and recursive type walks in sub_41A3E0 / sub_41A1F0. Together, these functions enforce 39 distinct diagnostic tags covering 35+ error categories and approximately 45 unique error code call sites.

All restrictions apply only when dword_106BF38 (--extended-lambda / --expt-extended-lambda) is set and the lambda has an explicit __device__ or __host__ __device__ annotation. Standard C++ lambdas and lambdas defined inside __device__ / __global__ function bodies are exempt.

Key Facts

PropertyValue
Primary validatorsub_447930 (scan_lambda, Phase 4, ~240 lines within 2113-line function)
Per-capture validatorsub_42EE00 (make_field_for_lambda_capture, 551 lines)
Type hierarchy walkersub_41A3E0 (validate_type_hd_annotation, 75 lines)
Array/element checkersub_41A1F0 (walk_type_for_hd_violations, 81 lines)
Type walk callbacksub_41B420 (33 lines, issues errors 3603/3604/3606/3607/3610/3611)
Diagnostic tag count39 unique tags for extended lambda errors
Error code range3592--3635, 3689--3691
Error severityAll severity 7 (error), except 3612 (warning) and 3590 (error)
Enable flagdword_106BF38 (--extended-lambda)
OptiX gatedword_106BDD8 && dword_106B670 (triggers 3689)

Restriction Categories

The tables below list every restriction enforced by the extended lambda validator, organized by the phase of validation in which each check occurs. The Error column gives the internal error index (displayed to users as 20000-series with the renumbering formula code + 16543). The Tag column gives the diagnostic tag name usable with --diag_suppress / #pragma nv_diag_suppress.

Category 1: Capture Restrictions

These checks run in two phases. The per-lambda checks (3593, 3595) occur in scan_lambda Phase 4 and in sub_4F9F20 (capture count finalization). The per-capture checks (3596--3599, 3616) run inside make_field_for_lambda_capture (sub_42EE00), which calls sub_41A1F0 for array dimension and constructibility analysis.

ErrorTagRestrictionEnforcement Location
3593extended_lambda_reference_captureReference capture ([&] or [&x]) is prohibited. Device memory cannot hold host-side references. Fires when capture_default == & and capture_mode == & on the same lambda (byte+24 bits 4 and 5 both set).sub_447930 Phase 4, line ~825
3595extended_lambda_too_many_capturesMaximum 1023 captures. The bitmap system uses 1024 bits (128 bytes) per wrapper type; bit 0 is reserved for the zero-capture primary template, so the usable range is 1--1023. Capture count > 0x3FE triggers this error.sub_4F9F20 line ~616
3596extended_lambda_init_capture_arrayInit-captures with array type are not supported. The init-capture's type node is checked for kind 3 (array type) with element kind 1 and sub-kind 21.sub_42EE00 line ~508
3597extended_lambda_array_capture_rankArrays with more than 7 dimensions cannot be captured. The walker sub_41A1F0 counts array nesting depth via sub_7A8370 (is_array_type) and sub_7A9310 (get_element_type). If depth > 7, error fires. The limit matches the generated __nv_lambda_array_wrapper specializations (dims 2--8, plus dim 1 as identity).sub_41A1F0 lines ~29, ~54
3598extended_lambda_array_capture_default_constructibleArray element type must be default-constructible on the host. After unwinding CV-qualifiers (kind 12 loop), calls sub_550E50(30, element_type, 0) to check default-constructibility. Failure emits this error.sub_41A1F0 line ~40
3599extended_lambda_array_capture_assignableArray element type must be copy-assignable on the host. Calls sub_5BD540 to get the assignment operator, then sub_510860(60, ...) to verify it is callable. Failure emits this error.sub_41A1F0 lines ~42--44
3616extended_lambda_pack_captureCannot capture an element of a parameter pack. After calling sub_41A1F0 for type validation, sub_7A8C00 checks whether the capture type involves a pack expansion; if so, this error fires.sub_42EE00 line ~517
3610extended_lambda_init_capture_initlistInit-captures with std::initializer_list type are prohibited. The type walk callback sub_41B420 checks kind and class identity.sub_41B420 / sub_4907A0
3602extended_lambda_capture_in_constexpr_ifAn extended lambda cannot first-capture a variable inside a constexpr if branch. The capture must be visible outside the discarded branch.sub_447930 Phase 6
3614extended_lambda_hd_init_captureInit-captures are completely prohibited for __host__ __device__ lambdas. When byte+25 bit 4 is set (HD wrapper) and the lambda has any captures, this error fires and the HD bits are cleared.sub_447930 line ~1710
--this_addr_capture_ext_lambdaImplicit capture of this in an extended lambda triggers a warning. Separate from the errors above; fires during capture list processing.sub_42FE50 / sub_42D710
--(no tag)*this capture requires either __device__-only or definition inside __device__/__global__ function, unless enabled by language dialect.sub_42FE50

Category 2: Type Restrictions

Type restrictions enforce that every type visible in the lambda's public interface (captures, parameters, return type, and parent function template arguments) is accessible to the device compiler. Three contexts are checked, each with two sub-checks (function-local types and private/protected class member types). Additionally, the parent function's template arguments are checked for private/protected template members.

ErrorTagContextRestriction
3603extended_lambda_capture_local_typeCapture variable typeA type local to a function cannot appear in the type of a captured variable.
3604extended_lambda_capture_private_typeCapture variable typeA private or protected class member type cannot appear in the type of a captured variable.
3606extended_lambda_call_operator_local_typeoperator() signatureA function-local type cannot appear in the return or parameter types of the lambda's operator().
3607extended_lambda_call_operator_private_typeoperator() signatureA private/protected class member type cannot appear in the operator() return or parameter types.
3610extended_lambda_parent_local_typeParent template argsA function-local type cannot appear in the template arguments of the enclosing parent function or any parent classes.
3611extended_lambda_parent_private_typeParent template argsA private/protected class member type cannot appear in the template arguments of the enclosing parent function or parent classes.
3635extended_lambda_parent_private_template_argParent template argsA template that is itself a private/protected class member cannot be used as a template argument of the enclosing parent.

Type Walk Dispatch via dword_E7FE78

The callback sub_41B420 uses a global discriminator dword_E7FE78 to select between the three contexts. Each context is called with a different value:

dword_E7FE78ContextLocal-type errorPrivate-type error
0Capture variable type36033604
1operator() signature36063607
2Parent template args36103611

The dispatch formula in sub_41B420 is 4 * (dword_E7FE78 != 1) + base_error. For local types, base is 3603; for private types, base is 3604. When dword_E7FE78 == 0, the multiplier is 41 = 4, yielding 3603+0 / 3604+0. When dword_E7FE78 == 1, the multiplier is 40 = 0, yielding 3603+3 = 3606 / 3604+3 = 3607. When dword_E7FE78 == 2 (and != 1), the multiplier is 4*1 = 4, yielding 3603+4 = (incorrect -- the actual formula uses a conditional). In practice the decompiled code shows:

// For function-local type check:
v2 = 3603;
if (dword_E7FE78)
    v2 = 4 * (unsigned int)(dword_E7FE78 != 1) + 3606;
// dword_E7FE78=0 -> 3603
// dword_E7FE78=1 -> 4*0 + 3606 = 3606
// dword_E7FE78=2 -> 4*1 + 3606 = 3610

// For private/protected type check:
v4 = 3604;
if (dword_E7FE78)
    v4 = 4 * (unsigned int)(dword_E7FE78 != 1) + 3607;
// dword_E7FE78=0 -> 3604
// dword_E7FE78=1 -> 4*0 + 3607 = 3607
// dword_E7FE78=2 -> 4*1 + 3607 = 3611

The tree walk itself is invoked via sub_7B0B60(type_node, sub_41B420, error_base). The error_base parameter (792 or 795) is stored in a global and used by the walker to control recursion behavior, not error selection.

Category 3: Enclosing Parent Function Restrictions

The parent function (the function in whose body the extended lambda is defined) must satisfy several naming and linkage constraints. These exist because the device compiler must be able to instantiate the wrapper template at a globally-unique mangled name derived from the parent function's signature.

ErrorTagRestrictionRationale
3605extended_lambda_enclosing_function_localParent function must not be defined inside another function (local function).Nested function bodies have no externally-visible mangling; the wrapper tag would be unresolvable.
3608extended_lambda_cant_take_function_addressParent function must allow its address to be taken. Checks entity+80 bits 0-1 for address-taken capability.The wrapper tag encodes a function pointer to the parent's operator(). If address-of is forbidden (e.g., deleted functions), the tag is ill-formed.
3609extended_lambda_parent_class_unnamedParent function cannot be a member of an unnamed class. Walks the scope chain checking entity+8 (name pointer) for null.Unnamed classes have no mangled name, making the wrapper tag unresolvable.
3601extended_lambda_parent_non_externOn Windows only: parent function must have external linkage. Internal or no linkage is prohibited.Windows COFF requires external linkage for cross-TU symbol resolution. On Linux ELF this restriction does not apply. Checks entity+81 bit 2 (has_qualified_scope) and entity+8 (name).
3608extended_lambda_inaccessible_parentParent function cannot have private or protected access within its class. Checks entity+80 bits 0-1 (access specifier).Private/protected member functions are not visible to the device compiler's separate compilation pass.
3592extended_lambda_enclosing_function_deducibleParent function must not have a deduced return type (auto return). Checks entity+81 bit 0 (is_deprecated flag used as deducible marker).Deduced return types are resolved lazily; the wrapper template needs a concrete type.
3600(no dedicated tag)Parent function cannot be = deleted or = defaulted. Checks entity+166 for values 1 or 2 (deleted, defaulted).A deleted/defaulted function has no body, so the lambda cannot exist.
3613(no dedicated tag)Parent function cannot have a noexcept specification. Checks entity+191 bit 0.Exception specifications interact with the wrapper's NeverThrows template parameter in ways that cannot be validated at frontend time.
3615extended_lambda_enclosing_function_not_foundThe validator (sub_41A3E0) could not locate the enclosing function. Fires when the type annotation context byte has bit 0 set but the host-device validation context a2 == 0.Internal consistency check; should not occur in well-formed code.

Category 4: Template Parameter Restrictions

The parent function's template parameter list must satisfy naming and variadic constraints to ensure the wrapper tag type can be uniquely instantiated.

ErrorTagRestriction
--extended_lambda_parent_template_param_unnamedEvery template parameter of the enclosing parent function must be named. Anonymous template parameters (template <typename>) prevent the wrapper from referencing the parameter in its tag type. Checked per-parameter during scope walk.
--extended_lambda_nest_parent_template_param_unnamedSame restriction applied to nested parent scopes (enclosing class templates, enclosing function templates above the immediate parent).
--extended_lambda_multiple_parameter_packsThe parent template function can have at most one variadic parameter pack, and it must be the last parameter. Multiple packs or non-trailing packs prevent the device compiler from deducing the wrapper specialization.

Category 5: Nesting and Context Restrictions

ErrorTagRestrictionRationale
--extended_lambda_enclosing_function_generic_lambdaAn extended lambda cannot be defined inside a generic lambda expression. Generic lambdas have template operator() which makes the closure type non-deducible for wrapper tag generation.Generic lambdas produce dependent types that the wrapper system cannot resolve.
--extended_lambda_enclosing_function_hd_lambdaAn extended lambda cannot be defined inside another extended __host__ __device__ lambda.The wrapper for the outer HD lambda would need to capture the inner wrapper, creating a recursive type dependency.
--extended_host_device_generic_lambdaA __host__ __device__ extended lambda cannot be a generic lambda (i.e., with auto parameters).The HD wrapper uses type erasure with concrete function pointer types. Generic lambdas would require polymorphic function pointers, which the type erasure scheme cannot express.
--extended_lambda_inaccessible_ancestorAn extended lambda cannot be defined inside a class that has private or protected access within another class.The wrapper tag must be visible to both host and device compilation passes. A privately-nested class is not accessible from the translation-unit scope where the wrapper template is instantiated.
--extended_lambda_inside_constexpr_ifAn extended lambda cannot be defined inside the if or else block of a constexpr if statement (platform/dialect dependent).Discarded constexpr if branches may eliminate the lambda entirely, but the preamble has already been committed. Restriction prevents dangling wrapper specializations.
3590extended_lambda_multiple_parentCannot specify __nv_parent more than once in a single lambda's capture list.__nv_parent stores a single parent class pointer at lambda_info + 32; only one slot exists.
3634(no dedicated tag)__nv_parent requires the lambda to be __device__ annotated. If __nv_parent is specified without __device__ execution space, this error fires. Additionally validates that the enclosing scope has __host__ but not __device__ execution space (bits at entity+182).__nv_parent is used to link the device closure to its enclosing class for member access. This is only meaningful in device execution context.

Category 6: Specifier and Annotation Restrictions

ErrorTagRestriction
3612extended_lambda_disallowed__host__ or __device__ annotation on a lambda when --extended-lambda is not enabled. This is a warning, not an error. The flag must be explicitly passed on the command line.
3620extended_lambda_constexprThe constexpr specifier is not allowed on an extended lambda's operator(). Also applies to consteval. Two separate emit calls: one for "constexpr" and one for "consteval".
3621(no dedicated tag)The operator() function for a lambda cannot be explicitly annotated with execution space annotations (__host__/__device__/__global__). The annotations are derived from the closure class, not the operator. Fires when entity+182 bits 1-2 are set on the call operator.
3689(no dedicated tag)OptiX mode incompatibility. When both dword_106BDD8 (OptiX) and dword_106B670 (a secondary OptiX flag) are set, and the lambda body at qword_106B678 + 176*dword_106B670 + 5 has bit 3 set, this error fires. OptiX has stricter lambda body requirements than standard CUDA.
3690extended_lambda_discriminatorLambda numbering lookup failure in the red-black tree (ptr / dword_E7FE48). The tree maps source positions to lambda indices for unique wrapper tag generation. If the tree search fails, the wrapper cannot be uniquely identified.
3691(no dedicated tag)Extended lambda with __host__ __device__ annotation where the type annotation byte has bit 4 set (HD init-capture validation context). Issued by sub_41A3E0 as a final post-check.

Category 7: Enclosing Scope Miscellaneous

ErrorTagRestriction
3617extended_lambda_no_parent_funcNo enclosing function could be found for the extended lambda. sub_6BCDD0 (nv_find_parent_lambda_function) walked the scope chain and returned null. The lambda may be at file scope, which is not a valid context for an extended lambda.
3618extended_lambda_illegal_parentAmbiguous overload when resolving the enclosing function. sub_6BCDD0 found multiple candidate functions. Emitted via sub_4F6E50 with three operands (location, space string, function name).
3619(no dedicated tag)Secondary ambiguity variant. Same as 3618 but fires on a different branch (the v291[0] check rather than v287[0]), indicating the ambiguity was detected through a different resolution path.
3601(duplicate)Lambda defined in unnamed namespace (entity+81 bit 2 set and entity+8 name pointer is null). The wrapper tag requires a named scope.
3605(duplicate)Non-trivially-copyable type in capture scope. When entity+80 bits 0-1 indicate non-trivial copy semantics, the capture cannot be transferred to device memory.

Validation Architecture

Phase 4 of scan_lambda: Per-Lambda Validation

After parsing the capture list and annotations (Phases 1--3), scan_lambda enters the extended lambda validation block. This block is guarded by dword_106BF38 (extended lambda mode) and the annotation bits at lambda_info + 25. The validation proceeds as:

sub_447930 (scan_lambda), Phase 4 entry:
  |
  +-- Call sub_6BCDD0 (nv_find_parent_lambda_function)
  |     Returns: parent function node, sets is_device/is_template flags
  |
  +-- If parent == NULL:  emit error 3617
  +-- If ambiguous:       emit error 3618 or 3619
  |
  +-- Validate parent function properties:
  |     entity+81 bit 0  -> error 3592 (deprecated/deducible)
  |     entity+191 bit 0 -> error 3613 (noexcept spec)
  |     entity+166 == 1|2 -> error 3600 (deleted/defaulted)
  |     entity+81 bit 2  -> unnamed scope check -> error 3601
  |     entity+80 bits 0-1 -> address-taken / access check -> error 3608
  |
  +-- Walk parent scope chain for unnamed classes:
  |     entity+8 == NULL -> error 3609
  |     Non-trivial copy   -> error 3605
  |
  +-- Check capture-default conflicts:
  |     byte+24 bits 4+5 both set -> error 3593 (& and = conflict)
  |
  +-- OptiX gate: dword_106BDD8 -> error 3689
  |
  +-- Lambda numbering via red-black tree:
        Lookup failure -> error 3690

Per-Capture Validation: sub_42EE00

For each captured variable, make_field_for_lambda_capture runs targeted checks:

sub_42EE00 (make_field_for_lambda_capture):
  |
  +-- If byte+25 bit 3 set (device wrapper):
  |     |
  |     +-- Check init-capture for array type
  |     |     type_node+48 == 3 && sub_kind == 21 -> error 3596
  |     |
  |     +-- Call sub_41A1F0 (walk_type_for_hd_violations)
  |     |     Counts array dimensions, checks element type
  |     |     dim > 7 -> error 3597
  |     |     Not default-constructible -> error 3598
  |     |     Not assignable -> error 3599
  |     |
  |     +-- Check for pack expansion
  |           sub_7A8C00 returns true -> error 3616
  |
  +-- (Later) If byte+25 bit 4 set (HD wrapper):
        |
        +-- Call sub_7B0B60 with sub_41B420 callback
              Walks entire type tree, fires 3603/3604 for
              function-local and private/protected types

Type Hierarchy Walker: sub_41A3E0 / sub_41A1F0

sub_41A3E0 is the outer wrapper that validates the per-capture annotation context. sub_41A1F0 performs the recursive array dimension walk and element-type validation.

sub_41A3E0 (validate_type_hd_annotation):
  |
  +-- Determine context string: "__device__" or "__host__ __device__"
  |     Based on a2 parameter (0 = HD, nonzero = device-only)
  |
  +-- Check annotation byte (a1+32):
  |     bit 0 set && a2==0 -> error 3615
  |     bit 3 set -> check parent visibility:
  |       entity+163 < 0 (private) -> check bit pattern
  |       Both bits 3+4 set with private parent -> error 3635
  |       Otherwise -> error 3593
  |     bit 5 set -> error 3594 (private/protected access)
  |
  +-- Unwrap CV-qualifiers on element type (kind==12 loop)
  |
  +-- Call sub_41A1F0 (walk_type_for_hd_violations):
  |     Recursive array walker:
  |       v6 = dimension counter
  |       Loop: while sub_7A8370(type) returns true
  |         increment v6, follow sub_7A9310 to element type
  |       If v6 > 7: error 3597
  |       Unwrap CV (kind==12 loop)
  |       If not in dependent context (dword_126C5C4 == -1):
  |         Check scope flags (byte+6 bits 1-2)
  |         sub_550E50(30, type, 0) -> error 3598 (not default-constructible)
  |         sub_5BD540 + sub_510860(60, ...) -> error 3599 (not assignable)
  |       Call sub_7B0B60(type, sub_41B420, 792) for deep type walk
  |
  +-- If a3 (third parameter) set:
        Check bit 4 of annotation byte -> error 3691

sub_41B420: Type Walk Callback

This compact callback (33 lines decompiled) is invoked by sub_7B0B60 for every type node in the capture's type tree. It checks two properties:

  1. Function-local type -- entity+81 bit 0 set: the type is defined inside a function body. Error selection uses dword_E7FE78 to pick between capture context (3603), operator() context (3606), and parent template-arg context (3610).

  2. Private/protected member type -- entity+81 bit 2 set AND entity+80 bits 0-1 in range [1,2] (private or protected access specifier). Error selection parallels the local-type case: 3604, 3607, or 3611 depending on dword_E7FE78.

Special case: when entity+132 == 9 (template parameter dependent type) AND entity+152 points to a class with byte+86 bit 0 set AND entity+72 is non-null, the function-local check is suppressed. This handles template parameters that are not themselves local but instantiate with local types -- the error is deferred to instantiation time.

Diagnostic Tag Reference

Complete list of all 39 extended lambda diagnostic tags, sorted alphabetically. All tags can be used with --diag_suppress, --diag_warning, --diag_error on the command line, and with #pragma nv_diag_suppress, #pragma nv_diag_warning, #pragma nv_diag_error in source.

TagCategory
extended_host_device_generic_lambdaNesting
extended_lambda_array_capture_assignableCapture
extended_lambda_array_capture_default_constructibleCapture
extended_lambda_array_capture_rankCapture
extended_lambda_call_operator_local_typeType
extended_lambda_call_operator_private_typeType
extended_lambda_cant_take_function_addressParent
extended_lambda_capture_in_constexpr_ifCapture
extended_lambda_capture_local_typeType
extended_lambda_capture_private_typeType
extended_lambda_constexprSpecifier
extended_lambda_disallowedSpecifier
extended_lambda_discriminatorInternal
extended_lambda_enclosing_function_deducibleParent
extended_lambda_enclosing_function_generic_lambdaNesting
extended_lambda_enclosing_function_hd_lambdaNesting
extended_lambda_enclosing_function_localParent
extended_lambda_enclosing_function_not_foundParent
extended_lambda_hd_init_captureCapture
extended_lambda_illegal_parentParent
extended_lambda_inaccessible_ancestorNesting
extended_lambda_inaccessible_parentParent
extended_lambda_init_capture_arrayCapture
extended_lambda_init_capture_initlistCapture
extended_lambda_inside_constexpr_ifNesting
extended_lambda_multiple_parameter_packsTemplate
extended_lambda_multiple_parentNesting
extended_lambda_nest_parent_template_param_unnamedTemplate
extended_lambda_no_parent_funcParent
extended_lambda_pack_captureCapture
extended_lambda_parent_class_unnamedParent
extended_lambda_parent_local_typeType
extended_lambda_parent_non_externParent
extended_lambda_parent_private_template_argType
extended_lambda_parent_private_typeType
extended_lambda_parent_template_param_unnamedTemplate
extended_lambda_reference_captureCapture
extended_lambda_too_many_capturesCapture
this_addr_capture_ext_lambdaCapture

Bitmap Interaction

The capture count limit of 1023 derives from the bitmap architecture. Each wrapper type (device and host-device) uses a 128-byte bitmap (unk_1286980 / unk_1286900) storing 1024 bits. The bitmap setter sub_6BCBF0 performs:

result[capture_count >> 6] |= 1LL << capture_count;

Bit 0 is never emitted as a wrapper specialization (the zero-capture case uses the primary template). Bits 1--1023 map to generated partial specializations. The error check at capture count > 0x3FE (1022) fires before the bitmap set operation, so the effective maximum is 1023 captures. Attempting 1024 or more would overflow the 64-bit word boundary calculation, though in practice the error prevents this.

Operator() Annotation Derivation

Error 3621 enforces a fundamental design rule: the operator() function of an extended lambda must not carry explicit execution space annotations. Instead, the execution space is derived from the closure class. During scan_lambda Phase 5 (decl_call_operator_for_lambda), the code sets the call operator's execution space from lambda_info + 25:

// Propagate device/host from lambda_info to call operator
byte[operator+182] = (4 * byte[lambda+25]) & 0x10 | byte[operator+182] & 0xEF;
byte[operator+182] = (16 * byte[lambda+25]) & 0x20 | byte[operator+182] & 0xDF;

If the call operator already has execution space bits set (from explicit annotation by the user), error 3621 fires. The rationale is that the wrapper template's tag type already encodes the execution space; having the operator carry its own annotations would create an inconsistency that the device compiler cannot resolve.

Key Functions

AddressName (recovered)LinesRole
sub_447930scan_lambda2113Master lambda parser; Phase 4 = restriction validator
sub_42EE00make_field_for_lambda_capture551Per-capture field creator with device-lambda validation
sub_41A3E0validate_type_hd_annotation75Outer type annotation checker (errors 3593/3594/3615/3635/3691)
sub_41A1F0walk_type_for_hd_violations81Recursive array dim / element-type validator (3597/3598/3599)
sub_41B420(type walk callback)33Issues 3603/3604/3606/3607/3610/3611 via dword_E7FE78 dispatch
sub_6BCDD0nv_find_parent_lambda_function33Scope chain walk to find enclosing host/device function
sub_6BCBF0nv_record_capture_count13Set bit in device or host-device bitmap
sub_4F9F20(capture count finalizer)~620Checks capture count > 0x3FE, calls bitmap setter
sub_7B0B60(tree walker)--Recursive type tree traversal, calls callback for each node
sub_7A8370(is_array_type)--Returns nonzero if type node is an array type
sub_7A9310(get_element_type)--Returns the element type of an array type node
sub_550E50(check_default_constructible)--sub_550E50(30, type, 0) tests default-constructibility
sub_510860(check_callable)--sub_510860(60, op, type) tests if operator is callable

Global State

VariableAddressPurpose
dword_106BF380x106BF38Extended lambda mode flag (--extended-lambda)
dword_106BDD80x106BDD8OptiX mode flag
dword_106B6700x106B670Secondary OptiX lambda flag
qword_106B6780x106B678OptiX lambda body array base pointer
dword_E7FE780xE7FE78Type walk context discriminator (0=capture, 1=operator, 2=parent)
ptr(stack)Red-black tree root for lambda numbering per source position
dword_E7FE480xE7FE48Red-black tree sentinel node
dword_126C5C40x126C5C4Dependent scope index (-1 = not in dependent context)
dword_126EFAC0x126EFACCUDA mode flag
dword_126EFA40x126EFA4GCC extensions flag
qword_126EF980x126EF98GCC compatibility version

IL Overview

The Intermediate Language (IL) is EDG's central data structure -- a typed, scope-linked graph of every declaration, type, expression, statement, and template in the translation unit. cudafe++ (EDG 6.6) builds the IL during parsing, walks it for CUDA device/host separation, and emits it as the .int.c output. The IL never touches disk: IL_SHOULD_BE_WRITTEN_TO_FILE=0 forces in-memory-only operation. All IL nodes live in a region-based arena allocator, organized into file-scope (region 1) and per-function (region N) memory pools.

The IL is versioned as IL_VERSION_NUMBER="6.6" and carries the compile-time flag ALL_TEMPLATE_INFO_IN_IL=1, meaning template definitions, specializations, and instantiation directives are fully represented in the IL graph rather than deferred to a separate template database.

Key Configuration Constants

ConstantValueMeaning
IL_VERSION_NUMBER"6.6"IL format version, matches EDG version
IL_SHOULD_BE_WRITTEN_TO_FILE0IL is never serialized to disk
ALL_TEMPLATE_INFO_IN_IL1Full template data in IL graph
IL_FILE_SUFFIX(string)Suffix for IL file names if serialization were enabled
sizeof_il_entry sentinel9999Validated at init time (guard value in qword_E6C580)

IL Entry Kind System

Every IL node carries an entry_kind byte that identifies its type. The name table off_E6DD80 (aliased as il_entry_kind_names at off_E6E020) maps these bytes to human-readable strings. The il_one_time_init function (sub_5CF7F0) validates that this table ends with a "last" sentinel.

There are 85 defined entry kind values (0-84). Some are primary node types with their own linked lists; others are auxiliary records displayed inline by their parent.

Complete il_entry_kind Table

KindHexNameBytesDisplayNotes
00x00none----Null/invalid sentinel
10x01source_file_entry80Case 1File name, line ranges, include flags
20x02constant184Case 216 sub-kinds (ck_*)
30x03param_type80Case 3Parameter type in function signature
40x04routine_type_supplement64InlineEmbedded in routine type node
50x05routine_type_extra--InlineAdditional routine type data
60x06type176Case 622 sub-kinds (tk_*)
70x07variable232Case 7Variables, parameters, structured bindings
80x08field176Case 8Class/struct/union members
90x09exception_specification16Case 9noexcept, throw() specs
100x0Aexception_spec_type24Case 0xAType in exception specification
110x0Broutine288Case 0xBFunctions, methods, constructors, destructors
120x0Clabel128Case 0xCGoto labels, break/continue targets
130x0Dexpr_node72Case 0xD36 sub-kinds (enk_*)
140x0E(reserved)--InlineSkipped in display
150x0F(reserved)--InlineSkipped in display
160x10switch_case_entry56Case 0x10Case value + range for switch
170x11switch_info24Case 0x11Switch statement descriptor
180x12handler40Case 0x12try/catch handler entry
190x13try_supplement32InlineTry block extra info
200x14asm_supplement--InlineInline asm statement data
210x15statement80Case 0x1526 sub-kinds (stmk_*)
220x16object_lifetime64Case 0x16Destruction ordering
230x17scope288Case 0x179 sub-kinds (sck_*)
240x18base_class112Case 0x18Inheritance record
250x19string_text1*--Raw string literal bytes
260x1Aother_text1*--Compiler version, misc text
270x1Btemplate_parameter136Case 0x1BTemplate param with supplement
280x1Cnamespace128Case 0x1CNamespace declarations
290x1Dusing_declaration80Case 0x1DUsing declarations/directives
300x1Edynamic_init104Case 0x1E9 sub-kinds (dik_*)
310x1Flocal_static_variable_init40Case 0x1FStatic local init records
320x20vla_dimension48Case 0x20Variable-length array bound
330x21overriding_virtual_func40Case 0x21Virtual override info
340x22(reserved)--InlineSkipped in display
350x23derivation_path24Case 0x23Base-class derivation step
360x24base_class_derivation32--Derivation detail record
370x25(reserved)--InlineSkipped in display
380x26(reserved)--InlineSkipped in display
390x27class_info208Case 0x27Class type supplement
400x28(reserved)----Skipped in display
410x29constructor_init48Case 0x29Ctor member/base initializer
420x2Aasm_entry152Case 0x2AInline assembly block
430x2Basm_operand--Case 0x2BAsm constraint + expression
440x2Casm_clobber--Case 0x2CAsm clobber register
450x2D(reserved)--InlineSkipped in display
460x2E(reserved)--InlineSkipped in display
470x2F(reserved)--InlineSkipped in display
480x30(reserved)--InlineSkipped in display
490x31element_position24--Designator element position
500x32source_sequence_entry32Case 0x32Declaration ordering
510x33full_entity_decl_info56Case 0x33Full declaration info
520x34instantiation_directive40Case 0x34Explicit instantiation
530x35src_seq_sublist24Case 0x35Source sequence sub-list
540x36explicit_instantiation_decl--Case 0x36extern template
550x37orphaned_entities56Case 0x37Entities without parent scope
560x38hidden_name32Case 0x38Hidden name entry
570x39pragma64Case 0x39Pragma records (43 kinds)
580x3Atemplate208Case 0x3ATemplate declaration
590x3Btemplate_decl40Case 0x3BTemplate declaration head
600x3Crequires_clause16Case 0x3CC++20 requires clause
610x3Dtemplate_param136Case 0x3DTemplate parameter entry
620x3Ename_reference40Case 0x3EName lookup reference
630x3Fname_qualifier40Case 0x3FQualified name qualifier
640x40seq_number_lookup32Case 0x40Sequence number index
650x41local_expr_node_ref--Case 0x41Local expression reference
660x42static_assert24Case 0x42Static assertion
670x43linkage_spec32Case 0x43extern "C"/"C++" block
680x44scope_ref32Case 0x44Scope back-reference
690x45(reserved)--InlineSkipped in display
700x46lambda--Case 0x46Lambda expression
710x47lambda_capture--Case 0x47Lambda capture entry
720x48attribute72Case 0x48C++11/GNU attribute
730x49attribute_argument40Case 0x49Attribute argument
740x4Aattribute_group8Case 0x4AAttribute group
750x4B(reserved)--InlineSkipped in display
760x4C(reserved)--InlineSkipped in display
770x4D(reserved)--InlineSkipped in display
780x4E(reserved)--InlineSkipped in display
790x4Ftemplate_info--Case 0x4FTemplate instantiation info
800x50subobject_path24Case 0x50Address constant sub-path
810x51(reserved)--InlineSkipped in display
820x52module_info--Case 0x52C++20 module metadata
830x53module_decl--Case 0x53Module declaration
840x54last----Sentinel for table validation

Inline entries (kinds 4, 5, 14, 15, 19, 20, 27, 34, 37, 38, 40, 45-48, 69, 75-78, 81) are displayed as part of their parent node rather than as standalone IL entries. The display dispatcher (sub_5F4930) returns immediately for these kinds.

IL Header Structure

The IL header lives in the BSS segment at 0x126EB60 and is printed by display_il_header_and_file_scope (sub_5F76B0). It records translation-unit-level metadata:

struct il_header {                        // at xmmword_126EB60
    il_entry*   primary_source_file;      // +0x00  head of source file list
    scope*      primary_scope;            // +0x08  file-scope root
    routine*    main_routine;             // +0x10  main() if present
    char*       compiler_version;         // +0x18  "6.6" version string
    char*       time_of_compilation;      // +0x20  build timestamp
    uint8_t     plain_chars_are_signed;   // +0x28  signedness of plain char
    uint32_t    source_language;          // +0x2C  0=C++, 1=C (dword_126EBA8)
    uint32_t    std_version;             // +0x30  e.g. 201703 (dword_126EBAC)
    uint8_t     pcc_compatibility_mode;   // +0x34  PCC compat flag
    uint8_t     enum_type_is_integral;    // +0x35
    uint32_t    default_max_member_align; // +0x38
    uint8_t     gcc_mode;                // +0x3C  GCC compatibility
    uint8_t     gpp_mode;                // +0x3D  G++ compatibility
    uint32_t    gnu_version;             // +0x40  e.g. 40201
    uint8_t     short_enums;             // +0x44
    uint8_t     default_nocommon;         // +0x45
    uint8_t     UCN_identifiers_used;     // +0x46
    uint8_t     vla_used;                // +0x47
    uint8_t     any_templates_seen;       // +0x48
    uint8_t     prototype_instantiations_in_il;     // +0x49
    uint8_t     il_has_all_prototype_instantiations; // +0x4A
    uint8_t     il_has_C_semantics;       // +0x4B
    uint8_t     nontag_types_used_in_exception_or_rtti; // +0x4C
    il_entry*   seq_number_lookup_entries; // +0x50
    uint32_t    target_configuration_index; // +0x58
};

The source_language field selects the display string "sl_Cplusplus" or "sl_C". When source_language == 1 (C mode) and std_version > 199900, the routine display additionally prints C99 pragma state fields (fp_contract, fenv_access, cx_limited_range).

Memory Region System

IL entries are allocated in numbered memory regions managed by a bump allocator (sub_6B7D60):

RegionPurposeLifetimeGlobals
1File scopeEntire translation unitdword_126EC90 (region ID), dword_126F690/dword_126F694 (base offset / prefix size)
2..NPer-function scopeDuration of function body processingdword_126EB40 (current region), dword_126F688/dword_126F68C (base offset / prefix size)

Region 1 contains all file-scope declarations: types, global variables, function declarations, namespaces, templates. Regions 2+ are allocated one per function definition and hold that function's local variables, statements, expressions, labels, and temporaries. The region table at qword_126EC88 maps region indices to their memory, while qword_126EB90 maps region indices to their associated scope entries. dword_126EC80 tracks the total number of regions.

The allocator selects file-scope vs function-scope by comparing dword_126EB40 == dword_126EC90. When equal, the node goes into region 1; otherwise it goes into the current function region. Some node types force a specific region:

  • Labels (alloc_label at sub_5E5CA0): Assert that the current region is NOT file scope
  • Templates (alloc_template at sub_5E8D20): Always file-scope only
  • Sequence number lookups (sub_5E9170): Force region 1 by temporarily setting TU-copy mode

The display system (sub_5F7DF0) iterates all regions:

// File scope
printf("Intermediate language for memory region 1 (file scope):");
walk_file_scope_il(display_il_entry, ...);   // sub_60E4F0

// Per-function regions
for (int r = 2; r <= region_count; r++) {
    scope* s = scope_table[r];
    routine* fn = s->assoc_routine;
    printf("Intermediate language for memory region %ld (function \"%s\"):",
           r, fn->name);
    walk_routine_scope_il(r, display_il_entry, ...);  // sub_610200
}

IL Entry Prefix

Every IL node has a multi-qword prefix preceding the node body. The prefix size depends on allocation mode: 24 bytes (3 qwords) in normal file-scope mode, 16 bytes (2 qwords) in TU-copy mode, and 8 bytes (1 qword) for function-scope allocations. The allocator (sub_6B7D60) allocates a contiguous block and the caller returns a pointer past the prefix, so the prefix occupies negative offsets from the returned node pointer.

Normal file-scope mode (dword_106BA08 == 0, dword_126F694 = 24):

Raw allocation layout (normal file-scope, 24-byte prefix):
Offset  Size  Field
------  ----  -----
+0      8     translation_unit_copy_address   (qword, zeroed in normal mode)
+8      8     next_in_list                    (qword, linked list pointer)
+16     8     prefix flags qword              (flags byte at +16, 7 bytes padding)
+24     ...   node body starts here           (returned pointer)

Node pointer perspective (ptr = raw + 24):
ptr - 24  = TU copy address   (8 bytes at raw+0)
ptr - 16  = next pointer       (8 bytes at raw+8)
ptr - 8   = prefix flags byte  (8 bytes at raw+16, flags in low byte)
ptr + 0   = first byte of node body

TU-copy mode (dword_106BA08 != 0, dword_126F694 = 16):

Raw allocation layout (TU-copy mode, 16-byte prefix):
+0      8     next_in_list                    (no TU copy slot)
+8      8     prefix flags qword
+16     ...   node body starts here           (returned pointer)

Function-scope allocations (dword_126F68C = 8):

Raw allocation layout (function-scope, 8-byte prefix):
+0      8     prefix flags qword              (no TU copy, no orphan slot)
+8      ...   node body starts here           (returned pointer)

The prefix flags byte is at ptr - 8 from the returned node pointer (in all modes). The next_in_list pointer at ptr - 16 is the linked list link used by the IL walker to traverse all entries of a given kind (file-scope only). The translation_unit_copy_address at ptr - 24 stores the original address when a node is copied between translation units; it is zeroed in normal mode and absent in TU-copy and function-scope modes.

The keep_in_il test throughout cudafe++ uses *(signed char*)(entry - 8) < 0 to check bit 7 of the prefix flags byte -- this works because the flags byte is always at offset -8 from the node pointer regardless of allocation mode.

Prefix Flags Byte

The prefix flags byte (at offset -8 from the returned node pointer) encodes scope and language information:

BitMaskNameMeaning
00x01allocatedAlways set on allocation
10x02file_scopeSet when !dword_106BA08 (not in TU-copy mode)
20x04is_in_secondary_ilEntry came from secondary translation unit
30x08language_flagCopies dword_126E5FC & 1 (C++ vs C mode indicator)
70x80keep_in_ilCUDA-critical: marks entry for device IL output

Bit 7 (keep_in_il) is the mechanism by which cudafe++ selects device-relevant declarations. The mark_to_keep_in_il pass in il_walk.c sets this bit on all entries that are needed for device compilation. See Device/Host Separation and keep-in-il for details.

Sub-Kind Systems

Most primary IL entry kinds use a secondary kind byte to discriminate between variants. These sub-kind enums are the core classification taxonomy of the IL.

Type Kinds (tk_*)

The type kind byte lives at offset +132 in the type node body. 22 values, dispatched by set_type_kind (sub_5E2E80) and displayed by display_type (sub_5F06B0):

ValueNameSupplementSizeNotes
0tk_error----Error/placeholder type
1tk_void----void
2tk_integerinteger_type_supplement32int, char, bool, enum, wchar_t, char8/16/32_t
3tk_float----float, double, long double
4tk_complex----_Complex float/double/ldouble
5tk_imaginary----_Imaginary (C99)
6tk_pointer----Pointer, reference, rvalue reference
7tk_routineroutine_type_supplement64Function type (return + params)
8tk_array----Fixed and variable-length arrays
9tk_classclass_type_supplement208class types
10tk_structclass_type_supplement208struct types
11tk_unionclass_type_supplement208union types
12tk_typereftyperef_type_supplement56typedef, using, decltype, typeof
13tk_ptr_to_member----Pointer-to-member
14tk_template_paramtempl_param_supplement40Template type parameter
15tk_vector----SIMD vector type
16tk_scalable_vector----Scalable vector (SVE)
17tk_nullptr----std::nullptr_t
18tk_mfp8----8-bit floating point
19tk_scalable_vector_count----Scalable vector predicate
20(auto/decltype_auto)----Placeholder types
21(typeof_unqual/typeof_type)----C23 typeof

The display function references off_A6FE40 (22 string entries) for type kind names. The typeref sub-kind table at off_A6F640 has 28 entries covering typedef aliases, decltype expressions, auto, and concept-constrained placeholders.

Constant Kinds (ck_*)

The constant kind byte lives at offset +148 in the constant node. 16 values, dispatched by display_constant (sub_5F2720):

ValueNameNotes
0ck_errorError placeholder
1ck_integerInteger value (arbitrary precision via sub_602F20)
2ck_stringString/character literal (char kind + length + raw bytes)
3ck_floatFloating-point constant
4ck_complexComplex constant (real + imaginary)
5ck_imaginaryImaginary constant
6ck_addressAddress constant with 7 address sub-kinds (abk_*)
7ck_ptr_to_memberPointer-to-member constant
8ck_label_differenceGNU label address difference
9ck_dynamic_initDynamically initialized constant
10ck_aggregateAggregate initializer (linked list of sub-constants)
11ck_init_repeatRepeated initializer (constant + count)
12ck_template_paramTemplate parameter constant with 15 sub-kinds (tpck_*)
13ck_designatorDesignated initializer
14ck_voidVoid constant
15ck_reflectionReflection entity reference

Address constant sub-kinds (abk_*): abk_routine, abk_variable, abk_constant, abk_temporary, abk_uuidof, abk_typeid, abk_label.

Template parameter constant sub-kinds (tpck_*): tpck_param, tpck_expression, tpck_member, tpck_unknown_function, tpck_address, tpck_sizeof, tpck_datasizeof, tpck_alignof, tpck_uuidof, tpck_typeid, tpck_noexcept, tpck_template_ref, tpck_integer_pack, tpck_destructor.

Expression Node Kinds (enk_*)

The expression kind byte lives at offset +24 in the expression node. 36 values, dispatched by display_expr_node (sub_5ECFE0):

ValueNameNotes
0enk_errorError expression
1enk_operationBinary/unary/ternary operation (120 operator sub-kinds via eok_*)
2enk_constantConstant reference
3enk_variableVariable reference
4enk_fieldField access
5enk_temp_initTemporary initialization
6enk_lambdaLambda expression
7enk_new_deletenew/delete expression (56-byte supplement)
8enk_throwthrow expression (24-byte supplement)
9enk_conditionConditional expression (32-byte supplement)
10enk_object_lifetimeObject lifetime management
11enk_typeidtypeid expression
12enk_sizeofsizeof expression
13enk_sizeof_packsizeof...(pack)
14enk_alignofalignof expression
15enk_datasizeofNVIDIA __datasizeof extension
16enk_address_of_ellipsisAddress of variadic parameter
17enk_statementStatement expression (GCC extension)
18enk_reuse_valueReused value reference
19enk_routineFunction reference
20enk_type_operandType as operand (e.g., in sizeof)
21enk_builtin_operationCompiler builtin (indexed via off_E6C5A0)
22enk_param_refParameter reference
23enk_braced_init_listC++11 braced init list
24enk_c11_genericC11 _Generic selection
25enk_builtin_choose_exprGCC __builtin_choose_expr
26enk_yieldC++20 co_yield
27enk_awaitC++20 co_await
28enk_fold_expressionC++17 fold expression
29enk_initializerInitializer expression
30enk_concept_idC++20 concept-id
31enk_requiresC++20 requires expression
32enk_compound_reqCompound requirement
33enk_nested_reqNested requirement
34enk_const_eval_deferredDeferred constexpr evaluation
35enk_template_nameTemplate name expression

The enk_operation kind (value 1) carries an additional operation.kind byte dispatched through off_A6F840 (120 entries, the eok_* enum) and an operation.type_kind byte from off_A6FE40 (22 entries).

Expression Operation Kinds (eok_*)

The 120+ operation kinds cover all C++ operators. Key groups:

CategoryOperations
Arithmeticeok_add, eok_subtract, eok_multiply, eok_divide, eok_remainder, eok_negate, eok_unary_plus
Bitwiseeok_and, eok_or, eok_xor, eok_complement, eok_shiftl, eok_shiftr
Comparisoneok_eq, eok_ne, eok_lt, eok_gt, eok_le, eok_ge, eok_spaceship
Logicaleok_land, eok_lor, eok_not
Assignmenteok_assign, eok_add_assign, eok_subtract_assign, eok_multiply_assign, etc.
Pointereok_indirect, eok_address_of, eok_padd, eok_psubtract, eok_pdiff, eok_subscript
Member accesseok_dot_field, eok_points_to_field, eok_dot_static, eok_points_to_static, eok_pm_field, eok_points_to_pm_call
Castseok_cast, eok_lvalue_cast, eok_ref_cast, eok_dynamic_cast, eok_bool_cast, eok_base_class_cast, eok_derived_class_cast
Callseok_call, eok_dot_member_call, eok_points_to_member_call, eok_dot_pm_call, eok_points_to_pm_func_ptr
Incrementeok_pre_incr, eok_pre_decr, eok_post_incr, eok_post_decr
Complexeok_real_part, eok_imag_part, eok_xconj
Vectoreok_vector_fill, eok_vector_eq, eok_vector_ne, eok_vector_lt, eok_vector_gt, eok_vector_le, eok_vector_ge, eok_vector_subscript, eok_vector_question, eok_vector_land, eok_vector_lor, eok_vector_not
Controleok_comma, eok_question, eok_parens, eok_lvalue, eok_lvalue_adjust, eok_noexcept
Variadiceok_va_start, eok_va_end, eok_va_arg, eok_va_copy, eok_va_start_single_operand
Virtualeok_virtual_function_ptr, eok_dot_vacuous_destructor_call, eok_points_to_vacuous_destructor_call
Misceok_array_to_pointer, eok_reference_to, eok_ref_indirect, eok_ref_dynamic_cast, eok_pm_base_class_cast, eok_pm_derived_class_cast, eok_class_rvalue_adjust

Statement Kinds (stmk_*)

The statement kind byte lives at offset +32 in the statement node. 26 values:

ValueNameSupplementNotes
0stmk_expr--Expression statement
1stmk_if--if statement
2stmk_constexpr_if24 bytesif constexpr (C++17)
3stmk_if_consteval--if consteval (C++23)
4stmk_if_not_consteval--if !consteval (C++23)
5stmk_while--while loop
6stmk_goto--goto statement
7stmk_label--Label statement
8stmk_return--return statement
9stmk_coroutine128 bytesC++20 coroutine body (full coroutine descriptor)
10stmk_coroutine_return--co_return statement
11stmk_block32 bytesCompound statement / block
12stmk_end_test_while--do-while loop
13stmk_for24 bytesfor loop
14stmk_range_based_for--C++11 range-for (iterator, begin, end, incr)
15stmk_switch_case--case label
16stmk_switch24 bytesswitch statement
17stmk_init--Declaration with initializer
18stmk_asm--Inline assembly
19stmk_try_block32 bytestry block
20stmk_decl--Declaration statement
21stmk_set_vla_size--VLA size computation
22stmk_vla_decl--VLA declaration
23stmk_assigned_goto--GCC computed goto
24stmk_empty--Empty statement
25stmk_stmt_expr_result--GCC statement expression result

The coroutine statement (kind 9) carries the largest supplement at 128 bytes, containing traits, handle, promise, initial/final suspend calls, unhandled_exception call, get_return_object call, new/delete routines, and parameter copies. A preserved typo in the EDG source reads "paramter_copies" (missing 'e'), confirming genuine EDG lineage.

Scope Kinds (sck_*)

The scope kind byte lives at offset +28 in the scope node. 9 observed values:

ValueNameNotes
0sck_fileFile scope (translation unit root)
1sck_func_prototypeFunction prototype scope
2sck_blockBlock scope (compound statement)
3sck_namespaceNamespace scope
6sck_class_struct_unionClass/struct/union scope
8sck_template_declarationTemplate declaration scope
15sck_conditionCondition scope (if/while/for condition variable)
16sck_enumEnum scope (C++11 scoped enums)
17sck_functionFunction body scope (has routine ptr, parameters, ctor inits)

Scope kinds determine which child lists are displayed. The bitmask (1 << kind) & 0x20044 (bits 2, 6, 17 = block, class/struct/union, function) and (1 << kind) & 0x9 (bits 0, 3 = file, namespace) control whether namespaces, using_declarations, and using_directives lists appear.

Dynamic Init Kinds (dik_*)

The dynamic init kind byte lives at offset +48. 9 values:

ValueNameNotes
0dik_noneNo initialization
1dik_zeroZero initialization
2dik_constantConstant initializer
3dik_expressionExpression initializer
4dik_class_result_via_ctorClass value via constructor call
5dik_constructorConstructor call (routine + args)
6dik_nonconstant_aggregateNon-constant aggregate init
7dik_bitwise_copyBitwise copy from source
8dik_lambdaLambda initialization

Common IL Node Header

All primary IL node types (type, variable, field, routine, scope, namespace, template, etc.) share a 96-byte common header copied from a template at xmmword_126F6A0..126F6F0. This header is initialized by init_il_alloc (sub_5EAD80) and contains:

  • Source correspondence (source_corresp) block: name, position, parent scope, access specifier, linkage, flags
  • The display function display_source_corresp (sub_5EDF40) prints these fields for every entity type

Key source correspondence fields (printed for all entities):

  • name and unmangled_name_or_mangled_encoding
  • decl_position (line + column)
  • name_references list
  • is_class_member + access (from off_A6F760: public/protected/private/none)
  • parent_scope and enclosing_routine
  • name_linkage (from off_E6E040: none/internal/external/C/C++)
  • Flags: referenced, needed, is_local_to_function, marked_as_gnu_extension, externalized, maybe_unused, is_deprecated_or_unavailable

Initialization and Reset

The IL subsystem initializes in two phases:

One-Time Init (sub_5CF7F0)

Called once at program startup. Validates 7 name-table arrays end with "last" sentinels:

TableAddressContent
il_entry_kind_namesoff_E6E02085 IL entry kind names
db_storage_class_namesoff_E6CD78Storage class enum names
db_special_function_kindsoff_E6D228Special function kind names
db_operator_namesoff_E6CD20Operator kind names
name_linkage_kind_namesoff_E6E060Linkage kind names
decl_modifier_namesoff_E6CD88Declaration modifier names
pragma_idsoff_E6CF38Pragma identifier names

Also validates unsigned_int_kind_of table (byte_E6D1AD == 111 == 'o') and initializes 60+ allocation pools via sub_7A3C00 (pool_init) with element sizes ranging from 1 to 1344 bytes.

Per-TU Init (sub_5CFE20)

Called at the start of each translation unit compilation. Zeroes all pool heads, allocates the constant-sharing hash table (16,312 bytes = 2,039 buckets at qword_126F228), and the character-type hash table (3,240 bytes at qword_126F2F8). Sets sharing mode flags (byte_126E558..126E55A = 3). Tail-calls sub_5EAF00 to reset float constant caches.

Secondary Pool Reset (sub_5D0170)

Resets ~80 transient globals in the 126F680..126F978 range between template instantiation passes. Pure state zeroing, no allocation.

Constant Sharing

IL constants are deduplicated via a 2,039-bucket hash table at qword_126F228. The alloc_shareable_constant function (sub_5D2390) checks constant_is_shareable (sub_5D2210) -- which excludes aggregate constants (kind 10), template parameter constants (kind 12), and string literals when string sharing is disabled (dword_126E1C0).

On a cache hit, the existing constant is relinked to the front of its bucket chain. On a miss, a new 184-byte constant is allocated and inserted. Statistics are tracked: total allocations (qword_126F208), comparisons (qword_126F200), region hits (qword_126F218), global hits (qword_126F220), and new buckets (qword_126F210).

CUDA Extensions to IL

NVIDIA adds several CUDA-specific fields to standard EDG IL nodes:

  • Routine flags (bytes 182-183): nvvm_intrinsic, global (global), device (device), host (host)
  • Variable flags: shared (shared), constant (constant), device (device), managed (managed)
  • keep_in_il bit (prefix byte bit 7): The mechanism for device/host code separation
  • Lambda entries (kinds 0x46, 0x47): Extended lambda wrapper support

These extensions are what make cudafe++ the CUDA-aware C++ frontend rather than a stock EDG compiler.

Function Map

AddressFunctionSourceNotes
sub_5CF7F0il_one_time_initil.cValidates tables, inits 60+ pools
sub_5CFE20il_init / il_resetil.cPer-TU initialization
sub_5D0170il_reset_secondary_poolsil.cTemplate instantiation reset
sub_5D01F0il_rebuild_entry_indexil.cBuild entry pointer index
sub_5D02F0il_invalidate_entry_indexil.cClear entry index
sub_5D0750compare_expressionsil.cDeep structural equality
sub_5D1350compare_constantsil.cConstant comparison (525 lines)
sub_5D1FE0compare_dynamic_initsil.cDynamic init comparison
sub_5D2210constant_is_shareableil.cShareability predicate
sub_5D2390alloc_shareable_constantil.cHash-table dedup allocation
sub_5D2F90i_copy_expr_treeil.cDeep expression tree copy
sub_5D3B90i_copy_constant_fullil.cDeep constant copy
sub_5D47A0i_copy_dynamic_initil.cDeep dynamic init copy
sub_5E2E80set_type_kindil_alloc.cType kind dispatch (22 kinds)
sub_5E3D40alloc_typeil_alloc.c176-byte type node
sub_5E4D20alloc_variableil_alloc.c232-byte variable node
sub_5E4F70alloc_fieldil_alloc.c176-byte field node
sub_5E53D0alloc_routineil_alloc.c288-byte routine node
sub_5E5CA0alloc_labelil_alloc.c128-byte label node
sub_5E5F00set_expr_node_kindil_alloc.cExpression kind dispatch
sub_5E62E0alloc_expr_nodeil_alloc.c72-byte expression node
sub_5E6E20set_statement_kindil_alloc.cStatement kind dispatch
sub_5E7060alloc_statementil_alloc.c80-byte statement node
sub_5E7D80alloc_scopeil_alloc.c288-byte scope node
sub_5E7A70alloc_namespaceil_alloc.c128-byte namespace node
sub_5E8D20alloc_templateil_alloc.c208-byte template node
sub_5E99D0dump_il_table_statisticsil_alloc.cPrint allocation stats
sub_5EAD80init_il_allocil_alloc.cInitialize common header template
sub_5F4930display_il_entryil_to_str.cMain display dispatcher (~1,686 lines)
sub_5F76B0display_il_header_and_file_scopeil_to_str.cIL header + region 1
sub_5F7DF0display_il_fileil_to_str.cTop-level display entry point
sub_60E4F0walk_file_scope_ilil_walk.cFile-scope tree walker
sub_610200walk_routine_scope_ilil_walk.cPer-function tree walker

Cross-References

IL Allocation

Every IL node in cudafe++ is allocated through a region-based bump allocator implemented in il_alloc.c (EDG 6.6 source at /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/il_alloc.c). The allocator manages 70+ distinct IL entry types across two memory region categories -- file-scope (persistent for the entire translation unit) and per-function-scope (transient, freed after each function body is processed). Free-lists recycle high-churn node types to reduce region pressure. The allocation subsystem occupies address range 0x5E0600-0x5EAF00 in the binary, roughly 43KB of compiled code covering 100+ functions.

Key Facts

PropertyValue
Source fileil_alloc.c (EDG 6.6)
Address range0x5E0600-0x5EAF00
Core allocatorsub_6B7D60 (region_alloc(region_id, size))
File-scope allocatorsub_5E03D0 (alloc_in_file_scope_region)
Dual-region allocatorsub_5E02E0 (alloc_in_region)
Scratch-region allocatorsub_5E0460 (alloc_in_scratch_region)
Stats dumpsub_5E99D0 (dump_il_table_stats), 340 lines
Init functionsub_5EAD80 (init_il_alloc)
Reset watermarkssub_5EAEC0 (reset_region_offsets)
Clear free-listssub_5EAF00 (clear_free_lists)
Node types tracked70+ (each with per-type counter)
Free-list types6 (template_arg, constant_list, expr_node, constant, param_type, source_seq_entry)

Region-Based Bump Allocator

The core allocation primitive is sub_6B7D60 (region_alloc), a bump allocator that takes a region ID and requested size, and returns a pointer to the allocated block within the region's memory. The caller then writes prefix fields and returns a pointer past the prefix to the node body.

region_alloc Pseudocode

// sub_6B7D60 -- region_alloc(region_id, total_size)
// Returns pointer to start of allocated block within the region.
void* region_alloc(int region_id, int64_t requested_size) {
    // Step 1: Align requested size to 8-byte boundary, add 8 for capacity margin
    int64_t aligned_size = requested_size;
    if (requested_size == 0) {
        aligned_size = 8;              // minimum allocation
    } else if (requested_size & 7) {
        aligned_size = (requested_size + 7) & ~7;   // round up to 8
    }
    int64_t check_size = aligned_size + 8;           // capacity check includes margin

    // Step 2: Get current region block
    mem_block_t* block = region_table[region_id];    // qword_126EC88[region_id]
    void* alloc_ptr = block->next_free;              // block[2] = bump pointer

    // Step 3: Check if current block has enough space
    if (block->end - alloc_ptr < check_size) {
        // Not enough space -- try free-list or allocate new block
        bool is_reuse = block->is_reusable;          // block byte +40
        int64_t block_size;
        if (is_reuse) {
            block_size = 2048;                       // small reuse block
        } else {
            flush_region(region_table[region_id]);   // sub_6B68D0
            block_size = 0x10000;                    // 64KB default
        }

        // Search free-list (qword_1280730) for a suitable block
        block = find_free_block(aligned_size + 56, block_size);
        if (!block) {
            // Allocate fresh block from heap
            if (block_size < aligned_size + 56)
                block_size = aligned_size + 56;
            block_size = (block_size + 7) & ~7;      // align to 8
            block = malloc(block_size);
            if (!block) fatal_error(4);              // out of memory
            block->capacity = block_size;
            block->end = (char*)block + block_size;
            block->next_free = block + 6;            // skip 48-byte header
        }

        // Link new block into region's block chain
        block->is_reusable = 0;
        alloc_ptr = block->next_free;
        block->next = region_table[region_id];
        region_table[region_id] = block;
    }

    // Step 4: Bump the pointer
    total_allocated += aligned_size;                 // qword_1280700
    block->next_free = (char*)alloc_ptr + aligned_size;
    alignment_waste += aligned_size - requested_size; // qword_12806F8
    per_region_total[region_id] += aligned_size;     // qword_126EC50[region_id]

    return alloc_ptr;
}

Region Architecture

dword_126EC90  = file_scope_region_id   (region 1, persistent)
dword_126EB40  = current_region_id      (file-scope or per-function)
dword_126F690  = file-scope base offset (typically 0)
dword_126F694  = file-scope prefix size (24 normal, 16 TU-copy)
dword_126F688  = function-scope base offset
dword_126F68C  = function-scope prefix size (8)
qword_126EC88  = region_table           (region index -> memory block)
qword_126EB90  = scope_table            (region index -> scope entry)
dword_126EC80  = total_region_count

Region selection uses a simple identity test: when dword_126EB40 == dword_126EC90, the current scope is file-scope and nodes go into region 1. When the values differ, the current scope is a function body, and nodes go into the current function's region. Some allocators force a specific behavior:

  • File-scope only: alloc_in_file_scope_region (sub_5E03D0) always uses dword_126EC90
  • Dual-region: alloc_in_region (sub_5E02E0) branches on the identity test
  • Scratch region: alloc_in_scratch_region (sub_5E0460) temporarily sets TU-copy mode, allocates from region 1, and restores state
  • Same-region-as: Used by alloc_class_list_entry (sub_5E2410) and alloc_based_type_list_member (sub_5E29C0) -- inspects the prefix byte of an existing node to determine which region it lives in, then allocates the new node in that same region

Allocation Protocol

Every IL node allocator follows a consistent protocol. The prefix size varies by mode: 24 bytes for file-scope (normal), 16 bytes for file-scope (TU-copy mode), and 8 bytes for function-scope.

File-scope allocation (normal mode, dword_126F694 = 24):

 1. if (dword_126EFC8) trace_enter(5, "alloc_<name>")
 2. raw = region_alloc(file_scope_region, entry_size + 24)
 3. ptr = raw + dword_126F690                       // base offset (typically 0)
 4. *(ptr+0)  = 0                                   // zero TU copy address slot (8 bytes)
    ++qword_126F7C0                                 // TU copy addr counter
 5. *(ptr+8)  = 0                                   // zero the next-in-list pointer (8 bytes)
    ++qword_126F750                                 // orphan pointer counter
 6. ++qword_126F7D8                                 // IL entry prefix counter
 7. *(ptr+16) = flags_byte:                         // prefix flags (8-byte qword, flags in low byte)
        bit 0 = 1                                   // allocated
        bit 1 = 1                                   // file_scope (not TU-copy)
        bit 3 = dword_126E5FC & 1                   // language flag (C++ vs C)
 8. node = ptr + 24                                 // skip 24-byte prefix
 9. ++qword_126F8xx                                 // per-type counter
10. initialize type-specific fields
11. copy 96-byte common header from template globals
12. if (dword_126EFC8) trace_leave()
13. return node

Function-scope allocation (dword_126F68C = 8):

 1. raw = region_alloc(current_region, entry_size + 8)
 2. ptr = raw + dword_126F688                       // function-scope base offset
 3. *(ptr+0) = flags_byte:                          // prefix flags (8-byte qword)
        bit 1 = !dword_106BA08                      // file_scope flag
        bit 3 = dword_126E5FC & 1                   // language flag
    (no TU copy slot, no next-in-list slot)
 4. node = ptr + 8                                  // skip 8-byte prefix
 5. return node

The returned pointer skips the prefix, so all field offsets documented in the IL are relative to this returned pointer. The prefix flags byte is always at node - 8 regardless of allocation mode. The next-in-list link (file-scope only) is at node - 16, and the TU-copy address (normal file-scope only) is at node - 24.

Common IL Header Template

Every IL node contains a 96-byte common header, copied from six __m128i template globals initialized by init_il_alloc (sub_5EAD80):

xmmword_126F6A0  [+0..+15]    16 bytes, zeroed
xmmword_126F6B0  [+16..+31]   16 bytes (high qword zeroed)
xmmword_126F6C0  [+32..+47]   16 bytes, zeroed
xmmword_126F6D0  [+48..+63]   16 bytes, zeroed
xmmword_126F6E0  [+64..+79]   16 bytes (from qword_126EFB8 = source position)
xmmword_126F6F0  [+80..+95]   16 bytes (low word = 4, high qword = 0)
qword_126F700    [+96..+103]  8 bytes (current source file reference)

This template captures the current source position and language state at the moment of allocation. The template is refreshed when the parser advances through source positions, so each newly-allocated node carries the file/line/column of the construct it represents.

IL Entry Prefix

Every IL entry has a variable-size raw prefix preceding the node body. The prefix is 24 bytes in normal file-scope mode, 16 bytes in TU-copy file-scope mode, and 8 bytes in function-scope mode.

Normal file-scope (24-byte prefix, ptr = raw + 24):
+0   [8 bytes]  TU copy   ptr - 24   translation_unit_copy_address
+8   [8 bytes]  next      ptr - 16   next_in_list link
+16  [8 bytes]  flags     ptr - 8    prefix flags byte (+ 7 padding)
+24  [...]      body      ptr + 0    node-specific fields

TU-copy file-scope (16-byte prefix, ptr = raw + 16):
+0   [8 bytes]  next      ptr - 16   next_in_list link
+8   [8 bytes]  flags     ptr - 8    prefix flags byte (+ 7 padding)
+16  [...]      body      ptr + 0    node-specific fields

Function-scope (8-byte prefix, ptr = raw + 8):
+0   [8 bytes]  flags     ptr - 8    prefix flags byte (+ 7 padding)
+8   [...]      body      ptr + 0    node-specific fields

Prefix Flags Byte

BitMaskNameSet When
00x01allocatedAlways set on fresh allocation
10x02file_scope!dword_106BA08 (not in TU-copy mode)
20x04is_in_secondary_ilEntry from secondary translation unit
30x08language_flagdword_126E5FC & 1 (C++ mode indicator)
70x80keep_in_ilSet by device code marking pass

Bit 7 is the CUDA-critical keep_in_il flag used to select device-relevant declarations. See Keep-in-IL for the marking algorithm. The flags byte is always at entry - 8 regardless of allocation mode, and the sign-bit position allows a fast test: *(signed char*)(entry - 8) < 0 means "keep this entry."

Some allocators preserve bit 7 across free-list recycling (notably alloc_local_constant at sub_5E1A80 and alloc_derivation_step at sub_5E1EE0), ensuring that the keep-in-il status is not lost when a node is reclaimed and reissued.

Complete Node Size Table

The stats dump function sub_5E99D0 prints the allocation table with exact names and per-unit sizes for all 70+ IL entry types. Sizes listed are the allocation unit in bytes -- the values passed to region_alloc.

Primary IL Nodes

IL Entry TypeSize (bytes)Counter GlobalAllocatorRegion
type176qword_126F8E0sub_5E3D40file-scope
variable232qword_126F8C0sub_5E4D20dual (kind-dependent)
routine288qword_126F8A8sub_5E53D0file-scope
expr_node72qword_126F880sub_5E62E0dual + free-list
statement80qword_126F818sub_5E7060dual
scope288qword_126F7E8sub_5E7D80dual
constant184qword_126F968sub_5E11C0dual
field176qword_126F8B0sub_5E4F70file-scope
label128qword_126F888sub_5E5CA0function-scope only
asm_entry152qword_126F890sub_5E57B0dual
namespace128qword_126F7F8sub_5E7A70file-scope
template208qword_126F720sub_5E8D20file-scope
template_parameters136qword_126F728sub_5E8A90file-scope
template_arg64qword_126F900sub_5E2190file-scope + free-list

Type Supplements

Auxiliary structures allocated alongside type nodes by set_type_kind (sub_5E2E80):

SupplementSizeCounterFor Type Kinds
integer_type_supplement32qword_126F8E8tk_integer (2)
routine_type_supplement64qword_126F958tk_routine (7)
class_type_supplement208qword_126F948tk_class (9), tk_struct (10), tk_union (11)
typeref_type_supplement56qword_126F8F0tk_typeref (12)
templ_param_supplement40qword_126F8F8tk_template_param (14)

Expression Supplements

Allocated inline by set_expr_node_kind (sub_5E5F00) for expression kinds that need extra storage:

SupplementSizeCounterFor Expression Kind
new/delete supplement56qword_126F868enk_new_delete (7)
throw supplement24qword_126F860enk_throw (8)
condition supplement32qword_126F858enk_condition (9)

Statement Supplements

Allocated inline by set_statement_kind (sub_5E6E20):

SupplementSizeCounterFor Statement Kind
constexpr_if24qword_126F798stmk_constexpr_if (2)
block32qword_126F830stmk_block (11)
for_loop24qword_126F820stmk_for (13)
try supplement32qword_126F838stmk_try_block (19)
switch_stmt_descr24qword_126F848stmk_switch (16)
coroutine_descr128qword_126F828stmk_coroutine (9)

Linked-List Entry Types

Entry TypeSizeCounterNotes
class_list_entry16qword_126F940Region-aware (sub_5E2410) or simple (sub_5E26A0)
routine_list_entry16qword_126F938sub_5E2750
variable_list_entry16qword_126F930sub_5E2800
constant_list_entry16qword_126F928Free-list recycled (sub_5E28B0)
IL_entity_list_entry24qword_126F7B8sub_5E94F0
based_type_list_member24qword_126F950Region-aware (sub_5E29C0)

Inheritance and Virtual Dispatch

Entry TypeSizeCounterAllocator
base_class112qword_126F908sub_5E2300
base_class_derivation32qword_126F910sub_5E1FD0
derivation_step24qword_126F918sub_5E1EE0
overriding_virtual_func40qword_126F920sub_5E20D0

Variable and Routine Auxiliaries

Entry TypeSizeCounterAllocator
dynamic_init104qword_126F8D8sub_5E4650
local_static_var_init40qword_126F8D0sub_5E4870
vla_dimension48qword_126F8C8sub_5E49C0
variable_template_info24qword_126F8B8sub_5E4C70
exception_specification16qword_126F8A0sub_5E5130
exception_spec_type24qword_126F898sub_5E51D0
param_type80qword_126F960sub_5E1D40 (free-list recycled)
constructor_init48qword_126F810sub_5E7410
handler40qword_126F840sub_5E6B90
switch_case_entry56qword_126F850sub_5E6A60

Scope and Source Tracking

Entry TypeSizeCounterAllocator
source_sequence_entry32qword_126F780sub_5E8300 (free-list recycled)
src-seq_secondary_decl56qword_126F778sub_5E8480
src-seq_end_of_construct24qword_126F770sub_5E85B0
src-seq_sublist24qword_126F768sub_5E86C0
local-scope-ref32qword_126F7E0sub_5E80A0
object_lifetime64qword_126F800sub_5E7800 (free-list recycled)
static_assertion24qword_126F788sub_5E81B0

Templates, Names, and Pragmas

Entry TypeSizeCounterAllocator
template_decl40qword_126F738sub_5E8C60
requires_clause16qword_126F730sub_5E8BB0
name_reference40qword_126F718sub_5E90B0
name_qualifier40qword_126F710sub_5E8FC0
element_position24qword_126F708sub_5E8EB0
pragma64qword_126F808sub_5E7570
using-decl80qword_126F7F0sub_5E7BF0
instantiation_directive40qword_126F758sub_5E8770
linkage_spec_block32qword_126F760sub_5E8830
hidden_name32qword_126F740sub_5E8980

Attributes and Miscellaneous

Entry TypeSizeCounterAllocator
attribute72qword_126F7B0sub_5E9600
attribute_arg40qword_126F7A8sub_5E96F0
attribute_group8qword_126F7A0sub_5E97C0
source_file80qword_126F970sub_5E08D0
seq_number_lookup_entry32qword_126F7C8sub_5E9170
subobject_path24qword_126F790sub_5E0A30
orphaned_list_header56qword_126F748sub_5E0800

Bookkeeping Counters (No Separate Allocator)

CounterSizeGlobalMeaning
string_literal_text1qword_126F7D0Raw string literal bytes (accumulated)
fs_orphan_pointers8qword_126F750File-scope orphan pointer slots
trans_unit_copy_addr8qword_126F7C0TU-copy address slots written
IL_entry_prefix4qword_126F7D8Total prefix flags bytes written

Free-List Recycling

Six node types use free-list recycling to avoid allocating fresh memory for high-churn entries. Each free-list is a singly-linked list with the link pointer embedded in the node itself.

Active Free-Lists

Node TypeFree-List HeadLink OffsetAlloc FunctionFree Function
template_arg (64B)qword_126F670+0sub_5E2190sub_5E22D0 (free_template_arg_list)
constant_list_entry (16B)qword_126F668+0sub_5E28B0sub_5E2990 (return_constant_list_entries_to_free_list)
expr_node (72B)qword_126E4B0+64sub_5E62E0(kind set to 36 = ek_reclaimed)
constant (184B)qword_126E4B8+104sub_5E1A80 (alloc_local_constant)sub_5E1B70 (free_local_constant)
param_type (80B)qword_126F678+0sub_5E1D40 (alloc_param_type)sub_5E1EB0 (free_param_type_list)
source_seq_entry (32B)scope+328--sub_5E8300(per-scope recycling)
object_lifetime (64B)scope+512+56sub_5E7800(per-scope recycling)

Expression Node Recycling

Expression nodes use the most sophisticated free-list protocol. The allocator (sub_5E62E0) checks qword_126E4B0 before allocating fresh memory:

// Pseudocode for alloc_expr_node
if (expr_free_list != NULL) {
    node = expr_free_list;
    assert(node->kind == 36);        // ek_reclaimed sentinel
    expr_free_list = *(node + 64);   // link at offset +64
    // reuse node (preserves bit 7 of prefix)
} else {
    node = region_alloc(region_id, 72);
    // full prefix initialization
}
set_expr_node_kind(node, requested_kind);
++total_expr_count;
++fs_expr_count;
update_rescan_counter(&rescan_expr_count);

When expression nodes are freed, their kind byte at offset +24 is set to 36 (ek_reclaimed), and their link pointer at offset +64 chains them into the free list. The stats dump walks this free list to count available recycled nodes, printing them as "(avail. fs expr node)".

A source-tracking variant alloc_expr_node_with_source_tracking (sub_5E66B0) wraps the allocation in save_source_correspondence/restore_source_correspondence calls (sub_5B8910/sub_5B89C0). For non-same-region allocations, this variant uses alloc_permanent(72) instead of the dual-region allocator because the free list cannot safely cross region boundaries.

Constant Recycling

Local constants use a separate free-list (qword_126E4B8) with the link at offset +104. The free_local_constant function (sub_5E1B70) validates the node is in-use (bit 0 of prefix) before unlinking. The check_local_constant_use assertion function (sub_5E1D00) verifies qword_126F680 == 0 at function boundaries, ensuring all borrowed constants have been returned.

The duplicate_constant_to_other_region function (sub_5E1BB0) handles the case where a constant must be copied from one region to another. When source and destination are the same region, it works in-place. When they differ, it allocates 184 bytes in the target region, copies contents via sub_5BA500, frees the original to the free list, and applies post-copy fixups (sub_5B9DE0, sub_5D39A0).

set_type_kind -- Type Kind Dispatch

set_type_kind (sub_5E2E80, confirmed at il_alloc.c:2334) writes the type kind byte at offset +132 of the type node and allocates any required type supplement. It handles 22 type kinds (0x00-0x15):

KindNameAction
0tk_errorNo-op
1tk_voidNo-op
2tk_integerAllocates 32-byte integer_type_supplement, sets default access=5
3tk_floatSets format byte = 2
4tk_complexSets format byte = 2
5tk_imaginarySets format byte = 2
6tk_pointerZeroes 2 payload fields
7tk_routineAllocates 64-byte routine_type_supplement, initializes calling convention and parameter bitfields
8tk_arrayZeroes size and flags fields
9tk_classAllocates 208-byte class_type_supplement, stores kind at +100
10tk_structSame as class
11tk_unionSame as class
12tk_typerefAllocates 56-byte typeref_type_supplement
13tk_ptr_to_memberZeroes fields
14tk_template_paramAllocates 40-byte templ_param_supplement
15tk_vectorZeroes fields
16tk_scalable_vectorZeroes fields
17-21Pack/special typesNo-op or zeroes
default--internal_error("set_type_kind: bad type kind")

The class type supplement (208 bytes) is the largest supplement. init_class_type_supplement_fields (sub_5E2D70) initializes it with defaults: access=1, virtual_function_table_index=-1, and zeroed member lists. The companion function init_class_type_supplement (sub_5E2C70) accesses the supplement through the type node's pointer at offset +152.

A combined function init_type_fields_and_set_kind (sub_5E3590, 317 lines) copies the 96-byte template header and then runs the same switch as set_type_kind inline. This is used by alloc_type (sub_5E3D40) to avoid a separate function call.

set_expr_node_kind -- Expression Kind Dispatch

set_expr_node_kind (sub_5E5F00, confirmed at il_alloc.c:3932) writes the expression kind byte at offset +24 and zeroes offset +8. It handles 36 expression kinds (0-35):

KindNameAction
0enk_errorNo-op
1enk_operationSets operation bytes (0x78=120, 0x15=21, 0, 0), zeroes 2 qwords
2-6enk_constant..enk_lambdaZeroes 2 qword operand fields
7enk_new_deleteAllocates 56-byte supplement via permanent alloc
8enk_throwAllocates 24-byte supplement
9enk_conditionAllocates 32-byte supplement
10enk_object_lifetimeZeroes 2 qwords
11,25,32Address-of variants1 qword + flag
12-15Cast variantsSets word=1, 1 qword
16enk_address_of_ellipsisNo-op
17,18,22,23,29,33,35Simple operand1 qword
19enk_routine3 qwords
20enk_type_operand2 qwords
21enk_builtin_operationSets byte=117 (0x75), 1 qword
24,26,27,30,31Complex operand2 qwords
28enk_fold_expression1 qword + 1 dword
34enk_const_eval_deferred1 qword + 1 dword
default--internal_error("set_expr_node_kind: bad kind")

The reinit_expr_node_kind function (sub_5E60E0) performs the same dispatch but additionally resets header fields (flag bits and source position from qword_126EFB8) before the kind switch. This is used when an existing expression node is repurposed without reallocation.

set_statement_kind -- Statement Kind Dispatch

set_statement_kind (sub_5E6E20, confirmed at il_alloc.c:4513) writes the statement kind byte at offset +32 and zeroes offset +40. It handles 26 statement kinds (0x00-0x19):

KindNameSupplement
0stmk_expr1 qword (expression pointer)
1stmk_if2 qwords (condition + body)
2stmk_constexpr_ifAllocates 24 bytes
3,4stmk_if_consteval2 qwords
5stmk_while1 qword
6,7stmk_goto/stmk_label2 qwords
8stmk_return1 qword
9stmk_coroutine1 qword (links to 128-byte coroutine_descr)
10,23,24,25VariousNo-op
11stmk_blockAllocates 32 bytes, stores source pos, sets priority
12stmk_end_test_while1 qword
13stmk_forAllocates 24 bytes
14stmk_range_based_for2 qwords
15stmk_switch_case2 qwords
16stmk_switchAllocates 24 bytes
17stmk_init1 qword
18stmk_asm1 qword + flag
19stmk_try_blockAllocates 32 bytes
20stmk_decl1 qword
21,22VLA statements1 qword
default--internal_error("set_statement_kind: bad kind")

set_constant_kind -- Constant Kind Dispatch

set_constant_kind (sub_5E0C60, confirmed at il_alloc.c:952) writes the constant kind byte at offset +148 and initializes the variant-specific union fields. 16 constant kinds (0-15):

KindNameAction
0ck_errorZeroes variant fields
1ck_integerCalls init_target_int (sub_461260)
2ck_stringZeroes string fields
3ck_floatZeroes float fields
4ck_addressAllocates 32-byte sub-node in file-scope region
5ck_complexZeroes complex fields
6ck_imaginaryZeroes imaginary fields
7ck_ptr_to_memberZeroes 2 fields
8ck_label_differenceZeroes 2 fields
9ck_dynamic_initZeroes
10ck_aggregateZeroes aggregate list head
11ck_init_repeatZeroes repeat fields
12ck_template_paramZeroes, dispatches to set_template_param_constant_kind
13ck_designatorZeroes
14ck_voidZeroes
15ck_reflectionZeroes
default--internal_error("set_constant_kind: bad kind")

The template parameter constant kind has its own sub-dispatch (sub_5E0B40, il_alloc.c:768) handling 14 sub-kinds (tpck_*), each zeroing variant fields at offsets +160, +168, +176. It validates the parent constant kind is 12 (ck_template_param) before proceeding.

Additional Kind Dispatchers

set_routine_special_kind

sub_5E5280 (confirmed at il_alloc.c:3065) sets the routine special kind byte at offset +166. 8 values (0-7):

KindAction
0Sets word at +168 to 0
1-4No-op
5Zeroes byte at +168
6-7Zeroes qword at +168
defaultinternal_error("set_routine_special_kind: bad kind")

set_dynamic_init_kind

sub_5E45C0 (confirmed at il_alloc.c:2506) sets the dynamic init kind at offset +48. 10 values (0-9) controlling what fields are initialized in the dynamic initialization variant union.

Statistics Dump

dump_il_table_stats (sub_5E99D0) prints a formatted table of all IL allocation counters. It is invoked when tracing is enabled or on explicit request. The output format:

IL table use:
   Table                    Number    Each     Total
   -----                    ------    ----     -----
   source file                  42      80      3360
   constant                   1847     184    339848
   type                        923     176    162448
   variable                    412     232     95584
   routine                     287     288     82656
   expr node                 12847      72    924984
   statement                  5923      80    473840
   scope                       312     288     89856
   ...
   Total                                     2172576

The function iterates all 70+ counters, multiplies count by per-unit size, accumulates a running total, and adds the passed argument a1 (typically the raw region overhead) for the final sum. It also walks the expr_node free list (qword_126E4B0) to count available recycled nodes, printing them separately as "(avail. fs expr node)".

The counter globals are contiguous in BSS from qword_126F680 through qword_126F970, with 8-byte spacing (qword counters). The full ordered list of counters is documented in the Complete Node Size Table above.

Initialization and Reset

init_il_alloc (sub_5EAD80)

Called once at compiler startup. Responsibilities:

  1. Zeroes the 96-byte common header template (xmmword_126F6A0-xmmword_126F6F0)
  2. Sets the source position portion of the template from qword_126EFB8
  3. Computes the language mode byte: byte_126E5F8 = (dword_126EFB4 != 2) + 2 (C++ mode detection)
  4. Registers 6 allocator state variables with sub_7A3C00 (saveable state for region offset save/restore across compilation phases)
  5. Optionally calls sub_6F5D00 if dword_106BF18 is set (debug initialization)

reset_region_offsets (sub_5EAEC0)

Resets the bump allocator watermarks. Called at region boundaries:

dword_126F690 = 0;               // base offset reset
if (dword_106BA08) {              // TU-copy mode
    dword_126F68C = 8;            // function-scope watermark
    dword_126F688 = 0;            // function-scope base
    dword_126F694 = 16;           // file-scope watermark
} else {
    dword_126F694 = 24;           // file-scope watermark (extra 8 for TU copy addr)
}

The different initial watermark values (16 vs 24) reflect the prefix size in each mode: normal mode uses a 24-byte prefix (8 TU-copy + 8 next-link + 8 flags), while TU-copy mode uses a 16-byte prefix (8 next-link + 8 flags, no TU-copy slot). Function-scope allocations use dword_126F68C = 8 (8-byte prefix: flags only).

clear_free_lists (sub_5EAF00)

Zeroes all 5 global free-list heads:

qword_126F678 = 0;    // param_type free list
qword_126F670 = 0;    // template_arg free list
qword_126E4B8 = 0;    // local_constant free list
qword_126E4B0 = 0;    // expr_node free list
qword_126F668 = 0;    // constant_list_entry free list

Called at function-scope exit to prevent dangling pointers into freed regions.

String Allocation

Two specialized allocators handle string storage in regions:

copy_string_to_region (sub_5E0600, il_alloc.c:548)

char* copy_string_to_region(int region_id, const char* str) {
    size_t len = strlen(str);
    char* buf;
    if (region_id == 0)
        buf = heap_alloc(len + 1);                // general heap
    else if (region_id == file_scope_region)
        buf = region_alloc(file_scope, len + 1);   // file-scope region
    else if (region_id == -1)
        buf = persistent_alloc(len + 1);           // persistent heap
    else
        internal_error("copy_string_to_region");
    return strcpy(buf, str);
}

copy_string_of_length_to_region (sub_5E0700, il_alloc.c:572)

Same three-way dispatch but takes an explicit length parameter and uses strncpy with explicit null termination: result[len] = 0.

Special Allocation Patterns

Labels -- Function-Scope Assertion

alloc_label (sub_5E5CA0) asserts that dword_126EB40 != dword_126EC90 (must be in function scope). Labels cannot exist at file scope -- they are always allocated in a function's region:

assert(current_region != file_scope_region);   // il_alloc.c:3588

Variables -- Kind-Dependent Region

alloc_variable (sub_5E4D20) uses the variable's linkage kind to select the allocation strategy: when kind > 2 (non-local variables like global, extern, static), it uses the dual-region allocator (sub_5E02E0). Otherwise it allocates directly in the file-scope region. This ensures that local variables live in function regions while globals persist in the file-scope region.

GNU Supplement for Routines

alloc_gnu_supplement_for_routine (sub_5E56D0, il_alloc.c:3412) asserts that no supplement already exists (*(routine+240) == 0), then allocates a 40-byte supplement and stores the pointer at routine+240. This is for GCC-extension attributes on functions (visibility, alias, constructor/destructor priority).

Pragma -- 43 Kinds

alloc_pragma (sub_5E7570, il_alloc.c:4781) uses the same-region-as pattern (handling null, non-file-scope, scratch, and same-region-as cases) and dispatches a switch covering 43 pragma kinds (0-42). Most kinds are no-op; kinds 19, 21, 26, 28, 29 have small payload fields.

Scope -- Routine Association

alloc_scope (sub_5E7D80) validates that if assoc_routine (argument a3) is non-null, the scope kind must be 17 (sck_function). Violation triggers internal_error("assoc_routine is non-NULL") at il_alloc.c:4946. After kind dispatch, it zeroes 26 qword fields (offsets 80-280) and sets *(result+240) = -1 as a sentinel.

Global Variable Map

AddressNamePurpose
dword_126EC90file_scope_region_idRegion 1 identifier
dword_126EB40current_region_idActive allocation region
dword_106BA08tu_copy_modeTU-copy mode flag (affects prefix layout)
dword_126EFC8tracing_enabledWhen set, brackets alloc calls with trace_enter/leave
qword_126EFB8null_source_positionDefault source position for new nodes
qword_126F700current_source_fileCurrent source file reference
qword_106B9B0compilation_contextActive compilation context pointer
dword_126E5FCsource_file_flagsBit 0 = C++ mode indicator
byte_126E5F8language_std_byteLanguage standard (controls routine type init)
dword_106BFF0uses_exceptionsException model flag (set in routine alloc)

IL Tree Walking

The IL tree walking framework is the backbone of every operation that must visit the complete IL graph: debug display, device code marking, IL serialization, and IL copying for template instantiation. The framework lives in il_walk.c (with entry-kind dispatch logic auto-generated from walk_entry.h). It provides a generic, callback-driven traversal engine consisting of two core functions: walk_file_scope_il (sub_60E4F0), which orchestrates the top-level iteration over all global entry-kind lists, and walk_entry_and_subtree (sub_604170), which recursively descends into a single entry's children according to the IL schema. Five global function-pointer slots allow each client to customize the walk's behavior without modifying the walker itself.

The framework follows a strict separation of traversal and action. The walker knows how to navigate the IL graph; the callbacks decide what to do at each node. This design enables the same walker to serve four fundamentally different purposes: pretty-printing, transitive-closure marking, pointer remapping during copy, and entry filtering during serialization.

Key Facts

PropertyValue
Source fileil_walk.c (EDG 6.6)
Header (auto-generated dispatch)walk_entry.h
Assert path/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/il_walk.c
Top-level file-scope walkersub_60E4F0 (walk_file_scope_il), 2043 lines
Recursive entry walkersub_604170 (walk_entry_and_subtree), 7763 lines / 42KB
Routine-scope walkersub_610200 (walk_routine_scope_il), 108 lines
Hash table resetsub_603B30 (clear_walk_hash_table), 23 lines
Anonymous union lookupsub_603FE0 (find_parent_var_of_anon_union_type), 127 lines
Entry kinds covered85 (switch cases 0--84)
Recursive self-calls~330 (in walk_entry_and_subtree)
Callback slots5 global function pointers

Callback Slot Architecture

The five callback slots are stored as global function pointers. Before any walk, the caller saves all five values, installs its own set, and restores the originals on exit. This save/restore discipline makes walks re-entrant -- a callback can itself trigger a nested walk with different callbacks.

AddressSlot NameSignaturePurpose
qword_126FB88entry_callbackentry_ptr(entry_ptr, entry_kind)Called for each entry visited; may return a replacement pointer
qword_126FB80string_callbackvoid(char_ptr, string_kind, byte_length)Called for each string field; string_kind is 24 (id_name), 25 (string_text), or 26 (other_text); byte_length is strlen+1 for kinds 24/26, field-based for kind 25
qword_126FB78pre_walk_checkint(entry_ptr, entry_kind)Called before descending into an entry; returns nonzero to skip the subtree
qword_126FB70entry_replaceentry_ptr(entry_ptr, entry_kind)Called to remap an entry pointer (used during IL copy to translate old pointers to new ones)
qword_126FB68entry_filterentry_ptr(entry_ptr, entry_kind)Called on linked-list heads to filter entries; returning NULL removes the entry from the list

The pre_walk_check slot is the only one whose return value controls flow: nonzero means "already handled, skip this subtree." The keep-in-il pass uses this to avoid revisiting already-marked entries (preventing infinite recursion on cyclic references). The entry_replace slot is used during IL copy operations to translate pointers from the source IL region to the destination region.

Walk State Globals

In addition to the five callback slots, four state variables track the walker's context:

AddressNameDescription
dword_126FB5Cis_file_scope_walk1 during walk_file_scope_il, 0 during walk_routine_scope_il
dword_126FB58is_secondary_il1 if the current scope belongs to the secondary IL region
dword_106B644current_il_regionToggles per IL region; used to stamp bit 2 of entry flags
dword_126FB60walk_mode_flagsBitmask controlling walk behavior (e.g., strip template info)

All four are saved and restored alongside the callback slots, making the entire walk context atomically swappable.

walk_file_scope_il (sub_60E4F0)

This is the central traversal entry point. Every operation that needs to visit the entire file-scope IL calls this function with its desired callbacks. It takes six arguments:

void walk_file_scope_il(
    entry_callback_t   a1,   // entry visitor (qword_126FB88)
    string_callback_t  a2,   // string visitor (qword_126FB80)
    entry_replace_t    a3,   // pointer remapper (qword_126FB70)
    entry_filter_t     a4,   // list filter (qword_126FB68)
    pre_walk_check_t   a5,   // pre-visit gate (qword_126FB78)
    int                a6    // walk_mode_flags (dword_126FB60)
);

Initialization

The function begins by saving all five callback slots and all four walk state variables, then installs the caller's values:

// Save current state
saved_entry_cb      = qword_126FB88;
saved_string_cb     = qword_126FB80;
saved_pre_walk      = qword_126FB78;
saved_entry_replace = qword_126FB70;
saved_entry_filter  = qword_126FB68;
saved_is_file_scope = dword_126FB5C;
saved_is_secondary  = dword_126FB58;
saved_il_region     = dword_106B644;
saved_mode_flags    = dword_126FB60;

// Install new callbacks
qword_126FB88 = a1;
qword_126FB80 = a2;
qword_126FB70 = a3;
qword_126FB68 = a4;
qword_126FB78 = a5;
dword_126FB60 = a6;
dword_126FB5C = 1;  // mark as file-scope walk

An assertion fires if pre_walk_check is NULL and the primary scope is in secondary IL (bit 1 of flags byte set):

if (!a5 && is_secondary)
    assert_fail("il_walk.c", 270, "walk_file_scope_il");

This prevents unguarded walks into secondary IL regions, which would produce incorrect results because secondary entries need canonical-entry delegation.

Walk Order

The function visits IL entries in a fixed, deterministic order. This order is significant for serialization (the IL binary format expects entries in this exact sequence) and for display (the --dump_il output follows this structure).

Phase 1: Primary scope (kind 23)

primary_scope = xmmword_126EB60[1];  // second qword of IL header

// If entry_replace callback exists, remap the scope pointer first
if (entry_replace)
    primary_scope = entry_replace(primary_scope, 23);

// Determine IL region flags from the scope's prefix byte
is_secondary_il = (*(primary_scope - 8) & 0x02) != 0;
current_il_region = ((*(primary_scope - 8) >> 2) ^ 1) & 1;

walk_entry_and_subtree(primary_scope, 23);

The scope entry (kind 23) is walked first because it is the root of the scope tree. Walking the scope recursively visits all nested scopes and their member lists.

Phase 2: Source file entries (kind 1)

for (entry = xmmword_126EB60[0]; entry; entry = entry->child_file) {
    if (entry_filter && !entry_filter(entry, 1))
        continue;  // filtered out
    walk_entry_and_subtree(entry, 1);
}

Source file entries form a linked list via offset +56 (child_file). Each entry holds the file name, full path, and name_as_written strings.

Phase 3: main_routine pointer and string entries

Before walking strings, the function remaps the main_routine pointer from the IL header:

// main_routine (qword_126EB70, IL header + 0x10)
if (entry_replace) {
    il_header.main_routine = entry_replace(il_header.main_routine, 11);
    // Also remap compiler_version through entry_replace
    compiler_version = entry_replace(compiler_version, 26);
}

Then two string entries from the IL header are walked as "other text" (kind 26):

// compiler_version (qword_126EB78, IL header + 0x18)
if (compiler_version) {
    if (trace_verbosity > 4)
        fprintf(s, "Walking IL tree, string entry kind = %s\n", "other text");
    if (string_callback)
        string_callback(compiler_version, 26, strlen(compiler_version) + 1);
}

// time_of_compilation (qword_126EB80, IL header + 0x20) -- same pattern
if (entry_replace)
    time_of_compilation = entry_replace(time_of_compilation, 26);
if (time_of_compilation) {
    if (string_callback)
        string_callback(time_of_compilation, 26, strlen(time_of_compilation) + 1);
}

Strings are walked with kind 26 (other_text) and the string callback receives the raw character pointer, the kind, and the length including the null terminator.

Phase 4: Orphaned entities list (kind 55)

for (entry = qword_126EBA0; entry; entry = entry->next) {
    if (entry_filter && !entry_filter(entry, 55))
        continue;
    walk_entry_and_subtree(entry, 55);
}

Kind 55 entries are orphaned entities -- declarations that lost their parent scope (e.g., after template instantiation cleanup). They are stored in a separate linked list headed at qword_126EBA0.

Phase 5: Global entry-kind lists (kinds 1--72)

The bulk of the walk iterates 45 global linked lists, one per entry kind. Each list head is stored at a fixed address in the 0x126E610--0x126EA80 range, with 16-byte spacing. The complete walk order, verified from the decompiled sub_60E4F0:

#Global AddressKindEntry Kind Name
1qword_126E6101source_file_entry
2qword_126E6202constant
3qword_126E6303param_type
4qword_126E6404routine_type_supplement
5qword_126E6505routine_type_extra
6qword_126E6606type
7qword_126E6707variable
8qword_126E6808field
9qword_126E6909exception_specification
10qword_126E6A010exception_spec_type
11qword_126E6B011routine
12qword_126E6C012label
13qword_126E6D013expr_node
14qword_126E6E014(reserved)
15qword_126E6F015(reserved)
16qword_126E70016switch_case_entry
17qword_126E71017switch_info
18qword_126E72018handler
19qword_126E73019try_supplement
20qword_126E74020asm_supplement
21qword_126E75021statement
22qword_126E76022object_lifetime
23qword_126E77023scope
24qword_126E7B027template_parameter
25qword_126E7C028namespace
26qword_126E7D029using_declaration
27qword_126E7E030dynamic_init
28qword_126E81033overriding_virtual_func
29qword_126E82034(reserved)
30qword_126E83035derivation_path
31qword_126E84036base_class_derivation
32qword_126E85037(reserved)
33qword_126E86038(reserved)
34qword_126E87039class_info
35qword_126E88040(reserved)
36qword_126E89041constructor_init
37qword_126E8A042asm_entry
38qword_126E8E046lambda
39qword_126E8F047lambda_capture
40qword_126E90048attribute
41qword_126E9D061template_param
42qword_126E9B059template_decl
43qword_126E9E062name_reference
44qword_126E9F063name_qualifier
45qword_126EA8072attribute (C++11)

Note the gaps in the walk order: kinds 24-26 (base_class, string_text, other_text), 31-32 (local_static_variable_init, vla_dimension), 43-45 (asm_operand, asm_clobber, reserved), and 49-58 (element_position through hidden_name) are skipped. These entry kinds are either embedded inline within parent entries, accessed only through the recursive descent of walk_entry_and_subtree, or have no file-scope lists. Also note that kinds 59 and 61 appear out-of-order (61 before 59) -- this is verified in the binary.

For each non-empty list, the walk applies the entry_replace callback (if present) to each entry before descending, and follows the next pointer (at offset -16 in the raw allocation, which is the next_in_list link in the entry prefix).

Phase 6: Special trailing lists

Three additional lists are walked after the main kind-indexed sequence:

// seq_number_lookup entries (kind 64) at qword_126EBE8
for (entry = qword_126EBE8; entry; entry = entry->next) {
    if (entry_filter) ...
    walk_entry_and_subtree(entry, 64);
}

// External declarations (kind 6) at qword_126EBE0
// -- uses entry_filter with kind 6 and follows offset +104 links

// Kind 83 entries at qword_126EC00
for (entry = qword_126EC00; entry; entry = entry->next) {
    if (entry_filter) ...
    walk_entry_and_subtree(entry, 83);
}

Cleanup

After all phases complete, the function restores all saved state:

dword_126FB5C = saved_is_file_scope;
dword_126FB58 = saved_is_secondary;
dword_106B644 = saved_il_region;
dword_126FB60 = saved_mode_flags;
qword_126FB88 = saved_entry_cb;
qword_126FB80 = saved_string_cb;
qword_126FB78 = saved_pre_walk;
qword_126FB70 = saved_entry_replace;
qword_126FB68 = saved_entry_filter;

If tracing is active (dword_126EFC8), the function emits trace-leave via sub_48AFD0.

walk_entry_and_subtree (sub_604170)

This is the recursive engine -- the second-largest function in the entire cudafe++ binary at 7763 lines / 42KB of decompiled code. It takes an entry pointer and its kind, then recursively walks every child entry according to the IL schema.

Entry Protocol

Before descending into any entry, the function executes a two-path check:

while (true) {
    if (pre_walk_check != NULL) {
        // Callback path: delegate decision to the callback
        if (pre_walk_check(entry, entry_kind))
            return;  // callback says skip
    } else {
        // Default path: check flags
        flags = *(entry - 8);

        // If not file-scope walk and entry has file-scope bit: skip
        if (!is_file_scope_walk && (flags & 0x01))
            return;

        // If entry's il_region bit matches current_il_region: skip
        if (((flags & 0x04) != 0) == current_il_region)
            return;

        // Stamp the entry's il_region bit to match current region
        *(entry - 8) = (4 * (current_il_region & 1)) | (flags & 0xFB);
    }

    // Trace output at verbosity > 4
    if (trace_verbosity > 4)
        fprintf(s, "Walking IL tree, entry kind = %s\n",
                il_entry_kind_names[entry_kind]);

    // Dispatch on entry kind
    switch ((char)entry_kind) { ... }
}

The while(true) loop structure exists because certain cases (particularly linked-list tails) use continue to re-enter the check with a new entry, avoiding redundant function-call overhead for tail chains.

The default-path flags check serves two purposes:

  1. Scope isolation: File-scope entries encountered during a routine-scope walk are skipped (they belong to the outer walk).
  2. Region tracking: The current_il_region toggle prevents visiting the same entry twice within a single walk -- once stamped, an entry's bit 2 matches current_il_region, and the equality check causes the walker to skip it.

Entry Kind Dispatch

The giant switch covers all 85 entry kinds. Each case knows the exact layout of that entry type and recursively calls walk_entry_and_subtree on every child pointer. The three callbacks are invoked at appropriate points:

  • entry_replace: Called on each child pointer before recursion, potentially replacing it with a remapped pointer.
  • string_callback: Called on string fields (file names, identifier text), receiving the string pointer, kind 26, and byte length including null terminator.
  • entry_filter: Called on linked-list head pointers, returning NULL to remove the entry from the list.

Coverage by Entry Kind

The following table shows the major entry kinds and what the walker visits for each:

KindNameChildren Walked
1source_file_entryfile_name (string, kind 26 at [0]), full_name (string, kind 26 at [1]), name_as_written (string, kind 26 at [2]), child file list (kind 1, linked via offset +56 at [5]), associated entry at [6] (kind 1), module info at [8] (kind 82)
2constantType refs at [14]/[15] (kind 6), expression at [16] (kind 13); sub-switch on constant_kind byte at +148 (see below)
3parameterType (kind 6), declared_type (kind 6), default_arg_expr (kind 13), attributes (kind 72)
6typeBase type (kind 6), member field list (kind 8), template info (kind 58), scope (kind 23), base class list (kind 24), class_info supplement (kind 39)
7variableType (kind 6), initializer expression (kind 13), attributes (kind 72), declared_type (kind 6)
8fieldNext field (kind 8), type (kind 6), bit_size_constant (kind 2)
9exception_specType list (kind 10), noexcept expression (kind 13)
11routineReturn type (kind 6), parameter list (kind 3), body scope (kind 23), template info (kind 58), exception spec (kind 9), attributes (kind 72)
12labelBreak label (kind 12), continue label (kind 12)
13expressionSub-expressions (kind 13), operand entries, type references (kind 6); sub-switch on expression operator covers ~120 operator kinds
16switch_caseStatement (kind 21), case_value constant (kind 2)
17switch_infoCase list (kind 16), default case (kind 16), sorted case array
18catch_entryParameter (kind 7), statement (kind 21), dynamic_init expression (kind 13)
21statementSub-statements (kind 21), expressions (kind 13), labels (kind 12); sub-switch on statement kind
22object_lifetimeVariable (kind 7), lifetime scope boundary
23scopeVariables list (kind 7), routines list (kind 11), types list (kind 6), nested scopes (kind 23), namespaces (kind 28), using-declarations (kind 29), hidden names (kind 56), labels (kind 12)
24base_classNext (kind 24), type (kind 6), derived_class (kind 6), offset expression
27template_parameterDefault value, constraint expression (kind 60), template param supplement
28namespaceAssociated scope (kind 23), flags
29using_declarationTarget entity, position, access specifier
30dynamic_initExpression (kind 13), associated variable (kind 7)
39class_infoConstructor initializer list (kind 41), friend list, base class list (kind 24)
41constructor_initNext (kind 41), member/base expression, initializer expression
55orphaned_entitiesEntity list, scope reference
58templateTemplate parameter list (kind 61), body, specializations list
72attributeAttribute arguments (kind 73), next attribute (kind 72)
80subobject_pathLinked list (kind 80), each entry walked recursively

Constant Entry Sub-Switch (Case 2)

The constant entry handler is one of the most complex cases. After walking two type references ([14], [15] as kind 6) and one expression ([16] as kind 13), it dispatches on the constant_kind byte at entry + 148:

// Walk shared fields first
walk(a1[14], 6);   // type
walk(a1[15], 6);   // declared_type
walk(a1[16], 13);  // associated expression

// Strip template info if walk_mode_flags set
if (walk_mode_flags)
    a1[17] = 0;

switch (constant_kind) {
    case 0:  /* ck_error */
    case 1:  /* ck_integer */
    case 3:  /* ck_float */
    case 5:  /* ck_imaginary */
    case 14: /* ck_void */
        break;  // leaf constants, no children

    case 2:  /* ck_string */
        // Walk string data at [20] as string_text (kind 25)
        // Length comes from [19] (not strlen -- may have embedded NULs)
        if (string_callback)
            string_callback(a1[20], 25, a1[19]);
        break;

    case 4:  /* ck_complex */
        walk(a1[19], 27);   // template_parameter (real/imaginary parts)
        break;

    case 6:  /* ck_address -- 7 sub-kinds at entry+152 */
        switch (address_sub_kind) {
            case 0: entry_replace(a1[20], 11);  break;  // routine
            case 1: entry_replace(a1[20], 7);   break;  // variable
            case 2: case 3:
                walk(a1[20], 2);                break;  // constant (recurse)
            case 4: entry_replace(a1[20], 6);   break;  // type (typeid)
            case 5: walk(a1[20], 6);            break;  // type (uuidof, recurse)
            case 6: entry_replace(a1[20], 12);  break;  // label
            default: error("bad address const kind");
        }
        // Then walk subobject_path list at [22] (kind 80)
        break;

    case 7:  /* ck_ptr_to_member */
        entry_replace(a1[19], 36);   // derivation_path
        walk(a1[20], 62);            // name_reference
        // Conditional: if a1[21] & 2, replace [22] as routine(11)
        //             else replace [22] as field(8)
        break;

    case 8:  /* ck_label_difference */
        walk(a1[20], 2);             // constant (recurse)
        break;

    case 9:  /* ck_dynamic_init */
        walk(a1[19], 30);            // dynamic_init entry
        break;

    case 10: /* ck_aggregate */
        // Linked list of constants at [19], each via offset +104
        for each constant in list: walk(entry, 2);
        entry_replace(a1[20], 2);    // tail constant
        break;

    case 11: /* ck_init_repeat */
        walk(a1[19], 2);            // repeated constant
        break;

    case 12: /* ck_template_param -- 15 sub-kinds at entry+152 */
        // Another sub-switch with cases 0-13 + default error
        break;

    case 13: /* ck_designator */
        walk(a1[20], 2);            // constant value
        break;

    case 15: /* ck_reflection */
        // Walk [20] with kind from entry+152 byte
        break;
}

The walk_mode_flags field zeroing (a1[17] = 0) strips template parameter constant info during IL binary output. This is the template-stripping behavior controlled by argument a6 of walk_file_scope_il.

String Entry Handling

String fields within entries are walked with three distinct string kind values:

String KindValueDisplay NameUsed ForLength Source
id_name24"id name"Identifier names (variable, function, field names)strlen(str) + 1
string_text25"string text"String literal content (for ck_string constants)Constant's length field [19]
other_text26"other text"File names, compiler version, compilation time, asm textstrlen(str) + 1

The string_text kind (25) is special: its length comes from the enclosing constant entry's [19] field rather than strlen, because C/C++ string literals may contain embedded null bytes. All other string kinds use strlen(str) + 1.

Error Strings

The function contains diagnostic strings from walk_entry.h that fire on unexpected sub-kind values:

StringLineTriggers When
"walk_entry_and_subtree: bad address const kind"883Unknown address_constant_kind in constant entry (kind 2, sub-kind 6)
"walk_entry_and_subtree: bad template param constant kind"1035Unknown template_param_constant_kind in constant entry (kind 2, sub-kind 12)
"walk_entry_and_subtree: bad constant kind"1051Unknown constant_kind in constant entry (kind 2)

All three errors reference walk_entry.h as the source file and walk_entry_and_subtree as the function name, confirming the dispatch code is generated from the header file.

walk_routine_scope_il (sub_610200)

The routine-scope counterpart of walk_file_scope_il. It takes a routine index and walks that routine's scope chain:

void walk_routine_scope_il(int routine_index, ...) {
    // Same 5-callback + 4-state save/restore pattern
    // Trace: "walk_routine_scope_il"
    // Assert: il_walk.c, line 376

    dword_126FB5C = 0;  // NOT file-scope walk

    scope = qword_126EB90[routine_index];  // routine_scope_array
    while (scope) {
        walk_entry_and_subtree(scope, 23);
        if (entry_replace)
            scope = entry_replace(scope, 23);
        scope = scope->next;
    }
}

The key difference from walk_file_scope_il is that is_file_scope_walk is set to 0, which changes the entry protocol in walk_entry_and_subtree: entries with the file-scope bit set in their flags byte are skipped, because they belong to the file-scope IL and should not be processed during a routine-scope walk.

Callers and Use Cases

The walk framework serves four distinct purposes. Each caller installs a different callback configuration.

IL Display

The --dump_il debug output uses the walk framework with display_il_entry (sub_5F4930) as the entry callback:

// sub_5F76B0 (display_il_header)
walk_file_scope_il(
    display_il_entry,    // a1: entry callback = sub_5F4930
    NULL,                // a2: no string callback
    NULL,                // a3: no replace
    NULL,                // a4: no filter
    NULL,                // a5: no pre-walk check
    0                    // a6: no special flags
);

With all callbacks NULL except entry_callback, the walker visits every entry in walk order and calls display_il_entry on each, which dispatches on entry kind to print formatted field dumps. The pre_walk_check is NULL, so the default flags-based skip logic applies -- the current_il_region toggle prevents double-visiting.

Keep-in-IL Marking

The device code selection pass (mark_to_keep_in_il, sub_610420) installs the prune callback as pre_walk_check and NULL for everything else:

// sub_610420 (mark_to_keep_in_il)
qword_126FB88 = NULL;                    // no entry callback
qword_126FB80 = NULL;                    // no string callback
qword_126FB78 = prune_keep_in_il_walk;   // sub_617310
qword_126FB70 = NULL;                    // no replace
qword_126FB68 = NULL;                    // no filter

The prune_keep_in_il_walk callback (sub_617310) sets bit 7 (0x80) of each entry's flags byte and returns 1 for already-marked entries (preventing infinite recursion). The actual subtree walk is handled by a specialized copy of walk_entry_and_subtree (sub_6115E0, walk_tree_and_set_keep_in_il, 4649 lines) that directly sets the keep bit on every reachable child rather than using callbacks. See Keep-in-IL for the full mechanism.

IL Serialization

IL binary output (when IL_SHOULD_BE_WRITTEN_TO_FILE would be enabled, or for device IL output) uses all five callback slots:

  • entry_callback: Records each entry's position in the output stream
  • string_callback: Serializes string data with length prefix
  • entry_replace: Translates IL pointers to output-stream offsets
  • entry_filter: Skips entries that should not appear in the output (e.g., entries without keep_in_il for device IL)
  • pre_walk_check: Prevents re-serializing entries already written

IL Copy (Template Instantiation)

When EDG instantiates a template, it copies the template's IL subtree into a new region. The copy operation uses entry_replace to remap all pointers from the source region to the destination:

  • entry_replace: For each child pointer, allocates a new entry in the destination region, copies the source entry's contents, and returns the new pointer
  • string_callback: Copies string data into the destination region
  • pre_walk_check: Tracks which entries have already been copied (using the visited-set hash table at qword_126FB50)

Hash Table for Visited Set

The walk framework includes a visited-set hash table for cycles and deduplication:

AddressNameDescription
qword_126FB50hash_table_arrayPointer to hash table bucket array
dword_126FB48hash_table_countNumber of entries in hash table
qword_126FB40visited_setPointer to visited-set data
dword_126FB30visited_countNumber of visited entries

The hash table is reset by sub_603B30 (clear_walk_hash_table) before each walk operation. It uses open addressing and is primarily employed during IL copy operations to map source entry pointers to their destination counterparts.

Helper Functions

Several helper functions support the walk framework:

AddressIdentityLinesPurpose
sub_603B30clear_walk_hash_table23Zeros the visited-set hash table (qword_126FB50, dword_126FB48)
sub_603FE0find_parent_var_of_anon_union_type127Searches scope member lists for the variable that owns an anonymous union type
sub_603BB0find_var_in_nested_scopes333Recursively searches nested scopes for a variable (deeply unrolled, 8+ levels)
sub_603B00(trivial getter)9Walk-state accessor
sub_610200walk_routine_scope_il108Routine-scope walker (counterpart to walk_file_scope_il)

Keep-in-IL Specialized Walkers

The keep-in-il pass uses parallel implementations of the walk framework that bypass the callback mechanism for performance:

AddressIdentityLinesPurpose
sub_6115E0walk_tree_and_set_keep_in_il4649File-scope variant -- sets bit 7 directly on every reachable entry
sub_618660walk_entry_and_set_keep_in_il_routine_scope3728Routine-scope variant
sub_61CE20--sub_620190(keep-in-il helpers)variousPer-kind helpers for template args, exception specs, array bounds, expressions, statements

These specialized walkers are structurally identical to walk_entry_and_subtree but replace callback invocations with direct *(entry - 8) |= 0x80 operations. They exist as separate functions rather than callback-based walks because the keep-in-il marking is performance-critical -- it runs on every CUDA compilation, and eliminating the function-pointer indirection across ~330 recursive calls provides measurable speedup.

Global Entry-Kind List Layout

The per-kind linked lists are stored in a contiguous global array starting at 0x126E600, with 16-byte stride. The formula 0x126E600 + kind * 0x10 gives the list head for most entry kinds up to kind 72. The complete walk order with all 51 lists (45 from Phase 5, 3 from Phase 6, plus orphaned entities, source files, and seq_number_lookup) is documented in the Phase 5 table above.

The three trailing lists (Phase 6) are stored outside the contiguous array at separate addresses in the IL header/footer region:

AddressKindPurposeNext-Pointer Strategy
qword_126EBE864Sequence number lookup entriesStandard next_in_list at node prefix
qword_126EBE06External declarations (type list)Type-specific next at offset +104
qword_126EC0083Module declarations (C++20)Standard next_in_list at node prefix

The external declarations list (qword_126EBE0) is notable: it walks entries as kind 6 (type) but uses a different linked-list strategy (offset +104 rather than the standard prefix next pointer). This is because the external declarations list is a secondary index over type entries that are also present in the main type list at qword_126E660.

Walk Order Diagram

walk_file_scope_il(callbacks...)
  |
  +-- [save 5 callbacks + 4 state vars]
  +-- [install caller's callbacks]
  |
  +-- Phase 1: walk_entry_and_subtree(primary_scope, 23)
  |     |
  |     +-- Recursively visits all nested scopes,
  |         their member lists (vars, routines, types),
  |         and all subtrees
  |
  +-- Phase 2: source_file list (kind 1)
  |     +-- for each file: walk(file, 1)
  |         +-- walks file_name, full_name, child files
  |
  +-- Phase 3: main_routine + string entries
  |     +-- entry_replace(main_routine, 11)
  |     +-- string_callback(compiler_version, 26, len)
  |     +-- string_callback(time_of_compilation, 26, len)
  |
  +-- Phase 4: orphaned_entities list (kind 55)
  |     +-- for each orphan: walk(orphan, 55)
  |
  +-- Phase 5: global lists (kinds 1, 2, 3, ..., 72)
  |     +-- for each kind:
  |           for each entry in list:
  |             entry_replace(entry, kind)
  |             walk(entry, kind)
  |
  +-- Phase 6: trailing lists (kinds 64, 6-ext, 83)
  |
  +-- [restore saved state]

Diagnostic Strings

StringSourceCondition
"walk_file_scope_il"sub_60E4F0Trace enter (dword_126EFC8 nonzero)
"walk_routine_scope_il"sub_610200Trace enter
"Walking IL tree, entry kind = %s\n"sub_604170dword_126EFCC > 4
"Walking IL tree, string entry kind = %s\n"sub_604170 / sub_60E4F0dword_126EFCC > 4
"walk_entry_and_subtree: bad address const kind"sub_604170Unknown address constant sub-kind (walk_entry.h:883)
"walk_entry_and_subtree: bad template param constant kind"sub_604170Unknown template param constant sub-kind (walk_entry.h:1035)
"walk_entry_and_subtree: bad constant kind"sub_604170Unknown constant kind (walk_entry.h:1051)
"find_parent_var_of_anon_union_type"sub_603FE0Assert at lines 511, 523
"find_parent_var_of_anon_union_type: var not found"sub_603FE0Variable lookup failed

Function Map

AddressIdentityConfidenceLinesEDG Source
sub_60E4F0walk_file_scope_il99%2043il_walk.c:270
sub_604170walk_entry_and_subtree99%7763il_walk.c / walk_entry.h
sub_610200walk_routine_scope_il98%108il_walk.c:376
sub_603B30clear_walk_hash_table85%23il_walk.c
sub_603FE0find_parent_var_of_anon_union_type99%127il_walk.c:511
sub_603BB0find_var_in_nested_scopes85%333il_walk.c
sub_603B00(trivial walk-state accessor)80%9il_walk.c
sub_6115E0walk_tree_and_set_keep_in_il98%4649il_walk.c
sub_618660walk_entry_and_set_keep_in_il_routine_scope88%3728il_walk.c
sub_61CE20(keep-in-il helper: template args)80%100il_walk.c
sub_61D0C0(keep-in-il helper: exception spec)80%108il_walk.c
sub_61D330(keep-in-il helper: array bound)80%97il_walk.c
sub_61D570(keep-in-il helper: overriding virtual)80%120il_walk.c
sub_61D7F0(keep-in-il helper: base class)80%69il_walk.c
sub_61D9B0(keep-in-il helper: attributes)80%202il_walk.c
sub_61DEC0(keep-in-il helper: using-decl)80%101il_walk.c
sub_61E160(keep-in-il helper: object lifetime)80%76il_walk.c
sub_61E370(keep-in-il helper: expressions)80%369il_walk.c
sub_61ECF0(keep-in-il helper: statements)80%466il_walk.c
sub_61F420(keep-in-il helper: additional exprs)80%631il_walk.c
sub_61FEA0(keep-in-il helper: decl sequence)80%173il_walk.c

Cross-References

Keep-in-IL (Device Code Selection)

cudafe++ compiles a single .cu translation unit that contains both host and device code. After the EDG frontend builds the complete IL tree, cudafe++ must split the two worlds: host-side declarations feed into the .int.c output, while device-side declarations feed into the binary IL emitted for cicc. The keep-in-il mechanism performs this split. It is a transitive-closure walk that starts from known device entities (functions with __device__/__global__ attributes, __shared__/__constant__/__managed__ variables) and recursively marks every IL entry they reference. Entries that survive the mark phase are written to the device IL; entries without the mark are stripped by the elimination pass.

The entire mechanism lives in il_walk.c (the mark/walk side) and il.c (the elimination side). It runs as pass 3 of fe_wrapup, after IL lowering (pass 2) and before C++ class finalization (pass 4).

Key Facts

PropertyValue
Source fileil_walk.c (mark), il.c (eliminate)
Mark entry pointsub_610420 (mark_to_keep_in_il), 892 lines
Recursive workersub_6115E0 (walk_tree_and_set_keep_in_il), 4649 lines / 23KB
Prune callbacksub_617310 (prune_keep_in_il_walk), 127 lines
Elimination entry pointsub_5CCBF0 (eliminate_unneeded_il_entries), 345 lines
Template cleanupsub_5CCA40 (clear_instantiation_required_on_unneeded_entities), 86 lines
Body removalsub_5CC410 (eliminate_bodies_of_unneeded_functions), ~200 lines
Triggerfe_wrapup pass 3, argument 23 (scope entry kind)
Guard flagdword_106B640 (set=1 before walk, cleared=0 after)
Key bitBit 7 (0x80) of byte at entry_ptr - 8

The Keep-in-IL Bit

Every IL entry is preceded by an 8-byte prefix. The byte at offset -8 from the entry pointer contains per-entry flags:

Byte at (entry_ptr - 8):
  bit 0  (0x01)  is_file_scope          Entry belongs to file-scope IL region
  bit 1  (0x02)  is_in_secondary_il     Entry is in the secondary IL (second TU)
  bit 2  (0x04)  current_il_region      Toggles per IL region (0 or 1)
  bits 3-6       (reserved)
  bit 7  (0x80)  keep_in_il             DEVICE CODE MARKER

The sign bit of this byte doubles as the keep-in-il flag. This allows a fast check: *(signed char*)(entry - 8) < 0 means "keep this entry." The elimination pass exploits this: it tests *(char*)(entry - 8) >= 0 to identify entries to remove.

Two additional "keep definition" flags exist on specific entity types:

Entity kindFieldBitMeaning
Type (kind 6, class/struct)entry + 162bit 7 (0x80)keep_definition_in_il -- retain full class body
Routine (kind 11)entry + 187bit 2 (0x04)keep_definition_in_il -- retain function body

The keep_definition_in_il flag is stronger than the base keep_in_il flag. A type marked with only keep_in_il may be emitted as a forward declaration; one marked with keep_definition_in_il retains its full member list, base classes, and nested types.

Pipeline Context

fe_wrapup (sub_588F90)
  |
  +-- Pass 1:  sub_588C60   per-file IL wrapup
  +-- Pass 2:  sub_707040   IL lowering
  +-- Pass 3:  dword_106B640 = 1                          <<<< guard ON
  |            sub_610420(file_scope, 23)                  <<<< MARK PHASE
  |            dword_106B640 = 0                           <<<< guard OFF
  +-- Pass 4:  sub_5CCA40   clear_instantiation_required   <<<< CLEANUP
  |            sub_5CC410   eliminate_bodies_of_unneeded    (if deferred)
  |            sub_5CCBF0   eliminate_unneeded_il_entries   (if deferred)
  +-- Pass 5:  sub_588D40   per-file part_3

The guard flag dword_106B640 is an assertion variable: mark_to_keep_in_il asserts it is nonzero during the walk. If the assertion fires, it triggers internal_error at il_walk.c:1959. This prevents accidental invocation of the marking logic outside the designated pass window.

Mark Phase: mark_to_keep_in_il (sub_610420)

This is the top-level entry point for device code marking. It orchestrates the entire transitive-closure walk.

Callback Installation

The function saves all five IL tree-walk callback slots and installs a specialized configuration:

// Save existing callbacks
saved_entry_cb      = qword_126FB88;    // entry_callback
saved_string_cb     = qword_126FB80;    // string_callback
saved_pre_walk      = qword_126FB78;    // pre_walk_check
saved_entry_replace = qword_126FB70;    // entry_replace
saved_entry_filter  = qword_126FB68;    // entry_filter

// Install keep-in-il configuration
qword_126FB88 = NULL;                   // no entry callback
qword_126FB80 = NULL;                   // no string callback
qword_126FB78 = prune_keep_in_il_walk;  // sub_617310
qword_126FB70 = NULL;                   // no replacement
qword_126FB68 = NULL;                   // no filter
dword_126FB60 = 0;                      // walk_mode_flags = 0
dword_126FB58 = (flags_byte & 0x02) != 0;  // is_secondary_il

All callbacks are restored on exit, making the function safely re-entrant from nested invocations (the prune callback can recursively call mark_to_keep_in_il for canonical entries in secondary IL).

Scope Initialization

For scope entries (kind 23), the function handles two cases:

  1. Scope already has keep_in_il set (byte at entry + 28 is nonzero): Call walk_tree_and_set_keep_in_il directly. The scope was previously identified as device-relevant.

  2. Fresh scope (byte at entry + 28 is zero): Clear bit 7 of the entry's flags byte, then walk. This is the file-scope entry point where the walk begins with the keep bit initially cleared, allowing the recursive walk to set it transitively.

if (entry_kind == 23) {             // scope
    if (*(entry + 28) != 0) {
        walk_tree_and_set_keep_in_il(entry, 23);
    } else {
        *(entry - 8) &= 0x7F;      // clear keep_in_il
        // Debug: "Beginning file scope keep_in_il walk"
        walk_tree_and_set_keep_in_il(entry, 23);
        if (dword_126EFB4 == 2)     // C++ mode
            walk_scope_and_mark_routine_definitions(entry);  // sub_6175F0
    }
}

Global Entry-Kind List Walk

After processing the scope, mark_to_keep_in_il iterates all 45+ global entry-kind linked lists. These lists at 0x126E610--0x126EA80 hold every file-scope entity indexed by entry kind. The function visits each list and calls walk_tree_and_set_keep_in_il on every entry:

// Orphaned scope list (kind 55) -- only entries with keep_definition flag
for (entry = qword_126EBA0; entry; entry = entry->next) {
    if (entry->routine_byte_187 & 0x04)   // keep_definition_in_il set
        walk_tree_and_set_keep_in_il(entry, 55);
}

// Source files (kind 1), constants (kind 2), parameters (kind 3), ...
// through to concepts (kind 72)
for (int kind = 1; kind <= 72; kind++) {
    for (entry = global_list[kind]; entry; entry = entry->next)
        walk_tree_and_set_keep_in_il(entry, kind);
}

The iteration order mirrors walk_file_scope_il (sub_60E4F0), processing kinds 1 through 72 with some gaps (kinds 24--26, 31--32, 43--45, 49--58, 60, 64--71 are skipped because those lists are empty or handled differently).

Using-Declaration Fixup

After the main walk, the function processes using-declarations attached to scopes. This is a fixed-point loop that repeats until no new entities are marked:

do {
    changed = 0;
    process_using_decl_list(scope->using_decls, is_class_scope, &changed);
} while (changed);

For each scope region (iterated via entry + 264), it walks the using-declaration chain and handles six declaration kinds:

Using-decl kind byteNameAction
0x33Simple usingIf target entity is marked, mark the using-decl
0x34Using with namespaceIf target entity is marked, mark using-decl + namespace
0x35Nested scopeRecurse via sub_6170C0
0x36Using with templateIf target entity is marked, mark using-decl + template
6Type alias (typedef)Special: if typedef of a class/struct with has_definition flag, and the underlying class is marked, mark the typedef too
66Using-everythingForce-mark unconditionally, set changed = 1

The typedef case (kind 6) deserves attention. When a typedef aliases a marked class, the typedef entry gets marked so that device code can reference the class through its alias name. The check verifies entry + 132 == 12 (typedef type kind), the underlying type is a class/struct/union (kinds 9--11), and the has_definition flag (entry + 161, bit 2) is set.

Recursive Worker: walk_tree_and_set_keep_in_il (sub_6115E0)

This 23KB function is structurally identical to the generic walk_entry_and_subtree (sub_604170) but specialized: instead of invoking callbacks, it directly sets the keep_in_il bit on every reachable sub-entry and recurses.

The function dispatches on entry kind (approximately 80 cases) and for each child pointer it encounters, performs:

if (child != NULL) {
    *(child - 8) |= 0x80;                    // set keep_in_il
    walk_tree_and_set_keep_in_il(child, child_kind);  // recurse
}

Key entry kinds and what they transitively mark:

Entry kindIDChildren marked
source_file1file_name, full_name, child files
constant2type, string data, address target
parameter3type, declared_type, default_arg_expr, attributes
type6base_type, member fields, template info, scope, base classes
variable7type, initializer expression, attributes
field8next field, type, bit_size_constant
routine11return_type, parameters, body, template info, exception specs
expression13sub-expressions, operands, type references
statement21sub-statements, expressions, labels
scope23all member lists (variables, routines, types, nested scopes)
template_parameter39default values, constraints
namespace28associated scope

The function also handles cross-references in template instantiations: when it encounters a template specialization, it follows the primary template pointer and marks the template definition too. This ensures that if device code uses vector<int>, the vector template itself is retained.

Pre-Walk Check Integration

Before recursing into any entry, the walk checks the pre_walk_check callback (qword_126FB78), which is set to prune_keep_in_il_walk. This callback returns 1 (skip) if the entry is already marked, preventing infinite recursion on cyclic references (classes referencing their own members) and avoiding redundant work.

Prune Callback: prune_keep_in_il_walk (sub_617310)

This callback is installed as the pre_walk_check during the keep-in-il walk. It runs before the walker descends into each entry.

Decision Logic

int prune_keep_in_il_walk(entry_ptr, entry_kind) {
    char flags = *(entry_ptr - 8);

    // Case 1: Secondary IL mismatch -- delegate to canonical
    if (is_secondary_il && !(flags & 0x02)) {
        canonical = lookup_canonical(entry_ptr, entry_kind);  // sub_5B9EE0
        if (dword_126EE48) {   // CUDA mode
            if (canonical && canonical->assoc_entry) {
                target = *canonical->assoc_entry;
                if (target != entry_ptr && (*(target - 8) & 0x02))
                    mark_to_keep_in_il(target, entry_kind);  // recurse
            }
        }
        return 1;  // skip this entry (handled via canonical)
    }

    // Case 2: Already marked -- skip
    if (flags < 0)   // bit 7 set = signed negative
        return 1;

    // Case 3: Type with class/struct/union definition -- mark definition too
    if (entry_kind == 6 && (*(entry + 132) - 9) <= 2) {
        if (is_local || is_imported || !has_name || has_definition)
            set_keep_definition_on_type(entry);  // sub_6111C0
    }

    // Set the keep_in_il bit
    *(entry_ptr - 8) |= 0x80;

    // Debug output
    if (trace_active && trace_filter("needed_flags", entry, kind)) {
        switch (entry_kind) {
            case 6:  fprintf(s, "Setting keep_in_il on type ");  break;
            case 7:  fprintf(s, "Setting keep_in_il on var  ");  break;
            case 11: fprintf(s, "Setting keep_in_il on rout ");  break;
            case 28: fprintf(s, "Setting keep_in_il on namespace "); break;
        }
    }

    // Case 4: Variable/routine in non-guard mode -- check class membership
    if (!dword_106B640) {
        if (!(*(entry + 82) & 0x10)) {
            canonical = lookup_canonical(entry, entry_kind);
            // Assert canonical exists (il_walk.c:1885)
            if (*(canonical + 81) & 0x04) {   // is class member
                class_type = **(canonical + 40 + 32);
                walk_tree_and_set_keep_in_il(class_type, 6);
                set_keep_definition_on_type(class_type);
            }
        }
        return 1;
    }

    // Handle canonical entry in secondary IL (CUDA mode)
    canonical = lookup_canonical(entry, entry_kind);
    if (dword_126EE48 && canonical) {
        assoc = *(canonical + 32);
        if (assoc) {
            target = *assoc;
            if (target != entry && (*(target - 8) & 0x02))
                mark_to_keep_in_il(target, entry_kind);
        }
    }

    return 0;  // continue walking into this entry's children
}

The callback's return value controls the walk: returning 1 tells the walker to skip the subtree (entry already processed or delegated to canonical), returning 0 tells it to descend into children.

Secondary IL Handling

When cudafe++ processes multiple translation units (e.g., through #include chains that bring in separate compilation units), it maintains primary and secondary IL regions. The secondary IL flag (bit 1 of the flags byte) distinguishes them. The prune callback handles cross-region references by looking up the canonical (primary) version of each entry via sub_5B9EE0 and recursively marking that version instead. This ensures the device IL output contains the primary definitions, not secondary duplicates.

Keep-Definition Logic

For Types (sub_6111C0 / sub_611300)

When a class/struct/union type needs its definition kept (not just a forward declaration), set_keep_definition_on_type performs:

void set_keep_definition_on_type(entry) {
    // Debug: "Setting keep_definition_in_il on <type>"
    *(entry + 162) |= 0x80;           // set keep_definition bit

    // If already marked keep_in_il, clear and re-walk
    // (definition requires deeper traversal than reference)
    if (*(entry - 8) & 0x80) {
        *(entry - 8) &= ~0x80;        // clear keep_in_il
        mark_to_keep_in_il(entry, 6);  // re-walk with full traversal
    }

    // For class/struct: also clear/re-walk the associated scope
    if (entry_kind is class/struct/union) {
        scope = entry->associated_scope;
        *(scope - 8) &= ~0x80;
        // Follow canonical type chain
    }
}

The clear-and-re-walk pattern is important: when an entity was initially marked via a shallow reference (e.g., a pointer to the class), only the type entry itself was marked. When the definition is later needed (e.g., the device code accesses a member), the keep bit is cleared and the walk restarts, this time descending into all members, base classes, and nested types.

For Routines (sub_6113F0 / sub_6181E0)

void set_keep_definition_on_routine(entry) {
    // Debug: "Setting keep_definition_in_il on rout <name>"
    *(entry + 187) |= 0x04;          // set keep_definition bit

    // If template specialization: also mark the primary template
    if (*(entry + 177) & 0x20) {
        primary = lookup_primary_template(entry);  // sub_5BBCC0
        mark_to_keep_in_il(primary, 11);
    }

    // Special member handling (copy/move constructors)
    if (special_member_kind == 1 || special_member_kind == 2) {
        // Recurse on associated class type's ctor/dtor
    }
}

Scope-Level Routine Walk: sub_6175F0

In C++ mode (dword_126EFB4 == 2), after the main mark pass, mark_to_keep_in_il calls sub_6175F0 on the file scope. This function performs an additional sweep through all scope hierarchies to ensure routine definitions are correctly retained:

  • For each class/struct scope with keep_in_il set: recurse into the class scope
  • For each namespace (non-alias): recurse into the namespace scope
  • For routines in class scopes with external linkage: if marked but not keep_definition, call set_keep_definition_on_routine

This handles the case where a class method is referenced by device code through a virtual call or template instantiation, requiring the full function body to be available in the device IL.

Elimination Phase

After the mark phase completes, three functions strip unmarked entities from the IL.

clear_instantiation_required_on_unneeded_entities (sub_5CCA40)

Runs in pass 4 of fe_wrapup, C++ mode only. Prevents unnecessary template instantiations from being triggered during IL output.

The function recursively walks the scope tree and for each routine with template instantiation flags, checks whether the instantiation is still needed:

void clear_instantiation_required_on_unneeded_entities(scope) {
    assert(dword_126EFB4 == 2);  // C++ only

    // Recurse into child scopes (skip namespace aliases)
    for (child = scope->nested_namespaces; child; child = child->next) {
        if (!(child->flags & 0x01))    // not an alias
            recurse(child->associated_scope);
    }

    // Recurse into class scopes
    for (type = scope->types_list; type; type = type->next) {
        if (is_class_struct_union(type) && !is_anonymous(type))
            recurse(type->type_extra->scope_entry);
    }

    // Clear instantiation_required on unneeded routines
    for (rout = scope->routines_list; rout; rout = rout->next) {
        if (!(rout->flags_80 & 0x08)           // not suppressed
            && !(rout->flags_179 & 0x10)        // not already cleared
            && ((rout->flags_179 & 6) == 2 || (dword_126E204 && rout->flags_176 < 0))
            && rout->source_corresp != NULL
            && !(rout->flags_176 & 0x02))       // not locally defined
        {
            clear_instantiation_required(rout->name, 0, 2);  // sub_78A380
        }
    }

    // For non-file scopes: also clear on variable templates
    if (scope->scope_kind != 0) {
        for (var = scope->variables_list; var; var = var->next) {
            if (!(var->flags_80 & 0x08)
                && !(var->flags_162 & 0x40)
                && (var->flags_162 & 0xB0) == 0x10
                && var->name != NULL)
            {
                clear_instantiation_required(var->name, 0, 2);
            }
        }
    }
}

eliminate_bodies_of_unneeded_functions (sub_5CC410)

Walks the IL table (qword_126EB98) and removes function bodies for routines that were not marked with keep_definition_in_il:

void eliminate_bodies_of_unneeded_functions() {
    for (idx = 1; idx <= dword_126EC78; idx++) {
        scope = qword_126EC88[idx];
        if (!scope) continue;
        if (scope not in current TU) continue;
        if (scope_kind != 17) continue;     // 17 = function body

        routine = scope->owning_entity;
        if (routine->keep_definition_in_il)  // byte+187 & 0x04
            continue;
        if (!(routine->flags_29 & 0x01))
            continue;

        remove_function_body(routine);       // sub_5CAB40
    }
}

eliminate_unneeded_il_entries (sub_5CCBF0)

The main elimination pass. Walks the scope tree and removes all unmarked entities from the IL linked lists.

void eliminate_unneeded_il_entries(scope) {
    emit_info = get_emit_info(scope);       // sub_703C30
    assert(emit_info != NULL);              // il.c:29598

    // Recurse into child scopes (skip namespace aliases)
    for (child = scope->nested_namespaces; child; child = child->next) {
        if (!(child->flags & 0x01))
            eliminate_unneeded_il_entries(child->associated_scope);
    }

    // --- Eliminate variables ---
    prev = NULL;
    for (var = scope->variables_list; var; var = next) {
        next = var->next_in_list;           // offset +104
        if (*(signed char*)(var - 8) < 0) { // keep_in_il set
            prev = var;                     // keep in list
        } else {
            // Unlink from list
            if (prev) prev->next = next;
            else scope->variables_list = next;
            var->next = NULL;
            // C++ mode: walk expression trees to clear hidden names
            if (cpp_mode) {
                walk_tree(var->expr_tree, clear_hidden_name_cb, 147);
                walk_tree(var->alt_tree, clear_hidden_name_cb, 147);
            }
        }
    }
    emit_info[5] = prev;   // last kept variable

    // File scope: also clean orphaned scope list
    if (scope->scope_kind == 0)
        eliminate_unneeded_scope_orphaned_list_entries();  // sub_5CC570

    // --- Eliminate routines (same pattern as variables) ---
    prev = NULL;
    for (rout = scope->routines_list; rout; rout = next) {
        next = rout->next_in_list;
        if (*(signed char*)(rout - 8) < 0) {
            prev = rout;
        } else {
            // Unlink + clear hidden names
        }
    }
    emit_info[6] = prev;

    // Clear global variable reference if unmarked
    if (qword_126EB70 && *(signed char*)(qword_126EB70 - 8) >= 0)
        qword_126EB70 = NULL;

    // --- Eliminate types ---
    prev = NULL;
    for (type = scope->types_list; type; type = next) {
        next = type->next_in_list;

        // Follow typedef chains to find the real type for sign-bit check
        real = type;
        if (real->type_kind == 12 && !real->name) {  // anonymous typedef
            do { real = real->base_type; }
            while (real->type_kind == 12 && !real->name);
        }

        if (*(signed char*)(real - 8) < 0) {   // marked
            prev = type;
            if (is_class_struct_union(type))
                eliminate_unneeded_class_definitions(type);  // sub_5CC1B0
        } else {
            // Unlink + process eliminated class members
            if (is_class_struct_union(type) && cpp_mode)
                process_members_of_eliminated_class(type);
            type->base_type = NULL;
            clear_type_extra_member_lists(type->type_extra);
            type->type_extra->flags |= 0x20;  // mark as eliminated
        }
    }
    emit_info[4] = prev;

    // --- Eliminate hidden names ---
    // (same sign-bit check, unlink unmarked entries from scope->hidden_names)

    // File scope: emit orphaned scopes
    if (scope->scope_kind == 0)
        emit_orphaned_scopes(scope);        // sub_718720

    // Clean external declarations list
    for (ext = qword_126EBE0; ext; ext = next) {
        next = ext->next;
        if (*(signed char*)(ext - 8) >= 0) {
            // Unlink from external declarations list
        }
    }
}

The debug output for eliminated vs. retained entities uses a string trick: "TARG_VERT_TAB_CHAR" + 17 evaluates to "R", so the output reads either "Removing variable <name>" (eliminated) or "Not removing variable <name>" (retained).

Global State

AddressNameDescription
dword_106B640keep_in_il_walk_activeAssertion guard; 1 during pass 3, 0 otherwise
dword_126EFB4cpp_mode2 = C++ mode (enables class/template processing)
dword_126EFC8trace_activeNonzero enables diagnostic output
dword_126EFCCtrace_verbosityHigher = more output (>2 prints elimination details)
dword_126EE48cuda_modeNonzero enables CUDA-specific canonical entry handling
dword_126E204template_compat_flagAffects template instantiation clearing criteria
qword_126EBA0orphaned_scope_listFile-scope orphaned scopes (kind 55 list)
qword_126EB70global_variable_refCleared if its entry is unmarked
qword_126EBE0external_decl_listExternal declarations; unmarked ones removed
qword_126EB98il_tableArray of IL scope pointers, indexed by il_table_index
qword_126FB78pre_walk_checkCallback slot; set to prune_keep_in_il_walk during mark
qword_126FB88entry_callbackCallback slot; NULL during mark phase
dword_126FB58is_secondary_ilWalk state: 1 if currently in secondary IL region
dword_126FB5Cis_file_scope_walkWalk state: 1 during file-scope walk
dword_106B644current_il_regionWalk state: toggles per IL region
dword_126FB60walk_mode_flagsWalk state: 0 during keep-in-il walk

Diagnostic Strings

StringSourceCondition
"Beginning file scope keep_in_il walk"sub_610420trace_active && trace_category("needed_flags")
"Ending file scope keep_in_il walk"sub_610420Same
"Setting keep_in_il on type "sub_617310trace_active && trace_filter("needed_flags", entry, 6)
"Setting keep_in_il on var "sub_617310Same, kind 7
"Setting keep_in_il on rout "sub_617310Same, kind 11
"Setting keep_in_il on namespace "sub_617310Same, kind 28
"Setting keep_definition_in_il on "sub_6111C0Trace active
"Setting keep_definition_in_il on rout "sub_6113F0Trace active
"Removing variable <name>"sub_5CCBF0trace_verbosity > 2 or trace_filter("dump_elim")
"Not removing variable <name>"sub_5CCBF0Same (for kept entries)
"Removing routine <name>"sub_5CCBF0Same
"Removing <type>"sub_5CCBF0Same
"eliminate_unneeded_il_entries"sub_5CCBF0trace_active (level 3 trace enter/exit)

Function Map

AddressIdentityConfidenceLinesEDG Source
sub_610420mark_to_keep_in_il99%892il_walk.c:1959
sub_6115E0walk_tree_and_set_keep_in_il98%4649il_walk.c
sub_617310prune_keep_in_il_walk99%127il_walk.c:1885
sub_6111C0set_keep_definition_on_type95%63il_walk.c
sub_611300set_keep_definition_on_type_simple92%48il_walk.c
sub_6113F0set_keep_definition_on_routine95%81il_walk.c
sub_6181E0set_keep_definition_on_routine_unconditional90%69il_walk.c
sub_6170C0process_using_decl_list92%154il_walk.c
sub_6175F0walk_scope_and_mark_routine_definitions90%634il_walk.c
sub_616EE0mark_virtual_function_types_to_keep85%88il_walk.c
sub_618370walk_and_set_keep_in_il_helper80%119il_walk.c
sub_618660walk_entry_and_set_keep_in_il_routine_scope88%3728il_walk.c
sub_5CCBF0eliminate_unneeded_il_entries100%345il.c:29598
sub_5CCA40clear_instantiation_required_on_unneeded_entities100%86il.c:29450
sub_5CC410eliminate_bodies_of_unneeded_functions100%~200il.c:29231
sub_5CC1B0eliminate_unneeded_class_definitions100%~200il.c
sub_5CC570eliminate_unneeded_scope_orphaned_list_entries100%~200il.c:29398
sub_5CB920process_members_of_eliminated_class_definition100%~300il.c:29097
sub_5B9EE0lookup_canonical_entry----il_walk.c
sub_78A380clear_instantiation_required----template.c

Cross-References

IL Display

The IL display subsystem produces a human-readable text dump of the entire Intermediate Language graph. It is compiled from EDG's il_to_str.c (source path /dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/il_to_str.c, confirmed by an assertion at line 6175 in form_float_constant). The display code occupies address range 0x5EB290--0x603A00 in the binary (roughly 90KB), with the main dispatch functions at 0x5EC600--0x5F7FD0 and formatting helpers continuing through 0x6039E0.

Activation is via the il_display CLI flag (flag index 10 in the boolean flag table), which triggers display_il_file after the frontend completes parsing. The output goes to stdout through an indirectable callback mechanism (qword_126F980). When active, every IL entry in every memory region is printed with labeled fields, 25-column-aligned formatting, and scope/address annotations.

Key Facts

PropertyValue
Source fileil_to_str.c (EDG 6.6)
Address range0x5EB290--0x6039E0
Top-level entry pointsub_5F7DF0 (display_il_file), 56 lines
Header + file-scopesub_5F76B0 (display_il_header), 174 lines
Main dispatchersub_5F4930 (display_il_entry), 1,686 lines
Single-entity displaysub_5F7D50 (display_single_entity), 38 lines
CLI flagil_display (index 10, boolean)
Output callbackqword_126F980 (function pointer, default sub_5EB290 = fputs(s, stdout))
Display-active flagbyte_126FA16 (set to 1 during display)
Scope context flagdword_126FA30 (1 = file-scope region, 0 = function-scope)
Entry kind name tableoff_E6DD80 (~84 entries, indexed by entry kind byte)

Top-Level Control Flow

display_il_file (sub_5F7DF0)
│
│  printf("Display of IL file \"%s\", produced by the compilation of \"%s\"\n",
│         il_file_name, source_file_name)
│
├── display_il_header (sub_5F76B0)
│   │  dword_126FA30 = 1                              // file-scope mode
│   │  puts("\n\nIntermediate language for memory region 1 (file scope):")
│   │  puts("\nil_header:")
│   │  ... 30+ header fields ...
│   │
│   └── walk_file_scope_il(display_il_entry, ...)      // sub_60E4F0
│       └── display_il_entry (sub_5F4930)              // callback per entity
│
└── for region = 2 .. dword_126EC80:
        dword_126FA30 = 0                              // function-scope mode
        // lookup function name from scope table
        printf("\n\nIntermediate language for memory region %ld (function \"%s\"):\n",
               region, function_name)
        walk_routine_scope_il(region, display_il_entry, ...)   // sub_610200
        └── display_il_entry (sub_5F4930)              // callback per entity

Memory region 1 is always file-scope (global declarations, types, templates). Regions 2+ correspond to individual function bodies. The scope table at qword_126EB90 maps each region index to its owning scope entry; the display code checks scope.kind == 17 (sck_function) and extracts the routine name for the banner.

IL Header Fields

display_il_header (sub_5F76B0) prints the translation-unit-level metadata stored in BSS at 0x126EB60--0x126EBF8:

FieldTypeNotes
primary_source_fileIL pointerSource file entry for the main .cu file
primary_scopeIL pointerFile-scope scope entry
main_routineIL pointermain() routine entry, if present
compiler_versionstringEDG compiler version string
time_of_compilationstringBuild timestamp
plain_chars_are_signedboolDefault char signedness
source_languageenumsl_Cplusplus (0) or sl_C (1), from dword_126EBA8
std_versionintegerC/C++ standard version (e.g., 201703 for C++17), from dword_126EBAC
pcc_compatibility_modeboolPCC compatibility
enum_type_is_integralboolWhether enum underlying type is integral
default_max_member_alignmentintegerDefault structure packing alignment
gcc_modeboolGCC compatibility mode
gpp_modeboolG++ compatibility mode
gnu_versionintegerGNU compatibility version number
short_enumsbool-fshort-enums behavior
default_nocommonboolDefault -fno-common
UCN_identifiers_usedboolUniversal character names in identifiers
vla_usedboolVariable-length arrays present
any_templates_seenboolWhether any templates were parsed
prototype_instantiations_in_ilboolTemplate prototype instantiations included
il_has_all_prototype_instantiationsboolAll prototypes included (ALL_TEMPLATE_INFO_IN_IL=1)
il_has_C_semanticsboolC-language semantics active
nontag_types_used_in_exception_or_rttiboolNon-tag types in EH/RTTI
seq_number_lookup_entriesintegerCount of source sequence entries
target_configuration_indexintegerTarget configuration selector

After printing the header, display_il_header calls walk_file_scope_il (sub_60E4F0) with display_il_entry as the per-entity callback. This iterates every IL entry in file-scope region 1.

The Main Dispatcher: display_il_entry

display_il_entry (sub_5F4930, 1,686 lines) is the central per-entity display function. It receives an entry pointer and an entry kind byte, and dispatches to the appropriate per-kind display function.

Transparent (Inline) Kinds

The first switch handles kinds that are displayed inline by their parent and should not appear as standalone entries. These return immediately without output:

Transparent kinds (early return):
  4   routine_type_supplement      15  (reserved)
  5   routine_type_extra           19  try_supplement
  14  (reserved)                   20  asm_supplement
  27  template_parameter_suppl     34  (reserved)
  37  (reserved)                   38  (reserved)
  46  (reserved)                   47  (reserved)
  48  (reserved)                   75  (reserved)
  76  (reserved)                   77  (reserved)
  78  (reserved)                   81  (reserved)

Entry Header Line

For non-transparent kinds, the dispatcher prints a scope-annotated header:

file-scope type@7f3a4b200100
func-scope variable@7f3a4b300200

The scope prefix comes from dword_126FA30 (1 = "file-scope", 0 = "func-scope"). The kind name is looked up from off_E6DD80[kind_byte]. The address is the raw entry pointer value. For entries in function-scope regions while dword_126FA30 == 1, a warning "**NON FILE SCOPE PTR**" is emitted.

Dispatch Table

The second switch dispatches to specialized display functions:

KindHexNameDisplay functionLines
10x01source_file_entryinline in dispatcher~40
20x02constantsub_5F2720 (display_constant)605
30x03param_typeinline in dispatcher~30
60x06typesub_5F06B0 (display_type)1,033
70x07variablesub_5EE500 (display_variable)614
80x08fieldinline in dispatcher~80
90x09exception_specificationinline~20
100x0Aexception_spec_typeinline~10
110x0Broutinesub_5EF1A0 (display_routine)1,160
120x0Clabelinline~30
130x0Dexpr_nodesub_5ECFE0 (display_expr_node)534
160x10switch_case_entryinline~15
170x11switch_infoinline~10
180x12handlerinline~15
210x15statementsub_5EC600 (display_statement)328
220x16object_lifetimeinline~20
230x17scopesub_5F2140 (display_scope)177
280x1Cnamespaceinline~20
290x1Dusing_declarationinline~20
300x1Edynamic_initsub_5F37F0 (display_dynamic_init)248
310x1Flocal_static_variable_initinline~15
320x20vla_dimensioninline~10
330x21overriding_virtual_funcinline~15
350x23derivation_pathinline~10
360x24base_classinline~25
390x27class_infosub_5F4030 (display_class_supplement)366
410x29constructor_initinline~15
420x2Aasm_entryinline~25
430x2Basm_operandinline~15
440x2Casm_clobberinline~10
500x32source_sequence_entryinline~15
510x33full_entity_decl_infoinline~15
520x34instantiation_directiveinline~10
530x35src_seq_sublistinline~10
540x36explicit_instantiation_declinline~10
550x37orphaned_entitiesinline~10
560x38hidden_nameinline~10
570x39pragmainline~20
580x3Atemplateinline~20
590x3Btemplate_declinline~15
600x3Crequires_clauseinline~10
610x3Dtemplate_paraminline~15
620x3Ename_referencesub_5EBC60 (display_name_reference)84
630x3Fname_qualifierinline~15
640x40seq_number_lookupinline~10
650x41local_expr_node_refinline~10
660x42static_assertinline~10
670x43linkage_specinline~10
680x44scope_refinline~10
700x46lambdainline~15
710x47lambda_captureinline~15
720x48attributeinline~20
730x49attribute_argumentinline~10
740x4Aattribute_groupinline~10
790x4Ftemplate_infoinline~15
800x50subobject_pathinline~10
820x52module_infoinline~10
830x53module_declinline~10

Per-Kind Display Functions

source_file_entry (Kind 1)

Displayed inline in the dispatcher. Fields:

FieldTypeNotes
file_namestringShort file name
full_namestringFull path
name_as_writtenstringAs-written in #include
first_seq_numberintegerFirst source sequence number in this file
last_seq_numberintegerLast source sequence number
first_line_numberintegerFirst line number
child_filesIL pointer listIncluded files
is_implicit_includeboolImplicitly included
is_include_fileboolIs an #included file (not the primary TU)
top_level_fileboolTop-level compilation unit

source_corresp (Shared Prefix)

All named entities (variable, routine, type, field, label, namespace, template_param) share a source_corresp sub-record, printed by display_source_corresp (sub_5EDF40, 170 lines). This is the first thing displayed for each such entity:

source_corresp:
  name:                    foo
  unmangled_name_or_mangled_encoding: _Z3foov
  decl_position.seq:       42
  decl_position.column:    5
  name_references:         name_reference@7f3a...
  is_class_member:         TRUE
  access:                  public
  parent_scope:            file-scope scope@7f3a...
  enclosing_routine:       NULL
  referenced:              TRUE
  needed:                  TRUE
  name_linkage:            external

Fields displayed by display_source_corresp:

FieldTypeLookup table
namestringDirect string
unmangled_name_or_mangled_encodingstringDirect string
decl_positionpositionseq + column sub-fields
name_referencesIL pointername_reference entry
is_class_memberbool--
accessenumoff_A6F760 (4 entries: public/protected/private/none)
parent_scopeIL pointerScope entry
enclosing_routineIL pointerRoutine entry
referencedbool--
neededbool--
is_local_to_functionbool--
parent_via_local_scope_refIL pointer--
name_linkageenumoff_E6E040 (none/internal/external/C/C++)
has_associated_pragmabool--
is_decl_after_first_in_comma_listbool--
copied_from_secondary_trans_unitbool--
same_name_as_external_entity_in_secondary_trans_unitbool--
member_of_unknown_basebool--
qualified_unknown_base_memberbool--
marked_as_gnu_extensionbool--
is_deprecated_or_unavailablebool--
externalizedbool--
maybe_unusedbool[[maybe_unused]] attribute
source_sequence_entryIL pointer--
attributesIL pointerAttribute list

type (Kind 6)

display_type (sub_5F06B0, 1,033 lines) handles all 22 type kinds. After calling display_source_corresp, it prints common type fields then switches on the type kind byte at offset +132:

Common type fields:

FieldLookup table
nextIL pointer
based_typesLinked list, kind from off_A6F420 (6 entries)
sizeInteger
alignmentInteger
incompletebool
used_in_exception_or_rttibool
declared_in_function_prototypebool
alignment_set_explicitlybool
variables_are_implicitly_referencedbool
may_aliasbool
autonomous_primary_tag_declbool
is_builtin_va_listbool
is_builtin_va_list_from_cstdargbool
has_gnu_abi_tag_attributebool
in_gnu_abi_tag_namespacebool
type_kindEnum from off_A6FE40 (22 entries)

Type kind switch (offset +132):

KindNameKey sub-fields
2integerint_kind (via sub_5F9110), explicitly_signed, wchar_t_type, char8_t_type, char16_t_type, char32_t_type, bool_type; for enums: is_scoped_enum, packed, originally_unnamed, is_template_enum, ELF_visibility, base_type, assoc_template
3/4/5float/double/ldoublefloat_kind (via sub_5F93D0)
6pointertype_pointed_to, is_reference, is_rvalue_reference
7functionreturn_type, param_type_list, assoc_routine, has_ellipsis, prototyped, trailing_return_type, value_returned_by_cctor, does_not_return, result_should_be_used, is_const, explicit_calling_convention, calling_convention (from off_E6CDA0), this_class, qualifiers, ref_qualifiers, prototype_scope, exception_specification
8arrayelement_type, qualifiers, is_static, is_variable_size_array, is_vla, element_count, bound_constant
9/10/11class/struct/unionfield_list, extra_info (class supplement via sub_5F4030), final, abstract, any_virtual_base_classes, any_virtual_functions, originally_unnamed, is_template_class, is_specialized, is_empty_class, is_packed, max_member_alignment
12typereftyperef_type, template_arg_list, assoc_template, typeref_kind (from off_A6F640, 28 entries), qualifiers, predeclared, has_variably_modified_type, is_nonreal
13member pointerclass_of_which_a_member, type
14template paramkind (tptk_param/tptk_member/tptk_unknown), is_pack, is_generic_param, is_auto_param, class_type, coordinates
15vectorelement_type, size_constant, is_boolean_vector, vector_kind
16tupleelement_type, tuple_elements

variable (Kind 7)

display_variable (sub_5EE500, 614 lines) is one of the most field-heavy display functions. After display_source_corresp, it prints:

FieldLookup table / Notes
nextIL pointer
typeIL pointer
storage_classoff_A6FE00 (7 entries: none/auto/register/static/extern/mutable/thread_local)
declared_storage_classSame table
asm_name or regoff_A6F480 (53 register kind entries)
alignmentInteger
ELF_visibilityoff_A6F720 (5 entries)
init_priorityInteger
cleanup_routineIL pointer
container / bindingsSelected by bits at offset +162
sectionString (ELF section name)
aliased_variableIL pointer
declared_typeIL pointer
template_infoIL pointer

CUDA-specific variable fields:

FieldNotes
shared__shared__ memory space
constant__constant__ memory space
device__device__ memory space

Boolean flags (approximately 50 flags spanning bytes 144--208):

is_weak, is_weakref, is_gnu_alias, has_gnu_used_attribute, has_gnu_abi_tag_attribute, is_not_common, is_common, has_internal_linkage_attribute, asm_name_is_valid, used, address_taken, is_parameter, is_parameter_pack, is_pack_element, is_enhanced_for_iterator, initializer_in_class, constant_valued, is_thread_local, extends_lifetime, is_template_param_object, compiler_generated, is_in_class_specialization, is_handler_param, is_this_parameter, referenced_non_locally, modified_within_try_block, is_template_variable, is_prototype_instantiation, is_nonreal, is_specialized, specialized_with_old_syntax, explicit_instantiation, class_explicitly_instantiated, explicit_do_not_instantiate, param_value_has_been_changed, param_used_more_than_once, is_anonymous_parent_object, is_member_constant, is_constexpr, declared_constinit, is_inline, suppress_inline_definition, superseded_external, has_variably_modified_type, is_vla, is_compound_literal, has_explicit_initializer, has_parenthesized_initializer, has_direct_braced_initializer, has_flexible_array_initializer, declared_with_auto_type_specifier, declared_with_decltype_auto, declared_with_class_template_placeholder

routine (Kind 11)

display_routine (sub_5EF1A0, ~1,160 lines) is the single largest per-kind display function. After display_source_corresp:

FieldLookup table / Notes
nextIL pointer
typeIL pointer (function type)
function_def_numberInteger
memory_regionInteger (region index for function body)
storage_classoff_A6FE00 (7 entries)
declared_storage_classSame table
special_kindoff_A6FC00 (13 entries: none/constructor/destructor/conversion/operator/lambda_call_operator/...)
opname_kindoff_A6FC80 (47 entries)
builtin_function_kindInteger
ELF_visibilityoff_A6F720
virtual_function_numberInteger
constexpr_intrinsic_numberInteger
sectionString
aliased_routineIL pointer
inline_partnerIL pointer
ctor_priority / dtor_priorityInteger
asm_nameString
declared_typeIL pointer
generating_using_declIL pointer
befriending_classesIL pointer list
assoc_templateIL pointer
template_arg_listVia display_template_arg_list

CUDA-specific routine flags (byte 182):

FlagBitMeaning
nvvm_intrinsicbit 4NVVM intrinsic function
devicebit 5__device__ execution space
globalbit 6__global__ execution space
hostbit 4 (byte 183)__host__ execution space

C99-specific fields (displayed when dword_126EBA8 == 1 and std_version > 199900):

fp_contract, fenv_access, cx_limited_range -- pragma state values from off_A6F460 (4 entries).

Boolean flags (approximately 60 flags spanning bytes 176--191):

address_taken, is_virtual, overrides_base_member, pure_virtual, final, override, covariant_return_virtual_override, is_inline, is_declared_constexpr, is_constexpr, is_constexpr_intrinsic, compiler_generated, defined, called, is_explicit_constructor, is_explicit_conversion_function, is_trivial_default_constructor, is_trivial_copy_function, is_trivial_destructor, is_initializer_list_ctor, is_delegating_ctor, is_inheriting_ctor, assignment_to_this_done, is_prototype_instantiation, is_template_function, is_specialized, specialized_with_old_syntax, explicit_instantiation, class_explicitly_instantiated, explicit_do_not_instantiate, has_nodiscard_attribute, never_throws, is_in_class_specialization, never_inline, is_pure, is_initialization_routine, is_finalization_routine, is_weak, is_weakref, is_gnu_alias, is_ifunc, has_gnu_used_attribute, has_gnu_abi_tag_attribute, in_gnu_abi_tag_namespace, allocates_memory, no_instrument_function, no_check_memory_usage, always_inline, gnu_c89_inline, implicit_alias, has_internal_linkage_attribute, contains_try_block, contains_local_class_type, superseded_external, defined_in_friend_decl, contains_statement_expression, inline_in_class_definition, is_lambda_body, is_defaulted, is_deleted, contains_local_static_variable, is_raw_literal_operator, is_tls_init_routine, has_deducible_return_type, has_deduced_return_type, contains_generic_lambda, is_coroutine, is_top_level_in_mem_region, friend_defined_in_instantiation, is_ineligible, definition_needed, defined_outside_of_parent, trailing_requires_clause

expr_node (Kind 13)

display_expr_node (sub_5ECFE0, 534 lines) handles 36 expression node kinds. Common expression fields are printed first:

FieldNotes
typeIL pointer (expression type)
orig_lvalue_typeIL pointer
nextIL pointer
is_lvaluebool
is_xvaluebool
result_is_not_usedbool
is_pack_expansionbool
is_parenthesizedbool
compiler_generatedbool
volatile_fetchbool
do_not_interpretbool
type_definition_neededbool

Expression kind switch (offset +24):

KindNameKey sub-fields
0enk_error(none)
1enk_operationoperation.kind from off_A6F840 (120 operator kinds), operation.type_kind from off_A6FE40 (22 type kinds), 20+ boolean flags for cast semantics, ADL suppression, virtual call properties, evaluation order
2enk_constantConstant value reference
3enk_variableVariable reference
4enk_fieldField access
5enk_temp_initTemporary initialization
6enk_lambdaLambda expression
7enk_new_deleteis_new, placement_new, aligned_version, array_delete, global_new_or_delete, deducible_type, type, routine, arg, dynamic_init, number_of_elements
8enk_throwThrow expression
9enk_conditionConditional expression
10enk_object_lifetimeObject lifetime tracking
11enk_typeidtypeid expression
12enk_sizeofsizeof expression
13enk_sizeof_packsizeof... (pack)
14enk_alignofalignof expression
15enk_datasizeof__datasizeof
16enk_address_of_ellipsisAddress of ...
17enk_statementStatement expression
18enk_reuse_valueValue reuse
19enk_routineFunction reference
20enk_type_operandType operand
21enk_builtin_operationBuilt-in op from off_E6C5A0
22enk_param_refParameter reference
23enk_braced_init_listBraced initializer
24enk_c11_generic_Generic selection
25enk_builtin_choose_expr__builtin_choose_expr
26enk_yieldco_yield
27enk_awaitco_await
28enk_foldFold expression
29enk_initializerInitializer
30enk_concept_idConcept ID
31enk_requiresrequires expression
32enk_compound_reqCompound requirement
33enk_nested_reqNested requirement
34enk_const_eval_deferredConsteval deferred
35enk_template_nameTemplate name

Every expression case ends with dump_source_position("position", ...) to record the source location.

statement (Kind 21)

display_statement (sub_5EC600, 328 lines) handles 26 statement kinds. Common fields first:

FieldNotes
positionSource position
nextIL pointer
parentIL pointer (enclosing scope/block)
attributesIL pointer
has_associated_pragmabool
is_initialization_guardbool
is_lowering_boilerplatebool
is_fallthrough_statementbool
is_likelybool
is_unlikelybool

Statement kind switch (offset +32):

KindNameKey sub-fields
0stmk_exprExpression statement
1stmk_ifif
2stmk_constexpr_ifif constexpr
3stmk_if_constevalif consteval (C++23)
4stmk_if_not_constevalif !consteval
5stmk_whilewhile loop
6stmk_gotogoto
7stmk_labellabel
8stmk_returnreturn
9stmk_coroutineCoroutine body (see below)
10stmk_coroutine_returnCoroutine return
11stmk_blockBlock/compound: statements, final_position, assoc_scope, lifetime, end_of_block_reachable, is_statement_expression
12stmk_end_test_whiledo-while
13stmk_forfor loop
14stmk_range_based_forRange-for: iterator, range, begin, end, ne_call_expr, incr_call_expr
15stmk_switch_caseswitch case
16stmk_switchswitch
17stmk_initInitialization
18stmk_asmInline assembly
19stmk_try_blocktry block
20stmk_declDeclaration
21stmk_set_vla_sizeVLA size
22stmk_vla_declVLA declaration
23stmk_assigned_gotoComputed goto
24stmk_emptyEmpty statement
25stmk_stmt_expr_resultStatement expression result

Coroutine statement (case 9) displays the full C++20 coroutine lowering structure:

traits, handle, promise, init_await_resume, this_param_copy,
paramter_copies, final_suspend_label, initial_suspend_call,
final_suspend_call, unhandled_exception_call, get_return_object_call,
new_routine, delete_routine, ...

The field name "paramter_copies" (missing the 'e' in "parameter") is a typo preserved verbatim from the EDG source. This confirms the display strings originate from Edison Design Group's own il_to_str.c -- a reimplementation would spell it correctly.

scope (Kind 23)

display_scope (sub_5F2140, 177 lines) handles 9 scope kinds:

KindNameExtra fields
0sck_fileTop-level file scope
1sck_func_prototypeFunction prototype scope
2sck_blockassoc_handler
3sck_namespaceassoc_namespace
6sck_class_struct_unionassoc_type
8sck_template_declarationTemplate declaration scope
15sck_conditionassoc_statement
16sck_enumassoc_type
17sck_functionroutine.ptr, parameters, constructor_inits, lifetime_of_local_static_vars, this_param_variable, return_value_variable

Common scope fields: next, parent, kind

Boolean flags: do_not_free_memory_region, is_constexpr_routine, is_stmt_expr_block, is_placeholder_scope, needed_walk_done

Child entity lists: assoc_block, lifetime, constants, types, variables, nonstatic_variables, labels, routines, asm_entries, scopes

Conditional lists (controlled by bitmask tests on scope kind):

// Bitmask 0x20044 = bits 2+6+17 = sck_block + sck_class_struct_union + sck_function
// Bitmask 0x9     = bits 0+3    = sck_file + sck_namespace

if ((1LL << kind) & 0x20044) {
    // display: namespaces, using_declarations, using_directives
}
if ((1LL << kind) & 0x9) {
    // display: namespaces, using_declarations, using_directives
}
// Also: dynamic_inits, local_static_variable_inits (function/block scopes)
//       expr_node_refs, scope_refs, vla_dimensions (function scope + C mode)
//       pragmas, hidden_names, templates, source_sequence_list, src_seq_sublist_list

constant (Kind 2)

display_constant (sub_5F2720, 605 lines) handles 16 constant kinds. After display_source_corresp, common fields include next, type, orig_type, expr, and approximately 25 boolean flags.

Constant kind switch (offset +148):

KindNameKey sub-fields
0ck_error(none)
1ck_integerInteger value via sub_602F20
2ck_stringcharacter_kind (char/wchar_t/char8_t/char16_t/char32_t), length, literal_kind (see below)
3ck_floatFloat value via sub_5FCAF0
4ck_complexComplex value
5ck_imaginaryImaginary value
6ck_addressSub-kind: abk_routine/variable/constant/temporary/uuidof/typeid/label; subobject_path, offset
7ck_ptr_to_membercasting_base_class, name_reference, cast_to_base, is_function_ptr
8ck_label_differencefrom_address, to_address
9ck_dynamic_initdynamic_init pointer
10ck_aggregatefirst_constant, last_constant, has_dynamic_init_component
11ck_init_repeatconstant, count, multidimensional_aggr_tail_not_repeated
12ck_template_paramSub-kinds: tpck_param/expression/member/unknown_function/address/sizeof/datasizeof/alignof/uuidof/typeid/noexcept/template_ref/integer_pack/destructor
13ck_designatoris_field_designator, is_generic, uses_direct_init_syntax
14ck_void(none)
15ck_reflectionentity, local_scope_number

dynamic_init (Kind 30)

display_dynamic_init (sub_5F37F0, 248 lines) handles 9 dynamic initialization kinds:

KindNameKey sub-fields
0dik_none(none)
1dik_zeroZero-initialization
2dik_constantConstant initialization
3dik_expressionExpression initialization
4dik_class_result_via_ctorClass result through constructor
5dik_constructorroutine, args, is_copy_constructor_with_implied_source, is_implicit_copy_for_copy_initialization, value_initialization
6dik_nonconstant_aggregateNon-constant aggregate
7dik_bitwise_copysource
8dik_lambdalambda, constant, non_constant

Common fields: next, variable, destructor, lifetime, next_in_destruction_list, unordered, init_expr_lifetime, and approximately 20 boolean flags including static_temp, follows_an_exec_statement, inside_conditional_expression, has_temporary_lifetime, is_constructor_init, is_freeing_of_storage_on_exception, overlaps_temps_in_inner_lifetime, is_reused_value, is_creation_of_initializer_list_object, master_entry.

class_info (Kind 39)

display_class_type_supplement (sub_5F4030, 366 lines) is not dispatched directly from the kind table but called by display_type when the type kind is class/struct/union (kinds 9/10/11). It prints the class supplement record:

FieldNotes
base_classesIL pointer list
direct_base_classesIL pointer list
preorder_base_classesIL pointer list
primary_base_classIL pointer
size_without_virtual_base_classesInteger
alignment_without_virtual_base_classesInteger
highest_virtual_function_numberInteger
virtual_function_info_offsetInteger
virtual_function_info_base_classIL pointer
ELF_visibilityoff_A6F720
is_lambda_closure_classbool
is_generic_lambda_closure_classbool
has_lambda_conversion_functionbool
is_initializer_listbool
has_initializer_list_ctorbool
has_anonymous_union_memberbool
anonymous_union_kindenum (auk_none/auk_variable/auk_field)
is_va_list_tagbool
has_nodiscard_attributebool
has_field_initializerbool
removed_from_ilbool
contains_errorbool
befriending_classesLinked list (checks kind bytes 9/10/11 for class/struct/union)
friend_routinesIL pointer list
friend_classesIL pointer list
assoc_scopeIL pointer
assoc_templateIL pointer
template_arg_listVia display_template_arg_list
lambda_parent.variable / .field / .routineSelected by bits in byte 86
proxy_of_typeIL pointer

Formatting Infrastructure

25-Column Field Labels

dump_field_label (sub_5EB2A0, 22 lines) is the universal field label formatter. It prints "field_name:" then pads with spaces to column 25. If the label plus colon exceeds 24 characters, it prints a newline first to avoid misalignment:

storage_class:           static
alignment:               16
is_constexpr:            TRUE

This produces the consistent columnar output visible in all IL dumps.

Boolean Fields

dump_field_bool (sub_5EB450, 25 lines) prints a label and "TRUE" or "FALSE":

is_virtual:              TRUE
pure_virtual:            FALSE

Source Position Fields

dump_source_position (sub_5EB4E0, 82 lines) prints position as two sub-fields when the position is non-zero (seq != 0 or column != 0):

position.seq:            42
position.column:         5

Reads a 32-bit sequence number at *position and a 16-bit column at *(position + 4).

IL Pointer Annotations

dump_il_entity_pointer (sub_5EB8B0, 99 lines) is the most comprehensive pointer formatter. For each IL entity pointer, it prints:

  1. Scope prefix: "file-scope" or "func-scope" (from bit 0 of the entry prefix byte at entry_ptr - 8)
  2. Kind name: from off_E6DD80[kind_byte]
  3. Hex address: @%lx
  4. Entity name (kind-dependent):
    • Kinds with name at offset +8 (bitmask 0x2000000010001984): prints the name string
    • Kind 12 (label): prints "label " prefix + name
    • Kind 6 (type): calls qualified name formatter
    • Kind 2 (constant): calls type display
    • Kind 0x40 (seq_number_lookup): prints qualified name from offset +0
    • Kind with bit 36 set: prints qualified name from offset +40, plus "in" context from +56
primary_source_file:     file-scope source_file_entry@7f3a4b100020 "test.cu"
main_routine:            file-scope routine@7f3a4b200100 "main"

The variant dump_il_string_pointer (sub_5EB670) prints the same format but includes the string value from the pointed-to entry. A scope mismatch (e.g., function-scope pointer found during file-scope display) triggers a "**NON FILE SCOPE PTR**" warning.

Entity List Display

display_entity_list (sub_5EC450, 87 lines) walks a linked list of entity pointers and prints each with scope/kind/address annotations:

entities:                file-scope variable@7f3a... "x"
                         func-scope variable@7f3a... "y"

It follows the next link at offset 0 of each list node until NULL.

String Literal Display

dump_string_value (sub_5EB300, 41 lines) prints string values with proper escape handling:

  • NULL pointers print "NULL"
  • Non-printable characters are printed as octal escapes (\OOO)
  • Backslash and double-quote are backslash-escaped (\\, \")
  • The octal mask width is controlled by dword_126E49C (CHAR_BIT equivalent, typically 8)
file_name:               "test.cu"
full_name:               "/home/user/project/test.cu"

Float Constant Formatting

form_float_constant (sub_5F7FD0, 302 lines) handles float-to-string conversion with EDG-specific formatting. An assertion at line 6175 guards against buffer overflow (63-byte limit).

Float kind suffixes:

KindSuffixType
0(none)double
2f/Ffloat
3f32xextended float32
5f64xextended float64
6l/Llong double
7wfloat128/wide
8qquad
9bf16bfloat16
10f16float16
11f32float32
12f64float64
13f128float128

Special value handling:

  • NaN: __builtin_nanf(""), __builtin_nan(""), etc. (when compiler version > 30299)
  • Infinity: __builtin_huge_valf() or (__extension__ 0x1.0p<exp>f)
  • Division form: (f/0.0f) or (f/(0,0.0f)) (C++ vs C modes, selected by dword_126E1D8/dword_126E1E8)
  • User-defined literals: (funcname("string_value")) form

Data Tables Referenced

The display subsystem relies on approximately 20 string-to-enum lookup tables in the .rodata segment:

AddressNameEntriesUsed by
off_A6F000attr_arg_kind_names6Attribute argument display
off_A6F040attr_location_names24Attribute display
off_A6F100attr_family_names5Attribute display
off_A6F140attr_kind_names86Attribute display
off_A6F3F0class_kind_labels3befriending_classes display
off_A6F420based_type_kind_names6display_type based_types
off_A6F460pragma_state_names4fp_contract/fenv_access/cx_limited_range
off_A6F480register_kind_names53display_variable reg field
off_A6F640typeref_kind_names28display_type typeref
off_A6F720elf_visibility_kind_names5ELF visibility (all entity types)
off_A6F760access_specifier_names4public/protected/private/none
off_A6F840expr_operator_kind_names120display_expr_node operations
off_A6FC00special_function_kind_names13display_routine special_kind
off_A6FC80operator_name_kind_names47display_routine opname_kind
off_A6FE00storage_class_names7Storage class (variable + routine)
off_A6FE40type_kind_names22Type kind (all type displays)
off_E6C5A0builtin_operation_namesvariesdisplay_expr_node builtins
off_E6CDA0calling_convention_namesvariesdisplay_type calling conventions
off_E6CDE0pragma_kind_namesvariesPragma display
off_E6CF40asm_clobber_reg_namesvariesAsm clobber display
off_E6D240token_kind_namesvariesFold expression / attribute_arg tokens
off_E6DD80il_entry_kind_names~84All display functions (entry kind)
off_E6E040linkage_kind_namesvariesName linkage (source_corresp)

All tables use the same bounds-checking pattern:

const char *name = "**BAD STORAGE CLASS**";
if ((unsigned char)value <= 6u)
    name = storage_class_names[value];
puts(name);

Out-of-range values produce "**BAD <KIND>**" sentinel strings, which serve as diagnostic markers for corrupted IL.

Global State

AddressNameTypePurpose
dword_126FA30is_file_scope_regionint1 during file-scope display, 0 during function-scope
qword_126F980output_callbacksfunction ptrOutput function (default: sub_5EB290 = fputs(s, stdout))
byte_126FA16display_activebyteSet to 1 during display, prevents re-entrant calls
byte_126FA11pcc_compat_shadowbyteShadow of PCC compatibility mode during display
dword_126EBA8source_languageint0 = C++, 1 = C
dword_126EBACstd_versionintC/C++ standard version number
dword_126EC80total_region_countintNumber of memory regions (1 = file scope only)
qword_126EC88region_tablepointer arrayRegion index to memory block mapping
qword_126EB90scope_tablepointer arrayRegion index to scope entry mapping
qword_126EEE0source_file_namestring ptrName of the source file being compiled

Helper Functions (0x5F8000--0x6039E0)

The display subsystem includes approximately 50 additional helper functions in the address range beyond the main dispatchers:

AddressLinesIdentityPurpose
sub_5F85E078display_bool_fieldBoolean TRUE/FALSE output
sub_5F876097display_flags_wordFlags word display
sub_5F891088display_type_qualifiersconst/volatile/restrict qualifier flags
sub_5F8A8049display_storage_classStorage class enum
sub_5F8BD0139display_access_specifierAccess with indentation
sub_5F8DF0103display_linkage_kindLinkage kind enum
sub_5F904028init_output_contextInitialize display callback state
sub_5F9110149display_int_type_kindInteger type kind name
sub_5F93D070display_float_type_kindFloat type kind name
sub_5F950070display_int_type_sizeInteger type size name
sub_5F965099display_qualifier_flagsFull qualifier flags
sub_5F982018display_ref_qualifier& or &&
sub_5F986091display_calling_conventionCalling convention from off_E6CDA0
sub_5F99A0115display_attribute_targetAttribute target kind
sub_5F9BC020display_asm_keyword"asm" or "volatile"
sub_5F9C1026display_elaborated_typeElaborated type specifier
sub_5F9CA050display_struct_layoutStructure layout padding mode
sub_5F9D8089display_member_alignmentMember alignment field
sub_5F9F7057display_template_kindTemplate kind name
sub_5FA0D0283display_template_arg_listFull template argument list
sub_5FA660127display_constraint_exprConstraint expression (C++20)
sub_5FA8F0118display_deduction_guideDeduction guide info
sub_5FAB70333display_capture_listLambda capture list
sub_5FB270556display_expr_operator_nameExpression operator name (120 kinds)
sub_5FBCD0571display_expr_detailsOperator-specific expression details
sub_5FCAF01,319display_float_constantFloat/complex/imaginary formatting
sub_5FE7C055display_expr_flagExpression flag display
sub_5FE8B01,659display_expr_operatorExpression operator details (2nd largest)
sub_60074072display_for_rangeRange-based for details
sub_600870171display_coroutine_infoCoroutine info (C++20)
sub_600BF019display_designated_initDesignated initializer
sub_600C50107display_attribute_entryAttribute entry
sub_600E0055display_asm_operandAsm operand display
sub_600EF076display_asm_statementAsm statement details
sub_600FF029display_gcc_builtin_kindGCC built-in kind
sub_60107087display_pragma_infoPragma info
sub_6011F0155display_declspec_attribute__declspec attribute
sub_60146092display_thread_localThread-local info
sub_6015A073display_module_infoModule info (C++20)
sub_6016F0197display_concept_requiresConcept/requires expression
sub_601B1048display_pack_expansionPack expansion info
sub_601BE050display_structured_bindingStructured binding (C++17)
sub_601CB0562display_additional_exprAdditional expression info
sub_6027D0144display_deduced_classDeduced class info
sub_6029B0190display_decl_sequenceDeclaration sequence entry
sub_602DC074display_enum_underlyingEnum underlying type
sub_602F20306display_integer_constantInteger constant formatting
sub_603670134display_vendor_attributeVendor attribute details
sub_6038F026display_cleanup_handlerCleanup handler
sub_6039E078display_sequence_entryLast function in il_to_str region

The "paramter_copies" Typo

The coroutine statement display (case 9 in display_statement) prints the field label "paramter_copies" -- missing the 'e' in "parameter." This typo is present in the compiled binary's string table and originates from Edison Design Group's source code. It serves as strong provenance evidence: a clean-room reimplementation would not reproduce this exact spelling error, confirming that cudafe++ links genuine EDG il_to_str.c object code.

Complete Call Graph

display_il_file (sub_5F7DF0) ─── TOP LEVEL
├── display_il_header (sub_5F76B0)
│   ├── init_output_context (sub_5F9040)
│   ├── dump_il_entity_pointer (sub_5EB8B0) ×30+ for header fields
│   ├── dump_field_bool (sub_5EB450) ×15+ for header booleans
│   ├── dump_string (sub_5EB790)
│   └── walk_file_scope_il (sub_60E4F0)
│       └── display_il_entry (sub_5F4930) ─── callback per entity
│
└── [loop over regions 2..N]
    └── walk_routine_scope_il (sub_610200)
        └── display_il_entry (sub_5F4930) ─── callback per entity

display_il_entry (sub_5F4930) ─── MAIN DISPATCHER
├── display_source_corresp (sub_5EDF40) ─── shared by named entities
├── display_statement (sub_5EC600) ─── case 0x15
│   ├── display_coroutine_info (sub_600870)
│   └── display_for_range (sub_600740)
├── display_expr_node (sub_5ECFE0) ─── case 0x0D
│   ├── display_expr_operator (sub_5FE8B0)
│   ├── display_expr_operator_name (sub_5FB270)
│   └── display_expr_details (sub_5FBCD0)
├── display_variable (sub_5EE500) ─── case 0x07
│   └── display_init_kind (sub_5EBB50)
├── display_routine (sub_5EF1A0) ─── case 0x0B
│   └── display_template_arg_list (sub_5EBF60 / sub_5FA0D0)
├── display_type (sub_5F06B0) ─── case 0x06
│   ├── display_class_supplement (sub_5F4030)
│   ├── display_int_type_kind (sub_5F9110)
│   └── display_float_type_kind (sub_5F93D0)
├── display_scope (sub_5F2140) ─── case 0x17
├── display_constant (sub_5F2720) ─── case 0x02
│   ├── display_integer_constant (sub_602F20)
│   └── display_float_constant (sub_5FCAF0)
├── display_dynamic_init (sub_5F37F0) ─── case 0x1E
├── display_name_reference (sub_5EBC60) ─── case 0x3E
└── display_entity_list (sub_5EC450) ─── multiple cases

display_single_entity (sub_5F7D50) ─── TARGETED DISPLAY
├── entity_lookup (sub_73D400)
├── resolve_entity (sub_7377D0)
├── get_entity_kind (sub_5C64C0)
├── init_output_context (sub_5F9040)
└── display_il_entry (sub_5F4930)

Relationship to Other Subsystems

The IL display subsystem is read-only: it never modifies the IL graph. It shares the same entry walker functions used by the IL Tree Walking framework (walk_file_scope_il = sub_60E4F0, walk_routine_scope_il = sub_610200) and the Keep-in-IL mark phase, but passes display_il_entry as the callback instead of a transformation function.

The IL Allocation subsystem provides dump_il_table_stats (sub_5E99D0), which dumps allocation counters rather than IL content -- a complementary diagnostic activated separately.

The field offsets printed by the display functions serve as ground truth for the IL Overview entry kind table and the Entity Node Layout documentation.

IL Comparison & Deep Copy

The IL comparison and deep copy engines are two tightly coupled subsystems in EDG's il.c that serve template instantiation, constant sharing, and overload resolution. The comparison engine determines structural equivalence between two IL expression trees or constant nodes -- needed when the compiler must decide whether two template arguments are "the same" or whether a constant has already been allocated. The deep copy engine clones expression trees while optionally substituting template parameters for their actual arguments -- the core mechanism behind template instantiation. Both subsystems are recursive tree walkers dispatched by node-kind switches, and both operate on the same IL node layout described in IL Overview.

These two engines share the address range 0x5D0750--0x5DFAD0 in the binary (roughly 37KB of compiled code). The comparison engine occupies 0x5D0750--0x5D2160, constant sharing infrastructure sits at 0x5D2170--0x5D2D80, the expression copy engine fills 0x5D2DE0--0x5D5550, and the template parameter substitution dispatcher extends from 0x5DC000--0x5DFAD0.

Key Facts

PropertyValue
Source fileil.c (EDG 6.6)
Assert path/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/il.c
Comparison enginesub_5D0750 (compare_expressions), 588 lines
Constant comparisonsub_5D1350 (compare_constants), 525 lines
Dynamic init comparisonsub_5D1FE0 (compare_dynamic_inits), ~80 lines
Constant sharing allocatorsub_5D2390 (alloc_shareable_constant), ~200 lines
Expression tree copiersub_5D2F90 (i_copy_expr_tree), 424 lines
Constant deep copiersub_5D3B90 (i_copy_constant_full), 305 lines
Template substitution dispatchersub_5DC000 (copy_template_param_expr), 1416 lines
Template constant dispatchersub_5DE290 (copy_template_param_con), 819 lines
Constant sharing hash buckets2039
Recursion depth guarddword_126F1D0 (incremented/decremented around compare_expressions)

Part 1: The Comparison Engine

Why It Exists

Three front-end subsystems need structural equality testing on IL trees:

  1. Template argument deduction. When the compiler deduces template arguments from a function call, it must compare the deduced value against a previously deduced value for the same parameter. Two independently constructed expression trees representing sizeof(int) must compare as equal even though they are distinct heap allocations.

  2. Constant sharing. Identical constants across the translation unit are deduplicated into a single canonical node in file-scope memory. The comparison engine is the hash table's equality predicate -- after two constants hash to the same bucket, compare_constants determines whether they are structurally identical.

  3. Overload resolution. When the compiler checks whether two function template specializations have equivalent signatures, it compares their template argument expressions for equivalence.

compare_expressions (sub_5D0750)

This is the main entry point. It takes two expression-node pointers and a flags word, and returns 1 (match) or 0 (mismatch). It uses a 36-case switch on the expression-node kind byte (offset +24 in the node layout).

compare_expressions(expr_a, expr_b, flags) -> bool:

    if expr_a == expr_b:
        return TRUE                          // pointer identity short-circuit

    if expr_a->kind != expr_b->kind:
        return FALSE                         // different node types never match

    recursion_depth++                        // dword_126F1D0

    switch expr_a->kind:

        case 0 (null):
            result = FALSE                   // two null nodes are never "equal"

        case 1 (operation):
            if expr_a->op_code != expr_b->op_code:
                result = FALSE
            else:
                // compare each operand in the linked list pairwise
                result = compare_operand_lists(expr_a->operands, expr_b->operands, flags)
                if result:
                    result = equiv_types(expr_a->result_type, expr_b->result_type)

        case 2 (constant reference):
            result = compare_constants(expr_a->constant, expr_b->constant, flags)

        case 3 (entity reference):
            // first try pointer equality on the referenced entity
            if expr_a->entity == expr_b->entity:
                result = TRUE
            elif sharing_enabled and same_sharing_symbol(expr_a, expr_b):
                result = TRUE
            else:
                // deep entity comparison via equiv_types + compare_template_variables
                result = equiv_types(expr_a->entity->type, expr_b->entity->type)

        case 4, 19 (type reference):
            result = (expr_a->type_ptr == expr_b->type_ptr)
                  or equiv_types(expr_a->type_ptr, expr_b->type_ptr)

        case 5, 18 (dynamic init):
            result = compare_dynamic_inits(expr_a->init, expr_b->init, flags)

        case 6 (source position):
            result = (expr_a->offset == expr_b->offset)

        case 7 (full expression info):
            result = compare_flags(expr_a, expr_b)
                 and equiv_types(...)
                 and compare_expressions(expr_a->sub_expr, expr_b->sub_expr, flags)

        case 8 (template arguments):
            // element-by-element comparison of arg lists
            result = compare_template_arg_lists(expr_a->args, expr_b->args, flags)

        case 10, 33 (sub-expression wrapper):
            result = compare_expressions(expr_a->inner, expr_b->inner, flags)

        case 11, 32 (unary with boolean):
            result = (expr_a->bool_field == expr_b->bool_field)
                 and compare_expressions(expr_a->inner, expr_b->inner, flags)

        case 12, 14, 15 (typed value):
            result = (expr_a->value_byte == expr_b->value_byte)
                 and compare_type_or_value(...)

        case 13 (two-byte key):
            result = (expr_a->key_word == expr_b->key_word)

        case 16 (always-equal sentinel):
            result = TRUE

        case 17, 22, 35 (opaque pointer):
            result = (expr_a->ptr == expr_b->ptr)

        case 20 (pointer with fallback):
            result = (expr_a->ptr == expr_b->ptr)
                  or deep_compare_via_sub_7B2260(...)

        case 21 (keyed sub-expression):
            result = (expr_a->key == expr_b->key)
                 and compare_expressions(expr_a->inner, expr_b->inner, flags)

        case 23 (simple sub-expression):
            result = compare_expressions(expr_a->inner, expr_b->inner, flags)

        case 24 (nested expression pair):
            result = compare_pair(expr_a, expr_b, flags)

        case 25 (lambda/closure):
            result = chase_closure_ptrs_and_compare(...)

        case 28 (attributed expression):
            result = (expr_a->attr_flags == expr_b->attr_flags)
                 and compare_expressions(expr_a->inner, expr_b->inner, flags)

        case 30 (template specialization):
            if expr_a->hash != expr_b->hash:
                result = FALSE
            else:
                result = compare_template_specializations(expr_a, expr_b)

        case 31 (function template args):
            result = compare_each_arg_type(expr_a->args, expr_b->args)

        default:
            internal_error("compare_expressions: bad expr kind")

    recursion_depth--
    return result

Flags interpretation. The third argument flags is a bitmask:

BitMaskMeaning
00x01Strict mode -- entity references must be pointer-identical, not just structurally equivalent
10x02Check constraints -- compare template constraints alongside types
20x04Allow specialization -- used by the equivalence wrapper (sub_5D1320) when comparing for specialization matching

Recursion depth guard. The global dword_126F1D0 is incremented on entry and decremented on exit. This counter is not used for depth limiting -- it exists so that diagnostic routines (guarded by dword_126EFC8) can print indented traces via sub_5C4B70 (dump_expr_tree).

compare_constants (sub_5D1350)

Constants are the most structurally complex IL nodes. A single constant node is 184 bytes and carries a constant_kind byte at offset +148 that selects among 16 primary kinds, some of which contain nested sub-kinds. The comparison function uses an outer switch on constant_kind and inner switches for aggregate and template-parameter sub-kinds.

compare_constants(const_a, const_b, flags) -> bool:

    if const_a == const_b:
        return TRUE

    if const_a->constant_kind != const_b->constant_kind:
        return FALSE

    switch const_a->constant_kind:

        case 0, 14 (trivial kinds):
            return TRUE

        case 1 (integer):
            return compare_integer_values(const_a->value, const_b->value)
               and (const_a->flags == const_b->flags)

        case 2 (string literal):
            return memcmp(const_a->bytes, const_b->bytes, const_a->length) == 0

        case 3, 5 (float):
            return compare_float_value(const_a->value, const_b->value)

        case 4 (complex):
            return compare_float_value(const_a->real, const_b->real)
               and compare_float_value(const_a->imag, const_b->imag)

        case 6 (address constant):
            // nested switch on address_kind (offset +152), 6 sub-kinds:
            switch const_a->address_kind:
                case 0, 1: pointer equality or compare_entities(...)
                case 2:    recursive type comparison
                case 3, 6: pointer equality at offset +160
                case 5:    type comparison via deep_compare
            // uses while(2) loop for manual tail-call optimization
            // on case 2 and case 13

        case 7 (template argument):
            return compare_template_arg(...)

        case 8 (pair of constants):
            return compare_constants(const_a->first, const_b->first, flags)
               and compare_constants(const_a->second, const_b->second, flags)

        case 9 (dynamic init):
            return compare_dynamic_inits(const_a->init, const_b->init, flags)

        case 10 (aggregate):
            // walk linked lists of sub-constants in lockstep
            a_elem = const_a->first_element
            b_elem = const_b->first_element
            while a_elem and b_elem:
                if not compare_constants(a_elem, b_elem, flags): return FALSE
                a_elem = a_elem->next; b_elem = b_elem->next
            return (a_elem == NULL and b_elem == NULL)

        case 11 (constant + scope):
            return compare_constants(const_a->sub, const_b->sub, flags)
               and (const_a->scope_byte == const_b->scope_byte)

        case 12 (template parameter constant):
            // deeply nested -- 14 sub-kinds at offset +152:
            switch const_a->template_param_kind:
                case 0:  pack parameter comparison
                case 1:  compare_expressions on embedded expr
                case 2:  compare_types via sub_5B3080
                case 3:  compare_types + type + flags
                case 4, 12: recursive compare_constants
                case 5-10: type equality + sub_5BFB80 comparison
                case 11: type + template argument list
                case 13: type equality only

        case 13 (entity ref constant):
            return (const_a->flags == const_b->flags)
               and (pointer_equal_or_sharing_match(const_a->entity, const_b->entity))

        case 15 (literal value):
            return (const_a->value_ptr == const_b->value_ptr)

        default:
            internal_error("compare_constants: bad constant kind")

Manual tail-call optimization. For cases 6 (address constant with type sub-kind) and 13 (entity ref), the function uses while(2) (an infinite loop that reassigns the operands and continues from the top of the comparison) instead of making a recursive call. This avoids stack growth when comparing chains of address constants, which can be deeply nested in pointer-to-member types.

compare_dynamic_inits (sub_5D1FE0)

Dynamic initializers represent runtime initialization expressions (constructors, aggregate init, etc.). The comparison function dispatches on the init kind byte at offset +48:

compare_dynamic_inits(init_a, init_b, flags) -> bool:

    if init_a->kind != init_b->kind:
        return FALSE
    if init_a->flags != init_b->flags:
        return FALSE
    // entity fields at +8, +16 compared with sharing-aware equality

    switch init_a->kind:
        case 0, 1: return TRUE (after header match)
        case 2, 6: return compare_constants(init_a->constant, init_b->constant)
        case 3, 4: return compare_expressions(init_a->expr, init_b->expr)
        case 5:    return compare_entity_ref(...)
                       and compare_sub_exprs(...)

Part 2: Constant Sharing

Why It Exists

Without deduplication, every occurrence of the integer constant 42 in a translation unit would allocate a separate 184-byte constant node in the IL. For large programs (especially heavy template users), this wastes significant memory. The constant sharing system maintains a hash table of canonical constants: when a new constant is about to be allocated, alloc_shareable_constant first checks whether an identical constant already exists in the hash table. If so, the existing node is returned; if not, a new canonical copy is created in the file-scope region and inserted into the table.

Shareability Predicate

Not all constants can be shared. The predicate constant_is_shareable (sub_5D2210) checks several blocking conditions:

constant_is_shareable(constant) -> bool:

    if not sharing_enabled (dword_126EE48):
        return FALSE

    if constant has parent:
        // parent must be type 2 (constant); checks sharing flag 0x40 at byte+81
        // and calls compare_constants on the parent's value
        return parent_is_shareable(...)

    // blocking conditions for parentless constants:
    if constant->associated_entry != NULL:  return FALSE   // already bound to an entry
    if constant->extra_data != 0:           return FALSE   // has auxiliary data
    if constant->flags & 0x02:              return FALSE   // flag bit 1 blocks sharing

    switch constant->constant_kind:
        case 2 (string):    return string_sharing_enabled (dword_126E1C0)
        case 6 (address):   return TRUE unless has extra payload at +176
                             or address_subkind==4 with data
        case 7 (template):  return (constant->extra_ptr == NULL)
        case 10 (aggregate): return FALSE   // aggregates never shared
        case 12 (template param): return FALSE   // template params never shared
        default:            return TRUE

The rationale for excluding aggregates and template parameters: aggregate constants contain linked lists of sub-constants that would require recursive sharing checks, and template parameter constants are inherently unique to their instantiation context.

Hash Table Structure

The hash table is allocated during il_init (sub_5CFE20) as a 16,312-byte block (stored at qword_126F228), yielding 2039 bucket slots (16312 / 8 = 2039). Each bucket is a pointer to the head of a singly-linked chain of constant nodes.

Hash Table Layout (qword_126F228):

    +--------+--------+--------+     +--------+
    | slot 0 | slot 1 | slot 2 | ... |slot 2038|
    +--------+--------+--------+     +--------+
        |        |        |
        v        v        v
      const -> const -> NULL    (singly-linked chains)
       |
       v
      const -> NULL

Why 2039? The number 2039 is prime. Using a prime number as the hash table size ensures that the modular-reduction step (hash % 2039) distributes keys uniformly even when the hash function produces patterns with common factors. The compiled code computes the modulus through an optimized multiply-and-shift sequence (multiply by 0x121456F, then shift) rather than a hardware division instruction.

alloc_shareable_constant (sub_5D2390)

This is the entry point for all constant allocation when sharing is enabled. It implements a hash-table lookup with MRU (most recently used) reordering of the chain:

alloc_shareable_constant(local_constant) -> constant*:

    total_alloc_count++                              // qword_126F208

    if not sharing_enabled or not constant_is_shareable(local_constant):
        return alloc_constant(local_constant)        // fallback to non-shared alloc

    if local_constant has parent:
        // parent's shared pointer is already the canonical copy
        assert parent->type == 2
        return parent->shared_ptr

    // ---- hash table lookup ----
    hash = compute_constant_hash(local_constant)     // sub_5BE150
    bucket_index = hash % 2039
    bucket_ptr = &hash_table[bucket_index]

    prev = NULL
    curr = *bucket_ptr
    while curr != NULL:
        comparison_count++                           // qword_126F200
        if compare_constants(curr, local_constant, 0):
            // ---- HIT: MRU reorder ----
            if prev != NULL:
                // unlink curr from current position
                prev->next = curr->next
                // move curr to front of chain
                curr->next = *bucket_ptr
                *bucket_ptr = curr
            if curr is in same region:
                region_hit_count++                   // qword_126F218
            else:
                global_hit_count++                   // qword_126F220
            return curr
        prev = curr
        curr = curr->next

    // ---- MISS: allocate new canonical constant ----
    new_bucket_count++                               // qword_126F210
    new_constant = alloc_in_file_scope(184)          // sub_5E11C0 or sub_5E1620
    memcpy(new_constant, local_constant, 184)        // 11.5 x SSE + 8-byte tail
    clear_sharing_flags(new_constant)
    fixup_constant_references(new_constant)          // sub_5D39A0
    // link at head of chain
    new_constant->next = *bucket_ptr
    *bucket_ptr = new_constant
    return new_constant

MRU optimization rationale. When a hash bucket chain contains many constants (collision), recently matched constants are likely to be matched again soon (temporal locality from template instantiation expanding the same types repeatedly). Moving the matched node to the front of the chain converts an O(n) average-case lookup into O(1) for repeated accesses to the same constant.

Statistics counters. The sharing system maintains four counters for profiling:

CounterAddressMeaning
qword_126F200comparisonsTotal compare_constants calls during sharing
qword_126F208total_allocsTotal calls to alloc_shareable_constant
qword_126F210new_bucketsNumber of cache misses (new canonical entries)
qword_126F218region_hitsSharing hits where the existing constant is in the same region
qword_126F220global_hitsSharing hits where the existing constant is in a different region

String Constant Interning (sub_5DBAB0)

String literals receive a separate interning pass through intern_string_constant at 0x5DBAB0. This function reuses the same 2039-bucket hash table (qword_126F228) but with string-specific comparison logic:

intern_string_constant(string, context_a, context_b) -> constant*:

    hash = compute_constant_hash(string)
    bucket_index = hash % 2039

    // linear chain search with exact match (flag=1)
    for each entry in chain:
        if compare_constants(entry, local_constant, 1):   // strict mode
            move_to_front(entry)                           // MRU
            return entry

    // miss: allocate new string constant in file-scope region
    new = alloc_constant_with_source_sequence(ck_string)
    memcpy(new, local_constant, 184)
    new->string_data = alloc_string_storage(strlen(string) + 1)
    strcpy(new->string_data, string)
    clear_sharing_flags(new)
    fixup_constant_references(new)
    link_at_chain_head(bucket_index, new)
    free_local_constant(local_constant)
    return new

fixup_constant_references (sub_5D39A0)

After a constant is copied into the shared region, some of its internal pointers may still reference nodes in the source (non-shared) region. fixup_constant_references walks the constant's internal structure and redirects these dangling references:

  • If the constant's associated IL entry is not in the shared region, the back-pointer at offset +128 is cleared.
  • For template parameter constants (kind 12), sub-kinds 1 and 5-10 may embed expression trees at offsets +160/+168. If these expressions are not in the shared region, they are deep-copied via copy_expr_tree or reattached via attach_to_region.
  • For literal value constants (kind 15) with expression sub-kind 13, the constant kind is rewritten based on the expression's kind (expr kind 2 becomes const kind 2, etc.), effectively inlining the expression into the constant.

Part 3: The Deep Copy Engine

Why It Exists

Template instantiation requires cloning expression trees from template definitions while replacing template parameter references with the actual arguments provided at the instantiation site. This is not a simple memcpy -- every node in the tree must be visited, its pointers updated to reference the new region's copies, and template parameter nodes must be intercepted and replaced with substituted values. The deep copy engine provides this transformation.

Default argument expansion also uses the copy engine: when a function call omits an argument that has a default, the default's expression tree is cloned from the function declaration into the call site.

i_copy_expr_tree (sub_5D2F90)

The central expression copier. It takes an expression node, a flags word, and a substitution-list context, then returns a freshly allocated deep copy.

i_copy_expr_tree(src_expr, flags, subst_list) -> expr_node*:

    // shallow clone: allocate new node, copy fixed fields
    dest = allocate_expr_node_clone(src_expr)      // sub_5C28B0

    switch src_expr->kind:

        case 0  (null):           // no children to copy
        case 3  (entity ref):     // entity pointer is shared, not copied
        case 4  (type ref):       // type pointer is shared
        case 16 (sentinel):       // no data
        case 19 (template ref):   // entity pointer is shared
        case 20 (type constant):  // type pointer is shared
        case 22 (opaque ptr):     // shallow only
        case 30 (template spec):  // shallow only
            break                 // nothing beyond the shallow clone

        case 1 (operation):
            // recursively copy the operand linked list
            dest->operands = i_copy_list_of_expr_trees(src_expr->operands, flags, subst)

        case 2 (constant reference):
            // deep-copy the constant node
            dest->constant = i_copy_constant_full(src_expr->constant, NULL, flags, subst)

        case 5 (dynamic init):
            dest->init = i_copy_dynamic_init(src_expr->init, flags, subst)

        case 6 (call expression):
            // walk argument list, copy each argument expression
            dest->args = copy_arg_list(src_expr->args, flags, subst)

        case 7 (full expression info):
            // copy 6 sub-fields (type, scope, sub-expression, etc.)
            copy_full_expr_children(dest, src_expr, flags, subst)

        case 8 (template arguments):
            dest->type_list = copy_type_list(src_expr->type_list, flags)

        case 9 (pack expansion):
            dest->type = copy_type(src_expr->type)
            dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)

        case 10 (object lifetime):
            push_lifetime_scope()
            dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
            attach_lifetime(dest)

        case 11, 23, 32, 33 (sub-expression list):
            dest->list = i_copy_list_of_expr_trees(src_expr->list, flags, subst)

        case 12, 14, 15 (typed value):
            // conditional copy based on value byte
            if src_expr->value_byte matches copy-condition:
                dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)

        case 13 (two-byte key):
            // no children beyond the key value

        case 17 (entity reference, copyable):
            if flags & 0x80:   // copy_entities mode
                dest->entity = alloc_constant_from_entity(src_expr->entity)

        case 18 (substitution slot):
            // look up in subst_list for replacement
            dest = resolve_from_substitution_list(subst_list, src_expr->index)

        case 21, 26, 27 (expression + list):
            dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
            dest->list = i_copy_list_of_expr_trees(src_expr->list, flags, subst)

        case 24 (list + pointer):
            dest->list = i_copy_list_of_expr_trees(src_expr->list, flags, subst)
            dest->ptr = copy_pointer_target(src_expr->ptr)

        case 25 (expression + flags):
            dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
            dest->flags |= propagated_flags

        case 28 (attributed expression):
            dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
            // attribute flags are already copied in the shallow clone

        case 31 (expression + extracted pointer):
            dest->inner = i_copy_expr_tree(src_expr->inner, flags, subst)
            dest->extra = extract_pointer(src_expr)

        case 34 (constexpr fold):
            dest = copy_constexpr_fold(src_expr)    // sub_65AE50

        default:
            internal_error("i_copy_expr_tree: bad expr kind")

    // ---- post-copy entity resolution (LABEL_11) ----
    if flags & 0x10:   // resolve_refs mode
        for kinds 2, 3, 7, 19:
            resolve_entity_ref(dest)               // sub_5B3030

    return dest

Flags interpretation for the copy engine:

BitMaskMeaning
40x10resolve_refs -- after copying, resolve entity references through the symbol table
70x80copy_entities -- copy entity nodes themselves (not just references to them)
120x1000mark_instantiated -- stamp copied nodes with the instantiation flag
140x5000preserve_source_pos -- carry source-position annotations from source to copy

i_copy_list_of_expr_trees (sub_5D38C0)

A helper that walks a linked list of expression nodes (connected via the next pointer at offset +16), copies each via i_copy_expr_tree, and links the copies into a new list:

i_copy_list_of_expr_trees(head, flags, subst) -> expr_node*:

    result_head = NULL
    result_tail = NULL

    curr = head
    while curr != NULL:
        copy = i_copy_expr_tree(curr, flags, subst)
        if result_head == NULL:
            result_head = copy
        else:
            result_tail->next = copy
        result_tail = copy
        curr = curr->next

    return result_head

i_copy_constant_full (sub_5D3B90)

The constant copier handles the 184-byte constant node and its recursive sub-structure. It maintains a substitution list to avoid duplicating shared type definitions across the copy tree.

i_copy_constant_full(src, dest_or_null, flags, subst_list) -> constant*:

    if dest_or_null:
        dest = dest_or_null                 // copy in place
    else:
        dest = alloc_constant_node()        // sub_5E11C0

    memcpy(dest, src, 184)                  // 11 x SSE + 8-byte tail
    clear_sharing_flag(dest, bit 2 at [5].byte[3])
    clear_sharing_flag(dest, bit 5 at [9].byte[3])

    switch dest->constant_kind:

        case 10 (aggregate):
            // walk linked list of sub-constants, deep-copy each
            for each element in dest->element_list:
                element = i_copy_constant_full(element, NULL, flags, subst_list)
                relink(element)

        case 11 (constant + scope):
            dest->sub_constant = i_copy_constant_full(
                dest->sub_constant, NULL, flags, subst_list)

        case 9 (dynamic init):
            dest->init = i_copy_dynamic_init(dest->init, flags, subst_list)

        case 6 (address constant with type definition):
            // substitution-list management: look up whether this type
            // has already been copied in this tree; if so, reuse the copy
            existing = lookup_in_subst_list(subst_list, src->type_def)
            if existing:
                dest->type_def = existing
            else:
                dest->type_def = copy_type(src->type_def)
                add_to_subst_list(subst_list, src->type_def, dest->type_def)

        case 12 (template parameter constant):
            switch dest->template_param_kind:
                case 0, 2, 3, 13: // no extra copy needed
                case 1:           dest->value_expr = copy_expr_tree(dest->value_expr)
                case 4, 12:       dest->inner = i_copy_constant_full(dest->inner, ...)
                case 5-10:        dest->extra_expr = copy_expr_tree(dest->extra_expr)
                case 11:          dest->type = copy_type(dest->type)
                                  dest->arg_list = copy_template_arg_list(dest->arg_list)

    fixup_constant_references(dest)         // sub_5D39A0
    return alloc_shareable_constant(dest)   // sub_5D2390 -- may deduplicate

Substitution list purpose. When copying an expression tree that references the same type definition in multiple sub-expressions (e.g., two occurrences of decltype(x) in a single template), the substitution list ensures both references point to the same copied type node, preserving the sharing relationship from the original tree.

Public Wrappers

The i_copy_* functions are internal -- they take a substitution-list parameter that must be cleaned up after use. Public wrappers handle this lifecycle:

WrapperAddressInternal functionPurpose
copy_expr_treesub_5D3940i_copy_expr_treeExpression deep copy with auto-cleanup
copy_constant_fullsub_5D4300i_copy_constant_fullConstant deep copy with auto-cleanup
copy_dynamic_initsub_5D4CF0i_copy_dynamic_initDynamic init deep copy with auto-cleanup
copy_constantsub_5D4D50i_copy_constant_fullSimple constant copy (flags=0)

Each wrapper allocates a local substitution list, calls the internal function, then appends the list entries to the global free list at qword_126F1E0.


Part 4: Template Parameter Substitution

Why It Exists

The deep copy engine (Part 3) performs a mechanical tree clone -- it duplicates structure but does not transform content. Template instantiation requires more: when the copier encounters a node referencing template parameter T, it must replace that node with the actual type argument (e.g., int). When it encounters sizeof(T), it must evaluate the expression with T=int and produce the constant 4. The template parameter substitution engine is the transformation layer that sits on top of the copy engine, intercepting template-parameter nodes and performing the substitution.

copy_template_param_expr (sub_5DC000)

This is the central dispatcher for expression-level template substitution. At 1416 lines and 7872 bytes of compiled code, it is the largest single function in the comparison/copy subsystem. It takes up to 10 arguments:

copy_template_param_expr(
    expr_node*   a1,     // expression to substitute
    template_ctx a2,     // template argument context
    template_ctx a3,     // secondary context (for nested templates)
    type*        a4,     // expected result type
    scope*       a5,     // current scope
    int          a6,     // flags
    int*         a7,     // error_flag (output: set to 1 on failure)
    diag_info*   a8,     // diagnostics context
    constant*    a9,     // scratch constant (pre-allocated workspace)
    constant**   a10     // output constant pointer
) -> expr_node*          // substituted expression, or NULL (use a9/a10)

The function dispatches on expr->kind and, for operation nodes, further dispatches on the operation code:

copy_template_param_expr(expr, tctx, ...):

    switch expr->kind:

        case 0 (empty):
            return expr                       // pass through unchanged

        case 1 (operation):
            switch expr->op_code:

                case 116 (type expression):
                    return copy_template_param_type_expr(expr, tctx, ...)

                case 5 (cast):
                    // substitute the cast's target type
                    new_type = copy_template_param_type(expr->target_type, tctx)
                    // recursively substitute the operand
                    new_operand = copy_template_param_expr(expr->operands, tctx, ...)
                    return build_cast_node(new_type, new_operand)

                case 0, 25, 28, 29, 53-57, 71, 72, 87, 88, 103 (unary/simple binary):
                    // substitute operand(s) recursively
                    new_ops = substitute_operand_list(expr->operands, tctx)
                    return build_operation_node(expr->op_code, new_ops, new_type)

                case 26, 27, 39-43, 58-63 (binary with type check):
                    // substitute both operands
                    lhs = copy_template_param_expr(expr->operands[0], tctx, ...)
                    rhs = copy_template_param_expr(expr->operands[1], tctx, ...)
                    // post-substitution type promotion:
                    do_conversions_on_operands_of_copied_template_expr(op, &lhs, &rhs)
                    return build_operation_node(op, lhs, rhs, result_type)

                case 39 (ternary / conditional):
                    cond  = copy_template_param_expr(operands[0], ...)
                    true_ = copy_template_param_expr(operands[1], ...)
                    false_= copy_template_param_expr(operands[2], ...)
                    return build_conditional(cond, true_, false_)

                case 44, 45 (imaginary):
                    internal_error("imaginary operators not implemented")

        case 2 (constant reference):
            return copy_template_param_con(expr->constant, tctx, ...)

        case 3 (variable / entity reference):
            // look up the variable in the substitution context
            subst = find_substitution(tctx, expr->entity)
            if subst found:
                // check type compatibility between expected and actual
                if is_pointer_compatible(expected_type, subst->type):
                    return build_value_from_constant(subst)
                else:
                    apply_type_conversion(subst, expected_type)
            return expr   // no substitution needed

        case 5 (function call in constant context):
            // dispatch on call sub-kind
            switch expr->call_subkind:
                case 1: substitute type + validate value
                case 2: delegate to copy_template_param_con

        case 19 (template parameter reference):
            // same logic as case 3 for template parameters
            subst = find_substitution(tctx, expr->template_param)
            return build_substituted_expr(subst)

        case 20 (type constant):
            new_type = copy_template_param_type(expr->type, tctx)
            return build_type_constant_expr(new_type)

        case 21 (builtin operation):
            return copy_template_param_builtin_operation(expr, tctx, ...)
            // asserts no error in process

        case 22 (type reference):
            new_type = copy_template_param_type(expr->type, tctx)
            return build_type_ref_expr(new_type)

        case 23 (expression wrapper):
            inner = copy_template_param_expr(expr->inner, tctx, ...)
            return wrap(inner)

        case 30 (pack expansion):
            return expand_pack(expr, tctx, ...)

        case 31 (dependent entity):
            // complex hash-map based instantiation tracking
            // with get_with_hash / vector_insert / entity list processing
            return instantiate_dependent_entity(expr, tctx, ...)

Post-substitution type conversions. The internal helper do_conversions_on_operands_of_copied_template_expr (at il.c line 18885) handles arithmetic promotions that must occur after template parameter substitution. For example, T + U where T=int and U=double requires promoting the int operand to double -- this promotion is not present in the template definition's expression tree because the types are unknown there. The function handles:

  • Shift operators (ops 53-54): promote the result type to the promoted LHS type.
  • Comparison operators (ops 58-63): compute the common type and apply usual arithmetic conversions.
  • Arithmetic operators (default): compute common type via sub_5657C0 and insert implicit conversion nodes.
  • Imaginary operators (ops 44-45): explicitly not implemented (triggers internal_error).

copy_template_param_con (sub_5DE290)

The constant-level substitution dispatcher. At 819 lines, it handles the case where a constant node in a template definition contains a reference to a template parameter:

copy_template_param_con(constant, tctx, expected_type, scope, flags,
                        error_flag, diag, scratch) -> constant*:

    switch constant->constant_kind:

        case 12 (template parameter constant):
            // this is the core case -- the constant IS a template parameter
            switch constant->template_param_kind:

                case 0 (value parameter):
                    // look up the bound value in the template argument list
                    binding = lookup_template_arg(tctx, constant->param_index)
                    if binding is a pack:
                        return expand_pack_element(binding, ...)
                    return binding->value_constant

                case 1 (expression parameter):
                    // try overload resolution first
                    result = copy_template_param_con_overload_resolution(...)
                    if result: return result
                    // fall back to full expression-level substitution
                    return copy_template_param_expr(constant->expr, tctx, ...)

                case 2 (non-member entity parameter):
                    return copy_template_param_unknown_entity_con(constant, FALSE, ...)

                case 3 (member entity parameter):
                    return copy_template_param_unknown_entity_con(constant, TRUE, ...)

                case 4 (nested constant parameter):
                    return copy_template_param_con(constant->inner, tctx, ...)

                case 5-10 (scalar value parameters: sizeof, alignof, etc.):
                    // look up the substitution via sub_5BFB80
                    // perform type equality check
                    // apply type conversions if needed
                    return substituted_scalar_constant(...)

                case 11 (entity + argument pack):
                    // entity substitution with argument list processing
                    return substitute_entity_with_args(...)

                case 12 (nested recursive):
                    return copy_template_param_con(constant->inner, tctx, ...)

        case 6 (address/aggregate constant):
            switch constant->address_kind:
                case 3 (function call):
                    // substitute callee type + each argument recursively
                    callee_type = copy_template_param_type(constant->callee_type, tctx)
                    for each arg in constant->args:
                        arg = copy_template_param_con(arg, tctx, ...)
                    return build_call_constant(callee_type, args)
                default:
                    if is_dependent_type(constant->type):
                        return deep_copy_constant(constant)
                    // handle address-space attribute patterns

        case 15 (expression constant):
            switch constant->expr_constant_kind:
                case 46 (strip_template_arg):
                    // dispatch on template argument type:
                    //   0 = type argument -> type substitution
                    //   1 = value argument -> value substitution
                    //   2 = template argument -> template substitution
                case 6:  return type_substitution(...)
                case 13: return non_type_param_substitution(...)
                case 2:  return recursive copy_template_param_con(inner, ...)

        default:
            internal_error("copy_template_param_con: unexpected kind")

copy_template_param_con_with_substitution (sub_5DFAD0)

The top-level entry point for template constant substitution, called from the template instantiation driver. It manages the IL region switch (moving allocation to file-scope for the duration of instantiation), handles the initial overload-resolution check, and performs post-substitution type normalization:

copy_template_param_con_with_substitution(constant, template_args, scope,
                                          expected_type, access, flags,
                                          error_flag, scratch):

    saved_region = current_region
    switch_to_file_scope_region()            // with debug trace

    local_scratch = alloc_local_constant()

    // ---- special case: overload resolution for expression parameters ----
    if constant->kind == 12 and constant->param_kind == 1:
        overload_info = lookup_overload_candidate(constant)
        if overload_info:
            result = copy_template_param_con_overload_resolution(
                         constant, overload_info, tctx, ...)
            if result: goto post_process

    // ---- validate expected type ----
    if expected_type is pointer_type:
        validate_pointer_binding(expected_type)

    // ---- main substitution ----
    result = copy_template_param_expr(constant->expr, tctx, ...)
    // or: result = copy_template_param_con(constant, tctx, ...)
    // depending on whether the constant embeds an expression

    post_process:
    // ---- post-substitution type normalization ----
    if result->type is pointer_type:
        validate_binding(result)
        result = try_implicit_conversion(result)
    elif result->type is array_type:
        result = try_implicit_conversion(result)
        result = array_to_pointer_decay(result)
    elif result->type is function_type:
        result = try_implicit_conversion(result)
        result = function_to_pointer_decay(result)
    else:
        result = general_conversion(result)

    // ---- handle deferred instantiation ----
    if is_deferred_instantiation(result):
        copy_deferred_data_into_scratch(result, scratch)

    restore_region(saved_region)
    free_local_constant(local_scratch)
    return result

Supporting Functions

FunctionAddressLinesPurpose
copy_template_param_type_exprsub_5DDEB082Handles op=116 type expressions within template substitution; extracts and substitutes the type, checks dependent-type status
copy_template_param_expr_listsub_5DE01077Iterates an expression linked list, calling copy_template_param_expr on each element; shares a single scratch constant across all iterations
copy_template_param_value_exprsub_5DE1A055Single-expression variant; passes the expression's own type as the expected type
copy_template_param_con_overload_resolutionsub_5DF6A0180Attempts overload resolution during template substitution when the template parameter refers to a set of overloaded functions; validates result type compatibility
copy_template_param_unknown_entity_consub_5DB420213Handles entity constants where the entity kind is not known until substitution time (using declarations, namespace aliases, variables, templates, types)

Part 5: Data Flow Between the Subsystems

The four subsystems interact in a specific calling pattern during template instantiation:

Template Instantiation Driver
  |
  +-> copy_template_param_con_with_substitution (entry point)
        |
        +-> copy_template_param_expr (expression-level dispatch)
        |     |
        |     +-> copy_template_param_con (constant-level dispatch)
        |     |     |
        |     |     +-> copy_template_param_unknown_entity_con
        |     |     +-> copy_template_param_con_overload_resolution
        |     |     +-> [recursive: copy_template_param_expr]
        |     |     +-> [recursive: copy_template_param_con]
        |     |
        |     +-> copy_template_param_type (type-level, in type.c)
        |     +-> copy_template_param_type_expr
        |     +-> copy_template_param_expr_list
        |     +-> copy_template_param_builtin_operation
        |
        +-> alloc_shareable_constant (deduplication on output)
              |
              +-> compare_constants (hash table equality check)
              +-> fixup_constant_references

The comparison engine is not called during the copy itself -- it is called only at the end, when the newly constructed constants are passed through alloc_shareable_constant for deduplication. This means the copy engine may temporarily create duplicate constants that are later merged by the sharing infrastructure. The design separates concerns: the copy engine focuses on correctness (producing a valid substituted tree), while the sharing engine focuses on efficiency (deduplicating identical results).


Part 6: Initialization and Reset

il_one_time_init (sub_5CF7F0)

Called once at program startup. Validates seven name-table arrays end with the "last" sentinel string, checks the sizeof_il_entry guard value (9999), and initializes 60+ allocation pools via pool_init (sub_7A3C00) with element sizes ranging from 1 byte to 1344 bytes. Conditionally initializes C++-mode pools (guarded by dword_106BF68 || dword_106BF58).

il_init (sub_5CFE20)

Called at the start of each translation unit. Zeroes all global pool heads, allocates and zeroes the two hash tables:

  • Character type table: 3240 bytes at qword_126F2F8 (5 character types x 81 slots = 405 entries, 8 bytes each).
  • Constant sharing table: 16312 bytes at qword_126F228 (2039 buckets, 8 bytes each).

Sets the three sharing mode bytes (byte_126E558, byte_126E559, byte_126E55A) to 3 (all sharing enabled), and tail-calls il_init_float_constants (sub_5EAF00).

il_reset_secondary_pools (sub_5D0170)

Zeroes ~80 qword globals in the 0x126F680--0x126F978 range. These are transient counters, list heads, and cached type pointers used during template instantiation. Called separately from il_init, suggesting it resets state between instantiation passes within the same translation unit.


Address Map

AddressFunctionLinesRole
0x5CF7F0il_one_time_init~200One-time startup validation + pool init
0x5CFE20il_init~100Per-TU hash table allocation + state reset
0x5D0170il_reset_secondary_pools~40Reset instantiation-transient state
0x5D0750compare_expressions588Expression tree structural equality
0x5D1320compare_expressions_for_equivalence~10Thin wrapper (flags=4)
0x5D1350compare_constants525Constant structural equality, 16 kinds
0x5D1FE0compare_dynamic_inits~80Dynamic init comparison
0x5D2160compare_constants_default~5Thin wrapper (flags=0)
0x5D2170expr_tree_contains_template_param_constant~50Template param presence check
0x5D2210constant_is_shareable~100Shareability predicate
0x5D2390alloc_shareable_constant~200Hash table deduplication allocator
0x5D2890alloc_il_entry_from_constant~20Wraps constant in IL entry
0x5D2F90i_copy_expr_tree424Expression tree deep copy (35-case switch)
0x5D38C0i_copy_list_of_expr_trees~40Linked-list copy helper
0x5D3940copy_expr_tree~30Public wrapper with cleanup
0x5D39A0fixup_constant_references~80Post-copy pointer fixup
0x5D3B90i_copy_constant_full305Constant deep copy (16-kind switch)
0x5D4300copy_constant_full~20Public wrapper with cleanup
0x5D47A0i_copy_dynamic_init~150Dynamic init deep copy
0x5D4C00copy_lambda_capture~60Lambda capture list copy
0x5D4DB0alloc_constant~150Non-shared constant allocation with kind-specific cleanup
0x5DBAB0intern_string_constant~92String literal interning via hash table
0x5DC000copy_template_param_expr1416Template substitution -- expression dispatcher
0x5DDEB0copy_template_param_type_expr82Template substitution -- type expressions
0x5DE010copy_template_param_expr_list77Template substitution -- expression list
0x5DE1A0copy_template_param_value_expr55Template substitution -- single value expr
0x5DE290copy_template_param_con819Template substitution -- constant dispatcher
0x5DF6A0copy_template_param_con_overload_resolution180Template substitution -- overload resolution
0x5DFAD0copy_template_param_con_with_substitution288Template substitution -- top-level entry

.int.c File Format

When cudafe++ processes a CUDA source file, the backend code generator emits a transformed C++ translation called the .int.c file (short for "intermediate C"). This is the host-side output that the downstream host compiler (GCC, Clang, or MSVC) will compile. The file preserves all host-visible declarations from the original source but replaces device code with stubs, injects CUDA runtime boilerplate, and appends registration tables and anonymous namespace support. The entire emission is driven by process_file_scope_entities (sub_489000), a 723-line function in cp_gen_be.c that serves as the backend entry point. It initializes output state, opens the output stream, emits a fixed sequence of preamble sections, walks the EDG intermediate language source sequence to generate the transformed C++ body, then appends a fixed trailer with _NV_ANON_NAMESPACE handling, #pragma pack() for MSVC, and CUDA host reference arrays.

Key Facts

PropertyValue
Backend entry pointsub_489000 (process_file_scope_entities, 723 lines)
EDG source filecp_gen_be.c (lines 19916-26628)
Default output name<input>.int.c (via sub_5ADD90 string concatenation)
Output override globalqword_106BF20 (set by CLI flag gen_c_file_name, case 45)
Stdout sentinel"-" (output filename compared character-by-character)
Output stream globalstream (FILE pointer at fixed address)
Line counterdword_1065820 (incremented on every \n)
Column counterdword_106581C (character position within current line)
Indent leveldword_1065834 (decremented with -- around directive blocks)
Needs-line-directive flagdword_1065818 (triggers #line emission before next output)
Source sequence cursorqword_1065748 (current IL entry being processed)
Device stub mode toggledword_1065850 (0=normal, 1=generating __wrapper__device_stub_)
Empty file guard string"int __dummy_to_avoid_empty_file;" at 0x83AED8
Anon namespace macro string"_NV_ANON_NAMESPACE" at 0x83AF45
Managed RT boilerplateinline static functions for __managed__ variable support

Output File Naming

The output filename is determined by three inputs, checked in order:

// sub_489000, decompiled lines 153-177
char *input_name = qword_126EEE0;   // source filename from CLI

// 1. Check for stdout mode
if (strcmp(input_name, "-") == 0) {
    stream = stdout;
}
else {
    // 2. Check for explicit output name override
    char *output_name = qword_106BF20;
    if (!output_name)
        // 3. Default: append ".int.c" to input filename
        output_name = sub_5ADD90(input_name, ".int.c");

    stream = sub_4F48F0(output_name, 0, 0, 0, 1701);  // open for writing
}

The - sentinel enables piping cudafe++ output to stdout for debugging or toolchain integration. The qword_106BF20 override is set by the gen_c_file_name CLI option (case 45 in the CLI parser at sub_459630), allowing nvcc to specify an explicit output path. The default .int.c suffix means a file kernel.cu produces kernel.cu.int.c.

Complete .int.c File Structure

A fully-generated .int.c file follows this fixed section ordering, top to bottom:

+------------------------------------------------------------------+
| 1. #line directive (initial source position)                     |
+------------------------------------------------------------------+
| 2. #pragma GCC diagnostic ignored "-Wunused-local-typedefs"      |
|    #pragma GCC diagnostic ignored "-Wattributes"                 |
+------------------------------------------------------------------+
| 3. #pragma GCC diagnostic push                                   |
|    #pragma GCC diagnostic ignored "-Wunused-variable"            |
|    #pragma GCC diagnostic ignored "-Wunused-function"            |
+------------------------------------------------------------------+
| 4. Managed runtime boilerplate                                   |
|    (static __nv_inited_managed_rt, __nv_init_managed_rt, etc.)   |
+------------------------------------------------------------------+
| 5. #pragma GCC diagnostic pop                                    |
+------------------------------------------------------------------+
| 6. #pragma GCC diagnostic ignored "-Wunused-variable"            |
|    #pragma GCC diagnostic ignored "-Wunused-private-field"       |
|    #pragma GCC diagnostic ignored "-Wunused-parameter"           |
+------------------------------------------------------------------+
| 7. Extended lambda macro definitions (or #define false stubs)    |
+------------------------------------------------------------------+
| 8. MAIN BODY: transformed C++ from source sequence walk          |
|    - #include "crt/host_runtime.h" (injected at first CUDA type) |
|    - Device stubs for __global__ kernels                         |
|    - #if 0 / #endif around device-only code                     |
|    - All host-visible declarations, types, functions             |
+------------------------------------------------------------------+
| 9. Empty file guard (if no entities generated)                   |
+------------------------------------------------------------------+
| 10. Breakpoint placeholders (debug builds only)                  |
+------------------------------------------------------------------+
| 11. _NV_ANON_NAMESPACE define / include / undef trick            |
+------------------------------------------------------------------+
| 12. #pragma pack() (MSVC only)                                   |
+------------------------------------------------------------------+
| 13. Module ID file output (if dword_106BFB8 set)                 |
+------------------------------------------------------------------+
| 14. Host reference arrays (.nvHRKI, .nvHRDE, etc.)               |
+------------------------------------------------------------------+

Section 1: Initial #line Directive

After opening the output stream, sub_489000 emits a #line directive via sub_46D1A0 to establish the initial source mapping. This directive points the host compiler's diagnostic messages back to the original .cu file:

// sub_489000, decompiled lines 283-287
sub_46D1A0(v10, v11);  // emit #line <number> "<filename>"

The #line directive format depends on the host compiler. For GCC/Clang hosts (dword_126E1F8 set), the line keyword is omitted (producing # 1 "file.cu"). For MSVC hosts (dword_126E1D8 set), the full #line 1 "file.cu" form is used. This pattern recurs throughout the file wherever source position changes.

Section 2-6: Diagnostic Suppressions

The preamble contains a layered set of #pragma GCC diagnostic directives that suppress warnings the host compiler would otherwise emit on the generated code. The exact set depends on which host compiler is active and its version.

Suppression Decisions

The conditions controlling each suppression are checked against host compiler identification globals:

GlobalMeaning
dword_126E1E8Host is Clang
dword_126E1F8Host is GCC (including Clang in GCC-compat mode)
dword_126E1D8Host is MSVC
qword_126EF90Clang version number
qword_126E1F0GCC/Clang version number
dword_106BF6CAlternative host compiler mode
dword_106BF68Secondary host compiler flag

-Wunused-local-typedefs

Emitted early, outside any push/pop scope:

// sub_489000, decompiled lines 182-187
if ((dword_126E1E8 && qword_126EF90 > 0x7787)   // Clang > 30599
    || (!dword_106BF6C && !dword_106BF68
        && dword_126E1F8 && qword_126E1F0 > 0x9F5F))  // GCC > 40799
{
    emit("#pragma GCC diagnostic ignored \"-Wunused-local-typedefs\"");
}

This targets GCC 4.8+ and Clang 3.1+, which introduced the -Wunused-local-typedefs warning. CUDA template machinery frequently creates local typedefs that are used only by device code (suppressed in #if 0 blocks), triggering spurious warnings.

-Wattributes

// sub_489000, decompiled lines 188-189
if (dword_126EFA8 && dword_106C07C)
    emit("\n#pragma GCC diagnostic ignored \"-Wattributes\"\n");

Suppresses warnings about unknown or ignored __attribute__ annotations. Emitted when CUDA-specific attribute processing is active (dword_126EFA8) and a secondary flag (dword_106C07C) indicates the host compiler would reject CUDA-specific attributes.

Push/Pop Block with -Wunused-variable and -Wunused-function

The managed runtime boilerplate (section 4) is wrapped in a diagnostic push/pop block:

// sub_489000, decompiled lines 190-234
emit("#pragma GCC diagnostic push");
emit("#pragma GCC diagnostic ignored \"-Wunused-variable\"");
emit("#pragma GCC diagnostic ignored \"-Wunused-function\"");

// ... managed runtime boilerplate here ...

emit("#pragma GCC diagnostic pop");

The push/pop scope isolates these suppressions to the managed runtime code. The conditions for emitting this block check Clang presence (dword_126E1E8), or GCC version > 40599 (qword_126E1F0 > 0x9E97). The managed runtime functions are static and may be unused in translation units without __managed__ variables.

Post-Pop File-Level Suppressions

After the pop, additional file-scoped suppressions are emitted that remain active for the rest of the file:

// sub_489000, decompiled lines 243-250
emit("#pragma GCC diagnostic ignored \"-Wunused-variable\"\n");

if (dword_126E1E8) {  // Clang only
    emit("#pragma GCC diagnostic ignored \"-Wunused-private-field\"\n");
    emit("#pragma GCC diagnostic ignored \"-Wunused-parameter\"\n");
}

The -Wunused-private-field and -Wunused-parameter suppressions are Clang-specific. GCC does not have -Wunused-private-field, and GCC's -Wunused-parameter behavior differs.

Summary of All Suppressions

WarningScopeHost CompilerVersion Threshold
-Wunused-local-typedefsFile-levelClang, GCCClang > 30599, GCC > 40799
-WattributesFile-levelGCC/ClangWhen CUDA attrs active
-Wunused-variablePush/pop blockClang, GCC >= 40599Around managed RT only
-Wunused-functionPush/pop blockClang, GCC >= 40599Around managed RT only
-Wunused-variableFile-levelClang, GCC >= 40199Rest of file
-Wunused-private-fieldFile-levelClang onlyAlways
-Wunused-parameterFile-levelClang onlyAlways

Section 7: Extended Lambda Macros

When extended lambda mode is NOT active (dword_106BF38 == 0), three stub macros are defined:

// sub_489000, decompiled lines 259-264
emit("#define __nv_is_extended_device_lambda_closure_type(X) false\n");
emit("#define __nv_is_extended_host_device_lambda_closure_type(X) false\n");
emit("#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false\n");
emit("#if defined(__nv_is_extended_device_lambda_closure_type)"
     " && defined(__nv_is_extended_host_device_lambda_closure_type)"
     "&& defined(__nv_is_extended_device_lambda_with_preserved_return_type)\n"
     "#endif\n");

These macros are consumed by crt/host_runtime.h to conditionally compile lambda wrapper infrastructure. When extended lambdas are disabled, all three evaluate to false, causing the runtime header to skip lambda wrapper code. The #if defined(...) && defined(...) block that immediately follows is an existence check -- it verifies the macros are defined, producing a compilation error if some other header has #undef'd them.

When extended lambda mode IS active (dword_106BF38 != 0), these defines are skipped entirely. The lambda preamble injection system (via sub_6BCC20) provides the real implementations later in the main body.

Section 8: Main Body -- Source Sequence Walk

The main body is generated by iterating the global source sequence list (qword_1065748), which is a linked list of EDG IL entries representing every top-level declaration in the translation unit. For each entry, the backend dispatches to sub_47ECC0 (gen_template / process_source_sequence), which handles all declaration kinds:

// sub_489000, decompiled lines 288-316 (simplified)
while (qword_1065748) {
    entry = qword_1065748;
    kind = entry->kind;  // byte at offset +16

    if (kind == 57) {
        // Pragma interleaving -- handled inline
        handle_pragma(entry);
    } else if (kind == 52) {
        // End-of-construct -- should not appear at top level
        fatal_error("Top-level end-of-construct entry");
    } else {
        entities_generated = 1;
        sub_47ECC0(0);  // gen_template at recursion level 0
    }
}

During this walk, several CUDA-specific injections occur:

  1. #include "crt/host_runtime.h" -- injected by sub_4864F0 (gen_type_decl) or sub_47ECC0 when the first CUDA-tagged entity at global scope is encountered. The flag dword_E85700 prevents duplicate inclusion.

  2. Device stub pairs -- __global__ kernel functions trigger two calls to gen_routine_decl (sub_47BFD0): first the forwarding body, then the static cudaLaunchKernel placeholder, controlled by the dword_1065850 toggle.

  3. #if 0 / #endif guards -- device-only declarations are wrapped in preprocessor guards to hide them from the host compiler.

  4. Interleaved pragmas -- source sequence entries of kind 57 represent #pragma directives from the original source (including #pragma pack, #pragma STDC, and user pragmas), which are re-emitted at their original positions.

Section 9: Empty File Guard

If the source sequence walk produced no entities (v12 == 0) and the compilation is not in pure CUDA mode (dword_126EFB4 != 2), a dummy declaration is emitted to prevent the host compiler from rejecting an empty translation unit:

// sub_489000, decompiled lines 565-569
if (!entities_generated && dword_126EFB4 != 2) {
    emit("int __dummy_to_avoid_empty_file;");
    newline();
}

Some host compilers (notably older GCC versions) produce warnings or errors on completely empty .c files. The int __dummy_to_avoid_empty_file; declaration is a minimal valid C/C++ statement that suppresses this.

Section 10: Breakpoint Placeholders

When the deferred function list (qword_1065840) is non-empty, the backend emits one breakpoint placeholder function per entry. These are used for debugger support in whole-program compilation mode:

// sub_489000, decompiled lines 573-651 (simplified)
node = qword_1065840;  // linked list of deferred functions
index = 0;
while (node) {
    emit("static __attribute__((used)) void __nv_breakpoint_placeholder");
    emit_decimal(index);
    putc('_', stream);
    if (node->name)
        emit(node->name);
    emit("(void) ");

    // Set source position from node
    set_source_position(node->source_start);
    emit("{ ");
    set_source_position(node->source_end);
    emit("exit(0); }");

    node = node->next;
    index++;
}

Each placeholder has the form static __attribute__((used)) void __nv_breakpoint_placeholderN_funcname(void) { exit(0); }. The __attribute__((used)) prevents the linker from stripping these functions. The debugger uses their addresses to set breakpoints on device functions that have been stripped from the host binary.

The deferred list is populated by gen_routine_decl when dword_106BFBC (whole-program mode) is set and dword_106BFDC is clear -- device-only functions that need host-side breakpoint anchors are pushed onto this list rather than receiving dummy bodies inline.

Section 11: _NV_ANON_NAMESPACE Trick

The trailer contains a four-step sequence that handles C++ anonymous namespace mangling for CUDA. Anonymous namespaces in C++ create translation-unit-local symbols, but CUDA device code requires globally unique symbol names (because device code from multiple TUs is linked together by the device linker). The _NV_ANON_NAMESPACE mechanism assigns a deterministic, globally unique identifier to each TU's anonymous namespace.

Step-by-Step Emission

// sub_489000, decompiled lines 654-710

// Step 1: #line back to original source
emit("#");
if (!dword_126E1F8)  // MSVC: include "line" keyword
    emit("line");
emit(" 1 \"");
emit(path_transform(qword_106BF88));  // original source file path
emit("\"");

// Step 2: #define _NV_ANON_NAMESPACE <hash>
emit("#define ");
emit("_NV_ANON_NAMESPACE");
emit(" ");
emit(sub_6BC7E0());   // generate unique hash string
newline();

// Step 3: #ifdef / #endif (force inclusion check)
emit("#ifdef ");
emit("_NV_ANON_NAMESPACE");
newline();
emit("#endif");
newline();

// Step 3b: #pragma pack() for MSVC
if (dword_126E1D8) {   // MSVC host
    emit("#pragma pack()");
    newline();
}

// Step 4: #include "<original_file>"
emit("#");
if (!dword_126E1F8)
    emit("line");
emit(" 1 \"");
emit(path_transform(qword_106BF88));
emit("\"");
newline();
emit("#include ");
emit("\"");
emit(path_transform(qword_106BF88));
emit("\"");
newline();

// Step 5: Reset #line and #undef
emit("#");
if (!dword_126E1F8)
    emit("line");
emit(" 1 \"");
emit(path_transform(qword_106BF88));
emit("\"");
newline();
emit("#undef ");
emit("_NV_ANON_NAMESPACE");
newline();

The Hash Generator (sub_6BC7E0)

The _NV_ANON_NAMESPACE value is produced by sub_6BC7E0, which constructs the string _GLOBAL__N_ followed by the module ID hash:

// sub_6BC7E0 (20 lines)
if (cached_result)
    return cached_result;

char *module_id = sub_5AF830(0);   // compute CRC32-based module ID
size_t len = strlen(module_id);
char *result = allocate(len + 12);
strcpy(result, "_GLOBAL__N_");
strcpy(result + 11, module_id);
cached_result = result;
return result;

The module ID (sub_5AF830) is a CRC32-based hash incorporating the source filename, compiler options, file modification time, and process ID. This produces values like _GLOBAL__N_1a2b3c4d5e6f7890 -- deterministic enough for reproducible builds, but unique enough to avoid collisions between TUs.

Why the Define/Include/Undef Sequence

The three-step define/include/undef pattern serves a specific purpose:

  1. #define _NV_ANON_NAMESPACE <hash> -- establishes the macro before the source file is re-included.

  2. #include "<original_file>" -- re-includes the original .cu source. During this second inclusion, any code inside anonymous namespaces that uses _NV_ANON_NAMESPACE gets the unique hash substituted, producing globally unique symbol names for device code.

  3. #undef _NV_ANON_NAMESPACE -- cleans up the macro after inclusion.

The #ifdef _NV_ANON_NAMESPACE / #endif block between define and include is a safety check -- it verifies the macro was actually defined before proceeding.

This mechanism works in conjunction with the EDG frontend's anonymous namespace handling. When the frontend encounters namespace { ... } containing device code, it generates references to _NV_ANON_NAMESPACE that become concrete identifiers during the re-inclusion pass. The name mangling in the demangler (sub_7CA140, sub_7C5650, sub_7C4E80) also uses _NV_ANON_NAMESPACE to produce consistent mangled names.

Section 12: #pragma pack() for MSVC

When the host compiler is MSVC (dword_126E1D8 set), a bare #pragma pack() is emitted to reset the packing alignment to the compiler default:

// sub_489000, decompiled lines 676-681
if (dword_126E1D8) {
    emit("#pragma pack()");
    newline();
}

This reset ensures that any #pragma pack(N) directives from the original source or from included CUDA headers do not leak into subsequent translation units. On GCC/Clang, the #pragma pack() push/pop mechanism is typically handled differently, so this emission is MSVC-specific.

Section 13-14: Module ID and Host Reference Arrays

The final two sections are conditional:

Module ID output (sub_5B0180): When dword_106BFB8 is set, the module ID string (the same CRC32-based hash from sub_5AF830) is written to a separate file. This ID is used by the CUDA runtime to match host-side registration code with the device fatbinary.

Host reference arrays (sub_6BCF80): When dword_106BFD0 (device registration) or dword_106BFCC (constant registration) is set, six calls to sub_6BCF80 emit ELF section declarations for host reference arrays:

// sub_489000, decompiled lines 713-721
// nv_emit_host_reference_array(emit_fn, is_kernel, is_device, is_internal)
sub_6BCF80(emit_callback, 1, 0, 1);  // kernel,   internal  -> .nvHRKI
sub_6BCF80(emit_callback, 1, 0, 0);  // kernel,   external  -> .nvHRKE
sub_6BCF80(emit_callback, 0, 1, 1);  // device,   internal  -> .nvHRDI
sub_6BCF80(emit_callback, 0, 1, 0);  // device,   external  -> .nvHRDE
sub_6BCF80(emit_callback, 0, 0, 1);  // constant, internal  -> .nvHRCI
sub_6BCF80(emit_callback, 0, 0, 0);  // constant, external  -> .nvHRCE

These produce extern "C" declarations with __attribute__((section(".nvHRXX"))) annotations, where XX is one of KE, KI, DE, DI, CE, CI (Kernel/Device/Constant + External/Internal). The arrays contain mangled names of device symbols, enabling the CUDA runtime to locate and register them at program startup.

Complete Example

For a source file kernel.cu containing a single __global__ kernel function and a host function, the generated kernel.cu.int.c looks approximately like this:

# 1 "kernel.cu"
#pragma GCC diagnostic ignored "-Wunused-local-typedefs"
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-variable"
#pragma GCC diagnostic ignored "-Wunused-function"
static char __nv_inited_managed_rt = 0;
static void **__nv_fatbinhandle_for_managed_rt;
static void __nv_save_fatbinhandle_for_managed_rt(void **in) {
    __nv_fatbinhandle_for_managed_rt = in;
}
static char __nv_init_managed_rt_with_module(void **);
static inline void __nv_init_managed_rt(void) {
    __nv_inited_managed_rt = (__nv_inited_managed_rt
        ? __nv_inited_managed_rt
        : __nv_init_managed_rt_with_module(
            __nv_fatbinhandle_for_managed_rt));
}
#pragma GCC diagnostic pop
#pragma GCC diagnostic ignored "-Wunused-variable"
#pragma GCC diagnostic ignored "-Wunused-private-field"
#pragma GCC diagnostic ignored "-Wunused-parameter"
#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false
#if defined(__nv_is_extended_device_lambda_closure_type) \
 && defined(__nv_is_extended_host_device_lambda_closure_type) \
 && defined(__nv_is_extended_device_lambda_with_preserved_return_type)
#endif

/* === main body begins here === */
#include "crt/host_runtime.h"

# 5 "kernel.cu"
void host_function(int *data, int n) {
    for (int i = 0; i < n; i++) data[i] *= 2;
}
# 10 "kernel.cu"
void my_kernel(float *data, int n) {
    ::my_kernel::__wrapper__device_stub_my_kernel(data, n);
    return;
}
#if 0
/* original __global__ kernel body suppressed */
#endif
static void __wrapper__device_stub_my_kernel(float *data, int n) {
    ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}
/* === main body ends === */

# 1 "kernel.cu"
#define _NV_ANON_NAMESPACE _GLOBAL__N_a1b2c3d4e5f67890
#ifdef _NV_ANON_NAMESPACE
#endif
# 1 "kernel.cu"
#include "kernel.cu"
# 1 "kernel.cu"
#undef _NV_ANON_NAMESPACE

Initialization State

Before emitting any output, sub_489000 zeroes all output-related global state and initializes four large hash tables (each 512KB, cleared with memset). It also sets up a function pointer table (xmmword_1065760 through xmmword_10657B0) containing code generation callbacks:

// sub_489000, decompiled lines 62-97 (summarized)
dword_1065834 = 0;         // indent level
stream = NULL;             // output file handle
dword_1065820 = 0;         // line counter
dword_106581C = 0;         // column counter
dword_1065818 = 0;         // needs-line-directive
qword_1065748 = 0;         // source sequence cursor
qword_1065740 = 0;         // alternate cursor
dword_1065850 = 0;         // device stub mode

// Clear four 512KB hash tables
memset(&unk_FE5700, 0, 0x7FFE0);   // 524,256 bytes
memset(&unk_F65720, 0, 0x7FFE0);
memset(qword_E85720, 0, 0x7FFE0);
memset(&xmmword_F05720, 0, 0x5FFE8);  // 393,192 bytes (smaller)

// Callback setup
if (!dword_126DFF0)                     // not MSVC mode
    qword_10657C0 = sub_46BEE0;        // gen_be callback
qword_10657C8 = loc_469200;            // line directive callback
qword_10657D0 = sub_466F40;            // output callback
qword_10657D8 = sub_4686C0;            // error callback

#line Directive Protocol

Throughout the file, #line directives maintain the mapping between generated output and original source positions. The emission protocol differs by host compiler:

Host Compiler#line FormatExample
GCC / Clang# <line> "<file>"# 42 "kernel.cu"
MSVC#line <line> "<file>"#line 42 "kernel.cu"

The dword_1065818 flag (needs_line_directive) is set whenever the current source position changes. Before emitting the next declaration or statement, sub_467DA0 checks this flag and emits a #line directive if needed, then clears the flag. The source position is tracked in two globals: qword_1065810 (pending position) and qword_126EDE8 (current position).

Function Map

AddressNameRole
sub_489000process_file_scope_entitiesBackend entry point; orchestrates entire .int.c emission
sub_47ECC0gen_template / process_source_sequenceWalks source sequence, dispatches all declaration kinds
sub_47BFD0gen_routine_declFunction declaration/definition generator; kernel stub logic
sub_4864F0gen_type_declType declaration generator; injects #include "crt/host_runtime.h"
sub_484A40gen_variable_declVariable declaration generator; managed memory registration
sub_467E50(emit string)Primary string emission to output stream
sub_468190(emit raw string)Raw string emission without line directive check
sub_46BC80(emit directive)Emits #if / #endif preprocessor lines
sub_467DA0(emit line directive)Conditionally emits #line when dword_1065818 is set
sub_467D60(emit newline)Emits newline and flushes pending line directive
sub_46CF20(emit source position)Sets source position for next #line directive
sub_5ADD90(string concat)Concatenates input filename with .int.c extension
sub_4F48F0(file open)Opens output file for writing (mode 1701)
sub_6BC7E0(anon namespace hash)Generates _GLOBAL__N_<module_id> string
sub_5AF830make_module_idCRC32-based unique TU identifier
sub_5B0180write_module_id_to_fileWrites module ID to separate file
sub_6BCF80nv_emit_host_reference_arrayEmits .nvHRKE/.nvHRDI/etc. ELF sections
sub_4F7B10(file close)Closes output stream (mode 1701)

Cross-References

CUDA Runtime Boilerplate

Every .int.c file emitted by cudafe++ contains a fixed block of CUDA runtime initialization code, injected unconditionally before the main body. This boilerplate implements lazy initialization of the CUDA managed memory runtime and defines macro stubs for the extended lambda detection system. The managed runtime block is always emitted regardless of whether the translation unit uses __managed__ variables -- the static flag __nv_inited_managed_rt ensures the runtime is initialized at most once, and the static linkage prevents symbol conflicts across translation units. The lambda detection macros provide a compile-time protocol between cudafe++ and crt/host_runtime.h: the runtime header inspects these macros to decide whether to compile lambda wrapper infrastructure.

Key Facts

PropertyValue
Emitter functionsub_489000 (process_file_scope_entities, line 218)
Managed RT string address0x83AAC8 (243 bytes)
Init function string address0x83ABC0 (210 bytes)
Managed access wrapper string0x839570 (65 bytes)
Access wrapper emitterssub_4768F0 (gen_name_ref, xref at 0x476DCF), sub_484940 (gen_variable_name, xref at 0x484A08)
Lambda stub macros string0x83AD10, 0x83AD50, 0x83AD98
Lambda existence check string0x83ADE8 (194 bytes)
Extended lambda mode flagdword_106BF38 (extended_lambda_mode)
Alternative host flagdword_106BF6C (alternative_host_compiler_mode)
__cudaPushCallConfiguration lookupsub_511D40 (scan_expr_full), string at 0x899213
Push config error message0x88CA48, error code 3654
Managed variable detection(*(_WORD *)(entity + 148) & 0x101) == 0x101
EDG source filecp_gen_be.c

Managed Memory Runtime Initialization

Static Variables Block

The first emission at line 218 of sub_489000 outputs four declarations as a single string literal:

static char __nv_inited_managed_rt = 0;
static void **__nv_fatbinhandle_for_managed_rt;
static void __nv_save_fatbinhandle_for_managed_rt(void **in) {
    __nv_fatbinhandle_for_managed_rt = in;
}
static char __nv_init_managed_rt_with_module(void **);

These are emitted verbatim from a single string at 0x83AAC8:

"static char __nv_inited_managed_rt = 0; static void **__nv_fatbinhandle_for_managed_rt;
 static void __nv_save_fatbinhandle_for_managed_rt(void **in)
 {__nv_fatbinhandle_for_managed_rt = in;} static char __nv_init_managed_rt_with_module(void **);"

Each component serves a specific role:

SymbolTypePurpose
__nv_inited_managed_rtstatic charGuard flag: 0 = not initialized, nonzero = initialized
__nv_fatbinhandle_for_managed_rtstatic void**Cached fatbinary handle, set during __cudaRegisterFatBinary
__nv_save_fatbinhandle_for_managed_rtstatic void (void**)Stores the fatbin handle for later use by the init function
__nv_init_managed_rt_with_modulestatic char (void**)Forward declaration -- defined by crt/host_runtime.h

The forward declaration of __nv_init_managed_rt_with_module is critical: this function is provided by the CUDA runtime headers (crt/host_runtime.h) and performs the actual CUDA runtime API calls to register managed variables with the unified memory system. By forward-declaring it here, the managed runtime boilerplate can reference it before the header is #included later in the file.

Lazy Initialization Function

Immediately after the static block, sub_489000 emits the __nv_init_managed_rt inline function. The emission has a conditional prefix:

// sub_489000, decompiled lines 221-224
if (dword_106BF6C)   // alternative host compiler mode
    emit("__attribute__((unused)) ");

emit(" static inline void __nv_init_managed_rt(void) {"
     " __nv_inited_managed_rt = (__nv_inited_managed_rt"
     " ? __nv_inited_managed_rt"
     "                 : __nv_init_managed_rt_with_module("
     "__nv_fatbinhandle_for_managed_rt));}");

When dword_106BF6C (alternative host compiler mode) is set, the function is prefixed with __attribute__((unused)) to suppress "defined but not used" warnings on host compilers that do not understand CUDA semantics.

The emitted function, reformatted for readability:

static inline void __nv_init_managed_rt(void) {
    __nv_inited_managed_rt = (
        __nv_inited_managed_rt
            ? __nv_inited_managed_rt
            : __nv_init_managed_rt_with_module(
                  __nv_fatbinhandle_for_managed_rt)
    );
}

This is a lazy initialization pattern. On first call, __nv_inited_managed_rt is 0 (falsy), so the ternary takes the false branch and calls __nv_init_managed_rt_with_module. That function performs CUDA runtime registration and returns a nonzero value which is stored back into __nv_inited_managed_rt. On subsequent calls, the ternary short-circuits and returns the existing value without re-initializing. The function is static inline to allow the host compiler to inline it at every managed variable access site, and static to avoid symbol collisions across translation units.

Runtime Registration Flow

The complete managed memory initialization sequence spans the compilation pipeline:

1. cudafe++ emits __nv_save_fatbinhandle_for_managed_rt() definition
2. cudafe++ emits forward decl of __nv_init_managed_rt_with_module()
3. cudafe++ emits __nv_init_managed_rt() with lazy init pattern
4. #include "crt/host_runtime.h" provides __nv_init_managed_rt_with_module()
5. __cudaRegisterFatBinary() calls __nv_save_fatbinhandle_for_managed_rt()
   to cache the fatbin handle
6. First access to any __managed__ variable triggers __nv_init_managed_rt()
7. __nv_init_managed_rt_with_module() calls __cudaRegisterManagedVariable()
   for every __managed__ variable in the TU

Managed Variable Access Transformation

When the backend encounters a reference to a __managed__ variable during code generation, it wraps the access in a comma-operator expression that forces lazy initialization. This transformation is performed by two functions:

  • sub_4768F0 (gen_name_ref, xref at 0x476DCF) -- handles qualified name references
  • sub_484940 (gen_variable_name, xref at 0x484A08) -- handles direct variable name emission

Detection Condition

Both functions detect __managed__ variables using the same bitfield test:

// sub_484940, decompiled line 11
if ((*(_WORD *)(entity + 148) & 0x101) == 0x101)

This tests two bits simultaneously as a 16-bit word read at offset 148:

ByteBitMaskMeaning
+148bit 00x01__device__ memory space
+149bit 00x01 (reads as 0x100 in word)__managed__ flag

The combined mask 0x101 matches when both __device__ and __managed__ are set. The __managed__ attribute handler (sub_40E0D0, apply_nv_managed_attr) always sets both bits: __managed__ implies the variable resides in device global memory (__device__), with the additional unified-memory semantics.

Emitted Wrapper

When the condition matches, the emitter outputs a prefix string from 0x839570:

(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), (

After the variable name is emitted normally, the suffix ))) closes the expression. The complete transformed access for a managed variable managed_var becomes:

(*( (__nv_inited_managed_rt ? (void)0 : __nv_init_managed_rt()), (managed_var)))

Breaking down the expression:

  1. Outer *(...) -- dereferences the result (the managed variable is accessed through a pointer after initialization)
  2. Comma operator (init_expr, (managed_var)) -- evaluates the init expression for its side effect, then yields the variable
  3. Ternary __nv_inited_managed_rt ? (void)0 : __nv_init_managed_rt() -- lazy init guard: if already initialized, the ternary evaluates to (void)0 (no-op). Otherwise, calls __nv_init_managed_rt() which performs runtime registration

This pattern guarantees that any access to any __managed__ variable triggers runtime initialization exactly once, regardless of access order. The comma operator ensures the initialization is a sequenced side effect evaluated before the variable access.

sub_4768F0 (gen_name_ref) -- Qualified Access Path

The name reference generator at sub_4768F0 handles the more complex case where the variable access includes scope qualification (::, template arguments, member access):

// sub_4768F0, decompiled lines 160-163
if (!v7 && a3 == 7 && (*(_WORD *)(v9 + 148) & 0x101) == 0x101) {
    v13 = 1;  // flag: need closing )))
    emit("(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), (");
    // ... emit qualified name with scope resolution ...
}

The condition a3 == 7 indicates the entity is a variable (IL entry kind 7). The !v7 check (v7 = a4, the fourth parameter) gates on whether the access is from a context that already handles initialization. The v13 flag tracks whether the closing ))) needs to be emitted after the complete name expression:

// sub_4768F0, decompiled lines 231-236
if (v13) {
    emit(")))");
    return 1;
}

sub_484940 (gen_variable_name) -- Direct Access Path

The direct variable name emitter at sub_484940 follows the same pattern but with a simpler structure:

// sub_484940, decompiled lines 10-15
v1 = 0;
if ((*(_WORD *)(a1 + 148) & 0x101) == 0x101) {
    v1 = 1;  // flag: need closing )))
    emit("(*( (__nv_inited_managed_rt ? (void)0: __nv_init_managed_rt()), (");
}

// ... emit variable name (possibly anonymous, templated, etc.) ...

if (v1) {
    emit(")))");
    return;
}

This function handles three variable name forms:

  1. Thread-local variables (byte +163 bit 7 set) -- emits "this" string (4 characters via inline loop)
  2. Anonymous variables (byte +165 bit 2 set) -- dispatches to sub_483A80 for generated name emission
  3. Regular variables -- dispatches to sub_472730 (gen_expression_or_name, mode 7)

The managed wrapper is applied around all three forms.

__cudaPushCallConfiguration Lookup

When cudafe++ processes a CUDA kernel launch expression (kernel<<<grid, block, shmem, stream>>>(args...)), the frontend must locate the __cudaPushCallConfiguration runtime function to lower the <<<>>> syntax into standard C++ function calls. This lookup occurs in sub_511D40 (scan_expr_full), the 80KB expression scanner.

Lookup Mechanism

At case 0x48 (decimal 72, the token for kernel launch <<<), the scanner performs a name lookup:

// sub_511D40, decompiled lines 1999-2006
sub_72EEF0("__cudaPushCallConfiguration", 0x1B);   // inject name into scope
v206 = sub_698940(v255, 0);                         // lookup the declaration

if (!v206 || *(_BYTE *)(v206 + 80) != 11) {        // not found or not a function
    sub_4F8200(0x0B, 3654, &qword_126DD38);         // emit error 3654
}

The lookup calls sub_72EEF0 to insert the identifier __cudaPushCallConfiguration (27 bytes, 0x1B) into the current scope context, then sub_698940 performs the actual name resolution. If the declaration is not found (!v206) or the entity at offset +80 is not a function (kind != 11), error 3654 is emitted.

Error 3654

The error string at 0x88CA48:

unable to find __cudaPushCallConfiguration declaration.
CUDA toolkit installation may be corrupt.

This error indicates that the CUDA runtime headers have not been properly included or that the toolkit installation is broken. The __cudaPushCallConfiguration function is declared in crt/device_runtime.h (included transitively through crt/host_runtime.h), so this error should only appear if the include paths are misconfigured.

The error is emitted with severity 0x0B (11), which maps to a fatal error -- compilation cannot continue without this function because every kernel launch depends on it.

Kernel Launch Lowering

After successful lookup, the scanner builds an AST node representing the lowered kernel launch. The <<<grid, block, shmem, stream>>> syntax is transformed into:

// Conceptual lowering:
if (__cudaPushCallConfiguration(grid, block, shmem, stream) != 0) {
    // launch configuration failed
}
kernel(args...);

Error 3655 (emitted at line 2019) handles the case where the call configuration push succeeds syntactically but the stream argument is missing in contexts that require it. The string for this is "explicit stream argument not provided in kernel launch".

Lambda Detection Macros

Default Stub Macros (No Extended Lambdas)

When dword_106BF38 (extended_lambda_mode) is 0, sub_489000 emits three macro definitions that evaluate to false, followed by an existence check:

// sub_489000, decompiled lines 259-264
emit("#define __nv_is_extended_device_lambda_closure_type(X) false\n");
emit("#define __nv_is_extended_host_device_lambda_closure_type(X) false\n");
emit("#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false\n");
emit("#if defined(__nv_is_extended_device_lambda_closure_type)"
     " && defined(__nv_is_extended_host_device_lambda_closure_type)"
     "&& defined(__nv_is_extended_device_lambda_with_preserved_return_type)\n"
     "#endif\n");

Verbatim emitted code:

#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false
#if defined(__nv_is_extended_device_lambda_closure_type) && defined(__nv_is_extended_host_device_lambda_closure_type)&& defined(__nv_is_extended_device_lambda_with_preserved_return_type)
#endif

Note the missing space before && in the second conjunction -- this is exactly how the string appears in the binary at 0x83ADE8. The #if defined(...) block is a compile-time assertion: if any of the three macros were #undef'd by a misbehaving header between this point and their use in crt/host_runtime.h, the preprocessor would silently skip lambda-related code rather than producing cryptic template errors. The #endif immediately follows -- the block has no body because its purpose is solely the existence check.

These macros are consumed by crt/host_runtime.h to conditionally compile lambda wrapper infrastructure. When all three evaluate to false, the runtime header skips device lambda wrapper template instantiation, host-device lambda wrapper instantiation, and trailing-return-type lambda handling.

Trait-Based Macros (Extended Lambdas Active)

When dword_106BF38 is nonzero (--extended-lambda or --expt-extended-lambda CLI flag), the stub macros are NOT emitted. Instead, the lambda preamble emitter sub_6BCC20 (nv_emit_lambda_preamble) provides trait-based implementations later in the file body. The decision is made at line 256 of sub_489000:

// sub_489000, decompiled lines 251-264
if (dword_106BF38)        // extended lambdas enabled?
    goto LABEL_38;        // skip stub macros, jump to next section
// else: emit stubs
emit("#define __nv_is_extended_device_lambda_closure_type(X) false\n");
// ...

The trait-based implementations emitted by sub_6BCC20 use template specialization rather than preprocessor macros. Each macro is #define'd to invoke a type trait helper:

Device lambda detection (string at 0xA82CF8):

template <typename T>
struct __nv_extended_device_lambda_trait_helper {
  static const bool value = false;
};
template <typename T1, typename...Pack>
struct __nv_extended_device_lambda_trait_helper<__nv_dl_wrapper_t<T1, Pack...> > {
  static const bool value = true;
};
#define __nv_is_extended_device_lambda_closure_type(X) \
    __nv_extended_device_lambda_trait_helper< \
        typename __nv_lambda_trait_remove_cv<X>::type>::value

Preserved return type detection (string at 0xA82F68):

template <typename T>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper {
  static const bool value = false;
};
template <typename U, U func, typename Return, unsigned Id, typename...Pack>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper<
    __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>, Pack...> > {
  static const bool value = true;
};
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) \
    __nv_extended_device_lambda_with_trailing_return_trait_helper< \
        typename __nv_lambda_trait_remove_cv<X>::type >::value

Host-device lambda detection (string at 0xA831B0):

template <typename>
struct __nv_extended_host_device_lambda_trait_helper {
  static const bool value = false;
};
template <bool B1, bool B2, bool B3, typename T1, typename T2, typename...Pack>
struct __nv_extended_host_device_lambda_trait_helper<
    __nv_hdl_wrapper_t<B1, B2, B3, T1, T2, Pack...> > {
  static const bool value = true;
};
#define __nv_is_extended_host_device_lambda_closure_type(X) \
    __nv_extended_host_device_lambda_trait_helper< \
        typename __nv_lambda_trait_remove_cv<X>::type>::value

All three trait helpers follow the same pattern: a primary template with value = false, a partial specialization matching the corresponding wrapper type with value = true, and a macro that instantiates the trait after stripping cv-qualifiers via __nv_lambda_trait_remove_cv. The cv-stripping is necessary because lambda closure types may be captured as const references.

Macro Registration in the Frontend

The three macro names are registered as built-in identifiers by sub_5863A0 (a frontend initialization function), which calls sub_7463B0 to register each name with a unique identifier code:

// sub_5863A0, decompiled lines 976-978
sub_7463B0(328, "__nv_is_extended_device_lambda_closure_type");
sub_7463B0(329, "__nv_is_extended_host_device_lambda_closure_type");
sub_7463B0(330, "__nv_is_extended_device_lambda_with_preserved_return_type");

These registrations (IDs 328, 329, 330) make the names known to the EDG lexer before any source code is parsed, ensuring they can be resolved during preprocessing even if no header has defined them yet.

Diagnostic Suppression Scope

The managed runtime boilerplate is wrapped in a #pragma GCC diagnostic push / pop block to isolate its warning suppressions:

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-variable"
#pragma GCC diagnostic ignored "-Wunused-function"

/* managed runtime declarations */

#pragma GCC diagnostic pop

The push/pop is emitted only when the host compiler supports it: Clang (dword_126E1E8 set), or GCC version > 40599 (qword_126E1F0 > 0x9E97 and dword_106BF6C not set). The suppressions are necessary because __nv_inited_managed_rt and __nv_init_managed_rt are static symbols that may never be referenced in translation units without __managed__ variables, causing -Wunused-variable and -Wunused-function warnings.

Global State Dependencies

GlobalTypeMeaningEffect on Emission
dword_106BF38intextended_lambda_mode0: emit false stubs. Nonzero: skip stubs, sub_6BCC20 provides traits
dword_106BF6Cintalternative_host_compiler_modeAdds __attribute__((unused)) to __nv_init_managed_rt
dword_126E1E8intHost is ClangControls push/pop and extra suppressions
dword_126E1F8intHost is GCCControls push/pop version threshold
qword_126E1F0int64GCC/Clang version number> 0x9E97 (40599) for push/pop support

Function Map

AddressNameRole
sub_489000process_file_scope_entitiesEmits managed RT block and lambda macros
sub_4768F0gen_name_refWraps qualified managed variable accesses
sub_484940gen_variable_nameWraps direct managed variable accesses
sub_511D40scan_expr_fullLooks up __cudaPushCallConfiguration for <<<>>> lowering
sub_6BCC20nv_emit_lambda_preambleEmits trait-based lambda detection macros
sub_5863A0(frontend init)Registers lambda macro names as built-in identifiers
sub_467E50(emit string)Primary string emission to output stream
sub_72EEF0(inject identifier)Inserts __cudaPushCallConfiguration into scope for lookup
sub_698940(name lookup)Resolves identifier to entity declaration
sub_4F8200(emit error)Error emission with severity and error code

Cross-References

Host Reference Arrays

When cudafe++ splits a CUDA source file into device and host halves, the host-side .int.c output is compiled by a standard C++ compiler (GCC, Clang, or MSVC) that has no concept of device symbols. The CUDA runtime, however, needs to know which __global__ kernels, __device__ variables, and __constant__ variables exist so it can register them at program startup. cudafe++ solves this by emitting host reference arrays -- static byte arrays containing the mangled names of device symbols, placed into specially-named ELF sections that downstream tools (the fatbinary linker and crt/host_runtime.h registration code) read to enumerate device entities. The mechanism exists because the host compiler's symbol table contains only host-side symbols; the .nvHR* sections provide the complementary device-side symbol directory that the CUDA runtime needs to build the host-device binding table.

The arrays are emitted at the very end of the .int.c file, after the #undef _NV_ANON_NAMESPACE cleanup, by six calls to nv_emit_host_reference_array (sub_6BCF80, 79 lines, nv_transforms.c). Each call handles one combination of symbol type (kernel, device variable, constant variable) and linkage class (external, internal). The split by linkage is critical for RDC (relocatable device code) compilation: external-linkage symbols are globally visible across translation units and resolved by nvlink, while internal-linkage symbols (from static declarations or anonymous namespaces) are TU-local and must carry module-ID-based name prefixes to avoid collisions.

Key Facts

PropertyValue
Emission functionsub_6BCF80 (nv_emit_host_reference_array, 79 lines)
EDG source filenv_transforms.c
Callersub_489000 (process_file_scope_entities, lines 713--721)
Guard conditiondword_106BFD0 || dword_106BFCC (device or constant registration enabled)
Emit callbacksub_467E50 (primary string emitter to output stream)
Registration functionsub_6BE300 (nv_get_full_nv_static_prefix, 370 lines, nv_transforms.c:2164)
Scope prefix buildersub_6BD2F0 (nv_build_scoped_name_prefix, 95 lines)
Expression walkersub_6BE330 (nv_scan_expression_for_device_refs, 89 lines)
List data structurestd::list<std::string>-like containers at 6 global addresses
Static prefix cacheqword_1286760
Anonymous namespace nameqword_1286A00 (format: _GLOBAL__N_<module_id>)
Prefix format stringat off_E7C768, expanded as "%s%lu_%s_"
Assert guardnv_transforms.c:2164, "nv_get_full_nv_static_prefix"

The Six Sections

The arrays are organized into 6 ELF sections along two axes: symbol type (3 values) and linkage (2 values):

SectionArray NameSymbol TypeLinkageGlobal List Address
.nvHRKEhostRefKernelArrayExternalLinkage__global__ kernelExternalunk_1286880
.nvHRKIhostRefKernelArrayInternalLinkage__global__ kernelInternalunk_12868C0
.nvHRDEhostRefDeviceArrayExternalLinkage__device__ variableExternalunk_1286780
.nvHRDIhostRefDeviceArrayInternalLinkage__device__ variableInternalunk_12867C0
.nvHRCEhostRefConstantArrayExternalLinkage__constant__ variableExternalunk_1286800
.nvHRCIhostRefConstantArrayInternalLinkage__constant__ variableInternalunk_1286840

The section name encoding is: .nvHR (host reference) + one letter for symbol type (K=kernel, D=device, C=constant) + one letter for linkage (E=external, I=internal).

Note that __shared__ variables are not included -- they have no host-visible address and exist only within a kernel's execution lifetime.

Emission Architecture

Invocation from the Backend

The backend entry point sub_489000 (process_file_scope_entities) calls sub_6BCF80 six times at the very end of .int.c generation (decompiled lines 713--721). The calls are guarded by two flags: dword_106BFD0 (device registration mode) and dword_106BFCC (constant registration mode). If neither is set, no arrays are emitted.

// sub_489000 trailer, decompiled lines 713-721
if (dword_106BFD0 || dword_106BFCC) {
    // nv_emit_host_reference_array(emit_fn, is_kernel, is_device, is_internal)
    sub_6BCF80(sub_467E50, 1, 0, 1);  // kernel,   internal  -> .nvHRKI
    sub_6BCF80(sub_467E50, 1, 0, 0);  // kernel,   external  -> .nvHRKE
    sub_6BCF80(sub_467E50, 0, 1, 1);  // device,   internal  -> .nvHRDI
    sub_6BCF80(sub_467E50, 0, 1, 0);  // device,   external  -> .nvHRDE
    sub_6BCF80(sub_467E50, 0, 0, 1);  // constant, internal  -> .nvHRCI
    sub_6BCF80(sub_467E50, 0, 0, 0);  // constant, external  -> .nvHRCE
}

The function signature is:

void nv_emit_host_reference_array(
    void (*emit)(const char *),  // a1: string emission callback
    int is_kernel,               // a2: 1 = kernel, 0 = variable
    int is_device,               // a3: 1 = __device__, 0 = __constant__ (only when is_kernel=0)
    int is_internal              // a4: 1 = internal linkage, 0 = external linkage
);

The flag decoding for selecting which global list, section name, and array name to use works as follows:

if is_kernel (a2 != 0):
    if is_internal (a4 != 0):  list = unk_12868C0, section = ".nvHRKI", name = "hostRefKernelArrayInternalLinkage"
    else:                       list = unk_1286880, section = ".nvHRKE", name = "hostRefKernelArrayExternalLinkage"
else if is_internal (a4 != 0):
    if is_device (a3 != 0):    list = unk_12867C0, section = ".nvHRDI", name = "hostRefDeviceArrayInternalLinkage"
    else:                       list = unk_1286840, section = ".nvHRCI", name = "hostRefConstantArrayInternalLinkage"
else:
    if is_device (a3 != 0):    list = unk_1286780, section = ".nvHRDE", name = "hostRefDeviceArrayExternalLinkage"
    else:                       list = unk_1286800, section = ".nvHRCE", name = "hostRefConstantArrayExternalLinkage"

Note the precedence: the kernel flag is checked first. When is_kernel=1, the is_device flag is ignored entirely -- kernels are always kernels regardless of is_device.

Emission Output Format

For each section, sub_6BCF80 emits a single array declaration:

extern "C" {
extern __attribute__((section(".nvHRKE")))
       __attribute__((weak))
const unsigned char hostRefKernelArrayExternalLinkage[] = {
/* _Z8myKernelPfi */
0x5f,0x5a,0x38,0x6d,0x79,0x4b,0x65,0x72,0x6e,0x65,0x6c,0x50,0x66,0x69,0x0,
/* _Z12otherKernelPd */
0x5f,0x5a,0x31,0x32,0x6f,0x74,0x68,0x65,0x72,0x4b,0x65,0x72,0x6e,0x65,0x6c,0x50,0x64,0x0,
0x0};
}

Key details about the emitted C:

  • extern "C" wrapping ensures no C++ name mangling is applied to the array itself. The section name in the ELF binary is the sole identifier.
  • __attribute__((section(".nvHRXX"))) places the array in a named ELF section that downstream tools scan by name.
  • __attribute__((weak)) allows multiple translation units to define the same array name without causing linker errors. When multiple TUs each emit their own hostRefKernelArrayExternalLinkage, the linker keeps one copy. This is safe because the CUDA runtime reads the section contents, not the symbol -- it concatenates all .nvHRKE section contributions from all object files.
  • const unsigned char[] encodes each mangled name as individual hex bytes, not as a string literal. This avoids any issues with embedded NUL bytes or special characters in mangled names.
  • Each symbol name is preceded by a /* mangled_name */ comment for human readability.
  • Each name is terminated by 0x0 (NUL byte).
  • If the list is empty (no symbols of that type/linkage), the array contains a single 0x0 sentinel.

The iteration traverses a doubly-linked list rooted at the global list variable. From the decompiled code:

// Decompiled iteration in sub_6BCF80, lines 56-73
for (node = list[3]; list + 1 != node; node = next_node(node)) {
    emit("/* ");
    emit(*(char **)(node + 32));   // mangled name string
    emit(" */\n");
    size_t len = *(size_t *)(node + 40);  // string length
    for (size_t j = 0; j < len; j++) {
        char byte = *(char *)(*(char **)(node + 32) + j);
        snprintf(buf, 128, "0x%x,", byte);
        emit(buf);
    }
    emit("0x0,");  // NUL terminator for this name
}

Each node in the linked list stores:

  • +32: pointer to the mangled name string
  • +40: length of the mangled name

The list structure itself is a std::list<std::string>-compatible container where list[3] (offset +24) points to the first data node and list + 1 (offset +8) is the sentinel/end node.

Symbol Registration Pipeline

The host reference arrays are the output of a two-phase pipeline: (1) symbol collection during compilation, and (2) array emission at the end of the backend pass.

Phase 1: Collection During Compilation

As cudafe++ processes the AST, it encounters declarations marked with __global__, __device__, or __constant__. Each such entity must be registered in the appropriate global list so it appears in the host reference array. This registration is performed by two cooperating functions:

nv_scan_expression_for_device_refs (sub_6BE330, 89 lines) recursively walks expression trees looking for references to device-annotated entities. It dispatches on expression kind:

Expression KindHandling
7 (variable reference)Checks __global__ bit, registers if device-annotated
11 (function reference)Checks function attributes, registers if __global__
15 (member access)Recurses on the member
16 (pointer dereference)Recurses on the operand
17 (expression list)Recurses on each element
20 (call expression)Checks the callee
24 (cast expression)Recurses on the operand

When the walker finds a device entity, it tail-calls into nv_get_full_nv_static_prefix.

nv_get_full_nv_static_prefix (sub_6BE300, 370 lines) is the master registration function. It determines the symbol's linkage class and constructs the name that goes into the host reference array. The function begins with two early-exit checks:

if (!entity) return;
if ((entity[182] & 0x40) == 0) return;  // not __global__

Byte +182 of the entity node carries execution space bits. Bit 6 (0x40) indicates __global__. Byte +179 carries additional flags where bits 0x12 indicate device/constant annotation. Byte +80 bits 0x70 encode the linkage class: 0x10 = internal (static/anonymous), 0x30 = external.

The function then splits into two paths based on linkage:

Internal Linkage Path

For static functions, anonymous-namespace entities, or entities with forced internal linkage, the name must include a TU-unique prefix to prevent collisions across translation units:

  1. Scope prefix construction (sub_6BD2F0): Recursively walks the entity's enclosing scopes (byte +28 == 3 indicates "has parent scope"). For each scope level, the scope name is extracted from +32 -> +8 (the scope's identifier string). For anonymous namespaces (where the scope name pointer is NULL), the function substitutes _GLOBAL__N_<module_id>, constructing and caching this string in qword_1286A00.

  2. Hash computation (sub_6BD1C0): The scope-qualified name is hashed using vsnprintf with format string at address 8573734 (likely "%s%lu" or similar) and a 32-byte buffer. This produces a deterministic hash of the scope path.

  3. Static prefix construction: The full prefix is assembled as:

    snprintf(buf, size, "%s%lu_%s_", off_E7C768, strlen(module_id), module_id)
    

    where off_E7C768 is a fixed prefix string (likely "__nv_static_" or similar) and module_id comes from sub_5AF830 (the CRC32-based module identifier). The result is cached in qword_1286760 so it is computed only once per TU.

  4. Name assembly: The prefix, a "_" separator, and the entity's mangled name (from entity +8) are concatenated.

  5. List insertion: The assembled name is pushed into the internal-linkage list (unk_12868C0 for kernels, unk_12867C0 for device variables, unk_1286840 for constants) via a std::list::push_back-equivalent call.

External Linkage Path

For entities with default (external) linkage, the path is simpler:

  1. A " ::" scope prefix is prepended (string at address 10998575, corresponding to " ::" -- two bytes).

  2. If the entity has a parent scope (byte +28 == 3 at the scope entry), the scope-qualified name is built by recursing through parent scopes, concatenating "::" separators and hashing each level with sub_6BD1C0.

  3. The entity's mangled name (from entity +8) is appended directly.

  4. The result is pushed into the external-linkage list (unk_1286880 for kernels, unk_1286780 for device variables, unk_1286800 for constants).

Phase 2: Emission (Backend Trailer)

After the entire source file has been processed and all entity walks have populated the 6 global lists, the backend trailer calls sub_6BCF80 six times. Each call drains one list and emits the corresponding ELF section declaration. The emission is always performed for all 6 sections, even if some lists are empty (producing arrays with only a 0x0 sentinel).

Internal vs. External Linkage Split

The split into internal and external linkage sections serves two distinct purposes:

Whole-Program Mode (-rdc=false)

In whole-program (non-RDC) mode, all device code from a single TU is embedded directly in the host object file as a fatbinary. The host reference arrays tell crt/host_runtime.h's __cudaRegisterLinkedBinary machinery which symbols exist in the fatbinary so it can register them with the CUDA driver at program startup.

Internal-linkage symbols require the TU-unique prefix to avoid name collisions if two TUs define identically-named static __global__ kernels. The prefix incorporates the module ID (a CRC32 of the TU's representative entity) to ensure uniqueness.

Separate Compilation Mode (-rdc=true)

In RDC mode, device code is compiled to relocatable device objects (.rdc files) that nvlink links together. External-linkage device symbols must be globally resolvable across TUs. The .nvHRKE/.nvHRDE/.nvHRCE sections provide the symbol directory that nvlink uses to match device symbols with their host-side registration entries.

Internal-linkage symbols in RDC mode remain TU-local. They carry module-ID prefixes and are placed in the *I sections, which nvlink processes separately. The split ensures that nvlink does not attempt to deduplicate or cross-reference symbols that were intentionally given internal linkage.

Downstream Consumption

Host Compiler

GCC/Clang/MSVC compiles the .int.c file and sees the extern "C" array declarations with __attribute__((section(...))). The host compiler places each array into the named ELF section (or PE section on Windows). Because the arrays are const unsigned char[] with weak linkage, they impose no runtime overhead and can be safely deduplicated by the linker.

The fatbinary linker reads the .nvHR* sections from each object file to discover which device symbols need registration. For each entry in the byte arrays, it extracts the mangled name (scanning for 0x0 terminators) and matches it against the device code in the fatbinary or relocatable device object.

CUDA Runtime (crt/host_runtime.h)

At program startup, the CUDA runtime's __cudaRegisterLinkedBinary function (or __cudaRegisterFatBinary in whole-program mode) walks the .nvHR* sections to:

  1. Register each __global__ kernel with cudaRegisterFunction
  2. Register each __device__ variable with cudaRegisterVar
  3. Register each __constant__ variable with cudaRegisterVar (with the constant flag)

This registration enables the host-side API (cudaLaunchKernel, cudaMemcpyToSymbol, etc.) to resolve device symbols by name at runtime.

Supporting Data Structures

Global List Nodes

Each of the 6 global lists (unk_1286780 through unk_12868C0) is a std::list<std::string>-compatible doubly-linked list. The list head structure occupies 48 bytes (3 pointers + metadata):

OffsetFieldDescription
+0allocatorAllocator state
+8sentinelSentinel/end node address (comparison target for iteration end)
+16sizeNumber of entries
+24firstPointer to first data node

Each data node stores:

OffsetFieldDescription
+0prevPrevious node pointer
+8nextNext node pointer
+16data_startStart of string data area
+32str_ptrPointer to mangled name character data
+40str_lenLength of the mangled name

The strings use SSO (Small String Optimization): if the mangled name is 15 bytes or shorter, the character data is stored inline starting at offset +16; otherwise str_ptr at +32 points to a heap allocation and offset +16 stores the heap capacity.

Static Prefix Cache

qword_1286760 caches the internal-linkage prefix string computed by nv_get_full_nv_static_prefix. The format is:

<off_E7C768><module_id_length>_<module_id>_

Where off_E7C768 is a fixed string (the NVIDIA static prefix marker), the module ID comes from sub_5AF830 (CRC32-based), and the underscores separate the components. This prefix is allocated once via sub_5E03D0 and reused for all internal-linkage entities in the TU.

Anonymous Namespace Name Cache

qword_1286A00 caches the anonymous namespace identifier, constructed as _GLOBAL__N_<module_id>. This follows the Itanium ABI convention for anonymous namespace mangling but uses the CUDA module ID instead of a random hash. It is allocated once by sub_6BD2F0 and reused for all entities in anonymous namespaces.

Scope-Qualified Name Builder

sub_6BD2F0 (nv_build_scoped_name_prefix) recursively constructs scope-qualified names for internal-linkage entities:

void nv_build_scoped_name_prefix(char **scope_name, scope_entry *parent, string *result) {
    // Recurse to parent scope first
    if (parent && parent->kind == 3)  // byte +28 == 3
        nv_build_scoped_name_prefix(parent->parent->name, parent->parent->scope, result);

    char *name = *scope_name;
    if (!name)
        name = get_or_create_anon_namespace_name();  // _GLOBAL__N_<module_id>

    // Build: hash(name) via vsnprintf with format at 8573734, 32-byte buffer
    // Append to result string
    format_string_to_sso(&tmp, vsnprintf, 32, 8573734, name_len);
    string_append(result, tmp);
}

The recursion visits ancestor scopes from outermost to innermost, concatenating hashed scope names. This produces a deterministic, collision-resistant path that uniquely identifies the entity's position in the namespace hierarchy.

Host Reference Trie

During compilation, cudafe++ maintains a trie (prefix tree) structure for deduplicating host reference entries. This trie is stored alongside the linear lists and prevents the same symbol from being registered twice if it is referenced from multiple points in the source.

The trie is cleaned up at the end of compilation by:

  • sub_6BD530 (nv_free_host_ref_tree, 257 lines) -- deeply recursive tree destructor with 9 levels of inlined recursion
  • sub_6BD820 (nv_free_host_ref_list, 34 lines) -- iterates the linked list, calling nv_free_host_ref_tree for each node's tree, then frees the node

Each trie node structure:

OffsetFieldDescription
+0nextNext sibling pointer
+8(reserved)Alignment/flags
+16child_chainFirst child in chain
+24child_treeChild subtree pointer
+32data_ptrPointer to name data (or +48 if inline)
+40data_lenLength of name data
+48inline_dataInline storage for short names

If data_ptr == &node[48] (the inline data area), no separate allocation was made; otherwise data_ptr points to a heap-allocated string that nv_free_host_ref_tree frees separately.

Complete Emission Example

For a source file containing:

__global__ void myKernel(float *data, int n) { /* ... */ }
__device__ int d_counter;
static __constant__ float c_table[256];

The .int.c trailer emits:

extern "C" {
extern __attribute__ ((section (".nvHRKI"))) __attribute__((weak)) const unsigned char hostRefKernelArrayInternalLinkage[] = {
0x0};

extern "C" {
extern __attribute__ ((section (".nvHRKE"))) __attribute__((weak)) const unsigned char hostRefKernelArrayExternalLinkage[] = {
/* _Z8myKernelPfi */
0x5f,0x5a,0x38,0x6d,0x79,0x4b,0x65,0x72,0x6e,0x65,0x6c,0x50,0x66,0x69,0x0,
0x0};
}

extern "C" {
extern __attribute__ ((section (".nvHRDI"))) __attribute__((weak)) const unsigned char hostRefDeviceArrayInternalLinkage[] = {
0x0};
}

extern "C" {
extern __attribute__ ((section (".nvHRDE"))) __attribute__((weak)) const unsigned char hostRefDeviceArrayExternalLinkage[] = {
/* _Z9d_counter */
0x5f,0x5a,0x39,0x64,0x5f,0x63,0x6f,0x75,0x6e,0x74,0x65,0x72,0x0,
0x0};
}

extern "C" {
extern __attribute__ ((section (".nvHRCI"))) __attribute__((weak)) const unsigned char hostRefConstantArrayInternalLinkage[] = {
/* __nv_static_42_kernel_cu_c_table */
0x5f,0x5f,0x6e,0x76,0x5f,...,0x0,
0x0};
}

extern "C" {
extern __attribute__ ((section (".nvHRCE"))) __attribute__((weak)) const unsigned char hostRefConstantArrayExternalLinkage[] = {
0x0};
}

Note how c_table (declared static __constant__) appears in the internal-linkage .nvHRCI section with its module-ID-prefixed name, while myKernel (external linkage by default) appears in .nvHRKE with its standard Itanium-ABI mangled name.

Function Map

AddressNameSourceLinesRole
sub_6BCF80nv_emit_host_reference_arraynv_transforms.c79Selects section/list by flags, emits array declaration
sub_6BE300nv_get_full_nv_static_prefixnv_transforms.c:2164370Master registration: determines linkage, builds name, inserts into list
sub_6BE330nv_scan_expression_for_device_refsnv_transforms.c89Recursive expression walker that finds device entity references
sub_6BD2F0nv_build_scoped_name_prefixnv_transforms.c95Recursive scope-qualified name builder for internal-linkage entities
sub_6BD1C0format_string_to_ssonv_transforms.c48Formats via vsnprintf into std::string SSO buffer
sub_6BD530nv_free_host_ref_treenv_transforms.c257Recursive deep-free of deduplication trie
sub_6BD820nv_free_host_ref_listnv_transforms.c34Frees linked list of host reference entries
sub_6BCF10nv_check_device_variable_in_hostnv_transforms.c16Validates device variable not improperly referenced from host
sub_5AF830make_module_idhost_envir.c~450CRC32-based TU identifier used in internal-linkage prefixes
sub_489000process_file_scope_entitiescp_gen_be.c723Backend entry point; calls sub_6BCF80 x6 in trailer
sub_467E50(emit string)cp_gen_be.c--Primary string emission callback passed to sub_6BCF80

Global Variables

AddressTypeNamePurpose
unk_1286780listdevice external listAccumulates __device__ external-linkage symbol names
unk_12867C0listdevice internal listAccumulates __device__ internal-linkage symbol names
unk_1286800listconstant external listAccumulates __constant__ external-linkage symbol names
unk_1286840listconstant internal listAccumulates __constant__ internal-linkage symbol names
unk_1286880listkernel external listAccumulates __global__ external-linkage symbol names
unk_12868C0listkernel internal listAccumulates __global__ internal-linkage symbol names
qword_1286760char*static prefix cacheCached internal-linkage prefix string (computed once per TU)
qword_1286A00char*anon namespace nameCached _GLOBAL__N_<module_id> string
dword_106BFD0intdevice registration flagEnables device symbol registration (guard for emission)
dword_106BFCCintconstant registration flagEnables constant symbol registration (guard for emission)

Cross-References

Module ID & Registration

When CUDA programs are compiled with separate compilation (-rdc=true), each .cu translation unit is compiled independently and later linked by nvlink. The host-side registration code emitted by cudafe++ must associate its __cudaRegisterFatBinary call with the correct device fatbinary, and anonymous namespace device symbols must receive globally unique mangled names. The module ID is a string identifier computed by make_module_id (sub_5AF830, host_envir.c, ~450 lines) that provides this uniqueness. It is derived from a CRC32 hash of the compiler options and source filename, combined with the output filename and process ID. Once computed, the module ID is cached in qword_126F0C0 and referenced throughout the backend code generator -- in _NV_ANON_NAMESPACE construction, _GLOBAL__N_ mangling, _INTERNAL prefixing, host reference array scoped names, and the module ID file written for nvlink consumption.

Key Facts

PropertyValue
Generator functionsub_5AF830 (make_module_id, ~450 lines, host_envir.c)
Settersub_5AF7F0 (set_module_id, host_envir.c, line 3387 assertion)
Gettersub_5AF820 (get_module_id, host_envir.c)
File writersub_5B0180 (write_module_id_to_file, host_envir.c)
Entity-based selectorsub_5CF030 (use_variable_or_routine_for_module_id_if_needed, il.c, line 31969)
Anon namespace constructorsub_6BC7E0 (nv_transforms.c, ~20 lines)
Cached module ID globalqword_126F0C0 (8 bytes, initially NULL)
Selected entity globalqword_126F140 (8 bytes, IL entity pointer)
Selected entity kindbyte_126F138 (1 byte, 7=variable or 11=routine)
Module ID file path globalqword_106BF80 (set by --module_id_file_name, flag 87)
Generate-module-ID-file flag--gen_module_id_file (flag 83, no argument)
Module ID file path flag--module_id_file_name (flag 87, has argument)
Options hash input globalqword_106C038 (string, command-line options to hash)
Output filename globalqword_106C040 (display filename override)
Emit-symbol-table flagdword_106BFB8 (triggers write_module_id_to_file in backend)
CRC32 polynomial0xEDB88320 (CRC-32/ISO-HDLC, reflected)
CRC32 initial value0xFFFFFFFF
Debug trace topic"module_id" (gated by dword_126EFC8)
Debug format strings"make_module_id: str1 = %s, str2 = %s, pid = %ld\n" at 0xA5DA48
"make_module_id: final string = %s\n" at 0xA5DA80

Algorithm Overview

The module ID generator has three source modes, tried in priority order. The result is always cached in qword_126F0C0 -- the function returns immediately if the cache is populated.

Mode 1: Module ID File

If qword_106BF80 (set by the --module_id_file_name CLI flag) is non-NULL and dword_106BFB8 is clear, the function opens the specified file, reads its entire contents into a heap-allocated buffer, null-terminates it, and uses that as the module ID verbatim. This allows build systems to inject deterministic, reproducible identifiers from external sources (e.g., a content hash of the source file computed by the build system).

// sub_5AF830, mode 1: read module ID from file
if (!dword_106BFB8 && qword_106BF80) {
    FILE *f = open_file(qword_106BF80, "r");  // sub_4F4870
    if (!f) fatal("unable to open module id file for reading");

    fseek(f, 0, SEEK_END);
    size_t len = ftell(f);
    rewind(f);

    char *buf = allocate(len + 1);             // sub_6B7340
    if (fread(buf, 1, len, f) != len)
        fatal("unable to read module id from file");

    buf[len] = '\0';
    fclose(f);
    qword_126F0C0 = buf;
    return buf;
}

Mode 2: Explicit Token (Caller-Provided String)

If the caller passes a non-NULL first argument (src), the function enters the default computation path using that string as the source filename component. When a secondary string argument (nptr) is provided instead (used by use_variable_or_routine_for_module_id_if_needed), it is first parsed with strtoul. If the parse succeeds (the entire string was consumed as a number), the numeric value is formatted as an 8-digit hex string. If the parse fails (the string is not purely numeric), the string is CRC32-hashed and the hash is used as the hex token. The working directory (qword_126EEA0) is used as an extra component, and the PID is always appended.

Mode 3: Default Computation (stat + ctime + getpid)

When no caller-provided string is available, the function stat()s the output file. If the stat succeeds and the file is a regular file (S_IFREG), the modification time (st_mtime) is converted to a string via ctime(), and the PID is obtained via getpid(). If the stat fails or the result is not a regular file, only the PID is used, with the compilation timestamp string (qword_126EB80) as the source component.

Complete Generation Pseudocode

function make_module_id(src_arg):
    // Check cache
    if qword_126F0C0 != NULL:
        return qword_126F0C0

    // Mode 1: read from file
    if !dword_106BFB8 AND qword_106BF80 != NULL:
        return read_file_contents(qword_106BF80)

    // Determine the output filename base
    if dword_126EE48:                    // multi-TU mode
        output_name = **(qword_106BA10 + 184)   // from TU descriptor
    else:
        output_name = xmmword_126EB60[0]         // primary source file
    if qword_106C040 != NULL:
        output_name = qword_106C040              // display name override

    // Determine source string and extra string
    pid = 0
    extra = NULL

    if src_arg != NULL:
        src = src_arg
        // skip nptr processing, fall through to assembly

    else if nptr != NULL:                // caller-provided numeric token
        (value, endptr) = strtoul(nptr, 0)
        if endptr <= nptr OR *endptr != '\0':
            value = crc32(nptr)          // not a pure number, hash it
        src = sprintf("%08lx", value)
        pid = getpid()
        extra = qword_126EEA0           // working directory

    else:                                // default: stat the output file
        if stat(output_name) succeeds AND is regular file:
            mtime = stat.st_mtime
            src = ctime(mtime)
            pid = getpid()
            extra = qword_126EEA0
        else:
            pid = getpid()
            src = qword_126EB80         // compilation timestamp
            extra = qword_126EEA0

    // --- Assemble the module ID string ---

    // Step 1: CRC32 of command-line options
    if qword_106C038 != NULL:
        options_crc = crc32(qword_106C038)
        options_hex = sprintf("_%08lx", options_crc)
    else:
        options_hex = sprintf("_%08lx", 0)

    // Step 2: source name compression
    name_len = strlen(src) + (extra ? strlen(extra) + 1 : 0)
    if name_len > 8:
        // Source name too long -- replace with CRC32
        combined_crc = crc32(src)
        if extra:
            combined_crc = crc32_continue(combined_crc, extra)
        src = sprintf("%08lx", combined_crc)
        // extra is consumed into the hash, set to NULL
        extra = NULL

    // Step 3: PID suffix
    if pid != 0:
        pid_suffix = sprintf("_%ld", pid)
    else:
        pid_suffix = ""

    // Step 4: extract basename of output file
    basename = strip_directory_prefix(output_name)   // sub_5AC1F0
    basename_len = strlen(basename)

    // Step 5: concatenate all components
    result = options_hex + "_" + basename_len + "_" + basename + "_" + src
    if extra:
        result += "_" + extra
    if pid != 0 AND nptr == NULL:
        result += pid_suffix

    // Step 6: sanitize -- replace all non-alphanumeric with '_'
    for each character c in result:
        if !isalnum(c):
            c = '_'

    // Cache and return
    qword_126F0C0 = result
    return result

Module ID Format

The final module ID string follows this structure:

_{options_crc}_{basename_len}_{basename}_{source_or_crc}[_{extra}][_{pid}]

All non-alphanumeric characters are replaced with underscores after assembly. A concrete example for a file kernel.cu compiled with nvcc -arch=sm_89 -rdc=true:

_a1b2c3d4_9_kernel_cu_5e6f7890_1234
  |          |  |        |         |
  |          |  |        |         +-- PID (getpid())
  |          |  |        +------------ CRC32 of source name (> 8 chars compressed)
  |          |  +--------------------- output basename ("kernel.cu", dot -> "_")
  |          +------------------------ basename length (9, "kernel.cu")
  +----------------------------------- CRC32 of options string

The leading underscore comes from the options_hex format ("_%08lx"). All dots, slashes, dashes, and other non-alphanumeric characters are uniformly replaced with underscores, making the result safe for use as a C identifier suffix.

CRC32 Implementation

The function contains an inline CRC32 implementation that appears three times in the decompiled output -- once for the options string hash, once for the source filename hash, and once for the extra string hash. All three are byte-identical in the binary, indicating the compiler inlined a shared helper (likely a static inline function or macro) at each call site.

The algorithm is the standard bit-by-bit reflected CRC-32 used by ISO 3309, ITU-T V.42, Ethernet, PNG, and zlib. The polynomial 0xEDB88320 is the bit-reversed form of the generator polynomial 0x04C11DB7.

CRC32 Pseudocode

function crc32(data: byte_string) -> uint32:
    crc = 0xFFFFFFFF                    // initialization vector

    for each byte in data:
        for bit_index in 0..7:
            // XOR the lowest bit of crc with the current data bit
            if ((crc ^ (byte >> bit_index)) & 1) != 0:
                crc = (crc >> 1) ^ 0xEDB88320
            else:
                crc = crc >> 1

    return crc ^ 0xFFFFFFFF             // final inversion

CRC32 Decompiled (Single Instance)

This is one of the three identical inline copies from sub_5AF830, processing the options string at qword_106C038:

// sub_5AF830, lines 121-165 (options CRC32)
uint64_t crc = 0xFFFFFFFF;
uint8_t *ptr = (uint8_t *)qword_106C038;

if (ptr) {
    while (*ptr) {
        uint8_t byte = *ptr;
        while (1) {
            ++ptr;
            // Bit 0
            uint64_t tmp = crc >> 1;
            if (((uint8_t)crc ^ byte) & 1) tmp ^= 0xEDB88320;
            // Bit 1
            uint64_t tmp2 = tmp >> 1;
            if (((uint8_t)tmp ^ (byte >> 1)) & 1) tmp2 ^= 0xEDB88320;
            // Bit 2
            uint64_t tmp3 = tmp2 >> 1;
            if (((uint8_t)tmp2 ^ (byte >> 2)) & 1) tmp3 ^= 0xEDB88320;
            // Bit 3
            uint64_t tmp4 = tmp3 >> 1;
            if (((uint8_t)tmp3 ^ (byte >> 3)) & 1) tmp4 ^= 0xEDB88320;
            // Bit 4
            uint64_t tmp5 = tmp4 >> 1;
            if (((uint8_t)tmp4 ^ (byte >> 4)) & 1) tmp5 ^= 0xEDB88320;
            // Bit 5
            uint64_t tmp6 = tmp5 >> 1;
            if (((uint8_t)tmp5 ^ (byte >> 5)) & 1) tmp6 ^= 0xEDB88320;
            // Bit 6
            uint64_t tmp7 = tmp6 >> 1;
            if (((uint8_t)tmp6 ^ (byte >> 6)) & 1) tmp7 ^= 0xEDB88320;
            // Bit 7
            crc = tmp7 >> 1;
            if (((uint8_t)tmp7 ^ (byte >> 7)) & 1) == 0)
                break;
            byte = *ptr;
            crc ^= 0xEDB88320;
            if (!*ptr) goto done;
        }
    }
done:
    sprintf(options_hex, "_%08lx", crc ^ 0xFFFFFFFF);
}

The unrolled 8-iteration loop processes one byte at a time without a lookup table. Each iteration shifts the CRC right by one bit and conditionally XORs the polynomial. The final XOR with 0xFFFFFFFF is the standard CRC-32 finalization step. The compiler fully unrolled the inner 8-bit loop, turning what was originally a counted for (int i = 0; i < 8; i++) loop into 8 sequential if-shift-xor blocks. The three copies in the function differ only in which input string they process and which output variable receives the result.

Why Three Inline Copies

The CRC32 code appears at three locations within sub_5AF830:

CopyInputOutputPurpose
1 (lines 121-164)qword_106C038 (options string)options_hexHash compiler flags into the module ID prefix
2 (lines 186-273)src + extra (source + extra strings)src (overwritten with hex)Compress long source filenames (> 8 chars) into a fixed-width hash
3 (lines 361-407)nptr (explicit token string)v67Hash non-numeric caller-provided tokens

Copy 2 is a two-pass CRC: it first hashes the source filename string, then continues the CRC state into the extra string (working directory), producing a single combined hash. This is why the code between copies 2a and 2b checks if (extra_len != 0) before starting the second pass.

The original C source almost certainly had a single crc32_string() helper function (or macro) that the compiler inlined at each call site during optimization. The EDG front-end codebase uses similar inline expansion patterns elsewhere (e.g., the 9 copies of UTF-8 decoding logic in the same file).

Module ID Source Modes -- Decision Tree

make_module_id(src)
    |
    +-- qword_126F0C0 set? --> return cached
    |
    +-- File mode available?
    |   (qword_106BF80 != NULL && !dword_106BFB8)
    |   YES --> read file, cache, return
    |
    +-- Caller provided src argument?
    |   YES --> use src as source component, no PID
    |
    +-- nptr set (explicit token)?
    |   YES --> strtoul(nptr)
    |           |
    |           +-- parse OK? --> use numeric value
    |           +-- parse fail? --> CRC32 hash nptr
    |           extra = working_directory
    |           pid = getpid()
    |
    +-- Default (no src, no nptr)
        stat(output_file)
        |
        +-- stat OK && regular file?
        |   src = ctime(st_mtime)
        |   pid = getpid()
        |   extra = working_directory
        |
        +-- stat fail
            src = qword_126EB80 (compilation timestamp)
            pid = getpid()
            extra = working_directory

Entity-Based Module ID Selection

An alternative entry path into the module ID system is use_variable_or_routine_for_module_id_if_needed (sub_5CF030, il.c, line 31969, ~65 lines). Instead of computing a hash from file metadata, this function selects a representative entity (variable or function) from the current translation unit whose mangled name serves as a stable identifier. The mangled name is then passed to sub_5AF830 as the src argument.

Selection Criteria

The function is invoked during IL processing. It first checks sub_5AF820 (get_module_id) -- if a module ID is already cached, it returns immediately. Otherwise, it evaluates the candidate entity:

// sub_5CF030, simplified
char *use_variable_or_routine_for_module_id_if_needed(entity, kind) {
    if (get_module_id())
        return get_module_id();      // already computed

    if (qword_126F140) {
        // Already selected an entity, extract its name
        assert(dword_106BF10 || dword_106BEF8);  // il.c:32064
        goto extract_name;
    }

    // Validate entity kind: must be 7 (variable) or 11 (routine)
    assert(entity && ((kind - 7) & 0xFB) == 0);   // il.c:31969

    // Check if entity is unsuitable (member of TU scope, etc.)
    if (entity->scope == primary_scope
        || (entity->flags_81 & 0x04)       // unnamed namespace
        || (entity->scope && entity->scope->kind == 3))
    {
        // Skip: entity in primary scope, unnamed namespace, or class scope
        ...
        return NULL;
    }

    if (kind == 7) {   // Variable
        // Must have: no storage class, has definition, not template-related,
        // not inline, not constexpr, not thread-local
        if (entity->storage_class == 0
            && entity->has_definition          // offset +169
            && !(entity->flags_162 & 0x10)     // not explicit specialization
            && !(entity->flags_164 & 0x10)     // not partial specialization
            && entity->flags_148 >= 0          // not extern template
            && !(entity->flags_160 & 0x08)     // not inline variable
            && entity->flags_165 >= 0)         // not constexpr
        {
            qword_126F140 = entity;
            byte_126F138 = 7;
        }
    }
    else {   // Routine (kind == 11)
        // Must have: no specialization, no builtin return type,
        // no template parameters, not defaulted/deleted
        if (!entity->flags_164
            && entity->flags_176 >= 0          // not defaulted
            && !(entity->flags_179 & 0x02)     // not deleted
            && !(entity->flags_180 & 0x38)     // not template-related
            && !(entity->flags_184 & 0x20))    // not consteval
        {
            // Additional checks: return type not builtin, not coroutine
            if (!is_builtin_type(entity->return_type)
                && !is_generic_function(entity)
                && !is_concept_function(entity->return_type_entry))
            {
                qword_126F140 = entity;
                byte_126F138 = 11;
            }
        }
    }

extract_name:
    // Get the entity's mangled name
    char *name;
    if (byte_126F138 == 7) {
        // Variable: check unnamed namespace, use mangled or lowered name
        if ((entity->flags_81 & 0x04) || (entity->scope && entity->scope->kind == 3))
            name = get_lowered_name();      // sub_6A70C0
        else
            name = entity->name;            // offset +8
    } else {
        // Routine: similar checks, use name or lowered name
        assert(byte_126F138 == 11);         // il.c:32079
        if (dword_126EFB4 == 2)             // C++20 mode
            name = get_mangled_name();      // sub_6A76C0
        else
            name = entity->name;
    }

    assert(name != NULL);                   // il.c:32086
    return make_module_id(name);            // sub_5AF830(name)
}

The strict filtering ensures the selected entity is one whose mangled name is deterministic across compilations of the same source. Template instantiations, inline variables, and unnamed namespace entities are excluded because their names may vary or conflict.

set_module_id and get_module_id

The module ID cache has a setter/getter pair for use by external callers that compute the ID through other means:

// sub_5AF7F0 -- set_module_id (host_envir.c, line 3387)
void set_module_id(char *id) {
    assert(qword_126F0C0 == NULL);   // "set_module_id" -- must not be set already
    qword_126F0C0 = id;
}

// sub_5AF820 -- get_module_id (host_envir.c)
char *get_module_id(void) {
    return qword_126F0C0;
}

The setter asserts that the module ID has not been previously set. This is a safety guard: the module ID must be computed exactly once per compilation. Any attempt to set it twice indicates a logic error in the pipeline.

write_module_id_to_file

The write_module_id_to_file function (sub_5B0180, host_envir.c, ~30 lines) is called during the backend output phase when dword_106BFB8 (emit-symbol-table flag) is set. It generates the module ID (via sub_5AF830(0)) and writes the raw string to a file:

// sub_5B0180 -- write_module_id_to_file
void write_module_id_to_file(void) {
    char *id = make_module_id(NULL);       // sub_5AF830(0)
    char *path = qword_106BF80;            // module ID file path

    if (!path)
        fatal("module id filename not specified");

    FILE *f = open_file_for_writing(path); // sub_4F48F0
    size_t len = strlen(id);

    if (fwrite(id, 1, len, f) != len)
        fatal("error writing module id to file");

    fclose(f);
}

The module ID file is a plain text file containing nothing but the module ID string (no newline, no header). This file is consumed by the fatbinary linker (fatbinary) and nvlink during the device linking phase.

Downstream Consumers

The module ID is referenced in seven distinct locations across the cudafe++ binary:

1. Anonymous Namespace Mangling (sub_6BC7E0)

Constructs the _GLOBAL__N_<module_id> string used as the _NV_ANON_NAMESPACE macro value in the .int.c trailer:

// sub_6BC7E0 (nv_transforms.c, ~20 lines)
if (qword_1286A00)                      // cached?
    return qword_1286A00;

char *id = make_module_id(NULL);        // sub_5AF830(0)
char *buf = allocate(strlen(id) + 12);  // "_GLOBAL__N_" = 11 chars + NUL
strcpy(buf, "_GLOBAL__N_");
strcpy(buf + 11, id);
qword_1286A00 = buf;                   // cache for reuse
return buf;

This string appears in the .int.c output as:

#define _NV_ANON_NAMESPACE _GLOBAL__N_a1b2c3d4e5f67890
#ifdef _NV_ANON_NAMESPACE
#endif
#include "kernel.cu"
#undef _NV_ANON_NAMESPACE

2. Scoped Name Prefix Builder (sub_6BD2F0)

The recursive nv_build_scoped_name_prefix function uses the same _GLOBAL__N_<module_id> string when building scope-qualified names for internal-linkage device entities in host reference arrays. If the entity is in an anonymous namespace and qword_1286A00 is not yet computed, it calls sub_5AF830(0) directly to generate the module ID.

3. Internal Linkage Prefix (sub_69DAA0)

Constructs _INTERNAL<module_id> for internal-linkage entities during name lowering:

// sub_69DAA0 (lower_name.c context)
char *id = make_module_id(NULL);
char *buf = allocate(strlen(id) + 10);
strcpy(buf, "_INTERNAL");              // 0x414E5245544E495F in little-endian
strcpy(buf + 9, id);

4. Unnamed Namespace Naming (sub_69ED40, give_unnamed_namespace_a_name)

When the name lowering pass encounters an unnamed (anonymous) namespace entity, it calls sub_5AF830(0) to obtain the module ID and constructs a _GLOBAL__N_<module_id> name for the namespace. The function is confirmed as give_unnamed_namespace_a_name from assert strings at lower_name.c lines 7880 and 7889.

5. Frontend Wrapup (sub_588E90)

The translation_unit_wrapup function (sub_588E90, fe_wrapup.c) calls sub_5AF830(0) unconditionally during frontend finalization. This ensures the module ID is computed and cached before the backend code generator needs it, even if no earlier consumer triggered computation.

6. Entity-Based Selection (sub_5CF030)

As described above, use_variable_or_routine_for_module_id_if_needed selects a representative entity and passes its mangled name to sub_5AF830, which then uses the name as the src component instead of file metadata.

7. Module ID File Output (sub_5B0180)

Writes the raw module ID string to a file for consumption by fatbinary and nvlink.

Integration with the Compilation Pipeline

The module ID is computed at multiple points during compilation, but only the first computation persists (all subsequent calls return the cached value):

Pipeline stage                    Module ID action
--------------------------------------------------------------
CLI parsing                       Flags 83/87 set qword_106BF80
                                  Options string stored in qword_106C038
Frontend processing               sub_5CF030 may select entity-based ID
Frontend wrapup (sub_588E90)      sub_5AF830(0) ensures ID is computed
Backend output (sub_489000)       sub_6BC7E0 uses ID for _NV_ANON_NAMESPACE
                                  sub_6BCF80 uses ID in host reference arrays
                                  sub_5B0180 writes ID to file (if dword_106BFB8)

The --gen_module_id_file flag (83) controls whether a module ID file is generated at all. The --module_id_file_name flag (87) specifies its path. Both are set by nvcc when invoking cudafe++ with -rdc=true.

PID Incorporation

The getpid() call ensures that concurrent compilations of the same source file produce different module IDs. Without the PID, two parallel nvcc invocations compiling the same .cu file with the same flags would generate identical module IDs, causing runtime registration collisions when the resulting objects are linked together. The PID is appended as the final underscore-separated component and is only included in modes 2 and 3 (not when the caller provides a src argument directly, and not when the module ID is read from a file). This means reproducible builds require mode 1 (file-based) or entity-based selection.

Global Variables

AddressSizeNameDescription
qword_126F0C08cached_module_idCached module ID string (computed once, never freed)
qword_106BF808module_id_file_pathPath from --module_id_file_name (flag 87)
qword_106C0388options_hash_inputCommand-line options string for CRC32 hashing
qword_106C0408display_filenameOutput filename override (used as basename source)
qword_126F1408selected_entityEntity chosen by use_variable_or_routine_for_module_id_if_needed
byte_126F1381selected_entity_kindKind of selected entity (7=variable, 11=routine)
dword_106BFB84emit_symbol_tableFlag: write module ID file + symbol table in backend
qword_1286A008cached_anon_namespace_hashCached _GLOBAL__N_<module_id> string
qword_126EEA08working_directoryCurrent working directory (set during host_envir_early_init)
qword_126EB808compilation_timestampctime() of compilation start (IL header)
dword_126EFC84debug_trace_flagEnables debug trace output to FILE s

Function Map

AddressNameSource FileLinesRole
sub_5AF830make_module_idhost_envir.c~450CRC32-based unique TU identifier generator
sub_5AF7F0set_module_idhost_envir.c~10Setter with assert guard (must be called once)
sub_5AF820get_module_idhost_envir.c~3Returns qword_126F0C0
sub_5B0180write_module_id_to_filehost_envir.c~30Writes module ID to file for nvlink
sub_5CF030use_variable_or_routine_for_module_id_if_neededil.c:31969~65Selects representative entity for stable ID
sub_6BC7E0(anon namespace hash)nv_transforms.c~20Constructs _GLOBAL__N_<module_id>
sub_6BD2F0nv_build_scoped_name_prefixnv_transforms.c~95Recursive scope-qualified name builder
sub_69DAA0(internal linkage prefix)lower_name.c~60Constructs _INTERNAL<module_id> prefix
sub_69ED40give_unnamed_namespace_a_namelower_name.c:7880~80Names anonymous namespaces with module ID
sub_588E90translation_unit_wrapupfe_wrapup.c~30Ensures module ID is computed during wrapup

Cross-References

EDG 6.6 Overview

cudafe++ is built on top of Edison Design Group's (EDG) commercial C++ frontend, version 6.6. EDG provides the complete C++ language implementation -- lexer, preprocessor, parser, semantic analysis, type system, template instantiation, overload resolution, constant evaluation, and Itanium ABI name mangling. NVIDIA licenses this frontend and compiles it from source with CUDA-specific modifications injected at three distinct integration levels: a dedicated NVIDIA source file (nv_transforms.c), surgical modifications to EDG source files that call into NVIDIA headers, and a large layer of CUDA property-query leaf functions that permeate every compilation phase.

The build path embedded in the binary is:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/

Source Tree

The binary contains debug path references to 52 .c files and 13 .h files. Together these constitute the entire EDG frontend plus NVIDIA's single dedicated source file.

Source Files (.c)

#FilePipeline role
1attribute.cC++11/GNU/CUDA attribute parsing and validation
2class_decl.cClass/struct/union declaration processing, lambda scanning
3cmd_line.cCommand-line argument parsing (276 flags)
4const_ints.cCompile-time integer constant evaluation
5cp_gen_be.cBackend -- .int.c code generation, source sequence walking
6debug.cDebug output and IL dump infrastructure
7decl_inits.cDeclaration initializer processing
8decl_spec.cDeclaration specifier parsing (storage class, type qualifiers)
9declarator.cDeclarator parsing (pointers, arrays, function signatures)
10decls.cGeneral declaration processing
11disambig.cSyntactic disambiguation (expression vs. declaration)
12error.cDiagnostic message formatting and emission (3,795 messages)
13expr.cExpression parsing and semantic analysis
14exprutil.cExpression utility functions (coercion, evaluation)
15extasm.cExtended inline assembly parsing
16fe_init.cFrontend initialization (36 subsystem init routines)
17fe_wrapup.cFrontend finalization (5-pass wrapup sequence)
18float_pt.cFloating-point literal parsing
19floating.cIEEE 754 constant folding (arbitrary precision)
20folding.cGeneral constant folding
21func_def.cFunction definition processing
22host_envir.cHost environment interface (file I/O, exit, signals)
23il.cIL node creation, linking, and management
24il_alloc.cIL arena allocator (region-based, 64KB blocks)
25il_to_str.cIL-to-string conversion for debug display
26il_walk.cIL tree walking with 5 callback functions
27interpret.cConstexpr interpreter (compile-time evaluation engine)
28layout.cStruct/class memory layout computation
29lexical.cLexer / tokenizer (357 token kinds)
30literals.cString and numeric literal processing
31lookup.cName lookup (unqualified, qualified, ADL)
32lower_name.cItanium ABI name mangling
33macro.cPreprocessor macro expansion
34mem_manage.cInternal memory management (arena allocator, tracking)
35modules.cC++20 module support (mostly stubs in CUDA build)
36nv_transforms.cNVIDIA-authored -- CUDA AST transforms, lambda wrappers
37overload.cC++ overload resolution
38pch.cPrecompiled header support
39pragma.cPragma processing (43 pragma kinds)
40preproc.cPreprocessor directives (#include, #ifdef, etc.)
41scope_stk.cScope stack management
42src_seq.cSource sequence (declaration ordering for emission)
43statements.cStatement parsing and semantic analysis
44symbol_ref.cSymbol reference tracking
45symbol_tbl.cSymbol table operations (hash-based lookup)
46sys_predef.cSystem predefinitions (built-in types, macros)
47target.cTarget configuration (data model, ABI)
48templates.cTemplate instantiation, specialization, deduction
49trans_copy.cTranslation unit IL deep copy
50trans_corresp.cCross-TU type correspondence verification (RDC)
51trans_unit.cTranslation unit lifecycle (the main entry point)
52types.cC++ type system (22 type kinds, queries, construction)

Header Files (.h)

#FileContents
1decls.hDeclaration node structure definitions
2float_type.hFloating-point type descriptors
3il.hIL entry kind enums, node structure definitions
4lexical.hToken kind enums, lexer state
5mem_manage.hMemory allocator interface
6modules.hModule system declarations
7nv_transforms.hNVIDIA-authored -- CUDA transform API, called from EDG files
8overload.hOverload resolution structures
9scope_stk.hScope stack interface
10symbol_tbl.hSymbol table interface
11types.hType node structure, type kind enum
12util.hGeneral utility macros and inline functions
13walk_entry.hIL walking callback signatures

Code Breakdown

The binary contains approximately 6,300 identifiable functions in the EDG portion of the code:

CategoryFunctions% of binaryDescription
Attributed to source files~2,200~35%Matched to one of the 52 .c files via assert strings, source path references, or address-range mapping
Unmapped EDG functions~2,900~46%EDG code without source file attribution (inlined, optimized, or from headers)
C++ runtime / ABI~1,200~19%Itanium ABI runtime, exception handling, std:: library, operator new/delete

Top 10 Source Files by Function Count

RankFileFunctionsPrimary responsibility
1expr.c~195Expression parsing, operator semantics, implicit conversions
2il.c~185IL node creation, entry kind dispatch, node linking
3templates.c~172Template instantiation worklist, SFINAE, deduction
4exprutil.c~154Expression coercion, arithmetic conversions, lvalue analysis
5symbol_tbl.c~102Symbol table hash operations, scope chain walking
6overload.c~100Candidate set construction, ICS ranking, best viable function
7class_decl.c~90Class body parsing, member declarations, lambda scanning
8attribute.c~83Attribute parsing, CUDA attribute validation dispatch
9cp_gen_be.c~81Backend emission, .int.c generation, device stub writing
10scope_stk.c~72Scope push/pop, scope kind management, lookup context

Architecture: Classic Frontend Pipeline

EDG implements a textbook multi-pass compiler frontend. cudafe++ drives it in a single-threaded, sequential pipeline from main() at 0x408950:

  source.cu
     |
     v
  +-----------+     lexical.c, macro.c, preproc.c, literals.c
  |  Lexer /  |     357 token kinds, trigraph handling, raw string
  |  Preproc  |     adjustment, __CUDA_ARCH__ macro injection
  +-----------+
     |  token stream
     v
  +-----------+     expr.c, declarator.c, decl_spec.c, statements.c,
  |  Parser   |     class_decl.c, disambig.c, func_def.c, extasm.c
  |           |     Recursive-descent with disambiguation
  +-----------+
     |  parse tree
     v
  +-----------+     overload.c, exprutil.c, lookup.c, templates.c,
  | Semantic  |     types.c, attribute.c, const_ints.c, folding.c
  | Analysis  |     Type checking, overload resolution, template
  |           |     instantiation, constexpr evaluation
  +-----------+
     |  annotated AST
     v
  +-----------+     il.c, il_alloc.c, il_walk.c, scope_stk.c,
  |  IL Build |     symbol_tbl.c, src_seq.c, trans_unit.c
  |           |     Scope-linked graph of all declarations, types,
  |           |     expressions, statements, templates
  +-----------+
     |  IL graph
     v
  +-----------+     fe_wrapup.c, lower_name.c, trans_corresp.c
  |  Wrapup   |     5-pass finalization: dead code marking,
  |           |     name lowering, cross-TU correspondence (RDC)
  +-----------+
     |  finalized IL
     v
  +-----------+     cp_gen_be.c, nv_transforms.c, host_envir.c
  |  Backend  |     Walk source sequence, emit .int.c file,
  | Emission  |     inject CUDA stubs, lambda wrappers, host
  |           |     reference arrays, managed variable boilerplate
  +-----------+
     |
     v
  output.int.c

The process_translation_unit function (sub_7A40A0 in trans_unit.c) is the main entry point for compilation. It allocates a 424-byte TU descriptor, opens the source file, and orchestrates the parse-to-IL sequence. For the main compilation path, it calls:

  1. sub_586240 -- parse the translation unit (drives lexer + parser)
  2. sub_4E8A60 -- standard compilation finalization (IL completion)
  3. sub_588F90 -- fe_wrapup (5-pass IL finalization)
  4. sub_489000 -- backend entry (.int.c emission, "Back end time")

NVIDIA Modifications

NVIDIA's CUDA integration is organized in three layers, from most isolated to most pervasive.

Level 1: NVIDIA-Authored Source (nv_transforms.c + nv_transforms.h)

A single dedicated NVIDIA source file at address range 0x6BAE70--0x6BE4A0, containing approximately 34 functions in ~14KB of code. This file implements all CUDA-specific AST transformations:

FunctionAddressPurpose
nv_init_transforms0x6BAE70Zero all NVIDIA transform state at startup
emit_device_lambda_wrapper0x6BB790Generate __nv_dl_wrapper_t<Tag, F1..FN> partial specialization
emit_hdl_wrapper (non-mutable)0x6BBB10Generate __nv_hdl_wrapper_t<false, ...> type-erased wrapper
emit_hdl_wrapper (mutable)0x6BBEE0Same as above but operator() is non-const
emit_array_capture_helpers0x6BC290Generate __nv_lambda_array_wrapper for 2D-8D arrays
nv_validate_cuda_attributes0x6BC890Validate __launch_bounds__, __cluster_dims__, __maxnreg__
nv_reset_capture_bitmasks0x6BCBC0Zero device/host-device capture bitmasks per TU
nv_record_capture_count0x6BCBF0Set bit N in capture bitmap for wrapper generation
nv_emit_lambda_preamble0x6BCC20Master emitter: inject all __nv_* templates into compilation
nv_find_parent_lambda_function0x6BCDD0Walk scope chain for enclosing device/global function
nv_emit_host_reference_array0x6BCF80Generate .nvHRKE/.nvHRDI/etc. ELF section arrays
nv_get_full_nv_static_prefix0x6BE300Build scoped name + register entity in host ref arrays

The companion header nv_transforms.h declares the API surface that EDG source files call into. This is the primary NVIDIA integration point -- EDG code never calls nv_transforms.c functions directly; it calls through the header's declarations.

Key data structures managed by nv_transforms.c:

GlobalSizePurpose
unk_1286980128 bytes (1024 bits)Device lambda capture-count bitmap
unk_1286900128 bytes (1024 bits)Host-device lambda capture-count bitmap
qword_12868F0pointerEntity-to-closure ID hash table
qword_1286A00pointerCached anonymous namespace name (_GLOBAL__N_<file>)
qword_1286760pointerCached static name prefix string
unk_1286780--unk_12868C06 listsHost reference array symbol lists (one per section type)
dword_126E2704 bytesC++17 noexcept-in-type-system flag

Level 2: NVIDIA-Modified EDG Files

Three EDG source files contain direct calls into nv_transforms.h functions, making them the "NVIDIA-aware" EDG files:

cp_gen_be.c -- The backend code generator. When it encounters a type named __nv_lambda_preheader_injection during source sequence walking, it calls nv_emit_lambda_preamble (sub_6BCC20) to inject the entire __nv_* template library. It also calls NVIDIA functions for host reference array emission, managed variable boilerplate, and device stub generation.

class_decl.c -- The class/struct declaration processor. The scan_lambda function (sub_447930, 2113 lines) detects __host__/__device__ annotations on lambda expressions, validates CUDA-specific constraints (35+ error codes in range 3592--3690), and records capture counts in the bitmaps via nv_record_capture_count.

statements.c -- The statement parser. Calls NVIDIA transform functions for statement-level CUDA validation, such as checking that __syncthreads() is not called in divergent control flow within __global__ functions.

Level 3: CUDA Property Query Layer

The most pervasive integration layer consists of 104 small leaf functions clustered at addresses 0x7A6000--0x7AA000 (within types.c). These are type-system query functions that answer questions like "is this type a __device__ pointer?", "does this class have __shared__ storage?", "is this a kernel function type?".

Each follows a canonical pattern:

bool is_<property>_type(type_node *t) {
    while (t->kind == 12)       // 12 = tk_typedef
        t = t->referenced_type; // strip typedef layers
    return <check on underlying type>;
}

These 104 accessors account for 3,648 total call sites across the binary. The top callers by call-site count:

AddressCallersIdentityReturns
0x7A8A30407is_class_or_struct_or_union_typekind in {9, 10, 11}
0x7A9910389type_pointed_toptr->referenced_type (kind == 6)
0x7A9E70319get_cv_qualifiersaccumulated cv-qual bits (& 0x7F)
0x7A6B60299is_dependent_typebit 5 of byte +133
0x7A7630243is_object_pointer_typekind == 6 && !(bit 0 of +152)
0x7A8370221is_array_typekind == 8
0x7A7B30199is_member_pointer_or_refkind == 6 && (bit 0 of +152)
0x7A6AC0185is_reference_typekind == 7
0x7A8DC0169is_function_typekind == 14
0x7A6E90140is_void_typekind == 1

CUDA integration is pervasive because these tiny accessors are called from every phase of compilation -- the parser checks execution space during declaration, semantic analysis validates cross-space calls, the type system queries CUDA qualifiers during overload resolution, and the backend reads them during IL emission. There is no isolated "CUDA layer"; the CUDA awareness is distributed across the entire frontend through these leaf functions.

Type Kind Constants

The type query functions operate on a type_node structure (176 bytes, IL entry kind 6). The kind field at offset +132 encodes:

kindNameDescription
0tk_noneNull/invalid
1tk_voidvoid
2tk_integerAll integer types including bool, char, enums
3tk_floatfloat
4tk_doubledouble
5tk_long_doublelong double
6tk_pointerPointer types (object and member)
7tk_referenceLvalue reference (T&)
8tk_arrayArray types (T[], T[N])
9tk_structstruct
10tk_classclass
11tk_unionunion
12tk_typedefTypedef alias (stripped by all query functions)
13tk_pointer_to_memberPointer-to-member (T C::*)
14tk_functionFunction type
15tk_bitfieldBit-field
16tk_pack_expansionParameter pack expansion
17tk_pack_expansionAlternate pack expansion form
18tk_autoauto / decltype(auto) placeholder
19tk_rvalue_referenceRvalue reference (T&&)
20tk_nullptr_tstd::nullptr_t

Memory Management

EDG uses a custom region-based arena allocator implemented in mem_manage.c (address range 0x6B5E40--0x6BA230). Key characteristics:

  • Block size: 64KB (0x10000) per block
  • Region model: Multiple numbered regions (file-scope = region 1, per-function = region N)
  • Free list recycling: Freed blocks go to qword_1280730 for reuse before new allocation
  • Trim threshold: Blocks with more than 1,887 unused bytes are split; remainder goes to free list
  • Tracking: All allocations recorded for watermark monitoring (qword_1280718 = total, qword_1280710 = peak)
  • Dual mode: Malloc-based (mode 0) or mmap-based (mode 1), selected by dword_1280728 from CLI flag

Block structure (48+ bytes header per 64KB block):

OffsetTypeField
+0void*Next pointer (block chain)
+8void*Current allocation pointer
+16void*High-water mark within block
+24void*End-of-block pointer
+32int64Block total size (0 if sub-block)
+40byteTrimmed flag
+48--Start of usable data

The free_fe function (sub_6BA230, 533 lines) implements a hash-table-based deduplicating allocator for front-end object deallocation, using open addressing with linear probing.

C++20 Modules (Stubs)

The modules.c file (address range 0x7C0C60--0x7C2560) contains approximately 20 functions implementing the C++20 module import/export interface. CUDA does not support C++20 modules, so most functions are stubs that return 0:

  • has_pending_template_definition_from_module -- returns 0
  • has_pending_template_specializations_from_module -- returns 0
  • Seven additional stub functions at 0x7C2350--0x7C2410 -- all return 0

The non-stub functions handle the binary module interface file format (magic header {0x9A, 0x13, 0x37, 0x7D}) and basic module name matching, likely preserved from the EDG baseline for future CUDA module support.

Cross-TU Correspondence (RDC Mode)

When compiling with Relocatable Device Code (--rdc), multiple translation units are processed sequentially. The trans_corresp.c file (address range 0x7A00D0--0x7A38A0) implements structural equivalence checking between types from different TUs:

  • verify_class_type_correspondence (sub_7A00D0, 703 lines) -- Deep comparison of class types: base classes, friend declarations, member functions, nested types, template parameters
  • verify_enum_type_correspondence (sub_7A0E10) -- Enum underlying type and enumerator list comparison
  • verify_function_type_correspondence (sub_7A1230) -- Parameter list and return type comparison
  • set_type_correspondence (sub_7A1460) -- Links two corresponding types across TUs

The trans_unit.c file manages TU lifecycle with a stack-based model:

GlobalPurpose
qword_106BA10Current translation unit pointer
qword_106B9F0Primary (first) translation unit
qword_106BA18TU stack top
dword_106B9E8TU stack depth (excluding primary)

process_translation_unit (sub_7A40A0) allocates a 424-byte TU descriptor and drives the parse-to-completion sequence. switch_translation_unit (sub_7A3D60) saves/restores per-TU state (registered variables, scope stack, file scope) when switching between TUs during RDC compilation.

Cross-References

Lexer & Tokenizer

The lexer in cudafe++ is EDG 6.6's lexical.c implementation -- a hand-coded, state-machine-driven tokenizer that converts raw source bytes into a stream of 357 distinct token kinds. It spans approximately 185 functions across the address range 0x668330--0x689130 and constitutes one of the densest subsystems in the binary. The design is a classic multi-layered scanner: a byte-level character scanner (sub_679800, 907 lines) feeds into a token acquisition engine (sub_6810F0, 3,811 lines), which in turn is wrapped by a cache-aware token delivery function (sub_676860, 1,995 lines). CUDA keyword recognition is injected at the get_token_main level, gated on dword_106C2C0 (GPU compilation mode flag).

The lexer does not use generated tables from tools like flex. Instead, every character-class test, keyword match, and operator scan is written as explicit C switch/if chains, compiled into dense jump tables by the optimizer. This produces extremely large functions -- get_token_main alone has approximately 300 local variables in its decompiled form -- but eliminates the overhead of table-driven DFA transitions for a language as context-sensitive as C++.

Key Facts

PropertyValue
Source filelexical.c (~185 functions)
Address range0x668330--0x689130
Token kinds357 (indexed from off_E6D240 name table)
Primary scannersub_679800 (scan_token, 907 lines)
Token acquisitionsub_6810F0 (get_token_main, 3,811 lines, ~300 locals)
Cache + deliverysub_676860 (get_next_token, 1,995 lines)
Numeric literal scannersub_672390 (scan_numeric_literal, 1,571 lines)
Keyword registrationsub_5863A0 (keyword_init, in fe_init.c, 200+ keywords)
Universal char scannersub_6711E0 (scan_universal_character, 278 lines)
Template arg scannersub_67DC90 (scan_template_argument_list, 1,078 lines)
Token cache entry size80--112 bytes (8 cache entry kinds)
Scope entry size784 bytes (at qword_126C5E8)
GPU mode gatedword_106C2C0
Current token globalword_126DD58

Architecture

The lexer is organized as four concentric layers, each calling into the one below it:

Parser (expr.c, decls.c, statements.c)
  │
  ▼
get_next_token (sub_676860)         ← Cache management, macro rescan
  │
  ▼
get_token_main (sub_6810F0)         ← Keyword classification, CUDA gates
  │
  ▼
scan_token (sub_679800)             ← Character-level scanning
  │
  ▼
Input buffer (qword_126DDA0)        ← Raw bytes from source file

The parser never calls the character-level scanner directly. All token consumption flows through get_next_token, which checks the token cache and rescan lists before falling through to get_token_main. This layering allows the lexer to support lookahead, backtracking, macro expansion replay, and template argument rescanning without modifying the core scanner.

Token System

The 357 Token Kinds

Every token produced by the lexer carries a 16-bit token code stored in word_126DD58. The complete set of 357 token kinds is indexed through the name table at off_E6D240, which maps each token code to its string representation. The stop-token table at qword_126DB48 + 8 contains 357 boolean entries used by the error recovery scanner to identify synchronization points.

Token codes are assigned in blocks:

RangeCategoryExamples
1--51Operators and punctuation+, -, *, /, (, ), {, }, ::, ->
52--76Alternative tokens / digraphsand, or, not, <%, %>, <:, :>
77--108C89 keywordsauto(77), break(78), case(79), char(80), while(108)
109--131C99/C11 keywordsrestrict(119), _Bool(120), _Complex(121), _Imaginary(122)
132--136MSVC keywords__declspec(132), __int8(133), __int16(134), __int32(135), __int64(136)
137--199C++ keywordscatch(150), class(151), template(160), decltype(185), typeof(189)
200--206Compiler internalInternal token kinds for the preprocessor
207--330Type traits__is_class(207), __has_trivial_copy, ..., NVIDIA-specific traits at 328--330
331--356Extended types / recent additions_Float32(331)--_Float128(335), C++23/26 features

CUDA-Specific Token Kinds

Three NVIDIA type-trait keywords occupy dedicated token codes registered during keyword_init:

Token CodeKeywordPurpose
328__nv_is_extended_device_lambda_closure_typeTests if type is a device lambda
329__nv_is_extended_host_device_lambda_closure_typeTests if type is a host-device lambda
330__nv_is_extended_device_lambda_with_preserved_return_typeTests if device lambda preserves return type

These are registered as standard type-trait keywords and participate in the same token classification path as the 60+ standard __is_xxx/__has_xxx traits.

Token State Globals

When a token is produced, the following globals are populated:

AddressNameTypeDescription
word_126DD58current_token_codeWORD16-bit token kind (0--356)
qword_126DD38current_source_positionQWORDEncoded file/line/column
qword_126DD48token_text_ptrQWORDPointer to identifier/literal text
srctoken_start_positionchar*Start of token in input buffer
ntoken_text_lengthsize_tLength of token text
dword_126DF90token_flags_1DWORDClassification flags
dword_126DF8Ctoken_flags_2DWORDAdditional flags
qword_126DF80token_extra_dataQWORDContext-dependent payload
xmmword_106C380--106C3B0identifier_lookup_result4 x 128-bitSSE-packed lookup result for identifiers (64 bytes)

The 64-byte identifier lookup result is written into four SSE registers (xmmword_106C380 through xmmword_106C3B0) by the identifier classification path. When a scanned identifier is also a keyword, the lookup result contains the keyword's token code, scope information, and classification flags. The compiler uses movaps/movups instructions to read/write this packed state in bulk.

Token Cache

The token cache provides the lookahead, backtracking, and macro-expansion replay capabilities required by C++ parsing. Tokens are stored in a linked list of cache entries that can be consumed, rewound, and re-scanned.

Cache Entry Layout (80--112 bytes)

OffsetSizeFieldDescription
+08nextNext entry in cache linked list
+88source_positionEncoded source location
+162token_codeToken kind (0--356)
+181cache_entry_kindDiscriminator for payload type (see below)
+204flagsToken flags
+244extra_flagsAdditional flags
+328extra_dataContext-dependent data
+40..variespayloadKind-specific payload data

Cache Entry Kinds

KindValuePayloadDescription
identifier1Name pointer + lookup resultIdentifier token with pre-resolved scope lookup
macro_def2Macro definition pointerMacro definition for re-expansion (calls sub_5BA500)
pragma3Pragma dataPreprocessor pragma for deferred processing
pp_number4Number textPreprocessing number (not yet classified as int/float)
(reserved)5--Not observed in use
string6String data + encodingString literal token
(reserved)7--Not observed in use
concatenated_string8Concatenated string dataWide or multi-piece concatenated string literal

Cache Management Globals

AddressNameDescription
qword_1270150cached_token_rescan_listHead of list of tokens to re-scan (pushed back for lookahead)
qword_1270128reusable_cache_stackStack of reusable cache entry blocks
qword_1270148free_token_listFree list for recycling cache entries
qword_1270140macro_definition_chainActive macro definition chain
dword_126DB74has_cached_tokensBoolean flag: nonzero when cache is non-empty

Cache Operations

AddressIdentityDescription
sub_669650copy_tokens_from_cacheCopies cached preprocessor tokens for macro re-expansion (assert at lexical.c:3417)
sub_669D00allocate_token_cache_entryAllocates from free list at qword_1270118
sub_669EB0create_cached_token_nodeCreates and initializes token cache node
sub_66A000append_to_token_cacheAppends token to cache list, maintains tail pointer
sub_66A140push_token_to_rescan_listPushes token onto rescan stack at qword_1270150
sub_66A2C0free_single_cache_entryReturns cache entry to free list

Layer 1: scan_token (sub_679800)

scan_token is the character-level scanner. It reads raw bytes from the input buffer at qword_126DDA0, classifies them, and produces a single token. The function is 907 lines and dispatches on the first byte of each token.

Character Dispatch

The scanner reads the byte at the current input position and enters one of the following paths:

First ByteAction
0x00 (NUL)Control byte processing (8 embedded control types, see below)
0x09 (TAB), 0x0B (VT), 0x0C (FF), 0x20 (space)Whitespace -- advance and retry
a--z, A--Z, _Identifier or keyword scanning
0--9Numeric literal scanning (decimal, hex, octal, binary)
'Character literal scanning
"String literal scanning
/Comment (// or /* */) or division operator
.Dot operator, or float literal if followed by digit
<Less-than, <=, <<, <<=, <=>, or template bracket
>Greater-than, >=, >>, >>=, or template bracket
+, -, *, %, ^, ~, !, =, &, |Operator scanning (single or compound)
(, ), [, ], {, }, ;, ,, ?, @Single-character tokens
#Preprocessor directive or stringification operator
\Universal character name (\uXXXX, \UXXXXXXXX) or line continuation

Embedded Control Bytes (NUL Dispatch)

The input buffer uses embedded NUL bytes (0x00) as in-band control markers. When the scanner encounters a NUL, it reads the next byte as a control type code:

Control TypeValueAction
Newline marker1End of line -- calls sub_6702F0 (refill_buffer) to read next source line
(reserved)2--
Macro position3Macro expansion position marker -- calls sub_66A770 to update position tracking
End of directive4Marks end of a preprocessor directive
EOF (primary)5End of current source file -- pops file stack
Stale position6Invalid position marker -- emits diagnostic 1192 or 861
Continuation7Backslash-newline continuation was here
EOF (secondary)8Secondary EOF marker for nested includes

This in-band signaling approach avoids the cost of checking buffer boundaries on every character read. The refill_buffer function (sub_6702F0, 792 lines) places these marker bytes at the end of each source line, so the scanner can detect line endings and EOF without comparing the input pointer against a limit.

Input Buffer System

AddressNameDescription
qword_126DDA0current_input_positionRead pointer into the input buffer
qword_126DDD8input_buffer_baseStart of the allocated input buffer
qword_126DDD0input_buffer_endEnd of the allocated input buffer
qword_126DDF0file_stackStack of open source files (for #include)
qword_127FBA8current_file_handleFILE* for the current source file
dword_127FBA0eof_flagSet when current file reaches EOF
dword_127FB9Cmultibyte_encoding_modeValues >1 enable multibyte character decoding via sub_5B09B0
dword_126DDA8source_line_counterLines read from current source file
dword_126DDBCoutput_line_counterLines emitted to preprocessed output

Buffer Refill: read_next_source_line (sub_66F4E0)

sub_66F4E0 (735 lines) reads the next line from the source file into the input buffer. It calls getc() for single-byte mode or sub_5B09B0 for multibyte mode (controlled by dword_127FB9C > 1). The function:

  1. Reads characters one at a time until newline or EOF
  2. Handles backslash-newline line splicing (joining continuation lines)
  3. Places control byte markers at newline positions (type 1) and EOF (type 5/8)
  4. Updates the line counter at dword_126DDA8
  5. Manages trigraph warnings (diagnostic 1750) through the companion function sub_6702F0

Layer 2: get_token_main (sub_6810F0)

get_token_main is the largest function in the lexer at 3,811 decompiled lines with approximately 300 local variables. It wraps scan_token and performs the complete token classification pipeline: keyword recognition, CUDA keyword gating, template parameter detection, operator overload name lookup, access specifier tracking, and namespace scope management.

Token Classification Pipeline

After scan_token produces a raw token, get_token_main performs these classification steps:

scan_token produces raw token
  │
  ├── Identifier?
  │     ├── Look up in keyword table
  │     │     ├── Standard C/C++ keyword → set token_code to keyword kind
  │     │     ├── CUDA keyword (dword_106C2C0 != 0) → set token_code
  │     │     ├── Type trait keyword → set token_code (207-356)
  │     │     └── Not a keyword → classify as identifier token
  │     │
  │     ├── Check template parameter context
  │     │     └── If inside template<>, classify as type-name or non-type
  │     │
  │     └── Entity lookup for context-sensitive classification
  │           ├── typedef name → classify as TYPE_NAME token
  │           ├── class/struct name → classify as CLASS_NAME
  │           ├── enum name → classify as ENUM_NAME
  │           ├── namespace name → classify as NAMESPACE_NAME
  │           └── template name → classify as TEMPLATE_NAME
  │
  ├── Numeric literal?
  │     └── Route to scan_numeric_literal (sub_672390)
  │
  ├── String/character literal?
  │     └── Handle encoding prefix (L, u8, u, U, R)
  │
  └── Operator/punctuation?
        ├── Check for template angle bracket context
        ├── Handle digraphs/alternative tokens
        └── Produce operator token code

CUDA Keyword Detection

CUDA keyword handling is gated on dword_106C2C0 (GPU mode). When this flag is nonzero, get_token_main recognizes CUDA-specific identifiers and routes them to the CUDA attribute processing path:

// Pseudocode from get_token_main
if (token_is_identifier) {
    // ... standard keyword lookup ...

    if (dword_106C2C0 != 0) {  // GPU mode active
        // Check for __device__, __host__, __global__,
        // __shared__, __constant__, __managed__,
        // __launch_bounds__, __grid_constant__
        // Route to CUDA attribute handlers
        if (dword_106BA08) {   // CUDA attribute processing enabled
            sub_74DC30(...);   // CUDA attribute resolution
            sub_74E240(...);   // CUDA attribute application
        }
    }
}

The GPU mode flag dword_106C2C0 is also checked during:

  • Attribute token processing in sub_686350 (handle_attribute_token, 584 lines)
  • Deferred diagnostic emission in sub_668660 (severity override via byte_126ED55)
  • Entity visibility computation in sub_669130

C++ Standard Version Gating

Throughout get_token_main, keyword classification is gated on the C++ standard version stored in dword_126EF68:

Version ValueStandardKeywords Enabled
201102C++11constexpr, decltype, nullptr, char16_t, char32_t, static_assert
201402C++14binary literals, digit separators
201703C++17if constexpr, char8_t, structured bindings
202002C++20concept, requires, co_yield, co_return, co_await, consteval, constinit
202302C++23typeof, typeof_unqual, extended digit separators

The language mode at dword_126EFB4 controls broader dialect selection:

ValueModeEffect
1GNU/defaultGNU extensions enabled, alternative tokens recognized
2MSVCMSVC keywords enabled (__declspec, __int8--__int64), some GNU extensions disabled

Context-Sensitive Token Classification

C++ requires the lexer to classify identifiers based on declaration context. The functions supporting this classification:

AddressIdentityDescription
sub_668C90classify_identifier_entityDispatches on entity kind: typedef(3), class(4,5), function(7,9), namespace(19-22)
sub_668E00resolve_entity_through_aliasWalks typedef/using chains (kind=3 with +104 flag, kind=16 → **[+88])
sub_668F80get_resolved_entity_typeResolves entity to underlying type through alias chains
sub_668900handle_token_identifier_type_checkDetermines if token is identifier vs typename vs template
sub_666720select_dual_lookup_symbolSelects between two candidate symbols in dual-scope lookup (372 lines)

Entity classification reads the entity_kind byte at offset +80 of entity nodes:

switch (entity->kind) {    // offset +80
    case 3:                // typedef
        return TYPE_NAME;
    case 4: case 5:        // class / struct
        return CLASS_NAME;
    case 6:                // enum
        return ENUM_NAME;
    case 7:                // function
        return IDENTIFIER;
    case 9: case 10:       // namespace / namespace alias
        return NAMESPACE_NAME;
    case 19: case 20: case 21: case 22:  // template kinds
        return TEMPLATE_NAME;
    case 16:               // using declaration
        return resolve_through_using(entity);
    case 24:               // namespace alias (resolved)
        return NAMESPACE_NAME;
}

Layer 3: get_next_token (sub_676860)

get_next_token (1,995 lines) is the token delivery function called by the parser. It manages the token cache, handles macro expansion replay, and calls get_token_main only when no cached tokens are available.

Token Delivery Flow

get_next_token (sub_676860)
  │
  ├── Check cached_token_rescan_list (qword_1270150)
  │     └── If non-empty: pop token, dispatch on cache_entry_kind
  │           ├── kind 1 (identifier): load xmmword_106C380..106C3B0
  │           ├── kind 2 (macro_def): call sub_5BA500 (macro expansion)
  │           ├── kind 3 (pragma): process deferred pragma
  │           ├── kind 4 (pp_number): return as-is
  │           ├── kind 6 (string): return string token
  │           └── kind 8 (concatenated_string): return concatenated string
  │
  ├── Check reusable_cache_stack (qword_1270128)
  │     └── If non-empty: pop and return cached token
  │           (assert: "get_token_from_reusable_cache_stack" at 4450, 4469)
  │
  ├── Check pending_macro_arg (qword_106B8A0)
  │     └── If set: process macro argument token
  │
  └── Fall through to get_token_main (sub_6810F0)
        └── Full token acquisition from source

The function sets the following globals on every token delivery:

  • word_126DD58 = token code
  • qword_126DD38 = source position
  • dword_126DF90 = token flags 1
  • dword_126DF8C = token flags 2
  • qword_126DF80 = extra data

CUDA Attribute Token Interception

When CUDA attribute processing is enabled (dword_106BA08 != 0), get_next_token intercepts identifier tokens and routes them through CUDA attribute resolution via sub_74DC30 and sub_74E240. This allows CUDA execution-space attributes (__device__, __host__, __global__) to be recognized at the token level rather than requiring full declaration parsing.

Numeric Literal Scanner: scan_numeric_literal (sub_672390)

The numeric literal scanner is 1,571 lines and handles every numeric literal format defined by C89 through C++23.

Literal Prefix Dispatch

scan_numeric_literal
  │
  ├── First char '0':
  │     ├── 0x/0X → hex literal (isxdigit validation)
  │     ├── 0b/0B → binary literal (C++14)
  │     ├── 0[0-7] → octal literal
  │     └── 0 alone → decimal zero
  │
  ├── First char '1'-'9':
  │     └── decimal literal
  │
  └── After integer part:
        ├── '.' → floating-point literal
        ├── 'e'/'E' → decimal float exponent
        ├── 'p'/'P' → hex float exponent
        └── suffix → type suffix parsing

C++14 Digit Separators

Digit separators (' characters within numeric literals) are handled through a two-flag system:

AddressNamePurpose
dword_126EEFCcpp14_digit_separators_enabledMaster enable for digit separator support
dword_126DB58digit_separator_seenSet when a separator is encountered in the current literal

When dword_126EEFC is enabled, the scanner accepts ' between digits:

// Digit separator handling in scan_numeric_literal
while (isdigit(*pos) || (*pos == '\'' && dword_126EEFC)) {
    if (*pos == '\'') {
        dword_126DB58 = 1;  // mark separator seen
        pos++;
        if (!isdigit(*pos))
            emit_diagnostic(2629);  // separator not followed by digit
        continue;
    }
    // process digit...
}

C++23 extended digit separators (for binary, octal, hex) are gated on dword_126EF68 > 202302:

if (dword_126EF68 > 202302) {
    // C++23: allow digit separators in binary/octal/hex
} else {
    emit_diagnostic(2628);  // C++23 feature used in earlier mode
}

Integer Suffix Parsing

sub_6748A0 (convert_integer_suffix, 137 lines) parses the following suffixes:

SuffixType
(none)int (or promoted per value)
u / Uunsigned int
l / Llong
ll / LLlong long
ul / ULunsigned long
ull / ULLunsigned long long
z / Zsize_t (C++23)
uz / UZsize_t unsigned (C++23)

sub_674BB0 (determine_numeric_literal_type, 400 lines) applies the C++ promotion rules based on the literal value and suffix to determine the final type.

Floating-Point Literal Handling

AddressIdentityDescription
sub_675390scan_float_exponentScans e/E/p/P exponent suffix (57 lines)
sub_6754B0convert_float_literalConverts float literal string to value (338 lines)

Float suffixes: f/F (float), l/L (long double), none (double).

Universal Character Names: scan_universal_character (sub_6711E0)

sub_6711E0 (278 lines, assert at lexical.c:12384) scans \uXXXX and \UXXXXXXXX universal character names in identifiers and string/character literals.

void scan_universal_character(char *input, uint32_t *result) {
    int width;
    if (input[1] == 'u')
        width = 4;    // \uXXXX
    else
        width = 8;    // \UXXXXXXXX

    uint32_t value = 0;
    for (int i = 0; i < width; i++) {
        char c = *input++;
        if (!isxdigit(c)) {
            // emit error diagnostic
            return;
        }
        int digit;
        if (c >= '0' && c <= '9')
            digit = c - 48;      // '0' = 48
        else if (islower(c))
            digit = c - 87;      // 'a' = 97, 97-87 = 10
        else
            digit = c - 55;      // 'A' = 65, 65-55 = 10
        value = (value << 4) | digit;
    }
    *result = value;
}

sub_671870 (validate_universal_character_value, 62 lines) performs range checking after scanning: surrogate pair values (0xD800--0xDFFF) are rejected, and values outside the valid Unicode range (> 0x10FFFF) produce an error.

The feature is controlled by dword_106BCC4 (universal characters enabled) and dword_106BD4C (extended character mode).

Keyword Registration: keyword_init (sub_5863A0)

sub_5863A0 (1,113 lines, in fe_init.c) registers all C/C++ keywords with the symbol table during frontend initialization. It calls sub_7463B0 (enter_keyword) once per keyword, passing the token ID and string representation. GNU double-underscore variants are registered via sub_585B10, and alternative tokens via sub_749600.

Keyword Categories and Version Gating

Keywords are registered conditionally based on language mode and standard version:

keyword_init (sub_5863A0)
  │
  ├── C89 core (always registered)
  │     auto(77), break(78), case(79), char(80), continue(82),
  │     default(83), do(84), double(85), else(86), enum(87),
  │     extern(88), float(89), for(90), goto(91), if(92),
  │     int(93), long(94), register(95), return(96), short(97),
  │     sizeof(99), static(100), struct(101), switch(102),
  │     typedef(103), union(104), unsigned(105), void(106), while(108)
  │
  ├── C99 (gated on C99+ mode)
  │     _Bool(120), _Complex(121), _Imaginary(122), restrict(119)
  │
  ├── C11 (gated on C11+ mode)
  │     _Generic(262), _Atomic(263), _Alignof(247), _Alignas(248),
  │     _Thread_local(194), _Static_assert(184), _Noreturn(260)
  │
  ├── C23 (gated on C23 mode)
  │     bool, true, false, alignof, alignas, static_assert,
  │     thread_local, typeof(189), typeof_unqual(190)
  │
  ├── C++ core (gated on C++ mode: dword_126EFB4 == 2)
  │     catch(150), class(151), friend(153), inline(154),
  │     mutable(174), operator(156), new(155), delete(152),
  │     private(157), protected(158), public(159), template(160),
  │     this(161), throw(162), try(163), virtual(164),
  │     namespace(175), using(179), typename(183), typeid(178),
  │     const_cast(166), dynamic_cast(167), static_cast(177),
  │     reinterpret_cast(176)
  │
  ├── C++ alternative tokens (gated on C++ mode)
  │     and(52), and_eq(64), bitand(33), bitor(51), compl(37),
  │     not(38), not_eq(48), or(53), or_eq(66), xor(50), xor_eq(65)
  │
  ├── C++ modern keywords (gated on standard version)
  │     C++11: constexpr(244), decltype(185), nullptr(237),
  │            char16_t(126), char32_t(127)
  │     C++17: char8_t(128)
  │     C++20: consteval(245), constinit(246), co_yield(267),
  │            co_return(268), co_await(269), concept(295), requires(294)
  │     C++23: typeof(189), typeof_unqual(190)
  │
  ├── GNU extensions (gated on dword_126EFA8)
  │     __extension__(187), __auto_type(186), __attribute(142),
  │     __builtin_offsetof(117), __builtin_types_compatible_p(143),
  │     __builtin_shufflevector(258), __builtin_convertvector(259),
  │     __builtin_complex(261), __builtin_has_attribute(296),
  │     __builtin_addressof(271), __builtin_bit_cast(297),
  │     __int128(239), __bases(249), __direct_bases(250),
  │     _Float32(331), _Float32x(332), _Float64(333),
  │     _Float64x(334), _Float128(335)
  │
  ├── MSVC extensions (gated on dword_126EFB0)
  │     __declspec(132), __int8(133), __int16(134),
  │     __int32(135), __int64(136)
  │
  ├── Clang extensions (gated on Clang version at qword_126EF90)
  │     _Nullable(264), _Nonnull(265), _Null_unspecified(266)
  │
  ├── Type traits (60+, gated by standard version)
  │     __is_class(207), __is_enum, __is_union, __has_trivial_copy,
  │     __has_virtual_destructor, ... through token code 327
  │
  ├── NVIDIA CUDA type traits (gated on GPU mode)
  │     __nv_is_extended_device_lambda_closure_type(328),
  │     __nv_is_extended_host_device_lambda_closure_type(329),
  │     __nv_is_extended_device_lambda_with_preserved_return_type(330)
  │
  └── EDG internal keywords (always registered)
        __edg_type__(272), __edg_size_type__(277),
        __edg_ptrdiff_type__(278), __edg_bool_type__(279),
        __edg_wchar_type__(280), __edg_opnd__(282),
        __edg_throw__(281), __edg_is_deducible(304),
        __edg_vector_type__(273), __edg_neon_vector_type__(274)

Version gating globals used during keyword registration:

AddressNameValues
dword_126EFB4language_mode1 = K&R C / GNU default, 2 = C++
dword_126EF68cpp_standard_version199900, 201102, 201402, 201703, 202002, 202302
qword_126EF98gnu_versione.g., 0x9FC3 = GCC 4.0.3
qword_126EF90clang_versione.g., 0x15F8F, 0x1D4BF
dword_126EFA8gnu_extensions_enabledBoolean
dword_126EFA4extensions_enabledBoolean (Clang compat)
dword_126EFACc_language_modeBoolean: C vs C++
dword_126EFB0microsoft_extensions_enabledBoolean

String and Character Literal Scanning

Character Literal Scanning

AddressIdentityLinesDescription
sub_66CB30scan_character_literal_prefix34Detects encoding prefix (L, u, U, u8)
sub_66CBD0scan_character_literal111Scans 'x' / L'x' / u'x' / U'x' / u8'x' literals

String Literal Scanning

AddressIdentityLinesDescription
sub_66C550scan_string_literal356Scans quoted string literals with escape sequences
sub_676080scan_raw_string_literal391Scans R"delimiter(content)delimiter" raw strings
sub_66E6E0scan_identifier_suffix94Checks for user-defined literal suffixes (C++11)
sub_66E920is_valid_ud_suffix51Validates user-defined literal suffix names
sub_6892F0string_literal_concatenation_check107Checks adjacent string literal tokens for concatenation
sub_689550process_user_defined_literal332Handles C++11 UDL operator lookup

Encoding Prefixes

The lexer recognizes 5 string encoding prefixes, each producing a different string literal type:

PrefixTokenCharacter TypeWidth
(none)"..."char1 byte
LL"..."wchar_t4 bytes (Linux)
u8u8"..."char8_t (C++20) / char1 byte
uu"..."char16_t2 bytes
UU"..."char32_t4 bytes

Scope Entry Layout

The lexer interacts heavily with the scope system. Scope entries are 784-byte records stored in an array at qword_126C5E8, indexed by dword_126C5E4 (current scope index).

OffsetSizeFieldDescription
+04name_hashHash of scope name for lookup
+41scope_kindKind code (12 = file scope, see below)
+61scope_flagsBit flags: bit 5 = inline namespace
+71access_flagsBit 0 = in class context
+101extra_flagsBit 0 = module scope
+121template_flagsBit 0 = in template argument scan, bit 4 = has concepts
+248symbol_chain_or_hash_ptrHead of symbol chain or hash table
+328hash_table_ptrHash table for O(1) lookup in large scopes
+1928lazy_load_scope_ptrPointer for lazy symbol loading (calls sub_7C1900)
+2084scope_depthNesting depth counter
+3768parent_template_infoTemplate context for template scope entries
+4168module_infoC++20 module partition data
+6328class_info_ptrPointer to class descriptor for class scopes

Scope-related globals:

AddressNameDescription
dword_126C5E4current_scope_indexIndex into scope table
dword_126C5C4class_scope_indexInnermost class scope (-1 if none)
dword_126C5C8namespace_scope_indexInnermost namespace scope (-1 if none)
dword_126C5DCfile_scope_indexFile (global) scope index
xmmword_126C520entity_kind_to_language_mode_map32-entry table mapping entity kinds to required language modes

Lexer State Stack

The lexer supports push/pop of its entire state for speculative parsing and template argument scanning.

AddressIdentityLinesDescription
sub_688320push_lexical_state137Pushes current lexer state onto qword_126DB40 stack
sub_668330pop_lexical_state_stack_full166Pops state, restores stop-token table, macro chains (assert at lexical.c:17808)

State stack nodes are 80-byte linked-list entries:

OffsetSizeField
+08next (previous state)
+88cached_tokens
+168source_position
+24--+7248token_cache_state (saved cache pointers and flags)

The push/pop mechanism is used for:

  • Template argument list scanning (sub_67DC90, 1,078 lines)
  • Speculative parsing in disambiguation contexts
  • Macro expansion state save/restore

Template Argument Scanning: scan_template_argument_list (sub_67DC90)

sub_67DC90 (1,078 lines, assert at lexical.c:19918) scans template argument lists (<...>). This is one of the most complex lexer functions because of the >> ambiguity: in vector<vector<int>>, the closing >> must be split into two > tokens to close two template argument lists.

The scanner:

  1. Pushes lexer state and sets template argument scanning mode (scope entry offset +12, bit 0)
  2. Scans tokens while tracking nesting depth of <> pairs
  3. Handles nested template-ids recursively
  4. Creates token cache entries for deferred parsing
  5. Uses the scope system to classify identifiers within template arguments
  6. Disambiguates >> as either right-shift or double template close

The entity kind checks at offsets +80 (values 19--22) identify template entities for recursive template-id scanning.

Preprocessor Integration

The lexer handles several preprocessor-related responsibilities:

Source Position Tracking

AddressIdentityLinesDescription
sub_66D100set_source_position282Converts raw input position to file/line/column (called from dozens of locations)
sub_66D5E0emit_output_line491Emits source text and #line directives to preprocessed output
sub_66B1F0emit_preprocessed_output231Outputs #line directives via qword_106C280 (output FILE*)

Macro Expansion Support

AddressIdentityLinesDescription
sub_66A770lookup_macro_at_position41Scans macro chain (qword_126DD80) for macro enclosing given position
sub_66A7F0create_macro_expansion_record44Allocates macro expansion tracking node
sub_66A890push_macro_expansion41Pushes new expansion onto active stack
sub_66A940pop_macro_expansion28Pops expansion from stack
sub_66A9D0is_in_macro_expansion12Returns whether currently inside macro expansion
sub_66A9F0get_macro_expansion_depth17Returns nesting depth of macro expansions
sub_66A310invalidate_macro_node56Clears macro definition when it goes out of scope
sub_66A5E0free_macro_definition_chain91Walks and frees macro chain via qword_126DD70 / qword_126DDE0

Include File Handling

AddressIdentityLinesDescription
sub_66BB50open_source_file332Opens include files via sub_4F4970 (fopen wrapper), creates file tracking nodes
sub_66EA70open_next_input_file364Opens next input source after current file ends, manages include-stack unwinding
sub_67BAB0scan_header_name110Scans <filename> or "filename" for #include directives

Token Pasting and Stringification

AddressIdentityLinesDescription
sub_67D1E0handle_token_pasting117Implements ## preprocessor operator
sub_67D440stringify_token251Implements # preprocessor operator
sub_67D050check_token_paste_validity57Validates token paste produces a valid token
sub_67D900expand_macro_argument204Expands a single macro argument during substitution

Operator Scanning

Multi-character operators are scanned by a set of dedicated functions in the 0x67ABB0--0x67BAB0 range. The scanner reads the first operator character and dispatches to the appropriate function to check for compound operators:

First CharPossible Tokens
<<, <=, <<, <<=, <=>, <% (digraph {), <: (digraph [)
>>, >=, >>, >>=
++, ++, +=
--, --, -=, ->, ->*
**, *=
&&, &&, &=
||, ||, |=
==, ==
!!, !=
::, ::
.., ..., .*

Template Angle Bracket Disambiguation

sub_67CB70 (handle_template_angle_brackets, 263 lines) handles the critical disambiguation of < and > in template contexts. In template argument lists, < opens and > closes, but in expressions, they are comparison operators. The function uses scope context information and the current parsing state (from the 784-byte scope entries) to make the determination.

Error Recovery

AddressIdentityLinesDescription
sub_6887C0skip_to_token317Error recovery: skips tokens until finding a synchronization point (;, }, etc.)
sub_6886F0expect_token31Checks current token matches expected kind, emits diagnostic on mismatch
sub_688560peek_next_token44Looks ahead at next token without consuming it

The stop-token table at qword_126DB48 + 8 (357 entries) controls which token kinds are valid synchronization points for error recovery.

Built-in Type and Attribute Handling

AddressIdentityLinesDescription
sub_685AB0handle_builtin_type_token289Processes built-in type keywords (int, float, etc.) into type tokens
sub_685F10process_decltype_token212Handles decltype() expression in token stream
sub_686350handle_attribute_token584Processes [[attribute]] and __attribute__((x)) syntax, including CUDA attributes
sub_686F40process_asm_or_extension_keyword244Handles asm, __asm__, and extension keywords

Diagnostic Strings

StringSourceCondition
"pop_lexical_state_stack_full"sub_668330Assert at lexical.c:17808
"copy_tokens_from_cache"sub_669650Assert at lexical.c:3417
"scan_universal_character"sub_6711E0Assert at lexical.c:12384
"get_token_from_cached_token_rescan_list"sub_676860Assert at lexical.c:4302
"get_token_from_reusable_cache_stack"sub_676860Assert at lexical.c:4450, 4469
"scan_template_argument_list"sub_67DC90Assert at lexical.c:19918
"select_dual_lookup_symbol"sub_666720Assert at lexical.c:22477
"keyword_init"sub_5863A0Assert at fe_init.c:1597
"fe_translation_unit_init"sub_5863A0Assert at fe_init.c:2373
Diagnostic CodeContextMeaning
870Character literal scanningInvalid character in literal
912select_dual_lookup_symbolAmbiguous lookup result
1192Control byte type 6Stale source position marker
861Control byte type 6Invalid position reference
1665check_deferred_diagnosticsDeferred macro-related warning
1750refill_bufferTrigraph sequence warning
2628Numeric literal scannerC++23 digit separator used in earlier mode
2629Numeric literal scannerDigit separator not followed by digit

Function Map

AddressIdentityConfidenceLinesEDG Source
sub_5863A0keyword_init / fe_translation_unit_init98%1,113fe_init.c:1597
sub_666720select_dual_lookup_symbolHIGH372lexical.c:22477
sub_668330pop_lexical_state_stack_fullHIGH166lexical.c:17808
sub_668660check_deferred_diagnosticsMEDIUM104lexical.c
sub_6688A0get_scope_from_entityHIGH32lexical.c
sub_668C90classify_identifier_entityMEDIUM89lexical.c
sub_668E00resolve_entity_through_aliasMEDIUM88lexical.c
sub_669650copy_tokens_from_cacheHIGH385lexical.c:3417
sub_669D00allocate_token_cache_entryMEDIUM119lexical.c
sub_66A000append_to_token_cacheMEDIUM88lexical.c
sub_66A140push_token_to_rescan_listMEDIUM46lexical.c
sub_66A3F0create_source_region_nodeMEDIUM84lexical.c
sub_66A5E0free_macro_definition_chainMEDIUM91lexical.c
sub_66A770lookup_macro_at_positionMEDIUM41lexical.c
sub_66A890push_macro_expansionMEDIUM41lexical.c
sub_66AA50process_preprocessor_directiveMEDIUM380lexical.c
sub_66B1F0emit_preprocessed_outputMEDIUM231lexical.c
sub_66B910skip_whitespace_and_commentsMEDIUM105lexical.c
sub_66BB50open_source_fileHIGH332lexical.c
sub_66C550scan_string_literalMEDIUM356lexical.c
sub_66CBD0scan_character_literalMEDIUM111lexical.c
sub_66D100set_source_positionHIGH282lexical.c
sub_66D5E0emit_output_lineHIGH491lexical.c
sub_66DFF0scan_pp_numberMEDIUM268lexical.c
sub_66EA70open_next_input_fileMEDIUM364lexical.c
sub_66F4E0read_next_source_lineHIGH735lexical.c
sub_6702F0refill_bufferHIGH792lexical.c
sub_6711E0scan_universal_characterHIGH278lexical.c:12384
sub_671870validate_universal_character_valueMEDIUM62lexical.c
sub_6719B0scan_identifier_or_keywordHIGH400lexical.c
sub_672390scan_numeric_literalHIGH1,571lexical.c
sub_6748A0convert_integer_suffixMEDIUM137lexical.c
sub_674BB0determine_numeric_literal_typeMEDIUM400lexical.c
sub_675390scan_float_exponentMEDIUM57lexical.c
sub_6754B0convert_float_literalMEDIUM338lexical.c
sub_676080scan_raw_string_literalMEDIUM-HIGH391lexical.c
sub_676860get_next_tokenHIGHEST1,995lexical.c:4302
sub_679800scan_tokenHIGH907lexical.c
sub_67BAB0scan_header_nameMEDIUM110lexical.c
sub_67CB70handle_template_angle_bracketsMEDIUM263lexical.c
sub_67D050check_token_paste_validityLOW57lexical.c
sub_67D1E0handle_token_pastingMEDIUM117lexical.c
sub_67D440stringify_tokenMEDIUM251lexical.c
sub_67D900expand_macro_argumentMEDIUM204lexical.c
sub_67DC90scan_template_argument_listHIGH1,078lexical.c:19918
sub_67F2E0create_template_argument_cacheMEDIUM184lexical.c
sub_67F740rescan_template_argumentsMEDIUM-HIGH583lexical.c
sub_680670resolve_dependent_template_idMEDIUM240lexical.c
sub_680AE0handle_dependent_name_contextMEDIUM235lexical.c
sub_6810F0get_token_mainHIGHEST3,811lexical.c
sub_685AB0handle_builtin_type_tokenMEDIUM289lexical.c
sub_685F10process_decltype_tokenMEDIUM212lexical.c
sub_686350handle_attribute_tokenMEDIUM-HIGH584lexical.c
sub_686F40process_asm_or_extension_keywordMEDIUM244lexical.c
sub_687F30setup_lexer_for_parsing_modeMEDIUM216lexical.c
sub_688320push_lexical_stateMEDIUM137lexical.c
sub_688560peek_next_tokenMEDIUM44lexical.c
sub_6886F0expect_tokenMEDIUM31lexical.c
sub_6887C0skip_to_tokenMEDIUM317lexical.c

Cross-References

Expression Parser

The expression parser is the largest subsystem in cudafe++. It lives in EDG 6.6's expr.c, which compiles to approximately 335KB of code (address range 0x4F8000--0x556600) containing roughly 320 functions. The central function scan_expr_full (sub_511D40) alone occupies 80KB -- approximately 2,000 decompiled lines with over 300 local variables. EDG uses a hand-written recursive descent parser, not a generated one (no yacc/bison). Each C++ operator precedence level has its own scanning function, and the call chain follows the precedence hierarchy: assignment, conditional, logical-or, logical-and, bitwise-or, bitwise-xor, bitwise-and, equality, relational, shift, additive, multiplicative, pointer-to-member, unary, postfix, primary.

CUDA-specific extensions are woven directly into this subsystem: cross-execution-space call validation at every function call site, remapping of GCC __sync_fetch_and_* builtins to NVIDIA __nv_atomic_fetch_* intrinsics, and constexpr-if gating of literal evaluation based on compilation mode.

Key Facts

PropertyValue
Source fileexpr.c (~320 functions) + exprutil.c (~90 functions)
Address range0x4F8000--0x556600 (expr.c), 0x558720--0x55FE10 (exprutil.c)
Total code size~385KB
Central dispatchersub_511D40 (scan_expr_full, 80KB, ~2,000 lines, 300+ locals)
Ternary handlersub_526E30 (scan_conditional_operator, 48KB)
Function call handlersub_545F00 (scan_function_call, 2,490 lines)
New-expression handlersub_54AED0 (scan_new_operator, 2,333 lines)
Identifier handlersub_5512B0 (scan_identifier, 1,406 lines)
Template rescansub_5565E0 (rescan_expr_with_substitution_internal, 1,558 lines)
Atomic builtin remappersub_537BF0 (adjust_sync_atomic_builtin, 1,108 lines, NVIDIA-specific)
Cross-space validationsub_505720 (check_cross_execution_space_call, 4KB)
Current token globalword_126DD58 (16-bit token kind)
Expression contextqword_106B970 (current scope/context pointer)
Trace flagdword_126EFC8 (debug trace), dword_126EFCC (verbosity level)

Architecture

Recursive Descent, No Generator

EDG's expression parser is entirely hand-written C. There are no parser tables, no DFA state machines, and no grammar transformation output. Each operator precedence level maps to one or more scan_* functions that call down the precedence chain via direct function calls. The parser is effectively a family of mutually recursive functions whose call graph encodes the C++ grammar.

The top-level entry point is scan_expr_full, which serves a dual role: (1) it contains the primary-expression scanner as a massive switch on token kind, and (2) after scanning a primary expression, it enters a post-scan binary-operator dispatch loop that routes to the correct precedence-level handler based on the next operator token.

scan_expr_full (sub_511D40)
  │
  ├─ [token switch] ─────────► Primary expressions
  │     case 1   → scan_identifier (sub_5512B0)
  │     case 2,3 → scan_numeric_literal (sub_5632C0)
  │     case 27  → scan_cast_or_expr (sub_544290)
  │     case 161 → scan_new_operator (sub_54AED0)
  │     case 162 → scan_throw_operator (sub_5211B0)
  │     ... (100+ token cases)
  │
  ├─ [postfix loop] ──────────► Postfix operators
  │     ()   → scan_function_call (sub_545F00)
  │     []   → scan_subscript_operator (sub_540560)
  │     .->  → scan_field_selection_operator (sub_5303E0)
  │     ++-- → scan_postfix_incr_decr (sub_510D70)
  │
  └─ [binary dispatch] ───────► Binary operators by precedence
        prec 64 → scan_simple_assignment_operator (sub_53FD70)
                  scan_compound_assignment_operator (sub_536E80)
        prec 60 → scan_conditional_operator (sub_526E30)
        prec 59 → scan_logical_operator (sub_526040)     [||]
        prec 58 → scan_logical_operator (sub_526040)     [&&]
        prec 57 → scan_comma_operator (sub_529720)
        ...     → scan_bit_operator (sub_525BC0)         [| ^ &]
        ...     → scan_eq_operator (sub_524ED0)          [== !=]
        ...     → scan_add_operator (sub_523EB0)         [+ -]
        ...     → scan_mult_operator (sub_5238C0)        [* / %]
        ...     → scan_shift_operator (sub_524960)       [<< >>]
        ...     → scan_ptr_to_member_operator (sub_522650) [.* ->*]

Precedence Levels

The parser assigns numeric precedence levels internally, passed as the a3 (third) parameter to scan_expr_full. The precedence integer increases with binding strength (higher values = tighter binding):

LevelOperatorsHandler
57, (comma)scan_comma_operator
58||scan_logical_operator
59&&scan_logical_operator
60? : (conditional)scan_conditional_operator
61|scan_bit_operator
62^scan_bit_operator
63&scan_bit_operator
64= += -= ...scan_simple_assignment_operator / scan_compound_assignment_operator

When scan_expr_full encounters a binary operator token whose precedence is lower than the current precedence parameter, it returns immediately, allowing the caller at that precedence level to consume the operator. This is the standard recursive descent technique: each level calls the next-higher-precedence scanner for its operands.

scan_expr_full -- The Central Dispatcher

scan_expr_full (sub_511D40, 80KB) is the largest function in the entire cudafe++ binary. Its structure follows this pattern:

function scan_expr_full(result, scan_info, precedence, flags, ...) {
    // 1. Trace entry
    if (debug_trace_flag)
        trace_enter(4, "scan_expr_full")
    if (debug_verbosity > 3)
        fprintf(trace_stream, "precedence level = %d\n", precedence)

    // 2. Extract context flags from current scope
    context = current_scope          // qword_106B970
    in_cuda_extension = (context[20] & 0x08) != 0
    in_pack_expansion = context[21] & 0x01
    saved_pending_expr = pending_expression   // qword_106B968
    pending_expression = 0

    // 3. Handle template rescan context
    if (in_template_context) {
        if (context.flags == TEMPLATE_ONLY_DEPENDENT)
            init_expr_stack_entry(...)
            // Mark as template-argument context
    }

    // 4. Handle forced-parenthesized-expression flag
    if (flags & 0x08)
        goto scan_cast_or_expr       // sub_544290

    // 5. Check for decltype token (185)
    if (current_token == 185 && dialect == C++)
        call sub_6810F0(...)         // re-classify through lexer

    // 6. MASTER TOKEN SWITCH -- dispatch on word_126DD58
    switch (current_token) {
        case 1:   // identifier
            // Special-case: check if identifier is a hidden type trait
            if (identifier_is("__is_pointer"))  { set_token(320); scan_unary_type_trait(); break; }
            if (identifier_is("__is_invocable")) { set_token(225); scan_call_like_builtin(); break; }
            if (identifier_is("__is_signed"))   { set_token(324); scan_unary_type_trait(); break; }
            // Default: full identifier scan
            scan_identifier(result, flags, precedence, ...)
            break;

        case 2, 3, 123, 124, 125:  // numeric, char, utf literals
            // Context-sensitive literal handling:
            //   - Check constexpr-if context (execution-space dependent)
            //   - Route to appropriate literal scanner
            if (is_constexpr_if_context)
                value = compute_constexpr_literal()
                scan_constexpr_literal_result(value, result)
            else
                scan_numeric_literal(literal_data, result)  // sub_5632C0
            break;

        case 4, 5, 6, 181, 182:  // string literals
            scan_string_literal(literal_data, result)       // sub_5632C0
            // Vector deprecation check for CUDA
            if ((cuda_mode || cuda_device_mode) && has_vector_literal_flag)
                result.flags |= VECTOR_DEPRECATED
            break;

        case 7:   // postfix-string-context (interpolated strings)
            check_postfix_string_context(...)
            scan_string_expression(literal_data, result)    // sub_563580
            break;

        case 27:  // left-paren '('
            scan_cast_or_expr(result, scratch, flags)       // sub_544290
            // Disambiguates: C-cast, grouped expr, GNU statement expr, fold expr
            break;

        case 31, 32:  // prefix ++ / --
            scan_prefix_incr_decr(result, ...)              // sub_516080
            break;

        case 33:  // & (address-of)
            scan_ampersand_operator(result, ...)            // sub_516720
            break;

        case 34:  // * (indirection)
            scan_indirection_operator(result, ...)          // sub_517270
            break;

        case 35, 36, 37, 38:  // unary + - ~ !
            scan_arith_prefix_operator(result, ...)         // sub_517680
            break;

        case 77:  // lambda expression '['
            scan_lambda_expression(result, ...)             // sub_5BBA60
            break;

        case 99, 284:  // sizeof
            scan_sizeof_operator(result, ...)               // sub_517BD0
            break;

        case 109: // _Generic
            scan_type_generic_operator(result, ...)         // inlined
            break;

        case 152: // requires
            scan_requires_expression(result, ...)           // sub_52CFF0
            break;

        case 155: // new (in C++ concept context path)
            scan_new_operator(result, ...)                  // sub_54AED0
            break;

        case 161: // new-expression
            scan_class_new_expression(result, ...)          // sub_6C9940/sub_6C9C50
            break;

        case 162: // throw
            scan_throw_operator(result, ...)                // sub_5211B0
            break;

        case 166: // const_cast
            scan_const_cast_operator(result, ...)           // sub_520280
            break;

        case 167: // static_cast
            scan_static_cast_operator(result, ...)          // sub_51F670
            break;

        case 176: // reinterpret_cast
            scan_reinterpret_cast_operator(result, ...)     // sub_5209A0
            break;

        case 177: // dynamic_cast
            scan_named_cast_operator(result, ...)           // sub_53D590
            break;

        case 178: // typeid
            scan_typeid_operator(result, ...)               // sub_535370
            break;

        case 185: // decltype
            scan_decltype_operator(result, ...)             // sub_52A3B0
            break;

        case 195 ... 356:  // type traits (__is_class, __is_enum, etc.)
            scan_unary_type_trait_helper(result, ...)       // sub_51A690
            // or
            scan_binary_type_trait_helper(result, ...)      // sub_51B650
            break;

        case 243: // noexcept
            scan_noexcept_operator(result, ...)             // sub_51D910
            break;

        case 267: // co_yield
            // Coroutine yield expression handling
            scan_braced_init_list_full(result, ...)         // sub_5360D0
            add_await_to_operand(result, ...)               // sub_50B630
            break;

        case 269: // co_await
            // Recursive scan of operand, then wrap with await semantics
            scan_expr_full(result, info, precedence, flags | AWAIT)
            add_await_to_operand(result, ...)               // sub_50B630
            break;

        case 297: // __builtin_bit_cast
            scan_builtin_bit_cast(result, ...)              // sub_51CC60
            break;

        // ... approximately 100 additional cases
    }

    // 7. POST-SCAN BINARY OPERATOR DISPATCH LOOP
    //    After scanning a primary/prefix expression, check for binary operators
    while (true) {
        op = current_token
        op_prec = get_binary_op_precedence(op)
        if (op_prec < precedence)
            break    // operator binds less tightly than our level

        switch (op) {
            case '?':  scan_conditional_operator(result, info, flags)   // sub_526E30
            case '=':  scan_simple_assignment_operator(result, ...)     // sub_53FD70
            case '+=': scan_compound_assignment_operator(result, ...)   // sub_536E80
            case '||': scan_logical_operator(result, info, ...)        // sub_526040
            case '&&': scan_logical_operator(result, info, ...)        // sub_526040
            case '|':  scan_bit_operator(result, ...)                  // sub_525BC0
            case '^':  scan_bit_operator(result, ...)
            case '&':  scan_bit_operator(result, ...)
            case '==': scan_eq_operator(result, ...)                   // sub_524ED0
            case '!=': scan_eq_operator(result, ...)
            case '<':  scan_rel_operator(result, ...)                  // sub_543A90
            case '+':  scan_add_operator(result, ...)                  // sub_523EB0
            case '-':  scan_add_operator(result, ...)
            case '*':  scan_mult_operator(result, ...)                 // sub_5238C0
            case '/':  scan_mult_operator(result, ...)
            case '%':  scan_mult_operator(result, ...)
            case '<<': scan_shift_operator(result, ...)                // sub_524960
            case '>>': scan_shift_operator(result, ...)
            case '.*': scan_ptr_to_member_operator(result, ...)        // sub_522650
            case '->*': scan_ptr_to_member_operator(result, ...)
            case ',':  scan_comma_operator(result, ...)                // sub_529720
            // Postfix operators (not precedence-gated):
            case '(':  scan_function_call(result, ...)                 // sub_545F00
            case '[':  scan_subscript_operator(result, ...)            // sub_540560
            case '.':  scan_field_selection_operator(result, ...)      // sub_5303E0
            case '->': scan_field_selection_operator(result, ...)
            case '++': scan_postfix_incr_decr(result, ...)             // sub_510D70
            case '--': scan_postfix_incr_decr(result, ...)
        }
    }

    // 8. Restore saved state and return
    pending_expression = saved_pending_expr
    if (debug_trace_flag)
        trace_exit(...)
    return result
}

Token Dispatch Map (Complete)

The master switch in scan_expr_full covers approximately 120 distinct token cases. The full dispatch table:

Token Code(s)Expression FormHandler
1Identifier (with __is_pointer/__is_signed detection)scan_identifier (sub_5512B0)
2, 3Integer / floating-point literalscan_numeric_literal (sub_5632C0)
4, 5, 6, 181, 182String literal (narrow, wide, UTF-8/16/32)scan_string_literal (sub_5632C0)
7Postfix string contextsub_563580
8Literal operator callmake_func_operand_for_literal_operator_call (sub_4FFFB0)
18, 80--136, 165, 180, 183Type keywords in expression contextscan_type_returning_type_trait_operator / scan_identifier
25__extension__scan_expr_splicer (sub_52FD70) or scan_statement_expression (sub_4F9F20)
27(scan_cast_or_expr (sub_544290) -- disambiguates cast/group/fold/stmt-expr
31, 32++ / -- (prefix)scan_prefix_incr_decr (sub_516080)
33& (address-of)scan_ampersand_operator (sub_516720)
34* (indirection)scan_indirection_operator (sub_517270)
35--38+ - ~ ! (unary)scan_arith_prefix_operator (sub_517680)
50__builtin_expectbound_function_in_cast (sub_503F70)
77[ (lambda)scan_lambda_expression (sub_5BBA60)
99, 284sizeofscan_sizeof_operator (sub_517BD0)
109_Genericscan_type_generic_operator (inlined)
111, 247alignof / _Alignofscan_alignof_operator (sub_519300)
112__intaddrscan_intaddr_operator (sub_520EE0)
113va_startscan_va_start_operator (sub_51E8A0)
114va_argscan_va_arg_operator (sub_51DFA0)
115va_endscan_va_end_operator (sub_51E4A0)
116va_copyscan_va_copy_operator (sub_51E670)
117offsetofscan_offsetof (sub_555530)
123char literalscan_utf_char_literal (sub_5659D0)
124wchar_t literalscan_wchar_literal (sub_5658D0)
125UTF character literalscan_wide_char_literal (sub_565950)
138--141__FUNCTION__/__PRETTY_FUNCTION__/__func__setup_function_name_literal (sub_50AC80)
143__builtin_types_compatible_pscan_builtin_operation_args_list (sub_534920)
144, 145__real__ / __imag__scan_complex_projection (sub_51D210)
146typeid (execution-space variant)scan_typeid_operator (sub_535370)
152requires (C++20)scan_requires_expression (sub_52CFF0)
155Concept expressionscan_new_operator path (sub_54AED0)
161newscan_class_new_expression (sub_6C9940)
162throwscan_throw_operator (sub_5211B0)
166const_castscan_const_cast_operator (sub_520280)
167static_castscan_static_cast_operator (sub_51F670)
176reinterpret_castscan_reinterpret_cast_operator (sub_5209A0)
177dynamic_castscan_named_cast_operator (sub_53D590)
178typeidscan_typeid_operator (sub_535370)
185decltypescan_decltype_operator (sub_52A3B0)
188wchar_t literal (alt)sub_5BCDE0
189typeofscan_typeof_operator (sub_52B540)
195--206Unary type traitsscan_unary_type_trait_helper (sub_51A690)
207--292Binary type traitsscan_binary_type_trait_helper (sub_51B650)
225, 226__is_invocable / __is_nothrow_invocabledispatch_call_like_builtin (sub_535080)
227--235Builtin operationssub_535080 / sub_51BC10 / sub_51B0C0
237__builtin_constant_psub_5BC7E0
243noexcept (operator)scan_noexcept_operator (sub_51D910)
251--256Builtin atomic operationscheck_operand_is_pointer (sub_5338B0/sub_533B80)
257, 258Fold expression tokensscan_builtin_shuffle (sub_53E480)
259__builtin_convertvectorscan_builtin_convertvector (sub_521950)
261__builtin_complexscan_builtin_complex (sub_521DB0)
262__builtin_choose_exprscan_c11_generic_selection (sub_554400)
267co_yieldBraced-init-list + coroutine add_await_to_operand (sub_50B630)
269co_awaitRecursive scan_expr_full + add_await_to_operand
270__builtin_laundersub_51B0C0(60, ...)
271__builtin_addressofscan_builtin_addressof (sub_519CF0)
294Pack expansionscan_requires_expr (sub_542D90)
296__has_attributescan_builtin_has_attribute (sub_51C780)
297__builtin_bit_castscan_builtin_bit_cast (sub_51CC60)
300, 301__is_pointer_interconvertible_with_classsub_51BE60
302, 303__is_corresponding_membersub_51C270
304__edg_is_deduciblesub_51B360
306, 307__builtin_source_locationsub_5BC720 / sub_534920

scan_conditional_operator -- Ternary ? :

scan_conditional_operator (sub_526E30, 48KB) is the second-largest expression-scanning function. The ternary operator is notoriously complex because it must unify the types of two branches that may have completely different types. The function handles:

  • Type unification between branches: determines the common type of the true and false expressions. This involves the usual arithmetic conversions for numeric types, pointer-to-derived to pointer-to-base conversions, null pointer conversions, and user-defined conversion sequences.
  • Lvalue conditional expressions (GCC extension): when both branches are lvalues of the same type, the result is itself an lvalue.
  • Void branches: if one or both branches are void expressions, the result type is void.
  • Throw in branches: a throw expression in one branch causes the result to take the type of the other branch.
  • Constexpr evaluation: when the condition is a constant expression, only one branch is semantically evaluated (the other is discarded).
  • Reference binding: determines whether the result is an lvalue reference, rvalue reference, or prvalue.
  • Overloaded operator?: resolution of user-defined conditional operators.
function scan_conditional_operator(context, result, flags) {
    // 1. The condition has already been scanned -- it is in 'result'
    //    We are positioned at the '?' token

    // 2. Save expression stack state
    saved_stack = save_expr_stack()

    // 3. Scan true branch (between ? and :)
    //    Note: precedence resets -- assignment expressions allowed here
    init_expr_stack_entry(...)
    scan_expr_full(true_result, info, ASSIGNMENT_PREC, flags)

    // 4. Expect and consume ':'
    expect_token(':')

    // 5. Scan false branch
    scan_expr_full(false_result, info, ASSIGNMENT_PREC, flags)

    // 6. Type unification of true_result and false_result
    true_type  = get_type(true_result)
    false_type = get_type(false_result)

    if (both_void(true_type, false_type))
        result_type = void
    else if (is_throw(true_result))
        result_type = false_type
    else if (is_throw(false_result))
        result_type = true_type
    else if (arithmetic_types(true_type, false_type))
        result_type = usual_arithmetic_conversions(true_type, false_type)
    else if (same_class_lvalues(true_result, false_result))
        result_type = common_lvalue_type(true_type, false_type)  // GCC ext
    else if (pointer_types(true_type, false_type))
        result_type = composite_pointer_type(true_type, false_type)
    else
        // Try user-defined conversions (overload resolution)
        result_type = resolve_via_conversion_sequences(true_type, false_type)

    // 7. Apply cv-qualification merging
    result_type = merge_cv_qualifications(true_type, false_type, result_type)

    // 8. Build result expression node
    build_conditional_expr_node(result, condition, true_result, false_result, result_type)

    // 9. Restore stack
    restore_expr_stack(saved_stack)
}

The complexity arises from the 15+ different type-pair combinations (arithmetic-arithmetic, pointer-pointer, pointer-null, class-class with conversions, void-void, throw-anything, lvalue-lvalue GCC extension) that each require different conversion logic.

scan_function_call -- All Call Forms

scan_function_call (sub_545F00, 2,490 lines) handles every form of function call expression. It is invoked from the postfix operator dispatch in scan_expr_full when a ( follows a primary expression, and also from various specialized paths.

The function handles:

  1. Regular function calls with overload resolution
  2. Builtin function calls -- GCC/Clang __builtin_* with special semantics
  3. Pseudo-calls to builtins -- va_start, __builtin_va_start, etc.
  4. GNU __builtin_classify_type -- compile-time type classification
  5. SFINAE context -- failed overload resolution suppresses errors instead of aborting
  6. Template argument deduction for function templates at call sites
  7. CUDA atomic builtin remapping -- delegates to adjust_sync_atomic_builtin (see below)
function scan_function_call(callee_operand, flags, context, ...) {
    // 1. Classify the callee
    operand_kind = get_operand_kind(callee_operand)
    assert(operand_kind is valid)  // "scan_function_call: bad operand kind"

    // 2. Scan argument list
    scan_call_arguments(arg_list, ...)   // sub_545760

    // 3. Branch on callee kind
    if (is_builtin_function(callee_operand)) {
        // Check if this is a special builtin
        if (is_sync_atomic_builtin(callee_operand)) {
            // CUDA-specific: remap __sync_fetch_and_* → __nv_atomic_fetch_*
            result = adjust_sync_atomic_builtin(callee, args, ...)  // sub_537BF0
            return result
        }

        // check_builtin_function_for_call: validate args for builtins
        check_builtin_function_for_call(callee, arg_list, ...)

        // scan_builtin_pseudo_call: for builtins with special evaluation
        if (is_pseudo_call_builtin(callee))
            return scan_builtin_pseudo_call(callee, arg_list, ...)
    }

    // 4. Overload resolution
    if (has_overload_candidates(callee_operand)) {
        best = perform_overload_resolution(callee, arg_list, ...)
        if (best == AMBIGUOUS)
            emit_error(...)
        if (best == NO_MATCH && in_sfinae_context)
            return SFINAE_FAILURE
        callee = best.function
    }

    // 5. Template argument deduction (if callee is a function template)
    if (is_function_template(callee)) {
        deduced = deduce_template_args(callee, arg_list, ...)
        if (deduction_failed && in_sfinae_context)
            return SFINAE_FAILURE
        callee = instantiate_template(callee, deduced)
    }

    // 6. CUDA cross-execution-space check
    if (cuda_mode)
        check_cross_execution_space_call(callee, ...)  // sub_505720

    // 7. Apply implicit conversions to arguments
    for each (arg, param) in zip(arg_list, callee.params):
        convert_arg_to_param_type(arg, param)

    // 8. Build call expression node
    build_call_expression(result, callee, arg_list, return_type)
}

scan_call_arguments (sub_545760, 332 lines)

The argument scanner called from scan_function_call:

function scan_call_arguments(arg_list_out, ...) {
    // assert "scan_call_arguments"
    // Loop: scan comma-separated expressions until ')'
    while (current_token != ')') {
        scan_expr_full(arg, info, ASSIGNMENT_PREC, flags)
        append(arg_list_out, arg)
        if (current_token == ',')
            consume(',')
        else
            break
    }
    // Handle default arguments for missing trailing params
    // Handle parameter pack expansion
}

scan_new_operator -- All new Forms

scan_new_operator (sub_54AED0, 2,333 lines) implements the complete C++ new expression. The function name strings embedded in the binary confirm the following sub-operations:

Sub-operationEmbedded Assert String
Entry point"scan_new_operator"
Rescan in template"rescan_new_operator_expr"
Token validation"scan_new_operator: expected new or gcnew"
Token extraction"get_new_operator_token"
Type parsing"scan_new_type"
Paren-as-braced fallback"scan_paren_expr_list_as_braced_list"
Array size deduction"deduce_new_array_size"
Deallocation lookup"determine_deletion_for_new"
Paren initializer"prep_new_object_init_paren_initializer"
Brace initializer"prep_new_object_init_braced_initializer"
No initializer"prep_new_object_init_no_initializer"
Non-POD error"scan_new_operator: non-POD class has neither actual nor assumed ctor"

The function processes all forms:

function scan_new_operator(result, flags, context, ...) {
    // Determine scope: ::new (global) vs. new (class-scope)
    is_global = check_and_consume("::")

    // Parse optional placement arguments: new(placement_args)
    if (current_token == '(')
        placement_args = scan_expression_list(...)

    // Parse the allocated type: new Type
    type = scan_new_type(...)

    // Parse optional array dimension: new Type[size]
    if (current_token == '[') {
        array_size = scan_expression(...)
        if (can_deduce_size)
            deduce_new_array_size(type, initializer)
    }

    // Parse optional initializer
    if (current_token == '(')
        init = prep_new_object_init_paren_initializer(type, ...)
    else if (current_token == '{')
        init = prep_new_object_init_braced_initializer(type, ...)
    else
        init = prep_new_object_init_no_initializer(type, ...)

    // Look up matching operator new
    new_fn = lookup_operator_new(type, placement_args, is_global, ...)

    // Look up matching operator delete (for exception cleanup)
    determine_deletion_for_new(new_fn, type, placement_args, ...)

    // For template-dependent types, defer to rescan at instantiation
    if (is_dependent_type(type))
        record_for_rescan(...)

    // Build new-expression node
    build_new_expr(result, new_fn, type, init, placement_args, array_size)
}

scan_identifier -- Name Resolution in Expression Context

scan_identifier (sub_5512B0, 1,406 lines) handles the case where the current token is an identifier in expression context. This is far more complex than a simple name lookup because identifiers in C++ can resolve to variables, functions, enumerators, type names (triggering functional-notation casts), anonymous union members, or preprocessing constants.

The function contains assert strings revealing its sub-operations:

Assert StringPurpose
"scan_identifier"Entry point
"scan_identifier: in preprocessing expr"Identifier in #if context evaluates to 0 or 1
"anonymous_parent_variable_of"Navigate to parent variable of anonymous union member
"anonymous_parent_variable_of: bad symbol kind on list"Error path for malformed anonymous union chain
"make_anonymous_union_field_operand"Construct operand for anonymous union member access
"get_with_hash"Hash-based lookup for cached resolution results
function scan_identifier(result, flags, precedence, ...) {
    // 1. Preprocessing-expression context
    //    In #if, undefined identifiers evaluate to 0
    if (in_preprocessing_expression) {
        // "scan_identifier: in preprocessing expr"
        result = make_integer_constant(0)
        return
    }

    // 2. Look up identifier in current scope
    lookup_result = scope_lookup(current_identifier, current_scope)

    // 3. If identifier resolves to a type name → functional-notation cast
    if (is_type_entity(lookup_result)) {
        scan_functional_notation_type_conversion(type, result, ...)  // sub_54E7C0
        return
    }

    // 4. If identifier is an anonymous union member
    if (is_anonymous_union_member(lookup_result)) {
        // Walk up to find the named parent variable
        parent = anonymous_parent_variable_of(lookup_result)
        result = make_anonymous_union_field_operand(parent, lookup_result)
        return
    }

    // 5. If identifier is a function (possibly overloaded)
    if (is_function_entity(lookup_result)) {
        result = make_func_operand(lookup_result)
        // Lambda capture check
        if (in_lambda_scope)
            check_var_for_lambda_capture(lookup_result, ...)
        return
    }

    // 6. Variable reference
    result = make_var_operand(lookup_result)

    // 7. Lambda capture analysis
    if (in_lambda_scope)
        check_var_for_lambda_capture(lookup_result, ...)

    // 8. Cross-execution-space reference check (CUDA)
    if (cuda_mode)
        check_cross_execution_space_reference(lookup_result, ...)
}

CUDA-Specific: Cross-Execution-Space Call Validation

Two functions implement the CUDA execution space enforcement that prevents illegal calls between __host__ and __device__ code:

check_cross_execution_space_call (sub_505720)

Called from scan_function_call and other call sites. The function extracts execution space information from bit-packed flags at entity offset +182:

function check_cross_execution_space_call(callee, is_must_check, diag_ctx) {
    // Extract callee's execution space from entity flags
    if (callee != NULL) {
        is_not_device_only = (callee[182] & 0x30) != 0x20  // bits 4-5
        is_host_only       = (callee[182] & 0x60) == 0x20  // bits 5-6
        is_global          = (callee[182] & 0x40) != 0      // bit 6
    }

    // Early exits for special contexts
    if (compilation_chain == -1)    return   // not in compilation
    if (CU has CUDA flags cleared)  return   // not a CUDA compilation unit
    if (in_SFINAE_context)          return   // errors suppressed

    // Get caller's execution space from enclosing function
    enclosing_fn = CU_table[enclosing_CU_index].function  // at +224
    if (enclosing_fn != NULL) {
        caller_host_only = (enclosing_fn[182] & 0x60) == 0x20
        caller_not_device_only = (enclosing_fn[182] & 0x30) != 0x20
    } else {
        // Top-level code: treated as __host__
        caller_host_only = 0
        caller_not_device_only = 1
    }

    // Check for implicitly HD (constexpr or __host__ __device__ by inference)
    if (callee[177] & 0x10)  return   // callee is implicitly HD
    if (callee has deleted+explicit HD flags)  return

    // The actual cross-space check matrix:
    // caller=host,  callee=device  → error 3462 or 3463
    // caller=device, callee=host   → error 3464 or 3465
    // callee=__global__            → error 3508

    if (caller_not_device_only && caller_host_only) {
        // Caller is __host__ only
        if (callee is __device__ only) {
            if (is_trivial_device_copyable(callee))  // sub_6BC680
                return  // allow
            space1 = get_execution_space_name(enclosing_fn, 0)  // sub_6BC6B0
            space2 = get_execution_space_name(callee, 1)
            emit_error(3462 + has_explicit_host, ...)
        }
    } else if (caller_not_device_only) {
        // Caller is __device__ only
        if (callee is __host__ only)
            emit_error(3464 + has_explicit_device, ...)
    }

    if (callee is __global__) {
        emit_error(3508, is_must_check ? "must" : "cannot", ...)
    }
}

The bit encoding at entity offset +182:

BitsMaskMeaning
4--5& 0x30__device__ flag: 0x20 = device-only
5--6& 0x60__host__ flag: 0x20 = host-only
6& 0x40__global__ flag

Error codes issued:

CodeMeaning
3462__device__ function called from __host__ context
3463Variant of 3462 with __host__ annotation note
3464__host__ function called from __device__ context
3465Variant of 3464 with __device__ annotation note
3508__global__ function called from wrong context

check_cross_space_call_in_template (sub_505B40)

A simplified variant (2.7KB) used during template instantiation. The logic mirrors check_cross_execution_space_call but operates when dword_126C5C4 == -1 (template instantiation depth guard). It does not take the is_must_check parameter and always checks both directions.

See the Execution Spaces page for full details on the CUDA execution model.

CUDA-Specific: adjust_sync_atomic_builtin

adjust_sync_atomic_builtin (sub_537BF0, 1,108 lines) is the largest NVIDIA-specific function in the expression parser. It transforms GCC-style __sync_fetch_and_* atomic builtins into NVIDIA's own __nv_atomic_fetch_* intrinsics.

Why This Remapping Exists

CUDA inherits GCC's __sync_fetch_and_* builtin family from the host-side C/C++ dialect, but NVIDIA's GPU ISA (PTX) uses a different instruction encoding for atomic operations. The GPU atomics have type-specific variants that the PTX backend needs to select the correct instruction. Rather than teaching the backend to decompose generic __sync_* builtins, NVIDIA front-loads the transformation in the parser, mapping each builtin to a type-suffixed __nv_atomic_fetch_* intrinsic that directly corresponds to a PTX atomic instruction.

The type suffix ensures correct instruction selection:

SuffixType CategoryPTX Atomic Type
_sSigned integer.s32, .s64
_uUnsigned integer.u32, .u64
_fFloating-point.f32, .f64

Remapping Table

GCC BuiltinNVIDIA Intrinsic (base)
__sync_fetch_and_add__nv_atomic_fetch_add
__sync_fetch_and_sub__nv_atomic_fetch_sub
__sync_fetch_and_and__nv_atomic_fetch_and
__sync_fetch_and_xor__nv_atomic_fetch_xor
__sync_fetch_and_or__nv_atomic_fetch_or
__sync_fetch_and_max__nv_atomic_fetch_max
__sync_fetch_and_min__nv_atomic_fetch_min

Pseudocode

function adjust_sync_atomic_builtin(callee, args, arg_list, builtin_info, result_ptr) {
    // assert "adjust_sync_atomic_builtin" at line 6073

    original_entity = get_builtin_entity(callee)   // sub_568F30
    assert(original_entity != NULL)

    // Check arg count -- if extra args and first arg is not pointer type
    if (builtin_info.extra_arg_count && callee[8] != 1) {
        // Reset and emit diagnostic 3768 (wrong arg type for atomic)
        original_entity = NULL
        if (validate_arg_types(...))
            emit_error(3768, diag_ctx)
        return original_entity
    }

    // Walk argument list to find the pointee type (type of *ptr)
    if (args == NULL) {
        // Use declared arg count from builtin info
        arg_index = builtin_info.declared_arg_count
        // ... validate, may emit error 3769 or 1645
    } else {
        // Navigate to the relevant argument node
        // Extract the pointee type by unwinding cv-qualifiers
        arg_type = get_init_component_type(args)
        pointee = unwrap_cv_qualifiers(arg_type)  // while type_kind == 12
    }

    // Determine the type suffix based on pointee type
    if (is_integer_type(pointee)) {
        if (is_signed(pointee))
            suffix = "_s"    // signed
        else
            suffix = "_u"    // unsigned
    } else if (is_float_type(pointee)) {
        suffix = "_f"        // floating-point
    } else {
        // Not a supported atomic type
        if (validate_arg_types(...))
            emit_error(1645 or 852, diag_ctx)
        return original_entity
    }

    // Construct the NVIDIA intrinsic name
    // Map __sync_fetch_and_OP → __nv_atomic_fetch_OP + suffix
    base_name = map_sync_to_nv(original_entity.name)
    // e.g., "__sync_fetch_and_add" → "__nv_atomic_fetch_add"
    full_name = base_name + suffix
    // e.g., "__nv_atomic_fetch_add_s" for signed int

    // Look up or create the NVIDIA intrinsic entity
    nv_entity = lookup_nv_intrinsic(full_name)

    // Replace the callee with the NVIDIA intrinsic
    *result_ptr = nv_entity

    return original_entity
}

The function validates that the pointee type is one of the supported atomic types. If the user passes a pointer to an unsupported type (e.g., a struct), it falls through to emit diagnostic 1645 ("argument type not supported for atomic operation") or 852 (a more specific variant when the __sync function has explicit type constraints).

Template Expression Rescanning

When a template is instantiated, expression trees from the template definition are re-evaluated with concrete template argument substitutions. This is handled by rescan_expr_with_substitution_internal (sub_5565E0, 1,558 lines), the third-largest function in the expression parser.

The function dispatches on expression kind (not token kind -- these are IL expression nodes, not source tokens) and recursively rescans each sub-expression with substitutions applied:

Assert StringPurpose
"rescan_expr_with_substitution_internal"Entry point
"operator_token_for_builtin_operator"Maps operator codes to tokens for rescan
"operator_token_for_expr_rescan"Alternate operator-to-token mapping
"invalid expr kind in expr rescan"Unreachable default case
"rescan_braced_init_list"Rescans {init-list} nodes
"make_operand_for_rescanned_identifier"Rebuilds identifier operands after substitution
"symbol_for_template_param_unknown_entity_rescan"Handles dependent names during rescan
"scan_rel_operator"Rescans relational operators (for comparison rewriting)

The key insight is that during template definition parsing, the parser builds a partially-evaluated expression tree where template-dependent parts are stored as opaque nodes. During instantiation, this function walks that tree, substitutes concrete types/values, and re-runs the semantic analysis that was deferred.

Supporting Infrastructure

Diagnostic Emission (30+ wrapper functions, 0x4F8000--0x4F8F80)

The expression parser uses a family of thin diagnostic wrapper functions at the beginning of the address range. Each wraps the core pattern: create_diag(code) -> add_arg(type/entity/string) -> emit(diag). The variants differ only in argument count and types:

FunctionIdentityArguments
sub_4F8090emit_diag_with_type_and_entityType arg + entity arg
sub_4F8160emit_diag_1argSingle argument
sub_4F8220emit_diag_with_2_type_argsTwo type arguments
sub_4F8320emit_diag_with_entity_and_typeEntity first, type second
sub_4F8B20issue_incomplete_type_diagIncomplete type diagnostic (assert confirmed)

Expression Stack (exprutil.c, 0x558720+)

The expression parser maintains a stack of expression contexts via qword_106B970. Each stack entry (the "current context") holds compilation mode flags, scope depth, CUDA execution space state, and template context bits. Key operations:

FunctionIdentityPurpose
sub_55D0D0save_expr_stackSaves current expression stack state
sub_55D100init_expr_stack_entryCreates new stack frame
sub_55DB50pop_expr_stackRestores previous frame
sub_55E490set_operand_kindSets the operand classification
sub_55C180alloc_ref_entryAllocates reference-entry for tracking
sub_55C830free_init_componentFrees initializer component node

Comparison Rewriting (C++20, 0x501020--0x508DC0)

The C++20 three-way comparison operator (<=>) triggers rewriting of traditional comparison expressions. complete_comparison_rewrite (sub_505E80, 6.9KB) rewrites a < b into (a <=> b) < 0 when a spaceship operator exists. It uses a recursion counter at qword_106B510 limited to 100 to prevent infinite rewrite loops. Related functions:

FunctionIdentity
sub_501020determine_defaulted_spaceship_return_type
sub_5015D0synthesize_defaulted_comparison_body
sub_501B00check_comparison_category_type
sub_505E10token_for_rel_op -- maps operator kinds to tokens (16->43, 17->44, 32->45, 33->46)
sub_505E80complete_comparison_rewrite -- core rewrite engine
sub_506430check_defaulted_eq_properties
sub_5068F0check_defaulted_secondary_comp

Range-Based For Loop Desugaring (0x50C510, 16.8KB)

fill_in_range_based_for_loop_constructs (sub_50C510) generates the desugared components of for (auto x : range):

// Source:     for (auto x : range_expr) body
// Desugared:  {
//               auto && __range = range_expr;
//               auto __begin = begin(__range);
//               auto __end = end(__range);
//               for (; __begin != __end; ++__begin) {
//                 auto x = *__begin;
//                 body
//               }
//             }

The function calls sub_6EF7A0 (overload resolution) to look up begin() and end() via ADL, and emits error 2285 when no suitable begin/end is found.

Key Global Variables

AddressNameTypeDescription
word_126DD58current_token_codeWORDCurrent token kind (0--356)
qword_126DD38current_source_positionQWORDEncoded file/line/column
qword_106B970current_scopeQWORDExpression context stack pointer
qword_106B968pending_expressionQWORDPending expression accumulator
dword_126EFC8debug_trace_flagDWORDNonzero enables trace output
dword_126EFCCdebug_verbosityDWORDTrace verbosity level (>3 prints precedence)
dword_126EFB4language_dialectDWORD1=C, 2=C++
qword_126EF98standard_versionQWORDLanguage standard version level
dword_126EFA8in_template_contextDWORDNonzero during template parsing
dword_126EFA4strict_modeDWORDStrict conformance mode flag
dword_126EFACextended_featuresDWORDExtended features enabled
xmmword_106C380identifier_lookup_result128-bitSSE-packed identifier lookup (64 bytes total, 4 xmmwords)
qword_106B510comparison_rewrite_depthQWORDRecursion counter for C++20 comparison rewriting (max 100)
dword_106C2C0gpu_compilation_modeDWORDNonzero during GPU compilation
qword_126C5E8compilation_unit_tableQWORDBase of CU array (784-byte stride)
dword_126C5E4current_CU_indexDWORDIndex into compilation unit table
dword_126C5D8enclosing_function_CU_indexDWORDCU index of enclosing function
dword_126C5C4template_instantiation_depthDWORD-1 = not in template instantiation

Diagnostic Codes

The expression parser emits approximately 50 distinct diagnostic codes:

CodeMeaning
57Pointer-to-member on non-class type
58Pointer-to-member on incomplete type
60Pointer-to-member on wrong class type
165Wrong argument count for builtin
244Type access violation in member selection
529Pointer-to-member in concept context
852Unsupported type for atomic operation (typed variant)
1022Inaccessible member in selection
1032Invalid _Generic controlling expression
1036Unsupported predefined function name
1436__builtin_types_compatible_p not available
1543__builtin_source_location not available
1596Invalid literal operator call
1645Argument type not supported for atomic operation
1733new-expression in module context
1763GNU statement expression not available
1777Statement expression in constexpr context
2285No begin/end for range-based for
2669co_yield outside coroutine
2747co_yield not in function scope
2866Statement expression in constexpr context
2896Statement expression in template instantiation
2982Comparison rewrite recursion limit exceeded
3462__device__ function called from __host__ context
3463Variant of 3462 with __host__ annotation note
3464__host__ function called from __device__ context
3465Variant of 3464 with __device__ annotation note
3508__global__ function called from wrong context
3768Wrong argument type for atomic builtin (extra arg)
3769Wrong argument type for atomic builtin (declared arg)

Function Index

Complete listing of confirmed functions in the expression parser, grouped by subsystem:

Core Expression Scanning

AddressSizeIdentityConfidence
sub_511D4080KBscan_expr_fullDEFINITE
sub_526E3048KBscan_conditional_operatorDEFINITE
sub_545F0016KBscan_function_callDEFINITE
sub_54AED015KBscan_new_operatorDEFINITE
sub_5512B09KBscan_identifierDEFINITE
sub_5442906KBscan_cast_or_exprDEFINITE
sub_5565E010KBrescan_expr_with_substitution_internalDEFINITE
sub_52972012KBscan_comma_operatorDEFINITE
sub_52604015KBscan_logical_operatorDEFINITE
sub_543A901.4KBscan_rel_operatorDEFINITE
sub_5401601.2KBapply_one_fold_operatorDEFINITE
sub_543FA01KBassemble_fold_expression_operandDEFINITE

Unary Operators

AddressSizeIdentityConfidence
sub_5160807.6KBscan_prefix_incr_decrDEFINITE
sub_51672013KBscan_ampersand_operatorDEFINITE
sub_5172704.4KBscan_indirection_operatorDEFINITE
sub_5176805.1KBscan_arith_prefix_operatorDEFINITE
sub_517BD026KBscan_sizeof_operatorDEFINITE
sub_5193009.4KBscan_alignof_operatorDEFINITE
sub_519CF06.1KBscan_builtin_addressofDEFINITE
sub_510D708.2KBscan_postfix_incr_decrDEFINITE

Binary Operators

AddressSizeIdentityConfidence
sub_5238C05.4KBscan_mult_operatorDEFINITE
sub_523EB010.6KBscan_add_operatorDEFINITE
sub_5249605.8KBscan_shift_operatorDEFINITE
sub_524ED05.6KBscan_eq_operatorDEFINITE
sub_525BC04.7KBscan_bit_operatorDEFINITE
sub_5254508.6KBscan_gnu_min_max_operatorDEFINITE
sub_52265019.8KBscan_ptr_to_member_operatorDEFINITE

Assignment

AddressSizeIdentityConfidence
sub_53FD701.1KBscan_simple_assignment_operatorDEFINITE
sub_536E803.1KBscan_compound_assignment_operatorDEFINITE
sub_5087704.7KBprocess_simple_assignmentDEFINITE

Member Access

AddressSizeIdentityConfidence
sub_5303E015KBscan_field_selection_operatorDEFINITE
sub_4FEB604.5KBmake_field_selection_operandDEFINITE
sub_4FEF004.6KBdo_field_selection_operationDEFINITE
sub_5405603.1KBscan_subscript_operatorDEFINITE

Cast Operators

AddressSizeIdentityConfidence
sub_51EE008.3KBscan_new_style_castDEFINITE
sub_51F67013.5KBscan_static_cast_operatorDEFINITE
sub_5202808.8KBscan_const_cast_operatorDEFINITE
sub_5209A04.9KBscan_reinterpret_cast_operatorDEFINITE
sub_53C6903.6KBscan_named_cast_operatorHIGH

Type Traits

AddressSizeIdentityConfidence
sub_51A69012KBscan_unary_type_trait_helperDEFINITE
sub_51B6507.2KBscan_binary_type_trait_helperDEFINITE
sub_5350800.2KBdispatch_call_like_builtinMEDIUM
sub_534B601.8KBscan_call_like_builtin_operationDEFINITE
sub_5497002.2KBcompute_is_invocableDEFINITE
sub_550E501.3KBcompute_is_constructibleDEFINITE
sub_5104102.1KBcompute_is_convertibleDEFINITE
sub_5108602.3KBcompute_is_assignableDEFINITE

CUDA-Specific

AddressSizeIdentityConfidence
sub_5057204KBcheck_cross_execution_space_callDEFINITE
sub_505B402.7KBcheck_cross_space_call_in_templateDEFINITE
sub_537BF07KBadjust_sync_atomic_builtinDEFINITE
sub_520EE02.7KBscan_intaddr_operatorDEFINITE

Initializers and Braced-Init-Lists

AddressSizeIdentityConfidence
sub_5360D04.7KBparse_braced_init_list_fullDEFINITE
sub_5392B00.2KBcomplete_braced_init_list_parsingDEFINITE
sub_5393401KBscan_braced_init_list_castDEFINITE
sub_5396700.4KBget_braced_init_listDEFINITE
sub_5410002KBscan_member_constant_initializer_expressionDEFINITE
sub_541DC05.5KBprescan_initializer_for_auto_type_deductionDEFINITE

Coroutines

AddressSizeIdentityConfidence
sub_50B63010KBadd_await_to_operandDEFINITE
sub_50C0701.8KBcheck_coroutine_contextHIGH
sub_50E0804.5KBmake_coroutine_result_expressionDEFINITE

C++20 Concepts and Requires

AddressSizeIdentityConfidence
sub_52CFF013.5KBscan_requires_expressionDEFINITE
sub_542D903.8KBscan_requires_exprDEFINITE
sub_52EB608.6KBscan_requires_clauseDEFINITE

Declaration Parser

C++ declaration parsing is the most ambiguity-ridden phase of front-end compilation. A statement like T(x); is simultaneously a valid function-style cast (expression) and a variable declaration with redundant parentheses. EDG 6.6 in cudafe++ resolves this by splitting the work into two stages: a prescanning/disambiguation phase (disambig.c) that probes ahead in the token stream to classify ambiguous constructs, followed by committed parsing across four tightly-coupled source files -- decl_spec.c (declaration specifiers), declarator.c (declarator syntax), decls.c (symbol table insertion and semantic validation), and decl_inits.c (initializer processing). CUDA adds a fifth axis of complexity: every declaration may carry execution space attributes (__device__, __host__, __global__) and memory space qualifiers (__shared__, __constant__, __managed__), which are parsed as attribute category 4 and must be separated from standard C++ attributes before semantic analysis.

The core pipeline processes approximately 22,000 lines of decompiled logic across six major functions, each exceeding 1,000 lines. The design is a classic recursive-descent parser with significant state carried in stack-allocated structures (128-byte decl_spec accumulators packed as __m128i arrays) and global scope chain state (784-byte entries in the scope table at qword_126C5E8).

Key Facts

PropertyValue
Source filesdecl_spec.c, declarator.c, decls.c, decl_inits.c, disambig.c
Address range0x4A0000--0x4F8000 (~360 KB of code, ~530 functions)
Central dispatchersub_4ACF80 (decl_specifiers, 4,761 lines)
Declarator entrysub_4B7BC0 (declarator, 284 lines)
Function declaratorsub_4B8190 (function_declarator, 3,144 lines)
Recursive declaratorsub_4BC950 (r_declarator, 2,578 lines)
Function declarationsub_4CE420 (decl_routine, 2,858 lines)
Variable declarationsub_4CA6C0 (decl_variable, 1,090 lines)
Top-level variable entrysub_4DEC90 (variable_declaration, 1,098 lines)
Disambiguationsub_4EA560 (prescan_declaration, ~400 lines)
Scope entry size784 bytes (at qword_126C5E8)
Decl specifier accumulator128 bytes (4 x __m128i, stack-allocated)
CUDA mode flagdword_126EFA8 (bool), dialect in dword_126EFB4 (2 = C++)
Current token globalword_126DD58
Token advancesub_676860 (get_next_token)

Architecture

The declaration parsing pipeline operates as a five-stage waterfall. Each stage narrows the interpretation of the token stream until a fully-resolved declaration is inserted into the symbol table:

Token Stream (from lexer)
  │
  ▼
STAGE 1: Disambiguation (disambig.c)
  │  prescan_declaration ─── lookahead to classify ambiguous constructs
  │  prescan_gnu_attribute ── skip __attribute__((...)) blocks
  │  find_for_loop_separator ── distinguish for-init from expression
  │
  ▼
STAGE 2: Declaration Specifiers (decl_spec.c)
  │  decl_specifiers ─── 4,761-line switch dispatching on token kind
  │  ├── storage class: auto, register, static, extern, typedef
  │  ├── type specifiers: int, char, void, class/struct/enum, typename
  │  ├── cv-qualifiers: const, volatile, restrict
  │  ├── function specifiers: inline, virtual, explicit, constexpr, consteval
  │  ├── CUDA attributes: __device__, __host__, __global__ (category 4)
  │  └── class_specifier / enum_specifier (recursive for definitions)
  │
  ▼
STAGE 3: Declarator (declarator.c)
  │  declarator ─── coordinates pointer/array/function declarators
  │  ├── pointer_declarator ── *, &, &&, ::*
  │  ├── r_declarator ── recursive descent on declarator-id
  │  ├── array_declarator ── [expression], []
  │  ├── function_declarator ── (params) cv-qualifiers -> trailing-return noexcept
  │  └── scan_declarator_attributes ── separates CUDA attrs from standard
  │
  ▼
STAGE 4: Declaration Processing (decls.c)
  │  decl_routine ─── function/method declarations (2,858 lines)
  │  decl_variable ── variable declarations with CUDA memory space
  │  variable_declaration ── top-level entry with CUDA error emission
  │  find_linked_symbol ── redeclaration detection
  │  id_linkage ── linkage determination (internal/external/none)
  │
  ▼
STAGE 5: Initializer Processing (decl_inits.c)
     ctor_inits_for_inheriting_ctor ── inheriting constructors
     dtor_initializer ── destructor init lists
     check_for_missing_initializer_full ── missing initializer diagnostics

Stage 1: Disambiguation (disambig.c)

The Problem

C++ has a famous syntactic ambiguity: many token sequences can be parsed as either declarations or expressions. The canonical example:

T(x);          // declaration of variable x of type T?  or  function-style cast of x to T?
T(x)(y);       // declaration of function x returning T?  or  call to T(x) with argument y?
T * x;         // declaration of pointer-to-T named x?  or  multiplication of T and x?

The C++ standard resolves these with the "if it can be a declaration, it is a declaration" rule. EDG implements this by prescanning: before committing to a parse, the parser saves the lexer state, probes ahead through the token stream to determine whether the construct is a declaration, then restores the lexer state and dispatches to the appropriate parser.

prescan_declaration (sub_4EA560)

This is the top-level disambiguation entry point, called when the parser encounters an ambiguous construct at statement or declaration level. It operates in a non-destructive lookahead mode: it consumes tokens tentatively, classifies the construct, then rewinds.

prescan_declaration(flags):
    save_lexer_state()
    
    # Compute CUDA-aware skip mode
    if flags & 0x800 == 0:       # not in template context
        skip_mode = 16385         # 0x4001: standard prescan
    else:
        skip_mode = 67125249      # 0x3FFC001: template-aware prescan
    
    # In CUDA C++ mode, use cuda_skip_token for identifier classification
    if dword_126EFB4 == 2:       # CUDA C++ dialect
        while not at_end_of_tentative_scan():
            token = current_token()
            if is_cuda_keyword(token):
                cuda_skip_token(skip_mode)   # sub_6810F0
            else:
                advance_token()              # sub_676860
            classify_declaration_vs_expression()
    
    restore_lexer_state()
    return classification  # DECLARATION or EXPRESSION

The skip_mode is a bitmask encoding which token classes to recognize during prescanning. In CUDA mode, the wider mask (0x3FFC001) includes CUDA execution-space keywords so that __device__ int x; is correctly classified as a declaration even though __device__ is not a standard C++ keyword.

prescan_gnu_attribute (sub_4E9E70)

Attributes complicate disambiguation because __attribute__((foo)) can appear almost anywhere in a declaration. This function skips over balanced GNU attribute sequences during prescanning:

prescan_gnu_attribute():
    assert current_token == 142     # GNU __attribute__ token
    while current_token == 142:
        advance_token()             # consume __attribute__
        match_balanced_parens()     # skip ((...))
        
        # CUDA extension: check if identifier is CUDA keyword
        if dword_126EFB4 == 2:      # CUDA C++ mode
            if BYTE1(xmmword_106C390) & 2:  # CUDA extension flag
                cuda_skip_token(...)

find_for_loop_separator (sub_4EC690)

A special-purpose disambiguator for for loops. In for(init; cond; incr), the parser must find the semicolons that separate the three clauses. This is non-trivial because the init clause can contain declarations with complex types, nested parentheses, and template angle brackets.

find_for_loop_separator():
    create_disambiguation_checkpoint()  # sub_67B4F0
    paren_depth = 0
    while true:
        token = current_token()
        if token == '(':
            paren_depth++
        elif token == ')':
            if paren_depth == 0:
                break
            paren_depth--
        elif token == ';' and paren_depth == 0:
            restore_checkpoint()
            return SEMICOLON_FOUND   # 0x4B = 75
        elif token == EOF:
            restore_checkpoint()
            return EOF               # 9
    restore_checkpoint()
    return NOT_FOUND                 # 0

Stage 2: Declaration Specifiers (decl_spec.c)

decl_specifiers (sub_4ACF80) -- The Central Dispatcher

This is the most complex function in the declaration parser: 4,761 decompiled lines, a while(2) loop containing a giant switch on token kinds, processing every specifier in a C++ declaration. It handles storage classes, type specifiers, cv-qualifiers, function specifiers, and CUDA attributes, accumulating results into a 128-byte stack structure.

Input Parameter: Context Flags

The a1 parameter encodes the parsing context as a bitmask:

BitMaskContext
20x4Inside class member declaration
30x8Inside function parameter list
40x10At block scope
60x40Inside template parameter list
140x4000Friend declaration
150x8000At class scope
180x40000In-declaration (re-entrant)
200x100000Constexpr lambda context

The Accumulator Structure

Results are accumulated into a stack-allocated structure (parameter a2) laid out as:

OffsetSizeFieldDescription
+84specifier_flagsBitmask of specifiers seen
+328source_positionPosition of first specifier
+1204flagsParsing state flags
+1324contextContext discriminator
+2008attribute_listLinked list of parsed attributes
+2088attribute_list_altSecondary attribute list (CUDA exec space)
+2284modifiersAccumulated modifier bits
+2728type_ptrResolved type pointer

Pseudocode

decl_specifiers(context_flags, accumulator, type_chain, ...):
    debug_trace(3, "decl_specifiers")
    
    spec_bits = 0        # accumulated specifier combination flags
    error_flag = 0
    
    while true:  # while(2) in decompilation
        token = word_126DD58    # current token
        
        switch token:
        
        # ── Storage class specifiers ──
        case TOKEN_AUTO:         # 77
        case TOKEN_REGISTER:     # 119
        case TOKEN_STATIC:       # 99
        case TOKEN_EXTERN:       # 88
        case TOKEN_TYPEDEF:      # 103
            process_storage_class_specifier(
                auto_flag, ..., context_flags, accumulator,
                prev_scope, &spec_bits, &result, &type_out, &error_flag
            )
            continue
        
        # ── Type specifiers (keywords) ──
        case TOKEN_VOID .. TOKEN_DOUBLE:       # 81-119 range
        case TOKEN_SIGNED:
        case TOKEN_UNSIGNED:
        case TOKEN_CHAR:
        case TOKEN_INT:
        case TOKEN_FLOAT:
        case TOKEN_DOUBLE:
            # Validate combination with existing specifiers
            if spec_bits & CONFLICTING_TYPE_MASK:
                emit_error(84)     # conflicting type specifiers
            spec_bits |= type_specifier_bit(token)
            advance_token()
            continue
        
        # ── cv-qualifiers ──
        case TOKEN_CONST:        # 263
        case TOKEN_VOLATILE:     # 264
        case TOKEN_RESTRICT:     # 265, 266
            accumulator.modifiers |= cv_bit(token)
            advance_token()
            continue
        
        # ── Function specifiers ──
        case TOKEN_INLINE:
            spec_bits |= INLINE_BIT
            advance_token()
            continue
        
        case TOKEN_VIRTUAL:
            spec_bits |= VIRTUAL_BIT
            advance_token()
            continue
        
        case TOKEN_EXPLICIT:
            spec_bits |= EXPLICIT_BIT
            advance_token()
            continue
        
        # ── C++11/17/20 specifiers ──
        case TOKEN_CONSTEXPR:
            spec_bits |= CONSTEXPR_BIT
            if context_flags & 0x100000:    # constexpr lambda
                emit_error(1570)
            advance_token()
            continue
        
        case TOKEN_CONSTEVAL:
            spec_bits |= CONSTEVAL_BIT
            advance_token()
            continue
        
        case TOKEN_CONSTINIT:
            spec_bits |= CONSTINIT_BIT
            advance_token()
            continue
        
        case TOKEN_THREAD_LOCAL:
            spec_bits |= THREAD_LOCAL_BIT
            advance_token()
            continue
        
        # ── Class/struct/union/enum definitions ──
        case TOKEN_CLASS:        # 151
        case TOKEN_STRUCT:
        case TOKEN_UNION:
            class_specifier(scope, context_flags, ..., &result, &error_flag)
            continue
        
        case TOKEN_ENUM:
            enum_specifier(scope, context_flags, ..., &result, &error_flag)
            continue
        
        # ── typename specifier ──
        case TOKEN_TYPENAME:     # 183
            typename_specifier(&type_out, accumulator, context_flag, ...)
            continue
        
        # ── Identifier (type name or constructor) ──
        case TOKEN_IDENTIFIER:   # 1
            # This is the declaration/expression ambiguity hotspot
            if try_interpret_as_type_name(accumulator):    # sub_4C4F80
                continue
            if is_constructor_decl(enclosing_class):       # sub_4AC970
                continue
            # Not a type name — fall through to end of specifiers
            break
        
        # ── GNU __attribute__ / __declspec ──
        case TOKEN_ATTRIBUTE:    # 142
            parse_attribute_list(accumulator)
            # CUDA: execution space attributes separated here
            continue
        
        # ── typeof / decltype ──
        case TOKEN_TYPEOF:       # 189
        case TOKEN_DECLTYPE:     # 185
            parse_typeof_or_decltype(accumulator)
            continue
        
        # ── End of specifiers ──
        case TOKEN_SEMICOLON:    # 55
        default:
            break  # exit while loop
    
    # Post-processing: validate specifier combinations
    if spec_bits == 0 and no_type_found:
        emit_error(79)   # missing type specifier
    
    # CUDA: check execution space context
    if dword_126EFB4 == 2:   # CUDA C++ mode
        validate_cuda_execution_space(accumulator, context_flags)
        if invalid_cuda_context:
            emit_error(3537)  # execution space attribute in wrong context

Token Classification Map

The switch in decl_specifiers handles the following token kinds:

Token CodeKeywordCategory
1identifierType name or constructor check
77autoStorage class (C++03) / placeholder type (C++11)
88externStorage class
99staticStorage class
103typedefStorage class
119registerStorage class
80--108C type keywordsType specifiers
142__attribute__GNU attribute
151classClass specifier
183typenameTypename specifier
185decltypeDecltype specifier
189typeofGNU typeof
263--266cv-qualifiersconst, volatile, restrict, __restrict

process_storage_class_specifier (sub_4A31A0)

Validates and records a storage class specifier. C++ allows at most one storage class per declaration (with some exceptions for thread_local).

process_storage_class_specifier(auto_flag, ..., context_flags, decl_info,
                                 prev_scope, spec_bits, result, type_out, error_flag):
    # Flag bits in context_flags:
    #   1=function, 4=class, 8=extern, 0x10=static, 0x200=register,
    #   0x4000=friend, 0x8000=at class scope, 0x100000=constexpr lambda

    if *spec_bits & STORAGE_CLASS_MASK:
        emit_error(80)     # duplicate storage class
        return
    
    if conflicting_with_previous_specifier:
        emit_error(81)     # conflicting storage class
        return
    
    switch current_storage_class:
        case EXTERN:
            if at_block_scope and not_cpp_mode:
                emit_error(85)
            if at_file_scope and not_standard_mode:
                emit_error(149)
            decl_info.linkage_byte = 3    # external linkage
        
        case STATIC:
            if in_class_definition and cpp_mode:
                emit_error(328)
        
        case REGISTER:
            emit_error(481)   # deprecated
        
        case AUTO:
            if dword_126EF4C:     # auto parameter support enabled
                # C++20: auto in parameter list = abbreviated template
                create_placeholder_type()    # sub_5BBA60
            else:
                emit_error(1598)  # auto type in invalid context
    
    *spec_bits |= storage_class_bit

class_specifier (sub_4A57C0, 2,179 lines)

Parses class/struct/union specifiers including the full class body. This function manages scope entry/exit, base class lists, member declarations, access specifiers, and CUDA execution space propagation.

Key operations:

  • Calls scan_tag_name (sub_4A38A0, 1,216 lines) to parse the class name, handling qualified names and template parameters
  • Calls check_for_class_modifiers (sub_4A3610) to detect final/__final
  • Manages the scope stack: pushes a class scope (kind 6 or 7) at qword_126C5E8 + 784 * scope_index
  • Sets CUDA execution space flags at scope entry offset +182 (bit 0x20) for device-side class definitions
  • Issues error 2407 for enum definitions in prohibited CUDA execution contexts

enum_specifier (sub_4AA2F0, 1,437 lines)

Parses enum, enum class, and enum struct specifiers, including:

  • Underlying type (enum E : int)
  • Opaque enum declarations (enum class E : int;)
  • Scoped vs. unscoped enum semantics
  • Calls scan_enumerator_list (sub_4A89F0, 950 lines) for the enumerator body

Specifier Validation Functions

After decl_specifiers accumulates all specifiers, several validation functions check that the combination is legal:

FunctionAddressLinesPurpose
check_use_of_constexprsub_4A22B0153Validates constexpr on functions and variables
check_use_of_constevalsub_4A1BF0104Validates consteval on functions only
check_use_of_constinitsub_4A1EC077Validates constinit on variables with static storage
check_use_of_thread_localsub_4A2000111Validates thread_local placement
check_explicit_specifiersub_4A1DF045Validates explicit on constructors/conversions
check_gnu_c_auto_typesub_4A258052Validates GNU __auto_type

Each follows the same pattern: examine the accumulated specifier bits and the entity kind at offset +80 of the declaration node, and emit a targeted error if the combination is illegal. For example, check_use_of_consteval:

check_use_of_consteval(decl_info):
    entity = decl_info[0]
    kind = entity[+80]       # symbol kind
    
    if kind != FUNCTION (10) and kind != MEMBER_FUNCTION (11):
        emit_error(2926)      # consteval on non-function
        entity[+177] &= 0xF9 # clear consteval bit
        return
    
    func_kind = entity[+166]
    if func_kind == DESTRUCTOR (2):
        emit_error(2927)      # consteval on destructor
        entity[+177] &= 0xF9
        return
    
    if func_kind == CONSTRUCTOR (1):
        if type_has_virtual_base(entity[+88]):
            emit_error(2928)  # consteval on ctor with virtual base
            entity[+177] &= 0xF9
            return
    
    if func_kind == CONVERSION (5):
        if certain_conversion_conditions:
            emit_error(2959)  # consteval on certain conversions
            entity[+177] &= 0xF9

Stage 3: Declarator Parsing (declarator.c)

Architecture

Declarator parsing uses inside-out construction: the C++ declarator syntax places the declared name in the center, with type constructors radiating outward (pointers to the left, arrays and function parameters to the right). The parser builds a derived-type chain that is later unwound against the base type from decl_specifiers to produce the final type.

Declarator syntax (C++ grammar):
    declarator := pointer-declarator
    pointer-declarator := {*, &, &&, C::*} cv-qualifiers* direct-declarator
    direct-declarator := declarator-id | ( declarator ) | direct-declarator ( params ) | direct-declarator [ expr ]
    declarator-id := qualified-name | unqualified-name

The parser coordinates five specialized sub-parsers:

FunctionAddressLinesRole
declaratorsub_4B7BC0284Top-level entry: dispatches to pointer/r_declarator
r_declaratorsub_4BC9502,578Recursive descent on direct-declarator
pointer_declaratorsub_4B72A0440*, &, &&, ::* with cv-qualifiers
array_declaratorsub_4B6760518[expr] and []
function_declaratorsub_4B81903,144(params) cv-quals -> ret noexcept

scan_declarator_attributes (sub_4B3970) -- CUDA Attribute Separation

This is the critical function that separates CUDA execution space attributes from standard C++ attributes on declarators. In standard C++, attributes apply to the entity being declared. CUDA adds a parallel attribute dimension -- execution space -- that must be routed to a separate storage location.

The function iterates through the attribute list and sorts each attribute by its category byte at offset +9:

scan_declarator_attributes(decl_info, attr_accumulator):
    attr_list = decl_info[+200]    # primary attribute list
    
    for each attr in attr_list:
        category = attr[+9]         # attribute category byte
        kind = attr[+8]             # attribute kind
        placement = attr[+10]       # where in declaration it appeared
        
        switch category:
            case 1:  # TYPE attribute (alignas, etc.)
                # Keep on primary list, set placement
                attr[+10] = 10      # after type specifier
                
            case 2:  # DECLARATION attribute ([[nodiscard]], etc.)
                if attr[+11] & 0x10:
                    # CUDA/vendor declaration attribute
                    route_to_vendor_list(attr)
                else:
                    # Standard declaration attribute
                    attr[+10] = 12  # before declarator
                
            case 3:  # STATEMENT attribute ([[fallthrough]], etc.)
                if decl_info[+131] & 8:  # class-key context
                    handle_class_key_stmt_attr(attr)
                
            case 4:  # CUDA EXECUTION SPACE attribute
                # __device__, __host__, __global__
                # Move to SECONDARY attribute list
                move_to_list(attr, decl_info[+184])
                
                # Error if misplaced
                if wrong_position:
                    emit_error(1847)  # attribute in wrong position
    
    # Mark all processed attributes
    for each attr in processed:
        attr[+11] |= 1    # set "consumed" flag

The separation into primary (offset +200) and secondary (offset +184) attribute lists is essential: downstream code (decl_routine, decl_variable) reads execution space from the secondary list and standard attributes from the primary list. This prevents CUDA execution space from interfering with standard attribute processing like [[nodiscard]] or [[deprecated]].

function_declarator (sub_4B8190, 3,144 lines)

The second-largest function in the declarator parser. It handles the complete C++ function declarator grammar including C++11 trailing return types, C++11/17 noexcept specifications, C++23 deducing this, and the C++ function qualifier trailer (const, volatile, &, &&).

function_declarator(decl_info, context_flags):
    debug_trace(3, "function_declarator")
    
    # Parse parameter list
    expect_token('(')
    param_list = parse_parameter_list()
    expect_token(')')
    
    # C++ member function qualifiers
    cv_quals = 0
    while is_cv_qualifier(current_token):
        cv_quals |= cv_bit(current_token)
        advance_token()
    
    # Ref-qualifier (& or &&)
    ref_qual = NONE
    if current_token == '&':
        ref_qual = LVALUE_REF
        advance_token()
    elif current_token == '&&':
        ref_qual = RVALUE_REF
        advance_token()
    
    # Exception specification
    except_spec = NONE
    if current_token == TOKEN_THROW:
        except_spec = parse_throw_spec()
    elif current_token == TOKEN_NOEXCEPT:
        except_spec = parse_noexcept_spec()
    
    # C++11 trailing return type
    trailing_return = NULL
    if current_token == TOKEN_ARROW:   # ->
        advance_token()
        trailing_return = parse_type()
    
    # C++20 trailing requires clause
    requires_clause = NULL
    if current_token == TOKEN_REQUIRES:
        requires_clause = scan_trailing_requires_clause()
    
    # C++23 deducing this
    if has_explicit_this_parameter(param_list):
        mark_deducing_this()
    
    # Build function type node
    func_type = add_to_derived_type_list(
        FUNCTION_TYPE,
        param_list, cv_quals, ref_qual,
        except_spec, trailing_return, requires_clause
    )
    
    return func_type

Derived Type Construction

add_to_derived_type_list (sub_4B4CF0, 600 lines) is the type-chain builder. Each declarator modifier (pointer, reference, array, function) appends a new node to a linked list. After parsing completes, form_declared_type (sub_4B4870) walks this chain bottom-up, applying each modifier to the base type to produce the final declared type.

For a declaration like const int *(*fp)(double):

Base type: const int
Derived chain: [function(double)] → [pointer] → [pointer]
Unwound: pointer to (pointer to function(double) returning const int)

Stage 4: Declaration Processing (decls.c)

decl_variable (sub_4CA6C0, 1,090 lines)

Processes variable declarations after specifiers and declarator have been parsed. This is where CUDA memory space qualifiers are applied and the variable entity is inserted into the symbol table.

CUDA Memory Space Bits

Variable entries carry a CUDA memory space bitmask at offset +148:

BitMaskMemory SpaceMeaning
00x01__constant__Device-side constant memory
10x02__shared__Block-shared memory (per-SM)
20x04__managed__Unified memory (host + device accessible)
40x10__device__Device global memory

These bits are set from the declaration state object (parameter a2), which carries the parsed CUDA attribute at offset +240:

decl_variable(decl_specs, decl_state, storage_class, out_entity, out_flags):
    debug_trace(3, "decl_variable")
    assert(decl_state != NULL)              # decls.c:7730
    
    # Look up existing variable in scope
    existing = lookup_variable_in_scope(    # sub_4C84B0
        scope, name, type_info
    )
    
    # Create new variable entity
    var_entity = create_variable_entry(     # sub_5C9840
        name, type, storage_class
    )
    
    # Apply CUDA memory space from declaration state
    if dword_126EFA8:                       # CUDA mode enabled
        cuda_attr_ptr = decl_state[+240]
        if cuda_attr_ptr != NULL:
            # Extract memory space from attribute
            space = extract_memory_space(cuda_attr_ptr)
            var_entity[+148] = space        # set memory space bits
            
            # Scope walk: determine if variable is at namespace scope
            # or inside a function (affects valid memory space combinations)
            scope_idx = dword_126C5E4       # current scope index
            scope_base = qword_126C5E8      # scope table base
            while scope_idx > 0:
                scope_entry = scope_base + 784 * scope_idx
                scope_kind = scope_entry[+4]
                if scope_kind == 4:          # class scope — walk up
                    scope_idx = scope_entry[+256]  # parent scope
                    continue
                break
            
            # Template scope check
            if scope_entry[+9] & 0x20:       # is_template_scope
                handle_template_variable()
    
    # Check redeclaration compatibility
    if existing != NULL:
        old_space = existing[+148]
        new_space = var_entity[+148]
        if old_space != new_space:
            # Determine which string to use for error message
            if new_space & 0x04:
                space_name = "__managed__"
            elif new_space & 0x01:
                space_name = "__constant__"
            elif new_space & 0x02:
                space_name = "__shared__"
            elif new_space & 0x10:
                space_name = "__device__"
            emit_error(1306)      # CUDA memory space mismatch on redeclaration
    
    # Anonymous type check
    if type_is_anonymous(var_entity):
        emit_error(891)           # anonymous type in variable declaration
    
    # Apply remaining attributes
    set_variable_attributes(var_entity)     # sub_4C4750

variable_declaration (sub_4DEC90, 1,098 lines) -- Top-Level Entry

This is the outermost entry point for processing a variable declaration. It wraps decl_variable with CUDA-specific validation, constexpr/constinit checks, and static data member definition handling.

CUDA-Specific Error Emission

The function contains a dense block of CUDA error checks for variable declarations:

variable_declaration(decl_info, ...):
    # Early CUDA checks
    check_constexpr_variable_init(decl_info)    # sub_4DAC80
    
    # CUDA memory space string selection for error messages
    mem_space_bits = entity[+148]
    byte_149 = entity[+149]
    
    if mem_space_bits & 0x04:     # __managed__
        # No __managed__-specific string needed here
        pass
    
    # Build human-readable attribute name for diagnostics
    if byte_149 & 1:
        space_str = "__constant__"
    elif mem_space_bits & 4 == 0:
        space_str = "__managed__"
        if byte_149 & 1 == 0:
            space_str = "__device__"
            if mem_space_bits & 2:
                space_str = "__shared__"
    
    # CUDA variable constraint errors
    if is_shared_variable:
        if is_variable_length_array:
            emit_error(3510)      # __shared__ variable with VLA
    
    if is_constant_variable:
        if is_constexpr:
            emit_error(3568)      # __constant__ combined with constexpr
        if is_volatile:
            emit_error(3566)      # __constant__ combined with volatile
        if is_vla:
            emit_error(3567)      # __constant__ with VLA
    
    if has_cuda_attribute:
        if in_constexpr_if_discarded_branch:
            emit_error(3578)      # CUDA attribute in discarded branch
        if at_namespace_scope and is_structured_binding:
            emit_error(3579)      # CUDA attribute on structured binding
        if is_variable_length_array:
            emit_error(3580)      # CUDA attribute on VLA
    
    # Dispatch to decl_variable or define_static_data_member
    if is_static_member_definition:
        define_static_data_member(...)
    else:
        decl_variable(decl_specs, decl_state, storage_class, ...)
    
    # Post-declaration CUDA fixup
    cuda_variable_fixup(entity)     # sub_4CC150
    mark_defined_variable(entity)   # sub_4DC200

Complete CUDA Variable Error Table

ErrorConditionMessage Summary
149Illegal CUDA storage class at namespace scopeStorage class not allowed here
891Anonymous type in variable declarationAnonymous type cannot be used
892auto-typed CUDA variable (variant)auto not allowed with CUDA qualifier
893auto-typed CUDA variableauto not allowed with CUDA qualifier
1306Memory space mismatch on redeclarationConflicting CUDA memory space
3483(CUDA variable context error)CUDA attribute context mismatch
3510__shared__ variable with VLAVariable-length arrays not allowed in __shared__
3566__constant__ with volatilevolatile incompatible with __constant__
3567__constant__ with VLAVariable-length arrays not allowed in __constant__
3568__constant__ with constexprconstexpr incompatible with __constant__
3578CUDA attribute in constexpr if discarded branchCUDA attribute in dead code
3579CUDA attribute on structured binding at namespace scopeStructured binding cannot have CUDA attribute
3580CUDA attribute on VLAVariable-length arrays not allowed with CUDA attribute
3648__constant__ with external linkageExternal __constant__ not allowed
1655Tentative definition of constexpr variableMissing initializer

decl_routine (sub_4CE420, 2,858 lines)

The largest function in the declaration processing stage. It handles function and method declarations, integrating CUDA calling convention validation, attribute consistency checking, and template interaction.

Parameters

ParameterOffsetDescription
a1--decl_specifiers accumulator (__m128i*)
a2--Declaration state object
a3--Function info (offset +64 = flags, +80 = prior type)
a4--SRK flags bitmask
a5--a8--Output pointers and context

SRK Flag Bits

The a4 parameter carries "scan result kind" flags that describe what was parsed:

BitMaskMeaning
00x01SRK_DECLARATION -- forward declaration
10x02SRK_DEFINITION -- has function body
70x80SRK_IMPLICIT -- compiler-generated
80x100SRK_CONSTEXPR -- constexpr function

Function Entity Layout

After processing, a function entity contains:

OffsetSizeFieldDescription
+801entity_kind10 = function, 11 = member function
+888descriptorPointer to function descriptor
+1448typeFunction type pointer
+1641defined_flagSet when definition is seen
+1661function_kind1=ctor, 2=dtor, 5=conversion, 7=deduction guide
+1688template_infoTemplate instantiation info
+1771attribute_flagsbit 1=constexpr, bit 2=consteval
+1881cuda_flags_1CUDA calling convention
+1891cuda_flags_2CUDA execution space
+1928parameter_listHead of parameter linked list

Pseudocode

decl_routine(decl_specs, decl_state, func_info, srk_flags, ...):
    debug_trace(3, "decl_routine")
    
    # Assertions
    assert func_info != NULL                    # decls.c:10057
    assert storage_class is valid               # decls.c:10059
    assert srk_flags & SRK_DECLARATION          # decls.c:10061
    assert func_type is routine type            # decls.c:10063
    if srk_flags & SRK_DEFINITION:
        assert body follows                     # decls.c:10068
    if srk_flags & SRK_IMPLICIT:
        assert compiler-generated context       # decls.c:10149
    
    # CUDA calling convention check
    if dword_126EFB4 == 2:                      # CUDA C++ mode
        check_cuda_calling_convention(          # sub_4C6AB0
            func_type, decl_specs
        )
        check_cuda_attribute_consistency(       # sub_4C6D50
            decl_state
        )
    
    # Look up existing declaration
    existing = find_linked_symbol(name, scope)
    
    if existing != NULL:
        # Redeclaration checks
        if existing.calling_convention != new_calling_convention:
            emit_error(948)         # calling convention mismatch
        
        if has_cuda_attribute(existing) and has_cuda_attribute(new):
            if not compatible_cuda_attributes(existing, new):
                emit_error(1430)    # function attribute mismatch
    
    # CUDA-specific restrictions
    if has_global_attribute:
        if return_type is auto:
            emit_error(1158)        # auto return type with __global__
    
    if is_deduction_guide:
        if has_any_cuda_attribute:
            emit_error(2885)        # CUDA attribute on deduction guide
    
    if is_explicit_instantiation:
        if conflicting_template_attributes:
            emit_error(1034)        # explicit instantiation conflict
    
    # Process CUDA attributes on the function
    process_cuda_attributes(decl_state)         # sub_42A250
    remove_cuda_trailing_return(decl_state)     # sub_42A210
    
    # Canonicalize trailing return type in CUDA mode
    if dword_126EFB4 == 2:
        canonicalize_return_type(func_type)      # sub_5DBCB0
    
    # Symbol table insertion
    entity = create_function_entity(name, func_type, storage_class)
    
    # Set defined flag
    assert entity.defined_flag is correct       # decls.c:10417
    
    # OpenMP variant handling (if active)
    if dword_106B4B8:                           # omp_declare_variant_active
        create_omp_variant_name("$$OMP_VARIANT%06d", variant_id)

CUDA Attribute Integration

Attribute Category System

EDG classifies attributes using a category byte at offset +9 in the attribute node:

CategoryValueMeaningExamples
Type1Applies to the typealignas, __aligned__
Declaration2Applies to the declaration[[nodiscard]], [[deprecated]]
Statement3Applies to a statement[[fallthrough]], [[likely]]
Execution space4CUDA execution space__device__, __host__, __global__

Category 4 is NVIDIA's addition to EDG's attribute system. Standard EDG uses categories 1-3. CUDA execution space attributes are recognized by the lexer as identifiers, classified as CUDA keywords by get_token_main (sub_6810F0) when dword_106C2C0 (GPU mode) is active, and converted to attribute nodes with category 4 during attribute parsing.

Attribute Node Layout

OffsetSizeFieldDescription
+08nextNext attribute in linked list
+81kindAttribute kind (0 when cleared/consumed)
+91category1=type, 2=decl, 3=stmt, 4=exec-space
+101placementWhere in declaration it appeared (10=after type, 12=before declarator)
+111flagsbit 0 = consumed, bit 4 = CUDA/vendor
+168payloadAttribute-specific data

Execution Space Propagation

When a CUDA execution space attribute is parsed, it flows through three processing points:

  1. decl_specifiers (sub_4ACF80): CUDA attributes are recognized as token 142 (attribute) and parsed into the attribute list. The attribute parser sets category 4 for execution space attributes.

  2. scan_declarator_attributes (sub_4B3970): Separates category-4 attributes from the primary attribute list and moves them to the secondary list at offset +184 of the declaration info structure.

  3. decl_routine / decl_variable: Reads execution space from the secondary attribute list and applies it to the function/variable entity. For functions, the execution space goes to offsets +188/+189 of the entity. For variables, the memory space goes to offset +148.

warn_on_cuda_execution_space_attributes (sub_4A8990)

A safety valve that catches execution space attributes in places where they should not appear (e.g., on type definitions that are not function or variable declarations):

warn_on_cuda_execution_space_attributes(attr_list):
    warned = false
    for each attr in attr_list:
        category = attr[+9]
        if category == 1 or category == 4:     # type or exec-space
            if not warned:
                emit_error(1882)               # invalid exec space attr
                warned = true
            attr[+8] = 0                       # clear kind (suppress further processing)

Scope Chain and Context Tracking

The declaration parser relies heavily on the scope chain stored in the global scope table. Every declaration must be inserted at the correct scope, and many validation checks depend on whether the current scope is namespace-scope, class-scope, block-scope, or template-scope.

Scope Entry Layout (784 bytes)

OffsetSizeFieldDescription
+41scope_kind2=namespace, 4=class, 6=function, 8=nested block, 10=block, 12=template, 15/17=special
+61flags_1bit 1=extern, bit 2=inline namespace, bit 7=pending class flag
+71flags_2bit 1=has using directives
+91template_flagsbit 5=is template scope, bit 1-3=template kind
+124scope_flagsbit 2-3=scope modifier
+1821cuda_flagsbit 5 (0x20)=CUDA device-side scope
+1928first_entityHead of entity linked list
+2168type_pointerAssociated type (for class scopes)
+2248namespace_ptrAssociated namespace
+2564parent_scopeIndex of parent scope in table
+3688source_beginSource position where scope begins
+3768associated_entityEntity that opened this scope
+4084parent_scope_idxAlternate parent scope index

Scope Table Globals

AddressNameDescription
qword_126C5E8scope_table_baseArray of 784-byte scope entries
dword_126C5E4current_scope_indexIndex into scope table
dword_126C5DCcurrent_scope_idCurrent scope identifier
dword_126C5B4namespace_scope_idNearest enclosing namespace scope
dword_126C5BCclass_scope_depthNesting depth of class scopes
dword_126C5C4lambda_scope_idCurrent lambda scope (-1 if none)
dword_126C5C8template_scope_idCurrent template scope (-1 if none)

Scope Walk for CUDA Memory Space

When processing a CUDA variable declaration, the parser walks up the scope chain to determine if the variable is at namespace scope (where __device__/__constant__/__managed__ are valid) or inside a function body (where __shared__ is additionally valid):

determine_cuda_variable_scope(var_entity):
    scope_idx = dword_126C5E4
    scope_base = qword_126C5E8
    
    while scope_idx > 0:
        entry = scope_base + 784 * scope_idx
        kind = entry[+4]
        
        if kind == 4:                  # class scope
            # Walk through class scopes to find enclosing namespace/function
            scope_idx = entry[+256]    # parent scope
            continue
        
        if kind == 2:                  # namespace scope
            # Variable is at namespace scope
            # Valid spaces: __device__, __constant__, __managed__
            return NAMESPACE_SCOPE
        
        if kind == 6 or kind == 10:    # function or block scope
            # Variable is inside a function body
            # Valid spaces: __shared__, __device__, __constant__, __managed__
            return FUNCTION_SCOPE
        
        scope_idx = entry[+256]
    
    return FILE_SCOPE

Linkage Determination

id_linkage (sub_4C3380, 310 lines)

Determines whether an identifier has internal, external, or no linkage. This is called during decl_variable and decl_routine to set the linkage byte on the entity.

id_linkage(entity, storage_class, scope):
    debug_trace(3, "id_linkage")
    
    kind = entity[+80]       # entity kind
    
    # C++ linkage rules
    if dword_126EFB4 == 2:    # C++ mode
        if storage_class == STATIC:
            return INTERNAL    # 0x10
        if storage_class == EXTERN:
            return EXTERNAL    # 0x20
        if scope_kind == NAMESPACE:
            if kind == FUNCTION:
                return EXTERNAL
            if kind == VARIABLE:
                if is_const_qualified and not explicitly_extern:
                    return INTERNAL
                return EXTERNAL
        if scope_kind == BLOCK:
            return NONE        # 0x00
    
    # C linkage rules (simpler)
    if storage_class == STATIC:
        return INTERNAL
    if scope_kind == FILE:
        return EXTERNAL
    
    return NONE
    
    # Debug output
    debug_print(linkage_string)   # "internal" / "external" / "none"

find_linked_symbol (sub_4C1CC0, 608 lines)

The redeclaration detection engine. When a new declaration is processed, this function searches the current and enclosing scopes for a previously-declared symbol with the same name and compatible linkage:

find_linked_symbol(name, scope, entity_kind):
    debug_trace(3, "find_linked_symbol")
    
    # Look up in symbol table
    existing = symbol_lookup(name, scope)    # sub_698940
    
    if existing == NULL:
        return NULL
    
    # For functions: handle overload sets
    if entity_kind == FUNCTION:
        # Walk overload set checking for compatible signature
        for each overload in existing.overload_set:
            if types_match(overload.type, new_type):
                return overload
        return NULL    # new overload, not redeclaration
    
    # For variables: check linkage compatibility
    if entity_kind == VARIABLE:
        if existing.linkage == new_linkage:
            return existing
        # Special case: extern at block scope refers to
        # namespace-scope variable with same name
        if new_storage_class == EXTERN and scope_kind == BLOCK:
            return walk_to_namespace_scope_and_search(name)
    
    return NULL

Constructor and Destructor Initialization (decl_inits.c)

ctor_inits_for_inheriting_ctor (sub_4A0310, 746 lines)

Builds the initialization sequence for inheriting constructors (C++11 using Base::Base;). The function iterates virtual base member lists to find matching base constructors and constructs the initialization order:

ctor_inits_for_inheriting_ctor(decl_info):
    class_type = decl_info[+40][+32]    # enclosing class type
    member_list = class_type[+152]       # member list
    
    # Iterate virtual bases
    for each member in member_list:
        if member[+80] == 8:             # base class member kind
            base_type = resolve_base_type(member)
            base_ctor = find_base_constructor(base_type)
            
            if decl_info[+178] & 0x40:   # inheriting-ctor redirection
                # Walk class hierarchy via offset+216 link
                while has_redirect(current):
                    current = current[+216]
                base_ctor = find_redirect_target(current)
            
            # Check accessibility
            check_base_ctor_accessibility(base_ctor)   # sub_48B3F0
            
            # Build init entry
            init_entry = allocate_init_entry()          # sub_6BA0D0
            init_entry.target = base_ctor
            append_to_init_list(init_entry)

dtor_initializer (sub_4A0EC0, 339 lines)

Builds the destructor initialization (destruction) list for a class. The destruction order is the reverse of construction order -- members are destroyed in reverse declaration order, then base classes in reverse order:

dtor_initializer(decl_info):
    debug_trace(3, "dtor_initializer")       # decl_inits.c:10153
    
    class_type = decl_info[5][+32]
    member_list = class_type[+152]
    
    # Check for delegating constructor
    if decl_info[22] & 2:
        return    # delegating ctor, no separate dtor init needed
    
    # Pass 1: members with flag (offset[10] & 2)
    for each member in member_list:
        if member[10] & 2:
            if class_type[+132] != 11:       # not union
                dtor = resolve_member_destructor(member)
                entry = allocate_init_entry()
                entry.destructor = dtor
    
    # Pass 2: members with (offset[10] & 3) == 1
    for each member in member_list:
        if (member[10] & 3) == 1:
            dtor = resolve_member_destructor(member)
            entry = allocate_init_entry()
            entry.destructor = dtor
    
    # Base class destructors (reverse order)
    base_list = class_type[+96]
    for each base in reverse(base_list):
        dtor = resolve_base_destructor(base)   # sub_737270
        entry = allocate_init_entry()
        entry.destructor = dtor

check_for_missing_initializer_full (sub_4A1540, 248 lines)

Checks whether a variable declaration is missing a required initializer:

check_for_missing_initializer_full(entity, type, unused, deferred_error):
    kind = entity[+80]       # 7=variable, 9=static member
    
    # VLA check
    if is_variable_length_array(type):
        emit_error(252)       # VLA cannot have initializer
    
    # const check (C++ mode)
    if dword_126EFB4 == 2:    # C++ mode
        if is_const_qualified(type) and not has_initializer(entity):
            if not is_extern(entity):
                emit_error(257)   # const object requires initializer
    
    # Abstract class check
    if type[+160] & 2:        # abstract class flag
        if type[+132] & 0xFB == 8:    # array of abstract
            emit_error(812)   # array of abstract class
        else:
            emit_error(516)   # abstract class cannot be instantiated
    
    # constexpr check
    if entity has constexpr flag:
        if not has_initializer(entity):
            emit_error(517)   # constexpr variable requires initializer

CUDA Mode Control Globals

The declaration parser is gated on several CUDA mode flags that control which code paths are active:

AddressNameTypeDescription
dword_126EFA8is_cuda_compilationboolMaster CUDA mode flag
dword_126EFB4cuda_dialectint0=none, 1=C, 2=C++
dword_126EFACextended_cuda_featuresboolAdditional CUDA extensions enabled
dword_126EFA4cuda_host_compilationboolCompiling host-side code
dword_126EFB0cuda_relaxed_constexprboolAllow constexpr on device functions
dword_106C17Cconstexpr_cuda_enabledboolCUDA constexpr compatibility mode
qword_126EF98cuda_version_threshold_1int64Version gate (0x9E97 = 40599 = CUDA 12.x)
qword_126EF90cuda_version_threshold_2int64Version gate (0x78B3 = 30899 = CUDA 11.x)
dword_126EF68cpp_standard_versionintC++ standard year (201102, 201402, ...)
dword_126EF64cpp_extensions_enabledboolLanguage extensions active

CUDA Version Gating

Several CUDA-specific code paths are guarded by version thresholds. The version values are encoded as major * 1000 + minor * 10 + patch:

// CUDA 11.x and later: enable extended constexpr
if qword_126EF90 > 0x78B3:     // 30899 → CUDA version >= 11.x
    enable_extended_constexpr()

// CUDA 12.x and later: enable managed memory attributes
if qword_126EF98 > 0x9E97:     // 40599 → CUDA version >= 12.x
    enable_managed_attributes()

// Recent CUDA: enable namespace-scope CUDA variable checks
if qword_126EF98 > 0x1116F:    // 70000+ → very recent CUDA
    enable_strict_namespace_checks()

Function Map

decl_spec.c (0x4A1BF0--0x4B37F0)

AddressIdentityLinesDescription
sub_4A1BF0check_use_of_consteval104Validate consteval specifier
sub_4A1DF0check_explicit_specifier45Validate explicit specifier
sub_4A1EC0check_use_of_constinit77Validate constinit specifier
sub_4A2000check_use_of_thread_local111Validate thread_local specifier
sub_4A22B0check_use_of_constexpr153Validate constexpr specifier
sub_4A2580check_gnu_c_auto_type52Validate GNU __auto_type
sub_4A2630scan_edg_vector_type203Parse vector type syntax
sub_4A2B80is_function_declaration_ahead162Lookahead: function declaration?
sub_4A2E40process_auto_parameter153C++20 auto parameters
sub_4A31A0process_storage_class_specifier223Storage class validation
sub_4A3610check_for_class_modifiers139Detect final/__final
sub_4A38A0scan_tag_name1,216Parse class/enum name
sub_4A4FD0set_name_linkage_for_type41Set type linkage
sub_4A5140update_membership_of_class173Update class scope info
sub_4A5510attach_tag_attributes143Attach attributes to types
sub_4A57C0class_specifier2,179Parse class/struct/union definition
sub_4A8990warn_on_cuda_execution_space_attributes33CUDA exec space warning
sub_4A89F0scan_enumerator_list950Parse enum body
sub_4AA2F0enum_specifier1,437Parse enum specifier
sub_4AC550typename_specifier197Parse typename T::type
sub_4AC970is_constructor_decl225Detect constructor declaration
sub_4ACE00enclosing_class_type43Get enclosing class from scope
sub_4ACF80decl_specifiers4,761Central specifier dispatcher
sub_4B37F0decl_spec_one_time_init40Module initialization

declarator.c (0x4B3920--0x4C00A0)

AddressIdentityLinesDescription
sub_4B3970scan_declarator_attributes297Separate CUDA exec-space attrs
sub_4B3E80scan_trailing_requires_clause136C++20 requires clause
sub_4B4230check_for_restrict_qualifier_on_derived_type124Restrict validation
sub_4B4870form_declared_type53Combine base type + derived chain
sub_4B4990report_bad_return_type_qualifier89cv-qual on return type
sub_4B4CF0add_to_derived_type_list600Build derived type chain
sub_4B5A70delayed_scan_of_exception_spec211Deferred exception spec
sub_4B6760array_declarator518Parse [expr]
sub_4B72A0pointer_declarator440Parse *, &, &&, ::*
sub_4B7BC0declarator284Top-level declarator entry
sub_4B8190function_declarator3,144Parse function signature
sub_4BC7F0scan_requires_expr_parameters61C++20 requires-expr params
sub_4BC950r_declarator2,578Recursive descent declarator
sub_4C00A0scan_lambda_declarator414Lambda declarator

decls.c (0x4C0840--0x4F0000)

AddressIdentityLinesDescription
sub_4C0910incompatible_types_are_SVR4_compatible77SVR4 ABI compat check
sub_4C0B10set_default_calling_convention112Calling convention setup
sub_4C0CB0record_overload91Record function overload
sub_4C0E90set_linkage_for_class_members107Propagate class linkage
sub_4C10E0set_linkage_environment138Linkage environment setup
sub_4C15D0check_use_of_placeholder_type175Validate auto/decltype(auto)
sub_4C1CC0find_linked_symbol608Redeclaration detection
sub_4C3380id_linkage310Linkage determination
sub_4C3A80qualified_name_redecl_sym320Qualified redeclaration
sub_4CA6C0decl_variable1,090Variable declaration processing
sub_4CC150cuda_variable_fixup120CUDA post-decl variable fixup
sub_4CE420decl_routine2,858Function declaration processing
sub_4DAC80check_constexpr_variable_init60CUDA constexpr check
sub_4DB440process_asm_block200Inline assembly declaration
sub_4DC200mark_defined_variable26CUDA constexpr linkage
sub_4DD710check_trailing_return_type80Auto type deduction check
sub_4DEC90variable_declaration1,098Top-level variable entry

disambig.c (0x4E9E70--0x4EC690)

AddressIdentityLinesDescription
sub_4E9E70prescan_gnu_attribute98Skip __attribute__ in prescan
sub_4EA560prescan_declaration400Top-level disambiguation
sub_4EB270prescan_declarator200Prescan declarator tokens
sub_4EC690find_for_loop_separator100Find ; in for-init

decl_inits.c (0x4A0310--0x4A1BE0)

AddressIdentityLinesDescription
sub_4A0310ctor_inits_for_inheriting_ctor746Inheriting ctor init list
sub_4A0EC0dtor_initializer339Destructor init list
sub_4A1540check_for_missing_initializer_full248Missing init diagnostic
sub_4A1B60decl_inits_init11Module initialization
sub_4A1BB0decl_inits_reset9Module reset

Cross-References

Overload Resolution

The overload resolution engine in cudafe++ is EDG 6.6's implementation of the C++ overload resolution algorithm (ISO C++ [over.match]). It lives in overload.c -- approximately 100 functions spanning address range 0x6BE4A0--0x6EF7A0 (roughly 200KB of compiled code). Overload resolution is one of the most complex subsystems in any C++ compiler because it sits at the intersection of nearly every other language feature: implicit conversions, user-defined conversions, template argument deduction, SFINAE, partial ordering, reference binding, list initialization, copy elision, and operator overloading each contribute decision branches to the algorithm. EDG implements the standard three-phase architecture -- candidate collection, viability checking, best-viable selection -- with NVIDIA-specific extensions for CUDA execution-space filtering.

Key Facts

PropertyValue
Source fileoverload.c (~100 functions)
Address range0x6BE4A0--0x6EF7A0
Total code size~200KB
Main selection entrysub_6E6400 (select_overloaded_function, 1,483 lines, 20 parameters)
Operator dispatchsub_6EF7A0 (select_overloaded_operator, 2,174 lines)
Viability checkersub_6E2040 (determine_function_viability, 2,120 lines)
Candidate evaluatorsub_6C4C00 (candidate evaluation, 1,044 lines)
Main driversub_6CE6E0 (overload resolution driver, 1,246 lines)
Built-in candidatessub_6CD010 (built-in operator candidates, 752 lines)
Candidate iteratorsub_6E4FA0 (try_overloaded_function_match, 633 lines)
Conversion scoringsub_6BEE10 (standard_conversion_sequence, 375 lines)
ICS comparisonsub_6CBC40 (implicit conversion sequence comparison, 345 lines)
Qualification comparesub_6BE6C0 (compare_qualification_conversions, 127 lines)
Copy constructor selectsub_6DBEA0 (select_overloaded_copy_constructor, 625 lines)
Default constructor selectsub_6E9080 (select_overloaded_default_constructor, 358 lines)
Assignment operator selectsub_6DD600 (select_overloaded_assignment_operator, 492 lines)
CTAD entrysub_6E8300 (deduce_class_template_args, 285 lines)
List initializersub_6D7C80 (prep_list_initializer, 2,119 lines)
Overload set traversalsub_6BA230 (iterate overload set)
Overload debug tracedword_126EFC8 (enable), qword_106B988 (output stream)
CUDA extensions flagbyte_126E349
Language modedword_126EFB4 (2 = C++)
Standard versiondword_126EF68 (201103 = C++11, 201703 = C++17, 202301 = C++23)

Why Overload Resolution Is Hard

Overload resolution is not a simple "find the best match" operation. The C++ standard defines it as a partial ordering problem over implicit conversion sequences, where each sequence is itself a multi-step chain of type transformations. The key sources of complexity:

  1. Implicit conversion sequences (ICS). Each argument-to-parameter match produces an ICS consisting of up to three steps: a standard conversion (lvalue-to-rvalue, array-to-pointer, etc.), optionally a user-defined conversion (constructor or conversion function), then another standard conversion. Ranking two ICSs against each other requires comparing each step independently.

  2. User-defined conversions. When no standard conversion exists, the compiler must search for converting constructors on the target type AND conversion operators on the source type, then perform a nested overload resolution among those candidates. This creates recursive invocations of the overload engine.

  3. Template argument deduction. Function templates produce candidates only after deduction succeeds. Deduction may fail (SFINAE), producing no candidate. Successfully deduced candidates participate in a separate tie-breaking rule: non-template functions are preferred over template specializations, and "more specialized" templates are preferred over "less specialized" ones ([over.match.best] p2.5).

  4. Partial ordering. When comparing two function templates that are both viable, the compiler must determine which is "more specialized" by attempting deduction in both directions (templates.c handles this). The result feeds back into overload ranking.

  5. Operator overloading. Built-in operators (like + on int) compete with user-defined operator+. The compiler synthesizes "built-in candidate functions" representing every valid built-in operator signature, adds them to the candidate set alongside user-defined operators, and runs the same best-viable algorithm on the combined set.

  6. Special contexts. Copy-initialization vs. direct-initialization, list-initialization, reference binding, and conditional-operator type determination each have their own overload resolution sub-procedures with modified candidate sets and ranking rules.

Architecture: Three-Phase Pipeline

                          PHASE 1                    PHASE 2                   PHASE 3
                     Candidate Collection        Viability Check         Best-Viable Selection
                    ┌───────────────────┐     ┌──────────────────┐     ┌──────────────────────┐
                    │                   │     │                  │     │                      │
  f(args...) ──────►│ Name lookup       │────►│ For each cand:   │────►│ Pairwise comparison  │
                    │ ADL (arg-dep.)    │     │  - param count   │     │ of viable candidates │
                    │ Using-declarations│     │  - conversions   │     │ via ICS ranking      │──► winner
                    │ Template deduction│     │  - constraints   │     │                      │
                    │ Built-in synth    │     │  - access check  │     │ Tie-breakers:        │──► or ambiguity
                    │                   │     │                  │     │  - non-template pref │
                    └───────────────────┘     └──────────────────┘     │  - partial ordering  │──► or no match
                                                                      │  - cv-qual ranking   │
                                                                      └──────────────────────┘

Phase 1: Candidate Collection

Candidates are collected into an overload set -- a linked list of entries allocated via sub_6BA0D0 and iterated via sub_6BA230. The overload set is built by the caller before invoking select_overloaded_function. Sources of candidates include:

  • Name lookup results. All declarations visible by name at the call site, including base class members and using-declarations.
  • Argument-dependent lookup (ADL). Additional functions found by searching the associated namespaces of the argument types (Koenig lookup). These are added to the set by the name lookup machinery before overload resolution begins.
  • Template specializations. For each function template in the name lookup result, template argument deduction is attempted. If deduction succeeds, the resulting specialization is added as a candidate. If deduction fails, the template is silently dropped (SFINAE).
  • Built-in operator candidates. For operator expressions, sub_6CD010 synthesizes candidate functions representing every valid built-in operator signature for the given operand types. These synthetic candidates use single-character type classification codes to match operand patterns.

Phase 2: Viability Checking

determine_function_viability (sub_6E2040, 2,120 lines) is the core viability checker. For each candidate function, it determines whether all arguments can be implicitly converted to the corresponding parameter types.

determine_function_viability (sub_6E2040, 2120 lines)
    Input:  candidate function F, argument list A[0..n-1]
    Output: viability flag, per-argument conversion summaries

    // Guard: SFINAE context handling
    if (in_sfinae_context)
        push_diagnostic_suppression()

    // PASS 1: Basic eligibility
    if (F is deleted)
        return NOT_VIABLE
    if (F is template && deduction_failed)
        return NOT_VIABLE
    if (F has fewer params than args && !F.is_variadic)
        return NOT_VIABLE
    if (F has more params than args && excess params lack defaults)
        return NOT_VIABLE

    // Handle implicit 'this' parameter for member functions
    if (F is non-static member function) {
        this_match = selector_match_with_this_param(
            object_operand, F.this_param_type)       // sub_6D0A80
        if (this_match == FAILED)
            return NOT_VIABLE
    }

    // PASS 2: Per-argument conversion check
    for i in 0..n-1:
        log("determine_function_viability: arg %d", i)

        param_type = F.params[i].type
        arg_type   = A[i].type

        // Compute implicit conversion sequence
        ics = compute_standard_conversion_sequence(   // sub_6BEE10
                  arg_type, param_type, context_flags)

        if (ics == NO_CONVERSION) {
            // Try user-defined conversion
            ics = try_user_defined_conversion(
                      arg_type, param_type)           // sub_6BF610
            if (ics == NO_CONVERSION)
                return NOT_VIABLE
        }

        // Check narrowing for list-initialization
        if (context == LIST_INIT && ics.is_narrowing)
            return NOT_VIABLE

        // Record per-argument match summary
        summaries[i] = ics

    log("(pass 2)")  // second pass through for detailed scoring

    // All arguments convertible -- candidate is viable
    return VIABLE, summaries[]

The function implements a two-pass approach visible in the debug trace output: pass 1 performs a quick rejection check (parameter count, deleted status, deduction success), and pass 2 computes the full conversion sequence for each argument. The per-argument summaries are stored in a 48-byte structure (set_arg_summary_for_user_conversion at sub_6BE990 initializes these).

Phase 3: Best-Viable Selection

select_overloaded_function (sub_6E6400, 1,483 lines, 20 parameters) performs the final selection. It is the master entry point for overload resolution -- called from the expression parser, from CTAD, and from special member function selection.

select_overloaded_function (sub_6E6400, 1483 lines, 20 params)
    Input:  overload_set, arg_list, context_flags, ...
    Output: best_function or AMBIGUOUS or NO_MATCH

    log("Entering select_overloaded_function with ...")

    // Early exit: dependent type arguments => defer to instantiation time
    if (selector_type_is_dependent)
        return DEPENDENT

    // Step 1: Iterate candidates and check viability
    viable_set = []
    try_overloaded_function_match(                    // sub_6E4FA0
        overload_set, arg_list, &viable_set, ...)

    if (viable_set is empty)
        return NO_MATCH

    if (viable_set has exactly 1 candidate)
        return viable_set[0]

    // Step 2: Pairwise comparison of viable candidates
    //   For each pair (F1, F2), compare their ICS for each argument
    best = viable_set[0]
    ambiguous = false

    for each candidate C in viable_set[1..]:
        cmp = compare_candidates(best, C)
        //   compare_candidates calls compare_conversion_sequences (sub_6BFF70)
        //   for each argument position, and applies tie-breakers

        if (cmp == C_IS_BETTER)
            best = C
            ambiguous = false
        else if (cmp == NEITHER_BETTER)
            ambiguous = true

    // Step 3: Verify best is strictly better than ALL others
    if (ambiguous) {
        // Final check: is there a single candidate that beats all?
        for each candidate C in viable_set:
            if (C != best) {
                cmp = compare_candidates(best, C)
                if (cmp != BEST_IS_BETTER)
                    return AMBIGUOUS
            }
    }

    return best

Candidate Comparison Rules

The pairwise comparison between two viable candidates F1 and F2 follows [over.match.best]. The result is one of: F1-better, F2-better, or indistinguishable.

compare_candidates(F1, F2):
    // Rule 1: Compare implicit conversion sequences argument-by-argument
    f1_better_count = 0
    f2_better_count = 0

    for i in 0..n-1:
        cmp = compare_conversion_sequences(           // sub_6BFF70
                  F1.ics[i], F2.ics[i])
        if (cmp == F1_BETTER) f1_better_count++
        if (cmp == F2_BETTER) f2_better_count++

    if (f1_better_count > 0 && f2_better_count == 0)
        return F1_IS_BETTER
    if (f2_better_count > 0 && f1_better_count == 0)
        return F2_IS_BETTER

    // Rule 2: Non-template preferred over template
    if (F1 is non-template && F2 is template)
        return F1_IS_BETTER
    if (F2 is non-template && F1 is template)
        return F2_IS_BETTER

    // Rule 3: More-specialized template preferred
    if (both are templates) {
        partial = partial_ordering(F1.template, F2.template)
        if (partial == F1_MORE_SPECIALIZED)
            return F1_IS_BETTER
        if (partial == F2_MORE_SPECIALIZED)
            return F2_IS_BETTER
    }

    // Rule 4: Compare qualification conversions
    cmp_qual = compare_qualification_conversions(     // sub_6BE6C0
                   F1.qual_info, F2.qual_info)
    if (cmp_qual != 0)
        return cmp_qual

    return NEITHER_BETTER

Implicit Conversion Sequence (ICS) Model

An ICS is the sequence of transformations needed to convert an argument type to a parameter type. EDG computes and stores ICS information in a compact structure.

Standard Conversion Sequence

standard_conversion_sequence (sub_6BEE10, 375 lines) computes the standard-conversion component of an ICS. It produces a conversion rank used in comparison.

RankNameExamplesPriority
Exact MatchNo conversion neededint to int, lvalue-to-rvalue1 (best)
PromotionInteger/float promotionshort to int, float to double2
ConversionStandard conversionint to double, derived-to-base3
User-DefinedUser conversion + std conversionFoo to Bar via constructor4
EllipsisMatch via ... parameterAny type to variadic5 (worst)

Within the same rank, additional criteria refine the comparison:

  1. Qualification adjustment. const T to T is worse than T to T. compare_qualification_conversions (sub_6BE6C0) encodes cv-qualification as a bitmask (const = 0x20, volatile = 0x40, restrict = 0x80) and compares subset relationships.
  2. Derived-to-base distance. Conversion through a shorter inheritance chain is better. Checked via sub_7AB300.
  3. Reference binding. Binding to T&& is preferred over binding to const T& when the argument is an rvalue.

User-Defined Conversion Sequence

When no standard conversion exists, try_conversion_function_match_full (sub_6D0F50, 1,085 lines) searches for a user-defined conversion path. It considers:

  1. Converting constructors on the target type (non-explicit constructors that accept the source type).
  2. Conversion functions on the source type (operator T() members).

For each candidate conversion, it checks:

try_conversion_function_match_full (sub_6D0F50, 1085 lines)
    Input:  source_class_type, dest_type, context_flags
    Output: selected conversion function/constructor, or AMBIGUOUS, or NONE

    log("considering conversion functions for [%lu.%d]")

    if (source is not class type)
        error("try_conversion_function_match_full: source not class")

    // Iterate conversion function candidates of source class
    for each conv_func in source_class.conversion_functions:  // via sub_6BA230
        return_type = conv_func.return_type
        if (return_type is compatible with dest_type) {
            // Check standard conversion from return_type to dest_type
            post_ics = compute_standard_conversion_sequence(
                           return_type, dest_type)
            if (post_ics != NO_CONVERSION)
                add_to_viable(conv_func, post_ics)
        }

    // Also check for converting constructors on dest type
    conversion_from_class_possible(                  // sub_6D28C0/6D2ED0
        source_class_type, dest_type, &viable_set)

    // Select best among viable user-defined conversions
    if (viable_set has 1 candidate)
        return viable_set[0]
    if (viable_set has multiple candidates)
        return best-of or AMBIGUOUS

    return NONE

The conversion_from_class_possible functions (sub_6D28C0 252 lines, sub_6D2ED0 293 lines) emit full debug traces with entry/exit messages:

Entering conversion_from_class_possible, dest_type = <type>
Candidate functions list: ...
Leaving conversion_from_class_possible: <result>

The Main Overload Resolution Driver

sub_6CE6E0 (1,246 lines) is the central driver function -- "THE MONSTER" -- that coordinates the overload resolution pipeline. It is called from determine_selector_match_level and from the candidate evaluation logic, acting as the type-comparison and scoring backbone that feeds the higher-level selection functions.

overload_resolution_driver (sub_6CE6E0, 1246 lines)
    // This function performs the detailed type comparison and conversion
    // sequence computation that determines how well a candidate matches.
    //
    // It is called per-candidate, per-argument-position from the viability
    // checker and the candidate evaluator.

    // 1. Quick identity check
    if (arg_type == param_type)
        return EXACT_MATCH

    // 2. Chase typedef chains to canonical types
    arg_canon  = canonical_type(arg_type)
    param_canon = canonical_type(param_type)

    // 3. Apply lvalue-to-rvalue conversion
    if (param expects rvalue && arg is lvalue)
        apply lvalue_to_rvalue conversion, record in ICS

    // 4. Apply array-to-pointer / function-to-pointer decay
    if (arg is array)      convert to pointer-to-element
    if (arg is function)   convert to pointer-to-function

    // 5. Check for standard conversions (integral promotion, float promotion,
    //    integral conversion, floating conversion, pointer conversion,
    //    pointer-to-member conversion, boolean conversion)
    std_conv = find_applicable_standard_conversion(arg_canon, param_canon)
    if (std_conv != NONE)
        return std_conv with rank

    // 6. Check for qualification conversion (add const/volatile)
    qual_conv = check_qualification_conversion(arg_canon, param_canon)
    if (qual_conv)
        return EXACT_MATCH with qual adjustment

    // 7. Check derived-to-base conversion
    if (is_class(arg_canon) && is_class(param_canon)) {
        if (is_derived_from(arg_canon, param_canon))
            return CONVERSION_RANK with derived-to-base marker
    }

    // 8. No standard conversion found
    return NO_CONVERSION

Candidate Evaluation Function

sub_6C4C00 (1,044 lines) is the candidate evaluation function -- it scores each candidate by computing the full set of implicit conversion sequences across all arguments and produces the data that compare_candidates uses.

evaluate_candidate (sub_6C4C00, 1044 lines)
    Input:  candidate F, argument list args[], match_context
    Output: per-argument ICS array, overall viability

    for each argument position i:
        // Compute the implicit conversion sequence
        ics = overload_resolution_driver(             // sub_6CE6E0
                  args[i].type, F.params[i].type, flags)

        if (ics == NO_CONVERSION) {
            // Try user-defined conversion
            udc = try_user_defined_conversion(args[i].type, F.params[i].type)
            if (udc == NONE)
                mark F as non-viable for position i
                return NON_VIABLE
            ics = user_defined_ics(udc)
        }

        // Record the ICS for this position
        F.arg_summaries[i] = ics

    // Compute overall match quality
    F.match_level = worst(F.arg_summaries[0..n-1])
    return VIABLE

Candidate Iteration

try_overloaded_function_match (sub_6E4FA0, 633 lines, and variant sub_6E5B20, 367 lines) iterates the overload set and calls determine_function_viability for each candidate.

try_overloaded_function_match (sub_6E4FA0, 633 lines)
    Input:  overload_set, arg_list, context
    Output: viable_candidates[]

    log("try_overloaded_function_match")

    // Traverse the overload set
    cursor = overload_set.head
    while (cursor != NULL):                           // via sub_6BA230

        candidate = cursor.function
        log("try_overloaded_function_match: considering %s",
            candidate.name)                           // via sub_5B72C0

        // Set up traversal symbol for template deduction
        set_overload_set_traversal_symbol(cursor)

        // Check viability
        viable = determine_function_viability(        // sub_6E2040
                     candidate, arg_list, context)

        if (viable) {
            add candidate to viable_candidates[]
            record conversion summaries
        }

        cursor = cursor.next

Operator Overloading

Operator overloading resolution follows a specialized path because it must consider both user-defined operators AND synthesized built-in operator candidates.

Entry Point: select_overloaded_operator

sub_6EF7A0 (2,174 lines) is the master entry point for operator overloading. It is called from the expression parser whenever an operator expression involves a class-type operand.

select_overloaded_operator / check_for_operator_overloading
    (sub_6EF7A0, 2174 lines)
    Input:  operator_kind, lhs_operand, rhs_operand (if binary), context
    Output: selected function (user-defined or built-in), or use-builtin flag

    log("Entering check_for_operator_overloading")

    // Guard: dependent operands => defer
    if (lhs is dependent || rhs is dependent)
        log("check_for_operator_overloading: dep operand")
        return DEPENDENT

    // Step 1: Collect user-defined operator candidates
    //   Search member operators of lhs class
    //   Search non-member operators via name lookup + ADL
    user_candidates = collect_user_operator_candidates(
                          operator_kind, lhs, rhs)

    // Step 2: Generate built-in operator candidates
    builtin_candidates = generate_builtin_candidates(   // sub_6CD010
                             operator_kind, lhs.type, rhs.type)

    // Step 3: Combine candidate sets
    combined = user_candidates + builtin_candidates

    // Step 4: Run standard overload resolution on combined set
    result = select_overloaded_function(                 // sub_6E6400
                 combined, [lhs, rhs], OPERATOR_CONTEXT)

    if (result is a built-in candidate) {
        // Adjust operands for built-in semantics
        adjust_operand_for_builtin_operator(             // sub_6E0E50
            lhs, rhs, operator_kind)
        return USE_BUILTIN
    }

    log("Leaving f_check_for_operator_overloading")
    return result.function

Built-in Operator Candidate Generation

sub_6CD010 (752 lines) generates synthetic candidate functions representing built-in operators. It uses a type classification code scheme where each type category is encoded as a single character.

Type Classification Codes

CodeMeaningQuery Function
A / aArithmetic typesub_7A7590 (is_arithmetic)
BBoolean typeis_bool
bBoolean-equivalentis_pointer/bool
CClass typesub_7A8A30 (is_class)
D / I / iInteger/integral typesub_7A71E0 (is_integral)
EEnum typesub_7A70F0 (is_enum)
FPointer-to-functionis_function_pointer
HHandle type (CLI)is_handle
MPointer-to-membersub_7A8D90 (is_member_pointer)
Nnullptr_tis_nullptr
OPointer-to-objectis_object_pointer
PPointer (any)is_pointer
SScoped enumis_scoped_enum
hHandle-to-CLI-arrayis_handle_array
nNon-bool arithmeticis_non_bool_arithmetic

The function matches_type_code (sub_6BECA0) dispatches on these codes to check whether an operand matches a candidate pattern. The function name_for_type_code (sub_6BE4A0, 67 lines) converts codes to human-readable strings for diagnostics (e.g., A becomes "arithmetic").

Candidate Pattern Matching

try_builtin_operands_match (sub_6ED2A0, 812 lines) matches operands against built-in operator patterns. The patterns are encoded as strings like "A;P" where each character is a type code and ; separates operand positions.

try_builtin_operands_match (sub_6ED2A0, 812 lines)
    Input:  operator_kind, pattern_string, operand types
    Output: match result

    log("try_builtin_operands_match: considering %s", pattern_string)

    for i in 0..num_operands-1:
        code = pattern_string[i]    // after skipping separators
        log("try_builtin_operands_match: operand %d", i)

        if (!matches_type_code(operand[i].type, code))
            log("try_builtin_operands_match: ran off pattern")
            return NO_MATCH

    return MATCH with conversion cost

try_conversions_for_builtin_operator (sub_6EE340, 1,058 lines) contains a large switch over operator kinds that selects the appropriate type pattern tables. It checks dword_126EF68 for C++17 features (>= 201703) and dword_126EFB4 for language mode.

Special Member Function Selection

Overload resolution for special member functions uses dedicated entry points that share the same underlying machinery but provide specialized candidate sets and matching rules.

Copy/Move Constructor Selection

select_overloaded_copy_constructor (sub_6DBEA0, 625 lines)
    Input:  class_type, source_operand, context_flags
    Output: selected constructor symbol, or NULL

    log("Entering select_overloaded_copy_constructor, class_type = %s",
        class_type.name)

    // Iterate all constructors of the class
    for each ctor in class_type.constructors:         // via sub_6BA230
        log("select_overloaded_copy_constructor: considering %s",
            ctor.name)                                // via sub_5B72C0

        // Check copy parameter match
        match = determine_copy_param_match(           // sub_6DBAC0
                    ctor, source_operand)

        // determine_copy_param_match calls:
        //   sub_6CE6E0 (type comparison)
        //   sub_6BE5D0 (value category check)
        //   sub_6DB6E0 (deduce_one_parameter for template ctors)

        if (match.viable) {
            if (match better than current_best)
                current_best = ctor

            // Check for ambiguity
            if (match == current_best && ctor != current_best)
                ambiguous = true
        }

    log("Leaving select_overloaded_copy_constructor, cctor_sym = %s",
        current_best.name)
    return current_best

The value category check (sub_6BE5D0, copy_function_not_callable_because_of_arg_value_category, 39 lines) is critical for C++11 move semantics: it rejects copy constructors when the source is an rvalue and a move constructor is available, and vice versa.

Default Constructor Selection

select_overloaded_default_constructor (sub_6E9080, 358 lines)
    Input:  class_type
    Output: selected constructor symbol

    log("Entering select_overloaded_default_constructor, class_type = %s",
        class_type.name)

    // Collect zero-argument constructors
    // Check for default arguments (a 1-param ctor with default is a default ctor)
    // Run standard overload resolution with empty argument list

    log("Leaving select_overloaded_default_constructor, ctor_sym = %s",
        result.name)
    return result

Assignment Operator Selection

select_overloaded_assignment_operator (sub_6DD600, 492 lines)
    Input:  class_type, rhs_operand
    Output: selected assignment operator symbol

    log("Entering select_overloaded_assignment_operator, class_type = %s",
        class_type.name)

    // Iterate assignment operator candidates
    for each assign_op in class_type.assignment_operators:  // via sub_6BA230
        log("select_overloaded_assignment_operator: considering %s",
            assign_op.name)

        // Check parameter match (similar to copy constructor)
        // ...

    log("Leaving select_overloaded_assignment_operator, assign_sym = %s",
        result.name)
    return result

Copy Elision

C++17 guaranteed copy elision is handled by handle_elided_copy_constructor_no_guard (two variants: sub_6DCD60 166 lines and sub_6DD180 169 lines). Even with elision, the compiler must verify that the copy/move constructor would be callable -- the constructor is selected via select_overloaded_copy_constructor but never actually invoked. The wrapper arg_copy_can_be_done_via_constructor (sub_6DCC00, 55 lines) performs this check.

List Initialization

prep_list_initializer (sub_6D7C80, 2,119 lines) implements C++11 brace-enclosed initializer list resolution. It is one of the largest functions in overload.c, reflecting the combinatorial complexity of list initialization.

prep_list_initializer (sub_6D7C80, 2119 lines)
    Input:  init_list (braced expression list), target_type, context
    Output: converted initializer expression

    // The algorithm (per [dcl.init.list]):
    //
    // 1. If T has an initializer_list<X> constructor and the braced-init-list
    //    can be converted to initializer_list<X>, use that constructor.
    //
    // 2. If T is an aggregate, perform aggregate initialization.
    //
    // 3. If T has constructors, overload resolution selects a constructor
    //    with the elements of the braced-init-list as arguments.
    //
    // 4. If T is a reference, bind to a temporary or element.
    //
    // At each step, check for narrowing conversions (C++11 requirement).

    // Gate: C++11 required
    if (dword_126EF68 < 201103)   // std_version < C++11
        return LEGACY_PATH

    // Step 1: Check for initializer_list constructor
    init_list_ctor = find_initializer_list_constructor(    // sub_6DFEC0
                         target_type, element_type)
    if (init_list_ctor) {
        init_list_obj = make_initializer_list_object(      // sub_6DFEC0
                            init_list, element_type)
        return set_up_for_constructor_call(init_list_ctor,
                                           init_list_obj)
    }

    // Step 2: Aggregate initialization (recursive for nested braces)
    if (is_aggregate(target_type)) {
        for each element in init_list:
            // Recursively call prep_list_initializer for nested braces
            prep_list_initializer(element, member_type, ...)  // recursive
        return aggregate_init_expr
    }

    // Step 3: Constructor overload resolution
    result = select_overloaded_function(                   // sub_6E6400
                 target_type.constructors, init_list.elements, LIST_INIT)

    // Step 4: Check for narrowing
    check_narrowing_conversions(init_list, result)

    return result

The find_initializer_list_constructor / make_initializer_list_object function (sub_6DFEC0, 692 lines) handles std::initializer_list<T> construction. It iterates constructors to find one taking initializer_list<T> and sets up the backing array via set_overload_set_traversal_symbol.

Class Template Argument Deduction (CTAD)

C++17 CTAD is implemented by deduce_class_template_args (sub_6E8300, 285 lines). CTAD works by synthesizing a set of "deduction guides" -- function-like entities derived from the class template's constructors -- and running overload resolution on them.

deduce_class_template_args (sub_6E8300, 285 lines)
    Input:  class_template, constructor_arguments, context
    Output: deduced template arguments

    // Step 1: Generate implicit deduction guides from constructors
    //   For each constructor C(P1, P2, ...) of class template T<A, B, ...>:
    //     Create guide: T(P1, P2, ...) -> T<deduced-A, deduced-B, ...>

    // Step 2: Add explicit deduction guides (user-provided)

    // Step 3: Run overload resolution among all guides
    selected_guide = select_overloaded_function(          // sub_6E6400
                         deduction_guides, constructor_args, CTAD_CONTEXT)

    // Step 4: Extract deduced template arguments from selected guide
    return selected_guide.deduced_args

CTAD delegates entirely to select_overloaded_function for the actual resolution -- the deduction guides are treated as ordinary function candidates with synthesized parameter types.

Auto Type Deduction

deduce_auto_type (sub_6DB010, 314 lines) implements C++11 auto type deduction, which is structurally similar to template argument deduction. It handles the special case of auto x = {1, 2, 3} where the deduced type is std::initializer_list<int>.

Conversion Infrastructure

Reference Binding

prep_reference_initializer_operand (sub_6D47B0, 1,121 lines) handles reference initialization, which has its own overload-resolution sub-algorithm for selecting the correct binding path:

  1. Direct binding. If the initializer is an lvalue of the right type (or derived), bind directly.
  2. Conversion-through-temporary. If a user-defined conversion exists, create a temporary and bind the reference to it.
  3. Direct reference binding check. conversion_for_direct_reference_binding_possible (sub_6D4610, 49 lines) checks whether direct binding is possible.

Operand Conversion

After overload resolution selects a function, the arguments must be physically converted to match the parameter types:

FunctionLinesRole
sub_6D6650 (user_convert_operand)427Applies user-defined conversion (constructor call or conversion function call)
sub_6E1430 (convert_operand_into_temp)418Creates a temporary and converts operand into it
sub_6E1C40 (prep_argument variant 1)69Prepares argument for function call
sub_6E1E40 (prep_argument variant 2)69Simplified argument preparation
sub_6EB1C0 (adjust_overloaded_function_call_arguments)249Post-resolution argument adjustment
sub_6E0E50 (adjust_operand_for_builtin_operator)199Adjusts operands for built-in operator semantics

The high-level call setup function select_and_prepare_to_call_overloaded_function (sub_6EB550, 392 lines) combines overload resolution with argument preparation in a single entry point.

Dynamic Initialization

determine_dynamic_init_for_class_init (sub_6DEBC0, 679 lines) determines whether a class object initialization requires a runtime (dynamic) initialization routine rather than static initialization. It checks whether the constructor is trivial, whether the initializer is a constant expression, and whether the target requires dynamic dispatch.

Conditional Operator

conditional_operator_conversion_possible (sub_6EBFC0, 326 lines) handles the special overload resolution for the ternary conditional operator (? :), which has unique type-determination rules involving common type computation between the second and third operands.

Ambiguity Diagnostics

When overload resolution fails due to ambiguity, dedicated diagnostic functions produce the error messages:

FunctionLinesRole
sub_6D7040 (diagnose_overload_ambiguity standalone)191Formats and emits ambiguity diagnostic with candidate list
sub_6D35E0 (user_defined_conversion_possible with diagnosis)399Handles ambiguity in user-defined conversion resolution

The diagnostic output uses sub_4F59D0, sub_4F5C10, sub_4F5CF0, and sub_4F5D50 for type-to-string formatting, producing messages in the format:

ambiguous overload for 'operator+(A, B)':
  candidate: operator+(int, int)
  candidate: operator+(A::operator int(), int)

Missing Sentinel Warning

warn_if_missing_sentinel (sub_6E9C60, 1,170 lines) is a large function that checks for missing sentinel arguments (NULL terminators) in variadic function calls. It references multiple CUDA extension flags (byte_126E349, byte_126E358, byte_126E3C0, byte_126E3C1, byte_126E481) because CUDA functions have different variadic conventions.

CUDA Execution Space Interaction

CUDA introduces an additional dimension to overload resolution: execution space compatibility. In standard C++, any visible function is a candidate. In CUDA, a candidate from the wrong execution space may be excluded or penalized.

How Execution Spaces Affect Candidates

The CUDA execution space interaction with overload resolution happens at two levels:

Level 1: Post-resolution validation (expr.c). After overload resolution selects the best viable function, check_cross_execution_space_call (sub_505720, 4KB) validates that the selected function is callable from the current execution context. If the call is illegal (e.g., calling a __device__-only function from __host__ code), error 3462--3465 is emitted. This check runs AFTER overload resolution, not during candidate filtering.

Level 2: Overload-internal CUDA awareness (overload.c). Within overload.c itself, the CUDA extensions flag byte_126E349 gates CUDA-specific behavior in several functions:

  • try_conversion_function_match_full (sub_6D0F50): Checks byte_126E349 when evaluating whether a conversion function is viable. In CUDA mode, conversion functions from the wrong execution space may be excluded from consideration during the user-defined conversion search.

  • warn_if_missing_sentinel (sub_6E9C60): Uses byte_126E349 and byte_126E358 to adjust sentinel checking behavior for CUDA-annotated variadic functions.

The key architectural decision is that CUDA does NOT filter candidates during Phase 1 (candidate collection) or Phase 2 (viability checking) of overload resolution proper. Instead, execution-space validation is a separate pass that runs after the standard C++ overload algorithm completes. This preserves EDG's clean separation between the standard-conforming overload engine and NVIDIA's CUDA extensions.

Cross-Space Validation

The execution space is encoded in the entity node at offset +182 as a bitfield:

Bit PatternMeaning
(byte & 0x30) == 0x20__device__ only
(byte & 0x60) == 0x20__host__ only
(byte & 0x60) == 0x40__global__
(byte & 0x30) == 0x30__host__ __device__

The cross-space checker (sub_505720) compares the caller's execution space with the callee's and emits:

ErrorCondition
3462__device__ called from __host__
3463Variant of 3462 for HD context
3464__host__ called from __device__
3465Variant of 3464 with __device__ note
3508__global__ called from wrong context

A template-instantiation variant (sub_505B40, check_cross_space_call_in_template) performs the same checks during template instantiation.

Debug Tracing

Overload resolution includes extensive debug tracing controlled by dword_126EFC8. When enabled, functions emit trace output via sub_48AFD0 / sub_48AE00 to the stream at qword_106B988:

Entering select_overloaded_function with ...
  try_overloaded_function_match: considering foo(int)
    determine_function_viability: arg 0
    (pass 2)
  try_overloaded_function_match: considering foo(double)
    determine_function_viability: arg 0
    (pass 2)
  comparing candidates: foo(int) vs foo(double)
Leaving select_overloaded_function: foo(int)

The trace format [%lu.%d] is used in conversion function matching to identify candidates by internal ID.

Overload Set Management

Overload sets are managed via two key functions in the memory management subsystem:

FunctionRole
sub_6BA0D0Allocate a new overload set entry
sub_6BA230Iterate/traverse an overload set (linked list walk)
sub_6EC650Overload set traversal utility (212 lines)
sub_6ECA20Overload set construction from multiple sources (137 lines)
sub_6ECCE0Overload set initialization wrapper (23 lines)

The linked-list representation means candidate iteration is O(n) per traversal, but overload sets are typically small (< 100 candidates), so this is not a performance concern.

Complete Function Map

AddressSize (lines)IdentityConfidence
0x6BE4A067name_for_type_codeVERY HIGH
0x6BE5D039copy_function_not_callable_because_of_arg_value_categoryVERY HIGH
0x6BE6C0127compare_qualification_conversionsHIGH
0x6BE99068set_arg_summary_for_user_conversionVERY HIGH
0x6BEAF030set_explicit_flag_on_param_listHIGH
0x6BEB6069find_conversion_functionVERY HIGH
0x6BECA070matches_type_codeVERY HIGH
0x6BEE10375standard_conversion_sequenceHIGH
0x6BF61080check_user_defined_conversionHIGH
0x6BF710163evaluate_conversion_for_argumentHIGH
0x6BFA50129process_builtin_operator_candidateHIGH
0x6BFD0067name_for_overloaded_operatorHIGH
0x6BFE4048check_ambiguous_conversionHIGH
0x6BFF70100compare_conversion_sequencesHIGH
0x6C4C001,044candidate evaluationHIGH
0x6C5C90386candidate scoring/rankingMEDIUM
0x6C8B70418argument conversion computationMEDIUM
0x6C92B0383template argument deduction for overloadsMEDIUM
0x6CBC40345implicit conversion sequence comparisonMEDIUM
0x6CD010752built-in operator candidate generationHIGH
0x6CE010226operator overload candidate setupMEDIUM
0x6CE6E01,246overload resolution driver ("THE MONSTER")HIGH
0x6D03D0170determine_selector_match_level (6-param)HIGH
0x6D0790132determine_selector_match_level (4-param)HIGH
0x6D0A80225selector_match_with_this_paramHIGH
0x6D0F501,085try_conversion_function_match_fullHIGH
0x6D28C0252conversion_from_class_possible (9-param)HIGH
0x6D2ED0293conversion_from_class_possible (10-param)HIGH
0x6D35E0399user_defined_conversion_possible / diagnose_overload_ambiguityHIGH
0x6D3DC0360conversion_possibleHIGH
0x6D461049conversion_for_direct_reference_binding_possibleHIGH
0x6D47B01,121prep_reference_initializer_operandHIGH
0x6D61F0176reference init helperMEDIUM
0x6D6650427user_convert_operand / set_up_for_conversion_function_callHIGH
0x6D7040191diagnose_overload_ambiguity (standalone)HIGH
0x6D7410239prep_conversion_operandHIGH
0x6D79E093conversion operand wrapperMEDIUM
0x6D7C802,119prep_list_initializerHIGH
0x6DACA0154list init parameter deduction helperMEDIUM
0x6DB010314deduce_auto_typeHIGH
0x6DB6E0236deduce_one_parameterHIGH
0x6DBAC0175determine_copy_param_matchHIGH
0x6DBEA0625select_overloaded_copy_constructorHIGH
0x6DCC0055arg_copy_can_be_done_via_constructorHIGH
0x6DCD60166handle_elided_copy_constructor_no_guard (variant 1)HIGH
0x6DD180169handle_elided_copy_constructor_no_guard (variant 2)HIGH
0x6DD600492select_overloaded_assignment_operatorHIGH
0x6DE11031actualize_class_object_from_braced_init_list_for_bitwise_copyHIGH
0x6DE1D075full_adjust_class_object_typeHIGH
0x6DE320111set_up_for_constructor_callHIGH
0x6DE5A0174temp_init_from_operand_fullHIGH
0x6DE9E07temp_init_from_operand (wrapper)HIGH
0x6DE9F0114find_top_temporaryHIGH
0x6DEBC0679determine_dynamic_init_for_class_initHIGH
0x6DF8C0107conversion with dynamic init wrapperMEDIUM
0x6DFBF092convert and determine dynamic init helperMEDIUM
0x6DFEC0692make_initializer_list_object / find_initializer_list_constructorHIGH
0x6E0E50199adjust_operand_for_builtin_operatorHIGH
0x6E125079argument preparation helperMEDIUM
0x6E1430418convert_operand_into_tempHIGH
0x6E1C4069prep_argument (5-param)HIGH
0x6E1E4069prep_argument (4-param)HIGH
0x6E20402,120determine_function_viabilityHIGH
0x6E4FA0633try_overloaded_function_match (variant 1)HIGH
0x6E5B20367try_overloaded_function_match (variant 2)HIGH
0x6E61D0121overload match wrapperMEDIUM
0x6E64001,483select_overloaded_function (20 params)HIGH
0x6E8300285deduce_class_template_args (CTAD)HIGH
0x6E8890199type comparison for overloadMEDIUM
0x6E8E2093overload candidate evaluation helperMEDIUM
0x6E9080358select_overloaded_default_constructorHIGH
0x6E9750281argument list builderMEDIUM
0x6E9C601,170warn_if_missing_sentinelHIGH
0x6EAF90105node_for_arg_of_overloaded_function_callHIGH
0x6EB1C0249adjust_overloaded_function_call_argumentsHIGH
0x6EB550392select_and_prepare_to_call_overloaded_functionHIGH
0x6EBFC0326conditional_operator_conversion_possibleHIGH
0x6EC650212overload set iteratorMEDIUM
0x6ECA20137overload set builderMEDIUM
0x6ECCE023overload set init wrapperLOW
0x6ECD70160util.h insert operationMEDIUM
0x6ECFB0193util.h insert variantMEDIUM
0x6ED2A0812try_builtin_operands_matchHIGH
0x6EE3401,058try_conversions_for_builtin_operatorHIGH
0x6EF7A02,174select_overloaded_operator / check_for_operator_overloadingHIGH

Key Globals

GlobalUsage
dword_126EFB4Language mode (2 = C++)
dword_126EF68Language standard version (201103/201703/202301)
dword_126EFA4GNU extensions enabled
dword_126EFACExtended mode flag
dword_126EFC8Debug trace enabled (controls overload trace output)
dword_126EFCCDebug output level
qword_106B988Overload debug output stream
qword_106B990Overload debug output stream (alternate)
qword_12C6B30Overload candidate list
byte_126E349CUDA extensions flag
byte_126E358Extension flag (likely __CUDA_ARCH__-related)
dword_106BEA8Overload configuration flag
dword_106BEC0Overload configuration flag
dword_106C2A8Used by selector match level
dword_106C2B8Operator-related flag
dword_106C2BCOperator mode flag
dword_106C104Operator configuration
dword_106C124Operator configuration
dword_106C140Operator configuration
dword_106C16COperator configuration
dword_126C5C4Template nesting depth
dword_126C5E4Scope stack depth
qword_126C5E8Scope stack base

Template Engine

The template engine in cudafe++ is EDG 6.6's implementation of C++ template instantiation, argument deduction, partial specialization ordering, and the worklist-driven fixpoint loop that produces all needed template instantiations at translation-unit end. It lives primarily in templates.c (160+ functions at 0x7530C0--0x794D30) with supporting cross-TU correspondence logic in trans_corresp.c (0x796E60--0x79F9E0).

Template instantiation in a C++ compiler is fundamentally a deferred operation: the compiler parses template definitions, records their bodies in a declaration cache, and only instantiates when a concrete use forces it. EDG implements this with two pending worklists -- one for class templates, one for function/variable templates -- that accumulate entries during parsing and are drained by a fixpoint loop at the end of each translation unit. This page documents the complete instantiation pipeline from "entity added to worklist" through "instantiated body emitted into IL."

Key Facts

PropertyValue
Source filetemplates.c (172 functions), trans_corresp.c (36 functions)
Address range0x7530C0--0x794D30 (templates), 0x796E60--0x79F9E0 (correspondence)
Fixpoint entry pointsub_78A9D0 (template_and_inline_entity_wrapup), 136 lines
Worklist walkersub_78A7F0 (do_any_needed_instantiations), 72 lines
Should-instantiate gatesub_774620 (should_be_instantiated), 326 lines
Function instantiationsub_775E00 (instantiate_template_function_full), 839 lines
Class instantiationsub_777CE0 (f_instantiate_template_class), 516 lines
Variable instantiationsub_774C30 (instantiate_template_variable), 751 lines
Pending function/variable listqword_12C7740 (linked list head)
Pending class listqword_12C7758 (linked list head)
Function depth limitqword_12C76E0 (max 255 = 0xFF)
Class depth limitPer-type counter at type entry +56, via qword_106BD10
Pending countersub_75D740 (increment) / sub_75D7C0 (decrement)
SSE state save4 xmmword registers for functions, 12 for classes
Instantiation modes"none" / "all" / "used" / "local"
Fixpoint flagdword_12C771C (set=1 when new work discovered, loop restarts)

Instantiation Entry Structure

Each pending instantiation is represented as a linked-list node. The function/variable worklist uses entries with the following layout:

OffsetSizeFieldDescription
+08entityPrimary symbol pointer
+88nextNext entry in pending list
+168inst_infoInstantiation info record (must be non-null)
+248master_instanceCanonical template symbol
+328actual_declDeclaration in the instantiation context
+408cached_declCached declaration (for kind 7 / function-local)
+648body_flagsDeferred/deleted function flags
+728pre_computed_resultResult from prior instantiation attempt
+801flagsStatus bitfield (see below)

Flags Byte at +80

BitMaskNameMeaning
00x01instantiatedEntity has been instantiated
10x02not_neededEntity was determined to not need instantiation
30x08explicit_instantiationFrom explicit template declaration
40x10suppress_autoAuto-instantiation suppressed (extern template)
50x20excludedEntity excluded from instantiation set
70x80can_be_instantiated_checkedPre-check already performed

Flags Byte at +28 (on inst_info at +16)

BitMaskNameMeaning
00x01blockedInstantiation blocked (dependency cycle)
30x08debug_checkedAlready checked by debug tracing path

The Fixpoint Loop: template_and_inline_entity_wrapup

sub_78A9D0 is the top-level entry point, called at the end of each translation unit from fe_wrapup. It implements a fixpoint loop that keeps running until no new instantiations are discovered.

template_and_inline_entity_wrapup (sub_78A9D0)
  |
  +-- Assert: qword_106BA18 == 0  (not nested in another TU)
  +-- Check: dword_126EFB4 == 2   (full compilation mode)
  |
  +-- FOR EACH translation_unit IN qword_106B9F0 linked list:
  |     |
  |     +-- sub_7A3EF0: set up TU context (switch active TU)
  |     |
  |     +-- PHASE 1: Process pending class instantiations
  |     |   Walk qword_12C7758 list:
  |     |     For each class entry:
  |     |       if sub_7A6B60 (is_dependent_type) == false
  |     |          AND sub_7A8A30 (is_class_or_struct_type) == true:
  |     |            f_instantiate_template_class(entry)
  |     |
  |     +-- PHASE 2: Enable instantiation mode
  |     |   dword_12C7730 = 1
  |     |
  |     +-- PHASE 3: Process pending function/variable instantiations
  |     |   do_any_needed_instantiations()
  |     |
  |     +-- sub_7A3F70: tear down TU context
  |
  +-- PHASE 4: Check for newly-needed instantiations
  |   if dword_12C771C != 0:
  |     dword_12C771C = 0
  |     LOOP BACK to top          <<<< FIXPOINT
  |
  +-- Check dword_12C7718 for additional pass

The fixpoint is necessary because instantiating one template may trigger references to other uninstantiated templates. For example, instantiating std::vector<Foo> may require instantiating std::allocator<Foo>, Foo's copy constructor, comparison operators, and so on. The loop re-runs until dword_12C771C (the "new instantiations needed" flag) remains zero through an entire pass.

Class-Before-Function Ordering

Classes are instantiated first (Phase 1) because function template instantiations may depend on complete class types. A function template body that accesses T::value_type requires T to be fully instantiated before the function body can be parsed. The two-phase design avoids forward-reference failures during function body replay.

Worklist Walker: do_any_needed_instantiations

sub_78A7F0 walks the pending function/variable instantiation list and processes each entry that passes the should_be_instantiated gate.

void do_any_needed_instantiations(void) {
    entry_t *v0 = qword_12C7740;          // pending list head
    while (v0) {
        if (v0->flags & 0x02) {            // already done
            v0 = v0->next;
            continue;
        }
        inst_info_t *v2 = v0->inst_info;   // offset +16, must be non-null
        if (!(v2->flags & 0x08)) {         // not debug-checked
            if (dword_126EFC8)             // debug tracing enabled
                sub_756B40(v0);            // f_is_static_or_inline check
        }
        if (v2->flags & 0x01) {            // blocked
            v0 = v0->next;
            continue;
        }
        if (v0->flags >= 0) {             // bit 7 not set (not pre-checked)
            sub_7574B0(v0);               // f_entity_can_be_instantiated
        }
        if (should_be_instantiated(v0, 1)) {
            instantiate_template_function_full(v0, 1);
        }
        v0 = v0->next;                    // offset +8
    }
}

The walk is a simple linear traversal. New entries appended during instantiation will be visited on the current pass if they appear after the current position, or on the next fixpoint iteration otherwise.

Debug tracing output: when dword_126EFC8 is nonzero, the walker emits "do_any_needed_instantiations, checking: " followed by the entity name for each entry it considers.

Decision Gate: should_be_instantiated

sub_774620 is the critical decision function that determines whether a pending template entity actually requires instantiation. It implements a chain of rejection checks -- an entity must pass all of them to be instantiated.

int should_be_instantiated(entry_t *a1, int a2) {
    // 1. Already done?
    if (a1->flags_28 & 0x01)    return 0;

    // 2. Suppressed by extern template?
    if (a1->flags_80 & 0x20)    return 0;

    // 3. Already instantiated and not explicit?
    if ((a1->flags_80 & 0x08) && !(a1->flags_80 & 0x01))
        return 0;

    // 4. Has valid master instance?
    if (!a1->master_instance)   return 0;    // offset +24

    // 5. Entity kind filter (function-specific)
    int kind = get_entity_kind(a1->master_instance);
    switch (kind) {
        case 10: case 11:   // class member function
        case 17:            // lambda
        case 9:             // namespace-scope function
        case 7:             // variable template
            break;          // eligible
        default:
            return 0;       // not a function/variable entity
    }

    // 6. Implicit include needed?
    if (needs_implicit_include(a1))
        do_implicit_include_if_needed(a1);    // sub_754A70

    // 7. Depth limit check
    if (get_depth(a1) > *qword_106BD10)
        return 0;

    // 8. Depth warning (diagnostic 489/490)
    if (approaching_depth_limit(a1))
        emit_warning(489);  // or 490

    return 1;
}

The depth limit at qword_106BD10 is the configurable maximum instantiation nesting depth. When exceeded, the entity is silently skipped. When approaching the limit, warnings 489 and 490 are emitted to alert the developer.

Function Instantiation: instantiate_template_function_full

sub_775E00 (839 lines) is the workhorse for instantiating function templates. It saves global parser state, replays the cached function body through the parser with substituted template arguments, and restores state afterward.

SSE State Save/Restore

The function saves and restores 4 SSE registers (xmmword_106C380--xmmword_106C3B0) that hold critical parser/scope state. These 128-bit registers store packed parser context (scope indices, token positions, flags) that must be preserved across instantiation because the parser is stateful and re-entrant:

Save on entry:
    saved_state[0] = xmmword_106C380    // parser scope context
    saved_state[1] = xmmword_106C390    // token stream state
    saved_state[2] = xmmword_106C3A0    // scope nesting info
    saved_state[3] = xmmword_106C3B0    // auxiliary flags

Restore on exit (always, even on error):
    xmmword_106C380 = saved_state[0]
    xmmword_106C390 = saved_state[1]
    xmmword_106C3A0 = saved_state[2]
    xmmword_106C3B0 = saved_state[3]

The use of SSE registers for state save/restore is a compiler optimization -- the generated code uses movaps/movups instructions to save 64 bytes of state in 4 instructions rather than 8 individual mov instructions. The data itself is ordinary integer/pointer fields packed into 128-bit quantities by the compiler's register allocator.

Instantiation Flow

instantiate_template_function_full (sub_775E00)
  |
  +-- Save 4 SSE registers (parser state)
  |
  +-- Check pre-existing result: a1[9] (offset +72)
  |   If result exists:
  |     Load associated translation unit
  |     GOTO restore
  |
  +-- Fresh instantiation:
  |   |
  |   +-- Check implicit include needed
  |   +-- Resolve actual declaration via find_corresponding_instance
  |   +-- For class members (kind 20): handle member function templates
  |   |
  |   +-- Depth limit check:
  |   |   if qword_12C76E0 >= 0xFF (255):
  |   |     emit error, GOTO restore
  |   |   qword_12C76E0++
  |   |
  |   +-- Constraint satisfaction check:
  |   |   sub_7C2370 / sub_7C23B0 (C++20 requires-clause)
  |   |
  |   +-- Handle deferred/deleted functions (offset +64 flags)
  |   |
  |   +-- Set up substitution context: sub_709DE0
  |   |   Binds template parameters to concrete arguments
  |   |
  |   +-- Replay cached function body: sub_5A88B0
  |   |   Re-parses the saved token stream with substituted types
  |   |
  |   +-- Emit into IL: sub_676860
  |   |   Processes tokens until end marker (token kind 9)
  |   |
  |   +-- Update canonical entry: sub_79F1D0
  |   |   Links instantiation to cross-TU correspondence table
  |   |
  |   +-- qword_12C76E0--  (decrement depth)
  |
  +-- Restore 4 SSE registers

Depth Counter: qword_12C76E0

This global counter tracks the current nesting depth of function template instantiations. The hard limit is 255 (0xFF). Each call to instantiate_template_function_full increments it on entry and decrements on exit. When the counter reaches 255, the function emits a fatal error and aborts instantiation.

The 255 limit is a safety valve against infinite recursive template instantiation (e.g., template<int N> struct S { S<N+1> member; }). The C++ standard mandates that implementations support at least 1,024 recursively nested template instantiations ([Annex B]), but EDG defaults to 255. This may be configurable via a CLI flag that sets qword_106BD10.

Class Instantiation: f_instantiate_template_class

sub_777CE0 (516 lines) instantiates class templates. It is structurally similar to the function instantiation path but saves significantly more state (12 SSE registers vs. 4) because class instantiation involves deeper parser state perturbation -- class bodies contain member declarations, nested types, and member function definitions.

SSE State Save/Restore (12 Registers)

Save on entry:
    saved[0]  = xmmword_106C380
    saved[1]  = xmmword_106C390
    saved[2]  = xmmword_106C3A0
    saved[3]  = xmmword_106C3B0
    saved[4]  = xmmword_106C3C0
    saved[5]  = xmmword_106C3D0
    saved[6]  = xmmword_106C3E0
    saved[7]  = xmmword_106C3F0
    saved[8]  = xmmword_106C400
    saved[9]  = xmmword_106C410
    saved[10] = xmmword_106C420
    saved[11] = xmmword_106C430

Restore on exit:
    (reverse order, same 12 registers)

The additional 8 registers (beyond the 4 used by function instantiation) capture the extended scope stack state, class body parsing context, base class list, member template processing state, and access specifier tracking that class body parsing requires.

Class Type Entry Layout

Class instantiation operates on a type entry with the following relevant fields:

OffsetSizeFieldDescription
+568instantiation_depth_counterPer-type depth limit via qword_106BD10
+728containing_template_declThe template declaration this specialization came from
+888scope_name_infoScope and name resolution data
+968class_body_infoPointer to cached class body tokens
+1048base_class_listLinked list of base class entries
+1208namespace_lookup_infoNamespace and extern template info
+1321kindType kind: 9=struct, 10=class, 11=union, 12=alias
+1448canonical_typePointer to canonical type entry (follow kind==12 chain)
+1528parent_scopeEnclosing scope entry
+1604attribute_flagsAttribute bits
+1761template_flagsbit 0 = primary template, bit 7 = inline
+1928template_argument_listSubstituted template argument list
+2008member_template_listLinked list of member templates
+2968associated_constraintC++20 constraint expression
+2981extra_flagsAdditional status bits

Instantiation Flow

f_instantiate_template_class (sub_777CE0)
  |
  +-- Walk to canonical type entry: follow kind==12 chain at +144
  +-- Get class symbol: sub_72F640
  |
  +-- Check extern template constraints: sub_7C2370/sub_7C23B0
  |
  +-- Save 12 SSE registers
  |
  +-- Depth limit check:
  |   if type_entry[+56] >= *qword_106BD10:
  |     emit error, GOTO restore
  |   type_entry[+56]++
  |
  +-- Set up substitution context: sub_709DE0
  |
  +-- Handle base class list:
  |   sub_415BE0 (parse base-specifier-list)
  |   sub_4A5510 (validate base classes)
  |
  +-- Parse class body from declaration cache
  |   Replay saved tokens with substituted types
  |
  +-- Process member templates:
  |   Loop on member_template_list (offset +200)
  |   sub_7856E0 for each member template
  |
  +-- Perform deferred access checks:
  |   sub_744F60 (perform_deferred_access_checks_at_depth)
  |
  +-- type_entry[+56]--  (decrement depth)
  |
  +-- Restore 12 SSE registers

Per-Type Depth Limit

Unlike function instantiation (which uses a single global counter qword_12C76E0 with a hard limit of 255), class instantiation uses a per-type counter stored at offset +56 of the type entry. The limit is still read from qword_106BD10. This per-type design prevents one deeply-nested class hierarchy from consuming the entire depth budget -- each class type tracks its own instantiation nesting independently.

Variable Instantiation: instantiate_template_variable

sub_774C30 (751 lines) handles variable template instantiation. Variable templates (C++14) are less common than function or class templates but follow the same pattern: extract master instance, set up substitution, replay cached declaration.

Instantiation Flow

instantiate_template_variable (sub_774C30)
  |
  +-- Extract master instance: a1[3]=symbol, a1[4]=decl
  |
  +-- Look up declaration type:
  |   Switch on kind: 4/5, 6, 9/10, 19-22
  |
  +-- Find declaration cache: offset +216 or +264
  |
  +-- Depth limit check: qword_106BD10
  |
  +-- Set up substitution context: sub_709DE0
  |
  +-- Create declaration state:
  |   memset(v77, 0, 0x1D8)    // 472 bytes = declaration state
  |   v77[0]  = symbol
  |   v77[3]  = source position
  |   v77[6]  = type
  |   v77[15] = flags
  |   v77[19] = self-pointer
  |   v77[33] = additional flags
  |   v77[35] = initializer
  |   v77[36] = IL tree
  |
  +-- Perform type substitution: sub_764AE0 (scan_template_declaration)
  |
  +-- Handle constexpr/constinit evaluation
  |
  +-- Handle deferred access checks
  |
  +-- Update canonical entry
  |
  +-- For kind==7 (function-local variable templates):
      Special handling via sub_5C9600, copy attributes from prototype

The declaration state structure is 472 bytes (0x1D8), stack-allocated and zero-initialized. This is the same structure used by the main declaration parser -- variable template instantiation reuses the declaration parsing infrastructure with pre-populated fields.

Pending Counter Management

Two small functions manage a pending-instantiation counter that tracks how many instantiations are in flight. This counter is used for progress reporting and infinite-loop detection.

increment_pending_instantiations (sub_75D740)

Called when a new template entity is added to the pending worklist. Increments the counter and checks against a maximum threshold via too_many_pending_instantiations (sub_75D6A0).

decrement_pending_instantiations (sub_75D7C0)

Called when an instantiation completes (successfully or by rejection). Decrements the counter.

The counter itself is not directly visible in the sweep report but is inferred from the call pattern: the increment function is called from code paths that add entries to qword_12C7740 or qword_12C7758, and the decrement is called at the end of each instantiate_template_function_full / f_instantiate_template_class / instantiate_template_variable invocation.

Instantiation Modes

The template engine supports four instantiation modes, controlled by CLI flags that set dword_12C7730 and related configuration globals:

Modedword_12C7730Behavior
"none"0No automatic instantiation. Only explicit template declarations trigger instantiation. Used for precompiled headers.
"used"1Instantiate templates that are actually used (ODR-referenced). This is the default mode. The should_be_instantiated function checks usage flags.
"all"2Instantiate all templates that have been declared, whether or not they are used. Used for template library precompilation.
"local"3Instantiate only templates with internal linkage. Extern templates are skipped. Used for split compilation models.

The mode transitions during compilation:

  1. During parsing: dword_12C7730 = 0 (collection only, no instantiation)
  2. At wrapup entry: dword_12C7730 = 1 (enable "used" mode)
  3. During fixpoint: mode may escalate to "all" if dword_12C7718 is set

The precompile mode (dword_106C094 == 3) skips the fixpoint loop entirely and records template entities for later instantiation in the consuming translation unit.

Substitution Engine: copy_type_with_substitution

sub_76D860 (1,229 lines) is the core type substitution function. It takes a type node and a set of template-parameter-to-argument bindings, and produces a new type with all template parameters replaced by their concrete values.

copy_type_with_substitution(type, bindings) -> type
  |
  +-- Dispatch on type->kind:
  |
  +-- Simple types (int, float, void): return type unchanged
  |
  +-- Pointer type (kind 6):
  |   new_pointee = copy_type_with_substitution(type->pointee, bindings)
  |   return make_pointer_type(new_pointee)
  |
  +-- Reference types (kind 7, 19):
  |   new_referent = copy_type_with_substitution(type->referent, bindings)
  |   return make_reference_type(new_referent, type->is_rvalue)
  |
  +-- Array type (kind 8):
  |   new_element = copy_type_with_substitution(type->element, bindings)
  |   new_size = substitute_expression(type->size_expr, bindings)
  |   return make_array_type(new_element, new_size)
  |
  +-- Function type (kind 14):
  |   new_return = copy_type_with_substitution(type->return_type, bindings)
  |   new_params = [substitute each parameter type]
  |   return make_function_type(new_return, new_params, type->cv_quals)
  |
  +-- Template parameter type:
  |   Look up parameter in bindings
  |   return concrete argument type
  |
  +-- Template-id type:
  |   new_args = copy_template_arg_list_with_substitution(type->args, bindings)
  |   return find_or_instantiate_template_class(type->template, new_args)
  |
  +-- Pack expansion (kind 16, 17):
  |   Expand pack with all elements from the binding
  |   return list of substituted types

Supporting substitution functions:

AddressIdentityDescription
sub_77BA10copy_parent_type_with_substitutionSubstitutes in enclosing class context
sub_77BFE0copy_template_with_substitutionSubstitutes within template declarations
sub_77FDE0copy_template_arg_list_with_substitutionSubstitutes within argument lists (612 lines)
sub_780B80copy_template_class_reference_with_substitutionHandles class template references
sub_78B600copy_template_variable_with_substitutionHandles variable template references
sub_793DF0substitute_template_param_listWalks parameter list with substitution (741 lines)

Template Argument Deduction

The deduction subsystem determines template argument values from function call arguments. Key functions:

AddressIdentityLinesDescription
sub_77CEE0matches_template_type788Core deduction: matches actual type against template parameter pattern. Implements [temp.deduct].
sub_77CA90matches_template_type_for_class_type--Class-specific variant with additional base class traversal
sub_77C720matches_template_arg_list--Matches a sequence of template arguments
sub_77C510matches_template_template_param--Matches template template parameters
sub_77C240template_template_arg_matches_param--Template template argument compatibility check
sub_77E9F0matches_template_constant--Matches non-type template arguments (constant expressions)
sub_77E310parameter_is_more_specialized330Partial ordering rule: determines which parameter is more specialized
sub_780FC0all_templ_params_have_values332Post-deduction check: verifies all parameters received values
sub_781660wrapup_template_argument_deduction--Finalizes deduction, applies default arguments
sub_781C40matches_partial_specialization316Tests actual arguments against a partial specialization

Partial Specialization Ordering

When multiple partial specializations match, the engine must select the "most specialized" one. This implements C++ [temp.class.order] and [temp.func.order]:

check_partial_specializations (sub_774470)
  |
  +-- For each partial specialization of the template:
  |   matches_partial_specialization(actual_args, partial_spec)
  |   If matches: add to candidates list
  |     add_to_partial_order_candidates_list (sub_773E40)
  |
  +-- If multiple candidates:
  |   partial_ord (sub_75D2A0)
  |     Pairwise comparison using parameter_is_more_specialized
  |     Select most specialized, or emit ambiguity error
  |
  +-- Return winning specialization (or primary template if no match)

For function templates, ordering uses compare_function_templates (sub_7730D0, 665 lines) which implements the more complex function template partial ordering rules.

Template Declaration Infrastructure

The declaration side handles parsing template<...> prefixes and setting up template entities:

AddressIdentityLinesDescription
sub_786260template_declaration2,487Main entry point for all template declarations. Handles primary, explicit specialization, partial specialization, and friend templates.
sub_782690class_template_declaration2,280Class-specific template declaration processing
sub_78D600template_or_specialization_declaration_full2,034Unified handler routing to class, function, or variable paths
sub_764AE0scan_template_declaration412Parses the template<...> prefix
sub_779D80scan_template_param_list626Parses template parameter lists
sub_77AAB0scan_lambda_template_param_list--C++20 lambda template parameter parsing
sub_770790make_template_function914Creates function template entity
sub_753870make_template_variable--Creates variable template entity
sub_756310set_up_template_decl--Template declaration state initialization

Explicit Instantiation

Explicit instantiation (template class Foo<int>; or template void f<int>();) is handled by a dedicated path:

explicit_instantiation (sub_791C70, 105 lines)
  |
  +-- Parse 'extern' flag: a2 & 1 = is_extern_instantiation
  +-- Save compilation mode (dword_106C094)
  |
  +-- Determine instantiation kind:
  |   extern:              kind = 16
  |   non-extern, no inline: kind = 15
  |   non-extern, inline:   kind = 18
  |
  +-- For precompiled header mode: mark scope entry
  |
  +-- instantiation_directive (sub_7908E0, 626 lines):
  |   |
  |   +-- Initialize target scope entry (memset 472 bytes)
  |   +-- Check CUDA device-code instantiation pragmas
  |   +-- Parse declaration:
  |   |   For classes:    sub_789EF0 (update_instantiation_flags)
  |   |   For functions:  sub_78D0E0 (find_matching_template_instance)
  |   |                   then sub_7897C0 (update_instantiation_flags)
  |   |   For variables:  similar path
  |   +-- Handle instantiation attributes (dllexport/visibility)
  |   +-- Clean up parser state
  |
  +-- Handle deferred access checks: sub_744F60
  +-- Restore compilation mode

update_instantiation_flags (sub_7897C0, 351 lines) sets the appropriate instantiation-required bits on the template entity after matching an explicit instantiation directive. It checks compilation mode, CUDA device/host targeting, and adjusts flags accordingly.

CUDA Integration Points

The template engine interacts with CUDA through several mechanisms:

  1. Device/host filtering in should_be_instantiated: The function checks CUDA execution space attributes via sub_756840 (sym_can_be_instantiated) to determine if a template entity should be instantiated for the current compilation target (device or host).

  2. Instantiation directives: CUDA-specific #pragma directives can trigger or suppress template instantiation for device code. The instantiation_directive function checks for these at dword_126EFA8 (GPU mode) and dword_126EFA4 (device-code flag).

  3. Namespace injection: CUDA-specific symbols are entered into cuda::std via enter_symbol_for_namespace_cuda_std (sub_749330) and std::meta via enter_symbol_for_namespace_std_meta (sub_7493C0, C++26 reflection support).

  4. Target dialect selection: select_cp_gen_be_target_dialect (sub_752A80) determines whether template instantiations emit device PTX code or host code, based on dword_126EFA8 (GPU mode) and dword_126EFA4 (device vs. host).

Cross-TU Correspondence

When compiling with RDC mode or multiple translation units, the same template may be instantiated in different TUs. The trans_corresp.c file (0x796E60--0x79F9E0) handles deduplication and canonical entry selection:

AddressIdentityDescription
sub_796E60canonical_rankingDetermines which of two TU entries is canonical
sub_7975D0may_have_correspondenceChecks if cross-TU correspondence is possible
sub_7999C0find_template_correspondenceFinds corresponding template across TUs (601 lines)
sub_79A5A0determine_correspondenceEstablishes correspondence relationship
sub_79B8D0mark_canonical_instantiationMarks the canonical version of an instantiation
sub_79C400f_set_trans_unit_correspSets up cross-TU correspondence (511 lines)
sub_79D080establish_instantiation_correspondencesLinks instantiation results across TUs
sub_79EE80--sub_79F1D0update_canonical_entry (3 variants)Updates canonical representative after instantiation
sub_79F9E0record_instantiationRecords an instantiation for cross-TU tracking

The correspondence system ensures that when std::vector<int> is instantiated in TU1 and TU2, both produce structurally equivalent IL, and only one canonical version is emitted to the output.

Global State

AddressNameDescription
qword_12C7740pending_instantiation_listHead of pending function/variable instantiation linked list
qword_12C7758pending_class_instantiation_listHead of pending class instantiation linked list
dword_12C7730instantiation_mode_activeCurrent instantiation mode (0=none, 1=used, 2=all, 3=local)
dword_12C771Cnew_instantiations_neededFixpoint flag: set to 1 when new work discovered
dword_12C7718additional_pass_neededSecondary fixpoint flag for extra passes
qword_12C76E0instantiation_depth_counterCurrent function template nesting depth (max 0xFF)
qword_106BD10max_instantiation_depth_limitConfigurable depth limit (read by class and function paths)
xmmword_106C380--106C3B0parser_state_save_area4 SSE registers saved by function instantiation
xmmword_106C380--106C430parser_state_save_area_full12 SSE registers saved by class instantiation
dword_106C094compilation_mode0=none, 1=normal, 3=precompile
dword_126EFB4compilation_phase2=full compilation (required for fixpoint loop)
qword_106B9F0translation_unit_list_headLinked list of TUs for per-TU fixpoint iteration
qword_106BA18tu_stack_topMust be 0 (not nested) when fixpoint starts
dword_126EFC8debug_tracing_enabledNonzero enables trace output for instantiation
dword_126EFA8gpu_modeNonzero when compiling CUDA code
dword_126EFA4device_code1=device-side compilation, 0=host stubs
word_126DD58current_token_kindParser state: current token (9=END)
qword_126DD38source_positionParser state: current source location
qword_126C5E8scope_table_baseArray of 784-byte scope entries
dword_126C5E4current_scope_indexIndex into scope table

Diagnostic Strings

StringSourceCondition
"do_any_needed_instantiations, checking: "sub_78A7F0dword_126EFC8 != 0 (debug tracing)
"template_and_inline_entity_wrapup"sub_78A9D0Assert string
"should_be_instantiated"sub_774620Assert string at templates.c:36894
"instantiate_template_function_full"sub_775E00Assert string at templates.c:7359
"f_instantiate_template_class"sub_777CE0Assert string at templates.c:5277
"instantiate_template_variable"sub_774C30Assert string at templates.c:7814
"check_template_nesting_depth"sub_7533E0Assert string
"instantiation_directive"sub_7908E0Assert string at templates.c:41682
"explicit_instantiation"sub_791C70Assert string at templates.c:42231
"template_arg_is_dependent"sub_7530C0Assert string at templates.c:8897

Function Map

AddressIdentityConfidenceLinesEDG Source
sub_78A9D0template_and_inline_entity_wrapup100%136templates.c:40084
sub_78A7F0do_any_needed_instantiations100%72templates.c:39760
sub_774620should_be_instantiated95%326templates.c:36894
sub_775E00instantiate_template_function_full95%839templates.c:7359
sub_777CE0f_instantiate_template_class95%516templates.c:5277
sub_774C30instantiate_template_variable95%751templates.c:7814
sub_75D740increment_pending_instantiations95%--templates.c
sub_75D7C0decrement_pending_instantiations95%--templates.c
sub_75D6A0too_many_pending_instantiations95%--templates.c
sub_7574B0f_entity_can_be_instantiated95%--templates.c:37066
sub_756B40f_is_static_or_inline_template_entity95%--templates.c
sub_756840sym_can_be_instantiated95%--templates.c
sub_754A70do_implicit_include_if_needed95%--templates.c
sub_76D860copy_type_with_substitution95%1229templates.c
sub_77FDE0copy_template_arg_list_with_substitution95%612templates.c
sub_793DF0substitute_template_param_list95%741templates.c
sub_77CEE0matches_template_type95%788templates.c
sub_780FC0all_templ_params_have_values95%332templates.c
sub_781C40matches_partial_specialization95%316templates.c
sub_774470check_partial_specializations95%58templates.c
sub_773E40add_to_partial_order_candidates_list95%306templates.c
sub_75D2A0partial_ord95%--templates.c
sub_7730D0compare_function_templates95%665templates.c
sub_786260template_declaration95%2487templates.c
sub_782690class_template_declaration95%2280templates.c
sub_78D600template_or_specialization_declaration_full95%2034templates.c
sub_764AE0scan_template_declaration95%412templates.c
sub_779D80scan_template_param_list95%626templates.c
sub_770790make_template_function95%914templates.c
sub_771D50find_template_function95%470templates.c
sub_7621A0find_template_class95%519templates.c
sub_78AC50find_template_variable95%528templates.c
sub_7908E0instantiation_directive95%626templates.c:41682
sub_791C70explicit_instantiation95%105templates.c:42231
sub_7897C0update_instantiation_flags90%351templates.c
sub_7770E0update_instantiation_required_flag95%434templates.c
sub_78D0E0find_matching_template_instance95%--templates.c
sub_709DE0set_up_substitution_context----(likely templates.c)
sub_744F60perform_deferred_access_checks_at_depth95%--symbol_tbl.c
sub_7530C0template_arg_is_dependent95%--templates.c:8897
sub_762C80template_arg_list_is_dependent_full95%839templates.c
sub_75EF10equiv_template_arg_lists95%493templates.c
sub_7931B0make_template_implicit_deduction_guide95%433templates.c
sub_794D30ctad95%990templates.c
sub_796E60canonical_ranking95%--trans_corresp.c
sub_7999C0find_template_correspondence95%601trans_corresp.c
sub_79C400f_set_trans_unit_corresp95%511trans_corresp.c
sub_79F1D0update_canonical_entry95%--trans_corresp.c
sub_79F9E0record_instantiation95%--trans_corresp.c

Cross-References

CUDA Template Restrictions

CUDA's split-compilation model imposes restrictions on C++ templates that have no counterpart in standard C++. When a __global__ function template is instantiated, cudafe++ generates a host-side stub whose mangled name must exactly match what the device compiler (cicc) independently produces. This agreement is only possible if both compilers can derive the complete mangled name from the template's signature and arguments. Types that are invisible to one side -- host-local types, unnamed types, private class members, certain lambda closures -- break this invariant and are therefore rejected. The same constraints apply to variable templates used in device contexts, and additional structural restrictions prevent variadic __global__ templates from producing ambiguous mangled names. This page documents all 24 CUDA-specific template restriction errors across 8 categories, the implementation functions that enforce them, and the __NV_name_expr mechanism that relies on these guarantees.

Key Facts

PropertyValue
Source filecp_gen_be.c (EDG 6.6 backend code generator)
Access checkersub_469F80 (template_arg_is_accessible, 144 lines)
Cache enginesub_469480 (cache_access_result_for, 670 lines)
Arg list walkersub_46A230 (walks template arg lists, 182 lines)
Pre-unnamed checksub_46A5B0 (arg_before_unnamed_template_param_arg, 396 lines)
Scope resolversub_469F30 (resolves scope via hash lookup, 23 lines)
Callback for scope walksub_46ACC0 (passed as callback into sub_61FE60)
Cache hash tablexmmword_F05720 (384 KB, 16,382-entry table, 24 bytes per slot)
Entity lookup tableunk_FE5700 (512 KB, used by sub_469F30)
Free list headqword_F05708 (recycled cache entries)
Total restriction errors24 across 8 categories

Why These Restrictions Exist

The CUDA compilation model splits a single .cu source file into two compilation paths:

  1. Host path: cudafe++ generates a .int.c file containing host stubs. The host compiler (gcc, clang, MSVC) compiles these stubs and produces a host object file. Each __global__ function template instantiation becomes a __wrapper__device_stub_ function.

  2. Device path: The same source is compiled by cicc into PTX. The device compiler independently instantiates the same templates and produces the device-side function bodies.

At link time, the CUDA runtime matches host stubs to device functions by mangled name. Both compilers must produce identical mangled names for every __global__ template instantiation. This is only possible when all template arguments are types that both compilers can see, name, and mangle identically. A host-only local type, for example, exists only in the host compiler's scope -- cicc cannot see it and cannot produce a matching mangled name. The restrictions documented below enforce this invariant.

The same logic applies to __device__/__constant__ variable templates, which must also match across the host/device boundary for registration and symbol lookup.

Category A: __global__ Declaration Restrictions (8 errors)

These errors prevent __global__ function templates from using C++ features that would prevent host stub generation or violate kernel ABI constraints.

TagMessageReason
global_function_constexprA __global__ function or function template cannot be marked constexprKernels are not evaluated at compile time; constexpr is meaningless for device launch.
global_function_constevalA __global__ function or function template cannot be marked constevalconsteval requires compile-time evaluation, incompatible with runtime kernel launch.
global_class_declA __global__ function or function template cannot be a member functionKernels have no this pointer; the launch ABI has no slot for an object reference.
global_friend_definitionA __global__ function or function template cannot be defined in a friend declarationFriend definitions have limited visibility, conflicting with the requirement for a globally-linkable stub.
global_exception_specAn exception specification is not allowed for a __global__ function or function templateGPU hardware has no exception unwinding mechanism.
global_function_in_unnamed_inline_nsA __global__ function or function template cannot be declared within an inline unnamed namespaceUnnamed namespaces produce TU-local linkage, but kernel stubs must have external linkage for runtime registration.
global_function_with_initializer_lista __global__ function or function template cannot have a parameter with type std::initializer_liststd::initializer_list holds a pointer to backing storage that cannot be transparently transferred to device memory.
global_va_list_typeA __global__ function or function template cannot have a parameter with va_list typeVariadic argument lists require stack-based access that does not exist on GPU hardware.

These checks occur during attribute application in apply_nv_global_attr (sub_40E1F0 / sub_40E7F0) and in the post-validation pass nv_validate_cuda_attributes (sub_6BC890). The checks apply equally to non-template __global__ functions and __global__ function templates.

Category B: Variadic __global__ Template Constraints (2 errors)

Standard C++ allows multiple parameter packs in a template and does not require packs to be the last parameter. CUDA restricts this for __global__ templates because the host stub ABI requires unambiguous argument layout.

TagMessage
global_function_pack_not_lastPack template parameter must be the last template parameter for a variadic __global__ function template
global_function_multiple_packsMultiple pack parameters are not allowed for a variadic __global__ function template

Rationale

The kernel launch wrapper (<<<grid, block>>>) must marshal each argument into a contiguous parameter buffer. For a variadic template like template<typename... Ts> __global__ void kernel(Ts... args), the compiler generates the buffer layout at instantiation time. If the pack is not last, or if multiple packs are present, the positional mapping between template parameters and launch arguments becomes ambiguous -- the compiler cannot determine which arguments belong to which pack without full deduction context that may not be available at stub generation time.

Example

// OK: single pack, last position
template<typename T, typename... Ts>
__global__ void kernel(T first, Ts... rest);

// Error: pack not last
template<typename... Ts, typename T>
__global__ void kernel(Ts... args, T last);  // global_function_pack_not_last

// Error: multiple packs
template<typename... Ts, typename... Us>
__global__ void kernel(Ts... a, Us... b);    // global_function_multiple_packs

Category C: Template Argument Visibility for __global__ (6 errors)

These are the core name-mangling restrictions. Every type used as a template argument to a __global__ function template instantiation must be visible and nameable by both the host and device compilers.

C.1: Host-local types

TagMessage
global_func_local_template_argA type defined inside a __host__ function (%t) cannot be used in the template argument type of a __global__ function template instantiation

A type defined inside a __host__ function exists only within that function's scope. The device compiler never sees it and cannot produce a matching mangled name.

void host_function() {
    struct LocalType { int x; };
    kernel<LocalType><<<1,1>>>();  // error: host-local type
}

C.2: Private/protected class members

TagMessage
global_private_type_argA type that is defined inside a class and has private or protected access (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the class is local to a __device__ or __global__ function

Private/protected nested types are accessible only through the enclosing class's access control. While C++ allows friend access and member function access to these types, the device compiler processes templates independently and may not have the same access context. The exception for types local to __device__/__global__ functions reflects that both compilers see device function bodies.

class Outer {
    struct Inner { int x; };     // private
    friend void launch();
};

void launch() {
    kernel<Outer::Inner><<<1,1>>>();  // error: private type
}

C.3: Unnamed types

TagMessage
global_unnamed_type_argAn unnamed type (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the type is local to a __device__ or __global__ function

Unnamed types (anonymous structs, unnamed enums) have no canonical name. Itanium ABI mangling for unnamed types relies on positional encoding within the enclosing scope, which may differ between host and device compilers if they process the enclosing scope differently. Types local to __device__/__global__ functions are exempt because the device compiler processes those scopes identically.

enum { A, B, C };                     // unnamed enum
kernel<decltype(A)><<<1,1>>>();       // error: unnamed type

C.4: Lambda closures

TagMessage
global_lambda_template_argThe closure type for a lambda (%t%s) cannot be used in the template argument type of a __global__ function template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the flag '-extended-lambda' is specified and the lambda is an extended lambda (a __device__ or __host__ __device__ lambda defined within a __host__ or __host__ __device__ function)

Lambda closures are compiler-generated anonymous types. Without --extended-lambda, there is no protocol for both compilers to agree on the closure type's mangled representation. The extended lambda mechanism (--extended-lambda / --extended-lambda) establishes a naming convention for lambdas annotated with __device__ or __host__ __device__, enabling cross-compiler name agreement.

auto f = [](int x){ return x*2; };
kernel<decltype(f)><<<1,1>>>();       // error unless extended lambda

C.5: Private/protected template template arguments

TagMessage
global_private_template_argA template that is defined inside a class and has private or protected access cannot be used in the template template argument of a __global__ function template instantiation

The same access-control problem as C.2, but for template template parameters. A private class template used as a template template argument cannot be guaranteed visible in the device compiler's independent instantiation context.

C.6: Texture/surface non-type arguments

Message
A texture or surface variable cannot be used in the non-type template argument of a __device__, __host__ __device__ or __global__ function template instantiation

Texture and surface objects have special hardware semantics. Their runtime addresses are not fixed at compile time (they are bound through the texture subsystem), so they cannot serve as non-type template arguments whose values must be known to produce a deterministic mangled name.

Implementation: The Access Checking Pipeline

The template argument restriction checks are implemented in a three-function pipeline within cp_gen_be.c:

sub_469F80 — template_arg_is_accessible

This is the primary entry point. It dispatches on the template argument kind (byte at arg+8):

int template_arg_is_accessible(arg_t *a1, int scope_depth, char check_scope, int *cache_miss) {
    arg->flags_25 |= 0x10;               // mark: currently checking
    int kind = arg->kind;                 // offset +8
    
    switch (kind) {
    case 0:  // type argument
        type = arg->value;               // offset +32
        result = cache_access_result_for(type, 6, scope_depth, cache_miss);
        if (!result && (check_scope & 1)) {
            // walk through typedef chains (type_kind == 12)
            while (type->kind == 12)
                type = type->canonical;   // offset +144
            result = cache_access_result_for(type, 6, scope_depth, cache_miss);
            if (!result) {
                sub_469F30(&type_holder, 0);   // resolve via entity lookup
                result = (type_holder != original_type);
            }
        }
        break;
        
    case 1:  // template argument (template template parameter)
        entity = arg->value;             // offset +32
        // Check class accessibility via derivation chain
        if (entity->base_class) {        // offset +128
            // Use IL walker sub_61FE60 with callback sub_46ACC0
            sub_61EC40(visitor_state);
            visitor_state[0] = sub_46ACC0;   // the callback
            sub_61FE60(entity->base_class, visitor_state);
            result = (visitor_state->found == 0);
        }
        break;
        
    case 2:  // non-type argument
        result = cache_access_result_for(arg->value, 58, scope_depth, cache_miss);
        break;
        
    default:
        __assert_fail("template_arg_is_accessible", "cp_gen_be.c", 2448);
    }
    
    arg->flags_25 &= ~0x10;              // clear: done checking
    return result;
}

The flags_25 |= 0x10 / &= ~0x10 pattern is a recursion guard: it marks the argument as "currently being checked" to prevent infinite loops through mutually-referential template arguments.

sub_469480 — cache_access_result_for

This function caches the result of access checking for a given entity to avoid redundant computation. The cache is a hash table at xmmword_F05720 with 16,382 buckets (0x3FFF), each 24 bytes wide.

Cache entry layout (24 bytes):

OffsetSizeFieldDescription
+08nextPointer to next entry in chain (collision list)
+88entityEntity pointer being cached
+164scope_idScope identifier from qword_1065708 chain
+201resultCached access result (1 = accessible, 0 = not)
+211arg_kindTemplate argument kind that was checked

Hash function: The entity pointer is right-shifted by 6 bits, then taken modulo 0x3FFF:

unsigned hash = ((unsigned)(entity >> 6) * 262161ULL) >> 32;
unsigned bucket = (entity >> 6)
    - 0x3FFF * (((hash + ((entity >> 6) - hash) >> 1)) >> 13);
char *slot = &xmmword_F05720[24 * bucket];

Cache hit path: If slot->entity == entity and the scope matches, return the cached result immediately. The function walks the qword_1065708 chain (the scope stack) to verify that the cached result was computed in a compatible scope context.

Cache eviction: When a cached entry's scope no longer matches (the scope stack has changed since caching), the entry is moved to the free list (qword_F05708). New entries are allocated from the free list or via sub_6B7340 (24-byte allocation).

Fallback (cache miss): On cache miss, the function performs the actual accessibility analysis:

  1. For type arguments (kind 6): resolves typedefs, checks if the type is a class/struct/enum with access restrictions. Uses sub_5F9C10 to resolve through elaborated type specifiers. Checks entity->access_bits at +80 (bits 0-1: 0=public, 1=protected, 2=private).

  2. For non-type arguments (kind 58): checks the entity's accessibility directly.

  3. For class/struct types (kinds 9-11): walks the class's template argument list recursively via sub_469F80.

  4. For dependent types (kind 14): recursively checks the base type.

  5. For function types (kind 7) and pointer-to-member types (kind 13): recursively checks the return type, parameter types, and pointed-to class.

After computing the result, it is stored in the cache for future lookups.

sub_46A230 — Template Arg List Walker

This function walks a template instantiation's argument list and checks each argument for accessibility. It uses the entity lookup hash table at unk_FE5700 to find cached resolution results.

__int64 walk_template_args(__int64 hash_table, unsigned __int64 type) {
    // Resolve through typedef chains
    while (type->kind == 12)
        type = type->canonical;           // offset +144
    
    // Hash the type pointer into a bucket
    _QWORD *bucket = hash_table + 32 * ((type >> 6) % 0x3FFF);
    
    // Walk the bucket chain
    while (bucket && bucket[1]) {
        entry = bucket[1];                // the entity entry
        
        // Check if this entry matches our type
        if (entry->canonical != type && !sub_7B2260(entry->canonical, type, 0))
            continue;
        
        // Scope compatibility check
        if (bucket[2] && bucket[2] != qword_126C5D0)
            continue;
        
        // For template entities (kind 10), walk their argument lists
        if (entry->kind == 10) {
            arg_list = *entry->template_args;
            while (arg) {
                if (arg->flags_25 & 0x10)     // already being checked
                    goto next;
                if (!template_arg_is_accessible(arg, 0, 0, &miss))
                    goto not_found;
                arg = arg->next;
            }
        }
        
        // Access check on the entity itself
        if (entry->access_bits != 0)      // private/protected
            if (!sub_467780(entity, 1, 0)) // check access
                goto not_found;
        
        // Cache the resolved entity in bucket[3]
        bucket[3] = qword_10657E8;
        return entry;
    }
    return 0;
}

The walker handles three argument kinds:

  • Kind 0 (type): Checks the type entity's accessibility and, for class templates (kind 12 with subkind 10), recursively walks nested template arguments.
  • Kind 1 (template): Checks the template entity's class ancestry.
  • Kind 2 (non-type): Resolves the non-type argument's scope via sub_5F9BC0.

sub_46A5B0 — arg_before_unnamed_template_param_arg

This function handles the generation of template arguments that appear before unnamed template parameter arguments. It determines the positional index of each argument relative to the template parameter list and calls the appropriate code-generation routine. The assert at line 4795 guards against an unexpected argument kind (must be 0, 1, or 2; kind 3 is a pack expansion sentinel).

Category D: Variable Template Parallel Restrictions (5 errors)

Variable templates (template<typename T> __device__ T var = ...) used in device contexts carry the same restrictions as __global__ function templates. The diagnostics mirror Category C exactly:

TagMessage
variable_template_private_type_argA type that is defined inside a class and has private or protected access (%t) cannot be used in the template argument type of a variable template instantiation, unless the class is local to a __device__ or __global__ function
variable_template_private_template_arg(private template template arg in variable template)
variable_template_unnamed_type_template_argAn unnamed type (%t) cannot be used in the template argument type of a variable template template instantiation, unless the type is local to a __device__ or __global__ function
variable_template_func_local_template_argA type defined inside a __host__ function (%t) cannot be used in the template argument type of a variable template template instantiation
variable_template_lambda_template_argThe closure type for a lambda (%t%s) cannot be used in the template argument type of a variable template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the lambda is an 'extended lambda' and the flag --extended-lambda is specified

The implementation shares the same cache_access_result_for / template_arg_is_accessible pipeline described in the Category C implementation section. The only difference is the error tag and message string emitted on failure.

Why Variable Templates Need the Same Restrictions

Variable templates instantiated with __device__, __constant__, or __managed__ memory space are registered by the CUDA runtime using their mangled names. The host-side .int.c file contains registration arrays (emitted in .nvHRDE, .nvHRDI, .nvHRCE, .nvHRCI sections) whose entries are byte arrays encoding mangled variable names. The device compiler independently mangles the same variable template instantiation. Both must produce identical names, so the same visibility constraints apply.

Category E: Static Global Template Stub (2 errors)

In whole-program compilation mode (-rdc=false) with -static-global-template-stub=true, template __global__ functions receive static linkage on their host stubs. This prevents ODR violations when the same template kernel is instantiated in multiple translation units. Two scenarios are incompatible with this mode:

TagMessage
extern_kernel_templatewhen "-static-global-template-stub=true", extern __global__ function template is not supported in whole program compilation mode ("-rdc=false"). To resolve the issue, either use separate compilation mode ("-rdc=true"), or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off)
template_global_no_defwhen "-static-global-template-stub=true" in whole program compilation mode ("-rdc=false"), a __global__ function template instantiation or specialization (%sq) must have a definition in the current translation unit. To resolve this issue, either use separate compilation mode ("-rdc=true"), or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off)

The Problem

An extern template kernel declaration says "this template instantiation exists elsewhere." But if the stub is static, there is no way for the linker to resolve the extern reference to a stub in another TU, because static symbols are TU-local. Similarly, a template instantiation without a definition in the current TU cannot have a static stub generated for it, because there is no body to inline.

Resolution Paths

Both diagnostics suggest the same two alternatives:

  1. Switch to -rdc=true (separate compilation): each TU gets its own device object, and cross-TU kernel references are resolved by the device linker (nvlink).
  2. Set -static-global-template-stub=false: stubs get external linkage, allowing cross-TU references at the cost of potential ODR violations if the same template is instantiated in multiple TUs.

Category F: Local Type Prevents Host Launch (1 error)

TagMessage
local_type_used_in_global_functiona local type %t (defined in %sq1) used in global function %sq2 template argument, the global function cannot be launched from host code.

This is a warning-level diagnostic, not a hard error. It fires when a type local to a function (but not a __host__-function-local type, which would be Category C.1) is used as a template argument. The kernel can still be instantiated and called from device code, but the host-side launch path is blocked because the local type is not visible to the host stub generator.

This diagnostic differs from global_func_local_template_arg in severity and scope: it is a soft warning that the kernel "cannot be launched from host code," rather than a hard error that rejects the instantiation entirely.

Category G: __grid_constant__ in Instantiation Directives (1 error)

TagMessage
grid_constant_incompat_templ_redeclincompatible __grid_constant__ annotation for parameter %s in function template redeclaration (see previous declaration %p)

When a function template is redeclared, the __grid_constant__ annotations on its parameters must match the original declaration. This is enforced because __grid_constant__ affects the ABI: a parameter marked __grid_constant__ is placed in constant memory and accessed through a different addressing mode. If a redeclaration omits the annotation, the host stub and device function would disagree on parameter layout.

The related diagnostic grid_constant_incompat_instantiation_directive applies to explicit instantiation directives (template __global__ void kernel<int>(...)) and is documented in the grid_constant page.

Category H: Kernel Launches from System File Templates (1 error)

Message
kernel launches from templates are not allowed in system files

This error fires when a <<<...>>> kernel launch expression appears inside a template function defined in a system header file. System headers are files marked with #pragma system_header or located in system include paths (e.g., the CUDA toolkit's include/ directory).

The restriction exists because system headers are processed with relaxed diagnostics. Kernel launch expressions inside template functions in system headers would be instantiated in user code contexts, but the launch transformation (replacing <<<...>>> with cudaConfigureCall + stub call) operates during the system header's processing pass where diagnostic state may be suppressed. Rather than risk silent miscompilation, the compiler rejects this pattern outright.

The __NV_name_expr Mechanism (6 errors)

NVRTC (NVIDIA's runtime compilation library) provides a mechanism to obtain the mangled name of a __global__ function or __device__/__constant__ variable at compile time. This mechanism is exposed through the __CUDACC_RTC__name_expr intrinsic, which the frontend processes during lowered name lookup.

Purpose

NVRTC compiles CUDA code at runtime, producing PTX that is loaded into the driver. The host application needs to look up compiled kernels and device variables by name via cuModuleGetFunction / cuModuleGetGlobal. The __NV_name_expr mechanism bridges this gap: the user provides a C++ name expression (e.g., my_kernel<int> or my_device_var<float>), and the compiler returns the corresponding mangled name (e.g., _Z9my_kernelIiEvv).

The 6 Errors

TagMessage
name_expr_parsingError in parsing name expression for lowered name lookup. Input name expression was: %sq
name_expr_extra_tokensExtra tokens found after parsing name expression for lowered name lookup. Input name expression was: %sq
name_expr_internal_errorInternal error in parsing name expression for lowered name lookup. Input name expression was: %sq
name_expr_non_global_routineName expression cannot form address of a non-__global__ function. Input name expression was: %sq
name_expr_non_device_variableName expression cannot form address of a variable that is not a __device__/__constant__ variable. Input name expression was: %sq
name_expr_not_routine_or_variableName expression must form address of a __global__ function or the address of a __device__/__constant__ variable. Input name expression was: %sq

Processing Pipeline

  1. Parsing: The name expression is parsed as a C++ id-expression. If parsing fails, name_expr_parsing is emitted. If tokens remain after a successful parse, name_expr_extra_tokens fires.

  2. Lookup: The parsed expression is resolved via standard C++ name lookup (qualified or unqualified, with template argument deduction if needed).

  3. Validation: The resolved entity is checked:

    • If it is a function, it must be __global__ (has the __global__ execution space byte set). Otherwise: name_expr_non_global_routine.
    • If it is a variable, it must be __device__ or __constant__ (memory space bits at entity+148). Otherwise: name_expr_non_device_variable.
    • If it is neither a function nor a variable: name_expr_not_routine_or_variable.
  4. Mangling: If validation passes, the entity is mangled using the Itanium ABI mangler (in lower_name.c) and the resulting string is recorded for NVRTC output.

Connection to Template Restrictions

The __NV_name_expr mechanism relies on every template argument being mangeable. All of the Category C restrictions directly support this: if a template argument type cannot be mangled (because it is unnamed, local, private, etc.), the name expression lookup would produce a mangled name that does not match the device-side mangling. The restrictions are enforced at template instantiation time, before any name expression lookup occurs, so that invalid instantiations never reach the mangling stage.

Data Structures

Template Argument Node (arg_t)

The template argument node is a linked-list entry used by sub_469F80 and sub_46A230:

OffsetSizeFieldDescription
+08nextNext argument in the list
+81kindArgument kind: 0=type, 1=template, 2=non-type, 3=pack expansion
+241flags_24Bit 0: is pack expansion
+251flags_25Bit 4 (0x10): currently being checked (recursion guard)
+328valuePointer to the type/entity/expression

Entity Node (type/symbol)

Relevant fields for accessibility checking:

OffsetSizeFieldDescription
+88name_entryName string pointer (or next scope for unnamed)
+248alt_nameAlternative name (for flag bit 3 at +81)
+408scope_infoScope information; +32 from this is the enclosing class/namespace
+801access_bitsBits 0-1: access specifier (0=public, 1=protected, 2=private)
+811entity_flagsBit 2 (0x04): is template specialization; bit 6 (0x40): is anonymous
+1288base_classBase class pointer (for class entities)
+1321type_kindType kind: 6/8=pointer/ref, 7=function, 9-11=class/struct/enum, 12=typedef, 13=pointer-to-member, 14=dependent
+1448canonicalCanonical type (for typedefs: the underlying type)
+1481subtype_kindSubkind (for type_kind 12: 10=template-id, 12=elaborated)
+1528type_infoType-specific data (template args, function params, etc.)
+1601template_kindFor template entities: template kind
+1611visibilityBit 7 (0x80): private visibility (negative char value)
+1622extra_flagsBit 7 (0x80) + bit 9 (0x200): cached accessibility state

Diagnostic Summary

All 24 errors sorted by category:

#CategoryTagSeverity
1Aglobal_function_constexprerror
2Aglobal_function_constevalerror
3Aglobal_class_declerror
4Aglobal_friend_definitionerror
5Aglobal_exception_specerror
6Aglobal_function_in_unnamed_inline_nserror
7Aglobal_function_with_initializer_listerror
8Aglobal_va_list_typeerror
9Bglobal_function_pack_not_lasterror
10Bglobal_function_multiple_packserror
11Cglobal_func_local_template_argerror
12Cglobal_private_type_argerror
13Cglobal_unnamed_type_argerror
14Cglobal_lambda_template_argerror
15Cglobal_private_template_argerror
16C(texture/surface non-type arg)error
17Dvariable_template_private_type_argerror
18Dvariable_template_private_template_argerror
19Dvariable_template_unnamed_type_template_argerror
20Dvariable_template_func_local_template_argerror
21Dvariable_template_lambda_template_argerror
22Eextern_kernel_templateerror
23Etemplate_global_no_deferror
24Flocal_type_used_in_global_functionwarning

Category G (grid_constant_incompat_templ_redecl) and Category H (kernel launches from templates...) are counted separately as they span the template/non-template boundary.

Function Map

AddressIdentityLinesRole
sub_469F80template_arg_is_accessible144Primary access checker -- dispatches on arg kind
sub_469480cache_access_result_for670Hash-cached accessibility analysis
sub_46A230(walks template arg lists)182Iterates entity lookup table for arg lists
sub_46A5B0arg_before_unnamed_template_param_arg396Handles args before unnamed template params
sub_469F30(scope resolve helper)23Resolves scope via cache_access_result_for + entity lookup
sub_46ACC0(scope walk callback)--Callback passed to IL walker sub_61FE60
sub_467780(access check)--Checks C++ access control (public/protected/private)
sub_466F40(output callback)--Code generation output callback
sub_5BFC70(pack expansion resolver)--Resolves pack expansion nodes (kind 3)
sub_5F9BC0(scope resolver)--Resolves entity scope chain
sub_5F9C10(elaborated type resolver)--Resolves elaborated type specifiers
sub_7B2260(type equivalence)--Checks structural type equivalence
sub_61EC40(init visitor)27Initializes IL tree visitor state
sub_61FE60(walk expression tree)17Walks expression tree with callback

Global Variables

GlobalAddressDescription
xmmword_F057200xF05720Access check cache hash table (384 KB, 16,382 entries x 24 bytes)
qword_F057080xF05708Free list head for recycled cache entries
qword_F057300xF05730Scope ID array parallel to cache (4 bytes per entry)
unk_FE57000xFE5700Entity lookup hash table (512 KB)
qword_10657080x1065708Scope stack head (linked list of scope entries)
qword_126C5D00x126C5D0Global scope sentinel
qword_10657E80x10657E8Current scope context for entity resolution
dword_10658480x1065848Extended lambda mode flag
dword_10658500x1065850Device stub mode flag

Cross-References

Constexpr Interpreter

The constexpr interpreter is the compile-time expression evaluation engine inside cudafe++. It lives in EDG 6.6's interpret.c (69 functions at 0x620CE0--0x65DE10, approximately 33,000 decompiled lines) and implements a virtual machine that executes arbitrary C++ expressions during compilation. Its central function, do_constexpr_expression (sub_634740), is the single largest function in the entire cudafe++ binary: 11,205 decompiled lines, 63KB of machine code, 128 unique callees, and 28 self-recursive call sites.

The interpreter exists because C++ constexpr evaluation requires the compiler to act as an execution engine. Since C++11, constexpr has grown from simple return-expression functions to a Turing-complete subset of C++ that includes loops, branches, dynamic memory allocation (C++20), virtual dispatch, exception-like control flow, and -- as of C++26 -- compile-time reflection. The interpreter must evaluate all of these constructs faithfully, track object lifetimes, detect undefined behavior, and convert results back into IL constants.

Key Facts

PropertyValue
Source fileinterpret.c (69 functions, ~33,000 decompiled lines)
Address range0x620CE0--0x65DE10
Main evaluatorsub_634740 (do_constexpr_expression), 11,205 lines, 63KB
Builtin evaluatorsub_651150 (do_constexpr_builtin_function), 5,032 lines
Loop evaluatorsub_644580 (do_constexpr_range_based_for_statement), 2,836 lines
Constructor evaluatorsub_6480F0 (do_constexpr_ctor), 1,659 lines
Call dispatchersub_657560 (do_constexpr_call), 1,445 lines
Top-level entrysub_65AE50 (interpret_expr)
Materializationsub_631110 (copy_interpreter_object_to_constant), 1,444 lines
Value extractionsub_64B580 (extract_value_from_constant), 2,299 lines
Arena block size64KB (0x10000)
Large alloc threshold1,024 bytes (0x400)
Max type size64MB (0x4000000)
Uninitialized marker0xDB fill pattern
Self-recursive calls28 (in do_constexpr_expression)
Confirmed assert IDs38 functions with assert strings
C++26 reflection8 std::meta::* functions

Architecture Overview

The interpreter is structured as a tree-walking evaluator with arena-based memory, memoization caching, and a call stack that mirrors C++ function invocation. The rest of the compiler invokes it through interpret_expr, which sets up interpreter state, calls the recursive evaluator, and converts the result back to an IL constant.

  AST expression node
        |
        v
  +-----------------+
  | interpret_expr  |  sub_65AE50 — allocates state, arena, hash table
  +-----------------+
        |
        v
  +---------------------------+
  | do_constexpr_expression   |  sub_634740 — the 11,205-line evaluator
  |                           |  dispatches on expression-kind code
  |  +-- arithmetic ops       |  cases 40-45: +, -, *, /, %
  |  +-- comparisons          |  cases 49-51: <, >, ==, !=, <=, >=
  |  +-- member access        |  cases 3-4: . and ->
  |  +-- type conversions     |  case 5: cast sub-switch (20+ type pairs)
  |  +-- pointer arithmetic   |  cases 46-48, 50: ptr+int, ptr-ptr
  |  +-- function calls ------+---> do_constexpr_call (sub_657560)
  |  +-- constructors   ------+---> do_constexpr_ctor (sub_6480F0)
  |  +-- builtins       ------+---> do_constexpr_builtin_function (sub_651150)
  |  +-- loops          ------+---> do_constexpr_range_based_for (sub_644580)
  |  +-- statements     ------+---> do_constexpr_statement (sub_647850)
  |  +-- dynamic_cast         |  inline within main evaluator
  |  +-- typeid               |  inline within main evaluator
  |  +-- offsetof             |  inline within main evaluator
  |  +-- bit_cast             |  calls translate_*_bytes functions
  +---------------------------+
        |
        v
  +-------------------------------------+
  | copy_interpreter_object_to_constant |  sub_631110 — materializes result
  +-------------------------------------+  back into IL constant nodes
        |
        v
  IL constant (returned to compiler)

Entry Points

The interpreter has multiple entry points, each called from a different compilation phase:

EntryAddressLinesCalled from
interpret_exprsub_65AE50572General constexpr evaluation (primary)
Entry for expression loweringsub_65A290311Expression lowering phase (sub_6E2040)
Entry for expression treessub_65A8C0274Expression handling (sub_5BB4C0, sub_5C3760)
interpret_dynamic_sub_initializerssub_65CFA067Aggregate initialization
Misc entriessub_65BAB0--sub_65D150150-470Template instantiation, static_assert, enum values

All entry points follow the same pattern: allocate the interpreter state object, initialize the arena and hash table, call do_constexpr_expression, then extract and convert the result.

Interpreter State Object

The interpreter state is a structure passed as the first argument (a1) to every evaluator function. It contains the evaluation stack, heap tracking, memoization cache, and diagnostic context.

OffsetSizeFieldDescription
+08hash_tablePointer to variable-to-value hash table
+88hash_capacityHash table capacity mask (low 32) / entry count (high 32)
+168stack_topCurrent stack allocation pointer
+248stack_baseBase of current arena block
+328heap_listHead of heap allocation chain (large objects)
+404scope_depthCurrent scope nesting counter
+568hash_aux_1Auxiliary hash table pointer
+648hash_aux_2Auxiliary hash table capacity
+728call_chainCurrent call stack chain (for recursion tracking)
+888diag_context_1Diagnostic context pointer
+968diag_context_2Source location for error reporting
+1128diag_context_3Additional diagnostic metadata
+1321flags_1Mode flags (bit 0 = strict mode)
+1331flags_2Additional mode flags

Memory Model

The interpreter uses a dual-tier memory system: an arena allocator for small objects and direct heap allocation for large ones.

Arena Allocator

Arena blocks are 64KB (0x10000 bytes) each, linked together at offset +24:

Block layout:
  +------------------+
  | next_block (+0)  |---> previous block (or null)
  | alloc_ptr  (+8)  |---> current bump position
  | capacity   (+16) |---> end of usable space
  | base       (+24) |---> start of block data
  +------------------+
  | usable space     |  64KB of object storage
  | ...              |
  +------------------+

Allocation follows a bump-pointer pattern:

void *arena_alloc(interp_state *state, size_t size) {
    size = ALIGN_UP(size, 8);
    ptrdiff_t remaining = 0x10000 - (state->stack_top - state->stack_base);
    if (remaining < size) {
        // Allocate new 64KB block, link to chain
        new_block = sub_622D20();
        new_block->next = state->stack_base;
        state->stack_base = new_block;
        state->stack_top = new_block + HEADER_SIZE;
    }
    void *result = state->stack_top;
    state->stack_top += size;
    return result;
}

Large Object Heap

Objects larger than 1,024 bytes (0x400) bypass the arena and are allocated individually via sub_6B7340 (the compiler's general-purpose allocator). These allocations are tracked through an allocation chain so they can be freed when the interpreter scope exits.

Object Header Layout

Every interpreter object has a header preceding the value bytes:

  offset -10  [-10]  bitmap byte 2 (validity tracking)
  offset  -9  [ -9]  bitmap byte 1 (initialization tracking)
  offset  -8  [ -8]  type pointer (8 bytes, points to type_node)
  offset   0  [  0]  value bytes start here
              ...     value data (size depends on type)

New objects are initialized with value bytes filled to 0xDB (decimal 219), which serves as an uninitialized-memory sentinel. Any read from an object whose bytes still contain 0xDB triggers error 2700 (access to uninitialized object).

Constexpr Value Representation

Values in the interpreter use a type-dependent representation:

Type categorykind byteValue sizeRepresentation
void00Flag 0x40 set, no value bytes
pointer10Stored as reference metadata, not inline bytes
integral216 bytesTwo 64-bit words (supports __int128)
float316 bytesIEEE 754 value in first 4/8 bytes, padded
double416 bytesIEEE 754 value in first 8 bytes, padded
complex532 bytesReal + imaginary parts
class/struct632 bytesReference to interpreter object
union732 bytesReference to interpreter object
array8N * elem_sizeRecursive: element count times element size
class (variants)9, 10, 11CachedLooked up in type-to-size hash table
typedef12(follow)Chase to underlying type
enum1316 bytesSame as integral
nullptr_t1932 bytesNull pointer representation

The reference representation for pointers and class objects uses 32 bytes (two __m128i values). The flag byte at offset +8 within a reference encodes:

BitMeaning
0Has concrete object backing
1Past-the-end pointer (one past array)
2Has allocation chain (from constexpr new)
3Has subobject path (member/base offset chain)
4Has bitfield information
5Is dangling (object lifetime ended)
6Is const-qualified

Memoization Hash Table

The interpreter maintains a hash table that maps type pointers to precomputed value sizes, avoiding redundant recursive size computations for class types:

GlobalPurpose
qword_126FEC0Hash table base pointer
qword_126FEC8Capacity mask (low 32 bits) / entry count (high 32 bits)

Each entry is 16 bytes: 8-byte key (type pointer), 4-byte size value, 4-byte padding. Collision resolution uses linear probing with a bitmask. The table grows (via sub_620760) when load factor exceeds 50%.

Constexpr Allocation Tracking (C++20)

C++20 introduced constexpr dynamic memory allocation (new/delete in constexpr contexts). The interpreter tracks these through a global allocation chain:

GlobalPurpose
qword_126FBC0Free list head
qword_126FBB8Outstanding allocation count

When std::allocator<T>::allocate() is called during constexpr (sub_62B100), the interpreter allocates from its arena, sets bit 2 in the object's flag byte, and links the allocation into the chain. std::allocator<T>::deallocate() (sub_62B470) validates that the freed pointer was actually allocated by std::allocator::allocate() and unlinks it. At the end of constexpr evaluation, any remaining allocations indicate a bug in the evaluated code (memory leaked during constant evaluation).

The Main Evaluator: do_constexpr_expression

sub_634740 is the heart of the interpreter. It takes four parameters:

// Returns 1 on success, 0 on failure
int do_constexpr_expression(
    interp_state *a1,       // interpreter state
    expr_node    *a2,       // AST expression node to evaluate
    value_buf    *a3,       // output value buffer (32 bytes)
    address_t    *a4        // "home" address for reference tracking
);

The function body is organized as a nested switch statement. The outer switch dispatches on the expression category at *(a2+24), and several cases contain inner switches for further dispatch.

Outer Switch: Expression Categories

int do_constexpr_expression(interp_state *a1, expr_node *a2,
                            value_buf *a3, address_t *a4) {
    int category = *(a2 + 24);    // expression category code
    switch (category) {
    case 0:   // ---- void expression ----
        a3->flags = 0x40;         // mark as void
        return 1;

    case 1:   // ---- operator expression ----
        return eval_operator(a1, a2, a3, a4);    // inner switch on *(a2+40)

    case 10:  // ---- sub-expression wrapper ----
        return do_constexpr_expression(a1, *(a2+40), a3, a4);  // recurse

    case 11:  // ---- typeid expression ----
        return do_constexpr_typeid(a1, a2, a3);  // inline

    case 17:  // ---- statement expression (GNU extension) ----
        return do_constexpr_statement(a1, *(a2+40), a3);  // sub_647850

    case 18:  // ---- variable lookup ----
        return lookup_variable(a1, a2, a3, a4);  // hash table at a1+0

    case 19:  // ---- function / static variable reference ----
        return resolve_static_ref(a1, a2, a3);

    case 21:  // ---- special expressions ----
        return eval_special(a1, a2, a3, a4);     // inner switch on *(a2+40)

    default:
        emit_error(2721);  // "expression is not a constant expression"
        return 0;
    }
}

Inner Switch: Operator Codes (case 1)

The operator expression case dispatches on the operator code at *(a2+40). This is the largest sub-switch, covering 100+ cases:

int eval_operator(interp_state *a1, expr_node *a2,
                  value_buf *a3, address_t *a4) {
    int opcode = *(a2 + 40);
    switch (opcode) {

    // ---- Assignment ----
    case 0: case 1:
        // Evaluate RHS, store to LHS address
        if (!do_constexpr_expression(a1, rhs, &rval, NULL)) return 0;
        if (!do_constexpr_expression(a1, lhs, &lval, NULL)) return 0;
        assign_value(lval.address, &rval, type);
        *a3 = lval;
        return 1;

    // ---- Member access (. and ->) ----
    case 3: case 4:
        // Evaluate base object, compute member offset
        if (!do_constexpr_expression(a1, base_expr, &base, NULL)) return 0;
        member_offset = compute_member_offset(member_decl, base.type);
        a3->address = base.address + member_offset;
        return 1;

    // ---- Type conversion (cast) ----
    case 5:
        return eval_conversion(a1, a2, a3, a4);  // 20+ type-pair sub-switch

    // ---- Parenthesized expression ----
    case 9:
        return do_constexpr_expression(a1, *(a2+48), a3, a4);  // recurse

    // ---- Pointer increment/decrement ----
    case 22: case 23:
        if (!do_constexpr_expression(a1, operand, &val, NULL)) return 0;
        // Validate pointer is within array bounds
        pos = get_runtime_array_pos(val.address);
        if (pos < 0 || pos >= array_size) {
            emit_error(2692);  // array bounds violation
            return 0;
        }
        val.address += (opcode == 22) ? elem_size : -elem_size;
        *a3 = val;
        return 1;

    // ---- Unary negation / bitwise complement ----
    case 26:
        if (!do_constexpr_expression(a1, operand, &val, NULL)) return 0;
        if (is_integer_type(val.type))
            a3->int_val = ~val.int_val;  // or -val.int_val
        else if (is_float_type(val.type))
            a3->float_val = -val.float_val;
        return 1;

    // ---- Arithmetic binary operators ----
    case 40:  // addition
    case 41:  // subtraction
    case 42:  // multiplication
    case 43:  // division
    case 44:  // modulo
    case 45:  // (additional arithmetic)
        if (!do_constexpr_expression(a1, lhs, &left, NULL)) return 0;
        if (!do_constexpr_expression(a1, rhs, &right, NULL)) return 0;
        if (opcode == 43 && right.int_val == 0) {
            emit_error(61);    // division by zero = UB
            return 0;
        }
        result = apply_arithmetic(opcode, left, right, type);
        if (check_overflow(result, type)) {
            emit_error(2708);  // arithmetic overflow
            return 0;
        }
        *a3 = result;
        return 1;

    // ---- Pointer arithmetic ----
    case 46:  // pointer + integer
    case 47:  // integer + pointer
    case 48:  // pointer - integer
        if (!do_constexpr_expression(a1, ptr_expr, &ptr, NULL)) return 0;
        if (!do_constexpr_expression(a1, int_expr, &idx, NULL)) return 0;
        // Validate result stays within allocation bounds
        new_pos = get_runtime_array_pos(ptr.address) + idx.int_val;
        if (new_pos < 0 || new_pos > array_size) {  // past-the-end is valid
            emit_error(2735);  // pointer arithmetic underflow/overflow
            return 0;
        }
        a3->address = ptr.address + idx.int_val * elem_size;
        return 1;

    // ---- Comparison operators ----
    case 49: case 50: case 51:
        if (!do_constexpr_expression(a1, lhs, &left, NULL)) return 0;
        if (!do_constexpr_expression(a1, rhs, &right, NULL)) return 0;
        // Pointer comparison: validate same complete object
        if (is_pointer(left.type) && !same_complete_object(left, right)) {
            emit_error(2734);  // invalid pointer comparison
            return 0;
        }
        a3->int_val = apply_comparison(opcode, left, right);
        return 1;

    // ---- Compound assignment (+=, -=, etc.) ----
    case 74: case 75:
        // Evaluate LHS address, compute new value, store back
        ...

    // ---- Shift operators ----
    case 80: case 81: case 82: case 83:
        // Left shift, right shift (arithmetic and logical)
        ...

    // ---- Array subscript ----
    case 87: case 88: case 89: case 90: case 91:
        if (!do_constexpr_expression(a1, base, &arr, NULL)) return 0;
        if (!do_constexpr_expression(a1, index, &idx, NULL)) return 0;
        if (idx.int_val < 0 || idx.int_val >= array_dimension) {
            emit_error(2692);  // array bounds violation
            return 0;
        }
        a3->address = arr.address + idx.int_val * elem_size;
        return 1;

    // ---- Pointer-to-member dereference (.* and ->*) ----
    case 92: case 93:
        ...

    // ---- sizeof ----
    case 94:
        a3->int_val = compute_sizeof(operand_type);
        return 1;

    // ---- Comma operator ----
    case 103:
        do_constexpr_expression(a1, lhs, &discard, NULL);  // evaluate + discard
        return do_constexpr_expression(a1, rhs, a3, a4);   // return RHS

    default:
        emit_error(2721);  // not a constant expression
        return 0;
    }
}

Type Conversion Sub-Switch (operator case 5)

Type conversions are one of the most complex parts of the evaluator. The sub-switch dispatches on source/target type pairs and handles overflow detection:

int eval_conversion(interp_state *a1, expr_node *a2,
                    value_buf *a3, address_t *a4) {
    type_node *src_type = source_type(a2);
    type_node *dst_type = target_type(a2);
    int src_kind = src_type->kind;  // offset +132
    int dst_kind = dst_type->kind;

    // Evaluate the operand first
    value_buf operand;
    if (!do_constexpr_expression(a1, *(a2+48), &operand, NULL))
        return 0;

    // Dispatch on type pair
    if (src_kind == 2 && dst_kind == 2) {
        // int -> int: check truncation
        if (!fits_in_target(operand.int_val, dst_type)) {
            emit_error(2707);  // integer overflow in conversion
            return 0;
        }
        a3->int_val = truncate_to(operand.int_val, dst_type);
    }
    else if (src_kind == 2 && (dst_kind == 3 || dst_kind == 4)) {
        // int -> float/double
        a3->float_val = (double)operand.int_val;
    }
    else if ((src_kind == 3 || src_kind == 4) && dst_kind == 2) {
        // float/double -> int: check overflow
        if (operand.float_val > INT_MAX || operand.float_val < INT_MIN) {
            emit_error(2728);  // floating-point conversion overflow
            return 0;
        }
        a3->int_val = (int64_t)operand.float_val;
    }
    else if (src_kind == 1 && dst_kind == 2) {
        // pointer -> int (reinterpret_cast)
        if (!cuda_allows_reinterpret_cast()) {  // dword_106C2C0
            emit_error(2727);  // invalid conversion
            return 0;
        }
    }
    else if (src_kind == 6 && dst_kind == 6) {
        // class -> class (derived-to-base or base-to-derived)
        a3->address = adjust_pointer_for_base(operand.address, src_type, dst_type);
    }
    else if (src_kind == 19 && dst_kind == 1) {
        // nullptr_t -> pointer
        a3->address = 0;
        a3->flags |= 0;  // null pointer
    }
    // ... 15+ additional type pairs ...
    return 1;
}

Variable Lookup (case 18)

When the evaluator encounters a variable reference, it looks up the variable's current value in the interpreter's hash table:

int lookup_variable(interp_state *a1, expr_node *a2,
                    value_buf *a3, address_t *a4) {
    void *var_key = get_variable_entity(a2);
    uint64_t *table = a1->hash_table;       // offset +0
    uint64_t  mask  = a1->hash_capacity;    // offset +8, low 32 bits

    // Linear-probing hash lookup
    uint32_t idx = hash(var_key) & mask;
    while (table[idx * 2] != 0) {
        if (table[idx * 2] == var_key) {
            // Found: load value from stored address
            void *value_addr = table[idx * 2 + 1];
            load_value(a3, value_addr, get_type(a2));
            return 1;
        }
        idx = (idx + 1) & mask;
    }
    // Variable not in scope -> likely a static/global constexpr
    return resolve_static_ref(a1, a2, a3);
}

Function Call Dispatch: do_constexpr_call

sub_657560 (1,445 lines) handles all function call evaluation during constexpr. It is the central dispatcher that routes calls to the appropriate evaluator based on the callee kind.

int do_constexpr_call(interp_state *a1, expr_node *call_expr,
                      value_buf *result, address_t *home) {
    // 1. Resolve the callee
    func_info callee;
    if (!eval_constexpr_callee(a1, call_expr, &callee))  // sub_643FE0
        return 0;

    // 2. Check recursion depth
    int depth = count_call_chain(a1->call_chain);
    if (depth > MAX_CONSTEXPR_DEPTH) {
        emit_error(2701);  // constexpr evaluation exceeded depth limit
        return 0;
    }

    // 3. Dispatch by callee kind
    if (callee.is_builtin) {
        // Route to builtin evaluator
        return do_constexpr_builtin_function(       // sub_651150
            a1, callee.descriptor, args, result, &success);
    }

    if (callee.is_constructor) {
        // Route to constructor evaluator
        return do_constexpr_ctor(a1, callee, args,  // sub_6480F0
                                 result, home);
    }

    if (callee.is_destructor) {
        // Route to destructor evaluator (two variants)
        return do_constexpr_dtor(a1, callee, args,  // sub_64EFE0 or sub_64FB10
                                 result);
    }

    if (callee.is_virtual) {
        // Virtual dispatch: resolve through vtable
        func_info resolved = resolve_virtual_call(callee, this_obj);
        if (!resolved.is_constexpr) {
            emit_error(269);  // virtual function is not constexpr
            return 0;
        }
        callee = resolved;
    }

    // 4. Check that function body is available
    if (!callee.has_body) {
        emit_error(2823);  // constexpr function not defined
        return 0;
    }

    // 5. Push call frame
    call_frame frame;
    frame.prev = a1->call_chain;
    a1->call_chain = &frame;
    frame.func = callee.entity;

    // 6. Bind arguments to parameters
    for (int i = 0; i < callee.param_count; i++) {
        value_buf arg_val;
        if (!do_constexpr_expression(a1, args[i], &arg_val, NULL))
            goto cleanup;
        bind_parameter(a1, callee.params[i], &arg_val);
    }

    // 7. Evaluate function body
    int ok = do_constexpr_statement(a1, callee.body, result);  // sub_647850

    // 8. Pop call frame, clean up allocations
cleanup:
    a1->call_chain = frame.prev;
    release_allocation_chain(a1, &frame);  // sub_633EC0
    return ok;
}

Callee Resolution: eval_constexpr_callee

sub_643FE0 (305 lines) resolves the callee expression of a function call. It handles direct calls, virtual dispatch (vtable lookup), and pointer-to-member-function calls. For virtual calls, it resolves overrides by walking the vtable of the most-derived type of the object being called on.

Recursion Depth Tracking

The interpreter tracks call depth through the call_chain linked list at offset +72 in the interpreter state. Each do_constexpr_call invocation pushes a frame; each return pops it. The chain is also used for diagnostic output -- when a constexpr evaluation fails, the error message includes the call stack showing how the offending expression was reached.

Constructor Evaluation: do_constexpr_ctor

sub_6480F0 (1,659 lines) evaluates constructor calls during constexpr. It implements the full C++ construction sequence:

int do_constexpr_ctor(interp_state *a1, func_info *ctor,
                      expr_node **args, value_buf *result,
                      address_t *target_addr) {
    class_type *cls = ctor->parent_class;

    // 1. Initialize virtual base classes (if most-derived)
    for (vbase in cls->virtual_bases) {
        address_t vbase_addr = target_addr + vbase.offset;
        if (vbase.has_initializer) {
            if (!do_constexpr_expression(a1, vbase.init, &val, &vbase_addr))
                return 0;
        } else {
            init_subobject_to_zero(vbase_addr, vbase.type);  // sub_62C030
        }
    }

    // 2. Initialize non-virtual base classes
    for (base in cls->bases) {
        address_t base_addr = target_addr + base.offset;
        if (base.has_ctor_call) {
            if (!do_constexpr_ctor(a1, base.ctor, base.args,
                                   &val, &base_addr))
                return 0;
        }
    }

    // 3. Initialize data members (in declaration order)
    for (member in cls->members) {
        address_t mem_addr = target_addr + member.offset;
        if (member.has_mem_initializer) {
            // From constructor's member-initializer-list
            if (!do_constexpr_expression(a1, member.init, &val, &mem_addr))
                return 0;
        } else if (member.has_default_initializer) {
            // From in-class default member initializer
            if (!do_constexpr_expression(a1, member.default_init,
                                         &val, &mem_addr))
                return 0;
        } else {
            // Default-initialize (zero for trivial types)
            init_subobject_to_zero(mem_addr, member.type);
        }
    }

    // 4. Execute constructor body (if non-trivial)
    if (ctor->has_body) {
        if (!do_constexpr_statement(a1, ctor->body, result))
            return 0;
    }

    // 5. Handle delegating constructors
    if (ctor->is_delegating) {
        return do_constexpr_ctor(a1, ctor->delegate_target,
                                 args, result, target_addr);
    }

    // 6. For trivial copy/move, use memcpy optimization
    if (ctor->is_trivial_copy) {
        copy_interpreter_subobject(target_addr, source_addr, cls);
        return 1;                                // sub_6337D0
    }

    return 1;
}

Loop Evaluation: do_constexpr_range_based_for_statement

sub_644580 (2,836 lines) evaluates all loop constructs during constexpr: for, while, do-while, and range-based for. It is self-recursive for nested loops.

int do_constexpr_range_based_for_statement(
        interp_state *a1, stmt_node *loop, value_buf *result) {

    // --- Range-based for ---
    if (loop->kind == RANGE_FOR) {
        // 1. Evaluate range expression: auto&& __range = <expr>
        value_buf range_val;
        if (!do_constexpr_expression(a1, loop->range_expr, &range_val, NULL))
            return 0;

        // 2. Evaluate begin() and end()
        value_buf begin_val, end_val;
        if (!do_constexpr_call(a1, loop->begin_call, &begin_val, NULL))
            return 0;
        if (!do_constexpr_call(a1, loop->end_call, &end_val, NULL))
            return 0;

        // 3. Loop: while (begin != end)
        while (true) {
            // Evaluate condition: begin != end
            value_buf cond;
            if (!do_constexpr_expression(a1, loop->condition, &cond, NULL))
                return 0;
            if (!cond.int_val)
                break;  // loop finished

            // Bind loop variable: auto x = *begin
            value_buf elem;
            if (!do_constexpr_expression(a1, loop->deref_expr, &elem, NULL))
                return 0;
            bind_parameter(a1, loop->loop_var, &elem);

            // Execute loop body
            int body_result = do_constexpr_statement(  // sub_6593C0
                a1, loop->body, result);

            if (body_result == BREAK)   break;
            if (body_result == RETURN)  return body_result;
            // CONTINUE falls through to increment

            // Increment iterator: ++begin
            if (!do_constexpr_expression(a1, loop->increment, &begin_val, NULL))
                return 0;

            // Destroy loop variable for this iteration
            cleanup_iteration(a1, loop->loop_var);  // sub_658CE0
        }
        return 1;
    }

    // --- Traditional for/while/do-while ---
    if (loop->kind == FOR_LOOP) {
        // Initialize
        if (loop->init_stmt)
            do_constexpr_statement(a1, loop->init_stmt, NULL);

        while (true) {
            // Condition
            if (loop->condition) {
                value_buf cond;
                do_constexpr_expression(a1, loop->condition, &cond, NULL);
                if (!cond.int_val) break;
            }
            // Body
            int r = do_constexpr_statement(a1, loop->body, result);
            if (r == BREAK)  break;
            if (r == RETURN) return r;
            // Increment
            if (loop->increment)
                do_constexpr_expression(a1, loop->increment, NULL, NULL);
        }
    }
    return 1;
}

The loop body evaluation is delegated to sub_6593C0 (816 lines), which handles per-iteration variable binding, break/continue/return propagation, and destruction of loop-scoped temporaries.

Statement Evaluation: do_constexpr_statement

sub_647850 (509 lines) evaluates compound statements, declarations, branches, and switch statements during constexpr:

int do_constexpr_statement(interp_state *a1, stmt_node *stmt,
                           value_buf *result) {
    switch (stmt->kind) {
    case COMPOUND:
        // Push scope, evaluate each sub-statement, pop scope
        a1->scope_depth++;
        for (s in stmt->children) {
            int r = do_constexpr_statement(a1, s, result);
            if (r == RETURN || r == BREAK || r == CONTINUE)
                { a1->scope_depth--; return r; }
        }
        a1->scope_depth--;
        return OK;

    case DECLARATION:
        // Allocate interpreter storage, evaluate initializer
        return do_constexpr_init_variable(a1, stmt->decl, result);

    case IF_STMT:
        value_buf cond;
        do_constexpr_expression(a1, stmt->condition, &cond, NULL);
        if (cond.int_val)
            return do_constexpr_statement(a1, stmt->then_branch, result);
        else if (stmt->else_branch)
            return do_constexpr_statement(a1, stmt->else_branch, result);
        return OK;

    case SWITCH_STMT:
        value_buf switch_val;
        do_constexpr_expression(a1, stmt->condition, &switch_val, NULL);
        // Find matching case label
        case_label = find_case(stmt->cases, switch_val.int_val);
        return do_constexpr_statement(a1, case_label->body, result);

    case RETURN_STMT:
        if (stmt->return_expr)
            do_constexpr_expression(a1, stmt->return_expr, result, NULL);
        return RETURN;

    case FOR_STMT: case WHILE_STMT: case DO_STMT: case RANGE_FOR:
        return do_constexpr_range_based_for_statement(  // sub_644580
            a1, stmt, result);

    case BREAK_STMT:    return BREAK;
    case CONTINUE_STMT: return CONTINUE;

    case TRY_STMT:
        // try/catch in constexpr (C++26 direction, partially supported)
        ...
    }
}

Builtin Function Evaluation: do_constexpr_builtin_function

sub_651150 (5,032 lines) evaluates compiler intrinsics and __builtin_* functions at compile time. It dispatches on the builtin function ID (a 16-bit value at *(a2+168)), using a sparse comparison tree rather than a dense switch table.

int do_constexpr_builtin_function(
        interp_state *a1,
        func_desc    *a2,       // function descriptor
        value_buf   **a3,       // argument array
        value_buf    *a4,       // result buffer
        int          *a5) {     // success/failure output

    uint16_t builtin_id = *(a2 + 168);

    // --- Arithmetic overflow detection ---
    // __builtin_add_overflow, __builtin_sub_overflow, __builtin_mul_overflow
    if (builtin_id == BUILTIN_ADD_OVERFLOW) {
        int64_t a = a3[0]->int_val, b = a3[1]->int_val;
        bool overflow;
        int64_t result = checked_add(a, b, &overflow);
        a3[2]->int_val = result;     // write to output parameter
        a4->int_val = overflow ? 1 : 0;
        return 1;
    }

    // --- Bit manipulation ---
    // __builtin_clz, __builtin_ctz, __builtin_popcount, __builtin_parity
    if (builtin_id == BUILTIN_CLZ) {
        uint64_t val = a3[0]->int_val;
        if (val == 0) { emit_error(61); return 0; }  // UB: clz(0)
        a4->int_val = __builtin_clzll(val);
        return 1;
    }
    if (builtin_id == BUILTIN_POPCOUNT) {
        a4->int_val = __builtin_popcountll(a3[0]->int_val);
        return 1;
    }
    if (builtin_id == BUILTIN_BSWAP32) {
        a4->int_val = __builtin_bswap32((uint32_t)a3[0]->int_val);
        return 1;
    }

    // --- String operations ---
    // __builtin_strlen, __builtin_strcmp, __builtin_memcmp,
    // __builtin_strchr, __builtin_memchr
    if (builtin_id == BUILTIN_STRLEN) {
        char *str = get_interpreter_string(a1, a3[0]);
        a4->int_val = strlen(str);
        return 1;
    }
    if (builtin_id == BUILTIN_STRCMP) {
        char *s1 = get_interpreter_string(a1, a3[0]);
        char *s2 = get_interpreter_string(a1, a3[1]);
        a4->int_val = strcmp(s1, s2);
        return 1;
    }

    // --- Floating-point classification ---
    // __builtin_isnan, __builtin_isinf, __builtin_isfinite,
    // __builtin_fpclassify, __builtin_huge_val, __builtin_nan
    if (builtin_id == BUILTIN_ISNAN) {
        a4->int_val = isnan(a3[0]->float_val) ? 1 : 0;
        return 1;
    }
    if (builtin_id == BUILTIN_NAN) {
        char *tag = get_interpreter_string(a1, a3[0]);
        a4->float_val = nan(tag);
        return 1;
    }

    // --- C++20/23 bit operations ---
    // std::bit_cast (via __builtin_bit_cast)
    if (builtin_id == BUILTIN_BIT_CAST) {
        // Serialize source object to target-format bytes
        translate_interpreter_object_to_target_bytes(  // sub_62A490
            a1, a3[0], byte_buffer);
        // Deserialize into destination type
        translate_target_bytes_to_interpreter_object(  // sub_62C670
            a1, byte_buffer, a4, dst_type);
        return 1;
    }

    // --- Type traits ---
    // __is_constant_evaluated()
    if (builtin_id == BUILTIN_IS_CONSTANT_EVALUATED) {
        a4->int_val = 1;  // always true inside constexpr evaluator
        return 1;
    }

    // --- Memory operations ---
    // __builtin_memcpy, __builtin_memmove
    if (builtin_id == BUILTIN_MEMCPY) {
        // Copy N bytes between interpreter objects
        copy_interpreter_bytes(a3[0]->address, a3[1]->address,
                               a3[2]->int_val);
        *a4 = *a3[0];  // return dest pointer
        return 1;
    }

    // ... 50+ additional builtin categories ...

    emit_error(2721);  // builtin not evaluable at compile time
    return 0;
}

Builtin Categories Summary

CategoryExamplesCount
Arithmetic overflow__builtin_add_overflow, __builtin_mul_overflow3
Bit manipulation__builtin_clz, __builtin_ctz, __builtin_popcount, __builtin_bswap8+
String operations__builtin_strlen, __builtin_strcmp, __builtin_memcmp, __builtin_strchr6+
Math/FP classify__builtin_isnan, __builtin_isinf, __builtin_huge_val, __builtin_nan8+
Type queries__is_constant_evaluated, __has_unique_object_representations4+
Memory operations__builtin_memcpy, __builtin_memmove3+
C++20/23 <bit>std::bit_cast, std::bit_ceil, std::bit_floor, std::countl_zero8+
Atomic (limited)Constexpr-evaluable atomic subset2+

Destructor Evaluation

Two functions handle constexpr destructor calls, splitting responsibilities:

do_constexpr_dtor variant 1 (sub_64EFE0, 503 lines) -- Evaluates the destructor body itself. Runs the user-written destructor code, then destroys members in reverse declaration order.

do_constexpr_dtor variant 2 / perform_destructions (sub_64FB10, 877 lines) -- Handles the full destruction sequence including base class destructors and array element destruction. Also implements perform_destructions, the post-evaluation cleanup that destroys all constexpr-created objects when their scope ends.

Materialization: Interpreter Objects to IL Constants

After constexpr evaluation completes, the interpreter's internal objects must be converted back into IL constant nodes that the rest of the compiler can consume.

copy_interpreter_object_to_constant

sub_631110 (1,444 lines) traverses the interpreter's memory representation of an object and builds the corresponding IL constant tree:

il_node *copy_interpreter_object_to_constant(
        interp_state *a1, address_t obj_addr, type_node *type) {
    int kind = type->kind;

    switch (kind) {
    case 2: case 13:  // integer, enum
        return make_integer_constant(load_int(obj_addr), type);

    case 3: case 4:   // float, double
        return make_float_constant(load_float(obj_addr), type);

    case 1:           // pointer
        if (is_null_pointer(obj_addr))
            return make_null_pointer_constant(type);
        // Non-null: build address expression with relocation
        return make_address_constant(
            translate_interpreter_offset(obj_addr),  // inline helper
            type);

    case 6: case 9: case 10: case 11:  // class/struct/union
        il_node *result = make_aggregate_constant(type);
        // Recursively convert each member
        for (member in get_members(type)) {
            address_t mem_addr = obj_addr + member.offset;
            il_node *mem_val = copy_interpreter_object_to_constant(
                a1, mem_addr, member.type);
            add_member_to_aggregate(result, mem_val);
        }
        return result;

    case 8:           // array
        il_node *result = make_array_constant(type);
        for (int i = 0; i < array_dimension(type); i++) {
            address_t elem_addr = obj_addr + i * elem_size;
            il_node *elem = copy_interpreter_object_to_constant(
                a1, elem_addr, elem_type);
            add_element_to_array(result, elem);
        }
        return result;
    }
}

This function also contains get_reflection_string_entry and translate_interpreter_offset as inlined helpers -- the former handles C++26 reflection string extraction, and the latter converts interpreter memory addresses into IL address expressions with proper relocations.

extract_value_from_constant (reverse direction)

sub_64B580 (2,299 lines) performs the inverse: given an IL constant node (from a previously evaluated constexpr), it extracts the value into the interpreter's internal representation. This is used when a constexpr function references another constexpr variable whose value was already computed.

__builtin_bit_cast Support

Two functions implement the byte-level serialization needed for std::bit_cast:

translate_interpreter_object_to_target_bytes (sub_62A490, 461 lines) -- Serializes an interpreter object to a target-format byte sequence. Must handle endianness conversion, padding bytes, and bitfield layout according to the target ABI.

translate_target_bytes_to_interpreter_object (sub_62C670, 529 lines) -- Deserializes target-format bytes back into an interpreter object. Validates that the source bytes represent a valid value for the destination type (e.g., no trap representations for bool).

C++20 Constexpr Memory Support

std::allocator::allocate

sub_62B100 (do_constexpr_std_allocator_allocate, 177 lines) -- Handles new expressions in constexpr context. Allocates from the interpreter arena, sets the allocation-chain flag (bit 2), and links the allocation into the tracking chain.

std::allocator::deallocate

sub_62B470 (do_constexpr_std_allocator_deallocate, 195 lines) -- Handles delete in constexpr context. Validates the pointer was allocated by std::allocator::allocate() by searching the allocation chain (qword_126FBC0 / qword_126FBB8).

std::construct_at

sub_64F920 (do_constexpr_std_construct_at, 108 lines) -- Handles std::construct_at() (C++20). Validates the target pointer, then delegates to do_constexpr_ctor for actual construction.

C++26 Reflection Support

EDG 6.6 includes experimental support for the P2996 compile-time reflection proposal. Eight dedicated functions implement std::meta::* operations:

FunctionAddressLinesReflection operation
do_constexpr_std_meta_substitutesub_628510526std::meta::substitute() -- template argument substitution
do_constexpr_std_meta_enumerators_ofsub_62EB00342std::meta::enumerators_of() -- enum value list
do_constexpr_std_meta_subobjects_ofsub_62F0B0434std::meta::subobjects_of() -- all subobjects
do_constexpr_std_meta_bases_ofsub_62F7B0339std::meta::bases_of() -- base class list
do_constexpr_std_meta_nonstatic_data_members_ofsub_62FD30308std::meta::nonstatic_data_members_of()
do_constexpr_std_meta_static_data_members_ofsub_630280308std::meta::static_data_members_of()
do_constexpr_std_meta_members_ofsub_6307E0590std::meta::members_of() -- all members
do_constexpr_std_meta_define_classsub_65DE10553std::meta::define_class() -- class synthesis

These functions operate on "infovecs" -- information vectors created by make_infovec (sub_62E1B0, 241 lines) that encode reflection metadata as interpreter-internal objects. The get_interpreter_string and get_interpreter_string_length helpers (also within sub_65DE10) extract string values from these infovecs for operations that take string parameters (member names, type names).

The define_class operation is particularly notable: it allows constexpr code to synthesize entirely new class types at compile time, a capability that goes beyond simple introspection.

CUDA-Specific Constexpr Behavior

The interpreter checks several global flags to relax standard constexpr restrictions for CUDA device code:

GlobalPurpose
dword_106C2C0Controls reinterpret_cast semantics in device constexpr
dword_106C1D8Controls pointer dereference behavior (likely --expt-relaxed-constexpr)
dword_106C1E0Controls typeid availability in device constexpr
dword_126EFACCUDA mode flag (enables/disables constexpr relaxations)
dword_126EFA4Secondary CUDA mode flag (combined with EFAC for fine control)

Standard C++ forbids reinterpret_cast, typeid, and certain pointer operations in constexpr contexts. CUDA relaxes these restrictions because GPU programming patterns frequently require type punning and address manipulation that the standard deems non-constant. When these flags are set, the interpreter suppresses the corresponding error codes and evaluates the expression as if it were permitted.

Language Version Gates

GlobalCheckMeaning
qword_126EF98> 0x222DF (140,255)C++20 features enabled (standard 202002)
qword_126EF98> 0x15F8F (89,999)C++14 features enabled (standard 201402)
dword_126EFB4== 2Full C++20+ compilation mode
dword_126EF68>= 202001C++20 constexpr dynamic allocation enabled

These version checks gate features like constexpr new/delete (C++20), constexpr dynamic_cast and typeid (C++20), and constexpr virtual dispatch (C++20).

Error Codes

The interpreter emits detailed diagnostics when constexpr evaluation fails. Each error code identifies a specific category of failure:

ErrorMeaning
61Undefined behavior detected (division by zero, clz(0), etc.)
269Virtual function called is not constexpr
286Pure virtual function called
2691Invalid pointer comparison direction
2692Array bounds violation
2700Access to uninitialized object
2701Constexpr evaluation exceeded depth limit
2707Integer overflow in type conversion
2708Arithmetic overflow in computation
2721Expression is not a constant expression
2725Type too large for constexpr evaluation (> 64MB)
2727Invalid type conversion in constexpr
2728Floating-point conversion overflow
2734Invalid pointer comparison (different complete objects)
2735Pointer arithmetic out of bounds
2751Null pointer dereference
2760Pointer-to-member dereference failure
2766Null pointer arithmetic
2808Class too large for constexpr representation
2823Constexpr function body not available
2879offsetof on invalid member
2921Direct value return failure
2938Virtual base class offset not found
2955Statement expression evaluation failure
2993Object lifetime violation
2999Variable-length array in constexpr
3007Pointer-to-member comparison failure
3024Dynamic initialization order issue
3248Member access on uninitialized object
3312Object representation mismatch (bit_cast)

Supporting Functions

Value Management

FunctionAddressLinesPurpose
f_value_bytes_for_typesub_628DE0843Compute interpreter storage size for a type
init_subobject_to_zerosub_62C030284Zero-initialize a constexpr subobject
mark_mutable_members_not_initializedsub_62D0F0203Mark mutable members after copy
Copy scalar valuesub_62B8A061Assign scalar value to interpreter object
Load valuesub_64EA30293Load value from interpreter object into buffer
Check initializedsub_62BF6055Validate interpreter object is initialized

Object Addressing

FunctionAddressLinesPurpose
find_subobject_for_interpreter_addresssub_629D30334Map address to subobject identity
obj_type_at_addresssub_62A210133Most-derived type at an address
get_runtime_array_possub_6341C0224Array element index for a pointer
last_subobject_path_linksub_6345D021Tail of subobject path chain
get_trailing_subobject_path_entrysub_63463082Trailing subobject for virtual bases
Copy subobjectsub_6337D0379Copy subobject between interpreter addresses
Validate subobject pathsub_62B980314Recursive validation of class hierarchy traversal

Condition and Allocation

FunctionAddressLinesPurpose
do_constexpr_conditionsub_658EE0302Evaluate if/while/for condition
do_constexpr_condition_allocsub_62D810187Allocate storage for condition result
do_constexpr_init_variablesub_6509E0427Initialize local variable in constexpr
Allocate value slotsub_62D4F0183Allocate and init a value slot in arena
Release allocation chainsub_633EC0157Free tracked constexpr allocations

Dynamic Initialization and Lambdas

FunctionAddressLinesPurpose
do_constexpr_dynamic_initsub_64A0401,111Dynamic initialization of constexpr variables
do_constexpr_lambda(within sub_64A040)--Lambda capture evaluation
do_array_constructor_copy(within sub_64A040)--Array construction via copy ctor

Debug and Diagnostics

FunctionAddressLinesPurpose
Format constexpr valuesub_632E80268Format value for error messages
Dump constexpr valuesub_6333E0166fprintf-based debug dump

Complete Function Map

AddressLinesIdentityConfidence
sub_628180237Init/entry wrapperMEDIUM
sub_628510526do_constexpr_std_meta_substituteHIGH (95%)
sub_628DE0843f_value_bytes_for_typeVERY HIGH (99%)
sub_629D30334find_subobject_for_interpreter_addressVERY HIGH (99%)
sub_62A210133obj_type_at_addressVERY HIGH (99%)
sub_62A490461translate_interpreter_object_to_target_bytesVERY HIGH (99%)
sub_62AD90194Allocate interpreter value storageHIGH (85%)
sub_62B100177do_constexpr_std_allocator_allocateVERY HIGH (99%)
sub_62B470195do_constexpr_std_allocator_deallocateVERY HIGH (99%)
sub_62B8A061Copy scalar valueHIGH (85%)
sub_62B980314Validate/traverse subobject pathHIGH (80%)
sub_62BF6055Validate initialization stateHIGH (85%)
sub_62C030284init_subobject_to_zeroVERY HIGH (99%)
sub_62C670529translate_target_bytes_to_interpreter_objectVERY HIGH (99%)
sub_62D0F0203mark_mutable_members_not_initializedVERY HIGH (99%)
sub_62D4F0183Allocate constexpr value slotHIGH (80%)
sub_62D810187do_constexpr_condition_allocVERY HIGH (99%)
sub_62DB00132Get value type size (wrapper)HIGH (80%)
sub_62DD10242Builtin dispatch helperMEDIUM (70%)
sub_62E1B0241make_infovecVERY HIGH (99%)
sub_62E670276Init/entry wrapperMEDIUM (60%)
sub_62EB00342do_constexpr_std_meta_enumerators_ofVERY HIGH (99%)
sub_62F0B0434do_constexpr_std_meta_subobjects_ofVERY HIGH (99%)
sub_62F7B0339do_constexpr_std_meta_bases_ofVERY HIGH (99%)
sub_62FD30308do_constexpr_std_meta_nonstatic_data_members_ofVERY HIGH (99%)
sub_630280308do_constexpr_std_meta_static_data_members_ofVERY HIGH (99%)
sub_6307E0590do_constexpr_std_meta_members_ofVERY HIGH (99%)
sub_6311101,444copy_interpreter_object_to_constantVERY HIGH (99%)
sub_632CB036Create reflection string objectMEDIUM (70%)
sub_632D8064get_reflection_string_entry helperHIGH (85%)
sub_632E80268Format constexpr value for diagnosticsMEDIUM (65%)
sub_6333E0166Dump constexpr value (debug)MEDIUM (65%)
sub_6337D0379Copy interpreter subobjectHIGH (85%)
sub_633EC0157Release allocation chainHIGH (80%)
sub_6341C0224get_runtime_array_posVERY HIGH (99%)
sub_6345D021last_subobject_path_linkVERY HIGH (99%)
sub_63463082get_trailing_subobject_path_entryVERY HIGH (99%)
sub_63474011,205do_constexpr_expressionABSOLUTE (100%)
sub_643C50202Prepare constexpr calleeHIGH (85%)
sub_643FE0305eval_constexpr_calleeVERY HIGH (99%)
sub_6445802,836do_constexpr_range_based_for_statementVERY HIGH (99%)
sub_647850509do_constexpr_statementHIGH (90%)
sub_6480F01,659do_constexpr_ctorVERY HIGH (99%)
sub_64A0401,111do_constexpr_dynamic_init / do_constexpr_lambdaVERY HIGH (99%)
sub_64B5802,299extract_value_from_constantVERY HIGH (99%)
sub_64DFA086Destructor chain walkerHIGH (80%)
sub_64E170404Perform destruction sequenceHIGH (85%)
sub_64E9E026Predicate / flag checkMEDIUM (65%)
sub_64EA30293Load value from interpreter objectHIGH (85%)
sub_64EFE0503do_constexpr_dtor (variant 1)VERY HIGH (99%)
sub_64F8F014Trivial forwarding wrapperMEDIUM (60%)
sub_64F920108do_constexpr_std_construct_atVERY HIGH (99%)
sub_64FB10877do_constexpr_dtor (v2) / perform_destructionsVERY HIGH (99%)
sub_6509E0427do_constexpr_init_variableVERY HIGH (99%)
sub_6511505,032do_constexpr_builtin_functionABSOLUTE (100%)
sub_6575601,445do_constexpr_callVERY HIGH (99%)
sub_658CE0134Loop iteration cleanupHIGH (80%)
sub_658EE0302do_constexpr_conditionVERY HIGH (99%)
sub_6593C0816Loop body evaluatorHIGH (85%)
sub_65A290311Entry from expression loweringMEDIUM (70%)
sub_65A8C0274Entry from expression treesMEDIUM (70%)
sub_65AE50572interpret_expr (primary entry)VERY HIGH (99%)
sub_65BAB0--sub_65D150150-470Misc entry pointsMEDIUM (70%)
sub_65CFA067interpret_dynamic_sub_initializersVERY HIGH (99%)
sub_65D9A0--sub_65DD207-68Small utility/accessor functionsMEDIUM (65%)
sub_65DE10553do_constexpr_std_meta_define_classVERY HIGH (99%)

Cross-References

Name Mangling

The name mangling subsystem in cudafe++ implements the Itanium C++ ABI name mangling specification, with NVIDIA-specific extensions for CUDA device lambda wrappers and host reference array registration. The mangling pipeline lives in lower_name.c (60+ functions spanning 0x69C980--0x6AB280) and produces the _Z prefixed symbols that appear in .int.c output and PTX. A separate CUDA-aware demangler at sub_7CABB0 (930 lines, statically linked, not EDG code) reverses the process with extensions for three NVIDIA vendor-specific mangled prefixes: Unvdl, Unvdtl, and Unvhdl. The glue between mangling and CUDA execution spaces is nv_get_full_nv_static_prefix in nv_transforms.c, which constructs scoped static prefixes for __global__ template stubs destined for host reference arrays.

Key Facts

PropertyValue
Source filelower_name.c (60+ functions), nv_transforms.c (prefix builder)
Address range0x69C980--0x6AB280 (mangling), 0x6BE300 (static prefix)
Demanglersub_7CABB0 (930 lines, NVIDIA custom, not EDG)
ABI standardItanium C++ ABI (IA-64), extended with NVIDIA vendor types
Operator name tablesub_69C980 (mangled_operator_name), 47 entries
Entity manglersub_6A1F00 (mangle_entity_name), ~1000 lines
Expression manglersub_6A8B10 (mangled_expression), ~700 lines
Scalable vector manglersub_69CF10 (mangled_scalable_vector_name), 170 lines
Static prefix buildersub_6BE300 (nv_get_full_nv_static_prefix), 370 lines
Output bufferqword_127FCC0 (dynamic buffer with capacity tracking)
Demangling mode flagqword_126ED90 (non-zero = demangling/diagnostic mode)
Compressed mangling flagdword_106BC7C (ABI version control)
ABI version selectorqword_126EF98 (selects vendor-specific vs standard codes)

Architecture Overview

Name mangling occurs at two distinct points in the cudafe++ pipeline:

  1. Forward mangling (IL lowering): EDG's lower_name.c converts entity nodes into Itanium ABI mangled names during the IL-to-text code generation phase. The entry point is mangle_entity_name (sub_6A1F00), which dispatches through 60+ helper functions to handle every C++ construct -- namespaces, classes, templates, operators, expressions, lambdas, and vendor-extended types.

  2. Reverse demangling (diagnostics): A statically linked demangler at sub_7CABB0 converts mangled names back to human-readable form for error messages and debug output. This demangler is not EDG code -- it is NVIDIA's custom implementation that wraps the standard Itanium ABI demangling algorithm with CUDA-specific extensions for device lambda wrapper types.

Entity Node (IL)
  |
  +-- sub_69FF70 (check_mangling_special_cases)
  |     Checks: extern "C", linkage name override, builtin
  |     If special case handled, done.
  |
  +-- sub_6A1F00 (mangle_entity_name)          ~1000 lines
  |     |
  |     +-- sub_69C980 (mangled_operator_name)  47 operators
  |     +-- sub_69E740 (mangle_type_encoding)   type dispatch
  |     +-- sub_6A3B00 (mangle_function_encoding)
  |     +-- sub_6A41A0 (mangle_declaration)
  |     +-- sub_6A4920 (mangle_template_parameter)
  |     +-- sub_6A5DC0 (mangle_abi_tags)        B<tag> encoding
  |     +-- sub_6A6AF0 (mangle_template_args)
  |     +-- sub_6A78B0 (mangle_complete_type)
  |     +-- sub_6A8390 (mangled_nested_name_component)
  |     +-- sub_6A85E0 (mangled_entity_reference)
  |     +-- sub_6A8B10 (mangled_expression)     ~700 lines
  |     +-- sub_6AB280 (mangled_encoding_for_sizeof)
  |
  +-- Output buffer: qword_127FCC0
        [buffer_ptr, write_pos, capacity, overflow_flag, ...]

Operator Name Table (sub_69C980)

mangled_operator_name at 0x69C980 is a pure lookup function: it takes an operator kind byte and an arity flag, and returns a pointer to the two-character Itanium ABI mangled operator code. The function covers all 47 overloadable C++ operators, including C++20 co_await.

Assert: "mangled_operator_name: bad kind" at lower_name.c:11557.

Four operators are context-sensitive -- their mangled code depends on whether the usage is unary (arity a2==1) or binary:

KindUnaryBinaryC++ Operator
5pspl+
6ngmi-
7deml*
11adan&

Complete Operator Kind Table

KindCodeOperatorKindCodeOperator
1nwnew26ls<<
2dldelete27rs>>
5ps/pl+ (unary/binary)28rS>>=
6ng/mi- (unary/binary)29lS<<=
7de/ml* (unary/binary)30eq==
9rm%31ne!=
11ad/an& (unary/binary)32le<=
12or|33ge>=
13co~34ss<=>
14nt!37pp++
16lt<40pm->*
17gt>41pt->
24aN%=42cl()
43ix[]44qu?:
45v23minvendor min46v23maxvendor max
47awco_await (C++20)

Kinds 3, 4, 8, 10, 15, 18--23, 25, 28--29, 35--36, 38--39 return pointers to .rodata string constants (unk_A7C560 etc.) that encode the remaining standard operators (dv, eo, aS, pL, mI, mL, dV, eO, aa, oo, mm, cm).

Note kinds 45 and 46: these are vendor-extended operators using the v<length><name> Itanium ABI encoding. v23min and v23max are NVIDIA/CUDA-specific min/max operators with a length prefix of 23 -- this encodes the string "min" (3 chars) and "max" (3 chars) as vendor-qualified identifiers.

Entity Name Mangling (sub_6A1F00)

mangle_entity_name at 0x6A1F00 is the master mangling function. It produces the complete Itanium ABI mangled name for any entity node. At roughly 1000 decompiled lines, it handles every C++ entity kind through a multi-level dispatch.

Demangling Mode Early Exit

The function begins with a demangling-mode check:

if (qword_126ED90) {          // demangling / diagnostic mode
    emit_char(1, output);     // '?'
    emit_string("?", output);
    return;
}

When qword_126ED90 is non-zero, the function emits "?" and returns immediately. This mode is used during diagnostic output when the compiler needs a placeholder rather than a real mangled name.

Pre-dispatch: Special Cases (sub_69FF70)

Before the main dispatch, sub_69FF70 (check_mangling_special_cases, 447 lines at 0x69FF70) screens for entities that bypass normal mangling:

  • Linkage name override: If the entity has an explicit asm("name") or [[gnu::alias("name")]], the override name is used directly.
  • extern "C" linkage: Returns the unmangled source name.
  • Builtin entities: Special-cased to avoid generating bogus mangled names.

Main Dispatch Structure

After special-case screening, mangle_entity_name dispatches on the entity kind byte at entity node offset +132:

Entity KindHandlerEncoding
Regular functionsub_6A3B00 (mangle_function_encoding)_Z<encoding>
Regular variableDirect type mangling_Z<name><type>
Namespace membersub_6A0740 (mangle_namespace_prefix)N<qual>..E
Class membersub_6A0A80 (mangle_class_prefix)N<class><name>E
Template specializationsub_6A6AF0 (mangle_template_args)I<args>E
Operator functionsub_69C980 (mangled_operator_name)operator codes
Constructor/destructorsub_69FE30C1/C2/C3/D0/D1/D2
Lambda closureLambda-specific pathUl<sig>E<disc>_
Local entitysub_69F830 (mangle_local_name)Z<func>E<entity>
Special (vtable etc.)sub_69FBC0 (mangle_special_name)TV/TI/GV etc.

Type Encoding Subpipeline

Type mangling is handled by sub_69E740 (mangle_type_encoding, 177 lines at 0x69E740), which dispatches on type kind to produce Itanium ABI type codes:

TypeCodeTypeCode
voidvboolb
charcsigned chara
unsigned charhshorts
unsigned shorttinti
unsigned intjlongl
unsigned longmlong longx
unsigned long longyfloatf
doubledlong doublee
__int128nunsigned __int128o
wchar_twchar8_tDu
char16_tDschar32_tDi
_Float16DF16___float128g
std::nullptr_tDnautoDa
decltype(auto)Dc

Pointer and reference types are encoded with prefix qualifiers: P (pointer), R (lvalue reference), O (rvalue reference). CV-qualifiers use K (const), V (volatile), r (restrict).

The builtin type mangler at sub_6A13A0 (396 lines) includes CUDA-specific type detection through dword_106C2C0 (GPU mode flag) to handle CUDA-extended types.

Substitution Mechanism

The Itanium ABI uses substitution sequences (S_, S0_, S1_, ...) to compress repeated type references. The substitution infrastructure in lower_name.c centers on:

  • sub_69F0D0 (mangle_substitution_check): Checks whether a type/name component has already been emitted and should use a substitution reference.
  • sub_69F150 (mangle_with_substitution, 87 lines): Handles S_ encoding, including the well-known substitutions Sa (std::allocator), Sb (std::basic_string), Ss (std::string), Si (std::istream), So (std::ostream), Sd (std::iostream).

Template Argument Mangling

Template arguments are enclosed in I...E and handled by:

  • sub_69ED40 (mangle_template_args, 86 lines): Iterates the template argument list, emitting I prefix and E suffix.
  • sub_69EEE0 (mangle_template_arg, 109 lines): Mangles individual template arguments, dispatching between type arguments (direct type encoding), non-type arguments (expression or literal encoding), and template template arguments.
  • sub_6A4920 (mangle_template_parameter, 277 lines): Encodes template parameter references (T_, T0_, T1_, ...).

ABI Tag Mangling (sub_6A5DC0)

sub_6A5DC0 (643 lines at 0x6A5DC0) handles [[gnu::abi_tag("...")]] attribute propagation per the Itanium ABI extensions. ABI tags are encoded as B<length><tag> suffixes and must be propagated through template instantiations and inline namespaces (e.g., std::__cxx11::basic_string with tag cxx11). This is one of the more complex mangling functions due to the transitive nature of tag propagation.

Constructor/Destructor Encoding (sub_69FE30)

Constructors and destructors use the Itanium ABI's multi-variant encoding:

CodeMeaning
C1Complete object constructor
C2Base object constructor
C3Complete object allocating constructor
D0Deleting destructor
D1Complete object destructor
D2Base object destructor

Special Name Mangling (sub_69FBC0)

sub_69FBC0 (125 lines) produces mangled names for compiler-generated symbols:

PrefixSymbol
_ZTVVirtual table
_ZTTVTT (construction vtable)
_ZTItypeinfo structure
_ZTStypeinfo name string
_ZGVGuard variable for static initialization
_ZTHThread-local initialization function
_ZTWThread-local wrapper function

Expression Mangling (sub_6A8B10)

mangled_expression at 0x6A8B10 is the second-largest function in lower_name.c at roughly 700 decompiled lines. It produces the Itanium ABI encoding for arbitrary C++ expressions appearing in template arguments, noexcept specifications, and decltype contexts.

Assert: "mangled_encoding_for_expression_full" at lower_name.c:6870, "mangled_expr_operator_name: bad operator" at lower_name.c:11873, "mangled_call_operation" at lower_name.c:6132.

Expression Kind Dispatch

The function first calls sub_69E740 to classify the expression node, then dispatches on the expression kind byte at node offset +24:

KindDescriptionABI Encoding
0Error/unknown expression? (demangling mode only)
1Operator expressionDispatches on operator byte at +40
2Literal valueL<type><value>E
3Entity referenceL_Z<encoding>E or substitution
4Template parameterT_/T0_ etc.
5sizeof/alignof/typeid/noexceptDelegated to sub_6AB280
6Cast expressionsc/dc/rc/cv prefix
7Call expressioncl<callee><args>E or cp<args>E
8Member accessdt/pt prefix
9Conditional expressionqu<cond><then><else>
10Pack expansionsp<pattern>

Operator Sub-dispatch (Kind 1)

When the expression is an operator expression, the function reads the operator byte at node offset +40 and performs a large switch covering 100+ cases. For standard binary and unary operators, it calls sub_69C980 (mangled_operator_name) to get the two-character ABI code, then recursively processes operands. Notable special cases:

  • Cast operators (kinds 0x05--0x13): Dispatches between sc (static_cast), dc (dynamic_cast), rc (reinterpret_cast), and cv (C-style cast) based on cast flags at node offset +25 and +42. The compressed mangling flag dword_106BC7C forces cv for all casts when set.
  • Vendor extensions (0x21, 0x22): __real__ and __imag__ complex number operations, encoded as v18__real__ and v18__imag__ using the vendor-extended operator format.
  • Increment/decrement (kinds 0x23--0x26): Pre/post increment (pp) and decrement (mm). Post-increment/decrement append _ suffix per Itanium ABI.
  • Call expressions (kinds 0x69--0x6D, 0x16--0x17, 0x69): Dispatches to mangled_call_operation which determines the callee encoding and emits cl (call) or cp (non-dependent call) prefix.

sizeof/alignof/typeid/noexcept (sub_6AB280)

mangled_encoding_for_sizeof at 0x6AB280 (130 lines) handles the sizeof-family of operators:

ABI CodeOperatorVariant
szsizeof(expr)Expression operand
stsizeof(type)Type operand
azalignof(expr)Expression operand
atalignof(type)Type operand
tetypeid(expr)Expression operand
titypeid(type)Type operand
nxnoexcept(expr)Expression operand

For older ABI versions (controlled by dword_106BC7C and qword_126EF98), the function emits vendor-specific codes v17alignof and v18alignofe instead of the standard at/az codes.

Scalable Vector Name Mangling (sub_69CF10)

mangled_scalable_vector_name at 0x69CF10 (170 lines) returns mangled names for ARM SVE and RISC-V V extension scalable vector types. EDG supports these types natively, and they must be mangled using the vendor-specific Itanium ABI encoding.

Assert: "mangled_scalable_vector_name" at lower_name.c:10473 and lower_name.c:10440.

The function dispatches on the type node's kind byte at offset +132:

Dispatch Logic

  1. Kind 12 (elaborated type): Unwraps through the elaboration chain (offset +144 points to the underlying type).
  2. Kind 3 (typedef/alias): Dispatches on subkind at offset +144:
    • Subkind 1: svint variants (signed integer vectors)
    • Subkind 2: svfloat variants (floating-point vectors)
    • Subkind 4: svbool variants (predicate vectors)
    • Subkind 9: svcount variants
  3. Kind 18 (mfloat8): mfloat8x types for ML inference.
  4. Kind 2 (plain vector): Dispatches on element type byte at offset +144, handling 8 element widths (cases 1--8).

Each type category has 4 mangling variants selected by the a2 parameter (values 1--4), corresponding to different vector widths or tuple sizes (e.g., svint8_t, svint8x2_t, svint8x3_t, svint8x4_t). The actual mangled strings are stored in .rodata pointer tables (off_A7E950 through off_A7EA18).

There is also special handling for svboolx4_t via sub_7A7220, which detects the specific boolean-tuple-of-4 predicate type and returns a dedicated mangling string.

Mangling Output Buffer

All mangling functions write into a shared output buffer managed through qword_127FCC0. The buffer structure:

OffsetSizeFieldDescription
+08reservedNot used during mangling
+88capacityAllocated buffer size
+168write_posCurrent write position (length of mangled name so far)
+248unusedReserved
+328buffer_ptrPointer to character buffer

Key buffer operations:

  • sub_69D850 (append_char_to_buffer): Appends a single character, calls sub_6B9B20 to grow the buffer if write_pos + 1 > capacity.
  • sub_69D530 (append_string): Appends a string to the buffer.
  • sub_69D580 (append_number): Appends a base-36 encoded number.
  • sub_6B9B20 (ensure_output_buffer_space): Grows the buffer (doubles capacity).

The sub_69DAA0 function (mangle_number, 63 lines) writes numbers in base-36 encoding as required by the Itanium ABI for substitution indices and discriminators.

Mangling Type Marks

The mangling pipeline uses a mark-and-sweep mechanism to track which types have been referenced during signature mangling (needed for substitution sequence generation):

  • sub_69CCB0 (set_signature_mark, 76 lines): Marks types in a function signature for mangling. Handles function types (a2=7) and template functions (a2=11) by calling sub_5CF440 for type traversal.
  • sub_69CE10 (ttt_mark_entry, 36 lines): Sets or clears the mangling mark on individual type entities. Uses bit 7 of byte at entity offset +81. The direction (mark vs unmark) is controlled by dword_127FC70.

CUDA Demangler Extensions (sub_7CABB0)

The CUDA-aware demangler at sub_7CABB0 (930 decompiled lines at 0x7CABB0) is a statically linked NVIDIA implementation, not part of EDG. It implements a full Itanium ABI C++ name demangler with three NVIDIA vendor-type extensions for CUDA lambda wrappers.

Function Signature

unsigned char* sub_7CABB0(
    unsigned char *mangled_name,   // a1: input cursor into mangled name
    int64_t       qualifier_out,   // a2: output qualifier struct (24 bytes)
    char          flags,           // a3: behavior flags
    int64_t       output_ctx       // a4: output buffer context
);

Output Buffer Context (a4)

OffsetSizeFieldDescription
+08buffer_ptrOutput character buffer
+88write_posCurrent output position
+168capacityBuffer capacity
+244error_flagSet to 1 on buffer overflow
+284overflowRedundant overflow indicator
+328suppress_levelWhen >0, output is suppressed (for dry-run parsing)
+488error_countCumulative parse error counter
+648skip_templateWhen set, suppresses template argument output

Qualifier Output (a2)

OffsetSizeFieldDescription
+04has_template_argsSet to 1 when template arguments were parsed
+44cv_qualifiersbit 0=const, bit 1=volatile, bit 2=restrict
+84ref_qualifier0=none, 1=lvalue &, 2=rvalue &&
+168template_depthTemplate nesting depth

Flags (a3)

BitMeaning
0Static-from mode: wraps output in [static from ...]...[C++]
1Suppress-scope mode: increments suppress level

Parsing Dispatch

The demangler handles these Itanium ABI top-level prefixes:

Prefix ByteASCIIABI MeaningHandler
0x42BEDG block-scope static entityBlock-scope handler (offset + length)
0x4ENNested name (qualified)sub_7CA440 (nested-name parser)
0x5AZLocal entitysub_7CEAE0 (encoding parser) + local suffix
0x53SSubstitutionsub_7CD7B0 (substitution resolver)
0x53 0x74Ststd:: prefixEmits std:: + sub_7CD0B0 (unqualified-name)
otherUnqualified namesub_7CD0B0 (unqualified-name parser)

After parsing the name, the function checks for I (template argument list, 0x49) and dispatches to sub_7C9D30 (template-args parser). A template argument cache at qword_12C7B48/12C7B40/12C7B50 stores parsed entries using a dynamic array that grows by 500 entries via malloc/realloc.

CUDA Vendor Type Extensions

The key NVIDIA extensions are triggered when the demangler encounters the vendor-extended type prefix U followed by nv (bytes 0x55 0x6E 0x76). Three patterns are recognized:

Unvdl -- Device Lambda Wrapper

Pattern: Unvdl<arity><encoding><type>...

Input:  "Unvdl" + <numeric_arity> + <function_encoding> + <captured_types>...
Output: "__nv_dl_wrapper_t<__nv_dl_tag<(& :: <scope>), <arity>, <type1>, ...> >"

Decoded step by step:

  1. Emit __nv_dl_wrapper_t<
  2. Emit __nv_dl_tag<
  3. Parse numeric arity via sub_7C3180, subtract 2 to get actual capture count
  4. Parse one type (sub_7CE590) for the wrapped function type
  5. Emit ,( + & :: + recursively demangle scope (calling sub_7CABB0 with flags=2)
  6. Emit ),
  7. Parse remaining captured types (count from step 3)
  8. Emit > >

Unvdtl -- Trailing Return Device Lambda

Pattern: Unvdtl<arity><return_type><encoding><captured_types>...

Input:  "Unvdtl" + <arity> + <type> + <func_encoding> + <captured_types>...
Output: "__nv_dl_wrapper_t<__nv_dl_trailing_return_tag<...>, <return_type>, ...>"

Same as Unvdl except:

  1. Emit __nv_dl_wrapper_t<
  2. Emit __nv_dl_trailing_return_tag< (instead of __nv_dl_tag<)
  3. After the scope demangling, parse an additional return type via sub_7CE590
  4. Parse a function type via sub_7CE5D0 (adds 1 to result pointer for the E terminator)
  5. Then parse remaining captured types

Unvhdl -- Host-Device Lambda Wrapper

Pattern: Unvhdl<bool1><bool2><bool3><arity><encoding><captured_types>...

Input:  "Unvhdl" + <IsMutable> + <HasFuncPtrConv> + <NeverThrows> + <arity> + ...
Output: "__nv_hdl_wrapper_t<true/false, true/false, true/false,
          __nv_dl_tag<(& :: <scope>), <arity>, <type1>, ...> >"

The three boolean template parameters are decoded first:

  1. Parse numeric value via sub_7C3180 -- if value != 2 (i.e., false in the encoding), emit true,; otherwise emit false,
  2. Repeat for HasFuncPtrConv (second boolean)
  3. Repeat for NeverThrows (third boolean)
  4. Then proceed identically to Unvdl (emit __nv_dl_tag<, parse captures, etc.), but with v68=1 flag marking the host-device variant

The boolean encoding convention: 2 encodes false, any other value (typically 0 or 1) encodes true. This is the reverse of the usual convention and matches the internal encoding used by nv_transforms.c when generating the mangled names.

Block-Scope Static Handling

When the input starts with B (ASCII 0x42), the demangler handles EDG's block-scope static entity encoding:

  1. If flags bit 0 is set and suppress_level is 0: emit [static from
  2. Parse an optional negative sign (n) followed by a decimal length
  3. Skip ahead by that length (the block-scope name)
  4. If suppress_level is 0: emit ] followed by [C++] (the closing bracket and C++ marker)
  5. If flags bit 0 is not set: decrement suppress_level

Instance Suffix

After parsing the main name, if the next character is _ followed by digits (or __ followed by digits), the demangler parses an instance discriminator and emits (instance N) suffix in the output, where N = parsed_value + 2.

Default Argument Suffix

For local entities (after Z...E), the discriminator prefix d triggers special handling:

  • d_ or d<number>_: emits [default argument N (from end)]:: where N = parsed_value + 2
  • dn<number>_: negative-index variant

Call Graph

The demangler calls into specialized sub-parsers:

AddressFunctionPurpose
sub_7CA440Nested-name parserHandles N...E qualified names
sub_7CEAE0Encoding parserTop-level <encoding> production
sub_7CD0B0Unqualified-name parser<source-name> and operator names
sub_7CD7B0Substitution resolverS_/S0_ back-references
sub_7C9D30Template-args parserI<args>E
sub_7CE590Type parserFull type demangling
sub_7CE5D0Function-type parserFunction signature types
sub_7C3180Numeric literal parserDecimal number extraction
sub_7C30C0Arity emitterOutputs numeric arity values
sub_7C2FB0String emitterEmits literal strings to output buffer
sub_7C3030Signed number parserHandles negative numbers

Static Prefix for global Templates (sub_6BE300)

nv_get_full_nv_static_prefix at 0x6BE300 (370 lines) in nv_transforms.c constructs unique prefix strings for __global__ function templates with static/internal linkage. These prefixes are used to register device symbols in host reference arrays (the .nvHR* ELF sections that the CUDA runtime uses for symbol discovery).

Assert: "nv_get_full_nv_static_prefix" at nv_transforms.c:2164.

Entry Conditions

The function checks two conditions on the entity node:

  1. Bit 0x40 at entity offset +182 must be set (marks __global__ functions)
  2. A name string at entity offset +8 must be non-null

Internal vs External Linkage Paths

The function takes different paths based on entity linkage:

Internal linkage (bits 0x12 at offset +179 set, or storage class 0x10 at offset +80):

  1. Build scoped name prefix via sub_6BD2F0 (nv_build_scoped_name_prefix), which recursively walks the scope chain (offset +40 -> parent scope at offset +28) to build Namespace1::Namespace2:: style prefixes. Anonymous namespaces insert _GLOBAL__N_<filename>.
  2. Hash the entity name via sub_6BD1C0 (format_string_to_sso) using vsnprintf with a format string at address 8573734.
  3. Build the full prefix string using snprintf:
snprintf(qword_1286760, n, "%s%lu_%s_", off_E7C768, strlen(filename), filename);

Where off_E7C768 is a global prefix string (likely "_nv_static_"), the %lu is the filename length, and %s is the filename from sub_5AF830(0). The result is cached in qword_1286760 for reuse across entities in the same translation unit.

  1. Concatenate prefix + "_" separator + entity scoped name
  2. Register the full string in qword_12868C0 (kernel internal-linkage host reference list)

External linkage:

  1. Build name with " ::" scope prefix (the leading space is intentional -- it matches the demangler output format)
  2. Walk scope chain via sub_6BD2F0 if the entity has a parent scope with kind 3 (namespace)
  3. Hash the entity name via sub_6BD1C0
  4. Append "_" separator
  5. Register in qword_1286880 (kernel external-linkage host reference list)

Host Reference Arrays

The prefixes generated by this function end up in six global lists, one per combination of {kernel, device, constant} x {external, internal} linkage:

GlobalSectionArray Name
unk_1286780.nvHRDEhostRefDeviceArrayExternalLinkage
unk_12867C0.nvHRDIhostRefDeviceArrayInternalLinkage
unk_1286800.nvHRCEhostRefConstantArrayExternalLinkage
unk_1286840.nvHRCIhostRefConstantArrayInternalLinkage
unk_1286880.nvHRKEhostRefKernelArrayExternalLinkage
unk_12868C0.nvHRKIhostRefKernelArrayInternalLinkage

These are emitted by sub_6BCF80 (nv_emit_host_reference_array) as weak extern "C" byte arrays in the specified ELF sections.

Type Mangling Subsystem (0x7C3000--0x7D0E00)

A separate type mangling subsystem exists in the 0x7C3000--0x7D0E00 range, used for diagnostic output and type encoding (distinct from the lower_name.c mangling used for symbol generation). Key functions:

AddressFunctionLinesDescription
sub_7C3480encode_operator_name716Operator name encoding for diagnostics
sub_7C5650encode_type_for_mangling794Full type encoding dispatcher
sub_7C6290encode_expression2519Largest function -- expression encoding
sub_7C8BE0encode_special_expression674Special expression forms
sub_7CBB90encode_builtin_type1314All builtin type mappings
sub_7CEAE0encode_template_args1417Template argument encoding
sub_7CFFC0encode_nullptr484nullptr-related type encoding

The encode_expression function at sub_7C6290 (2519 lines) is the largest single function in the entire type mangling subsystem and handles the full range of C++ expressions including dynamic_cast, const_cast, reinterpret_cast, safe_cast, static_cast, subscript, and throw.

Nested Name Components (sub_6A8390)

mangled_nested_name_component at 0x6A8390 (101 lines) handles the intermediate components within N...E nested name encodings. It emits ABI substitution codes:

  • dn: Destructor name
  • co: Coercion operator
  • sr: Unresolved scope resolution
  • L_ZN: Local scope nested name
  • D1Ev: Destructor suffix (complete object destructor, void return)

When in compressed mode (dword_106BC7C set), the function checks for std:: namespace via sub_7BE9E0 (is_std_namespace) and uses shortened forms.

Entity Reference Mangling (sub_6A85E0)

mangled_entity_reference at 0x6A85E0 (197 lines) is the central dispatch for mangling entity references within expressions. It handles:

  • Qualified scope resolution (bit 2 at entity offset +81)
  • Address-of expressions (ad prefix)
  • Compressed vs full mangling paths
  • Class member vs free-function encoding

Assert: "mangled_entity_reference" at lower_name.c:4183.

Mangling Discriminators (sub_69DBE0)

mangle_discriminator at 0x69DBE0 (72 lines) writes discriminators for local entities. Itanium ABI uses _ for discriminator 0, _<number>_ for higher discriminators, where the number is encoded in base-36.

Global State Summary

GlobalTypePurpose
qword_127FCC0Buffer*Primary mangling output buffer
qword_126ED90qwordDemangling/diagnostic mode flag
dword_106BC7CdwordCompressed/vendor-ABI mode flag
qword_126EF98qwordABI version selector
dword_127FC70dwordMark/unmark direction for type marks
qword_1286760char*Cached static prefix string
qword_1286A00char*Cached anonymous namespace name
dword_12C6A24dwordBlock-scope suppress level (demangler)
qword_12C7B48qwordTemplate argument cache index
qword_12C7B40qwordTemplate argument cache capacity
qword_12C7B50qwordTemplate argument cache pointer
off_E7C768char*Static prefix base string

Function Address Map

AddressSizeIdentityConfidence
0x69C83024init_lower_nameLOW
0x69C980168mangled_operator_nameHIGH
0x69CCB076set_signature_markHIGH
0x69CE1036ttt_mark_entryHIGH
0x69CF10170mangled_scalable_vector_nameHIGH
0x69D530--append_stringMEDIUM
0x69D580--append_numberMEDIUM
0x69D850--append_char_to_bufferMEDIUM
0x69DAA063mangle_numberMEDIUM
0x69DBE072mangle_discriminatorMEDIUM
0x69E380116mangle_cv_qualifiersMEDIUM
0x69E5F079mangle_ref_qualifierMEDIUM
0x69E740177mangle_type_encodingMEDIUM-HIGH
0x69EA40150mangle_function_typeMEDIUM
0x69ED4086mangle_template_argsMEDIUM
0x69EEE0109mangle_template_argMEDIUM
0x69F0D028mangle_substitution_checkLOW
0x69F15087mangle_with_substitutionMEDIUM
0x69F32078mangle_nested_nameMEDIUM
0x69F83054mangle_local_nameMEDIUM
0x69F93060mangle_unscoped_nameMEDIUM
0x69FA9058mangle_source_nameMEDIUM
0x69FBC0125mangle_special_nameMEDIUM
0x69FE3078mangle_constructor_destructorMEDIUM
0x69FF70447check_mangling_special_casesMEDIUM-HIGH
0x6A0740189mangle_namespace_prefixMEDIUM
0x6A0A8088mangle_class_prefixMEDIUM
0x6A0FB0245mangle_pointer_typeMEDIUM
0x6A13A0396mangle_builtin_typeMEDIUM-HIGH
0x6A1C8097mangle_expressionMEDIUM
0x6A1F00~1000mangle_entity_nameHIGH
0x6A4920277mangle_template_parameterMEDIUM
0x6A5DC0643mangle_abi_tagsMEDIUM-HIGH
0x6A78B0297mangle_complete_typeMEDIUM
0x6A7F20232mangle_initializerMEDIUM
0x6A8390101mangled_nested_name_componentHIGH
0x6A85E0197mangled_entity_referenceHIGH
0x6A8B10~700mangled_expressionHIGH
0x6AA03030mangled_expression_listHIGH
0x6AB280130mangled_encoding_for_sizeofHIGH
0x6BE300370nv_get_full_nv_static_prefixVERY HIGH
0x7CABB0930CUDA demangler (top-level)HIGH

Type System

The type system in cudafe++ is EDG 6.6's implementation of the C++ type representation, query, construction, comparison, and layout infrastructure. It lives primarily in types.c (250+ functions at 0x7A4940--0x7C02A0) with type allocation in il_alloc.c (0x5E2E80--0x5E45C0), type construction helpers in il.c (0x5D64F0--0x5D6DB0), and class layout computation in layout.c (0x65EA50--0x665B50).

Every C++ entity -- variable, function parameter, expression result, template argument -- carries a type pointer. EDG represents types as 176-byte heap-allocated nodes organized by a type_kind discriminant, with supplementary structures for complex kinds (classes, functions, integers, typedefs, template parameters). Type identity in the IL is pointer-based: two types are the "same type" if and only if they resolve to the same canonical node after chasing typedef chains. This page documents the complete type node architecture, the 22 type kinds, the 130 leaf query functions, the MRU-cached type construction pipeline, and the Itanium ABI class layout engine.

Key Facts

PropertyValue
Source filetypes.c (250+ functions), il_alloc.c (allocators), il.c (construction), layout.c (class layout)
Address range0x7A4940--0x7C02A0 (types.c), 0x5E2E80--0x5E45C0 (alloc), 0x5D64F0--0x5D6DB0 (il.c), 0x65EA50--0x665B50 (layout)
Type node size176 bytes (raw allocation includes 16-byte IL prefix)
Type kind count22 values (0x00--0x15)
Leaf query functions130 at 0x7A6260--0x7A9F90 (3,648 total call sites across binary)
Class layout entrysub_662670 (do_class_layout), 2,548 lines
Type allocatorsub_5E3D40 (alloc_type), 176-byte bump allocation
Kind dispatchsub_5E2E80 (set_type_kind), 22-way switch
Qualified type cachesub_5D64F0 (f_make_qualified_type), MRU linked list at type +112
Type comparisonsub_7AA150 (types_are_identical), 636 lines
Top query by callersis_class_or_struct_or_union_type at 0x7A8A30 (407 call sites)
Type counter globalqword_126F8E0 (incremented on every alloc_type)
Void type singletonqword_126E5E0

Type Node Layout (176 Bytes)

Every type in the IL is a 176-byte node allocated by alloc_type (sub_5E3D40). The allocator prepends a 16-byte IL prefix (8-byte TU-copy address + 8-byte next pointer), so the pointer returned to callers points at offset +16 of the raw allocation. All offsets below are relative to the returned pointer.

OffsetSizeFieldDescription
+096common headerCopied from xmmword_126F6A0..126F6F0 at allocation time
+08source_correspSource position info
+81prefix_flagsIL entry prefix: bit 0 = allocated, bit 1 = file-scope, bit 3 = language
+1128qualified_chainHead of MRU linked list of cv-qualified variants
+1204size_infoType size in target units (for constexpr value computation)
+1284alignmentType alignment
+1321type_kindDiscriminant byte: 0--21 (22 values)
+1331type_flags_1Bit 5 = is_dependent
+1341elaboration_flagsLow 2 bits = elaboration specifier kind
+1361type_flags_3Bit 2 = bitfield flag, bit 5 = unqualified strip flag
+1448referenced_typePoints to base/element/return type (kind-dependent). For pointers: pointed-to type. For arrays: element type. For typedefs: underlying type
+1451integer_subkind(overlaps +144 byte 1; valid when kind==2) Bit 3 = scoped enum, bit 4 = bit-int capable
+1461integer_flags(overlaps +144 byte 2; valid when kind==2) Bit 2 = _BitInt
+1528supplement_ptrPointer to kind-specific supplement, or member-pointer class type (kind==6 with member bit, kind==13)
+1531array_flags(overlaps +152 byte 1; valid when kind==8) Bit 0 = dependent, bit 1 = VLA, bit 5 = star-modified
+1608secondary_dataArray bound (kind==8) / attribute info (kind==12) / enum underlying type (kind==2)
+1611qualifier_or_class_flagsTyperef: cv-qualifier bits (kind==12). Class: bit 0 = local, bit 4 = template, bit 5 = anonymous (kind==9/10/11)
+1631class_flags_2(valid when kind==9/10/11) Bit 0 = empty class
+1641feature_usageCopied to byte_12C7AFC by record_type_features_used

Note: Fields at offsets +144--+164 form a union-like region. Different type kinds interpret these bytes differently. The overlap is intentional -- a pointer type uses +152 for the class pointer while an array type uses +153 for VLA flags, and so on.

The type_kind byte at +132 is the single most frequently read field in the entire binary. Every type query function begins by checking it, and the canonical typedef-chase pattern reads it in a tight loop.

Type Kind Enumeration (22 Values)

EDG uses 22 type kind values (tk_*), each with optional supplementary allocations for kind-specific metadata.

ValueNameSupplementSupplement SizeDescription
0tk_none----Sentinel / uninitialized
1tk_void----void type
2tk_integerinteger_type_supplement32 BAll integer types: bool, char, short, int, long, long long, __int128, _BitInt(N), and enumerations. Subkind at +145 discriminates
3tk_float----float (format byte at +144 = 2)
4tk_double----double
5tk_long_double----long double, __float128, _Float16, __bf16
6tk_pointer----Pointer to T. Bit 0 of +152 distinguishes member pointers from object pointers
7tk_routineroutine_type_supplement64 BFunction type. Supplement holds parameter list, calling convention, this-class pointer, exception specification
8tk_array----Array of T. Bound at +160, element type at +144
9tk_structclass_type_supplement208 Bstruct type
10tk_classclass_type_supplement208 Bclass type
11tk_unionclass_type_supplement208 Bunion type
12tk_typereftyperef_type_supplement56 BTypedef / elaborated type. References the underlying type at +144. This is the "chase me" kind
13tk_pointer_to_member----Pointer-to-member. Member type at +144, class type at +152
14tk_template_paramtempl_param_supplement40 BUnresolved template type parameter
15tk_typeof----typeof / __typeof__ expression type
16tk_decltype----decltype(expr) type
17tk_pack_expansion----Parameter pack expansion
18tk_auto----auto / decltype(auto) placeholder
19tk_rvalue_reference----Rvalue reference T&&
20tk_nullptr_t----std::nullptr_t
21tk_reserved----Reserved / unused (handled as no-op in set_type_kind)

The Integer Type Supplement (32 Bytes)

Integer types (kind 2) carry a 32-byte supplement allocated by set_type_kind and tracked by qword_126F8E8. This supplement discriminates the enormous variety of C++ integer types -- bool, char, signed char, unsigned char, wchar_t, char8_t, char16_t, char32_t, short, int, long, long long, __int128, _BitInt(N), and all scoped/unscoped enumerations.

The integer subkind value (at byte +145 of the parent type node) encodes:

ValueType Category
1--10Standard integer types (bool through unsigned long long)
11_BitInt / extended integer
12__int128 / extended

Signedness is determined by a lookup table at byte_E6D1B0, indexed by the integer subkind value.

The Routine Type Supplement (64 Bytes)

Function types (kind 7) carry a 64-byte supplement tracked by qword_126F958. Key fields:

Offset (in supplement)SizeField
+08Parameter type list head
+88Exception specification
+164Calling convention / noexcept flags
+3216Bitfield struct (ABI attributes, variadic flag)
+408this-class pointer (for member functions)

The Class Type Supplement (208 Bytes)

Class/struct/union types (kinds 9/10/11) carry a 208-byte supplement tracked by qword_126F948. This is the largest supplement and contains the full class metadata:

Offset (in supplement)SizeField
+08Scope pointer (member declarations)
+88Base class list head
+168Virtual function table pointer
+404Initialized to 1 by init_class_type_supplement_fields
+861Bit 0 = has virtual bases, bit 3 = has user conversion
+881Bit 5 = has flexible array / VLA member
+1004Class kind (9=struct, 10=class, 11=union)
+1288Scope block pointer
+1764Initialized to -1 (sentinel)

The Typeref Supplement (56 Bytes)

Typedef types (kind 12) carry a 56-byte supplement tracked by qword_126F8F0. A typeref wraps another type, creating the alias chain that all query functions must chase. The supplement holds the typedef declaration entity, elaborated type specifier information, and attribute data.

The Typedef Chase Pattern

The most pervasive code pattern in the entire binary is the typedef chase loop. Because C++ types may be wrapped in arbitrarily many typedef layers (typedef int myint; typedef myint myint2;), every function that inspects a type property must first resolve through all typedef indirections to reach the underlying canonical type.

The canonical pattern appears in every one of the 130 leaf query functions:

// Canonical typedef chase — appears 130+ times in types.c
type_t *skip_typedefs(type_t *type) {
    while (type->type_kind == 12)   // 12 = tk_typeref
        type = type->referenced_type;  // offset +144
    return type;
}

bool is_class_or_struct_or_union_type(type_t *type) {
    type = skip_typedefs(type);
    int kind = type->type_kind;     // offset +132
    return kind == 9 || kind == 10 || kind == 11;
}

In x86-64 machine code, this compiles to a 3-instruction loop:

.loop:
    cmp  byte [rdi+132], 12       ; type->type_kind == tk_typeref?
    jne  .done
    mov  rdi, [rdi+144]           ; type = type->referenced_type
    jmp  .loop
.done:

Why 130 Separate Functions?

A natural question: why does EDG have 130 individual query functions instead of a single get_type_kind() accessor? The answer is the EDG compilation model. Each function in types.c is a public API entry point that other source files (parse.c, lower.c, templates.c, etc.) can call without including the full type-system header. This provides:

  1. Encapsulation. Callers never see the type_kind enum values or internal layout. They call is_class_or_struct_or_union_type() instead of checking kind == 9 || kind == 10 || kind == 11.

  2. Binary stability. If EDG adds a new type kind or renumbers existing ones, only types.c needs recompilation. All callers are insulated.

  3. Fast-path optimization. Each leaf function is tiny (10--30 bytes of machine code), fits in a single cache line, and branches on at most 2--3 constants. The branch predictor handles these trivially.

  4. Semantic naming. is_arithmetic_type() is self-documenting where kind >= 2 && kind <= 5 is not. This matters in a 2.5M-line codebase.

Query Function Catalog (Top 30 by Caller Count)

AddressCallersFunctionReturns
0x7A8A30407is_class_or_struct_or_union_typekind in {9,10,11}
0x7A9910389type_pointed_toptr->referenced_type (kind==6)
0x7A9E70319get_cv_qualifiersAccumulated cv-qualifier bits (& 0x7F)
0x7A6B60299is_dependent_typeBit 5 of byte +133
0x7A7630243is_object_pointer_typekind==6 and not member pointer
0x7A8370221is_array_typekind==8
0x7A7B30199is_member_pointer_or_refkind==6 with member bit
0x7A6AC0185is_reference_typekind==7
0x7A8DC0169is_function_typekind==14
0x7A6E90140is_void_typekind==1
0x7A7C40132is_trivially_copy_constructibleRecursive triviality check
0x7A9350126array_element_type (deep)Strips arrays+typedefs to element
0x7A701085is_enum_typekind==2 with scoped check
0x7A71B082is_integer_typekind==2
0x7A802077type_size_and_alignmentComputes sizeof/alignof
0x7A781077is_member_pointer_flagkind==6, bit 0 of +152
0x7A827077get_mangled_type_encodingType encoding for name mangling
0x7A8D9076is_pointer_to_member_typekind==13
0x7A73F070is_long_double_typekind==5
0x7A795068is_member_ptr_with_both_bitskind==6, bits 0 and 1 of +152
0x7A70F062is_scoped_enum_typekind==2, bit 3 of +145
0x7A6EF056is_rvalue_reference_typekind==19 (rvalue reference T&&)
0x7A931051array_element_type (shallow)One-level array to element
0x7A6B9046is_simple_function_typekind==8, specific flag pattern
0x7A722043is_bit_int_typekind==2, bit 2 of +146
0x7A730042is_floating_point_typekind in {3,4,5}
0x7A775040is_non_member_ptr_typekind==6, no member bit
0x7A6EC039is_nullptr_t_typekind==20
0x7A99D037pm_member_typekind==13, extracts member type at +152
0x7A8F1034is_unresolved_function_typekind==14, constraint check

Total: 128 unique query functions, 4,448 call sites, average 34.75 callers per function.

Typedef Stripping Variants

Six specialized typedef-stripping functions exist, each stopping at a different boundary:

AddressFunctionBehavior
0x7A68F0skip_typedefsStrips all typedef layers, preserves cv-qualifiers
0x7A6930skip_named_typedefsStrips typedefs that have no name
0x7A6970skip_to_attributed_typedefStops at typedef with attribute flag set
0x7A69C0skip_typedefs_and_attributesStrips both typedef and attribute layers
0x7A6A10skip_to_elaborated_typedefStops at typedef with elaborated-type-specifier flag
0x7A6A70skip_non_attributed_typedefsStops at typedef with any attribute bits

These variants exist because C++ semantics sometimes care about intermediate typedef layers. For example, [[deprecated]] typedef int bad_int; attaches the attribute to the typedef itself, not to int. A function checking for deprecation must stop at the attributed typedef layer rather than chasing through to int.

Duplicate Query Functions

Several functions are exact binary duplicates with distinct addresses:

  • 0x7A7630 = 0x7A7670 = 0x7A7750 (is_non_member_pointer / is_object_pointer_type)
  • 0x7A7B00 = 0x7A7B70 (is_pointer_type)
  • 0x7A78D0 = 0x7A7910 (is_non_const_ref)

These duplicates exist because EDG uses distinct function names for semantic clarity even when the implementation is identical. The function-level linker does not merge them because they have distinct symbols with different ABI meanings: callers of is_object_pointer_type() and is_non_member_pointer_type() conceptually ask different questions even though the current answer is the same. If a future C++ revision changed pointer semantics, only one function would need updating.

Type Allocation

Type nodes are allocated by alloc_type (sub_5E3D40), which follows the standard IL allocation pattern used by all node allocators in il_alloc.c:

type_t *alloc_type(int type_kind) {
    // 1. Optional debug trace
    if (dword_126EFC8)
        trace_enter("alloc_type");

    // 2. Bump-allocate 176 bytes from the current region
    void *raw = region_alloc(dword_126EC90, 176);

    // 3. Set up IL prefix (16 bytes before the returned pointer)
    // raw[0..7] = TU-copy address (0 if not in copy mode)
    // raw[8..15] = next pointer (0)
    if (!dword_106BA08) {
        ++qword_126F7C0;         // orphan prefix count
        *(raw + 0) = 0;          // TU-copy addr
    }
    ++qword_126F750;             // IL entry count
    *(raw + 8) = 0;              // next pointer

    // 4. Set prefix flags byte
    byte flags = 1;              // bit 0 = allocated
    if (!dword_106BA08)
        flags |= 2;             // bit 1 = file-scope
    if (dword_126E5FC & 1)
        flags |= 8;             // bit 3 = C++ mode
    *(raw + 8) = flags;

    // 5. Increment type counter
    ++qword_126F8E0;

    // 6. Copy 96-byte common IL header
    type_t *result = raw + 16;
    memcpy(result, &xmmword_126F6A0, 96);

    // 7. Dispatch to set_type_kind
    set_type_kind(result, type_kind);

    // 8. Optional debug trace
    if (dword_126EFC8)
        trace_leave();

    return result;
}

set_type_kind: The 22-Way Dispatch

set_type_kind (sub_5E2E80) writes the kind byte and allocates any required supplement structure:

void set_type_kind(type_t *type, int kind) {
    type->type_kind = kind;      // byte at +132

    switch (kind) {
    case 0:  case 1:             // tk_none, tk_void
    case 17: case 18:            // pack expansions
    case 19: case 20: case 21:   // auto, rvalue_ref, nullptr_t
        break;                   // no supplement needed

    case 2:                      // tk_integer
        type->referenced_type = 5;  // default integer subkind
        type->supplement_ptr = alloc_permanent(32);
        ++qword_126F8E8;        // integer supplement counter
        // Store source position at supplement+16
        break;

    case 3: case 4: case 5:     // tk_float, tk_double, tk_long_double
        type->referenced_type = 2;  // format byte
        break;

    case 6:                      // tk_pointer
        type->supplement_ptr = 0;   // zero class-pointer field
        type->secondary_data = 0;
        break;

    case 7:                      // tk_routine (function type)
        type->supplement_ptr = alloc_permanent(64);
        ++qword_126F958;        // routine supplement counter
        // Initialize calling convention bitfield at supplement+32
        break;

    case 8:                      // tk_array
        // Zero size and flags fields
        break;

    case 9: case 10: case 11:   // tk_struct, tk_class, tk_union
        type->supplement_ptr = alloc_permanent(208);
        ++qword_126F948;        // class supplement counter
        init_class_type_supplement_fields(type->supplement_ptr);
        type->supplement_ptr->class_kind = kind;  // at supplement+100
        break;

    case 12:                     // tk_typeref
        type->supplement_ptr = alloc_permanent(56);
        ++qword_126F8F0;        // typeref supplement counter
        break;

    case 13:                     // tk_pointer_to_member
        // Zero member/class fields
        break;

    case 14:                     // tk_template_param
        type->supplement_ptr = alloc_permanent(40);
        ++qword_126F8F8;        // template param supplement counter
        break;

    case 15: case 16:           // tk_typeof, tk_decltype
        // Zero expression pointer fields
        break;

    default:
        internal_error("set_type_kind: bad type kind");
    }
}

Qualified Type Construction: The MRU Cache

When the compiler needs a const int given an int, it calls f_make_qualified_type (sub_5D64F0). This function is called extremely frequently -- every variable declaration, function parameter, and expression type computation may need cv-qualified variants. EDG optimizes this with a move-to-front (MRU) linked list cache on each type node.

type_t *f_make_qualified_type(type_t *base_type, int qualifiers) {
    // qualifiers bitmask: bit 0 = const, bit 1 = volatile,
    //                     bit 2 = restrict, bits 3-6 = address space

    // 1. Array special case: cv-qualify the element type, not the array
    if (base_type->type_kind == 8) {   // array
        type_t *elem = base_type->referenced_type;
        type_t *qual_elem = f_make_qualified_type(elem, qualifiers);
        return rebuild_array_type(base_type, qual_elem);
    }

    // 2. Strip existing qualifiers that already match
    int existing = get_cv_qualifiers(base_type) & 0x7F;
    int needed = qualifiers & ~existing;
    if (needed == 0)
        return base_type;           // already qualified as requested

    // 3. Search the MRU cache at base_type->qualified_chain (+112)
    type_t *prev = NULL;
    type_t *cur = base_type->qualified_chain;
    while (cur) {
        if (cur->type_kind == 12 &&             // must be typeref
            (cur->class_flags_1 & 0x7F) == qualifiers) {
            // Cache hit -- move to front if not already there
            if (prev) {
                prev->next = cur->next;
                cur->next = base_type->qualified_chain;
                base_type->qualified_chain = cur;
            }
            return cur;
        }
        prev = cur;
        cur = cur->next;
    }

    // 4. Cache miss -- allocate new qualified type
    type_t *qual = alloc_type(12);              // tk_typeref
    qual->referenced_type = base_type;          // +144 = underlying type
    qual->class_flags_1 = qualifiers & 0x7F;    // +161 = qualifier bits
    setup_type_node(qual);                       // sub_5B3DE0

    // 5. Insert at head of cache list
    qual->next = base_type->qualified_chain;
    base_type->qualified_chain = qual;

    return qual;
}

The MRU optimization is critical because type construction is highly skewed: const T is needed far more often than volatile const restrict T. By moving the most recently matched qualified variant to the front of the chain, subsequent lookups for the same qualification find it immediately.

The same MRU pattern appears in ptr_to_member_type_full (sub_5DB220), which caches pointer-to-member types on the member type's qualification chain at +112.

CV-Qualifier Bitmask

BitMaskQualifier
00x01const
10x02volatile
20x04__restrict
3--60x78Address space qualifier (CUDA/OpenCL)

The 7-bit mask (& 0x7F) at offset +161 of a typeref node encodes the full cv-qualification. get_cv_qualifiers (sub_7A9E70, 319 callers) accumulates these bits by chasing the typedef chain:

int get_cv_qualifiers(type_t *type) {
    int quals = 0;
    while (type->type_kind == 12) {     // chase typedefs
        quals |= type->class_flags_1 & 0x7F;
        type = type->referenced_type;
    }
    return quals;
}

Type Comparison

sub_7AA150 (types_are_identical, 636 lines) is the main structural type comparison function. It handles all 22 type kinds with recursive descent into component types. The algorithm:

  1. Chase typedefs on both operands to reach canonical types.
  2. If pointer-equal after chasing, return true (the common fast path).
  3. If kinds differ, return false.
  4. Dispatch on kind:
    • Integer (kind 2): Compare integer subkind values.
    • Pointer (kind 6): Recursively compare pointed-to types.
    • Array (kind 8): Compare bounds and recursively compare element types.
    • Function (kind 7): Compare return type, then parameter-by-parameter.
    • Class (kind 9/10/11): Pointer equality only (nominal typing).
    • Template param (kind 14): Compare parameter index and depth.
    • Pointer-to-member (kind 13): Compare both class and member types.

The comparison is structural for most types but nominal for classes. Two distinct struct Foo definitions in different scopes are different types even if they have identical members.

Cross-TU Type Correspondence

For relocatable device code (RDC) compilation, cudafe++ must match types across translation units. sub_7B2260 (types_are_equivalent_for_correspondence, 688 lines) performs a deep structural comparison that tolerates certain cross-TU differences (different typedef layers, different source positions) while requiring identical essential structure.

Type Construction Functions

Beyond f_make_qualified_type, several other type construction functions use the same cache pattern:

AddressFunctionCreatesCache Location
0x5D64F0f_make_qualified_typeconst T, volatile T, etc.Type +112 chain
0x5D6770make_vector_type__attribute__((vector_size(N))) TAllocated fresh
0x5D68E0character_typechar[N] string literal typesHash table at qword_126F2F8 (81-slot per kind)
0x5DB220ptr_to_member_type_fullT Class::*Member type +112 chain (MRU)
0x7AB9B0construct_function_typeR(Args...)Allocated fresh (423 lines)
0x7A6320make_cv_combined_typeCombines cv-quals from two typesAllocated fresh

Character Type Cache

String literal types (char[5], wchar_t[12], etc.) are extremely common in C++ programs. character_type (sub_5D68E0) uses a hash-table cache at qword_126F2F8 with 81 slots per character kind (5 kinds: char, wchar_t, char8_t, char16_t, char32_t), covering array sizes 0 through 80. For sizes exceeding 80, no caching is performed and a fresh array type is allocated every time.

Class Layout: do_class_layout

sub_662670 (do_class_layout, 2,548 lines) is the most complex function in the type system. It implements the Itanium C++ ABI class layout algorithm with GNU extensions, MSVC compatibility mode, and CUDA-specific adjustments. It is called exactly once per class definition from sub_442680 (class definition processing).

What do_class_layout Computes

For each class/struct/union, the function determines:

  • sizeof: Total class size including padding.
  • alignof: Required alignment, incorporating alignas, __attribute__((aligned)), and #pragma pack.
  • Member offsets: Byte offset of each non-static data member.
  • Base class offsets: Byte offset of each non-virtual base class subobject.
  • Virtual base offsets: Byte offset of each virtual base class subobject (stored in the vtable).
  • Vtable pointer placement: Where _vptr is placed (offset 0 for primary base, elsewhere for secondary).
  • Empty base optimization (EBO): Whether empty base classes can share address with data members.
  • Bit-field packing: How bit-fields are packed into allocation units.
  • Tail padding reuse: Whether derived classes can place members in base class tail padding (non-POD only).

Pseudocode: Itanium ABI Layout

void do_class_layout(type_t *class_type) {
    class_info_t *info = class_type->supplement_ptr;
    int sizeof_val = 0;
    int alignof_val = 1;
    int dsize = 0;          // data size (excludes tail padding)

    // PHASE 1: Lay out non-virtual base classes
    for (base_t *base = info->base_list; base; base = base->next) {
        if (base->is_virtual)
            continue;       // defer virtual bases

        int base_size = base->type->size_info;
        int base_align = base->type->alignment;

        // Empty base optimization
        if (is_empty_class(base->type)) {
            int offset = 0;
            while (empty_base_conflict(class_type, base->type, offset))
                offset += base_align;
            set_base_class_offset(base, offset);
            // sizeof may not increase for empty bases
        } else {
            // Align dsize up to base alignment
            dsize = ALIGN_UP(dsize, base_align);
            set_base_class_offset(base, dsize);
            dsize += base_size;
        }

        alignof_val = MAX(alignof_val, base_align);
    }

    // PHASE 2: Place vptr if needed
    if (class_has_virtual_functions(class_type) &&
        !has_primary_base_with_vptr(class_type)) {
        // vptr at current offset (usually 0)
        dsize = ALIGN_UP(dsize, POINTER_ALIGN);
        dsize += POINTER_SIZE;
        alignof_val = MAX(alignof_val, POINTER_ALIGN);
    }

    // PHASE 3: Lay out non-static data members
    for (field_t *field = info->first_field; field; field = field->next) {
        int field_align = alignment_of_field_full(field);
        int field_size = field->type->size_info;

        if (field->is_bitfield) {
            align_offsets_for_bit_field(field, &dsize, &alignof_val);
            continue;
        }

        dsize = ALIGN_UP(dsize, field_align);

        // Warn if field lands in tail padding of a base class
        warn_if_offset_in_tail_padding(class_type, dsize, field);

        field->offset = dsize;
        dsize += field_size;
        alignof_val = MAX(alignof_val, field_align);
    }

    // PHASE 4: Lay out virtual base classes
    for (base_t *base = info->base_list; base; base = base->next) {
        if (!base->is_virtual)
            continue;

        int base_align = base->type->alignment;
        if (is_empty_class(base->type)) {
            int offset = sizeof_val;
            while (subobject_conflict(class_type, base->type, offset))
                offset += base_align;
            set_virtual_base_class_offset(base, offset);
        } else {
            sizeof_val = ALIGN_UP(sizeof_val > dsize ? sizeof_val : dsize,
                                  base_align);
            set_virtual_base_class_offset(base, sizeof_val);
            sizeof_val += base->type->size_info;
        }
    }

    // PHASE 5: Finalize
    sizeof_val = MAX(sizeof_val, dsize);
    sizeof_val = ALIGN_UP(sizeof_val, alignof_val);
    if (sizeof_val == 0)
        sizeof_val = 1;     // C++ requires sizeof >= 1

    compute_empty_class_bit(class_type);
    trailing_base_does_not_affect_gnu_size(class_type);
    check_explicit_alignment(class_type);

    class_type->size_info = sizeof_val;
    class_type->alignment = alignof_val;

    // Debug: dump_layout() if debug flag set
    if (dword_126EFC8)
        dump_layout(class_type);
}

Key Sub-Functions

AddressFunctionPurpose
0x65EA50trailing_base_does_not_affect_gnu_sizeChecks if trailing empty base affects GNU-compatible size vs dsize
0x65EE70empty_base_conflictSelf-recursive: detects two empty bases of same type at same address
0x65F410increment_field_offsetsAdvances offset counters; warns about tail-padding overlap
0x65F9F0last_user_field_ofFinds last user-declared (non-compiler-generated) field
0x65FC20subobject_conflictGeneralizes empty_base_conflict to all subobjects
0x6610B0set_base_class_offsetsAssigns offsets to non-virtual base class subobjects
0x6614A0set_virtual_base_class_offsetAssigns offsets to virtual base class subobjects
0x6621E0alignment_of_field_fullComputes field alignment considering packed, aligned, pragma pack

Empty Base Optimization

The EBO is one of the most subtle parts of C++ layout. The C++ standard requires that two distinct subobjects of the same type have different addresses. But empty base classes (no data members, no virtual functions, all bases empty) can be placed at offset 0 without consuming space -- unless another subobject of the same type already occupies that address.

empty_base_conflict (sub_65EE70, 240 lines) is self-recursive: it walks the entire base class hierarchy checking for address collisions. When a conflict is detected, the layout engine advances the offset by the base's alignment until no conflict exists.

Alignment Computation

alignment_of_field_full (sub_6621E0, 193 lines) computes the effective alignment of a data member considering all alignment modifiers in priority order:

  1. Natural alignment of the field's type.
  2. __attribute__((aligned(N))) -- increases alignment.
  3. __attribute__((packed)) -- reduces alignment to 1.
  4. #pragma pack(N) -- caps alignment at N.
  5. __declspec(align(N)) -- MSVC mode alignment.

The interaction between these modifiers follows complex ABI rules. For example, #pragma pack(4) on a struct with a double member reduces the double's alignment from 8 to 4, but __attribute__((aligned(16))) on the same member overrides the pack to 16.

Type Trait Evaluation

sub_7BDCB0 (evaluate_type_trait, 510 lines) implements the compiler built-in type traits: __is_trivially_copyable, __is_constructible, __has_unique_object_representations, __is_aggregate, __is_empty, etc. These are dispatched via a switch on trait ID and return boolean results by inspecting the class type supplement flags and calling recursive property checks.

Type Deduction

sub_7B9670 (deduce_template_argument_type, 459 lines) handles template argument deduction from function arguments to template parameters. This is separate from the template engine's substitute_in_type (sub_7BCDE0, 800 lines), which performs the reverse operation: given concrete template arguments, produce the substituted type.

Global Type Singletons

Several frequently-used types are cached as global pointers to avoid repeated allocation:

GlobalType
qword_126E5E0void type
qword_126F2F0void type (duplicate reference)
qword_126F1A0std::source_location::__impl (cached on first use)

Statistics Tracking

Every type-related allocation increments a per-kind counter. print_trans_unit_statistics (sub_7A45A0) dumps these counters via fprintf:

CounterWhat it countsPer-entry size
qword_126F8E0Type nodes allocated176 B
qword_126F8E8Integer type supplements32 B
qword_126F958Routine type supplements64 B
qword_126F948Class type supplements208 B
qword_126F8F0Typeref supplements56 B
qword_126F8F8Template param supplements40 B
qword_126F280Pointer-to-member types constructed--

CUDA-Specific Type Extensions

Address Space Qualifiers

CUDA's __shared__, __constant__, and __device__ memory spaces are represented as address-space qualifiers in the cv-qualifier bitmask (bits 3--6 at +161). The attribute kind values {1, 6, 11, 12} (bitmask 0x1842) are checked in compare_attribute_specifiers (sub_7A5E10) to detect incompatible address-space qualified typedefs.

Feature Usage Tracking

record_type_features_used (sub_7A4F10) records GPU feature requirements based on types encountered:

  • _BitInt types (integer subkind 11/12): sets bit 0 of byte_12C7AFC
  • __float128 / __bf16 types: sets bit 2
  • Bit-fields: sets bit 1
  • Class types: copies feature bits from +164

This information feeds into architecture gating, ensuring that code using _BitInt(128) targets a GPU architecture that supports it.

Constexpr Type Size Limits

The constexpr interpreter (sub_628DE0, f_value_bytes_for_type) enforces a 64 MB limit (0x4000000 bytes) on types used in constexpr evaluation. This prevents compile-time memory exhaustion from expressions like constexpr std::array<char, 1'000'000'000> x{};.

Function Map

AddressLinesFunctionSource
0x5D64F0340f_make_qualified_typeil.c
0x5DB22063ptr_to_member_type_fullil.c
0x5E2E80--set_type_kindil_alloc.c
0x5E3D40--alloc_typeil_alloc.c
0x65EA50105trailing_base_does_not_affect_gnu_sizelayout.c
0x65EE70240empty_base_conflictlayout.c
0x65FC20271subobject_conflictlayout.c
0x6610B0196set_base_class_offsetslayout.c
0x6614A0204set_virtual_base_class_offsetlayout.c
0x6621E0193alignment_of_field_fulllayout.c
0x6626702548do_class_layoutlayout.c
0x7A4B40--ttt_is_type_with_no_name_linkagetypes.c
0x7A4F10--record_type_features_usedtypes.c
0x7A5E10--compare_attribute_specifierstypes.c
0x7A6260--type_has_flexible_array_or_vlatypes.c
0x7A6320--make_cv_combined_typetypes.c
0x7A68F0--0x7A9F90--130 leaf query functionstypes.c
0x7AA150636types_are_identicaltypes.c
0x7AB9B0423construct_function_typetypes.c
0x7AE680541adjust_type_for_templatestypes.c
0x7B2260688types_are_equivalent_for_correspondencetypes.c
0x7B3400905standard_conversion_sequencetypes.c
0x7B5210441require_complete_typetypes.c
0x7B63501107compute_type_layouttypes.c
0x7B7750784compute_class_propertiestypes.c
0x7B9670459deduce_template_argument_typetypes.c
0x7BDCB0510evaluate_type_traittypes.c
0x7BF630348format_type_for_diagnostictypes.c
0x7C02A0--compatible_ms_bit_field_container_typestypes.c

Diagnostic System Overview

The cudafe++ diagnostic system is a 7-stage pipeline rooted in EDG 6.6's error.c. It manages 3,795 error message templates, 9 severity levels, per-error suppression tracking, #pragma diagnostic overrides, and two output formats (text and SARIF JSON). The most-connected function in the entire binary -- sub_4F2930 (assertion handler) with 5,185 call sites -- feeds into this system, making error handling the single largest cross-cutting concern in cudafe++.

Error Table

The error message template table lives at off_88FAA0: an array of 3,795 const char* pointers indexed by error code (0--3794).

RangeCountOriginDisplay Format
0--34563,457Standard EDG 6.6#N-D
3457--3794338NVIDIA CUDA extensions#(N+16543)-D (20000--20337-D series)

The renumbering logic in construct_text_message (sub_4EF9D0):

int display_code = error_code;
if (display_code > 3456)
    display_code = error_code + 16543;   // 3457 → 20000, 3794 → 20337
sprintf(buf, "%d", display_code);

The -D suffix is appended only when severity <= 7 (warnings and below). Errors with severity > 7 (catastrophic, command-line error, internal) omit the suffix:

const char *suffix = "-D";
if (severity > 7)
    suffix = "";

Any access with error code > 3794 triggers sub_4F2D30 (error_text), which fires an assertion: "error_text: invalid error code" (error.c:911).

Severity Levels

Nine severity values are stored as a single byte at offset 180 of the diagnostic record:

ValueNameDisplay String (lowercase)Display String (uppercase)ColorizationExit Behavior
2note"note""Note"cyan (code 4)continues
4remark"remark""Remark"cyan (code 4)continues
5warning"warning""Warning"magenta (code 3)continues
6command-line warning"command-line warning""Command-line warning"magenta (code 3)continues
7error (soft)"error""Error"red (code 2)continues, counted
8error (hard)"error""Error"red (code 2)continues, counted, not suppressible by pragma
9catastrophic"catastrophic error""Catastrophic error"red (code 2)immediate exit(4)
10command-line error"command-line error""Command-line error"red (code 2)immediate exit(4)
11internal error"internal error""Internal error"red (code 2)immediate exit(11) via abort path

Uppercase display strings are used when dword_106BCD4 is set, indicating the diagnostic originates from a predefined macro file context (e.g., "In predefined macro file: Error #...").

The special string "nv_diag_remark" at offset +8 yields "remark" -- an NVIDIA-specific annotation kind for CUDA diagnostic remarks.

Severity Byte Arrays

Three parallel byte arrays, indexed as [4 * error_code], track per-error severity state:

ArrayAddressPurpose
byte_10679200x1067920Default severity -- the compile-time severity assigned to each error code
byte_10679210x1067921Current severity -- the effective severity after #pragma overrides
byte_10679220x1067922Tracking flags -- bit 0: first-time guard, bit 1: already-emitted, bit 2: has pragma override

The 4-byte stride means each error code occupies a 4-byte slot across all three arrays, with only the first byte of each slot used. This layout allows the pragma override system (sub_4F30A0) to efficiently look up and modify per-error severity.

7-Stage Diagnostic Pipeline

  caller emits error
       |
       v
  [1] create_diagnostic_entry     sub_4F40C0
       Allocate ~200-byte record, set error_code + severity
       |
       v
  [2] check_for_overridden_severity   sub_4F30A0
       Walk #pragma diagnostic stack, apply push/pop overrides
       |
       v
  [3] check_severity              sub_4F1330      ← 62 callers, 77 callees
       Central dispatch: suppress/promote, error limit, output routing
       |
       ├─── text path ──────────────────────────────────────┐
       |                                                     |
       v                                                     v
  [4] write_message_to_buffer    sub_4EF620       [6] write_sarif_message_json  sub_4EF8A0
       Expand %XY format specifiers from template             JSON-escape + wrap
       |                                                     |
       v                                                     v
  [5] construct_text_message     sub_4EF9D0 (6.5 KB)       SARIF JSON buffer → stderr
       file:line prefix, severity label, word wrap,
       caret lines, template context, include stack
       |
       v
  [4a] process_fill_in           sub_4EDCD0 (1,202 lines)
       Expand %T/%n/%s/%p/%d/%u/%t/%r specifiers
       |
       v
       output → stderr or redirect file

Stage 1: create_diagnostic_entry (sub_4F40C0)

Allocates a diagnostic record via sub_4EC940 and initializes it:

record = allocate_diagnostic_record();
record->kind = 0;              // primary diagnostic
record->error_code = a1;       // offset 176
if (severity <= 7)
    check_for_overridden_severity(a1, &severity, position);
record->severity = severity;   // offset 180
// resolve source position → file, line, column
// link into global diagnostic chain (qword_106BA10)

The wrapper sub_4F41C0 sets dword_106B4A8 (file index mode) to -1 for command-line and fatal severities (6, 9, 10, 11), disabling file-index tracking for diagnostics that have no meaningful source location.

Sub-diagnostics are created by sub_4F5A70 with kind=2, linked to their parent's sub-diagnostic chain at offsets 40/48 of the parent record.

Stage 2: check_for_overridden_severity (sub_4F30A0)

Walks the #pragma diagnostic stack stored in qword_1067820. Each stack entry is a 24-byte record containing a source position, a pragma action code, and an optional error code target.

Pragma action codes and their effect on severity:

CodePragmaEffect
30ignoredSet severity to 3 (suppress)
31remarkSet severity to 4
32warningSet severity to 5
33errorSet severity to 7
35defaultRestore from byte_1067920[4 * error_code]
36push/pop markerScope boundary for push/pop tracking

The function uses binary search (bsearch with comparator sub_4ECD20) to find the nearest pragma entry that applies at the current source position, then walks backward through the stack to resolve nested push/pop scopes.

Stage 3: check_severity (sub_4F1330)

The central dispatch function (601 decompiled lines, 62 callers, 77 callees). This is the most complex function in the error subsystem.

Complete decision tree pseudocode (derived from the decompiled sub_4F1330):

void check_severity(diagnostic_record *record) {
    dword_1065938 = 0;                    // reset caret-position cache
    uint8_t min_sev = byte_126ED69;       // minimum severity threshold

    // ── Gate 1: Minimum severity filter ──
    if (min_sev > record->severity) {
        if (min_sev == 3)                 // severity 3 = suppress sentinel
            ASSERT_FAIL("check_severity", error.c:3859);
        goto count_and_exit;              // silently discard
    }

    // ── Gate 2: System-header / suppress-all promotion ──
    if (is_system_header(record->source_sequence_number)) {
        min_sev = 8;                      // promote to hard error
    } else if (qword_106BCD8) {           // suppress-all-but-fatal mode
        min_sev = 7;                      // treat as error-level floor
    } else if (min_sev == 3) {
        ASSERT_FAIL("check_severity", error.c:3859);
    }

    if (record->severity < min_sev)
        goto count_and_exit;

    // ── Gate 3: Per-error tracking flags ──
    uint8_t *flags = &byte_1067922[4 * record->error_code];
    if (record->severity <= 7) {          // suppressible severities only
        uint8_t old = *flags;
        *flags |= 2;                      // mark as emitted
        if ((old & 1) && (old & 2))       // first-time guard + already-emitted
            goto suppressed;              // skip: already seen in this scope
    } else {
        *flags |= 2;                      // hard errors: always mark, never skip
    }

    // ── Gate 4: Pragma diagnostic check ──
    if (dword_126C5E4 != -1) {            // scope tracking active
        if (check_pragma_diagnostic(record->error_code,
                                     record->severity,
                                     &record->source_seq)) {
suppressed:
            // Update error/warning counters even when suppressed
            uint8_t sev = record->severity;
            if (sev <= 7 && sev < byte_126ED68)  // promote to error threshold
                sev = sev;                        // keep as-is
            else
                sev = 7;                          // count as error
            update_suppressed_counts(sev, &qword_126EDC8);
            goto count_and_exit;
        }
        // Record pragma scope if applicable
        if (in_template_scope() || has_special_scope_flags())
            record_pragma_diagnostic(record->error_code, record->severity);
    }

    // ── Gate 5: Suppress-all-but-fatal redirect ──
    if (qword_106BCD8 && !dword_106BCD4 && record->error_code != 992) {
        emit_error_992();                 // replace with fatal error 992
        return;                           // guard against catastrophic loop
    }

    // ── Severity promotion: warning → error threshold ──
    uint8_t effective_sev = record->severity;
    if (effective_sev <= 7 && effective_sev >= byte_126ED68) {
        effective_sev = 7;                // promote to error
        if (dword_126C5C8 == -1) {
            update_counts(7, &qword_126ED80);
            if (!dword_126ED78)           // no further counting needed
                goto skip_extra_counts;
            goto update_additional_counts;
        }
    } else if (effective_sev > 7 || effective_sev < byte_126ED68) {
        // Already at error+ or below promotion threshold
    }
    update_counts(effective_sev, &qword_126ED80);
    if (dword_126ED78 && (effective_sev - 9) > 2)  // not catastrophic/internal
        goto update_additional_counts;

skip_extra_counts:
    if (qword_126EDC0)
        update_counts(effective_sev, qword_126EDC0);

    // ── Allocate output buffers (first use) ──
    if (!qword_106B488) {
        qword_106B488 = alloc_buffer(0x400);
        qword_106B480 = alloc_buffer(0x80);
    }
    reset_buffer(qword_106B488);
    reset_buffer(qword_106B480);

    // ── Catastrophic loop detection ──
    if (record->severity == 9) {
        if (dword_106B4B0) {              // already processing catastrophic
            fprintf(stderr, "%s\n", "Loop in catastrophic error processing.");
            emergency_exit(9);            // never returns
        }
        dword_106B4B0 = 1;               // set catastrophic guard
        if (record->error_code == 3709 || !dword_126ED48)
            goto emit_message;
    } else if (record->severity == 11 || record->error_code == 3709) {
        goto emit_message;                // internal error or warnings-as-errors
    }

    // ── Template context expansion ──
    int context_count = 0;
    for (int scope = dword_126C5E4; scope > 0; scope--) {
        context_count += format_scope_context(scope);
    }
    // Include-file context
    if (dword_126EE48 && qword_106B9F0 && has_include_context()) {
        file_info = lookup_source_file(record->source_seq);
        if (file_info != current_file) {
            context_count++;
            // Emit error 1063/1064 (include-stack context)
            create_sub_diagnostic(record, (context_count != 1) ? 1064 : 1063);
        }
    }
    // Context elision (when context_limit is set)
    if (dword_126ED58 > 0 && dword_126ED58 + 1 < context_count) {
        // Emit error 1150: "%d context lines elided"
    }

emit_message:
    // ── Output routing ──
    reset_buffer(qword_106B488);
    if (dword_106BBB8 == 1) {
        // SARIF JSON path
        write_sarif_json(record);         // → qword_106B478
        fputs(sarif_buffer, stderr);
        fflush(stderr);
    } else {
        construct_text_message(record);   // → sub_4EF9D0
    }

    // ── Termination for fatal severities ──
    if (record->severity >= 9 && record->severity <= 11) {
        cleanup();                        // flush output, close files
        emergency_exit(record->severity); // exit with severity as code
        // unreachable
    }

    // ── Error limit enforcement ──
    if (qword_126ED90 + qword_126ED98 >= qword_126ED60) {
        fprintf(stderr, "%s\n", "Error limit reached.");
        if (qword_106C260)                // raw listing file
            fwrite("C \"\" 0 0 error limit reached\n", 1, 29, listing);
        cleanup();
        emergency_exit(9);               // exit(catastrophic)
    }

    // ── Warnings-as-errors promotion ──
    if (record->severity == 5 && dword_106C088 && !dword_106B4BC) {
        uint8_t saved_min = byte_126ED69;
        byte_126ED69 = 4;                // temporarily lower threshold
        dword_106C088 = 0;               // prevent recursion in self
        dword_106B4BC = 1;               // prevent recursion guard
        emit_diagnostic(4, 3709, ...);   // "warnings treated as errors"
        byte_126ED69 = saved_min;        // restore threshold
        dword_106C088 = 1;               // restore mode
    }

    // ── File index update ──
    if (dword_106B4A8 != -1)
        update_file_index(record);

count_and_exit:
    return;
}

Key decision points explained:

Minimum severity gate:

The global byte_126ED69 is the minimum severity threshold -- diagnostics below this level are silently discarded. When the threshold is 3 (the "suppress" sentinel), an assertion fires, which prevents the threshold from ever being set to the suppress level directly.

System-header promotion:

When a diagnostic originates from a system header (detected by sub_5B9B60), its severity is promoted to 8 (hard error, not suppressible by pragma). This applies equally to CUDA system headers.

Per-error tracking:

Bit 0 of the tracking flags (byte_1067922[4 * error_code]) acts as a first-time guard: if both bit 0 (first-time) and bit 1 (already-emitted) are set, the error has been suppressed-then-seen, and further emissions are skipped depending on the pragma scope.

Suppress-all-but-fatal mode:

When qword_106BCD8 is set and the error is not error 992 (the fatal sentinel), check_severity replaces the current diagnostic with error 992 and re-enters.

Catastrophic loop detection:

The re-entry guard dword_106B4B0 prevents infinite recursion when a catastrophic error triggers another catastrophic error during its own processing. The message "Loop in catastrophic error processing." is printed directly to stderr followed by emergency_exit(9).

Error limit enforcement:

qword_126ED90 (total errors) + qword_126ED98 (total warnings) are checked against qword_126ED60 (error limit). When exceeded, the compiler writes the limit message and exits with catastrophic status. The raw listing file also receives a machine-readable C "" 0 0 error limit reached line.

Warnings-as-errors promotion:

When dword_106C088 (warnings-are-errors mode) is set, every warning (severity 5) triggers error 3709 ("warnings treated as errors") as a follow-up diagnostic. The implementation temporarily lowers the minimum severity threshold to 4 (remark), disables warnings-as-errors mode, sets the recursion guard, emits the diagnostic, then restores all three values. This prevents the error-3709 diagnostic from itself triggering another error-3709.

Output routing:

if (dword_106BBB8 == 1)
    // SARIF JSON path: sub_4EF8A0 → qword_106B478
else
    sub_4EF9D0(record);   // text path → construct_text_message

Termination for fatal severities:

Severities 9 (catastrophic), 10 (command-line error), and 11 (internal error) all trigger cleanup via sub_66B5E0 followed by sub_5AF2B0(severity), which maps severity to the process exit code.

Stage 4: write_message_to_buffer (sub_4EF620)

Looks up the error template string from the table and expands format specifiers:

const char *template = off_88FAA0[error_code];   // error_code must be <= 3794

Format specifier syntax: %XY...Zn where:

  • X = specifier letter (T, d, n, p, r, s, t, u)
  • Y...Z = option characters (a-z, A-Z), max 29
  • n = trailing digit = fill-in index

Special forms:

  • %% = literal %
  • %[label] = named label fill-in, looked up in off_D481E0 table

Each specifier dispatches to process_fill_in (sub_4EDCD0) with the appropriate fill-in kind.

Stage 5: construct_text_message (sub_4EF9D0)

The largest function in the error subsystem at 1,464 decompiled lines (6.5 KB). Formats the complete diagnostic output.

Output format:

file(line): severity #code-D: message text

Variant formats:

  • "At end of source: ..." -- when line number is 0
  • "In predefined macro file: ..." -- when dword_106BCD4 is set
  • "Line N" -- when the file name is "-" (stdin)

Sub-diagnostic indentation:

KindIndent (chars)Continuation indent
0 (primary)010
2 (sub-diagnostic, same parent)1020
2 (sub-diagnostic, different parent)1222
3 (related)111

Word wrapping:

The function wraps output text at dword_106B470 (terminal width) column boundaries. When colorization is disabled, it uses a simple space-scanning algorithm. When colorization is enabled (ESC byte 0x1B in the formatted string), it tracks visible character width separately from escape sequence bytes and wraps only on visual boundaries.

Fill-in verification:

After output, the function iterates the fill-in linked list and asserts that every entry has used_flag == 1. An unused fill-in triggers: "construct_text_message: not all fill-ins used for error string: \"...\"" (error.c:4781).

Raw listing output:

When qword_106C260 (raw listing file) is open and the diagnostic is not a continuation (kind != 3), a machine-readable line is emitted:

S "filename" line column message\n

Where S is a single-character severity code: R (remark), W (warning), E (error), C (catastrophic/internal). Internal errors additionally prefix "(internal error) " before the message text.

Stage 6: process_fill_in (sub_4EDCD0)

Expands a single format specifier by searching the diagnostic record's fill-in linked list (head at offset 184) for an entry matching the requested kind and index. 1,202 decompiled lines.

Fill-in kind dispatch (from ASCII code of specifier letter):

LetterASCIIKindPayload
%T846 (type)Type node pointer
%d1000 (decimal)Integer value
%n1104 (entity name)Entity node pointer + options
%p1122 (parameter)Source position
%r1147Byte + pointer
%s1153 (string)String pointer
%t1165(type variant)
%u1171 (unsigned)Unsigned integer value

Entity name options (for %n specifier):

OptionMeaning
fFull qualification
oOmit kind prefix
pOmit parameters
tFull with template arguments
aOmit + show accessibility
dShow declaration location
TShow template specialization

Assertion Handler (sub_4F2930)

The most-connected function in the entire cudafe++ binary: 5,185 call sites. Declared __noreturn.

Signature:

void __noreturn assertion_handler(
    char *source_file,     // EDG source file path
    int   line_number,     // source line number
    const char *func_name, // enclosing function name
    const char *prefix,    // message prefix (or NULL)
    const char *message    // detail message (or NULL)
);

Message format (with prefix):

assertion failed: <prefix> <message> (<file>, line <line> in <func>)

Message format (without prefix):

assertion failed at: "<file>", line <line> in <func>

The function allocates a 0x400-byte buffer via sub_6B98A0, concatenates the message components using sub_6B9CD0 (buffer append), then calls sub_4F21C0 (internal_error). Because sub_4F21C0 is also __noreturn, the code after the call is dead -- the decompiler shows a loop structure with sprintf(v20, "%d", v8) that is never actually reached.

When dword_126ED40 (suppress assertion output) is set, the message text is replaced with "<suppressed>".

Internal Error Handler (sub_4F21C0)

Creates error 2656 with severity 11 (internal error), outputs it through the standard pipeline, then exits.

void __noreturn internal_error(const char *message) {
    if (dword_1065928) {                     // re-entry guard
        fprintf(stderr, "%s: %s\n", "Internal error loop", message);
        sub_5AF2B0(11);                      // emergency exit
    }
    dword_1065928 = 1;                       // set guard
    diag = sub_4F41C0(2656, &current_pos, 11);  // create diag record
    if (message)
        sub_4F2E90(diag, message);           // attach message as fill-in
    sub_4F1330(diag);                        // route through check_severity
    sub_5AF1D0(11);                          // cleanup + exit(11)
    sub_4F2240();                            // update file index (unreachable)
}

The re-entry guard dword_1065928 prevents infinite recursion: if internal_error is called while already processing an internal error (e.g., an assertion fires inside the error formatting code), it prints "Internal error loop: <message>" directly to stderr and exits immediately with code 11.

Exit Codes

CodeConditionTrigger
0Compilation succeededNormal exit via sub_5AF1D0(0)
2Errors encounteredtotal_errors > 0 at exit
4Catastrophic errorSeverity 9 or 10 reached
11Internal errorSeverity 11 (assertion failure)
abortDouble internal errorRe-entry in sub_4F21C0 or catastrophic loop

The exit path flows through sub_5AF2B0, which maps the severity to the appropriate process exit code. Catastrophic loop detection ("Loop in catastrophic error processing.") calls sub_5AF2B0(9), which maps to exit code 4.

Diagnostic Record Layout

Each diagnostic record is approximately 200 bytes, allocated by sub_4EC940:

OffsetSizeFieldDescription
04kind0=primary, 2=sub-diagnostic, 3=continuation
88nextLinked list pointer (global chain)
168parentParent diagnostic (for sub-diagnostics)
248related_listRelated diagnostic chain
408sub_diagnostic_headFirst sub-diagnostic
488sub_diagnostic_tailLast sub-diagnostic
728context_headTemplate/include context chain
888related_infoRelated location info pointer
968source_sequence_numberPosition in source sequence
1364file_indexIndex into source file table
1402column_endEnd column for caret range
1444line_deltaLine offset for continuation
1528file_name_stringCanonical file path
1608display_file_nameDisplay-formatted file path
1684column_numberColumn number
1724caret_infoCaret position data
1764error_codeError code (0--3794)
1801severitySeverity level (2--11)
1848fill_in_list_headFirst fill-in entry
1928fill_in_list_tailLast fill-in entry

Fill-In Entry Layout

Each fill-in entry is 40 bytes, allocated from a free-list pool (qword_106B490) or heap (sub_6B8070):

OffsetSizeFieldDescription
04kindFill-in kind (0--7, mapped from format specifier letter)
41used_flagSet to 1 when consumed during formatting
88nextNext fill-in in linked list
168+payloadUnion: qword for most kinds; int+int for kind 4 (entity name)

Kind-specific initialization in alloc_fill_in_entry (sub_4F2DE0):

  • Kind 2 (parameter): payload = qword_126EFB8 (current source position)
  • Kind 4 (entity name): payload = 0, extra = 0xFFFFFFFF, flags = 0
  • Kind 7: byte + qword payload
  • Default: payload = 0

Colorization

Initialized by sub_4F2C10 (init_colorization, error.c:825):

  1. Check NOCOLOR environment variable -- if set, disable colorization
  2. Check sub_5AF770 (isatty) -- if stderr is not a terminal, disable
  3. Read EDG_COLORS or GCC_COLORS environment variable
  4. Default: "error=01;31:warning=01;35:note=01;36:locus=01:quote=01:range1=32"

Category codes used in escape sequences:

CodeCategoryDefault ANSI
1reset\033[0m
2error\033[01;31m (bold red)
3warning\033[01;35m (bold magenta)
4note/remark\033[01;36m (bold cyan)
5locus\033[01m (bold)
6quote\033[01m (bold)
7range1\033[32m (green)

Controlled by dword_126ECA0 (colorization requested) and dword_126ECA4 (colorization active). The sub_4ECDD0 function emits escape sequences to the output buffer, and sub_4F3E50 handles escape insertion during word-wrapped output.

Key Global Variables

VariableAddressTypePurpose
off_88FAA00x88FAA0const char*[3795]Error message template table
off_D481E00xD481E0struct[]Named label fill-in table
byte_10679200x1067920byte[4*3795]Default severity per error
byte_10679210x1067921byte[4*3795]Current severity per error
byte_10679220x1067922byte[4*3795]Per-error tracking flags
byte_126ED680x126ED68byteError promotion threshold
byte_126ED690x126ED69byteMinimum severity threshold
qword_126ED600x126ED60qwordError limit
qword_126ED900x126ED90qwordTotal error count
qword_126ED980x126ED98qwordTotal warning count
dword_106B4B00x106B4B0intCatastrophic error re-entry guard
dword_106B4BC0x106B4BCintWarnings-as-errors recursion guard
dword_106BBB80x106BBB8intOutput format (0=text, 1=SARIF)
dword_106C0880x106C088intWarnings-are-errors mode
dword_10659280x1065928intInternal error re-entry guard
qword_106BCD80x106BCD8qwordSuppress-all-but-fatal mode
dword_106BCD40x106BCD4intPredefined macro file mode
qword_106B4880x106B488qwordMessage text buffer (0x400 initial)
qword_106B4800x106B480qwordLocation prefix buffer (0x80 initial)
qword_106B4780x106B478qwordSARIF JSON buffer (0x400 initial)
dword_106B4700x106B470intTerminal width for word wrapping
qword_126EDF00x126EDF0FILE*Error output stream (default stderr)
qword_106C2600x106C260FILE*Raw listing output file

Function Map

AddressName (Recovered)EDG SourceSizeRole
0x4EC940allocate_diagnostic_recorderror.c--Pool allocator for diagnostic records
0x4ECB10write_sarif_physical_locationerror.c--SARIF location JSON fragment
0x4ECDD0emit_colorization_escapeerror.c--Emit ANSI escape to buffer
0x4ED190record_pragma_diagnosticerror.c--Record pragma override in scope
0x4ED240check_pragma_diagnosticerror.c--Check if error suppressed by pragma
0x4EDCD0process_fill_inerror.c:42971,202 linesFormat specifier expansion
0x4EF620write_message_to_buffererror.c:4703159 linesTemplate string expansion
0x4EF8A0write_sarif_message_jsonerror.c79 linesSARIF message JSON wrapper
0x4EF9D0construct_text_messageerror.c:31531,464 linesFull text diagnostic formatter
0x4F1330check_severityerror.c:3859601 linesCentral severity dispatch
0x4F2190check_severity_thunkerror.c8 linesTail-call wrapper
0x4F21A0internal_error_varianterror.c9 linescheck_severity + exit(11)
0x4F21C0internal_errorerror.c22 linesError 2656, severity 11, re-entry guard
0x4F2240update_file_indexerror.c114 linesLRU source-file index cache
0x4F24B0build_source_caret_lineerror.c~100 linesSource caret underline
0x4F2930assertion_handlererror.c101 lines5,185 callers, __noreturn
0x4F2C10init_colorizationerror.c:82543 linesParse EDG_COLORS/GCC_COLORS
0x4F2D30error_text_invalid_codeerror.c:91112 linesAssert on code > 3794
0x4F2DE0alloc_fill_in_entryerror.c41 linesPool allocator for fill-ins
0x4F2E90append_fill_in_stringerror.c--Attach string fill-in to diagnostic
0x4F30A0check_for_overridden_severityerror.c:3803~130 linesPragma diagnostic stack walk
0x4F3480format_assertion_messageerror.c~100 linesMulti-arg string builder
0x4F3E50emit_colorization_in_wraperror.c--Escape handling during word wrap
0x4F40C0create_diagnostic_entryerror.c:5202~50 linesBase record creator
0x4F41C0create_diagnostic_entry_with_file_indexerror.c13 linesWrapper with file-index mode
0x4F5A70create_sub_diagnosticerror.c:524232 lineskind=2 sub-diagnostic creator
0x4F6C40format_scope_contexterror.c--Extract instantiation context from scope

Call Graph

sub_4F2930 (assertion_handler)  [5,185 callers, __noreturn]
  └── sub_4F21C0 (internal_error)
        ├── sub_4F41C0 (create_diagnostic_entry, error=2656, sev=11)
        │     └── sub_4F40C0 (create_diagnostic_entry)
        │           └── sub_4F30A0 (check_for_overridden_severity)
        ├── sub_4F2E90 (append_fill_in_string)
        ├── sub_4F1330 (check_severity)  [62 callers, 77 callees]
        │     ├── sub_4ED240 (check_pragma_diagnostic)
        │     ├── sub_4EF9D0 (construct_text_message)
        │     │     ├── sub_4EF620 (write_message_to_buffer)
        │     │     │     └── sub_4EDCD0 (process_fill_in)
        │     │     ├── sub_4F24B0 (build_source_caret_line)
        │     │     └── sub_4F3E50 (emit_colorization_in_wrap)
        │     ├── sub_4EF8A0 (write_sarif_message_json)
        │     │     └── sub_4EF620 (write_message_to_buffer)
        │     ├── sub_4F5A70 (create_sub_diagnostic)
        │     ├── sub_4F2DE0 (alloc_fill_in_entry)
        │     ├── sub_4F6C40 (format_scope_context)
        │     ├── sub_66B5E0 (cleanup)
        │     └── sub_5AF2B0 (exit)
        ├── sub_5AF1D0 (cleanup + exit)
        └── sub_4F2240 (update_file_index)

CUDA Error Catalog

cudafe++ reserves error indices 3457--3794 for CUDA-specific diagnostics. These 338 slots are displayed to the user as error numbers 20000--20337 with a -D suffix (for suppressible severities), produced by the renumbering logic in construct_text_message (sub_4EF9D0): when the internal error code exceeds 3456, the display code is error_code + 16543. Of the 338 slots, approximately 210 carry unique error message templates; the remainder are reserved or share templates with parametric fill-ins (%s, %sq, %t, %n, %no). Every CUDA error can be suppressed, promoted, or demoted by its diagnostic tag name via --diag_suppress, --diag_warning, --diag_error, or the #pragma nv_diagnostic system.

This page is a searchable reference catalog organized by error category. For the diagnostic pipeline mechanics (severity levels, pragma stack, output formatting), see Diagnostic Overview.

Error Numbering Scheme

// construct_text_message (sub_4EF9D0), error.c:3153
int display_code = error_code;
if (display_code > 3456)
    display_code = error_code + 16543;   // 3457 -> 20000, 3794 -> 20337
sprintf(buf, "%d", display_code);

// Suffix: "-D" appended when severity <= 7 (note, remark, warning, soft error)
const char *suffix = (severity > 7) ? "" : "-D";

User-visible format: file(line): error #20042-D: calling a __device__ function from a __host__ function is not allowed

Mapping formula:

DirectionFormula
Display to internalinternal = display - 16543 (for display >= 20000)
Internal to displaydisplay = internal + 16543 (for internal > 3456)

Diagnostic Tag Names and Suppression

Each CUDA error has an associated diagnostic tag name -- a snake_case identifier that can be passed to --diag_suppress, --diag_warning, --diag_error, or --diag_default instead of the numeric code. The tag names are also accepted by #pragma nv_diag_suppress, #pragma nv_diag_warning, etc.

# Suppress a specific CUDA error by tag name
nvcc --diag_suppress=calling_a_constexpr__host__function_from_a__device__function

# Suppress by numeric code (equivalent)
nvcc --diag_suppress=20042

# In source code
#pragma nv_diag_suppress device_function_redeclared_with_host

The pragma actions understood by cudafe++:

PragmaInternal CodeEffect
nv_diag_suppress30Set severity to 3 (suppressed)
nv_diag_remark31Set severity to 4 (remark)
nv_diag_warning32Set severity to 5 (warning)
nv_diag_error33Set severity to 7 (error)
nv_diag_default35Restore original severity
nv_diag_once--Emit only on first occurrence

Category 1: Cross-Space Calling (12 messages)

Cross-space call validation is the highest-frequency CUDA diagnostic category. The checker walks the call graph and emits an error whenever a function in one execution space calls a function in an incompatible space. Six variants cover non-constexpr calls; six more cover constexpr calls (which can be relaxed with --expt-relaxed-constexpr).

Standard Cross-Space Calls

TagMessage Template
unsafe_device_callcalling a __device__ function(%sq1) from a __host__ function(%sq2) is not allowed
unsafe_device_callcalling a __device__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed
unsafe_device_callcalling a __host__ function(%sq1) from a __device__ function(%sq2) is not allowed
unsafe_device_callcalling a __host__ function(%sq1) from a __global__ function(%sq2) is not allowed
unsafe_device_callcalling a __host__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed
unsafe_device_callcalling a __host__ function from a __host__ __device__ function is not allowed

Constexpr Cross-Space Calls

These fire when --expt-relaxed-constexpr is not enabled. The message explicitly suggests the flag.

TagMessage Template
unsafe_device_callcalling a constexpr __device__ function(%sq1) from a __host__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
unsafe_device_callcalling a constexpr __device__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
unsafe_device_callcalling a constexpr __host__ function(%sq1) from a __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
unsafe_device_callcalling a constexpr __host__ function(%sq1) from a __global__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
unsafe_device_callcalling a constexpr __host__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
unsafe_device_callcalling a constexpr __host__ function from a __host__ __device__ function is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.

Implementation: Cross-space checks are performed by the call-graph walker in the CUDA validation pass. The checker compares the execution space byte at entity offset +182 of the callee against the caller. When the mask test fails, the appropriate variant is selected based on whether either function is constexpr and whether the callee has named fill-ins or uses the anonymous (no %sq) form.

Category 2: Virtual Override Mismatch (6 messages)

When a derived class overrides a virtual function, the execution space of the override must match the base. Six combinations cover all mismatched pairs among __host__, __device__, and __host__ __device__.

TagMessage Template
--execution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ function
--execution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ __device__ function
--execution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __device__ function
--execution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __host__ __device__ function
--execution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __device__ function
--execution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __host__ function

Implementation: The override checker (sub_432280, record_virtual_function_override) extracts the 0x30 mask from the execution space byte of both the base and derived function entities. If they differ, the appropriate pair is selected and emitted. The __global__ space is not included because __global__ functions cannot be virtual (see Category 4).

Category 3: Redeclaration Mismatch (12 messages)

When a function is redeclared with a different execution space annotation, cudafe++ either emits an error (incompatible combination) or a warning (compatible promotion to __host__ __device__).

Error-Level Redeclarations (4 messages)

TagMessage Template
device_function_redeclared_with_globala __device__ function(%no1) redeclared with __global__
global_function_redeclared_with_devicea __global__ function(%no1) redeclared with __device__
global_function_redeclared_with_hosta __global__ function(%no1) redeclared with __host__
global_function_redeclared_with_host_devicea __global__ function(%no1) redeclared with __host__ __device__

Warning-Level Redeclarations (Promoted to HD, 5 messages)

TagMessage Template
device_function_redeclared_with_hosta __device__ function(%no1) redeclared with __host__, hence treated as a __host__ __device__ function
device_function_redeclared_with_host_devicea __device__ function(%no1) redeclared with __host__ __device__, hence treated as a __host__ __device__ function
device_function_redeclared_without_devicea __device__ function(%no1) redeclared without __device__, hence treated as a __host__ __device__ function
host_function_redeclared_with_devicea __host__ function(%no1) redeclared with __device__, hence treated as a __host__ __device__ function
host_function_redeclared_with_host_devicea __host__ function(%no1) redeclared with __host__ __device__, hence treated as a __host__ __device__ function

Global Redeclarations (3 messages)

TagMessage Template
global_function_redeclared_without_globala __global__ function(%no1) redeclared without __global__
host_function_redeclared_with_globala __host__ function(%no1) redeclared with __global__
host_device_function_redeclared_with_globala __host__ __device__ function(%no1) redeclared with __global__

Implementation: Redeclaration checking occurs in decl_routine (sub_4CE420) and check_cuda_attribute_consistency (sub_4C6D50). The checker compares the execution space byte from the prior declaration against the new declaration's attribute set. When bits differ, it selects the message based on which bits changed and whether the result is a compatible promotion.

Category 4: __global__ Function Constraints (37 messages)

__global__ (kernel) functions have the most extensive constraint set of any execution space. These errors enforce the CUDA programming model requirement that kernels have specific signatures, cannot be members, and cannot use certain C++ features.

Return Type and Signature

TagMessage Template
global_function_return_typea __global__ function must have a void return type
global_function_deduced_return_typea __global__ function must not have a deduced return type
global_function_has_ellipsisa __global__ function cannot have ellipsis
global_rvalue_ref_typea __global__ function cannot have a parameter with rvalue reference type
global_ref_param_restricta __global__ function cannot have a parameter with __restrict__ qualified reference type
global_va_list_typeA __global__ function or function template cannot have a parameter with va_list type
global_function_with_initializer_lista __global__ function or function template cannot have a parameter with type std::initializer_list
global_param_align_too_bigcannot pass a parameter with a too large explicit alignment to a __global__ function on win32 platforms

Declaration Context

TagMessage Template
global_class_declA __global__ function or function template cannot be a member function
global_friend_definitionA __global__ function or function template cannot be defined in a friend declaration
global_function_in_unnamed_inline_nsA __global__ function or function template cannot be declared within an inline unnamed namespace
global_operator_functionAn operator function cannot be a __global__ function
global_new_or_delete(internal -- global on operator new/delete)
--function main cannot be marked __device__ or __global__

C++ Feature Restrictions

TagMessage Template
global_function_constexprA __global__ function or function template cannot be marked constexpr
global_function_constevalA __global__ function or function template cannot be marked consteval
global_function_inline(internal -- global with inline)
global_exception_specAn exception specification is not allowed for a __global__ function or function template

Template Argument Restrictions

TagMessage Template
global_private_type_argA type that is defined inside a class and has private or protected access (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the class is local to a __device__ or __global__ function
global_private_template_argA template that is defined inside a class and has private or protected access cannot be used in the template template argument of a __global__ function template instantiation
global_unnamed_type_argAn unnamed type (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the type is local to a __device__ or __global__ function
global_func_local_template_argA type defined inside a __host__ function (%t) cannot be used in the template argument type of a __global__ function template instantiation
global_lambda_template_argThe closure type for a lambda (%t%s) cannot be used in the template argument type of a __global__ function template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the flag '-extended-lambda' is specified and the lambda is an extended lambda (a __device__ or __host__ __device__ lambda defined within a __host__ or __host__ __device__ function)
local_type_used_in_global_functiona local type %t (defined in %sq1) used in global function %sq2 template argument, the global function cannot be launched from host code.

Variadic Template Constraints

TagMessage Template
global_function_multiple_packsMultiple pack parameters are not allowed for a variadic __global__ function template
global_function_pack_not_lastPack template parameter must be the last template parameter for a variadic __global__ function template

Variable Template Restrictions (parallel to kernel template)

TagMessage Template
variable_template_private_type_argA type that is defined inside a class and has private or protected access (%t) cannot be used in the template argument type of a variable template instantiation, unless the class is local to a __device__ or __global__ function
variable_template_private_template_arg(private template template arg in variable template)
variable_template_unnamed_type_template_argAn unnamed type (%t) cannot be used in the template argument type of a variable template template instantiation, unless the type is local to a __device__ or __global__ function
variable_template_func_local_template_argA type defined inside a __host__ function (%t) cannot be used in the template argument type of a variable template template instantiation
variable_template_lambda_template_argThe closure type for a lambda (%t%s) cannot be used in the template argument type of a variable template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the lambda is an 'extended lambda' and the flag --extended-lambda is specified

Launch Configuration Attributes

TagMessage Template
bounds_attr_only_on_global_func%s is only allowed on a __global__ function
maxnreg_attr_only_on_global_func(maxnreg only on global)
--The %s qualifiers cannot be applied to the same kernel
--Multiple %s specifiers are not allowed
--no __launch_bounds__ specified for __global__ function
cuda_specifier_twice_in_group(duplicate CUDA specifier on same declaration)

Category 5: Extended Lambda Restrictions (35 messages)

Extended lambdas (__device__ or __host__ __device__ lambdas defined within host code, enabled by --extended-lambda) are one of the most constraint-heavy features in CUDA. The restriction set enforces that the lambda's closure type can be serialized for device transfer.

Capture Restrictions

TagMessage Template
extended_lambda_reference_captureAn extended %s lambda cannot capture variables by reference
extended_lambda_pack_captureAn extended %s lambda cannot capture an element of a parameter pack
extended_lambda_too_many_capturesAn extended %s lambda can only capture up to 1023 variables
extended_lambda_array_capture_rankAn extended %s lambda cannot capture an array variable (type: %t) with more than 7 dimensions
extended_lambda_array_capture_assignableAn extended %s lambda cannot capture an array variable whose element type (%t) is not assignable on the host
extended_lambda_array_capture_default_constructibleAn extended %s lambda cannot capture an array variable whose element type (%t) is not default constructible on the host
extended_lambda_init_capture_arrayAn extended %s lambda cannot init-capture variables with array type
extended_lambda_init_capture_initlistAn extended %s lambda cannot have init-captures with type std::initializer_list
extended_lambda_capture_in_constexpr_ifAn extended %s lambda cannot first-capture variable in constexpr-if context
this_addr_capture_ext_lambdaImplicit capture of 'this' in extended lambda expression
extended_lambda_hd_init_captureinit-captures are not allowed for extended __host__ __device__ lambdas
--Unless enabled by language dialect, *this capture is only supported when the lambda is either __device__ only, or is defined within a __device__ or __global__ function

Type Restrictions on Captures and Parameters

TagMessage Template
extended_lambda_capture_local_typeA type local to a function (%t) cannot be used in the type of a variable captured by an extended __device__ or __host__ __device__ lambda
extended_lambda_capture_private_typeA type that is a private or protected class member (%t) cannot be used in the type of a variable captured by an extended __device__ or __host__ __device__ lambda
extended_lambda_call_operator_local_typeA type local to a function (%t) cannot be used in the return or parameter types of the operator() of an extended __device__ or __host__ __device__ lambda
extended_lambda_call_operator_private_typeA type that is a private or protected class member (%t) cannot be used in the return or parameter types of the operator() of an extended __device__ or __host__ __device__ lambda
extended_lambda_parent_local_typeA type local to a function (%t) cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended __device__ or __host__ __device__ lambda
extended_lambda_parent_private_typeA type that is a private or protected class member (%t) cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended __device__ or __host__ __device__ lambda
extended_lambda_parent_private_template_argA template that is a private or protected class member cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended %s lambda

Enclosing Parent Function Restrictions

TagMessage Template
extended_lambda_enclosing_function_localThe enclosing parent function (%sq2) for an extended %s1 lambda must not be defined inside another function
extended_lambda_inaccessible_parentThe enclosing parent function (%sq2) for an extended %s1 lambda cannot have private or protected access within its class
extended_lambda_enclosing_function_deducibleThe enclosing parent function (%sq2) for an extended %s1 lambda must not have deduced return type
extended_lambda_cant_take_function_addressThe enclosing parent function (%sq2) for an extended %s1 lambda must allow its address to be taken
extended_lambda_parent_non_externOn Windows, the enclosing parent function (%sq2) for an extended %s1 lambda cannot have internal or no linkage
extended_lambda_parent_class_unnamedThe enclosing parent function (%sq2) for an extended %s1 lambda cannot be a member function of a class that is unnamed
extended_lambda_parent_template_param_unnamedThe enclosing parent function (%sq2) for an extended %s1 lambda cannot be in a template which has a unnamed parameter: %nd
extended_lambda_nest_parent_template_param_unnamedThe enclosing parent %n for an extended %s lambda cannot be a template which has a unnamed parameter
extended_lambda_multiple_parameter_packsThe enclosing parent template function (%sq2) for an extended %s1 lambda cannot have more than one variadic parameter, or it is not listed last in the template parameter list.

Nesting and Context Restrictions

TagMessage Template
extended_lambda_enclosing_function_generic_lambdaAn extended %s1 lambda cannot be defined inside a generic lambda expression(%sq2).
extended_lambda_enclosing_function_hd_lambdaAn extended %s1 lambda cannot be defined inside an extended __host__ __device__ lambda expression(%sq2). (note: double space before "lambda" is present in the binary)
extended_lambda_inaccessible_ancestorAn extended %s1 lambda cannot be defined inside a class (%sq2) with private or protected access within another class
extended_lambda_inside_constexpr_ifFor this host platform/dialect, an extended lambda cannot be defined inside the 'if' or 'else' block of a constexpr if statement
extended_lambda_multiple_parentCannot specify multiple __nv_parent directives in a lambda declaration
extended_host_device_generic_lambda__host__ __device__ extended lambdas cannot be generic lambdas
--If an extended %s lambda is defined within the body of one or more nested lambda expressions, each of these enclosing lambda expressions must be defined within the immediate or nested block scope of a function.

Specifier and Annotation

TagMessage Template
extended_lambda_disallowed__host__ or __device__ annotation on lambda requires --extended-lambda nvcc flag
extended_lambda_constexprThe %s1 specifier is not allowed for an extended %s2 lambda
--The operator() function for a lambda cannot be explicitly annotated with execution space annotations (__host__/__device__/__global__), the annotations are derived from its closure class

Category 6: Device Code Restrictions (13 messages)

General restrictions that apply to any code executing on the GPU. These errors are emitted when C++ features unsupported by the NVPTX backend appear in __device__ or __global__ function bodies.

TagMessage Template
cuda_device_code_unsupported_operatorThe operator '%s' is not allowed in device code
unsupported_type_in_device_code%t %s1 a %s2, which is not supported in device code
--device code does not support exception handling
--device code does not support coroutines
--operations on vector types are not supported in device code
undefined_device_entitycannot use an entity undefined in device code
undefined_device_identifieridentifier %sq is undefined in device code
thread_local_in_device_codecannot use thread_local specifier for variable declarations in device code
unrecognized_pragma_device_codeunrecognized #pragma in device code
--zero-sized parameter type %t is not allowed in device code
--zero-sized variable %sq is not allowed in device code
--dynamic initialization is not supported for a function-scope static %s variable within a __device__/__global__ function
--function-scope static variable within a __device__/__global__ function requires a memory space specifier

Category 7: Kernel Launch (6 messages)

Errors related to <<<...>>> kernel launch syntax.

TagMessage Template
device_launch_no_sepcompkernel launch from __device__ or __global__ functions requires separate compilation mode
missing_api_for_device_side_launchdevice-side kernel launch could not be processed as the required runtime APIs are not declared
--explicit stream argument not provided in kernel launch
--kernel launches from templates are not allowed in system files
device_side_launch_arg_with_user_provided_cctorcannot pass an argument with a user-provided copy-constructor to a device-side kernel launch
device_side_launch_arg_with_user_provided_dtorcannot pass an argument with a user-provided destructor to a device-side kernel launch

Category 8: Memory Space and Variable Restrictions (15 messages)

Variable Access Across Spaces

TagMessage Template
device_var_read_in_hosta %s1 %n1 cannot be directly read in a host function
device_var_written_in_hosta %s1 %n1 cannot be directly written in a host function
device_var_address_taken_in_hostaddress of a %s1 %n1 cannot be directly taken in a host function
host_var_read_in_devicea host %n1 cannot be directly read in a device function
host_var_written_in_devicea host %n1 cannot be directly written in a device function
host_var_address_taken_in_deviceaddress of a host %n1 cannot be directly taken in a device function

Variable Declaration Restrictions

TagMessage Template
illegal_local_to_device_function%s1 %sq2 variable declaration is not allowed inside a device function body
illegal_local_to_host_function%s1 %sq2 variable declaration is not allowed inside a host function body
--the __shared__ memory space specifier is not allowed for a variable declared by the for-range-declaration
--__shared__ variables cannot have external linkage
device_variable_in_unnamed_inline_nsA %s variable cannot be declared within an inline unnamed namespace
--member variables of an anonymous union at global or namespace scope cannot be directly accessed in __device__ and __global__ functions

Auto-Deduced Device References

TagMessage Template
auto_device_fn_refA non-constexpr __device__ function (%sq1) with "auto" deduced return type cannot be directly referenced %s2, except if the reference is absent when __CUDA_ARCH__ is undefined
device_var_constexpr(constexpr rules for device variables)
device_var_structured_binding(structured bindings on device variables)

Category 9: __grid_constant__ (8 messages)

The __grid_constant__ annotation (compute_70+) marks a kernel parameter as read-only grid-wide. Errors enforce that the parameter is on a __global__ function, is const-qualified, and is not a reference type.

TagMessage Template
grid_constant_non_kernel__grid_constant__ annotation is only allowed on a parameter of a __global__ function
grid_constant_not_consta parameter annotated with __grid_constant__ must have const-qualified type
grid_constant_reference_typea parameter annotated with __grid_constant__ must not have reference type
grid_constant_unsupported_arch__grid_constant__ annotation is only allowed for architecture compute_70 or later
grid_constant_incompat_redeclincompatible __grid_constant__ annotation for parameter %s in function redeclaration (see previous declaration %p)
grid_constant_incompat_templ_redeclincompatible __grid_constant__ annotation for parameter %s in function template redeclaration (see previous declaration %p)
grid_constant_incompat_specializationincompatible __grid_constant__ annotation for parameter %s in function specialization (see previous declaration %p)
grid_constant_incompat_instantiation_directiveincompatible __grid_constant__ annotation for parameter %s in instantiation directive (see previous declaration %p)

Category 10: JIT Mode (5 messages)

JIT mode (-dc for device-only compilation) restricts host constructs. These errors guide users toward the -default-device flag for unannotated declarations.

TagMessage Template
no_host_in_jitA function explicitly marked as a __host__ function is not allowed in JIT mode
unannotated_function_in_jitA function without execution space annotations (__host__/__device__/__global__) is considered a host function, and host functions are not allowed in JIT mode. Consider using -default-device flag to process unannotated functions as __device__ functions in JIT mode
unannotated_variable_in_jitA namespace scope variable without memory space annotations (__device__/__constant__/__shared__/__managed__) is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process unannotated namespace scope variables as __device__ variables in JIT mode
unannotated_static_data_member_in_jitA class static data member with non-const type is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process such data members as __device__ variables in JIT mode
host_closure_class_in_jitThe execution space for the lambda closure class members was inferred to be __host__ (based on context). This is not allowed in JIT mode. Consider using -default-device to infer __device__ execution space for namespace scope lambda closure classes.

Category 11: RDC / Whole-Program Mode (4 messages)

Diagnostics related to relocatable device code (-rdc=true) and whole-program compilation (-rdc=false).

TagMessage Template
--An inline __device__/__constant__/__managed__ variable must have internal linkage when the program is compiled in whole program mode (-rdc=false)
template_global_no_defwhen "-static-global-template-stub=true" in whole program compilation mode ("-rdc=false"), a __global__ function template instantiation or specialization (%sq) must have a definition in the current translation unit. To resolve this issue, either use separate compilation mode ("-rdc=true"), or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off)
extern_kernel_templatewhen "-static-global-template-stub=true", extern __global__ function template is not supported in whole program compilation mode ("-rdc=false"). To resolve the issue, either use separate compilation mode ("-rdc=true"), or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off)
--address of internal linkage device function (%sq) was taken (nv bug 2001144). mitigation: no mitigation required if the address is not used for comparison, or if the target function is not a CUDA C++ builtin. Otherwise, write a wrapper function to call the builtin, and take the address of the wrapper function instead

Category 12: Atomics (26 messages)

CUDA atomics are lowered to PTX instructions with specific size, type, scope, and memory order constraints. These diagnostics enforce hardware limits.

Architecture and Type Constraints

TagMessage Template
nv_atomic_functions_not_supported_below_sm60__nv_atomic_* functions are not supported on arch < sm_60.
nv_atomic_operation_not_in_device_functionatomic operations are not in a device function.
nv_atomic_function_no_argsatomic function requires at least one argument.
nv_atomic_function_address_takennv atomic function must be called directly.
invalid_nv_atomic_operation_sizeatomic operations and, or, xor, add, sub, min and max are valid only on objects of size 4, or 8.
invalid_nv_atomic_cas_sizeatomic CAS is valid only on objects of size 2, 4, 8 or 16 bytes.
invalid_nv_atomic_exch_sizeatomic exchange is valid only on objects of size 4, 8 or 16 bytes.
invalid_data_size_for_nv_atomic_generic_functiongeneric nv atomic functions are valid only on objects of size 1, 2, 4, 8 and 16 bytes.
non_integral_type_for_non_generic_nv_atomic_functionnon-generic nv atomic load, store, cas and exchange are valid only on integral types.
invalid_nv_atomic_operation_add_sub_sizeatomic operations add and sub are not valid on signed integer of size 8.
nv_atomic_add_sub_f64_not_supportedatomic add and sub for 64-bit float is supported on architecture sm_60 or above.
invalid_nv_atomic_operation_max_min_floatatomic operations min and max are not supported on any floating-point types.
floating_type_for_logical_atomic_operationFor a logical atomic operation, the first argument cannot be any floating-point types.
nv_atomic_cas_b16_not_supported(16-bit CAS not supported)
nv_atomic_exch_cas_b128_not_supported(128-bit exchange/CAS not supported)
nv_atomic_load_store_b128_version_too_low(128-bit load/store requires newer arch)

Memory Order and Scope

TagMessage Template
nv_atomic_load_order_erroratomic load's memory order cannot be release or acq_rel.
nv_atomic_store_order_erroratomic store's memory order cannot be consume, acquire or acq_rel.
nv_atomic_operation_order_not_constant_intatomic operation's memory order argument is not an integer literal.
nv_atomic_operation_scope_not_constant_intatomic operation's scope argument is not an integer literal.
invalid_nv_atomic_memory_order_value(invalid memory order enum value)
invalid_nv_atomic_thread_scope_value(invalid thread scope enum value)

Scope Fallback Warnings

TagMessage Template
nv_atomic_operations_scope_fallback_to_membaratomic operations' scope argument is supported on architecture sm_60 or above. Fall back to use membar.
nv_atomic_operations_memory_order_fallback_to_membaratomic operations' argument of memory order is supported on architecture sm_70 or above. Fall back to use membar.
nv_atomic_operations_scope_cluster_change_to_deviceatomic operations' scope of cluster is supported on architecture sm_90 or above. Using device scope instead.
nv_atomic_load_store_scope_cluster_change_to_deviceatomic load and store's scope of cluster is supported on architecture sm_90 or above. Using device scope instead.

Category 13: ASM in Device Code (6 messages)

Inline assembly constraints are more restrictive in device code (NVPTX backend supports fewer constraint letters than x86).

TagMessage Template
asm_constraint_letter_not_allowed_in_deviceasm constraint letter '%s' is not allowed inside a __device__/__global__ function
--an asm operand may specify only one constraint letter in a __device__/__global__ function
--The 'C' constraint can only be used for asm statements in device code
--The cc clobber constraint is not supported in device code
cuda_xasm_strict_placeholder_format(strict placeholder format in CUDA asm)
addr_of_label_in_device_funcaddress of label extension is not supported in __device__/__global__ functions

Category 14: #pragma nv_abi (10 messages)

The #pragma nv_abi directive controls the calling convention for device functions, adjusting parameter passing to match PTX ABI requirements.

TagMessage Template
nv_abi_pragma_bad_format(malformed #pragma nv_abi)
nv_abi_pragma_invalid_option#pragma nv_abi contains an invalid option
nv_abi_pragma_missing_arg#pragma nv_abi requires an argument
nv_abi_pragma_duplicate_arg#pragma nv_abi contains a duplicate argument
nv_abi_pragma_not_constant#pragma nv_abi argument must evaluate to an integral constant expression
nv_abi_pragma_not_positive_value#pragma nv_abi argument value must be a positive value
nv_abi_pragma_overflow_value#pragma nv_abi argument value exceeds the range of an integer
nv_abi_pragma_device_function#pragma nv_abi must be applied to device functions
nv_abi_pragma_device_function_context#pragma nv_abi is not supported inside a host function
nv_abi_pragma_next_construct#pragma nv_abi must appear immediately before a function declaration, function definition, or an expression statement

Category 15: __nv_register_params__ (4 messages)

The __nv_register_params__ attribute forces all parameters to be passed in registers (compute_80+).

TagMessage Template
register_params_not_enabled__nv_register_params__ support is not enabled
register_params_unsupported_arch__nv_register_params__ is only supported for compute_80 or later architecture
register_params_unsupported_function__nv_register_params__ is not allowed on a %s function
register_params_ellipsis_function__nv_register_params__ is not allowed on a function with ellipsis

Category 16: __CUDACC_RTC__name_expr (6 messages)

The __CUDACC_RTC__name_expr intrinsic is used by NVRTC to form the mangled name of a __global__ function or __device__/__constant__ variable at compile time.

TagMessage Template
name_expr_parsing(error during name expression parsing)
name_expr_non_global_routineName expression cannot form address of a non-__global__ function. Input name expression was: %sq
name_expr_non_device_variableName expression cannot form address of a variable that is not a __device__/__constant__ variable. Input name expression was: %sq
name_expr_not_routine_or_variableName expression must form address of a __global__ function or the address of a __device__/__constant__ variable. Input name expression was: %sq
name_expr_extra_tokens(extra tokens after name expression)
name_expr_internal_error(internal error in name expression processing)

Category 17: Texture and Surface Variables (8 messages)

Texture and surface objects have special memory semantics. These errors enforce that they are not used in ways incompatible with the GPU texture subsystem.

TagMessage Template
texture_surface_variable_in_unnamed_inline_nsA texture or surface variable cannot be declared within an inline unnamed namespace
--A texture or surface variable cannot be used in the non-type template argument of a __device__, __host__ __device__ or __global__ function template instantiation
reference_to_text_surf_type_in_device_funca reference to texture/surface type cannot be used in __device__/__global__ functions
reference_to_text_surf_var_in_device_functaking reference of texture/surface variable not allowed in __device__/__global__ functions
addr_of_text_surf_var_in_device_funccannot take address of texture/surface variable %sq in __device__/__global__ functions
addr_of_text_surf_expr_in_device_funccannot take address of texture/surface expression in __device__/__global__ functions
indir_into_text_surf_var_in_device_funcindirection not allowed for accessing texture/surface through variable %sq in __device__/__global__ functions
indir_into_text_surf_expr_in_device_funcindirection not allowed for accessing texture/surface through expression in __device__/__global__ functions

Category 18: __managed__ Variables (7 messages)

__managed__ unified-memory variables have significant restrictions because they must be accessible from both host and device.

TagMessage Template
managed_const_type_not_alloweda __managed__ variable cannot have a const qualified type
managed_reference_type_not_alloweda __managed__ variable cannot have a reference type
managed_cant_be_shared_constant__managed__ variables cannot be marked __shared__ or __constant__
unsupported_arch_for_managed_capability__managed__ variables require architecture compute_30 or higher
unsupported_configuration_for_managed_capability__managed__ variables are not yet supported for this configuration (compilation mode (32/64 bit) and/or target operating system)
decltype_of_managed_variableA __managed__ variable cannot be used as an unparenthesized id-expression argument for decltype()
--(dynamic initialization restrictions for managed variables)

Category 19: Device Function Signature Constraints (5 messages)

Restrictions on __device__ and __host__ __device__ functions that are distinct from __global__ constraints.

TagMessage Template
device_function_has_ellipsis__device__ or __host__ __device__ function with ellipsis requires compute_30 or higher architecture
device_func_tex_arg(device function with texture argument restriction)
no_host_device_initializer_list(std::initializer_list in host device context)
no_host_device_move_forward(std::move/forward in host device context)
no_strict_cuda_error(relaxed error checking mode)

Category 20: __wgmma_mma_async Builtins (4 messages)

Warp Group Matrix Multiply-Accumulate builtins (sm_90a+).

TagMessage Template
wgmma_mma_async_not_enabled__wgmma_mma_async builtins are only available for sm_90a
wgmma_mma_async_nonconstant_argNon-constant argument to __wgmma_mma_async call
wgmma_mma_async_missing_argsThe 'A' or 'B' argument to __wgmma_mma_async call is missing
wgmma_mma_async_bad_shapeThe shape %s is not supported for __wgmma_mma_async builtin

Category 21: __block_size__ / __cluster_dims__ (8 messages)

Architecture-dependent launch configuration attributes.

TagMessage Template
block_size_unsupported__block_size__ is not supported for this GPU architecture
block_size_must_be_positive(block size values must be positive)
cluster_dims_unsupported__cluster_dims__ is not supported for this GPU architecture
cluster_dims_must_be_positive(cluster_dims values must be positive)
cluster_dims_too_large(cluster_dims exceeds maximum)
conflict_between_cluster_dim_and_block_sizecannot specify the second tuple in __block_size__ while __cluster_dims__ is present
--cannot specify max blocks per cluster for this GPU architecture
shared_block_size_must_be_positive(shared block size must be positive)

Category 22: Inline Hint Conflicts (2 messages)

TagMessage Template
--"__inline_hint__" and "__forceinline__" may not be used on the same declaration
--"__inline_hint__" and "__noinline__" may not be used on the same declaration

Category 23: Miscellaneous CUDA Errors

Remaining CUDA-specific diagnostics that do not fall into the above categories.

TagMessage Template
cuda_displaced_new_or_delete_operator(displaced new/delete in CUDA context)
cuda_demote_unsupported_floating_point(unsupported floating-point type demoted)
illegal_ucn_in_device_identiferUniversal character is not allowed in device entity name (%sq)
thread_local_for_device_vars(thread_local on device variables)
--__global__ function or function template cannot have a parameter with va_list type
global_qualifier_not_allowed(execution space qualifier not allowed here)

Complete Diagnostic Tag Index (286 tags)

The following table lists all 286 CUDA-specific diagnostic tag names extracted from the cudafe++ binary. Each tag can be used with --diag_suppress, --diag_warning, --diag_error, or #pragma nv_diag_suppress / nv_diag_warning / nv_diag_error.

Tags are organized alphabetically within functional groups.

Cross-Space / Execution Space

Tag Name
unsafe_device_call

Redeclaration

Tag Name
device_function_redeclared_with_global
device_function_redeclared_with_host
device_function_redeclared_with_host_device
device_function_redeclared_without_device
global_function_redeclared_with_device
global_function_redeclared_with_host
global_function_redeclared_with_host_device
global_function_redeclared_without_global
host_device_function_redeclared_with_global
host_function_redeclared_with_device
host_function_redeclared_with_global
host_function_redeclared_with_host_device

__global__ Constraints

Tag Name
bounds_attr_only_on_global_func
cuda_specifier_twice_in_group
global_class_decl
global_exception_spec
global_friend_definition
global_func_local_template_arg
global_function_consteval
global_function_constexpr
global_function_deduced_return_type
global_function_has_ellipsis
global_function_in_unnamed_inline_ns
global_function_inline
global_function_multiple_packs
global_function_pack_not_last
global_function_return_type
global_function_with_initializer_list
global_lambda_template_arg
global_new_or_delete
global_operator_function
global_param_align_too_big
global_private_template_arg
global_private_type_arg
global_qualifier_not_allowed
global_ref_param_restrict
global_rvalue_ref_type
global_unnamed_type_arg
global_va_list_type
local_type_used_in_global_function
maxnreg_attr_only_on_global_func
missing_launch_bounds
template_global_no_def

Extended Lambda

Tag Name
extended_host_device_generic_lambda
extended_lambda_array_capture_assignable
extended_lambda_array_capture_default_constructible
extended_lambda_array_capture_rank
extended_lambda_call_operator_local_type
extended_lambda_call_operator_private_type
extended_lambda_cant_take_function_address
extended_lambda_capture_in_constexpr_if
extended_lambda_capture_local_type
extended_lambda_capture_private_type
extended_lambda_constexpr
extended_lambda_disallowed
extended_lambda_discriminator
extended_lambda_enclosing_function_deducible
extended_lambda_enclosing_function_generic_lambda
extended_lambda_enclosing_function_hd_lambda
extended_lambda_enclosing_function_local
extended_lambda_enclosing_function_not_found
extended_lambda_hd_init_capture
extended_lambda_illegal_parent
extended_lambda_inaccessible_ancestor
extended_lambda_inaccessible_parent
extended_lambda_init_capture_array
extended_lambda_init_capture_initlist
extended_lambda_inside_constexpr_if
extended_lambda_multiple_parameter_packs
extended_lambda_multiple_parent
extended_lambda_nest_parent_template_param_unnamed
extended_lambda_no_parent_func
extended_lambda_pack_capture
extended_lambda_parent_class_unnamed
extended_lambda_parent_local_type
extended_lambda_parent_non_extern
extended_lambda_parent_private_template_arg
extended_lambda_parent_private_type
extended_lambda_parent_template_param_unnamed
extended_lambda_reference_capture
extended_lambda_too_many_captures
this_addr_capture_ext_lambda

Device Code

Tag Name
addr_of_label_in_device_func
asm_constraint_letter_not_allowed_in_device
auto_device_fn_ref
cuda_device_code_unsupported_operator
cuda_xasm_strict_placeholder_format
illegal_ucn_in_device_identifer
no_strict_cuda_error
thread_local_in_device_code
undefined_device_entity
undefined_device_identifier
unrecognized_pragma_device_code
unsupported_type_in_device_code

Device Function

Tag Name
device_func_tex_arg
device_function_has_ellipsis
no_host_device_initializer_list
no_host_device_move_forward

Kernel Launch

Tag Name
device_launch_no_sepcomp
device_side_launch_arg_with_user_provided_cctor
device_side_launch_arg_with_user_provided_dtor
missing_api_for_device_side_launch

Variable Access

Tag Name
device_var_address_taken_in_host
device_var_constexpr
device_var_read_in_host
device_var_structured_binding
device_var_written_in_host
device_variable_in_unnamed_inline_ns
host_var_address_taken_in_device
host_var_read_in_device
host_var_written_in_device
illegal_local_to_device_function
illegal_local_to_host_function

Variable Template

Tag Name
variable_template_func_local_template_arg
variable_template_lambda_template_arg
variable_template_private_template_arg
variable_template_private_type_arg
variable_template_unnamed_type_template_arg

__managed__

Tag Name
decltype_of_managed_variable
managed_cant_be_shared_constant
managed_const_type_not_allowed
managed_reference_type_not_allowed
unsupported_arch_for_managed_capability
unsupported_configuration_for_managed_capability

__grid_constant__

Tag Name
grid_constant_incompat_instantiation_directive
grid_constant_incompat_redecl
grid_constant_incompat_specialization
grid_constant_incompat_templ_redecl
grid_constant_non_kernel
grid_constant_not_const
grid_constant_reference_type
grid_constant_unsupported_arch

Atomics

Tag Name
floating_type_for_logical_atomic_operation
invalid_data_size_for_nv_atomic_generic_function
invalid_nv_atomic_cas_size
invalid_nv_atomic_exch_size
invalid_nv_atomic_memory_order_value
invalid_nv_atomic_operation_add_sub_size
invalid_nv_atomic_operation_max_min_float
invalid_nv_atomic_operation_size
invalid_nv_atomic_thread_scope_value
non_integral_type_for_non_generic_nv_atomic_function
nv_atomic_add_sub_f64_not_supported
nv_atomic_cas_b16_not_supported
nv_atomic_exch_cas_b128_not_supported
nv_atomic_function_address_taken
nv_atomic_function_no_args
nv_atomic_functions_not_supported_below_sm60
nv_atomic_load_order_error
nv_atomic_load_store_b128_version_too_low
nv_atomic_load_store_scope_cluster_change_to_device
nv_atomic_operation_not_in_device_function
nv_atomic_operation_order_not_constant_int
nv_atomic_operation_scope_not_constant_int
nv_atomic_operations_memory_order_fallback_to_membar
nv_atomic_operations_scope_cluster_change_to_device
nv_atomic_operations_scope_fallback_to_membar
nv_atomic_store_order_error

JIT Mode

Tag Name
host_closure_class_in_jit
no_host_in_jit
unannotated_function_in_jit
unannotated_static_data_member_in_jit
unannotated_variable_in_jit

RDC / Whole-Program

Tag Name
extern_kernel_template
template_global_no_def

#pragma nv_abi

Tag Name
nv_abi_pragma_bad_format
nv_abi_pragma_device_function
nv_abi_pragma_device_function_context
nv_abi_pragma_duplicate_arg
nv_abi_pragma_invalid_option
nv_abi_pragma_missing_arg
nv_abi_pragma_next_construct
nv_abi_pragma_not_constant
nv_abi_pragma_not_positive_value
nv_abi_pragma_overflow_value

__nv_register_params__

Tag Name
register_params_ellipsis_function
register_params_not_enabled
register_params_unsupported_arch
register_params_unsupported_function

name_expr

Tag Name
name_expr_extra_tokens
name_expr_internal_error
name_expr_non_device_variable
name_expr_non_global_routine
name_expr_not_routine_or_variable
name_expr_parsing

Texture / Surface

Tag Name
addr_of_text_surf_expr_in_device_func
addr_of_text_surf_var_in_device_func
indir_into_text_surf_expr_in_device_func
indir_into_text_surf_var_in_device_func
reference_to_text_surf_type_in_device_func
reference_to_text_surf_var_in_device_func
texture_surface_variable_in_unnamed_inline_ns

__wgmma_mma_async

Tag Name
wgmma_mma_async_bad_shape
wgmma_mma_async_missing_args
wgmma_mma_async_nonconstant_arg
wgmma_mma_async_not_enabled

__block_size__ / __cluster_dims__

Tag Name
block_size_must_be_positive
block_size_unsupported
cluster_dims_must_be_positive
cluster_dims_too_large
cluster_dims_unsupported
conflict_between_cluster_dim_and_block_size
shared_block_size_must_be_positive
shared_block_size_too_large

Miscellaneous

Tag Name
cuda_demote_unsupported_floating_point
cuda_displaced_new_or_delete_operator
thread_local_for_device_vars

Internal Representation

Each CUDA error message is stored as a const char* entry in the error template table at off_88FAA0. The diagnostic tag names are stored in a separate string-to-integer lookup table; the tag name resolver (sub_4ED240 and related functions) performs a binary search on this table to match tag strings against internal error codes.

The format specifiers embedded in CUDA error messages use the same system as EDG base errors:

SpecifierMeaningExample in CUDA messages
%sqQuoted entity nameFunction name in cross-space call
%sq1, %sq2Indexed quoted namesCaller and callee in call errors
%no1Entity name (omit kind)Function name in redeclaration
%n1, %n2Entity namesOverride base/derived pair
%ndEntity name with decl locationTemplate parameter
%s, %s1, %s2String fill-inExecution space keyword
%tType fill-inType name in template arg errors
%pSource positionPrevious declaration location

For full format specifier documentation, see Format Specifiers.

Format Specifiers

The cudafe++ diagnostic system uses a custom format specifier language -- not printf -- to expand parameterized error messages. The expansion engine is process_fill_in (sub_4EDCD0, 1,202 decompiled lines in error.c), called by write_message_to_buffer (sub_4EF620, 159 lines) during template string expansion. Each diagnostic record carries a linked list of typed fill-in entries that supply the actual values -- type nodes, entity pointers, strings, integers, source positions -- which the format engine renders into the final message text.

This page documents the specifier syntax, the fill-in kind system, entity-kind dispatch, suffix options, numeric indexing, and the labeled fill-in mechanism.

Specifier Syntax

When write_message_to_buffer walks an error template string (looked up from off_88FAA0[error_code]), it recognizes three format constructs:

SyntaxMeaningExample
%%Literal % character"100%% complete"
%XY...ZnFill-in specifier: letter X, options Y...Z, index n%nfd2, %sq1, %t
%[label]Named label fill-in reference%[class_or_struct]

Positional Specifier Parsing

The parser (sub_4EF620, error.c:4703) processes %XY...Zn specifiers as follows:

// After seeing '%', read next char as specifier letter
char spec_letter = template[pos + 1];      // 'T', 'd', 'n', 'p', 'r', 's', 't', 'u'
pos += 2;

// Collect option characters (a-z, A-Z) into buffer, max 29
int opt_count = 0;
char options[30];
while (true) {
    char c = template[pos];
    if (c >= '0' && c <= '9') {
        // Trailing digit = fill-in index (1-based)
        fill_in_index = c - '0';
        break;
    }
    if ((c & 0xDF) < 'A' || (c & 0xDF) > 'Y') {
        // Not a letter -- end of specifier, index defaults to 1
        fill_in_index = 1;
        break;
    }
    options[opt_count++] = c;
    if (opt_count > 29)
        assertion_handler("error.c", 4739,
            "write_message_to_buffer",
            "construct_text_message:",
            "too many option characters");
    pos++;
}
options[opt_count] = '\0';

process_fill_in(diagnostic_record, spec_letter, options, fill_in_index);

The maximum of 29 option characters is enforced by an assertion. In practice, specifiers use 0--3 option characters.

Fill-In Kinds

The specifier letter maps to a fill-in kind value through a switch on (letter - 84) in process_fill_in (sub_4EDCD0, error.c:4297):

LetterASCIIletter - 84KindPayload TypeDescription
%T8406Type node pointerType name, uppercase rendering ("<int, float>")
%d100160int64Signed decimal integer
%n110264Entity node pointerEntity/symbol name with rich formatting
%p112282Source position cookieSource file + line reference
%r114307byte + pointerTemplate parameter reference
%s115313const char*Plain string
%t116325Type node pointerType name, lowercase rendering ("int")
%u117331uint64Unsigned decimal integer

Any other letter triggers the assertion: "process_fill_in: bad fill-in kind" (error.c:4297).

Usage Frequency Across 3,795 Templates

Measured across all error message templates in off_88FAA0:

SpecifierOccurrencesTypical Context
%s~470String fragments: attribute names, keyword text, flag names
%t~241Type names in mismatch diagnostics
%sq~233Quoted string fragments in CUDA cross-space messages
%n~179Entity names: function, variable, class, template
%p~76Source positions: "declared at line N of file.cu"
%d~60Numeric values: counts, limits, sizes
%T~40Type template parameter lists
%u~20Unsigned counts
%r~10Template parameter back-references

Fill-In Entry Layout

Each fill-in entry is a 40-byte node allocated from a pool (qword_106B490) or heap by alloc_fill_in_entry (sub_4F2DE0):

OffsetSizeFieldDescription
04kindFill-in kind (0--7, from specifier letter mapping)
41used_flagSet to 1 when consumed during expansion
53(padding)--
88nextNext fill-in in linked list
168+payloadUnion, varies by kind (see below)

Payload Layout by Kind

Kind 0 (decimal, %d) / Kind 1 (unsigned, %u) / Kind 3 (string, %s) / Kind 5 (type, %t) / Kind 6 (type, %T):

OffsetSizeField
168value -- int64 for kind 0/1, const char* for kind 3, type node pointer for kind 5/6

Kind 2 (position, %p):

OffsetSizeField
168position_cookie -- initialized to qword_126EFB8 (current source position) at allocation time

Kind 4 (entity name, %n):

OffsetSizeField
168entity_ptr -- pointer to entity node
244scope_index -- initialized to 0xFFFFFFFF (invalid)
281full_qualification_flag
291original_name_flag
301parameter_list_flag
311template_function_flag
321definition_flag
331alternate_original_flag
341template_only_flag

Kind 7 (%r):

OffsetSizeField
161param_byte
177(padding)
248template_scope_ptr

Fill-In Linked List

Fill-in entries attach to the diagnostic record as a singly-linked list:

  • Head pointer: diagnostic record offset 184 (fill_in_list_head)
  • Tail pointer: diagnostic record offset 192 (fill_in_list_tail)

When process_fill_in searches for a matching entry, it walks the list from head, looking for the first entry where node->kind == requested_kind. If the specifier includes an index (e.g., %t2), it skips index - 1 matching entries before consuming the target:

const __m128i *node = *(diagnostic + 184);   // fill_in_list_head
if (!node)
    goto fill_in_not_found;

while (node->kind != requested_kind || --index > 0) {
    node = node->next;                        // offset 8
    if (!node)
        goto fill_in_not_found;
}

node->used_flag = 1;                          // mark consumed (offset 4)
// proceed with kind-specific rendering

If no matching entry is found, process_fill_in triggers an assertion with a diagnostic message identifying the missing fill-in: "specified fill-in (%X, N) not found for error string: \"...\"" (error.c:4317).

After all format specifiers have been expanded, construct_text_message (sub_4EF9D0) iterates the entire fill-in list and asserts that every entry has used_flag == 1. An unconsumed fill-in triggers: "construct_text_message: not all fill-ins used for error string: \"...\"" (error.c:4781).

Numeric Indexing

When a template string must reference multiple fill-ins of the same kind, a trailing digit selects which one:

SpecifierMeaning
%tFirst type fill-in (index 1, default)
%t1First type fill-in (index 1, explicit)
%t2Second type fill-in (index 2)
%n1First entity name fill-in
%n2Second entity name fill-in
%sq1First string fill-in, quoted
%sq2Second string fill-in, quoted

The index is a single digit 0--9. Index 0 behaves identically to index 1 (the counter is pre-decremented before comparison). In practice, most templates use indices 1 and 2; a few use up to 3.

Real template example (CUDA cross-space call, error 3499):

calling a __device__ function(%sq1) from a __host__ function(%sq2) is not allowed

Here %sq1 and %sq2 are both kind 3 (string) with option q (quoted), selecting the first and second string fill-ins respectively. The caller attaches two string fill-ins -- the called function's name and the calling function's name.

Suffix Options

String Options (%s)

The %s specifier accepts only one option character: q for quoted output.

FormRendering
%sRaw string: foo
%sqQuoted string: "foo"

The q option wraps the string in double-quote characters (") and applies colorization if enabled (quote category, code 6 = bold). Any other option character on %s triggers: "process_fill_in: bad option" (error.c:4364).

Multiple q characters are permitted syntactically (the parser loops over all option chars validating each is q) but have no additional effect -- only one layer of quoting is applied.

Entity Name Options (%n)

The %n specifier accepts a rich set of option suffixes that control how an entity is rendered. Options are processed left-to-right, setting flags on the fill-in entry's flag bytes (offsets 28--34):

OptionFlag ByteEffect
foffset 28 (full_qualification)Show fully-qualified name with namespace/class scope chain
ooffset 29 (original_name)Omit the entity kind prefix (suppress "function ", "variable ", etc.)
poffset 30 (parameter_list)Show function parameter types in signature
toffset 31 + offset 28Show template arguments AND full qualification (sets both flags)
aoffset 29 + offset 33Show original name AND alternate/accessibility info
doffset 32 (definition)Append declaration location: " (declared at line N of file.cu)"
Toffset 34 (template_only)Show template specialization context: " (from translation unit ...)"

Options can be combined. Common combinations from the error template table:

SpecifierRendering Example
%nfunction "foo"
%no"foo" (no kind prefix)
%nffunction "ns::cls::foo" (fully qualified)
%nfdfunction "ns::cls::foo" (declared at line 42 of bar.cu)
%ntfunction "ns::cls::foo<int>" (full + template args)
%npfunction "foo" [with parameters shown]
%nTfunction "foo" (from translation unit bar.cu)
%na"foo" based on template argument(s) ...

No Options for Other Kinds

The %d, %u, %p, %t, %T, and %r specifiers reject all option characters:

if (*options != '\0')
    assertion_handler("error.c", 4372,
        "process_fill_in",
        "process_fill_in: bad option", NULL);

Kind-Specific Rendering

Kind 0 -- Signed Decimal (%d)

Renders the 64-bit signed integer payload using snprintf(buf, 20, "%lli", value), then writes the result to the output buffer. The 20-character buffer accommodates the full range of int64_t values including the sign.

Kind 1 -- Unsigned Decimal (%u)

Formats the payload through sub_4F63D0, which renders the unsigned 64-bit value into a dynamically-sized string buffer.

Kind 2 -- Source Position (%p)

Calls sub_4F6820 (form_source_position) with the position cookie from the fill-in payload. The rendering includes:

  • File name (via sub_5B15D0 for display formatting)
  • Line number
  • Contextual text supplied by the caller through three string arguments (prefix, suffix, end-of-source fallback)

The caller passes context strings like " (declared ", ")", "(at end of source)" to frame the position reference. When the position resolves to line 0 or the file is "-" (stdin), alternate formats are used.

Kind 3 -- String (%s / %sq)

Without the q option, writes the string pointer payload directly to the output buffer via strlen + sub_6B9CD0 (buffer append).

With the q option, wraps the string in double quotes with colorization:

if (colorization_active)
    emit_escape(buffer, 6);       // quote color (bold)
write_char(buffer, '"');
write_string(buffer, payload);
if (colorization_active)
    emit_escape(buffer, 1);       // reset
write_char(buffer, '"');

Kind 5 -- Type, Lowercase (%t)

Renders the type node through the type formatting subsystem. The rendering pipeline:

  1. Set byte_10678FA = 1 (name lookup kind = type display mode)
  2. Write opening "
  3. Call sub_600740 (format type for display) with the type node and the entity formatter callback table (qword_1067860)
  4. Write closing "
  5. Check via sub_7BE9C0 if the type has an "aka" (also-known-as) desugared form
  6. If yes, append ' (aka "desugared_type")' -- comparing the rendered forms to avoid redundant output when they are identical

The aka check compares the rendered text of the original type against the desugared type. If they produce identical strings (same length, same content via strncmp), the aka suffix is suppressed by truncating the buffer back to the pre-aka position.

Kind 6 -- Type, Uppercase (%T)

Renders a type template argument list in angle brackets:

write_string(buffer, "\"<");
// Walk the template argument linked list
for (arg = payload; arg != NULL; arg = arg->next) {
    if (arg->kind != 3)   // skip pack expansion markers
        format_template_argument(arg, &entity_formatter);
    if (arg->next && arg->next->kind != 3)
        write_string(buffer, ", ");
}
write_string(buffer, ">\"");

Template argument entries with kind == 3 (at byte offset +8) are pack-expansion markers and are skipped during rendering.

Kind 7 -- Template Parameter Reference (%r)

Renders a template parameter by looking up the parameter entity through sub_5B9EE0 (entity lookup by scope + index). If found and non-null, renders via sub_4F3970 (unqualified entity name). Otherwise, falls back to sub_6011F0 (generic template parameter formatting).

Entity Kind Dispatch (%n)

When processing %n specifiers, process_fill_in reads the entity kind byte at offset 80 of the entity node and dispatches to kind-specific rendering logic. The function first resolves through projection indirection: if entity_kind == 16 (typedef), it follows the pointer at entity->info_ptr->pointed_to; if entity_kind == 24 (resolved namespace alias), it follows entity->info_ptr.

The dispatch handles 25 entity kind values (0--24, with gaps at 14/15/16/24 handled as special cases):

Entity KindValueKind Label StringIndex in off_88FAA0Rendering Logic
keyword0(none -- literal "keyword")--Write keyword ", then the keyword's name string from entity->name_sym->name
concept1(from table)1462Simple: write kind label + quoted name
constant template parameter2"constant" or "nontype"--Check template parameter subkind: type_kind 14 with subkind 2 = "nontype", else "constant"
template parameter3(from table)1464 or 1465Check whether the template parameter is a type parameter (type_kind != 14) → index 1465, else 1464
class4(from table, CUDA-aware)1466--1468CUDA mode: 1467 or 1468 (class vs struct); non-CUDA: 1466
struct5(same as class)1466--1468Same dispatch as class, differentiated by v46 != 5
enum6(from table)1472Simple: write kind label + quoted name
variable7"variable" or "handler parameter"1474 or 1475Check handler-parameter flag (offset 163, bit 0x40). If set: "handler parameter" (index 1474). If variable is a structured binding (offset 162, bit 1): use index 2937. Otherwise: "variable" (index 1475) with optional template context
field8"field" or "member"1480 or 1481CUDA C++ mode: "member" (index 1480); C mode: "field" (index 1481)
member9"member"1480Always "member" with optional template context from scope chain
function10"function" or "deduction guide"1478 or 2892Check linkage kind (offset 166 == 7): deduction guide → index 2892. Otherwise "function" (1478). Walk qualified type chain to strip cv-qualifiers
function overload11(same as function)1478 or 2892Same dispatch as function (case 10), merged in the switch
namespace12(from table)1463Simple: write kind label + quoted name
label13(none)--Write quoted name only, no kind prefix, no type info
typedef (indirect variable)14"variable"1475Dereferences through entity->info_ptr->pointed_to and renders as variable
typedef (indirect function)15"function"1478Dereferences through entity->info_ptr, extracts function entity + routine info
typedef16----Assertion: "form_symbol_summary: projection of projection kind" (error.c:2020). Should have been resolved before dispatch
using declaration17(from table)1479Simple: write kind label + quoted name
parameter18"parameter"1473Simple: write "parameter" + quoted name with type info
class (anonymous/unnamed)19(from table)1469--1471 or 1889Multiple sub-cases: anonymous class bit 0x40 → index 1469; class-template with bit 0x02 → index 1470; deduction_guide bit → index 1889; else index 1471
function template20"function template"1485 (lambda) or kind labelLambda function (offset 189, bit 0x20): index 1485 with scope entity. Otherwise: "function template" with type and parameter info
variable template21(from table)2750Simple: write kind label + quoted name
alias template22(from table)3050Simple: write kind label + quoted name
concept template23(from table)1482Simple: write kind label + quoted name
resolved namespace alias24----Assertion: "form_symbol_summary: projection of projection kind" (same as kind 16). Should have been resolved

Any entity kind value outside 0--24 (excluding the gaps that trigger assertions) hits the default case: "form_symbol_summary: unsupported symbol kind" (error.c:2023).

Entity Rendering Pipeline

For entity kinds that produce a fully-formatted name (most non-trivial cases), the rendering proceeds through these stages:

1. Write entity kind label string (e.g., "function ")
   └── sub_6B9EA0(buffer, kind_label_string)
   └── sub_6B9CD0(buffer, " ", 1)

2. Open quote
   └── Optional colorization: sub_4ECDD0(buffer, 6)   // quote color
   └── sub_6B9CD0(buffer, "\"", 1)

3. Render type prefix (if has_type_info and full_qualification)
   └── sub_5FE8B0(type_node, 0, 1, 0, 0, &entity_formatter)

4. Render qualified or unqualified name
   ├── With template context:  sub_737A00(entity, &entity_formatter)
   └── Without template context: sub_4F3970(entity)

5. Render function parameters (if applicable)
   ├── Full parameter types: sub_5FB270(type, 0, 0, &entity_formatter)
   └── Simple type suffix:   sub_6016F0(type, &entity_formatter)

6. Close quote
   └── sub_6B9CD0(buffer, "\"", 1)
   └── Optional colorization: sub_4ECDD0(buffer, 1)   // reset

7. Append accessibility info (if 'a' option)
   └── " based on template argument(s) "
   └── sub_5FA660(template_arg_list, 0, &entity_formatter)

8. Append declaration location (if 'd' option)
   └── sub_4F6820(position, diag, " (declared ", ")", "(at end of source)")

9. Append translation unit info (if 'T' option)
   └── " (from translation unit <filename>)"

The original_name flag (o option) suppresses steps 1 and 3, rendering only the bare quoted name without a kind prefix or type qualification. The full_qualification flag (f option) enables step 3 and uses sub_737A00 for fully-qualified name rendering in step 4. The parameter_list flag (p option) forces step 5 to include full parameter-type rendering.

Template Context in Entity Names

When dword_126E274 (show template arguments) is non-zero and the entity has template context, the renderer can walk up the template scope chain:

  1. Access the entity's routine info (for functions: offset 88 → offset 192 → offset 16)
  2. Check for the instantiated-from entity (offset 104 of scope info, guarded by !(offset_176 & 1))
  3. If found, use the instantiated-from entity as the display target
  4. For class templates (entity_kind == 20): walk the template parameter chain, rendering <param1, param2, ...> with pack-expansion markers (...) for variadic parameters

CUDA-Specific Entity Rendering

Several entity kinds have CUDA-aware rendering paths:

  • Class/struct (kinds 4/5): When dword_126EFB4 == 2 (CUDA C++ mode) and the entity has an anonymous flag (offset 161, bit 0x80), rendering jumps to the anonymous-class handler (kind 19) instead
  • Field (kind 8): In CUDA C++ mode, the kind label is "member" (index 1480); in C mode, it is "field" (index 1481)
  • Class/struct label selection: In CUDA C++ mode, the kind label index is always 1467; in non-CUDA mode, it depends on whether the entity is class vs struct

Labeled Fill-Ins (%[label])

The %[label] syntax references a named fill-in from the label table at off_D481E0. This mechanism allows error templates to include conditional text fragments that vary based on language mode or compilation context.

Label Table Structure

off_D481E0 is an array of 24-byte entries (3 pointers per entry):

OffsetSizeFieldDescription
08nameLabel name string (e.g., "class_or_struct")
88condition_ptrPointer to condition flag (dword)
164true_indexString table index when *condition_ptr != 0
204false_indexString table index when *condition_ptr == 0

Label Lookup Algorithm

// write_message_to_buffer, error.c:4714
char *label_start = template + pos + 2;      // skip "%["
char *label_end = strchr(template + pos + 1, ']');
if (!label_end)
    assertion_handler("error.c", 4714, "write_message_to_buffer", NULL, NULL);

size_t label_len = label_end - label_start;

// Walk off_D481E0 table
struct label_entry *entry = off_D481E0;
while (entry->name) {
    if (strncmp(entry->name, label_start, label_len) == 0) {
        // Found matching label
        int string_index;
        if (*entry->condition_ptr)
            string_index = entry->true_index;
        else
            string_index = entry->false_index;

        if (string_index > 3794)
            error_text_invalid_code();     // sub_4F2D30

        // Expand the referenced string directly into the buffer
        const char *text = off_88FAA0[string_index];
        write_to_buffer(buffer, text, strlen(text));
        pos = label_end + 1;
        break;
    }
    entry++;   // advance by 24 bytes
}

if (!entry->name) {
    // Label not found -- fatal
    fprintf(stderr, "missing fill-in label: %.*s\n", label_len, label_start);
    assertion_handler("error.c", 430,
        "get_label_fill_in_entry",
        "get_label_fill_in_entry: no label fill-in found", NULL);
}

The label table entries reference string indices in the same off_88FAA0 table used for error messages. This allows a single error template to produce different text depending on compilation mode -- for example, using "class" vs "struct" based on a language-mode flag, or "virtual" vs "" based on a feature flag.

The label text is written directly to the output buffer without further format specifier processing -- labels cannot contain nested % specifiers.

Output Buffer

All rendering targets the global message text buffer at qword_106B488:

  • Initial allocation: 0x400 bytes (1 KB) via sub_6B98A0
  • Dynamic growth: sub_6B9B20 doubles the buffer when capacity is exceeded
  • String append: sub_6B9CD0(buffer, data, length) -- the workhorse write function
  • String write: sub_6B9EA0(buffer, string) -- convenience wrapper (calls strlen + sub_6B9CD0)

The entity display callback infrastructure at qword_1067860 allows the type/name formatting subsystem to write to the same buffer through an indirect call:

VariableAddressPurpose
qword_10678600x1067860Entity formatter callback (set to sub_5B29C0)
qword_10678700x1067870Entity formatter output buffer (set to qword_106B488)
byte_10678F10x10678F1C mode flag (dword_126EFB4 == 1)
byte_10678F40x10678F4Pre-C++11 flag
byte_10678FA0x10678FAName lookup kind (saved/restored around type rendering)
byte_10678FE0x10678FEEntity display flags (saved/restored around %n processing)
byte_10679020x1067902Type desugaring mode flag (saved/restored around %t aka rendering)

Colorization Interaction

When dword_126ECA4 (colorization active) is non-zero, the format engine inserts ANSI escape sequences around quoted names and type references:

ContextColor CodeANSI SequenceVisual
Opening quote (")6 (quote)\033[01mBold
Closing quote (")1 (reset)\033[0mNormal
Type rendering context(inherited)--Inherits from diagnostic severity color

The escape sequences are emitted by sub_4ECDD0(buffer, color_code). The color codes correspond to the categories parsed from EDG_COLORS / GCC_COLORS environment variables during initialization.

Function Map

AddressName (Recovered)SizeRole
0x4EDCD0process_fill_in1,202 linesCore format specifier expansion
0x4EF620write_message_to_buffer159 linesTemplate string walker, % parser
0x4F2DE0alloc_fill_in_entry41 linesPool allocator for 40-byte fill-in nodes
0x4F2D30error_text_invalid_code12 linesAssert on invalid error code (> 3794)
0x4F2930assertion_handler101 lines__noreturn, 5,185 callers
0x4F3480format_assertion_message~100 linesMulti-arg string builder for assertion text
0x4F6820form_source_position~130 linesRender %p source position with file + line
0x4F3970format_entity_unqualified--Render unqualified entity name
0x4F39E0format_entity_with_template--Render entity with template args + accessibility
0x737A00format_qualified_name--Render fully-qualified name through scope chain
0x5FE8B0format_type_with_qualifiers--Render type with cv-qualifiers for %n prefix
0x5FB270format_function_parameters--Render function parameter type list
0x6016F0format_simple_type--Render simple type suffix
0x600740format_type_for_display--Render type for %t specifier
0x7BE9C0has_desugared_type--Check if type has an "aka" form
0x5FA660format_template_argument_list--Render template argument list for %n a option
0x5FA0D0format_template_argument--Render single template argument for %T
0x5B9EE0lookup_entity_by_scope--Entity lookup for %r template parameter
0x4F63D0format_unsigned_decimal--Render unsigned integer for %u
0x6B9CD0buffer_append--Write bytes to dynamic buffer
0x6B9EA0buffer_write_string--Write null-terminated string to buffer
0x4ECDD0emit_colorization_escape--Emit ANSI escape sequence

Cross-References

SARIF Output & Pragma Diagnostic Control

cudafe++ supports two diagnostic output formats -- traditional text (default) and SARIF v2.1.0 JSON -- controlled by the --output_mode flag (flag index 274, stored in dword_106BBB8). Alongside the output format, the pragma diagnostic system allows per-error severity overrides at arbitrary source positions through #pragma nv_diag_* directives, which record a stack of severity modifications binary-searched at emission time. A companion colorization subsystem adds ANSI escape sequences to text-mode output, governed by environment variables and terminal detection. This page covers the internals of all three subsystems.

For the diagnostic pipeline architecture, severity levels, and error message formatting, see Diagnostic Overview. For the CUDA error catalog and tag-name suppression, see CUDA Errors.

SARIF Output Mode

Activation

SARIF mode is activated by passing --output_mode sarif on the command line. The flag handler (case 274 in the CLI parser at sub_454160) performs a simple string comparison:

// sub_454160, case 274
if (strcmp(arg, "text") == 0)
    dword_106BBB8 = 0;        // text mode (default)
else if (strcmp(arg, "sarif") == 0)
    dword_106BBB8 = 1;        // SARIF JSON mode
else
    error("unrecognized output mode (must be one of text, sarif): %s", arg);

When dword_106BBB8 == 1, three changes take effect globally:

  1. write_init (sub_5AEDB0) emits the SARIF JSON header instead of nothing
  2. check_severity (sub_4F1330) routes each diagnostic through the SARIF JSON builder instead of construct_text_message
  3. write_signoff (sub_5AEE00) emits ]}]}\n instead of the error/warning summary line

All other pipeline behavior -- severity computation, pragma overrides, error counting, exit codes -- is identical in both modes. Exit codes in SARIF mode skip the text messages ("Compilation terminated.", "Compilation aborted.") but use the same numeric values (0, 2, 4, 11).

SARIF Header (sub_5AEDB0)

write_init is called once at the start of compilation. In SARIF mode, it writes the JSON envelope to qword_126EDF0 (the diagnostic output stream, typically stderr):

{
  "version": "2.1.0",
  "$schema": "https://raw.githubusercontent.com/oasis-tcs/sarif-spec/master/Schemata/sarif-schema-2.1.0.json",
  "runs": [{
    "tool": {
      "driver": {
        "name": "EDG CPFE",
        "version": "6.6",
        "organization": "Edison Design Group",
        "fullName": "Edison Design Group C/C++ Front End - 6.6",
        "informationUri": "https://edg.com/c"
      }
    },
    "columnKind": "unicodeCodePoints",
    "results": [

The version strings ("6.6") are hardcoded in the binary via two %s format arguments that both resolve to the static string "6.6". The runs array is opened but not closed -- each diagnostic result is appended as the compilation proceeds, and the array is closed by write_signoff.

An assertion guards the mode value: if dword_106BBB8 is neither 0 nor 1, the function fires sub_4F2930 with "write_init" at host_envir.c:2017.

SARIF Result Object

Each diagnostic emitted through check_severity (sub_4F1330) produces one JSON result object. The construction happens inline within check_severity at LABEL_91, building the JSON into the SARIF buffer qword_106B478:

{
  "ruleId": "EC<error_code>",
  "level": "<severity_string>",
  "message": {"text": "<expanded_message>"},
  "locations": [{"physicalLocation": <location_object>}],
  "relatedLocations": [<related_location_objects>]
}

Comma handling: When qword_126ED90 + qword_126ED98 > 1 (more than one diagnostic has been emitted), a comma is prepended before the opening { to maintain valid JSON array syntax.

Rule ID Format

The rule ID is always "EC" followed by the internal error code (0--3794), not the display code:

sub_6B9CD0(sarif_buf, "\"ruleId\":", 9);
sub_6B9CD0(sarif_buf, "\"EC", 3);
sprintf(s, "%lu", *(uint32_t*)(record + 176));  // internal error code
sub_6B9CD0(sarif_buf, s, strlen(s));
sub_6B9CD0(sarif_buf, "\"", 1);

For a CUDA error with internal code 3499 (display code 20042), the rule ID is "EC3499", not "EC20042". This differs from the text-mode format which uses "EC%lu" with the same internal code in construct_text_message.

Level Mapping

The level field is derived from the diagnostic severity byte at record offset 180. When severity <= byte_126ED68 (the error-promotion threshold) and severity <= 7, it is promoted to "error" before level selection. The mapping:

Severitylevel StringSARIF Standard?
4 (remark)"remark"Non-standard extension
5 (warning)"warning"Standard
7 (error, soft)"error"Standard
8 (error, hard)"error"Standard
9 (catastrophic)"catastrophe"Non-standard extension
11 (internal)"internal_error"Non-standard extension

Any other severity value triggers the assertion at error.c:4886:

sub_4F2930(..., "write_sarif_level",
    "determine_severity_code: bad severity", 0);

Notes (severity 2) and command-line diagnostics (severity 6, 10) never reach the SARIF level mapper -- notes are suppressed below the minimum severity gate, and command-line diagnostics bypass the SARIF path entirely.

Message Object (sub_4EF8A0)

The message text is produced by write_sarif_message_json (sub_4EF8A0), which wraps the expanded error template in a JSON {"text":"..."} object:

  1. Appends {"text":" to the SARIF buffer
  2. Calls write_message_to_buffer (sub_4EF620) to expand the error template with fill-in values into qword_106B488
  3. Null-terminates the message buffer
  4. JSON-escapes the message: iterates each character, prepending \ before any " (0x22) or \ (0x5C) character
  5. Appends "} to close the message object

The escaping is minimal -- only double-quote and backslash are escaped. Control characters (newlines, tabs) are not escaped, relying on the fact that EDG error messages do not contain embedded newlines.

Physical Location (sub_4ECB10)

When the diagnostic record has a valid file index (offset 136 != 0), a locations array is emitted containing one physical location object:

{
  "physicalLocation": {
    "artifactLocation": {"uri": "file://<canonical_path>"},
    "region": {"startLine": <line>, "startColumn": <column>}
  }
}

The function sub_4ECB10 (write_sarif_physical_location):

  1. Calls sub_5B97A0 to resolve the source-position cookie at record offset 136 into file path, line number, and column number
  2. Calls sub_5B1060 to canonicalize the file path
  3. Emits the artifactLocation with a file:// URI prefix
  4. Emits startLine unconditionally
  5. Emits startColumn only when the column value is non-zero (the v4 check: if (v4))

The startColumn conditional emission means that diagnostics without column information (e.g., command-line errors) produce location objects with only startLine.

Sub-diagnostics (linked at record offset 72, the sub_diagnostic_head pointer) are serialized into the relatedLocations array:

if (record->sub_diagnostic_head) {
    append(",\"relatedLocations\":[");
    int first = 1;
    for (sub = record->sub_diagnostic_head; sub; sub = sub->next) {
        sub->parent = record;          // back-link at offset 16
        append("{\"message\":");
        write_sarif_message_json(sub);  // expand sub-diagnostic message
        if (sub->file_index)
            write_sarif_physical_location(sub);
        append("}");
        if (!first)
            append(",");               // note: comma AFTER closing }
        first = 0;
    }
    append("]");
}

Each related location has its own message object and an optional physicalLocation. The comma is placed after the closing brace of each entry except the first, yielding [{...}{...},{...},...] -- this is a bug in the JSON generation that produces malformed output when there are three or more related locations, since the first separator comma is missing.

write_signoff closes the JSON structure:

if (dword_106BBB8 == 1) {
    fwrite("]}]}\n", 1, 5, qword_126EDF0);
    return;
}

This closes: results array (]), the run object (}), the runs array (]), and the top-level object (}), followed by a newline.

In text mode, write_signoff instead prints the error/warning summary (e.g., "3 errors, 2 warnings detected in file.cu"), using message-table lookups via sub_4F2D60 with IDs 1742--1748 and 3234--3235 for pluralization.

Complete SARIF Output Example

{"version":"2.1.0","$schema":"https://raw.githubusercontent.com/oasis-tcs/sarif-spec/master/Schemata/sarif-schema-2.1.0.json","runs":[{"tool":{"driver":{"name":"EDG CPFE","version":"6.6","organization":"Edison Design Group","fullName":"Edison Design Group C/C++ Front End - 6.6","informationUri":"https://edg.com/c"}},"columnKind":"unicodeCodePoints","results":[{"ruleId":"EC3499","level":"error","message":{"text":"calling a __device__ function(\"foo\") from a __host__ function(\"main\") is not allowed"},"locations":[{"physicalLocation":{"artifactLocation":{"uri":"file:///path/to/test.cu"},"region":{"startLine":10,"startColumn":5}}}]}]}]}

Pragma Diagnostic Control

Pragma Actions

cudafe++ processes #pragma nv_diag_* directives through the preprocessor, which records them as pragma action entries on a global stack. Six action codes are defined:

CodePragma DirectiveSeverity EffectInternal Name
30#pragma nv_diag_suppressSet severity to 3 (suppressed)ignored
31#pragma nv_diag_remarkSet severity to 4 (remark)remark
32#pragma nv_diag_warningSet severity to 5 (warning)warning
33#pragma nv_diag_errorSet severity to 7 (error)error
35#pragma nv_diag_defaultRestore from byte_1067920[4 * error_code]default
36#pragma nv_diag_push / popScope boundary markerpush/pop

Note the gap: action code 34 is not used. Actions 30--33 modify severity, 35 restores the compile-time default, and 36 provides push/pop scoping to allow localized overrides.

The pragmas accept either a numeric error code or a diagnostic tag name:

#pragma nv_diag_suppress 20042              // by display code
#pragma nv_diag_suppress calling_a_constexpr__host__function  // by tag name

Display codes >= 20000 are converted to internal codes by sub_4ED170:

int internal_code = (display_code > 19999) ? display_code - 16543 : display_code;

Pragma Stack (qword_1067820)

The pragma stack is a dynamically-growing array of 24-byte records stored at qword_1067820. The array is managed as a sorted-by-position sequence to enable binary search.

Each 24-byte stack entry has the following layout:

OffsetSizeFieldDescription
04position_cookieSource position (sequence number)
42columnColumn number within the line
81action_codePragma action (30--36)
91flagsBit 0: is push/pop with saved index
168error_code or saved_indexTarget error code, or -1/saved push index for scope markers

The array header (pointed to by qword_1067820) contains:

OffsetSizeField
08Pointer to entry array base
88Array capacity
168Entry count

Recording Pragma Entries (sub_4ED190)

When the preprocessor encounters a #pragma nv_diag_* directive, record_pragma_diagnostic (sub_4ED190) creates a new stack entry:

void record_pragma_diagnostic(uint error_code, uint8_t severity, uint *position) {
    // Hash: (column+1) * (position+1) * error_code * (severity+1)
    uint64_t hash = (*(uint16_t*)(position+2) + 1) * (*position + 1)
                    * error_code * (severity + 1);
    uint64_t bucket = hash % 983;     // 0x3D7

    entry = allocate(32);
    entry->error_code_field = error_code;    // offset 8
    entry->severity = severity;              // offset 12
    entry->position = *position;             // offset 16
    entry->saved_index = 0xFFFFFFFF;         // offset 24 = -1

    // Insert at head of hash chain
    entry->next = hash_table[bucket];        // qword_1065960
    hash_table[bucket] = entry;
}

This function serves double duty: it records the pragma entry for the per-diagnostic suppression hash table (qword_1065960, 983 buckets) used by check_pragma_diagnostic (sub_4ED240), and it simultaneously records the entry on the position-sorted pragma stack.

The bit byte_1067922[4 * error_code] |= 4 is set to mark that this error code has at least one pragma override, enabling the fast-path check in check_for_overridden_severity.

Per-Diagnostic Suppression Check (sub_4ED240)

check_pragma_diagnostic (sub_4ED240) is the fast-path check called from check_severity to determine whether a specific diagnostic at a specific source position should be suppressed. It operates on the hash table rather than the sorted stack:

bool check_pragma_diagnostic(uint error_code, uint8_t severity, uint *position) {
    uint64_t hash = (position->column + 1) * (position->cookie + 1)
                    * error_code * (severity + 1);
    entry = hash_table[hash % 983];

    // Walk hash chain matching all four fields
    while (entry) {
        if (entry->error_code == error_code &&
            entry->severity == severity &&
            entry->position == position->cookie &&
            entry->column == position->column)
            break;
        entry = entry->next;
    }
    if (!entry) return false;

    // Scope check: compare current scope ID
    scope = scope_table[current_scope_index];
    if (entry->saved_scope_id != scope->id || scope->kind == 9) {
        entry->saved_scope_id = scope->id;
        entry->emit_count = 0;
        return true;    // first time in this scope → suppress
    }

    // Already seen in this scope → check error limit
    entry->emit_count++;
    return entry->emit_count <= error_limit;
}

Severity Override Resolution (sub_4F30A0)

check_for_overridden_severity (sub_4F30A0) is the position-based pragma stack walker. It is called from create_diagnostic_entry (sub_4F40C0) for any diagnostic with severity <= 7, and determines the effective severity by walking the pragma stack backward from the diagnostic's source position.

Entry conditions:

void check_for_overridden_severity(int error_code, char *severity_out,
                                    int64_t position, ...) {
    char current_severity = byte_1067921[4 * error_code];

    // Fast path: if no pragma override exists for this error code, skip
    if ((byte_1067922[4 * error_code] & 4) == 0)
        goto done;

    // Ensure pragma stack exists and has entries
    if (!qword_1067820 || !qword_1067820->count)
        goto done;

Binary search phase:

When the diagnostic position is before the last pragma stack entry (i.e., the position comparison at offset 0/4 shows the diagnostic comes before the final entry), the function uses bsearch with comparator sub_4ECD20 to find the nearest pragma entry at or before the diagnostic position:

// Construct search key from diagnostic position
search_key.position = position->cookie;
search_key.column = position->column;

qword_10658F8 = 0;  // scratch: will hold the best-match pointer
result = bsearch(&search_key, stack_base, entry_count, 24, comparator);

The comparator sub_4ECD20 compares position cookies first, then columns. It has a side effect: whenever the comparison result is >= 0 (the search key is at or after the candidate), it stores the candidate pointer in qword_10658F8. This means after bsearch completes, qword_10658F8 holds the rightmost entry that is at or before the search key -- the "floor" entry.

Backward walk phase:

After finding the starting position (either via binary search or by starting from the last entry), the function walks backward through the stack:

while (1) {
    uint8_t action = *(uint8_t*)(entry + 8);

    if (action == 36) {             // push/pop marker
        if ((*(uint8_t*)(entry+9) & 1) == 0)
            goto skip;              // plain pop: no saved index
        int64_t saved_idx = *(int64_t*)(entry + 16);
        if (saved_idx == -1)
            goto skip;              // push without matching pop
        // Jump to the push point
        entry = &stack_base[24 * saved_idx];
        continue;
    }

    if (*(uint32_t*)(entry + 16) == error_code) {
        switch (action) {
            case 30: current_severity = 3; goto apply;     // suppress
            case 31: current_severity = 4; goto apply;     // remark
            case 32: current_severity = 5; goto apply;     // warning
            case 33: current_severity = 7; goto apply;     // error
            case 35:                                        // default
                current_severity = byte_1067920[4 * error_code];
                goto done;
            default:
                assertion("get_severity_from_pragma", error.c:3741);
        }
    }

skip:
    if (entry == stack_base)
        goto done;                  // reached bottom of stack
    entry -= 24;                    // previous entry
}

done:
    if (current_severity)
        *severity_out = current_severity;

apply:
    *severity_out = current_severity;

The key insight is the push/pop handling: action code 36 entries with flags & 1 set contain a saved index at offset 16 that points to the corresponding push entry. The walker jumps to the push entry, effectively skipping all pragma entries within the pushed scope, restoring the severity state from before the push.

An out-of-bounds entry pointer triggers the assertion at error.c:3803:

if (entry < stack_base || entry >= &stack_base[24 * count])
    assertion("check_for_overridden_severity", error.c:3803);

GCC Diagnostic Pragma Output

cudafe++ generates #pragma GCC diagnostic directives in its output (the transformed C++ sent to the host compiler) to suppress host-compiler warnings on code that cudafe++ knowingly generates or transforms. These are not the same as the nv_diag_* pragmas that control cudafe++'s own diagnostics.

The output pragmas are emitted via sub_467E50 (the line-output function) with hardcoded strings:

// Emitted around certain code regions
sub_467E50("#pragma GCC diagnostic push");
sub_467E50("#pragma GCC diagnostic ignored \"-Wunused-local-typedefs\"");
sub_467E50("#pragma GCC diagnostic ignored \"-Wattributes\"");
// ... generated code ...
sub_467E50("#pragma GCC diagnostic pop");

The full set of GCC warnings suppressed in output:

Warning FlagContext
-Wunevaluated-expressiondecltype expressions in init-captures (when dword_126E1E8 = GCC host)
-WattributesCUDA attribute annotations on transformed code
-Wunused-parameterDevice function stubs with unused parameters
-Wunused-functionForward-declared device functions not called in host path
-Wunused-local-typedefsType aliases generated for CUDA type handling
-Wunused-variableVariables in constexpr-if discarded branches
-Wunused-private-fieldPrivate members of device-only classes

On MSVC host compilers, the equivalent mechanism uses __pragma(warning(push)) / __pragma(warning(pop)) instead.

Colorization

Initialization (sub_4F2C10)

Colorization is initialized by init_colorization (sub_4F2C10), called from the diagnostic pipeline setup. The function determines whether color output should be enabled and parses the color specification.

Decision sequence:

1. Assert dword_126ECA0 != 0       (colorization was requested via --colors)
2. Check getenv("NOCOLOR")         → if set, disable
3. Check sub_5AF770()              → if stderr is not a TTY, disable
4. If still enabled, parse color spec
5. Set dword_126ECA4 = dword_126ECA0  (activate colorization)

Step 3 calls sub_5AF770 (check_terminal_capabilities), which:

  • Verifies qword_126EDF0 (diagnostic output FILE*) exists
  • Calls fileno() + isatty() on it
  • Calls getenv("TERM") and rejects "dumb" terminals
  • Returns 1 if interactive, 0 otherwise

The --colors / --no_colors CLI flag pair controls dword_126ECA0 (colorization requested). When --no_colors is set or NOCOLOR is in the environment, colorization is unconditionally disabled regardless of terminal capabilities.

Color Specification Parsing (sub_4EC850)

The color specification string is sourced from environment variables with a fallback chain:

char *spec = getenv("EDG_COLORS");
if (!spec) {
    spec = getenv("GCC_COLORS");
    if (!spec)
        spec = "error=01;31:warning=01;35:note=01;36:locus=01:quote=01:range1=32";
}

Note: although the string "DEFAULT_EDG_COLORS" appears in the binary (as a compile-time macro name), the actual default is hardcoded. The EDG_COLORS variable takes priority over GCC_COLORS, allowing EDG-specific customization while maintaining GCC compatibility.

The specification format is category=codes:category=codes:... where:

  • category is one of: error, warning, note, locus, quote, range1
  • codes is a semicolon-separated sequence of ANSI SGR parameters (digits and ; only)
  • : separates category assignments

sub_4EC850 (parse_color_category) is called once for each of the 6 configurable categories:

sub_4EC850(2, "error");      // category code 2
sub_4EC850(3, "warning");    // category code 3
sub_4EC850(4, "note");       // category code 4
sub_4EC850(5, "locus");      // category code 5
sub_4EC850(6, "quote");      // category code 6
sub_4EC850(7, "range1");     // category code 7

For each category, the parser:

  1. Uses strstr() to find the category name in the spec string
  2. Checks that the character after the name is =
  3. Extracts the value up to the next : (or end of string)
  4. Validates that the value contains only digits (0x30--0x39) and semicolons (0x3B)
  5. Stores the pointer and length in qword_126ECC0[2*code] and qword_126ECC8[2*code]
  6. If validation fails (non-digit, non-semicolon character), nullifies the entry

Color Category Codes

Seven category codes are used internally, with code 1 reserved for reset:

CodeCategoryDefault ANSIEscapeApplied To
1reset\033[0mESC [ 0 mEnd of any colored region
2error\033[01;31mESC [ 01;31 mError/catastrophic/internal severity labels
3warning\033[01;35mESC [ 01;35 mWarning/command-line-warning labels
4note/remark\033[01;36mESC [ 01;36 mNote and remark severity labels
5locus\033[01mESC [ 01 mSource file:line location prefix
6quote\033[01mESC [ 01 mQuoted identifiers in messages
7range1\033[32mESC [ 32 mSource-range underline markers

Escape Sequence Emission

Two functions handle color escape output, depending on context:

sub_4ECDD0 (emit_colorization_escape): Used within construct_text_message for inline color markers. Writes a 2-byte internal marker (ESC byte 0x1B followed by the category code) into the output buffer. These markers are later expanded into full ANSI sequences during the final output pass.

void emit_colorization_escape(buffer *buf, uint8_t category_code) {
    buf_append_byte(buf, 0x1B);     // ESC
    buf_append_byte(buf, category_code);
}

sub_4F3E50 (add_colorization_characters): Used during word-wrapped output to emit full ANSI escape sequences. For category 1 (reset), it writes ESC [ 0 m. For categories 2--7, it writes ESC [ followed by the parsed ANSI codes from qword_126ECC0, followed by m.

void add_colorization_characters(uint8_t category) {
    if (category > 7)
        assertion("add_colorization_characters", error.c:862);

    if (category == 1) {
        // Reset: ESC [ 0 m
        buf_append(sarif_buf, ESC);
        buf_append(sarif_buf, '[');
        buf_append(sarif_buf, '0');
        buf_append(sarif_buf, 'm');
    } else if (color_pointer[category]) {
        // ESC [ <codes> m
        buf_append(sarif_buf, ESC);
        buf_append(sarif_buf, '[');
        buf_append_n(sarif_buf, color_pointer[category], color_length[category]);
        buf_append(sarif_buf, 'm');
    }
}

The assertion at error.c:862 fires if a category code > 7 is passed, which would indicate a programming error in the diagnostic formatter.

Word Wrapping with Colors

construct_text_message (sub_4EF9D0) has two code paths for word wrapping:

  1. Non-colorized: Simple space-scanning algorithm that breaks at the terminal width (dword_106B470)
  2. Colorized: Tracks visible character width separately from escape sequence bytes. When the formatted string contains byte 0x1B (ESC), the wrapping logic counts only non-escape characters toward the column width, ensuring that ANSI codes do not prematurely trigger line breaks.

The terminal width dword_106B470 defaults to a reasonable value (typically 80 or derived from the terminal) and controls the column at which output lines are wrapped.

Colorization State Variables

VariableAddressPurpose
dword_126ECA00x126ECA0Colorization requested (--colors flag)
dword_126ECA40x126ECA4Colorization active (after init_colorization)
qword_126ECC00x126ECC0Color spec pointer array (2 qwords per category)
qword_126ECC80x126ECC8Color spec length array (paired with pointers)
dword_106B4700x106B470Terminal width for word wrapping

Diagnostic Counter System (sub_4F3020)

The function update_diagnostic_counter (sub_4F3020) is called from check_severity to increment per-severity counters. These counters drive the summary output in write_signoff and the error-limit check:

void update_diagnostic_counter(uint8_t severity, uint64_t *counter_block) {
    switch (severity) {
        case 2:  break;              // notes: not counted
        case 4:  counter_block[0]++; break;  // remarks
        case 5:
        case 6:  counter_block[1]++; break;  // warnings
        case 7:
        case 8:  counter_block[2]++; break;  // errors
        case 9:
        case 10:
        case 11: counter_block[3]++; break;  // fatal
        default:
            assertion("update_diagnostic_counter: bad severity", error.c:3223);
    }
}

The primary counter block is at qword_126ED80 (4 qwords: remark_count, warning_count, error_count, fatal_count). The global totals qword_126ED90 (total errors) and qword_126ED98 (total warnings) are updated from a different counter block qword_126EDC8 after pragma-suppressed diagnostics are processed.

Global Variables

VariableAddressTypePurpose
dword_106BBB80x106BBB8intOutput format: 0=text, 1=SARIF
qword_106B4780x106B478buffer*SARIF JSON output buffer (0x400 initial)
qword_106B4880x106B488buffer*Message text buffer (0x400 initial)
qword_106B4800x106B480buffer*Location prefix buffer (0x80 initial)
qword_10678200x1067820array*Pragma diagnostic stack (24-byte entries)
qword_10659600x1065960ptr[983]Per-diagnostic suppression hash table
qword_10658F80x10658F8ptrbsearch scratch: best-match pragma entry
byte_10679200x1067920byte[4*3795]Default severity per error code
byte_10679210x1067921byte[4*3795]Current severity per error code
byte_10679220x1067922byte[4*3795]Per-error tracking flags (bit 2 = has pragma)
dword_126ECA00x126ECA0intColorization requested
dword_126ECA40x126ECA4intColorization active
qword_126ECC00x126ECC0ptr[]Color spec pointers (per category)
qword_126ECC80x126ECC8size_t[]Color spec lengths (per category)
qword_126EDF00x126EDF0FILE*Diagnostic output stream

Function Map

AddressNameEDG SourceSizeRole
0x4EC850parse_color_categoryerror.c47 linesParse one category=codes from color spec
0x4ECB10write_sarif_physical_locationerror.c64 linesEmit SARIF physicalLocation JSON
0x4ECD20bsearch_comparatorerror.c15 linesPosition comparator for pragma stack search
0x4ECD50check_suppression_flagserror.c30 linesBit-flag suppression test
0x4ECDD0emit_colorization_escapeerror.c30 linesWrite ESC+category to buffer
0x4ED100create_file_index_entryerror.c22 linesAllocate 160-byte file-index node
0x4ED170display_to_internal_codeerror.c12 linesConvert display code >= 20000 to internal
0x4ED190record_pragma_diagnosticerror.c24 linesRecord pragma entry in hash table
0x4ED240check_pragma_diagnosticerror.c39 linesHash-based per-diagnostic suppression check
0x4EF8A0write_sarif_message_jsonerror.c79 linesJSON-escape and wrap message text
0x4F1330check_severityerror.c:3859601 linesCentral dispatch, SARIF/text routing
0x4F2C10init_colorizationerror.c:82543 linesParse color env vars, set up categories
0x4F3020update_diagnostic_countererror.c:322338 linesIncrement per-severity counters
0x4F30A0check_for_overridden_severityerror.c:3803~130 linesPragma stack walk with bsearch
0x4F3E50add_colorization_characterserror.c:862~80 linesEmit full ANSI escape sequence
0x5AEDB0write_inithost_envir.c:201728 linesSARIF header / text-mode no-op
0x5AEE00write_signoffhost_envir.c:2203131 linesSARIF footer / text-mode summary
0x5AF770check_terminal_capabilitieshost_envir.c~30 linesTTY + TERM detection

Entity Node Layout

The entity node is the central data structure in cudafe++ (EDG 6.6) for representing every named declaration: functions, variables, fields, parameters, namespaces, and types. Each node is a variable-size record -- routines occupy 288 bytes, variables 232 bytes, fields 176 bytes -- linked into scope chains and cross-referenced by type nodes, expression nodes, and template instantiation records.

This page focuses on the CUDA-specific fields that NVIDIA grafted onto the EDG entity node. These fields encode execution space (__host__/__device__/__global__), variable memory space (__shared__/__constant__/__managed__), launch configuration (__launch_bounds__/__cluster_dims__/__block_size__/__maxnreg__), and assorted kernel metadata. The attribute application functions in attribute.c write these fields; the backend code generator, cross-space validator, IL walker, and stub emitter read them.

Key Facts

PropertyValue
Routine entity size288 bytes (IL entry kind 11)
Variable entity size232 bytes (IL entry kind 7)
Field entity size176 bytes (IL entry kind 8)
Execution space offset+182 (1 byte, bitfield)
Memory space offset+148 (1 byte, bitfield)
Launch config pointer+256 (8-byte pointer to 56-byte struct)
Source fileattribute.c (writers), nv_transforms.c / cp_gen_be.c (readers)
Attribute dispatchsub_413240 (apply_one_attribute, 585 lines)
Post-validationsub_6BC890 (nv_validate_cuda_attributes)

Visual Layout (Routine Entity, 288 Bytes)

Offset   0         8        16        24        32        40        48        56
       +=========+=========+=========+=========+=========+=========+=========+=========+
  0x00 | next_entity_ptr   | name_string_ptr   |            (EDG internal)             |
       +---------+---------+---------+---------+---------+---------+---------+---------+
  0x20 |                              (EDG internal continued)                          |
       +---------+---------+---------+---------+---------+---------+---------+---------+
  0x40 |                              (EDG internal continued)                          |
       +====+====+=========+=========+=========+=========+=========+=========+=========+
  0x50 |kind|stor|         | assoc_entity_ptr  |                                       |
       |+80 |+81 |         |                   |                                       |
       +----+----+---------+---------+---------+---------+---------+---------+---------+
  0x60 |                   | variable_type_ptr |                                       |
       +=========+=========+=========+=========+====+=========+=========+==========+===+
  0x80 | storage_class/align|         |type_kind|    | return_type_ptr   |MEM |EXT |    |
       |         |         |         |+132     |    | +144              |+148|+149|    |
       +---------+---------+---------+----+----+----+---------+---------+----+----+----+
  0x98 | proto_ptr / param_list +152  |link|stor|    |grid|    |op  |         |         |
       |                              |+160|+161|    |+164|    |+166|         |         |
       +---------+---------+---------+----+----+----+----+----+----+---------+---------+
  0xB0 |mbr |dev |    |kern|func|    |EXEC|CEXT| template_linkage_flags +184            |
       |+176|+177|    |+179|+180|    |+182|+183|                                       |
       +----+----+----+----+----+----+----+----+=========+=========+=========+=========+
  0xC0 | alias_chain/linkage+186      |         |ctor/dtor|lambda  |                    |
       |                              |         |  +190   | +191   |                    |
       +---------+---------+---------+---------+---------+---------+---------+---------+
  0xD0 | variable_alias_chain_next +208         |                                       |
       +---------+---------+---------+---------+---------+---------+---------+---------+
  0xF0 | func_extra / alias_entry +240          |                                       |
       +---------+---------+---------+---------+---------+---------+---------+---------+
 0x100 | LAUNCH_CONFIG_PTR +256      |         (padding to 288)                         |
       +=========+=========+=========+=========+=========+=========+=========+=========+

CUDA-specific fields (UPPERCASE):
  MEM   = +148  variable memory space bitfield (__device__/__shared__/__constant__)
  EXT   = +149  extended memory space (__managed__)
  EXEC  = +182  execution space bitfield (__host__/__device__/__global__)
  CEXT  = +183  CUDA extended flags (__nv_register_params__, __cluster_dims__ intent)
  LAUNCH_CONFIG_PTR = +256  pointer to 56-byte launch_config_t struct

Full Offset Map (CUDA-Relevant Fields)

The table below documents every entity node offset touched by CUDA attribute handlers and validation functions. Offsets are byte positions from the start of the entity node. Fields marked "EDG base" are standard EDG fields that CUDA code tests but does not define.

OffsetSizeFieldSet ByRead By
+08Next entity pointer (linked list)EDGScope iteration
+88Name string pointerEDGError messages, stub emission
+801Entity kind byte (7=variable, 8=field, 11=routine)EDGAll attribute handlers
+811Storage flags (bit 2=local, bit 3=has_name, bit 6=anonymous)EDG__global__ / __device__ validation
+888Associated entity pointerEDGnv_is_device_only_routine
+1128Variable type pointerEDGget_func_type_for_attr
+1281Storage class code / alignmentEDGapply_internal_linkage_attr
+1321Type kind byte (12=qualifier)EDGReturn type traversal
+1448Return type / next-in-chain pointerEDG__global__ void-return check
+1481Variable memory space bitfieldCUDA attr handlersBackend, IL walker
+1491Extended memory spaceapply_nv_managed_attrBackend, runtime init
+1528Function prototype / parameter list headEDG__global__ param checks
+1601Linkage/visibility bits (variable: low 3 = visibility)VariousVisibility propagation
+1611Storage/linkage flags (bit 7=thread_local)EDG__managed__ / __device__ validation
+1641Storage class / grid_constant flags (bit 2=grid_constant)__grid_constant__ handler__managed__/__device__ conflict check
+1661Operator function kind (5=operator function)EDG__global__ validation
+1761Member function flags (bit 7=static member)EDG__global__ static-member check
+1771Device propagation flag (bit 4=0x10)Virtual override propagationOverride space checking
+1791Constexpr/kernel flagsDeclaration processingStub generation, attribute interaction
+1801Function attributes (bit 6=nodiscard, bit 7=noinline)Various attribute handlersBackend
+1821Execution space bitfieldCUDA execution space handlersEverywhere
+1831CUDA extended flags__cluster_dims__ / __nv_register_params__Post-validation, stub emission
+1848Template/linkage flags (48-bit field)EDG + CUDA handlersLambda check, visibility
+1861Alias chain flag (bit 3=internal linkage)apply_internal_linkage_attrLinker
+1901Constructor/destructor priority flagsapply_constructor_attr / apply_destructor_attrBackend
+1911Lambda flags (bit 0=is_lambda)EDG lambda processing__global__ validation
+2088Variable alias chain next pointerapply_alias_attrAlias loop detection
+2408Function extra info / alias entryapply_alias_attrAlias chain traversal
+2568Launch configuration pointerCUDA launch config handlersPost-validation, backend

Execution Space Bitfield (Byte +182)

This is the most frequently read field in CUDA-specific code paths. Every function entity carries a single byte that encodes which execution spaces the function belongs to.

Byte at entity+182:

  bit 0  (0x01)   device_capable     Function can execute on device
  bit 1  (0x02)   device_explicit    __device__ was explicitly written
  bit 2  (0x04)   host_capable       Function can execute on host
  bit 3  (0x08)   (reserved)
  bit 4  (0x10)   host_explicit      __host__ was explicitly written
  bit 5  (0x20)   device_annotation  Secondary device flag (HD detection)
  bit 6  (0x40)   global_kernel      Function is a __global__ kernel
  bit 7  (0x80)   global_confirmed   Always set by __global__ handler tail guard

Combined Patterns

The attribute handlers do not set individual bits. They OR entire patterns into the byte. Each CUDA keyword produces a fixed bitmask:

KeywordOR mask(s)Result byteHandlerEvidence
__global__0x61 then 0x800xE1sub_40E1F0 (apply_nv_global_attr)`entity+182
__device__0x230x23sub_40EB80 (apply_nv_device_attr)`entity+182
__host__0x150x15sub_4108E0 (apply_nv_host_attr)`entity+182
__host__ __device__0x23 then 0x150x37Both handlers in sequenceOR of device + host masks
(no annotation)none0x00--Implicit __host__

The 0x80 bit is set unconditionally at the end of apply_nv_global_attr. After the main body ORs 0x61 into byte+182 (setting bit 6 = global_kernel), a tail guard checks bit 6 and always ORs 0x80:

// sub_40E1F0, lines 84-88
v10 = *(_BYTE *)(a2 + 182);
if ( (v10 & 0x40) == 0 )       // if bit 6 (global_kernel) not set, bail
    return a2;                  // (only reachable via early error paths)
*(_BYTE *)(a2 + 182) = v10 | 0x80;   // always set for __global__

Since 0x61 was already OR'd in, bit 6 is always set on the normal path, so 0x80 is always applied. The actual result byte for any successful __global__ application is 0x61 | 0x80 = 0xE1. The guard condition only triggers on error paths where 0x61 was never applied (e.g., the template-lambda error at line 21 which returns before reaching line 56).

Extraction Patterns

Code throughout cudafe++ extracts execution space category using bitmask tests:

MaskTestMeaningUsed in
& 0x30== 0x00No explicit annotation (implicit host)Space classification
& 0x30== 0x10__host__ onlySpace classification
& 0x30== 0x20__device__ onlynv_is_device_only_routine
& 0x30== 0x30__host__ __device__Space classification
& 0x60== 0x20Device, not kernelDevice-only predicate
& 0x60== 0x60__global__ kernel (implies device)Kernel identification
& 0x40!= 0Is a __global__ kernelStub generation gate

Variable Memory Space Bitfield (Byte +148)

For variable entities (kind 7), byte +148 encodes the CUDA memory space:

Byte at entity+148:

  bit 0  (0x01)   __device__     Variable resides in device global memory
  bit 1  (0x02)   __shared__     Variable resides in shared memory
  bit 2  (0x04)   __constant__   Variable resides in constant memory

These bits are mutually exclusive in valid programs. The attribute handlers enforce this by checking for conflicting combinations:

// From apply_nv_device_attr (sub_40EB80), variable path:
a2->byte_148 |= 0x01;                      // set __device__
int shared_or_constant = a2->byte_148 & 0x06;   // check __shared__ | __constant__
if (popcount(shared_or_constant) + (a2->byte_148 & 0x01) == 2)
    error(3481, ...);                       // conflicting memory spaces

The __device__ attribute on a function (kind 11) does NOT touch byte +148. It writes to byte +182 (execution space) instead. The memory space byte is strictly for variables.

Extended Memory Space (Byte +149)

Byte at entity+149:

  bit 0  (0x01)   __managed__    Unified memory, accessible from both host and device

Set by apply_nv_managed_attr (sub_40E0D0). The handler also sets bit 0 of +148 (__device__) because managed memory resides in device global memory. Additional validation:

  • Error 3481: conflicting if __shared__ or __constant__ is already set
  • Error 3482: cannot be thread-local (byte +161 bit 7)
  • Error 3485: cannot be a local variable (byte +81 bit 2)
  • Error 3577: incompatible with __grid_constant__ parameter (byte +164 bit 2)

Constexpr and Kernel Flags (Byte +176, +179)

Byte +176: Member Function Flags

Byte at entity+176:

  bit 7  (0x80)   static_member   Function is a static class member

Tested by apply_nv_global_attr to detect static __global__ functions. The check is (signed char)(a2->byte_176) < 0, which is true when bit 7 is set. Combined with the local-function test (byte +81 bit 2 clear), this triggers warning 3507.

Byte +179: Constexpr / Kernel Property Flags

Byte at entity+179:

  bit 1  (0x02)   kernel_body        Function has a kernel body (used for stub generation)
  bit 2  (0x04)   (instantiation)    Instantiation-required status
  bit 4  (0x10)   constexpr          Function is constexpr
  bit 5  (0x20)   noinline           Function is noinline

The kernel_body flag at bit 1 (0x02) is the primary gate for device stub generation. The backend code generator (gen_routine_decl in cp_gen_be.c) checks:

// From gen_routine_decl (sweep p1.04, line ~1430)
if ((*(_BYTE *)(v3 + 182) & 0x40) != 0       // is __global__ kernel
    && (*(_BYTE *)(v3 + 179) & 2) != 0)       // has kernel body
{
    // Emit __wrapper__device_stub_<name>(<params>) forwarding body
}

The constexpr flag at bit 4 (0x10) is tested during __global__ attribute validation. When set, the void-return-type check AND the lambda check are both skipped:

// From apply_nv_global_attr (sub_40E1F0), lines 39-50
if ( (*(_BYTE *)(a2 + 179) & 0x10) == 0 )   // NOT constexpr
{
    // Non-constexpr __global__: check return type and lambda
    if ( (*(_BYTE *)(a2 + 191) & 1) != 0 )
        error(3506, ...);                    // lambda __global__ not allowed
    else if ( !is_void_return_type(a2) )
        error(3505, ...);                    // must return void
}
// If constexpr (bit 4 set): skip both checks entirely

This is a separate check from the static-member test (byte +176 bit 7 with byte +81 bit 2), which appears earlier at line 28:

if ( *(char *)(a2 + 176) < 0         // static member (bit 7 set)
    && (*(_BYTE *)(a2 + 81) & 4) == 0 )  // not local
    warning(3507, "__global__");          // static __global__ warning

Operator Function Kind (Byte +166)

Byte at entity+166:

  Value 5:  operator function (operator(), operator+, etc.)

Tested during __global__ attribute application. If the entity is an operator function (value == 5), error 3644 is emitted: operator() cannot be declared __global__.

// From apply_nv_global_attr (sub_40E1F0), line 30-31
if ( *(_BYTE *)(a2 + 166) == 5 )
    sub_4F8200(7, 3644, a1 + 56);     // error: __global__ on operator function

This prevents declaring lambda call operators as kernels via the __global__ attribute directly (extended lambdas use a different mechanism with wrapper types).

Parameter List (Pointer +152)

For routine entities, offset +152 holds a pointer to the function prototype structure. The prototype's first field (+0) points to the parameter list head -- a linked list of parameter entities.

The __global__ attribute handler iterates this list to check two constraints:

  1. Variadic check: prototype +16 bit 0 indicates variadic parameters. If set, error 3503 is emitted (variadic __global__ functions are not allowed).

  2. __grid_constant__ check: the post-validation function nv_validate_cuda_attributes (sub_6BC890) walks the parameter list looking for parameters with byte +32 bit 1 set (the __grid_constant__ flag on a parameter entity). If found on a non-__global__ function, error 3702 is emitted.

// From nv_validate_cuda_attributes (sub_6BC890), lines 26-39
// Walk parameter list from prototype
v10 = **(__int64 ****)(v2 + 152);    // parameter list head
while (v10) {
    if (((_BYTE)v10[4] & 2) != 0)    // parameter byte+32 bit 1 = __grid_constant__
        error(3702, ...);            // grid_constant on non-kernel parameter
    v10 = (__int64 **)*v10;          // next parameter
}

CUDA Extended Flags (Byte +183)

Byte at entity+183:

  bit 3  (0x08)   __nv_register_params__   Function uses register parameter passing
  bit 6  (0x40)   __cluster_dims__ intent  cluster_dims attribute with no arguments

nv_register_params (Bit 0x08)

Set by apply_nv_register_params_attr (sub_40B0A0). When present, the post-validation function nv_validate_cuda_attributes checks whether the function is __global__ or __host__, and emits error 3661 if so. Device-only functions (__device__ without __host__) are exempt:

// From nv_validate_cuda_attributes (sub_6BC890), lines 42-69
if ( (*(_BYTE *)(a1 + 183) & 8) == 0 )     // no __nv_register_params__
    goto check_launch_config;

if ( (v3 & 0x40) != 0 ) {                  // __global__ kernel
    v4 = "__global__";
    error(3661, &qword_126EDE8, v4);       // incompatible
} else if ( (v3 & 0x30) != 0x20 ) {        // NOT device-only (has host component)
    v4 = "__host__";
    error(3661, &qword_126EDE8, v4);       // incompatible
}
// else: device-only function -- __nv_register_params__ is allowed

The key check is (v3 & 0x30) != 0x20: when the execution space annotation bits indicate device-only (bits 4,5 = 0x20), the error is skipped. This means __nv_register_params__ is valid only on __device__ functions -- it is rejected on __global__, __host__, and __host__ __device__ functions.

cluster_dims Intent (Bit 0x40)

Set by apply_nv_cluster_dims_attr (sub_4115F0) when the attribute is applied with zero arguments. This marks the function as "wants cluster dimensions" without specifying concrete values -- the values may come from a separate __block_size__ attribute or from a template parameter.

Template / Linkage Flags (Pointer +184)

Offset +184 is a 48-bit (6-byte) field encoding template instantiation and linkage information. The __global__ attribute handler tests a specific bit pattern to detect constexpr lambdas with template linkage:

// From apply_nv_global_attr (sub_40E1F0), line 21
if ( (*(_QWORD *)(a2 + 184) & 0x800001000000LL) == 0x800000000000LL )
{
    // This is a template lambda with external linkage but no definition yet.
    // Applying __global__ to it is an error.
    v14 = sub_6BC6B0(a2, 0);    // get entity name
    sub_4F7510(3469, a1 + 56, "__global__", v14);
    return;
}

The mask 0x800001000000 tests two bits:

  • Bit 47 (0x800000000000): template instantiation pending
  • Bit 24 (0x000001000000): has definition body

When bit 47 is set but bit 24 is clear, the entity is a template lambda awaiting instantiation that has no body yet -- applying __global__ (or __device__) to such an entity produces error 3469.

Launch Configuration Struct (Pointer +256)

Offset +256 holds a pointer to a lazily-allocated 56-byte launch configuration structure. This pointer is NULL for functions without any launch configuration attributes. The allocation function sub_5E52F0 creates and zero-initializes the struct on first use.

Launch Config Layout

struct launch_config_t {           // 56 bytes, allocated by sub_5E52F0
    int64_t  maxThreadsPerBlock;   // +0   from __launch_bounds__(arg1)
    int64_t  minBlocksPerMP;       // +8   from __launch_bounds__(arg2)
    int32_t  maxBlocksPerCluster;  // +16  from __launch_bounds__(arg3)
    int32_t  cluster_dim_x;        // +20  from __cluster_dims__(x) or __block_size__(x,y,z,cx)
    int32_t  cluster_dim_y;        // +24  from __cluster_dims__(y) or __block_size__(x,y,z,cx,cy)
    int32_t  cluster_dim_z;        // +28  from __cluster_dims__(z) or __block_size__(x,y,z,cx,cy,cz)
    int32_t  maxnreg;              // +32  from __maxnreg__(N)
    int32_t  local_maxnreg;        // +36  from __local_maxnreg__(N)
    int32_t  block_size_x;         // +40  from __block_size__(x)
    int32_t  block_size_y;         // +44  from __block_size__(y)
    int32_t  block_size_z;         // +48  from __block_size__(z)
    uint8_t  flags;                // +52  bit 0=cluster_dims_set, bit 1=block_size_set
};                                 //      3 bytes padding to 56

Attribute-to-Field Mapping

AttributeArgumentsFields WrittenHandler
__launch_bounds__(M)1 int+0 = Msub_411C80
__launch_bounds__(M,N)2 ints+0 = M, +8 = Nsub_411C80
__launch_bounds__(M,N,C)3 ints+0 = M, +8 = N, +16 = Csub_411C80
__cluster_dims__(x)1 int+20 = x, +24 = 1, +28 = 1, +52 bit 0sub_4115F0
__cluster_dims__(x,y)2 ints+20 = x, +24 = y, +28 = 1, +52 bit 0sub_4115F0
__cluster_dims__(x,y,z)3 ints+20 = x, +24 = y, +28 = z, +52 bit 0sub_4115F0
__cluster_dims__()0 argsentity+183 bit 6 (intent flag only)sub_4115F0
__maxnreg__(N)1 int+32 = Nsub_410F70
__local_maxnreg__(N)1 int+36 = Nsub_411090
__block_size__(x,y,z)3 ints+40 = x, +44 = y, +48 = z, +52 bit 1sub_4109E0
__block_size__(x,y,z,cx,cy,cz)6 intsblock + cluster dims, +52 bits 0+1sub_4109E0

Post-Validation Constraints

The function nv_validate_cuda_attributes (sub_6BC890) performs cross-attribute validation after all attributes have been applied. The key checks on the launch config struct:

1. __launch_bounds__ only on __global__:

// sub_6BC890, lines 45-51
v5 = *(_QWORD *)(a1 + 256);         // launch config pointer
if ( !v5 )  goto done;
if ( (v3 & 0x40) != 0 )             // if __global__, skip to next check
    goto check_cluster;
// Not __global__ but has launch_bounds values
if ( *(_QWORD *)v5 || *(_QWORD *)(v5 + 8) )
    error(3534, "__launch_bounds__");   // launch_bounds on non-kernel

2. __cluster_dims__/__block_size__ only on __global__:

// sub_6BC890, lines 81-87
if ( (*(_BYTE *)(a1 + 183) & 0x40) != 0    // cluster_dims intent
    || *(int *)(v5 + 20) >= 0 )             // cluster_dim_x set
{
    v11 = "__cluster_dims__";
    if ( *(int *)(v5 + 40) > 0 )
        v11 = "__block_size__";
    error(3534, v11);                       // not allowed on non-kernel
}

3. maxBlocksPerCluster vs cluster product:

// sub_6BC890, lines 101-114
v6 = *(int *)(v5 + 20);                    // cluster_dim_x
if ( (int)v6 > 0 ) {
    v7 = *(int *)(v5 + 16);                // maxBlocksPerCluster
    if ( (int)v7 > 0
        && v7 < *(int*)(v5 + 28) * *(int*)(v5 + 24) * v6 )
    {
        // maxBlocksPerCluster < cluster_dim_x * cluster_dim_y * cluster_dim_z
        error(3707, "__cluster_dims__");    // inconsistent values
    }
}

4. __maxnreg__ only on __global__:

// sub_6BC890, lines 116-121
if ( *(int *)(v5 + 32) < 0 )               // maxnreg not set (sentinel -1)
    goto check_launch_maxnreg_conflict;
if ( (v9 & 0x40) == 0 )                    // not __global__
    error(3715, "__maxnreg__");             // maxnreg on non-kernel

5. __launch_bounds__ + __maxnreg__ conflict:

// sub_6BC890, lines 144-145
if ( *(_QWORD *)v5 )                       // maxThreadsPerBlock set
    error(3719, "__launch_bounds__ and __maxnreg__");

Entity Kind Reference

The entity kind byte at +80 determines which offsets are valid. CUDA attribute handlers gate on this value:

KindValueCUDA offsets usedHandler examples
Variable7+148, +149, +161, +164__device__, __shared__, __constant__, __managed__
Field8+136packed, aligned (non-CUDA)
Routine11+144, +152, +166, +176, +179, +182, +183, +184, +191, +256All execution space attrs, launch config

Cross-References

  • Execution Spaces -- deep dive on byte +182 semantics and the six virtual override mismatch errors
  • Attributes Overview -- attribute kind enum (86-108) and apply_one_attribute dispatch
  • IL Overview -- IL entry kinds 7 (variable), 8 (field), 11 (routine) node sizes
  • Scope Entry -- 784-byte scope structure that contains entity chains

Scope Entry

The scope entry is the 784-byte record that forms the elements of the scope stack, the central data structure in cudafe++ for tracking nested lexical scopes during C++ parsing and semantic analysis. The scope stack is a flat array at qword_126C5E8, indexed by dword_126C5E4 (current depth). Every time the parser enters a new scope -- file, block, function body, class definition, template declaration, namespace -- a new 784-byte entry is pushed onto this stack. When the scope closes, the entry is popped and all associated cleanup runs: symbol table housekeeping, using-directive deactivation, name collision discriminator assignment, template parameter restoration, and memory region disposal.

This page documents the scope stack entry layout, the scope kind enum, the key flag bytes, the CUDA-specific additions (device/host scope context), the template instantiation depth counters, and the major push/pop functions.

Key Facts

PropertyValue
Entry size784 bytes (constant, verified by "Stack entry size: %d\n" in debug statistics)
Stack base pointerqword_126C5E8 (global, array of 784-byte entries)
Current depth indexdword_126C5E4 (global, 0-based index of topmost entry)
Function scope indexdword_126C5D8 (-1 if not inside a function scope)
Class scope indexdword_126C5C8 (-1 if not inside a class scope)
File scope indexdword_126C5DC
EDG source filescope_stk.c (address range 0x6FE160-0x7106B0, ~160 functions)
Push functionsub_700560 (push_scope_full, 1476 lines, 13 parameters)
Pop functionsub_7076A0 (pop_scope, 1142 lines)
Index arithmetic784 * index for byte offset; reverse via multiply by 0x7D6343EB1A1F58D1 and shift right (division by 784 = 16 * 49)

Scope Stack Global Variables

GlobalTypeMeaning
qword_126C5E8void*Base pointer to the scope stack array
dword_126C5E4int32Current scope stack top index (0-based)
dword_126C5D8int32Current function scope index (-1 if none)
dword_126C5DCint32File scope index / secondary depth marker
dword_126C5ACint32Saved depth for template instantiation
qword_126C5D0void*Current routine descriptor pointer
dword_126C5B8int32is_member_of_template flag
dword_126C5C8int32Class scope index (-1 if none)
dword_126C5C4int32Nested class / lambda scope index (-1 if none)
dword_126C5E0int32Scope hash / identifier
dword_126C5B4int32Namespace scope index
dword_126C5BCint32Class scope depth counter
qword_126C598void*Pack expansion context pointer

Full Offset Map

The table documents every field observed in the 784-byte scope stack entry. These are scope stack entry fields, not IL scope node fields (the IL scope is a separate 288-byte structure pointed to from offset +192).

OffsetSizeFieldEvidence
+04scope_numberUnique identifier for this scope instance; checked in pop_scope assertions
+41scope_kindScope kind enum byte (see table below)
+51scope_flags_1General flags
+62scope_flags_2Bit 0 = void return flag; bit 1 = device scope context (NVIDIA addition); bit 2 = inline namespace; in some contexts bit 1 = is_extern, bit 5 = inline_namespace
+71access_flagsBit 0 = in class context; bit 1 = has using-directives; bit 4 = lambda body
+81scope_flags_4Template/class/reactivation bits; bit 5 (0x20) = is_template_scope
+91scope_flags_5Bit 0 = needs cleanup / scope pop control -- when set, triggers sub_67B4E0() cleanup of template instantiation artifacts before popping
+101scope_flags_6Bit 0 = in_template_context
+111sign bitin_template_dependent_context
+121scope_flags_7Bit 0 = in_template_arg_scan; bit 2 = suppress_diagnostics; bit 4 = has_concepts / void_return_warned
+131scope_flags_8Bit 4 = warned_no_return
+141flags3Bit 2 = in_device_code (NVIDIA-specific, marks whether code in this scope is device code)
+248symbol_chain_or_hash_ptrPointer to the name symbol chain or hash table for name lookup
+328hash_table_ptrHash table pointer (when scope uses hashing for lookup)
+32-+144112Inline tail infoWhen +24 is 0, this region contains inline tail pointers for entity lists: +40 = variables tail, +48 = types tail, +56 = routines next, +88 = asm tail, +112 = namespaces tail, +144 = templates tail
+1928il_scope_ptrPointer to the associated 288-byte IL scope node (the persistent representation that survives scope pop)
+2008local_static_init_listList of local static variable initializers
+2088vla_dimensions_list / scope_depthVLA dimension tracking (C mode); scope depth integer
+2168class_type_ptr / tu_ptrFor class scopes: pointer to the class type symbol. For template instantiation scopes: pointer to the translation unit descriptor
+2248routine_descriptorPointer to the current routine descriptor (set for function scopes)
+2328namespace_entityFor namespace scopes: pointer to the namespace entity
+2404region_numberMemory region number (-1 = invalid sentinel, set by alloc_scope)
+2564parent_scope_indexIndex of the enclosing scope in the stack (reported at both +240 and +256 in different sweeps -- likely +240 = region, +256 = parent)
+2728name_hiding_listLinked list of names hidden by declarations in this scope
+2968local_vars_tailTail pointer for the local variables list
+3688source_beginSource position at scope entry
+3768associated_entity / parent_template_infoAssociated entity pointer / template information pointer
+3848template_argument_listTemplate argument list for instantiation scopes
+4084try_block_index / enclosing_class_scope_indexTry block index (-1 = none); in class contexts, index to enclosing class scope
+4168module_infoModule information pointer (C++20 modules support)
+4244line_numberLine number at scope open (for diagnostics)
+4968root_object_lifetimeRoot of the object lifetime tree for this scope
+5128freed_lifetime_listList of freed object lifetimes awaiting reuse
+5604enclosing_scope_indexParent scope index for pop validation
+5764template_instantiation_depth_counterNested instantiation depth counter -- incremented on recursive template instantiation push, decremented on pop; when > 0, pop just decrements without actually popping the scope stack
+5804orig_depthOriginal scope stack depth at time of template instantiation push; validated during pop
+5844saved_scope_depthSaved scope depth; restored via dword_126C5AC on template instantiation pop
+6088class_def_info_ptrClass definition information pointer
+6248template_info_ptrTemplate information record pointer
+6328template_parameter_list / class_info_ptrTemplate parameter list pointer
+7048lambda_counterLambda expression counter within this scope (int64)
+7204fixup_counterDeferred fixup counter
+7288has_been_completedCompletion flag (int64 used as bool)
+7368deferred_fixup_list_headHead of deferred fixup linked list
+7448deferred_fixup_list_tailTail of deferred fixup linked list

Scope Stack Kind Enum

The scope stack kind byte at +4 uses a different, larger enum than the IL scope kind (sck_*) at IL scope node +28. The scope stack enum includes additional entries for reactivation states, template instantiation context, and module scopes. The mapping is derived from scope_kind_to_string (sub_7000E0, 77 lines) which contains display string literals for each enum value, and from display_scope (sub_5F2140) in il_to_str.c.

Scope Stack Kind Values

ValueNameDisplay StringNotes
0ssk_source_file"source file"Top-level file scope. Maps to IL sck_file (0).
1ssk_func_prototype"function prototype"Function prototype scope (parameter names). Maps to IL sck_func_prototype (1).
2ssk_block"block"Block scope (compound statement). Maps to IL sck_block (2).
3ssk_alloc_namespace"alloc_namespace"Namespace scope (first opening). Maps to IL sck_namespace (3).
4ssk_namespace_extension"namespace extension"Namespace extension (reopened namespace N { ... }).
5ssk_namespace_reactivation"namespace reactivation"Namespace scope reactivated for out-of-line definition.
6ssk_class_struct_union"class/struct/union"Class/struct/union scope. Maps to IL sck_class_struct_union (6).
7ssk_class_reactivation"class reactivation"Class scope reactivated for out-of-line member definition (e.g., void MyClass::foo() { ... }).
8ssk_template_declaration"template declaration"Template declaration scope (template<...>). Maps to IL sck_template_declaration (8).
9ssk_template_instantiation"template instantiation"Template instantiation scope (pushed by push_template_instantiation_scope).
10ssk_instantiation_context"instantiation context"Instantiation context scope (tracks the chain of instantiation sites for diagnostics).
11ssk_module_decl_import"module decl import"C++20 module declaration/import scope.
12ssk_module_isolation"module isolation"C++20 module isolation scope (module purview boundary).
13ssk_pragma"pragma"Pragma scope (for pragma-delimited regions).
14ssk_function_access"function access"Function access scope.
15ssk_condition"condition"Condition scope (if/while/for condition variable). Maps to IL sck_condition (15).
16ssk_enum"enum"Scoped enum scope (C++11 enum class). Maps to IL sck_enum (16).
17ssk_function"function"Function body scope (has routine pointer, parameters, ctor init list). Maps to IL sck_function (17).

Relationship to IL Scope Kinds

The IL scope node (288 bytes, allocated by alloc_scope at sub_5E7D80) uses a smaller sck_* enum at its +28 field. The scope stack entry at +192 points to the IL scope that persists after the stack entry is popped. Not all scope stack kinds produce an IL scope -- reactivation kinds (5, 7) and context kinds (9, 10) reuse existing IL scopes.

IL sck_*ValueCorresponding stack kind(s)
sck_file00
sck_func_prototype11
sck_block22
sck_namespace33, 4, 5
sck_class_struct_union66, 7
sck_template_declaration88
sck_condition1515
sck_enum1616
sck_function1717

CUDA-Specific Fields

NVIDIA added two device/host scope tracking bits to the scope entry, grafted onto the EDG base structure.

Byte +6, Bit 1: Device Scope Context

scope_entry+6, bit 1 (0x02):

  When set: code in this scope is compiled for the device execution space.
  When clear: code in this scope is compiled for the host.

This bit is tested by CUDA-specific code paths to determine whether the current compilation context targets device or host. It affects:

  • Whether __device__-only functions suppress certain diagnostics (e.g., missing return value warning at check_void_return_okay, sub_719D20)
  • Whether device-specific type validation applies
  • Severity overrides via byte_126ED55 (default diagnostic severity for device mode)

The bit is set when entering __device__ or __global__ function scopes and cleared when entering __host__ scopes. This allows mixed host/device compilation to track which context is active at any nesting depth.

Byte +14, Bit 2: In Device Code

scope_entry+14, bit 2 (0x04):

  Secondary device-code marker. Set when the parser is inside a device
  function body. Used in conjunction with dword_106C2C0 (CUDA device
  compilation mode flag).

Template Instantiation Depth Counters

Three fields at offsets +576, +580, and +584 form the template instantiation depth tracking system. These fields enable the scope stack to handle nested template instantiations without fully pushing/popping scope entries at every nesting level.

Mechanism

When push_template_instantiation_scope (sub_709DE0) sets up a template instantiation, it writes the current scope stack depth into +580 (orig_depth) and the saved global depth into +584 (saved_scope_depth). The +576 counter starts at 0.

If the same template scope is re-entered for a nested instantiation (e.g., recursive template), +576 is incremented rather than pushing a full new scope entry. On pop, pop_template_instantiation_scope (sub_708EE0) checks +576:

if (scope_entry[576] > 0) {
    scope_entry[576]--;     // just decrement, don't pop
    return;
}
// Full pop: restore scope stack to orig_depth
pop_scopes_to(scope_entry[580]);
restore(dword_126C5AC, scope_entry[584]);

This optimization avoids deep scope stack growth during deeply recursive template instantiations (e.g., std::tuple<T1, T2, ..., TN> with large N).

Validation

pop_template_instantiation_scope_with_check (sub_708E90) validates that +576 matches the expected depth before calling the actual pop. The assertion is at scope_stk.c line 5593. A mismatch triggers an internal error.

Push Scope: push_scope_full (sub_700560)

The core scope push function (1476 lines, 13 parameters, located at 0x700560). Called directly or via thin wrappers for each scope kind.

Parameters

The 13-parameter signature handles all scope kinds through a single entry point:

  1. Scope kind
  2. Associated entity pointer (class type, namespace entity, routine descriptor, etc.)
  3. Region number
  4. Additional flags 5-13. Kind-specific parameters (template info, reactivation data, etc.)

Key Operations

  1. Stack growth: Increments dword_126C5E4. If the stack exceeds its allocation, it is reallocated (the base pointer qword_126C5E8 may change).

  2. Entry initialization: Zeros the 784-byte entry, then sets:

    • +0 = scope number (from a global counter)
    • +4 = scope kind
    • +240 = region number
    • +192 = IL scope pointer (newly allocated via alloc_scope or reused from a reactivated entity)
    • +560 = parent scope index
  3. Kind-specific setup:

    • File (0): Sets dword_126C5DC, initializes file-level state.
    • Block (2): Links to enclosing function scope.
    • Namespace (3, 4, 5): Sets +232 to namespace entity. For extensions (4), reuses existing IL scope. For reactivation (5), calls add_active_using_directives_for_scope.
    • Class (6, 7): Sets +216 to class type pointer, dword_126C5C8 to current index. For reactivation (7), walks the class hierarchy to restore template context.
    • Template declaration (8): Sets template-related bits in +8.
    • Function (17): Sets dword_126C5D8, qword_126C5D0, +224.
  4. Parent scope linkage: Calls set_parent_scope_on_push to establish the scope tree.

  5. Memory region: Calls get_enclosing_memory_region to determine the memory arena for allocations within this scope.

Push Wrappers

WrapperAddressParametersTarget Kind
push_scope (thin)sub_7047907Various
push_scope_with_using_dirssub_7047C029Namespace + using
push_template_scopesub_7048707Template declaration (8)
push_block_reactivation_scopesub_7048A032Block reactivation
push_namespace_scope_fullsub_7024D040Namespace (3)
push_function_scopesub_704BB013Function (17)
push_class_scopesub_704C1017Class (6)
push_scope_for_compound_statementsub_70C8A064Block (2)
push_scope_for_conditionsub_70C95086Condition (15)
push_scope_for_init_statementsub_70CAE049Block (2), C++17 init

Pop Scope: pop_scope (sub_7076A0)

The core scope pop function (1142 lines, at 0x7076A0). Complement to push_scope_full. Performs all scope cleanup in a specific order.

Cleanup Sequence

  1. Debug trace: If dword_126EFC8 is set, prints "pop_scope: number = %ld, depth = %d".

  2. Scope wrapup: Calls wrapup_scope (sub_706710, 381 lines) which:

    • Iterates all symbols in the scope
    • Runs end_of_scope_symbol_check (sub_705440, 781 lines) for consistency validation
    • Emits needed definitions
    • Reports unreferenced entities
  3. Using-directive deactivation: Clears active using-directives for this scope via sub_6FEC10 (debug: clear using-directive).

  4. Template parameter restoration: If leaving a template scope, calls restore_default_template_params (sub_6FEEE0) to undo template parameter symbol bindings.

  5. Name collision discriminators: Assigns ABI discriminator values to entities with the same name in this scope via assign_discriminators_to_entities_list (sub_7036E0).

  6. C99 inline definitions: Checks check_c99_inline_definition (sub_703AD0) for C99-mode inline function rules.

  7. Module/pragma state: Adjusts STDC pragma state (byte_126E558/559/55A) and module context if applicable.

  8. Stack decrement: Decrements dword_126C5E4. Restores previous scope's global state (function scope index, class scope index, etc.).

  9. Memory region disposal: Frees the memory arena associated with this scope if the scope kind has one.

Pop Variants

FunctionAddressLinesPurpose
pop_scope (core)sub_7076A01142Full scope pop with all cleanup
pop_scope_fullsub_70C440100Wrapper calling core + name hiding cleanup
pop_scope (validation)sub_70C62062Pop with object lifetime validation: asserts "pop_scope: curr_object_lifetime is not that of", "pop_scope: unexpected curr_object_lifetime"

Template Instantiation Scope

The template instantiation scope push and pop are separate from the generic scope push/pop. They handle the complex process of binding template parameters to arguments, setting up the correct translation unit context, and managing pack expansions.

Push: push_template_instantiation_scope (sub_709DE0)

The largest function in the scope_stk.c range at 1281 lines. Takes 8 parameters: template pointer, association info, and various flags.

Key operations:

  1. Translation unit check: Calls sub_7418D0 to verify that the template being instantiated belongs to the same translation unit, or that cross-TU instantiation is explicitly allowed (flag & 0x1000). Failure triggers the assert "push_template_instantiation_scope: wrong translation unit".

  2. Template parameter binding: Iterates the template parameter list and the instantiation argument list in parallel, creating bindings. For each template parameter:

    • Type parameters: binds to the supplied type argument
    • Non-type parameters: binds to the supplied expression/value
    • Template template parameters: binds to the supplied template
  3. Pack expansion: For variadic templates, handles parameter pack expansion. Creates pack instantiation descriptors via create_pack_instantiation_descr (sub_70CF50, 772 lines).

  4. Scope entry setup: Writes +576 = 0, +580 = current depth, +584 = dword_126C5AC. Sets +216 to the translation unit pointer.

  5. State save/restore: Saves dword_126C5B8 (is_member_of_template), dword_126C5D8 (function scope), qword_126C5D0 (routine descriptor).

  6. Reactivation flags: Flag bits & 0x84000 control class template reactivation behavior. When set, the function enters class reactivation mode via sub_70BB60.

Pop: pop_template_instantiation_scope (sub_708EE0)

66 lines. Reverse of the push.

  1. Reads +576 (depth counter). If > 0, decrements and returns early (nested instantiation shortcut).
  2. If bit 0 of +9 is set (needs_cleanup), calls sub_67B4E0() to clean up template instantiation artifacts.
  3. Pops scope entries back to orig_depth (+580) via sub_7076A0.
  4. Restores dword_126C5AC from +584.
  5. Calls sub_6FED20 (debug trace: set using-directive).
FunctionAddressLinesRole
pop_template_instantiation_scope_wrappersub_708E707Thin wrapper passing through to sub_708EE0
pop_template_instantiation_scope_with_checksub_708E9014Validates +576 depth counter before calling sub_708EE0
pop_template_instantiation_scope_variantsub_70911071Alternative pop with extra +8 flag processing, returns int64
pop_instantiation_scope_for_rescansub_70900054Pop for template argument rescan case
push_instantiation_scope_for_rescansub_70B900123Push for template parameter rescanning
push_instantiation_scope_for_templ_param_rescansub_70B7C052Push for template parameter rescan
push_instantiation_scope_for_classsub_70BB60131Push for class template instantiation
push_class_and_template_reactivation_scope_fullsub_7098B0261Combined class + template reactivation

Using-Directive Activation

When entering a namespace scope that has active using namespace directives, those directives must be reactivated so that names from the nominated namespace are visible. The scope stack manages this through two functions:

add_active_using_directives_for_scope (sub_6FFCC0)

246 lines. Called during scope push when entering a namespace or block that may have inherited using-directives. Walks the using-directive list for the scope and calls add_active_using_directive_to_scope for each one.

Debug trace format: "adding using-dir at depth %d for namespace %s applies at %d".

Using-Directive Debug Traces

FunctionAddressLinesTrace
Debug: set using-directivesub_6FED2074"setting using-dir at depth %d for namespace %s applies at %d"
Debug: clear using-directivesub_6FEC1034"clearing using-dir at depth %d for namespace %s applies at %d"
Debug: using-dir set/clearsub_704490106"using_dir", "setting", "clearing"

Name Collision Discriminators

When multiple local entities share the same name (e.g., two struct S in different blocks within the same function), the Itanium ABI requires a discriminator suffix in the mangled name. The scope stack manages this through:

FunctionAddressLinesRole
get_name_collision_list + initialize_local_name_collision_tablesub_6FE76064Manages the name collision table at qword_12C6FE0
compute_local_name_collision_discriminator + distinct_lambda_signaturessub_702FB0293Computes ABI discriminator values for local entities; includes lambda signature discrimination logic
cancel_name_collision_discriminatorsub_7034C0118Cancels a previously assigned discriminator (7 assertion sites)
assign_discriminators_to_entities_listsub_7036E046Assigns ABI discriminators to a list of entities at scope exit
set_parent_entity_for_closure_typessub_70379091Sets parent entity for lambda closure types (needed for correct mangling, 5 assertion sites)
set_parent_routine_for_closure_types_in_default_argssub_70392043Sets parent routine for lambdas in default argument contexts

Class and Template Reactivation

When defining an out-of-line member function (void MyClass::foo() { ... }), the parser must reactivate the class scope so that class member names are visible. For class templates, this also requires reactivating the template instantiation scope.

reactivate_class_context (sub_7029D0 / sub_70BE50)

Two implementations exist:

  • sub_7029D0 (196 lines, in p1.16): Reactivates a class scope for out-of-line definition. Asserts "reactivate_class_context: class type has NULL assoc_info".
  • sub_70BE50 (130 lines, in p1.17): Additional variant that handles nesting, template flags, and scope_entry +8 bit manipulation.

push_class_and_template_reactivation_scope_full (sub_7098B0)

261 lines. Handles the combined case of class template reactivation. Reads symbol flags at offsets +80, +81, +161, +162. Processes "specified template decl info" at +64 of assoc_info. Detects member templates via bit 0x10 at +81. When dword_106BC58 is set, enters class reactivation mode with sub_70BB60.

reactivate_local_context (sub_702670 / sub_70C0F0)

  • sub_702670 (120 lines): Reactivates a previously saved local scope context. Calls push_scope_full.
  • sub_70C0F0 (50 lines): Asserts "reactivate_local_context".

Pack Expansion Support

The scope stack provides infrastructure for variadic template parameter pack expansion during instantiation.

FunctionAddressLinesRole
create_pack_instantiation_descrsub_70CF50772Creates pack instantiation descriptors; handles sizeof..., fold expressions
create_pack_instantiation_descr_helpersub_70DD60212Helper for pack descriptor creation
cleanup_pack_instantiation_statesub_70E13037Cleans up pack expansion state
end_potential_pack_expansion_contextsub_70E1D0392Processes end of pack expansion; checks C++17 via dword_126EF68 > 199900; uses qword_126C598 (pack expansion context)
find_template_arg_for_pack + get_enclosing_template_params_and_argssub_6FE9B0140Traverses scope stack to find template arguments for parameter packs

Scope Stack Query Functions

FunctionAddressLinesRole
get_innermost_template_dependent_contextsub_6FE16072Traverses scope stack to find innermost template-dependent scope
get_outermost_template_dependent_contextsub_6FFA6054Complement to innermost variant
get_curr_template_params_and_args (part 1)sub_70E7F0321Retrieves current template parameters and arguments from scope stack
get_curr_template_params_and_args (full)sub_70F5401002Full implementation with default argument handling and pack expansion
is_in_template_contextsub_70EE1016No-arg predicate, returns bool
is_in_template_instantiation_scopesub_70EDA0276-arg predicate, returns bool
current_class_symbol_if_class_templatesub_70413084Returns class symbol if inside a class template definition
is_in_deprecated_contextsub_70F44043Checks scope_entry[83] bit 4 and walks scope stack
get_scope_depthsub_70C60017Returns current scope stack depth value
get_template_scope_info_for_entitysub_7106B074Last function in scope_stk.c range

Debug and Statistics

Scope Statistics Dump (sub_702DC0)

95 lines. Prints scope stack statistics when debug tracing is enabled. Output format:

Scope stack statistics
Stack entry size: 784
Max. stack depth: <N>

Followed by per-scope-kind counts using all scope kind display names.

Scope Entry Dump (sub_700260 / sub_7002D0)

  • sub_700260 (17 lines): Prints " scope %d" with scope kind name via scope_kind_to_string. Detects bad depth with "***BAD SCOPE DEPTH***".
  • sub_7002D0 (111 lines): Detailed dump using format "%s%3ld %3d " with associated type/symbol information.

End-of-Scope Processing

wrapup_scope (sub_706710)

381 lines. Major scope cleanup function called from pop_scope. Processes all symbols in the scope, emits needed definitions, runs end_of_scope_symbol_check. Debug traces: "wrapup_scope", "Wrapping up ", " scope".

end_of_scope_symbol_check (sub_705440)

781 lines, 6 assertion sites. The largest validation function. Checks:

  • Symbol-to-IL-entry parent-class consistency ("end_of_scope_symbol_check: sym/il-entry parent-class mismatch")
  • Parameter-to-routine association ("end_of_scope_symbol_check: parameter with no assoc routine")
  • Hash table statistics ("hash_stats", "Hash statistics for: ")

set_needed_flags_at_end_of_file_scope (sub_707040)

188 lines. Determines which entities need to be emitted at the end of the translation unit. Validates scope kind ("set_needed_flags_at_end_of_file_scope: bad scope kind"). Debug brackets: "Start of set_needed_flags_at_end_of_file_scope\n" / "End of set_needed_flags_at_end_of_file_scope\n".

finish_function_body_processing (sub_6FE2A0)

142 lines. Post-processes function bodies after the scope closes. Determines whether the function needs to be emitted ("routine_needed_even_if_unreferenced", "Not calling mark_as_needed for", "storage class is %s\n").

Cross-References

Translation Unit Descriptor

The translation unit descriptor is the 424-byte structure at the heart of cudafe++'s multi-TU compilation support. Every source file processed by the frontend -- whether via RDC separate compilation or C++20 module import -- gets its own TU descriptor. The descriptor holds pointers to the parser state, scope stack snapshot, error context, and IL tree root for that translation unit. When the frontend switches from one TU to another, it saves the entire set of per-TU global variables into the outgoing descriptor's storage buffer and restores the incoming descriptor's saved values, making TU switching look like a cooperative context switch for compiler state.

The descriptor is allocated from the region-based arena (sub_6BA0D0), linked into a global TU chain, and managed through a TU stack that tracks the active-TU history for nested TU switches (e.g., when processing an entity requires switching to its owning TU temporarily).

Key Facts

PropertyValue
Size424 bytes (confirmed by print_trans_unit_statistics: "translation units ... 424 bytes each")
Allocationsub_6BA0D0(424) -- region-based arena, never individually freed
Source filetrans_unit.c (EDG 6.6, address range 0x7A3A50-0x7A48B0, ~12 functions)
Allocatorsub_7A40A0 (process_translation_unit)
Save functionsub_7A3A50 (save_translation_unit_state)
Restore functionsub_7A3D60 (switch_translation_unit)
Fix-up functionsub_7A3CF0 (fix_up_translation_unit)
Statisticssub_7A45A0 (print_trans_unit_statistics)
TU count globalqword_12C7A78 (incremented on each allocation)
Active TU globalqword_106BA10 (current_translation_unit)
Primary TU globalqword_106B9F0 (primary_translation_unit)

Full Offset Map

The table below documents every field in the 424-byte TU descriptor. Offsets are byte positions from the start of the descriptor. Fields are identified from the initialization code in process_translation_unit (sub_7A40A0), the save/restore pair (sub_7A3A50/sub_7A3D60), and the fix-up function (sub_7A3CF0).

OffsetSizeFieldSet ByRead By
+08next_tu -- linked list pointer to next TU in chainprocess_translation_unit (via qword_12C7A90)fe_wrapup TU iteration loop
+88prev_scope_state -- saved scope pointer (xmmword_126EB60+8)save_translation_unit_stateswitch_translation_unit
+168storage_buffer -- pointer to bulk registered-variable storageprocess_translation_unit (allocates sub_6BA0D0(per_tu_storage_size))save/switch_translation_unit
+24160file_scope_info -- file scope state block (20 qwords, initialized by sub_7046E0)sub_7046E0 (zeroes 20 fields at offsets 0-152 within this block)Scope stack operations, sub_704490
+1848(cleared to 0) -- within file scope info tailprocess_translation_unit--
+1928(cleared to 0) -- gap between scope info and registered-variable zoneprocess_translation_unit--
+200160registered-variable direct fields -- zeroed bulk region (offsets +200 through +359)memset in process_translation_unit; individual fields written by registered-variable initialization loopsave/switch_translation_unit via storage buffer
+2088scope_stack_saved_1 -- saved qword_126EB70 (scope stack depth marker)save_translation_unit_state (a1[26])switch_translation_unit
+2568scope_stack_saved_2 -- saved qword_126EBA0save_translation_unit_state (a1[32])switch_translation_unit
+3208scope_stack_saved_3 -- saved qword_126EBE0save_translation_unit_state (a1[40])switch_translation_unit
+3528(cleared to 0) -- end of registered-variable zoneprocess_translation_unit--
+3608(cleared to 0) -- additional state word 1process_translation_unit--
+3688(cleared to 0) -- additional state word 2process_translation_unit--
+3768module_info_ptr -- pointer to module info structure (parameter a3 of process_translation_unit)process_translation_unitModule import path, a3[2] back-link
+3848il_state_ptr -- shortcut pointer for IL state (1344-byte aggregate at unk_126E600), set via registered-variable mechanism with offset_in_tu = 384Registered-variable init loopIL operations
+3922flags -- bit field: byte 0 = is_primary_tu (1 if a3 == NULL), byte 1 = 0x01 (initialization sentinel, combined initial value = 0x0100)process_translation_unitTU classification
+39414(padding / reserved)----
+4084error_severity_level -- copied from dword_126EC90 (current maximum error severity)process_translation_unitError reporting, recovery decisions
+4168(cleared to 0) -- additional stateprocess_translation_unit--

Layout Diagram

Translation Unit Descriptor (424 bytes)
===========================================

 +0    [next_tu          ] -----> next TU in chain (NULL for last)
 +8    [prev_scope_state ] -----> saved scope ptr (from xmmword_126EB60+8)
+16    [storage_buffer   ] -----> heap block for registered variable values
+24    [                                                              ]
       [  file_scope_info (160 bytes, 20 qwords)                      ]
       [  initialized by sub_7046E0: all fields zeroed                ]
       [  scope state snapshot for this TU's file scope               ]
+184   [  (tail of scope info, cleared)                               ]
+192   [  (gap, cleared to 0)                                         ]
+200   [                                                              ]
       [  registered-variable direct fields (160 bytes, bulk zeroed)  ]
       [  includes scope stack snapshots at +208, +256, +320          ]
       [  individual fields set by registered-variable init loop      ]
+352   [  (cleared to 0)                                              ]
+360   [  (additional state, cleared)                                 ]
+368   [  (additional state, cleared)                                 ]
+376   [module_info_ptr   ] -----> module info (NULL for primary TU)
+384   [il_state_ptr      ] -----> shortcut to IL state in storage buffer
+392   [flags             ] 0x0100 initial; byte 0 = is_primary
+394   [  (reserved)      ]
+408   [error_severity    ] from dword_126EC90
+412   [  (pad)           ]
+416   [  (additional, 0) ]
+424   === end ===

Initialization Sequence

The initialization in process_translation_unit proceeds in this order:

  1. [+0] = 0 (next_tu pointer, not yet linked)
  2. [+16] = sub_6BA0D0(qword_12C7A98) (allocate storage buffer, size = accumulated registered-variable total)
  3. [+8] = 0 (prev_scope_state)
  4. sub_7046E0(tu + 24) -- zero-initialize the 160-byte file scope info block
  5. [+192] = 0, [+352] = 0, [+184] = 0 -- explicit clears around the bulk region
  6. memset(aligned(tu + 200), 0, ...) -- bulk-zero the registered-variable direct fields from +200 to +360 (aligned to 8-byte boundary)
  7. [+360] = 0, [+368] = 0, [+376] = 0 -- clear additional state
  8. [+392] = 0x0100 (flags: initialized sentinel in high byte)
  9. [+408] = 0, [+416] = 0
  10. Registered-variable default-value loop: iterate qword_12C7AA8 (registered variable list) and for each entry with offset_in_tu != 0, write variable_address into tu_desc[offset_in_tu]
  11. [+376] = a3 (module_info_ptr)
  12. [+392] byte 0 = (a3 == NULL) (is_primary flag)

Lifecycle

Phase 1: Registration (Before Any TU Processing)

Before the first translation unit is processed, every EDG subsystem registers its per-TU global variables by calling f_register_trans_unit_variable (sub_7A3C00). This happens during frontend initialization, before dword_12C7A8C (registration_complete) is set to 1.

The three core variables are registered by register_builtin_trans_unit_variables (sub_7A4690):

// sub_7A4690 -- register_builtin_trans_unit_variables
f_register_trans_unit_variable(&dword_106BA08, 4, 0);   // is_recompilation
f_register_trans_unit_variable(&qword_106BA00, 8, 0);   // current_filename
f_register_trans_unit_variable(&dword_106B9F8, 4, 0);   // has_module_info

In total, approximately 217 calls to f_register_trans_unit_variable are made across all subsystems. Each call adds a 40-byte registration record to the linked list headed by qword_12C7AA8 and accumulates the variable size into qword_12C7A98 (the per-TU storage buffer size). The accumulated size determines how large the storage buffer allocation will be for each TU descriptor.

Phase 2: Allocation and Initialization

When process_translation_unit (sub_7A40A0) is called for each source file:

process_translation_unit(filename, is_recompilation, module_info_ptr)
  1. If a current TU exists (qword_106BA10 != 0), save its state via save_translation_unit_state
  2. Reset compilation state (sub_5EAEC0 -- error state reset)
  3. If recompilation mode: reset parse state (sub_585EE0)
  4. Set dword_12C7A8C = 1 (registration complete -- no more variable registrations allowed)
  5. Allocate the 424-byte descriptor and its storage buffer
  6. Initialize all fields (see sequence above)
  7. Copy registered-variable defaults into the descriptor
  8. Link into the TU chain

Phase 3: Linking

The descriptor is linked into two structures simultaneously:

TU Chain (singly-linked list via [+0]):

  • Head: qword_106B9F0 (primary_translation_unit) -- the first TU processed
  • Tail: qword_12C7A90 (tu_chain_tail) -- the most recently allocated TU
  • Linking: *tu_chain_tail = new_tu; tu_chain_tail = new_tu
  • Used by: fe_wrapup to iterate all TUs during the 5-pass post-processing

TU Stack (singly-linked list of 16-byte stack entries):

  • Top: qword_106BA18 (translation_unit_stack_top)
  • Each entry: [+0] = next, [+8] = tu_descriptor_ptr
  • Free list: qword_12C7AB8 (stack entries are recycled, not freed)
  • Depth counter: dword_106B9E8 (counts non-primary TUs on the stack)
TU Chain:                    TU Stack:
                             qword_106BA18
primary_tu --> tu_2 --> tu_3    |
    ^                           v
    |                      [next|tu_3] --> [next|tu_2] --> [next|primary] --> NULL
qword_106B9F0                        each entry: 16 bytes

Phase 4: Active TU Tracking

The global qword_106BA10 always points to the currently active TU descriptor. All compiler state -- parser globals, scope stack, symbol tables, error context -- corresponds to this TU. Switching the active TU requires a full context switch through switch_translation_unit.

Phase 5: Processing (5 Passes in fe_wrapup)

After parsing completes, fe_wrapup (sub_588F90) iterates the TU chain and performs 5 passes over all TUs:

  1. Pass 1 (file_scope_il_wrapup): per-TU scope cleanup, cross-TU entity marking
  2. Pass 2 (set_needed_flags_at_end_of_file_scope): compute needed-flags for entities
  3. Pass 3 (mark_to_keep_in_il): mark entities to keep in the IL tree
  4. Pass 4 (three sub-stages): clear unneeded instantiation flags, eliminate unneeded function bodies, eliminate unneeded IL entries
  5. Pass 5 (file_scope_il_wrapup_part_3): final cleanup, scope assertion, re-run of passes 2-4 for the primary TU

Each pass switches to the target TU via switch_translation_unit before processing.

Phase 6: Pop and Cleanup

After sub_588E90 (translation_unit_wrapup) and the main compilation passes complete, the TU is popped from the stack. The inline pop code in process_translation_unit (mirroring pop_translation_unit_stack at sub_7A3F70):

  1. Assert: stack_top->tu_ptr == current_tu (stack integrity check)
  2. Decrement dword_106B9E8 if popped TU is not the primary TU
  3. Move stack entry to free list (qword_12C7AB8)
  4. If a previous TU remains on the stack, switch to it via switch_translation_unit

Registered Variable Mechanism

The registered variable mechanism is the save/restore system that makes TU switching possible. It works in three phases: registration, save, and restore.

Registration Phase

During frontend initialization, each subsystem calls f_register_trans_unit_variable to declare global variables that contain per-TU state. Each call creates a 40-byte registration record:

Registered Variable Entry (40 bytes)
  [0]   8   next               linked list pointer
  [8]   8   variable_address   pointer to the global variable
  [16]  8   variable_size      number of bytes to save/restore
  [24]  8   offset_in_storage  byte offset within the TU storage buffer
  [32]  8   offset_in_tu       byte offset within the TU descriptor (0 = none)

Registration accumulates the total storage buffer size in qword_12C7A98. Each variable gets assigned a sequential offset within the buffer:

// f_register_trans_unit_variable (sub_7A3C00), simplified
void f_register_trans_unit_variable(void *var_ptr, size_t size, size_t offset_in_tu) {
    assert(!registration_complete);   // dword_12C7A8C must be 0
    assert(var_ptr != NULL);

    record = alloc(40);
    record->next = NULL;
    record->variable_address = var_ptr;
    record->variable_size = size;
    record->offset_in_storage = per_tu_storage_size;  // qword_12C7A98
    record->offset_in_tu = offset_in_tu;

    // append to linked list
    if (!list_head) list_head = record;     // qword_12C7AA8
    if (list_tail) list_tail->next = record;
    list_tail = record;                      // qword_12C7AA0

    // align size to 8 bytes, accumulate
    size_t aligned = (size + 7) & ~7;
    per_tu_storage_size += aligned;          // qword_12C7A98
}

The third parameter offset_in_tu is non-zero only for variables that need a direct shortcut pointer within the TU descriptor itself. For example, the 1344-byte IL state aggregate at unk_126E600 registers with offset_in_tu = 384, so tu_desc[384] receives a pointer to the stored copy of that aggregate within the storage buffer. Most variables pass 0 (no shortcut needed).

Save Phase (save_translation_unit_state)

When switching away from a TU, sub_7A3A50 saves the current state:

// save_translation_unit_state (sub_7A3A50), simplified
void save_translation_unit_state(tu_desc *tu) {
    char *storage = tu->storage_buffer;     // tu[2]

    // Iterate all registered variables
    for (reg = registered_variable_list_head; reg; reg = reg->next) {
        // Copy current global value into storage buffer
        void *dst = storage + reg->offset_in_storage;
        memcpy(dst, reg->variable_address, reg->variable_size);

        // If this variable has a direct field in the TU descriptor,
        // store a pointer to the saved copy there
        if (reg->offset_in_tu != 0) {
            *(void **)((char *)tu + reg->offset_in_tu) = dst;
        }
    }

    // Save scope stack state (3 explicit fields)
    tu->scope_saved_1 = qword_126EB70;   // tu[26]
    tu->scope_saved_2 = qword_126EBA0;   // tu[32]
    tu->scope_saved_3 = qword_126EBE0;   // tu[40]

    // Save file scope indices via sub_704490
    if (dword_126C5E4 != -1) {
        sub_704490(dword_126C5E4, 0, 0);
        // Walk scope stack entries, clear scope-to-TU back-pointers
        for (entry = scope_top; entry; entry -= 784) {
            if (entry[+192])
                *(int *)(entry[+192] + 240) = -1;
            if (!entry[+4]) break;
        }
    }
}

Restore Phase (switch_translation_unit)

When switching to a different TU, sub_7A3D60 restores its state:

// switch_translation_unit (sub_7A3D60), simplified
void switch_translation_unit(tu_desc *target) {
    assert(current_tu != NULL);       // qword_106BA10

    if (current_tu == target) return; // no-op if already active

    save_translation_unit_state(current_tu);  // save outgoing
    current_tu = target;              // qword_106BA10 = target

    char *storage = target->storage_buffer;

    // Iterate all registered variables -- REVERSE of save
    for (reg = registered_variable_list_head; reg; reg = reg->next) {
        // Copy saved value from storage buffer back to global
        memcpy(reg->variable_address, storage + reg->offset_in_storage,
               reg->variable_size);

        // Update shortcut pointer if present
        if (reg->offset_in_tu != 0) {
            *(void **)((char *)target + reg->offset_in_tu) =
                memcpy result;  // points into the global
        }
    }

    // Restore scope stack state
    xmmword_126EB60_high = target[1];  // prev_scope_state
    qword_126EB70 = target[26];
    qword_126EBA0 = target[32];
    qword_126EBE0 = target[40];

    // Rebuild file scope indices via sub_704490
    if (dword_126C5E4 != -1) {
        // Recompute scope-to-TU back-pointers
        for (entry = scope_top; entry; entry -= 784) {
            if (entry[+192])
                *(int *)(entry[+192] + 240) = index_formula;
            if (!entry[+4]) break;
        }
        sub_704490(dword_126C5E4, 1, computed_flag);
    }
}

The key asymmetry between save and restore: memcpy direction is reversed. Save copies global -> storage_buffer. Restore copies storage_buffer -> global. The shortcut pointer (offset_in_tu) semantics also differ: during save it points into the storage buffer; during restore it points back to the global variable.

Fix-Up Phase

After the primary TU's registered-variable defaults are first copied into its descriptor, fix_up_translation_unit (sub_7A3CF0) performs a one-time pass that writes variable-address pointers into the TU descriptor's shortcut fields:

// fix_up_translation_unit (sub_7A3CF0)
void fix_up_translation_unit(tu_desc *primary) {
    assert(primary->next_tu == NULL);  // must be the first TU

    for (reg = registered_variable_list_head; reg; reg = reg->next) {
        if (reg->offset_in_tu != 0) {
            *(void **)((char *)primary + reg->offset_in_tu) =
                reg->variable_address;
        }
    }
}

This ensures the primary TU's shortcut pointers point directly to the live global variables rather than the storage buffer, since the primary TU's globals are the "real" values (not copies).

TU Stack Operations

The TU stack supports nested TU switches. This is needed when processing an entity declared in a different TU requires temporarily switching to that TU's context.

Push (push_translation_unit_stack)

// sub_7A3EF0 -- push_translation_unit_stack
void push_translation_unit_stack(tu_desc *tu) {
    // Allocate stack entry from free list or fresh allocation
    stack_entry *entry;
    if (stack_entry_free_list) {          // qword_12C7AB8
        entry = stack_entry_free_list;
        stack_entry_free_list = entry->next;
    } else {
        entry = alloc(16);               // sub_6B7340(16)
        ++stack_entry_count;              // qword_12C7A80
    }

    entry->tu_ptr = tu;                  // [+8]
    entry->next = tu_stack_top;          // [+0] = qword_106BA18

    // If pushing a different TU than current, switch to it
    if (current_tu != tu)
        switch_translation_unit(tu);

    // Track depth for non-primary TUs
    if (tu != primary_tu)
        ++tu_stack_depth;                // dword_106B9E8

    tu_stack_top = entry;                // qword_106BA18
}

Pop (pop_translation_unit_stack)

// sub_7A3F70 -- pop_translation_unit_stack
void pop_translation_unit_stack() {
    stack_entry *top = tu_stack_top;      // qword_106BA18

    // Integrity assertion: top-of-stack TU must match current TU
    assert(top->tu_ptr == current_tu);    // top[+8] == qword_106BA10

    if (top->tu_ptr != primary_tu)
        --tu_stack_depth;                 // dword_106B9E8

    // Pop: move top to free list, advance stack
    stack_entry *prev = top->next;
    top->next = stack_entry_free_list;   // return to free list
    stack_entry_free_list = top;
    tu_stack_top = prev;                 // qword_106BA18

    // If another TU remains, switch to it
    if (prev)
        switch_translation_unit(prev->tu_ptr);
}

Push Entity's TU (push_entity_translation_unit)

A convenience function sub_7A3FE0 pushes the TU that owns a given entity:

// sub_7A3FE0 -- push_entity_translation_unit
int push_entity_translation_unit(entity *ent) {
    if (ent->flags_81 & 0x20)  return 0;  // anonymous entity, no TU
    tu_desc *owning_tu = get_entity_tu(ent);  // sub_741960
    if (owning_tu == current_tu)  return 0;   // already in correct TU

    push_translation_unit_stack(owning_tu);
    return 1;  // caller must pop when done
}

TU Stack Entry Layout

TU Stack Entry (16 bytes)
  [0]   8   next            next entry in stack (toward bottom) or free list
  [8]   8   tu_desc_ptr     pointer to the TU descriptor

Stack entries are recycled through a free list (qword_12C7AB8). They are allocated by sub_6B7340 (general storage, not arena) on first use and never deallocated -- only returned to the free list on pop.

TU Correspondence (24 bytes)

When processing multiple TUs in RDC mode, the frontend must track structural equivalence between types and declarations across TUs. Each correspondence is a 24-byte node:

Trans Unit Correspondence (24 bytes)
  [0]   8   next            linked list pointer
  [8]   8   ptr             pointer to the corresponding entity/type
  [16]  4   refcount        reference count (freed when decremented to 1)
  [20]  1   flag            correspondence type flag

Allocation uses a free list (qword_12C7AB0) with fallback to arena allocation (sub_6BA0D0(24)). The reference-counted deallocation in free_trans_unit_corresp (sub_7A3BB0) asserts that refcount > 0 before decrementing, and only pushes the node onto the free list when the count reaches 1 (not 0 -- the last reference is the free-list entry itself).

Global Variables

TU State

GlobalTypeIdentity
qword_106BA10tu_desc*current_translation_unit -- always points to the active TU
qword_106B9F0tu_desc*primary_translation_unit -- the first TU processed (head of chain)
qword_12C7A90tu_desc*tu_chain_tail -- last TU in the linked list
qword_106BA18stack_entry*translation_unit_stack_top -- top of the TU stack
dword_106B9E8int32tu_stack_depth -- number of non-primary TUs on the stack
qword_106BA00char*current_filename -- source file name for the active TU
dword_106BA08int32is_recompilation -- 1 if this TU is being recompiled
dword_106B9F8int32has_module_info -- 1 if the active TU has module info

Registration Infrastructure

GlobalTypeIdentity
qword_12C7AA8reg_entry*registered_variable_list_head
qword_12C7AA0reg_entry*registered_variable_list_tail
qword_12C7A98size_tper_tu_storage_size -- accumulated total size of all registered variables (determines storage buffer allocation)
dword_12C7A8Cint32registration_complete -- set to 1 when first TU is allocated; guards against late registration
dword_12C7A88int32has_seen_module_tu -- set to 1 when a TU with module info is processed

Free Lists and Counters

GlobalTypeIdentity
qword_12C7AB8stack_entry*stack_entry_free_list
qword_12C7AB0corresp*corresp_free_list
qword_12C7A78int64tu_count -- total TU descriptors allocated
qword_12C7A80int64stack_entry_count -- total stack entries allocated (not freed)
qword_12C7A68int64registration_count -- total registered variable entries
qword_12C7A70int64corresp_count -- total correspondence nodes allocated

Correspondence State

GlobalTypeIdentity
dword_106B9E4int32Correspondence state variable 1 (per-TU, registered)
dword_106B9E0int32Correspondence state variable 2 (per-TU, registered)
qword_12C7798int64Correspondence state variable 3 (per-TU, registered)
qword_12C7800[14]Correspondence hash table 1 (0x70 bytes)
qword_12C7880[14]Correspondence hash table 2 (0x70 bytes)
qword_12C7900[14]Correspondence hash table 3 (0x70 bytes)

Reset Functions

Two reset functions exist for different scopes:

reset_translation_unit_state (sub_7A4860) -- zeroes the 6 core runtime globals. Called during error recovery or frontend teardown:

qword_106BA10 = 0;  // current_tu
qword_106B9F0 = 0;  // primary_tu
qword_12C7A90 = 0;  // tu_chain_tail
dword_106B9F8 = 0;  // has_module_info
qword_106BA18 = 0;  // tu_stack_top
dword_106B9E8 = 0;  // tu_stack_depth

init_translation_unit_tracking (sub_7A48B0) -- zeroes all 13 tracking globals. Called during frontend initialization before any registrations:

qword_12C7AA8 = 0;  // registered_variable_list_head
qword_12C7AA0 = 0;  // registered_variable_list_tail
qword_12C7A98 = 0;  // per_tu_storage_size
dword_106BA08 = 0;  // is_recompilation
qword_106BA00 = 0;  // current_filename
qword_12C7AB8 = 0;  // stack_entry_free_list
qword_12C7AB0 = 0;  // corresp_free_list
qword_12C7A80 = 0;  // stack_entry_count
qword_12C7A78 = 0;  // tu_count
qword_12C7A68 = 0;  // registration_count
qword_12C7A70 = 0;  // corresp_count
dword_12C7A8C = 0;  // registration_complete
dword_12C7A88 = 0;  // has_seen_module_tu

Memory Statistics

The print_trans_unit_statistics function (sub_7A45A0) reports the allocation counts and total memory for the four structure types managed by the TU system:

StructureSizeCounterStorage
Trans unit correspondence24 bytesqword_12C7A70Arena
Translation unit descriptor424 bytesqword_12C7A78Arena (gen. storage)
TU stack entry16 bytesqword_12C7A80General storage
Variable registration40 bytesqword_12C7A68General storage

Function Map

AddressIdentityConfidenceRole
sub_7A3A50save_translation_unit_stateHIGHSave all per-TU globals to storage buffer
sub_7A3B50alloc_trans_unit_correspHIGHAllocate 24-byte correspondence node
sub_7A3BB0free_trans_unit_correspHIGHReference-counted deallocation
sub_7A3C00f_register_trans_unit_variableDEFINITERegister a global for per-TU save/restore
sub_7A3CF0fix_up_translation_unitDEFINITEOne-time shortcut pointer fix-up for primary TU
sub_7A3D60switch_translation_unitDEFINITEContext-switch to a different TU
sub_7A3EF0push_translation_unit_stackHIGHPush TU onto the stack
sub_7A3F70pop_translation_unit_stackDEFINITEPop TU from the stack
sub_7A3FE0push_entity_translation_unitMEDIUM-HIGHPush the TU that owns a given entity
sub_7A40A0process_translation_unitDEFINITEMain entry: allocate, init, parse, cleanup
sub_7A45A0print_trans_unit_statisticsHIGHMemory usage report for TU subsystem
sub_7A4690register_builtin_trans_unit_variablesHIGHRegister the 3 core per-TU globals
sub_7A4860reset_translation_unit_stateDEFINITEZero 6 runtime globals
sub_7A48B0init_translation_unit_trackingDEFINITEZero all 13 tracking globals

Assertions

The TU system contains 8 assertion sites (calls to sub_4F2930 with source path trans_unit.c):

LineFunctionConditionMeaning
163free_trans_unit_corresprefcount > 0Correspondence double-free
227f_register_trans_unit_variable!registration_completeVariable registered after first TU allocated
230f_register_trans_unit_variablevar_ptr != NULLNULL pointer passed for registration
469fix_up_translation_unitprimary_tu->next == NULLFix-up called on non-primary TU
514switch_translation_unitcurrent_tu != NULLSwitch attempted with no active TU
556pop_translation_unit_stackstack_top->tu == current_tuStack/active TU mismatch
696process_translation_unit!(a3==NULL && has_seen_module)Non-module TU after module TU
725process_translation_unitis_recompilation (when primary and first TU)Primary TU must be a recompilation

Cross-References

  • RDC Mode -- multi-TU compilation: correspondence system, cross-TU IL copying, module ID generation
  • Frontend Wrapup -- the 5-pass post-processing architecture that iterates the TU chain
  • Scope Entry -- 784-byte scope stack entries saved/restored during TU switches
  • Entity Node -- entities carry a back-pointer to their owning TU (extracted via sub_741960)
  • IL Overview -- the IL tree rooted in each TU's file scope
  • Pipeline Overview -- where process_translation_unit sits in the full pipeline

Type Node

The type node is the fundamental type representation in EDG's intermediate language (IL). Every C++ type -- from int to const volatile std::vector<std::pair<int, float>>*& -- is represented as a 176-byte type node allocated by alloc_type (sub_5E3D40 in il_alloc.c). Type nodes form the backbone of the type system: every variable, field, routine, expression, and parameter carries a pointer to its type node. There are approximately 4,448 call sites across 128 type-query leaf functions in types.c alone.

The type node is a discriminated union. The type_kind byte at offset +132 selects one of 22 type kinds (0-21), and certain type kinds trigger allocation of a separate supplementary structure (class_type_supplement, routine_type_supplement, etc.) that hangs off the type node at offset +152. The base 176 bytes contain the common header shared with all IL entries (96 bytes), the type discriminator, qualifier flags, size/alignment, and type-kind-specific inline payload fields.

Key Facts

PropertyValue
Allocation size176 bytes (IL entry size)
Allocatorsub_5E3D40 (alloc_type), il_alloc.c
Kind settersub_5E2E80 (set_type_kind), 22 cases
In-place reinitsub_5E3590 (init_type_fields), no allocation
Counter globalqword_126F8E0
Stats label"type" (176 bytes each)
Regionfile-scope only (always dword_126EC90)
Type query functions128 leaf functions in types.c (0x7A4940-0x7C02A0)
Most-called queryis_class_or_struct_or_union_type (sub_7A8A30, 407 call sites)
Source filesil_alloc.c (allocation), types.c (queries/construction)

Memory Layout

Raw Allocation vs Returned Pointer

Like all IL entries, the raw allocation includes a 16-byte prefix that is hidden from the returned pointer. The allocator returns raw + 16, so all field offsets documented below are relative to this returned pointer.

Raw allocation (192 bytes total):
  raw+0   [8 bytes]   TU copy address (zeroed, ptr[-2])
  raw+8   [8 bytes]   next-in-list link (zeroed, ptr[-1])
  raw+16  [176 bytes] type node body (ptr+0 onward)

Prefix flags byte at ptr[-8]:
  bit 0 (0x01)  allocated
  bit 1 (0x02)  file_scope
  bit 3 (0x08)  language_flag (C++ mode)
  bit 7 (0x80)  keep_in_il (CUDA device marking)

Complete Field Map

The 176 bytes of the type node body divide into three regions: the common IL header (bytes 0-95), the type discriminator and qualifier zone (bytes 96-135), and the type-kind-specific payload (bytes 136-175).

Offset  Size  Field                  Description
------  ----  -----                  -----------
+0      96    common_il_header       Shared with all IL entry types (see below)
+96     24    (continuation of       Source position, declaration metadata,
              common header area)    scope/name linkage -- varies by IL kind
+120    8     type_size              Computed size of this type in bytes
+128    4     alignment              Required alignment (bytes)
+132    1     type_kind              Type discriminator (0-21, see table below)
+133    1     type_flags_1           bit 5 = is_dependent
+134    1     type_qual_flags        bit 0 = const, bit 1 = volatile
                                    (cleared by sub_5E3580: *(a1+134) &= 0xFC)
+135    1     (padding/reserved)
+136    8     (reserved/varies)      Kind-dependent inline storage
+144    8     referenced_type        For tk_pointer/tk_reference/tk_typedef:
                                      -> pointed-to/referenced/aliased type
                                    For tk_pointer_to_member: -> class type
                                    For tk_function: return type enum (2=void)
                                    For tk_integer (enum): underlying type ptr
+145    1     enum_flags             For tk_integer:
                                      bit 3 = scoped_enum
                                      bit 4 = is_bit_int_capable
+146    1     extended_int_flags     For tk_integer:
                                      bit 2 = is_BitInt
+147    1     (padding)
+148    1     (varies by kind)       For tk_class/struct/union: access default
                                    set_type_kind initializes to 1
+149    1     kind_init_byte         Flags initialized during set_type_kind
+150    2     (cleared by init)      Zeroed by init_type_fields_and_set_kind
+152    8     supplement_ptr         For tk_class/struct/union: -> class_type_supplement
                                    For tk_routine: -> routine_type_supplement
                                    For tk_integer: -> integer_type_supplement
                                    For tk_typeref: -> typeref_type_supplement
                                    For tk_template_param: -> templ_param_supplement
                                    For tk_pointer: member_pointer_flag
                                      bit 0 = is_member_pointer
                                      bit 1 = extended_member_ptr
                                    For tk_array: bound expression pointer
+153    1     array_flags            For tk_array:
                                      bit 0 = dependent_bound
                                      bit 1 = is_VLA
                                      bit 5 = star_modifier
+154    6     (varies)
+160    8     typedef_attr_kind      For tk_typeref: attribute kind value
                                    For tk_array: numeric bound value
+161    1     class_flags_1          For tk_class/struct/union:
                                      bit 0 = is_local_class
                                      bit 2 = no_name_linkage
                                      bit 4 = is_template_class
                                      bit 5 = is_anonymous
                                      bit 7 = has_nested_types
+162    1     typedef_flags          For tk_typeref:
                                      bit 0 = is_elaborated
                                      bit 6 = is_attributed
                                      bit 7 = has_addr_space
+163    1     class_flags_2          For tk_class/struct/union:
                                      bit 0 = extern_template_inst
                                      bit 3 = alignment_set
                                      bit 4 = is_scoped (for union)
+164    2     feature_flags          Target feature requirements
                                    (copied to byte_12C7AFC by
                                     record_type_features_used)
+166    2     (reserved)
+168    4     alignment_attr         Explicit alignment / packed attribute
+172    4     (tail padding)

Common IL Header (Bytes 0-95)

The first 96 bytes are copied verbatim from the template globals (xmmword_126F6A0 through xmmword_126F6F0) during allocation. This template captures the current source file, line, and column position, and is refreshed as the parser advances. The header contains:

xmmword_126F6A0  [+0..+15]    scope/class pointer, name pointer (zeroed)
xmmword_126F6B0  [+16..+31]   declaration metadata (high qword zeroed)
xmmword_126F6C0  [+32..+47]   reserved (zeroed)
xmmword_126F6D0  [+48..+63]   reserved (zeroed)
xmmword_126F6E0  [+64..+79]   source position (from qword_126EFB8)
xmmword_126F6F0  [+80..+95]   low word = 4 (access default), high zeroed
qword_126F700    [+96..+103]  current source file reference

The source position at bytes +64..+79 allows error messages and diagnostics to reference the exact declaration point for each type.

Type Kind Enumeration

The type kind byte at offset +132 holds one of 22 values. The set_type_kind function (sub_5E2E80, 279 lines, il_alloc.c:2334) dispatches on this value to initialize type-kind-specific fields and allocate supplement structures where needed.

ValueNameC++ ConstructsSupplementPayload
0tk_nonePlaceholder / uninitializedNoneno-op
1tk_voidvoidNoneno-op
2tk_integerbool, char, short, int, long, long long, __int128, _BitInt(N), all unsigned variants, wchar_t, char8_t, char16_t, char32_t, enumerations32-byte integer_type_supplement+144=5 (default)
3tk_floatfloat, _Float16, __bf16Noneformat byte = 2
4tk_doubledoubleNoneformat byte = 2
5tk_long_doublelong double, __float128Noneformat byte = 2
6tk_pointerT*, member pointers (bit 0 of +152)None2 payload fields zeroed
7tk_routineFunction type int(int, float)64-byte routine_type_supplementcalling convention, params init
8tk_arrayT[N], T[], VLAsNonesize+flags zeroed
9tk_structstruct S208-byte class_type_supplementkind stored at supplement+100
10tk_classclass C208-byte class_type_supplementkind stored at supplement+100
11tk_unionunion U208-byte class_type_supplementkind stored at supplement+100
12tk_typereftypedef, using alias, elaborated type specifiers56-byte typeref_type_supplement--
13tk_typeoftypeof(expr), __typeof__Nonezeroed
14tk_template_paramtypename T, template type/non-type/template parameters40-byte templ_param_supplement--
15tk_decltypedecltype(expr)Nonezeroed
16tk_pack_expansionT... (parameter pack expansion)Nonezeroed
17tk_pack_expansion_altAlternate pack expansion formNoneno-op
18tk_autoauto, decltype(auto)Noneno-op
19tk_rvalue_referenceT&& (rvalue reference)Noneno-op
20tk_nullptr_tstd::nullptr_tNoneno-op
21tk_reserved_21Reserved / unusedNoneno-op

Reconciling set_type_kind with types.c query functions: There is an apparent conflict between the set_type_kind dispatch (where case 7 allocates a routine supplement, case 0xD/13 is typeof, case 0xE/14 is template_param) and the types.c query function catalog (where is_reference_type tests kind==7, is_pointer_to_member_type tests kind==13, is_function_type tests kind==14). The set_type_kind switch is the authoritative source for allocation behavior -- it is a 279-line DEFINITE-confidence function with the embedded error string "set_type_kind: bad type kind". The types.c catalog was reconstructed from runtime query patterns and may reflect a different numbering or the fact that type kind values are reassigned after initial allocation. The table above follows the set_type_kind dispatch numbering. The types.c query mappings are documented in the query function catalog below for cross-reference.

set_type_kind Dispatch Summary

switch (type_kind) {
  case 0, 1, 17..21:   // tk_none, tk_void, alt-pack, auto, rvalue_ref, nullptr, reserved
    break;             // no-op: simple types with no extra state

  case 2:              // tk_integer
    type->+144 = 5;    // default integer subkind
    supplement = alloc_in_file_scope_region(32);   // integer_type_supplement
    ++qword_126F8E8;
    supplement->+16 = source_position;
    type->+152 = supplement;
    break;

  case 3, 4, 5:        // tk_float, tk_double, tk_long_double
    type->format_byte = 2;   // IEEE format indicator
    break;

  case 6:              // tk_pointer
    type->+144 = 0;    // pointed-to type (to be set later)
    type->+152 = 0;    // member-pointer flags (cleared)
    break;

  case 7:              // tk_routine
    supplement = alloc_in_file_scope_region(64);   // routine_type_supplement
    ++qword_126F958;
    init_bitfield_struct(supplement+32);            // calling convention defaults
    type->+152 = supplement;
    break;

  case 8:              // tk_array
    type->+120 = 0;    // array total size (unknown)
    type->+152 = 0;    // bound expression (none)
    type->+153 &= mask; // clear array flags
    type->+160 = 0;    // numeric bound (none)
    break;

  case 9, 10, 11:      // tk_struct, tk_class, tk_union
    supplement = alloc_in_file_scope_region(208);  // class_type_supplement
    ++qword_126F948;
    init_class_type_supplement_fields(supplement);
    supplement->+100 = type_kind;                  // remember which class flavor
    type->+152 = supplement;
    break;

  case 12 (0xC):       // tk_typeref (typedef / using alias)
    supplement = alloc_in_file_scope_region(56);   // typeref_type_supplement
    ++qword_126F8F0;
    type->+152 = supplement;
    break;

  case 13 (0xD):       // tk_typeof
    type->+144 = 0;
    type->+152 = 0;
    break;

  case 14 (0xE):       // tk_template_param
    supplement = alloc_in_file_scope_region(40);   // templ_param_supplement
    ++qword_126F8F8;
    type->+152 = supplement;
    break;

  case 15 (0xF):       // tk_decltype
    type->+144 = 0;
    type->+152 = 0;
    break;

  case 16 (0x10):      // tk_pack_expansion
    type->+144 = 0;
    type->+152 = 0;
    break;

  default:
    internal_error("set_type_kind: bad type kind");
}
type->+132 = type_kind;

Supplement Structures

Five type kinds trigger allocation of a supplementary structure. The supplement pointer lives at type node offset +152 and points to a separately-allocated block in the file-scope region.

class_type_supplement (208 bytes)

Allocated for tk_struct (kind 9), tk_class (kind 10), and tk_union (kind 11). This is the richest supplement, carrying the full class definition metadata. Initialized by init_class_type_supplement_fields (sub_5E2D70, 40 lines) and init_class_type_supplement (sub_5E2C70, 42 lines).

OffsetSizeFieldDescription
+08scope_ptrPointer to the class scope (288-byte scope node)
+88base_class_listHead of linked list of base class entries (112 bytes each)
+168friend_decl_listHead of friend declaration list
+248member_list_headMember entity list (routines, variables, nested types)
+328nested_type_listNested type definitions
+404default_access1 = public (struct/union), 2 = private (class)
+444(reserved)
+488using_decl_listUsing declarations in class scope
+568static_data_membersStatic data member list
+648template_infoTemplate instantiation info (if template class)
+728virtual_base_listVirtual base class chain
+804(flags)
+842(reserved)
+861class_property_flagsbit 0 = has_virtual_bases, bit 3 = has_user_conversion
+881extended_flagsbit 5 = has_flexible_array
+968vtable_ptrVirtual function table pointer
+1004class_kindCopy of type_kind (9, 10, or 11)
+1048destructor_ptrPointer to destructor entity
+1128copy_ctor_ptrCopy constructor entity
+1208move_ctor_ptrMove constructor entity
+1288(scope chain)
+1368conversion_functionsUser-defined conversion operator list
+1448befriending_classesList of classes that befriend this class
+1528deduction_guidesDeduction guide list (C++17)
+1608(reserved)
+1688(reserved)
+1764vtable_indexVirtual function table index, initialized to -1 (0xFFFFFFFF)
+1804(padding)
+1848(reserved)
+1928(reserved)
+2008(reserved)

Counter: qword_126F948, stats label "class type supplement".

routine_type_supplement (64 bytes)

Allocated for tk_routine (kind 7) by set_type_kind. Encodes the function signature metadata.

OffsetSizeFieldDescription
+08param_type_listHead of parameter type linked list (80 bytes each)
+88return_typeReturn type pointer
+168exception_specException specification pointer (16 bytes)
+248(reserved)
+324calling_conventionCalling convention bitfield (initialized by set_type_kind)
+364param_countNumber of parameters
+404flagsVariadic, noexcept, trailing-return, etc.
+444(reserved)
+488(reserved)
+568(reserved)

Counter: qword_126F958, stats label "routine type supplement".

Each parameter in the param_type_list is an 80-byte param_type node (allocated by alloc_param_type, sub_5E1D40, free-list recycled from qword_126F678). Parameter types form a singly-linked list through their +0 field.

integer_type_supplement (32 bytes)

Allocated for tk_integer (kind 2). Represents the properties of integral and enumeration types.

OffsetSizeFieldDescription
+04integer_subkindSubkind identifier (values 1-12, default 5)
+44bit_widthWidth in bits (for _BitInt(N))
+84signedness0=unsigned, 1=signed (lookup via byte_E6D1B0)
+124(reserved)
+168source_positionSource position at allocation time
+248underlying_typeFor enums: pointer to the underlying integer type

Counter: qword_126F8E8, stats label "integer type supplement".

The integer_subkind field distinguishes between the various integer types. Known subkind values from type query analysis:

SubkindType
1-10Standard integer types (bool through long long)
11_Float16 / extended
12__int128 / extended

typeref_type_supplement (56 bytes)

Allocated for tk_typeref (kind 12 = 0xC in set_type_kind). Links the typedef/using-alias to its referenced declaration and tracks elaborated type specifier properties.

OffsetSizeFieldDescription
+08referenced_declThe declaration this typedef names
+88original_typeThe original type before typedef expansion
+168scope_ptrScope in which the typedef was declared
+248(reserved)
+328attribute_infoAttribute specifier chain
+408template_infoTemplate argument list (for alias templates)
+488(reserved)

Counter: qword_126F8F0, stats label "typeref type supplement".

The elaborated type specifier kind is encoded in type_node+162:

  • bit 0: is_elaborated (uses struct/class/union/enum keyword)
  • bit 6: is_attributed (carries [[...]] attributes)
  • bit 7: has_addr_space (CUDA address space attribute)

The constant 0x18C2 (= bits {1,6,7,11,12}) is used as a bitmask in is_incomplete_type_deep (sub_7A6580) to identify the set of elaborated type specifier kinds.

templ_param_supplement (40 bytes)

Allocated for tk_template_param (kind 14 = 0xE in set_type_kind). Represents a template type parameter (typename T), non-type template parameter, or template template parameter.

OffsetSizeFieldDescription
+04param_indexZero-based index in the template parameter list
+44param_depthNesting depth (0 for outermost template)
+84param_kind0=type, 1=non-type, 2=template-template
+124(reserved)
+168constraintAssociated constraint expression (C++20 concepts)
+248default_argDefault template argument (type or expression)
+328(reserved)

Counter: qword_126F8F8, stats label "templ param supplement".

Type Qualifier Encoding

CV-qualifiers are not stored as separate type nodes (unlike some compiler designs). Instead, they are encoded as bit flags within the type node itself. The primary qualifier storage is at offset +134:

Byte at type+134 (type_qual_flags):
  bit 0 (0x01)   const
  bit 1 (0x02)   volatile

The function clear_type_qualifier_bits (sub_5E3580) performs *(a1+134) &= 0xFC to strip both const and volatile.

Additional qualifier information is accessed through the prefix flags byte at ptr[-8]:

  • bit 5 (0x20): __restrict qualifier (has_restrict_qualifier, sub_7A7850)
  • bit 6 (0x40): volatile qualifier duplicate (has_volatile_qualifier, sub_7A7890)

The function get_cv_qualifiers (sub_7A9E70, 319 call sites) accumulates cv-qualifier bits by walking through typedef chains, applying a & 0x7F mask to collect all qualifier bits from each layer.

Type Query Function Catalog

The types.c file (address range 0x7A4940-0x7C02A0) contains approximately 250 functions. Of these, 128 are tiny leaf functions that query type node properties. They follow a canonical pattern:

// Canonical type query pattern
bool is_<property>_type(type_node *type) {
    while (type->type_kind == 12)        // skip typedefs
        type = type->referenced_type;
    return type->type_kind == <expected>;  // or flag check
}

Most-Referenced Query Functions

Sorted by call site count across the entire binary:

CallersFunctionAddressTest
407is_class_or_struct_or_union_type0x7A8A30kind in {9, 10, 11}
389type_pointed_to0x7A9910kind==6, return +144
319get_cv_qualifiers0x7A9E70accumulate qualifier bits (& 0x7F)
299is_dependent_type0x7A6B60bit 5 of byte +133
243is_object_pointer_type0x7A7630kind==6 && !(bit 0 of +152)
221is_array_type0x7A8370kind==8
199is_member_pointer_or_ref0x7A7B30kind==6 && (bit 0 of +152)
185is_reference_type0x7A6AC0kind==7
169is_function_type0x7A8DC0kind==14
140is_void_type0x7A6E90kind==1
126array_element_type (deep)0x7A9350strips arrays+typedefs recursively
85is_enum_type0x7A7010kind==2 (with scoped check)
82is_integer_type0x7A71B0kind==2
77is_member_pointer_flag0x7A7810kind==6, bit 0 of +152
76is_pointer_to_member_type0x7A8D90kind==13
70is_long_double_type0x7A73F0kind==5
62is_scoped_enum_type0x7A70F0kind==2, bit 3 of +145
56is_rvalue_reference_type0x7A6EF0kind==19

Typedef Stripping Functions

Six functions strip typedef layers with different stopping conditions:

FunctionAddressBehavior
skip_typedefs0x7A68F0Strips all typedef layers, preserves cv-qualifiers
skip_named_typedefs0x7A6930Stops at unnamed typedefs
skip_to_attributed_typedef0x7A6970Stops at typedef with attribute flag
skip_typedefs_and_attributes0x7A69C0Strips both typedefs and attributed-typedefs
skip_to_elaborated_typedef0x7A6A10Stops at typedef with elaborated flag
skip_non_attributed_typedefs0x7A6A70Stops at typedef with any attribute bits

Compound Type Predicates

FunctionAddressType Kinds
is_arithmetic_type0x7A7560{2, 3, 4, 5}
is_scalar_type0x7A7BA0{2, 3, 4, 5, 6(non-member), 13, 19, 20}
is_aggregate_type0x7A8B40{8, 9, 10, 11}
is_floating_point_type0x7A7300{3, 4, 5}
is_pack_or_auto_type0x7A7420{16, 17, 18}
is_pack_expansion_type0x7A6BE0{16, 17}
is_complete_type0x7A6DA0Not void, not reference, not incomplete class

Duplicate Functions

EDG uses distinct function names for semantic clarity even when the implementation is identical. The compiler does not merge them:

  • 0x7A7630 == 0x7A7670 == 0x7A7750 (all: is_non_member_pointer / is_object_pointer_type)
  • 0x7A7B00 == 0x7A7B70 (both: is_pointer_type)
  • 0x7A78D0 == 0x7A7910 (both: is_non_const_ref)

Type Construction

alloc_type (sub_5E3D40)

The primary type allocation function. Takes a single argument: the type kind. Returns a pointer to a fully-initialized 176-byte type node with the appropriate supplement structure allocated and linked.

Protocol:

  1. Trace enter (if dword_126EFC8 set)
  2. Allocate 176 bytes via region_alloc in file-scope region
  3. Write 16-byte prefix (TU copy addr, next link, flags byte)
  4. Increment qword_126F8E0 (type counter)
  5. Copy 96-byte common IL header from template globals
  6. Set default access to 1 at +148
  7. Dispatch set_type_kind switch for the requested kind
  8. Trace leave (if tracing)
  9. Return raw + 16

init_type_fields (sub_5E3590)

Re-initializes an existing type node in-place without allocating new memory. Used when a type node needs to change kind after initial allocation (rare but occurs during template instantiation). Copies the template header and dispatches the same set_type_kind switch.

make_cv_combined_type (sub_7A6320)

Constructs a new type that combines cv-qualifiers from two source types. Recursively handles arrays (recurses on element type) and pointer-to-member (recurses on member type). Allocates a fresh type node via alloc_type, copies the base type via sub_5DA0A0, then applies the combined qualifiers via sub_5D64F0.

Type Comparison

types_are_identical (sub_7AA150)

The main type comparison function (636 lines). Handles all 22 type kinds with deep structural comparison. For class types, delegates to the class scope comparison infrastructure. For function types, compares parameter lists, return types, and calling conventions.

types_are_equivalent_for_correspondence (sub_7B2260)

A 688-line function used during multi-TU compilation (CUDA RDC mode). Compares types across translation units for structural equivalence, called from verify_class_type_correspondence (sub_7A00D0).

compatible_ms_bit_field_container_types (sub_7C02A0)

The last function in types.c. Checks if two integer types are compatible for MSVC bit-field container layout rules: both must be kind==2 (integer) with matching size at offset +120.

Pointer and Reference Encoding

Pointers use type kind 6 (tk_pointer), with member-pointer status distinguished by flag bits at offset +152:

tk_pointer (kind 6):
  +144  referenced_type   The pointed-to / referenced type
  +152  bit 0 = 0         Object pointer (T*)
        bit 0 = 1         Member pointer (T C::*)
        bit 1             Extended member pointer flag

The types.c query functions use the following kind tests for pointer/reference classification. Note that the kind values tested here correspond to the types.c query numbering (see reconciliation note in the type kind table):

QueryKind Test+152 TestMatches
is_pointer_typekind==6--T*, T C::*
is_object_pointer_typekind==6!(bit 0)T* only
is_member_pointer_flagkind==6bit 0T C::* only
is_reference_typekind==7--T& (lvalue reference)
is_rvalue_reference_typekind==19--T&&
is_pointer_to_member_typekind==13--T C::* (alternate encoding)

The pm_class_type (sub_7A9A10) and pm_member_type (sub_7A99D0) access +144 and +152 respectively for kind-13 nodes.

Array Type Encoding

Array types (kind 8) store bounds inline in the type node:

tk_array (kind 8):
  +120  type_size          Total array size in bytes (0 if unknown)
  +128  alignment          Element alignment
  +144  element_type       Pointer to the element type node
  +152  bound_expr         Bound expression pointer (for VLAs and dependent)
  +153  array_flags:
          bit 0 = dependent_bound    (template-dependent array size)
          bit 1 = is_VLA             (C99 variable-length array)
          bit 5 = star_modifier      (C99 [*] syntax)
  +160  numeric_bound      Compile-time bound value (when not VLA/dependent)

The function identical_array_type_level (sub_7A4E10, types.c:6779) compares two array types by checking the VLA flag, dependent flag, and then either bound expressions (via sub_5D2160) or numeric bounds at +160.

Class Type Flags

Class types (kinds 9, 10, 11) carry two flag bytes at offsets +161 and +163 in the type node, plus property flags in the class_type_supplement at supplement offset +86:

type_node+161 (class_flags_1)

BitMaskFieldQuery Function
00x01is_local_classis_local_class_type (0x7A8EE0)
20x04no_name_linkagettt_is_type_with_no_name_linkage (0x7A4B40)
40x10is_template_classis_template_class_type (0x7A8EA0)
50x20is_anonymousis_non_anonymous_class_type tests !(bit 5) (0x7A8A90)
70x80has_nested_types--

type_node+163 (class_flags_2)

BitMaskFieldQuery Function
00x01extern_template_instis_empty_class checks this (0x7A range)
30x08alignment_set--
40x10is_scopedis_scoped_union_type (0x7A8B00)

class_type_supplement+86

BitMaskFieldQuery Function
00x01has_virtual_basesclass_has_virtual_bases (0x7A8BC0)
30x08has_user_conversionclass_has_user_conversion (0x7A8C00)

Type Size and Layout

type_size_and_alignment (sub_7A8020, 132 lines) computes the size and alignment of a type for ABI purposes. The computed size is stored at type_node offset +120 and alignment at +128.

For class types, the major layout computation is performed by compute_type_layout (sub_7B6350, 1107 lines), which handles:

  • Base class sub-object placement
  • Virtual base class offsets
  • Member field alignment and padding
  • Bit-field packing (with MSVC compatibility via compatible_ms_bit_field_container_types)
  • Empty base optimization

Integration with Other IL Nodes

Type nodes are referenced from virtually every other IL entity:

IL NodeOffsetDescription
Variable entity (232B)+112Variable's declared type
Field entity (176B)+112Field's declared type
Routine entity (288B)+112Function's type (kind 7 with routine_type_supplement)
Expression node (72B)+16Expression result type
Parameter type (80B)+8Parameter's declared type
Constant (184B)+112Constant's type
Template argument (64B)+32Type argument value (when kind=0)

Allocation Statistics

In a typical CUDA compilation, the stats dump (sub_5E99D0) reports type node counts in the thousands. The supplement allocation counts track closely:

type                    176 bytes each   (qword_126F8E0)
integer type supplement  32 bytes each   (qword_126F8E8)
routine type supplement  64 bytes each   (qword_126F958)
class type supplement   208 bytes each   (qword_126F948)
typeref type supplement  56 bytes each   (qword_126F8F0)
templ param supplement   40 bytes each   (qword_126F8F8)
param type               80 bytes each   (qword_126F960, free-list recycled)

Type nodes are always allocated in the file-scope region (persistent for the entire translation unit) because types must outlive any individual function body. This contrasts with expression nodes and statements which can be allocated in per-function regions and freed after each function is processed.

Template Instance Record

The template instance record is the 128-byte structure that represents a pending or completed template instantiation in cudafe++ (EDG 6.6). Every template entity that may require instantiation -- function templates, class member function templates, variable templates -- gets one of these records allocated by alloc_template_instance (sub_7416E0). The records are chained into a singly-linked worklist for function/variable templates (qword_12C7740). A separate worklist of type entries (not instance records) at qword_12C7758 tracks pending class template instantiations. A fixpoint loop at translation-unit end drains both lists, instantiating entities until no new work remains.

This page documents the instance record layout, the master instance info record, the two worklists, the depth-tracking mechanisms, the parser state save/restore during instantiation, and the fixpoint algorithm that ties everything together.

Key Facts

PropertyValue
Instance record size128 bytes (allocated by sub_7416E0, alloc_template_instance)
Master instance info size32 bytes (allocated by sub_7416A0, alloc_master_instance_info)
Allocation counter (instances)qword_12C74F0 (incremented on each 128-byte allocation)
Allocation counter (master info)qword_12C74E8 (incremented on each 32-byte allocation)
Memory allocatorsub_6BA0D0 (EDG arena allocator)
Function/variable worklist headqword_12C7740
Function/variable worklist tailqword_12C7738
Class worklist headqword_12C7758
Fixpoint entry pointsub_78A9D0 (template_and_inline_entity_wrapup)
Worklist walkersub_78A7F0 (do_any_needed_instantiations)
Decision gatesub_774620 (should_be_instantiated)
Source filetemplates.c (EDG 6.6, path edg/EDG_6.6/src/templates.c)

Instance Record Layout (128 bytes)

Each record is allocated by alloc_template_instance (sub_7416E0) and zero-initialized. The allocator clears all 128 bytes, then initializes offsets +84 and +92 from qword_126EFB8 (the current source position context). The low nibble of byte +81 is explicitly masked to zero (*(_BYTE *)(result + 81) &= 0xF0).

OffsetSizeFieldDescription
+08entity_primaryPrimary entity pointer (the instantiation's own symbol)
+88nextNext entry in the pending worklist (singly-linked)
+168inst_infoPointer to 32-byte master instance info record (see below)
+248master_symbolCanonical template symbol -- the entity being instantiated from
+328actual_declDeclaration entity in the instantiation context
+408cached_declCached declaration for function-local templates (partial specialization lookup result)
+488referencing_namespaceNamespace that triggered the instantiation (set by determine_referencing_namespace, sub_75D5B0)
+568(reserved)Zero-initialized, usage not observed
+648body_flagsDeferred/deleted function body flags
+728pre_computed_resultResult from a prior instantiation attempt (non-null skips re-instantiation)
+801flagsStatus bitfield (see flags table below)
+811flags2Secondary flags (bit 0 = on_worklist, bit 1 = warning_emitted)
+848source_position_1Source location context at entry creation (from qword_126EFB8)
+928source_position_2Second source location context at entry creation (from qword_126EFB8)
+1048(reserved)Zero-initialized
+1128(reserved)Zero-initialized
+1208(reserved)Zero-initialized

Allocator Pseudocode

// sub_7416E0 — alloc_template_instance
template_instance_t *alloc_template_instance(void) {
    if (debug_tracing_enabled)
        trace_enter(5, "alloc_template_instance");

    template_instance_t *rec = arena_alloc(128);   // sub_6BA0D0
    alloc_count_instances++;                        // qword_12C74F0

    // Zero all fields
    rec->entity_primary    = NULL;    // +0
    rec->next              = NULL;    // +8
    rec->inst_info         = NULL;    // +16
    rec->master_symbol     = NULL;    // +24
    rec->actual_decl       = NULL;    // +32
    rec->cached_decl       = NULL;    // +40
    rec->ref_namespace     = NULL;    // +48
    rec->reserved_56       = NULL;    // +56
    rec->body_flags        = NULL;    // +64
    rec->precomputed       = NULL;    // +72
    rec->flags             = 0;       // +80
    rec->flags2           &= 0xF0;   // +81: clear low nibble
    rec->source_pos_1      = current_source_context;  // +84 from qword_126EFB8
    rec->source_pos_2      = current_source_context;  // +92 from qword_126EFB8
    rec->reserved_104      = NULL;    // +104
    rec->reserved_112      = NULL;    // +112
    rec->reserved_120      = NULL;    // +120

    if (debug_tracing_enabled)
        trace_leave();
    return rec;
}

Flags Byte at +80

Six bits are used. The byte is written by update_instantiation_required_flag (sub_7770E0) and read by do_any_needed_instantiations (sub_78A7F0) and should_be_instantiated (sub_774620).

BitMaskNameMeaning
00x01instantiation_requiredEntity needs instantiation (set by update_instantiation_required_flag)
10x02not_neededEntity was determined not to need instantiation (skip on worklist walk)
30x08explicit_instantiationExplicit template declaration triggered this entry
40x10suppress_autoAuto-instantiation suppressed (from extern template declaration)
50x20excludedEntity excluded from instantiation set
70x80can_be_instantiated_checkedPre-check (f_entity_can_be_instantiated) already performed; skip redundant check

Flags Byte at +81

BitMaskNameMeaning
00x01on_worklistEntry has been linked into the pending worklist
10x02warning_emittedDepth-limit warning already emitted for this entry

The on_worklist bit at +81 bit 0 is the guard that prevents double-insertion into the linked list. When add_to_instantiations_required_list sets up the linked list linkage (at qword_12C7740/qword_12C7738), it checks this bit first and sets it afterward. If the bit is already set, the function takes the "already on worklist" path which may set the new_instantiations_needed fixpoint flag (dword_12C771C = 1) instead.

Master Instance Info Record (32 bytes)

Each template entity (class, function, or variable template) has exactly one master instance info record, allocated by alloc_master_instance_info (sub_7416A0). This record is shared across all instantiations of the same template and is stored at the template's associated scope info (scope_assoc + 16). The link between a 128-byte instance record and its master info is at instance +16.

OffsetSizeFieldDescription
+08nextNext master info in chain (linked list)
+88back_pointerPointer back to the template instance record that owns this info
+168associated_scopePointer to the associated scope/translation-unit data
+244pending_countNumber of pending instantiations of this template (incremented/decremented by update_instantiation_required_flag)
+281flagsStatus bits (low nibble cleared on allocation)

Master Info Flags Byte at +28

BitMaskNameMeaning
00x01blockedInstantiation blocked (dependency cycle or extern template)
20x04has_instancesAt least one instantiation has been completed
30x08debug_checkedAlready checked by debug tracing path

Allocator Pseudocode

// sub_7416A0 — alloc_master_instance_info
master_instance_info_t *alloc_master_instance_info(void) {
    master_instance_info_t *info = arena_alloc(32);   // sub_6BA0D0
    alloc_count_master_info++;                         // qword_12C74E8

    info->next             = NULL;    // +0
    info->back_pointer     = NULL;    // +8
    info->associated_scope = NULL;    // +16
    info->pending_count    = 0;       // +24
    info->flags           &= 0xF0;   // +28: clear low nibble

    return info;
}

find_or_create_master_instance (sub_753550)

This function connects a 128-byte instance record to its shared master info. It looks up the template's scope association, checks whether a master info record already exists at scope_assoc + 16, and creates one if absent.

// sub_753550 — find_or_create_master_instance
void find_or_create_master_instance(template_instance_t *inst) {
    entity_t *sym = inst->master_symbol;            // inst[3], offset +24
    scope_t  *scope = resolve_template_scope(sym);  // sub_73DE50

    // Find the template's canonical entity
    entity_t *canonical;
    if (is_variable(sym))                           // (kind - 7) & 0xFD == 0
        canonical = *find_variable_correspondence(scope);   // sub_79AAA0
    else
        canonical = *find_function_correspondence(scope);   // sub_79FD80
    assert(canonical != NULL, "find_or_create_master_instance");

    scope_assoc_t *assoc = canonical->scope_assoc;  // offset +96
    master_instance_info_t *info = assoc->master_info;  // assoc + 16

    if (info == NULL) {
        // First instantiation of this template — allocate master info
        info = alloc_master_instance_info();         // sub_7416A0
        info->back_pointer = inst;                   // info[1] = inst

        if (sym != inst->actual_decl) {
            // Class members: add to secondary deferred list
            // qword_12C7750 / qword_12C7748
            append_to_deferred_list(info);
        }
        assoc->master_info = info;                   // assoc + 16

        if (debug_tracing_enabled) {
            trace("find_or_create_master_instance: symbol:");
            print_symbol(inst->master_symbol);
        }
    }

    inst->inst_info = info;                          // inst[2], offset +16
}

The Two Worklists

Template instantiation uses two separate worklists -- one for class templates, one for function/variable templates. This separation is fundamental to correctness: class templates must be instantiated before function templates within each fixpoint iteration, because function template bodies may reference members of class template instantiations.

Function/Variable Worklist (qword_12C7740)

GlobalPurpose
qword_12C7740Head of the singly-linked list
qword_12C7738Tail pointer (for O(1) append)

Entries are 128-byte instance records linked through +8 (next pointer). New entries are appended at the tail by add_to_instantiations_required_list (the tail section of sub_7770E0).

Class Worklist (qword_12C7758)

This list holds type entries (not 128-byte instance records) that need class template instantiation. Entries are linked through offset +0 of the type entry. The list is populated by update_instantiation_flags (sub_789EF0) and drained by template_and_inline_entity_wrapup (sub_78A9D0).

Worklist Insertion: add_to_instantiations_required_list

The tail portion of update_instantiation_required_flag (sub_7770E0, starting at the label checking inst->flags2 & 0x01) implements worklist insertion:

// Tail of sub_7770E0 — add_to_instantiations_required_list
void add_to_instantiations_required_list(template_instance_t *inst) {
    if (inst->flags2 & 0x01) {
        // Already on the worklist — do not re-add.
        // But if instantiation mode is active and instantiation_required is set,
        // signal that a new fixpoint pass is needed.
        if (instantiation_mode_active
            && (inst->flags & 0x01)
            && inst->inst_info != NULL
            && !(inst->inst_info->flags & 0x01))     // not blocked
        {
            new_instantiations_needed = 1;            // dword_12C771C
            tu_ptr->needs_recheck = 1;                // TU + 393
        }
        return;
    }

    // Link into the function/variable worklist
    if (pending_list_head)                            // qword_12C7740
        pending_list_tail->next = inst;               // qword_12C7738->next
    else
        pending_list_head = inst;

    pending_list_tail = inst;
    inst->flags2 |= 0x01;                            // mark as on worklist

    // Verify correct translation unit
    tu_t *tu = trans_unit_for_symbol(inst->master_symbol);  // sub_741960
    assert(tu == current_tu,
           "add_to_instantiations_required_list: symbol for wrong translation unit");
}

The Fixpoint Loop

The fixpoint loop is the algorithm that drives all template instantiation in cudafe++. It runs at the end of each translation unit, after parsing is complete, and iterates until no new instantiation work remains. The entry point is template_and_inline_entity_wrapup (sub_78A9D0).

Algorithm

template_and_inline_entity_wrapup (sub_78A9D0):

    assert(tu_stack_top == 0)          // qword_106BA18: not nested in another TU
    assert(compilation_phase == 2)     // dword_126EFB4: full compilation mode

    LOOP:
        FOR EACH translation_unit IN tu_list (qword_106B9F0):

            set_up_tu_context(tu)      // sub_7A3EF0

            // PHASE 1 — Class templates (from qword_12C7758)
            for entry in class_worklist:
                if is_dependent_type(entry)       continue
                if !is_class_or_struct(entry)     continue
                f_instantiate_template_class(entry)

            // PHASE 2 — Enable instantiation mode
            instantiation_mode_active = 1         // dword_12C7730

            // PHASE 3 — Function/variable templates
            do_any_needed_instantiations()        // sub_78A7F0

            tear_down_tu_context()                // sub_7A3F70

        // PHASE 4 — Check fixpoint condition
        new_instantiations_needed = 0             // dword_12C771C
        FOR EACH translation_unit IN tu_list:
            if tu->needs_recheck:                 // TU + 393
                tu->needs_recheck = 0
                set_up_tu_context(tu)
                do_any_needed_instantiations()
                // process inline entities
                tear_down_tu_context()
                additional_pass_needed = 1        // dword_12C7718

            if new_instantiations_needed:
                GOTO LOOP                         // restart fixpoint

The fixpoint is necessary because instantiating one template can trigger references to other not-yet-instantiated templates. For example, instantiating std::vector<Foo> may require instantiating std::allocator<Foo>, Foo's copy constructor, comparison operators, and any other templates used in std::vector's implementation. Each such reference may add a new entry to the worklist, which the next pass will discover and process. The loop terminates when a complete pass produces no new entries.

Worklist Walker: do_any_needed_instantiations

sub_78A7F0 performs a linear walk over the function/variable worklist. For each entry, it applies a series of rejection filters, and if the entry passes all of them, dispatches to instantiate_template_function_full (sub_775E00).

// sub_78A7F0 — do_any_needed_instantiations
void do_any_needed_instantiations(void) {
    template_instance_t *entry = pending_list_head;   // qword_12C7740

    while (entry) {
        // 1. Already processed?
        if (entry->flags & 0x02) {                    // not_needed
            entry = entry->next;
            continue;
        }

        // 2. Get master instance info
        master_instance_info_t *info = entry->inst_info;  // offset +16
        assert(info != NULL, "do_any_needed_instantiations");

        // 3. Blocked by dependency?
        if (info->flags & 0x01) {                     // blocked
            entry = entry->next;
            continue;
        }

        // 4. Check if debug-verified
        if (!(info->flags & 0x08))                    // not debug_checked
            f_is_static_or_inline(entry);             // sub_756B40

        // 5. Pre-check if not already done
        if (!(entry->flags & 0x80))                   // can_be_instantiated not checked
            f_entity_can_be_instantiated(entry);      // sub_7574B0

        // 6. Mode filter
        if (compilation_mode != 1                     // dword_106C094
            && !(entry->flags & 0x01))                // not instantiation_required
        {
            entry = entry->next;
            continue;
        }

        // 7. Blocked after pre-check?
        if (info->flags & 0x01) {
            entry = entry->next;
            continue;
        }

        // 8. Decision gate
        if (!should_be_instantiated(entry, 1)) {      // sub_774620
            entry = entry->next;
            continue;
        }

        // 9. Instantiate
        instantiate_template_function_full(entry, 1); // sub_775E00
        entry = entry->next;
    }
}

The walk is a simple forward traversal. Entries appended during instantiation (by new add_to_instantiations_required_list calls from within instantiated function bodies) that land after the current position will be visited on this same pass. Entries that land before the current position, or entries whose status changes after being skipped, are caught by the next fixpoint iteration.

Decision Gate: should_be_instantiated

sub_774620 (326 lines) is the final filter before instantiation. It implements an eight-step rejection chain. An entity must pass every step to be instantiated.

// sub_774620 — should_be_instantiated
bool should_be_instantiated(template_instance_t *inst, int check_implicit) {
    master_instance_info_t *info = inst->inst_info;     // +16
    assert(info != NULL, "should_be_instantiated");

    // 1. Blocked?
    if (info->flags & 0x01)          return false;

    // 2. Excluded?
    if (inst->flags & 0x20)          return false;      // excluded

    // 3. Explicit but not required?
    if (inst->flags & 0x08) {                           // explicit_instantiation
        if (!(inst->flags & 0x01))   return false;      // not marked required
    }

    // 4. Not required and not in normal mode?
    if (!(inst->flags & 0x01) && compilation_mode != 1)
        return false;

    // 5. Has valid master_symbol?
    entity_t *master = inst->master_symbol;             // +24
    if (!master)                     return false;

    // 6. Entity kind check
    //    Function: kind 10/11 (member), kind 9 (namespace-scope)
    //    Variable: kind 7
    //    class-local function: kind 17 (lambda)
    int kind = master->kind;                            // master + 80
    // ... kind-specific filtering ...

    // 7. Template body available?
    //    Check that the template has a cached body to replay
    if (!has_template_body(inst))     return false;

    // 8. Implicit include?
    if (check_implicit && implicit_include_enabled) {
        do_implicit_include_if_needed(inst);            // sub_754A70
        // re-check body availability after include
    }

    // 9. Depth limit warning (diagnostics 489/490)
    if (approaching_depth_limit(inst)) {
        if (!(inst->flags2 & 0x02)) {                   // warning not yet emitted
            emit_warning(489 or 490, inst->master_symbol);
            inst->flags2 |= 0x02;                       // mark warning emitted
        }
        return false;
    }

    return true;
}

Depth Tracking

Template instantiation depth is tracked at two levels -- a global counter for function templates and a per-type counter for class templates -- plus a pending-instantiation counter that detects runaway expansion.

Function Template Depth: qword_12C76E0

A single global counter incremented on entry to instantiate_template_function_full and decremented on exit. Hard limit: 255 (0xFF).

// Inside sub_775E00 (instantiate_template_function_full):
if (instantiation_depth >= 0xFF) {                // qword_12C76E0
    emit_fatal_error(/* depth exceeded */);
    goto restore_state;
}
instantiation_depth++;
// ... perform instantiation ...
instantiation_depth--;

The 255 limit is a safety valve against infinite recursive template metaprogramming. Consider:

template<int N>
struct factorial {
    static constexpr int value = N * factorial<N-1>::value;
};

Without a depth limit, factorial<256> would recurse 256 levels deep, each level re-entering the parser to process the template body. At 255, EDG aborts with a fatal error rather than risk a stack overflow. The C++ standard (Annex B) recommends implementations support at least 1,024 recursively nested template instantiations, but EDG defaults to 255 as a practical limit -- configurable via qword_106BD10.

Class Template Depth: Per-Type Counter at type_entry + 56

Each class type entry has its own depth counter at offset +56. The limit is read from qword_106BD10 (the same configurable limit, typically 500). This per-type design is critical: it prevents one deeply-nested class hierarchy from blocking all other class instantiations.

// Inside sub_777CE0 (f_instantiate_template_class):
uint32_t depth = type_entry->depth_counter;       // type_entry + 56
if (depth >= max_depth_limit) {                   // qword_106BD10
    emit_error(456, decl);
    type_entry->flags |= 0x01;                    // mark completed
    goto restore_state;
}
type_entry->depth_counter++;
// ... perform class instantiation ...
type_entry->depth_counter--;

Pending Instantiation Counter: increment/decrement/too_many

Three functions manage a per-type pending-instantiation counter that detects exponential expansion of template instantiation work.

increment_pending_instantiations (sub_75D740): dispatches on the entity kind byte at entity + 80 to locate the owning type entry, then increments the counter at type_entry + 56.

decrement_pending_instantiations (sub_75D7C0): mirror of the above, decrements.

too_many_pending_instantiations (sub_75D6A0): compares the counter against qword_106BD10. If the threshold is met, emits diagnostic 456 and returns true to abort the instantiation.

// sub_75D6A0 — too_many_pending_instantiations
bool too_many_pending_instantiations(entity_t *entity, entity_t *context,
                                     source_pos_t *pos) {
    type_entry_t *type = resolve_owning_type(entity);  // dispatch on entity->kind
    assert(type != NULL, "too_many_pending_instantiations");

    uint32_t count = type->pending_counter;            // type + 56
    if (count >= max_depth_limit) {                    // qword_106BD10
        emit_error(456, pos, context);
        return true;
    }
    return false;
}

The entity-kind dispatch is identical across all three functions:

Entity KindByte +80Type Entry Resolution
4, 5 (template member function)entity->scope_assoc->field_80*(entity->assoc + 96)->offset_80
6 (type alias template)entity->scope_assoc->field_32*(entity->assoc + 96)->offset_32
9, 10 (namespace function, class)entity->scope_assoc->field_56*(entity->assoc + 96)->offset_56
19-22 (class member types)entity->type_infoentity->offset_88

Depth Limit Counter at type_entry + 432

Inside update_instantiation_required_flag (sub_7770E0), a secondary counter at type_entry + 432 (a 16-bit word) tracks how many times an entity's instantiation-required flag has been toggled. When this counter reaches 200, diagnostic 599 is emitted as a warning. If it exceeds 199, the instantiation is skipped entirely. This catches oscillating patterns where two mutually-dependent templates keep adding and removing each other from the worklist.

// Inside sub_7770E0, compilation_mode == 1 path:
if (!setting_required && is_function_or_variable(master_symbol)) {
    int16_t toggle_count = *(int16_t *)(type_entry + 432);
    toggle_count++;
    *(int16_t *)(type_entry + 432) = toggle_count;
    if (toggle_count == 200)
        emit_warning(599, actual_decl);
    if (toggle_count > 199)
        return;  // stop oscillating
}

Parser State Save/Restore During Instantiation

Template instantiation re-enters the parser: the compiler replays the cached template body tokens with substituted types. This means the parser's global state -- scope indices, current token, source position, declaration context -- must be saved before instantiation and restored afterward. EDG uses movups/movaps SSE instructions to bulk-save/restore this state in 128-bit chunks.

Why SSE?

The global parser state variables are ordinary integers, pointers, and flags laid out at consecutive addresses. The compiler's register allocator (or manual optimization) packs adjacent globals into 128-bit SSE loads/stores, saving 4 or more individual mov instructions per save/restore. This is not a quirk of the architecture -- it is a deliberate performance optimization for a hot path. Template-heavy C++ codebases (Boost, STL, Eigen) can trigger thousands of instantiations, each requiring a state save/restore pair.

Function Instantiation: 4 SSE Registers

instantiate_template_function_full (sub_775E00) saves and restores 4 SSE registers covering 64 bytes of parser state at addresses 0x106C380--0x106C3B0.

Save on entry (before any parser re-entry):
    local[0] = xmmword_106C380      // 16 bytes: parser scope context
    local[1] = xmmword_106C390      // 16 bytes: token stream state
    local[2] = xmmword_106C3A0      // 16 bytes: scope nesting info
    local[3] = xmmword_106C3B0      // 16 bytes: auxiliary flags

Also saved as individual scalars:
    saved_source_pos     = qword_126DD38
    saved_source_col     = WORD2(qword_126DD38)
    saved_diag_pos       = qword_126EDE8
    saved_diag_col       = WORD2(qword_126EDE8)

Restore on exit (always, even on error path):
    xmmword_106C380 = local[0]
    xmmword_106C390 = local[1]
    xmmword_106C3A0 = local[2]
    xmmword_106C3B0 = local[3]
    qword_126DD38   = saved_source_pos
    qword_126EDE8   = saved_diag_pos

Class Instantiation: 11 + 12 SSE Registers (Conditional)

f_instantiate_template_class (sub_777CE0) saves substantially more state because class body parsing involves deeper parser perturbation -- member declarations, nested types, access specifiers, base class processing, and member template definitions all modify global parser state.

The save is conditional on the current token kind (word_126DD58). If the token kind is between 2 and 8 inclusive (meaning the parser is mid-expression or mid-declaration when the class instantiation is triggered), the full save executes:

Primary state block (always saved when token is 2--8): 11 SSE registers from xmmword_126DC60--xmmword_126DD00, covering 176 bytes of declaration parser state, plus qword_126DD10 (8 bytes).

Save:
    local[0]  = xmmword_126DC60     // declaration context
    local[1]  = xmmword_126DC70     // access specifier state
    local[2]  = xmmword_126DC80     // base class list context
    local[3]  = xmmword_126DC90     // member template tracking
    local[4]  = xmmword_126DCA0     // nested type state
    local[5]  = xmmword_126DCB0     // friend declaration context
    local[6]  = xmmword_126DCC0     // using declaration state
    local[7]  = xmmword_126DCD0     // default argument context
    local[8]  = xmmword_126DCE0     // static assertion state
    local[9]  = xmmword_126DCF0     // concept/requires state
    local[10] = xmmword_126DD00     // template parameter context
    saved_dd10 = qword_126DD10      // additional scalar

Extended state block (saved only when token kind == 8, class definition in progress): 12 more SSE registers from xmmword_126DBA0--xmmword_126DC40, covering 192 bytes, plus qword_126DC50.

    local[11] = xmmword_126DBA0     // class body parse state
    local[12] = xmmword_126DBB0     // virtual function table context
    local[13] = xmmword_126DBC0     // constructor/destructor tracking
    local[14] = xmmword_126DBD0     // initializer list state
    local[15] = xmmword_126DBE0     // exception specification
    local[16] = xmmword_126DBF0     // noexcept evaluation context
    local[17] = xmmword_126DC00     // member initializer state
    local[18] = xmmword_126DC10     // default member init
    local[19] = xmmword_126DC20     // alignment tracking
    local[20] = unk_126DC30         // padding/layout state
    local[21] = xmmword_126DC40     // class completion state
    saved_dc50 = qword_126DC50      // additional scalar

The conditional save is a performance optimization: when the parser is in a simple context (token kind outside 2--8), the class instantiation only needs to save the 4 SSE registers from xmmword_106C380--xmmword_106C3B0 (same as function instantiation). The full 23-register save is only needed when a class instantiation is triggered mid-parse (e.g., during elaborated type specifier resolution or SFINAE evaluation).

Summary of State Save Areas

Instantiation KindConditionSSE RegistersBytes SavedAddress Range
FunctionAlways4640x106C380--0x106C3B0
Class (minimal)token not 2--84640x106C380--0x106C3B0
Class (mid-declaration)token 2--84 + 1164 + 1840x106C380--0x106C3B0 + 0x126DC60--0x126DD10
Class (mid-class-body)token == 84 + 11 + 1264 + 184 + 200All three ranges

The update_instantiation_required_flag Function

sub_7770E0 (434 lines) is the central function that decides whether to add a template instance to the worklist. Its name in EDG source is update_instantiation_required_flag, confirmed by the assert string at templates.c:38863 and the debug trace "Setting instantiation_required flag to %s for (options=%d)".

This function is called whenever a template entity's instantiation status changes -- when a template is first referenced, when it is explicitly instantiated, when its definition becomes available, or when an extern template declaration is encountered.

Parameters

void update_instantiation_required_flag(
    template_instance_t *inst,     // a1: the instance record
    bool                 setting,  // a2 (int cast): true = mark required, false = unmark
    unsigned int         options   // a3: bitmask controlling behavior
);

Options Bitmask

BitMaskMeaning
00x01Force worklist addition even without inline body
10x02Unmarking: decrement pending count and clear flags
20x04Suppress should_be_instantiated check
30x08Check for inline member of class template

High-Level Flow

update_instantiation_required_flag(inst, setting, options):

    // 1. Resolve owning type entry (for toggle counter)
    type_entry = resolve_type_from_actual_decl(inst->actual_decl)

    // 2. Check inline member status
    if is_function_or_variable(master_symbol):
        if is_inline_member:
            adjust options based on module/inline status

    // 3. Debug trace
    if debug_tracing_enabled:
        "Setting instantiation_required flag to TRUE/FALSE for (options=N)"
        print_symbol(inst->master_symbol)

    // 4. Ensure inst_info exists
    if inst_info is NULL:
        find_or_create_master_instance(inst)    // sub_753550

    // 5. If setting == true and entity != actual_decl:
    //    — Set inst->flags |= 0x01 (instantiation_required)
    //    — If inst_info exists, increment pending_count
    //    — Set referencing_namespace
    //    — Possibly call should_be_instantiated + instantiate immediately
    //    — Add to worklist via add_to_instantiations_required_list

    // 6. If setting == false and options & 0x02:
    //    — Decrement pending_count
    //    — Clear inst->flags bit 0
    //    — If count < 0: internal error (templates.c:38908)

    // 7. Worklist linkage (add_to_instantiations_required_list tail)
    //    — Check on_worklist bit
    //    — Append to qword_12C7740/qword_12C7738
    //    — Or set fixpoint flag if already on list

Global State Reference

AddressNameTypeDescription
qword_12C7740pending_instantiation_listvoid*Head of function/variable worklist
qword_12C7738pending_instantiation_list_tailvoid*Tail of function/variable worklist
qword_12C7758pending_class_listvoid*Head of class template worklist
qword_12C7750deferred_master_info_listvoid*Head of deferred master-info list
qword_12C7748deferred_master_info_list_tailvoid*Tail of deferred master-info list
dword_12C7730instantiation_modeint320=none, 1=used, 2=all, 3=local
dword_12C771Cnew_instantiations_neededint32Fixpoint flag (1 = restart loop)
dword_12C7718additional_pass_neededint32Secondary fixpoint flag
qword_12C76E0function_depth_counterint64Current function instantiation depth (max 255)
qword_106BD10max_depth_limitint64Configurable depth limit (read by both function and class paths)
qword_12C74F0instance_alloc_countint64Total 128-byte records allocated
qword_12C74E8master_info_alloc_countint64Total 32-byte master-info records allocated
qword_12C7708inline_entity_list_headvoid*Head of inline entity fixup list
qword_12C7700inline_entity_list_tailvoid*Tail of inline entity fixup list

Diagnostic Messages

NumberSeverityConditionMessage Summary
456ErrorDepth counter >= max limitExcessive template instantiation depth
489WarningApproaching depth limit (explicit instantiation)Template instantiation depth nearing limit
490WarningApproaching depth limit (auto instantiation)Template instantiation depth nearing limit
599WarningToggle counter reaches 200Instantiation flag oscillation detected
759ErrorEntity not visible at file scopeTemplate entity not accessible for instantiation

Function Map

AddressIdentityConfidenceLinesRole
sub_7416E0alloc_template_instance95%40Allocate 128-byte instance record
sub_7416A0alloc_master_instance_info95%16Allocate 32-byte master info record
sub_753550find_or_create_master_instance95%75Link instance to shared master info
sub_7770E0update_instantiation_required_flag95%434Update flags, add to worklist
sub_78A7F0do_any_needed_instantiations100%72Walk function/variable worklist
sub_78A9D0template_and_inline_entity_wrapup100%136Fixpoint loop entry point
sub_774620should_be_instantiated95%326Decision gate
sub_775E00instantiate_template_function_full95%839Function template instantiation
sub_777CE0f_instantiate_template_class95%516Class template instantiation
sub_774C30instantiate_template_variable95%751Variable template instantiation
sub_75D740increment_pending_instantiations95%--Increment per-type depth counter
sub_75D7C0decrement_pending_instantiations95%--Decrement per-type depth counter
sub_75D6A0too_many_pending_instantiations95%--Check depth limit, emit diagnostic 456
sub_75D5B0determine_referencing_namespace95%47Find namespace that triggered instantiation
sub_7574B0f_entity_can_be_instantiated95%--Pre-check: body available, constraints satisfied
sub_756B40f_is_static_or_inline_template_entity95%--Check linkage for instantiation eligibility
sub_789EF0update_instantiation_flags95%--Update class instantiation flags, add to class worklist
sub_72ED70alloc_symbol_list_entry95%39Allocate 16-byte symbol list node (for inline entity list)

Cross-References

CLI Flag Inventory

Quick Reference: 20 Most Important CUDA-Specific Flags

Flag (via -Xcudafe)nvcc EquivalentIDEffect
--diag_suppress=N--diag-suppress=N39Suppress diagnostic number N (comma-separated)
--diag_error=N--diag-error=N42Promote diagnostic N to error
--diag_warning=N--diag-warning=N41Demote diagnostic N to warning
--display_error_number--44Show #NNNNN-D error codes in output
--target=smXX--gpu-architecture=smXX245Set SM architecture target (parsed via sub_7525E0)
--relaxed_constexpr--expt-relaxed-constexpr104Allow constexpr cross-space calls
--extended-lambda--expt-extended-lambda106Enable __device__/__host__ __device__ lambdas in host code (dword_106BF38)
--device-c-rdc=true77Relocatable device code (separate compilation)
--keep-device-functions--keep-device-functions71Do not strip unused device functions
--no_warnings-w22Suppress all warnings
--promote_warnings-W23Promote all warnings to errors
--error_limit=N--32Maximum errors before abort (default: unbounded)
--force-lp64-m6465LP64 data model (pointer=8, long=8)
--output_mode=sarif--274SARIF JSON diagnostic output
--debug_mode-G82Full debug mode (sets 3 debug globals)
--device-syntax-only--72Device-side syntax check without codegen
--no-device-int128--52Disable __int128 on device
--zero_init_auto_vars--81Zero-initialize automatic variables
--fe-inlining--54Enable frontend inlining
--gen_c_file_name=path--45Set output .int.c file path

These are the flags most commonly passed through -Xcudafe for CUDA development. The full inventory of 276 flags follows below.


cudafe++ accepts 276 command-line flags registered in a flat table at dword_E80060. The flags are not parsed directly from the binary's argv -- NVIDIA's driver compiler nvcc decomposes its own options and invokes cudafe++ with the appropriate low-level flags. Users never run cudafe++ directly; instead, they pass options through nvcc -Xcudafe <flag>, which strips the -Xcudafe prefix and forwards the remainder as a bare argument to the cudafe++ process.

The flag system is implemented in three functions within cmd_line.c:

FunctionAddressLinesRole
register_command_flagsub_451F8025Insert one entry into the flag table
init_command_line_flagssub_4520103,849Register all 276 flags (called once)
proc_command_linesub_4596304,105Main parser: match argv against table, dispatch to 275-case switch
default_initsub_45EB40470Zero 350 global config variables + flag-was-set bitmap

Flag Table Structure

Each flag occupies a 40-byte entry in a contiguous array beginning at dword_E80060, with a maximum capacity of 552 entries (overflow triggers a panic via sub_40351D). The current count is tracked in dword_E80058.

struct flag_entry {                    // 40 bytes per entry
    int32_t   case_id;                 // dword_E80060[idx*10]    -- switch dispatch ID
    char*     name;                    // qword_E80068[idx*5]     -- long flag name string
    int16_t   short_char;              // word_E80070[idx*20]     -- single-char alias (0 if none)
    int8_t    is_valid;                // word_E80070[idx*20]+1   -- always 1
    int8_t    takes_value;             // byte_E80072[idx*40]     -- flag requires =<value> argument
    int32_t   visible;                 // dword_E80080[idx*10]    -- mode/action classification
    int8_t    is_boolean;              // byte_E80073[idx*40]     -- flag is on/off toggle
    int64_t   name_length;             // qword_E80078[idx*5]     -- strlen(name), precomputed
};

The flag-was-set bitmap at byte_E7FF40 spans 0x110 bytes (272 flag slots). When a flag is matched during parsing, the corresponding bit is set to record that the user explicitly provided it. This bitmap is zeroed by default_init before every compilation.

Registration Protocol

register_command_flag (sub_451F80) is called approximately 275 times from init_command_line_flags. Its prototype:

void register_command_flag(
    int    case_id,        // dispatch ID for the switch statement
    char*  name,           // "--name" (without the dashes)
    char   short_opt,      // single-letter alias, 0 for none
    char   takes_value,    // 1 if the flag requires =<value>
    int    mode_flag,      // visibility / classification
    char   enabled         // whether the flag is active
);

Some flags are registered as paired toggles -- --flag and --no_flag share the same case_id but set the target global to 1 or 0 respectively. These pairs are registered either by two calls to register_command_flag or by inline table population within init_command_line_flags.

Parsing Flow

proc_command_line (sub_459630) is the master CLI parser. It:

  1. Calls init_command_line_flags to populate the flag table (once)
  2. Allocates four hash tables for accumulating -D, -I, system include, and macro alias arguments
  3. Adjusts nine diagnostic severities by default via sub_4ED400: four are suppressed (severity 3: errors 1373, 1374, 1375, 2330) and five are demoted to remark (severity 4: errors 1257, 1633, 111, 185, 175)
  4. Enters the main loop over argv:
    • Scans for - prefix to identify flags
    • Handles -X short flags and --flag-name long flags
    • Handles --flag=value syntax via parse_flag_name_value (sub_451EC0)
    • Matches flag names against the registered table using strncmp against each entry's precomputed name_length
    • Dispatches to a giant switch(case_id) with 275 cases
  5. Executes post-parsing dialect resolution (described below)
  6. Opens output, error, and list files
  7. Treats the remaining non-flag argv entry as the input filename

The -Xcudafe Pass-Through

Users never invoke cudafe++ directly. The intended usage path is:

nvcc --some-option -Xcudafe --diag_suppress=1234 source.cu

nvcc strips -Xcudafe and passes --diag_suppress=1234 directly to the cudafe++ process as an argv element. Multiple -Xcudafe arguments accumulate. Because cudafe++ flags use -- long-form prefixes, there is no ambiguity with nvcc's own flag namespace.

Certain nvcc flags like --expt-extended-lambda and --expt-relaxed-constexpr are translated by nvcc into the corresponding cudafe++ internal flags (--extended-lambda, --relaxed_constexpr) before invocation. Users do not need to know the internal names.

Flag Catalog by Category

The 276 flags are grouped below by functional category. Each table lists:

  • ID -- the case_id used in the dispatch switch
  • Flag -- the --name as registered (paired flags shown as name / no_name)
  • Short -- single-character alias (dash required: -E, -C, etc.)
  • Arg -- whether the flag takes a =<value> argument
  • Effect -- what the flag does internally

Core EDG Flags (1--44)

These are standard Edison Design Group frontend options that predate NVIDIA's CUDA modifications.

IDFlagShortArgEffect
1strict-AnoEnable strict standards conformance mode
2strict_warnings-anoStrict mode with extra warnings
3no_line_commands-PnoSuppress #line directives in preprocessor output
4preprocess-EnoPreprocessor-only mode (output to stdout)
5comments-CnoPreserve comments in preprocessor output
6old_line_commands--noUse old-style # N "file" line directives
7old_c-KnoK&R C mode (calls set_c_mode(1))
8dependencies-MnoOutput #include dependency list (preprocessor-only)
9trace_includes-HnoPrint each #include file as it is opened
10il_display--noDump intermediate language after parsing
11anachronisms / no_anachronisms--noAllow/disallow anachronistic C++ constructs
12cfront_2.1-bnoCfront 2.1 compatibility mode
13cfront_3.0--noCfront 3.0 compatibility mode
14no_code_gen-nnoParse only, skip code generation
15signed_chars / unsigned_chars-snoDefault char signedness
16instantiate-tyesTemplate instantiation mode: none, all, used, local
17implicit_include / no_implicit_include-BnoEnable/disable implicit inclusion of template definitions
18suppress_vtbl / force_vtbl--noControl virtual table emission
19dollar-$noAllow $ in identifiers
20timing-#noPrint compilation phase timing
21version-vnoPrint version banner and continue
22no_warnings-wnoSuppress all warnings (sets severity threshold to error-only)
23promote_warnings-WnoPromote warnings to errors
24remarks-rnoEnable remark-level diagnostics
25c-mnoForce C language mode
26c++-pnoForce C++ language mode
27exceptions / no_exceptions-xnoEnable/disable C++ exception handling
28no_use_before_set_warnings-jnoSuppress "used before set" variable warnings
29include_directory-IyesAdd include search path (handles - for stdin)
30define_macro-DyesDefine preprocessor macro (builds linked list)
31undefine_macro-UyesUndefine preprocessor macro
32error_limit-eyesMaximum number of errors before abort
33list-LyesGenerate listing file
34xref-XyesGenerate cross-reference file
35error_output--yesRedirect error output to file
36output-oyesSet output file path
37db-dyesLoad debug database
38time_limit--yesSet compilation time limit
39diag_suppress--yesSuppress diagnostic numbers (comma-separated list)
40diag_remark--yesDemote diagnostics to remark severity
41diag_warning--yesSet diagnostics to warning severity
42diag_error--yesPromote diagnostics to error severity
43diag_once--yesEmit diagnostic only on first occurrence
44display_error_number / no_display_error_number--noShow/hide error code numbers in output

NVIDIA CUDA-Specific Flags (45--89)

These flags are NVIDIA additions absent from stock EDG. They control CUDA compilation modes, device code generation, and host/device interaction.

IDFlagArgEffect
45gen_c_file_nameyesSet output .int.c file path (qword_106BF20)
46msvc_target_versionyesMSVC version for compatibility (dword_126E1D4)
47host-stub-linkage-explicitnoUse explicit linkage on host stubs
48static-host-stubnoGenerate static host stubs
49device-hidden-visibilitynoApply hidden visibility to device symbols
50no-hidden-visibility-on-unnamed-nsnoExempt unnamed namespaces from hidden visibility
51no-multiline-debugnoDisable multiline debug info
52no-device-int128noDisable __int128 on device
53no-device-float128noDisable __float128 on device
54fe-inliningnoEnable frontend inlining (dword_106C068 = 1)
55modify-stack-limityesControl stack limit modification (dword_106C064)
56fassociative-mathnoEnable associative floating-point math
57orig_src_file_nameyesOriginal source file name (before preprocessing)
58orig_src_path_nameyesOriginal source path name (full path)
59frandom-seedyesRandom seed for reproducible output
60check-template-param-qualnoCheck template parameter qualifications
61check-clock-callnoValidate clock() calls in device code
62check-ffs-callnoValidate ffs() calls in device code
63check-routine-address-takennoCheck when device routine address is taken
64check-memory-clobbernoValidate memory clobber in inline asm
65force-lp64noLP64 data model: pointer=8, long=8
66force-llp64noLLP64 data model: pointer=4, long=4
67pgi_llvmnoPGI/LLVM backend mode
68pgi_arch_ppcnoPGI PowerPC architecture
69pgi_arch_aarch64noPGI AArch64 architecture
70pgi_versionyesPGI compiler version number
71keep-device-functionsnoDo not strip unused device functions
72device-syntax-onlynoDevice-side syntax check without codegen
73device-time-tracenoEnable device compilation time tracing
74force_linkonce_to_weaknoConvert linkonce to weak linkage
75disable_host_implicit_call_checknoSkip implicit call validation on host
76no_strict_cuda_errornoRelax strict CUDA error checking
77device-cnoRelocatable device code (RDC) mode
78no-shadow-functionsnoDisable function shadowing in device code
79disable_ext_lambda_cachenoDisable extended lambda capture cache
80no-constant-variable-inferencingnoDisable constexpr variable inference on device
81zero_init_auto_varsnoZero-initialize automatic variables
82debug_modenoFull debug mode (sets 3 debug globals to 1)
83gen_module_id_filenoGenerate module ID file
84include_file_nameyesForced include file name
85gen_device_file_nameyesDevice-side output file name
86stub_file_nameyesStub file output path
87module_id_file_nameyesModule ID file path
88tile_bc_file_nameyesTile bitcode file path
89tile-onlynoTile-only compilation mode

Architecture and Host Compiler Flags (90--114)

These flags identify the target architecture and host compiler for compatibility emulation.

IDFlagShortArgEffect
90m32--no32-bit mode: pointer=4, long=4, all types sized for ILP32
91m64--no64-bit mode (default on Linux x86-64)
92Version-VnoPrint version with different copyright format, then exit(1)
93compiler_bindir--yesHost compiler binary directory
94sdk_dir--yesSDK directory path
95pgc++--noPGI C++ compiler mode
96icc--noIntel ICC compiler mode
97icc_version--yesIntel ICC version number
98icx--noIntel ICX (oneAPI) compiler mode
99grco--noGRCO compiler mode
100allow_managed--noAllow __managed__ variable declarations
101gen_system_templates_from_text--noGenerate system templates from text
102no_host_device_initializer_list--noDisable HD initializer_list support
103no_host_device_move_forward--noDisable HD std::move/std::forward
104relaxed_constexpr--noRelaxed constexpr rules for device code (--expt-relaxed-constexpr)
105dont_suppress_host_wrappers--noEmit host wrapper functions unconditionally
106arm_cross_compiler--noARM cross-compilation mode
107target_woa--noWindows on ARM target
108gen_div_approx_no_ftz--noGenerate approximate division without flush-to-zero
109gen_div_approx_ftz--noGenerate approximate division with flush-to-zero
110shared_address_immutable--noShared memory addresses are immutable
111uumn--noUnnamed union member naming

C++ Language Feature Toggle Flags (115--275)

The largest group -- approximately 120 paired boolean toggles that control individual C++ language features. Most are inherited from EDG's configuration surface. Each pair shares a case_id and sets a global variable to 1 (--flag) or 0 (--no_flag).

Precompiled Headers (115--121)

IDFlagArgEffect
115unsigned_wchar_tnowchar_t is unsigned
116create_pchyesCreate precompiled header file
117use_pchyesUse existing precompiled header
118pchnoEnable PCH mode
119pch_messages / no_pch_messagesnoShow/hide PCH status messages
120pch_verbose / no_pch_verbosenoVerbose PCH output
121pch_diryesPCH file directory

Core C++ Feature Toggles (122--170)

IDFlagArgDefault
122restrict / no_restrictnoon
123long_lifetime_temps / short_lifetime_tempsno--
124wchar_t_keyword / no_wchar_t_keywordnoon
125pack_alignmentyes--
126alternative_tokens / no_alternative_tokensnoon
127svr4 / no_svr4no--
128brief_diagnostics / no_brief_diagnosticsno--
129nonconst_ref_anachronism / no_nonconst_ref_anachronismno--
130no_preproc_onlyno--
131rtti / no_rttinoon
132building_runtimeno--
133bool / no_boolnoon
134array_new_and_delete / no_array_new_and_deleteno--
135explicit / no_explicitno--
136namespaces / no_namespacesnoon
137using_std / no_using_stdno--
138remove_unneeded_entities / no_remove_unneeded_entitiesnoon
139typename / no_typenameno--
140implicit_typename / no_implicit_typenamenoon
141special_subscript_cost / no_special_subscript_costno--
143old_style_preprocessingno--
144old_for_init / new_for_initno--
145for_init_diff_warning / no_for_init_diff_warningno--
146distinct_template_signatures / no_distinct_template_signaturesno--
147guiding_decls / no_guiding_declsnoon
148old_specializations / no_old_specializationsnoon
149wrap_diagnostics / no_wrap_diagnosticsno--
150implicit_extern_c_type_conversion / no_implicit_extern_c_type_conversionno--
151long_preserving_rules / no_long_preserving_rulesno--
152extern_inline / no_extern_inlineno--
153multibyte_chars / no_multibyte_charsno--
154embedded_c++noEmbedded C++ mode
155vla / no_vlano--
156enum_overloading / no_enum_overloadingno--
157nonstd_qualifier_deduction / no_nonstd_qualifier_deductionno--
158late_tiebreaker / early_tiebreakerno--
159preincludeyes--
160preinclude_macrosyes--
161pending_instantiationsyes--
162const_string_literals / no_const_string_literalsnoon
163class_name_injection / no_class_name_injectionnoon
164arg_dep_lookup / no_arg_dep_lookupnoon
165friend_injection / no_friend_injectionnoon
166nonstd_using_decl / no_nonstd_using_declno--
168designators / no_designatorsno--
169extended_designators / no_extended_designatorsno--
170variadic_macros / no_variadic_macrosno--
171extended_variadic_macros / no_extended_variadic_macrosno--

Include Paths and Module Support (167, 172, 256--265)

Note: These flags use non-contiguous IDs because sys_include and incl_suffixes are registered early, while the C++20 module flags use a separate ID range (256+).

IDFlagArgEffect
167sys_includeyesSystem include directory
172incl_suffixesyesInclude file suffix list (default "::stdh:")
256modules_directoryyesC++20 modules directory
257ms_mod_file_mapyesMSVC module file mapping
258ms_header_unityesMSVC header unit
259ms_header_unit_quoteyesMSVC quoted header unit
260ms_header_unit_angleyesMSVC angle-bracket header unit
261ms_mod_interface / no_ms_mod_interfacenoMSVC module interface mode
262ms_internal_partition / no_ms_internal_partitionnoMSVC internal partition mode
263ms_translate_include / no_ms_translate_includenoMSVC translate #include to import
264modules / no_modulesnoEnable/disable C++20 modules
265module_import_diagnostics / no_module_import_diagnosticsnoModule import diagnostic messages

Host Compiler and Language Feature Toggles (182--239)

Note: All IDs below are verified against the decompiled init_command_line_flags (sub_452010). Flags are registered by sub_451F80 (explicit call) or by inline array population. IDs are not sequential -- gaps exist where flags were removed or repurposed.

IDFlagArgDefault
182gcc / no_gccnoGCC compatibility mode
183g++ / no_g++noG++ mode (alias for GCC C++ mode)
184gnu_versionyesGCC version number (default 80100 = GCC 8.1.0)
185report_gnu_extensionsnoReport use of GNU extensions
186short_enums / no_short_enumsnoUse minimal-size enum representation
187clang / no_clangnoClang compatibility mode
188clang_versionyesClang version number (default 90100 = Clang 9.1.0)
189strict_gnu / no_strict_gnunoStrict GNU mode
190db_nameyesDebug database name
191long_longnoAllow long long type
192context_limityesMaximum template instantiation context depth
193set_flag / clear_flagyesRaw flag manipulation via off_D47CE0 lookup table
194edg_base_diryesEDG base directory (error on invalid path)
195embedded_c / no_embedded_cnoEmbedded C mode (not relevant to CUDA)
196thread_local_storage / no_thread_local_storagenothread_local support
197trigraphs / no_trigraphsnoTrigraph processing (default on)
198nonstd_default_arg_deduction / no_nonstd_default_arg_deductionno--
199stdc_zero_in_system_headers / no_stdc_zero_in_system_headersno--
200template_typedefs_in_diagnostics / no_template_typedefs_in_diagnosticsno--
202uliterals / no_uliteralsnoUnicode literals (u"", U"", u8"")
203type_traits_helpers / no_type_traits_helpersnoIntrinsic type traits
204c++11 / c++0xnoC++11 mode (sets dword_126EF68 to 201103 or 199711)
205list_macrosnoList all defined macros after preprocessing
206dump_configurationnoDump full compiler configuration
207dump_legacy_as_targetyesDump legacy configuration in target format
208signed_bit_fields / unsigned_bit_fieldsnoDefault bit-field signedness
210check_concatenations / no_check_concatenationsnoString literal concatenation checks
211unicode_source_kindyesSource encoding: UTF-8=1, UTF-16LE=2, UTF-16BE=3, none=0
212lambdas / no_lambdasnoC++ lambda expressions
213rvalue_refs / no_rvalue_refsnoRvalue references
214rvalue_ctor_is_copy_ctor / rvalue_ctor_is_not_copy_ctornoRvalue constructor treatment
215gen_move_operations / no_gen_move_operationsnoImplicit move constructor/assignment (default on)
216auto_type / no_auto_typenoC++11 auto type deduction
217auto_storage / no_auto_storagenoauto as storage class (C++03 meaning)
218nonstd_instantiation_lookup / no_nonstd_instantiation_lookupno--
219nullptr / no_nullptrnonullptr keyword
220gcc89_inliningnoGCC 8.9-era inlining behavior
221nonstd_gnu_keywords / no_nonstd_gnu_keywordsnoGNU extension keywords
222default_nocommon_tentative_definitions / default_common_tentative_definitionsnoTentative definition linkage
223no_token_separators_in_pp_outputno--
224c23_typeof / no_c23_typeofnoC23 typeof operator
225c++11_sfinae / no_c++11_sfinaenoC++11 SFINAE rules
226c++11_sfinae_ignore_access / no_c++11_sfinae_ignore_accessnoIgnore access checks in SFINAE
227variadic_templates / no_variadic_templatesnoParameter packs and pack expansion
228c++03noC++03 mode (sets dword_126EF68 to 199711)
229func_prototype_tags / no_func_prototype_tagsno--
230implicit_noexcept / no_implicit_noexceptnoImplicit noexcept on destructors
231unrestricted_unions / no_unrestricted_unionsnoUnrestricted unions (C++11)
232max_depth_constexpr_callyesMaximum constexpr recursion depth (default 200)
233max_cost_constexpr_callyesMaximum constexpr evaluation cost (default 256)
234delegating_constructors / no_delegating_constructorsno--
235lossy_conversion_warning / no_lossy_conversion_warningno--
236deprecated_string_conv / no_deprecated_string_convnoDeprecated string literal to char* conversion
237user_defined_literals / no_user_defined_literalsnoUDL support
238preserve_lvalues_with_same_type_casts / no_...no--
239nonstd_anonymous_unions / no_nonstd_anonymous_unionsno--

Late C++/Architecture/Output Flags (240--258)

IDFlagArgEffect
240c++14noC++14 mode (sets dword_126EF68 to 201402)
241c11noC11 mode (sets dword_126EF68 to 201112)
242c17noC17 mode (sets dword_126EF68 to 201710)
243c23noC23 mode (sets dword_126EF68 to 202311)
244digit_separators / no_digit_separatorsnoC++14 digit separators (1'000'000)
245targetyesSM architecture string, parsed via sub_7525E0 into dword_126E4A8
246c++17noC++17 mode (sets dword_126EF68 to 201703)
247utf8_char_literals / no_utf8_char_literalsnoUTF-8 character literal support
248stricter_template_checkingnoAdditional template constraint checks
249exc_spec_in_func_type / no_exc_spec_in_func_typenoException spec as part of function type (C++17)
250aligned_new / no_aligned_newnoAligned operator new (C++17)
251c++20noC++20 mode (sets dword_126EF68 to 202002)
252c++23noC++23 mode (sets dword_126EF68 to 202302)
253ms_std_preprocessor / no_ms_std_preprocessornoMSVC standard preprocessor mode
268partial-linknoPartial linking mode
273dump_command_optionsnoPrint all registered flag names
274output_modeyesOutput format: text (0) or sarif (1)
275incognito / no_incognitonoIncognito mode

Note: Many IDs in the 240-252 range serve double duty as both C/C++ standard selectors and feature toggles. The standard selection IDs are also cross-referenced in the Language Standard Selection section above.

Inline-Registered Paired Flags

Seven additional paired flags are registered through inline table population rather than calls to register_command_flag. They share the same entry structure but are populated directly into the array:

FlagEffect
relaxed_abstract_checking / no_relaxed_abstract_checkingRelax abstract class checks
concepts / no_conceptsC++20 concepts support
colors / no_colorsColorized diagnostic output
keep_restrict_in_signatures / no_keep_restrict_in_signaturesPreserve restrict in mangled names
check_unicode_security / no_check_unicode_securityUnicode security checks (homoglyph detection)
old_id_chars / no_old_id_charsLegacy identifier character rules
add_match_notes / no_add_match_notesAdd notes about matching overloads

Language Standard Selection

Six language standard flags set dword_126EF68 (the internal __cplusplus / __STDC_VERSION__ value) and trigger corresponding mode changes:

C Standards

IDFlagdword_126EF68 valueC standard
7old_c(K&R)Pre-ANSI C via set_c_mode(1)
179c89198912ANSI C / C89
178c99199901C99
241c11201112C11
242c17201710C17
243c23202311C23

C++ Standards

IDFlagdword_126EF68 valueC++ standard
228c++03199711C++98/03 (also aliased as c++98 via --c++11 flag ID 204 with conditional)
204c++11201103C++11 (sets 199711 if dword_E7FF14 is unset or C mode)
240c++14201402C++14
246c++17201703C++17
251c++20202002C++20
252c++23202302C++23

When a C++ standard is selected, the post-parsing dialect resolution logic automatically enables the corresponding feature flags. For example, selecting --c++11 (value 201103) enables lambdas, rvalue references, auto type deduction, nullptr, variadic templates, and other C++11 features. The resolution logic also interacts with GCC/Clang version thresholds to determine which extensions are available.

Diagnostic Control Flags

The five diag_* flags (IDs 39--43) accept comma-separated lists of diagnostic numbers. The parser strips whitespace, splits on commas, and calls sub_4ED400(number, severity, 1) for each number:

--diag_suppress=1234,5678       # suppress errors 1234 and 5678
--diag_warning=20001            # demote CUDA error 20001 to warning
--diag_error=111                # promote diagnostic 111 to error
--diag_remark=185               # demote diagnostic 185 to remark
--diag_once=175                 # emit diagnostic 175 only once

The error number system is documented in Diagnostic System Overview. Numbers above 3456 in the internal range correspond to the 20000-series CUDA errors via the offset formula display_code = internal_code + 16543.

Post-Parsing Dialect Resolution

After the main parsing loop completes, proc_command_line executes a large block of dialect resolution logic that:

  1. Resolves host compiler mode conflicts -- If both --gcc and --clang are set, or --cfront_2.1 is combined with modern modes, the resolution picks one and adjusts feature flags accordingly
  2. Sets C++ feature flags from __cplusplus version -- Based on the value in dword_126EF68:
    • 199711 (C++98/03): baseline features only
    • 201103 (C++11): enables lambdas, rvalue refs, auto, nullptr, variadic templates, range-based for, delegating constructors, unrestricted unions, user-defined literals
    • 201402 (C++14): adds digit separators, generic lambdas, relaxed constexpr
    • 201703 (C++17): adds aligned new, exception spec in function type, structured bindings
    • 202002 (C++20): adds concepts, modules, coroutines
    • 202302 (C++23): adds latest features
  3. Applies GCC version thresholds -- When in GCC compatibility mode, certain features are gated on the GCC version number stored in qword_126EF98 (default 80100 = GCC 8.1.0). Known thresholds:
    • 40299 (0x9D6B): GCC 4.2
    • 40599 (0x9E97): GCC 4.5
    • 40699 (0x9EFB): GCC 4.6
    • Higher versions enable progressively more features
  4. Opens output files -- Error output, listing file, output file
  5. Processes the input filename -- The remaining non-flag argv entry

Key Globals After Resolution

GlobalTypeContent
dword_126EF68int32__cplusplus / __STDC_VERSION__ value
dword_126EFB4int32Language mode: 0=unset, 1=C, 2=C++
dword_126EFA8int32GCC compatibility enabled
dword_126EFA4int32Clang compatibility enabled
qword_126EF98int64GCC version (default 80100)
qword_126EF90int64Clang version (default 90100)
dword_126EFB0int32GNU extensions enabled
dword_126EFACint32Clang extensions enabled
dword_126E4A8int32SM architecture code (from --target)
dword_126E1D4int32MSVC target version

The set_flag / clear_flag Mechanism

Flag ID 199 (--set_flag / --clear_flag) provides a raw escape hatch. The argument is a flag name looked up in the off_D47CE0 table -- an array of {name, global_address} pairs. If the name is found, the corresponding global variable is set to the provided integer value (--set_flag=name=value) or cleared to 0 (--clear_flag=name). This mechanism allows nvcc to toggle internal EDG configuration flags that do not have dedicated CLI flag registrations.

Default Values

default_init (sub_45EB40) runs before proc_command_line and initializes approximately 350 global configuration variables. Notable non-zero defaults:

GlobalDefaultMeaning
dword_106C2101Exceptions enabled
dword_106C1801RTTI enabled
dword_106C1781bool is keyword
dword_106C1941Namespaces enabled
dword_106C19C1Argument-dependent lookup enabled
dword_106C1A01Class name injection enabled
dword_106C1A41String literals are const
dword_106C1881wchar_t is keyword
dword_106C18C1Alternative tokens enabled
dword_106C1401Compound literals allowed
dword_106C1381Dependent name processing enabled
dword_106C1341Template parsing enabled
dword_106C12C1Friend injection enabled
dword_106BDB81restrict enabled
dword_106BDB01Remove unneeded entities enabled
dword_106BD981Trigraphs enabled
dword_106BD681Guiding declarations allowed
dword_106BD581Old specializations allowed
dword_106BD541Implicit typename enabled
dword_106BE841Generate move operations enabled
dword_106C0641Stack limit modification enabled
qword_106BD10200Max constexpr recursion depth
qword_106BD08256Max constexpr evaluation cost
qword_126EF9880100Default GCC version (8.1.0)
qword_126EF9090100Default Clang version (9.1.0)
qword_126EF781926MSVC version threshold
qword_126EF7099999Some upper bound sentinel

Conflict Detection

Before the main parsing loop, check_conflicting_flags (sub_451E80) verifies that flags 3, 193, 194, and 195 (no_line_commands, set_flag, clear_flag, and related flags) are not used in conflicting combinations. If any conflict is detected, error 1027 is emitted.

Version Banners

Two flags print version information:

--version (ID 21, -v):

cudafe: NVIDIA (R) Cuda Language Front End
Portions Copyright (c) 2005, 2024 NVIDIA Corporation
Portions Copyright (c) 1988-2018, 2024 Edison Design Group Inc.
Based on Edison Design Group C/C++ Front End, version 6.6
Cuda compilation tools, release 13.0, V13.0.88

--Version (ID 92, -V): Prints a different copyright format with full date/time stamp, then calls exit(1).

Cross-References

EDG Build Configuration

cudafe++ is built from Edison Design Group (EDG) C/C++ front end source code, version 6.6. At build time, NVIDIA sets approximately 750 compile-time constants that control every aspect of the front end's behavior -- from which backend generates output, to how the IL system operates, to what ABI conventions are followed. These constants are baked into the binary and cannot be changed at runtime. They represent the specific EDG configuration NVIDIA chose for CUDA compilation.

The function dump_configuration (sub_44CF30, 785 lines) prints all 747 constants as C preprocessor #define statements when invoked with --dump_configuration. Of these, 613 are defined and 134 are explicitly listed as "not defined." The output is written to qword_126EDF0 (the configuration output stream, typically stderr) in alphabetical order.

$ cudafe++ --dump_configuration
/* Configuration data for Edison Design Group C/C++ Front End */
/* version 6.6, built on Aug 20 2025 at 13:59:03. */

#define ABI_CHANGES_FOR_ARRAY_NEW_AND_DELETE 1
#define ABI_CHANGES_FOR_CONSTRUCTION_VTBLS 1
...
#define WRITE_SIGNOFF_MESSAGE 1

/* Legacy configuration: <unnamed> */
#define LEGACY_TARGET_CONFIGURATION_NAME NULL

The constants fall into seven categories: backend selection, IL system, internal checking, diagnostics, target platform model, compiler compatibility, and feature defaults.

Backend Selection

The EDG front end supports multiple backend code generators. NVIDIA configured cudafe++ for the C++ code generation backend (cp_gen_be), which means the front end's output is C++ source code -- not object code, not C, and not a serialized IL file.

ConstantValueMeaning
BACK_END_IS_CP_GEN_BE1Backend generates C++ source (the .ii / .int.c output)
BACK_END_IS_C_GEN_BE0Not the C code generation backend
BACK_END_SHOULD_BE_CALLED1Backend phase is active (front end does not stop after parsing)
CP_GEN_BE_TARGET_MATCHES_SOURCE_DIALECT1Generated C++ targets the same dialect as the input
GEN_CPP_FILE_SUFFIX".int.c"Output file suffix for generated C++
GEN_C_FILE_SUFFIX".int.c"Output file suffix for generated C (same as C++, unused)

This is the central architectural fact about cudafe++. It is a source-to-source translator: CUDA C++ goes in, host-side C++ with device stubs comes out. The cp_gen_be backend walks the IL tree and emits syntactically valid C++ that the host compiler (gcc/clang/MSVC) can consume. The generated code preserves the original types, templates, and namespaces rather than lowering to a simpler representation.

The CP_GEN_BE_TARGET_MATCHES_SOURCE_DIALECT=1 setting means the backend does not down-level the output. If the input is C++17, the generated code uses C++17 constructs. This avoids the complexity of translating modern C++ features into older dialects.

Disabled Backend Features

Several backend capabilities are compiled out:

ConstantValueMeaning
GCC_IS_GENERATED_CODE_TARGET0Output is not GCC-specific C
CLANG_IS_GENERATED_CODE_TARGET0Output is not Clang-specific C
MSVC_IS_GENERATED_CODE_TARGET0Output is not MSVC-specific C
SUN_IS_GENERATED_CODE_TARGET0Output is not Sun/Oracle compiler C
MICROSOFT_DIALECT_IS_GENERATED_CODE_TARGET0Output does not use Microsoft C++ extensions

None of the compiler-specific code generation targets are enabled. The cp_gen_be emits portable C++ that is syntactically valid across all major compilers. This is possible because CUDA's host compilation already controls dialect selection through its own flag forwarding to the host compiler.

IL System

The Intermediate Language (IL) system is the core data structure connecting the parser to the backend. NVIDIA's configuration makes a critical choice: the IL is never serialized to disk.

ConstantValueMeaning
IL_SHOULD_BE_WRITTEN_TO_FILE0IL stays in memory -- never written to an IL file
DO_IL_LOWERING0No IL transformation passes before backend
IL_WALK_NEEDED1IL walker infrastructure is compiled in
IL_VERSION_NUMBER"6.6"IL format version, matches EDG version
ALL_TEMPLATE_INFO_IN_IL1Complete template metadata in the IL graph
PROTOTYPE_INSTANTIATIONS_IN_IL1Uninstantiated function prototypes preserved
NEED_IL_DISPLAY1IL display/dump routines compiled in
NEED_NAME_MANGLING1Name mangling infrastructure compiled in
NEED_DECLARATIVE_WALK0Declarative IL walker not needed

Why IL_SHOULD_BE_WRITTEN_TO_FILE=0 Matters

In a standard EDG deployment (like the Comeau C++ compiler or Intel ICC's older front end), the IL can be serialized to a binary file for separate backend processing. With IL_SHOULD_BE_WRITTEN_TO_FILE=0, NVIDIA eliminates the entire IL serialization path. The IL exists only as an in-memory graph during compilation:

  1. The parser builds IL nodes in region-based arenas (file-scope region 1, per-function region N)
  2. The IL walker traverses the graph to select device vs. host code
  3. The cp_gen_be backend reads the IL graph directly and emits C++ source
  4. The arenas are freed

This design means the IL_FILE_SUFFIX constant is left undefined -- there is no suffix because there is no file. The constants LARGE_IL_FILE_SUPPORT, USE_TEMPLATE_INFO_FILE, TEMPLATE_INFO_FILE_SUFFIX, INSTANTIATION_FILE_SUFFIX, and EXPORTED_TEMPLATE_FILE_SUFFIX are all similarly undefined.

Why DO_IL_LOWERING=0 Matters

IL lowering is an optional transformation pass that simplifies the IL before the backend processes it. In a lowering-enabled build, complex C++ constructs (VLAs, complex numbers, rvalue adjustments) are reduced to simpler forms. With DO_IL_LOWERING=0, NVIDIA bypasses all of this:

ConstantValueMeaning
DO_IL_LOWERING0Master lowering switch is off
LOWER_COMPLEX0No lowering of _Complex types
LOWER_VARIABLE_LENGTH_ARRAYS0VLAs passed through as-is
LOWER_CLASS_RVALUE_ADJUST0No rvalue conversion lowering
LOWER_FIXED_POINT0No fixed-point lowering
LOWER_IFUNC0No indirect function lowering
LOWER_STRING_LITERALS_TO_NON_CONST0String literals keep const qualification
LOWER_EXTERN_INLINE1Exception: extern inline functions are lowered
LOWERING_NORMALIZES_BOOLEAN_CONTROLLING_EXPRESSIONS0No boolean normalization
LOWERING_REMOVES_UNNEEDED_CONSTRUCTIONS_AND_DESTRUCTIONS0No dead construction removal

The only lowering that remains active is LOWER_EXTERN_INLINE=1, which handles extern inline functions that need special treatment in the generated output. Everything else passes through the IL untransformed.

This makes sense for cudafe++'s role. As a source-to-source translator, it benefits from preserving the original code structure. The host compiler handles all the actual lowering when it compiles the generated .ii file.

Why IL_WALK_NEEDED=1 Matters

Despite no serialization and no lowering, the IL walk infrastructure is compiled in. This is because cudafe++ uses the IL walker for its primary CUDA-specific task: device/host code separation. The walker traverses the IL graph and marks each entity with execution space flags (__host__, __device__, __global__), then the backend selectively emits code based on which space is being generated.

Template Information Preservation

ConstantValueMeaning
ALL_TEMPLATE_INFO_IN_IL1Full template definitions in the IL, not a separate database
PROTOTYPE_INSTANTIATIONS_IN_IL1Even uninstantiated prototypes kept
RECORD_TEMPLATE_STRINGS1Template argument strings preserved
RECORD_HIDDEN_NAMES_IN_IL1Names hidden by using declarations still recorded
RECORD_UNRECOGNIZED_ATTRIBUTES1Unknown [[attributes]] preserved in IL
RECORD_RAW_ASM_OPERAND_DESCRIPTIONS1Raw asm operand text kept
KEEP_TEMPLATE_ARG_EXPR_THAT_CAUSES_INSTANTIATION1Template argument expressions that trigger instantiation are retained

With ALL_TEMPLATE_INFO_IN_IL=1, template definitions, partial specializations, and instantiation directives live directly in the IL graph. This eliminates the need for a separate template information file (USE_TEMPLATE_INFO_FILE is undefined). Combined with PROTOTYPE_INSTANTIATIONS_IN_IL=1, the IL retains complete template metadata -- even for function templates that have been declared but not yet instantiated. This is essential for CUDA's device/host separation, where a template might be instantiated in different execution spaces.

Internal Checking

NVIDIA builds cudafe++ with assertions enabled. This produces a binary with extensive runtime self-checking.

ConstantValueMeaning
CHECKING1Internal assertion macros are active
DEBUG1Debug-mode code paths are compiled in
CHECK_SWITCH_DEFAULT_UNEXPECTED1Default cases in switch statements trigger assertions
EXPENSIVE_CHECKING0Costly O(n) verification checks are disabled
OVERWRITE_FREED_MEM_BLOCKS0No memory poisoning on free
EXIT_ON_INTERNAL_ERROR0Internal errors do not call exit() directly
ABORT_ON_INIT_COMPONENT_LEAKAGE0No abort on init-time leaks
TRACK_INTERPRETER_ALLOCATIONS0constexpr interpreter does not track allocations

Assertion Infrastructure

With CHECKING=1, the internal assertion macro internal_error (sub_4F2930) is live. The binary contains 5,178 call sites across 2,139 functions that invoke this handler. Each call site passes the source file name, line number, function name, and a diagnostic message pair. When an assertion fires, the handler constructs error 2656 with severity level 11 (catastrophic) and reports it through the standard diagnostic infrastructure.

The DEBUG=1 setting enables additional code paths that perform intermediate consistency checks during parsing and IL construction. These checks are less expensive than EXPENSIVE_CHECKING (which is off) but still add measurable overhead to compilation time. NVIDIA presumably leaves both CHECKING and DEBUG on because cudafe++ is a critical toolchain component where silent corruption is far worse than a slightly slower compilation.

The CHECK_SWITCH_DEFAULT_UNEXPECTED=1 setting means that every switch statement in the EDG source that handles enumerated values will trigger an assertion if control reaches the default case. This catches missing case handling when new enum values are added.

Diagnostics Configuration

These constants control the default formatting and behavior of compiler error messages.

ConstantValueMeaning
DEFAULT_BRIEF_DIAGNOSTICS0Full diagnostics by default (not one-line)
DEFAULT_DISPLAY_ERROR_NUMBER0Error numbers hidden by default
COLUMN_NUMBER_IN_BRIEF_DIAGNOSTICS1Column numbers included in brief-mode output
DEFAULT_ENABLE_COLORIZED_DIAGNOSTICS1ANSI color codes enabled by default
MAX_ERROR_OUTPUT_LINE_LENGTH79Diagnostic lines wrap at 79 characters
DEFAULT_CONTEXT_LIMIT10Maximum 10 lines of instantiation context shown
DEFAULT_DISPLAY_ERROR_CONTEXT_ON_CATASTROPHE1Show context even on fatal errors
DEFAULT_ADD_MATCH_NOTES1Add notes explaining overload/template resolution
DEFAULT_DISPLAY_TEMPLATE_TYPEDEFS_IN_DIAGNOSTICS0Use raw types, not typedef aliases, in messages
DEFAULT_OUTPUT_MODEom_textDefault output is text, not SARIF JSON
DEFAULT_MACRO_POSITIONS_IN_DIAGNOSTICS(undefined)Macro expansion position tracking is off
ERROR_SEVERITY_EXPLICIT_IN_ERROR_MESSAGES1Severity word ("error"/"warning") always printed
DIRECT_ERROR_OUTPUT_TO_STDOUT0Errors go to stderr
WRITE_SIGNOFF_MESSAGE1Print summary line at compilation end

Color Configuration

The DEFAULT_EDG_COLORS constant encodes ANSI SGR (Select Graphic Rendition) color codes for diagnostic categories:

"error=01;31:warning=01;35:note=01;36:locus=01:quote=01:range1=32"
CategorySGR CodeAppearance
error01;31Bold red
warning01;35Bold magenta
note01;36Bold cyan
locus01Bold (default color)
quote01Bold (default color)
range132Green (non-bold)

This matches GCC's diagnostic color scheme, which is intentional -- cudafe++ is designed to produce diagnostics that look visually consistent with the host GCC compiler's output.

ABI Configuration

ConstantValueMeaning
ABI_COMPATIBILITY_VERSION9999Maximum ABI compatibility level
IA64_ABI1Uses Itanium C++ ABI (standard on Linux)
ABI_CHANGES_FOR_ARRAY_NEW_AND_DELETE1Array new/delete ABI changes active
ABI_CHANGES_FOR_CONSTRUCTION_VTBLS1Construction vtable ABI changes active
ABI_CHANGES_FOR_COVARIANT_VIRTUAL_FUNC_RETURN1Covariant return ABI changes active
ABI_CHANGES_FOR_PLACEMENT_DELETE1Placement delete ABI changes active
ABI_CHANGES_FOR_RTTI1RTTI ABI changes active
DRIVER_COMPATIBILITY_VERSION9999Maximum driver-level compatibility

The ABI_COMPATIBILITY_VERSION=9999 is a sentinel meaning "accept all ABI changes." In EDG's versioning scheme, specific ABI compatibility versions can be set to match a particular compiler release (e.g., GCC 3.2's ABI). Setting it to 9999 means cudafe++ uses the latest ABI rules for every construct, which is appropriate because it generates source code that the host compiler will re-ABI anyway.

All five ABI_CHANGES_FOR_* constants are set to 1, meaning every ABI improvement EDG has made is active. These affect name mangling, vtable layout, and RTTI representation. Since cudafe++ emits C++ source rather than object code, these primarily affect name mangling output and the structure of compiler-generated entities.

Compiler Compatibility Layer

cudafe++ emulates GCC by default. These constants configure the compatibility surface.

ConstantValueMeaning
DEFAULT_GNU_COMPATIBILITY1GCC compatibility mode is on by default
DEFAULT_GNU_VERSION80100Default GCC version = 8.1.0
GNU_TARGET_VERSION_NUMBER70300Target GCC version = 7.3.0
DEFAULT_GNU_ABI_VERSION30200Default GNU ABI version = 3.2.0
DEFAULT_CLANG_COMPATIBILITY0Clang compat off by default
DEFAULT_CLANG_VERSION90100Clang version if enabled = 9.1.0
DEFAULT_MICROSOFT_COMPATIBILITY0MSVC compat off by default
DEFAULT_MICROSOFT_VERSION1926MSVC version if enabled = 19.26 (VS 2019)
MSVC_TARGET_VERSION_NUMBER1926Same: MSVC 19.26 target
GNU_EXTENSIONS_ALLOWED1GNU extensions compiled into the parser
GNU_X86_ASM_EXTENSIONS_ALLOWED1GNU inline asm syntax supported
GNU_X86_ATTRIBUTES_ALLOWED1GNU __attribute__ on x86 targets
GNU_VECTOR_TYPES_ALLOWED1GNU vector types (__attribute__((vector_size(...))))
GNU_VISIBILITY_ATTRIBUTE_ALLOWED1__attribute__((visibility(...))) support
GNU_INIT_PRIORITY_ATTRIBUTE_ALLOWED1__attribute__((init_priority(...))) support
MICROSOFT_EXTENSIONS_ALLOWED0MSVC extensions not available
SUN_EXTENSIONS_ALLOWED0Sun/Oracle extensions not available

The DEFAULT_GNU_VERSION=80100 encodes GCC 8.1.0 as major*10000 + minor*100 + patch. This is the baseline GCC version cudafe++ emulates when nvcc does not specify an explicit --compiler-bindir host compiler. At runtime, nvcc overrides this with the actual detected host GCC version via --gnu_version=NNNNN.

The version numbers stored here serve as fallback defaults. They affect which GNU extensions and builtins are available, which warning behaviors are emulated, and how __GNUC__ / __GNUC_MINOR__ / __GNUC_PATCHLEVEL__ are defined for the preprocessor.

Disabled Compatibility Modes

ConstantValueMeaning
CFRONT_2_1_OBJECT_CODE_COMPATIBILITY0No AT&T cfront 2.1 compat
CFRONT_3_0_OBJECT_CODE_COMPATIBILITY0No AT&T cfront 3.0 compat
CFRONT_GLOBAL_VS_MEMBER_NAME_LOOKUP_BUG0No cfront name lookup bug emulation
DEFAULT_SUN_COMPATIBILITY(undefined)No Sun/Oracle compat
CPPCLI_ENABLING_POSSIBLE0C++/CLI (managed C++) disabled
CPPCX_ENABLING_POSSIBLE0C++/CX (WinRT extensions) disabled
DEFAULT_UPC_MODE0Unified Parallel C disabled
DEFAULT_EMBEDDED_C_ENABLED0Embedded C extensions disabled

NVIDIA disables every compatibility mode except GCC. This is consistent with CUDA's host compiler support matrix: GCC and Clang on Linux, MSVC on Windows. The cfront, Sun, UPC, and embedded C modes are EDG capabilities that NVIDIA does not need.

Target Platform Model

The TARG_* constants describe the target architecture's data model. Since cudafe++ is a source-to-source translator for the host side, these model x86-64 Linux.

Data Type Sizes (bytes)

TypeSizeAlignment
char11
short22
int44
long88
long long88
__int1281616
pointer88
float44
double88
long double1616
__float801616
__float1281616
ptr-to-data-member88
ptr-to-member-function168
ptr-to-virtual-base88

This is the standard LP64 data model (long and pointer are 64-bit). TARG_ALL_POINTERS_SAME_SIZE=1 confirms there are no near/far pointer distinctions.

Key Target Properties

ConstantValueMeaning
TARG_CHAR_BIT88 bits per byte
TARG_HAS_SIGNED_CHARS1char is signed by default
TARG_HAS_IEEE_FLOATING_POINT1IEEE 754 floating point
TARG_SUPPORTS_X86_641x86-64 target support
TARG_SUPPORTS_ARM640No ARM64 target support
TARG_SUPPORTS_ARM320No ARM32 target support
TARG_DEFAULT_NEW_ALIGNMENT16operator new returns 16-byte aligned
TARG_IA64_ABI_USE_GUARD_ACQUIRE_RELEASE1Thread-safe static local init guards
TARG_CASE_SENSITIVE_EXTERNAL_NAMES1Symbol names are case-sensitive
TARG_EXTERNAL_NAMES_GET_UNDERSCORE_ADDED0No leading underscore on symbols

The TARG_SUPPORTS_ARM64=0 and TARG_SUPPORTS_ARM32=0 confirm that this build of cudafe++ targets x86-64 Linux only. NVIDIA produces separate cudafe++ builds for other host platforms (ARM64 Linux, Windows).

Floating Point Model

ConstantValueMeaning
FP_USE_EMULATION1Floating-point constant folding uses software emulation
USE_SOFTFLOAT1Software floating-point library linked
APPROXIMATE_QUADMATH1__float128 operations use approximate arithmetic
USE_QUADMATH_LIBRARY0Not linked against libquadmath
HOST_FP_VALUE_IS_128BIT1Host FP value representation uses 128 bits
FP_LONG_DOUBLE_IS_80BIT_EXTENDED1long double is x87 80-bit extended precision
FP_LONG_DOUBLE_IS_BINARY1280long double is not IEEE binary128
FLOAT80_ENABLING_POSSIBLE1__float80 type can be enabled
FLOAT128_ENABLING_POSSIBLE1__float128 type can be enabled

The FP_USE_EMULATION=1 and USE_SOFTFLOAT=1 settings mean cudafe++ does not use the host CPU's floating-point unit for constant folding during compilation. Instead, it uses a software emulation library. This guarantees deterministic results regardless of the build machine's FPU behavior, rounding mode, or x87 precision settings. The APPROXIMATE_QUADMATH=1 indicates that __float128 constant folding uses an approximate (but portable) implementation rather than requiring libquadmath.

Memory and Host Configuration

ConstantValueMeaning
USE_MMAP_FOR_MEMORY_REGIONS1IL memory regions use mmap
USE_MMAP_FOR_MODULES1C++ module storage uses mmap
HOST_ALLOCATION_INCREMENT65536Arena grows in 64 KB increments
HOST_ALIGNMENT_REQUIRED8Host requires 8-byte alignment
HOST_IL_ENTRY_PREFIX_ALIGNMENT8IL node prefix aligned to 8 bytes
HOST_POINTER_ALIGNMENT8Pointer alignment on host platform
USE_FIXED_ADDRESS_FOR_MMAP0No fixed mmap addresses
NULL_POINTER_IS_ZERO1Null pointer has all-zero bit pattern

The USE_MMAP_FOR_MEMORY_REGIONS=1 setting means the IL's region-based arena allocator uses mmap system calls (likely MAP_ANONYMOUS) rather than malloc. This gives EDG more control over memory layout and allows whole-region deallocation via munmap without fragmentation concerns. The 64 KB allocation increment (HOST_ALLOCATION_INCREMENT=65536) means each arena expansion maps a new 64 KB page-aligned chunk.

Code Generation Controls

These constants affect what the cp_gen_be backend emits.

ConstantValueMeaning
GENERATE_SOURCE_SEQUENCE_LISTS1Source sequence lists (instantiation ordering) generated
GENERATE_LINKAGE_SPEC_BLOCKS1extern "C" blocks preserved in output
USING_DECLARATIONS_IN_GENERATED_CODE1using declarations appear in output
GENERATE_EH_TABLES0No EH tables -- host compiler handles exceptions
GENERATE_MICROSOFT_IF_EXISTS_ENTRIES0No __if_exists / __if_not_exists output
SUPPRESS_ARRAY_STATIC_IN_GENERATED_CODE1static in array parameter declarations suppressed
GCC_BUILTIN_VARARGS_IN_GENERATED_CODE0No GCC __builtin_va_* in output
USE_HEX_FP_CONSTANTS_IN_GENERATED_CODE0No hex float literals in output
ADD_BRACES_TO_AVOID_DANGLING_ELSE_IN_GENERATED_C0No extra braces for dangling else
DOING_SOURCE_ANALYSIS1Source analysis mode (affects what is preserved)

The GENERATE_EH_TABLES=0 is significant. Exception handling tables are not generated because cudafe++ emits source code -- the host compiler is responsible for generating the actual EH tables when it compiles the .ii output. Similarly, GCC_BUILTIN_VARARGS_IN_GENERATED_CODE=0 means the output uses standard <stdarg.h> varargs rather than GCC builtins, keeping the output compiler-portable.

Template and Instantiation Model

ConstantValueMeaning
AUTOMATIC_TEMPLATE_INSTANTIATION0No automatic instantiation to separate files
INSTANTIATION_BY_IMPLICIT_INCLUSION1Template definitions found via implicit include
INSTANTIATE_TEMPLATES_EVERYWHERE_USED0Not every use triggers instantiation
INSTANTIATE_EXTERN_INLINE0Extern inline templates not instantiated eagerly
INSTANTIATE_INLINE_VARIABLES0Inline variables not instantiated eagerly
INSTANTIATE_BEFORE_PCH_CREATION0No instantiation before PCH
DEFAULT_INSTANTIATION_MODEtim_noneNo separate instantiation mode
DEFAULT_MAX_PENDING_INSTANTIATIONS200Maximum pending instantiations per TU
MAX_TOTAL_PENDING_INSTANTIATIONS256Hard cap on total pending
MAX_UNUSED_ALL_MODE_INSTANTIATIONS200Limit on unused instantiation entries
DEFAULT_MAX_DEPTH_CONSTEXPR_CALL256Maximum constexpr recursion depth
DEFAULT_MAX_COST_CONSTEXPR_CALL2000000Maximum constexpr evaluation cost

The AUTOMATIC_TEMPLATE_INSTANTIATION=0 and DEFAULT_INSTANTIATION_MODE=tim_none disable EDG's automatic template instantiation mechanism. This mechanism (where EDG writes instantiation requests to a file for later processing) is unnecessary because cudafe++ processes each translation unit in a single pass -- templates are instantiated inline as the parser encounters them, and the backend emits the instantiated code directly.

Feature Enablement Constants

The DEFAULT_* constants set the initial values of runtime-configurable features. These can be overridden by command-line flags, but they establish the baseline behavior when no flags are specified.

Enabled by Default

ConstantValueFeature
DEFAULT_GNU_COMPATIBILITY1GCC compatibility mode
DEFAULT_EXCEPTIONS_ENABLED1C++ exception handling
DEFAULT_RTTI_ENABLED1Runtime type identification
DEFAULT_BOOL_IS_KEYWORD1bool is a keyword (not a typedef)
DEFAULT_WCHAR_T_IS_KEYWORD1wchar_t is a keyword
DEFAULT_NAMESPACES_ENABLED1Namespaces are supported
DEFAULT_ARG_DEPENDENT_LOOKUP1ADL (Koenig lookup) active
DEFAULT_CLASS_NAME_INJECTION1Class name injected into its own scope
DEFAULT_EXPLICIT_KEYWORD_ENABLED1explicit keyword recognized
DEFAULT_EXTERN_INLINE_ALLOWED1extern inline permitted
DEFAULT_IMPLICIT_NOEXCEPT_ENABLED1Implicit noexcept on dtors/deallocs
DEFAULT_IMPLICIT_TYPENAME_ENABLED1typename implicit in dependent contexts
DEFAULT_TYPE_TRAITS_HELPERS_ENABLED1Compiler intrinsic type traits
DEFAULT_STRING_LITERALS_ARE_CONST1String literals have const type
DEFAULT_TYPE_INFO_IN_NAMESPACE_STD1type_info in std::
DEFAULT_C_AND_CPP_FUNCTION_TYPES_ARE_DISTINCT1C and C++ function types differ
DEFAULT_FRIEND_INJECTION1Friend declarations inject names
DEFAULT_DISTINCT_TEMPLATE_SIGNATURES1Template signatures are distinct
DEFAULT_ARRAY_NEW_AND_DELETE_ENABLED1operator new[] / operator delete[]
DEFAULT_CPP11_DEPENDENT_NAME_PROCESSING1C++11-style dependent name processing
DEFAULT_ENABLE_COLORIZED_DIAGNOSTICS1ANSI color in diagnostics
DEFAULT_CHECK_FOR_BYTE_ORDER_MARK1UTF-8 BOM detection on
DEFAULT_CHECK_PRINTF_SCANF_POSITIONAL_ARGS1printf/scanf format checking
DEFAULT_ALWAYS_FOLD_CALLS_TO_BUILTIN_CONSTANT_P1__builtin_constant_p folded

Disabled by Default (Require Explicit Enabling)

ConstantValueFeature
DEFAULT_CPP_MODE199711Default language standard is C++98
DEFAULT_LAMBDAS_ENABLED0Lambdas off (enabled by C++ version selection)
DEFAULT_RVALUE_REFERENCES_ENABLED0Rvalue refs off (enabled by C++ version)
DEFAULT_VARIADIC_TEMPLATES_ENABLED0Variadic templates off (enabled by C++ version)
DEFAULT_NULLPTR_ENABLED0nullptr off (enabled by C++ version)
DEFAULT_RANGE_BASED_FOR_ENABLED0Range-for off (enabled by C++ version)
DEFAULT_AUTO_TYPE_SPECIFIER_ENABLED0auto type deduction off (enabled by C++ version)
DEFAULT_COMPOUND_LITERALS_ALLOWED0C99 compound literals off
DEFAULT_DESIGNATORS_ALLOWED0C99/C++20 designated initializers off
DEFAULT_C99_MODE0Not in C99 mode
DEFAULT_VLA_ENABLED0Variable-length arrays off
DEFAULT_CPP11_SFINAE_ENABLED0C++11 SFINAE rules off (enabled by C++ version)
DEFAULT_MODULES_ENABLED0C++20 modules off
DEFAULT_REFLECTION_ENABLED0C++ reflection off
DEFAULT_MICROSOFT_COMPATIBILITY0MSVC compat off
DEFAULT_CLANG_COMPATIBILITY0Clang compat off
DEFAULT_BRIEF_DIAGNOSTICS0Full diagnostic output
DEFAULT_DISPLAY_ERROR_NUMBER0Error numbers hidden
DEFAULT_INCOGNITO0Not in incognito mode
DEFAULT_REMOVE_UNNEEDED_ENTITIES0Dead code not removed

The DEFAULT_CPP_MODE=199711 (C++98) looks surprising, but this is simply the EDG default. In practice, nvcc always passes an explicit --std=c++NN flag to cudafe++ that overrides this default, typically --std=c++17 in modern CUDA. The C++11/14/17/20 features listed as "disabled by default" are all enabled by the standard version selection code in proc_command_line.

Predefined Macro Constants

These constants control which macros cudafe++ automatically defines for the preprocessor.

ConstantValueEffect
DEFINE_MACRO_WHEN_EXCEPTIONS_ENABLED1--exceptions causes #define __EXCEPTIONS
DEFINE_MACRO_WHEN_RTTI_ENABLED1--rtti causes #define __RTTI
DEFINE_MACRO_WHEN_BOOL_IS_KEYWORD1bool keyword causes #define _BOOL
DEFINE_MACRO_WHEN_WCHAR_T_IS_KEYWORD1wchar_t keyword causes #define _WCHAR_T
DEFINE_MACRO_WHEN_ARRAY_NEW_AND_DELETE_ENABLED1Causes #define __ARRAY_OPERATORS
DEFINE_MACRO_WHEN_PLACEMENT_DELETE_ENABLED1Causes #define __PLACEMENT_DELETE
DEFINE_MACRO_WHEN_VARIADIC_TEMPLATES_ENABLED1Causes #define __VARIADIC_TEMPLATES
DEFINE_MACRO_WHEN_CHAR16_T_AND_CHAR32_T_ARE_KEYWORDS1Causes #define __CHAR16_T_AND_CHAR32_T
DEFINE_MACRO_WHEN_LONG_LONG_IS_DISABLED1Causes #define __NO_LONG_LONG when long long is off
DEFINE_FEATURE_TEST_MACRO_OPERATORS_IN_ALL_MODES1Feature test macros available in all modes
MACRO_DEFINED_WHEN_IA64_ABI"__EDG_IA64_ABI"Always defined (since IA64_ABI=1)
MACRO_DEFINED_WHEN_TYPE_TRAITS_HELPERS_ENABLED"__EDG_TYPE_TRAITS_ENABLED"Always defined (since type traits are on)

These macros allow header files to conditionally compile based on which compiler features are active. They are part of EDG's mechanism for compatibility with GCC's predefined macro surface -- GCC defines __EXCEPTIONS when exceptions are on, so cudafe++ does the same.

Miscellaneous Constants

ConstantValueMeaning
VERSION_NUMBER"6.6"EDG front end version
VERSION_NUMBER_FOR_MACRO606Numeric form for __EDG_VERSION__ macro
DIRECTORY_SEPARATOR'/'Unix path separator
FILE_NAME_FOR_STDIN"-"Standard Unix convention for stdin
OBJECT_FILE_SUFFIX".o"Unix object file suffix
PCH_FILE_SUFFIX".pch"Precompiled header suffix
PREDEFINED_MACRO_FILE_NAME"predefined_macros.txt"File with platform-defined macros
DEFAULT_TMPDIR"/tmp"Default temp directory
DEFAULT_USR_INCLUDE"/usr/include"Default system include path
DEFAULT_EDG_BASE""EDG base directory (empty = use argv[0] path)
MAX_INCLUDE_FILES_OPEN_AT_ONCE8Limit on simultaneously open include files
MODULE_MAX_LINE_NUMBER250000Maximum source lines per module
COMPILE_MULTIPLE_SOURCE_FILES0One source file per invocation
COMPILE_MULTIPLE_TRANSLATION_UNITS0One TU per invocation
USING_DRIVER0Not integrated into a driver binary
EDG_WIN320Not a Windows build
WINDOWS_PATHS_ALLOWED0No backslash path separators

The VERSION_NUMBER="6.6" identifies this as EDG C/C++ front end version 6.6, which is the latest major release. VERSION_NUMBER_FOR_MACRO=606 becomes the __EDG_VERSION__ predefined macro, allowing header files to detect the exact EDG version (e.g., #if __EDG_VERSION__ >= 606).

The legacy configuration section at the bottom of the dump output reports LEGACY_TARGET_CONFIGURATION_NAME as NULL, meaning this build does not use a named legacy target configuration. In EDG's framework, named target configurations are used to preset constants for specific compilers (e.g., "gnu" or "microsoft"). NVIDIA's configuration is fully custom and does not map to any of EDG's predefined configurations.

Relationship Between Build Configuration and Runtime Flags

The build configuration constants and the runtime CLI flags form a two-layer system:

  1. Build-time constants (CHECKING=1, BACK_END_IS_CP_GEN_BE=1, IL_SHOULD_BE_WRITTEN_TO_FILE=0) determine what code paths exist in the binary. If IL_SHOULD_BE_WRITTEN_TO_FILE=0, the IL serialization code is not compiled in -- no runtime flag can enable it.

  2. DEFAULT_* constants set initial values for features that can be toggled at runtime. DEFAULT_EXCEPTIONS_ENABLED=1 means exceptions are on unless --no_exceptions is passed. These defaults are loaded by default_init (sub_45EB40) before command-line parsing.

  3. *_ENABLING_POSSIBLE constants gate whether a feature can be toggled at all. COROUTINE_ENABLING_POSSIBLE=1 means the --coroutines / --no_coroutines flag pair is registered. REFLECTION_ENABLING_POSSIBLE=0 means the reflection flag pair is not even registered -- the feature cannot be turned on.

This layering means the build configuration determines the binary's permanent capabilities, while the CLI flags select among the enabled possibilities.

Function Reference

FunctionAddressLinesRole
dump_configurationsub_44CF30785Print all 747 constants as #define statements
default_initsub_45EB40470Initialize 350 config globals from DEFAULT_* values
init_command_line_flagssub_4520103,849Register all CLI flags (gated by *_ENABLING_POSSIBLE)
proc_command_linesub_4596304,105Parse flags and override DEFAULT_* settings

Architecture Detection

cudafe++ determines the target GPU architecture through a five-stage pipeline: nvcc translates the user-facing --gpu-architecture=sm_XX flag into an internal numeric index, passes it to cudafe++ via --target, the CLI parser stores the index in a global, set_target_configuration configures over 100 type-system globals for that target, and the TU initializer copies the index into per-translation-unit state where it is read by feature gates throughout compilation. A parallel path, select_cp_gen_be_target_dialect, routes the backend to emit either device-side or host-side C++ based on a separate flag. This page documents the complete chain from nvcc invocation to the point where individual feature checks read the stored architecture value.

Key Facts

PropertyValue
Target index globaldword_126E4A8 (set by --target, CLI case 245)
Invalid sentinel-1 (0xFFFFFFFF)
Error on invalid targetError 2664: "invalid or no value specified with --nv_arch flag"
Target parser stubsub_7525E0 (6 bytes, returns -1 unconditionally)
Configuration functionsub_7525F0 (set_target_configuration, target.c:299)
Type table initializersub_7515D0 (100+ globals, called from sub_7525F0)
Configuration validatorsub_7527B0 (check_target_configuration, target.c:512-659)
Field alignment initializersub_752DF0 (init_field_alignment_tables, target.c:825)
Dialect selectorsub_752A80 (select_cp_gen_be_target_dialect, target.c:736)
TU-level copydword_126EBF8 (target_configuration_index, set in sub_586240)
GPU mode flagdword_126EFA8 (set by --gcc, case 182; gates dialect selection)
Device-side flagdword_126EFA4 (set by --clang, case 187; selects device vs host output)

The Full Propagation Chain

The architecture value flows through five distinct stages before it is available for feature gate checks. Each stage adds a layer of processing: parsing, validation, type model configuration, dialect routing, and per-TU state replication.

Stage 1: nvcc                         Stage 2: CLI parsing
  --gpu-architecture=sm_90    --->      case 245 (--target)
  translates to --target=<idx>          sub_7525E0(<arg>) -> dword_126E4A8
                                        if -1: error 2664, abort
                                            |
                                            v
Stage 3: Target init                   Stage 4: Dialect selection
  sub_7525F0(idx)                        sub_752A80()
    assert idx != -1                       if dword_126EFA8 (GPU mode):
    sub_7515D0()   -> 100+ type globals      if dword_126EFA4: device path
    qword_126E1B0 = "lib"                    else: host path
    sub_752DF0()   -> alignment tables
    sub_7527B0()   -> validation
                                            |
                                            v
Stage 5: TU initialization
  sub_586240()
    dword_126EBF8 = dword_126E4A8  (per-TU copy)
    version marker: "6.6\0"
    timestamp copy
                                            |
                                            v
Feature checks throughout compilation
  if (dword_126E4A8 < 70) { error("__grid_constant__ requires compute_70"); }
  if (dword_126E4A8 < 80) { error("__nv_register_params__ requires compute_80"); }
  ...

Stage 1: nvcc Translates the Architecture

Users specify the GPU architecture through nvcc:

nvcc --gpu-architecture=sm_90 source.cu

nvcc translates this into an internal numeric index and passes it to cudafe++ as --target=<index>. The value stored in dword_126E4A8 is NOT a raw SM number like 90 -- it is an index into EDG's target configuration table. nvcc performs the mapping from user-facing strings (sm_90, compute_80, etc.) to this index. cudafe++ never sees the sm_XX string directly.

The --target flag is registered as CLI flag 253 with the internal case_id 245 in the flag table:

// From sub_452010 (init_command_line_flags)
sub_451F80(245, "target", 0, 1, 1, 1);
//         ^id   ^name   ^no_short ^has_arg ^mode ^enabled

Stage 2: CLI Parsing (proc_command_line, case 245)

When proc_command_line (sub_459630) encounters --target, it dispatches to case 245:

// sub_459630, case 245 (decompiled)
case 245:
    v80 = sub_7525E0(qword_E7FF28, v23, v20, v30);
    dword_126E4A8 = v80;                   // store target index
    if (v80 == -1) {
        sub_4F8420(2664);                   // emit error 2664
        // "invalid or no value specified with --nv_arch flag"
        sub_4F2930("cmd_line.c", 12219,
                   "proc_command_line", 0, 0);  // assert-fail
    }
    sub_7525F0(v80);                        // set_target_configuration
    goto LABEL_136;                         // continue parsing

The error string references --nv_arch, which is the nvcc-facing name for this flag. Internally cudafe++ processes it as --target (case 245). The discrepancy exists because the error message is shared with nvcc's error reporting path.

The sub_7525E0 Stub

sub_7525E0 is the architecture parser function. In the CUDA Toolkit 13.0 binary, it is a 6-byte stub:

// sub_7525E0 -- 0x7525E0, 6 bytes
__int64 sub_7525E0()
{
    return 0xFFFFFFFFLL;  // always returns -1
}
; IDA disassembly
sub_7525E0:
    mov     eax, 0FFFFFFFFh
    retn

This stub always returns the invalid sentinel -1. The actual architecture code reaches dword_126E4A8 through the argument value passed by nvcc, not through parsing logic within this function. The function signature in the call site (sub_7525E0(qword_E7FF28, v23, v20, v30)) shows that four arguments are passed, but the stub ignores all of them. This means either:

  1. The actual parsing is performed by nvcc, which passes the pre-resolved numeric index as the argument string, and sub_7525E0 simply converts it with strtol -- but the link-time optimization eliminated the body because the result was equivalent to the argument itself.

  2. The function is a placeholder that was replaced at link time by a different object file that nvcc provides when building the toolchain.

In either case, the return value -1 is only reached when no valid --target argument is provided, which triggers error 2664.

Stage 3: set_target_configuration (sub_7525F0)

After the target index is stored, sub_7525F0 performs the post-parse initialization. This function lives in target.c:299:

// sub_7525F0 -- set_target_configuration
__int64 __fastcall sub_7525F0(int a1)
{
    // Guard: accepts any value >= 0, rejects only -1
    // (a1 + 1) wraps -1 to 0, and (0u > 1) is false
    // Any non-negative value + 1 > 1 would be true... BUT this is unsigned:
    // -1 + 1 = 0, 0 > 1u = false (passes)
    // 0 + 1 = 1, 1 > 1u = false (passes)
    // The guard actually fires when a1 <= -2 (e.g., -2 + 1 = -1, cast unsigned = huge)
    if ((unsigned int)(a1 + 1) > 1)
        assert_fail("target.c", 299, "set_target_configuration", 0, 0);

    sub_7515D0();              // initialize type tables
    qword_126E1B0 = "lib";    // library search path prefix
    return -1;                 // return value unused
}

The unsigned comparison (a1 + 1) > 1u accepts values 0 and -1, rejecting everything else. In practice, only 0 or a valid non-negative target index reaches this function (the -1 case is caught earlier by the error 2664 check). The guard is a sanity assertion rather than a functional check.

Type Table Initialization (sub_7515D0)

sub_7515D0 is the core of Stage 3. It sets over 100 global variables that define the target platform's data model. These globals control how the EDG front end sizes types, computes alignments, and evaluates constant expressions. The function hardcodes an LP64 data model with CUDA-specific properties:

// sub_7515D0 -- target type initialization (complete decompilation)
__int64 sub_7515D0()
{
    // === Integer type sizes (in bytes) ===
    dword_126E338 = 4;     // sizeof(int)
    dword_126E328 = 8;     // sizeof(long)
    dword_126E410 = 4;     // sizeof(short) [confirmed by cross-ref]
    dword_126E420 = 2;     // sizeof(wchar_t)

    // === Pointer properties ===
    dword_126E2B8 = 8;     // sizeof(pointer)
    dword_126E2AC = 8;     // alignof(pointer)
    dword_126E4A0 = 8;     // target bits-per-byte (CHAR_BIT)
    dword_126E29C = 8;     // sizeof(ptrdiff_t)

    // === Floating-point properties ===
    // float: 24-bit mantissa, exponent range [-125, 128]
    dword_126E264 = 24;    // float mantissa bits
    dword_126E25C = 128;   // float max exponent
    dword_126E260 = -125;  // float min exponent

    // double: 53-bit mantissa, exponent range [-1021, 1024]
    dword_126E258 = 53;    // double mantissa bits
    dword_126E250 = 1024;  // double max exponent
    dword_126E254 = -1021; // double min exponent

    // long double: 16 bytes, same as __float128
    dword_126E2FC = 16;    // sizeof(long double)
    dword_126E308 = 16;    // alignof(long double)

    // __float128: 113-bit mantissa, exponent range [-16381, 16384]
    dword_126E234 = 113;   // __float128 mantissa bits
    dword_126E22C = 0x4000; // __float128 max exponent (16384)
    dword_126E230 = -16381; // __float128 min exponent

    // 80-bit extended (x87): same parameters as __float128
    dword_126E240 = 64;    // x87 extended mantissa bits
    dword_126E238 = 0x4000; // x87 extended max exponent
    dword_126E23C = -16381; // x87 extended min exponent
    dword_126E24C = 64;    // another extended format (IBM double-double?)
    dword_126E244 = 0x4000;
    dword_126E248 = -16381;

    // === Alignment properties ===
    dword_126E400 = 8;     // alignof(long long)
    dword_126E3F0 = 8;     // alignof(double)
    dword_126E35C = 8;     // alignof(long)
    dword_126E3E0 = 16;    // alignof(__int128) or max alignment
    dword_126E318 = 16;    // alignof(long double, repeated)
    dword_126E278 = 16;    // maximum natural alignment

    // === Endianness and signedness ===
    dword_126E4A4 = 1;     // little-endian
    dword_126E498 = 1;     // char is signed
    dword_126E368 = 1;     // int is 2's complement
    dword_126E384 = 1;     // enum underlying type signed

    // === Bit-field and struct layout ===
    dword_126E3A8 = -1;    // MSVC bit-field allocation mode (-1 = disabled)
    dword_126E2A8 = 0;     // no extra struct padding
    dword_126E2F0 = 0;     // field alignment override disabled
    dword_126E398 = 0;     // no special alignment for unnamed fields
    dword_126E298 = 0;     // no zero-length array as last field padding
    dword_126E288 = 1;     // field alloc order = declaration order
    dword_126E294 = 1;     // allow zero-sized objects
    dword_126E28C = 1;     // allow empty base optimization

    // === ABI flags ===
    dword_126E394 = 1;     // ELF-style name mangling
    dword_126E3AC = 1;     // Itanium ABI compliance
    dword_126E37C = 1;     // EH table generation enabled
    dword_126E3A0 = 0;     // no Windows SEH
    dword_126E36C = 1;     // thunks for virtual calls
    dword_126E380 = 1;     // covariant return types
    dword_126E39C = 0;     // no RTTI incompatibility workaround

    // === Integral type encoding (byte_126E4xx) ===
    byte_126E431 = 0;      // bool encoding index
    byte_126E430 = 2;      // char encoding index
    byte_126E480 = 4;      // char16_t encoding
    byte_126E470 = 6;      // char32_t encoding
    byte_126E490 = 5;      // wchar_t encoding
    byte_126E481 = 6;      // char8_t encoding

    // === Size_t properties ===
    byte_126E349 = 8;      // size_t byte width indicator
    qword_126E350 = -1;    // SIZE_MAX (0xFFFFFFFFFFFFFFFF for 64-bit)
    byte_126E348 = 7;      // size_t type encoding index

    // === String properties ===
    dword_126E49C = 8;     // host string char bit width
    dword_126E1BC = 1;     // feature flag (enabled)
    dword_126E494 = 1;     // null-terminated string assumption

    // === Replicated size values (qword versions) ===
    // These are 64-bit copies of the 32-bit size values above,
    // used for 64-bit arithmetic in constant evaluation
    qword_126E330 = 8;     // sizeof(long) as int64
    qword_126E340 = 4;     // sizeof(int) as int64
    qword_126E300 = 16;    // sizeof(long double) as int64
    qword_126E310 = 16;    // alignof(long double) as int64
    qword_126E418 = 4;     // sizeof(short) as int64
    qword_126E3E8 = 16;    // sizeof(__int128) as int64
    qword_126E408 = 8;     // sizeof(long long) as int64
    qword_126E320 = 16;    // alignof(something 16B) as int64
    qword_126E3F8 = 8;     // alignof(double) as int64
    qword_126E3D0 = 16;    // sizeof(max int) as int64
    qword_126E360 = 8;     // sizeof(long) alignment as int64
    qword_126E2C0 = 8;     // sizeof(pointer) as int64
    qword_126E2B0 = 16;    // alignof(pointer, packed) as int64
    qword_126E428 = 2;     // sizeof(wchar_t) as int64
    qword_126E2A0 = 8;     // sizeof(ptrdiff_t) as int64

    // === Miscellaneous ===
    qword_126E3B0 = 0;     // no custom va_list
    qword_126E3B8 = 0;     // no custom va_list secondary
    dword_126E3A4 = 0;     // bit-field container sizing disabled
    byte_126E2F6 = 4;      // unnamed struct alignment
    byte_126E2F5 = 4;      // unnamed union alignment
    byte_126E2F4 = 4;      // default minimum alignment
    byte_126E2F7 = 4;      // stack alignment
    byte_126E2F8 = 4;      // thread-local alignment
    byte_126E358 = 7;      // size_t type kind encoding
    dword_126E370 = 0;     // padding/zero
    dword_126E374 = 0;
    dword_126E378 = 1;     // 64-bit mode flag (LP64)
    dword_126E290 = 0;
    dword_126E388 = 0;
    dword_126E38C = 0;
    dword_126E390 = -1;    // special marker

    return -1;  // return value unused by caller
}

The function establishes the LP64 data model: sizeof(int)=4, sizeof(long)=8, sizeof(pointer)=8. This matches the CUDA device code ABI where device pointers are 64-bit. The dword_126E378 = 1 flag explicitly marks this as 64-bit mode.

CLI Overrides for the Data Model

Two CLI flags can override specific type properties set by sub_7515D0, because they are processed before case 245 in the switch:

Case 65 (--force-lp64): Enforces 64-bit pointer and long sizes:

case 65:
    dword_106C01C = 1;          // force-lp64 flag recorded
    qword_126E408 = 8;          // sizeof(long long) = 8
    dword_126E400 = 8;          // alignof(long long) = 8
    byte_126E349 = 8;           // size_t = 8 bytes
    byte_126E358 = 7;           // size_t type encoding

Case 66 (--force-llp64): Sets 32-bit pointer and long sizes (Windows-like):

case 66:
    dword_106C018 = 1;          // force-llp64 flag recorded
    qword_126E408 = 4;          // sizeof(long long) = 4
    dword_126E400 = 4;          // alignof(long long) = 4
    byte_126E349 = 10;          // size_t = different encoding
    byte_126E358 = 9;           // size_t type encoding

Case 90 (--m32): Sets the complete 32-bit (ILP32) data model:

case 90:
    dword_126E378 = 0;          // 32-bit mode (not LP64)
    qword_126E360 = 4;          // sizeof(long) = 4
    dword_126E35C = 4;          // alignof(long) = 4
    qword_126E350 = 0xFFFFFFFF; // SIZE_MAX = 32-bit
    byte_126E349 = 6;           // size_t = 4 bytes
    byte_126E358 = 5;           // size_t type encoding
    qword_126E2C0 = 4;          // sizeof(pointer) = 4
    dword_126E2B8 = 4;          // sizeof(pointer, dword) = 4
    qword_126E2B0 = 8;          // alignof(pointer, packed) = 8
    dword_126E2AC = 4;          // alignof(pointer) = 4
    qword_126E2A0 = 4;          // sizeof(ptrdiff_t) = 4
    dword_126E29C = 4;          // sizeof(ptrdiff_t, dword) = 4
    qword_126E408 = 4;          // sizeof(long long) = 4
    dword_126E400 = 4;          // alignof(long long) = 4
    byte_126E2F4 = 4;           // default minimum alignment = 4

Because sub_7515D0 is called from sub_7525F0 (which runs during case 245), and case 90 executes before case 245, the --m32 overrides are applied first but then overwritten by sub_7515D0's LP64 defaults. This means the 32-bit overrides from --m32 are effective ONLY for the globals that sub_7515D0 does NOT touch. For the globals that both code paths write (like qword_126E408, dword_126E400, byte_126E349, byte_126E358), the sub_7515D0 LP64 values take precedence. However, --force-lp64 and --force-llp64 are no-ops when --target is also specified, because sub_7515D0 overwrites their values too.

In practice, nvcc controls all of these flags coherently -- it does not pass conflicting combinations.

Configuration Validation (sub_7527B0)

After sub_7515D0 sets the type tables, sub_752DF0 (init_field_alignment_tables) populates alignment lookup tables and then calls sub_7527B0 (check_target_configuration). This function validates the consistency of the configured type model:

// sub_7527B0 -- check_target_configuration (pseudocode summary)
void check_target_configuration()
{
    // Validate char fits in 8 bytes
    compute_type_size(0, &size, &precision);
    if (size > 8) fatal("target char is too large");

    // Validate wchar_t size
    if (qword_126E488 > 8) fatal("target wchar_t is too large");

    // Validate char16_t: must be unsigned, at least 16 bits
    if (qword_126E478 > 8) fatal("target char16_t is too large");
    if (dword_126E4A0 * qword_126E478 <= 15)
        fatal("target char16_t is too small");
    if (is_unsigned[byte_126E480] == 0)
        assert_fail("target char16_t must be unsigned");

    // Validate char32_t: must be unsigned, at least 32 bits
    if (qword_126E468 > 8) fatal("target char32_t is too large");
    if (dword_126E4A0 * qword_126E468 <= 31)
        fatal("target char32_t is too small");
    if (is_unsigned[byte_126E470] == 0)
        assert_fail("target char32_t must be unsigned");

    // Validate size_t range
    compute_type_size(byte_126E349, &size, &precision);
    if (size * dword_126E4A0 > 64) size_bits = 64;
    if (qword_126E350 > max_for_bits(size_bits))
        fatal("targ_size_t_max is too large");

    // Validate largest integer type
    if (qword_126E3D8 > 16) fatal("targ_sizeof_largest_integer too large");
    if (qword_126E3D8 < qword_126E3F8)
        fatal("invalid targ_sizeof_largest_integer");

    // Validate INT_VALUE_PARTS
    if (16 * dword_126E4A0 != 128)
        fatal("invalid INT_VALUE_PARTS_PER_INTEGER_VALUE");

    // Validate host string char
    if (dword_126E49C > 8) fatal("targ_host_string_char_bit too large");

    // Validate pack alignment
    if (dword_126E284 < 1 || dword_126E284 > 255)
        fatal("invalid targ_minimum_pack_alignment");
    if (dword_126E284 > dword_126E280)
        fatal("invalid targ_maximum_pack_alignment");

    // Validate GNU IA-32 vector function integer sizes
    if (qword_126E428 != 2 || qword_126E418 != 4 || qword_126E3F8 != 8)
        assert_fail("invalid integer sizes for GNU IA-32 vector functions");

    // Validate MSVC bit-field allocation
    if (dword_126E3A4 && dword_126E3A8 != -1)
        fatal("targ_microsoft_bit_field_allocation must be -1 "
              "when targ_bit_field_container_size is TRUE");

    // Validate field allocation order
    if (!dword_126E3AC) assert_fail(...);
    if (!dword_126E288)
        fatal("targ_field_alloc_sequence_equals_decl_sequence must be TRUE");

    // Validate host/target endianness match
    if (dword_126E4A4 != dword_126EE40)
        fatal("unexpected host/target endian mismatch");

    // After validation, call dialect selector
    select_cp_gen_be_target_dialect();  // sub_752A80
}

The validator confirms that the type model is internally consistent. Most of these checks are compile-time assertions that should never fire with the hardcoded LP64 values from sub_7515D0, but they guard against corruption or misconfiguration if the type globals are modified by other code paths (such as --m32 or --force-llp64).

Notably, check_target_configuration calls select_cp_gen_be_target_dialect (sub_752A80) as its last action. This means dialect selection happens after all type model validation is complete.

Field Alignment Tables (sub_752DF0)

init_field_alignment_tables populates two alignment lookup tables at qword_12C7640 and qword_12C7680. These tables map integer type kinds to their struct field alignment requirements. The function only fills the tables when dword_126E2F0 (field alignment override) is nonzero; in the default CUDA configuration, this field is set to 0 by sub_7515D0, so the alignment tables remain at their initialized-to-zero state.

When the tables are populated, they read alignment values from the dword_126E2CC-dword_126E2F0 range (which sub_7515D0 leaves at zero), meaning the alignment tables are effectively disabled for CUDA targets. The function also copies qword_126E3E8 (sizeof largest integer type) into qword_126E3D8 before calling the configuration validator.

Stage 4: Dialect Selection (sub_752A80)

select_cp_gen_be_target_dialect determines whether the backend generates device-side or host-side C++ output. It is called from check_target_configuration (sub_7527B0) after all type model validation passes:

// sub_752A80 -- select_cp_gen_be_target_dialect (complete decompilation)
__int64 sub_752A80()
{
    // Guard: no dialect should be set yet
    if (dword_126E1F8 || dword_126E1D0 || dword_126E1FC || dword_126E1E8)
        assert_fail("target.c", 736,
                    "select_cp_gen_be_target_dialect",
                    "Target dialect already set.", 0);

    if (dword_126EFA8) {           // GPU compilation mode enabled
        dword_126E1DC = 1;         // enable cp_gen backend
        dword_126E1EC = 1;         // enable backend output

        if (dword_126EFA4) {       // device-side compilation
            dword_126E1E8 = 1;     // set device target dialect
            qword_126E1E0 = qword_126EF90;  // copy Clang version
            return qword_126EF90;
        } else {                   // host-side compilation (stub generation)
            dword_126E1F8 = 1;     // set host target dialect
            qword_126E1F0 = qword_126EF98;  // copy GCC version
            return qword_126EF98;
        }
    }
    return result;  // non-GPU mode: no dialect set
}

The guard at entry checks that no dialect has been previously set. This fires only if select_cp_gen_be_target_dialect is called twice, which is a programming error.

Dialect Global Roles

GlobalRoleSet When
dword_126EFA8GPU compilation mode active--gcc flag (case 182) sets this to 1
dword_126EFA4Device-side (vs host-side) compilation--clang flag (case 187) sets this to 1
dword_126E1DCcp_gen backend enabledGPU mode active
dword_126E1ECBackend output enabledGPU mode active
dword_126E1E8Device target dialect selectedDevice-side compilation
dword_126E1F8Host target dialect selectedHost-side compilation
qword_126E1E0Device compiler versionCopied from qword_126EF90 (Clang version)
qword_126E1F0Host compiler versionCopied from qword_126EF98 (GCC version)

The naming of dword_126EFA8 as "gcc mode" and dword_126EFA4 as "clang mode" is misleading. In CUDA compilation, dword_126EFA8 means "GPU compilation is active" (nvcc always passes --gcc) and dword_126EFA4 means "this is the device-side pass" (nvcc passes --clang for the device compilation pass, not for the host pass). The version numbers copied into qword_126E1E0 and qword_126E1F0 represent the host compiler's version for pragma compatibility, not the "Clang" or "GCC" version in any semantic sense.

Device vs Host Output Paths

cudafe++ is invoked twice per .cu file by nvcc:

  1. Device pass (dword_126EFA4 = 1): cudafe++ processes the CUDA source and emits the device-side IL/PTX code. The dialect is set to "device" (dword_126E1E8 = 1) and the version number comes from qword_126EF90.

  2. Host pass (dword_126EFA4 = 0): cudafe++ processes the same source and emits the host-side .int.c file with device stubs. The dialect is set to "host" (dword_126E1F8 = 1) and the version number comes from qword_126EF98.

The dialect selection determines which backend code paths execute during .int.c generation. Device-dialect mode generates PTX-compatible output; host-dialect mode generates host C++ with stub functions.

Stage 5: TU Initialization (sub_586240)

During translation unit initialization, sub_586240 copies the target index from the CLI-level global into per-TU state:

// sub_586240 -- fe_translation_unit_init_secondary (relevant excerpt)
if (dword_106BA08) {                       // is recompilation / secondary TU
    // ... version marker and timestamp setup ...
    v6 = allocate(4);
    *(int32_t *)v6 = 3550774;              // "6.6\0" -- EDG version marker
    qword_126EB78 = v6;                    // store version string pointer
    qword_126EB80 = strcpy(allocate(len), byte_106B5C0);  // timestamp
    dword_126EBF8 = dword_126E4A8;         // CRITICAL: copy target index
}

The copy dword_126EBF8 = dword_126E4A8 replicates the architecture index into the translation unit's state block. Both globals contain the same value in single-TU compilation (which is the only mode CUDA uses). The dual-variable pattern exists because EDG's multi-TU architecture theoretically supports per-TU target configurations, but CUDA compilation always uses a single target per cudafe++ invocation.

After this point, feature checks throughout the compiler read either dword_126E4A8 (the CLI-level global) or dword_126EBF8 (the TU-level copy). Both contain the same integer target index.

Feature Gate Mechanism

Individual features are gated by comparing dword_126E4A8 against threshold constants during semantic analysis. The pattern is consistent across all architecture-gated features:

// Pattern: hard error on unsupported architecture
if (dword_126E4A8 < THRESHOLD) {
    emit_error(DIAGNOSTIC_ID, location);
    // compilation continues or aborts depending on severity
}

Some features use a global flag that is set during target initialization rather than reading dword_126E4A8 directly. For example, __nv_register_params__ checks dword_106C028 (the "uumn" flag, set by CLI case 112) rather than comparing the architecture directly:

// sub_40B0A0 -- apply_nv_register_params_attr
if (!dword_106C028) {                    // feature not enabled
    emit_error(7, 3659, location);       // "not enabled" error
    v3 = 0;                              // mark as invalid
}

The architecture check for __nv_register_params__ is separate -- it uses diagnostic tag register_params_unsupported_arch (requiring compute_80+), which is evaluated in a different code path from the enable flag check.

Feature Flag vs Direct Comparison

The distinction between feature-flag gating and direct SM comparison is:

  • Direct comparison (dword_126E4A8 < N): Used for features where the threshold is baked into the comparison instruction. The threshold cannot be changed without recompiling cudafe++. Examples: __grid_constant__ (< 70), __managed__ (< 30), alloca() (< 52).

  • Feature flag (dword_XXXXXX == 0): Used for features that can be enabled/disabled independently of the architecture. The flag is set by a CLI option, and the architecture is checked separately. Example: __nv_register_params__ uses dword_106C028 for the enable check and a separate comparison for the architecture check.

Both patterns ultimately depend on the target index value, but the feature-flag pattern adds an extra level of indirection that allows nvcc to control feature availability through CLI flags rather than relying solely on the architecture number.

The --db Debug Mechanism

The --db flag (CLI case 37) activates EDG's internal debug tracing. While not directly part of the architecture detection chain, it shares adjacent globals (dword_126EFC8, dword_126EFCC) and can expose architecture checks as they execute.

The --db flag calls sub_48A390 (proc_debug_option, 238 lines, debug.c). On entry, it unconditionally enables tracing:

dword_126EFC8 = 1;  // debug tracing enabled

If the argument is a bare integer, it sets the verbosity level:

if (first_char is digit) {
    dword_126EFCC = strtol(arg, NULL, 10);  // verbosity level
    return 0;
}

Otherwise, it parses debug trace control entries (see Architecture Feature Gating for the full --db parsing grammar). After proc_debug_option returns, the CLI parser saves the verbosity level:

// proc_command_line, case 37
dword_106C2A0 = dword_126EFCC;  // save error count baseline

At higher verbosity levels (5+), the compiler logs IL tree walking with messages like "Walking IL tree, entry kind = ...", which provides visibility into when architecture gate checks fire during semantic analysis.

Complete Call Graph

main (sub_585EE0)
  |
  +-> proc_command_line (sub_459630)
  |     |
  |     +-> case 90 (--m32):     set ILP32 type properties
  |     +-> case 65 (--force-lp64): set LP64 overrides
  |     +-> case 66 (--force-llp64): set LLP64 overrides
  |     +-> case 245 (--target):
  |           |
  |           +-> sub_7525E0(<arg>)            // parse target index (stub: returns -1)
  |           +-> dword_126E4A8 = result       // store target index
  |           +-> if -1: emit error 2664       // invalid target
  |           +-> sub_7525F0(result)           // set_target_configuration
  |                 |
  |                 +-> sub_7515D0()           // initialize 100+ type globals (LP64)
  |                 +-> qword_126E1B0 = "lib"  // library prefix
  |                 |
  |                 +-> [implicit via sub_752DF0]:
  |                       +-> sub_752DF0()     // init_field_alignment_tables
  |                       +-> sub_7527B0()     // check_target_configuration
  |                             |
  |                             +-> [20+ consistency checks]
  |                             +-> sub_752A80()   // select_cp_gen_be_target_dialect
  |                                   |
  |                                   +-> if GPU mode && device:
  |                                   |     dword_126E1E8 = 1 (device dialect)
  |                                   |     qword_126E1E0 = qword_126EF90
  |                                   +-> if GPU mode && host:
  |                                         dword_126E1F8 = 1 (host dialect)
  |                                         qword_126E1F0 = qword_126EF98
  |
  +-> fe_translation_unit_init (sub_586240)
        |
        +-> dword_126EBF8 = dword_126E4A8      // copy target index to TU state
        +-> qword_126EB78 = "6.6\0"            // EDG version marker
        +-> qword_126EB80 = timestamp           // compilation timestamp

[After TU init, feature checks read dword_126E4A8 or dword_126EBF8]

Global Variable Summary

Target Architecture State

AddressSizeNameRole
dword_126E4A84sm_architectureTarget index from --target. Sentinel: -1.
dword_126EBF84target_configuration_indexTU-level copy of dword_126E4A8.
dword_126E3784is_64bit_mode1 = LP64 (64-bit), 0 = ILP32 (32-bit).
dword_126E4A44target_little_endian1 = little-endian.

Type Model (Sizes, set by sub_7515D0)

AddressSizeNameLP64 Value
dword_126E338 / qword_126E3404/8sizeof_int4
dword_126E328 / qword_126E3304/8sizeof_long8
dword_126E2B8 / qword_126E2C04/8sizeof_pointer8
dword_126E29C / qword_126E2A04/8sizeof_ptrdiff8
dword_126E410 / qword_126E4184/8sizeof_short4
dword_126E400 / qword_126E4084/8sizeof_long_long8
dword_126E420 / qword_126E4284/8sizeof_wchar2
dword_126E2FC / qword_126E3004/8sizeof_long_double16
dword_126E2584double_mantissa_bits53
dword_126E2644float_mantissa_bits24
dword_126E2344float128_mantissa_bits113

Type Model (Alignment, set by sub_7515D0)

AddressSizeNameLP64 Value
dword_126E2AC4alignof_pointer8
dword_126E35C / qword_126E3604/8alignof_long8
dword_126E308 / qword_126E3104/8alignof_long_double16
dword_126E3F0 / qword_126E3F84/8alignof_double8
dword_126E2784max_natural_alignment16
byte_126E2F41default_min_alignment4

Dialect Selection State

AddressSizeNameRole
dword_126EFA84gpu_mode_enabledGPU compilation active (set by --gcc)
dword_126EFA44is_device_compilationDevice-side pass (set by --clang)
dword_126E1DC4cp_gen_enabledcp_gen backend active
dword_126E1EC4backend_output_enabledBackend output generation active
dword_126E1E84device_dialect_setDevice target dialect selected
dword_126E1F84host_dialect_setHost target dialect selected
qword_126E1E08device_versionClang version copied for device dialect
qword_126E1F08host_versionGCC version copied for host dialect
qword_126E1B08lib_prefixLibrary search prefix, set to "lib"

Feature Gate Globals

AddressSizeNameRole
dword_106C0284nv_register_params_enabledEnable flag for __nv_register_params__ (set by --uumn, case 112)

Cross-References

Experimental and Version-Gated Flags

cudafe++ gates several categories of CUDA language features behind flags that nvcc manages automatically. Users interact with these through nvcc options like --expt-extended-lambda and --expt-relaxed-constexpr; nvcc translates these into the internal cudafe++ flags --extended-lambda and --relaxed_constexpr before invocation. A third category, C++ standard version gating, controls which language-level features affect the CUDA compilation pipeline. Two additional flags (--default-device, --no-device-int128/--no-device-float128) tune device code semantics without the "experimental" label.

This page documents the internal mechanism of each flag: the global variable it sets, every code path it unlocks, the diagnostics it suppresses or enables, and the compile-time cost of enabling it.

Flag Summary

nvcc Flagcudafe++ Internal FlagFlag IDGlobal VariableDefaultEffect
--expt-extended-lambda--extended-lambda79*dword_106BF380Enable entire extended lambda wrapper infrastructure
--expt-relaxed-constexpr--relaxed_constexpr104dword_106BF400Allow constexpr cross-space calls
-std=c++NN--c++NN / set_flag--dword_126EF68199711Gate C++ standard features
(JIT mode)--default-device**--0Change unannotated default to __device__
--no-device-int128--no-device-int12852--0Disable __int128 in device code
--no-device-float128--no-device-float12853--0Disable __float128/_Float128 in device code

* The extended-lambda flag is registered as flag 79 (disable_ext_lambda_cache is a separate flag at that slot in some reports; the exact case_id for the flag parsed as extended-lambda is in the CUDA-specific range 47--89 but the individual case within the grouped 47--53 block is not fully disambiguated). The flag string "extended-lambda" is at binary address 0x836410, referenced from sub_452010 (init_command_line_flags).

** The --default-device flag is not in the standard numbered flag catalog (1--275). It is registered through one of the 7 inline-registered paired flags or the set_flag/clear_flag table (off_D47CE0). Its string literal appears in four JIT error messages in the binary.

--extended-lambda (dword_106BF38)

This is the single most impactful experimental flag in cudafe++. It enables the entire extended lambda subsystem -- approximately 40 functions in nv_transforms.c, 2,100 lines of lambda scanning in cmd_line.c, 17 steps of preamble text emission, and per-lambda wrapper generation in the backend. Without it, CUDA lambdas annotated with __device__ or __host__ __device__ are rejected outright.

What It Enables

When dword_106BF38 != 0, the following subsystems activate:

1. Lambda scanning (sub_447930, 2,113 lines)

The 7-phase scan_lambda function performs full CUDA validation on every lambda expression. Phase 4 checks all 35+ restriction categories documented in the restrictions page. Without the flag, phase 4 early-exits and emits error 3612 instead.

2. Preamble injection (sub_4864F0 + sub_6BCC20)

When the backend encounters a type declaration for the sentinel __nv_lambda_preheader_injection, three conditions must all be true for the preamble to fire:

// sub_4864F0 trigger conditions:
if ((entity_bits[-8] & 0x10) != 0      // marker bit set
    && dword_106BF38 != 0               // --extended-lambda enabled
    && name_matches_sentinel)           // 30-byte name comparison
{
    sub_6BCC20(emit_func);              // emit ~10-50 KB of template text
}

The master emitter (sub_6BCC20) produces the complete lambda wrapper infrastructure as inline C++ text injected into the .int.c output. The 17-step emission sequence generates:

StepOutputPurpose
1__NV_LAMBDA_WRAPPER_HELPER, __nvdl_remove_ref, __nvdl_remove_constUtility macros and type traits
2__nv_dl_tagDevice lambda tag type
3Array capture helpers (dim 2--8)N-dimensional array forwarding via sub_6BC290
4Primary __nv_dl_wrapper_t + zero-capture specializationDevice lambda wrapper template
5__nv_dl_trailing_return_tag + zero-capture specializationTrailing return type support
6Device bitmap scanOne sub_6BB790 call per set bit in unk_1286980
7__nv_hdl_helper (anonymous namespace, 4 static function pointers)Host-device lambda dispatch helper
8Primary __nv_hdl_wrapper_t with static_assertHost-device wrapper template
9HD bitmap scanFour calls per set bit in unk_1286900 (const x mutable x 2 helpers)
10__nv_hdl_helper_trait_outerDeduction helper traits
11C++17 noexcept variantsConditional on dword_126E270 (see C++ version gating)
12__nv_hdl_create_wrapper_tFactory for HD wrappers
13__nv_lambda_trait_remove_const/volatile/cvCV-qualifier removal traits
14__nv_extended_device_lambda_trait_helper + detection macroDevice lambda type detection
15__nv_lambda_trait_remove_dl_wrapperUnwrapper trait
16Trailing-return detection trait + macroType introspection
17HD detection trait + macroHost-device lambda type detection

3. 1024-bit capture bitmaps

Two bitmaps track which capture counts have been observed during parsing:

BitmapAddressScopeBits Used
Deviceunk_1286980128 bytes (1024 bits)Bit N = capture count N seen in a __device__ lambda
Host-deviceunk_1286900128 bytes (1024 bits)Bit N = capture count N seen in an HD lambda

sub_6BCBF0 registers a capture count by setting the corresponding bit. sub_6BCBC0 resets both bitmaps to zero between translation units. The maximum representable capture count is 1023 (bit 0 is reserved for the primary template in the device path; the HD path uses bit 0). Error 3595 fires when capture count exceeds 1022 (v33 > 0x3FE).

4. Per-lambda wrapper generation (sub_47B890, 336 lines)

During backend code generation, gen_lambda produces the per-lambda wrapper specialization for each extended lambda encountered. This runs in the gen_template dispatcher (sub_47ECC0).

5. Extended lambda capture type generation (sub_46E640, ~400 lines)

nv_gen_extended_lambda_capture_types generates explicit type declarations for captured variables, enabling the closure type to be serialized across host/device boundaries.

What Happens Without It

When dword_106BF38 == 0, any lambda with __host__ or __device__ annotations triggers error 3612:

error #20155-D: __host__ or __device__ annotation on lambda requires --extended-lambda nvcc flag

Additionally, the .int.c header emits hardcoded false macros (from sub_489000):

#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false

These definitions ensure that code using the detection macros compiles without error but reports that no extended lambdas exist.

Compile-Time Cost

Enabling --extended-lambda has measurable compile-time impact:

  • Fixed overhead: ~10 KB of injected template text (steps 1--5, 7--8, 10--17) emitted for every translation unit, regardless of how many lambdas appear
  • Variable per capture count: ~0.8 KB per distinct device lambda capture count, ~6 KB per distinct HD capture count (the HD path emits 4 specializations per bit: const non-mutable, const mutable, non-const non-mutable, non-const mutable)
  • Typical TU with 3--5 distinct capture counts: 30--50 KB of additional .int.c text
  • Template instantiation load: The wrapper templates use deep SFINAE patterns; the host compiler (gcc/clang/MSVC) must instantiate these for every extended lambda in the TU
  • Lambda scanning: The 2,113-line scan_lambda function performs full restriction validation on every lambda expression, adding O(N) per-lambda overhead

The cost is proportional to the number of distinct capture counts, not the total number of lambdas. Two __device__ lambdas each capturing 3 variables share a single wrapper specialization.

All 35+ extended lambda error codes (3590--3691) are documented in lambda/restrictions.md. Key errors specific to the flag gate:

ErrorDisplayTagCondition
361220155-Dextended_lambda_disallowedLambda has __host__/__device__ annotation but dword_106BF38 == 0
359520138-Dextended_lambda_too_many_capturesCapture count > 1023
359020133-Dextended_lambda_multiple_parentMultiple __nv_parent pragmas

--expt-relaxed-constexpr (dword_106BF40)

This flag relaxes cross-execution-space calling rules for constexpr functions. Without it, a constexpr __device__ function cannot be called from a __host__ function and vice versa, even though constexpr functions are evaluated at compile time on the host regardless of their execution space annotation.

Flag Registration

Registered as flag ID 104 (relaxed_constexpr) in the CUDA-specific flag range. The --expt-relaxed-constexpr nvcc flag is translated to --relaxed_constexpr before passing to cudafe++. The flag sets dword_106BF40 to 1.

Note: Despite the W066 report labeling this global lambda_host_device_mode, the decompiled code shows it is checked in two distinct contexts: cross-space call validation (sub_505720) and extended lambda device qualification (sub_6BC680). The variable name reflects its role in relaxing constexpr constraints, not lambda-specific behavior. It affects lambda behavior only in the specific case of is_device_or_extended_device_lambda (see below).

What It Relaxes

The flag modifies behavior in two code paths:

1. Cross-space call checking (sub_505720)

In check_cross_execution_space_call, when the caller is a __device__-only function and the callee has bit 2 set at offset +177 (explicit __device__ annotation), the checker tests dword_106BF40:

// sub_505720, caller is __device__ or __global__, callee is constexpr __host__:
if ((callee[177] & 0x02) != 0) {     // callee has explicit execution space
    if (dword_106BF40) {               // --expt-relaxed-constexpr
        // skip error, allow the call
        return;
    }
}

Without the flag, this path falls through to emit one of the 6 constexpr-specific cross-space errors.

2. Device lambda qualification (sub_6BC680)

In is_device_or_extended_device_lambda, when an entity has __device__ annotation (bit 177|2) but NOT the extended lambda bit (bit 177|4), the function returns dword_106BF40 != 0:

// sub_6BC680 (decompiled):
bool is_device_or_extended_device_lambda(entity* a1) {
    if ((a1->byte_177 & 0x02) != 0) {       // has __device__
        if ((a1->byte_177 & 0x04) == 0) {    // NOT extended lambda
            return dword_106BF40 != 0;        // relaxed constexpr allows it
        }
        return true;
    }
    return false;
}

This means --expt-relaxed-constexpr allows certain __device__ lambdas to be treated as extended device lambdas even without the --extended-lambda flag, but only in the specific context of device lambda type checking.

The 6 Error Messages It Suppresses

When dword_106BF40 == 0 and a constexpr function call crosses execution spaces, one of these 6 error messages is emitted. Each message explicitly suggests the flag as a workaround:

#Caller SpaceCallee SpaceError Message
1__host__ __device__constexpr __device__"calling a constexpr __device__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this."
2__host__constexpr __device__"calling a constexpr __device__ function(%sq1) from a __host__ function(%sq2) is not allowed. ..."
3__host__ __device__constexpr __host__"calling a constexpr __host__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. ..."
4__host__ __device__constexpr __host__"calling a constexpr __host__ function from a __host__ __device__ function is not allowed. ..." (no entity names -- edge case for unresolved functions)
5__device__constexpr __host__"calling a constexpr __host__ function(%sq1) from a __device__ function(%sq2) is not allowed. ..."
6__global__constexpr __host__"calling a constexpr __host__ function(%sq1) from a __global__ function(%sq2) is not allowed. ..."

The %sq1 and %sq2 format specifiers are cudafe++'s diagnostic format for qualified entity names (see diagnostics/format-specifiers.md).

Why It Is Experimental

The flag is labeled "experimental" because enabling it can produce silent runtime errors when:

  1. A constexpr function has different behavior on host vs device due to #ifdef __CUDA_ARCH__ guards or host/device-specific intrinsics. The compiler evaluates constexpr functions on the host during compilation, but with the flag enabled, a constexpr __device__ function might be evaluated on the host where __CUDA_ARCH__ is not defined, producing a different constant value than the programmer expects for device code.

  2. A constexpr __host__ function references host-only APIs (file I/O, system calls, host-specific math libraries). With relaxed constexpr, this function can be called from a __device__ context. If the call is not resolved at compile time (not actually evaluated as a constant expression), the linker or runtime will fail with an obscure error rather than the clear cudafe++ diagnostic.

  3. The relaxation applies globally -- there is no per-function opt-in. Once enabled, all constexpr cross-space calls are permitted, making it impossible to catch genuinely incorrect calls alongside intentionally relaxed ones.

The related diagnostic tag is cl_relaxed_constexpr_requires_bool (at binary address 0x853640), which indicates there was at some point a stricter validation that the flag's value must be boolean.

Interaction with Other Globals

The dword_106BF40 flag interacts with the cross-space checking infrastructure controlled by dword_106BFD0 (device_registration) and dword_106BFCC (constant_registration). When dword_106BF40 is set AND the current routine is in device scope (+182 & 0x30 == 0x20) AND the routine has __device__ annotation (+177 bit 1), the cross-space reference check in record_symbol_reference_full (sub_72A650/sub_72B510) skips the error entirely.

C++ Standard Version Gating (dword_126EF68)

The global variable dword_126EF68 holds the C++ (or C) standard version as an integer matching the __cplusplus or __STDC_VERSION__ predefined macro value. This is set during CLI parsing and controls feature gating throughout the frontend.

Version Values

Standarddword_126EF68 Valuenvcc Flag
C++98/03199711-std=c++03
C++11201103-std=c++11
C++14201402-std=c++14
C++17201703-std=c++17
C++20202002-std=c++20
C++23202302-std=c++23

C standard values are also stored here when compiling C code:

Standarddword_126EF68 Value
K&R(triggers set_c_mode(1) instead)
C89198912
C99199901
C11201112
C17201710
C23202311

How Version Gating Works

Throughout the frontend, dword_126EF68 is compared against threshold values to enable or disable features. The comparison is always >= or > against the version number. Examples from the binary:

List initialization (sub_6D7DE0, overload.c): The 2,119-line list initialization function checks dword_126EF68 >= 201103 before enabling C++11 brace-enclosed initializer semantics.

Operator overloading (sub_6E7310, overload.c): Checks dword_126EF68 >= 201703 for C++17 features like class template argument deduction in operator resolution.

Preprocessor directives (sub_6FEDD0, preproc.c): Checks dword_126EF68 >= 202301 for #elifdef/#elifndef support (C++23 feature).

Byte ordering in .int.c output (sub_489000): Sets byte_10657F4 based on:

if (dword_126EFB4 == 2)              // CUDA mode
    byte_10657F4 = (dword_126EFB0 != 0);
else if (dword_126EF68 <= 199900)    // pre-C99
    byte_10657F4 = (dword_126EFB0 != 0);
else
    byte_10657F4 = 1;

C++17 noexcept-in-Type-System (dword_126E270)

A key version-gated feature for CUDA is dword_126E270, the C++17 "noexcept is part of the type system" flag. This global is set when dword_126EF68 >= 201703 and controls whether the lambda preamble injection (step 11 in sub_6BCC20) emits noexcept specializations of __nv_hdl_helper_trait_outer:

// sub_6BCC20, step 11:
if (dword_126E270) {                   // C++17 noexcept in type system
    // Emit 2 additional trait specializations with NeverThrows=true
    // for noexcept-qualified function types
    emit_noexcept_trait_specialization(emit, /* const */ 0);
    emit_noexcept_trait_specialization(emit, /* non-const */ 1);
}
// Closing }; of __nv_hdl_helper_trait_outer emitted unconditionally after

Without these specializations, C++17 code using noexcept lambdas in host-device contexts would fail to match the wrapper traits, producing template deduction failures.

Version Interactions with CUDA

The C++ standard version interacts with CUDA semantics in several ways:

  • C++11 minimum: Most CUDA lambda features require >= 201103. Extended lambdas are only meaningful with C++11 lambda syntax.
  • C++14 generic lambdas: Generic __device__ lambdas (with auto parameters) are gated on >= 201402.
  • C++17 structured bindings and if constexpr: The extended lambda system interacts with if constexpr through restriction errors 3620/3621 (constexpr/consteval conflict in lambda operator()).
  • C++20 concepts: The template variant of cross-space checking (sub_505B40) has a concept-context guard that checks dword_126C5C4 (nested class scope), which is only meaningful with C++20 concepts.

--default-device

This flag is specific to JIT (device-only) compilation mode and changes the default execution space for unannotated entities from __host__ to __device__.

Mechanism

When enabled, the execution-space assignment logic modifies entity+182 to receive the __device__ OR mask (0x23) instead of the implicit host default (0x00). For variables, entity+148 bit 0 (__device__ memory space) is set.

JIT Mode Context

JIT mode activates when --gen_c_file_name (flag 45) is NOT provided -- there is no host output path, so the host backend never runs. This is the compilation mode used by NVRTC (the CUDA runtime compilation library) and the CUDA Driver API's runtime compilation facilities (cuModuleLoadData, cuLinkAddData).

Without --default-device, five JIT-specific diagnostics warn about unannotated entities:

Diagnostic TagMessage Summary
no_host_in_jitExplicit __host__ not allowed in JIT mode (no --default-device suggestion)
unannotated_function_in_jitUnannotated function considered host, not allowed in JIT
unannotated_variable_in_jitNamespace-scope variable without memory space annotation
unannotated_static_data_member_in_jitNon-const static data member considered host
host_closure_class_in_jitLambda closure class inferred as __host__

Four of the five messages explicitly suggest --default-device as a workaround. The exception is no_host_in_jit -- an explicit __host__ annotation cannot be overridden by a flag and requires a source code change.

The --default-device flag interacts with the extended lambda system (dword_106BF38): when both are active, namespace-scope lambda closure classes infer __device__ execution space instead of __host__, avoiding the host_closure_class_in_jit diagnostic.

See cuda/jit-mode.md for full JIT mode documentation.

--no-device-int128 / --no-device-float128

These two flags (IDs 52 and 53) disable 128-bit integer and floating-point types in device code respectively.

Registration

Both are registered in sub_452010 as no-argument mode flags in the CUDA-specific range:

FlagIDBinary AddressGlobal Effect
no-device-int128520x836133Disables __int128 type in device compilation
no-device-float128530x836144Disables __float128/_Float128 in device compilation

Purpose

The EDG frontend supports __int128 (keyword ID 239 in the builtin keyword table) and _Float128 (keyword ID 335) as extended types. In device code, these types may not be supported by all GPU architectures or may have different semantics than on the host.

The flags belong to the grouped CUDA boolean flags (cases 47--53 in proc_command_line), alongside host-stub-linkage-explicit, static-host-stub, device-hidden-visibility, no-hidden-visibility-on-unnamed-ns, and no-multiline-debug.

Type feature tracking uses byte_12C7AFC as a usage flags byte: bit 0 tracks specific integer subtypes (kinds 11, 12), bit 2 tracks float128/bfloat16 usage. The dword_106C070 global serves as the float128 feature flag, and dword_106C06C controls bfloat16.

NVRTC has specific support strings for both types in the binary (int128 NVRTC, float128 NVRTC), confirming that the JIT compilation path handles the presence or absence of these types explicitly.

Interaction Matrix

The experimental flags interact with each other and with version gating:

InteractionBehavior
--extended-lambda + C++17Enables noexcept wrapper trait specializations (step 11 in preamble) via dword_126E270
--extended-lambda + --expt-relaxed-constexprA __device__ lambda without the extended-lambda bit is treated as extended if dword_106BF40 is set (via sub_6BC680)
--extended-lambda + JIT modeLambda closure class execution space inference changes; --default-device affects namespace-scope lambda inference
--expt-relaxed-constexpr + cross-space checkingSuppresses 6 specific constexpr cross-space errors; does NOT suppress the 6 non-constexpr variants
--no-device-int128 + NVRTCNVRTC-specific handling confirms both flags are respected in JIT compilation
C++20 + cross-space checkingConcept context guard in sub_505B40 adds an additional bypass condition for template cross-space calls

Global Variable Reference

AddressSizeSemantic NameSet ByChecked By
dword_106BF384extended_lambda_modeFlag 79* (--extended-lambda)sub_4864F0 (trigger), sub_489000 (macros), sub_447930 (scan_lambda)
dword_106BF404relaxed_constexpr_modeFlag 104 (--relaxed_constexpr)sub_505720 (cross-space call), sub_6BC680 (device lambda test), sub_72A650/sub_72B510 (symbol ref)
dword_126EF684cpp_standard_versionCLI std selection28+ functions across all subsystems
dword_126E2704cpp17_noexcept_typePost-parsing dialect resolutionsub_6BCC20 (preamble step 11)

Function Reference

AddressLinesIdentitySourceRole
sub_4520103,849init_command_line_flagscmd_line.cRegisters all 276 flags including experimental
sub_4596304,105proc_command_linecmd_line.cParses flags, sets globals
sub_4479302,113scan_lambdacmd_line.cFull lambda validation (uses dword_106BF38)
sub_4864F0751gen_type_declcp_gen_be.cPreamble injection trigger (checks dword_106BF38)
sub_6BCC20244nv_emit_lambda_preamblenv_transforms.cMaster preamble emitter (17 steps)
sub_505720147check_cross_execution_space_callexpr.cCross-space call checker (uses dword_106BF40)
sub_505B4092check_cross_space_call_in_templateexpr.cTemplate variant of cross-space checker
sub_6BC68016is_device_or_extended_device_lambdanv_transforms.cDevice lambda test (uses dword_106BF40)
sub_489000723process_file_scope_entitiescp_gen_be.cBackend entry; emits false macros when flag off
sub_46E640~400nv_gen_extended_lambda_capture_typescp_gen_be.cCapture type declarations for extended lambdas
sub_6BCBF013nv_record_capture_countnv_transforms.cBitmap bit-set for capture counts
sub_6BCBC0~10nv_reset_capture_bitmapsnv_transforms.cReset both 1024-bit bitmaps

Cross-References

EDG Source File Map

This page is the definitive reference table mapping all 52 .c source files and 13 .h header files from EDG 6.6 to their binary addresses in the cudafe++ CUDA 13.0 build. Every column is derived from the .rodata string cross-reference database and verified against the 20 sweep reports (P1.01 through P1.20).

For narrative discussion of these files and their roles in the compilation pipeline, see the Function Map and EDG Overview pages.

Build Path

All source files share the build prefix:

/dvs/p4/build/sw/rel/gpgpu/toolkit/r13.0/compiler/drivers/compiler/edg/EDG_6.6/src/

Coverage Summary

MetricCount
.c files with mapped functions52
.h files with mapped functions13
Total source files65
Functions mapped via .c paths2,129
Functions mapped via .h paths only80
Total mapped functions2,209
Unmapped functions in EDG region (0x403300--0x7E0000)~2,896
C++ runtime / demangler (0x7E0000--0x829722)~1,085
PLT stubs + init (0x402A18--0x403300)~283
Total functions in binary~6,483
Mapping coverage34.1%

The 34% mapping rate reflects the fact that only functions containing EDG internal_error assertions reference __FILE__ strings. Functions below the assertion threshold, display-only code compiled without assertions, inlined leaf functions, and the statically-linked C++ runtime are all invisible to this technique.

Column Definitions

ColumnMeaning
#Row number, ordered by main body start address
Source FileFilename from the EDG source tree
OriginEDG = standard Edison Design Group code; NVIDIA = NVIDIA-authored
Total FuncsUnique functions referencing this file's __FILE__ string (stubs + main)
StubsAssert wrapper functions in 0x403300--0x408B40
Main FuncsFunctions in the main body region (after 0x409350)
Main Body StartLowest xref address outside the stub region
Main Body EndHighest xref address outside the stub region
Code SizeMain Body End - Main Body Start in bytes; approximate (includes interleaved .h inlines and alignment padding)

Source File Table -- 52 .c Files

Sorted by main body start address. This ordering reflects the binary layout, which is near-alphabetical with two exceptions noted below.

#Source FileOriginTotal FuncsStubsMain FuncsMain Body StartMain Body EndCode Size
1attribute.cEDG17771700x4093500x418F8064,560
2class_decl.cEDG27392640x4192800x447930190,160
3cmd_line.cEDG441430x44B2500x45963058,336
4const_ints.cEDG4130x461C200x4659A015,744
5cp_gen_be.cEDG226252010x466F900x489000139,376
6debug.cEDG2020x48A1B00x48A1B0<1 KB
7decl_inits.cEDG19641920x48B3F00x4A154090,448
8decl_spec.cEDG883850x4A1BF00x4B37F072,704
9declarator.cEDG640640x4B39700x4C00A050,480
10decls.cEDG20752020x4C09100x4E8C40164,656
11disambig.cEDG5140x4E9E700x4EC69010,272
12error.cEDG511500x4EDCD00x4F8F8045,744
13expr.cEDG538105280x4F98700x5565E0380,528
14exprutil.cEDG299132860x5587200x583540175,648
15extasm.cEDG7070x584CA00x5858502,992
16fe_init.cEDG6150x585B100x5863A02,192
17fe_wrapup.cEDG2020x588D400x588F90592
18float_pt.cEDG790790x5895500x59415044,032
19folding.cEDG13991300x594B300x5A4FD066,464
20func_def.cEDG561550x5A51B00x5AAB8022,992
21host_envir.cEDG192170x5AD5400x5B1E7018,736
22il.cEDG358163420x5B28F00x5DFAD0184,800
23il_alloc.cEDG381370x5E06000x5E830031,488
24il_to_str.cEDG831820x5F7FD00x6039E047,632
25il_walk.cEDG271260x603FE00x620190115,120
26interpret.cEDG21652110x620CE00x65DE10250,160
27layout.cEDG212190x65EA500x665A6028,688
28lexical.cEDG14051350x6667200x689130141,328
29literals.cEDG210210x68ACC00x68F2B017,904
30lookup.cEDG712690x68FAB00x69BE8050,128
31lower_name.cEDG179111680x69C9800x6AB28059,648
32macro.cEDG431420x6AB6E00x6B5C1042,288
33mem_manage.cEDG9270x6B6DD00x6BA23013,408
34nv_transforms.cNVIDIA1010x6BE3000x6BE300~22 KB1
35overload.cEDG28432810x6BE4A00x6EF7A0201,472
36pch.cEDG233200x6F27900x6F5DA013,840
37pragma.cEDG280280x6F61B00x6F83208,560
38preproc.cEDG100100x6F9B000x6FC94011,840
39scope_stk.cEDG18661800x6FE1600x7106B075,600
40src_seq.cEDG571560x710F100x71872030,736
41statements.cEDG831820x7193000x726A5055,120
42symbol_ref.cEDG422400x726F200x72CEA024,448
43symbol_tbl.cEDG17581670x72D9500x74B8D0122,688
44sys_predef.cEDG351340x74C6900x75147019,936
45target.cEDG110110x7525F00x752DF02,048
46templates.cEDG455124430x7530C00x794D30285,808
47trans_copy.cEDG2020x796BA00x796BA0<1 KB
48trans_corresp.cEDG886820x796E600x7A342050,112
49trans_unit.cEDG100100x7A3BB00x7A46902,784
50types.cEDG885830x7A49400x7C02A0112,480
51modules.cEDG223190x7C0C600x7C25606,400
52floating.cEDG509410x7D0EB00x7D59B019,200
TOTALS5,3381985,1400x4093500x7D59B0~3.57 MB

Source File Table -- 13 .h Header Files

Header files appear in assertion strings when an inline function or macro defined in the header triggers an internal_error call. The function itself is compiled within the .c file's translation unit, but __FILE__ resolves to the header path. These functions are scattered across the binary, interleaved with the .c file that #include-d them.

#Header FileTotal FuncsStubsMain FuncsMin AddressMax AddressPrimary Host
1decls.h1010x4E08F00x4E08F0decls.c
2float_type.h630630x7D1C900x7DEB90floating.c
3il.h5230x52ABC00x6011F0expr.c, il.c, il_to_str.c
4lexical.h1010x68F2B00x68F2B0lexical.c / literals.c boundary
5mem_manage.h4040x4EDCD00x4EDCD0error.c
6modules.h5050x7C11000x7C2560modules.c
7nv_transforms.h3030x4322800x719D20class_decl.c, cp_gen_be.c, src_seq.c
8overload.h1010x6C9E400x6C9E40overload.c
9scope_stk.h4040x503D900x574DD0expr.c, exprutil.c
10symbol_tbl.h2110x7377D00x7377D0symbol_tbl.c
11types.h174130x4692600x7B05E0Many .c files (scattered type queries)
12util.h124101140x430E100x7C2B10All major .c files
13walk_entry.h510510x6041700x618660il_walk.c
TOTALS28117264

Header Distribution Patterns

The 13 headers fall into three distinct patterns:

Localized headers -- functions cluster in a single .c file's address range:

  • float_type.h (63 funcs in 52 KB at 0x7D1C90--0x7DEB90, all within floating.c)
  • walk_entry.h (51 funcs in 90 KB at 0x604170--0x618660, all within il_walk.c)
  • modules.h (5 funcs in 5 KB at 0x7C1100--0x7C2560, all within modules.c)
  • decls.h, lexical.h, overload.h, symbol_tbl.h (1--2 funcs each, single site)
  • mem_manage.h (4 funcs, single site in error.c)

Moderately scattered headers -- functions appear in 2--3 .c files:

  • il.h (5 funcs across expr.c, il.c, il_to_str.c)
  • scope_stk.h (4 funcs across expr.c, exprutil.c)
  • nv_transforms.h (3 funcs across class_decl.c, cp_gen_be.c, src_seq.c)

Pervasive headers -- functions inlined into most .c files:

  • util.h (124 xrefs spanning 0x430E10--0x7C2B10, nearly the entire EDG region)
  • types.h (17 funcs spanning 0x469260--0x7B05E0, scattered type queries)

Assert Stub Region

The region 0x403300--0x408B40 contains 198 small __noreturn functions. Each encodes a single assertion site: the source file path, line number, and enclosing function name. When the assertion condition fails, the stub calls sub_4F2930 (EDG's internal_error handler) and does not return. Every stub is 29 bytes.

Stub Distribution by Source File

Source FileStub CountSource FileStub Count
cp_gen_be.c25macro.c1
il.c16mem_manage.c2
exprutil.c13modules.c3
templates.c12overload.c3
lower_name.c11pch.c3
expr.c10preproc.c0
class_decl.c9scope_stk.c6
folding.c9src_seq.c1
floating.c9statements.c1
attribute.c7symbol_ref.c2
symbol_tbl.c8symbol_tbl.h1
trans_corresp.c6sys_predef.c1
lexical.c5target.c0
decls.c5trans_copy.c0
types.c5trans_unit.c0
decl_inits.c4types.h4
interpret.c3util.h10
decl_spec.c3il.h2
host_envir.c2debug.c0
layout.c2extasm.c0
lookup.c2fe_wrapup.c0
cmd_line.c1float_pt.c0
const_ints.c1declarator.c0
disambig.c1pragma.c0
error.c1nv_transforms.c0
fe_init.c1literals.c0
func_def.c1
il_alloc.c1
il_to_str.c1
il_walk.c1

After the stubs, addresses 0x408B40--0x409350 contain 15 C++ static constructor functions (ctor_001 through ctor_015) that initialize global tables at program startup. These have no source file attribution.

Gap Analysis -- Unmapped Regions

The following address ranges within the EDG .text region contain functions that could not be mapped to any source file via __FILE__ strings. Each gap represents functions that either lack assertions entirely, use non-EDG assertion macros, or are compiler-generated (vtable thunks, exception handlers, template instantiation artifacts).

#Gap RangeSizeBetweenProbable Content
10x408B40--0x4093502 KBstubs / attribute.cStatic constructors (ctor_001--ctor_015)
20x447930--0x44B25013 KBclass_decl.c / cmd_line.cBoundary helpers, small inlines
30x459630--0x461C2034 KBcmd_line.c / const_ints.cUnmapped option handlers, flag tables
40x4659A0--0x466F906 KBconst_ints.c / cp_gen_be.cConstant integer helpers
50x489000--0x48A1B05 KBcp_gen_be.c / debug.cBackend emission tail
60x48A1B0--0x48B3F05 KBdebug.c / decl_inits.cDebug infrastructure
70x5E8300--0x5F7FD087 KBil_alloc.c / il_to_str.cIL display routines (no assertions)
80x620190--0x620CE03 KBil_walk.c / interpret.cWalk epilogue
90x65DE10--0x65EA503 KBinterpret.c / layout.cInterpreter tail
100x665A60--0x6667203 KBlayout.c / lexical.cLayout/lexer boundary
110x689130--0x68ACC07 KBlexical.c / literals.cToken conversion helpers
120x6AB280--0x6AB6E01 KBlower_name.c / macro.cMangling helpers
130x6BA230--0x6BAE703 KBmem_manage.c / nv_transforms.cMemory infrastructure
140x6EF7A0--0x6F279012 KBoverload.c / pch.cOverload resolution tail
150x6FC940--0x6FE1606 KBpreproc.c / scope_stk.cPreprocessor tail functions
160x751470--0x7525F07 KBsys_predef.c / target.cPredefined macro infrastructure
170x7A4690--0x7A49401 KBtrans_unit.c / types.cTU helpers
180x7C2560--0x7D0EB059 KBmodules.c / floating.cType-name encoding, module helpers
190x7D59B0--0x7DEB9037 KBfloating.c tailfloat_type.h template instantiations
200x7DFFF0--0x82A000304 KBpost-EDGC++ runtime, demangler, soft-float, EH
Total unmapped~582 KB

The largest unmapped gap within EDG code proper is the IL display region at 0x5E8300--0x5F7FD0 (87 KB). These functions were compiled from il_to_str.c but contain no assertions because the display/dump subsystem was built without assertion macros -- it is purely diagnostic code that formats IL trees to stdout.

Alphabetical Layout Observation

Source files are laid out in the binary in near-alphabetical order by filename, a consequence of the build system compiling .c files in directory-listing order and the linker processing them sequentially. The sequence is strictly alphabetical from attribute.c through types.c (rows 1--50).

Two files break this pattern:

FileExpected PositionActual PositionOffset
modules.cBetween mem_manage.c and nv_transforms.c (#33--#34)After types.c (#51, at 0x7C0C60)+47 rows late
floating.cBetween float_pt.c and folding.c (#18--#19)After modules.c (#52, at 0x7D0EB0)+34 rows late

Both files appear after the main alphabetical sequence, placed at the very end of the EDG region. The most likely explanation is that modules.c and floating.c are compiled as separate translation units outside the main EDG build directory -- perhaps in a subdirectory or a secondary build target -- and are appended to the link line after the alphabetically-sorted main objects. The modules.c file implements C++20 module support (mostly stubs in the CUDA build), and floating.c implements arbitrary-precision IEEE 754 arithmetic -- both are semi-independent subsystems that could plausibly be compiled separately.

Note that floating.c is followed immediately by its private header float_type.h (63 template instantiations at 0x7D1C90--0x7DEB90), confirming they share a compilation unit.

Binary Region Map

0x402A18 +--------------------------+
         | PLT stubs / init (283)   |  3 KB
0x403300 +--------------------------+
         | Assert stubs (198)       |  22 KB
0x408B40 +--------------------------+
         | Constructors (15)        |  2 KB
0x409350 +--------------------------+
         | attribute.c              |  65 KB
0x419280 | class_decl.c             | 190 KB
         | cmd_line.c               |  58 KB
         | const_ints.c             |  16 KB
         | cp_gen_be.c              | 139 KB
         | debug.c                  |  <1 KB
         | decl_inits.c             |  90 KB
         | decl_spec.c              |  73 KB
         | declarator.c             |  50 KB
         | decls.c                  | 165 KB
         | disambig.c               |  10 KB
         | error.c                  |  46 KB
         | expr.c                   | 381 KB
         | exprutil.c               | 176 KB
         | extasm.c                 |   3 KB
         | fe_init.c                |   2 KB
         | fe_wrapup.c              |  <1 KB
         | float_pt.c               |  44 KB
         | folding.c                |  66 KB
         | func_def.c               |  23 KB
         | host_envir.c             |  19 KB
         | il.c                     | 185 KB
         | il_alloc.c               |  31 KB
         | [il display gap]         |  87 KB  (unmapped)
         | il_to_str.c              |  48 KB
         | il_walk.c                | 115 KB
         | interpret.c              | 250 KB
         | layout.c                 |  29 KB
         | lexical.c                | 141 KB
         | literals.c               |  18 KB
         | lookup.c                 |  50 KB
         | lower_name.c             |  60 KB
         | macro.c                  |  42 KB
         | mem_manage.c             |  13 KB
         | nv_transforms.c          |  22 KB
         | overload.c               | 201 KB
         | pch.c                    |  14 KB
         | pragma.c                 |   9 KB
         | preproc.c                |  12 KB
         | scope_stk.c              |  76 KB
         | src_seq.c                |  31 KB
         | statements.c             |  55 KB
         | symbol_ref.c             |  24 KB
         | symbol_tbl.c             | 123 KB
         | sys_predef.c             |  20 KB
         | target.c                 |   2 KB
         | templates.c              | 286 KB
         | trans_copy.c             |  <1 KB
         | trans_corresp.c          |  50 KB
         | trans_unit.c             |   3 KB
         | types.c                  | 112 KB
         |  --- alphabetical break ---
         | modules.c                |   6 KB
         |  --- gap (59 KB) ---
0x7D0EB0 | floating.c               |  19 KB
         | float_type.h inlines     |  52 KB
0x7DFFF0 +--------------------------+
         | C++ runtime / demangler  | 304 KB
0x82A000 +--------------------------+

Reproduction

To regenerate the source file list from the strings database:

jq '[.[] | select(.value | test("/dvs/p4/.*\\.[ch]$")) |
  {file: (.value | split("/") | last),
   xrefs: (.xrefs | length)}
] | group_by(.file) |
  map({file: .[0].file,
       total_xrefs: (map(.xrefs) | add)}) |
  sort_by(.file)' cudafe++_strings.json

To extract address ranges per file:

import json
from collections import defaultdict

with open('cudafe++_strings.json') as f:
    data = json.load(f)

files = defaultdict(list)
for entry in data:
    val = entry.get('value', '')
    if '/dvs/p4/' not in val:
        continue
    if not (val.endswith('.c') or val.endswith('.h')):
        continue
    fname = val.split('/')[-1]
    for xref in entry.get('xrefs', []):
        files[fname].append(int(xref['from'], 16))

for fname in sorted(files):
    addrs = sorted(files[fname])
    print(f"{fname:25s}  {hex(addrs[0]):>12s} - {hex(addrs[-1]):>12s}"
          f"  ({len(addrs)} xrefs)")

  1. nv_transforms.c has only 1 function with an EDG-style __FILE__ reference, but sweep analysis confirms ~40 functions in the 0x6BAE70--0x6BE4A0 region (~22 KB). Most use NVIDIA's own assertion macros instead of EDG's internal_error path.

Global Variable Index

cudafe++ v13.0 uses approximately 400+ global variables scattered across the .bss and .data segments. These variables fall into clear functional categories: compilation mode selectors, error/diagnostic state, I/O handles, CUDA-specific flags, translation unit management, scope tracking, IL allocation, lexer state, template instantiation, lambda transforms, and memory management. Every address listed below was confirmed through binary analysis of the x86-64 Linux ELF (sha256 6a69...). This page serves as the canonical cross-reference for all other wiki articles.

The variables cluster into three address regions: 0x106xxxx (NVIDIA-added configuration flags, typically set during CLI processing), 0x126xxxx (EDG core compiler state, used throughout parsing, IL generation, and code emission), and 0x12Cxxxx / 0x128xxxx (template instantiation, lambda transform, and arena allocator state). A few tables live in the read-only .rodata segment at 0xE6xxxx--0xE8xxxx.

Compilation Mode and Language Standard

These globals control the fundamental compilation dialect -- C vs C++, which standard version, which vendor extensions are active, and whether the compiler is in CUDA mode.

AddressSizeNameDescription
dword_126EFB44language_modeMaster dialect selector. 1 = C, 2 = C++. Checked in virtually every subsystem. In some contexts (p1.12) interpreted as device_il_mode when value is 2.
dword_126EF684cpp_standard_version__cplusplus value. 199711 = C++98, 201103 = C++11, 201402 = C++14, 201703 = C++17, 202002 = C++20, 202302 = C++23. For C mode: 199000 (pre-C99), 199901 (C99), 201112 (C11), 201710 (C17), 202311 (C23).
dword_126EFAC4extended_featuresEDG extended features / GNU compatibility mode flag. Also used as CUDA mode indicator in several paths.
dword_126EFA84gcc_extensionsGCC extensions mode (1 = enabled). Also used as GPU compilation mode flag in device/host separation.
dword_126EFA44clang_extensionsClang extensions mode. Dual-use: also serves as device-code-mode flag during device/host separation (1 = compiling device side).
dword_126EFB04gnu_extensions_enabledGNU extensions active (set alongside dword_126EFA8). Also used as strict_c_mode and relaxed_constexpr in some paths.
qword_126EF988gcc_versionGCC compatibility version, encoded as major*10000+minor*100+patch. Default 80100 (GCC 8.1.0). Compared as hex thresholds (e.g., 0x9E97 = 40599).
qword_126EF908clang_versionClang compatibility version. Default 90100. Used for feature gating (compared against 0x78B3, 0x15F8F, 0x1D4BF).
qword_126EF788msvc_versionMSVC compatibility version. Default 1926.
qword_126EF708version_threshold_maxUpper version bound. Default 99999.
dword_126EF644cpp_extensions_enabledC extension level (nonstandard extensions).
dword_126EF804feature_flag_80Miscellaneous feature flag, default 1.
dword_126EF484auto_parameter_modeAuto parameter support flag (inverse of input).
dword_126EF4C4auto_parameter_supportAuto-parameter enabled (C++20 auto function params).
dword_126EEFC4digit_separators_enabledC++14 digit separator (') support.
dword_126EF0C4feature_flag_0CMiscellaneous feature flag, default 1.
dword_126E4A84sm_architectureTarget SM architecture version (set by --nv_arch / case 245).
dword_126E4984signed_charsWhether plain char is signed.

CUDA-Specific Flags

Flags controlling CUDA-specific behavior: device code generation, extended lambdas, relaxed constexpr, OptiX mode.

AddressSizeNameDescription
dword_10658504device_stub_modeDevice stub mode toggle. Toggled by expression dword_1065850 = (dword_1065850 == 0) in gen_routine_decl. 0 = forwarding body pass, 1 = static stub pass.
dword_106BF384extended_lambda_modeNVIDIA extended lambdas enabled (--expt-extended-lambda). Gates the lambda wrapper generation pipeline.
dword_106BF404lambda_host_device_modeLambda host-device mode flag. Controls whether __device__ function references are allowed in host code.
dword_106BF344lambda_validation_skipSkip lambda validation checks.
dword_106BFDC4skip_device_onlySkip device-only code generation. When clear, deferred function list accumulates at qword_1065840.
dword_106BFF04relaxed_attribute_modeNVIDIA relaxed override mode. Controls permissive __host__/__device__ attribute mismatch handling. Default 1 in CLI defaults.
dword_106BFBC4whole_program_modeWhole-program mode (affects deferred function list behavior).
dword_106BFD04device_registrationEnable CUDA device registration / cross-space reference checking.
dword_106BFCC4constant_registrationEnable CUDA constant registration / another cross-space check flag.
dword_106BFB84emit_symbol_tableEmit symbol table in output.
dword_106BF6C4alt_host_compiler_modeAlternative host compiler mode.
dword_106BF684host_compiler_flagHost compiler attribute support flag. Also dword_106BF58.
dword_106BDD84optix_modeOptiX compilation mode flag.
dword_106B6704optix_kernel_indexOptiX kernel index (combined with dword_106BDD8 for error 3689).
qword_106B6788optix_kernel_tableOptiX kernel info table pointer.
dword_106C2C04gpu_modeGPU/device compilation mode. Controls reinterpret_cast semantics, pointer dereference, and keyword detection in device context.
dword_106C1D84relaxed_constexpr_ptrControls pointer dereference in device constexpr (--expt-relaxed-constexpr related).
dword_106C1E04device_typeidControls typeid availability in device constexpr context.
dword_106C1F44device_class_lookupCUDA device class member lookup flag.
dword_E7C7604[6]exec_space_tableExecution space bitmask table (6 entries). a1 & dword_E7C760[a2] tests space compatibility.
dword_106B6404keep_in_il_activeAssertion guard: set to 1 before keep_in_il walk, cleared to 0 after.
dword_E857004host_runtime_includedFlag: host_runtime.h already included in .int.c output.
dword_126E2704cpp17_noexcept_typeC++17 noexcept-in-type-system flag. Gates noexcept variant emission for lambda wrappers.
dword_106BF804-ptrmodule_id_fileModule-ID file path (for CRC32 calculation).
qword_10658408deferred_function_listLinked list of deferred functions (used when dword_106BFDC is clear).

Error and Diagnostic State

The diagnostic subsystem uses a set of globals to track error/warning counts, severity thresholds, output format, and per-error suppression state.

AddressSizeNameDescription
qword_126ED908error_countTotal errors emitted. Also used as error-recovery-mode flag (nonzero = in recovery).
qword_126ED988warning_countTotal warnings emitted.
qword_126EDF08error_output_streamFILE* for diagnostic output. Default stderr. Initialized during ctor_002.
qword_126EDE88current_source_positionCurrent source position for error reporting. Mirrored from qword_1065810.
qword_126ED608error_limitMaximum error count before abort.
byte_126ED691min_severity_thresholdMinimum severity for diagnostic output (default threshold).
byte_126ED681error_promotion_thresholdSeverity at or above which warnings become errors.
dword_126ED404suppress_assertion_outputSuppress assertion output flag.
dword_126ED484no_catastrophic_on_errorDisable catastrophic error on internal assertion.
dword_126ED504no_caret_diagnosticsDisable caret (^) diagnostics.
dword_126ED584max_context_linesMaximum source context lines in diagnostics.
dword_126ED784has_error_in_scopeError occurred in current scope.
dword_126ED444name_lookup_kindName lookup kind for diagnostic formatting.
byte_126ED551device_severity_overrideDefault severity for device-mode diagnostics.
byte_126ED561warning_level_controlWarning level control byte.
dword_106BBB84output_formatOutput format selector. 0 = plaintext, 1 = SARIF JSON.
dword_106C0884warnings_are_errorsTreat warnings as errors (-Werror equivalent).
dword_126ECA04colorization_requestedColor output requested.
dword_126ECA44colorization_activeColor output currently active (after TTY detection).
off_88FAA08[3795]error_message_tableArray of 3,795 const char* pointers indexed by error code.
byte_10679201[3795]default_severity_tableDefault severity for each error code.
byte_10679211[3795]current_severity_tableCurrent (possibly pragma-modified) severity.
byte_10679224[3795]per_error_flagsPer-error tracking: bit 0 = first occurrence, other bits = suppression state.
off_D481E0--label_fill_in_tableDiagnostic label fill-in table ({name, cond_index, default_index} entries).
qword_106B4888message_text_bufferGrowable message text buffer (initial 0x400 bytes via sub_6B98A0).
qword_106B4808location_prefix_bufferLocation prefix buffer (initial 0x80 bytes).
qword_106B4788sarif_json_bufferSARIF JSON output buffer (initial 0x400 bytes).
dword_106B4704terminal_widthTerminal width for word wrapping.
dword_106B4A04fill_in_alloc_countFill-in entry allocation counter.
qword_106B4908fill_in_free_listFree list for 40-byte fill-in entries.
dword_106B4B04catastrophic_error_guardRe-entry guard for catastrophic error processing.
dword_10659284assertion_reentry_guardRe-entry guard for assertion handler.
qword_10678608entity_formatter_callbackEntity name formatting callback (sub_5B29C0).
qword_10678708entity_formatter_bufferEntity formatter output buffer.
byte_10678F11diag_is_c_modeDiagnostic C mode flag (dword_126EFB4 == 1).
byte_10678F41diag_is_pre_cpp11Diagnostic pre-C++11 flag.
byte_10678FA1diag_name_lookup_kindName lookup kind for entity display.
qword_106BCD88suppress_all_but_fatalWhen set, suppress all errors except 992 (fatal).
dword_106BCD44predefined_macro_file_modePredefined macro file mode (affects error case).
qword_10658F88pragma_scratch_bufferScratch buffer for pragma bsearch operations.
dword_106B4BC4werror_emitted_guardPrevents recursion in warnings-as-errors emission.

I/O and File Management

Globals controlling input/output filenames, streams, include paths, and preprocessor output.

AddressSizeNameDescription
qword_126EEE08input_filenameCurrent output/source filename (write-protected name). Compared against "-" for stdout mode.
qword_106BF208output_filename_overrideOutput C file path (set by --gen_c_file_name / case 45).
qword_106C0408output_filename_altAlternative output filename (used in signoff).
qword_106C2808output_fileFILE* for .int.c output (stdout or file).
qword_126EE988include_path_listInclude search path linked list head.
qword_126F1008include_path_free_listFree list for recycled search path nodes.
qword_126F0E88path_normalize_bufferGrowable buffer for path normalization (0x100 initial).
dword_126EE584backslash_as_separatorBackslash as path separator (Windows mode).
dword_126EE544windows_drive_letterRecognize Windows drive-letter paths.
dword_126EEE84bom_detection_enabledByte-order mark detection enabled.
dword_126F1104once_guardOne-time initialization guard for source file processing.
qword_126F0C08cached_module_idCached module ID string (CRC32-based).
qword_106BF808module_id_file_pathModule-ID file path for external ID override.
qword_106C0388options_hash_inputCommand-line options hash input for module ID.
qword_106C2488macro_alias_mapHash table: macro define/alias mappings.
qword_106C2408include_path_mapInclude path list for CLI processing.
qword_106C2388sys_include_mapSystem include path map.
qword_106C2288sys_include_map_2Additional system include map.
dword_106C29C4preprocess_modePreprocessing-only mode (1 = active). Set by CLI cases 3,4.
dword_106C2944no_line_commandsSuppress #line directives in output.
dword_106C2884preprocess_output_modePreprocess output: 0 = suppress, 1 = emit preprocessed text.
dword_106C2544skip_backendSkip backend code generation entirely.

Scope Stack

The scope stack is an array of 784-byte entries at qword_126C5E8, indexed by dword_126C5E4. It tracks the nested scope hierarchy (file, namespace, class, function, block, template).

AddressSizeNameDescription
qword_126C5E88scope_table_baseBase pointer to scope stack array. Each entry is 784 bytes.
dword_126C5E44current_scope_indexCurrent top-of-stack index.
dword_126C5DC4saved_scope_indexSaved scope index (for enum processing, lambda nesting).
dword_126C5D84function_scope_indexEnclosing function scope index (-1 if none).
dword_126C5C84template_scope_indexTemplate scope index (-1 if not in template).
dword_126C5C44class_scope_indexClass/nested-class scope index (-1 if none). Also used as friend_scope_index in some paths.
dword_126C5BC4lambda_body_flagLambda body processing flag / template declaration flag.
dword_126C5B84class_nesting_depthClass nesting depth / is_member_of_template flag.
dword_126C5B44block_scope_counterBlock scope counter / namespace scope parameter.
dword_126C5AC4saved_depth_templateSaved scope depth for template instantiation restore.
dword_126C5E04scope_hashScope hash/identifier.
dword_126C5A44nesting_scope_indexNesting scope index.
dword_126C5A04scope_misc_flagMiscellaneous scope flag.
dword_126C5C04instantiation_scope_indexInstantiation scope index.
qword_126C5D08current_routine_ptrCurrent enclosing function/routine descriptor pointer. Used for execution space checks (offset +32 -> byte +177 bit 2 for device, byte +182 & 0x30 for space mask).
qword_126C5988pack_expansion_contextPack expansion context pointer (C++17).
qword_126C5908symbol_hash_tableRobin Hood hash table for symbol lookup within scope.

Lexer and Token State

The lexer maintains its current token, source position, and preprocessor state in these globals.

AddressSizeNameDescription
word_126DD582current_tokenCurrent token kind (357 possible values). Key values: 7 = identifier, 33 = comma, 55 = semicolon, 56 = =, 67 = equals, 73 = CUDA token, 76 = *, 142 = __attribute__, 161 = this, 187 = requires clause.
qword_126DD388token_source_positionSource position of current token.
qword_126DD488token_text_ptrPointer to current identifier/literal text.
dword_126DF904token_flags_1Token flags / current declaration counter.
dword_126DF8C4token_flags_2Secondary token flags.
qword_126DF808token_extra_dataToken extra data pointer.
dword_126DB744has_cached_tokensCached token state flag.
dword_126DB584digit_separator_seenC++14 digit separator seen during number scanning.
qword_126DDA08input_positionCurrent position in input buffer.
qword_126DDD88input_buffer_baseInput buffer base address.
qword_126DDD08input_buffer_endInput buffer end address.
dword_126DDA84line_counterCurrent line number in input.
dword_126DDBC4source_line_numberSource line number (for #line directive tracking).
qword_126DD808active_macro_chainActive macro expansion chain head.
qword_126DD608macro_expansion_markerMacro expansion position marker.
dword_126DD304in_directive_flagCurrently processing preprocessor directive.
qword_126DD188current_macro_nodeCurrent macro being expanded.
qword_126DD708macro_tracking_1Macro position tracking state.
qword_126DDE08macro_tracking_2Secondary macro tracking state.
qword_126DDF08file_stackInclude file stack (for #include nesting).
dword_126DDE84preproc_state_1Preprocessor state variable.
dword_126E49C4preproc_state_2Preprocessor state variable.
qword_126DB408lexical_state_stackLexical state save/restore stack (linked list of 80-byte nodes).
qword_126DB488stop_token_tableStop token table: 357 entries at offset +8, indexed by token kind.
qword_126DD988raw_string_stateRaw string literal tracking state.
dword_126EF004raw_string_flagRaw string literal processing flag.
qword_126DDD88raw_string_baseRaw string buffer base.
qword_126DDD08raw_string_endRaw string buffer end.

Preprocessor and Macro System

AddressSizeNameDescription
qword_12701408macro_definition_chainMacro definition chain head.
qword_12701488free_token_listFree list for recycled token nodes.
qword_12701508cached_token_listCached token list head (for rescan).
qword_12701288reusable_cache_stackReusable macro cache stack.
qword_106B8A08pending_macro_argPending macro argument pointer.
dword_106B7184suppress_pragma_modeSuppress pragma processing mode.
dword_106B7204preprocessing_modePreprocessor-only mode active.
dword_106B6EC4line_numbering_stateLine numbering state for #line output.
qword_106B7408pragma_binding_tablePragma binding table (0x158 bytes initial).
qword_106B7308pragma_alloc_pool_1Pragma allocation pool.
qword_106B7388pragma_alloc_pool_2Pragma allocation pool (secondary).
qword_106B8908pragma_name_hash_1Pragma name hash table.
qword_106B8A88pragma_name_hash_2Pragma name hash table (secondary).
off_E6CDE0--pragma_id_tablePragma ID-to-name mapping table.
byte_126E5581stdc_cx_limited_range#pragma STDC CX_LIMITED_RANGE state. Default 3.
byte_126E5591stdc_fenv_access#pragma STDC FENV_ACCESS state. Default 3.
byte_126E55A1stdc_fp_contract#pragma STDC FP_CONTRACT state. Default 3.
dword_126EE484macro_expansion_trackingMacro expansion tracking / secondary IL enabled flag. Set to 1 during init-complete. Also controls shareable-constants feature.

Translation Unit State

These globals track the current translation unit, TU list, and per-TU save/restore mechanism.

AddressSizeNameDescription
qword_106BA108current_tuPointer to current translation unit descriptor (424 bytes).
qword_106B9F08primary_tuPointer to first (primary) translation unit.
qword_12C7A908tu_chain_tailTail of translation unit linked list.
qword_106BA188tu_stackTranslation unit stack (for nested TU processing).
dword_106B9E84tu_stack_depthTU stack depth (excluding primary).
dword_106BA084is_recompilationRecompilation / secondary-TU flag. When 0 = primary TU, when 1 = secondary. Affects IL entity flag bits.
qword_106BA008current_filenameCurrent filename string pointer.
dword_106B9F84has_module_infoTU has module information.
qword_12C7A988per_tu_storage_sizeTotal per-TU variable buffer size.
qword_12C7AA88registered_var_list_headRegistered per-TU variable list head.
qword_12C7AA08registered_var_list_tailRegistered per-TU variable list tail.
qword_12C7AB88stack_entry_free_listTU stack entry free list.
qword_12C7AB08corresp_free_listTU correspondence structure free list.
dword_12C7A8C4registration_completeVariable registration complete flag.
dword_12C7A884has_seen_module_tuHas seen a module TU.
qword_12C7A708corresp_countTU correspondence allocation counter.
qword_12C7A788tu_countTranslation unit allocation counter.
qword_12C7A808stack_entry_countStack entry allocation counter.
qword_12C7A688registration_countVariable registration allocation counter.

IL (Intermediate Language) State

The IL subsystem uses arena-allocated regions for entities. Two primary regions exist: file-scope and function-scope.

AddressSizeNameDescription
dword_126EC904file_scope_region_idFile-scope IL region ID. Persistent for the entire TU.
dword_126EB404current_region_idCurrent allocation region ID (file-scope or function-scope).
dword_126EC804max_region_idMaximum allocated region ID.
qword_126EB6016il_headerIL header (SSE-width, used for expression copy).
qword_126EB708main_routineMain routine entity (main() function). Sign-bit used as elimination marker.
qword_126EB788compiler_version_stringCompiler version string pointer.
qword_126EB808compilation_timestampCompilation timestamp string.
byte_126EB881plain_chars_signedPlain chars are signed flag (IL header field).
qword_126EB908routine_scope_arrayArray indexed by routine number. Also per-region metadata.
qword_126EB988function_def_tableFunction definition table (16 bytes per entry, indexed 1..dword_126EC78).
qword_126EBA08orphaned_scope_listOrphaned scope list head (for dead code elimination).
dword_126EBA84source_languageSource language (0 = C++, 1 = C).
dword_126EBAC4std_version_ilStandard version for IL header.
byte_126EBB01pcc_compatibility_modePCC compatibility mode.
byte_126EBB11enum_type_is_integralEnum underlying type is integral.
dword_126EBB44max_member_alignmentDefault maximum member alignment.
byte_126EBB81il_gcc_modeIL GCC mode.
byte_126EBB91il_gpp_modeIL G++ mode.
byte_126EBD51any_templates_seenAny templates encountered.
byte_126EBD61proto_instantiations_in_ilPrototype instantiations present in IL.
byte_126EBD71il_all_proto_instantiationsIL has all prototype instantiations.
byte_126EBD81il_c_semanticsIL has C semantics.
qword_126EBE08deferred_instantiation_listDeferred/external declaration list head.
qword_126EBE88seq_number_entriesSequence number lookup entries (for IL index build).
dword_126EBF84target_config_indexTarget configuration index.
dword_126EC784routine_counterCurrent routine / entity counter.
dword_126EC7C4entity_buffer_capacityEntity buffer capacity (grows by 2048).
qword_126EC888region_block_chainsArray of block chains indexed by region ID.
qword_126EC508region_size_trackingArray of region size tracking.
qword_126EC588large_alloc_arrayLarge-allocation (mmap) array.
dword_126E5FC4file_scope_constant_flagSource-file-info flags (bit 0 = constant region flag).
byte_126E5F81il_language_byteLanguage standard byte for routine-type init.
qword_126EFB88null_source_positionDefault/null source position struct.
qword_126F7008current_source_file_refCurrent source file reference for IL entities.

IL Entity Kind Lists

The IL maintains per-kind linked lists for file-scope entities (kinds 1 through 72+).

AddressSizeNameDescription
qword_126E6108kind_1_listSource file entries (kind 1).
qword_126E6208kind_2_listConstant entries (kind 2).
qword_126E6308kind_3_listParameter entries (kind 3).
......Continues through all 72+ entry kinds.
qword_126EA808kind_72_listLast numbered kind list (kind 72).

IL Allocation Counters

Each IL entity type has a dedicated allocation counter used for memory statistics reporting.

AddressSizeNameDescription
qword_126F6808local_constant_countLocal constant allocation count. Asserted zero at region boundaries.
qword_126F7488orphan_ptr_countOrphan pointer allocation count.
qword_126F7508entity_prefix_countEntity prefix allocation count.
qword_126F7908source_corresp_countSource correspondence allocation count.
qword_126F7C08gen_alloc_header_countGen-alloc header count (TU copy addresses).
qword_126F7D08string_bytes_countString literal bytes counter.
qword_126F7D88il_entry_prefix_countIL entry prefix allocation count.
qword_126F8A08exception_spec_countException specification entry count (16 bytes).
qword_126F8988exception_spec_type_countException spec type count (24 bytes).
qword_126F8908asm_entry_countASM entry count (152 bytes).
qword_126F8A88routine_countRoutine entry count (288 bytes).
qword_126F8B08field_countField entry count (176 bytes).
qword_126F8B88var_template_countVariable template entry count (24 bytes).
qword_126F8C08variable_countVariable entry count (232 bytes).
qword_126F8C88vla_dim_countVLA dimension entry count (48 bytes).
qword_126F8D08local_static_init_countLocal static init count (40 bytes).
qword_126F8D88dynamic_init_countDynamic init entry count (104 bytes).
qword_126F8E08type_countType entry count (176 bytes).
qword_126F8E88enum_supplement_countEnum type supplement count.
qword_126F8F08typeref_supplement_countTyperef type supplement count (56 bytes).
qword_126F8F88misc_supplement_countMisc type supplement count.
qword_126F9008template_arg_countTemplate argument count (64 bytes).
qword_126F9088base_class_countBase class count (112 bytes).
qword_126F9108base_class_deriv_countBase class derivation count (32 bytes).
qword_126F9188derivation_step_countDerivation step count (24 bytes).
qword_126F9208overriding_countOverriding entry count (40 bytes).
qword_126F9288constant_list_countConstant list entry count (16 bytes).
qword_126F9308variable_list_countVariable list entry count (16 bytes).
qword_126F9388routine_list_countRoutine list entry count (16 bytes).
qword_126F9408class_list_countClass list entry count (16 bytes).
qword_126F9488class_supplement_countClass type supplement count.
qword_126F9508based_type_member_countBased type list member count (24 bytes).
qword_126F9588routine_supplement_countRoutine type supplement count (64 bytes).
qword_126F9608param_type_countParameter type entry count (80 bytes).
qword_126F9688constant_alloc_countConstant allocation count (184 bytes).
qword_126F9708source_file_countSource file entry count.

IL Free Lists

Arena allocators recycle nodes through per-type free lists.

AddressSizeNameDescription
qword_126E4B88constant_free_listConstants (linked via offset +104).
qword_126E4B08expr_node_free_listExpression nodes (linked via offset +64).
qword_126F6788param_type_free_listParameter type entries (linked via offset +0).
qword_126F6708template_arg_free_listTemplate argument entries (linked via offset +0).
qword_126F6688constant_list_free_listConstant list entries (linked via offset +0).

IL Pools and Region Allocator

AddressSizeNameDescription
qword_126F600104type_node_pool_1Type node pool (104-byte entries).
qword_126F580104type_node_pool_2Secondary type node pool.
qword_126F500104conditional_pool_1Conditional pool (guarded by dword_106BF68 || dword_106BF58).
qword_126F480104conditional_pool_2Conditional pool (secondary).
qword_126F400112expr_pool_1Expression/statement node pool (112 bytes).
qword_126F380112expr_pool_2Expression pool (secondary).
qword_126F300112expr_pool_3Expression pool (tertiary).
unk_126E6001344scope_poolScope table pool (1344 bytes, 384 initial count).
qword_126E58096common_header_poolCommon IL header pool (96 bytes).
dword_126F6904region_prefix_offsetRegion allocation prefix offset (0 or 8).
dword_126F6944region_prefix_sizeRegion allocation prefix size (16 or 24).
dword_126F6884alt_prefix_offsetAlternate region prefix offset.
dword_126F68C4alt_prefix_sizeAlternate region prefix size (8).

Constant Sharing Hash Table

AddressSizeNameDescription
qword_126F1288constant_hash_tableHash table for constant sharing/dedup.
qword_126F1308next_constant_indexNext constant index (monotonically increasing).
qword_126F2288shareable_constant_hashShareable constant hash table (2039 buckets).
qword_126F2008hash_comparisonsHash comparison count (statistics).
qword_126F2088hash_searchesHash search count.
qword_126F2108hash_new_bucketsNew hash bucket count.
qword_126F2188hash_region_hitsRegion hit count.
qword_126F2208hash_global_hitsGlobal hit count.
qword_126F2808member_ptr_type_countMember-pointer / qualified type allocation counter.
qword_126F2F83240char_string_type_cacheCharacter string type cache (405 entries = 3240/8). Indexed by 648*char_kind + 8*length.

Cached Type Nodes

AddressSizeNameDescription
qword_126F2F08cached_void_typeLazy-init cached void type node.
qword_126F2E08cached_size_t_typeLazy-init cached size_t type (for array memcpy).
qword_126F2D08cached_wchar_typeCached wchar_t type.
qword_126F2C88cached_char16_typeCached char16_t type.
qword_126F2C08cached_char32_typeCached char32_t type.
qword_126F2B88cached_char8_typeCached char8_t type (C++20).
qword_126F6108cached_char16_variantCached char16_t variant type.
qword_106B6608cached_void_fn_typeCached void function type (C++ mode).
qword_126E5E08global_char_typeGlobal char type. Used with qualifier 1 = const for const char*.

Template Instantiation

AddressSizeNameDescription
qword_12C77408pending_instantiation_listPending function/variable instantiation worklist head.
qword_12C77588pending_class_listPending class instantiation list.
qword_12C76E08instantiation_depthCurrent instantiation depth counter (max 0xFF = 255).
qword_106BD108max_instantiation_depthMaximum template instantiation depth limit. Default 200.
qword_106BD088max_constexpr_costMaximum constexpr evaluation cost. Default 256.
dword_12C77304instantiation_mode_activeInstantiation mode active flag.
dword_12C771C4new_instantiations_neededFixpoint flag: new instantiations generated in current pass.
dword_12C77184additional_pass_neededAdditional instantiation pass needed flag.
dword_106C0944compilation_modeCompilation mode: 0 = none, 1 = normal, 2 = used-only, 3 = precompile.
dword_106C09C4extended_language_modeExtended language mode.
qword_12C7B488template_arg_cacheTemplate argument cache.
qword_12C7B408template_arg_cache_2Template argument cache (secondary).
qword_12C7B508template_arg_cache_3Template argument cache (tertiary).
qword_12C7800112[3]template_hash_tablesThree template hash tables (0x70 bytes each = 14 slots).

Lambda Transform State

NVIDIA's extended lambda system uses bitmaps and linked lists to track device and host-device lambda closures.

AddressSizeNameDescription
unk_1286980128device_lambda_bitmapDevice lambda capture count bitmap (1024 bits). One bit per closure class index.
unk_1286900128host_device_lambda_bitmapHost-device lambda capture count bitmap (1024 bits).
qword_12868F08entity_closure_mapEntity-to-closure mapping hash table (via sub_742670).
qword_1286A008cached_anon_namespace_nameCached anonymous namespace name (_GLOBAL__N_<filename>).
qword_12867608cached_static_prefixCached static prefix string for mangled names.
byte_1286A20256Kname_format_buffer256KB buffer for name formatting.

Lambda Registration Lists

Six linked lists track device/constant/kernel entities with internal/external linkage for .int.c registration emission.

AddressSizeNameDescription
unk_1286780--device_external_listDevice entities with external linkage.
unk_12867C0--device_internal_listDevice entities with internal linkage.
unk_1286800--constant_external_listConstant entities with external linkage.
unk_1286840--constant_internal_listConstant entities with internal linkage.
unk_1286880--kernel_external_listKernel entities with external linkage.
unk_12868C0--kernel_internal_listKernel entities with internal linkage.

IL Tree Walking

The walk_tree subsystem uses global callback pointers for its 5-callback traversal model.

AddressSizeNameDescription
qword_126FB888entry_callbackCalled for each IL entry during walk.
qword_126FB808string_callbackCalled for each string encountered.
qword_126FB788pre_walk_checkPre-walk filter: if returns nonzero, skip subtree.
qword_126FB708entry_replaceEntry replacement callback.
qword_126FB688entry_filterLinked-list entry filter callback.
dword_126FB5C4is_file_scope_walk1 = walking file-scope IL.
dword_126FB584is_secondary_il1 = current scope is in secondary IL region.
dword_126FB604walk_mode_flagsWalk mode flags (template stripping, etc.).
dword_106B6444current_il_regionCurrent IL region (0 or 1; toggles bit 2 of entry flags).

IL Walk Visited-Set

AddressSizeNameDescription
dword_126FB304visited_countCount of visited entries in current walk.
qword_126FB408visited_setVisited-entry set pointer.
dword_126FB484hash_table_countHash table entry count for visited set.
qword_126FB508hash_table_arrayHash table array for visited set.

IL Display

AddressSizeNameDescription
qword_126F9808display_output_contextIL-to-string output callback/context.
dword_126FA304is_file_scope_display1 = displaying file-scope region.
byte_126FA161display_activeIL display currently active flag.
byte_126FA111pcc_mode_shadowPCC compatibility mode shadow for display.
qword_126FA40--display_string_bufferDisplay string buffer (raw literal prefix, etc.).

Constexpr Evaluator

AddressSizeNameDescription
qword_126FDE08eval_node_free_listEvaluation node free list (0x10000-byte arena blocks).
qword_126FDE88eval_nesting_depthEvaluation nesting depth counter.
qword_126FE008[11]hash_bucket_free_listsHash bucket free lists by popcount size class (11 buckets).
qword_126FE608[11]value_node_free_listsValue node free lists by popcount size class (11 buckets).
qword_126FBC08variant_path_free_listVariant path node free list.
qword_126FBB88variant_path_countVariant path allocation count.
qword_126FBC88variant_path_limitVariant path limit.
qword_126FBD08variant_path_tableVariant path table pointer.
qword_126FEC08constexpr_class_hash_tableClass type hash table base for constexpr.
qword_126FEC88constexpr_class_hash_infoLow 32 = capacity mask, high 32 = entry count.

Backend Code Generation (cp_gen_be.c)

AddressSizeNameDescription
dword_10658344indent_levelCurrent indentation depth in output.
dword_10658204output_line_numberOutput line counter.
dword_106581C4output_columnOutput column counter (chars since last newline).
dword_10658304output_column_altAlternate column counter.
dword_10658184needs_line_directiveNeeds #line directive flag.
qword_10658108output_source_positionCurrent source position for #line directives.
qword_10657488source_sequence_ptrCurrent source sequence entry pointer.
qword_10657408source_sequence_altSecondary source sequence pointer (nested scope iteration).
byte_10656F01current_linkage_specCurrent linkage spec: 2 = extern "C", 3 = extern "C++".
qword_10657088output_scope_stackOutput scope stack pointer (linked list).
qword_10658708debug_trace_listDebug trace request linked list.

Expression Parsing State

AddressSizeNameDescription
qword_106B9708expr_stack_topCurrent expression stack top pointer. Primary context object for expression parsing. Checked at offset +17 (flags), +18, +19 (bit flags), +48, +120.
qword_106B9688expr_stack_prevPrevious expression stack entry (push/pop).
qword_106B5808saved_expr_contextSaved expression context (for nested evaluation).
qword_106B5108rewrite_loop_counterRewrite loop counter (limited to 100 to prevent infinite loops).
dword_126EF084requires_expr_enabledRequires-expression enabled (C++20).

Overload Resolution

AddressSizeNameDescription
qword_E7FE988override_pending_listVirtual function override pending list head (40-byte entries).
qword_E7FEA08override_free_listOverride entry free list.
qword_E7FE888covariant_free_listCovariant override free list.
qword_E7FEC88lambda_hash_tableLambda closure class hash table pointer.
qword_E7FED08template_member_hashTemplate member hash table pointer.
dword_E7FE484rbtree_sentinelRed-black tree sentinel node (for lambda numbering).
qword_E7FE588rbtree_left_sentinelRed-black tree left sentinel (= &dword_E7FE48).
qword_E7FE608rbtree_right_sentinelRed-black tree right sentinel (= &dword_E7FE48).
qword_E7FE688rbtree_sizeRed-black tree entry count.

Attribute System

AddressSizeNameDescription
off_D4682032/entryattribute_descriptor_tableAttribute descriptor table. ~160 entries, stride 32 bytes. Runs to unk_D47A60.
qword_E7FB608attribute_hash_tableAttribute name hash table (Robin Hood lookup via sub_742670).
qword_E7F0388attribute_hash_table_2Secondary attribute hash table.
byte_E7FB80204scoped_attr_bufferBuffer for scoped attribute name formatting ("namespace::name").
byte_82C0E0--attribute_kind_tableAttribute kind descriptor table (indexed by attribute kind).
dword_E7F0784attr_init_flagAttribute subsystem initialization flag.
dword_E7F0804attr_flagsAttribute system flags.
qword_E7F0708visibility_stackVisibility stack linked list.
qword_E7F0688visibility_stateCurrent visibility state.
qword_E7F0488alias_ifunc_free_listFree list for alias/ifunc entries.
qword_E7F0588alias_list_headAlias entry linked list head.
qword_E7F0508alias_list_nextAlias entry linked list next.
dword_106BF184extended_attr_configExtended attribute configuration flag. Gates additional initialization.

Control Flow Tracking

AddressSizeNameDescription
qword_12C71108cf_descriptor_free_listControl flow descriptor free list.
qword_12C71188cf_active_list_tailActive control flow list tail.
qword_12C71208cf_active_list_headActive control flow list head.

Cross-Reference System

AddressSizeNameDescription
qword_106C2588xref_output_fileCross-reference output file handle. When nonzero, enables xref emission.
qword_12C71608xref_callbackCross-reference callback (sub_726F10).
dword_12C71484xref_enabledCross-reference generation enabled.
byte_12C71FA1xref_flag_aCross-reference flag A.
byte_12C71FE1xref_flag_bCross-reference flag B. Default 1.

Object Lifetime Stack

AddressSizeNameDescription
qword_126E4C08curr_object_lifetimeTop of object lifetime stack. Used for destructor ordering and scope cleanup.

Timing and Debug

AddressSizeNameDescription
dword_106C0A44timing_enabledTiming/profiling enabled flag.
dword_126EFC84debug_traceDebug tracing active. When set, calls sub_48AE00/sub_48AFD0 trace hooks.
dword_126EFCC4debug_verbosityDebug verbosity level. >2 = detailed, >3 = very detailed, >4 = IL walk trace.
byte_106B5C0128compilation_timestampCompilation timestamp string (from ctime()).

Memory Allocator (Arena/Pool System)

AddressSizeNameDescription
qword_12807308block_free_listRecycled 0x10000-byte block free list.
qword_12807188total_memory_allocatedTotal memory allocated (watermark).
qword_12807108peak_memory_allocatedPeak memory allocated.
qword_12807088tracked_alloc_totalTracked allocation total.
qword_12807208free_fe_hash_tableHash table for free_fe tracked allocations.
qword_12807488alloc_tracking_listLinked list of allocation tracking records.
dword_12807284mmap_modeAllocation mode flag. 0 = malloc-based, 1 = mmap-based. Set from dword_106BF18.
dword_12807504tracking_record_countTracking record count (inline up to 1023, then heap).
unk_1280760--tracking_record_arrayInline tracking record array.

IL Copy Remap

AddressSizeNameDescription
qword_126F1E08copy_remap_free_listCopy remap entry free list (24 bytes each).
qword_126F1D88copy_remap_countCopy remap entry count.
qword_126F1D04copy_recursion_depthCopy recursion depth counter.
qword_126F1F88copy_remap_stat_countCopy remap statistics count.
qword_126F1408selected_entitySelected entity for copy/comparison.
byte_126F1381selected_entity_kindKind of selected entity (7 or 11).

IL Deferred Reordering Batch

AddressSizeNameDescription
qword_126F1708reorder_batchBatch reordering array (24-byte records: entity, placeholder, source_sequence).
qword_126F1588reorder_ptr_arrayPointer array for batch reordering.
qword_126F1508reorder_batch_limitBatch size limit (100 entries).

CLI Processing State

AddressSizeNameDescription
dword_E800584flag_countCurrent registered CLI flag count (panics at 552 via sub_40351D).
dword_E7FF204argv_indexCurrent argv parsing index (starts at 1).
byte_E7FF40272flag_was_set_bitmap272-byte bitmap: which CLI flags were explicitly set.
dword_E7FF144language_already_setGuard against switching language mode after initial set.
dword_E7FF104cuda_compat_flagCUDA compatibility flag (set based on dword_126EFAC && qword_126EF98 <= 0x76BF).
off_D47CE0--set_flag_lookup_tableLookup table for --set_flag CLI option (name-to-address mapping).

EDG Feature Flags (0x106Bxxx-0x106Cxxx Region)

These flags control individual C/C++ language features. Set during CLI processing and standard-version initialization.

AddressSizeNameDescription
dword_106C2104exceptions_enabledException handling enabled. Default 1.
dword_106C1804rtti_enabledRTTI enabled. Default 1.
dword_106C1644templates_enabledTemplates enabled.
dword_106C1B84template_arg_contextTemplate argument context flag.
dword_106C1944namespaces_enabledNamespaces enabled. Default 1.
dword_106C19C4arg_dep_lookupArgument-dependent lookup. Default 1.
dword_106C1784bool_keywordbool keyword enabled. Default 1.
dword_106C1884wchar_t_keywordwchar_t keyword enabled. Default 1.
dword_106C18C4alternative_tokensAlternative tokens enabled. Default 1.
dword_106C1A04class_name_injectionClass name injection. Default 1.
dword_106C1A44const_string_literalsConst string literals. Default 1.
dword_106C1344parse_templatesParse templates. Default 1.
dword_106C1384dep_nameDependent name processing. Default 1.
dword_106C12C4friend_injectionFriend injection. Default 1.
dword_106C1284adl_relatedADL related feature. Default 1.
dword_106C1244module_visibilityModule-level visibility. Default 1.
dword_106C1404compound_literalsCompound literals. Default 1.
dword_106C13C4base_assign_defaultBase assign op is default. Default 1.
dword_106C10C4deferred_instantiationDeferred instantiation flag.
dword_106C0E44exceptions_featureExceptions feature flag (version-dependent).
dword_106C0644modify_stack_limitModify stack limit. Default 1.
dword_106C0684fe_inliningFrontend inlining enabled.
dword_106C0A04feature_A0Miscellaneous feature flag. Default 1.
dword_106C0984feature_98Miscellaneous feature flag. Default 1.
dword_106C0FC4feature_FCMiscellaneous feature flag. Default 1.
dword_106C1544feature_154Miscellaneous feature flag. Default 1.
dword_106C2084constexpr_if_discardConstexpr-if discarded-statement handling.
dword_106C1F04cpp_mode_featureC++ mode feature flag.
dword_106C2A44feature_2A4Default 1.
dword_106C2144feature_214Default 1.
dword_106C2BC4modules_enabledC++20 modules enabled.
dword_106C2B84module_partitionsModule partitions enabled.
dword_106BDB84restrict_enabledrestrict keyword enabled. Default 1.
dword_106BDB04remove_unneeded_entitiesRemove unneeded entities. Default 1.
dword_106BD984trigraphs_enabledTrigraph support. Default 1.
dword_106BD684guiding_declsGuiding declarations. Default 1.
dword_106BD584old_specializationsOld-style specializations. Default 1.
dword_106BD544implicit_typenameImplicit typename. Default 1.
dword_106BEA04rtti_configRTTI configuration flag.
dword_106BE844gen_move_operationsGenerate move operations. Default 1.
dword_106BC084nodiscard_enabled[[nodiscard]] enabled.
dword_106BC644visibility_supportVisibility support enabled.
dword_106BDF04gnu_attr_groupsGNU attribute groups enabled.
dword_106BDF44msvc_declspecMSVC __declspec enabled.
dword_106BCBC4template_featuresTemplate features flag.
dword_106BFC44debug_mode_1Debug mode flag 1 (set by --debug_mode).
dword_106BFC04debug_mode_2Debug mode flag 2.
dword_106BFBC4debug_mode_3Debug mode flag 3.
qword_106BCE08include_suffix_defaultInclude suffix default string ("::stdh:").
qword_106BC708version_thresholdFeature version threshold. Default 30200.

Host Compiler Target Configuration

AddressSizeNameDescription
dword_126E1D44msvc_target_versionMSVC target version (1200 = VC6, 1400 = VS2005, etc.).
dword_126E1D84is_msvc_hostIs MSVC host compiler.
dword_126E1DC4is_edg_nativeEDG native mode.
dword_126E1E84is_clang_hostIs Clang host compiler.
dword_126E1F84is_gnu_hostIs GNU/GCC host compiler.
qword_126E1F08gnu_host_versionGCC/Clang host version number.
qword_126E1E08clang_host_versionClang host version number.
dword_126E1EC4backend_enabledBackend generation enabled.
dword_126E1BC4host_feature_flagHost feature flag. Default 1.
dword_126DFF04msvc_declspec_modeMSVC __declspec mode enabled.
qword_126E1B08library_prefixLibrary search path prefix ("lib").
dword_126E2004constexpr_init_flagConstexpr initialization flag.
dword_126E2044instantiation_flagInstantiation control flag.
dword_126E2244parameter_flagParameter handling flag.

Type System Lookup Tables (Read-Only)

AddressSizeNameDescription
byte_E6D1B0256signedness_tableType-code-to-signedness lookup table.
byte_E6D1AD1unsigned_int_kind_sentinelMust equal 111 ('o') -- sentinel validation.
byte_A668A0256type_kind_propertiesType kind property table. Bit 1 = callable, bit 4 = aggregate.
off_E6E020--il_entry_kind_namesIL entry kind name table (last = "last", sentinel = 9999).
off_E6CD78--db_storage_class_namesStorage class name table (last = "last").
off_E6D228--db_special_function_kindsSpecial function kind name table.
off_E6CD20--db_operator_namesOperator name table.
off_E6E060--name_linkage_kind_namesName linkage kind names.
off_E6CD88--decl_modifier_namesDeclaration modifier names.
off_E6CF38--pragma_idsPragma ID table.
qword_E6C5808sizeof_il_entry_sentinelMust equal 9999 -- sizeof IL entry validation.
off_E6DD80--il_entry_kind_display_namesIL entry kind display names (indexed by kind byte).
off_E6E040--linkage_kind_display_namesLinkage kind display names (none/internal/external/C/C++).
off_E6E140--feature_init_tableFeature initialization table (used with dword_106BF18).

IL Display Tables (Read-Only)

AddressSizeNameDescription
off_A6F8408[120]builtin_op_namesBuiltin operation kind names (120 entries).
off_A6FE408[22]type_kind_namesType kind names (22 entries: void, bool, int, float, ...).
off_A6F7608[4]access_specifier_namesAccess specifier names (public/protected/private/none).
off_A6FE008[7]storage_class_display_namesStorage class display names (7: none/auto/register/static/extern/mutable/thread_local).
off_A6F480--register_kind_namesRegister kind names.
off_A6FC00--special_kind_namesSpecial function kind names (lambda call operator, etc.).
off_A6FC80--opname_kind_namesOperator name kind names.
off_A6F640--typeref_kind_namesTyperef kind names.
off_A6F420--based_type_kind_namesBased type kind names.
off_A6F3F0--class_kind_namesClass/struct/union kind names.
off_E6C5A0--builtin_op_tableBuiltin operation reference table.

PCH and Serialization

AddressSizeNameDescription
dword_106B6904pch_modePrecompiled header mode.
dword_106B6B04pch_loadedPCH loaded flag.
qword_12C6BA08pch_string_buffer_1PCH string buffer.
qword_12C6BA88pch_string_buffer_2PCH string buffer (secondary).
qword_12C6EA08pch_write_statePCH binary write state.
qword_12C6EA88pch_misc_statePCH miscellaneous state.
dword_12C6C884pch_config_flagPCH configuration flag.
byte_12C6EE01pch_byte_flagPCH byte flag.
dword_12C6C8C4saved_var_list_countSaved variable list count (PCH).
qword_12C6CA08saved_var_listsSaved variable list array (PCH).

Inline and Linkage Tracking

AddressSizeNameDescription
qword_12C6FC88inline_def_tracking_1Inline definition tracking.
qword_12C6FD08inline_def_tracking_2Inline definition tracking (secondary).
qword_12C6FD88inline_def_tracking_3Inline definition tracking (tertiary).
qword_12C6FB88linkage_stack_1Linkage stack.
qword_12C6FC08linkage_stack_2Linkage stack (secondary).
qword_12C6FE08mangling_discriminatorABI mangling discriminator tracking.
qword_12C70E88misc_trackingMiscellaneous definition tracking.

Miscellaneous

AddressSizeNameDescription
qword_126E4C08curr_object_lifetimeTop of object lifetime stack.
qword_106B9B08active_compilation_ctxActive compilation context pointer.
dword_126E2804max_object_sizeMaximum object size (for vector/array validation).
dword_106B4B84omp_declare_variantOpenMP declare variant active flag.
dword_106BC7C4compressed_manglingCompressed name mangling mode.
dword_106BD4C4profiling_flagProfiling / performance measurement flag.
dword_106BCFC4traditional_enumTraditional (unscoped) enum mode.
dword_106BBD44char16_variant_flagchar16_t variant selection flag.
dword_106BD744sharing_mode_configIL sharing mode configuration.
dword_126E1C04string_sharing_enabledString sharing enabled in IL.
byte_126E1C41basic_char_typeBasic char type code (for sub_5BBDF0).
dword_106BD8C4svr4_modeSVR4 ABI mode.
byte_126E3491cuda_extensions_byteCUDA extensions flag (byte-sized).
byte_126E3581arch_extension_byteExtension flag (possibly __CUDA_ARCH__).
byte_126E3C01extension_byte_C0Extension flag byte.
byte_126E3C11extension_byte_C1Extension flag byte.
byte_126E4811extension_byte_481Extension flag byte.
dword_126F2484il_index_validIL index valid flag (1 = index built).
qword_126F2408il_index_capacityIL index array capacity.
qword_126EBF08il_index_countIL index entry count.
qword_126F2308il_index_auxIL index auxiliary pointer.
dword_12C6A244block_scope_suppressBlock-scope suppress level.
dword_127FC704mark_directionMark/unmark direction for entity traversal.
dword_127FBA04eof_flagInput EOF flag.
qword_127FBA88file_handleCurrent input file handle.
dword_127FB9C4multibyte_modeMultibyte character mode (>1 = active).
qword_126E4408[6]char_type_widthsCharacter type width table (indexed by char kind: 1,2,4 bytes).
qword_126E5808[11]special_type_entriesSpecial type entries (11 entries).
qword_126DE00--operator_name_tableOperator name string table.
off_E6E0E0--predef_macro_mode_namesPredefined macro mode name table (sentinel = "last").
qword_126EEA08predef_macro_statePredefined macro initialization state.
dword_106BBA84c23_featuresC23 features flag (#elifdef/#elifndef).
dword_106C2B04preproc_feature_flagPreprocessor feature flag.
dword_106BEF84pch_config_2PCH configuration flag (secondary).

GCC Pragma State

AddressSizeNameDescription
qword_12C6F608gcc_pragma_stack_1GCC pragma push/pop stack.
qword_12C6F688gcc_pragma_stack_2GCC pragma stack (secondary).
qword_12C6F788gcc_pragma_stateGCC pragma state.
qword_12C6F988gcc_pragma_miscGCC pragma miscellaneous state.

Integer Range Tables (SSE-width)

AddressSizeNameDescription
xmmword_126E0E016integer_upper_boundsUpper bounds for integer kinds (populated during init).
xmmword_126E00016integer_lower_boundsLower bounds for integer kinds.

IL Common Header Template

The 96-byte (6 x 16 bytes) template copied into every new IL entity:

AddressSizeName
xmmword_126F6A016IL header template word 0
xmmword_126F6B016IL header template word 1
xmmword_126F6C016IL header template word 2
xmmword_126F6D016IL header template word 3
xmmword_126F6E016IL header template word 4
xmmword_126F6F016IL header template word 5

Address Region Summary

RegionRangeCountPurpose
.rodata0x82xxxx--0xA7xxxx~30Constant tables (attribute descriptors, operation names, type kind names)
.rodata0xD46xxx--0xD48xxx~10Attribute descriptor table, CLI flag lookup
.rodata0xE6xxxx--0xE8xxxx~40IL metadata tables (entry kind names, type properties, signedness, pragma IDs)
.data0x88xxxx1Error message template table (3795 entries)
.bss0x106Bxxx--0x106Cxxx~120NVIDIA-added CLI flags, feature toggles, CUDA configuration
.bss0x1065xxx~20Backend code generator state (output position, stub mode)
.bss0x1067xxx~10Diagnostic per-error tracking, entity formatter
.bss0x126xxxx~200EDG core state (scope stack, lexer, IL, error counters, source position)
.bss0x1270xxx~10Preprocessor macro chains
.bss0x1280xxx~15Arena allocator tracking, lambda bitmaps
.bss0x1286xxx~10Lambda transform state, registration lists
.bss0x12C6xxx--0x12C7xxx~40PCH, template instantiation, TU management
.bss0xE7xxxx~30Attribute system, override tracking, red-black tree

Token Kind Table

Every token produced by cudafe++'s lexer carries a 16-bit token kind stored in the global word_126DD58. There are exactly 357 token kinds, numbered 0 through 356, with names indexed from a read-only string pointer table at off_E6D240 in the .rodata segment. A parallel 357-entry byte array at byte_E6C0E0 maps each token kind to an operator-name index, used by the initialize_opname_kinds routine (sub_588BB0) to populate the operator name display table at qword_126DE00. A boolean stop-token table at qword_126DB48 + 8 (357 entries) marks which token kinds are valid synchronization points for error recovery in skip_to_token (sub_6887C0).

Token kind assignment follows a block scheme established by the EDG 6.6 frontend: operators and punctuation occupy the lowest range, followed by alternative tokens (C++ digraphs and named operators), C89 keywords, C99/C11 extensions, MSVC keywords, core C++ keywords, compiler internals, type-trait intrinsics, and finally the newest C++23/26 and extended-type additions at the top. CUDA-specific additions from NVIDIA occupy three dedicated slots (328--330) within the type-trait block, plus additional entries in the extended range. This ordering reflects the historical accretion of the C and C++ standards: each new standard appended its keywords at the end rather than filling gaps.

Key Facts

PropertyValue
Total token kinds357 (indices 0--356)
Name tableoff_E6D240 (357 string pointers in .rodata)
Operator-to-name mapbyte_E6C0E0 (357-byte index array)
Operator name display tableqword_126DE00 (48 string pointers, populated by sub_588BB0)
Stop-token tableqword_126DB48 + 8 (357 boolean entries)
Current token globalword_126DD58 (WORD)
Keyword registration functionsub_5863A0 (keyword_init, 1,113 lines, fe_init.c)
Keyword entry functionsub_7463B0 (enter_keyword)
GNU variant registrationsub_585B10 (enter_gnu_keyword)
Alternative token entrysub_749600 (registers named operator alternative)

Token Kind Ranges

RangeCountCategoryDescription
01SpecialEnd-of-file / no-token sentinel
1--3131Operators and punctuationCore operators (+, -, *, etc.) and delimiters ((, ), {, }, ;)
32--5120Operators (continued)Compound and remaining operators (<<, >>, ->, ::, ..., <=>)
52--7625Alternative tokens / digraphsC++ named operators (and, or, not) and digraphs (<%, %>, <:, :>)
77--10832C89 keywordsAll keywords from ANSI C89/ISO C90
109--13123C99/C11 keywordsrestrict, _Bool, _Complex, _Imaginary, character types
132--1365MSVC keywords__declspec, __int8--__int64
137--19963C++ keywordsCore C++ keywords plus C++11/14/17/20/23 additions
200--2067Compiler internalPreprocessor and internal token kinds
207--327121Type trait intrinsics__is_xxx / __has_xxx compiler intrinsic keywords
328--3303NVIDIA CUDA type traitsNVIDIA-specific lambda type-trait intrinsics
331--35626Extended types / recent additions_Float32--_Float128, C++23/26 features, scalable vector types

Complete Token Table

Operators and Punctuation (0--51)

These tokens are produced directly by the character-level scanner sub_679800 (scan_token). Multi-character operators are resolved by dedicated scanning functions in the 0x67ABB0--0x67BAB0 range.

KindNameC/C++ ConstructNotes
0<eof>End of fileSentinel / no-token marker
1<identifier>IdentifierAny non-keyword identifier
2<integer literal>Integer constantDecimal, hex, octal, or binary
3<floating literal>Floating-point constantFloat, double, or long double
4<character literal>Character constant'x', includes wide/u8/u16/u32
5<string literal>String literal"...", includes wide/u8/u16/u32/raw
6;SemicolonStatement terminator
7(Left parenthesisGrouping, function call
8)Right parenthesis
9,CommaSeparator, comma operator
10=Assignmenta = b
11{Left braceBlock/initializer open
12}Right braceBlock/initializer close
13+PlusAddition, unary plus
14-MinusSubtraction, unary minus
15*StarMultiplication, pointer dereference, pointer declarator
16/SlashDivision
17<Less-thanComparison, template open bracket
18>Greater-thanComparison, template close bracket
19&AmpersandBitwise AND, address-of, reference declarator
20?Question markTernary conditional
21:ColonLabel, ternary, bit-field width
22~TildeBitwise complement, destructor
23%PercentModulo
24^CaretBitwise XOR
25[Left bracketArray subscript, attributes [[
26.DotMember access
27]Right bracket
28!ExclamationLogical NOT
29|PipeBitwise OR
30->ArrowPointer member access
31++IncrementPre/post increment
32--DecrementPre/post decrement
33==EqualEquality comparison; also bitand alt-token for &
34!=Not-equalInequality comparison
35<=Less-or-equalComparison
36>=Greater-or-equalComparison
37<<Left shiftAlso compl alt-token for ~
38>>Right shiftAlso not alt-token for !
39+=Add-assignCompound assignment
40-=Subtract-assign
41*=Multiply-assign
42/=Divide-assign
43%=Modulo-assign
44<<=Left-shift-assign
45>>=Right-shift-assign
46&&Logical ANDAlso address of rvalue reference
47||Logical OR
48^=XOR-assignAlso not_eq alt-token for !=
49&=AND-assign
50|=OR-assignAlso xor alt-token for ^
51::Scope resolutionAlso bitor alt-token for |

Alternative Tokens and Digraphs (52--76)

C++ alternative tokens (ISO 14882 clause 5.5) and C/C++ digraphs. These are registered during keyword_init (sub_5863A0) via sub_749600 when in C++ mode (dword_126EFB4 == 2).

KindNameEquivalentNotes
52and&&Logical AND
53or||Logical OR
54->*->*Pointer-to-member via pointer
55.*.*Pointer-to-member via object
56......Ellipsis (variadic)
57<=><=>Three-way comparison (C++20)
58##Preprocessor stringification
59####Preprocessor token paste
60<%{Digraph for left brace
61%>}Digraph for right brace
62<:[Digraph for left bracket
63:>]Digraph for right bracket
64and_eq&=Bitwise AND-assign
65xor_eq^=Bitwise XOR-assign
66or_eq|=Bitwise OR-assign
67%:#Digraph for hash
68%:%:##Digraph for token paste
69--76(reserved)--Reserved for future alternative tokens

C89 Keywords (77--108)

Always registered unconditionally. These form the base keyword set present in every compilation mode.

KindNameC/C++ Construct
77autoStorage class (C89); type deduction (C++11)
78breakLoop/switch exit
79caseSwitch case label
80charCharacter type
81constConst qualifier
82continueLoop continuation
83defaultSwitch default label; defaulted function (C++11)
84doDo-while loop
85doubleDouble-precision float
86elseIf-else branch
87enumEnumeration
88externExternal linkage
89floatSingle-precision float
90forFor loop
91gotoUnconditional jump
92ifConditional
93intInteger type
94longLong integer modifier
95registerRegister storage hint (deprecated in C++17)
96returnFunction return
97shortShort integer modifier
98signedSigned integer modifier
99sizeofSize query operator
100staticStatic storage / internal linkage
101structStructure
102switchMulti-way branch
103typedefType alias (C-style)
104unionUnion type
105unsignedUnsigned integer modifier
106voidVoid type
107volatileVolatile qualifier
108whileWhile loop

C99/C11/C23 Keywords (109--131)

Gated on the C standard version at dword_126EF68 (values: 199901 = C99, 201112 = C11, 202311 = C23).

KindNameStandardC/C++ Construct
109inlineC99Inline function hint (already C++ keyword at 154)
110--118(reserved)----
119restrictC99Pointer restrict qualifier
120_BoolC99Boolean type (C-style)
121_ComplexC99Complex number type
122_ImaginaryC99Imaginary number type
123--125(reserved)----
126char16_tC++11/C2316-bit character type
127char32_tC++11/C2332-bit character type
128char8_tC++17/C23UTF-8 character type
129--131(reserved)----

MSVC Keywords (132--136)

Gated on dword_126EFB0 (Microsoft extensions enabled, language mode 2/MSVC).

KindNameMSVC Construct
132__declspecMSVC declaration specifier
133__int88-bit integer type
134__int1616-bit integer type
135__int3232-bit integer type
136__int6464-bit integer type

C++ Core Keywords (137--199)

Gated on C++ mode (dword_126EFB4 == 2). Some keywords within this range were added in C++11 through C++23 and have additional standard-version gates.

KindNameStandardC/C++ Construct
137boolC++98Boolean type
138trueC++98Boolean literal
139falseC++98Boolean literal
140wchar_tC++98Wide character type
141--149(reserved)----
142__attributeGNUGCC attribute syntax
143__builtin_types_compatible_pGNUGCC type compatibility test
144--149(reserved)----
150catchC++98Exception handler
151classC++98Class definition
152deleteC++98Deallocation; deleted function (C++11)
153friendC++98Friend declaration
154inlineC++98Inline function/variable
155newC++98Allocation expression
156operatorC++98Operator overload
157privateC++98Access specifier
158protectedC++98Access specifier
159publicC++98Access specifier
160templateC++98Template declaration
161thisC++98Current object pointer
162throwC++98Throw expression
163tryC++98Try block
164virtualC++98Virtual function/base
165(reserved)----
166const_castC++98Const cast expression
167dynamic_castC++98Dynamic cast expression
168(reserved)----
169exportC++98/20Export declaration (original C++98, revived for modules in C++20)
170exportC++20Module export (alternate registration slot)
171--173(reserved)----
174mutableC++98Mutable data member
175namespaceC++98Namespace declaration
176reinterpret_castC++98Reinterpret cast expression
177static_castC++98Static cast expression
178typeidC++98Runtime type identification
179usingC++98Using declaration/directive
180--182(reserved)----
183typenameC++98Dependent type name
184static_assertC++11Static assertion; also _Static_assert in C11
185decltypeC++11Decltype specifier
186__auto_typeGNUGCC auto type extension
187__extension__GNUGCC extension marker (suppress warnings)
188(reserved)----
189typeofC++23/GNUType-of expression
190typeof_unqualC++23Unqualified type-of expression
191--193(reserved)----
194thread_localC++11Thread-local storage; also _Thread_local in C11
195--199(reserved)----

Compiler Internal Tokens (200--206)

These tokens are used internally by the preprocessor and the token cache. They never appear in user-visible diagnostics.

KindNamePurpose
200<pp-number>Preprocessing number (not yet classified as integer or float)
201<header-name>Include file name (<file> or "file")
202<newline>Logical newline token (preprocessor directive boundary)
203<whitespace>Whitespace token (preprocessing mode only)
204<placemarker>Token-paste placeholder (empty argument in ##)
205<pragma>Pragma token (deferred for later processing)
206<end-of-directive>End of preprocessor directive

Type Trait Intrinsics (207--327)

These are compiler intrinsic keywords that implement the C++ type traits (from <type_traits>) without requiring template instantiation. They are registered during keyword_init with C++ standard version gating -- earlier traits (C++11) are always available in C++ mode, while newer traits (C++20, C++23, C++26) require the corresponding standard version at dword_126EF68. Some traits are MSVC-specific (gated on dword_126EFB0) or Clang-specific (gated on qword_126EF90).

The complete list of type-trait intrinsics, organized alphabetically within each sub-category:

Unary Type Predicates

KindNameStandardTests Whether...
207__is_classC++11Type is a class (not union)
208__is_enumC++11Type is an enumeration
209__is_unionC++11Type is a union
210__is_podC++11Type is POD (plain old data)
211__is_emptyC++11Type has no non-static data members
212__is_polymorphicC++11Type has at least one virtual function
213__is_abstractC++11Type has at least one pure virtual function
214__is_literal_typeC++11Type is a literal type (deprecated C++17)
215__is_standard_layoutC++11Type is standard-layout
216__is_trivialC++11Type is trivially copyable and has trivial default constructor
217__is_trivially_copyableC++11Type is trivially copyable
218__is_finalC++14Class is marked final
219__is_aggregateC++17Type is an aggregate
220__has_virtual_destructorC++11Type has a virtual destructor
221__has_trivial_constructorC++11Type has a trivial default constructor
222__has_trivial_copyC++11Type has a trivial copy constructor
223__has_trivial_assignC++11Type has a trivial copy assignment
224__has_trivial_destructorC++11Type has a trivial destructor
225__has_nothrow_constructorC++11Default constructor is noexcept
226__has_nothrow_copyC++11Copy constructor is noexcept
227__has_nothrow_assignC++11Copy assignment is noexcept
228__has_trivial_move_constructorC++11Type has a trivial move constructor
229__has_trivial_move_assignC++11Type has a trivial move assignment
230__has_nothrow_move_assignC++11Move assignment is noexcept
231__has_unique_object_representationsC++17Type has unique object representations
232__is_signedC++11Type is a signed arithmetic type
233__is_unsignedC++11Type is an unsigned arithmetic type
234__is_integralC++11Type is an integral type
235__is_floating_pointC++11Type is a floating-point type
236__is_arithmeticC++11Type is an arithmetic type
237nullptrC++11Null pointer literal (not a trait; shares range)
238__is_fundamentalC++11Type is a fundamental type
239__int128GNU128-bit integer type (not a trait; shares range)
240__is_scalarC++11Type is a scalar type
241__is_objectC++11Type is an object type
242__is_compoundC++11Type is a compound type
243__is_referenceC++11Type is an lvalue or rvalue reference
244constexprC++11Constexpr specifier (not a trait; shares range)
245constevalC++20Consteval specifier (not a trait; shares range)
246constinitC++20Constinit specifier (not a trait; shares range)
247_AlignofC11Alignment query (C11 spelling)
248_AlignasC11Alignment specifier (C11 spelling)
249__basesGCCDirect base classes (GCC extension)
250__direct_basesGCCNon-virtual direct base classes (GCC extension)
251__builtin_arm_ldrexClangARM load-exclusive intrinsic
252__builtin_arm_ldaexClangARM load-acquire-exclusive intrinsic
253__builtin_arm_addgClangARM MTE add-tag intrinsic
254__builtin_arm_irgClangARM MTE insert-random-tag intrinsic
255__builtin_arm_ldgClangARM MTE load-tag intrinsic
256__is_member_pointerC++11Type is a pointer to member
257__is_member_function_pointerC++11Type is a pointer to member function
258__builtin_shufflevectorClangClang vector shuffle intrinsic
259__builtin_convertvectorClangClang vector conversion intrinsic
260_NoreturnC11No-return function specifier
261__builtin_complexGNUGCC complex number construction
262_GenericC11Generic selection expression
263_AtomicC11Atomic type qualifier/specifier
264_NullableClangNullable pointer qualifier
265_NonnullClangNon-null pointer qualifier
266_Null_unspecifiedClangNull-unspecified pointer qualifier
267co_yieldC++20Coroutine yield expression
268co_returnC++20Coroutine return statement
269co_awaitC++20Coroutine await expression
270__is_member_object_pointerC++11Type is a pointer to data member
271__builtin_addressofGNUAddress-of without operator overload

EDG Internal Keywords (272--283)

These are not user-facing keywords. They are injected by the EDG frontend into synthesized declarations for built-in types, throw specifications, and vector types.

KindNamePurpose
272__edg_type__EDG internal type placeholder
273__edg_vector_type__SIMD vector type (GCC __attribute__((vector_size)) lowering)
274__edg_neon_vector_type__ARM NEON vector type
275__edg_scalable_vector_type__ARM SVE scalable vector type
276__edg_neon_polyvector_type__ARM NEON polynomial vector type
277__edg_size_type__Placeholder for size_t before it is typedef'd
278__edg_ptrdiff_type__Placeholder for ptrdiff_t before it is typedef'd
279__edg_bool_type__Placeholder for bool / _Bool
280__edg_wchar_type__Placeholder for wchar_t
281__edg_throw__Throw specification in synthesized declarations
282__edg_opnd__Operand reference in synthesized expressions
283(reserved)--

More Type Predicates and Binary Traits (284--327)

KindNameStandardTests Whether...
284__is_constC++11Type is const-qualified
285__is_volatileC++11Type is volatile-qualified
286__is_voidC++11Type is void
287__is_arrayC++11Type is an array
288__is_pointerC++11Type is a pointer
289__is_lvalue_referenceC++11Type is an lvalue reference
290__is_rvalue_referenceC++11Type is an rvalue reference
291__is_functionC++11Type is a function type
292__is_constructibleC++11Type is constructible from given args
293__is_nothrow_constructibleC++11Construction is noexcept
294requiresC++20Requires expression/clause
295conceptC++20Concept definition
296__builtin_has_attributeGNUTests if declaration has given attribute
297__builtin_bit_castC++20Bit cast intrinsic (std::bit_cast implementation)
298__is_assignableC++11Type is assignable from given type
299__is_nothrow_assignableC++11Assignment is noexcept
300__is_trivially_constructibleC++11Construction is trivial
301__is_trivially_assignableC++11Assignment is trivial
302__is_destructibleC++11Type is destructible
303__is_nothrow_destructibleC++11Destruction is noexcept
304__edg_is_deducibleEDGEDG internal: template argument is deducible
305__is_trivially_destructibleC++11Destruction is trivial
306__is_base_ofC++11First type is base of second (binary trait)
307__is_convertibleC++11First type is convertible to second (binary trait)
308__is_sameC++11Two types are the same (binary trait)
309__is_trivially_copy_assignableC++11Copy assignment is trivial
310__is_assignable_no_precondition_checkEDGAssignable without precondition validation
311__is_same_asClangAlias for __is_same (Clang compatibility)
312__is_referenceableC++11Type can be referenced
313__is_bounded_arrayC++20Type is a bounded array
314__is_unbounded_arrayC++20Type is an unbounded array
315__is_scoped_enumC++23Type is a scoped enumeration
316__is_literalC++11Alias for __is_literal_type
317__is_complete_typeEDGType is complete (not forward-declared)
318__is_nothrow_convertibleC++20Conversion is noexcept (binary trait)
319__is_convertible_toMSVCMSVC alias for __is_convertible
320__is_invocableC++17Callable with given arguments
321__is_nothrow_invocableC++17Call is noexcept
322__is_trivially_equality_comparableClangBitwise equality is equivalent
323__is_layout_compatibleC++20Types have compatible layouts
324__is_pointer_interconvertible_base_ofC++20Pointer-interconvertible base (binary trait)
325__is_corresponding_memberC++20Corresponding members in layout-compatible types
326__is_pointer_interconvertible_with_classC++20Member pointer is interconvertible with class pointer
327__is_trivially_relocatableC++26Type can be trivially relocated

NVIDIA CUDA Type Traits (328--330)

Three NVIDIA-specific type-trait intrinsics occupy dedicated token kinds. These are registered during keyword_init when GPU mode is active (dword_106C2C0 != 0) and participate in the same token classification pipeline as all other type traits. They are used internally by the CUDA frontend to detect extended lambda closure types during device/host separation.

KindNamePurpose
328__nv_is_extended_device_lambda_closure_typeTests whether a type is the closure type of an extended device lambda. Used during device code generation to identify lambda closures that require special treatment (wrapper function generation, address-space conversion).
329__nv_is_extended_host_device_lambda_closure_typeTests whether a type is the closure type of an extended host-device lambda (__host__ __device__). These lambdas require dual code generation paths and wrapper functions for both host and device.
330__nv_is_extended_device_lambda_with_preserved_return_typeTests whether a device lambda has an explicitly specified (preserved) return type rather than a deduced one. Affects how the compiler generates the wrapper function return type.

When extended lambdas are disabled, these traits are predefined as macros expanding to false:

// Fallback definitions in preprocessor preamble:
#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false

Extended Types and Recent Additions (331--356)

These are the newest token kinds, added for extended floating-point types (ISO/IEC TS 18661-3) and recent C++23/26 features.

KindNameStandardC/C++ Construct
331_Float32TS 18661-332-bit IEEE 754 float
332_Float32xTS 18661-3Extended 32-bit float
333_Float64TS 18661-364-bit IEEE 754 float
334_Float64xTS 18661-3Extended 64-bit float
335_Float128TS 18661-3128-bit IEEE 754 float
336--340(reserved)----
341--356(recent additions)C++23/26Reserved for MSVC C++/CLI traits (__is_ref_class, __is_value_class, __is_interface_class, __is_delegate, __is_sealed, __has_finalizer, __has_copy, __has_assign, __is_simple_value_class, __is_ref_array, __is_valid_winrt_type, __is_win_class, __is_win_interface) and additional future extensions

Token Cache

The token cache provides lookahead, backtracking, and macro-expansion replay for C++ parsing. Tokens are stored in a linked list of cache entries, each 80--112 bytes depending on payload.

Cache Entry Layout

OffsetSizeFieldDescription
+08nextNext entry in linked list
+88source_positionEncoded file/line/column
+162token_codeToken kind (0--356)
+181cache_entry_kindPayload discriminator (see table below)
+204flagsToken classification flags
+244extra_flagsAdditional flags
+328extra_dataContext-dependent data
+40..variespayloadKind-specific data (40--72 bytes)

Cache Entry Kinds

Eight discriminator values select the payload interpretation at offset +40:

KindValuePayload ContentSizeDescription
identifier1Name pointer + 64-byte lookup result72Identifier with pre-resolved scope/symbol lookup. The 64-byte lookup result mirrors xmmword_106C380--106C3B0.
macro_def2Macro definition pointer8Reference to a macro definition for re-expansion. Dispatched to sub_5BA500.
pragma3Pragma datavariesPreprocessor pragma deferred for later processing
pp_number4Number text pointer8Preprocessing number not yet classified as integer or float
(reserved)5----Not observed in use
string6String data + encoding bytevariesString literal with encoding prefix information
(reserved)7----Not observed in use
concatenated_string8Concatenated string datavariesWide or multi-piece concatenated string literal

Cache Management Globals

AddressNameDescription
qword_1270150cached_token_rescan_listHead of list of tokens to re-scan (pushed back for lookahead)
qword_1270128reusable_cache_stackStack of reusable cache entry blocks
qword_1270148free_token_listFree list for recycling cache entries
qword_1270140macro_definition_chainActive macro definition chain
qword_1270118cache_entry_free_listFree list for allocate_token_cache_entry
dword_126DB74has_cached_tokensBoolean: nonzero when cache is non-empty

Cache Operations

AddressIdentityLinesDescription
sub_669650copy_tokens_from_cache385Copies cached preprocessor tokens for macro re-expansion (assert at lexical.c:3417)
sub_669D00allocate_token_cache_entry119Allocates from free list at qword_1270118, initializes fields
sub_669EB0create_cached_token_node83Creates and initializes cache node with source position
sub_66A000append_to_token_cache88Appends token to cache list, maintains tail pointer
sub_66A140push_token_to_rescan_list46Pushes token onto rescan stack at qword_1270150
sub_66A2C0free_single_cache_entry18Returns cache entry to free list

Keyword Registration

All keywords are registered during frontend initialization by sub_5863A0 (keyword_init / fe_translation_unit_init, 1,113 lines, in fe_init.c). The function calls sub_7463B0 (enter_keyword) for each keyword, passing the numeric token kind and the keyword string. GNU double-underscore variants (e.g., __asm and __asm__ for asm) are registered via sub_585B10 (enter_gnu_keyword), which automatically generates both __name and __name__ forms from a single root. Alternative tokens are registered via sub_749600.

Version Gating Architecture

Registration is conditional on a set of global configuration flags established during CLI processing:

AddressNameControlsValues
dword_126EFB4language_modeC vs C++ dialect1 = C (GNU default), 2 = C++
dword_126EF68cpp_standard_versionStandard version level199711 (C++98), 201103 (C++11), 201402 (C++14), 201703 (C++17), 202002 (C++20), 202302 (C++23)
dword_126EFACc_language_modeC mode flagBoolean
dword_126EFB0microsoft_extensionsMSVC keywordsBoolean
dword_126EFA8gnu_extensionsGCC keywordsBoolean
dword_126EFA4clang_extensionsClang keywordsBoolean
qword_126EF98gnu_versionGCC version thresholdEncoded: e.g., 0x9FC3 = GCC 4.0.3
qword_126EF90clang_versionClang version thresholdEncoded: e.g., 0x15F8F, 0x1D4BF

Registration Pattern

The pseudocode below shows the version-gated registration pattern reconstructed from sub_5863A0:

void keyword_init(void) {
    // C89 keywords -- always registered
    enter_keyword(77, "auto");
    enter_keyword(78, "break");
    enter_keyword(79, "case");
    // ... all C89 keywords ...
    enter_keyword(108, "while");

    // C99 keywords -- gated on C99+ standard
    if (c_standard_version >= 199901) {
        enter_keyword(119, "restrict");
        enter_keyword(120, "_Bool");
        enter_keyword(121, "_Complex");
        enter_keyword(122, "_Imaginary");
    }

    // C11 keywords
    if (c_standard_version >= 201112) {
        enter_keyword(184, "_Static_assert");
        enter_keyword(247, "_Alignof");
        enter_keyword(248, "_Alignas");
        enter_keyword(260, "_Noreturn");
        enter_keyword(262, "_Generic");
        enter_keyword(263, "_Atomic");
        enter_keyword(194, "_Thread_local");
    }

    // C++ mode keywords
    if (language_mode == 2) {  // C++ mode
        enter_keyword(137, "bool");
        enter_keyword(138, "true");
        enter_keyword(139, "false");
        enter_keyword(140, "wchar_t");
        enter_keyword(150, "catch");
        enter_keyword(151, "class");
        // ... all C++ core keywords ...
        enter_keyword(183, "typename");

        // Alternative tokens (C++ only)
        enter_alt_token(52, "and", /*len*/3);
        enter_alt_token(53, "or", 2);
        enter_alt_token(64, "and_eq", 6);
        // ... all alternative tokens ...

        // C++11 keywords
        if (cpp_standard_version >= 201103) {
            enter_keyword(244, "constexpr");
            enter_keyword(185, "decltype");
            enter_keyword(237, "nullptr");
            enter_keyword(126, "char16_t");
            enter_keyword(127, "char32_t");
            enter_keyword(184, "static_assert");
            enter_keyword(194, "thread_local");
        }

        // C++20 keywords
        if (cpp_standard_version >= 202002) {
            enter_keyword(245, "consteval");
            enter_keyword(246, "constinit");
            enter_keyword(267, "co_yield");
            enter_keyword(268, "co_return");
            enter_keyword(269, "co_await");
            enter_keyword(294, "requires");
            enter_keyword(295, "concept");
        }
    }

    // GNU extensions -- gated on gnu_extensions flag
    if (gnu_extensions) {
        enter_gnu_keyword(187, "__extension__");
        enter_gnu_keyword(186, "__auto_type");
        enter_gnu_keyword(142, "__attribute");
        enter_keyword(117, "__builtin_offsetof");
        enter_keyword(143, "__builtin_types_compatible_p");
        enter_keyword(239, "__int128");
        // ... all GNU extensions ...
    }

    // MSVC extensions
    if (microsoft_extensions) {
        enter_keyword(132, "__declspec");
        enter_keyword(133, "__int8");
        enter_keyword(134, "__int16");
        enter_keyword(135, "__int32");
        enter_keyword(136, "__int64");
    }

    // Type traits (C++11+, ~60 traits)
    if (language_mode == 2) {
        enter_keyword(207, "__is_class");
        enter_keyword(208, "__is_enum");
        // ... all type traits through 327 ...
    }

    // CUDA type traits (GPU mode)
    if (gpu_mode) {
        enter_keyword(328, "__nv_is_extended_device_lambda_closure_type");
        enter_keyword(329, "__nv_is_extended_host_device_lambda_closure_type");
        enter_keyword(330, "__nv_is_extended_device_lambda_with_preserved_return_type");
    }

    // Extended float types (GNU)
    if (gnu_extensions) {
        enter_keyword(331, "_Float32");
        enter_keyword(332, "_Float32x");
        enter_keyword(333, "_Float64");
        enter_keyword(334, "_Float64x");
        enter_keyword(335, "_Float128");
    }

    // Post-keyword init: scope setup, builtin registration
    // ...
}

GNU Double-Underscore Registration

sub_585B10 (enter_gnu_keyword, assert at fe_init.c:698) implements the pattern where a single keyword name is registered in two or three forms:

  • If name starts with _: registers name as-is and __name__ (e.g., _Bool stays, plus ___Bool__ if applicable)
  • Otherwise: registers __name and __name__ (e.g., asm produces __asm and __asm__)

The function uses a stack buffer of 49 characters maximum (name + 5 <= 0x31), prepends __ (encoded as 0x5F5F in little-endian), copies the name, and appends __ with a null terminator. Both variants call sub_7463B0 (enter_keyword) with the same token kind.

Operator Name Table

The operator name display table at qword_126DE00 maps operator kinds to printable names for diagnostics and error messages. It is populated by sub_588BB0 (initialize_opname_kinds) during fe_wrapup.c initialization.

The initialization loop iterates all 357 entries of byte_E6C0E0 (operator-to-name index), mapping each non-zero entry to the corresponding string from off_E6D240 (the token name table). Two special cases are hardcoded:

Operator KindDisplay NameSpecial Case
42()Function call operator (overridden from default)
43[]Array subscript operator (overridden from default)

Additionally, the array positions for new[] and delete[] are hardcoded separately, since these operator names do not correspond to single tokens.

The routine validates that all entries in the range qword_126DE08 through qword_126DF80 (the 48 operator name slots) are non-null, and panics with "initialize_opname_kinds: bad init of opname_names" if any gap is found.

Token State Globals

When a token is produced by the lexer, the following globals are populated:

AddressNameTypeDescription
word_126DD58current_token_codeWORD16-bit token kind (0--356)
qword_126DD38current_source_positionQWORDEncoded file/line/column
qword_126DD48token_text_ptrQWORDPointer to identifier/literal text
srctoken_start_positionchar*Start of token in input buffer
ntoken_text_lengthsize_tLength of token text
dword_126DF90token_flags_1DWORDClassification flags
dword_126DF8Ctoken_flags_2DWORDAdditional flags
qword_126DF80token_extra_dataQWORDContext-dependent payload
xmmword_106C380--106C3B0identifier_lookup_result4 x 128-bitSSE-packed lookup result (64 bytes, 4 XMM registers)

Cross-References

CUDA Error Catalog

cudafe++ reserves internal error indices 3457--3794 (338 slots) for CUDA-specific diagnostics. These are displayed to users as numbers 20000--20337 using the formula display = internal + 16543. Of the 338 slots, approximately 210 carry unique message templates; the remainder are reserved or share templates with parametric fill-ins. Every CUDA error can be controlled by its numeric code or diagnostic tag name via --diag_suppress, --diag_warning, --diag_error, or the #pragma nv_diagnostic system.

This page is a flat lookup table. For the diagnostic pipeline architecture (severity stack, pragma scoping, SARIF output), see Diagnostic Overview. For narrative discussion of each category with implementation details, see CUDA Errors.

Numbering and Display Format

User-visible:  file.cu(42): error #20042-D: calling a __device__ function from ...
                                    ^^^^^
                                    display code = internal + 16543
DirectionFormulaExample
Display to internalinternal = display - 1654320042 maps to internal 3499
Internal to displaydisplay = internal + 165433457 maps to display 20000

The -D suffix appears when severity is 7 or below (note, remark, warning, soft error). Hard errors (severity 8+) omit the suffix.

Severity Codes

CodeLevelSuppressible
2noteyes
4remarkyes
5warningyes
6command-line warningno
7error (soft)yes
8error (hard, from pragma)no
9catastrophic errorno
10command-line errorno
11internal errorno

How to Use This Catalog

Suppress by numeric code:

nvcc --diag_suppress=20042

Suppress by tag name:

nvcc --diag_suppress=unsafe_device_call

In source code:

#pragma nv_diag_suppress unsafe_device_call
#pragma nv_diag_suppress 20042

Category 1: Cross-Space Calling

Checks performed by the call-graph walker comparing the execution-space byte at entity offset +182 of caller vs. callee.

Standard Cross-Space Calls (6 messages)

TagSevMessage
unsafe_device_callWcalling a __device__ function(%sq1) from a __host__ function(%sq2) is not allowed
unsafe_device_callWcalling a __device__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed
unsafe_device_callWcalling a __host__ function(%sq1) from a __device__ function(%sq2) is not allowed
unsafe_device_callWcalling a __host__ function(%sq1) from a __global__ function(%sq2) is not allowed
unsafe_device_callWcalling a __host__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed
unsafe_device_callWcalling a __host__ function from a __host__ __device__ function is not allowed

Constexpr Cross-Space Calls (6 messages)

These fire when --expt-relaxed-constexpr is not enabled.

TagSevMessage
unsafe_device_callWcalling a constexpr __device__ function(%sq1) from a __host__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
unsafe_device_callWcalling a constexpr __device__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
unsafe_device_callWcalling a constexpr __host__ function(%sq1) from a __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
unsafe_device_callWcalling a constexpr __host__ function(%sq1) from a __global__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
unsafe_device_callWcalling a constexpr __host__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
unsafe_device_callWcalling a constexpr __host__ function from a __host__ __device__ function is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.

Category 2: Virtual Override Mismatch

Override checker (sub_432280) extracts the 0x30 mask from the execution-space byte. __global__ is excluded because kernels cannot be virtual.

TagSevMessage
--Eexecution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ function
--Eexecution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ __device__ function
--Eexecution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __device__ function
--Eexecution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __host__ __device__ function
--Eexecution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __device__ function
--Eexecution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __host__ function

Category 3: Redeclaration Mismatch

Checked in decl_routine (sub_4CE420) and check_cuda_attribute_consistency (sub_4C6D50).

Incompatible Redeclarations (error-level)

TagSevMessage
device_function_redeclared_with_globalEa __device__ function(%no1) redeclared with __global__
global_function_redeclared_with_deviceEa __global__ function(%no1) redeclared with __device__
global_function_redeclared_with_hostEa __global__ function(%no1) redeclared with __host__
global_function_redeclared_with_host_deviceEa __global__ function(%no1) redeclared with __host__ __device__
global_function_redeclared_without_globalEa __global__ function(%no1) redeclared without __global__
host_function_redeclared_with_globalEa __host__ function(%no1) redeclared with __global__
host_device_function_redeclared_with_globalEa __host__ __device__ function(%no1) redeclared with __global__

Compatible Promotions (warning-level, promoted to HD)

TagSevMessage
device_function_redeclared_with_hostWa __device__ function(%no1) redeclared with __host__, hence treated as a __host__ __device__ function
device_function_redeclared_with_host_deviceWa __device__ function(%no1) redeclared with __host__ __device__, hence treated as a __host__ __device__ function
device_function_redeclared_without_deviceWa __device__ function(%no1) redeclared without __device__, hence treated as a __host__ __device__ function
host_function_redeclared_with_deviceWa __host__ function(%no1) redeclared with __device__, hence treated as a __host__ __device__ function
host_function_redeclared_with_host_deviceWa __host__ function(%no1) redeclared with __host__ __device__, hence treated as a __host__ __device__ function

Category 4: __global__ Function Constraints

Return Type and Signature

TagSevMessage
global_function_return_typeEa __global__ function must have a void return type
global_function_deduced_return_typeEa __global__ function must not have a deduced return type
global_function_has_ellipsisEa __global__ function cannot have ellipsis
global_rvalue_ref_typeEa __global__ function cannot have a parameter with rvalue reference type
global_ref_param_restrictEa __global__ function cannot have a parameter with __restrict__ qualified reference type
global_va_list_typeEA __global__ function or function template cannot have a parameter with va_list type
global_function_with_initializer_listEa __global__ function or function template cannot have a parameter with type std::initializer_list
global_param_align_too_bigEcannot pass a parameter with a too large explicit alignment to a __global__ function on win32 platforms

Declaration Context

TagSevMessage
global_class_declEA __global__ function or function template cannot be a member function
global_friend_definitionEA __global__ function or function template cannot be defined in a friend declaration
global_function_in_unnamed_inline_nsEA __global__ function or function template cannot be declared within an inline unnamed namespace
global_operator_functionEAn operator function cannot be a __global__ function
global_new_or_deleteE(__global__ on operator new/delete)
--Efunction main cannot be marked __device__ or __global__

C++ Feature Restrictions

TagSevMessage
global_function_constexprEA __global__ function or function template cannot be marked constexpr
global_function_constevalEA __global__ function or function template cannot be marked consteval
global_function_inlineE(__global__ with inline)
global_exception_specEAn exception specification is not allowed for a __global__ function or function template

Template Argument Restrictions

TagSevMessage
global_private_type_argEA type that is defined inside a class and has private or protected access (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the class is local to a __device__ or __global__ function
global_private_template_argEA template that is defined inside a class and has private or protected access cannot be used in the template template argument of a __global__ function template instantiation
global_unnamed_type_argEAn unnamed type (%t) cannot be used in the template argument type of a __global__ function template instantiation, unless the type is local to a __device__ or __global__ function
global_func_local_template_argEA type defined inside a __host__ function (%t) cannot be used in the template argument type of a __global__ function template instantiation
global_lambda_template_argEThe closure type for a lambda (%t%s) cannot be used in the template argument type of a __global__ function template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the flag '-extended-lambda' is specified and the lambda is an extended lambda
local_type_used_in_global_functionWa local type %t (defined in %sq1) used in global function %sq2 template argument, the global function cannot be launched from host code.

Variable Template Restrictions (parallel set)

TagSevMessage
variable_template_private_type_argE(private/protected type in variable template instantiation)
variable_template_private_template_argE(private template template arg in variable template)
variable_template_unnamed_type_template_argEAn unnamed type (%t) cannot be used in the template argument type of a variable template instantiation, unless the type is local to a __device__ or __global__ function
variable_template_func_local_template_argEA type defined inside a __host__ function (%t) cannot be used in the template argument type of a variable template instantiation
variable_template_lambda_template_argEThe closure type for a lambda (%t%s) cannot be used in the template argument type of a variable template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the lambda is an 'extended lambda' and the flag --extended-lambda is specified

Variadic Template Constraints

TagSevMessage
global_function_multiple_packsEMultiple pack parameters are not allowed for a variadic __global__ function template
global_function_pack_not_lastEPack template parameter must be the last template parameter for a variadic __global__ function template

Launch Configuration Attributes

TagSevMessage
bounds_attr_only_on_global_funcE%s is only allowed on a __global__ function
maxnreg_attr_only_on_global_funcE(__maxnreg__ only on __global__)
missing_launch_boundsWno __launch_bounds__ specified for __global__ function
cuda_specifier_twice_in_groupE(duplicate CUDA specifier on same declaration)
bounds_maxnreg_incompatible_qualifiersE(__launch_bounds__ and __maxnreg__ conflict)
--EThe %s qualifiers cannot be applied to the same kernel
--EMultiple %s specifiers are not allowed
--Eincorrect value for launch bounds

Category 5: Extended Lambda Restrictions

Extended lambdas (__device__ or __host__ __device__ lambdas in host code, enabled by --extended-lambda) must have closure types serializable for device transfer.

Capture Restrictions

TagSevMessage
extended_lambda_reference_captureEAn extended %s lambda cannot capture variables by reference
extended_lambda_pack_captureEAn extended %s lambda cannot capture an element of a parameter pack
extended_lambda_too_many_capturesEAn extended %s lambda can only capture up to 1023 variables
extended_lambda_array_capture_rankEAn extended %s lambda cannot capture an array variable (type: %t) with more than 7 dimensions
extended_lambda_array_capture_assignableEAn extended %s lambda cannot capture an array variable whose element type (%t) is not assignable on the host
extended_lambda_array_capture_default_constructibleEAn extended %s lambda cannot capture an array variable whose element type (%t) is not default constructible on the host
extended_lambda_init_capture_arrayEAn extended %s lambda cannot init-capture variables with array type
extended_lambda_init_capture_initlistEAn extended %s lambda cannot have init-captures with type std::initializer_list
extended_lambda_capture_in_constexpr_ifEAn extended %s lambda cannot first-capture variable in constexpr-if context
this_addr_capture_ext_lambdaWImplicit capture of 'this' in extended lambda expression
extended_lambda_hd_init_captureEinit-captures are not allowed for extended __host__ __device__ lambdas
--EUnless enabled by language dialect, *this capture is only supported when the lambda is either __device__ only, or is defined within a __device__ or __global__ function

Type Restrictions on Captures and Parameters

TagSevMessage
extended_lambda_capture_local_typeEA type local to a function (%t) cannot be used in the type of a variable captured by an extended __device__ or __host__ __device__ lambda
extended_lambda_capture_private_typeEA type that is a private or protected class member (%t) cannot be used in the type of a variable captured by an extended __device__ or __host__ __device__ lambda
extended_lambda_call_operator_local_typeEA type local to a function (%t) cannot be used in the return or parameter types of the operator() of an extended __device__ or __host__ __device__ lambda
extended_lambda_call_operator_private_typeEA type that is a private or protected class member (%t) cannot be used in the return or parameter types of the operator() of an extended __device__ or __host__ __device__ lambda
extended_lambda_parent_local_typeEA type local to a function (%t) cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended __device__ or __host__ __device__ lambda
extended_lambda_parent_private_typeEA type that is a private or protected class member (%t) cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended __device__ or __host__ __device__ lambda
extended_lambda_parent_private_template_argEA template that is a private or protected class member cannot be used in the template argument of the enclosing parent function (and any parent classes) of an extended %s lambda

Enclosing Parent Function Restrictions

TagSevMessage
extended_lambda_enclosing_function_localEThe enclosing parent function (%sq2) for an extended %s1 lambda must not be defined inside another function
extended_lambda_enclosing_function_not_foundE(no enclosing function found for extended lambda)
extended_lambda_inaccessible_parentEThe enclosing parent function (%sq2) for an extended %s1 lambda cannot have private or protected access within its class
extended_lambda_enclosing_function_deducibleEThe enclosing parent function (%sq2) for an extended %s1 lambda must not have deduced return type
extended_lambda_cant_take_function_addressEThe enclosing parent function (%sq2) for an extended %s1 lambda must allow its address to be taken
extended_lambda_parent_non_externEOn Windows, the enclosing parent function (%sq2) for an extended %s1 lambda cannot have internal or no linkage
extended_lambda_parent_class_unnamedEThe enclosing parent function (%sq2) for an extended %s1 lambda cannot be a member function of a class that is unnamed
extended_lambda_parent_template_param_unnamedEThe enclosing parent function (%sq2) for an extended %s1 lambda cannot be in a template which has a unnamed parameter: %nd
extended_lambda_nest_parent_template_param_unnamedEThe enclosing parent %n for an extended %s lambda cannot be a template which has a unnamed parameter
extended_lambda_multiple_parameter_packsEThe enclosing parent template function (%sq2) for an extended %s1 lambda cannot have more than one variadic parameter, or it is not listed last in the template parameter list.
extended_lambda_no_parent_funcE(extended lambda has no parent function)
extended_lambda_illegal_parentE(extended lambda in illegal parent context)

Nesting and Context Restrictions

TagSevMessage
extended_lambda_enclosing_function_generic_lambdaEAn extended %s1 lambda cannot be defined inside a generic lambda expression(%sq2).
extended_lambda_enclosing_function_hd_lambdaEAn extended %s1 lambda cannot be defined inside an extended __host__ __device__ lambda expression(%sq2).
extended_lambda_inaccessible_ancestorEAn extended %s1 lambda cannot be defined inside a class (%sq2) with private or protected access within another class
extended_lambda_inside_constexpr_ifEFor this host platform/dialect, an extended lambda cannot be defined inside the 'if' or 'else' block of a constexpr if statement
extended_lambda_multiple_parentECannot specify multiple __nv_parent directives in a lambda declaration
extended_host_device_generic_lambdaE__host__ __device__ extended lambdas cannot be generic lambdas
--EIf an extended %s lambda is defined within the body of one or more nested lambda expressions, each of these enclosing lambda expressions must be defined within the immediate or nested block scope of a function.

Specifier and Annotation

TagSevMessage
extended_lambda_disallowedE__host__ or __device__ annotation on lambda requires --extended-lambda nvcc flag
extended_lambda_constexprEThe %s1 specifier is not allowed for an extended %s2 lambda
lambda_operator_annotatedEThe operator() function for a lambda cannot be explicitly annotated with execution space annotations (__host__/__device__/__global__), the annotations are derived from its closure class
extended_lambda_discriminatorE(extended lambda discriminator collision)

Category 6: Device Code Restrictions

General restrictions that apply to all GPU-side code (__device__ and __global__ function bodies).

TagSevMessage
cuda_device_code_unsupported_operatorEThe operator '%s' is not allowed in device code
unsupported_type_in_device_codeE%t %s1 a %s2, which is not supported in device code
--Edevice code does not support exception handling
no_coroutine_on_deviceEdevice code does not support coroutines
--Eoperations on vector types are not supported in device code
undefined_device_entityEcannot use an entity undefined in device code
undefined_device_identifierEidentifier %sq is undefined in device code
thread_local_in_device_codeEcannot use thread_local specifier for variable declarations in device code
unrecognized_pragma_device_codeWunrecognized #pragma in device code
--Ezero-sized parameter type %t is not allowed in device code
--Ezero-sized variable %sq is not allowed in device code
--Edynamic initialization is not supported for a function-scope static %s variable within a __device__/__global__ function
--Efunction-scope static variable within a __device__/__global__ function requires a memory space specifier
use_of_virtual_base_on_compute_1xEUse of a virtual base (%t) requires the compute_20 or higher architecture
--Ealloca() is not supported for architectures lower than compute_52

Category 7: Kernel Launch

TagSevMessage
device_launch_no_sepcompEkernel launch from __device__ or __global__ functions requires separate compilation mode
missing_api_for_device_side_launchEdevice-side kernel launch could not be processed as the required runtime APIs are not declared
--Wexplicit stream argument not provided in kernel launch
--Ekernel launches from templates are not allowed in system files
device_side_launch_arg_with_user_provided_cctorEcannot pass an argument with a user-provided copy-constructor to a device-side kernel launch
device_side_launch_arg_with_user_provided_dtorEcannot pass an argument with a user-provided destructor to a device-side kernel launch

Category 8: Memory Space and Variable Restrictions

Variable Access Across Spaces

TagSevMessage
device_var_read_in_hostEa %s1 %n1 cannot be directly read in a host function
device_var_written_in_hostEa %s1 %n1 cannot be directly written in a host function
device_var_address_taken_in_hostEaddress of a %s1 %n1 cannot be directly taken in a host function
host_var_read_in_deviceEa host %n1 cannot be directly read in a device function
host_var_written_in_deviceEa host %n1 cannot be directly written in a device function
host_var_address_taken_in_deviceEaddress of a host %n1 cannot be directly taken in a device function

Variable Declaration Restrictions

TagSevMessage
illegal_local_to_device_functionE%s1 %sq2 variable declaration is not allowed inside a device function body
illegal_local_to_host_functionE%s1 %sq2 variable declaration is not allowed inside a host function body
shared_specifier_in_range_forEthe __shared__ memory space specifier is not allowed for a variable declared by the for-range-declaration
bad_shared_storage_classE__shared__ variables cannot have external linkage
device_variable_in_unnamed_inline_nsEA %s variable cannot be declared within an inline unnamed namespace
--Emember variables of an anonymous union at global or namespace scope cannot be directly accessed in __device__ and __global__ functions
shared_inside_structEshared type inside a struct or union is not allowed
shared_parameterE(__shared__ as function parameter)

Auto-Deduced Device References

TagSevMessage
auto_device_fn_refEA non-constexpr __device__ function (%sq1) with "auto" deduced return type cannot be directly referenced %s2, except if the reference is absent when __CUDA_ARCH__ is undefined
device_var_constexprE(constexpr rules for __device__ variables)
device_var_structured_bindingE(structured bindings on __device__ variables)

Category 9: __grid_constant__

The __grid_constant__ annotation (compute_70+) marks a kernel parameter as read-only grid-wide.

TagSevMessage
grid_constant_non_kernelE__grid_constant__ annotation is only allowed on a parameter of a __global__ function
grid_constant_not_constEa parameter annotated with __grid_constant__ must have const-qualified type
grid_constant_reference_typeEa parameter annotated with __grid_constant__ must not have reference type
grid_constant_unsupported_archE__grid_constant__ annotation is only allowed for architecture compute_70 or later
grid_constant_incompat_redeclEincompatible __grid_constant__ annotation for parameter %s in function redeclaration (see previous declaration %p)
grid_constant_incompat_templ_redeclEincompatible __grid_constant__ annotation for parameter %s in function template redeclaration (see previous declaration %p)
grid_constant_incompat_specializationEincompatible __grid_constant__ annotation for parameter %s in function specialization (see previous declaration %p)
grid_constant_incompat_instantiation_directiveEincompatible __grid_constant__ annotation for parameter %s in instantiation directive (see previous declaration %p)

Category 10: JIT Mode

JIT mode (-dc for device-only compilation) restricts host constructs.

TagSevMessage
no_host_in_jitEA function explicitly marked as a __host__ function is not allowed in JIT mode
unannotated_function_in_jitEA function without execution space annotations (__host__/__device__/__global__) is considered a host function, and host functions are not allowed in JIT mode. Consider using -default-device flag to process unannotated functions as __device__ functions in JIT mode
unannotated_variable_in_jitEA namespace scope variable without memory space annotations (__device__/__constant__/__shared__/__managed__) is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process unannotated namespace scope variables as __device__ variables in JIT mode
unannotated_static_data_member_in_jitEA class static data member with non-const type is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process such data members as __device__ variables in JIT mode
host_closure_class_in_jitEThe execution space for the lambda closure class members was inferred to be __host__ (based on context). This is not allowed in JIT mode. Consider using -default-device to infer __device__ execution space for namespace scope lambda closure classes.

Category 11: RDC / Whole-Program Mode

TagSevMessage
--EAn inline __device__/__constant__/__managed__ variable must have internal linkage when the program is compiled in whole program mode (-rdc=false)
template_global_no_defEwhen "-static-global-template-stub=true" in whole program compilation mode ("-rdc=false"), a __global__ function template instantiation or specialization (%sq) must have a definition in the current translation unit
extern_kernel_templateEwhen "-static-global-template-stub=true", extern __global__ function template is not supported in whole program compilation mode ("-rdc=false")
--Waddress of internal linkage device function (%sq) was taken (nv bug 2001144). mitigation: no mitigation required if the address is not used for comparison, or if the target function is not a CUDA C++ builtin

Category 12: Atomics

CUDA atomics lowered to PTX instructions with size, type, scope, and memory-order constraints.

Architecture and Type Constraints

TagSevMessage
nv_atomic_functions_not_supported_below_sm60E__nv_atomic_* functions are not supported on arch < sm_60.
nv_atomic_operation_not_in_device_functionEatomic operations are not in a device function.
nv_atomic_function_no_argsEatomic function requires at least one argument.
nv_atomic_function_address_takenEnv atomic function must be called directly.
invalid_nv_atomic_operation_sizeEatomic operations and, or, xor, add, sub, min and max are valid only on objects of size 4, or 8.
invalid_nv_atomic_cas_sizeEatomic CAS is valid only on objects of size 2, 4, 8 or 16 bytes.
invalid_nv_atomic_exch_sizeEatomic exchange is valid only on objects of size 4, 8 or 16 bytes.
invalid_data_size_for_nv_atomic_generic_functionEgeneric nv atomic functions are valid only on objects of size 1, 2, 4, 8 and 16 bytes.
non_integral_type_for_non_generic_nv_atomic_functionEnon-generic nv atomic load, store, cas and exchange are valid only on integral types.
invalid_nv_atomic_operation_add_sub_sizeEatomic operations add and sub are not valid on signed integer of size 8.
nv_atomic_add_sub_f64_not_supportedWatomic add and sub for 64-bit float is supported on architecture sm_60 or above.
invalid_nv_atomic_operation_max_min_floatEatomic operations min and max are not supported on any floating-point types.
floating_type_for_logical_atomic_operationEFor a logical atomic operation, the first argument cannot be any floating-point types.
nv_atomic_cas_b16_not_supportedE16-bit atomic compare-and-exchange is supported on architecture sm_70 or above.
nv_atomic_exch_cas_b128_not_supportedE128-bit atomic exchange or compare-and-exchange is supported on architecture sm_90 or above.
nv_atomic_load_store_b128_version_too_lowE128-bit atomic load and store are supported on architecture sm_70 or above.

Memory Order and Scope

TagSevMessage
nv_atomic_load_order_errorEatomic load's memory order cannot be release or acq_rel.
nv_atomic_store_order_errorEatomic store's memory order cannot be consume, acquire or acq_rel.
nv_atomic_operation_order_not_constant_intEatomic operation's memory order argument is not an integer literal.
nv_atomic_operation_scope_not_constant_intEatomic operation's scope argument is not an integer literal.
invalid_nv_atomic_memory_order_valueE(invalid memory order enum value)
invalid_nv_atomic_thread_scope_valueE(invalid thread scope enum value)

Scope Fallback Warnings

TagSevMessage
nv_atomic_operations_scope_fallback_to_membarWatomic operations' scope argument is supported on architecture sm_60 or above. Fall back to use membar.
nv_atomic_operations_memory_order_fallback_to_membarWatomic operations' argument of memory order is supported on architecture sm_70 or above. Fall back to use membar.
nv_atomic_operations_scope_cluster_change_to_deviceWatomic operations' scope of cluster is supported on architecture sm_90 or above. Using device scope instead.
nv_atomic_load_store_scope_cluster_change_to_deviceWatomic load and store's scope of cluster is supported on architecture sm_90 or above. Using device scope instead.

Category 13: ASM in Device Code

NVPTX backend supports fewer inline-assembly constraint letters than x86.

TagSevMessage
asm_constraint_letter_not_allowed_in_deviceEasm constraint letter '%s' is not allowed inside a __device__/__global__ function
asm_constraint_must_have_single_letterEan asm operand may specify only one constraint letter in a __device__/__global__ function
--EThe 'C' constraint can only be used for asm statements in device code
cc_clobber_in_deviceEThe cc clobber constraint is not supported in device code
cuda_xasm_strict_placeholder_formatE(strict placeholder format in CUDA asm)
addr_of_label_in_device_funcEaddress of label extension is not supported in __device__/__global__ functions

Category 14: #pragma nv_abi

Controls calling convention for device functions, adjusting parameter passing to match PTX ABI.

TagSevMessage
nv_abi_pragma_bad_formatE(malformed #pragma nv_abi)
nv_abi_pragma_invalid_optionE#pragma nv_abi contains an invalid option
nv_abi_pragma_missing_argE#pragma nv_abi requires an argument
nv_abi_pragma_duplicate_argE#pragma nv_abi contains a duplicate argument
nv_abi_pragma_not_constantE#pragma nv_abi argument must evaluate to an integral constant expression
nv_abi_pragma_not_positive_valueE#pragma nv_abi argument value must be a positive value
nv_abi_pragma_overflow_valueE#pragma nv_abi argument value exceeds the range of an integer
nv_abi_pragma_device_functionE#pragma nv_abi must be applied to device functions
nv_abi_pragma_device_function_contextE#pragma nv_abi is not supported inside a host function
nv_abi_pragma_next_constructE#pragma nv_abi must appear immediately before a function declaration, function definition, or an expression statement

Category 15: __nv_register_params__

Forces all parameters to be passed in registers (compute_80+).

TagSevMessage
register_params_not_enabledE__nv_register_params__ support is not enabled
register_params_unsupported_archE__nv_register_params__ is only supported for compute_80 or later architecture
register_params_unsupported_functionE__nv_register_params__ is not allowed on a %s function
register_params_ellipsis_functionE__nv_register_params__ is not allowed on a function with ellipsis

Category 16: Name Expression (NVRTC)

__CUDACC_RTC__name_expr forms the mangled name of a __global__ function or __device__/__constant__ variable at compile time.

TagSevMessage
name_expr_parsingEError in parsing name expression for lowered name lookup. Input name expression was: %sq
name_expr_non_global_routineEName expression cannot form address of a non-__global__ function. Input name expression was: %sq
name_expr_non_device_variableEName expression cannot form address of a variable that is not a __device__/__constant__ variable. Input name expression was: %sq
name_expr_not_routine_or_variableEName expression must form address of a __global__ function or the address of a __device__/__constant__ variable. Input name expression was: %sq
name_expr_extra_tokensEExtra tokens found after parsing name expression for lowered name lookup. Input name expression was: %sq
name_expr_internal_errorEInternal error in parsing name expression for lowered name lookup. Input name expression was: %sq

Category 17: Texture and Surface Variables

TagSevMessage
texture_surface_variable_in_unnamed_inline_nsEA texture or surface variable cannot be declared within an inline unnamed namespace
--EA texture or surface variable cannot be used in the non-type template argument of a __device__, __host__ __device__ or __global__ function template instantiation
reference_to_text_surf_type_in_device_funcEa reference to texture/surface type cannot be used in __device__/__global__ functions
reference_to_text_surf_var_in_device_funcEtaking reference of texture/surface variable not allowed in __device__/__global__ functions
addr_of_text_surf_var_in_device_funcEcannot take address of texture/surface variable %sq in __device__/__global__ functions
addr_of_text_surf_expr_in_device_funcEcannot take address of texture/surface expression in __device__/__global__ functions
indir_into_text_surf_var_in_device_funcEindirection not allowed for accessing texture/surface through variable %sq in __device__/__global__ functions
indir_into_text_surf_expr_in_device_funcEindirection not allowed for accessing texture/surface through expression in __device__/__global__ functions

Category 18: __managed__ Variables

TagSevMessage
managed_const_type_not_allowedEa __managed__ variable cannot have a const qualified type
managed_reference_type_not_allowedEa __managed__ variable cannot have a reference type
managed_cant_be_shared_constantE__managed__ variables cannot be marked __shared__ or __constant__
unsupported_arch_for_managed_capabilityE__managed__ variables require architecture compute_30 or higher
unsupported_configuration_for_managed_capabilityE__managed__ variables are not yet supported for this configuration (compilation mode (32/64 bit) and/or target operating system)
decltype_of_managed_variableEA __managed__ variable cannot be used as an unparenthesized id-expression argument for decltype()

Category 19: Device Function Signature Constraints

TagSevMessage
device_function_has_ellipsisE__device__ or __host__ __device__ function with ellipsis requires compute_30 or higher architecture
device_func_tex_argE(device function with texture argument restriction)
no_host_device_initializer_listE(std::initializer_list in __host__ __device__ context)
no_host_device_move_forwardE(std::move/forward in __host__ __device__ context)
no_strict_cuda_errorW(relaxed error checking mode)

Category 20: __wgmma_mma_async Builtins

Warp Group Matrix Multiply-Accumulate builtins (sm_90a+).

TagSevMessage
wgmma_mma_async_not_enabledE__wgmma_mma_async builtins are only available for sm_90a
wgmma_mma_async_nonconstant_argENon-constant argument to __wgmma_mma_async call
wgmma_mma_async_missing_argsEThe 'A' or 'B' argument to __wgmma_mma_async call is missing
wgmma_mma_async_bad_shapeEThe shape %s is not supported for __wgmma_mma_async builtin
wgmma_mma_async_bad_A_typeE(invalid type for operand A)
wgmma_mma_async_bad_B_typeE(invalid type for operand B)

Category 21: __block_size__ / __cluster_dims__

Architecture-dependent launch configuration attributes.

TagSevMessage
block_size_unsupportedE__block_size__ is not supported for this GPU architecture
block_size_must_be_positiveE(block size values must be positive)
cluster_dims_unsupportedE__cluster_dims__ is not supported for this GPU architecture
cluster_dims_must_be_positiveE(__cluster_dims__ values must be positive)
cluster_dims_too_largeEcluster dimension value is too large
conflict_between_cluster_dim_and_block_sizeEcannot specify the second tuple in __block_size__ while __cluster_dims__ is present
max_blocks_per_cluster_unsupportedEcannot specify max blocks per cluster for this GPU architecture
max_blocks_per_cluster_negativeEmax blocks per cluster must not be negative
max_blocks_per_cluster_too_largeEmax blocks per cluster is too large
too_many_blocks_in_clusterEtotal number of blocks in cluster computed from %s exceeds __launch_bounds__ specified limit for max blocks in cluster
shared_block_size_must_be_positiveEthe block size of a shared array must be greater than zero
shared_block_size_too_largeE(shared block size exceeds maximum)
mismatched_shared_block_sizeEshared block size does not match one previously specified
ambiguous_block_size_specE(ambiguous block size specification)
multiple_block_sizesEmultiple block sizes not allowed
threads_dimension_requires_definite_block_sizeEa dynamic THREADS dimension requires a definite block size
shared_nonthreads_dimE(shared array dimension is not THREADS-based)
shared_affinity_typeE(shared affinity type mismatch)

Category 22: Inline Hint Conflicts

TagSevMessage
inline_hint_forceinline_conflictE"__inline_hint__" and "__forceinline__" may not be used on the same declaration
inline_hint_noinline_conflictE"__inline_hint__" and "__noinline__" may not be used on the same declaration

Category 23: __local_maxnreg__

TagSevMessage
local_maxnregE(__local_maxnreg__ attribute applied)
local_maxnreg_attr_only_nonmember_funcE(__local_maxnreg__ only on non-member functions)
local_maxnreg_attribute_conflictE(__local_maxnreg__ conflicts with existing attribute)
local_maxnreg_negativeE(__local_maxnreg__ value is negative)
local_maxnreg_too_largeE(__local_maxnreg__ value exceeds maximum)
maxnreg_attr_only_nonmember_funcE(__maxnreg__ only on non-member functions)
bounds_attr_only_nonmember_funcE(launch bounds only on non-member functions)

Category 24: Miscellaneous CUDA Errors

TagSevMessage
cuda_displaced_new_or_delete_operatorE(displaced new/delete in CUDA context)
cuda_demote_unsupported_floating_pointW(unsupported floating-point type demoted)
illegal_ucn_in_device_identiferEUniversal character is not allowed in device entity name (%sq)
thread_local_for_device_varsEcannot use thread_local specifier for a %s variable
global_qualifier_not_allowedE(execution space qualifier not allowed here)
unsupported_nv_attributeW(unrecognized NVIDIA attribute)
addr_of_nv_builtin_varE(address-of applied to NVIDIA builtin variable)
shared_address_immutableE(__shared__ variable address is immutable)
nonshared_blocksizeofE(BLOCKSIZEOF applied to non-__shared__ variable)
nonshared_strict_relaxedE(strict/relaxed qualifier on non-__shared__ variable)
extern_sharedW(extern __shared__ variable)
invalid_nvvm_builtin_intrinsicE(invalid NVVM builtin intrinsic)
unannotated_static_not_allowed_in_deviceE(unannotated static not allowed in device code)
missing_pushcallconfigE(cudaConfigureCall not found for kernel launch lowering)

Complete Diagnostic Tag Index

All 286 CUDA-specific diagnostic tag names extracted from the cudafe++ binary, organized alphabetically within functional groups. Every tag can be used with --diag_suppress, --diag_warning, --diag_error, or #pragma nv_diag_suppress / nv_diag_warning / nv_diag_error.

Cross-Space / Execution Space (1 tag)

#Tag Name
1unsafe_device_call

Redeclaration (12 tags)

#Tag Name
2device_function_redeclared_with_global
3device_function_redeclared_with_host
4device_function_redeclared_with_host_device
5device_function_redeclared_without_device
6global_function_redeclared_with_device
7global_function_redeclared_with_host
8global_function_redeclared_with_host_device
9global_function_redeclared_without_global
10host_device_function_redeclared_with_global
11host_function_redeclared_with_device
12host_function_redeclared_with_global
13host_function_redeclared_with_host_device

__global__ Constraints (30 tags)

#Tag Name
14bounds_attr_only_on_global_func
15bounds_maxnreg_incompatible_qualifiers
16cuda_specifier_twice_in_group
17global_class_decl
18global_exception_spec
19global_friend_definition
20global_func_local_template_arg
21global_function_consteval
22global_function_constexpr
23global_function_deduced_return_type
24global_function_has_ellipsis
25global_function_in_unnamed_inline_ns
26global_function_inline
27global_function_multiple_packs
28global_function_pack_not_last
29global_function_return_type
30global_function_with_initializer_list
31global_lambda_template_arg
32global_new_or_delete
33global_operator_function
34global_param_align_too_big
35global_private_template_arg
36global_private_type_arg
37global_qualifier_not_allowed
38global_ref_param_restrict
39global_rvalue_ref_type
40global_unnamed_type_arg
41global_va_list_type
42local_type_used_in_global_function
43maxnreg_attr_only_on_global_func
44missing_launch_bounds
45template_global_no_def

Extended Lambda (38 tags)

#Tag Name
46extended_host_device_generic_lambda
47extended_lambda_array_capture_assignable
48extended_lambda_array_capture_default_constructible
49extended_lambda_array_capture_rank
50extended_lambda_call_operator_local_type
51extended_lambda_call_operator_private_type
52extended_lambda_cant_take_function_address
53extended_lambda_capture_in_constexpr_if
54extended_lambda_capture_local_type
55extended_lambda_capture_private_type
56extended_lambda_constexpr
57extended_lambda_disallowed
58extended_lambda_discriminator
59extended_lambda_enclosing_function_deducible
60extended_lambda_enclosing_function_generic_lambda
61extended_lambda_enclosing_function_hd_lambda
62extended_lambda_enclosing_function_local
63extended_lambda_enclosing_function_not_found
64extended_lambda_hd_init_capture
65extended_lambda_illegal_parent
66extended_lambda_inaccessible_ancestor
67extended_lambda_inaccessible_parent
68extended_lambda_init_capture_array
69extended_lambda_init_capture_initlist
70extended_lambda_inside_constexpr_if
71extended_lambda_multiple_parameter_packs
72extended_lambda_multiple_parent
73extended_lambda_nest_parent_template_param_unnamed
74extended_lambda_no_parent_func
75extended_lambda_pack_capture
76extended_lambda_parent_class_unnamed
77extended_lambda_parent_local_type
78extended_lambda_parent_non_extern
79extended_lambda_parent_private_template_arg
80extended_lambda_parent_private_type
81extended_lambda_parent_template_param_unnamed
82extended_lambda_reference_capture
83extended_lambda_too_many_captures
84this_addr_capture_ext_lambda

Device Code (13 tags)

#Tag Name
85addr_of_label_in_device_func
86asm_constraint_letter_not_allowed_in_device
87asm_constraint_must_have_single_letter
88auto_device_fn_ref
89cc_clobber_in_device
90cuda_device_code_unsupported_operator
91cuda_xasm_strict_placeholder_format
92illegal_ucn_in_device_identifer
93no_coroutine_on_device
94no_strict_cuda_error
95thread_local_in_device_code
96undefined_device_entity
97undefined_device_identifier
98unrecognized_pragma_device_code
99unsupported_type_in_device_code
100use_of_virtual_base_on_compute_1x

Device Function (4 tags)

#Tag Name
101device_func_tex_arg
102device_function_has_ellipsis
103no_host_device_initializer_list
104no_host_device_move_forward

Kernel Launch (4 tags)

#Tag Name
105device_launch_no_sepcomp
106device_side_launch_arg_with_user_provided_cctor
107device_side_launch_arg_with_user_provided_dtor
108missing_api_for_device_side_launch

Variable Access (11 tags)

#Tag Name
109device_var_address_taken_in_host
110device_var_constexpr
111device_var_read_in_host
112device_var_structured_binding
113device_var_written_in_host
114device_variable_in_unnamed_inline_ns
115host_var_address_taken_in_device
116host_var_read_in_device
117host_var_written_in_device
118illegal_local_to_device_function
119illegal_local_to_host_function

Variable Template (5 tags)

#Tag Name
120variable_template_func_local_template_arg
121variable_template_lambda_template_arg
122variable_template_private_template_arg
123variable_template_private_type_arg
124variable_template_unnamed_type_template_arg

__managed__ (6 tags)

#Tag Name
125decltype_of_managed_variable
126managed_cant_be_shared_constant
127managed_const_type_not_allowed
128managed_reference_type_not_allowed
129unsupported_arch_for_managed_capability
130unsupported_configuration_for_managed_capability

__grid_constant__ (8 tags)

#Tag Name
131grid_constant_incompat_instantiation_directive
132grid_constant_incompat_redecl
133grid_constant_incompat_specialization
134grid_constant_incompat_templ_redecl
135grid_constant_non_kernel
136grid_constant_not_const
137grid_constant_reference_type
138grid_constant_unsupported_arch

Atomics (26 tags)

#Tag Name
139floating_type_for_logical_atomic_operation
140invalid_data_size_for_nv_atomic_generic_function
141invalid_nv_atomic_cas_size
142invalid_nv_atomic_exch_size
143invalid_nv_atomic_memory_order_value
144invalid_nv_atomic_operation_add_sub_size
145invalid_nv_atomic_operation_max_min_float
146invalid_nv_atomic_operation_size
147invalid_nv_atomic_thread_scope_value
148non_integral_type_for_non_generic_nv_atomic_function
149nv_atomic_add_sub_f64_not_supported
150nv_atomic_cas_b16_not_supported
151nv_atomic_exch_cas_b128_not_supported
152nv_atomic_function_address_taken
153nv_atomic_function_no_args
154nv_atomic_functions_not_supported_below_sm60
155nv_atomic_load_order_error
156nv_atomic_load_store_b128_version_too_low
157nv_atomic_load_store_scope_cluster_change_to_device
158nv_atomic_operation_not_in_device_function
159nv_atomic_operation_order_not_constant_int
160nv_atomic_operation_scope_not_constant_int
161nv_atomic_operations_memory_order_fallback_to_membar
162nv_atomic_operations_scope_cluster_change_to_device
163nv_atomic_operations_scope_fallback_to_membar
164nv_atomic_store_order_error

JIT Mode (5 tags)

#Tag Name
165host_closure_class_in_jit
166no_host_in_jit
167unannotated_function_in_jit
168unannotated_static_data_member_in_jit
169unannotated_variable_in_jit

RDC / Whole-Program (2 tags)

#Tag Name
170extern_kernel_template
171template_global_no_def

#pragma nv_abi (10 tags)

#Tag Name
172nv_abi_pragma_bad_format
173nv_abi_pragma_device_function
174nv_abi_pragma_device_function_context
175nv_abi_pragma_duplicate_arg
176nv_abi_pragma_invalid_option
177nv_abi_pragma_missing_arg
178nv_abi_pragma_next_construct
179nv_abi_pragma_not_constant
180nv_abi_pragma_not_positive_value
181nv_abi_pragma_overflow_value

__nv_register_params__ (4 tags)

#Tag Name
182register_params_ellipsis_function
183register_params_not_enabled
184register_params_unsupported_arch
185register_params_unsupported_function

Name Expression (6 tags)

#Tag Name
186name_expr_extra_tokens
187name_expr_internal_error
188name_expr_non_device_variable
189name_expr_non_global_routine
190name_expr_not_routine_or_variable
191name_expr_parsing

Texture / Surface (7 tags)

#Tag Name
192addr_of_text_surf_expr_in_device_func
193addr_of_text_surf_var_in_device_func
194indir_into_text_surf_expr_in_device_func
195indir_into_text_surf_var_in_device_func
196reference_to_text_surf_type_in_device_func
197reference_to_text_surf_var_in_device_func
198texture_surface_variable_in_unnamed_inline_ns

__wgmma_mma_async (6 tags)

#Tag Name
199wgmma_mma_async_bad_A_type
200wgmma_mma_async_bad_B_type
201wgmma_mma_async_bad_shape
202wgmma_mma_async_missing_args
203wgmma_mma_async_nonconstant_arg
204wgmma_mma_async_not_enabled

__block_size__ / __cluster_dims__ (18 tags)

#Tag Name
205ambiguous_block_size_spec
206block_size_must_be_positive
207block_size_unsupported
208cluster_dims_must_be_positive
209cluster_dims_too_large
210cluster_dims_unsupported
211conflict_between_cluster_dim_and_block_size
212max_blocks_per_cluster_negative
213max_blocks_per_cluster_too_large
214max_blocks_per_cluster_unsupported
215mismatched_shared_block_size
216multiple_block_sizes
217shared_affinity_type
218shared_block_size_must_be_positive
219shared_block_size_too_large
220shared_nonthreads_dim
221threads_dimension_requires_definite_block_size
222too_many_blocks_in_cluster

Inline Hint (2 tags)

#Tag Name
223inline_hint_forceinline_conflict
224inline_hint_noinline_conflict

__local_maxnreg__ (7 tags)

#Tag Name
225bounds_attr_only_nonmember_func
226local_maxnreg
227local_maxnreg_attr_only_nonmember_func
228local_maxnreg_attribute_conflict
229local_maxnreg_negative
230local_maxnreg_too_large
231maxnreg_attr_only_nonmember_func

Lambda Annotation (1 tag)

#Tag Name
232lambda_operator_annotated

Miscellaneous (16 tags)

#Tag Name
233addr_of_nv_builtin_var
234bad_shared_storage_class
235cuda_demote_unsupported_floating_point
236cuda_displaced_new_or_delete_operator
237extern_shared
238invalid_nvvm_builtin_intrinsic
239missing_pushcallconfig
240nonshared_blocksizeof
241nonshared_strict_relaxed
242shared_address_immutable
243shared_inside_struct
244shared_parameter
245shared_specifier_in_range_for
246thread_local_for_device_vars
247unannotated_static_not_allowed_in_device
248unsupported_nv_attribute

Diagnostic Pragma Actions (6 tags -- not suppressible, but listed for completeness)

#Tag Name
249nv_diag_default
250nv_diag_error
251nv_diag_once
252nv_diag_remark
253nv_diag_suppress
254nv_diag_warning

Cross-Reference: EDG Error Codes Used for CUDA

The following standard EDG error codes (0--3456) are repurposed or frequently triggered by CUDA-specific validation. These display with their original number (not the 20000-D series).

Internal #Display #Context
2121CUDA auto type with template deduction
147147redeclaration mismatch
149149illegal CUDA storage class at namespace scope
246246static member of non-class type
298298typedef/using with template name
325325thread_local in CUDA
337337calling convention mismatch
453453in template instantiation context
551551not a member function
795795definition in class scope with external linkage (CUDA)
799799definition in class scope (C++20 CUDA)
891891anonymous type in variable declaration
892892auto with __constant__ variable
893893auto with CUDA variable
948948calling convention mismatch on redeclaration
992992fatal error (suppress-all sentinel)
10341034explicit instantiation with conflicting attributes
10631063in include file context
11181118CUDA attribute on namespace-scope variable
11501150context lines truncated
11581158auto return type with __global__
13061306CUDA memory space mismatch on redeclaration
14181418incomplete type in definition
14301430function attribute mismatch in template
15601560CUDA constexpr class with non-trivial destructor
15801580redeclaration with different template parameters
16551655tentative definition of constexpr
23842384constexpr mismatch on redeclaration (CUDA)
24422442extern variable at block scope with CUDA attribute
24432443extern variable at block scope with CUDA attribute (variant)
25022502no_unique_address mismatch
25032503no_unique_address mismatch (variant)
26562656internal error (assertion failure)
28852885CUDA attribute on deduction guide
29372937structured binding with CUDA attribute
30333033incompatible constexpr CUDA target
31163116restrict qualifier on definition
34143414auto with volatile/atomic qualifier
35103510__shared__ variable with VLA
35663566__constant__ with constexpr with auto
35673567CUDA variable with VLA type
35683568__constant__ with constexpr
35783578CUDA attribute in discarded constexpr-if branch
35793579CUDA attribute at namespace scope with structured binding
35803580CUDA attribute on variable-length array
36483648CUDA __constant__ with external linkage
36983698parameter type mismatch
37093709warnings treated as errors

Format Specifiers in CUDA Messages

CUDA error messages use the same fill-in system as EDG base errors, expanded by process_fill_in (sub_4EDCD0).

SpecifierKindMeaningExample
%sq3Quoted entity nameFunction name in cross-space call
%sq1, %sq23Indexed quoted namesCaller and callee
%no14Entity name (omit kind prefix)Function in redeclaration
%n1, %n24Entity namesOverride base/derived pair
%nd4Entity with declaration locationTemplate parameter
%s, %s1, %s23String fill-inExecution space keyword
%t6Type fill-inType in template arg errors
%p2Source positionPrevious declaration location

Architecture Requirements Summary

Quick reference for minimum architecture required by various CUDA features.

FeatureMinimum Architecture
Virtual basescompute_20
__device__/__host__ __device__ with ellipsiscompute_30
__managed__ variablescompute_30
alloca()compute_52
__nv_atomic_* functionssm_60
Atomic scope argumentsm_60
Atomic add/sub for f64sm_60
__grid_constant__compute_70
Atomic memory order argumentsm_70
16-bit atomic CASsm_70
128-bit atomic load/storesm_70
__nv_register_params__compute_80
Cluster scope for atomicssm_90
128-bit atomic exchange/CASsm_90
__wgmma_mma_async builtinssm_90a

Virtual Override Execution Space Matrix

When a derived class overrides a base class virtual function in CUDA, the execution spaces of both functions must be compatible. A __device__ virtual cannot be overridden by a __host__ function, a __host__ virtual cannot be overridden by a __device__ function, and so on. cudafe++ enforces these rules inside record_virtual_function_override (sub_432280, 437 lines, class_decl.c), which runs each time the EDG front-end registers a virtual override during class body scanning. The function performs three tasks: (1) propagate the base class's execution space obligations onto the derived function, (2) detect illegal mismatches and emit one of six dedicated error messages (3542--3547), and (3) fall through to standard EDG override recording (covariant returns, [[nodiscard]], override/final, requires-clause checks).

This page documents the override checking logic at reimplementation-grade depth: reconstructed pseudocode from the decompiled binary, a complete compatibility matrix, the six error messages with their diagnostic tags, and the relaxed-mode flag that softens certain checks.

Key Facts

PropertyValue
Binary functionsub_432280 (record_virtual_function_override, 437 lines)
Source fileclass_decl.c
Parametersa1=derivation_info, a2=overriding_sym, a3=overridden_sym, a4=base_class_info, a5=covariant_return_adjustment
Entity field readbyte +182 (execution space bitfield) on both overridden and overriding entities
Classification maskbyte & 0x30 -- two-bit extraction: 0x00=implicit host, 0x10=explicit host, 0x20=device, 0x30=HD
Propagation bits0x10 (host_explicit), 0x20 (device_annotation)
Attribute lookupsub_5CEE70 with kind 87 (__device__) and 86 (__host__)
Error emissionsub_4F4F10 with severity 8 (hard error)
Relaxed mode flagdword_106BFF0 (relaxed_attribute_mode)
Implicitly-HD testbyte +177 & 0x10 on entity -- constexpr / __forceinline__ bypass
Override-involved markbyte +176 |= 0x02 on overriding entity
Assertion guardnv_is_device_only_routine from nv_transforms.h:367

Why Virtual Functions Need Execution Space Checks

Standard C++ imposes no concept of execution space on virtual functions. CUDA introduces three execution spaces (__host__, __device__, __host__ __device__) and one launch-only space (__global__). When a virtual function in a base class is declared with one execution space, every override in every derived class must be callable in the same space. If the base declares a __device__ virtual, calling it through a base pointer on the GPU must dispatch to the derived override -- which is only possible if the override is also __device__ (or __host__ __device__).

__global__ functions cannot be virtual at all (error 3505/3506 prevents this at the attribute application stage), so the override matrix only covers three spaces: __host__, __device__, and __host__ __device__. An unannotated function counts as implicit __host__.

Function Entry: Mark and Resolve Entities

The function begins by resolving the actual entity nodes from the symbol table entries:

// sub_432280 entry (lines 60-69 of decompiled output)
//
// a2 = overriding_sym (symbol table entry for the derived-class function)
// a3 = overridden_sym (symbol table entry for the base-class function)
//
// v10 = entity of overridden function:  *(overridden_sym + 88)
// v11 = entity of overriding function:  *(*(overriding_sym) + 88)
//
// The entity node at offset +88 is the "associated routine entity" --
// the actual function representation containing execution space bits.

int64_t overridden_entity = *(int64_t*)(overridden_sym + 88);   // v10
int64_t overriding_entity = *(int64_t*)(*(int64_t*)overriding_sym + 88);  // v11

// Mark the overriding entity as "involved in an override"
*(uint8_t*)(overriding_entity + 176) |= 0x02;

The +176 |= 0x02 flag marks the derived function as "override-involved." This flag is consumed downstream by the exception specification resolver and other class completion logic.

Phase 1: Implicitly-HD Fast Path and Execution Space Propagation

The first branch tests byte +177 & 0x10 on the overriding entity. This bit indicates the function is implicitly __host__ __device__ -- set for constexpr functions (implicitly HD since CUDA 7.5) and __forceinline__ functions. When this bit is set, the override is exempt from mismatch checking, but execution space propagation still occurs.

// Phase 1: implicitly-HD check and propagation (lines 70-94)
void check_and_propagate(int64_t overriding_entity, int64_t overridden_entity) {

    if (overriding_entity->byte_177 & 0x10) {
        // Overriding function is implicitly HD (constexpr / __forceinline__)
        //
        // Skip mismatch errors entirely -- an implicitly-HD function is
        // compatible with any base execution space.  But we must still
        // propagate the base's space obligations onto the derived entity
        // so that downstream passes (IL marking, code generation) know
        // what to emit.

        if (!(overridden_entity->byte_177 & 0x10)) {
            // Overridden function is NOT implicitly HD -- it has an explicit
            // execution space.  We need to propagate that space.
            //
            // Guard: skip propagation for constexpr lambdas with internal
            // linkage but no override flag (a degenerate case).
            if ((overridden_entity->qword_184 & 0x800001000000) == 0x800000000000
                && !(overridden_entity->byte_176 & 0x02)) {
                // Degenerate case -- skip propagation
                goto done_nvidia_checks;
            }

            uint8_t base_es = overridden_entity->byte_182;

            // Propagate __host__ obligation:
            // If the base is NOT device-only (i.e., base is host, HD, or
            // unannotated), the derived function inherits the host obligation.
            if ((base_es & 0x30) != 0x20) {
                overriding_entity->byte_182 |= 0x10;   // set host_explicit
            }

            // Propagate __device__ obligation:
            // If the base has the device_annotation bit set, the derived
            // function inherits the device obligation.
            if (base_es & 0x20) {
                overriding_entity->byte_182 |= 0x20;   // set device_annotation
            }
        }

        goto done_nvidia_checks;
    }

    // ... Phase 2 continues below
}

Why Propagation Matters

Propagation ensures that a derived class inherits its base class's execution space obligations even when the derived function is implicitly HD. Consider:

struct Base {
    __device__ virtual void f();        // byte_182 & 0x30 == 0x20
};

struct Derived : Base {
    constexpr void f() override;        // byte_177 & 0x10 set (implicitly HD)
};

Without propagation, Derived::f would have byte_182 == 0x00 (no explicit annotation). The device-side IL pass would skip it, and a virtual call base_ptr->f() on the GPU would dispatch to a function never compiled for the device. Propagation sets byte_182 |= 0x20 (device_annotation), ensuring the function is included in device IL.

The propagation follows strict rules:

Base byte_182 & 0x30Propagated to overriding entity
0x00 (implicit host)|= 0x10 (host_explicit)
0x10 (explicit host)|= 0x10 (host_explicit)
0x20 (device)|= 0x20 (device_annotation)
0x30 (HD)|= 0x10 then |= 0x20 (both)

Phase 2: Explicit Annotation Mismatch Detection

When the overriding function is NOT implicitly HD (byte_177 & 0x10 == 0), the checker must verify that the derived function's explicit execution space matches the base. It does this by querying the attribute lists on the overriding symbol for __device__ (kind 87) and __host__ (kind 86) attributes using sub_5CEE70.

The overriding symbol has two attribute list pointers: offset +184 (primary attributes) and offset +200 (secondary/redeclaration attributes). Both are checked for each attribute kind.

Reconstructed Pseudocode

// Phase 2: explicit annotation mismatch detection (lines 96-188)
//
// At this point, overriding_entity->byte_177 & 0x10 == 0 (not implicitly HD).
// We must determine what execution space annotations the overriding function
// has, and compare against the overridden function's execution space.

void check_override_mismatch(
    int64_t overriding_sym,       // a2
    int64_t overriding_entity,    // v11
    int64_t overridden_entity,    // v10
    int64_t overridden_sym_list,  // v6 = a2+48 (location info for diagnostics)
    int64_t overridden_sym_arg,   // v8 = a3 (for diagnostics)
    int64_t base_sym              // v9 = *a2 (for diagnostics)
) {
    // -- Assertion: overridden entity must exist --
    if (!overridden_entity) {
        internal_error("nv_transforms.h", 367, "nv_is_device_only_routine");
    }

    // -- Extract overridden execution space --
    uint8_t base_es    = overridden_entity->byte_182;
    uint8_t mask_30    = base_es & 0x30;     // 0x00/0x10/0x20/0x30
    bool    base_no_device_annotation = (base_es & 0x20) == 0;  // v56
    bool    base_is_hd = (mask_30 == 0x30);  // v58
    uint8_t base_device_bit = base_es & 0x20;  // v55

    // -- Check overriding function for __device__ attribute (kind 87) --
    bool has_device_attr = find_attribute(87, overriding_sym->attr_list_184)
                        || find_attribute(87, overriding_sym->attr_list_200);

    if (has_device_attr) {
        // Overriding function has __device__.
        // Now check if it also has __host__ (kind 86) -- making it HD.

        bool has_host_attr = find_attribute(86, overriding_sym->attr_list_184)
                          || find_attribute(86, overriding_sym->attr_list_200);

        if (has_host_attr) {
            // --- Overriding is __host__ __device__ ---
            if (base_device_bit) {
                // Base has device_annotation (bit 5 set).
                // If base is device-only (mask_30 == 0x20), error 3544.
                if (mask_30 == 0x20) {
                    emit_error(8, 3544, location, overridden, base);
                }
                // If base is HD (mask_30 == 0x30), it's legal -- no error.
                // If base has device_bit but mask_30 != 0x20 and != 0x30,
                // that can't happen (bit 5 set implies mask_30 is 0x20 or 0x30).
            } else {
                // Base has no device_annotation -- base is host or implicit host.
                emit_error(8, 3543, location, overridden, base);
            }
        } else {
            // --- Overriding is __device__ only ---
            // Fall through to LABEL_83 logic.
            goto device_only_check;
        }
    } else {
        // Overriding function has NO __device__ attribute.
        // It's either explicit __host__ or implicit host (no annotation).

        if (dword_106BFF0) {
            // Relaxed mode: check if overriding has explicit __host__.
            bool has_host_attr = find_attribute(86, overriding_sym->attr_list_184)
                              || find_attribute(86, overriding_sym->attr_list_200);

            if (!has_host_attr) {
                // No explicit __host__ either -- implicit host.
                // In relaxed mode, an implicit-host override is treated like
                // a device-only override for certain base configurations.
                // Jump into the device-only path with modified conditions.
                goto device_only_check_relaxed;
            }
            // Explicit __host__ in relaxed mode: fall through to normal checks.
        }

        // --- Overriding is __host__ (explicit or implicit) ---
        if (mask_30 == 0x20) {
            // Base is __device__ only
            emit_error(8, 3545, location, overridden, base);
        } else if (mask_30 == 0x30) {
            // Base is __host__ __device__
            emit_error(8, 3546, location, overridden, base);
        }
        // else: base is host/implicit-host, same space -- no error.
        goto done_nvidia_checks;
    }

device_only_check:
    // Overriding is __device__ only (has __device__ but no __host__).
    // v39 = base_no_device_annotation (v56), v40 = 1 (always set entering here).
    {
        bool should_error = base_no_device_annotation;  // v39
        bool relaxed_extra = true;                      // v40

device_only_check_relaxed:
        // (relaxed mode entry: v39 = 0, a1 = v56 = base_no_device_annotation)

        if (dword_106BFF0) {
            // Relaxed mode: the error fires unconditionally when
            // base has no device annotation (base is host/implicit-host).
            // In strict mode, same condition applies.
            should_error = base_no_device_annotation;
            relaxed_extra = true;   // always true in relaxed
        }

        if (should_error) {
            // Base is host-only (no device_annotation) and override is device-only.
            emit_error(8, 3542, location, overridden, base);
        } else if (base_is_hd && relaxed_extra) {
            // Base is HD, override is device-only.
            // v40 (relaxed_extra) is always 1 from Entry A, so this
            // fires in both strict and relaxed modes for D-overrides-HD.
            emit_error(8, 3547, location, overridden, base);
        }
        // else: base is device-only too -- compatible, no error.
    }

done_nvidia_checks:
    // Continue to standard EDG override recording...
}

Decision Tree (Simplified)

overriding byte_177 & 0x10?
  YES (implicitly HD) --> propagate, skip mismatch check
  NO  --> extract base_es = overridden byte_182
          has __device__ attr on overriding?
            YES --> also has __host__ attr?
              YES (override=HD):
                base has device_annotation?
                  YES and mask_30==0x20 --> ERROR 3544
                  NO                    --> ERROR 3543
              NO (override=D-only):
                base has NO device_annotation? --> ERROR 3542
                base is HD?                    --> ERROR 3547
            NO (override=H or implicit-H):
              base mask_30==0x20 --> ERROR 3545
              base mask_30==0x30 --> ERROR 3546
              otherwise         --> legal (same space)

The Six Error Messages

Each mismatch produces one of six errors. All are emitted at severity 8 (hard error) and are individually suppressible by their diagnostic tag via --diag_suppress or #pragma nv_diag_suppress.

InternalDisplayDiagnostic TagMessage Template
354220085vfunc_incompat_exec_h_dexecution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __device__ function
354320086vfunc_incompat_exec_h_hdexecution space mismatch: overridden entity (%n1) is a __host__ function, but overriding entity (%n2) is a __host__ __device__ function
354420087vfunc_incompat_exec_d_hdexecution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ __device__ function
354520088vfunc_incompat_exec_d_hexecution space mismatch: overridden entity (%n1) is a __device__ function, but overriding entity (%n2) is a __host__ function
354620089vfunc_incompat_exec_hd_hexecution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __host__ function
354720090vfunc_incompat_exec_hd_dexecution space mismatch: overridden entity (%n1) is a __host__ __device__ function, but overriding entity (%n2) is a __device__ function

The display number is computed as internal + 16543 (the standard CUDA error renumbering from construct_text_message). The tag naming convention is vfunc_incompat_exec_{overridden}_{overriding}.

The %n1 and %n2 fill-ins resolve to the entity display names of the base and derived functions respectively, including their full qualified names and parameter types.

Suppression Example

# Suppress by tag (preferred)
nvcc --diag_suppress=vfunc_incompat_exec_h_d file.cu

# Suppress by display number
nvcc --diag_suppress=20085 file.cu

# Suppress in source
#pragma nv_diag_suppress vfunc_incompat_exec_h_d

Complete Compatibility Matrix

This table shows every combination of base (overridden) and derived (overriding) execution space. "Implicit H" means the function has no execution space annotation (byte_182 & 0x30 == 0x00). Since implicit host and explicit __host__ are treated identically for override purposes (both lack the device_annotation bit and have mask_30 != 0x20), they share the same row/column behavior.

__global__ is excluded because __global__ functions cannot be virtual -- the attribute handler rejects __global__ on virtual functions before override checking ever runs.

The matrix is the same in both strict mode (dword_106BFF0 == 0) and relaxed mode (dword_106BFF0 == 1). The relaxed flag changes the code path used to reach the error decision but produces the same result for all input combinations.

Derived: H / implicit HDerived: DDerived: HDDerived: implicitly HD
Base: H / implicit Hlegalerror 3542error 3543legal + propagate |= 0x10
Base: Derror 3545legalerror 3544legal + propagate |= 0x20
Base: HDerror 3546error 3547legallegal + propagate |= 0x10, |= 0x20

Reading the matrix: each row is the base class virtual function's space; each column is the derived class override's space. "Legal" means no error is emitted and the override is recorded normally. "Legal + propagate" means the override is accepted AND the base's execution space bits are OR'd into the derived entity's byte_182.

The diagonal (same space in base and derived) is always legal. The last column (implicitly HD) is always legal because an implicitly HD function is compatible with every execution space -- the mismatch check is skipped entirely and only propagation runs.

Why Both Modes Produce the Same Matrix

Tracing the LABEL_83 code path with the two entry points reveals that dword_106BFF0 does NOT gate error 3547. In the critical device-only-override path (Entry A), v40 is set to 1 before reaching LABEL_83 regardless of the relaxed flag. The flag only changes the assignment to a1 and v40 via conditional moves (cmovz/cmovnz in the disassembly), but the net effect is identical for all input combinations:

LABEL_83 internals (decompiled, annotated):
  a2 = 3542;                          // tentative error
  if (!dword_106BFF0) a1 = v39;       // strict: a1 = v39
  if (dword_106BFF0) v40 = 1;         // relaxed: force v40 = 1
  // BUT v40 was already 1 from Entry A (line 134)
  if (a1) emit_error(3542);           // base has no device_annotation
  else if (v58 && v40) emit_error(3547);  // base is HD
  else skip;                          // base is D-only (compatible)

Entry A sets v39 = v56, v40 = 1, a1 = v56. In strict mode, a1 is overwritten to v39 (same value). In relaxed mode, a1 stays v56 (same value). Either way, a1 = v56 = (base has no device annotation). The v40 = 1 from Entry A is preserved. The result is identical.

The relaxed flag introduces a second entry point (Entry B) for overriding functions with no explicit annotation. In relaxed mode, such functions are routed through LABEL_83 with v39 = 0 and a1 = v56, producing the same device-only check logic. In strict mode, the same functions take the direct H/implicit-H path and produce errors 3545/3546 for device/HD bases. Both paths reach the same conclusions.

Relaxed Mode: The Unannotated Override Path

When dword_106BFF0 == 1 and the overriding function has no __device__ attribute, the checker takes an additional step before falling through to the H/implicit-H path. It queries the overriding symbol for explicit __host__ (kind 86). If __host__ IS found, the function is confirmed as explicit host and errors 3545/3546 apply normally. If __host__ is NOT found (truly unannotated), the function is reclassified through the device-only check path (LABEL_83). This reclassification does not change the error outcome -- an unannotated function overriding a host base still sees no error (both are host-space), and an unannotated function overriding a device or HD base still produces the appropriate error.

Propagation Details

When the overriding function is implicitly HD (byte_177 & 0x10), execution space is propagated from the base to the derived entity by OR-ing bits into byte_182:

// Propagation (direct from decompiled sub_432280, lines 77-91)
uint8_t base_es = overridden_entity->byte_182;

// If base is NOT device-only, derived inherits host obligation
if ((base_es & 0x30) != 0x20) {
    overriding_entity->byte_182 |= 0x10;   // host_explicit bit
    base_es = overridden_entity->byte_182;  // re-read (compiler artifact)
}

// If base has device_annotation, derived inherits device obligation
if (base_es & 0x20) {
    overriding_entity->byte_182 |= 0x20;   // device_annotation bit
}

The re-read of overridden_entity->byte_182 after setting 0x10 on the overriding entity is a compiler artifact (the decompiler shows it reading back from v10+182 into v22, but v10 is the overridden entity, so the value hasn't changed). The OR operations are on the overriding entity only.

Propagation Matrix

Base space (byte_182 & 0x30)Bits OR'd into overriding byte_182Net effect on overriding entity
0x00 (implicit H)|= 0x10Becomes explicit host (0x10)
0x10 (explicit H)|= 0x10Becomes explicit host (0x10)
0x20 (D only)|= 0x20Becomes device-annotated (0x20)
0x30 (HD)|= 0x10, then |= 0x20Becomes HD (0x30)

After propagation, the overriding entity's byte_182 accurately reflects the execution space obligations inherited from its base class. Downstream passes (device/host separation, IL marking, code generation) use this byte to determine whether the function needs device-side compilation, host-side compilation, or both.

Relaxed Mode (dword_106BFF0)

The global flag dword_106BFF0 (relaxed_attribute_mode, default 1 per CLI defaults) controls permissive handling of execution space annotations across the compiler. Its primary effects are on attribute application (allowing __device__ + __global__ coexistence) and cross-space call validation. For virtual override checking, its effect is narrower:

  1. Unannotated override reclassification. In relaxed mode, when the overriding function has neither __device__ nor __host__ attributes explicitly, the checker additionally queries the overriding symbol for __host__ (kind 86). If __host__ is NOT found, the checker treats the unannotated function as potentially device-compatible and routes through the device-only check path (LABEL_83). This can produce error 3542 (D overrides H) for an implicit-host function, which would otherwise only see errors 3545/3546.

  2. No error suppression for overrides. Unlike attribute application where relaxed mode suppresses error 3481, relaxed mode does NOT suppress any of the six override errors. All six fire at severity 8 in both modes. The flag dword_106BFF0 modulates the code path taken to reach the error decision, not the severity or suppression of the error itself.

Additional Override Checks (Non-CUDA)

After the CUDA execution space checks, sub_432280 continues with standard EDG override validation:

ErrorConditionMeaning
1788Base has [[nodiscard]], derived does notMissing [[nodiscard]] on override
1789Derived has [[nodiscard]], base does notExtraneous [[nodiscard]] on override
1850Overriding a final virtual functionOverride of final function
2935Derived has requires-clause, base does notRequires-clause mismatch
2936Base has requires-clause, derived does notRequires-clause mismatch

These are standard C++ checks unrelated to CUDA execution spaces.

Example: Override Interactions

// Example 1: Legal same-space override
struct Base {
    __device__ virtual void f();
};
struct Derived : Base {
    __device__ void f() override;     // Legal: D overrides D
};

// Example 2: Error 3542 -- D overrides H
struct Base2 {
    virtual void f();                 // Implicit __host__
};
struct Derived2 : Base2 {
    __device__ void f() override;     // ERROR 3542 (20085)
};
// error #20085-D: execution space mismatch: overridden entity (Base2::f)
//   is a __host__ function, but overriding entity (Derived2::f)
//   is a __device__ function

// Example 3: Error 3546 -- H overrides HD
struct Base3 {
    __host__ __device__ virtual void f();
};
struct Derived3 : Base3 {
    void f() override;                // ERROR 3546 (20089)
};
// error #20089-D: execution space mismatch: overridden entity (Base3::f)
//   is a __host__ __device__ function, but overriding entity (Derived3::f)
//   is a __host__ function

// Example 4: Legal constexpr override with propagation
struct Base4 {
    __device__ virtual int g();
};
struct Derived4 : Base4 {
    constexpr int g() override;       // Legal: implicitly HD, propagates |= 0x20
};
// Derived4::g now has byte_182 |= 0x20 (device_annotation)
// and is included in device IL compilation.

// Example 5: Error 3547 -- D overrides HD
struct Base5 {
    __host__ __device__ virtual void h();
};
struct Derived5 : Base5 {
    __device__ void h() override;     // ERROR 3547 (20090)
};

Function Map

AddressIdentityLinesSource
sub_432280record_virtual_function_override437class_decl.c
sub_5CEE70find_attribute (attribute list lookup by kind)~30attribute.c
sub_4F4F10emit_diag_with_entity_pair (severity, error, loc, base, derived)~100error.c
sub_4F2930internal_error (assertion failure)~20error.c
sub_41A6E0dump_override_entry (debug trace helper)~40class_decl.c
sub_41D010add_to_override_list~20class_decl.c
sub_5E20D0allocate_override_entry (40-byte node)~15mem.c
sub_432130resolve_indeterminate_exception_specification~60class_decl.c

Override Entry Structure

Each recorded override is stored as a 40-byte linked list node:

Override entry (40 bytes):
  +0x00 (0):   next pointer
  +0x08 (8):   base_class_symbol (entity in base class vtable)
  +0x10 (16):  derived_class_entity (overriding function entity)
  +0x18 (24):  flags (0 initially, set during processing)
  +0x20 (32):  covariant_return_adjustment (pointer or NULL)

The override list is managed via:

  • qword_E7FE98: list head (most recent entry)
  • qword_E7FEA0: free list head (recycled 40-byte entries)
  • qword_E7FE90: allocation counter

When debug tracing is enabled (dword_126EFCC > 3), the function prints "newly created: ", "existing entry: ", "after modification: ", and "removing: " to stderr via fwrite, followed by calls to sub_41A6E0 to dump the entry contents.

Cross-References