RDC Mode
CUDA supports two compilation models that fundamentally change how cudafe++ processes device code: whole-program mode (-rdc=false, the default) and separate compilation mode (-rdc=true, also called Relocatable Device Code). The mode switch affects error checking, stub linkage, module ID generation, anonymous namespace mangling, and -- when multiple translation units are involved -- triggers EDG's cross-TU correspondence machinery for structural type verification.
From cudafe++'s perspective, the distinction maps to a single CLI flag (--device-c, flag index 77) and a handful of global booleans that gate code paths throughout the binary. This page documents what changes between the two modes, how module IDs are generated, how cross-TU IL correspondence works, and how host stub linkage is controlled.
Key Facts
| Property | Value |
|---|---|
| RDC CLI flag | --device-c (flag index 77, no argument) |
| Whole-program mode flag | dword_106BFBC (also set by --debug_mode) |
| Module ID cache | qword_126F0C0 (cached string, computed once) |
| Module ID generator | sub_5AF830 (make_module_id, ~450 lines) |
| Module ID setter | sub_5AF7F0 (set_module_id) |
| Module ID getter | sub_5AF820 (get_module_id) |
| Module ID file writer | sub_5B0180 (write_module_id_to_file) |
| Module ID file flag | --gen_module_id_file (flag 83) |
| Module ID file path | --module_id_file_name (flag 87) |
| Cross-TU IL copier | sub_796BA0 (copy_secondary_trans_unit_IL_to_primary, trans_copy.c) |
| Cross-TU usage marker | sub_796C00 (mark_secondary_IL_entities_used_from_primary) |
| Class correspondence | sub_7A00D0 (verify_class_type_correspondence, 703 lines) |
| TU processing entry | sub_7A40A0 (process_translation_unit) |
| TU switch | sub_7A3D60 (switch_translation_unit) |
| Host stub linkage flag | --host-stub-linkage-explicit (flag 47) |
| Static host stub flag | --static-host-stub (flag 48) |
| Static template stub flag | --static-global-template-stub (set_flag mechanism) |
| EDG source files | host_envir.c (module ID), trans_copy.c, trans_corresp.c, trans_unit.c |
Whole-Program Mode (-rdc=false)
Whole-program mode is the default. All device code for a given translation unit must be defined within that single .cu file. No external device symbols are allowed. The host compiler sees the entire program at once, and nvlink is not required for device code linking.
Constraints Enforced
Five diagnostics are specific to whole-program mode or are closely tied to the internal-linkage consequences of non-RDC compilation:
1. Inline device/constant/managed variables must have internal linkage.
An inline __device__/__constant__/__managed__ variable must have
internal linkage when the program is compiled in whole program
mode (-rdc=false)
In whole-program mode, the device runtime has no linker step to resolve external inline variables across TUs. An inline __device__ variable with external linkage would need cross-TU deduplication that only nvlink can provide. The frontend forces static (or anonymous-namespace) linkage, emitting an error if the variable has external linkage.
2. Extern __global__ function templates are forbidden (with -static-global-template-stub=true).
when "-static-global-template-stub=true", extern __global__ function
template is not supported in whole program compilation mode ("-rdc=false").
To resolve the issue, either use separate compilation mode ("-rdc=true"),
or explicitly set "-static-global-template-stub=false" (but see nvcc
documentation about downsides of turning it off)
The -static-global-template-stub flag causes template kernel stubs to receive static linkage to avoid ODR violations when the same template is instantiated in multiple host-side compilation units. An extern template declaration conflicts with this because the extern stub expects an external definition while the static stub forces a local one. The diagnostic tag for this is extern_kernel_template.
3. __global__ template instantiations must have local definitions (with -static-global-template-stub=true).
when "-static-global-template-stub=true" in whole program compilation
mode ("-rdc=false"), a __global__ function template instantiation or
specialization (%sq) must have a definition in the current translation
unit.
A static stub requires a definition in the same TU. If the instantiation point references a template defined in another header without an explicit instantiation, the stub has no body to emit. The diagnostic tag is template_global_no_def.
Both template-related diagnostics recommend either switching to -rdc=true or setting -static-global-template-stub=false. The 4 usage contexts in the binary for -static-global-template-stub all appear in error message strings (at addresses 0x88E588 and 0x88E6E0).
4. Kernel launch from __device__ or __global__ functions requires separate compilation.
kernel launch from __device__ or __global__ functions requires
separate compilation mode
Dynamic parallelism -- launching a kernel from device code (a __device__ or __global__ function calling <<<...>>>) -- requires the device linker (nvlink) to resolve cross-module kernel references. In whole-program mode, no device linking occurs, so the construct is illegal. The diagnostic tag is device_launch_no_sepcomp.
5. Address of internal linkage device function (bug mitigation).
address of internal linkage device function (%sq) was taken
(nv bug 2001144). mitigation: no mitigation required if the
address is not used for comparison, or if the target function
is not a CUDA C++ builtin. Otherwise, write a wrapper function
to call the builtin, and take the address of the wrapper
function instead
This diagnostic fires in whole-program mode when code takes the address of a static __device__ function. Because device functions with internal linkage get module-ID-based name mangling, their addresses may differ across compilations or across TUs even when they refer to the "same" function. The warning documents a known NVIDIA bug (2001144) and provides a workaround: wrap the builtin in a non-internal function and take the wrapper's address instead. This diagnostic has no associated tag name -- it is emitted unconditionally when the condition is detected.
Deferred Function List
When dword_106BFBC (whole-program mode) is set and dword_106BFDC (skip-device-only) is clear, gen_routine_decl (sub_47BFD0) adds device-only functions to a deferred linked list (qword_1065840) rather than emitting dummy bodies inline. Each list node is 32 bytes:
| Offset | Field |
|---|---|
| +0 | next pointer |
| +8 | Source position (start) |
| +16 | Source position (end) |
| +24 | Name string (strdup'd, or NULL) |
This list is consumed during the breakpoint placeholder phase in process_file_scope_entities (sub_489000), where each entry produces a static __attribute__((used)) void __nv_breakpoint_placeholder<N>_<name>(void) { exit(0); } function for debugger support.
Separate Compilation Mode (-rdc=true)
When nvcc passes --device-c (flag index 77) to cudafe++, separate compilation mode is activated. This:
- Allows
__device__,__constant__, and__managed__variables to have external linkage - Permits
extern __global__template functions - Enables dynamic parallelism (kernel launches from device code)
- Requires nvlink to resolve device-side cross-TU references
- Generates a module ID that uniquely identifies each compilation unit for runtime registration
In this mode, the host stubs are generated with external linkage (by default) so the host linker can resolve cross-TU kernel calls. The module ID is embedded in the registration code to match host stubs with their corresponding device fatbinary segments.
Multi-TU Processing in EDG
When multiple translation units are compiled in a single cudafe++ invocation (as happens during RDC compilation with nvcc), the EDG frontend processes them sequentially using a stack-based TU management system:
| Global | Purpose |
|---|---|
qword_106BA10 | Current translation unit pointer |
qword_106B9F0 | Primary (first) translation unit |
qword_106BA18 | TU stack top |
dword_106B9E8 | TU stack depth (excluding primary) |
process_translation_unit (sub_7A40A0, trans_unit.c) is the main entry point called from main() for each source file:
- Allocates a 424-byte TU descriptor via
sub_6BA0D0 - Initializes scope state and copies registered variable defaults
- Sets the primary TU pointer (
qword_106B9F0) for the first file - Links the TU into the processing chain
- Opens the source file and sets up include paths
- Runs the parser (
sub_586240) - Dispatches to standard compilation (
sub_4E8A60) or module compilation (sub_6FDDF0) - Calls finalization (
sub_588E90) - Pops the TU from the stack
switch_translation_unit (sub_7A3D60, trans_unit.c, line 514) saves/restores per-TU state when the frontend needs to reference entities from a different TU:
- Asserts
qword_106BA10 != 0(current TU exists) - If target differs from current: saves current TU via
sub_7A3A50 - Restores target TU state via
memcpyfrom per-TU buffer - Sets
qword_106BA10 = target - Restores scope chain:
xmmword_126EB60,qword_126EB70, etc. - Recomputes file scope indices via
sub_704490
Per-TU state is registered through f_register_trans_unit_variable (sub_7A3C00, trans_unit.c, line 227), which accumulates variables into a linked list (qword_12C7AA8). Each registration record is 40 bytes with fields for the variable pointer, name, prior size, and buffer offset. The total per-TU buffer size is tracked in qword_12C7A98.
Three core variables are always registered (sub_7A4690):
dword_106BA08(is_recompilation), 4 bytesqword_106BA00(current_filename), 8 bytesdword_106B9F8(has_module_info), 4 bytes
Module ID Generation
Every compilation unit in CUDA needs a unique identifier to associate host-side registration code with the correct device fatbinary. This identifier -- the module ID -- is generated by make_module_id (sub_5AF830, host_envir.c, ~450 lines) and cached in qword_126F0C0.
Algorithm
The module ID generator has three source modes, tried in order:
Mode 1: Module ID file. If qword_106BF80 (set by --module_id_file_name) is non-NULL, the entire contents of the specified file are read and used as the module ID. This allows build systems to inject deterministic identifiers.
Mode 2: Explicit numeric token. If the caller provides a non-NULL string argument (nptr), it is parsed via strtoul. If the parse succeeds, the numeric value is used directly. If the parse fails (the string is not a pure integer), the string itself is CRC32-hashed and the hash is used.
Mode 3: Default computation. The default path builds the ID from several components:
- Calls
stat()on the source file to obtainmtime - Formats
ctime()of the modification time - Reads
getpid()for the current process ID - Collects
qword_106C038(command-line options hash input) - Computes the CRC32 hash of the options string
- Takes the output filename, strips it to basename
- If the source filename exceeds 8 characters, replaces it with its CRC32 hex representation
The final string is assembled in the format:
{options_crc}_{output_name_len}_{output_name}_{source_or_crc}[_{extra}][_{pid}]
All non-alphanumeric characters in the result are replaced with underscores. The string is allocated permanently and cached in qword_126F0C0.
Debug tracing (gated by dword_126EFC8) emits:
make_module_id: str1 = %s, str2 = %s, pid = %ld
make_module_id: final string = %s
CRC32 Implementation
The function contains an inline CRC32 implementation that appears three times (for the options hash, the source filename, and the extra string). All three copies use the same algorithm:
- Polynomial:
0xEDB88320(standard reflected CRC-32) - Initial value:
0xFFFFFFFF - Processing: bit-by-bit, 8 iterations per byte
- Final XOR: implicit via the reflected algorithm
The triple inlining suggests the CRC32 was originally a macro or small inline function that the compiler expanded at each call site. The polynomial 0xEDB88320 is the bitwise reversal of the standard CRC-32 polynomial 0x04C11DB7, confirming this is the ubiquitous CRC-32/ISO-HDLC algorithm.
PID Incorporation
The getpid() call ensures that concurrent compilations of the same source file produce different module IDs. Without the PID, two parallel nvcc invocations compiling the same .cu file with the same flags would generate identical module IDs, potentially causing runtime registration collisions. The PID is appended as the final underscore-separated component.
Module ID File Output
When --gen_module_id_file (flag 83) is set, write_module_id_to_file (sub_5B0180) generates the module ID via sub_5AF830(0) and writes it to the file specified by qword_106BF80 (--module_id_file_name, flag 87). If the filename is not set, it emits "module id filename not specified". If the write fails, it emits "error writing module id to file".
In the backend output phase, if dword_106BFB8 (emit-symbol-table flag) is set, sub_5B0180 is also called to write the module ID before the host reference arrays are emitted.
Entity-Based Module ID Selection
An alternative module ID source is available through use_variable_or_routine_for_module_id_if_needed (sub_5CF030, il.c, line 31969, ~65 lines). Rather than computing a hash from file metadata, this function selects a representative entity (variable or function) from the current TU whose mangled name can serve as a stable identifier. The selection criteria are strict:
- Entity kind must be 7 (variable) or 11 (routine), tested via
(kind - 7) & 0xFB == 0 - Must have a definition (for variables: offset
+169 != 0; for routines: has a body) - Must not be a class member
- Must not be in an unnamed namespace
- Must have storage class
== 0(no explicitstatic,extern, orregister) - Must not be template-related or marked with special compilation flags
- For routines: must not have explicit specialization, return type must not be a builtin
The selected entity is stored in qword_126F140 with its kind byte in byte_126F138 (7 for variable, 11 for routine). This entity's name is then fed into sub_5AF830 to produce the final module ID string. The entity-based approach provides a more deterministic ID than the PID-based default, since it is derived from source content rather than runtime state.
Anonymous Namespace Mangling
The module ID directly controls how anonymous namespaces are mangled in the .int.c output. The function sub_6BC7E0 (in nv_transforms.c) constructs the anonymous namespace identifier:
// sub_6BC7E0 implementation:
if (qword_1286A00) // cached?
return qword_1286A00;
module_id = sub_5AF830(0); // get or compute module ID
buf = malloc(strlen(module_id) + 12); // "_GLOBAL__N_" = 11 chars + NUL
strcpy(buf, "_GLOBAL__N_");
strcat(buf, module_id);
qword_1286A00 = buf; // cache for reuse
return buf;
This _GLOBAL__N_<module_id> string is emitted in the .int.c trailer as:
#define _NV_ANON_NAMESPACE _GLOBAL__N_<module_id>
#ifdef _NV_ANON_NAMESPACE
#endif
#include "<source_file>"
#undef _NV_ANON_NAMESPACE
The #define gives anonymous namespace entities a stable, unique mangled name that is consistent between the device and host compilation paths. The #ifdef/#endif guard is defensive -- it tests that the macro was defined (it always is at this point). The #include re-includes the original source file with the macro defined, allowing the host compiler to see the anonymous namespace entities with their module-ID-qualified names. The #undef cleans up to avoid polluting later inclusions.
The anonymous namespace hash also appears during host reference array name construction. For static or anonymous-namespace device entities, the scoped name prefix builder (sub_6BD2F0) inserts _GLOBAL__N_<module_id> as the namespace component, ensuring the mangled name in the .nvHR* section uniquely identifies the entity even across TUs with the same anonymous namespace structure.
Usage in Output
The module ID appears in three places in the generated .int.c output:
-
Anonymous namespace mangling:
sub_6BC7E0constructs_GLOBAL__N_<module_id>for anonymous-namespace symbols in device code, producing unique mangled names per TU. -
Registration boilerplate: The
__cudaRegisterFatBinarycall passes the module ID to the CUDA runtime, which uses it to match host registration with the correct device fatbinary. -
Module ID file: When requested, the ID is written to a separate file for consumption by the build system or nvlink.
Cross-TU IL Correspondence
When multiple TUs are processed in a single cudafe++ invocation, the same C++ types, templates, and declarations may appear in multiple TUs. EDG's correspondence system verifies structural equivalence and establishes canonical entries to avoid duplicate definitions in the merged output.
trans_copy.c: IL Copying Between TUs
The trans_copy.c file contains a single function at address 0x796BA0:
copy_secondary_trans_unit_IL_to_primary -- Copies IL entries from secondary translation units into the primary TU's IL tree. Called after all TUs have been parsed, during the fe_wrapup finalization phase (specifically, after the 5-pass multi-TU iteration). This function ensures that device-reachable IL entries from secondary TUs are available in the primary TU's output scope.
A closely related function exists at 0x796C00:
mark_secondary_IL_entities_used_from_primary (sub_796C00) -- Called during fe_wrapup pass 2 (IL lowering), before the TU iteration loop that applies sub_707040 to each TU's file-scope IL. This function marks IL entities in secondary TUs that are referenced from the primary TU, ensuring they survive any dead-code elimination in later passes.
trans_corresp.c: Structural Equivalence Checking
The trans_corresp.c file (address range 0x796E60--0x7A3420, 88 functions) implements the full cross-TU correspondence verification system. The core functions:
verify_class_type_correspondence (sub_7A00D0, 703 lines) is the centerpiece. It performs a deep structural comparison of two class types from different TUs:
- Base class comparison via
sub_7A27B0(verify_base_class_correspondence) -- iterates base class lists, comparing virtual/non-virtual status, accessibility, and type identity - Friend declaration comparison via
sub_7A1830(verify_friend_declaration_correspondence) -- walks friend lists checking structural equivalence - Member function comparison via
sub_7A1DB0(verify_member_function_correspondence, 411 lines) -- compares function signatures, attributes, constexpr status, and virtual overrides - Nested type comparison via
sub_798960(equiv_member_constants) -- verifies nested class/enum/typedef correspondence - Template parameter comparison via
sub_7B2260-- validates template parameter lists match structurally - Using declaration comparison -- dispatches by kind:
36= alias,6/11= using declaration,7/58= namespace using declaration
If any comparison fails, the function delegates to sub_797180 to emit a diagnostic (error codes 1795/1796), then falls through to f_set_no_trans_unit_corresp (5 variants at sub_797B50-sub_7981A0 for different entity kinds).
The type node layout used by the correspondence system:
- Offset
+132: type kind (9=struct, 10=class, 11=union) - Offset
+144: referenced type / next pointer - Offset
+152: class info pointer - Offset
+161: flags byte (bits for anonymous, elaborated, template, local) - Class info at
+128: scope block with members at indexed offsets[12],[13],[14],[18],[22]
Supporting verification functions:
| Address | Name | Scope |
|---|---|---|
sub_7A0E10 | verify_enum_type_correspondence | Enum underlying type and enumerator list |
sub_7A1230 | verify_function_type_correspondence | Parameter and return type |
sub_7A1390 | verify_type_correspondence | Dispatcher to class/enum/function variants |
sub_7A1460 | set_type_correspondence | Links two types as corresponding |
sub_7A1CC0 | verify_nested_class_body_correspondence | Nested class scope comparison |
sub_7A2C10 | verify_template_parameter_correspondence | Template parameter list |
sub_7A3140 | check_decl_correspondence_with_body | Declaration with definition |
sub_7A3420 | check_decl_correspondence_without_body | Declaration-only case |
sub_7A38A0 | check_decl_correspondence | Dispatcher (with/without body) |
sub_7A38D0 | same_source_position | Source position comparison |
sub_7999C0 | find_template_correspondence | Cross-TU template entity matching (601 lines) |
sub_79A5A0 | determine_correspondence | General correspondence determination |
sub_79B8D0 | mark_canonical_instantiation | Updates instantiation canonical status |
sub_79C1A0 | get_canonical_entry_of | Returns canonical entity for a TU entry |
sub_79D080 | establish_instantiation_correspondences | Links instantiations across TUs |
sub_79DFC0 | set_type_corresp | Sets type correspondence |
sub_79E760 | find_routine_correspondence | Cross-TU function matching |
sub_79F320 | find_namespace_correspondence | Cross-TU namespace matching |
Correspondence Lifecycle
The correspondence system uses three hash tables (qword_12C7800, qword_12C7880, qword_12C7900, each 0x70 bytes / 14 slots) plus linked lists to track established correspondences. The lifecycle:
- Registration (
sub_7A3920): Registers three global variables (dword_106B9E4,dword_106B9E0,qword_12C7798) for per-TU save/restore - Initialization (
sub_7A3980): Zeroes all correspondence hash tables and list pointers - Discovery during parsing: As the secondary TU is parsed, types/functions that match primary-TU entities are identified through name and scope comparison
- Verification:
verify_class_type_correspondenceand its siblings perform deep structural comparison - Linkage:
set_type_correspondence(sub_7A1460) andf_set_trans_unit_corresp(sub_79C400, 511 lines) connect matching entities - Canonicalization:
canonical_ranking(sub_796E60) determines which TU's entity is the canonical representative;mark_canonical_instantiation(sub_79B8D0) updates instantiation records
The correspondence allocation uses 24-byte nodes from a free list (qword_12C7AB0) managed by alloc_trans_unit_corresp (sub_7A3B50) and free_trans_unit_corresp (sub_7A3BB0). The free function decrements a refcount at offset +16; when it reaches 1, the node returns to the free list.
Integration with fe_wrapup
The cross-TU correspondence system hooks into the 5-pass multi-TU architecture in fe_wrapup (sub_588E90):
| Pass | Action | Cross-TU Role |
|---|---|---|
| 1 | Per-file IL wrapup (sub_588C60) | Iterates TU chain, prepares file scope IL |
| 2 | IL lowering (sub_707040) | Calls sub_796C00 (mark secondary IL) before loop |
| 3 | IL emission (sub_610420, arg 23) | Marks device-reachable entries per TU |
| 4 | C++ class finalization | Deferred member processing |
| 5 | Per-file part 3 (sub_588D40) | Final per-TU cleanup |
| Post | Cleanup | Calls sub_796BA0 (copy secondary IL to primary) |
After all five passes complete, sub_796BA0 copies remaining secondary-TU IL into the primary TU's tree, and scope renumbering fixes up any index conflicts.
Host Reference Arrays and Linkage Splitting
The six .nvHR* ELF sections emitted in the .int.c output trailer encode device symbol names for CUDA runtime discovery. These arrays are split along two axes: symbol type (kernel, device variable, constant variable) and linkage (external, internal). The split is critical for RDC: external-linkage symbols are globally resolvable by nvlink across all TUs, while internal-linkage symbols are TU-local and require module-ID-based prefixing to avoid collisions.
| Section | Array Name | Symbol Type | Linkage |
|---|---|---|---|
.nvHRKE | hostRefKernelArrayExternalLinkage | __global__ kernel | external |
.nvHRKI | hostRefKernelArrayInternalLinkage | __global__ kernel | internal |
.nvHRDE | hostRefDeviceArrayExternalLinkage | __device__ variable | external |
.nvHRDI | hostRefDeviceArrayInternalLinkage | __device__ variable | internal |
.nvHRCE | hostRefConstantArrayExternalLinkage | __constant__ variable | external |
.nvHRCI | hostRefConstantArrayInternalLinkage | __constant__ variable | internal |
The emission is driven by 6 calls to nv_emit_host_reference_array (sub_6BCF80, 79 lines, nv_transforms.c) with parameters (emit_callback, is_kernel, is_device, is_internal_linkage):
// From sub_489000 (process_file_scope_entities), backend output phase:
if (dword_106BFD0 || dword_106BFCC) {
sub_6BCF80(sub_467E50, 1, 0, 1); // kernel, internal
sub_6BCF80(sub_467E50, 1, 0, 0); // kernel, external
sub_6BCF80(sub_467E50, 0, 1, 1); // device, internal
sub_6BCF80(sub_467E50, 0, 1, 0); // device, external
sub_6BCF80(sub_467E50, 0, 0, 1); // constant, internal
sub_6BCF80(sub_467E50, 0, 0, 0); // constant, external
}
Each call iterates a separate global list that was populated during the entity walk:
| List Address | Content |
|---|---|
unk_1286880 | kernel external |
unk_12868C0 | kernel internal |
unk_1286780 | device external |
unk_12867C0 | device internal |
unk_1286800 | constant external |
unk_1286840 | constant internal |
Entity registration into these lists is performed by nv_get_full_nv_static_prefix (sub_6BE300, 370 lines, nv_transforms.c:2164). This function examines each device-annotated entity and routes it to the appropriate list based on its execution space bits (at entity offset +182) and linkage (internal linkage = static or anonymous namespace, determined by flags at entity offset +80).
For internal linkage entities, the function builds a scoped name prefix:
- Recursively constructs the scope path via
sub_6BD2F0(nv_build_scoped_name_prefix) - For anonymous namespaces, inserts the
_GLOBAL__N_<module_id>prefix (viaqword_1286A00) - Hashes the full path with
format_string_to_sso(sub_6BD1C0) - Constructs the prefix:
off_E7C768 + len + "_" + filename + "_" - Caches the prefix in
qword_1286760for reuse - Appends
"_"and the entity's mangled name
For external linkage entities, the path is simpler: the :: scope-qualified name is used directly without module-ID-based prefixing.
The generated output for each symbol:
extern "C" {
extern __attribute__((section(".nvHRKE")))
__attribute__((weak))
const unsigned char hostRefKernelArrayExternalLinkage[] = {
0x5f, 0x5a, /* ... mangled name bytes ... */ 0x00
};
}
The __attribute__((weak)) allows multiple TUs to define the same array without linker errors -- the CUDA runtime reads whichever copy survives.
Host Stub Linkage Flags
Three CLI flags control the linkage of generated host stubs:
--host-stub-linkage-explicit (Flag 47)
When set, host stubs are emitted with explicit linkage specifiers rather than relying on the default linkage of the surrounding context. This ensures that the stub's linkage matches what nvcc/nvlink expects regardless of the source file's linkage context (e.g., inside an anonymous namespace or extern "C" block).
--static-host-stub (Flag 48)
Forces all generated host stubs (__wrapper__device_stub_*) to have static linkage. This is used in single-TU compilation where the stubs do not need to be visible to other object files. It prevents symbol conflicts when the same kernel name appears in multiple compilation units that are linked together.
--static-global-template-stub (set_flag Mechanism)
Unlike the direct CLI flags above, -static-global-template-stub is set through the generic --set_flag mechanism (flag 193), which looks up the name in the off_D47CE0 table and stores the value. It has 4 usage contexts in the binary, all in error message strings.
When enabled (=true), template __global__ function stubs receive static linkage. This prevents ODR violations in whole-program mode when the same template kernel is instantiated in multiple host-side TUs. The tradeoff is that extern template kernels and out-of-TU instantiations become illegal (see the constraints in the whole-program section above).
Output Differences Between Modes
| Output Aspect | Whole-Program (-rdc=false) | Separate Compilation (-rdc=true) |
|---|---|---|
| Host stub linkage | Can be static (with flags 47/48) | External (default) |
| Template stub linkage | static (with -static-global-template-stub) | External |
| Module ID generation | Generated but less critical | Required for registration matching |
| Module ID file | Optional | Typically generated |
| Device code embedding | Inline fatbinary in host object | Relocatable device object (.rdc) |
| nvlink requirement | No | Yes (resolves device symbols) |
| Dynamic parallelism | Forbidden | Allowed |
| Extern device variables | Forbidden | Allowed |
| Anonymous namespace hash | Used for device symbol uniqueness | Used for device symbol uniqueness |
| Deferred function list | Active (breakpoint placeholders) | Behavior depends on dword_106BFDC |
| Cross-TU correspondence | N/A (single TU) | Active when multi-TU invocation |
Global Variables
| Address | Size | Name | Purpose |
|---|---|---|---|
dword_106BFBC | 4 | whole_program_mode | Whole-program mode; also set by --debug_mode (flag 82, which sets dword_106BFC4=1, dword_106BFC0=1, dword_106BFBC=1) |
dword_106BFDC | 4 | skip_device_only | Disables deferred function list accumulation |
dword_106BFB8 | 4 | emit_symbol_table | Emit symbol table + module ID to file |
dword_106BFD0 | 4 | device_registration | Device registration / cross-space reference checking |
dword_106BFCC | 4 | constant_registration | Constant registration flag |
qword_126F0C0 | 8 | cached_module_id | Cached module ID string |
qword_106BF80 | 8 | module_id_file_path | Module ID file path (from --module_id_file_name) |
qword_106BA10 | 8 | current_translation_unit | Pointer to current TU descriptor |
qword_106B9F0 | 8 | primary_translation_unit | Pointer to first TU (primary) |
qword_106BA18 | 8 | translation_unit_stack | Top of TU stack |
dword_106B9E8 | 4 | tu_stack_depth | TU stack depth (excluding primary) |
qword_12C7AA8 | 8 | registered_variable_list_head | Per-TU variable registration list |
qword_12C7A98 | 8 | per_tu_storage_size | Total per-TU buffer size |
qword_12C7AB0 | 8 | corresp_free_list | Correspondence node free list |
qword_12C7AB8 | 8 | stack_entry_free_list | TU stack entry free list |
qword_1065840 | 8 | deferred_function_list | Breakpoint placeholder linked list head |
Function Map
| Address | Name | Source File | Lines | Role |
|---|---|---|---|---|
sub_5AF830 | make_module_id | host_envir.c | ~450 | CRC32-based unique TU identifier |
sub_5AF7F0 | set_module_id | host_envir.c | ~10 | Setter for cached module ID |
sub_5AF820 | get_module_id | host_envir.c | ~3 | Getter for cached module ID |
sub_5B0180 | write_module_id_to_file | host_envir.c | ~30 | Writes module ID to file |
sub_5CF030 | use_variable_or_routine_for_module_id_if_needed | il.c:31969 | ~65 | Selects representative entity for ID |
sub_6BC7E0 | (anon namespace hash) | nv_transforms.c | ~20 | Generates _GLOBAL__N_<module_id> |
sub_6BCF80 | nv_emit_host_reference_array | nv_transforms.c | 79 | Emits .nvHR* ELF section with symbol names |
sub_6BD2F0 | nv_build_scoped_name_prefix | nv_transforms.c | ~95 | Recursive scope-qualified name builder |
sub_6BE300 | nv_get_full_nv_static_prefix | nv_transforms.c:2164 | ~370 | Scoped name + host ref array registration |
sub_796BA0 | copy_secondary_trans_unit_IL_to_primary | trans_copy.c | ~50 | Copies secondary TU IL to primary |
sub_796C00 | mark_secondary_IL_entities_used_from_primary | -- | -- | Marks secondary IL referenced from primary |
sub_796E60 | canonical_ranking | trans_corresp.c | -- | Determines canonical TU entry |
sub_7975D0 | may_have_correspondence | trans_corresp.c | -- | Quick correspondence eligibility check |
sub_797990 | f_change_canonical_entry | trans_corresp.c | -- | Updates canonical representative |
sub_7983A0 | f_same_name | trans_corresp.c | -- | Cross-TU symbol name comparison |
sub_79C400 | f_set_trans_unit_corresp | trans_corresp.c | 511 | Establishes entity correspondence |
sub_7A00D0 | verify_class_type_correspondence | trans_corresp.c | 703 | Deep class structural comparison |
sub_7A0E10 | verify_enum_type_correspondence | trans_corresp.c | -- | Enum comparison |
sub_7A1230 | verify_function_type_correspondence | trans_corresp.c | -- | Function type comparison |
sub_7A1460 | set_type_correspondence | trans_corresp.c | -- | Links corresponding types |
sub_7A1DB0 | verify_member_function_correspondence | trans_corresp.c | 411 | Member function comparison |
sub_7A27B0 | verify_base_class_correspondence | trans_corresp.c | -- | Base class list comparison |
sub_7A3920 | register_trans_corresp_variables | trans_corresp.c | -- | Registers per-TU state variables |
sub_7A3980 | init_trans_corresp_state | trans_corresp.c | -- | Zeroes all correspondence state |
sub_7A3A50 | save_translation_unit_state | trans_unit.c | -- | Saves current TU state to buffer |
sub_7A3C00 | f_register_trans_unit_variable | trans_unit.c | -- | Registers a per-TU variable |
sub_7A3CF0 | fix_up_translation_unit | trans_unit.c | -- | Finalizes TU state |
sub_7A3D60 | switch_translation_unit | trans_unit.c | -- | Saves/restores TU context |
sub_7A3EF0 | push_translation_unit_stack | trans_unit.c | -- | Pushes TU onto stack |
sub_7A3F70 | pop_translation_unit_stack | trans_unit.c | -- | Pops TU from stack |
sub_7A40A0 | process_translation_unit | trans_unit.c | -- | Main TU processing entry point |
sub_7A4690 | register_builtin_trans_unit_variables | trans_unit.c | -- | Registers 3 core per-TU vars |
Cross-References
- Kernel Stub Generation --
-static-global-template-stubdetails and the stub toggle mechanism - Device/Host Separation -- How the single-pass tag-and-filter architecture works
- .int.c File Format -- Anonymous namespace mangling and module ID in output
- Backend Code Generation -- Module ID output phase
- Host Reference Arrays --
.nvHR*section format and runtime discovery - CLI Flag Inventory -- Flag indices 47, 48, 77, 83, 87
- CUDA Error Catalog -- Category 11 (RDC / whole-program diagnostics)
- EDG 6.6 Overview -- Cross-TU correspondence section
- Template Engine -- Template instantiation deduplication across TUs
- Global Variable Index -- All globals referenced here