Kernel Stub Generation
When cudafe++ generates the .int.c host translation of a CUDA source file, every __global__ kernel function undergoes a critical transformation: the original kernel body is suppressed and replaced with a device stub -- a lightweight host-callable wrapper that delegates to cudaLaunchKernel. This mechanism is how CUDA kernel launch syntax (kernel<<<grid, block>>>(args)) ultimately becomes a regular C++ function call that the host compiler can process. The stub generation logic lives entirely within gen_routine_decl (sub_47BFD0), a 1,831-line function in cp_gen_be.c that is the central code generator for all C++ function declarations and definitions. A secondary function, gen_bare_name (sub_473F10), handles the character-by-character emission of the __wrapper__device_stub_ prefix into function names.
The stub mechanism operates in two passes controlled by a global toggle, dword_1065850 (the device_stub_mode flag). The toggle fires at the top of gen_routine_decl, BEFORE the body-selection logic runs. Because the toggle is dword_1065850 = (dword_1065850 == 0), it flips 0->1 on the first invocation. This means:
- First invocation (toggle 0->1):
dword_1065850 == 1at decision points -> emits thestaticdeclaration withcudaLaunchKernelplaceholder body, then recurses. - Recursive invocation (toggle 1->0):
dword_1065850 == 0at decision points -> emits the forwarding body that calls__wrapper__device_stub_<name>.
Both invocations wrap the original kernel body in #if 0 / #endif so the host compiler never sees device code.
Key Facts
| Property | Value |
|---|---|
| Source file | cp_gen_be.c (EDG 6.6 backend code generator) |
| Main generator | sub_47BFD0 (gen_routine_decl, 1831 lines) |
| Bare name emitter | sub_473F10 (gen_bare_name, 671 lines) |
| Stub prefix string | "__wrapper__device_stub_" at 0x839420 |
| Specialization prefix | "__specialization_" at 0x839960 |
| cudaLaunchKernel body | "{ ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);}" at 0x839CB8 |
| Device-only dummy (ctor/dtor) | "{int *volatile ___ = 0;" at 0x839A3E + "::free(___);" at 0x839A72 |
| Device-only dummy (global) | "{int volatile ___ = 1;" at 0x839A56 + "::exit(___);" at 0x839A80 |
| Stub mode flag | dword_1065850 (global toggle) |
| Static template stub CLI flag | -static-global-template-stub=true |
| Parameter list generator | sub_478900 (gen_parameter_list) |
| Scope qualifier emitter | sub_474D60 (recursive namespace path) |
| Parameter name emitter | sub_474BB0 (emit entity name for forwarding) |
The Device Stub Mode Toggle
The entire stub generation mechanism hinges on a single global variable, dword_1065850. This flag acts as a modal switch: when set, all subsequent code generation for __global__ functions produces the static stub variant rather than the forwarding body.
Toggle Logic
The toggle occurs in gen_routine_decl at the point where the function's CUDA flags are inspected. The critical line from the decompiled binary:
// sub_47BFD0, around decompiled line 553
// v3 = routine entity pointer, v8 = is_friend flag
__int64 flags = *(_QWORD *)(v3 + 176);
if ((flags & 0x40000002000000) == 0x40000002000000 && v8 != 1)
dword_1065850 = dword_1065850 == 0; // toggle: 0->1 or 1->0
The bitmask 0x40000002000000 encodes a combination of the __global__ attribute and a linkage/definition flag in the entity's 8-byte flags field at offset +176. The condition requires BOTH bits set and the declaration must NOT be a friend declaration (v8 != 1). The toggle expression dword_1065850 == 0 flips the flag: if it was 0, it becomes 1; if it was 1, it becomes 0.
This means gen_routine_decl is called twice for every __global__ kernel. Crucially, the toggle fires at the TOP of the function, BEFORE the body emission logic:
- First call (
dword_1065850 == 0at entry -> toggled to1): All subsequent decision points seedword_1065850 == 1. Emits thestaticstub withcudaLaunchKernelplaceholder body. Then recurses. - Recursive call (
dword_1065850 == 1at entry -> toggled to0): All subsequent decision points seedword_1065850 == 0. Emits the forwarding stub body. Does NOT recurse (the flag is 0 at the end).
The self-recursion that drives the second call is explicit at the end of gen_routine_decl:
// sub_47BFD0, decompiled line 1817-1821
if (dword_1065850) {
qword_1065748 = (int64_t)v163; // restore source sequence pointer
return sub_47BFD0(v152, a2); // recursive self-call
}
After emitting the static stub (first call), the self-recursion check at line 1817 fires because dword_1065850 == 1. The function restores the source sequence state and calls itself. In the recursive call, the toggle fires again (1->0), and the forwarding body is emitted with dword_1065850 == 0. At the end of the recursive call, dword_1065850 == 0, so no further recursion occurs.
Stub Generation: The Forwarding Body
When dword_1065850 == 0 and the entity has __global__ annotation (byte +182 & 0x40) with a body (byte +179 & 0x02), gen_routine_decl emits a forwarding body instead of the original kernel implementation. This is the output produced by the recursive (second) invocation.
Step-by-Step Emission
The forwarding body is assembled from multiple sub_468190 (emit raw string) calls:
// Condition: (byte[182] & 0x40) != 0 && (byte[179] & 2) != 0 && dword_1065850 == 0
// 1. Open brace
sub_468190("{");
// 2. Scope qualification (if kernel is in a namespace)
scope = *(v3 + 40); // entity's enclosing scope
if (scope && byte_at(scope + 28) == 3) { // scope kind 3 = namespace
sub_474D60(*(scope + 32)); // recursively emit namespace::namespace::...
sub_468190("::");
}
// 3. Emit "__wrapper__device_stub_" prefix
sub_468190("__wrapper__device_stub_");
// 4. Emit the original function name
sub_468190(*(char **)(v3 + 8)); // entity name string at offset +8
Template Argument Emission
After the function name, template arguments must be forwarded. The logic branches on whether the function is an explicit template specialization (v153) or a non-template member of a template class:
Case A: Explicit specialization (v153 != 0) -- uses the template argument list at entity offset +224:
v135 = *(v3 + 224); // template_args linked list
if (v135) {
putc('<', stream); // emit '<'
do {
arg_kind = byte_at(v135 + 8);
if (arg_kind == 0) {
// Type argument: emit type specifier + declarator
sub_5FE8B0(v135[4], ...); // gen_type_specifier
sub_5FB270(v135[4], ...); // gen_declarator
} else if (arg_kind == 1) {
// Value argument (non-type template param)
sub_5FCAF0(v135[4], 1, ...); // gen_constant
} else {
// Template-template argument
sub_472730(v135[4], ...); // gen_template_arg
}
v135 = *v135; // next in linked list
separator = v135 ? ',' : '>';
putc(separator, stream);
} while (v135);
}
Case B: Non-specialization -- template parameters from the enclosing class template are forwarded:
// v162 = template parameter info from enclosing scope
v92 = v162[1]; // template parameter list
if (v92 && (byte_at(v92 + 113) & 2) == 0) {
sub_467E50("<");
do {
param_kind = byte_at(v92 + 112);
if (param_kind == 1) {
// type parameter -- emit the type
sub_5FE8B0(*(v92 + 120), ...);
sub_5FB270(*(v92 + 120), ...);
} else if (param_kind == 2) {
// non-type parameter -- emit constant
sub_5FCAF0(*(v92 + 120), 1, ...);
} else {
// template-template parameter
sub_472730(*(v92 + 120), ...);
}
if (byte_at(v92 + 113) & 1)
sub_467E50("..."); // parameter pack expansion
v92 = *(v92 + 104); // next parameter
emit(v92 ? "," : ">");
} while (v92);
}
Parameter Forwarding
After the name and template arguments, the forwarding call's actual arguments are emitted:
// 5. Emit parameter forwarding: "(param1, param2, ...)"
sub_468150(40); // '('
param = *(v167 + 40); // first parameter entity from definition scope
if (param) {
for (separator = ""; ; separator = ",") {
sub_468190(separator);
sub_474BB0(param, 7); // emit parameter name
if (byte_at(param + 166) & 0x40) {
sub_468190("..."); // variadic parameter pack expansion
}
param = *(param + 104); // next parameter in list
if (!param) break;
}
}
sub_468190(");");
// 6. Emit return statement and closing brace
sub_468190("return;}");
Complete Output Example
For a kernel:
namespace my_ns {
template<typename T>
__global__ void my_kernel(T* data, int n) { /* device code */ }
}
The forwarding body (emitted during the recursive call with dword_1065850 == 0) produces:
template<typename T>
void my_ns::my_kernel(T* data, int n) {
my_ns::__wrapper__device_stub_my_kernel<T>(data, n);
return;
}
#if 0
/* original kernel body here */
#endif
Note: __host__ is NOT emitted in the forwarding body. The __global__ attribute is stripped and no explicit execution space appears. The function appears as a plain C++ function in .int.c.
Stub Generation: The Static cudaLaunchKernel Placeholder
When dword_1065850 == 1 (the first invocation, after the toggle), the function declaration is rewritten with a different storage class and body. Despite being called "pass 2" conceptually (it produces the definition that the forwarding body calls), it is emitted FIRST in the output because the toggle sets the flag before any body emission logic runs.
Declaration Modifiers
When dword_1065850 is set, gen_routine_decl forces the storage class to static and optionally prepends the __specialization_ prefix:
// sub_47BFD0, decompiled lines 897-903
if (dword_1065850) {
v164 = 2; // force storage class = static
v23 = "static";
if (v153) // if template specialization
sub_467E50("__specialization_");
goto emit_storage_class; // -> sub_467E50("static"); sub_468150(' ');
}
The __specialization_ prefix is emitted BEFORE static for template specializations. This creates names like __specialization_static void __wrapper__device_stub_kernel(...) which the CUDA runtime uses to distinguish specialization stubs from primary template stubs.
Name Emission via gen_bare_name
In stub mode, gen_bare_name (sub_473F10) prepends the wrapper prefix character-by-character. The relevant code path:
// sub_473F10, decompiled lines 130-144
if (byte_at(v2 + 182) & 0x40 && dword_1065850) {
// Emit line directive if pending
if (dword_1065818)
sub_467DA0();
// Character-by-character emission of "__wrapper__device_stub_"
v25 = "_wrapper__device_stub_"; // note: starts at second char
v26 = 95; // first char: '_' (0x5F = 95)
do {
++v25;
putc(v26, stream);
v26 = *(v25 - 1);
++dword_106581C;
} while ((char)v26);
}
The technique is notable: the string "_wrapper__device_stub_" is stored starting at the second character, and the first underscore (_, ASCII 95) is loaded as the initial character separately. The do/while loop then walks the string pointer forward, emitting each character via putc and incrementing the column counter (dword_106581C). This assembles the full __wrapper__device_stub_ prefix before the actual function name is emitted.
cudaLaunchKernel Placeholder Body
For non-specialization __global__ kernels in stub mode, the body is a single-line placeholder:
// sub_47BFD0, decompiled lines 1424-1429
if (dword_1065850) {
if (!v153 && v90) { // not a specialization AND has __global__ body
sub_468190("{ ::cudaLaunchKernel(0, 0, 0, 0, 0, 0);}");
goto suppress_original;
}
}
The call ::cudaLaunchKernel(0, 0, 0, 0, 0, 0) is never actually executed at runtime. It exists solely to create a linker dependency on the CUDA runtime library, ensuring that cudaLaunchKernel is linked even though the real launch is performed through the CUDA driver API. The six zero arguments match the signature cudaError_t cudaLaunchKernel(const void*, dim3, dim3, void**, size_t, cudaStream_t).
Complete Output Example (Static Stub)
For the same kernel above, the static stub (emitted first, with dword_1065850 == 1) produces:
static void __wrapper__device_stub_my_kernel(float* data, int n) {
::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}
Dummy Bodies for Non-Kernel Device Functions
Not all CUDA-annotated functions are __global__ kernels. Device-only functions (constructors, destructors, and plain __device__ functions) that have definitions also need host-side bodies to prevent host compiler errors. These receive dummy bodies designed to suppress optimizer warnings while remaining syntactically valid.
Condition for Dummy Body Emission
The dummy body path activates in the ELSE branch of the __global__ check -- that is, for non-kernel device functions. The condition from the decompiled code (lines 1603-1606):
// This path is reached when (byte[182] & 0x40) == 0 -- entity is NOT __global__
// The flags field at offset +176 is an 8-byte bitfield encoding linkage/definition state.
uint64_t flags = *(uint64_t*)(entity + 176);
if ((flags & 0x30000000000500) != 0x20000000000000) // NOT a device-only entity with definition
goto emit_original_body; // skip dummy, emit normally
if (!dword_106BFDC || (entity->byte_81 & 4) != 0) // whole-program flag check
{
// Emit dummy body for device-only function visible in host output
}
The bitmask 0x30000000000500 extracts the device-annotation and definition bits from the 8-byte flags field. The target value 0x20000000000000 selects entities that have device annotation set but no host-side definition -- exactly the functions that need a dummy body to satisfy the host compiler.
Constructor/Destructor Dummy (definition_kind 1 or 2)
For constructors (definition_kind == 1) and destructors (definition_kind == 2), the dummy body allocates a volatile null pointer and frees it:
// sub_47BFD0, decompiled lines 1611-1651
if ((unsigned char)(byte[166] - 1) <= 1) {
sub_468190("{int *volatile ___ = 0;");
// ... emit (void)param; for each parameter ...
sub_468190("::free(___);}");
}
Output:
{int *volatile ___ = 0;(void)param1;(void)param2;::free(___);}
The volatile qualifier prevents the optimizer from removing the allocation. The ::free(0) call is a no-op at runtime but establishes a dependency on the C library and prevents dead code elimination of the entire body.
global / Regular Device Function Dummy (definition_kind >= 3)
For non-constructor/destructor device functions, a different pattern is used:
else {
sub_468190("{int volatile ___ = 1;");
// ... emit (void)param; for each parameter ...
sub_468190("::exit(___);}");
}
Output:
{int volatile ___ = 1;(void)param1;(void)param2;::exit(___);}
The ::exit(1) call guarantees the function is never considered to "return normally" by the host compiler's control-flow analysis, suppressing missing-return-value warnings for non-void functions.
Parameter Usage Emission
Between the opening and closing statements, each named parameter is referenced with (void)param; to suppress unused-parameter warnings. The loop walks the parameter list:
for (kk = *(v167 + 40); kk; kk = *(kk + 104)) {
if (*(kk + 8) && !(byte_at(kk + 166) & 0x40)) { // has name, not a pack
// For aggregate types with GNU host compiler: complex cast chain
if (!dword_1065750 && dword_126E1F8
&& is_aggregate_type(*(kk + 112))
&& has_nontrivial_dtor(*(kk + 112))) {
sub_468190("(void)");
sub_468190("reinterpret_cast<void *>(&(const_cast<char &>");
sub_468190("(reinterpret_cast<const volatile char &>(");
sub_474BB0(kk, 7); // parameter name
sub_468190("))))");
} else {
sub_468190("(void)");
sub_474BB0(kk, 7); // parameter name
}
sub_468150(';');
}
}
The complex reinterpret_cast chain for aggregate types with non-trivial destructors avoids triggering GCC/Clang warnings about taking the address of a parameter that might be passed in registers.
The #if 0 / #endif Suppression
After the stub body is emitted, the original kernel body is wrapped in preprocessor guards to hide it from the host compiler:
// sub_47BFD0, decompiled lines 1598-1601
sub_46BC80("#if 0"); // emit "#if 0\n"
--dword_1065834; // decrease indent level
sub_467D60(); // emit newline
// ... then emit the original body via:
dword_1065850_saved = dword_1065850;
dword_1065850 = 0; // temporarily disable stub mode
sub_47AEF0(*(v167 + 80), 0); // gen_statement_full: emit original body
dword_1065850 = dword_1065850_saved; // restore stub mode
sub_466C10(); // finalize
// ... then emit #endif
putc('#', stream);
// character-by-character emission of "#endif\n"
The function temporarily disables stub mode (dword_1065850 = 0) while emitting the original body so that any nested constructs are generated normally. After the body, #endif is emitted and stub mode is restored.
For definitions (when v112 == 0), a trailing ; is appended after #endif to satisfy host compilers that may expect a statement terminator.
The -static-global-template-stub Flag
The CLI flag -static-global-template-stub=true controls how template __global__ functions are stubbed. When enabled, template kernel stubs receive static linkage, which avoids ODR violations when the same template kernel is instantiated in multiple translation units during whole-program compilation (-rdc=false).
The flag produces two diagnostic messages when it encounters problematic patterns:
-
Extern template kernel:
"when "-static-global-template-stub=true", extern __global__ function template is not supported in whole program compilation mode ("-rdc=false")"-- Anexterntemplate kernel cannot receive a static stub because the definitions would conflict across TUs. -
Missing definition:
"when "-static-global-template-stub=true" in whole program compilation mode ("-rdc=false"), a __global__ function template instantiation or specialization (%sq) must have a definition in the current translation unit"-- The static stub requires a local definition to replace.
Both diagnostics recommend either switching to -rdc=true (separate compilation) or explicitly setting -static-global-template-stub=false.
Diagnostic Push/Pop Around Stubs
Before emitting device stub declarations, gen_routine_decl wraps the output in compiler-specific diagnostic suppression to prevent spurious warnings:
For GCC/Clang hosts (dword_126E1F8 set, version > 0x9E97 = 40599):
sub_467E50("\n#pragma GCC diagnostic push\n");
sub_467E50("#pragma GCC diagnostic ignored \"-Wunused-parameter\"\n");
// ... stub emission ...
sub_467E50("\n#pragma GCC diagnostic pop\n");
For MSVC hosts (dword_126E1D8 set):
sub_467E50("\n__pragma(warning(push))\n");
sub_467E50("__pragma(warning(disable : 4100))\n"); // unreferenced formal parameter
// ... stub emission ...
sub_467E50("\n__pragma(warning(pop))\n");
For static template specialization stubs, an additional warning is suppressed:
- GCC/Clang:
#pragma GCC diagnostic ignored "-Wunused-function"(warning 4505 on MSVC: "unreferenced local function has been removed")
Deferred Function List for Whole-Program Mode
When dword_106BFBC (a whole-program compilation flag) is set and dword_106BFDC is clear, instead of emitting a dummy body immediately, gen_routine_decl adds the function to a deferred list:
// sub_47BFD0, decompiled lines 1713-1745
v117 = sub_6B7340(32); // allocate 32-byte node
v117[0] = qword_1065840; // link to previous head
v117[1] = source_start; // source position start
v117[2] = source_end; // source position end
if (has_name)
v117[3] = strdup(name); // copy of function name
else
v117[3] = NULL;
qword_1065840 = v117; // push onto list head
This deferred list (qword_1065840) is later consumed during the breakpoint placeholder generation phase in process_file_scope_entities (sub_489000), where each deferred entry produces a static __attribute__((used)) void __nv_breakpoint_placeholder<N>_<name>(void) { exit(0); } function.
Function Map
| Address | Name | Role |
|---|---|---|
sub_47BFD0 | gen_routine_decl | Main stub generator; 1831 lines; handles all function declarations |
sub_473F10 | gen_bare_name | Character-by-character name emission with __wrapper__device_stub_ prefix |
sub_474BB0 | gen_entity_name | Parameter name emission for forwarding calls |
sub_474D60 | gen_scope_qualifier | Recursive namespace path emission (ns1::ns2::) |
sub_478900 | gen_parameter_list | Parameter list with type transformation in stub mode |
sub_478D70 | gen_function_declarator_with_scope | Full function declarator with cv-qualifiers and ref-qualifiers |
sub_47AEF0 | gen_statement_full | Statement generator used for emitting original body inside #if 0 |
sub_47ECC0 | gen_template / process_source_sequence | Top-level dispatch; also sets dword_1065850 for instantiation directives |
sub_46BC80 | (emit #if directive) | Emits #if 0 / #if 1 preprocessor lines |
sub_467E50 | (emit string) | Primary string emission to output stream |
sub_468190 | (emit raw string) | Raw string emission (no line directive) |
sub_489000 | process_file_scope_entities | Backend entry point; consumes deferred function list |
Concrete Example: Simple Kernel Stub Output
Given this input CUDA source:
__global__ void add_one(int *data, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n)
data[idx] += 1;
}
cudafe++ generates the following in the .int.c host translation file. The toggle fires at the top of gen_routine_decl (0->1), so the static stub definition is emitted FIRST, followed by the forwarding body from the recursive call.
Output 1: Static Stub Definition (first call, dword_1065850 == 1 after toggle)
The static stub provides the linker symbol that the forwarding body calls. Diagnostic pragmas wrap the declaration to suppress unused-parameter warnings:
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wunused-parameter"
static void __wrapper__device_stub_add_one(int *data, int n) {
::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}
#if 0
/* Original kernel body -- hidden from host compiler */
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n)
data[idx] += 1;
}
#endif
#pragma GCC diagnostic pop
The static storage class is forced by the check at decompiled line 897-903. The __wrapper__device_stub_ prefix is emitted by gen_bare_name (sub_473F10). The cudaLaunchKernel placeholder body comes from the string literal at 0x839CB8.
Output 2: Forwarding Body (recursive call, dword_1065850 == 0 after toggle)
After the static stub is emitted and gen_routine_decl recurses, the forwarding body replaces the original kernel body. The __global__ attribute is stripped (kernels become regular host functions in .int.c):
void add_one(int *data, int n) {__wrapper__device_stub_add_one(data, n);return;}
#if 0
/* Original kernel body -- hidden from host compiler (emitted again) */
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n)
data[idx] += 1;
}
#endif
The forwarding body is assembled character-by-character:
{-- open brace- Scope qualifier (none for file-scope kernels;
ns::for namespaced ones) __wrapper__device_stub_-- the stub prefix from string at0x839420add_one-- the original function name fromentity + 8(data, n)-- parameter names forwarded (no types, just names viasub_474BB0));return;}-- close the forwarding call and return
The original body appears in #if 0 in both outputs because both code paths reach the same LABEL_457 -> sub_46BC80("#if 0") emission point.
Template Kernel Example
For a template kernel:
template<typename T>
__global__ void scale(T *data, T factor, int n) { /* ... */ }
// explicit instantiation
template __global__ void scale<float>(float *, float, int);
Output 1 (first call, dword_1065850 == 1) produces a specialization stub:
__specialization_static void __wrapper__device_stub_scale(float *data, float factor, int n) {
::cudaLaunchKernel(0, 0, 0, 0, 0, 0);
}
Output 2 (recursive call, dword_1065850 == 0) produces a forwarding stub with template arguments:
template<typename T>
void scale(T *data, T factor, int n) {__wrapper__device_stub_scale<T>(data, factor, n);return;}
The __specialization_ prefix is emitted only when the entity is a template specialization (v153 != 0) and dword_1065850 is set (decompiled line 901-902).
Device-Only Function Example
For a non-kernel __device__ function with a body:
__device__ int device_helper(int x, int y) {
return x + y;
}
The host output uses a dummy body instead of a forwarding stub (since there is no __wrapper__device_stub_ target for non-kernel functions):
__attribute__((unused)) int device_helper(int x, int y) {int volatile ___ = 1;(void)x;(void)y;::exit(___);}
#if 0
{
return x + y;
}
#endif
The __attribute__((unused)) prefix is emitted when the function's execution space is device-only ((byte_182 & 0x70) == 0x20) and dword_126E1F8 (GCC host compiler mode) is set (decompiled line 905-906).
Cross-References
- Execution Spaces -- byte
+182bitfield that drives the__global__check; complete redeclaration matrix - Device/Host Separation -- IL marking that determines which functions need stubs; the
dword_1065850toggle lifecycle - RDC Mode -- separate compilation mode that affects stub linkage
- .int.c File Format -- overall structure of the generated host file
- CUDA Runtime Boilerplate -- managed memory initialization emitted alongside stubs