Architecture Feature Gating
cudafe++ enforces architecture-dependent feature gates that prevent use of CUDA constructs on hardware that cannot support them. These gates operate at three distinct layers: compile-time SM version checks against dword_126E4A8 during semantic analysis, string-embedded diagnostic messages with architecture names baked into .rodata, and host-compiler version gating controlling which GCC/Clang-specific #pragma directives and language constructs appear in the generated .int.c output. A separate mechanism, the --db debug system, provides runtime tracing that can expose architecture checks as they execute. This page documents all three layers, the global variables involved, every discovered threshold constant, and the complete data flow from nvcc invocation to feature gate evaluation.
Key Facts
| Property | Value |
|---|---|
| SM version storage | dword_126E4A8 (sm_architecture, set by --target / case 245) |
| SM version TU-level copy | dword_126EBF8 (target_config_index, copied during TU init in sub_586240) |
| Architecture parser stub | sub_7525E0 (6-byte stub returning -1; actual parsing done by nvcc) |
| Post-parse initializer | sub_7525F0 (set_target_configuration, target.c:299) |
| Type table initializer | sub_7515D0 (sets 100+ type-size/alignment globals, called from sub_7525F0) |
| GCC version global | qword_126EF98 (default 80100 = GCC 8.1.0, set by --gnu_version case 184) |
| Clang version global | qword_126EF90 (default 90100 = Clang 9.1.0, set by --clang_version case 188) |
| GCC host dialect flag | dword_126E1F8 (host compiler identified as GCC) |
| Clang host dialect flag | dword_126E1E8 (host compiler identified as Clang) |
| Host GCC version copy | qword_126E1F0 (copied from qword_126EF98 during dialect init) |
| Host Clang version copy | qword_126E1E0 (copied from qword_126EF90 during dialect init) |
--nv_arch error string | "invalid or no value specified with --nv_arch flag" at 0x8884F0 |
| Debug option parser | sub_48A390 (proc_debug_option, 238 lines, debug.c) |
| Debug trace linked list | qword_1065870 (head pointer) |
| Invalid arch sentinel | -1 (0xFFFFFFFF) |
| Feature threshold count | 17 CUDA features across 7 SM versions (20, 30, 52, 60, 70, 80, 90/90a) |
| Host compiler threshold count | 19 version constants across GCC 3.0 through GCC 14.0 |
Layer 1: SM Architecture Input
How the Architecture Reaches cudafe++
cudafe++ never parses architecture strings directly from the user. The driver (nvcc) translates user-facing flags like --gpu-architecture=sm_90 into an internal numeric code and passes it via the --target flag when spawning the cudafe++ process. Inside cudafe++, the --target flag is registered as CLI flag 245 and handled in proc_command_line (sub_459630).
The handler calls sub_7525E0, which in the CUDA Toolkit 13.0 binary is a 6-byte stub:
; sub_7525E0 -- architecture parser stub
; Address: 0x7525E0, Size: 6 bytes
mov eax, 0FFFFFFFFh ; return -1 unconditionally
retn
This stub always returns -1 (the invalid-architecture sentinel). The actual architecture code is injected by nvcc into the argument string that sub_7525E0 receives. Because IDA decompiled this as a stub, the parsing logic is either inlined by the compiler or resolved through a different mechanism at link time. The result is stored in dword_126E4A8:
// proc_command_line (sub_459630), case 245
v80 = sub_7525E0(qword_E7FF28, v23, v20, v30); // parse SM code from arg string
dword_126E4A8 = v80; // store in sm_architecture
if (v80 == -1) {
sub_4F8420(2664); // emit error 2664
// error string: "invalid or no value specified with --nv_arch flag"
sub_4F2930("cmd_line.c", 12219, "proc_command_line", 0, 0);
// assert_fail -- unreachable if error handler returns
}
sub_7525F0(v80); // set_target_configuration
Error 2664 fires when the architecture value is -1. The error string at 0x8884F0 references --nv_arch (the nvcc-facing name for this flag). This string has no direct xrefs in the IDA analysis, meaning it is loaded indirectly through the error message table (off_88FAA0). The --nv_arch name in the error message is a user-facing alias; internally cudafe++ processes it as --target (flag 245).
set_target_configuration (sub_7525F0)
After storing the SM version, sub_7525F0 performs post-parse initialization. This function lives in target.c:299:
// sub_7525F0 -- set_target_configuration
__int64 __fastcall sub_7525F0(int a1)
{
if ((unsigned int)(a1 + 1) > 1) // rejects only -1
assert_fail("set_target_configuration", 299);
sub_7515D0(); // initialize type table for target platform
qword_126E1B0 = "lib"; // library search path prefix
}
The guard (a1 + 1) > 1u is an unsigned comparison that accepts any value >= 0 and rejects only -1 (which wraps to 0 when incremented). This is a sanity check -- in production, nvcc always provides a valid SM code.
Type Table Initialization (sub_7515D0)
The sub_7515D0 function, called from set_target_configuration, initializes over 100 global variables describing the target platform's type sizes, alignments, and numeric limits. This establishes the data model for CUDA device code:
// sub_7515D0 -- target type initialization (excerpt)
// Sets LP64 data model with CUDA-specific type properties
dword_126E328 = 8; // sizeof(long)
dword_126E338 = 4; // sizeof(int)
dword_126E2FC = 16; // sizeof(long double)
dword_126E308 = 16; // alignof(long double)
dword_126E2B8 = 8; // sizeof(pointer)
dword_126E2AC = 8; // alignof(pointer)
dword_126E420 = 2; // sizeof(wchar_t)
dword_126E4A0 = 8; // target vector width
dword_126E258 = 53; // double mantissa bits
dword_126E250 = 1024; // double max exponent
dword_126E254 = -1021; // double min exponent
dword_126E234 = 113; // __float128 mantissa bits
dword_126E22C = 0x4000; // __float128 max exponent
dword_126E230 = -16381; // __float128 min exponent
// ... ~80 more assignments ...
The function unconditionally returns -1, which is not used by the caller.
SM Version Propagation
During translation unit initialization (sub_586240, called from fe_translation_unit_init), the SM version is copied into a TU-level global:
// sub_586240, line 54 in decompiled output
dword_126EBF8 = dword_126E4A8; // target_config_index = sm_architecture
After this point, architecture checks throughout the compiler read either dword_126E4A8 (the CLI-level global) or dword_126EBF8 (the TU-level copy). Both contain the same integer SM version code. The dual-variable pattern exists because EDG's architecture supports multi-TU compilation where each TU could theoretically target a different architecture (though CUDA compilation always uses a single target per cudafe++ invocation).
Layer 2: CUDA Feature Thresholds
cudafe++ checks the SM architecture version at semantic analysis time to gate CUDA-specific features. When a feature is used on an architecture below its minimum requirement, the compiler emits a diagnostic error or warning. All thresholds below were extracted from error strings embedded in the binary's .rodata section and confirmed through cross-reference with diagnostic tag names.
Complete Feature Threshold Table
| Feature | Min Architecture | Diagnostic Tag | Error String |
|---|---|---|---|
| Virtual base classes | compute_20 | use_of_virtual_base_on_compute_1x | Use of a virtual base (%t) requires the compute_20 or higher architecture |
| Device variadic functions | compute_30 | device_function_has_ellipsis | __device__ or __host__ __device__ function with ellipsis requires compute_30 or higher architecture |
__managed__ variables | compute_30 | unsupported_arch_for_managed_capability | __managed__ variables require architecture compute_30 or higher |
alloca() in device code | compute_52 | alloca_unsupported_for_lower_than_arch52 | alloca() is not supported for architectures lower than compute_52 |
| Atomic scope argument | sm_60 | (inline) | atomic operations' scope argument is supported on architecture sm_60 or above. Fall back to use membar. |
| Atomic f64 add/sub | sm_60 | (inline) | atomic add and sub for 64-bit float is supported on architecture sm_60 or above. |
__nv_atomic_* functions | sm_60 | (inline) | __nv_atomic_* functions are not supported on arch < sm_60. |
__grid_constant__ | compute_70 | grid_constant_unsupported_arch | __grid_constant__ annotation is only allowed for architecture compute_70 or later |
| Atomic memory order | sm_70 | (inline) | atomic operations' argument of memory order is supported on architecture sm_70 or above. Fall back to use membar. |
| 128-bit atomic load/store | sm_70 | (inline) | 128-bit atomic load and store are supported on architecture sm_70 or above. |
| 16-bit atomic CAS | sm_70 | (inline) | 16-bit atomic compare-and-exchange is supported on architecture sm_70 or above. |
__nv_register_params__ | compute_80 | register_params_unsupported_arch | __nv_register_params__ is only supported for compute_80 or later architecture |
__wgmma_mma_async | sm_90a | wgmma_mma_async_not_enabled | __wgmma_mma_async builtins are only available for sm_90a |
| Atomic cluster scope | sm_90 | (inline) | atomic operations' scope of cluster is supported on architecture sm_90 or above. Using device scope instead. |
| Atomic cluster scope (load/store) | sm_90 | (inline) | atomic load and store's scope of cluster is supported on architecture sm_90 or above. Using device scope instead. |
| 128-bit atomic exch/CAS | sm_90 | nv_atomic_exch_cas_b128_not_supported | 128-bit atomic exchange or compare-and-exchange is supported on architecture sm_90 or above. |
GPU-Architecture-Gated Attributes (No Specific SM in String)
Several features check the architecture but their error strings do not embed a specific SM version number. Instead, they use the generic phrase "this GPU architecture", meaning the threshold is encoded in the comparison logic rather than the diagnostic text:
| Feature | Diagnostic Tag | Error String |
|---|---|---|
__cluster_dims__ | cluster_dims_unsupported | __cluster_dims__ is not supported for this GPU architecture |
max_blocks_per_cluster | max_blocks_per_cluster_unsupported | cannot specify max blocks per cluster for this GPU architecture |
__block_size__ | block_size_unsupported | __block_size__ is not supported for this GPU architecture |
__managed__ (config) | unsupported_configuration_for_managed_capability | __managed__ variables are not yet supported for this configuration (compilation mode (32/64 bit) and/or target operating system) |
These features are gated by the same dword_126E4A8 comparison mechanism as the features in the main table, but their exact SM threshold values would require tracing the specific comparison instructions in the semantic analysis functions.
Diagnostic Behavior: Errors vs Warnings vs Demotions
Architecture gate violations produce three distinct behaviors depending on the feature class:
Hard errors -- Compilation halts. Features that fundamentally cannot work on the target architecture:
__managed__below compute_30 -- No unified memory hardware support__grid_constant__below compute_70 -- No hardware constant propagation mechanism__nv_register_params__below compute_80 -- Register parameter ABI not available__wgmma_mma_asyncbelow sm_90a -- No warp-group MMA hardwarealloca()below compute_52 -- No dynamic stack allocation support on device- Virtual base classes below compute_20 -- No vtable support on earliest GPU architectures
Fallback warnings -- Compilation continues with degraded behavior. The compiler generates functionally correct but potentially less performant code:
- Atomic scope arguments on pre-sm_60 -- Falls back to
membar-based synchronization - Atomic memory order on pre-sm_70 -- Falls back to
membar-based ordering - 64-bit float atomics on pre-sm_60 -- Falls back to CAS loop emulation
Scope demotion warnings -- Informational diagnostics about automatic scope narrowing:
- Cluster scope atomics on pre-sm_90 -- Silently demotes to device scope ("Using device scope instead")
compute_XX vs sm_XX Naming
Error strings use two naming conventions that reflect CUDA's split between virtual and physical architectures:
-
compute_XX-- Virtual architecture. Checked at PTX generation time. Features gated bycompute_XXare relevant to the intermediate PTX representation and are independent of the specific GPU die. Examples:__managed__(requires unified memory ISA support),alloca()(requires dynamic stack frame instructions). -
sm_XX-- Physical architecture. Checked at SASS generation time. Features gated bysm_XXare tied to specific hardware capabilities of a GPU die. Examples: 128-bit atomics (require specific load/store unit widths), cluster scope (requires the SM 9.0 thread block cluster hardware).
In practice, cudafe++ stores a single integer in dword_126E4A8 and the distinction is purely semantic -- both forms gate against the same numeric value. The value is a compute capability number (e.g., 70 for Volta, 90 for Hopper).
The sm_90a suffix (with the a accelerator flag) is a special case used exclusively for __wgmma_mma_async builtins. This variant requires the Hopper accelerated architecture, which is distinct from the base sm_90. The a suffix is encoded in the SM integer value passed to cudafe++ by nvcc.
__wgmma_mma_async Detail
The warp-group matrix multiply-accumulate builtin has the most granular validation of any architecture-gated feature. Beyond the sm_90a architecture check, cudafe++ also validates:
| Check | Diagnostic Tag | Error String |
|---|---|---|
| Architecture gate | wgmma_mma_async_not_enabled | __wgmma_mma_async builtins are only available for sm_90a |
| Shape validation | wgmma_mma_async_bad_shape | The shape %s is not supported for __wgmma_mma_async builtin |
| A operand type | wgmma_mma_async_bad_A_type | (type mismatch diagnostic) |
| B operand type | wgmma_mma_async_bad_B_type | (type mismatch diagnostic) |
| Missing arguments | wgmma_mma_async_missing_args | The 'A' or 'B' argument to __wgmma_mma_async call is missing |
| Non-constant args | wgmma_mma_async_nonconstant_arg | Non-constant argument to __wgmma_mma_async call |
The validation function is identified as check_wgmma_mma_async (string at 0x888CAC). Four type-specific builtin variants are registered: __wgmma_mma_async_f16, __wgmma_mma_async_bf16, __wgmma_mma_async_tf32, and __wgmma_mma_async_f8.
nv_register_params Detail
The register parameter attribute has three distinct checks, only one of which is an architecture gate:
| Check | Diagnostic Tag | Error String |
|---|---|---|
| Feature enable flag | register_params_not_enabled | __nv_register_params__ support is not enabled |
| Architecture gate | register_params_unsupported_arch | __nv_register_params__ is only supported for compute_80 or later architecture |
| Function type check | register_params_unsupported_function | __nv_register_params__ is not allowed on a %s function |
| Ellipsis check | register_params_ellipsis_function | (variadic function diagnostic) |
The attribute handler is apply_nv_register_params_attr (string at 0x830C78).
SM Version to Feature Summary
| SM Version | Features Introduced | Feature Count |
|---|---|---|
| compute_20 | Virtual base classes in device code | 1 |
| compute_30 | __managed__ variables, device variadic functions | 2 |
| compute_52 | alloca() in device code | 1 |
| sm_60 | Atomic scope argument, 64-bit float atomics, __nv_atomic_* API | 3 |
| sm_70 | __grid_constant__, 128-bit atomic load/store, atomic memory order, 16-bit CAS | 4 |
| compute_80 | __nv_register_params__ | 1 |
| sm_90 / sm_90a | __wgmma_mma_async, thread block clusters, 128-bit atomic exchange/CAS, cluster scope atomics | 5 |
Notably absent from cudafe++ error strings are features like cooperative groups (sm_60+), tensor cores (sm_70+), and dynamic parallelism (sm_35+). These are checked at runtime or by the PTX assembler (ptxas) rather than the language frontend.
Layer 3: Host Compiler Version Gating
cudafe++ generates .int.c output that must compile cleanly under the host C++ compiler (GCC, Clang, or MSVC). Because different host compiler versions support different warning pragmas, attributes, and language features, cudafe++ gates its output based on the host compiler version stored in qword_126EF98 (GCC) and qword_126EF90 (Clang). Additionally, several C++ language feature flags in the EDG frontend are conditionally enabled based on host compiler version to match the behavior the user expects from their host compiler.
Version Encoding
Both GCC and Clang versions are encoded as a single integer: major * 10000 + minor * 100 + patch. For example, GCC 8.1.0 is encoded as 80100. The compiler tests these values against hexadecimal threshold constants using > (strictly-greater-than) comparisons, which effectively means "version at or above threshold + 1." Since all threshold values use a 99 patch level (e.g., 40299 for GCC 4.2.99), the gate > 40299 is equivalent to >= 40300, which effectively means "GCC 4.3 or later."
Complete Threshold Table
| Hex Constant | Decimal | Encoded Version | Effective Gate | Occurrence Count |
|---|---|---|---|---|
0x752F | 29,999 | 2.99.99 | GCC/Clang >= 3.0 | 1 (dialect resolution) |
0x75F7 | 30,199 | 3.01.99 | GCC/Clang >= 3.2 | low |
0x76BF | 30,399 | 3.03.99 | GCC/Clang >= 3.4 | low (cuda_compat_flag gate) |
0x7787 | 30,599 | 3.05.99 | Clang >= 3.6 | medium (-Wunused-local-typedefs) |
0x78B3 | 30,899 | 3.08.99 | Clang >= 3.9 | low |
0x9C3F | 39,999 | 3.99.99 | GCC >= 4.0 | medium (dword_106BDD8 + Clang gate) |
0x9D07 | 40,199 | 4.01.99 | GCC >= 4.2 | medium (-Wunused-variable file-level) |
0x9D6B | 40,299 | 4.02.99 | GCC >= 4.3 | medium (variadic templates) |
0x9DCF | 40,399 | 4.03.99 | GCC >= 4.4 | low (dialect resolution) |
0x9E33 | 40,499 | 4.04.99 | GCC >= 4.5 | low (dialect resolution) |
0x9E97 | 40,599 | 4.05.99 | GCC >= 4.6 | medium (diagnostic push/pop) |
0x9EFB | 40,699 | 4.06.99 | GCC >= 4.7 | low (feature flag gating) |
0x9F5F | 40,799 | 4.07.99 | GCC >= 4.8 | medium (-Wunused-local-typedefs) |
0xEA5F | 59,999 | 5.99.99 | GCC >= 6.0 | 22 files (C++14/17 features) |
0xEB27 | 60,199 | 6.01.99 | GCC >= 6.2 | low (HasFuncPtrConv gate) |
0x1116F | 69,999 | 6.99.99 | GCC >= 7.0 | medium (dword_106BDD8 + feature flags) |
0x15F8F | 89,999 | 8.99.99 | GCC/Clang >= 9.0 | medium (C++17/20 features) |
0x1D4BF | 119,999 | 11.99.99 | GCC/Clang >= 12.0 | 8 files |
0x1FBCF | 129,999 | 12.99.99 | GCC >= 13.0 | 13 files |
0x222DF | 139,999 | 13.99.99 | GCC >= 14.0 | 5 files |
How Thresholds Are Used
The thresholds serve three purposes:
1. Diagnostic pragma emission. The .int.c output includes #pragma GCC diagnostic directives to suppress host compiler warnings about CUDA-generated code. Different GCC/Clang versions introduced different warning flags, so the pragmas are conditionally emitted:
// From sub_489000 (backend boilerplate emission)
// -Wunused-local-typedefs: GCC 4.8+ (0x9F5F) or Clang 3.6+ (0x7787)
if ((dword_126E1E8 && qword_126EF90 > 0x7787)
|| (!dword_106BF6C && !dword_106BF68
&& dword_126E1F8 && qword_126E1F0 > 0x9F5F))
{
emit("#pragma GCC diagnostic ignored \"-Wunused-local-typedefs\"");
}
// Push/pop block for managed RT: GCC 4.6+ (0x9E97) or Clang
if (dword_126E1E8 || (!dword_106BF6C && dword_126E1F8 && qword_126E1F0 > 0x9E97))
{
emit("#pragma GCC diagnostic push");
emit("#pragma GCC diagnostic ignored \"-Wunused-variable\"");
emit("#pragma GCC diagnostic ignored \"-Wunused-function\"");
// ... managed runtime boilerplate ...
emit("#pragma GCC diagnostic pop");
}
// File-level -Wunused-variable: GCC 4.2+ (0x9D07) or Clang
if (dword_126E1E8 || (dword_126E1F8 && qword_126E1F0 > 0x9D07))
emit("#pragma GCC diagnostic ignored \"-Wunused-variable\"");
2. C++ feature gating during dialect resolution. The post-parsing dialect resolution in proc_command_line and the sub_44B6B0 dialect setup function use qword_126EF98 thresholds to decide which C++ language features to enable. Examples from the decompiled code:
// sub_44B6B0 -- dialect resolution, ~400 lines
// GCC 4.3+ (0x9D6B): enable variadic templates
if (qword_126EF98 > 0x9D6B)
dword_106BE1C = 1; // variadic_templates
// GCC 4.7+ (0x9EFB): enable list initialization under certain conditions
if (qword_126EF98 > 0x9EFB && dword_106BE1C && (!byte_E7FFF1 || dword_106C10C))
dword_106BE10 = 1;
// GCC 6.0+ (0xEA5F) or Clang: enable C++14/17 features
if (dword_126EFA4 || (dword_126EFA8 && qword_126EF98 > 0xEA5F))
// Enable feature (Clang always, GCC only 6.0+)
3. CUDA compatibility mode. A special flag dword_E7FF10 (cuda_compat_flag) is set when dword_126EFAC && qword_126EF98 <= 0x76BF -- that is, when extended features are enabled but the GCC version is 3.3.99 or below. This activates a legacy compatibility path for very old host compilers that lack modern C++ support.
The 0xEA5F (59999) Threshold -- The Most Pervasive Gate
The threshold 0xEA5F (GCC 6.0) is the most widely used version constant in the binary, appearing in 22 decompiled functions. It gates the C++14/17 feature set boundary. GCC 6.0 was the first GCC release with full C++14 support and substantial C++17 support.
The typical usage pattern is:
// Pattern: "Clang (any version) OR GCC 6.0+"
if (dword_126EFA4 || (dword_126EFA8 && qword_126EF98 > 0xEA5F))
// Enable C++14/17 feature
// Pattern: "GNU extensions but not Clang, GCC 6.0+"
if (dword_126EFAC && !dword_126EFA4 && qword_126EF98 > 0xEA5F)
// Enable GNU-specific extended feature
Functions using this threshold include: declaration processing (sub_40D900), attribute application (sub_413ED0), class declaration (sub_431590), dialect resolution (sub_44B6B0), initializer processing (sub_48C710, sub_4B6760), backend code generation (sub_4688C0), expression canonicalization (sub_4CA6C0, sub_4D2B70), IL walking (sub_54AED0), scope management (sub_59C9B0, sub_59AF40), type processing (sub_5D1350), overload resolution (sub_662670, sub_666720), and template specialization (sub_6A3B00).
Version-Gated Feature Flag: dword_106BDD8
One particular feature flag (dword_106BDD8) is set during dialect resolution based on a compound version check:
// sub_44B6B0, decompiled line ~228-231
// v4 = (dword_126EFA4 != 0), i.e., is_clang_mode
if ((dword_126EFAC && !v4 && qword_126EF98 > 0x1116F) // GNU ext, not Clang, GCC >= 7.0
|| (v4 && qword_126EF90 > 0x9C3F)) // or Clang >= 4.0
{
dword_106BDD8 = 1;
}
This flag is referenced in 7 decompiled functions (sub_430920, sub_42FE50, sub_447930, sub_44AAC0, sub_44B6B0, sub_45EB40, sub_724630). The W066 global variables report identifies it as optix_mode, but the decompiled code shows it is set purely based on compiler version thresholds during dialect resolution, not from any --emit-optix-ir CLI flag. It likely controls a C++ language feature (possibly structured bindings or another C++17 feature) that requires GCC 7.0+ or Clang 4.0+ support, and the "optix_mode" name in the report may be a misidentification based on context where it was encountered. The flag gates behavior in attribute validation (sub_42FE50), where it interacts with dword_106B670 to control feature availability.
Dialect Initialization Flow
The host compiler version globals are initialized in proc_command_line and propagated to the dialect system during TU initialization:
proc_command_line (CLI parsing, sub_459630):
case 184 (--gnu_version=X): qword_126EF98 = X // GCC version
case 188 (--clang_version=X): qword_126EF90 = X // Clang version
case 182 (--gcc): dword_126EFA8 = 1 // GCC mode flag
case 187 (--clang): dword_126EFA4 = 1 // Clang mode flag
dialect_init (sub_44B6B0, called during setup):
// ~400 lines of version-threshold-based feature flag resolution
// Sets 30+ EDG feature flags based on gcc_version, clang_version,
// cpp_standard_version, and extension mode flags
target dialect (sub_752A80, select_cp_gen_be_target_dialect):
if (dword_126EFA8): // GCC mode
dword_126E1F8 = 1 // host_dialect_gnu
qword_126E1F0 = qword_126EF98 // host_gcc_version
if (dword_126EFA4): // Clang mode
dword_126E1E8 = 1 // host_dialect_clang
qword_126E1E0 = qword_126EF90 // host_clang_version
The defaults for unspecified versions are qword_126EF98 = 80100 (GCC 8.1.0) and qword_126EF90 = 90100 (Clang 9.1.0), set during default_init (sub_45EB40).
The --db Debug Mechanism
The --db flag (CLI case 37) activates EDG's internal debug tracing system by calling sub_48A390 (proc_debug_option). While not directly related to architecture gating, the --db mechanism shares the adjacent global namespace (dword_126EFC8, dword_126EFCC) and is relevant because debug tracing can expose architecture checks as they execute in real time.
Connection Between --db and Architecture
The --db flag does not set or modify any architecture-related globals. Its connection to the architecture system is observational: when debug tracing is enabled, the compiler emits trace output at key decision points throughout compilation, including the semantic analysis functions that evaluate architecture thresholds. Enabling --db=5 (verbosity level 5) causes the compiler to log IL entry kinds, template instantiation steps, and scope transitions, which provides visibility into when and why architecture gates fire.
The CLI dispatch for --db:
// proc_command_line (sub_459630), case 37
case 37: // --db=<string>
if (sub_48A390(qword_E7FF28)) // proc_debug_option
goto error; // returns nonzero on parse failure
dword_106C2A0 = dword_126EFCC; // save initial error count baseline
After proc_debug_option returns, dword_106C2A0 captures the current value of dword_126EFCC (debug verbosity level). This is used as a baseline error count for subsequent error tracking.
proc_debug_option (sub_48A390)
This 238-line function (debug.c) parses debug control strings. On entry, it unconditionally sets dword_126EFC8 = 1 (debug tracing enabled), then dispatches based on the first character of the input:
// sub_48A390 entry
dword_126EFC8 = 1; // enable debug tracing
v3 = (unsigned __int8)*nptr;
if ((v3 - 48) <= 9) { // first char is digit
dword_126EFCC = strtol(v1, 0, 10); // set verbosity level
return 0;
}
The full parsing grammar:
| Input Format | Parsed As | Action |
|---|---|---|
"5" (numeric only) | Verbosity level | Sets dword_126EFCC = 5 |
"name=3" | Name with level | Adds trace node: action=1, level=3 |
"name+=3" | Additive trace | Adds trace node: action=2, level=3 |
"name-=3" | Subtractive trace | Adds trace node: action=3, level=3 |
"name=3!" | Permanent trace | Adds trace node: action=1, level=3, permanent=1 |
"#name" | Hash removal | Removes matching node from trace list |
"-name" | Dash removal | Removes matching node from trace list |
"a,b=2,c=3" | Comma-separated | Processes each entry independently |
Debug Trace Node Structure
Debug trace requests are stored as a singly-linked list rooted at qword_1065870. Each node is 28 bytes, allocated via sub_6B7340 (the IL allocator):
struct debug_trace_node { // 28 bytes (32 allocated)
struct debug_trace_node* next; // +0: linked list link
char* name_string; // +8: entity name to trace (heap copy)
int32 action_type; // +16: 1=set, 2=add, 3=subtract, 4=remove
int32 level; // +20: trace level (integer)
int32 permanent; // +24: 1=survives reset, 0=cleared on reset
};
When proc_debug_option encounters its own name in the trace list (the self-referential check !strcmp(src, "proc_debug_option")), it prints the entire trace state to stderr:
if (qword_1065870 && (v2 & 1) != 0) {
do {
fprintf(s, "debug request for: %s\n", node->name_string);
fprintf(s, "action=%d, level=%d\n", node->action_type, node->level);
node = node->next;
} while (node);
}
Debug Verbosity Levels
The dword_126EFCC verbosity level controls trace output granularity across the entire compiler:
| Level | Effect |
|---|---|
| 0 | No debug output (default) |
| 1-2 | Basic trace: function entry/exit markers |
| 3 | Detailed trace: includes entity names, scope indices |
| 4 | Very detailed: IL entry kinds, overload candidate lists |
| 5+ | Full trace: IL tree walking with "Walking IL tree, entry kind = ..." |
db_name (CLI case 190)
The --db_name flag (case 190) calls a separate function sub_48AD80 to register a debug name filter. Unlike --db which enables global tracing, --db_name restricts trace output to entities matching the specified name pattern. If sub_48AD80 fails (returns nonzero), error 570 is emitted.
Three-Layer Checking Model
Layer 1: Compile-Time Semantic Checks (cudafe++ Frontend)
These are the primary gates. During semantic analysis, cudafe++ reads dword_126E4A8 and compares it against threshold constants. Violations emit diagnostic errors through the standard error system (diagnostic IDs in the 3000+ range, displayed as 20000-series via the +16543 offset formula). These checks are unconditional -- they fire regardless of whether the code would actually execute at runtime.
Enforcement point: Declaration processing, type checking, attribute application, and CUDA-specific semantic validation passes.
Examples:
__managed__variable declaration withdword_126E4A8 < 30triggersunsupported_arch_for_managed_capability__grid_constant__parameter withdword_126E4A8 < 70triggersgrid_constant_unsupported_arch__wgmma_mma_asynccall on non-sm_90a triggerswgmma_mma_async_not_enabled- Virtual base class with
dword_126E4A8 < 20triggersuse_of_virtual_base_on_compute_1x
Layer 2: String-Embedded Diagnostic Formatting
Error strings with architecture names baked into .rodata represent the complete set of architecture-dependent diagnostics. These strings are loaded by the diagnostic system and formatted with the current architecture value. The strings serve as the user-visible feedback for Layer 1 checks.
The architecture name in the string (e.g., "compute_70", "sm_90a") is a literal constant, not a formatted parameter -- the compiler does not interpolate the actual target architecture into these messages. This means the error messages always state the minimum required architecture, not what the user actually specified. The only exception is the virtual base error which uses %t (a type formatter) to include the base class name, not the architecture.
Layer 3: Host Compiler Version Gating
This layer does not check GPU architecture at all -- instead, it gates the output format of the generated .int.c file based on the host C++ compiler's version. The thresholds ensure that GCC/Clang-specific pragmas, attributes, and language constructs in the generated code are compatible with the actual host compiler that will consume the output.
Enforcement point: Backend code generation (sub_489000 and related functions in cp_gen_be.c).
Impact: Incorrect host compiler version gating does not cause compilation failure -- it may produce warnings from the host compiler due to unrecognized pragmas, or miss warning suppression directives that would silence spurious diagnostics.
Interaction Between Layers
nvcc (driver)
|
| --target=<sm_code> --gnu_version=<ver> --clang_version=<ver>
v
cudafe++ process
|
+-- CLI parsing (proc_command_line)
| dword_126E4A8 = sm_code (SM architecture)
| qword_126EF98 = gcc_version (host GCC version)
| qword_126EF90 = clang_version (host Clang version)
|
+-- set_target_configuration (sub_7525F0)
| sub_7515D0() -- type table init (100+ globals)
|
+-- dialect_resolution (sub_44B6B0)
| 30+ feature flags set based on version thresholds
| dword_126E1F8 / dword_126E1E8 -- host dialect set
| qword_126E1F0 / qword_126E1E0 -- host version copies
|
+-- TU init (sub_586240)
| dword_126EBF8 = dword_126E4A8 (SM version copy)
|
+-- [Layer 1] Semantic analysis
| Compare dword_126E4A8 against SM thresholds
| Emit CUDA-specific errors for unsupported features
|
+-- [Layer 2] Diagnostic formatting
| Load error string with baked-in architecture name
| Format and display error to user
|
+-- [Layer 3] .int.c code generation
| Compare qword_126E1F0 / qword_126E1E0 against host thresholds
| Emit appropriate #pragma directives
| Generate host-compiler-compatible boilerplate
|
v
Host Compiler (gcc / clang / cl.exe)
Layers 1 and 2 operate during the frontend phase and can halt compilation. Layer 3 operates during the backend phase and only affects the format of the generated output file.
Global Variable Summary
| Address | Size | Name | Role |
|---|---|---|---|
dword_126E4A8 | 4 | sm_architecture | Target SM version from --target (case 245). Sentinel: -1. |
dword_126EBF8 | 4 | target_config_index | TU-level copy of dword_126E4A8, set in sub_586240. |
qword_126EF98 | 8 | gcc_version | GCC compatibility version. Default 80100. Set by --gnu_version (case 184). |
qword_126EF90 | 8 | clang_version | Clang compatibility version. Default 90100. Set by --clang_version (case 188). |
dword_126EFA8 | 4 | gcc_extensions | GCC mode enabled. Set by --gcc (case 182). |
dword_126EFA4 | 4 | clang_extensions | Clang mode enabled. Set by --clang (case 187). |
dword_126EFAC | 4 | extended_features | Extended features / GNU compat mode. |
dword_126EFB0 | 4 | gnu_extensions_enabled | GNU extensions active. |
dword_126E1F8 | 4 | host_dialect_gnu | Host compiler is GCC/GNU. Set during dialect init. |
dword_126E1E8 | 4 | host_dialect_clang | Host compiler is Clang. Set during dialect init. |
qword_126E1F0 | 8 | host_gcc_version | Host GCC version, copied from qword_126EF98. |
qword_126E1E0 | 8 | host_clang_version | Host Clang version, copied from qword_126EF90. |
dword_126EFC8 | 4 | debug_trace_enabled | Debug tracing active. Set unconditionally by --db. |
dword_126EFCC | 4 | debug_verbosity | Debug output level. >2=detailed, >4=IL walk trace. |
dword_E7FF10 | 4 | cuda_compat_flag | Legacy compat: dword_126EFAC && qword_126EF98 <= 0x76BF. |
dword_106BDD8 | 4 | version_gated_feature | Set when GCC >= 7.0 or Clang >= 4.0. Referenced in 7 functions. |
dword_106C2A0 | 4 | error_count_baseline | Saved from dword_126EFCC after --db processing. |
qword_1065870 | 8 | debug_trace_list | Head of debug trace request linked list. |
dword_126E4A0 | 4 | target_vector_width | Set to 8 by sub_7515D0. |
Cross-References
- CLI Flag Inventory --
--target,--gnu_version,--clang_version,--dbflag details - Architecture Detection --
--targetflag and SM version parsing details - CUDA Error Catalog -- Complete diagnostic messages for each feature gate
- .int.c File Format -- Host compiler pragma emission details
- Backend Code Generation -- GCC/Clang version threshold usage in output
- Global Variable Index -- Full address-level documentation
- Execution Spaces -- Execution space bitfield and attribute handlers
- __managed__ Variables -- Managed variable attribute and SM 30 gate
- __grid_constant__ -- Grid constant attribute and SM 70 gate