Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Architecture Feature Gating

cudafe++ enforces architecture-dependent feature gates that prevent use of CUDA constructs on hardware that cannot support them. These gates operate at three distinct layers: compile-time SM version checks against dword_126E4A8 during semantic analysis, string-embedded diagnostic messages with architecture names baked into .rodata, and host-compiler version gating controlling which GCC/Clang-specific #pragma directives and language constructs appear in the generated .int.c output. A separate mechanism, the --db debug system, provides runtime tracing that can expose architecture checks as they execute. This page documents all three layers, the global variables involved, every discovered threshold constant, and the complete data flow from nvcc invocation to feature gate evaluation.

Key Facts

PropertyValue
SM version storagedword_126E4A8 (sm_architecture, set by --target / case 245)
SM version TU-level copydword_126EBF8 (target_config_index, copied during TU init in sub_586240)
Architecture parser stubsub_7525E0 (6-byte stub returning -1; actual parsing done by nvcc)
Post-parse initializersub_7525F0 (set_target_configuration, target.c:299)
Type table initializersub_7515D0 (sets 100+ type-size/alignment globals, called from sub_7525F0)
GCC version globalqword_126EF98 (default 80100 = GCC 8.1.0, set by --gnu_version case 184)
Clang version globalqword_126EF90 (default 90100 = Clang 9.1.0, set by --clang_version case 188)
GCC host dialect flagdword_126E1F8 (host compiler identified as GCC)
Clang host dialect flagdword_126E1E8 (host compiler identified as Clang)
Host GCC version copyqword_126E1F0 (copied from qword_126EF98 during dialect init)
Host Clang version copyqword_126E1E0 (copied from qword_126EF90 during dialect init)
--nv_arch error string"invalid or no value specified with --nv_arch flag" at 0x8884F0
Debug option parsersub_48A390 (proc_debug_option, 238 lines, debug.c)
Debug trace linked listqword_1065870 (head pointer)
Invalid arch sentinel-1 (0xFFFFFFFF)
Feature threshold count17 CUDA features across 7 SM versions (20, 30, 52, 60, 70, 80, 90/90a)
Host compiler threshold count19 version constants across GCC 3.0 through GCC 14.0

Layer 1: SM Architecture Input

How the Architecture Reaches cudafe++

cudafe++ never parses architecture strings directly from the user. The driver (nvcc) translates user-facing flags like --gpu-architecture=sm_90 into an internal numeric code and passes it via the --target flag when spawning the cudafe++ process. Inside cudafe++, the --target flag is registered as CLI flag 245 and handled in proc_command_line (sub_459630).

The handler calls sub_7525E0, which in the CUDA Toolkit 13.0 binary is a 6-byte stub:

; sub_7525E0 -- architecture parser stub
; Address: 0x7525E0, Size: 6 bytes
mov     eax, 0FFFFFFFFh    ; return -1 unconditionally
retn

This stub always returns -1 (the invalid-architecture sentinel). The actual architecture code is injected by nvcc into the argument string that sub_7525E0 receives. Because IDA decompiled this as a stub, the parsing logic is either inlined by the compiler or resolved through a different mechanism at link time. The result is stored in dword_126E4A8:

// proc_command_line (sub_459630), case 245
v80 = sub_7525E0(qword_E7FF28, v23, v20, v30);  // parse SM code from arg string
dword_126E4A8 = v80;                              // store in sm_architecture
if (v80 == -1) {
    sub_4F8420(2664);  // emit error 2664
    // error string: "invalid or no value specified with --nv_arch flag"
    sub_4F2930("cmd_line.c", 12219, "proc_command_line", 0, 0);
    // assert_fail -- unreachable if error handler returns
}
sub_7525F0(v80);  // set_target_configuration

Error 2664 fires when the architecture value is -1. The error string at 0x8884F0 references --nv_arch (the nvcc-facing name for this flag). This string has no direct xrefs in the IDA analysis, meaning it is loaded indirectly through the error message table (off_88FAA0). The --nv_arch name in the error message is a user-facing alias; internally cudafe++ processes it as --target (flag 245).

set_target_configuration (sub_7525F0)

After storing the SM version, sub_7525F0 performs post-parse initialization. This function lives in target.c:299:

// sub_7525F0 -- set_target_configuration
__int64 __fastcall sub_7525F0(int a1)
{
    if ((unsigned int)(a1 + 1) > 1)  // rejects only -1
        assert_fail("set_target_configuration", 299);
    sub_7515D0();           // initialize type table for target platform
    qword_126E1B0 = "lib";  // library search path prefix
}

The guard (a1 + 1) > 1u is an unsigned comparison that accepts any value >= 0 and rejects only -1 (which wraps to 0 when incremented). This is a sanity check -- in production, nvcc always provides a valid SM code.

Type Table Initialization (sub_7515D0)

The sub_7515D0 function, called from set_target_configuration, initializes over 100 global variables describing the target platform's type sizes, alignments, and numeric limits. This establishes the data model for CUDA device code:

// sub_7515D0 -- target type initialization (excerpt)
// Sets LP64 data model with CUDA-specific type properties
dword_126E328 = 8;     // sizeof(long)
dword_126E338 = 4;     // sizeof(int)
dword_126E2FC = 16;    // sizeof(long double)
dword_126E308 = 16;    // alignof(long double)
dword_126E2B8 = 8;     // sizeof(pointer)
dword_126E2AC = 8;     // alignof(pointer)
dword_126E420 = 2;     // sizeof(wchar_t)
dword_126E4A0 = 8;     // target vector width
dword_126E258 = 53;    // double mantissa bits
dword_126E250 = 1024;  // double max exponent
dword_126E254 = -1021; // double min exponent
dword_126E234 = 113;   // __float128 mantissa bits
dword_126E22C = 0x4000; // __float128 max exponent
dword_126E230 = -16381; // __float128 min exponent
// ... ~80 more assignments ...

The function unconditionally returns -1, which is not used by the caller.

SM Version Propagation

During translation unit initialization (sub_586240, called from fe_translation_unit_init), the SM version is copied into a TU-level global:

// sub_586240, line 54 in decompiled output
dword_126EBF8 = dword_126E4A8;  // target_config_index = sm_architecture

After this point, architecture checks throughout the compiler read either dword_126E4A8 (the CLI-level global) or dword_126EBF8 (the TU-level copy). Both contain the same integer SM version code. The dual-variable pattern exists because EDG's architecture supports multi-TU compilation where each TU could theoretically target a different architecture (though CUDA compilation always uses a single target per cudafe++ invocation).

Layer 2: CUDA Feature Thresholds

cudafe++ checks the SM architecture version at semantic analysis time to gate CUDA-specific features. When a feature is used on an architecture below its minimum requirement, the compiler emits a diagnostic error or warning. All thresholds below were extracted from error strings embedded in the binary's .rodata section and confirmed through cross-reference with diagnostic tag names.

Complete Feature Threshold Table

FeatureMin ArchitectureDiagnostic TagError String
Virtual base classescompute_20use_of_virtual_base_on_compute_1xUse of a virtual base (%t) requires the compute_20 or higher architecture
Device variadic functionscompute_30device_function_has_ellipsis__device__ or __host__ __device__ function with ellipsis requires compute_30 or higher architecture
__managed__ variablescompute_30unsupported_arch_for_managed_capability__managed__ variables require architecture compute_30 or higher
alloca() in device codecompute_52alloca_unsupported_for_lower_than_arch52alloca() is not supported for architectures lower than compute_52
Atomic scope argumentsm_60(inline)atomic operations' scope argument is supported on architecture sm_60 or above. Fall back to use membar.
Atomic f64 add/subsm_60(inline)atomic add and sub for 64-bit float is supported on architecture sm_60 or above.
__nv_atomic_* functionssm_60(inline)__nv_atomic_* functions are not supported on arch < sm_60.
__grid_constant__compute_70grid_constant_unsupported_arch__grid_constant__ annotation is only allowed for architecture compute_70 or later
Atomic memory ordersm_70(inline)atomic operations' argument of memory order is supported on architecture sm_70 or above. Fall back to use membar.
128-bit atomic load/storesm_70(inline)128-bit atomic load and store are supported on architecture sm_70 or above.
16-bit atomic CASsm_70(inline)16-bit atomic compare-and-exchange is supported on architecture sm_70 or above.
__nv_register_params__compute_80register_params_unsupported_arch__nv_register_params__ is only supported for compute_80 or later architecture
__wgmma_mma_asyncsm_90awgmma_mma_async_not_enabled__wgmma_mma_async builtins are only available for sm_90a
Atomic cluster scopesm_90(inline)atomic operations' scope of cluster is supported on architecture sm_90 or above. Using device scope instead.
Atomic cluster scope (load/store)sm_90(inline)atomic load and store's scope of cluster is supported on architecture sm_90 or above. Using device scope instead.
128-bit atomic exch/CASsm_90nv_atomic_exch_cas_b128_not_supported128-bit atomic exchange or compare-and-exchange is supported on architecture sm_90 or above.

GPU-Architecture-Gated Attributes (No Specific SM in String)

Several features check the architecture but their error strings do not embed a specific SM version number. Instead, they use the generic phrase "this GPU architecture", meaning the threshold is encoded in the comparison logic rather than the diagnostic text:

FeatureDiagnostic TagError String
__cluster_dims__cluster_dims_unsupported__cluster_dims__ is not supported for this GPU architecture
max_blocks_per_clustermax_blocks_per_cluster_unsupportedcannot specify max blocks per cluster for this GPU architecture
__block_size__block_size_unsupported__block_size__ is not supported for this GPU architecture
__managed__ (config)unsupported_configuration_for_managed_capability__managed__ variables are not yet supported for this configuration (compilation mode (32/64 bit) and/or target operating system)

These features are gated by the same dword_126E4A8 comparison mechanism as the features in the main table, but their exact SM threshold values would require tracing the specific comparison instructions in the semantic analysis functions.

Diagnostic Behavior: Errors vs Warnings vs Demotions

Architecture gate violations produce three distinct behaviors depending on the feature class:

Hard errors -- Compilation halts. Features that fundamentally cannot work on the target architecture:

  • __managed__ below compute_30 -- No unified memory hardware support
  • __grid_constant__ below compute_70 -- No hardware constant propagation mechanism
  • __nv_register_params__ below compute_80 -- Register parameter ABI not available
  • __wgmma_mma_async below sm_90a -- No warp-group MMA hardware
  • alloca() below compute_52 -- No dynamic stack allocation support on device
  • Virtual base classes below compute_20 -- No vtable support on earliest GPU architectures

Fallback warnings -- Compilation continues with degraded behavior. The compiler generates functionally correct but potentially less performant code:

  • Atomic scope arguments on pre-sm_60 -- Falls back to membar-based synchronization
  • Atomic memory order on pre-sm_70 -- Falls back to membar-based ordering
  • 64-bit float atomics on pre-sm_60 -- Falls back to CAS loop emulation

Scope demotion warnings -- Informational diagnostics about automatic scope narrowing:

  • Cluster scope atomics on pre-sm_90 -- Silently demotes to device scope ("Using device scope instead")

compute_XX vs sm_XX Naming

Error strings use two naming conventions that reflect CUDA's split between virtual and physical architectures:

  • compute_XX -- Virtual architecture. Checked at PTX generation time. Features gated by compute_XX are relevant to the intermediate PTX representation and are independent of the specific GPU die. Examples: __managed__ (requires unified memory ISA support), alloca() (requires dynamic stack frame instructions).

  • sm_XX -- Physical architecture. Checked at SASS generation time. Features gated by sm_XX are tied to specific hardware capabilities of a GPU die. Examples: 128-bit atomics (require specific load/store unit widths), cluster scope (requires the SM 9.0 thread block cluster hardware).

In practice, cudafe++ stores a single integer in dword_126E4A8 and the distinction is purely semantic -- both forms gate against the same numeric value. The value is a compute capability number (e.g., 70 for Volta, 90 for Hopper).

The sm_90a suffix (with the a accelerator flag) is a special case used exclusively for __wgmma_mma_async builtins. This variant requires the Hopper accelerated architecture, which is distinct from the base sm_90. The a suffix is encoded in the SM integer value passed to cudafe++ by nvcc.

__wgmma_mma_async Detail

The warp-group matrix multiply-accumulate builtin has the most granular validation of any architecture-gated feature. Beyond the sm_90a architecture check, cudafe++ also validates:

CheckDiagnostic TagError String
Architecture gatewgmma_mma_async_not_enabled__wgmma_mma_async builtins are only available for sm_90a
Shape validationwgmma_mma_async_bad_shapeThe shape %s is not supported for __wgmma_mma_async builtin
A operand typewgmma_mma_async_bad_A_type(type mismatch diagnostic)
B operand typewgmma_mma_async_bad_B_type(type mismatch diagnostic)
Missing argumentswgmma_mma_async_missing_argsThe 'A' or 'B' argument to __wgmma_mma_async call is missing
Non-constant argswgmma_mma_async_nonconstant_argNon-constant argument to __wgmma_mma_async call

The validation function is identified as check_wgmma_mma_async (string at 0x888CAC). Four type-specific builtin variants are registered: __wgmma_mma_async_f16, __wgmma_mma_async_bf16, __wgmma_mma_async_tf32, and __wgmma_mma_async_f8.

nv_register_params Detail

The register parameter attribute has three distinct checks, only one of which is an architecture gate:

CheckDiagnostic TagError String
Feature enable flagregister_params_not_enabled__nv_register_params__ support is not enabled
Architecture gateregister_params_unsupported_arch__nv_register_params__ is only supported for compute_80 or later architecture
Function type checkregister_params_unsupported_function__nv_register_params__ is not allowed on a %s function
Ellipsis checkregister_params_ellipsis_function(variadic function diagnostic)

The attribute handler is apply_nv_register_params_attr (string at 0x830C78).

SM Version to Feature Summary

SM VersionFeatures IntroducedFeature Count
compute_20Virtual base classes in device code1
compute_30__managed__ variables, device variadic functions2
compute_52alloca() in device code1
sm_60Atomic scope argument, 64-bit float atomics, __nv_atomic_* API3
sm_70__grid_constant__, 128-bit atomic load/store, atomic memory order, 16-bit CAS4
compute_80__nv_register_params__1
sm_90 / sm_90a__wgmma_mma_async, thread block clusters, 128-bit atomic exchange/CAS, cluster scope atomics5

Notably absent from cudafe++ error strings are features like cooperative groups (sm_60+), tensor cores (sm_70+), and dynamic parallelism (sm_35+). These are checked at runtime or by the PTX assembler (ptxas) rather than the language frontend.

Layer 3: Host Compiler Version Gating

cudafe++ generates .int.c output that must compile cleanly under the host C++ compiler (GCC, Clang, or MSVC). Because different host compiler versions support different warning pragmas, attributes, and language features, cudafe++ gates its output based on the host compiler version stored in qword_126EF98 (GCC) and qword_126EF90 (Clang). Additionally, several C++ language feature flags in the EDG frontend are conditionally enabled based on host compiler version to match the behavior the user expects from their host compiler.

Version Encoding

Both GCC and Clang versions are encoded as a single integer: major * 10000 + minor * 100 + patch. For example, GCC 8.1.0 is encoded as 80100. The compiler tests these values against hexadecimal threshold constants using > (strictly-greater-than) comparisons, which effectively means "version at or above threshold + 1." Since all threshold values use a 99 patch level (e.g., 40299 for GCC 4.2.99), the gate > 40299 is equivalent to >= 40300, which effectively means "GCC 4.3 or later."

Complete Threshold Table

Hex ConstantDecimalEncoded VersionEffective GateOccurrence Count
0x752F29,9992.99.99GCC/Clang >= 3.01 (dialect resolution)
0x75F730,1993.01.99GCC/Clang >= 3.2low
0x76BF30,3993.03.99GCC/Clang >= 3.4low (cuda_compat_flag gate)
0x778730,5993.05.99Clang >= 3.6medium (-Wunused-local-typedefs)
0x78B330,8993.08.99Clang >= 3.9low
0x9C3F39,9993.99.99GCC >= 4.0medium (dword_106BDD8 + Clang gate)
0x9D0740,1994.01.99GCC >= 4.2medium (-Wunused-variable file-level)
0x9D6B40,2994.02.99GCC >= 4.3medium (variadic templates)
0x9DCF40,3994.03.99GCC >= 4.4low (dialect resolution)
0x9E3340,4994.04.99GCC >= 4.5low (dialect resolution)
0x9E9740,5994.05.99GCC >= 4.6medium (diagnostic push/pop)
0x9EFB40,6994.06.99GCC >= 4.7low (feature flag gating)
0x9F5F40,7994.07.99GCC >= 4.8medium (-Wunused-local-typedefs)
0xEA5F59,9995.99.99GCC >= 6.022 files (C++14/17 features)
0xEB2760,1996.01.99GCC >= 6.2low (HasFuncPtrConv gate)
0x1116F69,9996.99.99GCC >= 7.0medium (dword_106BDD8 + feature flags)
0x15F8F89,9998.99.99GCC/Clang >= 9.0medium (C++17/20 features)
0x1D4BF119,99911.99.99GCC/Clang >= 12.08 files
0x1FBCF129,99912.99.99GCC >= 13.013 files
0x222DF139,99913.99.99GCC >= 14.05 files

How Thresholds Are Used

The thresholds serve three purposes:

1. Diagnostic pragma emission. The .int.c output includes #pragma GCC diagnostic directives to suppress host compiler warnings about CUDA-generated code. Different GCC/Clang versions introduced different warning flags, so the pragmas are conditionally emitted:

// From sub_489000 (backend boilerplate emission)
// -Wunused-local-typedefs: GCC 4.8+ (0x9F5F) or Clang 3.6+ (0x7787)
if ((dword_126E1E8 && qword_126EF90 > 0x7787)
    || (!dword_106BF6C && !dword_106BF68
        && dword_126E1F8 && qword_126E1F0 > 0x9F5F))
{
    emit("#pragma GCC diagnostic ignored \"-Wunused-local-typedefs\"");
}

// Push/pop block for managed RT: GCC 4.6+ (0x9E97) or Clang
if (dword_126E1E8 || (!dword_106BF6C && dword_126E1F8 && qword_126E1F0 > 0x9E97))
{
    emit("#pragma GCC diagnostic push");
    emit("#pragma GCC diagnostic ignored \"-Wunused-variable\"");
    emit("#pragma GCC diagnostic ignored \"-Wunused-function\"");
    // ... managed runtime boilerplate ...
    emit("#pragma GCC diagnostic pop");
}

// File-level -Wunused-variable: GCC 4.2+ (0x9D07) or Clang
if (dword_126E1E8 || (dword_126E1F8 && qword_126E1F0 > 0x9D07))
    emit("#pragma GCC diagnostic ignored \"-Wunused-variable\"");

2. C++ feature gating during dialect resolution. The post-parsing dialect resolution in proc_command_line and the sub_44B6B0 dialect setup function use qword_126EF98 thresholds to decide which C++ language features to enable. Examples from the decompiled code:

// sub_44B6B0 -- dialect resolution, ~400 lines
// GCC 4.3+ (0x9D6B): enable variadic templates
if (qword_126EF98 > 0x9D6B)
    dword_106BE1C = 1;  // variadic_templates

// GCC 4.7+ (0x9EFB): enable list initialization under certain conditions
if (qword_126EF98 > 0x9EFB && dword_106BE1C && (!byte_E7FFF1 || dword_106C10C))
    dword_106BE10 = 1;

// GCC 6.0+ (0xEA5F) or Clang: enable C++14/17 features
if (dword_126EFA4 || (dword_126EFA8 && qword_126EF98 > 0xEA5F))
    // Enable feature (Clang always, GCC only 6.0+)

3. CUDA compatibility mode. A special flag dword_E7FF10 (cuda_compat_flag) is set when dword_126EFAC && qword_126EF98 <= 0x76BF -- that is, when extended features are enabled but the GCC version is 3.3.99 or below. This activates a legacy compatibility path for very old host compilers that lack modern C++ support.

The 0xEA5F (59999) Threshold -- The Most Pervasive Gate

The threshold 0xEA5F (GCC 6.0) is the most widely used version constant in the binary, appearing in 22 decompiled functions. It gates the C++14/17 feature set boundary. GCC 6.0 was the first GCC release with full C++14 support and substantial C++17 support.

The typical usage pattern is:

// Pattern: "Clang (any version) OR GCC 6.0+"
if (dword_126EFA4 || (dword_126EFA8 && qword_126EF98 > 0xEA5F))
    // Enable C++14/17 feature

// Pattern: "GNU extensions but not Clang, GCC 6.0+"
if (dword_126EFAC && !dword_126EFA4 && qword_126EF98 > 0xEA5F)
    // Enable GNU-specific extended feature

Functions using this threshold include: declaration processing (sub_40D900), attribute application (sub_413ED0), class declaration (sub_431590), dialect resolution (sub_44B6B0), initializer processing (sub_48C710, sub_4B6760), backend code generation (sub_4688C0), expression canonicalization (sub_4CA6C0, sub_4D2B70), IL walking (sub_54AED0), scope management (sub_59C9B0, sub_59AF40), type processing (sub_5D1350), overload resolution (sub_662670, sub_666720), and template specialization (sub_6A3B00).

Version-Gated Feature Flag: dword_106BDD8

One particular feature flag (dword_106BDD8) is set during dialect resolution based on a compound version check:

// sub_44B6B0, decompiled line ~228-231
// v4 = (dword_126EFA4 != 0), i.e., is_clang_mode
if ((dword_126EFAC && !v4 && qword_126EF98 > 0x1116F)  // GNU ext, not Clang, GCC >= 7.0
    || (v4 && qword_126EF90 > 0x9C3F))                   // or Clang >= 4.0
{
    dword_106BDD8 = 1;
}

This flag is referenced in 7 decompiled functions (sub_430920, sub_42FE50, sub_447930, sub_44AAC0, sub_44B6B0, sub_45EB40, sub_724630). The W066 global variables report identifies it as optix_mode, but the decompiled code shows it is set purely based on compiler version thresholds during dialect resolution, not from any --emit-optix-ir CLI flag. It likely controls a C++ language feature (possibly structured bindings or another C++17 feature) that requires GCC 7.0+ or Clang 4.0+ support, and the "optix_mode" name in the report may be a misidentification based on context where it was encountered. The flag gates behavior in attribute validation (sub_42FE50), where it interacts with dword_106B670 to control feature availability.

Dialect Initialization Flow

The host compiler version globals are initialized in proc_command_line and propagated to the dialect system during TU initialization:

proc_command_line (CLI parsing, sub_459630):
  case 184 (--gnu_version=X):   qword_126EF98 = X   // GCC version
  case 188 (--clang_version=X): qword_126EF90 = X   // Clang version
  case 182 (--gcc):             dword_126EFA8 = 1    // GCC mode flag
  case 187 (--clang):           dword_126EFA4 = 1    // Clang mode flag

dialect_init (sub_44B6B0, called during setup):
  // ~400 lines of version-threshold-based feature flag resolution
  // Sets 30+ EDG feature flags based on gcc_version, clang_version,
  // cpp_standard_version, and extension mode flags

target dialect (sub_752A80, select_cp_gen_be_target_dialect):
  if (dword_126EFA8):                           // GCC mode
    dword_126E1F8 = 1                           // host_dialect_gnu
    qword_126E1F0 = qword_126EF98              // host_gcc_version
  if (dword_126EFA4):                           // Clang mode
    dword_126E1E8 = 1                           // host_dialect_clang
    qword_126E1E0 = qword_126EF90              // host_clang_version

The defaults for unspecified versions are qword_126EF98 = 80100 (GCC 8.1.0) and qword_126EF90 = 90100 (Clang 9.1.0), set during default_init (sub_45EB40).

The --db Debug Mechanism

The --db flag (CLI case 37) activates EDG's internal debug tracing system by calling sub_48A390 (proc_debug_option). While not directly related to architecture gating, the --db mechanism shares the adjacent global namespace (dword_126EFC8, dword_126EFCC) and is relevant because debug tracing can expose architecture checks as they execute in real time.

Connection Between --db and Architecture

The --db flag does not set or modify any architecture-related globals. Its connection to the architecture system is observational: when debug tracing is enabled, the compiler emits trace output at key decision points throughout compilation, including the semantic analysis functions that evaluate architecture thresholds. Enabling --db=5 (verbosity level 5) causes the compiler to log IL entry kinds, template instantiation steps, and scope transitions, which provides visibility into when and why architecture gates fire.

The CLI dispatch for --db:

// proc_command_line (sub_459630), case 37
case 37:  // --db=<string>
    if (sub_48A390(qword_E7FF28))  // proc_debug_option
        goto error;                // returns nonzero on parse failure
    dword_106C2A0 = dword_126EFCC; // save initial error count baseline

After proc_debug_option returns, dword_106C2A0 captures the current value of dword_126EFCC (debug verbosity level). This is used as a baseline error count for subsequent error tracking.

proc_debug_option (sub_48A390)

This 238-line function (debug.c) parses debug control strings. On entry, it unconditionally sets dword_126EFC8 = 1 (debug tracing enabled), then dispatches based on the first character of the input:

// sub_48A390 entry
dword_126EFC8 = 1;  // enable debug tracing
v3 = (unsigned __int8)*nptr;
if ((v3 - 48) <= 9) {               // first char is digit
    dword_126EFCC = strtol(v1, 0, 10); // set verbosity level
    return 0;
}

The full parsing grammar:

Input FormatParsed AsAction
"5" (numeric only)Verbosity levelSets dword_126EFCC = 5
"name=3"Name with levelAdds trace node: action=1, level=3
"name+=3"Additive traceAdds trace node: action=2, level=3
"name-=3"Subtractive traceAdds trace node: action=3, level=3
"name=3!"Permanent traceAdds trace node: action=1, level=3, permanent=1
"#name"Hash removalRemoves matching node from trace list
"-name"Dash removalRemoves matching node from trace list
"a,b=2,c=3"Comma-separatedProcesses each entry independently

Debug Trace Node Structure

Debug trace requests are stored as a singly-linked list rooted at qword_1065870. Each node is 28 bytes, allocated via sub_6B7340 (the IL allocator):

struct debug_trace_node {           // 28 bytes (32 allocated)
    struct debug_trace_node* next;  // +0:  linked list link
    char*  name_string;             // +8:  entity name to trace (heap copy)
    int32  action_type;             // +16: 1=set, 2=add, 3=subtract, 4=remove
    int32  level;                   // +20: trace level (integer)
    int32  permanent;               // +24: 1=survives reset, 0=cleared on reset
};

When proc_debug_option encounters its own name in the trace list (the self-referential check !strcmp(src, "proc_debug_option")), it prints the entire trace state to stderr:

if (qword_1065870 && (v2 & 1) != 0) {
    do {
        fprintf(s, "debug request for: %s\n", node->name_string);
        fprintf(s, "action=%d,  level=%d\n", node->action_type, node->level);
        node = node->next;
    } while (node);
}

Debug Verbosity Levels

The dword_126EFCC verbosity level controls trace output granularity across the entire compiler:

LevelEffect
0No debug output (default)
1-2Basic trace: function entry/exit markers
3Detailed trace: includes entity names, scope indices
4Very detailed: IL entry kinds, overload candidate lists
5+Full trace: IL tree walking with "Walking IL tree, entry kind = ..."

db_name (CLI case 190)

The --db_name flag (case 190) calls a separate function sub_48AD80 to register a debug name filter. Unlike --db which enables global tracing, --db_name restricts trace output to entities matching the specified name pattern. If sub_48AD80 fails (returns nonzero), error 570 is emitted.

Three-Layer Checking Model

Layer 1: Compile-Time Semantic Checks (cudafe++ Frontend)

These are the primary gates. During semantic analysis, cudafe++ reads dword_126E4A8 and compares it against threshold constants. Violations emit diagnostic errors through the standard error system (diagnostic IDs in the 3000+ range, displayed as 20000-series via the +16543 offset formula). These checks are unconditional -- they fire regardless of whether the code would actually execute at runtime.

Enforcement point: Declaration processing, type checking, attribute application, and CUDA-specific semantic validation passes.

Examples:

  • __managed__ variable declaration with dword_126E4A8 < 30 triggers unsupported_arch_for_managed_capability
  • __grid_constant__ parameter with dword_126E4A8 < 70 triggers grid_constant_unsupported_arch
  • __wgmma_mma_async call on non-sm_90a triggers wgmma_mma_async_not_enabled
  • Virtual base class with dword_126E4A8 < 20 triggers use_of_virtual_base_on_compute_1x

Layer 2: String-Embedded Diagnostic Formatting

Error strings with architecture names baked into .rodata represent the complete set of architecture-dependent diagnostics. These strings are loaded by the diagnostic system and formatted with the current architecture value. The strings serve as the user-visible feedback for Layer 1 checks.

The architecture name in the string (e.g., "compute_70", "sm_90a") is a literal constant, not a formatted parameter -- the compiler does not interpolate the actual target architecture into these messages. This means the error messages always state the minimum required architecture, not what the user actually specified. The only exception is the virtual base error which uses %t (a type formatter) to include the base class name, not the architecture.

Layer 3: Host Compiler Version Gating

This layer does not check GPU architecture at all -- instead, it gates the output format of the generated .int.c file based on the host C++ compiler's version. The thresholds ensure that GCC/Clang-specific pragmas, attributes, and language constructs in the generated code are compatible with the actual host compiler that will consume the output.

Enforcement point: Backend code generation (sub_489000 and related functions in cp_gen_be.c).

Impact: Incorrect host compiler version gating does not cause compilation failure -- it may produce warnings from the host compiler due to unrecognized pragmas, or miss warning suppression directives that would silence spurious diagnostics.

Interaction Between Layers

nvcc (driver)
  |
  | --target=<sm_code>  --gnu_version=<ver>  --clang_version=<ver>
  v
cudafe++ process
  |
  +-- CLI parsing (proc_command_line)
  |     dword_126E4A8 = sm_code         (SM architecture)
  |     qword_126EF98 = gcc_version     (host GCC version)
  |     qword_126EF90 = clang_version   (host Clang version)
  |
  +-- set_target_configuration (sub_7525F0)
  |     sub_7515D0()  -- type table init (100+ globals)
  |
  +-- dialect_resolution (sub_44B6B0)
  |     30+ feature flags set based on version thresholds
  |     dword_126E1F8 / dword_126E1E8  -- host dialect set
  |     qword_126E1F0 / qword_126E1E0  -- host version copies
  |
  +-- TU init (sub_586240)
  |     dword_126EBF8 = dword_126E4A8   (SM version copy)
  |
  +-- [Layer 1] Semantic analysis
  |     Compare dword_126E4A8 against SM thresholds
  |     Emit CUDA-specific errors for unsupported features
  |
  +-- [Layer 2] Diagnostic formatting
  |     Load error string with baked-in architecture name
  |     Format and display error to user
  |
  +-- [Layer 3] .int.c code generation
  |     Compare qword_126E1F0 / qword_126E1E0 against host thresholds
  |     Emit appropriate #pragma directives
  |     Generate host-compiler-compatible boilerplate
  |
  v
Host Compiler (gcc / clang / cl.exe)

Layers 1 and 2 operate during the frontend phase and can halt compilation. Layer 3 operates during the backend phase and only affects the format of the generated output file.

Global Variable Summary

AddressSizeNameRole
dword_126E4A84sm_architectureTarget SM version from --target (case 245). Sentinel: -1.
dword_126EBF84target_config_indexTU-level copy of dword_126E4A8, set in sub_586240.
qword_126EF988gcc_versionGCC compatibility version. Default 80100. Set by --gnu_version (case 184).
qword_126EF908clang_versionClang compatibility version. Default 90100. Set by --clang_version (case 188).
dword_126EFA84gcc_extensionsGCC mode enabled. Set by --gcc (case 182).
dword_126EFA44clang_extensionsClang mode enabled. Set by --clang (case 187).
dword_126EFAC4extended_featuresExtended features / GNU compat mode.
dword_126EFB04gnu_extensions_enabledGNU extensions active.
dword_126E1F84host_dialect_gnuHost compiler is GCC/GNU. Set during dialect init.
dword_126E1E84host_dialect_clangHost compiler is Clang. Set during dialect init.
qword_126E1F08host_gcc_versionHost GCC version, copied from qword_126EF98.
qword_126E1E08host_clang_versionHost Clang version, copied from qword_126EF90.
dword_126EFC84debug_trace_enabledDebug tracing active. Set unconditionally by --db.
dword_126EFCC4debug_verbosityDebug output level. >2=detailed, >4=IL walk trace.
dword_E7FF104cuda_compat_flagLegacy compat: dword_126EFAC && qword_126EF98 <= 0x76BF.
dword_106BDD84version_gated_featureSet when GCC >= 7.0 or Clang >= 4.0. Referenced in 7 functions.
dword_106C2A04error_count_baselineSaved from dword_126EFCC after --db processing.
qword_10658708debug_trace_listHead of debug trace request linked list.
dword_126E4A04target_vector_widthSet to 8 by sub_7515D0.

Cross-References