Experimental and Version-Gated Flags
cudafe++ gates several categories of CUDA language features behind flags that nvcc manages automatically. Users interact with these through nvcc options like --expt-extended-lambda and --expt-relaxed-constexpr; nvcc translates these into the internal cudafe++ flags --extended-lambda and --relaxed_constexpr before invocation. A third category, C++ standard version gating, controls which language-level features affect the CUDA compilation pipeline. Two additional flags (--default-device, --no-device-int128/--no-device-float128) tune device code semantics without the "experimental" label.
This page documents the internal mechanism of each flag: the global variable it sets, every code path it unlocks, the diagnostics it suppresses or enables, and the compile-time cost of enabling it.
Flag Summary
| nvcc Flag | cudafe++ Internal Flag | Flag ID | Global Variable | Default | Effect |
|---|---|---|---|---|---|
--expt-extended-lambda | --extended-lambda | 79* | dword_106BF38 | 0 | Enable entire extended lambda wrapper infrastructure |
--expt-relaxed-constexpr | --relaxed_constexpr | 104 | dword_106BF40 | 0 | Allow constexpr cross-space calls |
-std=c++NN | --c++NN / set_flag | -- | dword_126EF68 | 199711 | Gate C++ standard features |
| (JIT mode) | --default-device | ** | -- | 0 | Change unannotated default to __device__ |
--no-device-int128 | --no-device-int128 | 52 | -- | 0 | Disable __int128 in device code |
--no-device-float128 | --no-device-float128 | 53 | -- | 0 | Disable __float128/_Float128 in device code |
* The extended-lambda flag is registered as flag 79 (disable_ext_lambda_cache is a separate flag at that slot in some reports; the exact case_id for the flag parsed as extended-lambda is in the CUDA-specific range 47--89 but the individual case within the grouped 47--53 block is not fully disambiguated). The flag string "extended-lambda" is at binary address 0x836410, referenced from sub_452010 (init_command_line_flags).
** The --default-device flag is not in the standard numbered flag catalog (1--275). It is registered through one of the 7 inline-registered paired flags or the set_flag/clear_flag table (off_D47CE0). Its string literal appears in four JIT error messages in the binary.
--extended-lambda (dword_106BF38)
This is the single most impactful experimental flag in cudafe++. It enables the entire extended lambda subsystem -- approximately 40 functions in nv_transforms.c, 2,100 lines of lambda scanning in cmd_line.c, 17 steps of preamble text emission, and per-lambda wrapper generation in the backend. Without it, CUDA lambdas annotated with __device__ or __host__ __device__ are rejected outright.
What It Enables
When dword_106BF38 != 0, the following subsystems activate:
1. Lambda scanning (sub_447930, 2,113 lines)
The 7-phase scan_lambda function performs full CUDA validation on every lambda expression. Phase 4 checks all 35+ restriction categories documented in the restrictions page. Without the flag, phase 4 early-exits and emits error 3612 instead.
2. Preamble injection (sub_4864F0 + sub_6BCC20)
When the backend encounters a type declaration for the sentinel __nv_lambda_preheader_injection, three conditions must all be true for the preamble to fire:
// sub_4864F0 trigger conditions:
if ((entity_bits[-8] & 0x10) != 0 // marker bit set
&& dword_106BF38 != 0 // --extended-lambda enabled
&& name_matches_sentinel) // 30-byte name comparison
{
sub_6BCC20(emit_func); // emit ~10-50 KB of template text
}
The master emitter (sub_6BCC20) produces the complete lambda wrapper infrastructure as inline C++ text injected into the .int.c output. The 17-step emission sequence generates:
| Step | Output | Purpose |
|---|---|---|
| 1 | __NV_LAMBDA_WRAPPER_HELPER, __nvdl_remove_ref, __nvdl_remove_const | Utility macros and type traits |
| 2 | __nv_dl_tag | Device lambda tag type |
| 3 | Array capture helpers (dim 2--8) | N-dimensional array forwarding via sub_6BC290 |
| 4 | Primary __nv_dl_wrapper_t + zero-capture specialization | Device lambda wrapper template |
| 5 | __nv_dl_trailing_return_tag + zero-capture specialization | Trailing return type support |
| 6 | Device bitmap scan | One sub_6BB790 call per set bit in unk_1286980 |
| 7 | __nv_hdl_helper (anonymous namespace, 4 static function pointers) | Host-device lambda dispatch helper |
| 8 | Primary __nv_hdl_wrapper_t with static_assert | Host-device wrapper template |
| 9 | HD bitmap scan | Four calls per set bit in unk_1286900 (const x mutable x 2 helpers) |
| 10 | __nv_hdl_helper_trait_outer | Deduction helper traits |
| 11 | C++17 noexcept variants | Conditional on dword_126E270 (see C++ version gating) |
| 12 | __nv_hdl_create_wrapper_t | Factory for HD wrappers |
| 13 | __nv_lambda_trait_remove_const/volatile/cv | CV-qualifier removal traits |
| 14 | __nv_extended_device_lambda_trait_helper + detection macro | Device lambda type detection |
| 15 | __nv_lambda_trait_remove_dl_wrapper | Unwrapper trait |
| 16 | Trailing-return detection trait + macro | Type introspection |
| 17 | HD detection trait + macro | Host-device lambda type detection |
3. 1024-bit capture bitmaps
Two bitmaps track which capture counts have been observed during parsing:
| Bitmap | Address | Scope | Bits Used |
|---|---|---|---|
| Device | unk_1286980 | 128 bytes (1024 bits) | Bit N = capture count N seen in a __device__ lambda |
| Host-device | unk_1286900 | 128 bytes (1024 bits) | Bit N = capture count N seen in an HD lambda |
sub_6BCBF0 registers a capture count by setting the corresponding bit. sub_6BCBC0 resets both bitmaps to zero between translation units. The maximum representable capture count is 1023 (bit 0 is reserved for the primary template in the device path; the HD path uses bit 0). Error 3595 fires when capture count exceeds 1022 (v33 > 0x3FE).
4. Per-lambda wrapper generation (sub_47B890, 336 lines)
During backend code generation, gen_lambda produces the per-lambda wrapper specialization for each extended lambda encountered. This runs in the gen_template dispatcher (sub_47ECC0).
5. Extended lambda capture type generation (sub_46E640, ~400 lines)
nv_gen_extended_lambda_capture_types generates explicit type declarations for captured variables, enabling the closure type to be serialized across host/device boundaries.
What Happens Without It
When dword_106BF38 == 0, any lambda with __host__ or __device__ annotations triggers error 3612:
error #20155-D: __host__ or __device__ annotation on lambda requires --extended-lambda nvcc flag
Additionally, the .int.c header emits hardcoded false macros (from sub_489000):
#define __nv_is_extended_device_lambda_closure_type(X) false
#define __nv_is_extended_host_device_lambda_closure_type(X) false
#define __nv_is_extended_device_lambda_with_preserved_return_type(X) false
These definitions ensure that code using the detection macros compiles without error but reports that no extended lambdas exist.
Compile-Time Cost
Enabling --extended-lambda has measurable compile-time impact:
- Fixed overhead: ~10 KB of injected template text (steps 1--5, 7--8, 10--17) emitted for every translation unit, regardless of how many lambdas appear
- Variable per capture count: ~0.8 KB per distinct device lambda capture count, ~6 KB per distinct HD capture count (the HD path emits 4 specializations per bit: const non-mutable, const mutable, non-const non-mutable, non-const mutable)
- Typical TU with 3--5 distinct capture counts: 30--50 KB of additional
.int.ctext - Template instantiation load: The wrapper templates use deep SFINAE patterns; the host compiler (gcc/clang/MSVC) must instantiate these for every extended lambda in the TU
- Lambda scanning: The 2,113-line
scan_lambdafunction performs full restriction validation on every lambda expression, adding O(N) per-lambda overhead
The cost is proportional to the number of distinct capture counts, not the total number of lambdas. Two __device__ lambdas each capturing 3 variables share a single wrapper specialization.
Related Error Codes
All 35+ extended lambda error codes (3590--3691) are documented in lambda/restrictions.md. Key errors specific to the flag gate:
| Error | Display | Tag | Condition |
|---|---|---|---|
| 3612 | 20155-D | extended_lambda_disallowed | Lambda has __host__/__device__ annotation but dword_106BF38 == 0 |
| 3595 | 20138-D | extended_lambda_too_many_captures | Capture count > 1023 |
| 3590 | 20133-D | extended_lambda_multiple_parent | Multiple __nv_parent pragmas |
--expt-relaxed-constexpr (dword_106BF40)
This flag relaxes cross-execution-space calling rules for constexpr functions. Without it, a constexpr __device__ function cannot be called from a __host__ function and vice versa, even though constexpr functions are evaluated at compile time on the host regardless of their execution space annotation.
Flag Registration
Registered as flag ID 104 (relaxed_constexpr) in the CUDA-specific flag range. The --expt-relaxed-constexpr nvcc flag is translated to --relaxed_constexpr before passing to cudafe++. The flag sets dword_106BF40 to 1.
Note: Despite the W066 report labeling this global lambda_host_device_mode, the decompiled code shows it is checked in two distinct contexts: cross-space call validation (sub_505720) and extended lambda device qualification (sub_6BC680). The variable name reflects its role in relaxing constexpr constraints, not lambda-specific behavior. It affects lambda behavior only in the specific case of is_device_or_extended_device_lambda (see below).
What It Relaxes
The flag modifies behavior in two code paths:
1. Cross-space call checking (sub_505720)
In check_cross_execution_space_call, when the caller is a __device__-only function and the callee has bit 2 set at offset +177 (explicit __device__ annotation), the checker tests dword_106BF40:
// sub_505720, caller is __device__ or __global__, callee is constexpr __host__:
if ((callee[177] & 0x02) != 0) { // callee has explicit execution space
if (dword_106BF40) { // --expt-relaxed-constexpr
// skip error, allow the call
return;
}
}
Without the flag, this path falls through to emit one of the 6 constexpr-specific cross-space errors.
2. Device lambda qualification (sub_6BC680)
In is_device_or_extended_device_lambda, when an entity has __device__ annotation (bit 177|2) but NOT the extended lambda bit (bit 177|4), the function returns dword_106BF40 != 0:
// sub_6BC680 (decompiled):
bool is_device_or_extended_device_lambda(entity* a1) {
if ((a1->byte_177 & 0x02) != 0) { // has __device__
if ((a1->byte_177 & 0x04) == 0) { // NOT extended lambda
return dword_106BF40 != 0; // relaxed constexpr allows it
}
return true;
}
return false;
}
This means --expt-relaxed-constexpr allows certain __device__ lambdas to be treated as extended device lambdas even without the --extended-lambda flag, but only in the specific context of device lambda type checking.
The 6 Error Messages It Suppresses
When dword_106BF40 == 0 and a constexpr function call crosses execution spaces, one of these 6 error messages is emitted. Each message explicitly suggests the flag as a workaround:
| # | Caller Space | Callee Space | Error Message |
|---|---|---|---|
| 1 | __host__ __device__ | constexpr __device__ | "calling a constexpr __device__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this." |
| 2 | __host__ | constexpr __device__ | "calling a constexpr __device__ function(%sq1) from a __host__ function(%sq2) is not allowed. ..." |
| 3 | __host__ __device__ | constexpr __host__ | "calling a constexpr __host__ function(%sq1) from a __host__ __device__ function(%sq2) is not allowed. ..." |
| 4 | __host__ __device__ | constexpr __host__ | "calling a constexpr __host__ function from a __host__ __device__ function is not allowed. ..." (no entity names -- edge case for unresolved functions) |
| 5 | __device__ | constexpr __host__ | "calling a constexpr __host__ function(%sq1) from a __device__ function(%sq2) is not allowed. ..." |
| 6 | __global__ | constexpr __host__ | "calling a constexpr __host__ function(%sq1) from a __global__ function(%sq2) is not allowed. ..." |
The %sq1 and %sq2 format specifiers are cudafe++'s diagnostic format for qualified entity names (see diagnostics/format-specifiers.md).
Why It Is Experimental
The flag is labeled "experimental" because enabling it can produce silent runtime errors when:
-
A
constexprfunction has different behavior on host vs device due to#ifdef __CUDA_ARCH__guards or host/device-specific intrinsics. The compiler evaluatesconstexprfunctions on the host during compilation, but with the flag enabled, aconstexpr __device__function might be evaluated on the host where__CUDA_ARCH__is not defined, producing a different constant value than the programmer expects for device code. -
A
constexpr __host__function references host-only APIs (file I/O, system calls, host-specific math libraries). With relaxed constexpr, this function can be called from a__device__context. If the call is not resolved at compile time (not actually evaluated as a constant expression), the linker or runtime will fail with an obscure error rather than the clear cudafe++ diagnostic. -
The relaxation applies globally -- there is no per-function opt-in. Once enabled, all constexpr cross-space calls are permitted, making it impossible to catch genuinely incorrect calls alongside intentionally relaxed ones.
The related diagnostic tag is cl_relaxed_constexpr_requires_bool (at binary address 0x853640), which indicates there was at some point a stricter validation that the flag's value must be boolean.
Interaction with Other Globals
The dword_106BF40 flag interacts with the cross-space checking infrastructure controlled by dword_106BFD0 (device_registration) and dword_106BFCC (constant_registration). When dword_106BF40 is set AND the current routine is in device scope (+182 & 0x30 == 0x20) AND the routine has __device__ annotation (+177 bit 1), the cross-space reference check in record_symbol_reference_full (sub_72A650/sub_72B510) skips the error entirely.
C++ Standard Version Gating (dword_126EF68)
The global variable dword_126EF68 holds the C++ (or C) standard version as an integer matching the __cplusplus or __STDC_VERSION__ predefined macro value. This is set during CLI parsing and controls feature gating throughout the frontend.
Version Values
| Standard | dword_126EF68 Value | nvcc Flag |
|---|---|---|
| C++98/03 | 199711 | -std=c++03 |
| C++11 | 201103 | -std=c++11 |
| C++14 | 201402 | -std=c++14 |
| C++17 | 201703 | -std=c++17 |
| C++20 | 202002 | -std=c++20 |
| C++23 | 202302 | -std=c++23 |
C standard values are also stored here when compiling C code:
| Standard | dword_126EF68 Value |
|---|---|
| K&R | (triggers set_c_mode(1) instead) |
| C89 | 198912 |
| C99 | 199901 |
| C11 | 201112 |
| C17 | 201710 |
| C23 | 202311 |
How Version Gating Works
Throughout the frontend, dword_126EF68 is compared against threshold values to enable or disable features. The comparison is always >= or > against the version number. Examples from the binary:
List initialization (sub_6D7DE0, overload.c): The 2,119-line list initialization function checks dword_126EF68 >= 201103 before enabling C++11 brace-enclosed initializer semantics.
Operator overloading (sub_6E7310, overload.c): Checks dword_126EF68 >= 201703 for C++17 features like class template argument deduction in operator resolution.
Preprocessor directives (sub_6FEDD0, preproc.c): Checks dword_126EF68 >= 202301 for #elifdef/#elifndef support (C++23 feature).
Byte ordering in .int.c output (sub_489000): Sets byte_10657F4 based on:
if (dword_126EFB4 == 2) // CUDA mode
byte_10657F4 = (dword_126EFB0 != 0);
else if (dword_126EF68 <= 199900) // pre-C99
byte_10657F4 = (dword_126EFB0 != 0);
else
byte_10657F4 = 1;
C++17 noexcept-in-Type-System (dword_126E270)
A key version-gated feature for CUDA is dword_126E270, the C++17 "noexcept is part of the type system" flag. This global is set when dword_126EF68 >= 201703 and controls whether the lambda preamble injection (step 11 in sub_6BCC20) emits noexcept specializations of __nv_hdl_helper_trait_outer:
// sub_6BCC20, step 11:
if (dword_126E270) { // C++17 noexcept in type system
// Emit 2 additional trait specializations with NeverThrows=true
// for noexcept-qualified function types
emit_noexcept_trait_specialization(emit, /* const */ 0);
emit_noexcept_trait_specialization(emit, /* non-const */ 1);
}
// Closing }; of __nv_hdl_helper_trait_outer emitted unconditionally after
Without these specializations, C++17 code using noexcept lambdas in host-device contexts would fail to match the wrapper traits, producing template deduction failures.
Version Interactions with CUDA
The C++ standard version interacts with CUDA semantics in several ways:
- C++11 minimum: Most CUDA lambda features require
>= 201103. Extended lambdas are only meaningful with C++11 lambda syntax. - C++14 generic lambdas: Generic
__device__lambdas (withautoparameters) are gated on>= 201402. - C++17 structured bindings and if constexpr: The extended lambda system interacts with
if constexprthrough restriction errors 3620/3621 (constexpr/consteval conflict in lambda operator()). - C++20 concepts: The template variant of cross-space checking (
sub_505B40) has a concept-context guard that checksdword_126C5C4(nested class scope), which is only meaningful with C++20 concepts.
--default-device
This flag is specific to JIT (device-only) compilation mode and changes the default execution space for unannotated entities from __host__ to __device__.
Mechanism
When enabled, the execution-space assignment logic modifies entity+182 to receive the __device__ OR mask (0x23) instead of the implicit host default (0x00). For variables, entity+148 bit 0 (__device__ memory space) is set.
JIT Mode Context
JIT mode activates when --gen_c_file_name (flag 45) is NOT provided -- there is no host output path, so the host backend never runs. This is the compilation mode used by NVRTC (the CUDA runtime compilation library) and the CUDA Driver API's runtime compilation facilities (cuModuleLoadData, cuLinkAddData).
Without --default-device, five JIT-specific diagnostics warn about unannotated entities:
| Diagnostic Tag | Message Summary |
|---|---|
no_host_in_jit | Explicit __host__ not allowed in JIT mode (no --default-device suggestion) |
unannotated_function_in_jit | Unannotated function considered host, not allowed in JIT |
unannotated_variable_in_jit | Namespace-scope variable without memory space annotation |
unannotated_static_data_member_in_jit | Non-const static data member considered host |
host_closure_class_in_jit | Lambda closure class inferred as __host__ |
Four of the five messages explicitly suggest --default-device as a workaround. The exception is no_host_in_jit -- an explicit __host__ annotation cannot be overridden by a flag and requires a source code change.
The --default-device flag interacts with the extended lambda system (dword_106BF38): when both are active, namespace-scope lambda closure classes infer __device__ execution space instead of __host__, avoiding the host_closure_class_in_jit diagnostic.
See cuda/jit-mode.md for full JIT mode documentation.
--no-device-int128 / --no-device-float128
These two flags (IDs 52 and 53) disable 128-bit integer and floating-point types in device code respectively.
Registration
Both are registered in sub_452010 as no-argument mode flags in the CUDA-specific range:
| Flag | ID | Binary Address | Global Effect |
|---|---|---|---|
no-device-int128 | 52 | 0x836133 | Disables __int128 type in device compilation |
no-device-float128 | 53 | 0x836144 | Disables __float128/_Float128 in device compilation |
Purpose
The EDG frontend supports __int128 (keyword ID 239 in the builtin keyword table) and _Float128 (keyword ID 335) as extended types. In device code, these types may not be supported by all GPU architectures or may have different semantics than on the host.
The flags belong to the grouped CUDA boolean flags (cases 47--53 in proc_command_line), alongside host-stub-linkage-explicit, static-host-stub, device-hidden-visibility, no-hidden-visibility-on-unnamed-ns, and no-multiline-debug.
Type feature tracking uses byte_12C7AFC as a usage flags byte: bit 0 tracks specific integer subtypes (kinds 11, 12), bit 2 tracks float128/bfloat16 usage. The dword_106C070 global serves as the float128 feature flag, and dword_106C06C controls bfloat16.
NVRTC has specific support strings for both types in the binary (int128 NVRTC, float128 NVRTC), confirming that the JIT compilation path handles the presence or absence of these types explicitly.
Interaction Matrix
The experimental flags interact with each other and with version gating:
| Interaction | Behavior |
|---|---|
--extended-lambda + C++17 | Enables noexcept wrapper trait specializations (step 11 in preamble) via dword_126E270 |
--extended-lambda + --expt-relaxed-constexpr | A __device__ lambda without the extended-lambda bit is treated as extended if dword_106BF40 is set (via sub_6BC680) |
--extended-lambda + JIT mode | Lambda closure class execution space inference changes; --default-device affects namespace-scope lambda inference |
--expt-relaxed-constexpr + cross-space checking | Suppresses 6 specific constexpr cross-space errors; does NOT suppress the 6 non-constexpr variants |
--no-device-int128 + NVRTC | NVRTC-specific handling confirms both flags are respected in JIT compilation |
| C++20 + cross-space checking | Concept context guard in sub_505B40 adds an additional bypass condition for template cross-space calls |
Global Variable Reference
| Address | Size | Semantic Name | Set By | Checked By |
|---|---|---|---|---|
dword_106BF38 | 4 | extended_lambda_mode | Flag 79* (--extended-lambda) | sub_4864F0 (trigger), sub_489000 (macros), sub_447930 (scan_lambda) |
dword_106BF40 | 4 | relaxed_constexpr_mode | Flag 104 (--relaxed_constexpr) | sub_505720 (cross-space call), sub_6BC680 (device lambda test), sub_72A650/sub_72B510 (symbol ref) |
dword_126EF68 | 4 | cpp_standard_version | CLI std selection | 28+ functions across all subsystems |
dword_126E270 | 4 | cpp17_noexcept_type | Post-parsing dialect resolution | sub_6BCC20 (preamble step 11) |
Function Reference
| Address | Lines | Identity | Source | Role |
|---|---|---|---|---|
sub_452010 | 3,849 | init_command_line_flags | cmd_line.c | Registers all 276 flags including experimental |
sub_459630 | 4,105 | proc_command_line | cmd_line.c | Parses flags, sets globals |
sub_447930 | 2,113 | scan_lambda | cmd_line.c | Full lambda validation (uses dword_106BF38) |
sub_4864F0 | 751 | gen_type_decl | cp_gen_be.c | Preamble injection trigger (checks dword_106BF38) |
sub_6BCC20 | 244 | nv_emit_lambda_preamble | nv_transforms.c | Master preamble emitter (17 steps) |
sub_505720 | 147 | check_cross_execution_space_call | expr.c | Cross-space call checker (uses dword_106BF40) |
sub_505B40 | 92 | check_cross_space_call_in_template | expr.c | Template variant of cross-space checker |
sub_6BC680 | 16 | is_device_or_extended_device_lambda | nv_transforms.c | Device lambda test (uses dword_106BF40) |
sub_489000 | 723 | process_file_scope_entities | cp_gen_be.c | Backend entry; emits false macros when flag off |
sub_46E640 | ~400 | nv_gen_extended_lambda_capture_types | cp_gen_be.c | Capture type declarations for extended lambdas |
sub_6BCBF0 | 13 | nv_record_capture_count | nv_transforms.c | Bitmap bit-set for capture counts |
sub_6BCBC0 | ~10 | nv_reset_capture_bitmaps | nv_transforms.c | Reset both 1024-bit bitmaps |
Cross-References
- config/cli-flags.md -- complete flag catalog and registration protocol
- lambda/overview.md -- extended lambda pipeline architecture
- lambda/preamble-injection.md -- 17-step preamble emission detail
- lambda/restrictions.md -- all 35+ lambda restriction error codes
- cuda/cross-space-validation.md -- cross-space call checking and
dword_106BF40relaxation - cuda/jit-mode.md -- JIT mode,
--default-deviceflag, and NVRTC - diagnostics/cuda-errors.md -- complete CUDA error catalog