Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Preamble Injection

The entire CUDA extended lambda template library -- every __nv_dl_wrapper_t, every __nv_hdl_wrapper_t, every trait helper and detection macro -- enters the compilation through a single function: sub_6BCC20 (nv_emit_lambda_preamble). This 244-line function in nv_transforms.c accepts a void(*emit)(const char*) callback and produces raw C++ source text that is injected into the .int.c output stream. The preamble is emitted exactly once per translation unit, triggered by a sentinel type declaration named __nv_lambda_preheader_injection. The trigger mechanism lives in sub_4864F0 (gen_type_decl in cp_gen_be.c), which string-compares each type declaration's name against the sentinel marker, emits a synthetic #line directive, and then calls the master emitter.

The preamble contains 20 logical emission steps, ranging from simple type traits (4 lines each) to bitmap-driven loops that generate hundreds of template specializations. The design is driven by a critical optimization: rather than emitting all 1024 possible capture-count specializations for each wrapper type, cudafe++ maintains two 1024-bit bitmaps (unk_1286980 for device lambdas, unk_1286900 for host-device lambdas) that track which capture counts were actually used during frontend parsing. The preamble emitter scans these bitmaps and generates only the specializations that the translation unit requires.

Key Facts

PropertyValue
Master emittersub_6BCC20 (nv_emit_lambda_preamble, 244 lines, nv_transforms.c)
Trigger functionsub_4864F0 (gen_type_decl, 751 lines, cp_gen_be.c)
Emit callback (typical)sub_467E50 (raw text output to .int.c stream)
Sentinel type name__nv_lambda_preheader_injection
Synthetic source file"nvcc_internal_extended_lambda_implementation"
Enable flagdword_106BF38 (--extended-lambda / --expt-extended-lambda)
Device bitmapunk_1286980 (128 bytes = 16 x uint64 = 1024 bits)
Host-device bitmapunk_1286900 (128 bytes = 16 x uint64 = 1024 bits)
C++17 noexcept gatedword_126E270 (controls noexcept trait variants)
One-shot guaranteeOnce emitted, the sentinel type is wrapped in #if 0 / #endif
Max capture count1024 (bit index range 0..1023)
Array dimension range2D through 8D (7 specializations per wrapper)

Trigger Mechanism: sub_4864F0 (gen_type_decl)

The preamble is not emitted eagerly at the start of compilation. Instead, the EDG frontend inserts a synthetic type declaration named __nv_lambda_preheader_injection into the IL at the point where the lambda template library is needed. During backend code generation, sub_4864F0 (the type declaration emitter in cp_gen_be.c) encounters this declaration and performs the following sequence:

// sub_4864F0, decompiled lines 200-242
// Check: is this a type tagged with the preheader marker? (bit at v4-8 & 0x10)
if ((*(_BYTE *)(v4 - 8) & 0x10) != 0)
{
    if (dword_106BF38)                           // --extended-lambda enabled?
    {
        v18 = *(_QWORD *)(v4 + 8);              // get type name pointer
        if (v18)
        {
            // Compare name against "__nv_lambda_preheader_injection" (30 chars + NUL)
            v30 = "__nv_lambda_preheader_injection";
            v31 = 32;                             // comparison length
            do {
                if (!v31) break;
                v29 = *(_BYTE *)v18++ == *v30++;
                --v31;
            } while (v29);

            if (v29)                              // name matched
            {
                if (dword_106581C)                // pending newline needed
                    sub_467D60();                 // emit newline

                // Emit #line directive pointing to synthetic source file
                v32 = "#line";
                if (dword_126E1DC)                // shorthand mode
                    v32 = "#";
                sub_467E50(v32);
                sub_467E50(" 1 \"nvcc_internal_extended_lambda_implementation\"");

                if (dword_106581C)
                    sub_467D60();

                // THE CRITICAL CALL: emit entire lambda template library
                sub_6BCC20(sub_467E50);

                dword_1065820 = 0;                // reset line tracking state
                qword_1065828 = 0;
            }
        }
    }
    // Suppress the sentinel type from host compiler output
    sub_46BC80("#if 0");
    --dword_1065834;
    sub_467D60();
}

Trigger Conditions

Three conditions must all be true for preamble emission:

  1. Marker bit set -- The type declaration node has bit 0x10 set at offset -8 (the IL node header flags). This bit marks NVIDIA-injected synthetic declarations.

  2. Extended lambda mode active -- dword_106BF38 is nonzero, meaning --extended-lambda (or --expt-extended-lambda) was passed to nvcc.

  3. Name matches sentinel -- The type's name at offset +8 is byte-equal to "__nv_lambda_preheader_injection" (a 31-character string including NUL; the comparison loop runs up to 32 iterations).

Synthetic Source File Context

Before calling sub_6BCC20, the trigger emits:

#line 1 "nvcc_internal_extended_lambda_implementation"

This #line directive serves two purposes: it changes the apparent source file for any diagnostics emitted during template parsing, and it provides a recognizable marker in the generated .int.c file for debugging. All lambda template infrastructure appears to originate from "nvcc_internal_extended_lambda_implementation" rather than from the user's source file. The dword_126E1DC flag selects between #line and the shorthand # form for the line directive.

One-Shot Guarantee and Sentinel Suppression

After the preamble is emitted, the sentinel type declaration is wrapped in #if 0 / #endif. The #if 0 is emitted immediately after the preamble call (line 239: sub_46BC80("#if 0")). The matching #endif is emitted later when sub_4864F0 reaches the closing path for this declaration type (lines 736-745):

else if ((*(_BYTE *)(v4 - 8) & 0x10) != 0)
{
    if (dword_106581C)
        sub_467D60();
    ++dword_1065834;
    sub_468190("#endif");
    --dword_1065834;
    sub_467D60();
    dword_1065820 = 0;
    qword_1065828 = 0;
}

The sentinel type __nv_lambda_preheader_injection never reaches the host compiler's type system -- it exists solely as a positional marker in the IL. Because the EDG frontend inserts exactly one such declaration per translation unit, and the backend processes declarations sequentially, the preamble is guaranteed to be emitted exactly once.

After emission, dword_1065820 (output line counter) and qword_1065828 (output state pointer) are reset to zero, ensuring subsequent #line directives correctly track the user's source file.

Master Emitter: sub_6BCC20

The function signature:

__int64 __fastcall sub_6BCC20(void (__fastcall *a1)(const char *));

The single parameter a1 is an output callback. In production, this is always sub_467E50 -- the function that writes raw text to the .int.c output stream. Every a1("...") call appends the given string literal to the output. The function has no other state parameters; all needed state (bitmaps, C++17 flag) is read from globals.

The 20 emission steps are executed unconditionally in a fixed order. Steps 6 and 9 contain bitmap-scanning loops that conditionally call sub-emitters based on which capture counts were registered during frontend parsing. Step 11 is gated on the C++17 noexcept flag.

Step 1: Type Removal Traits and Wrapper Helper Macro

The first a1(...) call emits the largest single string literal in the function -- three foundational metaprogramming utilities:

#define __NV_LAMBDA_WRAPPER_HELPER(X, Y) decltype(X), Y

template <typename T>
struct __nvdl_remove_ref { typedef T type; };

template<typename T>
struct __nvdl_remove_ref<T&> { typedef T type; };

template<typename T>
struct __nvdl_remove_ref<T&&> { typedef T type; };

template <typename T, typename... Args>
struct __nvdl_remove_ref<T(&)(Args...)> {
  typedef T(*type)(Args...);
};

template <typename T>
struct __nvdl_remove_const { typedef T type; };

template <typename T>
struct __nvdl_remove_const<T const> { typedef T type; };

__NV_LAMBDA_WRAPPER_HELPER(X, Y) expands to decltype(X), Y. It provides the <U, func> pair for tag type construction from a single expression. At each lambda wrapper call site, the per-lambda emitter (sub_47B890) generates __NV_LAMBDA_WRAPPER_HELPER(&Closure::operator(), &Closure::operator()), which expands to decltype(&Closure::operator()), &Closure::operator().

__nvdl_remove_ref strips lvalue and rvalue references, with a special case for function references (T(&)(Args...) -> T(*)(Args...)). __nvdl_remove_const strips top-level const. Both are used during capture type emission to normalize captured variable types before passing them as template arguments to wrapper structs.

Step 2: Device Lambda Tag

template <typename U, U func, unsigned>
struct __nv_dl_tag { };

The device lambda tag type. U is the type of the lambda's operator(), func is a non-type template parameter holding the pointer to that operator, and the unsigned disambiguates lambdas with identical operator types at different call sites within the same TU.

Step 3: Array Capture Helpers (sub_6BC290)

sub_6BCC20 calls sub_6BC290(a1), which emits the __nv_lambda_array_wrapper and __nv_lambda_field_type infrastructure for C-style array captures. This is a separate 183-line function that generates templates for array dimensions 2 through 8.

Three template families are emitted:

Primary template (static_assert trap):

template <typename T>
struct __nv_lambda_array_wrapper {
    static_assert(sizeof(T) == 0,
        "nvcc internal error: unexpected failure in capturing array variable");
};

Per-dimension partial specializations (dimensions 2-8). For each dimension D from 2 to 8, sub_6BC290 generates a partial specialization with D size_t template parameters and a nested-for-loop constructor:

// Example: 3D (v1 = 3)
template<typename T, size_t D1, size_t D2, size_t D3>
struct __nv_lambda_array_wrapper<T [D1][D2][D3]> {
    T arr[D1][D2][D3];
    __nv_lambda_array_wrapper(const T in[D1][D2][D3]) {
        for(size_t i1 = 0; i1 < D1; ++i1)
        for(size_t i2 = 0; i2 < D2; ++i2)
        for(size_t i3 = 0; i3 < D3; ++i3)
            arr[i1][i2][i3] = in[i1][i2][i3];
    }
};

Field type trait specializations:

template <typename T>
struct __nv_lambda_field_type { typedef T type; };

// For each dimension D from 2 to 8:
template<typename T, size_t D1, ..., size_t DN>
struct __nv_lambda_field_type<T [D1]...[DN]> {
    typedef __nv_lambda_array_wrapper<T [D1]...[DN]> type;
};

template<typename T, size_t D1, ..., size_t DN>
struct __nv_lambda_field_type<const T [D1]...[DN]> {
    typedef const __nv_lambda_array_wrapper<T [D1]...[DN]> type;
};

The loop structure in sub_6BC290 uses two stack buffers: v33[1024] for the nested-for-loop lines (each sprintf call formats four copies of the loop index variable) and s[1064] for dimension parameters and array subscript expressions. The outer loop runs from v1 = 2 to v1 = 8 (inclusive, 7 iterations). 1D arrays do not need a wrapper -- they can be captured directly. Arrays of 9+ dimensions are unsupported (the primary template's static_assert fires).

See Capture Handling for detailed documentation.

Step 4: Primary __nv_dl_wrapper_t and Zero-Capture Specialization

template <typename Tag, typename...CapturedVarTypePack>
struct __nv_dl_wrapper_t {
    static_assert(sizeof...(CapturedVarTypePack) == 0,
                  "nvcc internal error: unexpected number of captures!");
};

template <typename Tag>
struct __nv_dl_wrapper_t<Tag> {
    __nv_dl_wrapper_t(Tag) { }
    template <typename...U1>
    int operator()(U1...) { return 0; }
};

The primary template traps any instantiation with a non-zero capture count that lacks a matching specialization. The zero-capture specialization provides a trivial constructor and a dummy operator() returning int(0). This return value is never used at runtime -- the device compiler dispatches through the tag's encoded function pointer.

Step 5: Trailing-Return Tag and Base Specialization

template <typename U, U func, typename Return, unsigned>
struct __nv_dl_trailing_return_tag { };

template <typename U, U func, typename Return, unsigned Id>
struct __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id> > {
    __nv_dl_wrapper_t(__nv_dl_trailing_return_tag<U, func, Return, Id>) { }

    template <typename...U1> Return operator()(U1...) {
        __builtin_unreachable();
    }
};

For lambdas with explicit trailing return types (-> ReturnType), the tag carries the Return type as a template parameter. The operator() returns Return instead of int, with __builtin_unreachable() satisfying the compiler without generating actual return-value code.

The trailing-return tag and its zero-capture specialization are emitted as two separate a1(...) calls. The __builtin_unreachable() body is split: a1("__builtin_unreachable(); }\n}; \n\n").

Step 6: Device Lambda Bitmap Scan

Scans unk_1286980 (the device lambda bitmap, 1024 bits) and calls sub_6BB790 for each set bit with index greater than zero:

// Decompiled from sub_6BCC20
v1 = (unsigned __int64 *)&unk_1286980;
v2 = 0;
do {
    v3 = *v1;                          // load 64-bit word
    v4 = v2 + 64;                      // word boundary
    do {
        if (v2 && (v3 & 1) != 0)       // skip bit 0, emit for set bits
            sub_6BB790(v2, a1);         // emit_device_lambda_wrapper_specialization
        ++v2;
        v3 >>= 1;
    } while (v4 != v2);
    ++v1;
} while (v4 != 1024);

Bit 0 is explicitly skipped (if (v2 && ...)). The zero-capture case is handled by the specializations in steps 4 and 5.

For each set bit N > 0, sub_6BB790(N, a1) emits two __nv_dl_wrapper_t partial specializations: one for __nv_dl_tag and one for __nv_dl_trailing_return_tag, each with N typed fields, a constructor taking N parameters, and an initializer list binding inK to fK. See Device Lambda Wrapper for full emitter logic.

This bitmap-driven approach is the critical compile-time optimization. A translation unit using lambdas with capture counts 1, 3, and 5 emits exactly 6 struct specializations rather than 2046 (1023 counts x 2 tag variants).

Step 7: Host-Device Helper Class (__nv_hdl_helper)

Emitted inside an anonymous namespace:

namespace {
template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
struct __nv_hdl_helper {
    typedef void * (*fp_copier_t)(void *);
    typedef OpFuncR (*fp_caller_t)(void *, OpFuncArgs...);
    typedef void (*fp_deleter_t)(void *);
    typedef OpFuncR (*fp_noobject_caller_t)(OpFuncArgs...);

    static fp_copier_t fp_copier;
    static fp_deleter_t fp_deleter;
    static fp_caller_t fp_caller;
    static fp_noobject_caller_t fp_noobject_caller;
};

// Out-of-line static member definitions (4 members):
template <typename Tag, typename OpFuncR, typename ...OpFuncArgs>
typename __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier_t
    __nv_hdl_helper<Tag, OpFuncR, OpFuncArgs...>::fp_copier;

// ... (fp_deleter, fp_caller, fp_noobject_caller follow the same pattern)
}

The anonymous namespace prevents ODR violations across TUs. The Tag parameter isolates function pointer storage per lambda site even when call signatures are identical. The entire struct definition plus all four out-of-line member definitions are emitted as a single a1(...) call.

PointerPurpose
fp_copierHeap-copies a Lambda from void* (used by copy constructor)
fp_callerCasts void* to Lambda* and invokes operator()
fp_deleterCasts void* to Lambda* and deletes it
fp_noobject_callerStores captureless lambda as raw function pointer

Step 8: Primary __nv_hdl_wrapper_t

template <bool IsMutable, bool HasFuncPtrConv, bool NeverThrows,
          typename Tag, typename OpFunc, typename...CapturedVarTypePack>
struct __nv_hdl_wrapper_t {
    static_assert(sizeof...(CapturedVarTypePack) == 0,
        "nvcc internal error: unexpected number of captures "
        "in __host__ __device__ lambda!");
};

Same safety-net pattern as the device wrapper.

Step 9: Host-Device Lambda Bitmap Scan

Scans unk_1286900 (the host-device bitmap, 1024 bits). Unlike the device scan, this loop does not skip bit 0 -- the zero-capture host-device case still requires distinct specializations for HasFuncPtrConv=true vs HasFuncPtrConv=false.

For each set bit N, four specialization calls are made:

v5 = (unsigned __int64 *)&unk_1286900;
v6 = 0;
do {
    v7 = *v5;
    v8 = v6 + 64;
    do {
        while ((v7 & 1) == 0) {        // fast-skip unset bits
            ++v6;
            v7 >>= 1;
            if (v6 == v8) goto LABEL_13;
        }
        sub_6BBB10(0, v6, a1);         // IsMutable=false, HasFuncPtrConv=false
        sub_6BBEE0(0, v6, a1);         // IsMutable=true,  HasFuncPtrConv=false
        sub_6BBB10(1, v6, a1);         // IsMutable=false, HasFuncPtrConv=true
        v9 = v6++;
        v7 >>= 1;
        sub_6BBEE0(1, v9, a1);        // IsMutable=true,  HasFuncPtrConv=true
    } while (v6 != v8);
LABEL_13:
    ++v5;
} while (v6 != 1024);

Note the ordering asymmetry in the fourth call: sub_6BBEE0(1, v9, a1) uses the pre-increment value v9 because v6 has already been incremented by the v9 = v6++ expression.

The inner while ((v7 & 1) == 0) loop provides fast skipping over consecutive unset bits without executing four function calls per zero bit. This is an optimization compared to the device scan loop.

Calla1a2IsMutableHasFuncPtrConvoperator() qualifier
sub_6BBB10(0, N, emit)0Nfalsefalseconst noexcept(NeverThrows)
sub_6BBEE0(0, N, emit)0Ntruefalsenoexcept(NeverThrows) (no const)
sub_6BBB10(1, N, emit)1Nfalsetrueconst noexcept(NeverThrows)
sub_6BBEE0(1, N, emit)1Ntruetruenoexcept(NeverThrows) (no const)

The sole difference between sub_6BBB10 and sub_6BBEE0 is that sub_6BBB10 emits "false," for IsMutable and adds a3("const ") before the noexcept qualifier on operator(), while sub_6BBEE0 emits "true," and omits the const. They are otherwise structurally identical -- 238 vs 236 lines, the 2-line difference being exactly the a3("const ") call.

See Host-Device Lambda Wrapper for the complete internal structure of each specialization.

Step 10: __nv_hdl_helper_trait_outer (Base Specializations)

The deduction helper trait that extracts the wrapper type from a lambda's operator() signature:

template <bool IsMutable, bool HasFuncPtrConv, typename ...CaptureArgs>
struct __nv_hdl_helper_trait_outer {
    template <typename Tag, typename Lambda>
    struct __nv_hdl_helper_trait
        : public __nv_hdl_helper_trait<Tag, decltype(&Lambda::operator())> { };

    // Match const operator() (non-mutable lambda):
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) const> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, false,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

    // Match non-const operator() (mutable lambda):
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...)> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, false,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

The primary __nv_hdl_helper_trait inherits from a specialization on decltype(&Lambda::operator()). The compiler deduces the member function pointer type and pattern-matches against the const or non-const specialization. Both produce NeverThrows=false.

This block is emitted without a closing }; -- the noexcept variants (step 11) are conditionally appended before the closing brace.

Step 11: C++17 Noexcept Trait Variants (Conditional)

Gated on dword_126E270:

if (dword_126E270)
    a1(/* noexcept trait specializations */);
a1("\n};");  // close __nv_hdl_helper_trait_outer

When C++17 noexcept-in-type-system is active, two additional __nv_hdl_helper_trait specializations are emitted:

    // Match const noexcept operator():
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) const noexcept> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, true,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

    // Match non-const noexcept operator():
    template <typename Tag, typename C, typename R, typename... OpFuncArgs>
    struct __nv_hdl_helper_trait<Tag, R(C::*)(OpFuncArgs...) noexcept> {
        template <typename Lambda>
        static auto get(Lambda lam, CaptureArgs... args)
            -> __nv_hdl_wrapper_t<IsMutable, HasFuncPtrConv, true,
                                   Tag, R(OpFuncArgs...), CaptureArgs...>;
    };

The noexcept specializations produce NeverThrows=true. In C++17, R(C::*)(Args...) const noexcept is a distinct type from R(C::*)(Args...) const, so without these specializations, noexcept lambdas would fail to match and the trait chain would break.

Step 12: __nv_hdl_create_wrapper_t Factory

template<bool IsMutable, bool HasFuncPtrConv, typename Tag,
         typename...CaptureArgs>
struct __nv_hdl_create_wrapper_t {
    template <typename Lambda>
    static auto __nv_hdl_create_wrapper(Lambda &&lam, CaptureArgs... args)
        -> decltype(
            __nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>
                ::template __nv_hdl_helper_trait<Tag, Lambda>
                ::get(lam, args...))
    {
        typedef decltype(
            __nv_hdl_helper_trait_outer<IsMutable, HasFuncPtrConv, CaptureArgs...>
                ::template __nv_hdl_helper_trait<Tag, Lambda>
                ::get(lam, args...)) container_type;
        return container_type(Tag{}, std::move(lam), args...);
    }
};

This factory is the entry point called at each host-device lambda usage site. The trailing return type chains through the trait hierarchy to deduce the exact __nv_hdl_wrapper_t specialization. The body constructs the deduced wrapper with Tag{}, the moved lambda, and the capture arguments.

Step 13: CV-Removal Traits

template<typename T> struct __nv_lambda_trait_remove_const { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_const<T const> { typedef T type; };

template<typename T> struct __nv_lambda_trait_remove_volatile { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_volatile<T volatile> { typedef T type; };

template<typename T> struct __nv_lambda_trait_remove_cv {
    typedef typename __nv_lambda_trait_remove_const<
        typename __nv_lambda_trait_remove_volatile<T>::type>::type type;
};

These are distinct from the __nvdl_remove_ref/__nvdl_remove_const emitted in step 1. The step-1 traits are used during capture type normalization at wrapper call sites. The step-13 traits are used by the detection macros (steps 14-17) to strip CV qualifiers before testing whether a type is an extended lambda wrapper.

Step 14: Device Lambda Detection Trait

template <typename T>
struct __nv_extended_device_lambda_trait_helper {
    static const bool value = false;
};

template <typename T1, typename...Pack>
struct __nv_extended_device_lambda_trait_helper<__nv_dl_wrapper_t<T1, Pack...> > {
    static const bool value = true;
};

#define __nv_is_extended_device_lambda_closure_type(X) \
    __nv_extended_device_lambda_trait_helper< \
        typename __nv_lambda_trait_remove_cv<X>::type>::value

SFINAE detection for device lambda wrappers. The macro strips CV qualifiers first, ensuring const __nv_dl_wrapper_t<...> is also detected. Used by CUDA runtime headers for conditional behavior on extended lambda types.

Step 15: Device Lambda Wrapper Unwrapper

template<typename T> struct __nv_lambda_trait_remove_dl_wrapper { typedef T type; };
template<typename T> struct __nv_lambda_trait_remove_dl_wrapper<__nv_dl_wrapper_t<T> > {
    typedef T type;
};

Extracts the inner tag type from a zero-capture device lambda wrapper. Only matches __nv_dl_wrapper_t<T> with a single template parameter (the tag). Used to access __nv_dl_tag or __nv_dl_trailing_return_tag for device function dispatch resolution.

Step 16: Trailing-Return Device Lambda Detection

template <typename T>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper {
    static const bool value = false;
};

template <typename U, U func, typename Return, unsigned Id, typename...Pack>
struct __nv_extended_device_lambda_with_trailing_return_trait_helper<
    __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<U, func, Return, Id>, Pack...> > {
    static const bool value = true;
};

#define __nv_is_extended_device_lambda_with_preserved_return_type(X) \
    __nv_extended_device_lambda_with_trailing_return_trait_helper< \
        typename __nv_lambda_trait_remove_cv<X>::type>::value

Detects whether a device lambda wrapper uses the trailing-return tag variant. Needed because trailing-return lambdas require different handling during device compilation -- the return type is explicit and must be preserved, rather than deduced.

Step 17: Host-Device Lambda Detection Trait

The final emission:

template <typename>
struct __nv_extended_host_device_lambda_trait_helper {
    static const bool value = false;
};

template <bool B1, bool B2, bool B3, typename T1, typename T2, typename...Pack>
struct __nv_extended_host_device_lambda_trait_helper<
    __nv_hdl_wrapper_t<B1, B2, B3, T1, T2, Pack...> > {
    static const bool value = true;
};

#define __nv_is_extended_host_device_lambda_closure_type(X) \
    __nv_extended_host_device_lambda_trait_helper< \
        typename __nv_lambda_trait_remove_cv<X>::type>::value

Detects any __nv_hdl_wrapper_t instantiation. The partial specialization matches all six template parameters (B1=IsMutable, B2=HasFuncPtrConv, B3=NeverThrows, T1=Tag, T2=OpFunc, Pack=captures).

sub_6BCC20 returns the result of this final a1(...) call.

Bitmap Infrastructure

Registration: sub_6BCBF0 (nv_record_capture_count)

During frontend parsing, scan_lambda (sub_447930) records each lambda's capture count:

__int64 __fastcall sub_6BCBF0(int a1, unsigned int a2)
{
    unsigned __int64 *result;
    if (a1)
        result = (unsigned __int64 *)&unk_1286900;  // host-device bitmap
    else
        result = (unsigned __int64 *)&unk_1286980;  // device bitmap
    result[a2 >> 6] |= 1ULL << a2;
    return (__int64)result;
}

The function selects the bitmap based on a1 (0 = device, nonzero = host-device), computes the word index as a2 >> 6 (divide by 64), and sets the bit via bitwise OR. No synchronization is needed because the frontend is single-threaded.

Reset: sub_6BCBC0 (nv_reset_capture_bitmasks)

Before each translation unit, both bitmaps are zeroed:

void sub_6BCBC0(void)
{
    memset(&unk_1286980, 0, 128);  // device bitmap
    memset(&unk_1286900, 0, 128);  // host-device bitmap
}

Scan Algorithm Differences

AspectDevice scan (step 6)Host-device scan (step 9)
Bitmapunk_1286980unk_1286900
Bit 0Skipped (if (v2 && ...))Processed
Skip strategyTests every bit individuallyInner while fast-skips consecutive zeros
Calls per set bit1 (sub_6BB790)4 (sub_6BBB10 x2 + sub_6BBEE0 x2)
Specializations per set bit2 (standard + trailing-return)4 (IsMutable x HasFuncPtrConv)

The device scan skips bit 0 because the zero-capture case is handled by the always-emitted primary template. The host-device scan processes bit 0 because the zero-capture case requires explicit specializations for the HasFuncPtrConv and IsMutable dimensions -- the always-emitted primary template contains only a static_assert trap.

Complete Emission Order Summary

StepContentEmitterTemplates Produced
1Ref/const removal traitsinline string__NV_LAMBDA_WRAPPER_HELPER, __nvdl_remove_ref, __nvdl_remove_const
2Device taginline string__nv_dl_tag
3Array helperssub_6BC290__nv_lambda_array_wrapper (dim 2-8), __nv_lambda_field_type specializations
4Device wrapper primaryinline string__nv_dl_wrapper_t primary + zero-capture
5Trailing-return taginline string__nv_dl_trailing_return_tag + zero-capture specialization
6Device bitmap scanloop + sub_6BB790N-capture __nv_dl_wrapper_t (2 per set bit N > 0)
7HD helperinline string__nv_hdl_helper (anonymous namespace, 4 static FPs)
8HD wrapper primaryinline string__nv_hdl_wrapper_t primary with static_assert
9HD bitmap scanloop + sub_6BBB10 x2 + sub_6BBEE0 x2N-capture __nv_hdl_wrapper_t (4 per set bit)
10Trait outerinline string__nv_hdl_helper_trait_outer (const + non-const specializations)
11C++17 noexceptconditional inlineNoexcept __nv_hdl_helper_trait specializations
12Factoryinline string__nv_hdl_create_wrapper_t
13CV traitsinline string__nv_lambda_trait_remove_const/volatile/cv
14Device detectioninline string__nv_extended_device_lambda_trait_helper + macro
15Wrapper unwrapinline string__nv_lambda_trait_remove_dl_wrapper
16Trailing-return detectioninline string__nv_extended_device_lambda_with_trailing_return_trait_helper + macro
17HD detectioninline string__nv_extended_host_device_lambda_trait_helper + macro

Output Size Characteristics

The preamble size depends on the number of distinct capture counts used:

ComponentFixed/VariableApproximate Size
Steps 1-5 (fixed templates)Fixed~1.5 KB
Step 3 (array helpers, dim 2-8)Fixed~4 KB
Step 6 (device, per capture count)Variable~0.8 KB per count
Steps 7-8 (HD helper + primary)Fixed~1.5 KB
Step 9 (HD, per capture count)Variable~6 KB per count (4 specializations)
Steps 10-17 (traits, macros)Fixed~3 KB

A typical translation unit with 3-5 distinct capture counts produces approximately 30-50 KB of injected C++ text.

Design Rationale

Text Emission vs AST Construction

The preamble is emitted as raw C++ source text rather than constructed as AST nodes in the EDG IL. This trades correctness-by-construction for implementation simplicity:

  • Avoids IL complexity. Constructing proper AST nodes for template partial specializations, static member definitions, anonymous namespaces, and macros would require deep integration with the EDG IL construction API.
  • Matches output format. The .int.c file is plain C++ text consumed by the host compiler. Since the templates must eventually become text, generating them as text from the start eliminates a serialize-deserialize round trip.
  • Self-documenting. The emitted text is directly readable in the .int.c file. grep for __nv_dl_wrapper_t to see exactly what was produced.

The cost is that the templates exist only as generated text, not as first-class IL entities. They cannot be analyzed or transformed by other EDG passes. This is acceptable because the preamble templates are infrastructure -- they are never the target of user-facing diagnostics or transformations.

Why Bitmaps Instead of Lists

The 1024-bit bitmap offers constant-time set (O(1) via shift-and-OR) and linear-time scan (O(1024) = effectively constant for a fixed-size structure). The bitmap has zero dynamic allocation, fits in two cache lines (128 bytes), and the scan loop compiles to simple shift-and-test instructions. Alternative representations (sorted lists, hash sets) would add allocation overhead and complexity for negligible benefit given the fixed 128-byte size.

Why Bit 0 Is Skipped for Device but Not Host-Device

The device lambda zero-capture case is fully handled by the primary template's zero-capture specialization (step 4), which is always emitted. No per-capture-count specialization is needed because the zero-capture wrapper has no fields, no constructor parameters, and no specialization-specific behavior.

The host-device zero-capture case requires distinct specializations for HasFuncPtrConv=true (lightweight function pointer path) and HasFuncPtrConv=false (heap-allocated type erasure path). These paths have fundamentally different internal structure. The always-emitted primary template contains only a static_assert trap, not a working implementation, so bit 0 must be processed to generate the actual zero-capture specializations.

Function Map

AddressName (recovered)SourceLinesRole
sub_6BCC20nv_emit_lambda_preamblenv_transforms.c244Master emitter: 17-step template injection pipeline
sub_4864F0gen_type_declcp_gen_be.c751Trigger: detects sentinel, emits #line, calls master emitter
sub_467E50emit_stringcp_gen_be.c~29Output callback: writes string char-by-char via putc()
sub_467D60emit_newlinecp_gen_be.c~15Emits \n, increments line counter
sub_6BC290emit_array_capture_helpersnv_transforms.c183Step 3: __nv_lambda_array_wrapper for dim 2-8
sub_6BB790emit_device_lambda_wrapper_specializationnv_transforms.c191Step 6: N-capture __nv_dl_wrapper_t (both tag variants)
sub_6BBB10emit_hdl_wrapper_nonmutablenv_transforms.c238Step 9: __nv_hdl_wrapper_t<false,...> specialization
sub_6BBEE0emit_hdl_wrapper_mutablenv_transforms.c236Step 9: __nv_hdl_wrapper_t<true,...> specialization
sub_6BCBF0nv_record_capture_countnv_transforms.c13Sets bit N in device or HD bitmap
sub_6BCBC0nv_reset_capture_bitmasksnv_transforms.c9Zeroes both 128-byte bitmaps before each TU
sub_46BC80emit_preprocessor_directivecp_gen_be.c--Emits #if 0 / #endif suppression blocks

Global State

VariableAddressTypePurpose
unk_12869800x1286980uint64_t[16]Device lambda capture-count bitmap (1024 bits)
unk_12869000x1286900uint64_t[16]Host-device lambda capture-count bitmap (1024 bits)
dword_106BF380x106BF38int32--extended-lambda mode flag
dword_126E2700x126E270int32C++17 noexcept-in-type-system flag
dword_126E1DC0x126E1DCint32EDG native mode flag (# vs #line format)
dword_106581C0x106581Cint32Output column counter
dword_10658200x1065820int32Output line counter (reset after preamble)
qword_10658280x1065828int64Output state pointer (reset after preamble)
dword_10658180x1065818int32Pending indentation flag
dword_10658340x1065834int32Preprocessor nesting depth counter