Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

global Function Constraints

The __global__ attribute designates a CUDA kernel -- a function that executes on the GPU and is callable from host code via the <<<...>>> launch syntax. Of all CUDA execution space attributes, __global__ imposes the most constraints. cudafe++ enforces these constraints across three separate validation passes: attribute application (when __global__ is first applied to an entity), post-declaration validation (after all attributes on a declaration are resolved), and semantic analysis (during template instantiation, redeclaration merging, and lambda processing). This page documents all constraint checks, their implementation in the binary, the entity node fields they inspect, and the diagnostics they emit.

Key Facts

PropertyValue
Source filesattribute.c (apply handler), nv_transforms.c (post-validation), class_decl.c (redeclaration, lambda), decls.c (template packs)
Apply handler (variant 1)sub_40E1F0 (89 lines)
Apply handler (variant 2)sub_40E7F0 (86 lines)
Post-validationsub_6BC890 (nv_validate_cuda_attributes, 161 lines)
Attribute kind byte0x58 = 'X'
OR mask appliedentity+182 |= 0x61 (bits 0 + 5 + 6)
HD combined flagentity+182 |= 0x80 (set when __global__ applied to function already marked __host__)
Total constraint checks37 distinct error conditions
Entity fields read+81, +144, +148, +152, +166, +176, +179, +182, +183, +184, +191
Relaxed mode flagdword_106BFF0 (suppresses certain conflict checks)
main() entity pointerqword_126EB70 (compared to detect __global__ main)

Two Variants of apply_nv_global_attr

Two nearly identical functions implement the __global__ application logic. Both perform the same 11 validation checks and apply the same 0x61 bitmask. The difference is purely structural: sub_40E1F0 uses a for loop with a null-terminated break for the parameter default-init iteration, while sub_40E7F0 uses a do-while loop with an explicit null check and early return. Both exist because EDG's attribute subsystem may route through different call paths depending on whether the attribute appears on a declaration or a definition.

// Pseudocode for apply_nv_global_attr (sub_40E1F0 / sub_40E7F0)
// a1: attribute node, a2: entity node, a3: target kind
entity_t* apply_nv_global_attr(attr_node_t* a1, entity_t* a2, uint8_t a3) {

    // Gate: only applies to functions (kind 11)
    if (a3 != 11)
        return a2;

    // ---- Phase 1: Linkage / constexpr lambda check ----
    // Bits 47 and 24 of the 48-bit field at +184
    if ((a2->qword_184 & 0x800001000000) == 0x800000000000) {
        // Constexpr lambda with internal linkage but no local flag
        char* name = get_entity_display_name(a2, 0);  // sub_6BC6B0
        emit_error(3469, a1->src_loc, "__global__", name);
        return a2;   // bail out, do not apply __global__
    }

    // ---- Phase 2: Structural constraints ----

    // 2a. Static member function check
    if ((signed char)a2->byte_176 < 0 && !(a2->byte_81 & 0x04))
        emit_warning(3507, a1->src_loc, "__global__");  // severity 5

    // 2b. operator() check
    if (a2->byte_166 == 5)
        emit_error(3644, a1->src_loc);  // severity 7

    // 2c. Exception specification check (uses return type chain)
    type_t* ret = a2->type_chain;  // entity+144
    while (ret->kind == 12)        // skip cv-qualifier wrappers
        ret = ret->referenced;     // type+144
    if (ret->prototype->exception_spec)  // proto+152 -> +56
        emit_error(3647, a1->src_loc);   // auto/decltype(auto) return

    // 2d. Execution space conflict
    uint8_t es = a2->byte_182;
    if (!relaxed_mode && (es & 0x60) == 0x20)  // already __device__ only
        emit_error(3481, a1->src_loc);
    if (es & 0x10)                              // already __host__ explicit
        emit_error(3481, a1->src_loc);

    // 2e. Return type must be void
    if (!(a2->byte_179 & 0x10)) {  // not constexpr
        if (a2->byte_191 & 0x01)   // is lambda
            emit_error(3506, a1->src_loc);
        else {
            type_t* base = skip_typedefs(a2->type_chain);  // sub_7A68F0
            if (!is_void_type(base->referenced))            // sub_7A6E90
                emit_error(3505, a1->src_loc);
        }
    }

    // 2f. Variadic (ellipsis) check
    type_t* proto_type = a2->type_chain;  // +144
    while (proto_type->kind == 12)
        proto_type = proto_type->referenced;
    if (proto_type->prototype->flags_16 & 0x01)  // bit 0 of proto+16
        emit_error(3503, a1->src_loc);

    // ---- Phase 3: Apply the bitmask ----
    a2->byte_182 |= 0x61;   // device_capable + device_annotation + global_kernel

    // ---- Phase 4: Additional checks (after bitmask set) ----

    // 4a. Local function (constexpr local)
    if (a2->byte_81 & 0x04)
        emit_error(3688, a1->src_loc);

    // 4b. main() function check
    if (a2 == main_entity && (a2->byte_182 & 0x20))
        emit_error(3538, a1->src_loc);

    // ---- Phase 5: Parameter iteration (__grid_constant__ warning) ----
    if (a1->flags & 0x01) {  // attr_node+11 bit 0: applies to parameters
        // Walk parameter list from prototype
        proto_type = a2->type_chain;
        while (proto_type->kind == 12)
            proto_type = proto_type->referenced;
        param_t* param = *proto_type->prototype->param_list;  // deref +152

        source_loc_t loc = a1->src_loc;  // +56
        for (; param != NULL; param = param->next) {
            // Peel cv-qualifier wrappers
            type_t* ptype = param->type;  // param[1]
            while (ptype->kind == 12)
                ptype = ptype->referenced;

            // Check: is type a __grid_constant__ candidate?
            if (!has_grid_constant_flag(ptype) && scope_index == -1) {
                // sub_7A6B60: checks byte+133 bit 5 (0x20)
                int64_t scope = scope_table_base + 784 * scope_table_index;
                if ((scope->flags_6 & 0x06) == 0 && scope->kind_4 != 12) {
                    type_t* ptype2 = param->type;
                    while (ptype2->kind == 12)
                        ptype2 = ptype2->referenced;
                    if (!ptype2->default_init)  // type+120 == NULL
                        emit_error(3669, &loc);
                }
            }
        }
    }

    // ---- Phase 6: HD combined flag ----
    if (a2->byte_182 & 0x40)       // __global__ now set
        a2->byte_182 |= 0x80;      // mark as combined HD

    return a2;
}

Execution Order Detail

The 0x61 bitmask is applied before the local-function (3688) and main() (3538) checks but after all structural checks (3507, 3644, 3647, 3481, 3505/3506, 3503). This means the bitmask is set even when errors are emitted -- cudafe++ continues processing after errors to collect as many diagnostics as possible in a single compilation pass.

The constexpr-lambda check at the top (error 3469) is the only check that causes an early return. If the function is a constexpr lambda with wrong linkage, the bitmask is NOT set and no further validation is performed.

Validation Error Catalog

The 37 validation errors are organized by the phase in which they are checked and by semantic category. Error codes below are cudafe++ internal diagnostic numbers; severity values match the sub_4F41C0 severity parameter (5 = warning, 7 = error, 8 = hard error).

Category 1: Return Type

ErrorSeverityCheckMessage
35057!is_void_type(skip_typedefs(entity+144)->referenced)a __global__ function must have a void return type
35067entity+191 & 0x01 (lambda) and non-voida __global__ function must not have a deduced return type
36477entity+152 -> +56 != NULL (exception spec present on return proto)auto/decltype(auto) deduced return type

Error 3505 and 3506 are mutually exclusive paths guarded by the byte+179 & 0x10 constexpr flag. When the function is not constexpr, the handler checks whether it is a lambda (3506 path, which checks byte+191 bit 0) or a regular function (3505 path, which resolves through skip_typedefs via sub_7A68F0 and tests is_void_type via sub_7A6E90). The skip_typedefs function follows the type chain while type->kind == 12 (cv-qualifier wrapper) and type->byte_161 & 0x7F == 0 (no qualifier flags). The is_void_type function follows the same chain and returns kind == 1 (void).

Error 3647 is checked independently of 3505/3506. The check examines the exception specification pointer at prototype offset +56. In EDG's type system, auto and decltype(auto) return types are represented with a non-null exception specification node on the return type's prototype -- this is a repurposed field that indicates the return type is deduced.

Category 2: Parameters

ErrorSeverityCheckMessage
35038proto+16 & 0x01 (has ellipsis)a __global__ function cannot have ellipsis
37027param_flags & 0x02 (rvalue ref)a __global__ function cannot have a parameter with rvalue reference type
--7Parameter with __restrict__ on reference typea __global__ function cannot have a parameter with __restrict__ qualified reference type
--7Parameter of type va_listA __global__ function or function template cannot have a parameter with va_list type
--7Parameter of type std::initializer_lista __global__ function or function template cannot have a parameter with type std::initializer_list
--7Oversized alignment on win32cannot pass a parameter with a too large explicit alignment to a __global__ function on win32 platforms
36698Device-scope parameter without default init__grid_constant__ parameter warning (device-side check)

Error 3503 (ellipsis) is checked in the apply handler by testing bit 0 of the function prototype's flags word at offset +16. This bit indicates the parameter list ends with ....

Error 3702 (rvalue reference) is checked in the post-validation pass (sub_6BC890), not in the apply handler. The post-validator walks the parameter list and checks byte offset +32 (bit 1) of each parameter node.

The __restrict__ reference, va_list, initializer_list, and win32 alignment checks are scattered across separate validation functions in nv_transforms.c and are triggered during declaration processing rather than during attribute application.

Error 3669 is checked in the apply handler's parameter iteration loop. It walks each parameter, resolves through cv-qualifier wrappers, and tests whether sub_7A6B60 returns false (meaning the parameter type has bit 5 of byte+133 clear -- not a __grid_constant__ type) AND the scope lookup produces a non-array, non-qualifier type without a default initializer at type+120.

Category 3: Modifiers

ErrorSeverityCheckMessage
35075(signed char)byte_176 < 0 && !(byte_81 & 0x04)A __global__ function or function template cannot be marked constexpr (warning for static member)
36888byte_81 & 0x04 (local function)A __global__ function or function template cannot be marked constexpr (constexpr local)
34818Execution space conflict (see matrix)Conflicting CUDA execution spaces
--7Function is constevalA __global__ function or function template cannot be marked consteval
36447byte_166 == 5 (operator function kind)An operator function cannot be a __global__ function
--7Defined in friend declarationA __global__ function or function template cannot be defined in a friend declaration
--7Exception specification presentAn exception specification is not allowed for a __global__ function or function template
--7Declared in inline unnamed namespaceA __global__ function or function template cannot be declared within an inline unnamed namespace
35387a2 == qword_126EB70 (is main())function main cannot be marked __device__ or __global__

Error 3507 deserves special attention. The decompiled code shows:

if ((signed char)a2->byte_176 < 0 && !(a2->byte_81 & 0x04))
    emit_warning(3507, ...);

The signed char cast means byte_176 >= 0x80 (bit 7 set = static member function). The !(byte_81 & 0x04) condition ensures it is NOT a local function. The emitter uses severity 5 (warning via sub_4F8DB0), meaning this is a warning, not an error -- NVIDIA chose to warn rather than reject __global__ on static members, though the official documentation says it is not allowed. The displayed string is "A __global__ function or function template cannot be marked constexpr" with "__global__" as the attribute name parameter, though the actual semantic is "static member function" per the field being checked.

Error 3644 checks entity+166 == 5. This field stores the "operator function kind" enum value, where 5 corresponds to operator(). This prevents lambda call operators or functors from being directly marked __global__.

Error 3688 is checked after the bitmask is set (byte_182 |= 0x61). It tests byte_81 & 0x04, which indicates a local (block-scope) function. The handler emits with severity 8 (via sub_4F81B0, hard error).

Error 3538 compares the entity pointer against qword_126EB70, which holds the entity pointer for main() (set during initial declaration processing). The condition also requires byte_182 & 0x20 (device annotation bit set), which is always true after |= 0x61.

Category 4: Template Constraints

ErrorSeverityCheckMessage
--7Pack parameter is not last template parameterPack template parameter must be the last template parameter for a variadic __global__ function template
--7Multiple pack parametersMultiple pack parameters are not allowed for a variadic __global__ function template

These checks are performed during template declaration processing in decls.c, not in the apply handler. They constrain variadic __global__ function templates: CUDA requires that pack parameters appear last (so the runtime can enumerate kernel arguments), and only a single pack is permitted (the CUDA launch infrastructure cannot handle multiple parameter packs).

Category 5: Redeclaration

ErrorSeverityCheckMessage
--7Previously __global__, now no execution spacea __global__ function(%no1) redeclared without __global__
--7Previously __global__, now __host__a __global__ function(%no1) redeclared with __host__
--7Previously __global__, now __device__a __global__ function(%no1) redeclared with __device__
--7Previously __global__, now __host__ __device__a __global__ function(%no1) redeclared with __host__ __device__

These four error variants are symmetrical with the reverse direction:

  • a __device__ function(%no1) redeclared with __global__
  • a __host__ function(%no1) redeclared with __global__
  • a __host__ __device__ function(%no1) redeclared with __global__

Redeclaration checks occur during declaration merging in class_decl.c. When a function is redeclared and the execution space of the new declaration does not match the original, cudafe++ emits one of these errors. The %no1 format specifier inserts the function name. These checks run independently of the apply_nv_global_attr handler -- they operate on the merged entity after both attribute sets have been processed.

Category 6: Constexpr Lambda Linkage

ErrorSeverityCheckMessage
34695(qword_184 & 0x800001000000) == 0x800000000000__global__ on constexpr lambda with wrong linkage

This is the first check in the apply handler and the only one that causes early return. The 48-bit field at entity+184 encodes template and linkage properties. Bit 47 (0x800000000000) indicates internal linkage or a similar constraint, while bit 24 (0x000001000000) indicates a local entity. When bit 47 is set but bit 24 is clear, the entity is a constexpr lambda that cannot legally receive __global__. The handler calls sub_6BC6B0 (get_entity_display_name) to format the entity name for the diagnostic message, then returns without setting the bitmask.

Category 7: Post-Validation (sub_6BC890)

These checks run after all attributes on a declaration have been applied, in the nv_validate_cuda_attributes function:

ErrorSeverityCheckMessage
37027Parameter with rvalue reference flag (bit 1 at param+32)a __global__ function cannot have a parameter with rvalue reference type
36617__nv_register_params__ on __global____nv_register_params__ is not allowed on a __global__ function
35347__launch_bounds__ on non-__global__%s attribute is not allowed on a non-__global__ function
37077maxBlocksPerCluster < cluster producttotal number of blocks in cluster computed from %s exceeds __launch_bounds__ specified limit
37157__maxnreg__ on non-__global____maxnreg__ is not allowed on a non-__global__ function
37197Both __launch_bounds__ and __maxnreg____launch_bounds__ and __maxnreg__ may not be used on the same declaration
36954__global__ without __launch_bounds__no __launch_bounds__ specified for __global__ function (warning)

Error 3695 is a severity-4 diagnostic (informational warning). It fires when a __global__ function has no associated launch configuration, encouraging developers to specify __launch_bounds__ for optimal register allocation. This is the only constraint that is a soft advisory rather than a hard or standard error.

Entity Node Field Reference

The apply handler reads and writes specific fields within the entity node. Complete field semantics:

OffsetSizeField NameRole in __global__ Validation
+811 bytelocal_flagsBit 2 (0x04): function is local (block-scope). Checked for 3688 and as exemption for 3507.
+1448 bytestype_chainPointer to return type. Followed through kind==12 cv-qualifier wrappers.
+1528 bytesprototypeFunction prototype pointer. At prototype+16: flags (bit 0 = ellipsis). At prototype+56: exception spec pointer. At prototype+0: parameter list head (double deref for first param).
+1661 byteoperator_kindValue 5 = operator(). Checked for 3644.
+1761 bytemember_flagsBit 7 (0x80, checked as signed char < 0): static member function. Checked for 3507.
+1791 byteconstexpr_flagsBit 4 (0x10): function is constexpr. Guards 3505/3506 check (skipped if constexpr).
+1821 byteexecution_spaceThe primary execution space bitfield. |= 0x61 sets global kernel. Read for conflict checks (0x60, 0x10 masks).
+1831 byteextended_cudaBit 3 (0x08): __nv_register_params__. Checked in post-validation. Bit 6 (0x40): __cluster_dims__ set.
+1848 byteslinkage_template48-bit field encoding template/linkage flags. Only lower 48 bits used; mask 0x800001000000 checks constexpr lambda linkage.
+1911 bytelambda_flagsBit 0 (0x01): entity is a lambda. Routes to 3506 instead of 3505 for void-return check.
+2568 byteslaunch_configPointer to launch configuration struct (56 bytes). NULL if no launch attributes applied. Read in post-validation.

The 0x61 Bitmask

The OR mask 0x61 sets three bits in the execution space byte:

0x61 = 0b01100001

  bit 0 (0x01):  device_capable     -- function can run on device
  bit 5 (0x20):  device_annotation  -- has explicit device-side annotation
  bit 6 (0x40):  global_kernel      -- function is a __global__ kernel

Bit 0 is shared with __device__ (0x23) and __host__ (0x15). It serves as a "has CUDA annotation" predicate -- any entity with bit 0 set has been explicitly annotated with at least one execution space keyword. This enables fast if (byte_182 & 0x01) checks throughout the codebase.

Bit 5 is shared with __device__. A __global__ function is considered device-annotated because kernel code executes on the GPU.

Bit 6 is unique to __global__. The mask byte_182 & 0x40 is the canonical predicate for "is this a kernel function?" used in dozens of locations throughout the binary.

HD Combined Flag (0x80)

After setting 0x61, the handler checks whether bit 6 (0x40, global kernel) is now set. If so, it ORs 0x80 into the byte. This bit means "combined host+device" and is set as a secondary effect. The logic at the end of the function:

if (a2->byte_182 & 0x40)       // just set via |= 0x61
    a2->byte_182 |= 0x80;      // always true after apply

This means every __global__ function ends up with byte_182 & 0x80 set, which marks it as "combined" in the execution space classification. This is semantically correct: a kernel has both a host-side stub (for launching) and device-side code (for execution).

Parameter Iteration for grid_constant

The final section of the apply handler iterates the function's parameter list to check for parameters that should be annotated __grid_constant__. This check only runs when attr_node->flags bit 0 (a1+11 & 0x01) is set, indicating the attribute application context includes parameter-level processing.

The iteration follows this structure:

// Navigate to function prototype
type_t* proto_type = entity->type_chain;     // +144
while (proto_type->kind == 12)               // skip cv-qualifiers
    proto_type = proto_type->referenced;      // +144

// Get parameter list head (double dereference)
param_t** param_list = proto_type->prototype->param_head;  // proto+152 -> deref
param_t* param = *param_list;                               // deref again

for (; param != NULL; param = param->next) {
    // Navigate to unqualified parameter type
    type_t* ptype = param[1];    // param->type (offset 8)
    while (ptype->kind == 12)
        ptype = ptype->referenced;

    // sub_7A6B60: checks byte+133 bit 5 (0x20) -- "has __grid_constant__"
    bool has_gc = (ptype->byte_133 & 0x20) != 0;

    if (!has_gc && dword_126C5C4 == -1) {
        // Scope table lookup
        int64_t scope = qword_126C5E8 + 784 * dword_126C5E4;
        uint8_t scope_flags = scope->byte_6;
        uint8_t scope_kind = scope->byte_4;

        // Skip if scope has qualifier flags or is a cv-qualified scope
        if ((scope_flags & 0x06) == 0 && scope_kind != 12) {
            // Re-navigate to unqualified type
            type_t* ptype2 = param[1];
            while (ptype2->kind == 12)
                ptype2 = ptype2->referenced;

            // Check for default initializer
            if (ptype2->qword_120 == 0)
                emit_error(3669, &saved_source_loc);
        }
    }
}

The scope table lookup uses a 784-byte scope structure (at qword_126C5E8 indexed by dword_126C5E4) to determine whether the current context is device-side. The dword_126C5C4 == -1 check verifies we are in device compilation mode. This entire parameter iteration is a device-side warning mechanism: it alerts developers when a kernel parameter lacks a default initializer in a context where __grid_constant__ would be appropriate.

Post-Declaration Validation (sub_6BC890)

After all attributes on a declaration are applied, nv_validate_cuda_attributes (sub_6BC890, 161 lines) performs cross-attribute consistency checks. For __global__ functions, this function enforces:

Rvalue Reference Parameters (3702)

// Walk parameter list
type_t* ret = entity->type_chain;
while (ret->kind == 12)
    ret = ret->referenced;
param_t* param = **((param_t***)ret + 19);  // proto -> param list

while (param) {
    if (param->byte_32 & 0x02)  // rvalue reference flag
        emit_error(3702, source_loc);
    param = param->next;
}

This check scans all parameters for the rvalue reference flag (bit 1 at parameter node offset +32). Kernel functions cannot accept rvalue references because kernel launch involves copying arguments through the CUDA runtime, which does not support move semantics across the host-device boundary.

nv_register_params Conflict (3661)

if (entity->byte_183 & 0x08) {  // __nv_register_params__ set
    if (entity->byte_182 & 0x40)
        emit_error(3661, ..., "__global__");
    else if ((entity->byte_182 & 0x30) == 0x20)
        emit_error(3661, ..., "__host__");
}

The __nv_register_params__ attribute (bit 3 of byte+183) is incompatible with __global__ because kernel parameter passing uses a fixed ABI that cannot be overridden.

Launch Configuration Without global (3534)

launch_config_t* lc = entity->launch_config;  // +256
if (lc && !(entity->byte_182 & 0x40)) {
    if (lc->maxThreadsPerBlock || lc->minBlocksPerMultiprocessor)
        emit_error(3534, ..., "__launch_bounds__");
}

The __launch_bounds__, __cluster_dims__, and __block_size__ attributes require __global__. If a non-kernel function has any of these, error 3534 fires.

Cluster Dimension Product Check (3707)

if (lc->cluster_dim_x > 0 && lc->maxBlocksPerCluster > 0) {
    uint64_t product = lc->cluster_dim_x * lc->cluster_dim_y * lc->cluster_dim_z;
    if (lc->maxBlocksPerCluster < product)
        emit_error(3707, ...);
}

launch_bounds and maxnreg Conflict (3719)

if (lc->maxThreadsPerBlock && lc->maxnreg >= 0)
    emit_error(3719, ..., "__launch_bounds__ and __maxnreg__");

These two attributes provide contradictory register pressure hints and cannot coexist.

Missing launch_bounds Warning (3695)

if ((entity->byte_182 & 0x40) &&
    (!lc || (!lc->maxThreadsPerBlock && !lc->minBlocksPerMultiprocessor)))
    emit_warning(3695);

Severity 4 (advisory). Encourages developers to annotate kernels with __launch_bounds__ for optimal register allocation.

Execution Space Conflict Matrix

When __global__ is applied to a function that already has an execution space annotation, the handler checks for conflicts using two conditions:

// Condition 1: already __device__ only (without relaxed mode)
if (!dword_106BFF0 && (byte_182 & 0x60) == 0x20)
    error(3481);

// Condition 2: already __host__ explicit
if (byte_182 & 0x10)
    error(3481);
Current byte_182Applying __global__(byte & 0x60) == 0x20byte & 0x10Result
0x00 (none)|= 0x61 -> 0x61falsefalseaccepted
0x23 (__device__)truefalseerror 3481 (unless relaxed)
0x15 (__host__)falsetrueerror 3481
0x37 (__host__ __device__)falsetrueerror 3481
0x61 (__global__)truefalseerror 3481 (unless relaxed) -- idempotent bitmask

In relaxed mode (dword_106BFF0 != 0), the first condition is suppressed, allowing __device__ + __global__ combinations. The second condition (explicit __host__) is never relaxed.

Helper Functions

AddressIdentityLinesPurpose
sub_6BC6B0get_entity_display_name49Formats entity name for diagnostic messages. Handles demangling, strips leading ::.
sub_7A68F0skip_typedefs19Follows type chain through kind==12 wrappers while byte_161 & 0x7F == 0.
sub_7A6E90is_void_type16Follows type chain through kind==12, returns kind == 1.
sub_7A6B60has_grid_constant_flag9Follows type chain through kind==12, returns byte_133 & 0x20.
sub_4F7510emit_error_with_names66Emits error with two string arguments (attribute name + entity name).
sub_4F8DB0emit_warning_with_name38Emits warning (severity 5) with one string argument.
sub_4F8200emit_error_basic10Emits error with severity + code + source location.
sub_4F81B0emit_error_minimal10Emits error (severity 8) with code + source location.
sub_4F8490emit_error_with_extra38Emits error with one supplementary argument.

Additional global Constraints (Outside Apply Handler)

Beyond the apply handler and post-validation, several other subsystems enforce __global__-specific rules. These checks occur during template instantiation, lambda processing, and declaration merging:

Template Argument Type Restrictions

CUDA restricts which types can appear as template arguments in __global__ function template instantiations:

  • Host-local types (defined inside a __host__ function) cannot be used
  • Private/protected class members cannot be used (unless the class is local to a __device__/__global__ function)
  • Unnamed types cannot be used (unless local to a __device__/__global__ function)
  • Lambda closure types cannot be used (unless the lambda is defined in a __device__/__global__ function, or is an extended lambda with --extended-lambda)
  • Texture/surface variables cannot be used as non-type template arguments
  • Private/protected template template arguments from class scope cannot be used

Static Global Template Stub

In whole-program compilation mode (-rdc=false) with -static-global-template-stub=true:

  • Extern __global__ function templates are not supported
  • __global__ function template instantiations must have definitions in the current TU

Device-Side Restrictions

Functions marked __global__ (or __device__) are subject to additional restrictions during semantic analysis:

  • address of label extension is not supported
  • ASM operands may specify only one constraint letter
  • Certain ASM constraint letters are forbidden
  • Texture/surface variables cannot have their address taken or be indirected
  • Anonymous union member variables at global/namespace scope cannot be directly accessed
  • Function-scope static variables require a memory space specifier
  • Dynamic initialization of function-scope static variables is not supported

Function Map

AddressIdentityLinesSource File
sub_40E1F0apply_nv_global_attr (variant 1)89attribute.c
sub_40E7F0apply_nv_global_attr (variant 2)86attribute.c
sub_6BC890nv_validate_cuda_attributes161nv_transforms.c
sub_6BC6B0get_entity_display_name49nv_transforms.c
sub_7A68F0skip_typedefs19types.c
sub_7A6E90is_void_type16types.c
sub_7A6B60has_grid_constant_flag9types.c
sub_4F7510emit_error_with_names66error.c
sub_4F8DB0emit_warning_with_name38error.c
sub_4F8200emit_error_basic10error.c
sub_4F81B0emit_error_minimal10error.c
sub_4F8490emit_error_with_extra38error.c
sub_413240apply_one_attribute (dispatch)585attribute.c

Global Variables

GlobalAddressPurpose
dword_106BFF00x106BFF0Relaxed mode flag. When set, suppresses __device__ + __global__ conflict (3481).
qword_126EB700x126EB70Pointer to the entity node for main(). Compared during 3538 check.
dword_126C5C40x126C5C4Scope index sentinel (-1 = device compilation mode). Guards 3669 parameter check.
dword_126C5E40x126C5E4Current scope table index.
qword_126C5E80x126C5E8Scope table base pointer. Each entry is 784 bytes.

Cross-References