Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Execution Spaces

Every CUDA function lives in one or more execution spaces that govern where the function can run (host CPU, device GPU, or both) and what it can call. cudafe++ encodes execution space as a single-byte bitfield at offset +182 of the entity (routine) node. This byte is the most frequently tested field in CUDA-specific code paths -- it drives attribute application, redeclaration compatibility, virtual override checking, call-graph validation, IL marking, and code generation selection. Understanding this byte is prerequisite to understanding nearly every CUDA-specific subsystem in cudafe++.

The three CUDA execution-space keywords (__host__, __device__, __global__) are parsed as EDG attributes with internal kind codes 'V' (86), 'W' (87), and 'X' (88) respectively. The attribute dispatch table in apply_one_attribute (sub_413240) routes each kind to a dedicated handler that validates constraints and sets the bitfield. Functions without any explicit annotation default to __host__.

Key Facts

PropertyValue
Source fileattribute.c (handlers), class_decl.c (redecl/override), nv_transforms.h (inline predicates)
Bitfield locationEntity node byte at offset +182
__global__ handlersub_40E1F0 / sub_40E7F0 (apply_nv_global_attr, two variants)
__device__ handlersub_40EB80 (apply_nv_device_attr)
__host__ handlersub_4108E0 (apply_nv_host_attr)
Virtual override checkersub_432280 (record_virtual_function_override)
Execution space mask tabledword_E7C760[] (indexed by space enum)
Mask lookupsub_6BCF60 (nv_check_execution_space_mask)
Annotation helpersub_41A1F0 (validates HD annotations on types)
Relaxed mode flagdword_106BFF0 (permits otherwise-illegal space combinations)
main() entity pointerqword_126EB70 (compared during attribute application)

The Execution Space Bitfield (Entity + 182)

Byte offset +182 within a routine entity node encodes the execution space as a bitfield. Individual bits carry distinct meanings:

Byte at entity+182:

  bit 0  (0x01)   device_capable     Function can execute on device
  bit 1  (0x02)   device_explicit    __device__ was explicitly written
  bit 2  (0x04)   host_capable       Function can execute on host
  bit 3  (0x08)   (reserved)
  bit 4  (0x10)   host_explicit      __host__ was explicitly written
  bit 5  (0x20)   device_annotation  Secondary device flag (used in HD detection)
  bit 6  (0x40)   global_kernel      Function is a __global__ kernel
  bit 7  (0x80)   hd_combined        Combined __host__ __device__ flag

Combined Patterns

The attribute handlers do not set individual bits -- they OR entire patterns into the byte. Each CUDA keyword produces a characteristic bitmask:

KeywordOR maskResulting byteBit breakdown
__global__0x610xE1device_capable + device_annotation + global_kernel + bit 7 (always set)
__device__0x230x23device_capable + device_explicit + device_annotation
__host__0x150x15device_capable + host_capable + host_explicit
__host__ __device__0x23 | 0x150x37device_capable + device_explicit + host_capable + host_explicit + device_annotation
(no annotation)none0x00Implicit __host__ -- bits remain zero

The 0x80 bit is set unconditionally by the __global__ handler. After the |= 0x61 operation (which sets bit 6), the handler reads the byte back and checks (byte & 0x40) != 0. Since bit 6 was just set, this is always true, so |= 0x80 always executes. Despite the field name hd_combined in some tooling, the bit functions as a "has global annotation" marker in practice.

Why device_capable (bit 0) Appears in host

The __host__ mask 0x15 includes bit 0 (device_capable). This is not an error. Bit 0 acts as a "has execution space annotation" marker rather than a strict "runs on device" flag. The actual device-only vs host-only distinction is determined by the two-bit extraction at bits 4-5 (the 0x30 mask), described below.

Execution Space Classification (0x30 Mask)

The critical two-bit extraction byte & 0x30 classifies a routine into one of four categories:

(byte & 0x30):
  0x00  ->  no explicit annotation (implicit __host__)
  0x10  ->  __host__ only
  0x20  ->  __device__ only
  0x30  ->  __host__ __device__

This extraction is the basis of nv_is_device_only_routine, an inline predicate defined in nv_transforms.h (line 367). The full check from the decompiled binary is:

// nv_is_device_only_routine (inlined from nv_transforms.h:367)
// entity_sym: the symbol table entry for the routine
// entity_sym+88 -> associated routine entity

__int64 entity = *(entity_sym + 88);
if (!entity)
    internal_error("nv_transforms.h", 367, "nv_is_device_only_routine");

char byte = *(char*)(entity + 182);
bool is_device_only = ((byte & 0x30) == 0x20) && ((byte & 0x60) == 0x20);

The double-check (byte & 0x60) == 0x20 ensures the function is device-only and NOT a __global__ kernel (which would have bit 6 set, making byte & 0x60 == 0x60). This predicate is used in:

  • check_void_return_okay (sub_719D20): suppress missing-return warnings for device-only functions
  • record_virtual_function_override (sub_432280): drive virtual override execution space propagation
  • Cross-space call validation: determine whether a call crosses execution space boundaries
  • IL keep-in-il marking: identify device-reachable code

The 0x60 Mask (Kernel vs Device)

A secondary extraction byte & 0x60 distinguishes kernels from plain device functions:

(byte & 0x60):
  0x00  ->  no device annotation
  0x20  ->  __device__ only (not a kernel)
  0x40  ->  __global__ only (should not occur in isolation)
  0x60  ->  __global__ (which implies __device__)

nv_is_device_only_routine Truth Table

The predicate is inlined from nv_transforms.h:367 and appears in multiple call sites. Its internal_error guard string "nv_is_device_only_routine" appears in sub_432280 at the source path EDG_6.6/src/nv_transforms.h. The complete truth table for all execution space combinations:

Execution spacebyte+182byte & 0x30byte & 0x60Result
(none, implicit __host__)0x000x000x00false
__host__0x150x100x00false
__device__0x230x200x20true
__host__ __device__0x370x300x20false
__global__0xE10x200x60false

The __global__ case is the key distinction: byte & 0x30 yields 0x20 (same as __device__), but byte & 0x60 yields 0x60 (not 0x20), so the predicate correctly rejects kernels.

// Full pseudocode for nv_is_device_only_routine
// Inlined at every call site; not a standalone function in the binary.
//
// Input: sym -- a symbol table entry (not the entity itself)
// Output: true if the routine is __device__ only (not __host__, not __global__)
bool nv_is_device_only_routine(symbol *sym) {
    entity *e = sym->entity;            // sym + 88
    if (!e)
        internal_error("nv_transforms.h", 367, "nv_is_device_only_routine");

    char byte = e->byte_182;
    // First check:  bits 4-5 == 0x20 -> has __device__, no __host__
    // Second check: bits 5-6 == 0x20 -> has __device__, no __global__
    return ((byte & 0x30) == 0x20) && ((byte & 0x60) == 0x20);
}

Complete Redeclaration Matrix

The matrix below documents every possible pair of (existing annotation, newly-applied annotation) and the result. Each cell is derived from the three attribute handler functions. "Relaxed" means the outcome changes when dword_106BFF0 is set.

Existing \ Applying__host____device____global__
(none) 0x000x15 -- OK0x23 -- OK0xE1 -- OK
__host__ 0x150x15 -- idempotent0x37 -- OK (HD)error 3481 (always: handler checks byte & 0x10 unconditionally)
__device__ 0x230x37 -- OK (HD)0x23 -- idempotenterror 3481 (relaxed: OK)
__global__ 0xE1error 3481 (always)error 3481 (relaxed: OK)0xE1 -- idempotent
__host__ __device__ 0x370x37 -- idempotent0x37 -- idempotenterror 3481 (always: byte & 0x10 fires)

The __global__ column always errors when the existing annotation includes __host__ (bit 4 = 0x10), because the __global__ handler's condition (v5 & 0x10) != 0 is not guarded by the relaxed-mode flag. The __device__ column errors on existing __global__ only when relaxed mode is off, because the __device__ handler guards its check with !dword_106BFF0.

Note that __global__'s byte value is 0xE1 (not 0x61) because the 0x80 bit is always set after __global__ is applied, as documented above.

Attribute Application Functions

apply_nv_global_attr (sub_40E1F0 / sub_40E7F0)

Two nearly identical entry points exist. Both apply __global__ to a function entity. The variant at sub_40E7F0 uses a do-while loop for parameter iteration instead of a for loop, but the validation logic is identical. Both variants may exist because EDG generates different code paths for attribute-on-declaration vs attribute-on-definition.

The function performs extensive validation before setting the bitmask:

// Pseudocode for apply_nv_global_attr (sub_40E1F0)
int64_t apply_nv_global_attr(attr_node *a1, entity *a2, char target_kind) {
    if (target_kind != 11)      // only applies to functions
        return a2;

    // Check constexpr lambda with wrong linkage
    if ((a2->qword_184 & 0x800001000000) == 0x800000000000) {
        char *name = get_entity_name(a2, 0);
        error(3469, a1->source_loc, "__global__", name);
        return a2;
    }

    // Static member check
    if ((signed char)a2->byte_176 < 0 && !(a2->byte_81 & 0x04))
        warning(3507, a1->source_loc, "__global__");

    // operator() check
    if (a2->byte_166 == 5)
        error(3644, a1->source_loc);

    // Return type must be void (skip cv-qualifiers)
    type *ret = a2->return_type;    // +144
    while (ret->kind == 12)         // 12 = cv-qualifier wrapper
        ret = ret->next;            // +144
    if (ret->prototype->exception_spec)     // +152 -> +56
        error(3647, a1->source_loc);        // auto/decltype(auto) return

    // Execution space conflict check (single condition with ||)
    char es = a2->byte_182;
    if ((!dword_106BFF0 && (es & 0x60) == 0x20) || (es & 0x10) != 0)
        error(3481, a1->source_loc);
        // Left branch: already __device__ (not relaxed mode) -> conflict
        // Right branch: already __host__ explicit (unconditional) -> conflict

    // Return type must be void (non-constexpr path)
    if (!(a2->byte_179 & 0x10)) {      // not constexpr
        if (a2->byte_191 & 0x01)       // lambda
            error(3506, a1->source_loc);
        else if (!is_void_return(a2))
            error(3505, a1->source_loc);
    }

    // Variadic check
    // ... skip to prototype, check bit 0 of proto+16
    if (proto_flags & 0x01)
        error(3503, a1->source_loc);

    // >>> SET THE BITMASK <<<
    a2->byte_182 |= 0x61;    // bits 0,5,6: device_capable + device_annotation + global_kernel

    // Local function check
    if (a2->byte_81 & 0x04)
        error(3688, a1->source_loc);

    // main() check
    if (a2 == qword_126EB70 && (a2->byte_182 & 0x20))
        error(3538, a1->source_loc);

    // Always set bit 7 after __global__: the check reads the byte AFTER |= 0x61,
    // so bit 6 is always set, making this unconditional.
    if (a2->byte_182 & 0x40)
        a2->byte_182 |= 0x80;

    // Parameter default-init check (device-side warning)
    // ... iterate parameters, warn 3669 if missing defaults
    return a2;
}

apply_nv_device_attr (sub_40EB80)

Handles both variables (target_kind == 7) and functions (target_kind == 11). For variables, it sets the memory space bitfield at +148 (bit 0 = __device__). For functions, it sets the execution space.

// Variable path (target_kind == 7):
a2->byte_148 |= 0x01;              // __device__ memory space
if (((a2->byte_148 & 0x02) != 0) + ((a2->byte_148 & 0x04) != 0) == 2)
    error(3481, ...);               // both __shared__ (bit 1) AND __constant__ (bit 2) set
if ((signed char)a2->byte_161 < 0)
    error(3482, ...);               // thread_local
if (a2->byte_81 & 0x04)
    error(3485, ...);               // local variable

// Function path (target_kind == 11):
// Same constexpr-lambda check as __global__
if (!dword_106BFF0 && (a2->byte_182 & 0x40))
    error(3481, ...);               // already __global__, now __device__
a2->byte_182 |= 0x23;              // device_capable + device_explicit + device_annotation
if ((a2->byte_81 & 0x04) && (a2->byte_182 & 0x40))
    error(3688, ...);               // local function with __global__
if (a2 == qword_126EB70 && (a2->byte_182 & 0x20))
    error(3538, ...);               // __device__ on main()

apply_nv_host_attr (sub_4108E0)

The simplest of the three. Only applies to functions (target_kind 11). Fewer validation checks than __global__ or __device__.

// Function path (target_kind == 11):
// Same constexpr-lambda check
if (a2->byte_182 & 0x40)
    error(3481, ...);           // already __global__, now __host__
a2->byte_182 |= 0x15;          // device_capable + host_capable + host_explicit
if ((a2->byte_81 & 0x04) && (a2->byte_182 & 0x40))
    error(3688, ...);           // local function
if (a2 == qword_126EB70 && (a2->byte_182 & 0x20))
    error(3538, ...);           // __host__ on main()

Default Execution Space

Functions without any explicit annotation have byte +182 == 0x00. This is treated as implicit __host__:

  • The 0x30 mask yields 0x00, which the cross-space validator treats identically to 0x10 (explicit __host__)
  • The function is compiled for the host side only
  • It is excluded from device IL during the keep-in-il pass

In JIT compilation mode (--default-device), the default flips to __device__. This changes which functions are kept in device IL without requiring explicit annotations.

Execution Space Conflict Detection

The attribute handlers enforce a mutual-exclusion matrix. When a second execution space attribute is applied to a function that already has one, the handler checks for conflicts using error 3481:

Already setApplyingResult
(none)__host__0x15 -- accepted
(none)__device__0x23 -- accepted
(none)__global__0xE1 -- accepted
__host__ (0x15)__device__0x37 -- accepted (HD)
__device__ (0x23)__host__0x37 -- accepted (HD)
__host__ (0x15)__global__error 3481 (always -- byte & 0x10 is unconditional)
__device__ (0x23)__global__error 3481 (unless dword_106BFF0)
__global__ (0xE1)__host__error 3481 (always)
__global__ (0xE1)__device__error 3481 (unless dword_106BFF0)
__host__ (0x15)__host__idempotent OR, no error
__device__ (0x23)__device__idempotent OR, no error
__global__ (0xE1)__global__idempotent OR, no error

The relaxed mode flag dword_106BFF0 suppresses certain conflicts. When set, combinations that would normally produce error 3481 are silently accepted. This flag corresponds to --expt-relaxed-constexpr or similar permissive compilation modes. Note that the relaxed flag does NOT affect the __host__ -> __global__ or __global__ -> __host__ paths -- these always error because the __global__ handler checks byte & 0x10 unconditionally, and the __host__ handler checks byte & 0x40 unconditionally.

Virtual Function Override Checking (sub_432280)

When a derived class overrides a virtual function, cudafe++ must verify execution space compatibility. This check is embedded in record_virtual_function_override (sub_432280, 437 lines, from class_decl.c).

nv_is_device_only_routine Inline Check

The function first tests whether the overriding function has the __device__ flag at +177 bit 4 (0x10). If so, and the overridden function does NOT have this flag, execution space propagation occurs:

// Propagation logic (simplified from sub_432280, lines 70-94)
if (overriding->byte_177 & 0x10) {     // overriding is __device__
    if (!(overridden->byte_177 & 0x10)) {   // overridden is NOT __device__
        char es = overridden->byte_182;
        if ((es & 0x30) != 0x20) {          // overridden is not device-only
            overriding->byte_182 |= 0x10;   // propagate __host__ flag
        }
        if (es & 0x20) {                    // overridden has device_annotation
            overriding->byte_182 |= 0x20;   // propagate device_annotation
        }
    }
}

Six Virtual Override Mismatch Errors (3542-3547)

When the overriding function is NOT __device__, the checker looks up execution space attributes using sub_5CEE70 (attribute kind 87 = __device__, kind 86 = __host__). Based on which attributes are found on the overriding function and the execution space of the overridden function, one of six errors is emitted:

ErrorOverriding hasOverridden space (byte & 0x30)Meaning
3542__device__ only0x00 or 0x10 (host/implicit)Device override of host virtual
3543__device__ + __host__0x00 (no annotation)HD override of implicit-host virtual
3544__device__ + __host__0x20 (device-only)HD override of device-only virtual
3545no __device__0x20 (device-only)Host override of device-only virtual
3546no __device__0x30 (HD)Host override of HD virtual
3547__device__ only0x30 (HD), relaxed modeDevice override of HD virtual (relaxed)

The errors are emitted via sub_4F4F10 with severity 8 (hard error). The dword_106BFF0 relaxed mode flag modulates certain paths: in relaxed mode, some combinations that would otherwise error are accepted or downgraded.

Decision Logic

// Pseudocode for override mismatch detection (sub_432280, lines 95-188)
char es = overridden->byte_182;
char mask_30 = es & 0x30;
bool has_host_bit = (es & 0x20) != 0;    // device_annotation
bool is_hd = (mask_30 == 0x30);

bool has_device_attr = has_attribute(overriding, 87 /*__device__*/);
bool has_host_attr   = has_attribute(overriding, 86 /*__host__*/);

if (has_device_attr) {
    if (has_host_attr) {
        // Overriding is __host__ __device__
        if (has_host_bit)
            error = 3544;   // HD overrides device-only
        else if (mask_30 != 0x20)
            error = 3543;   // HD overrides implicit-host
    } else {
        // Overriding is __device__ only
        if (!has_host_bit)
            error = 3542;   // device overrides host
        if (is_hd && relaxed_mode)
            error = 3547;   // device overrides HD (relaxed)
    }
} else {
    // Overriding has no __device__
    if (mask_30 == 0x20)
        error = 3545;       // host overrides device-only
    else if (mask_30 == 0x30)
        error = 3546;       // host overrides HD
}

global Function Constraints

The __global__ handler enforces the strictest constraints of any execution space. A kernel function must satisfy all of the following:

ConstraintCheckError
Must be a function (not variable/type)target_kind == 11silently ignored if not
Not a constexpr lambda with wrong linkage(qword_184 & 0x800001000000) != 0x8000000000003469
Not a static member function(signed char)byte_176 >= 0 || (byte_81 & 0x04)3507
Not operator()byte_166 != 53644
Return type not auto/decltype(auto)no exception spec at proto+563647
No conflicting execution spacesee conflict matrix above3481
Return type is void (non-constexpr)is_void_return(a2)3505 / 3506
Not variadic!(proto_flags & 0x01)3503
Not a local function!(byte_81 & 0x04)3688
Not main()a2 != qword_126EB703538
Parameters have default init (device-side)walk parameter list3669 (warning)

Execution Space Annotation Helper (sub_41A1F0)

This function validates that type arguments used in __host__ __device__ or __device__ template contexts are well-formed. It traverses the type chain (following cv-qualifier wrappers where kind == 12), emitting diagnostics:

  • Error 3597: Type nesting depth exceeds 7 levels
  • Error 3598: Type is not device-callable (fails sub_550E50 check)
  • Error 3599: Type lacks appropriate constructor/destructor for device context

The first argument selects the annotation string: when a3 == 0, the string is "__host__ __device__"; when a3 != 0, it is "__device__".

Attribute Dispatch (apply_one_attribute)

The central dispatcher sub_413240 (apply_one_attribute, 585 lines) routes attribute kinds to their handlers via a switch statement:

Kind byteDecimalAttributeHandler
'V'86__host__sub_4108E0
'W'87__device__sub_40EB80
'X'88__global__sub_40E1F0 or sub_40E7F0

Attribute display names are resolved by sub_40A310 (attribute_display_name), which maps the kind byte back to the human-readable CUDA keyword string for use in diagnostic messages.

Execution Space Mask Table (dword_E7C760)

A lookup table at dword_E7C760 stores precomputed bitmasks indexed by execution space enum value. The function sub_6BCF60 (nv_check_execution_space_mask) performs return a1 & dword_E7C760[a2], allowing fast bitwise checks of whether a given entity's execution space matches a target space category. This table is used throughout cross-space validation and IL marking.

Diagnostics Reference

ErrorSeverityMeaning
3469errorExecution space attribute on constexpr lambda with wrong linkage
3481errorConflicting execution spaces
3482error__device__ variable with thread_local storage
3485error__device__ attribute on local variable
3503error__global__ function cannot be variadic
3505error__global__ return type must be void (non-constexpr path)
3506error__global__ return type must be void (constexpr/lambda path)
3507warning__global__ on static member function
3538errorExecution space attribute on main()
3577error__device__ variable with constexpr and conflicting memory space
3542errorVirtual override: __device__ overrides host
3543errorVirtual override: __host__ __device__ overrides implicit-host
3544errorVirtual override: __host__ __device__ overrides device-only
3545errorVirtual override: host overrides device-only
3546errorVirtual override: host overrides __host__ __device__
3547errorVirtual override: __device__ overrides HD (relaxed mode)
3597errorType nesting too deep for execution space annotation
3598errorType not callable in target execution space
3599errorType lacks device-compatible constructor/destructor
3644error__global__ on operator()
3647error__global__ return type cannot be auto/decltype(auto)
3669warning__global__ parameter without default initializer (device-side)
3688errorExecution space attribute on local function

Function Map

AddressIdentityLinesSource
sub_40A310attribute_display_name83attribute.c
sub_40E1F0apply_nv_global_attr (variant 1)89attribute.c
sub_40E7F0apply_nv_global_attr (variant 2)86attribute.c
sub_40EB80apply_nv_device_attr100attribute.c
sub_4108E0apply_nv_host_attr31attribute.c
sub_413240apply_one_attribute (dispatch)585attribute.c
sub_41A1F0execution space annotation helper82class_decl.c
sub_432280record_virtual_function_override437class_decl.c
sub_6BCF60nv_check_execution_space_mask7nv_transforms.c
sub_719D20check_void_return_okay271statements.c

Cross-References