Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Entity Node Layout

The entity node is the central data structure in cudafe++ (EDG 6.6) for representing every named declaration: functions, variables, fields, parameters, namespaces, and types. Each node is a variable-size record -- routines occupy 288 bytes, variables 232 bytes, fields 176 bytes -- linked into scope chains and cross-referenced by type nodes, expression nodes, and template instantiation records.

This page focuses on the CUDA-specific fields that NVIDIA grafted onto the EDG entity node. These fields encode execution space (__host__/__device__/__global__), variable memory space (__shared__/__constant__/__managed__), launch configuration (__launch_bounds__/__cluster_dims__/__block_size__/__maxnreg__), and assorted kernel metadata. The attribute application functions in attribute.c write these fields; the backend code generator, cross-space validator, IL walker, and stub emitter read them.

Key Facts

PropertyValue
Routine entity size288 bytes (IL entry kind 11)
Variable entity size232 bytes (IL entry kind 7)
Field entity size176 bytes (IL entry kind 8)
Execution space offset+182 (1 byte, bitfield)
Memory space offset+148 (1 byte, bitfield)
Launch config pointer+256 (8-byte pointer to 56-byte struct)
Source fileattribute.c (writers), nv_transforms.c / cp_gen_be.c (readers)
Attribute dispatchsub_413240 (apply_one_attribute, 585 lines)
Post-validationsub_6BC890 (nv_validate_cuda_attributes)

Visual Layout (Routine Entity, 288 Bytes)

Offset   0         8        16        24        32        40        48        56
       +=========+=========+=========+=========+=========+=========+=========+=========+
  0x00 | next_entity_ptr   | name_string_ptr   |            (EDG internal)             |
       +---------+---------+---------+---------+---------+---------+---------+---------+
  0x20 |                              (EDG internal continued)                          |
       +---------+---------+---------+---------+---------+---------+---------+---------+
  0x40 |                              (EDG internal continued)                          |
       +====+====+=========+=========+=========+=========+=========+=========+=========+
  0x50 |kind|stor|         | assoc_entity_ptr  |                                       |
       |+80 |+81 |         |                   |                                       |
       +----+----+---------+---------+---------+---------+---------+---------+---------+
  0x60 |                   | variable_type_ptr |                                       |
       +=========+=========+=========+=========+====+=========+=========+==========+===+
  0x80 | storage_class/align|         |type_kind|    | return_type_ptr   |MEM |EXT |    |
       |         |         |         |+132     |    | +144              |+148|+149|    |
       +---------+---------+---------+----+----+----+---------+---------+----+----+----+
  0x98 | proto_ptr / param_list +152  |link|stor|    |grid|    |op  |         |         |
       |                              |+160|+161|    |+164|    |+166|         |         |
       +---------+---------+---------+----+----+----+----+----+----+---------+---------+
  0xB0 |mbr |dev |    |kern|func|    |EXEC|CEXT| template_linkage_flags +184            |
       |+176|+177|    |+179|+180|    |+182|+183|                                       |
       +----+----+----+----+----+----+----+----+=========+=========+=========+=========+
  0xC0 | alias_chain/linkage+186      |         |ctor/dtor|lambda  |                    |
       |                              |         |  +190   | +191   |                    |
       +---------+---------+---------+---------+---------+---------+---------+---------+
  0xD0 | variable_alias_chain_next +208         |                                       |
       +---------+---------+---------+---------+---------+---------+---------+---------+
  0xF0 | func_extra / alias_entry +240          |                                       |
       +---------+---------+---------+---------+---------+---------+---------+---------+
 0x100 | LAUNCH_CONFIG_PTR +256      |         (padding to 288)                         |
       +=========+=========+=========+=========+=========+=========+=========+=========+

CUDA-specific fields (UPPERCASE):
  MEM   = +148  variable memory space bitfield (__device__/__shared__/__constant__)
  EXT   = +149  extended memory space (__managed__)
  EXEC  = +182  execution space bitfield (__host__/__device__/__global__)
  CEXT  = +183  CUDA extended flags (__nv_register_params__, __cluster_dims__ intent)
  LAUNCH_CONFIG_PTR = +256  pointer to 56-byte launch_config_t struct

Full Offset Map (CUDA-Relevant Fields)

The table below documents every entity node offset touched by CUDA attribute handlers and validation functions. Offsets are byte positions from the start of the entity node. Fields marked "EDG base" are standard EDG fields that CUDA code tests but does not define.

OffsetSizeFieldSet ByRead By
+08Next entity pointer (linked list)EDGScope iteration
+88Name string pointerEDGError messages, stub emission
+801Entity kind byte (7=variable, 8=field, 11=routine)EDGAll attribute handlers
+811Storage flags (bit 2=local, bit 3=has_name, bit 6=anonymous)EDG__global__ / __device__ validation
+888Associated entity pointerEDGnv_is_device_only_routine
+1128Variable type pointerEDGget_func_type_for_attr
+1281Storage class code / alignmentEDGapply_internal_linkage_attr
+1321Type kind byte (12=qualifier)EDGReturn type traversal
+1448Return type / next-in-chain pointerEDG__global__ void-return check
+1481Variable memory space bitfieldCUDA attr handlersBackend, IL walker
+1491Extended memory spaceapply_nv_managed_attrBackend, runtime init
+1528Function prototype / parameter list headEDG__global__ param checks
+1601Linkage/visibility bits (variable: low 3 = visibility)VariousVisibility propagation
+1611Storage/linkage flags (bit 7=thread_local)EDG__managed__ / __device__ validation
+1641Storage class / grid_constant flags (bit 2=grid_constant)__grid_constant__ handler__managed__/__device__ conflict check
+1661Operator function kind (5=operator function)EDG__global__ validation
+1761Member function flags (bit 7=static member)EDG__global__ static-member check
+1771Device propagation flag (bit 4=0x10)Virtual override propagationOverride space checking
+1791Constexpr/kernel flagsDeclaration processingStub generation, attribute interaction
+1801Function attributes (bit 6=nodiscard, bit 7=noinline)Various attribute handlersBackend
+1821Execution space bitfieldCUDA execution space handlersEverywhere
+1831CUDA extended flags__cluster_dims__ / __nv_register_params__Post-validation, stub emission
+1848Template/linkage flags (48-bit field)EDG + CUDA handlersLambda check, visibility
+1861Alias chain flag (bit 3=internal linkage)apply_internal_linkage_attrLinker
+1901Constructor/destructor priority flagsapply_constructor_attr / apply_destructor_attrBackend
+1911Lambda flags (bit 0=is_lambda)EDG lambda processing__global__ validation
+2088Variable alias chain next pointerapply_alias_attrAlias loop detection
+2408Function extra info / alias entryapply_alias_attrAlias chain traversal
+2568Launch configuration pointerCUDA launch config handlersPost-validation, backend

Execution Space Bitfield (Byte +182)

This is the most frequently read field in CUDA-specific code paths. Every function entity carries a single byte that encodes which execution spaces the function belongs to.

Byte at entity+182:

  bit 0  (0x01)   device_capable     Function can execute on device
  bit 1  (0x02)   device_explicit    __device__ was explicitly written
  bit 2  (0x04)   host_capable       Function can execute on host
  bit 3  (0x08)   (reserved)
  bit 4  (0x10)   host_explicit      __host__ was explicitly written
  bit 5  (0x20)   device_annotation  Secondary device flag (HD detection)
  bit 6  (0x40)   global_kernel      Function is a __global__ kernel
  bit 7  (0x80)   global_confirmed   Always set by __global__ handler tail guard

Combined Patterns

The attribute handlers do not set individual bits. They OR entire patterns into the byte. Each CUDA keyword produces a fixed bitmask:

KeywordOR mask(s)Result byteHandlerEvidence
__global__0x61 then 0x800xE1sub_40E1F0 (apply_nv_global_attr)`entity+182
__device__0x230x23sub_40EB80 (apply_nv_device_attr)`entity+182
__host__0x150x15sub_4108E0 (apply_nv_host_attr)`entity+182
__host__ __device__0x23 then 0x150x37Both handlers in sequenceOR of device + host masks
(no annotation)none0x00--Implicit __host__

The 0x80 bit is set unconditionally at the end of apply_nv_global_attr. After the main body ORs 0x61 into byte+182 (setting bit 6 = global_kernel), a tail guard checks bit 6 and always ORs 0x80:

// sub_40E1F0, lines 84-88
v10 = *(_BYTE *)(a2 + 182);
if ( (v10 & 0x40) == 0 )       // if bit 6 (global_kernel) not set, bail
    return a2;                  // (only reachable via early error paths)
*(_BYTE *)(a2 + 182) = v10 | 0x80;   // always set for __global__

Since 0x61 was already OR'd in, bit 6 is always set on the normal path, so 0x80 is always applied. The actual result byte for any successful __global__ application is 0x61 | 0x80 = 0xE1. The guard condition only triggers on error paths where 0x61 was never applied (e.g., the template-lambda error at line 21 which returns before reaching line 56).

Extraction Patterns

Code throughout cudafe++ extracts execution space category using bitmask tests:

MaskTestMeaningUsed in
& 0x30== 0x00No explicit annotation (implicit host)Space classification
& 0x30== 0x10__host__ onlySpace classification
& 0x30== 0x20__device__ onlynv_is_device_only_routine
& 0x30== 0x30__host__ __device__Space classification
& 0x60== 0x20Device, not kernelDevice-only predicate
& 0x60== 0x60__global__ kernel (implies device)Kernel identification
& 0x40!= 0Is a __global__ kernelStub generation gate

Variable Memory Space Bitfield (Byte +148)

For variable entities (kind 7), byte +148 encodes the CUDA memory space:

Byte at entity+148:

  bit 0  (0x01)   __device__     Variable resides in device global memory
  bit 1  (0x02)   __shared__     Variable resides in shared memory
  bit 2  (0x04)   __constant__   Variable resides in constant memory

These bits are mutually exclusive in valid programs. The attribute handlers enforce this by checking for conflicting combinations:

// From apply_nv_device_attr (sub_40EB80), variable path:
a2->byte_148 |= 0x01;                      // set __device__
int shared_or_constant = a2->byte_148 & 0x06;   // check __shared__ | __constant__
if (popcount(shared_or_constant) + (a2->byte_148 & 0x01) == 2)
    error(3481, ...);                       // conflicting memory spaces

The __device__ attribute on a function (kind 11) does NOT touch byte +148. It writes to byte +182 (execution space) instead. The memory space byte is strictly for variables.

Extended Memory Space (Byte +149)

Byte at entity+149:

  bit 0  (0x01)   __managed__    Unified memory, accessible from both host and device

Set by apply_nv_managed_attr (sub_40E0D0). The handler also sets bit 0 of +148 (__device__) because managed memory resides in device global memory. Additional validation:

  • Error 3481: conflicting if __shared__ or __constant__ is already set
  • Error 3482: cannot be thread-local (byte +161 bit 7)
  • Error 3485: cannot be a local variable (byte +81 bit 2)
  • Error 3577: incompatible with __grid_constant__ parameter (byte +164 bit 2)

Constexpr and Kernel Flags (Byte +176, +179)

Byte +176: Member Function Flags

Byte at entity+176:

  bit 7  (0x80)   static_member   Function is a static class member

Tested by apply_nv_global_attr to detect static __global__ functions. The check is (signed char)(a2->byte_176) < 0, which is true when bit 7 is set. Combined with the local-function test (byte +81 bit 2 clear), this triggers warning 3507.

Byte +179: Constexpr / Kernel Property Flags

Byte at entity+179:

  bit 1  (0x02)   kernel_body        Function has a kernel body (used for stub generation)
  bit 2  (0x04)   (instantiation)    Instantiation-required status
  bit 4  (0x10)   constexpr          Function is constexpr
  bit 5  (0x20)   noinline           Function is noinline

The kernel_body flag at bit 1 (0x02) is the primary gate for device stub generation. The backend code generator (gen_routine_decl in cp_gen_be.c) checks:

// From gen_routine_decl (sweep p1.04, line ~1430)
if ((*(_BYTE *)(v3 + 182) & 0x40) != 0       // is __global__ kernel
    && (*(_BYTE *)(v3 + 179) & 2) != 0)       // has kernel body
{
    // Emit __wrapper__device_stub_<name>(<params>) forwarding body
}

The constexpr flag at bit 4 (0x10) is tested during __global__ attribute validation. When set, the void-return-type check AND the lambda check are both skipped:

// From apply_nv_global_attr (sub_40E1F0), lines 39-50
if ( (*(_BYTE *)(a2 + 179) & 0x10) == 0 )   // NOT constexpr
{
    // Non-constexpr __global__: check return type and lambda
    if ( (*(_BYTE *)(a2 + 191) & 1) != 0 )
        error(3506, ...);                    // lambda __global__ not allowed
    else if ( !is_void_return_type(a2) )
        error(3505, ...);                    // must return void
}
// If constexpr (bit 4 set): skip both checks entirely

This is a separate check from the static-member test (byte +176 bit 7 with byte +81 bit 2), which appears earlier at line 28:

if ( *(char *)(a2 + 176) < 0         // static member (bit 7 set)
    && (*(_BYTE *)(a2 + 81) & 4) == 0 )  // not local
    warning(3507, "__global__");          // static __global__ warning

Operator Function Kind (Byte +166)

Byte at entity+166:

  Value 5:  operator function (operator(), operator+, etc.)

Tested during __global__ attribute application. If the entity is an operator function (value == 5), error 3644 is emitted: operator() cannot be declared __global__.

// From apply_nv_global_attr (sub_40E1F0), line 30-31
if ( *(_BYTE *)(a2 + 166) == 5 )
    sub_4F8200(7, 3644, a1 + 56);     // error: __global__ on operator function

This prevents declaring lambda call operators as kernels via the __global__ attribute directly (extended lambdas use a different mechanism with wrapper types).

Parameter List (Pointer +152)

For routine entities, offset +152 holds a pointer to the function prototype structure. The prototype's first field (+0) points to the parameter list head -- a linked list of parameter entities.

The __global__ attribute handler iterates this list to check two constraints:

  1. Variadic check: prototype +16 bit 0 indicates variadic parameters. If set, error 3503 is emitted (variadic __global__ functions are not allowed).

  2. __grid_constant__ check: the post-validation function nv_validate_cuda_attributes (sub_6BC890) walks the parameter list looking for parameters with byte +32 bit 1 set (the __grid_constant__ flag on a parameter entity). If found on a non-__global__ function, error 3702 is emitted.

// From nv_validate_cuda_attributes (sub_6BC890), lines 26-39
// Walk parameter list from prototype
v10 = **(__int64 ****)(v2 + 152);    // parameter list head
while (v10) {
    if (((_BYTE)v10[4] & 2) != 0)    // parameter byte+32 bit 1 = __grid_constant__
        error(3702, ...);            // grid_constant on non-kernel parameter
    v10 = (__int64 **)*v10;          // next parameter
}

CUDA Extended Flags (Byte +183)

Byte at entity+183:

  bit 3  (0x08)   __nv_register_params__   Function uses register parameter passing
  bit 6  (0x40)   __cluster_dims__ intent  cluster_dims attribute with no arguments

nv_register_params (Bit 0x08)

Set by apply_nv_register_params_attr (sub_40B0A0). When present, the post-validation function nv_validate_cuda_attributes checks whether the function is __global__ or __host__, and emits error 3661 if so. Device-only functions (__device__ without __host__) are exempt:

// From nv_validate_cuda_attributes (sub_6BC890), lines 42-69
if ( (*(_BYTE *)(a1 + 183) & 8) == 0 )     // no __nv_register_params__
    goto check_launch_config;

if ( (v3 & 0x40) != 0 ) {                  // __global__ kernel
    v4 = "__global__";
    error(3661, &qword_126EDE8, v4);       // incompatible
} else if ( (v3 & 0x30) != 0x20 ) {        // NOT device-only (has host component)
    v4 = "__host__";
    error(3661, &qword_126EDE8, v4);       // incompatible
}
// else: device-only function -- __nv_register_params__ is allowed

The key check is (v3 & 0x30) != 0x20: when the execution space annotation bits indicate device-only (bits 4,5 = 0x20), the error is skipped. This means __nv_register_params__ is valid only on __device__ functions -- it is rejected on __global__, __host__, and __host__ __device__ functions.

cluster_dims Intent (Bit 0x40)

Set by apply_nv_cluster_dims_attr (sub_4115F0) when the attribute is applied with zero arguments. This marks the function as "wants cluster dimensions" without specifying concrete values -- the values may come from a separate __block_size__ attribute or from a template parameter.

Template / Linkage Flags (Pointer +184)

Offset +184 is a 48-bit (6-byte) field encoding template instantiation and linkage information. The __global__ attribute handler tests a specific bit pattern to detect constexpr lambdas with template linkage:

// From apply_nv_global_attr (sub_40E1F0), line 21
if ( (*(_QWORD *)(a2 + 184) & 0x800001000000LL) == 0x800000000000LL )
{
    // This is a template lambda with external linkage but no definition yet.
    // Applying __global__ to it is an error.
    v14 = sub_6BC6B0(a2, 0);    // get entity name
    sub_4F7510(3469, a1 + 56, "__global__", v14);
    return;
}

The mask 0x800001000000 tests two bits:

  • Bit 47 (0x800000000000): template instantiation pending
  • Bit 24 (0x000001000000): has definition body

When bit 47 is set but bit 24 is clear, the entity is a template lambda awaiting instantiation that has no body yet -- applying __global__ (or __device__) to such an entity produces error 3469.

Launch Configuration Struct (Pointer +256)

Offset +256 holds a pointer to a lazily-allocated 56-byte launch configuration structure. This pointer is NULL for functions without any launch configuration attributes. The allocation function sub_5E52F0 creates and zero-initializes the struct on first use.

Launch Config Layout

struct launch_config_t {           // 56 bytes, allocated by sub_5E52F0
    int64_t  maxThreadsPerBlock;   // +0   from __launch_bounds__(arg1)
    int64_t  minBlocksPerMP;       // +8   from __launch_bounds__(arg2)
    int32_t  maxBlocksPerCluster;  // +16  from __launch_bounds__(arg3)
    int32_t  cluster_dim_x;        // +20  from __cluster_dims__(x) or __block_size__(x,y,z,cx)
    int32_t  cluster_dim_y;        // +24  from __cluster_dims__(y) or __block_size__(x,y,z,cx,cy)
    int32_t  cluster_dim_z;        // +28  from __cluster_dims__(z) or __block_size__(x,y,z,cx,cy,cz)
    int32_t  maxnreg;              // +32  from __maxnreg__(N)
    int32_t  local_maxnreg;        // +36  from __local_maxnreg__(N)
    int32_t  block_size_x;         // +40  from __block_size__(x)
    int32_t  block_size_y;         // +44  from __block_size__(y)
    int32_t  block_size_z;         // +48  from __block_size__(z)
    uint8_t  flags;                // +52  bit 0=cluster_dims_set, bit 1=block_size_set
};                                 //      3 bytes padding to 56

Attribute-to-Field Mapping

AttributeArgumentsFields WrittenHandler
__launch_bounds__(M)1 int+0 = Msub_411C80
__launch_bounds__(M,N)2 ints+0 = M, +8 = Nsub_411C80
__launch_bounds__(M,N,C)3 ints+0 = M, +8 = N, +16 = Csub_411C80
__cluster_dims__(x)1 int+20 = x, +24 = 1, +28 = 1, +52 bit 0sub_4115F0
__cluster_dims__(x,y)2 ints+20 = x, +24 = y, +28 = 1, +52 bit 0sub_4115F0
__cluster_dims__(x,y,z)3 ints+20 = x, +24 = y, +28 = z, +52 bit 0sub_4115F0
__cluster_dims__()0 argsentity+183 bit 6 (intent flag only)sub_4115F0
__maxnreg__(N)1 int+32 = Nsub_410F70
__local_maxnreg__(N)1 int+36 = Nsub_411090
__block_size__(x,y,z)3 ints+40 = x, +44 = y, +48 = z, +52 bit 1sub_4109E0
__block_size__(x,y,z,cx,cy,cz)6 intsblock + cluster dims, +52 bits 0+1sub_4109E0

Post-Validation Constraints

The function nv_validate_cuda_attributes (sub_6BC890) performs cross-attribute validation after all attributes have been applied. The key checks on the launch config struct:

1. __launch_bounds__ only on __global__:

// sub_6BC890, lines 45-51
v5 = *(_QWORD *)(a1 + 256);         // launch config pointer
if ( !v5 )  goto done;
if ( (v3 & 0x40) != 0 )             // if __global__, skip to next check
    goto check_cluster;
// Not __global__ but has launch_bounds values
if ( *(_QWORD *)v5 || *(_QWORD *)(v5 + 8) )
    error(3534, "__launch_bounds__");   // launch_bounds on non-kernel

2. __cluster_dims__/__block_size__ only on __global__:

// sub_6BC890, lines 81-87
if ( (*(_BYTE *)(a1 + 183) & 0x40) != 0    // cluster_dims intent
    || *(int *)(v5 + 20) >= 0 )             // cluster_dim_x set
{
    v11 = "__cluster_dims__";
    if ( *(int *)(v5 + 40) > 0 )
        v11 = "__block_size__";
    error(3534, v11);                       // not allowed on non-kernel
}

3. maxBlocksPerCluster vs cluster product:

// sub_6BC890, lines 101-114
v6 = *(int *)(v5 + 20);                    // cluster_dim_x
if ( (int)v6 > 0 ) {
    v7 = *(int *)(v5 + 16);                // maxBlocksPerCluster
    if ( (int)v7 > 0
        && v7 < *(int*)(v5 + 28) * *(int*)(v5 + 24) * v6 )
    {
        // maxBlocksPerCluster < cluster_dim_x * cluster_dim_y * cluster_dim_z
        error(3707, "__cluster_dims__");    // inconsistent values
    }
}

4. __maxnreg__ only on __global__:

// sub_6BC890, lines 116-121
if ( *(int *)(v5 + 32) < 0 )               // maxnreg not set (sentinel -1)
    goto check_launch_maxnreg_conflict;
if ( (v9 & 0x40) == 0 )                    // not __global__
    error(3715, "__maxnreg__");             // maxnreg on non-kernel

5. __launch_bounds__ + __maxnreg__ conflict:

// sub_6BC890, lines 144-145
if ( *(_QWORD *)v5 )                       // maxThreadsPerBlock set
    error(3719, "__launch_bounds__ and __maxnreg__");

Entity Kind Reference

The entity kind byte at +80 determines which offsets are valid. CUDA attribute handlers gate on this value:

KindValueCUDA offsets usedHandler examples
Variable7+148, +149, +161, +164__device__, __shared__, __constant__, __managed__
Field8+136packed, aligned (non-CUDA)
Routine11+144, +152, +166, +176, +179, +182, +183, +184, +191, +256All execution space attrs, launch config

Cross-References

  • Execution Spaces -- deep dive on byte +182 semantics and the six virtual override mismatch errors
  • Attributes Overview -- attribute kind enum (86-108) and apply_one_attribute dispatch
  • IL Overview -- IL entry kinds 7 (variable), 8 (field), 11 (routine) node sizes
  • Scope Entry -- 784-byte scope structure that contains entity chains