Entity Node Layout
The entity node is the central data structure in cudafe++ (EDG 6.6) for representing every named declaration: functions, variables, fields, parameters, namespaces, and types. Each node is a variable-size record -- routines occupy 288 bytes, variables 232 bytes, fields 176 bytes -- linked into scope chains and cross-referenced by type nodes, expression nodes, and template instantiation records.
This page focuses on the CUDA-specific fields that NVIDIA grafted onto the EDG entity node. These fields encode execution space (__host__/__device__/__global__), variable memory space (__shared__/__constant__/__managed__), launch configuration (__launch_bounds__/__cluster_dims__/__block_size__/__maxnreg__), and assorted kernel metadata. The attribute application functions in attribute.c write these fields; the backend code generator, cross-space validator, IL walker, and stub emitter read them.
Key Facts
| Property | Value |
|---|---|
| Routine entity size | 288 bytes (IL entry kind 11) |
| Variable entity size | 232 bytes (IL entry kind 7) |
| Field entity size | 176 bytes (IL entry kind 8) |
| Execution space offset | +182 (1 byte, bitfield) |
| Memory space offset | +148 (1 byte, bitfield) |
| Launch config pointer | +256 (8-byte pointer to 56-byte struct) |
| Source file | attribute.c (writers), nv_transforms.c / cp_gen_be.c (readers) |
| Attribute dispatch | sub_413240 (apply_one_attribute, 585 lines) |
| Post-validation | sub_6BC890 (nv_validate_cuda_attributes) |
Visual Layout (Routine Entity, 288 Bytes)
Offset 0 8 16 24 32 40 48 56
+=========+=========+=========+=========+=========+=========+=========+=========+
0x00 | next_entity_ptr | name_string_ptr | (EDG internal) |
+---------+---------+---------+---------+---------+---------+---------+---------+
0x20 | (EDG internal continued) |
+---------+---------+---------+---------+---------+---------+---------+---------+
0x40 | (EDG internal continued) |
+====+====+=========+=========+=========+=========+=========+=========+=========+
0x50 |kind|stor| | assoc_entity_ptr | |
|+80 |+81 | | | |
+----+----+---------+---------+---------+---------+---------+---------+---------+
0x60 | | variable_type_ptr | |
+=========+=========+=========+=========+====+=========+=========+==========+===+
0x80 | storage_class/align| |type_kind| | return_type_ptr |MEM |EXT | |
| | | |+132 | | +144 |+148|+149| |
+---------+---------+---------+----+----+----+---------+---------+----+----+----+
0x98 | proto_ptr / param_list +152 |link|stor| |grid| |op | | |
| |+160|+161| |+164| |+166| | |
+---------+---------+---------+----+----+----+----+----+----+---------+---------+
0xB0 |mbr |dev | |kern|func| |EXEC|CEXT| template_linkage_flags +184 |
|+176|+177| |+179|+180| |+182|+183| |
+----+----+----+----+----+----+----+----+=========+=========+=========+=========+
0xC0 | alias_chain/linkage+186 | |ctor/dtor|lambda | |
| | | +190 | +191 | |
+---------+---------+---------+---------+---------+---------+---------+---------+
0xD0 | variable_alias_chain_next +208 | |
+---------+---------+---------+---------+---------+---------+---------+---------+
0xF0 | func_extra / alias_entry +240 | |
+---------+---------+---------+---------+---------+---------+---------+---------+
0x100 | LAUNCH_CONFIG_PTR +256 | (padding to 288) |
+=========+=========+=========+=========+=========+=========+=========+=========+
CUDA-specific fields (UPPERCASE):
MEM = +148 variable memory space bitfield (__device__/__shared__/__constant__)
EXT = +149 extended memory space (__managed__)
EXEC = +182 execution space bitfield (__host__/__device__/__global__)
CEXT = +183 CUDA extended flags (__nv_register_params__, __cluster_dims__ intent)
LAUNCH_CONFIG_PTR = +256 pointer to 56-byte launch_config_t struct
Full Offset Map (CUDA-Relevant Fields)
The table below documents every entity node offset touched by CUDA attribute handlers and validation functions. Offsets are byte positions from the start of the entity node. Fields marked "EDG base" are standard EDG fields that CUDA code tests but does not define.
| Offset | Size | Field | Set By | Read By |
|---|---|---|---|---|
+0 | 8 | Next entity pointer (linked list) | EDG | Scope iteration |
+8 | 8 | Name string pointer | EDG | Error messages, stub emission |
+80 | 1 | Entity kind byte (7=variable, 8=field, 11=routine) | EDG | All attribute handlers |
+81 | 1 | Storage flags (bit 2=local, bit 3=has_name, bit 6=anonymous) | EDG | __global__ / __device__ validation |
+88 | 8 | Associated entity pointer | EDG | nv_is_device_only_routine |
+112 | 8 | Variable type pointer | EDG | get_func_type_for_attr |
+128 | 1 | Storage class code / alignment | EDG | apply_internal_linkage_attr |
+132 | 1 | Type kind byte (12=qualifier) | EDG | Return type traversal |
+144 | 8 | Return type / next-in-chain pointer | EDG | __global__ void-return check |
+148 | 1 | Variable memory space bitfield | CUDA attr handlers | Backend, IL walker |
+149 | 1 | Extended memory space | apply_nv_managed_attr | Backend, runtime init |
+152 | 8 | Function prototype / parameter list head | EDG | __global__ param checks |
+160 | 1 | Linkage/visibility bits (variable: low 3 = visibility) | Various | Visibility propagation |
+161 | 1 | Storage/linkage flags (bit 7=thread_local) | EDG | __managed__ / __device__ validation |
+164 | 1 | Storage class / grid_constant flags (bit 2=grid_constant) | __grid_constant__ handler | __managed__/__device__ conflict check |
+166 | 1 | Operator function kind (5=operator function) | EDG | __global__ validation |
+176 | 1 | Member function flags (bit 7=static member) | EDG | __global__ static-member check |
+177 | 1 | Device propagation flag (bit 4=0x10) | Virtual override propagation | Override space checking |
+179 | 1 | Constexpr/kernel flags | Declaration processing | Stub generation, attribute interaction |
+180 | 1 | Function attributes (bit 6=nodiscard, bit 7=noinline) | Various attribute handlers | Backend |
+182 | 1 | Execution space bitfield | CUDA execution space handlers | Everywhere |
+183 | 1 | CUDA extended flags | __cluster_dims__ / __nv_register_params__ | Post-validation, stub emission |
+184 | 8 | Template/linkage flags (48-bit field) | EDG + CUDA handlers | Lambda check, visibility |
+186 | 1 | Alias chain flag (bit 3=internal linkage) | apply_internal_linkage_attr | Linker |
+190 | 1 | Constructor/destructor priority flags | apply_constructor_attr / apply_destructor_attr | Backend |
+191 | 1 | Lambda flags (bit 0=is_lambda) | EDG lambda processing | __global__ validation |
+208 | 8 | Variable alias chain next pointer | apply_alias_attr | Alias loop detection |
+240 | 8 | Function extra info / alias entry | apply_alias_attr | Alias chain traversal |
+256 | 8 | Launch configuration pointer | CUDA launch config handlers | Post-validation, backend |
Execution Space Bitfield (Byte +182)
This is the most frequently read field in CUDA-specific code paths. Every function entity carries a single byte that encodes which execution spaces the function belongs to.
Byte at entity+182:
bit 0 (0x01) device_capable Function can execute on device
bit 1 (0x02) device_explicit __device__ was explicitly written
bit 2 (0x04) host_capable Function can execute on host
bit 3 (0x08) (reserved)
bit 4 (0x10) host_explicit __host__ was explicitly written
bit 5 (0x20) device_annotation Secondary device flag (HD detection)
bit 6 (0x40) global_kernel Function is a __global__ kernel
bit 7 (0x80) global_confirmed Always set by __global__ handler tail guard
Combined Patterns
The attribute handlers do not set individual bits. They OR entire patterns into the byte. Each CUDA keyword produces a fixed bitmask:
| Keyword | OR mask(s) | Result byte | Handler | Evidence |
|---|---|---|---|---|
__global__ | 0x61 then 0x80 | 0xE1 | sub_40E1F0 (apply_nv_global_attr) | `entity+182 |
__device__ | 0x23 | 0x23 | sub_40EB80 (apply_nv_device_attr) | `entity+182 |
__host__ | 0x15 | 0x15 | sub_4108E0 (apply_nv_host_attr) | `entity+182 |
__host__ __device__ | 0x23 then 0x15 | 0x37 | Both handlers in sequence | OR of device + host masks |
| (no annotation) | none | 0x00 | -- | Implicit __host__ |
The 0x80 bit is set unconditionally at the end of apply_nv_global_attr. After the main body ORs 0x61 into byte+182 (setting bit 6 = global_kernel), a tail guard checks bit 6 and always ORs 0x80:
// sub_40E1F0, lines 84-88
v10 = *(_BYTE *)(a2 + 182);
if ( (v10 & 0x40) == 0 ) // if bit 6 (global_kernel) not set, bail
return a2; // (only reachable via early error paths)
*(_BYTE *)(a2 + 182) = v10 | 0x80; // always set for __global__
Since 0x61 was already OR'd in, bit 6 is always set on the normal path, so 0x80 is always applied. The actual result byte for any successful __global__ application is 0x61 | 0x80 = 0xE1. The guard condition only triggers on error paths where 0x61 was never applied (e.g., the template-lambda error at line 21 which returns before reaching line 56).
Extraction Patterns
Code throughout cudafe++ extracts execution space category using bitmask tests:
| Mask | Test | Meaning | Used in |
|---|---|---|---|
& 0x30 | == 0x00 | No explicit annotation (implicit host) | Space classification |
& 0x30 | == 0x10 | __host__ only | Space classification |
& 0x30 | == 0x20 | __device__ only | nv_is_device_only_routine |
& 0x30 | == 0x30 | __host__ __device__ | Space classification |
& 0x60 | == 0x20 | Device, not kernel | Device-only predicate |
& 0x60 | == 0x60 | __global__ kernel (implies device) | Kernel identification |
& 0x40 | != 0 | Is a __global__ kernel | Stub generation gate |
Variable Memory Space Bitfield (Byte +148)
For variable entities (kind 7), byte +148 encodes the CUDA memory space:
Byte at entity+148:
bit 0 (0x01) __device__ Variable resides in device global memory
bit 1 (0x02) __shared__ Variable resides in shared memory
bit 2 (0x04) __constant__ Variable resides in constant memory
These bits are mutually exclusive in valid programs. The attribute handlers enforce this by checking for conflicting combinations:
// From apply_nv_device_attr (sub_40EB80), variable path:
a2->byte_148 |= 0x01; // set __device__
int shared_or_constant = a2->byte_148 & 0x06; // check __shared__ | __constant__
if (popcount(shared_or_constant) + (a2->byte_148 & 0x01) == 2)
error(3481, ...); // conflicting memory spaces
The __device__ attribute on a function (kind 11) does NOT touch byte +148. It writes to byte +182 (execution space) instead. The memory space byte is strictly for variables.
Extended Memory Space (Byte +149)
Byte at entity+149:
bit 0 (0x01) __managed__ Unified memory, accessible from both host and device
Set by apply_nv_managed_attr (sub_40E0D0). The handler also sets bit 0 of +148 (__device__) because managed memory resides in device global memory. Additional validation:
- Error 3481: conflicting if
__shared__or__constant__is already set - Error 3482: cannot be thread-local (
byte +161bit 7) - Error 3485: cannot be a local variable (
byte +81bit 2) - Error 3577: incompatible with
__grid_constant__parameter (byte +164bit 2)
Constexpr and Kernel Flags (Byte +176, +179)
Byte +176: Member Function Flags
Byte at entity+176:
bit 7 (0x80) static_member Function is a static class member
Tested by apply_nv_global_attr to detect static __global__ functions. The check is (signed char)(a2->byte_176) < 0, which is true when bit 7 is set. Combined with the local-function test (byte +81 bit 2 clear), this triggers warning 3507.
Byte +179: Constexpr / Kernel Property Flags
Byte at entity+179:
bit 1 (0x02) kernel_body Function has a kernel body (used for stub generation)
bit 2 (0x04) (instantiation) Instantiation-required status
bit 4 (0x10) constexpr Function is constexpr
bit 5 (0x20) noinline Function is noinline
The kernel_body flag at bit 1 (0x02) is the primary gate for device stub generation. The backend code generator (gen_routine_decl in cp_gen_be.c) checks:
// From gen_routine_decl (sweep p1.04, line ~1430)
if ((*(_BYTE *)(v3 + 182) & 0x40) != 0 // is __global__ kernel
&& (*(_BYTE *)(v3 + 179) & 2) != 0) // has kernel body
{
// Emit __wrapper__device_stub_<name>(<params>) forwarding body
}
The constexpr flag at bit 4 (0x10) is tested during __global__ attribute validation. When set, the void-return-type check AND the lambda check are both skipped:
// From apply_nv_global_attr (sub_40E1F0), lines 39-50
if ( (*(_BYTE *)(a2 + 179) & 0x10) == 0 ) // NOT constexpr
{
// Non-constexpr __global__: check return type and lambda
if ( (*(_BYTE *)(a2 + 191) & 1) != 0 )
error(3506, ...); // lambda __global__ not allowed
else if ( !is_void_return_type(a2) )
error(3505, ...); // must return void
}
// If constexpr (bit 4 set): skip both checks entirely
This is a separate check from the static-member test (byte +176 bit 7 with byte +81 bit 2), which appears earlier at line 28:
if ( *(char *)(a2 + 176) < 0 // static member (bit 7 set)
&& (*(_BYTE *)(a2 + 81) & 4) == 0 ) // not local
warning(3507, "__global__"); // static __global__ warning
Operator Function Kind (Byte +166)
Byte at entity+166:
Value 5: operator function (operator(), operator+, etc.)
Tested during __global__ attribute application. If the entity is an operator function (value == 5), error 3644 is emitted: operator() cannot be declared __global__.
// From apply_nv_global_attr (sub_40E1F0), line 30-31
if ( *(_BYTE *)(a2 + 166) == 5 )
sub_4F8200(7, 3644, a1 + 56); // error: __global__ on operator function
This prevents declaring lambda call operators as kernels via the __global__ attribute directly (extended lambdas use a different mechanism with wrapper types).
Parameter List (Pointer +152)
For routine entities, offset +152 holds a pointer to the function prototype structure. The prototype's first field (+0) points to the parameter list head -- a linked list of parameter entities.
The __global__ attribute handler iterates this list to check two constraints:
-
Variadic check: prototype
+16bit 0 indicates variadic parameters. If set, error 3503 is emitted (variadic__global__functions are not allowed). -
__grid_constant__check: the post-validation functionnv_validate_cuda_attributes(sub_6BC890) walks the parameter list looking for parameters withbyte +32bit 1 set (the__grid_constant__flag on a parameter entity). If found on a non-__global__function, error 3702 is emitted.
// From nv_validate_cuda_attributes (sub_6BC890), lines 26-39
// Walk parameter list from prototype
v10 = **(__int64 ****)(v2 + 152); // parameter list head
while (v10) {
if (((_BYTE)v10[4] & 2) != 0) // parameter byte+32 bit 1 = __grid_constant__
error(3702, ...); // grid_constant on non-kernel parameter
v10 = (__int64 **)*v10; // next parameter
}
CUDA Extended Flags (Byte +183)
Byte at entity+183:
bit 3 (0x08) __nv_register_params__ Function uses register parameter passing
bit 6 (0x40) __cluster_dims__ intent cluster_dims attribute with no arguments
nv_register_params (Bit 0x08)
Set by apply_nv_register_params_attr (sub_40B0A0). When present, the post-validation function nv_validate_cuda_attributes checks whether the function is __global__ or __host__, and emits error 3661 if so. Device-only functions (__device__ without __host__) are exempt:
// From nv_validate_cuda_attributes (sub_6BC890), lines 42-69
if ( (*(_BYTE *)(a1 + 183) & 8) == 0 ) // no __nv_register_params__
goto check_launch_config;
if ( (v3 & 0x40) != 0 ) { // __global__ kernel
v4 = "__global__";
error(3661, &qword_126EDE8, v4); // incompatible
} else if ( (v3 & 0x30) != 0x20 ) { // NOT device-only (has host component)
v4 = "__host__";
error(3661, &qword_126EDE8, v4); // incompatible
}
// else: device-only function -- __nv_register_params__ is allowed
The key check is (v3 & 0x30) != 0x20: when the execution space annotation bits indicate device-only (bits 4,5 = 0x20), the error is skipped. This means __nv_register_params__ is valid only on __device__ functions -- it is rejected on __global__, __host__, and __host__ __device__ functions.
cluster_dims Intent (Bit 0x40)
Set by apply_nv_cluster_dims_attr (sub_4115F0) when the attribute is applied with zero arguments. This marks the function as "wants cluster dimensions" without specifying concrete values -- the values may come from a separate __block_size__ attribute or from a template parameter.
Template / Linkage Flags (Pointer +184)
Offset +184 is a 48-bit (6-byte) field encoding template instantiation and linkage information. The __global__ attribute handler tests a specific bit pattern to detect constexpr lambdas with template linkage:
// From apply_nv_global_attr (sub_40E1F0), line 21
if ( (*(_QWORD *)(a2 + 184) & 0x800001000000LL) == 0x800000000000LL )
{
// This is a template lambda with external linkage but no definition yet.
// Applying __global__ to it is an error.
v14 = sub_6BC6B0(a2, 0); // get entity name
sub_4F7510(3469, a1 + 56, "__global__", v14);
return;
}
The mask 0x800001000000 tests two bits:
- Bit 47 (
0x800000000000): template instantiation pending - Bit 24 (
0x000001000000): has definition body
When bit 47 is set but bit 24 is clear, the entity is a template lambda awaiting instantiation that has no body yet -- applying __global__ (or __device__) to such an entity produces error 3469.
Launch Configuration Struct (Pointer +256)
Offset +256 holds a pointer to a lazily-allocated 56-byte launch configuration structure. This pointer is NULL for functions without any launch configuration attributes. The allocation function sub_5E52F0 creates and zero-initializes the struct on first use.
Launch Config Layout
struct launch_config_t { // 56 bytes, allocated by sub_5E52F0
int64_t maxThreadsPerBlock; // +0 from __launch_bounds__(arg1)
int64_t minBlocksPerMP; // +8 from __launch_bounds__(arg2)
int32_t maxBlocksPerCluster; // +16 from __launch_bounds__(arg3)
int32_t cluster_dim_x; // +20 from __cluster_dims__(x) or __block_size__(x,y,z,cx)
int32_t cluster_dim_y; // +24 from __cluster_dims__(y) or __block_size__(x,y,z,cx,cy)
int32_t cluster_dim_z; // +28 from __cluster_dims__(z) or __block_size__(x,y,z,cx,cy,cz)
int32_t maxnreg; // +32 from __maxnreg__(N)
int32_t local_maxnreg; // +36 from __local_maxnreg__(N)
int32_t block_size_x; // +40 from __block_size__(x)
int32_t block_size_y; // +44 from __block_size__(y)
int32_t block_size_z; // +48 from __block_size__(z)
uint8_t flags; // +52 bit 0=cluster_dims_set, bit 1=block_size_set
}; // 3 bytes padding to 56
Attribute-to-Field Mapping
| Attribute | Arguments | Fields Written | Handler |
|---|---|---|---|
__launch_bounds__(M) | 1 int | +0 = M | sub_411C80 |
__launch_bounds__(M,N) | 2 ints | +0 = M, +8 = N | sub_411C80 |
__launch_bounds__(M,N,C) | 3 ints | +0 = M, +8 = N, +16 = C | sub_411C80 |
__cluster_dims__(x) | 1 int | +20 = x, +24 = 1, +28 = 1, +52 bit 0 | sub_4115F0 |
__cluster_dims__(x,y) | 2 ints | +20 = x, +24 = y, +28 = 1, +52 bit 0 | sub_4115F0 |
__cluster_dims__(x,y,z) | 3 ints | +20 = x, +24 = y, +28 = z, +52 bit 0 | sub_4115F0 |
__cluster_dims__() | 0 args | entity+183 bit 6 (intent flag only) | sub_4115F0 |
__maxnreg__(N) | 1 int | +32 = N | sub_410F70 |
__local_maxnreg__(N) | 1 int | +36 = N | sub_411090 |
__block_size__(x,y,z) | 3 ints | +40 = x, +44 = y, +48 = z, +52 bit 1 | sub_4109E0 |
__block_size__(x,y,z,cx,cy,cz) | 6 ints | block + cluster dims, +52 bits 0+1 | sub_4109E0 |
Post-Validation Constraints
The function nv_validate_cuda_attributes (sub_6BC890) performs cross-attribute validation after all attributes have been applied. The key checks on the launch config struct:
1. __launch_bounds__ only on __global__:
// sub_6BC890, lines 45-51
v5 = *(_QWORD *)(a1 + 256); // launch config pointer
if ( !v5 ) goto done;
if ( (v3 & 0x40) != 0 ) // if __global__, skip to next check
goto check_cluster;
// Not __global__ but has launch_bounds values
if ( *(_QWORD *)v5 || *(_QWORD *)(v5 + 8) )
error(3534, "__launch_bounds__"); // launch_bounds on non-kernel
2. __cluster_dims__/__block_size__ only on __global__:
// sub_6BC890, lines 81-87
if ( (*(_BYTE *)(a1 + 183) & 0x40) != 0 // cluster_dims intent
|| *(int *)(v5 + 20) >= 0 ) // cluster_dim_x set
{
v11 = "__cluster_dims__";
if ( *(int *)(v5 + 40) > 0 )
v11 = "__block_size__";
error(3534, v11); // not allowed on non-kernel
}
3. maxBlocksPerCluster vs cluster product:
// sub_6BC890, lines 101-114
v6 = *(int *)(v5 + 20); // cluster_dim_x
if ( (int)v6 > 0 ) {
v7 = *(int *)(v5 + 16); // maxBlocksPerCluster
if ( (int)v7 > 0
&& v7 < *(int*)(v5 + 28) * *(int*)(v5 + 24) * v6 )
{
// maxBlocksPerCluster < cluster_dim_x * cluster_dim_y * cluster_dim_z
error(3707, "__cluster_dims__"); // inconsistent values
}
}
4. __maxnreg__ only on __global__:
// sub_6BC890, lines 116-121
if ( *(int *)(v5 + 32) < 0 ) // maxnreg not set (sentinel -1)
goto check_launch_maxnreg_conflict;
if ( (v9 & 0x40) == 0 ) // not __global__
error(3715, "__maxnreg__"); // maxnreg on non-kernel
5. __launch_bounds__ + __maxnreg__ conflict:
// sub_6BC890, lines 144-145
if ( *(_QWORD *)v5 ) // maxThreadsPerBlock set
error(3719, "__launch_bounds__ and __maxnreg__");
Entity Kind Reference
The entity kind byte at +80 determines which offsets are valid. CUDA attribute handlers gate on this value:
| Kind | Value | CUDA offsets used | Handler examples |
|---|---|---|---|
| Variable | 7 | +148, +149, +161, +164 | __device__, __shared__, __constant__, __managed__ |
| Field | 8 | +136 | packed, aligned (non-CUDA) |
| Routine | 11 | +144, +152, +166, +176, +179, +182, +183, +184, +191, +256 | All execution space attrs, launch config |
Cross-References
- Execution Spaces -- deep dive on byte
+182semantics and the six virtual override mismatch errors - Attributes Overview -- attribute kind enum (86-108) and
apply_one_attributedispatch - IL Overview -- IL entry kinds 7 (variable), 8 (field), 11 (routine) node sizes
- Scope Entry -- 784-byte scope structure that contains entity chains