Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

PTX Directive Handling

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

PTX directives -- .version, .target, .entry, .func, .global, .shared, .local, .const, .reg, .param, .weak, .common, .extern, .visible, .alias, .pragma -- are parsed and semantically validated by the Bison reduction actions embedded in the 48 KB parser function sub_4CE6B0. Unlike instructions which pass through opcode table lookup (sub_46E000) and per-instruction semantic validators, directives are handled entirely within the Bison reduction switch: each grammar production's action block reads values from the parser value stack, validates them against the current PTX version and target architecture, and writes the results into the 1,200-byte parser state object or its child compile-unit state (CU_state). No intermediate AST is constructed; directives take effect immediately during parsing.

The state object maintains 18 linked lists (9 head/tail pairs at offsets 368--512) that track symbols per state space, a string-keyed hash map (offset 208) for target feature flags, and a scope chain (offset 984) rooted at offset 968 for nested function declarations. Two version-gating functions -- sub_489050 (PTX ISA version) and sub_489390 (SM architecture) -- guard every directive that was introduced after the baseline ISA.

Bison parsersub_4CE6B0 (48,263 bytes, 631 case labels)
Version validatorsub_44A100 (bsearch over 44 valid PTX version IDs at xmmword_1CFD940)
PTX version gatesub_489050 -- sub_454E70 + sub_455A80(major, minor, state)
SM arch gatesub_489390 -- checks state+168 >= required_sm
Target handlersub_4B1080 (per-target, texmode logic)
Function handlersub_497C00 (entry/func declarations, ABI)
Variable handlersub_4A0CD0 (state-space declarations, type validation)
Parameter allocatorsub_44F6E0 (48-byte parameter nodes)
Scope managersub_44B9C0 (scope hash map at state+1008)
State-space lists18 linked lists at state+368--state+512
Target feature mapHash map at state+208 (string keys, presence values)

Architecture

PTX source text
     |
     v
+-------------------------------------------------------------------+
| BISON LALR(1) PARSER  sub_4CE6B0                                  |
| 631 reduction cases, each a direct action                         |
|                                                                   |
|   DIRECTIVE       CASES           HANDLER                         |
|   .version        35              sscanf + sub_44A100             |
|   .target         5, 38           sub_4B1080 (per-target)         |
|   .address_size   10              inline validation                |
|   .entry          82, 86-88       sub_497C00                      |
|   .func           97, 100-105     sub_497C00                      |
|   .global/shared  57-68           sub_4A0CD0                      |
|     /local/const                                                  |
|   .reg/.param     110-112         inline + sub_48BE80             |
|   .weak           55              sub_489050(3,1)                 |
|   .common         56              sub_489050(5,0)                 |
|   .extern         79              sets CU+81, linkage=3           |
|   .visible        80              sets CU+81, linkage=2           |
|   .alias          41              sub_4036D9 (param match)        |
|   .pragma         42              prefix-match dispatch chain     |
|                                                                   |
+-------------------+-----------------------------------------------+
                    |
          +---------+---------+
          v                   v
  PARSER STATE OBJECT     CU_STATE (compile-unit)
  ~1200 bytes             pointed to by state+1096
  state+144: version      CU+0:   linkage code
  state+152: target       CU+24:  state-space ID
  state+160: ptx_major    CU+48:  func metadata buf
  state+164: ptx_minor    CU+80:  return type
  state+168: sm_id        CU+81:  declaration linkage
  state+196: addr_size    CU+88:  current function
  state+208: feature map  CU+156: noinline pragma
  state+368: 18 ll heads  CU+172: reg-usage pragma
  state+968: scope root   CU+784: arch capability
  state+984: scope chain  CU+2448: target string
  state+1008: scope map   CU+2456: version string

.version X.Y -- Case 35

The .version directive establishes the PTX ISA version for the compilation unit. The parser extracts the major and minor version integers from the grammar, validates the combined version against a sorted table of 44 known versions, and stores both the numeric and string forms.

// Reconstructed from case 35 of sub_4CE6B0
int major = sub_449950();       // extract major from parser state
int minor = sub_449960();       // extract minor from parser state
sscanf(token, "%d.%d", &major, &minor);

// Allocate formatted version string
char* ver_str = pool_alloc(pool, 5);
sprintf(ver_str, "%d.%d", major, minor);

// Validate: bsearch over 44 valid version IDs
int combined = major * 10 + minor;
if (!sub_44A100(combined))
    fatal_error("Unsupported PTX version %s", ver_str);

pool_free(ver_str);

// Store in parser state
state->version_string = token;          // state+144
state->ptx_major = major;               // state+160
state->ptx_minor = minor;               // state+164
CU_state->version_string = token;       // CU+2456

Version Validation -- sub_44A100

// sub_44A100: validate PTX version against known versions
bool sub_44A100(int version_id) {
    int key = version_id;
    return bsearch(&key,
                   xmmword_1CFD940,   // sorted table base
                   0x2C,              // 44 entries
                   4,                 // sizeof(int)
                   compar) != NULL;   // simple integer compare
}

The 44-entry table at xmmword_1CFD940 contains the combined version IDs (major*10 + minor) for every PTX ISA version recognized by ptxas v13.0. This covers PTX 1.0 through 8.7+.

.target sm_XX -- Cases 5 and 38

The .target directive accepts a comma-separated list of targets: SM architecture identifiers (sm_XX, compute_XX) and feature modifiers (texmode_unified, texmode_independent, texmode_raw, map_f64_to_f32, debug).

Case 38 -- Target List Iteration

// Reconstructed from case 38
for (node = list_begin(*v5); !list_end(node); node = list_next(node)) {
    char* target_str = list_value(node);
    sub_4B1080(target_str, location, state);
}

Per-Target Handler -- sub_4B1080

The function branches on whether the target string contains "sm_" or "compute_".

SM/compute targets:

// SM target path in sub_4B1080
state->target_string = target_str;       // state+152
CU->target_string = target_str;          // CU+2448
state->arch_variant = sub_1CBEFD0(target_str);  // state+177

int sm_id;
sscanf(target_str + prefix_len, "%d", &sm_id);
state->target_id = sm_id;               // state+168
if (sm_id > state->max_target)
    state->max_target = sm_id;           // state+204

// Validate against one of three target tables:
//   compute_ targets:  unk_1D16160 (6 entries, 12 bytes each)
//   sm_ sub-variant:   unk_1D161C0 (7 entries, 12 bytes each)
//   standard sm_:      unk_1D16220 (32 entries, 12 bytes each)
// Each entry: { sm_id, required_ptx_major, required_ptx_minor }
entry = bsearch(&sm_id, table, count, 12, sub_484B70);
if (entry) {
    if (!sub_455A80(entry->ptx_major, entry->ptx_minor, state))
        state->version_mismatch_flag |= 1;   // state+178
}

Feature modifiers:

ModifierPTX RequirementActionCU State
map_f64_to_f32Deprecated for sm > 12Stored in feature map; CU+152 |= 1Feature flag
texmode_unified--Stored in feature map; default if none specifiedDefault
texmode_independentPTX >= 1.5Stored in feature map; CU+2464 = 1Tex mode
texmode_rawRequires state+220 flagStored in feature map; CU+2465 = 1Tex mode
debugPTX >= 3.0CU+2466 = 1; state+1033 = 1; state+834 = 1Debug on

Texmode values are mutually exclusive. Each setter checks the feature hash map at state+208 for conflicting entries before inserting:

// texmode_unified path in sub_4B1080
if (map_get(state->feature_map, "texmode_independent"))
    error("conflicting texmode: %s", target_str);
if (map_get(state->feature_map, "texmode_raw"))
    error("conflicting texmode: %s", target_str);
map_put(state->feature_map, "texmode_unified", 1);

Case 5 -- Automatic Texmode Inference

When the .target directive omits an explicit texmode, case 5 infers one based on CLI flags:

if (arch_supports_texmode(CU->arch_capability)) {
    if (!map_has(feature_map, "texmode_independent") &&
        !map_has(feature_map, "texmode_raw")) {
        if (state->cli_texmode_independent)
            sub_4B1080("texmode_independent", loc, state);
        else if (state->cli_texmode_raw)
            sub_4B1080("texmode_raw", loc, state);
        else
            sub_4B1080("texmode_unified", loc, state);
    }
}

.address_size 32|64 -- Case 10

// Reconstructed from case 10
sub_489050(state, 2, 3, ".address_size directive", location);  // PTX >= 2.3

int value = stack_value;
if (((value - 32) & ~0x20) != 0)       // allows exactly 32 and 64
    error("Invalid address size: %d", value);

state->address_size = value;             // state+196

The bit trick (v - 32) & ~0x20 passes for exactly two values:

  • v=32: (0) & 0xFFFFFFDF = 0
  • v=64: (32) & 0xFFFFFFDF = 0

Any other value produces a nonzero result and triggers an error.

.entry / .func Declarations -- Cases 76+, 82, 88, 97, 103

Function and entry declarations span multiple Bison productions because the grammar decomposes them into prototype, parameter list, linkage qualifier, and body productions. The central handler sub_497C00 processes both entry functions and device functions.

sub_497C00 -- Function Declaration Handler

// Reconstructed signature
int64 sub_497C00(
    state,          // parser state
    int decl_type,  // 1=visible, 2=forward, 3=extern, 4=static, 5=definition
    name,           // function name token
    return_params,  // return parameter list (NULL for entries)
    params,         // input parameter list
    bool is_entry,  // 1 for .entry, 0 for .func
    bool is_func,   // CU+80 qualifier for .func
    scratch_regs,   // scratch register list
    int retaddr,    // return address allocno (-1 if none)
    bool noreturn,  // .noreturn attribute
    bool unique,    // .unique attribute
    bool force_inline, // .FORCE_INLINE attribute
    location        // source location token
);

Processing steps:

  1. Scope creation: sub_44B9C0(state) creates a new scope context. The scope hash map at state+1008 maps scope IDs (starting at 61) to 40-byte scope descriptors.

  2. Parameter node allocation: sub_44F6E0(state, scope, name, 0, 0, location) allocates a 48-byte parameter descriptor: {type_info, name, scope, alignment, init_data, location}.

  3. Symbol lookup: sub_4504D0(state+968, name, 1, state) searches the current scope chain for an existing declaration.

  4. Forward declaration resolution: If a matching forward declaration exists, the handler validates compatibility:

    • Declaration type consistency (except 2->1 and 4->1 promotions)
    • Parameter list type/alignment/state-space matching via sub_484DA0
    • Return parameter matching via sub_484DA0
    • Scratch register count and types
    • Return address register, first parameter register
    • .noreturn and .unique attribute consistency
    • Unified identifier matching
  5. New function creation: If no prior declaration:

    • Registers in state+968 (regular scope) or state+976 (extern scope)
    • Calls sub_44FDC0 to record ABI metadata
    • For Blackwell GB10B architecture (sub_70FA00(CU, 33)): allocates __nv_reservedSMEM_gb10b_war_var in shared memory as a hardware workaround

Case 82 -- Entry Function

// Case 82: .entry declaration
if (CU->output_param_context)
    error("Parameter to entry function");

result = sub_497C00(state, decl_type, name,
                    NULL,    // no return params for entries
                    params,
                    1,       // is_entry = true
                    0,       // is_func = false
                    NULL,    // no scratch regs
                    -1,      // no retaddr
                    0, 0, 0, // no .noreturn/.unique/.force_inline
                    location);

Case 88 -- Entry Function Body Completion

After the function body is parsed, case 88 performs the final validation pass:

  1. Performance directive validation:

    • .maxntid and .reqntid are mutually exclusive
    • .maxnctapersm/.minnctapersm require either .maxntid or .reqntid
    • .reqntid + .reqnctapercluster require .blocksareclusters
    • .reqnctapercluster and .maxclusterrank are mutually exclusive
  2. Kernel parameter size limits (computed via sub_42CBF0 + sub_484ED0):

    PTX VersionMax Kernel Param Size
    < 1.5256 bytes
    >= 1.5, < 8.14,352 bytes
    >= 8.132,764 bytes

    Parameters exceeding 4,352 bytes also require SM >= 70 and PTX >= 8.1.

  3. Debug labels: Generates __$startLabel$__<name> and __$endLabel$__<name> for DWARF debug info.

  4. Debug hash: If debug mode enabled (state+856 != 0), computes CRC32(name) % 0xFFFF + base as a debug identifier stored at func->80+176.

Case 97 -- Device Function

// Case 97: .func declaration
result = sub_497C00(state, decl_type, name,
                    return_params, params,
                    0,                   // is_entry = false
                    CU->return_qualifier, // CU+80
                    scratch_regs, retaddr,
                    noreturn, unique, force_inline,
                    location);

State-Space Declarations -- .global, .shared, .local, .const

State-space directives set the "current state space" field (CU+24) and then delegate to sub_4A0CD0 for variable declaration processing or sub_4A2020 for declaration-without-initializer processing.

State-Space Code Assignment

CaseActionState Space
57*CU = 1(extern/unresolved)
59*CU = 3.shared
61*CU = 2.global
63*CU = 4.local
65*CU = 5.const
67*CU = 0.reg
58, 60, 62, 64, 66, 68sub_4A2020(...)Process declaration in current space

The odd-numbered cases set the state-space code; the immediately following even-numbered cases trigger the actual declaration processing.

Variable Validator -- sub_4A0CD0

This 4,937-byte function validates variable declarations across all state spaces. Key checks:

  1. Type validation: Resolves .texref via sub_450D00. For types 9 (.surfref) and 10 (.texref), enforces .tex deprecation after PTX 1.5 and .surfref scope restrictions.

  2. .b128 type: Requires PTX >= 8.3 (sub_455A80(8, 3)) and SM >= 70 (sub_489390(state, 70)).

  3. State-space restrictions:

    • .managed valid only with .global (space 5)
    • .reserved valid only with .shared (space 8)
    • .reserved shared alignment must be <= 64
    • .common valid only with .const
    • .param at file scope requires .const space
    • .local const disallowed at file scope
  4. Texmode interaction:

    • .surfref types require texmode_independent in the feature map
    • .tex/.texref types incompatible with texmode_raw
  5. Initializer handling: If an initializer is present, calls sub_4A02A0 to validate constant expressions (no function pointers, no entry functions as values, no opaque type initializers).

State-Space Linked Lists -- 18 Lists at state+368

The parser maintains 18 linked list heads (9 head/tail pairs) at state offsets 368--512 to track declared symbols per state space:

Offset    Pair   State Space
368/376   0      .global
384/392   1      .shared
400/408   2      .local
416/424   3      .const
432/440   4      .param
448/456   5      .tex
464/472   6      .surf
480/488   7      .sampler
496/504   8      reserved / other

Initialization (case 3 -- section begin): Iterates j from 0 to 144 in steps of 8, allocating an 88-byte sentinel node (type=6) for each list. Each node's +48 field links to per-section tracking data at state+656 + j.

Scope teardown (case 76 -- new compilation unit): Destroys old symbol tables via sub_425D20, clears the target feature map, and merges scope-level lists into the parent scope by concatenating linked list chains for offsets 16, 48, 112, 128, 144, and 184 of the scope node.

.reg / .param -- Register and Parameter Declarations

Within function bodies, .reg and .param declarations create typed register/parameter entries. Three grammar productions handle the variants:

Declaration Node Layout (56 bytes)

OffsetTypeField
0ptrType list pointer
8ptrName pointer
16int32State-space code
20byteIs array
21byteIs vector
24int32Alignment
28byteExtra flags
40int32Count / range start
44int32Range end (0xFFFFFFFF = no upper bound)
48ptrAuxiliary data

Case 110 -- Single declaration: Reads type info from CU_state (offsets +16, +24, +28, +29, +32, +36), allocates the 56-byte node, sets count from the parsed integer, and calls sub_48BE80(state) to validate.

Case 111 -- Range declaration: Same as 110 but sets both start and end bounds. The sentinel value 0xFFFFFFFF at offset 44 distinguishes range from single declarations.

Case 112 -- Vector declaration: Handles vector type qualifiers (.v2, .v4).

Visibility / Linkage Directives

.weak -- Case 55

sub_489050(state, 3, 1, ".weak directive", location);  // PTX >= 3.1

.common -- Case 56

sub_489050(state, 5, 0, ".common directive", location);  // PTX >= 5.0

Linkage Qualifiers -- Cases 78--81

These set CU+81 (declaration linkage type) within function prototype production contexts:

CaseLinkagePTX Directive
781.visible (default/internal)
793.extern
802.visible
814.weak

.alias -- Case 41

Symbol aliasing requires PTX >= 6.3 and SM >= 30:

// Reconstructed from case 41
sub_489050(state, 6, 3, ".alias", location);   // PTX >= 6.3
sub_489390(state, 0x1E, ".alias", location);   // SM >= 30

sym1 = sub_4504D0(state->scope_chain, name1, 1, state);
sym2 = sub_4504D0(state->scope_chain, name2, 1, state);

if (!sym1) error("undefined symbol: %s", name1);
if (!sym2) error("undefined symbol: %s", name2);

Validation:

  • Both symbols must be function type (node type == 5)
  • sym1 must not already have a body defined (sym1->80->88 == 0)
  • Neither can be entry functions
  • No self-aliasing (names must differ)
  • Parameter lists must match (calls sub_4036D9 twice: once for return params, once for input params)
  • .noreturn attribute must be consistent across both symbols
  • Cannot alias to .extern or declaration-qualified functions

On success: sym1->80->64 = sym2 (sets the alias-target pointer).

.pragma -- Case 42

The .pragma directive requires PTX >= 2.0 and dispatches through a prefix-matching chain. Each pragma string is compared against known prefixes via sub_4279D0 (starts-with test):

// Reconstructed dispatch structure from case 42
for (node = list_begin(pragma_list); !list_end(node); node = list_next(node)) {
    char* pragma_str = list_value(node);
    sub_489050(state, 2, 0, ".pragma directive", location);  // PTX >= 2.0

    char* arch_str = sub_457CB0(CU->arch_descriptor, index);
    if (starts_with(arch_str, pragma_str)) {
        // matched known pragma
        dispatch_to_handler(pragma_str, state);
    }
}

Pragma Dispatch Chain

PriorityPrefix IndexPragmaHandlerStorage
1sub_457CB0(arch, 1)"noinline"sub_456A50 + sub_48D8F0CU+156, CU+192
2sub_457CB0(arch, 3)inline-relatedSets CU+160 = 1CU+160
3sub_457CB0(arch, 16)register-usagesub_4563E0 + sub_48C370CU+172
4sub_457CB0(arch, 5)min threadssub_4563E0 + sub_48C6F0CU+164 or CU+168
5sub_457CB0(arch, 9)max constraintsub_4567E0 + sub_403D2FCU+176
6sub_457CB0(arch, 10)min constraintsub_4567E0 + sub_403D2FCU+184
7sub_457CB0(arch, 18)deprecatedWarning via dword_29FA6C0--
8sub_457CC0(arch, 1)deprecatedWarning via dword_29FA6C0--
9--11sub_457C60/CA0/C70unsupportedWarning via dword_29FA7F0--
12sub_457D30/D50unsupportedWarning via dword_29FA7F0--
13sub_457CB0(arch, 22)function-levelAppends to func or module pragma listfunc->80->80 or state+272

Unmatched pragmas trigger an error via dword_29FA6C0.

Feature Version Gating

Two functions guard every directive against minimum PTX ISA version and SM architecture requirements. They are called hundreds of times throughout the Bison reduction actions.

sub_489050 -- PTX ISA Version Check

// sub_489050(state, required_major, required_minor, directive_name, location)
char sub_489050(state, int major, int minor, char* name, location) {
    if (sub_454E70(state->version_check_disabled))  // state+960
        return 1;  // checks disabled
    if (state->lenient_mode)  // state+832
        return 1;  // lenient mode
    if (!sub_455A80(major, minor, state)) {
        char buf[152];
        sprintf(buf, "%d.%d", major, minor);
        sub_42FBA0(error_desc, location, name, buf);
    }
    return result;
}

sub_489390 -- SM Architecture Check

// sub_489390(state, required_sm, directive_name, location)
char sub_489390(state, uint required_sm, char* name, location) {
    if (sub_454E70(state->version_check_disabled))  // state+960
        return 1;
    if (!state->target_string || state->target_id > required_sm) {
        // state+152 == NULL or state+168 > required_sm
        char buf[48];
        sprintf(buf, "sm_%d", required_sm);
        sub_42FBA0(error_desc, location, name, buf);
    }
    return result;
}

Version Requirements by Directive

DirectivePTX ISASM Architecture
.address_size>= 2.3--
.weak>= 3.1--
.common>= 5.0--
.alias>= 6.3>= 30
.branchtargets>= 6.0>= 30
.calltargets>= 2.1>= 20
.callprototype>= 2.1>= 20
.pragma>= 2.0--
texmode_independent>= 1.5--
debug target>= 3.0--
kernel param list>= 1.4--
opaque types>= 1.5--
.b128 type>= 8.3>= 70
kernel params > 4352B>= 8.1>= 70

Parser State Object Layout

The parser state object (v1127 / a1 in sub_4CE6B0) is approximately 1,200 bytes. Key offsets for directive handling:

OffsetTypeField
72ptrModule-level output buffer
88ptrCurrent function link
144char*.version string (e.g., "8.5")
152char*.target string (e.g., "sm_90")
160int32PTX major version
164int32PTX minor version
168int32SM architecture ID
177byteArchitecture sub-variant flag
178byteVersion mismatch flag
196int32.address_size (32 or 64)
204int32Maximum SM ID encountered
208ptrTarget feature hash map
219byteCLI texmode_independent flag
220byteCLI texmode_raw flag
272ptrModule pragma list head
368--512ptr[18]State-space linked list heads
656--800bytesPer-section tracking data (144 bytes)
832byteLenient mode flag
834wordDebug mode flags
856int32Debug hash base
960int32Version check disable flag
968ptrScope root (top-level symbol table)
976ptrExtern function scope
984ptrCurrent scope chain pointer
1000byteFunction body active flag
1008ptrScope hash map
1033byteDebug info enabled
1096ptrCU_state pointer

Function Map

AddressSizeIdentityCallers
sub_44A10039 BPTX version bsearch validatorcase 35
sub_44B9C0171 BScope context creatorcase 82, 97 via sub_497C00
sub_44F6E0135 BParameter node allocator (48 B nodes)sub_497C00
sub_489050115 BPTX ISA version gate~30 directive cases
sub_48939085 BSM architecture version gate~15 directive cases
sub_497C002,992 BFunction/entry declaration handlercases 82, 97
sub_4A0CD04,937 BVariable/symbol declaration validatorcases 58--68
sub_4A02A02,607 BInitializer/constant expression validatorsub_4A0CD0
sub_4B1080~700 BPer-target handler (SM + texmode)cases 5, 38
sub_4036D9437 BParameter list compatibility checkcase 41 (.alias)
sub_4CE6B048,263 BBison parser (all directive cases)compilation driver