PTX Parser (Flex + Bison)

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas front-end parses PTX assembly text into internal IR using a classic two-stage architecture: a Flex-generated DFA scanner (lexer) and a Bison-generated LALR(1) shift-reduce parser. Unlike most compiler front-ends, the parser does not construct an AST. Instead, Bison reduction actions directly build IR nodes, populate the instruction table, and emit validation calls -- the parse tree is consumed inline and never materialized as a data structure. A separate macro preprocessor handles .MACRO, .ELSE/.ELIF/.ENDIF, and .INCLUDE directives at the character level before tokens reach the Flex DFA. The instruction table builder (sub_46E000, 93 KB) registers all PTX opcodes with their legal type combinations during parser initialization, and an instruction lookup subsystem classifies operands into 12 categories at parse time.


Flex scanner	`sub_720F00` (15.8 KB, 64 KB with inlined helpers)
DFA table	`off_203C020` (transition/accept array)
Scanner rules	~552 Flex rules, 162 token types (codes 258--422)
Scanner prefix	`ptx` (all Flex symbols: `ptxlex`, `ptxensure_buffer_stack`, etc.)
Bison parser	`sub_4CE6B0` (48 KB, spans `0x4CE6B0`--`0x4DA337`)
Grammar size	~512 productions, 443 reduction cases
LALR tables	`word_1D146A0` (yydefact), `word_1D121A0` (yycheck), `word_1D13360` (yypact), `word_1D150C0` (yypgoto), `byte_1D15960` (yyr2)
Instruction table builder	`sub_46E000` (93 KB, 1,141 calls to `sub_46BED0`)
Instruction lookup	`sub_46C690` (entry), `sub_46C6E0` (6.4 KB descriptor matcher)
Macro preprocessor	`sub_71F630` (14 KB dispatcher), `sub_71E2B0` (32 KB conditional handler)
Parser state object	1,128 bytes (+ 2,528-byte lexer state via pointer at +1096)
Error handler	`sub_42FBA0` (2,350 callers, central diagnostics)
Parser init	`sub_451730` (14 KB, symbol table + special registers + opcode table)

Architecture

PTX source text
     │
     ▼
┌─────────────────────────────────────────────────────────┐
│  MACRO PREPROCESSOR (character-level, 0x71B000-0x720000)│
│  sub_71F630  dispatch: .MACRO / .ELSE / .INCLUDE        │
│  sub_71E2B0  conditional: .ELSE / .ELIF / .ENDIF (32KB) │
│  sub_71DCA0  macro definition handler                   │
│  sub_71C310  .INCLUDE file handler                      │
└────────────────────┬────────────────────────────────────┘
                     │ preprocessed character stream
                     ▼
┌─────────────────────────────────────────────────────────┐
│  FLEX DFA SCANNER  sub_720F00 (15.8KB, 552 rules)       │
│  off_203C020       DFA transition table                  │
│  Token codes:      258-422 (162 types)                   │
│  Helper:           sub_720410 (yy_get_next_buffer)       │
│                    sub_720630 (yy_get_previous_state)     │
│                    sub_720BA0 (yy_scan_string)            │
└────────────────────┬────────────────────────────────────┘
                     │ token stream (code + attribute)
                     ▼
┌─────────────────────────────────────────────────────────┐
│  BISON LALR(1) PARSER  sub_4CE6B0 (48KB, 512 prods)     │
│  5 LALR tables at 0x1D12xxx-0x1D15xxx                    │
│  443 reduction actions → direct IR construction           │
│  NO AST: reductions emit IR nodes inline                  │
└────────────────────┬────────────────────────────────────┘
                     │
          ┌──────────┴──────────┐
          ▼                     ▼
  INSTRUCTION TABLE         SEMANTIC VALIDATORS
  sub_46E000 (93KB)         sub_4B2F20 (52KB, general)
  sub_46BED0 (per-opcode)   sub_4C5FB0 (28KB, operands)
  sub_46C690 (lookup)       sub_4C2FD0 (12KB, WMMA/MMA)
  sub_46C6E0 (6.4KB match)  sub_4ABFD0 (11KB, async copy)
                            sub_4A73C0 (10KB, tensormap)
                            + 20 more validators

Flex DFA Scanner -- `sub_720F00`

The scanner is a standard Flex-generated DFA with the ptx prefix (all exported symbols use ptx instead of yy: ptxlex, ptxensure_buffer_stack, ptx_create_buffer, etc.). At 15.8 KB of core logic (64 KB including inlined buffer management), it is the largest single function in the lexer region. The DFA transition table lives at off_203C020 and is indexed by *(DWORD*)(state + 76) (the current start condition). The main loop structure follows the textbook Flex pattern:

// DFA transition core (reconstructed from sub_720F00)
while (1) {
    v10 = (DWORD*)(table_base + 8 * state);   // table[state]
    if (current_char == *v10) {                 // character match
        state = table_base + 8 * v10[1];       // goto next state
        action = *(unsigned int*)(state - 4);   // accept action (or 0)
    }
    if (action != 0) break;                     // matched a rule
}
// Giant switch on action number (0..~550)
switch (action) { ... }

The scanner returns integer token codes to the Bison parser. The value 550 is YY_NULL (end-of-input sentinel). Token attributes are communicated through the lexer state object, which the parser state carries as a pointer at offset +1096. The scanner receives this pointer as its a3 argument and dereferences it (e.g., *(_QWORD *)(a3 + 1096)) to reach the 2,528-byte lexer state.

Token Categories

The 552 Flex rules map PTX lexemes to 162 distinct token types. Bison terminal codes range from 258 to 422. The scanner switch cases reveal the following category structure:

Switch cases	Token code	Category	Examples / attributes
2	364	Semicolons / newlines	Statement terminator
5--7	340, 341, 344	Keywords	PTX keywords
63--65	302	Register names	Attribute: -1, `chr-48`, `chr-38` (register numbering)
74--91	320	Data types	Values 1--18: `.b8` through `.f64` (18 type qualifiers)
92--94	322	Comparison types	Values 9, 7, 11
95--99	323	Rounding modes	Values 24--29: `.rn`, `.rz`, `.rm`, `.rp`, etc.
1	(internal)	`#include`	Strips whitespace, copies filename
3	(dispatch)	Preprocessor directive	Calls `sub_71F630`
4	339	`#pragma`	Strips whitespace

Line and column tracking uses fields at *(state+48) (line number) and *(state+52) (column), incremented on each newline character.

Buffer Management

The scanner uses the standard Flex buffer stack for nested input sources (includes, macros, inline strings). Key buffer management functions:

Address	Size	Identity	Purpose
`sub_720190`	2.0 KB	`ptxensure_buffer_stack`	Grows buffer stack via realloc
`sub_7202E0`	1.3 KB	`ptx_create_buffer`	Creates `YY_BUFFER_STATE` from FILE*
`sub_720410`	3.3 KB	`yy_get_next_buffer`	Refills character buffer, handles EOF
`sub_720630`	9.7 KB	`yy_get_previous_state`	Restores DFA state, SIMD-optimized memmove
`sub_720BA0`	4.3 KB	`ptx_scan_string`	Scans inline string into buffer
`sub_724CC0`	4.9 KB	`ptx_scan_bytes`	Macro expansion buffer allocation
`sub_725070`	2.7 KB	`ptx_scan_buffer`	Buffer creation with error recovery

Notable: sub_720630 contains SSE2-optimized memmove using __m128i aligned 16-byte copies for buffer compaction -- a Flex optimization for large input buffers. The ptx_scan_bytes function (sub_724CC0) is called from the Bison parser actions (3 call sites in sub_4CCF30) to handle inline macro expansion during parsing.

Error strings in the buffer system:

"out of dynamic memory in ptxensure_buffer_stack()"
"out of dynamic memory in ptx_create_buffer()"
"out of dynamic memory in yy_get_next_buffer()"
"out of dynamic memory in ptx_scan_bytes()"
"bad buffer in ptx_scan_bytes()"
"out of dynamic memory in ptx_scan_buffer()"
"fatal flex scanner internal error--no action found"
"fatal flex scanner internal error--end of buffer missed"
"unexpected EOF while scanning"

Macro Preprocessor

Before tokens reach the Flex DFA, a character-level macro preprocessor handles .MACRO/.ENDM, .ELSE/.ELIF/.ENDIF, and .INCLUDE directives. The preprocessor lives at 0x71B000--0x720000 (~20 KB) and operates on raw character streams, not tokens. This design is identical to C's preprocessor running before the lexer.

Preprocessor Dispatch -- `sub_71F630`

The top-level dispatcher (14 KB) is called from the Flex scanner's case 3 (directive detection). It examines the directive name and routes to the appropriate handler:

Directive	Handler	Size	Description
`.MACRO`	`sub_71DCA0`	8.4 KB	Macro definition: records body text, handles nesting
`.ELSE` / `.ELIF`	`sub_71E2B0`	32 KB	Conditional code: skips blocks, handles nested conditionals
`.ENDIF`	`sub_71E2B0`	(shared)	End of conditional block
`.INCLUDE`	`sub_71C310`	8.3 KB	File inclusion: pushes new input source onto lexer stack

The dispatcher uses strstr for substring matching on directive names and returns token codes (e.g., 364 for end-of-directive).

Conditional Handler -- `sub_71E2B0`

At 32 KB, this is the largest preprocessor function. It handles .ELSE, .ELIF, and .ENDIF by scanning ahead through the input character stream, counting nesting levels, and skipping entire blocks of PTX text when conditions are false. It calls sub_4287D0 (the token reader) to evaluate conditional expressions and sub_428C40 (string compare) for keyword matching. Two nearly-duplicate code blocks handle .ELSE and .ELIF paths with identical scanning logic but different branch conditions.

Macro Definition -- `sub_71DCA0`

Handles .MACRO directives by recording the macro body text. The function is recursive to support nested .MACRO definitions. It delegates to sub_71D710 (macro body scanner, 7.5 KB) and sub_71D1B0 (macro argument scanner, 6.8 KB). The argument scanner uses strlen + strncmp for keyword matching against a delimiter string parameter.

Include Handler -- `sub_71C310`

Processes .INCLUDE by pushing a new file onto the lexer's input stack. The function is recursive (calls itself 4 times) for nested includes. It manages the include-stack pointers at offsets +2128, +2136, +2160, and +2168 of the lexer state object (the 2,528-byte struct pointed to by parser+1096), and uses the "pushback character" register at offset +2441 of the same lexer state. String reference: "ptxset_lineno called with no buffer".

Error Handling

Macro errors are reported through sub_71BF60 (fatal macro abort) which calls sub_71BF30 to print "out of dynamic memory..." messages, and sub_71C140 (format error) which calls sub_42CA60 (error output). Nesting depth is checked by sub_724CC0 which prints "macro nesting too deep!" on overflow.

Bison LALR(1) Parser -- `sub_4CE6B0`

The parser is a standard Bison-generated LALR(1) shift-reduce parser spanning 48 KB (addresses 0x4CE6B0--0x4DA337). It contains ~512 grammar productions with 443 reduction cases. The function calls ptxlex (sub_720F00) to obtain tokens and uses five LALR tables for state transitions:

Table	Address	Bison name	Purpose
`word_1D146A0`	`0x1D146A0`	`yydefact`	Default reduction rule for each state
`word_1D121A0`	`0x1D121A0`	`yycheck`	Valid lookahead verification
`word_1D13360`	`0x1D13360`	`yypact`	Parser action table (shift/reduce)
`word_1D150C0`	`0x1D150C0`	`yypgoto`	Goto table for nonterminals
`byte_1D15960`	`0x1D15960`	`yyr2`	Right-hand-side length for each rule

Direct IR Construction (No AST)

The critical architectural decision: Bison reduction actions directly construct IR nodes rather than building an intermediate AST. When a grammar rule is reduced, the semantic action immediately:

Allocates IR nodes via the pool allocator (sub_424070)
Populates instruction fields from token attributes
Calls instruction validators for semantic checking
Links nodes into the instruction stream
Registers symbols in the symbol table (via sub_426150, the hash map)

This means the parser is a single-pass translator from PTX text to IR. The trade-off is clear: no AST means no multi-pass source-level analysis, but it eliminates an entire allocation and traversal phase. For an assembler (as opposed to a high-level language compiler), this is the right choice -- PTX is already a linearized instruction stream with no complex scoping or overload resolution that would benefit from an AST.

Reduction Actions -- Semantic Processing

The 443 reduction cases in the parser body handle PTX constructs from simple register declarations to complex matrix instruction specifications. Diagnostic strings found in the parser tail (0x4D5000--0x4DA337) reveal the kinds of semantic checks performed during reduction:

Directive validation:

"Defining labels in .section"
"dwarf data" -- DWARF section processing
"reqntid" / ".reqntid directive" -- required thread count
".minnctapersm directive" -- min CTAs per SM
".maxnctapersm" / ".maxnctapersm directive" -- max CTAs per SM (deprecated)
".maxntid and .reqntid cannot both be specified"
".maxnctapersm directive deprecated..."
".minnctapersm is ignored..."

Type and operand validation:

"Vector Type not specified properly"
".f16x2 packed data-type" -- half-precision packed type
"matrix shape" -- matrix instruction dimensions
".scale_vectorsize" -- vector scaling modifier
"too many layout specifiers"

Resource limits:

"Kernel parameter size larger than 4352 bytes"

Architecture gating:

"sm_50", "sm_20", "sm_53" -- target architecture checks via sub_485520(ctx, sm_number)
PTX version checks via sub_485570(ctx, major, minor)

Expression handling:

"%s+%llu" / "%s-%s" -- label arithmetic in address expressions
"Negative numbers in dwarf section" -- DWARF data validation

Symbol resolution:

"unrecognized symbol" -- lexer/symbol table failure
"syntax error" -- generic parse error
".extern" -- external declarations
".noreturn directive" -- function attributes
"texmode_unified" / "texmode_raw" -- texture mode selection
"cache eviction priority" / ".level::eviction_priority" -- cache policy

Error Recovery

Parse errors trigger sub_42FBA0 with "syntax error" as the message. The central diagnostic emitter (sub_42FBA0, 2,388 bytes, 2,350 callers) handles all severity levels:

Severity	Prefix	Tag	Behavior
0	(suppressed)	--	Silently ignored
1--2	`"info "`	`@I@`	Informational message
3	`"warning "` or `"error "`	`@W@` or `@E@`	Context-dependent; promoted to error by `--Werror`
4	`"error* "`	`@E@`	Non-fatal error
5	`"error "`	`@E@`	Error
6+	`"fatal "`	(none)	Calls `longjmp` to abort compilation

The diagnostic system reads the source file to display context lines (prefixed with "# "), caching file offsets every 10 lines in a hash map for fast random-access seeking.

Parser Initialization -- `sub_451730`

Parser initialization (14 KB) builds the lexer's symbol table with all built-in PTX names before parsing begins. This function is called from the compilation driver (sub_446240) and performs three major tasks:

1. Special Register Registration

All PTX special registers are pre-registered in the symbol table with their internal identifiers:

Category	Registers
Thread/block ID	`%ntid`, `%laneid`, `%warpid`, `%nwarpid`, `%smid`, `%nsmid`, `%ctaid`, `%nctaid`, `%gridid`
Clocks	`%clock`, `%clock_hi`, `%clock64`
Performance counters	`%%pm0`--`%%pm7`, `%%pm0_64`--`%%pm7_64`
Lane masks	`%lanemask_eq`, `%lanemask_le`, `%lanemask_lt`, `%lanemask_ge`, `%lanemask_gt`
Environment	`%%envreg0`--`%%envreg31`
Timers	`%globaltimer_lo`, `%globaltimer_hi`
Shared memory	`%total_smem_size`, `%dynamic_smem_size`
Texture types	`.texref`, `.samplerref`, `.surfref`
Predefined macros	`GPU_ARCH`, `PTX_MAJOR_VERSION`, `PTX_MINOR_VERSION`

2. Opcode Table Construction

Calls sub_46E000 -- the 93 KB instruction table builder -- to register all PTX opcodes with their legal type combinations. See the dedicated section below.

3. Context State Initialization

Allocates and initializes two objects: the parser state (1,128 bytes, sub_424070(pool, 1128)) and the lexer state (2,528 bytes, sub_424070(pool, 2528)). The parser state stores a pointer to the lexer state at offset +1096. The string "PTX parsing state" identifies the parser state allocation in memory dumps. The string "<builtin>" serves as the filename for built-in declarations. Both objects are zeroed via memset before field initialization.

Instruction Table Builder -- `sub_46E000`

This is the largest single function in the front-end region at 93 KB. It is not a normal function body but a massive initialization sequence that calls sub_46BED0 exactly 1,141 times -- once per legal PTX instruction variant. Each call registers an opcode name together with its accepted type combinations using compact encoding strings.

Operand Encoding Strings

Each instruction variant is registered with a string that encodes its operand signature. The encoding uses single-character codes for operand categories:

Code	Meaning
`F`	Float operand (`.f16`, `.f32`, `.f64`)
`H`	Half-precision (`.f16`, `.f16x2`)
`I`	Integer operand (`.s8`--`.s64`, `.u8`--`.u64`)
`B`	Bitwise operand (`.b8`--`.b128`)
`N`	Immediate / numeric literal
`P`	Predicate operand

String references found in the function include composite type signatures:

"F32F32" -- binary float32 operation
"F16F16F16F16" -- quad half-precision
"I32I8I8I32" -- integer MMA (int32 accumulator, int8 operands)
"F64F64F64F64" -- quad float64 (double-precision MMA)
"_mma.warpgroup" -- warp-group MMA marker

Hash Tables

The instruction table builder populates two hash tables at offsets +2472 and +2480 within the lexer state object (the 2,528-byte struct passed as the first argument to sub_46E000). These hash tables provide O(1) lookup from opcode name to the registered type combination list.

Registration Function -- `sub_46BED0`

Called 1,141 times from sub_46E000. Each call takes an opcode name string and an operand encoding string, creates a descriptor node, and inserts it into the hash table. The descriptor captures the opcode, its legal operand types, and the semantic validation function to call during parsing.

Instruction Lookup -- `sub_46C690` and `sub_46C6E0`

At parse time, when the parser reduces an instruction production, it calls sub_46C690 to look up the instruction name in the hash table built by sub_46E000. The lookup returns a descriptor list, and sub_46C6E0 (6.4 KB, the descriptor matcher) walks the list to find the variant matching the actual operands present in the source.

Operand Classification -- 12 Categories

The descriptor matcher (sub_46C6E0) classifies each operand into one of 12 categories based on its syntactic form, then matches the category sequence against the registered encoding strings. The 12 categories cover:

General-purpose register (R)
Predicate register (P)
Uniform register (UR)
Uniform predicate (UP)
Integer immediate
Float immediate
Address expression (register + offset)
Label / symbol reference
Special register
Vector operand
Texture / surface / sampler reference
Bitfield / compound modifier

The classification examines token attributes set by the lexer -- register type bits at (field >> 28) & 7, immediate flag (0x1000000), uniform flag (0x6000000), and operand descriptor fields at instruction offset 84+.

Parser State Object (1,128 bytes)

The parser passes a state object through all phases. This 1,128-byte structure (sub_424070(pool, 1128)) carries compilation context and pointers to sub-systems. It is indexed as _QWORD* (8-byte slots), so QWORD index [N] = byte offset N*8. The highest accessed byte is +1120 (index [140]), fitting exactly within the 1,128-byte allocation.

Offset	Size	Field	Description
+0	8	`pool_context`	Pool allocator handle (from `sub_4258D0`)
+8	8	`compilation_unit`	Pointer to compilation unit (parameter a2)
+16	8	`macro_symbol_table`	Hash table for macros (`sub_425CA0`, 64 buckets)
+24	8	`module_ptr`	Pointer to module object (parameter a3)
+32	8	`container_a`	Sorted set container (8,192 buckets)
+56	8	`scope_chain[0]`	Scope chain entry (`sub_44F7C0`), used for symbol resolution
+64	8	`scope_chain[1]`	Second scope chain entry
+72	8	`scope_chain[2]`	Third scope chain entry
+80	8	`type_map`	Type descriptor hash map (`sub_42D150`, 8 buckets)
+96	8	`symbol_tables[0..5]`	Six hash tables for symbol lookup (at +96, +104, +112, +120, +128, +136)
+152	8	`current_function`	Pointer to current function being parsed
+160	4	`ptx_major_version`	PTX ISA major version (set by Bison reduction)
+164	4	`ptx_minor_version`	PTX ISA minor version
+168	4	`sm_version_check`	SM target version for feature gating
+177	1	`flag_a`	Initialization flag
+192	2	`word_96`	Zero-initialized word at WORD index 96
+196	4	`address_size`	32 or 64 (address width)
+208	8	`hash_ref_a`	Hash table reference (64-bucket)
+236	1	`default_flag`	Initialized to 1
+264	16	`list_a`	Linked list (head at +264, tail ptr at +272 points to head)
+280	8	`sorted_set_b`	Sorted set (8,192 buckets)
+288	8	`sorted_set_c`	Sorted set (1,024 buckets)
+296	16	`sorted_maps[0..1]`	Two sorted maps (`sub_42A300`)
+320	8	`hash_e`	Hash table (1,024 buckets)
+328	16	`list_b`	Linked list (head/tail pair)
+344	16	`list_c`	Linked list (head/tail pair)
+360	256	`offset_table[16]`	SSE-initialized offset table (16 entries of 16 bytes each, computed from base address + constants at `xmmword_1CFDA00`--`1CFDA70`)
+616	16	`list_d`	Linked list (head/tail pair)
+632	16	`list_e`	Linked list (head/tail pair); low bits of first word used as `address_space_flags`
+648	8	`local_symbol_table`	Per-scope local symbol table pointer
+824	8	`symbol_lookup_ref`	Hash table for symbol name lookup
+832	1	`dwarf_section_flag`	Nonzero when inside `.section` DWARF data
+834	1	`directive_flag_a`	Checked as pair with +835
+836	1	`directive_flag_b`	Set to 1 by multiple Bison reductions
+840	8	`builtin_filename`	Interned string `"<builtin>"`
+848	8	`empty_string`	Interned empty string `""`
+856	4	`sm_arch_number`	SM architecture number (parameter a6, e.g. 90 for sm_90)
+860	1	`feature_a`	Feature flags set during parsing
+861	1	`feature_b`
+862	1	`feature_c`
+864	1	`feature_d`
+865	1	`feature_e`	ORed with 1 by Bison reductions
+869	1	`flag_h`	Initialized to 0
+960	4	`sm_target_code`	SM target code used in `sub_454E70` checks
+968	8	`insn_stream_a`	Instruction stream pointer A (set in Bison)
+976	8	`insn_stream_b`	Instruction stream pointer B
+984	8	`insn_stream_c`	Instruction stream pointer C
+1000	1	`insn_state_flag`	Instruction state flag (= 0)
+1008	8	`string_pool`	String pool pointer
+1016	8	`context_ref`	Compilation context reference (parameter a4)
+1048	4	`dword_262`	Zero-initialized
+1053	1	`parsing_active`	Toggled 1/0 during active parsing
+1080	16	`list_f`	Linked list (head/tail pair)
+1096	8	`lexer_state_ptr`	Pointer to 2,528-byte lexer state object (see below)
+1104	16	`list_g`	Linked list (head/tail pair)
+1120	1	`param_flag`	From parameter a10

Lexer State Object (2,528 bytes)

The lexer state is a separate heap-allocated object (sub_424070(pool, 2528)) pointed to by parser_state+1096. It is the primary state carrier for the Flex DFA scanner and the instruction table subsystem. All functions that need scanner state (the Bison parser, the Flex scanner, the include handler, and the instruction table builder) access this object through the pointer at +1096.

Offset	Size	Field	Description
+48	4	`line_number`	Current source line (incremented on newline)
+52	4	`column_number`	Current source column
+64	8	`buffer_limit`	Pointer to end of current scan buffer
+76	4	`start_condition`	Flex DFA start condition (`*(state+76)`, indexes `off_203C020`)
+152	1	`flag_a`	Scanner state flag
+156	8	`sentinel_a`	Initialized to -1 (0xFFFFFFFFFFFFFFFF)
+164	8	`sentinel_b`	Initialized to -1
+172	4	`address_size_proxy`	Written by Bison via `sub_4563E0`; -1 on init
+180	8	`zero_pair`	Zero-initialized
+188	8	`sentinel_c`	Initialized to 0xFFFFFFFF00000000
+196	8	`sentinel_d`	Initialized to -1
+204	4	`sentinel_e`	DWORD[51], initialized to -1
+208	2	`word_104`	WORD[104], zero-initialized
+540	1	`flag_b`	Scanner flag
+541	1	`include_active`	Checked by Flex (`lexer+541`) and Bison to gate `.INCLUDE` behavior
+784	8	`current_filename`	Pointer to current filename string (set during include handling)
+1984	128	`version_array[32]`	DWORD array of version fields; written by `sub_70FDD0(lexer, index, value)` as `(lexer + 4index + 1984) = value`
+2104	4	`ptx_major_ver`	`version_array[30]` = PTX major version (initialized to 9)
+2108	4	`ptx_minor_ver`	`version_array[31]` = PTX minor version (initialized to 0)
+2128	8	`include_stack_a`	Include nesting pointer 1 (linked list for file stack)
+2136	8	`include_stack_b`	Include nesting pointer 2
+2160	8	`include_stack_head`	Head of include stack (walked by `sub_71C310`)
+2168	8	`include_stack_file`	Include stack filename pointer
+2441	1	`pushback_char`	Character pushed back into input stream by scanner
+2464	2	`word_1232`	Zero-initialized
+2466	1	`flag_c`	Flag
+2472	8	`opcode_hash_a`	Opcode lookup hash table (populated by `sub_46E000`)
+2480	8	`opcode_hash_b`	Second opcode lookup hash table (populated by `sub_46E000`)
+2488	8	`context_sub_ref`	Compilation context sub-reference (parameter a9); accessed by Bison for `sub_457CB0`/`sub_70A5B0` calls
+2496	1	`flag_d`	Flag
+2504	24	`tail_fields`	Three zero-initialized QWORD slots (indices [313],[314],[315])

Version checks use sub_485520(ctx, sm_number) (SM architecture >= N) and sub_485570(ctx, major, minor) (PTX version >= major.minor). For example, the address-space attribute setter (sub_4035D3) checks sm_90 and PTX 7.8:

if (!sub_485520(ctx, 90))
    sub_42FBA0(&err, loc, "sm_90", ...);   // Error: requires sm_90
if (!sub_485570(ctx, 7, 8))
    sub_42FBA0(&err, loc, "7.8", ...);     // Error: requires PTX 7.8
*(byte*)(v15 + 632) = (old & 0xFC) | (a2 & 3);   // Set address space bits

Semantic Validators

The parser's reduction actions dispatch to specialized validator functions for each instruction category. These functions live in 0x460000--0x4D5000 and check SM architecture requirements, type compatibility, operand constraints, and instruction-specific invariants.

Address	Size	Identity	Coverage
`sub_4B2F20`	52.6 KB	General instruction validator	Textures, surfaces, loads, stores, cvt, calls
`sub_4CE6B0` tail	48 KB	Directive/declaration validator	`.local_maxnreg`, `.alias`, `.unified`, `.pragma`, `.noreturn`
`sub_4C5FB0`	28.5 KB	Operand validator	State spaces, rounding, barriers, cache levels
`sub_4C2FD0`	12.2 KB	WMMA/MMA validator	Matrix dimensions, FP8 types, layout specifiers
`sub_49BBA0`	11.4 KB	MMA scale/block validator	`.scale_vec_size`, `.block_scale`, sparse GMMA
`sub_4ABFD0`	11.1 KB	Async copy validator	`cp.async`, bulk copy, `cvt.tf32.f32.rna`
`sub_4A73C0`	10.9 KB	Tensormap validator	`.tile`, field ranges, `.tensormap::generic`
`sub_4BFED0`	10.3 KB	WMMA shape/type validator	`.m%dn%dk%d` shapes, `.aligned` modifier
`sub_4AF9F0`	5.8 KB	CVT validator	`cvt.f16x2.f32`, type combinations, rounding
`sub_4AEB60`	3.7 KB	LDSM validator	`_ldsm.s8.s4`/`_ldsm.u8.u4` format conversion
`sub_4B1630`	4.6 KB	Function address validator	`cudaDeviceSynchronize`, kernel/device addresses
`sub_498AF0`	3.9 KB	MMA layout validator	Row/col layout, floating-point type constraints
`sub_497C00`	3.0 KB	Prototype validator	`.FORCE_INLINE`, `.noreturn`, `.unique`, register counts
`sub_496690`	3.6 KB	Scope/barrier validator	Scope modifiers, barrier constraints
`sub_494210`	2.3 KB	Sparse GMMA validator	Sparse GMMA with specific types
`sub_492C80`	4.0 KB	Cache eviction validator	L2 eviction priority, `.v8.b32`/`.v4.b64`
`sub_49A5A0`	3.5 KB	Special register validator	`%laneid`, `%clock64`, `%lanemask_*`, arch gating
`sub_4A0CD0`	4.9 KB	Variable declaration validator	`.texref`, `.managed`, `.reserved`, `.common`
`sub_4A02A0`	2.6 KB	Initializer validator	`generic()` operator, function addresses
`sub_4036D9`	437 B	Parameter list validator	Count, types, alignment, state space

Validators follow a uniform pattern: they receive the parser context and instruction data, check constraints against the current SM architecture and PTX version, and call sub_42FBA0 with descriptive error messages when violations are found. The general validator (sub_4B2F20, 52.6 KB) is the second-largest function in the front-end and covers the broadest range of PTX instructions.

ROT13 Opcode Name Obfuscation

PTX opcode names stored in the binary are ROT13-encoded as an obfuscation measure. The static constructor ctor_003 at 0x4095D0 (17 KB, ~1,700 lines) decodes and populates the opcode name table at 0x29FE300 during program startup. Each entry is a (string_ptr, length) pair. Decoded examples:

ROT13	Decoded	PTX instruction
`NPDOHYX`	`ACQBULK`	`acqbulk`
`OFLAP`	`BSYNC`	`bsync`
`PPGY.P`	`CCTL.C`	`cctl.c`
`SZN`	`FMA`	`fma`
`FRGC`	`SETP`	`setp`
`ERGHEA`	`RETURN`	`return`
`RKVG`	`EXIT`	`exit`

The table covers the entire PTX ISA vocabulary -- hundreds of opcodes. A separate ROT13 table in ctor_005 (0x40D860, 80 KB) encodes 2,000+ internal Mercury/OCG tuning knob names (see Knobs System).

Compilation Pipeline Integration

The parser is invoked from the top-level compilation driver sub_446240 (11 KB), which orchestrates the full pipeline:

Parse  →  CompileUnitSetup  →  DAGgen  →  OCG  →  ELF  →  DebugInfo

The driver reports timing for each phase:

"Parse-time : %.3f ms (%.2f%%)"
"CompileUnitSetup-time : %.3f ms (%.2f%%)"
"DAGgen-time : %.3f ms (%.2f%%)"
"OCG-time : %.3f ms (%.2f%%)"
"ELF-time : %.3f ms (%.2f%%)"
"DebugInfo-time : %.3f ms (%.2f%%)"

The parse phase encompasses the Flex scanner, macro preprocessor, Bison parser, instruction table lookup, and all semantic validation. Since the parser directly builds IR, the output of the parse phase is a populated instruction stream ready for the DAG generation phase.

PTX Text Generation (Reverse Direction)

The inverse of parsing -- converting IR back to PTX text -- lives in 0x4DA340--0x5A8E40 (580 formatter functions). Each handles one PTX opcode. A dispatcher at sub_5D4190 (12.9 KB) routes by opcode name using 81 direct string comparisons plus a 473-entry hash switch. Every formatter follows an identical allocation pattern:

pool = sub_4280C0(ctx)[3];              // Get allocator pool
buf = sub_424070(pool, 50000);          // 50KB temp buffer
// ... sprintf() operands into buf ...
len = strlen(buf);
result = sub_424070(pool, len + 1);     // Exact-size allocation
strcpy(result, buf);
sub_4248B0(buf);                        // Free temp buffer
return result;

A monolithic format string table (~1.8 MB) at the a2 parameter contains pre-assembled PTX text templates with %s/%llu/%d placeholders. This trades memory for speed: instead of building instruction text dynamically, ptxas simply fills in operand names at runtime.

Function Map

Address	Size	Identity	Confidence
`sub_720F00`	15.8 KB	`ptxlex` -- Flex DFA scanner main	98%
`sub_4CE6B0`	48 KB	`ptxparse` -- Bison LALR(1) parser	HIGH
`sub_46E000`	93 KB	Instruction table builder (1,141 opcode registrations)	HIGH
`sub_46BED0`	--	Per-opcode registration function (called 1,141x)	HIGH
`sub_46C690`	--	Instruction lookup entry	HIGH
`sub_46C6E0`	6.4 KB	Descriptor matcher (12-category operand classifier)	HIGH
`sub_451730`	14 KB	Parser initialization (allocs 1,128B parser state + 2,528B lexer state)	HIGH
`sub_70FDD0`	14 B	Lexer version array writer: `(a1 + 4a2 + 1984) = a3`	HIGH
`sub_71F630`	14 KB	Preprocessor directive dispatcher	93%
`sub_71E2B0`	32 KB	Conditional handler (`.ELSE`/`.ELIF`/`.ENDIF`)	92%
`sub_71DCA0`	8.4 KB	Macro definition handler (`.MACRO`)	90%
`sub_71C910`	13 KB	Directive scanner	91%
`sub_71C310`	8.3 KB	Include handler (`.INCLUDE`)	90%
`sub_71D1B0`	6.8 KB	Macro argument scanner	89%
`sub_71D710`	7.5 KB	Macro body scanner	89%
`sub_71BA10`	2.3 KB	Macro character peek	88%
`sub_71BB80`	2.6 KB	Macro buffer reader	88%
`sub_71BE20`	1.1 KB	Macro expansion entry	85%
`sub_71BF60`	1.8 KB	Macro fatal abort	90%
`sub_71C140`	2.5 KB	Macro format error	88%
`sub_720190`	2.0 KB	`ptxensure_buffer_stack`	95%
`sub_7202E0`	1.3 KB	`ptx_create_buffer`	96%
`sub_720410`	3.3 KB	`yy_get_next_buffer`	95%
`sub_720630`	9.7 KB	`yy_get_previous_state` (SSE2 optimized)	94%
`sub_720BA0`	4.3 KB	`ptx_scan_string`	93%
`sub_724CC0`	4.9 KB	`ptx_scan_bytes` / macro nesting check	91%
`sub_725070`	2.7 KB	`ptx_scan_buffer`	93%
`sub_42FBA0`	2.4 KB	Central diagnostic emitter (2,350 callers)	HIGH
`sub_4280C0`	597 B	Thread-local context accessor (3,928 callers)	HIGH
`sub_424070`	2.1 KB	Pool allocator (3,809 callers)	HIGH
`sub_4248B0`	923 B	Pool deallocator (1,215 callers)	HIGH
`sub_42BDB0`	14 B	Fatal OOM handler (3,825 callers)	HIGH
`sub_446240`	11 KB	Top-level compilation driver	HIGH
`sub_4095D0`	17 KB	ROT13 opcode name table initializer	HIGH
`sub_5D4190`	12.9 KB	PTX text format dispatcher	HIGH
`sub_4B2F20`	52.6 KB	General instruction validator	HIGH
`sub_4C5FB0`	28.5 KB	Instruction operand validator	HIGH
`sub_4C2FD0`	12.2 KB	WMMA/MMA validator	HIGH
`sub_485520`	--	SM architecture check (`sm >= N`)	HIGH
`sub_485570`	--	PTX version check (`version >= M.N`)	HIGH

Cross-References

Pipeline Overview -- where the parser fits in the compilation flow
PTX Directive Handling -- detailed directive processing after parsing
PTX-to-Ori Lowering -- what happens to the IR the parser builds
Knobs System -- ROT13-encoded knob names from ctor_005
Memory Pool Allocator -- sub_424070/sub_4248B0 pool system
Hash Tables & Bitvectors -- sub_426150/sub_426D60 hash map
PTX Instruction Table -- full opcode catalog
CLI Options -- sub_432A00/sub_434320 option handling

Keyboard shortcuts

PTXAS Reverse Engineering Reference