Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

PTX Parsing

The embedded ptxas compiler in nvlink v13.0.88 contains a complete PTX assembler frontend: a flex-generated lexer, a bison/yacc-generated parser, a module initialization system that defines every PTX special register and builtin type, an instruction handler registry covering the full PTX ISA 9.0, a CUDA builtin function database of 608 entries, and a recursive expression printer. This page documents each subsystem at the function level, reconstructed from decompiled binary analysis across the 0x12AF000--0x12C0000, 0x1430000--0x15C0000, and 0x16E0000--0x16F6000 address ranges.

Architecture Overview

PTX parsing is organized into seven subsystems that form a pipeline from raw text to internal IR:

PTX source text
    |
    v
[1] Flex Lexer (sub_16EAF60, 65KB)
    |  DFA state table at off_22788C0
    |  550+ token actions
    |  Macro preprocessor (sub_16E8310)
    v
[2] Bison/Yacc Parser (sub_16F0E60, 31KB)
    |  1172 lines of parser actions
    |  Operand list builders (sub_16E4D60..sub_16E7D00)
    v
[3] Module Initializer (sub_12AF950, 60KB)
    |  Special registers, builtin types, hash tables
    |  PTX version 9.0, ~20 symbol caches
    v
[4] Instruction Handler Registry (sub_158D130, 41KB)
    |  115 named mnemonic handlers + 473 hash-encoded handlers (588 total)
    |  Two-level dispatch: name table *(ctx+808), hash table *(ctx+816)
    v
[5] Semantic Validators (sub_147EF50, 288KB + sub_146BEC0, 206KB + ...)
    |  SM version gates, PTX version gates, operand type checks
    |  Relaxed mode bypass (sub_12B3090)
    v
[6] Code Template Generators (~250 functions, 3--50KB each)
    |  50KB temp buffer, sprintf-based PTX emission
    v
[7] Expression Printer (sub_12B33A0, 9KB)
       17 expression kinds, recursive descent

Module Initialization

sub_12AF950 -- ptx_init_compilation_state (59,874 bytes, 1,440 lines)

The master initialization function. Creates the entire PTX compilation state in a single call, including the symbol table infrastructure, all PTX special registers, all builtin aggregate types, and the version/target configuration. Takes 11 parameters including a function pointer callback.

Allocation. Creates two major structures: a 1,128-byte compilation context and a 2,528-byte instruction state block. Both are arena-allocated via sub_4307C0.

PTX version. Sets PTX_MAJOR_VERSION = 9, PTX_MINOR_VERSION = 0 via sub_448E70 hash table inserts. This corresponds to CUDA Toolkit 13.0 / PTX ISA 9.0.

Special register initialization. Registers every PTX special register in declaration order, using sub_448E70 to insert each name into the symbol table:

Register GroupNamesCount
Thread/block IDs%tid, %ntid, %laneid, %warpid, %nwarpid5
SM identity%smid, %nsmid2
CTA/grid IDs%ctaid, %nctaid, %gridid3
Clocks%clock, %clock_hi, %clock643
Performance counters%pm0--%pm7 (loop 0..7)8
64-bit perf counters%pm0_64--%pm7_64 (loop 0..7)8
Lane masks%lanemask_eq, %lanemask_le, %lanemask_lt, %lanemask_ge, %lanemask_gt5
Environment%envreg0--%envreg31 (loop 0..31)32
Timers%globaltimer_lo, %globaltimer_hi, %globaltimer3
Shared memory%total_smem_size, %dynamic_smem_size, %aggr_smem_size3
Reserved SMEM%reserved_smem_offset_begin, %reserved_smem_offset_end, %reserved_smem_offset_cap, %reserved_smem_offset_0, %reserved_smem_offset_15
Cluster%is_explicit_cluster, %clusterid, %nclusterid, %cluster_ctaid, %cluster_ctarank, %cluster_nctaid, %cluster_nctarank7
CUDA graphs%current_graph_exec1
Total85

Builtin type initialization. Creates three aggregate type descriptors via sub_12AD9E0:

.texref fields: width, height, depth, channel_data_type, channel_order, normalized_coords, filter_mode, addr_mode_0, addr_mode_1, addr_mode_2, array_size, num_mipmap_levels, num_samples (13 fields).

.samplerref fields: force_unnormalized_coords, filter_mode, addr_mode_0, addr_mode_1, addr_mode_2 (5 fields).

.surfref fields: width, height, depth, channel_data_type, channel_order, array_size, memory_layout (7 fields).

Hash table allocation. Initializes approximately 20 hash tables (via sub_4489C0) for symbol caches, type caches, instruction lookup, and builtin registration. Each table uses the standard pattern: sub_4489C0(hash_fn, eq_fn, bucket_count).

Target configuration. Calls sub_1426EE0 to initialize target-specific configuration and sub_16E3AA0 to set SM version info (field 30 = SM major, field 31 = SM minor).

sub_12AF200 -- ptx_init_builtin_macros (4,556 bytes, 144 lines)

Initializes Fermi-era backward-compatibility macro definitions. Referenced as "<fermi macros>" in calling context.

sub_12AF550 -- ptx_init_target_features (4,625 bytes, 144 lines)

Initializes the target feature set based on the SM version. Companion to the main initializer.

Lexer (Flex-Generated)

sub_16EAF60 -- ptx_flex_scanner_main_loop (64,758 bytes, 2,530 lines)

The main scanner loop, generated by flex. This is the lexical analysis engine that tokenizes PTX source text.

DFA structure. The scanner uses a DFA state table at off_22788C0, indexed by (current_state, input_character). The action dispatch table at dword_2278020 maps accepted token types to action categories. The action switch has 550+ cases covering the entire PTX token vocabulary: keywords, directives, register names, literals, operators, and punctuation.

EOF handling. Case 550 handles end-of-file. The scanner tracks line numbers by counting newline characters (ASCII 10).

Buffer management. Three helper functions manage scanner input:

FunctionSizeDescription
sub_16EA470 (ptx_create_buffer)3,359 BAllocates new scan buffer. Error: "out of dynamic memory in ptx_create_buffer()"
sub_16EA690 (ptx_scan_buffer_handler)9,723 BManages buffer switching/refilling. Error: "fatal error - scanner input buffer overflow"
sub_16EED20 (ptx_scan_bytes)5,001 BCreates scan buffer from byte array. Errors: "out of dynamic memory in ptx_scan_bytes()", "bad buffer in ptx_scan_bytes()"
sub_16EF8E0 (ptx_scan_buffer_create)3,161 BCompanion buffer creator. Error: "out of dynamic memory in ptx_scan_buffer()"
sub_16EAC00 (ptx_string_concat)4,329 BString concatenation for token accumulation, called with "\n" separator

sub_16E8310 -- ptx_macro_preprocessor (33,290 bytes, 1,068 lines)

The PTX macro preprocessor. Runs before lexical analysis to expand .MACRO, .ELSE, .ELIF directives.

Input handling. Processes input character by character, checking for whitespace (ASCII 9 = tab, 32 = space). Manages macro nesting with a depth limit -- exceeding it triggers "macro nesting too deep!". Uses sub_44FB20 for character classification.

State tracking. Key offsets in the preprocessor context:

  • +48: current line number
  • +52: current column number
  • +2144: input buffer pointer
  • +2441: lookahead character
  • +8: stdin fallback (when no input buffer is active)

sub_16EF680 -- ptx_macro_expansion (4,010 bytes, 142 lines)

Handles individual macro expansion with nesting depth checking. Emits "macro nesting too deep!" on overflow.

Parser (Bison/Yacc-Generated)

sub_16F0E60 -- ptx_parser_yacc_main (30,684 bytes, 1,172 lines)

The main parser driver, generated by bison/yacc. Works in conjunction with the flex scanner (sub_16EAF60). At 1,172 lines, this encodes the full PTX grammar reduction rules and semantic actions.

sub_16E9690 -- ptx_directive_parser (14,777 bytes, 525 lines)

Parses PTX directives including .surfref, .samplerref, .texref, and section directives. The 525-line function implements a complex state machine that handles multiple directive types and their parameter lists.

sub_1442040 -- section_directive_validator (3,300 bytes, 82 lines)

Validates .section and @@DWARF directives. The @@DWARF handler processes inline DWARF debug data embedded in PTX source, which is the mechanism by which nvcc passes debug information through PTX to the assembler.

Instruction Construction

The parser builds instruction IR nodes through a family of builder functions in the 0x16E4D60--0x16E7D00 range. All work with dynamic arrays of 48-byte records (one per operand), using a growth factor of 1.5x with a minimum of 16 elements. The block allocation unit is 768 bytes (16 elements times 48 bytes). Memory allocation uses vtable callbacks at **obj+24.

FunctionSizeDescription
sub_16E4D60 (ptx_instruction_builder)17,306 BMain instruction node constructor. Manages operand count fields at offsets +144, +148, +152, +156, +176, +184, +188
sub_16E5A60 (ptx_operand_insert_helper)7,165 BOperand insertion into existing instruction
sub_16E6370 (ptx_operand_setter)8,516 BSets operand properties within instruction node
sub_16E6970 (ptx_complex_builder)13,307 BMulti-operand instruction construction (505 lines)
sub_16E7210 (ptx_operand_append)6,875 BAppends operand to instruction's operand array
sub_16E7770 (ptx_operand_modifier_handler)7,563 BApplies modifiers during instruction construction
sub_16E7D00 (ptx_instruction_fixup)8,513 BPost-processing: adjusts operand counts and internal state

Instruction Handler Registration

sub_158D130 -- ptx_instruction_handler_table_init (40,900 bytes, 595 lines)

The master instruction handler registration function. First calls sub_158A600 to initialize the CUDA builtin name table, then populates two dispatch hash tables with a combined 588 unique handler functions spanning 988.7 KB of code.

Named Mnemonic Table (115 entries)

Stored at *(context+808). Maps PTX instruction mnemonic strings to handler function pointers via sub_448E70. Every entry below is extracted verbatim from the decompiled registration sequence -- no entries are omitted.

Arithmetic and Math (17 entries)

MnemonicHandlerSize (bytes)Notes
divsub_157067017,760Integer and FP division, largest arithmetic handler
div.fullsub_152C8002,032Full-range division (no fast approximation)
remsub_15427B03,872Integer remainder
rcpsub_1569C7013,168Reciprocal approximation
rsqrtsub_1534F602,704Reciprocal square root
sqrtsub_156CFE013,968Square root
ex2sub_153C1303,520Base-2 exponential
lg2sub_14E35601,360Base-2 logarithm
tanhsub_14BEAA01,072Hyperbolic tangent (sm_75+)
cvtsub_15585D06,256Type conversion
testpsub_153A9B02,960Floating-point property test (NaN, Inf, etc.)
copysignsub_14C41201,056Copy sign bit between FP values
dp2a.losub_1524A001,8402-element dot product accumulate (low half)
dp2a.hisub_15258701,8402-element dot product accumulate (high half)
dp4asub_1530B402,2564-element dot product accumulate
membarsub_14943B0496Memory barrier
prefetchsub_14C0F501,024Cache prefetch

Bit Manipulation (6 entries)

MnemonicHandlerSize (bytes)Notes
bfindsub_1549BC04,832Find most significant bit
brevsub_14C45401,136Bit reversal
bfesub_15314102,368Bit field extract
bfisub_14E70A01,504Bit field insert
clzsub_1494C60608Count leading zeros
popcsub_14941B0512Population count

Warp-Level Operations (5 entries)

MnemonicHandlerSize (bytes)Notes
votesub_1539DF03,008Warp vote (all/any/uni/ballot)
shflsub_15391703,200Warp shuffle (up/down/bfly/idx)
matchsub_15436D03,952Warp match (sm_70+)
reduxsub_15206201,792Warp-level reduction (sm_80+)
bar.warpsub_15235401,760Intra-warp barrier

Barrier and Synchronization (12 entries)

MnemonicHandlerSize (bytes)Notes
barriersub_15292301,712Named barrier sync
barrier.arrivesub_15298E01,808Named barrier arrive (non-blocking)
barrier.redsub_15419503,680Named barrier with reduction
barrier.ctasub_1493B00416CTA-scope barrier
barrier.cta.arrivesub_1493CA0416CTA-scope barrier arrive
barrier.cta.redsub_1494A00608CTA-scope barrier with reduction
barsub_14DDF501,216Legacy barrier (PTX < 7.0 form)
bar.arrivesub_14B9B90944Legacy barrier arrive
bar.redsub_14E65301,424Legacy barrier with reduction
bar.ctasub_1493820368Legacy CTA-scope barrier
bar.cta.arrivesub_1493990368Legacy CTA-scope barrier arrive
bar.cta.redsub_14945A0576Legacy CTA-scope barrier with reduction

Texture and Surface (6 entries)

MnemonicHandlerSize (bytes)Notes
texsub_153DCB03,984Texture fetch
tex.basesub_15409504,096Texture fetch (base LOD)
tex.levelsub_15446404,432Texture fetch (explicit LOD)
tex.gradsub_1566D6012,048Texture fetch (explicit gradients)
tld4sub_15266A02,000Texture load 4 (gather)
sured.bsub_14E5A601,408Surface reduction (byte)

Matrix (Tensor Core) Operations (16 entries)

MnemonicHandlerSize (bytes)Notes
mmasub_157A04027,056Generic MMA (sm_80+)
wmma.mmasub_15809F039,952WMMA matrix multiply-accumulate (sm_70+)
wmma.load.asub_155BCB07,984WMMA load matrix A fragment
wmma.load.bsub_1559E407,792WMMA load matrix B fragment
wmma.load.csub_1561DE09,760WMMA load matrix C/accumulator fragment
wmma.store.dsub_155FB708,816WMMA store matrix D result fragment
wgmma.mma_asyncsub_14C3C101,296Warpgroup MMA async (sm_90+)
wgmma.fencesub_1493320304Warpgroup MMA fence
wgmma.commit_groupsub_1493450304Warpgroup MMA commit group
wgmma.wait_groupsub_1493580320Warpgroup MMA wait group
ldmatrixsub_14C64501,200Load matrix from shared memory
stmatrixsub_14A9570912Store matrix to shared memory
movmatrixsub_1493E40432Move matrix between registers
tcgen05.mmasub_1574BD021,6165th-gen tensor core MMA (sm_100+)
tcgen05.mma.wssub_15489C04,6085th-gen tensor core MMA with warpgroup specialization
tensormap.replacesub_15386802,800Replace fields in tensor map descriptor

tcgen05 (5th-Gen Tensor Core) Management (9 entries)

MnemonicHandlerSize (bytes)Notes
tcgen05.allocsub_15221201,712Allocate tensor core accumulator
tcgen05.deallocsub_15457904,288Deallocate tensor core accumulator
tcgen05.relinquish_alloc_permitsub_14DF3101,296Relinquish allocation permit
tcgen05.ldsub_152CFF02,144Load into tensor core accumulator
tcgen05.ld.redsub_1531D502,480Load-reduce into tensor core accumulator
tcgen05.stsub_152AF802,080Store from tensor core accumulator
tcgen05.commitsub_15251301,856Commit tensor core operations
tcgen05.cpsub_14FB7901,584Copy tensor core data
tcgen05.shiftsub_14AAA30912Shift tensor core accumulator rows

tcgen05 Guardrail Intrinsics (8 entries)

These are internal pseudo-instructions (prefixed with _) used for runtime validation of tensor core operations. They emit guard traps expanded by sub_15B86A0.

MnemonicHandlerSize (bytes)Notes
_tcgen05.guardrails.is_phase_validsub_14936C0352Validate mbarrier phase
_tcgen05.guardrails.are_columns_allocatedsub_1496E10784Validate column allocation
_tcgen05.guardrails.is_current_warp_valid_ownersub_1494EC0608Validate warp ownership
_tcgen05.guardrails.in_physical_boundssub_1497120800Validate physical address bounds
_tcgen05.guardrails.allocation_granularitysub_1493FF0448Validate allocation granularity
_tcgen05.guardrails.datapath_alignmentsub_14A9900848Validate datapath alignment
_tcgen05.guardrails.sp_consistency_across_idesc_modsub_1496520736Validate stack pointer consistency across idesc modifiers
_tcgen05.guardrails.check_sparse_usagesub_14B9F40976Validate sparse matrix usage

Async Copy and Bulk Operations (6 entries)

MnemonicHandlerSize (bytes)Notes
cp.async.bulksub_154C1B05,072Asynchronous bulk copy
cp.async.bulk.tensorsub_156440010,592Asynchronous bulk tensor copy (sm_90+)
cp.async.mbarrier.arrivesub_1495120704Async copy mbarrier arrive notification
st.asyncsub_15479504,208Asynchronous store
red.asyncsub_153B5403,056Asynchronous reduction
st.bulksub_15023D01,552Bulk store

Multi-Memory (CTA-level Multicast) (3 entries)

MnemonicHandlerSize (bytes)Notes
multimem.ld_reducesub_15468504,352Multicast memory load-reduce (sm_90+)
multimem.stsub_15344602,816Multicast memory store
multimem.redsub_14C37F01,056Multicast memory reduction

Cache Policy (3 entries)

MnemonicHandlerSize (bytes)Notes
createpolicy.rangesub_15364D02,816Create L2 cache policy by address range
createpolicy.fractionalsub_15302A02,208Create L2 cache policy by fractional hit rate
createpolicy.cvtsub_14947E0544Convert between cache policy representations

SIMD Video Instructions (23 entries)

MnemonicHandlerSize (bytes)Notes
vaddsub_15557805,648Scalar video add
vsubsub_15541705,648Scalar video subtract
vminsub_15515D05,568Scalar video minimum
vmaxsub_1552B905,600Scalar video maximum
vabsdiffsub_1556D906,208Scalar video absolute difference
vshlsub_154D5805,280Scalar video shift left
vshrsub_154EA205,296Scalar video shift right
vsetsub_154AEA04,880Scalar video compare and set
vmadsub_154FED05,888Scalar video multiply-add
vadd2sub_152E8F02,208SIMD-2 video add
vsub2sub_152F1902,192SIMD-2 video subtract
vmin2sub_1536FD02,880SIMD-2 video minimum
vmax2sub_1537B102,928SIMD-2 video maximum
vabsdiff2sub_15092001,632SIMD-2 video absolute difference
vset2sub_14E6AC01,504SIMD-2 video compare and set
vavrg2sub_15327002,512SIMD-2 video average
vadd4sub_153FAB03,744SIMD-4 video add
vsub4sub_153EC403,696SIMD-4 video subtract
vmin4sub_152BFE02,080SIMD-4 video minimum
vmax4sub_152A7802,048SIMD-4 video maximum
vabsdiff4sub_152B7A02,112SIMD-4 video absolute difference
vset4sub_1526E701,824SIMD-4 video compare and set
vavrg4sub_1533A802,528SIMD-4 video average

Internal / Special (1 entry)

MnemonicHandlerSize (bytes)Notes
_ldldusub_1496800800Internal: unaligned load (emitted by lowering)

Named handler statistics. 115 total entries, 115 unique function addresses. Total code size: 426,096 bytes (416.1 KB). Handler sizes range from 304 bytes (wgmma.fence) to 39,952 bytes (wmma.mma). The five largest handlers account for 99,632 bytes (23.4% of all named handler code): wmma.mma (39.9 KB), mma (27.1 KB), tcgen05.mma (21.6 KB), div (17.8 KB), sqrt (14.0 KB).

Hash-Encoded Opcode Table (473 entries)

Stored at *(context+816). Maps 32-bit hash keys (stored as decimal string keys like "2644314910") to handler function pointers. These 473 entries cover instruction variants that are dispatched by opcode encoding hash rather than mnemonic string. Each handler follows the same code template pattern as the named handlers: allocate 50 KB temp buffer, build PTX text via sprintf from string table offsets, shrink-allocate, return.

Structure. The hash keys are 32-bit unsigned integers representing pre-computed hashes of instruction encoding descriptors. The hash function is not applied at lookup time to a mnemonic string; instead, the parser pre-computes the hash from the instruction's opcode + type + modifier combination and looks it up directly. This covers complex variant combinations (e.g., add.s32, add.f32, mad.lo.s32, setp.eq.f64) where encoding a separate named mnemonic for each variant would be impractical.

Statistics. 473 entries, 473 unique function addresses, zero overlap with the named mnemonic table. Total code size: 586,352 bytes (572.6 KB). Handler sizes range from 736 to 8,080 bytes, with median 1,136 bytes and mean 1,239 bytes. Size distribution:

Size RangeCountPct
500--999 bytes17035.9%
1,000--1,999 bytes29662.6%
2,000--4,999 bytes61.3%
5,000--9,999 bytes10.2%

Address distribution. Hash handlers cluster in the 0x1490000--0x1520000 range (92% of entries), while named handlers dominate 0x1520000--0x1580000. This reflects a compilation unit boundary: the hash handlers are generated from table-driven templates, while the named handlers are handwritten for semantically complex instructions.

Complete hash-encoded dispatch table. All 473 entries in registration order:

#Hash KeyHandlerSize (B)#Hash KeyHandlerSize (B)
12644314910sub_1496B207362605425506sub_1497440800
3359337725sub_14D1F801,02444134604268sub_14AA6A0912
53457617063sub_14BBE601,02463458731190sub_14B84C01,024
73461614778sub_14C13501,0248273026588sub_14A9FC0912
9273550881sub_14AA33091210766250731sub_14BC2401,040
11767430384sub_14BC6301,02412278072875sub_1497770832
13278597168sub_1497AB0832143013151698sub_1527CA01,792
151034227483sub_14CAB901,136162998735345sub_14CB0001,136
172995851755sub_14CB4701,136181030557441sub_14C8C801,136
192983137751sub_14C90F02,272202980254161sub_14C99D01,136
214210694260sub_14987B0832224203485294sub_1498AF0832
231497109918sub_1498E30832242394822284sub_1499170832
252387613318sub_14994B0832264199093174sub_14997F0832
2788478685sub_1499B308322885595095sub_1499E70832
291848645077sub_14B3EC0912302192249095sub_149A1B0832
313719369477sub_14B3B20928322918388213sub_149A4F0832
332915504623sub_149A83083234224662509sub_14B33E0928
35952045343sub_149AB70832362319257885sub_14B3780928
371094325528sub_14BBA701,024383069252162sub_149AEB0832
391087116562sub_149B1F0832402848461329sub_149B530832
41843846881sub_149B87083242851055847sub_14BB6801,024
4390377643sub_14FB1801,584441335693225sub_15138C01,648
451883181179sub_14F13E01,520463364426361sub_150EB801,584
471886064769sub_1520D201,792481678118992sub_15275901,808
491325928250sub_14CB8E01,136502738688312sub_14D57E01,136
513355120138sub_14CBD501,13652709825544sub_14D49701,136
533359969808sub_14D3FE01,136541675235402sub_14F26101,520
553140424264sub_150CB501,584564198894970sub_14F2C201,520
571134169976sub_15118901,584581059984638sub_14B8C50944
591052775672sub_149BBB0832603046838824sub_149BEF0832
61816714957sub_14B9020976622826047991sub_149C230832
63809505991sub_149C570832641870467175sub_15283C01,856
6586707601sub_14E82601,552661323503503sub_150D8301,648
671867583585sub_14E7C501,552683340309087sub_150D1C01,648
691662521398sub_1528B001,840701313738528sub_14CC1C01,136
712717978910sub_14D96701,136723331002864sub_14CC6301,136
73677188590sub_14D8CD01,136743335852534sub_14E3AB01,360
751659637808sub_14FA5601,584763116306990sub_15165D01,648
774195224928sub_14FAB701,584781121980254sub_15172B01,648
792889945594sub_14E16401,376802867532256sub_14E4A701,008
812673873353sub_14E0C201,376822651460015sub_14E20A01,360
832678854089sub_14DB3701,136842656440751sub_14DA9B01,136
854003207167sub_14BA6F01,024861343296809sub_149C8B0832
873995998201sub_149CBF0832883711244238sub_14BAEB01,024
891073812728sub_149DF70832903704035272sub_149E2B0832
913716224974sub_14BAAD01,024921078793464sub_149E5F0832
933709016008sub_149E930832944259320680sub_14BA3101,024
952122780818sub_14B1E20960963611562640sub_14BE2801,040
974256437090sub_14B21C0960981687164256sub_14BE6901,024
994002681655sub_14E3FF01,3601003999798065sub_14C57001,136
1011414468911sub_14D6FF01,1361021888620641sub_14C52901,136
1033361346143sub_14D7E601,1361044007662391sub_14A9C50912
1051893601377sub_149EC708321063366326879sub_14B3040928
1074004778801sub_149EFB08321081419449647sub_14B1340832
1093968866277sub_149F2F08321101320883471sub_149F630832
1113961657311sub_149F9708321123676903348sub_149FCB0832
1131051399390sub_149FFF08321143669694382sub_14A0330832
1153681884084sub_14A06708321161056380126sub_14A09B0832
1173674675118sub_14A0CF08321184243723086sub_14B0860928
1192119110776sub_14B0C009281203599372918sub_14BDE701,040
1214240839496sub_14B0FA09281221663046982sub_14BEED01,024
1233987084061sub_14CEE201,1361243984200471sub_14CCAA01,136
1251390351637sub_14D91A01,1361261884950599sub_14CCF101,136
1273349156421sub_14D88001,2321283992064797sub_149D8F0832
1291889931335sub_149DC308321303354137157sub_14AFD80928
1313989181207sub_149D2708321321395332373sub_14B1A80960
1331537545665sub_149D5B08321341515132327sub_149CF30832
135453121116sub_14953E0736136437523522sub_14956C0736
137456004706sub_14959A0736138440407112sub_1495C80736
1392498761100sub_1495F607361402495091058sub_1496240736
1411034555164sub_14C71E01,1361422999521778sub_14C76501,136
1432996638188sub_14C7AC01,1361441030885122sub_14C9E401,136
1452983924184sub_14CA2B01,1361462981040594sub_14CA7201,136
1471095832857sub_14B88809441481088623891sub_1497DF0832
1493070300739sub_1498130832150852563176sub_14B93F0944
1512849509906sub_1498470832152845354210sub_14A1030832
1531886851202sub_1523C201,79215490705324sub_14F87101,520
1551336348586sub_15186001,6481561883967612sub_14F8D201,520
1573365540474sub_150E5101,6481583361083921sub_14D3B201,184
1591326583611sub_14C69001,1361602739671353sub_14D66501,232
1613356234251sub_14C6D701,136162711267337sub_14D6B201,136
1631678905425sub_1522E701,7921641676021835sub_14F19F01,520
1653141538377sub_1510BB01,6481664199222651sub_14EF5901,536
1671134825337sub_15105401,6481683137016288sub_14B16E0960
1693132166618sub_14B0120960170471143384sub_14BD6501,040
1711124994826sub_14B04C09601722522026248sub_14BDA601,040
1731061491967sub_14BB2901,0081741054283001sub_14A1370832
1753047887401sub_14A16B0832176818222286sub_14B97C0976
1772827096568sub_14A19F0832178811013320sub_14A1D30832
1791871253608sub_1529FF01,93618087035282sub_14EFBA01,552
1811324158864sub_1515F601,6481821868370018sub_14F01B01,520
1833341423200sub_15152801,6481843336966647sub_14E45301,344
1851314393889sub_14C5B701,1361862718961951sub_14D4E401,136
1873332116977sub_14C5FE01,136188678630383sub_14D53101,136
1891663307831sub_15243101,7761901660424241sub_14F93301,552
1913117421103sub_1518C701,6481924195552609sub_14F99401,552
1931122635615sub_1513F301,6481943112899014sub_14B2900960
1953108049344sub_14B2CA0960196438506430sub_14BCA201,040
1971112805104sub_14B25609601982501316846sub_14BD2401,040
1994177860699sub_14A20708322004170651733sub_14A23B0832
2011475745157sub_14A26F083220273667524sub_14A2A30832
20370783934sub_14A2D708322041825641916sub_14AB500912
2052188906734sub_14A30B08322063707835116sub_14AB160912
2072361988723sub_14A33F08322082354779757sub_14A3730832
2094177728413sub_14A3A708322102903577052sub_14A3DB0832
211201659348sub_14AADC09122122900693462sub_14A40F0832
2132307723524sub_14AE420928214948702982sub_14A4430832
2152868580833sub_14E1B701,3602162890994171sub_14E5FE01,360
2172652508592sub_14DAE901,2482182674921930sub_14DE4101,280
2192657489328sub_14E11301,2962202679902666sub_14E2B001,296
2211516180904sub_14A4DF08322221538594242sub_14A5130832
2233061191149sub_14E30301,3602243243250227sub_14A5470832
2253020952069sub_14A57B08322263019248130sub_14A5AF0832
227434050149sub_14A5E308322283231978109sub_14A6170832
229303240176sub_14A64B083223043389887sub_14A67F0832
23148370623sub_14A6B30832232275189148sub_14F9F501,552
2331515589530sub_1516C401,6482342061111404sub_14F20001,520
2353537441386sub_150FED01,5842362063994994sub_14EDD501,552
23786052206sub_14EE3601,5522381850871870sub_14EE9701,552
2391853755460sub_14EEF801,5522401308364633sub_14EC5101,552
2412694582615sub_150F1F01,6482423298431529sub_14ECB201,520
243626594855sub_150F8601,6482443301315119sub_14ED1301,552
24586510955sub_14ED7401,5522461310855017sub_150DEA01,648
2471849954363sub_14F32301,5522483310228025sub_150C4E01,584
2491852837953sub_14F38401,5522501083838248sub_14F3E501,552
2512453999910sub_150BE701,6482523051426296sub_14F44601,552
253363533302sub_150AB201,5842543054309886sub_14F4A701,552
2551088818984sub_14F50801,5522562458980646sub_15158F01,648
2573059290622sub_14F07C01,5522583056407032sub_14F0DD01,552
259368514038sub_15145A01,6482601354829774sub_14AC380912
2612779386316sub_14C07301,0562623398570654sub_14AC720912
263765072540sub_14C0B401,0242643401454244sub_14AB8A0912
265210570726sub_14ABC409122661475219428sub_14C03201,024
2672030440630sub_14ABFE07362683531018932sub_14BCE301,040
2692033324220sub_14ACAC0912270411636811sub_14A6E70832
2713209564771sub_14A71B08322722998538731sub_14A74F0832
2733220836889sub_14A78308322742996834792sub_14A7B70832
275280826838sub_14A7EB083227620976549sub_14A81F0832
27725957285sub_14A853083227882382164sub_14F6ED01,552
2791835274276sub_14F74E01,5202801838157866sub_14E88701,552
281271519106sub_14E8E801,5522821503399808sub_1514C101,648
2832045513810sub_14E94901,5522843513324112sub_1511F001,648
2852048397400sub_14E9AA01,5522861304694591sub_14EA0B01,552
2872682392893sub_15125701,6482883282833935sub_14EA6C01,552
289602477581sub_1512BE01,6482903285717525sub_14EACD01,552
2911085148942sub_14EB2E01,5522922446790924sub_15132501,648
2933043693028sub_14EB8F01,5522943040809438sub_14EBF001,552
295344396764sub_15112201,64829682840913sub_14F56901,552
2971298665295sub_150B1901,6482981834356769sub_14F5CA01,552
2993286110751sub_150B8001,6483001837240359sub_14F62B01,552
3011080168206sub_14F68C01,5523022441810188sub_15179201,648
3033038712292sub_14F7AF01,5523043035828702sub_14F81001,552
305339416028sub_1517F901,6483061351159732sub_14AD5A0928
3072767196594sub_14BF6F01,0403083382973060sub_14AD940928
309740955266sub_14BF2E01,0403103385856650sub_14ADCE0928
311206900684sub_14AE0809283121463029706sub_14BFF101,040
3132014843036sub_14ACE609123143506901658sub_14BFB001,040
3152017726626sub_14AD2009283163038777811sub_14E25D01,296
3173245216309sub_14A88708323183021214212sub_14A8BB0832
319275713438sub_14CD3801,1363201516769180sub_14D83301,136
3212065436788sub_14CD7F01,1363222062553198sub_14CDC601,136
3233539538540sub_14DA0101,2163241854279747sub_14CE0D01,136
3251851396157sub_14CE5401,1363263312325179sub_14D9B401,136
32787035245sub_14CE9B01,1363281312034667sub_14DA4E01,280
3293222802971sub_14A8EF08323302998800874sub_14A9230832
331272043396sub_14C95601,1363321504579458sub_14D74C01,136
3332049839194sub_14C49B01,1363342046955604sub_14C4E201,136
3353515421266sub_14D79901,13633683365203sub_14C7F301,136
3371299844945sub_14D61801,1363381838682153sub_14C83A01,136
3391835798563sub_14C88101,1363403288207905sub_14D5CB01,136
3412841907644sub_14D31D01,1843422819494306sub_14D36702,416
3432846888380sub_14DB8501,2483442824475042sub_14DBD301,248
3453059225067sub_14E4FC01,3603463036811729sub_14E55101,360
3472839941562sub_14E07101,2963482817528224sub_14E02001,280
3492844922298sub_14DE9101,2803502822508960sub_14DEE101,280
3513422950081sub_14AF9E09603523407352487sub_14AEB60928
3531363152881sub_14AEF009283541359482839sub_14AF2A0928
3553425833671sub_14AF6409603563410236077sub_14AE7C0928
357467604616sub_14A4770832358445191278sub_14A4AB0832
3591158287159sub_1504EA01,5683603132756487sub_15054C01,568
3613135640077sub_1505AE01,5683623637908057sub_15061001,568
363981800023sub_15192E01,6483641686312233sub_15048801,568
3653560118055sub_151CC401,5363661689195823sub_14FC3D01,536
3673387691560sub_1503C401,568368715527206sub_151D2A01,552
3691413616888sub_15042601,5683703271366390sub_151C5E01,632
3711416500478sub_15036201,5683721147801373sub_14FC9D01,536
3733110343149sub_14FCFD01,5363743113226739sub_14FD5D01,536
3751666782485sub_14FDBD01,5363761663898895sub_14FE1D01,536
3773529185037sub_15199401,5363783627422271sub_14FE7D01,536
379962794557sub_1519FA01,6323803377205774sub_15067201,568
381696521740sub_151F9201,5683821391203550sub_1506D401,568
3833240433372sub_151FFA01,5683841394087140sub_15029E01,568
3853392672296sub_14FEDD01,536386720507942sub_151BF801,536
3871421481214sub_14FF3D01,5363881418597624sub_14FF9D01,536
3893276347126sub_151B9201,5363903382186510sub_15073601,568
391701502476sub_151F2A01,5683921396184286sub_15079801,568
3933245414108sub_151EC201,7923941399067876sub_1507FA01,568
3951359482725sub_15085C01,5683962749632867sub_151A6001,648
3973355054645sub_1508BE01,568398687150131sub_151D9001,568
3993357938235sub_15030001,5684001157959476sub_15098601,568
4012532053298sub_15227D01,7124023131052548sub_1509EA01,568
403447091714sub_1521A801,6484043133936138sub_150A4E01,568
4053745714894sub_14B42609284061127945420sub_14C33E01,040
4071847793054sub_14B5C309604083759937436sub_14C23A01,040
4091850676644sub_14B5FE09284102345603302sub_14B6390944
4113861910244sub_14C1F901,040412223810486sub_14B6740944
4131976047028sub_14C17701,040414226694076sub_14B6AF0944
4153735229108sub_14B6EA09284161108939954sub_14C1B801,040
4171825379716sub_14B72509604183729004418sub_14C2FD01,040
4191828263306sub_14B76009604202335117516sub_14B79B0944
4213842904778sub_14C2BC01,040422201397148sub_14B7D60944
4231945114010sub_14C27B01,040424204280738sub_14B81101,008
4251348996939sub_1500BD01,5364262730627401sub_151DF601,632
4273332641307sub_15011D01,536428656217113sub_151E5C01,552
4293335524897sub_15017D01,5364301147473690sub_1501DD01,568
4312513047832sub_151AC601,5844323108639210sub_14FFFD01,536
433416158696sub_151B2C01,6324343111522800sub_15005D01,536
4351359482727sub_14CF7001,1524362750288229sub_14DDA701,248
4373358855741sub_14CFB801,1524383355972151sub_14D00001,152
439688722997sub_14DC2101,2484401348996941sub_14D04801,152
4412731282763sub_14DC6F01,2484423333558813sub_14D09001,152
443657789979sub_14DCBD01,2484443336442403sub_14D0D801,152
4451157959478sub_14D28B01,1684462532708660sub_14DF8201,264
4473131970054sub_14D2D401,152448448664580sub_14DFD101,264
4493134853644sub_14D24201,1844501147473692sub_14D12001,152
4512513703194sub_14DD0B01,2804523109556716sub_14D16801,152
453417731562sub_14DD5901,2484543112440306sub_14D1B001,152
4551884820921sub_14B46109284561862407583sub_14B49C0928
4573770356457sub_14B4D709284583759870671sub_14B5120944
4591887704511sub_14B54D09444601865291173sub_14B5880944
4611944652170sub_14FBDC01,552462597299025sub_152D8502,128
4632637630251sub_152E0A02,1284642351893152sub_152FA202,176
4652267286071sub_14D44A01,2324663536260243sub_153CEF03,520
4673543796907sub_155DBE08,080468263656552sub_14E76801,488
4691337398693sub_15330D02,4804701319900574sub_15213D01,712
471626988249sub_1525FA01,8084723246921673sub_14CF2901,136
4731681331703sub_15359F02,784

Two-Level Dispatch Protocol

When the parser encounters an instruction:

  1. Look up the mnemonic in the named table at *(context+808) via sub_449A80.
  2. If not found, compute the instruction encoding hash and look up in *(context+816).
  3. The resolved handler receives (context, instruction_node) and generates PTX code template text.

Combined scale. The two tables register 588 unique handler functions covering 988.7 KB of code (416.1 KB named + 572.6 KB hash-encoded). This is the complete PTX ISA 9.0 instruction set as implemented by nvlink's embedded assembler. The handler address range spans 0x1493320--0x15809F0 (949.7 KB of .text).

Code Template Generators

Approximately 250 functions in the 0x14A0000--0x15C0000 range follow an identical code generation pattern. Each handles one specific instruction variant (a particular MMA shape, a barrier variant, an arithmetic operation with specific type combination).

Common pattern (verified across sub_14A4770, sub_14AAA30, sub_14C3C10):

1. arena = sub_44F410(a1, a2) + offset[3]     // get memory arena
2. buf   = sub_4307C0(arena, 50000)            // allocate 50KB temp buffer
3. // Build PTX text via successive sprintf() calls:
   //   - Base template from read-only string table (a2 + constant_offset)
   //   - Conditional sections based on sub_16DF3B0 (has descriptor),
   //     sub_16DF3D0 (has scale), sub_16E4530 (matrix dimension)
   //   - Register names, immediates, matrix shapes substituted in
4. len   = strlen(buf)
5. out   = sub_4307C0(arena, len + 1)          // shrink-allocate final string
6. strcpy(out, buf)
7. sub_431000(buf)                             // free temp buffer
8. return out

The larger handlers deviate from this pattern by size but not structure. The six largest handlers each exceed 100KB:

FunctionSizeInstruction
sub_15B86A0345.2 KBCUDA builtin prototype generator (608-case switch)
sub_147EF50287.9 KBMaster instruction semantic validator
sub_1487650240.3 KBTop-level PTX statement processor
sub_146BEC0206.1 KBLoad/store memory operation validator
sub_15809F0170.5 KBwmma.mma code template generator
sub_157A040118.6 KBmma (tensor core) code template generator

Instruction Property Query Functions

Code template generators query instruction properties through a family of accessor functions in the 0x16DF000--0x16E4000 range:

FunctionDescription
sub_16DF3B0Check if instruction has descriptor field
sub_16DF3D0Check if instruction has scale field
sub_16DF450Get scale descriptor value
sub_16DF5E0Get load descriptor value
sub_16DF5F0Get store descriptor value
sub_16E4530Get matrix dimension (a/b selector, variant)
sub_16E1030Get matrix operation mode
sub_16E2410Get immediate value from instruction node
sub_16E32D0Get register range start
sub_16E3320Get register range end
sub_16E3960Get matrix layout/shape
sub_16E36D0Check for specific instruction modifier
sub_16DBA40Get instruction mnemonic string
sub_16DBB00Get instruction type string
sub_16DBDD0Get operand type string
sub_16DBE80Get scope/modifier string
sub_16DDD30Get data type string from enum

CUDA Builtin Registration

sub_158A600 -- cuda_builtin_name_table_init (45,700 bytes, 630 lines)

Registers 608 CUDA runtime builtin function names into a hash table for the instruction lowering pipeline. These builtins are PTX pseudo-functions that the assembler expands into instruction sequences during compilation.

Data structure. Creates a hash table via sub_4489C0 with capacity 0x80. Allocates a 9,728-byte builtin data array from unk_1F8E0C0. Stores: *(context+1056) = data array, *(context+1064) = hash table, *(context+1072) = 608 (entry count).

Each builtin is registered with a sequential index (1--608) mapping the name to an index used by sub_15B86A0 (the 345KB prototype generator) to produce the corresponding .weak .func (...) __cuda_smXX_foo (...) ; declaration.

Builtin families (608 total, organized by SM generation and functional category):

TypePrefix PatternApprox. CountSM Range
Redux/sync__cuda_reduxsync_b32_*~20sm_70+
Sanitizer__cuda_sanitizer_memcheck_*~10all
Video emulation__cuda_scalar_video_emulation_*~30sm_30+
Guardrail traps__cuda_sm10x_tcgen05_guardrail_trap_*~8sm_100+
Bulk copy__cuda_sm1xx_bulk_copy_*, __cuda_cp_async_bulk_tensor_*~20sm_100+
SM20 math__cuda_sm20_div_*, __cuda_sm20_rem_*, __cuda_sm20_rcp_*, __cuda_sm20_sqrt_*, __cuda_sm20_dsqrt_*, __cuda_sm20_drsqrt_*, __cuda_sm20_bfe_*, __cuda_sm20_bfi_*~80sm_20+
SM3x math__cuda_sm3x_div_rn_ftz_*, __cuda_sm3x_div_rn_noftz_*~20sm_30+
SM62 emulation__cuda_sm62_dp2a_*, __cuda_sm62_dp4a_*~8sm_62
SM70 barriers__cuda_sm70_barrier_arrive_*, __cuda_sm70_barrier_red_*, __cuda_sm70_barrier_sync_*~30sm_70+
SM70 warp__cuda_sm70_matchsync_*, __cuda_sm70_shflsync_*, __cuda_sm70_votesync_*, __cuda_sm70_warpsync~40sm_70+
SM70 WMMA__cuda_sm70_wmma_m16n16k16_*, __cuda_sm70_wmma_m32n8k16_*, etc.~100+sm_70+
SM80+ MMA__cuda_sm80_mma_* and later~100+sm_80+

sub_15B86A0 -- cuda_builtin_prototype_generator (345,200 bytes, 9,666 lines)

A giant switch on builtin index (608 cases). For each case, allocates a string via sub_14932E0(length, arena) and writes the PTX prototype. Returns a string of the form:

.weak .func (.param .b32 retval) __cuda_sm20_div_s16 (.param .b32 a, .param .b32 b) ;

Covers all SM generations (sm20 through sm10x) and all function families: div, rem, rcp, sqrt, dsqrt, drsqrt, barrier, wmma, shfl, vote, matchsync, warpsync, reduxsync, sanitizer_memcheck, scalar_video_emulation, tcgen05, bulk_copy, cp_async_bulk_tensor.

Semantic Validation

Instruction validation follows a consistent protocol across all validators:

1. sub_12B3090(context)       // Check "relaxed mode" -- if true, skip all checks
2. sub_12B3CA0(major, minor)  // Check PTX version requirement
3. sub_1441FB0(context, ...)  // Emit error if PTX version too low
4. sub_12A8530(sm_version)    // Check SM target capability
5. sub_14422F0(context, ...)  // Emit error if SM version too low
6. (operand type/count checks)
7. sub_467A70(error_code)     // Emit diagnostic with specific error dword

Error codes are global dword_ variables:

VariableMeaning
dword_2A5C530PTX version requirement not met
dword_2A5C500SM version requirement not met
dword_2A5C560Target not supported
dword_2A5CFD0Unsupported modifier
dword_2A5D680Syntax error
dword_2A5CAB0Type error
dword_2A5D380Duplicate symbol definition
dword_2A5D290Parameter error

sub_147EF50 -- ptx_instruction_semantic_analyzer (287,900 bytes, 7,803 lines)

The master instruction validator. Takes (context, char* name, char mode, error_sink). Validates every aspect of a PTX instruction against the current SM target and PTX version:

  • Dimensions, texture modes (texmode_independent, texmode_raw, texmode_unified)
  • Cache policies, state spaces, vector types
  • Scoping (.cta, .cluster, .sys), ordering (.acquire, .release)
  • Proxy fences, comparison operators, eviction priorities
  • Directives: .branchtargets, .calltargets, .callprototype, .weak, .common, .noreturn, .abi_preserve, .abi_preserve_control, .unified, .allocno, .scratch, .retaddr_allocno, .ptr

SM version gates referenced: sm_20, sm_30, sm_32, sm_50, sm_60, sm_61, sm_70, sm_72, sm_75, sm_80, sm_89, sm_90.

sub_146BEC0 -- ptx_load_store_validator (206,100 bytes, 5,546 lines)

Validates all memory operations: ld/st, atomics, reductions, fence, membar, cp.async, st.async/red.async. Checks:

  • Vector types (.v2, .v4, .v8), data widths (.b32, .b64, .b128)
  • Memory ordering (.acquire/.release), scope (.sys)
  • Cache eviction priority for L2
  • Address space modifiers (.shared::cluster, .shared::cta)
  • .sync_restrict, .op_restrict, .cluster scope on atomics
  • SM gates: sm_11, sm_12, sm_20, sm_30, sm_50, sm_60, sm_80, sm_90
  • PTX version gates: 1.1, 1.2, 2.0, 6.0, 7.8, 8.0, 8.1, 8.4, 8.6

References the string "igonre-src on cp.async" -- a genuine typo ("ignore") in NVIDIA's source code at approximately offset +1100.

sub_1487650 -- ptx_statement_processor (240,300 bytes, 6,428 lines)

Top-level PTX statement handler. Processes kernel/function declarations including:

  • Performance directives: .maxnctapersm, .minnctapersm, .maxntid, .reqntid, .reqnctapercluster, .blocksareclusters
  • Kernel parameter size validation (4,352 byte limit)
  • Function attributes: .noreturn, .extern, .FORCE_INLINE, .unique
  • Start/end label creation: "__$startLabel$__%s", "__$endLabel$__%s"
  • Parameter registers, scratch registers, return address registers
  • PTX version gates: 2.0, 2.1, 2.2, 2.3, 3.0, 3.1, 4.3, 5.0, 6.0, 6.3, 6.4, 7.8, 8.0, 8.1, 8.8, 9.0

Instruction Modifier Groups

sub_14871D0 (3,400 bytes, 54 lines) initializes the modifier group enum. Registers bidirectional group mappings via sub_465720 and sub_448E70:

GUARD  <-> PRED     (bidirectional)
TYPES, POSTOP, COMPARE, APRX, FTZ, SAT, SHAMT, ORDER, NC, ROUND,
TESTP, FLOW, TEXTURE, QUERY, CLAMP, SHR, VMAD, PRMT, SHFL,
ENDIS, UNIFORM, VECTOR, VOTE

These 25 modifier groups classify the possible instruction suffixes. The GUARD <-> PRED bidirectional mapping reflects that guard predicates and predicate operands share the same register file.

Expression Printer

sub_12B33A0 -- ptx_print_expression (8,692 bytes, 291 lines)

A recursive pretty-printer that serializes expression AST nodes to text output. Dispatches on (*node & 0x3F), a 6-bit expression kind discriminant:

TypeNameOutput Format
0Binary operator<left> <op> <right> (recursive)
1Unary operator<op> <child> (recursive)
2Integer literal%lld
3Float literal0D%016llx (f64) or 0F%08x (f32) -- PTX hex float format
4Variable referencevariable name string
5Array index<base>[<index>]
6Vector swizzle.x, .y, .z, .w component select
7Half-register select.h0--.h3, .b0--.b7 component select
8Byte-register select.b0--.b3 component select
9Predicate negate! prefix if bit 0 of byte 1 is set
10Nop/empty(nothing)
11Parenthesized( <child> ) (recursive)
12Memory dereference[ <child> ] (recursive)
13Label referencelabel name
14Vector init{ <elem>, <elem>, ... }
15Tuple( <elem>, <elem>, ... )
16Wildcard_

The binary operator table at off_1F24200 has 30 entries (indices 0 through 0x1D), covering arithmetic, bitwise, shift, logical, and comparison operators.

Symbol Management

sub_12ADB80 -- ptx_register_function_symbol (5,093 bytes, 171 lines)

Registers a function in the PTX module's symbol table. Allocates an 88-byte function descriptor (type = 4). Checks for __cudart_ prefix to classify CUDA runtime functions, which are stored in a separate linked list at module offset +168 (versus offset +56 for user functions). Parameter lists are linked at offsets +32 and +40.

sub_12ADFE0 -- ptx_register_global_symbol (5,025 bytes, 153 lines)

Registers a global/shared/const variable. Takes 15 parameters covering all variable attributes. Allocates three structures: 88-byte descriptor (type = 5), 248-byte extended info (at offset +80), and 80-byte type descriptor (at extended info offset +136).

sub_12AEA80 -- ptx_resolve_struct_member (7,282 bytes, 278 lines)

Resolves member access on struct types. Walks member name strings character by character, uses ctype_b_loc() for digit detection to extract numeric array index suffixes. Temporarily replaces characters with '<' during lookup. Recursive: calls sub_12AE6F0 for base lookup, then recurses on the result for chained member access.

sub_12B8790 -- ptx_define_global_symbol (6,937 bytes, 264 lines)

Defines a new global symbol with allocation. Takes 10 parameters: (module, name, storage_class, alignment, size, ...). Storage class dispatch:

Storage classAddress spaceBehavior
2.sharedWraps name as "$__internal_%d_$%s" for module version <= 2. Aligns to boundary, updates module offset +120.
3.constAligns to boundary, updates module offset +80.
5SpecialCreates "$ADDRESS$%s" wrapper symbol with alignment 4 (32-bit) or 8 (64-bit). Triggered by module->byte[11] or SM capability above threshold.

Magic relocation base: 0x70000064 (= R_CUDA relocation type constant).

Bindless Texture Lowering

sub_12B9660 -- ptx_process_bindless_textures (7,505 bytes, 220 lines)

Processes bindless texture/surface operand pairs from the module. Iterates over three lists: samplers (offset +144), textures (offset +152), surfaces (offset +160). For each (texture, sampler) pair:

  1. Query maximum resource limits via vtable calls at offsets +24+56, +24+40, +24+48.
  2. Compare actual counts against limits; warn via sub_467460 if exceeded.
  3. Create a merged, sorted list of (texture, sampler) combinations.
  4. For each pair within the address space limit, create a "$BINDLESS$%s$%s$%s" symbol (module name, texture name, sampler name). Allocate a 112-byte descriptor with type = 4, alignment = 4.
  5. Set relocation offset = vtable[304]() - 0x70000064.
  6. Register relocation entries via sub_12B8650.

Unpaired textures get "$BINDLESS$%s$%s" (two-part, no sampler) via sub_12B9320. Surfaces are processed in a final pass via sub_465390.

DWARF Debug Data Processing

Several functions handle DWARF debug information embedded in PTX via @@DWARF directives:

FunctionSizeDescription
sub_14420403.3 KB@@DWARF directive handler (also validates .section)
sub_1444AB03.7 KB"dwarf data" processing
sub_14492F08.3 KBDWARF data with SM validation (sm_80 gate)
sub_14498C04.1 KBDWARF data section handler ("%s+%llu", "labels + imm expression in .section")
sub_1449BA04.3 KBDWARF label validator

SM100 tcgen05 Intrinsic Codegen

A cluster of functions in the 0x16E0A70--0x16E3AB0 range generates PTX inline code for Blackwell (SM100+) tensor core gen05 operations. These are not parser functions per se but are invoked during instruction lowering to produce PTX helper sequences.

sub_16E0A70 -- tcgen05_instruction_type_classifier (17,302 bytes, 322 lines)

Chains approximately 50 type-check predicates (sub_12B5670--sub_12B5950) against an instruction object at *(a1+8). Returns an integer type ID (1--54) corresponding to specific tcgen05 MMA variants. Type IDs 1--8 map to base types, 18--29 to extended variants. Modifiers include _expand16bit, _pack16bit, _maxabs, _minabs, _fused, _blockscale, _ashift.

sub_16E1DB0 -- tcgen05_guardrails_codegen (10,365 bytes, 325 lines)

Generates boundary-checking PTX code for tcgen05 tensor memory operations. Emits "mov.u32 %s, %s;" and "mov.u32 %s, %d;" sequences. References: "__cuda__sm10x_tcgen05_guardrails_in_physical_bounds_", "__cuda__sm10x_tcgen05_guardrails_are_columns_allocated_". Checks SM100/SM10x variants.

sub_16E2410 -- tcgen05_tmem_addr_compute (3,411 bytes, 111 lines)

Emits "add.u32 %s, %s, %s;" for tensor memory address computation. References "__cuda_sm_100_tcgen05_tmem_addr".

sub_16E3AB0 -- tcgen05_instruction_args_type_mapper (18,623 bytes, 337 lines)

Maps arrays of instruction argument pointers to arrays of numeric type IDs, using the same predicate chain as sub_16E0A70. Stores results at a1+944 per argument, with argument count at base struct offset +796.

PTX Type System

sub_12A7B90 -- ptx_init_version_table (17,596 bytes, 430 lines)

Initializes the PTX ISA version-to-feature mapping table. The entire body is SIMD divmod-by-10 computation on constant arrays at xmmword_1F24440--1F244E0. The pattern q = (x * 0xCCCCCCCD) >> 34; r = x - 10*q decomposes version numbers into major/minor pairs. Writes 30+ (quotient, remainder) pairs. String reference "%d.%d\n" confirms version formatting.

sub_12AA190 -- ptx_get_cvt_opcode (6,332 bytes, 171 lines)

Resolves conversion instruction opcodes from source/destination type pairs. Takes (src_type, dst_type, rounding_mode, context). Returns PTX opcode enum values (25--59). Examples:

  • src=8, dst=7 -> opcode 53 or 55 (f16 to f32 or similar)
  • src=2, dst=3 -> opcode 29 (s32 to s64)
  • src=3, dst=2 -> opcode 28 (s64 to s32)

Calls sub_12A72D0 to validate that the conversion is legal. Calls sub_16E08F0/sub_16E60D0 for illegal conversion warnings.

sub_12AA9C0 / sub_12AAD20 -- ptx_canonicalize_opcode (5,110--6,944 bytes)

Opcode renumbering for stable serialization or cross-version compatibility. Maps old opcode IDs to canonical form via a giant if-else chain. Most values map to themselves, but some are reordered: 33->31, 32->30, 49->50, 50->49, 53->58, 54->59, 55->57, 56->54, 57->55, 58->53, 59->56.

Shared Utility Functions

The entire PTX frontend is built on a consistent set of utility primitives:

FunctionDescription
sub_467A70(dword)Diagnostic/error/warning emitter (takes error code dword)
sub_44F410()Memory arena accessor
sub_4307C0(arena, size)Arena memory allocator
sub_431000(ptr)Arena memory deallocator
sub_45CAC0()Out-of-memory handler (abort)
sub_448E70(table, key, val)Hash table insert
sub_449A80(table, key)Hash table lookup
sub_449BE0(table, key)Hash table contains check
sub_4489C0(hash, eq, cap)Hash table constructor
sub_44E3A0()String comparison/matching
sub_464460(node, next)Linked list append
sub_464700(list, cb, ctx)Linked list foreach
sub_464BB0(list)Linked list count
sub_450280()sprintf to output buffer
sub_12B3090()Check if in "relaxed mode" (skip validation)
sub_12B3CA0(major, minor)PTX version check
sub_12A8530(sm)SM version comparison (>= threshold)
sub_16EF370()Parser error emitter

Context Structure (Partial Reconstruction)

The main compiler context is accessed via *(a1 + 1096) throughout the frontend. Key fields in the instruction structure pointed to by this offset:

Offset  Size  Field
  +88    8B   scope/function pointer
 +104    8B   modifier enum table
 +112    8B   modifier group table
 +220    4B   operand count
 +224    4B   primary operand type enum
 +228    4B   secondary operand type enum
 +244    4B   instruction opcode / shape identifier
 +256   40B   operand type descriptors (8 bytes each, 5 operands max)
 +560    4B   parameter stride
 +596    4B   operand count (second field)
 +601    1B   flags (bit 1: has_scale, bits 2-5: type code)
 +602    1B   flags (bits 7-8: block_scale_kind)
 +603    1B   flags (bits 1-2: scale variant)
 +620    1B   instruction subtype byte
 +627    1B   flags (bits 4-5: layout variant)
 +648+   8B   per-operand tree pointers (8 bytes each)
 +776    4B   instruction class (11/12 = MMA class)
 +808    8B   named instruction handler hash table
 +816    8B   hash-encoded instruction handler hash table
+1056    8B   CUDA builtin data array
+1064    8B   CUDA builtin hash table
+1072    4B   CUDA builtin count (608)
+2488    8B   pragma definition table

SM Version Gate Reference

The validators enforce feature availability across GPU generations. This table summarizes when major PTX features became available, as encoded in the validation logic:

SM VersionKey Features Gated
sm_11Early shared memory loads
sm_12Shared memory banking
sm_20Generic addressing, membar.sys, clock64, %pm0--%pm7, math builtins
sm_30Shuffle, barrier_sync, .branchtargets, .callprototype, %envreg
sm_50%total_smem_size, %dynamic_smem_size
sm_53.f16x2 packed type
sm_60Atomic scopes
sm_62dp2a/dp4a emulation via builtins
sm_70WMMA, barrier_sync*, matchsync*, shflsync*, votesync*
sm_72Additional MMA shapes
sm_75Format conversion, ldmatrix, cache prefetching
sm_80.bf16, .tf32, extended cache, st.async, eviction priorities
sm_89FP8 MMA
sm_90wgmma, cp.async.bulk.tensor, cluster scope, tensormap
sm_100+tcgen05.*, SM1xx builtins, guardrail traps

Cross-References

Sibling Wikis